#9851 Packager Dashboard - Post Mortem after deployment of new release to prod
Closed: Fixed 3 years ago by kevin. Opened 3 years ago by frantisekz.

Describe what you would like us to do:


Today in the morning, I went ahead and deployed new version of Fedora Packager Dashboard (for the rest of this post, I'll be talking about it's backend - oraculum). The same build was running just fine in staging for a few days.

In a brief moment after the deployment, I've seen that pod with api endpoint (oraculum-api-endpoint) died. Something sent SIGTERM to the entrypoint process of it (gunicorn serving oraculum). There wasn't anything weird in the logs (part of logs is in the end of this post), just stream of successfully served requests and then boom, out of nowhere, a SIGTERM. OpenShift spawned another pod which saw a similar fate in a minute.

This madness stopped after oraculum-workers finished their sync-up with bugzilla. They weren't dying though, the main process was (workers live in separate pods).

Is it possible that the cluster somehow sent SIGTERM to it? Can we do something about that so it can't happen in the future? I can't figure out what else than the cluster could do that. I know there can be configured some limits for pods, I didn't see anything applied to the pods running oraculum.

For more context, the thing that happened after the upgrade was the cache purge for all the package bugs due to some restructuring for private bugs support. This was expected, oraculum is designed to cope with cache missing, it is able to schedule jobs to create a cache even before the planned cache refreshing (that is, however, very resource heavy operation that usually don't happen as users always get their data served from cache and no planning is involved on the api-endpoint). It started doing just that, however, plenty of users were opening the dashboard (or, more probably, had their dashboards open and frontend was trying to query data) and it created a huge load spike. So, it kept dying until the cache was slowly created.

According to the metrics in OpenShift, it wasn't RAM usage that was high (but, I might be wrong here, don't know how can I depend on those pretty looking graphs in the OpenShift), but I'd expect that it was a significant CPU load.

Logs before the end (ip addresses and usernames masked):

0.0.0.0 - - [07/Apr/2021:09:37:03 +0000] "GET /api/v1/packager_dashboard/user HTTP/1.1" 200 293633 "https://packager-dashboard.fedoraproject.org/user " "Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:87.0) Gecko/20100101 Firefox/87.0"
0.0.0.0 - - [07/Apr/2021:09:37:03 +0000] "GET /api/v1/packager_dashboard/user HTTP/1.1" 200 50250 "https://packager-dashboard.fedoraproject.org/user " "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36"
0.0.0.0 - - [07/Apr/2021:09:37:03 +0000] "GET /api/v1/packager_dashboard/user HTTP/1.1" 200 8622 "https://packager-dashboard.fedoraproject.org/user " "Mozilla/5.0 (Linux; Android 8.0.0; SM-G965F Build/R16NW) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.116 Mobile Safari/537.36"
0.0.0.0 - - [07/Apr/2021:09:37:03 +0000] "GET /api/v1/packager_dashboard/user HTTP/1.1" 200 38859 "https://packager-dashboard.fedoraproject.org/user " "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36"
0.0.0.0 - - [07/Apr/2021:09:37:04 +0000] "GET /api/v1/packager_dashboard/user HTTP/1.1" 200 62760 "https://packager-dashboard.fedoraproject.org/user " "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
Handling signal: term
[2021-04-07 09:37:07 +0000] [1] [INFO] Handling signal: term
0.0.0.0 - - [07/Apr/2021:09:37:10 +0000] "GET /api/v1/packager_dashboard/user HTTP/1.1" 200 4541964 "https://packager-dashboard.fedoraproject.org/user " "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:87.0) Gecko/20100101 Firefox/87.0"
Worker exiting (pid: 35)
[2021-04-07 09:37:10 +0000] [35] [INFO] Worker exiting (pid: 35)
Shutting down: Master
[2021-04-07 09:37:10 +0000] [1] [INFO] Shutting down: Master

Thanks for any input and ideas!

When do you need this to be done by? (YYYY/MM/DD)



Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: low-gain, low-trouble, ops

3 years ago

This is most likely caused by the readiness/liveness check that was failing during this time.
If this check fail 3 times in a row, the pod is killed and another one is restarted.
According to the node log, this check was indeed failing during this timespan (8:40 - 10:35 utc)

Readiness probe for "oraculum-api-endpoint-24-tntwg_oraculum(75efc44b-9784-11eb-9d3e-52540090ed9f):oraculum-api-endpoint" failed (failure): Get http://10.128.2.180:8080/: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

You may need to adjust the initialDelaySeconds parameter to wait a bit more before doing the initial check.

--
It can also be the probe timeout, which is 1 second currently. Since the cpu load was high, maybe the app wasn't able to respond quickly enough.
If that is the case, you can simply set the TimeoutSeconds parameter a little bit higher.

Yep. that sounds completely plausable @darknao :)

Please re-open if you need us to do anything further.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done