#116 Greenwave can fail OpenShift liveness check if a request takes a long time to be serviced
Closed 6 years ago Opened 6 years ago by dcallagh.

In Fedora's Openshift deployment of Greenwave, we observed:

https://os.fedoraproject.org/console/project/greenwave/browse/pods/greenwave-web-8-2pbtr?tab=events

Unhealthy
Liveness probe failed: Get http://10.131.0.64:8080/healthcheck: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
158 times in the last 19 hours

Killing
Killing container with id docker://web:pod "greenwave-web-8-2pbtr_greenwave(96634b5e-fc84-11e7-a141-5254008d18bb)" container "web" is unhealthy, it will be killed and re-created.
27 times in the last 19 hours

The logs show that Greenwave seems to be handling requests normally. However we currently run each pod with a single gunicorn worker. This means, if some particular request is slow and takes a while to be serviced, the Openshift liveness probe may get stuck behind that other real request and exceed the configured timeout for the liveness probe, causing Openshift to kill the pod.


Easiest solution is probably just to run some larger number of workers. I am trying to remember why it is set to 1 right now...

We run the datagrepper pod with 4 gunicorn workers and haven't seen issues.

I've submitted some PR on bodhi to improve the check script but we should fix GW before we can tests these changes.
So could you let us know once you've fixed this? :)

The Gunicorn docs say, about choosing number of workers:

Gunicorn relies on the operating system to provide all of the load balancing when handling requests. Generally we recommend (2 x $num_cores) + 1 as the number of workers to start off with. While not overly scientific, the formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

I'm not sure how that translates to Openshift land. How many cores are exposed to each pod? I don't think Openshift even gives you a way to know that (as an application developer).

Anyway, we can change it to something like 8 workers which should be enough to avoid this problem. Of course if the pod ends up handling 8 slow requests concurrently, then it can still run into this problem where the liveness check will time out...

Added --workers 8 to the Dockerfile that is used by the BuildConfig in Fedora's OpenShift:

https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=b53181dce936571287413a50f6d93830a11ae541

although we are also maintaining copies of the Dockerfile in the official container/greenwave dist-git and in greenwave's git repo as well, so the same change should go into those places too for consistency...

Applied this change on stage and prod. Openshift has done its magic and now the pods are running 8 gunicorn workers each. This should alleviate the liveness probe failures.

The pods in the most recent Greenwave deployment on os.fedoraproject.org show that they haven't been failing their liveness check, at least in the last ~18 hours since that deployment was done.

(There doesn't seem to be any way to look at historical events for pods which were part of previous deleted deployments.)

But I think it's safe to assume that --workers 8 has fixed this problem. I'll apply it in greenwave's source tree itself and also the official Fedora container dist-git.

Metadata Update from @ralph:
- Issue status updated to: Closed (was: Open)

6 years ago

Login to comment on this ticket.

Metadata