#11657 Bodhi pod restarting
Closed: Fixed 3 months ago by kevin. Opened 5 months ago by mattia.

During the last week I have been receiving 2/3 alerts per day about bodhi web pod restarting.
From Openshift console it looks like there are spikes in CPU and memory consumption just before the pod gets restarted.
However, I have no access to the logs, so can someone look at what the issue is there?


Seems like some sort of db issue?

2023-12-03 18:06:34,173 INFO [bodhi.server][MainThread] Bodhi ready and at your service!
2023-12-03 18:06:34,782 ERROR [bodhi.server][ThreadPoolExecutor-0_0] Error caught. Handling HTML response.
Traceback (most recent call last):
File "/usr/lib64/python3.11/site-packages/sqlalchemy/engine/base.py", line 1910, in _execute_context
self.dialect.do_execute(
File "/usr/lib64/python3.11/site-packages/sqlalchemy/engine/default.py", line 736, in do_execute
cursor.execute(statement, parameters)
psycopg2.OperationalError: unexpected response from server; first received character was "�"
lost synchronization with server: got message type "6", length 876033280

some more tracebacks... then...

[SQL: SELECT releases.id AS releases_id, releases.name AS releases_name, releases.long_name AS releases_long_name, releases.version AS releases_version, releases.id_prefix AS releases_id_prefix, releases.branch AS releases_branch, releases.dist_tag AS releases_dist_tag, releases.stable_tag AS releases_stable_tag, releases.testing_tag AS releases_testing_tag, releases.candidate_tag AS releases_candidate_tag, releases.pending_signing_tag AS releases_pending_signing_tag, releases.pending_testing_tag AS releases_pending_testing_tag, releases.pending_stable_tag AS releases_pending_stable_tag, releases.override_tag AS releases_override_tag, releases.mail_template AS releases_mail_template, releases.state AS releases_state, releases.composed_by_bodhi AS releases_composed_by_bodhi, releases.create_automatic_updates AS releases_create_automatic_updates, releases.package_manager AS releases_package_manager, releases.testing_repository AS releases_testing_repository, releases.eol AS releases_eol
FROM releases
WHERE releases.name = %(name_1)s OR releases.name = %(name_2)s OR releases.version = %(version_1)s
LIMIT %(param_1)s]
[parameters: {'name_1': 'F26', 'name_2': 'F26', 'version_1': 'F26', 'param_1': 1}]
(Background on this error at: https://sqlalche.me/e/14/e3q8)

It might be that it's getting a big request and failing liveness probe? Not sure.

The 'lost of synch' just after the pod restart is something I've always seen (perhaps it is trying to access the database before everything is fully up).
The other SQL select doesn't seem very demanding, just a single release queried by name...

Perhaps we should just re-deploy and things will be automagically fixed as it happened other times?...

I've just fired a new build of bodhi-base image and the pods were automatically redeployed... let's see how things go today.

BTW, before the pods restart I noticed that the bodhi-celery pod was using over 10GByte of RAM... I don't know if this is the expected usage of celery. After restart it is now stable at 1.5GByte.

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Reporter (was: Needs Review)

5 months ago

It restarted again at 10.39 UTC
Liveness probe failed at 10.27 and 10.38 and restarted the pod, but meanwhile updates were properly processed and I can't see any spike in CPU or memory consumption, so...

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Waiting on Reporter)
- Issue tagged with: Needs investigation

5 months ago

My guess is now that the ongoing networking problems cause the pod to be unreachable from whatever service is monitoring its availability, thus it is restarted even if it is working correctly.

It's not... bodhi is all in IAD2, the networking issue was in RDU2. ;)

What seems to happen is that bodhi pod gets stuck processing something, the health check fails and openshift kills it and restarts it.

But as to what is causing the sticking issue, thats the question.

One more thing I noticed: those restarts seems to have some regularity.
They seem to happen every day at 0, 6, 12, 18 UTC. Probably some cron script/celery task.

So, my theory is:

some large request or set of requests is coming in then, bodhi starts processing them, the database gets really busy and is churning on them, then openshift sees bodhi not respond and so it kills the web pods. ;(

I have done several things to try and fix this:

  • I tweaked the db config to hopefully be more perfomant. It should have more workers and use more memory now. So ideally it would process fast enough to avoid the issue.
  • I increased bodhi's liveness probe time from 10s to 15s.

Along the way I found that bodhi's 'mid' pod that is supposed to do db schema updates wasnt defined right. I fixed that and it should be working now. (which I had to do to change the timeout anyhow).

I've noticed that bodhi doesn't have any request logs, so it's hard to say what's hitting it every 6 hours. ;( Would it be possible to enable those somehow? (at least for a short time)
I guess we will see the next spike time where we are at.

Still rebooting :pensive:
Can we look at apache logs to get the request that causes the high load? Perhaps, the error log or the access log?
Otherwise I'll try to look how to print out the requested url from Pyramid.

Or we can try to update to bodhi 8.0 (which I'm going to release) and see if, by any chance, the problem gets fixed...

Yeah. ;(

So a few observations:

  • There's a ton of crawler traffic. I wonder if there's any good to letting crawlers crawl at all here? We do block some path;s, but perhaps we should just disallow * ?

  • I looked at proxy logs and I couldn't find anything that broke things, but there's some crazy crawler hits like:

https://bodhi.fedoraproject.org//updates/?ike=pdftk-java-3.3.1&page=593
and
https://bodhi.fedoraproject.org//updates/?page=18050
and
https://bodhi.fedoraproject.org/updates/?page=1384&releases=F34&status=stable

but all those loaded just fine here. (Although perhaps they are cached somehow now?)

  • Looking around on the errors we see lead me to:

https://stackoverflow.com/questions/43944787/sqlalchemy-celery-with-scoped-session-error/54751019#54751019

basically a problem with using connection pools and database connections. I have no idea if this might be possible in bodhi code, but might be worth looking into ?

  • It would be great if we could get the last request that it was processing when it became unresponsive, but not sure how possible that is. ;(

I've locally tested this PR:
https://github.com/mattiaverga/bodhi/commit/6b657c467866159383dcd2afa4a1855c7449abcd

It will make sqlalchemy disable connection pool and set a timeout to 30sec on queries.
However, I'm not comfortable to propose it as "hey, this will fix all problems", since I'm not extremely expert on this kind of topics... perhaps there's someone in infra / releng / CPE who knows better what kind of impact this change may have.
AFAIK all db operations Bodhi run are "atomic", a connection is fired at the start and closed in the end, so I don't think we're reusing connections from a pool. Perhaps we only need to set the query timeout.

If this turns out to be good, I could officially include it in the next Bodhi release in a more customizable way from the config file, rather than hardcoded.

Cool, yeah, I don't know if this is the problem, related to the problem, or completely unrelated. ;)

Perhaps @zlopez and/or @abompard could take a look tomorrow?

My main concern here is the timeout, @mattia do you think that 30 sec will be enough for query?

My main concern here is the timeout, @mattia do you think that 30 sec will be enough for query?

Well, I think so... if I remember correctly, in the past we had a 30sec timeout set on http replies which was causing errors, but that was because of crazy loops in serializing objects, not because of db reply.

Maybe we can apply only the timeout thing as a hotpatch (leaving out the null connection pool thing or the moment), see what happens and revert it quickly if things explode?

Or try with Bodhi 8.0 (which I released and built in koji a couple of days ago). FYI, if anyone is going to upgrade Bodhi, I've prepared a couple of ansible PRs needed for the upgrade to 8.0:
- https://pagure.io/fedora-infra/ansible/pull-request/1617
- https://pagure.io/fedora-infra/ansible/pull-request/1686

+1 for trying the change and see if this will helps with the issue

So does 8.0 have that Nulldb change?

I think we should roll that out this week.... stg first?

So does 8.0 have that Nulldb change?

No, I've not pushed that change in Bodhi code. I'd like to check if v8.0 already fixes the problem as a side effect, since it already has code changes targeted at sqlalchemy > 1.4.
If the problem persist I'd like to apply those changes (nullPool and timeout) in two separate patches applied directly in ansible, to check if any of them have effect.

I think we should roll that out this week.... stg first?

bodhi-stg is already on 8.0.1 ;-)
@adamwill asked for https://github.com/fedora-infra/bodhi/pull/5581 to be merged before prod deployment, I'll bake a 8.0.2 ASAP.

thanks. please give me a heads up when deploying 8.x to prod as I have to change openQA scheduler code at that point, or else it will start unnecessarily re-running tests.

ok, can you do ansible PR's for the nullPool / timeout? we can try them in prod with 7.x.x ?

Or do you prefer to just wait and push 8.0.x out to prod, then try the nullPool / timeout changes?

Web pods crashed again today.... I've submitted https://pagure.io/fedora-infra/ansible/pull-request/1755 to revert the nullPool patch and apply a timeout of 30 seconds to all database queries. I don't think anu query should require more than that to retrieve data.

Side note: I had the suspect that the database errors seen at the very start of web pod were due to the fact the other bodhi pods (celery, celery-beat and consumer) were started before the web pod. Today, after kevin merged the above patch, the web pod started before the others and indeed no errors got logged. So it might be useful to make sure the web pod starts always before the others, but I don't see any parameter to do so in openshift docs...

Setting the timeout to database queries didn't work out, pod is still restarting.
I am now looking at the pod configuration side. What is the purpose of activeDeadlineSeconds: 21600 set on the pod? It looks suspicious exactly the amount of time between restarts...

The docs say: "ActiveDeadlineSeconds is the duration in seconds that the deployer pods for this deployment config may be active on a node before the system actively tries to terminate them."
so, I read that to be the amount of time a deployment rollout can take before it terminates it as not having worked.

But the last deployment:

  Started:      Sun, 28 Jan 2024 17:02:23 +0000
  Finished:     Sun, 28 Jan 2024 17:03:34 +0000

So, it took less than a minute and finished. ;)

Nothing we tried seems to solve the problem, so I'm just taking blind shots.
In another thread I remember to have read from you, kevin, that some builders were showing CPU spikes with some kernel versions. That seems likely what is happening on Bodhi web pod, that high CPU load doesn't seem due to an external request or to a database query taking too long to respond.
So I've checked in terminal what kernel the pod is running and got a surprisingly (to me) output:

$ uname -r
5.14.0-284.48.1.el9_2.x86_64

The base image is a F38 machine, why it is running an EL9 kernel? Also:

$ dnf list kernel --all --showduplicates
Last metadata expiration check: 0:00:59 ago on Sat Feb 10 09:50:11 2024.
Available Packages
kernel.x86_64                                                      6.2.9-300.fc38                                                       fedora 
kernel.x86_64                                                      6.7.3-100.fc38                                                       updates

$ dnf list installed kernel                       
Error: No matching Packages to list

Yes, this is how containers work. ;) They are not vm's...

Since they are running in our openshift and that openshift uses Red Hat Core OS and thus the RHEL kernel, thats expected to see that kernel. ;)

I am not sure what else could be going on here... perhaps once again @darknao could glance at it and find the problem in 2minutes? ;)

One random thing I wondered... could it be a connection to waiverdb/greenwave is hanging and causing the process on the bodhi side to hang too long?

I checked the logs and I saw a lot of liveness probe failure.
Each event matches a "spike" of CPU (from 0 to 0.5 CPU, almost nothing) and a spike of network usage (from ~100kb/s to 1.5Mb/s).
The liveness probe is configured to query the webserver every minute, and after 3 fails, kill the pod
I also noticed the pod logs a lot of db errors when it's restarted, such as

Feb 10 12:11:06 sqlalchemy.exc.ResourceClosedError: This result object does not return rows. It has been closed automatically.

I suspect the db is overloaded from a previous query. But there is not much that can be done from the OpenShift perspective.
It's not uncommon to add a db connection check in the container entrypoint, or an init container, to ensure the DB is ready to accept connection before starting the application.

Here are the liveness probe logs for today:

00:15:32.825658    3796 prober.go:107] "Probe failed" probeType="Liveness"

00:17:32.825654    3796 prober.go:107] "Probe failed" probeType="Liveness"
00:18:32.824647    3796 prober.go:107] "Probe failed" probeType="Liveness"
00:19:32.825726    3796 prober.go:107] "Probe failed" probeType="Liveness"
[killed]

00:21:32.825682    3796 prober.go:107] "Probe failed" probeType="Liveness"
00:23:32.825690    3796 prober.go:107] "Probe failed" probeType="Liveness"
00:25:32.825694    3796 prober.go:107] "Probe failed" probeType="Liveness"


06:08:32.825652    3796 prober.go:107] "Probe failed" probeType="Liveness"
06:10:32.825677    3796 prober.go:107] "Probe failed" probeType="Liveness"
06:12:32.826652    3796 prober.go:107] "Probe failed" probeType="Liveness"
06:14:32.825666    3796 prober.go:107] "Probe failed" probeType="Liveness"
06:15:32.825679    3796 prober.go:107] "Probe failed" probeType="Liveness"


12:08:32.824693    3796 prober.go:107] "Probe failed" probeType="Liveness"
12:09:32.825696    3796 prober.go:107] "Probe failed" probeType="Liveness"
12:10:32.824650    3796 prober.go:107] "Probe failed" probeType="Liveness"
[killed]

12:13:32.825794    3796 prober.go:107] "Probe failed" probeType="Liveness"
12:14:32.824680    3796 prober.go:107] "Probe failed" probeType="Liveness"
12:15:32.825734    3796 prober.go:107] "Probe failed" probeType="Liveness"
[killed]

12:17:32.824700    3796 prober.go:107] "Probe failed" probeType="Liveness"
12:20:32.826658    3796 prober.go:107] "Probe failed" probeType="Liveness"


18:07:32.825661    3796 prober.go:107] "Probe failed" probeType="Liveness"

18:09:32.825677    3796 prober.go:107] "Probe failed" probeType="Liveness"
18:10:32.825662    3796 prober.go:107] "Probe failed" probeType="Liveness"

18:12:32.825695    3796 prober.go:107] "Probe failed" probeType="Liveness" 
18:13:32.825668    3796 prober.go:107] "Probe failed" probeType="Liveness"

18:19:32.824676    3796 prober.go:107] "Probe failed" probeType="Liveness"

Each of those "unresponsiveness" events seems to last from 7 to 12 minutes, but not all of them triggers a restart.
I recommend changing the liveness configuration as follows, so the pod can survive those events:

livenessProbe:
  periodSeconds: 120s 
  failureThreshold: 8

That would give you 16 minutes before the pod gets killed.
That will not solve the DB issue, but at least, the pod will not be restarted.

(btw, DeploymentConfigs are deprecated. You should convert everything in this namespace with Deployments)

Did some more testing with the liveness probe during the last event and the /healthz/live request took from 10 to 60 seconds to respond.

So my updated recommendation is:

livenessProbe:
  periodSeconds: 120s 
  timeoutSeconds: 60s
  failureThreshold: 5

I''ve pushed that... lets see how it goes.

So far, I've seen no more restarts and nothing seems affected by those spikes. Shall we consider this "fixed" and maybe in future we'll figure out the real cause? (or they'll fade away by themselves)

Yeah. I also did some tweaking on the database sever, but not sure thats related.

But sure, lets close for now.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 months ago

Login to comment on this ticket.

Metadata