Two composes were failed, I tried to resume them, but then bodhi-backend failed
F31M-testing : 1 updates (requested) F30M-testing : 1 updates (requested)
Dec 22 00:49:45 bodhi-backend01.phx2.fedoraproject.org systemd[1]: Started Bodhi's Celery worker. Dec 22 00:49:48 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:49:48,756: INFO/MainProcess] Connected to amqps://bodhi:**@rabbitmq.fedoraproject.org:5671//bodhi Dec 22 00:49:48 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:49:48,789: INFO/MainProcess] mingle: searching for neighbors Dec 22 00:49:49 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:49:49,902: INFO/MainProcess] mingle: all alone Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:50:56,777: CRITICAL/MainProcess] Unrecoverable error: NotFound(404, "NOT_FOUND - failed to perform operation on queue 'has_koji_mou nt' in vhost '/bodhi' due to timeout", (50, 10), 'Queue.declare') Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: Traceback (most recent call last): Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/celery/worker/worker.py", line 205, in start Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: self.blueprint.start(self) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/celery/bootsteps.py", line 119, in start Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: step.start(parent) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: return self.obj.start() Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/s[47/1868] ges/celery/worker/consumer/consumer.py", line 317, in start Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: blueprint.start(self) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/celery/bootsteps.py", line 119, in start Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: step.start(parent) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/celery/worker/consumer/tasks.py", line 41, in start Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: c.connection, on_decode_error=c.o n_decode_error, Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/celery/app/amqp.py", line 297, in TaskConsumer Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: **kw Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/kombu/messaging.py", line 386, in __init__ Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: self.revive(self.channel) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/kombu/messaging.py", line 408, in revive Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: self.declare() Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/kombu/messaging.py", line 421, in declare Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: queue.declare() Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/kombu/entity.py", line 605, in declare Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: self._create_queue(nowait=nowait, channel=channel) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/kombu/entity.py", line 614, in _create_queue Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: self.queue_declare(nowait=nowait, passive=False, channel=channel) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/kombu/entity.py", line 649, in queue_declare Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: nowait=nowait, Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/amqp/channel.py", line 1154, in queue_declare Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: spec.Queue.DeclareOk, returns_tup le=True, Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/amqp/abstract_channel.py", line 80, in wait Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: self.connection.drain_events(time out=timeout) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/amqp/connection.py", line 500, in drain_events Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: while not self.blocking_read(time out): Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packa ges/amqp/connection.py", line 506, in blocking_read Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: return self.on_inbound_frame(fram e) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/amqp/method_framing.py", line 55, in on_frame Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: callback(channel, method_sig, buf, None) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/amqp/connection.py", line 510, in on_inbound_method Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: method_sig, payload, content, Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/amqp/abstract_channel.py", line 126, in dispatch_method Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: listener(*args) Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: File "/usr/lib/python3.7/site-packages/amqp/channel.py", line 282, in _on_close Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: reply_code, reply_text, (class_id, method_id), ChannelError, Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: amqp.exceptions.NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'has_koji_mount' in vhost '/bodhi' due to timeout Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: -------------- celery@bodhi-backend01.phx2.fedoraproject.org v4.2.1 (windowlicker) Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: ---- **** ----- Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: --- * *** * -- Linux-5.3.8-200.fc30.x86_64-x86_64-with-fedora-30-Thirty 2019-12-22 00:49:46 Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: -- * - **** --- Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- [config] Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- .> app: __main__:0x7f99cc8f9650 Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- .> transport: amqps://bodhi:**@rabbitmq.fedoraproject.org:5671//bodhi Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- .> results: rpc:// Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - *** --- * --- .> concurrency: 30 (prefork) Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker) Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: --- ***** ----- Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: -------------- [queues] Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: .> has_koji_mount exchange=has_koji_mount(direct) key=has_koji_mount Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [tasks] Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: . compose Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: . handle_update Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org systemd[1]: bodhi-celery.service: Main process exited, code=exited, status=1/FAILURE Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org systemd[1]: bodhi-celery.service: Failed with result 'exit-code'.
This is blocking updates, so would be good to fix asap.
@cverna and @abompard or anyone else on the bodhi team, can you look?
So, I managed to get it working again by rebooting rabbitmq02 (I rebooted 01 also, but it didn't seem to matter).
I don't understand what state it was in tho...
Before the reboot:
[root@rabbitmq01 ~][PROD]# rabbitmqctl list_queues -p "/pubsub" Listing queues amqp_to_zmq 386 faf 10697 openqa.stg_scheduler 0 openqa_relvalamiconsumer.test 6438 federation: amq.topic -> rabbit@rabbitmq01.phx2.fedoraproject.org:/public_pubsub:amq.topic 0 robosignatory 1 fedora-build-checks 0 greenwave 0 wiki 577508
After the reboot:
[root@rabbitmq03 ~][PROD]# rabbitmqctl list_queues -p /pubsub --online Listing queues openqa_checkcomp 0 openqa_relvalamiconsumer.test 6438 bodhi_composer 0 openqa_wiki_reporter 0 centos-ci 1029 bodhi 0 openqa_autocloudreporter.stg 0 amqp_bridge_verify_missing 19 federation: zmq.topic -> rabbit@rabbitmq01.phx2.fedoraproject.org:/public_pubsub:zmq.topic 0 mts 0 the-new-hotness 55 robosignatory 0 openqa_scheduler 0 openqa_resultsdb_reporter 0 openqa_autocloudreporter 0 openqa.stg_scheduler 0 faf 10707 amqp_to_zmq 0 federation: amq.topic -> rabbit@rabbitmq01.phx2.fedoraproject.org:/public_pubsub:amq.topic 0 resultsdb_ci_listener 0 wiki 578159 openqa_relvalamiconsumer 0 openqa_relvalconsumer.test 0 fedora-build-checks 0 koschei 1 openqa_relvalconsumer 0 coreos 0 greenwave 38
I'd appreciate any ideas on how this happened and how to prevent it. :)
This might be related: no builds have been pushed through bodhi to rawhide for half a day ...
https://bodhi.fedoraproject.org/releases/F32
Looks like first, moving things from testing → stable stopped happening, and then moving stuff from pending → testing stopped a few hours later, as well.
Metadata Update from @mizdebsk: - Issue tagged with: bodhi
@kevin it seems that many of the rawhide updates are blocked because the builds are waiting to be signed. Or they are still tagged as f32-signing-pending.
Could you check if robosignatory needs a restart ? if it is working as expected we might need to tag again the builds to be signed so that the rawhide gating pipeline can restart.
I found basically all the openQA consumers weren't picking up messages after the outage, even though they looked as if they had reconnected successfully and were working. I had to restart all the services manually.
I suspect we need to restart basically all fedora-messaging consumers that haven't been restarted yet to be sure things are working.
I did restart it, and it's been signing new things fine.
I wonder if the messages for those things were lost and we need to retag them to get it to pick them up now? I can try that...
I'm consolidating:
#8482 Koji build fails with "GenericError: Build already in progress" and #8480 koji builds not getting tagged
since I think they are all this same issue and we can debug and provide status here.
Symptoms:
Things I have done:
I guess lets see if things are better now and if someone could figure out how to get the updates pushes out that would be good.
The odd thing about all this is that there isn't any changes I can see that were made when it started happening...
So, responsiveness is much better, and signing seems to have caught up... however, tasks are sometimes still not finishing right. The linger in tagging.
Hello infra
Thanks a lot for taking care of this issue. :D
One of my builds was not tagged, and hence the build failed. However, when I rebuild again, I get an error message stating that "the build already exists". I am unable to find the build and hence am stuck. I can update the Release tag and rebuild, but I am wondering if there is a cached copy somewhere of the same that can be used.
Release
I am not sure if this is the correct place to ask, but I didn't want to bug the devel list, and hence am posting here.
Is there something we can do to get the borked bodhi updates un-stuck?
With a primitive bodhi query, it looks like there are 100+ updates stuck (?) in pending state across the three active releases:
pending
https://bodhi.fedoraproject.org/updates/?status=pending&request=testing
There are also 200+ updates with a request for stable, which seems like an unusually high number (and at least some of them have already been ejected from pushes):
stable
https://bodhi.fedoraproject.org/updates/?status=testing&request=stable
And, what can a packager do if an update is "ejected" by bodhi due to being in the wrong tag?
Is there something we can do to get the borked bodhi updates un-stuck? With a primitive bodhi query, it looks like there are 100+ updates stuck (?) in pending state across the three active releases: https://bodhi.fedoraproject.org/updates/?status=pending&request=testing There are also 200+ updates with a request for stable, which seems like an unusually high number (and at least some of them have already been ejected from pushes): https://bodhi.fedoraproject.org/updates/?status=testing&request=stable And, what can a packager do if an update is "ejected" by bodhi due to being in the wrong tag?
I took care of the ejected updates, they should be in a correct state in bodhi now.
For the rest of the updates we just need to run a new updates composes. We finally managed to unblock the composes from last weekend so we should be able to run a set of fresh compose.
I ll try to do that today and monitor it, since koji still seems slow at times which impacts bodhi.
@cverna can you fix https://bodhi.fedoraproject.org/updates/FEDORA-2019-9184b1e049 please?
Also I see that https://bodhi.fedoraproject.org/updates/FEDORA-2019-c6029679d3 got stuck somehow..
I could resume my builds by manual tagging them with: koji tag-build f31-updates-candidate <package-name>
I hope this helps
I could resume my builds by manual tagging them with: koji tag-build f31-updates-candidate <package-name> I hope this helps
Thanks a lot @tartina, it worked beautifully. :smile:
I reinstalled proxy101 and proxy110 as fedora 31.
Builders are still dropping off and stopping checkin with: "koji.TaskManager:1210 Attempting to take task..." messages.
Some updates pushes finally finished, but there are some new ones stuck now.
I guess it still could be a proxy issue, or a koji bug, but I still don't see anything that could have changed or any errors showing whats breaking.
Perhaps the koji folks could tell us what the "attempting to take task" hang means?
@tkopecek any ideas?
Another thing: There are a TON (273+) '/usr/bin/bodhi-approve-testing' cron jobs on bodhi-backend01
@cverna Could this be contributing to issues or another symptom?
Another thing: There are a TON (273+) '/usr/bin/bodhi-approve-testing' cron jobs on bodhi-backend01 @cverna Could this be contributing to issues or another symptom?
I think it is a symptom, the cronjob is dealing with tagging builds in koji for rawhide gating so if koji is slow to process these requests I guess that could leave the cron in a weird state.
The update eject message is also a symptom, during the compose the builds are move to the stable tag ('f31-updates`) for example but some of the tag build task in koji fail which make the whole compose fail. So next time we run the compose, some of the builds in the updates are already tagged in the stable tag but the update is not in the stable status.
For example this is tag build that failed --> https://koji.fedoraproject.org/koji/taskinfo?taskID=39945241
Is it possible this is also causing - or is caused by - greenwave / taskotron issues?
No matter which bodhi update I look at, I get the "failed to talk to greenwave" error message instead of the "ignored" state, and no taskotron results get fetched for any of my updates, not even ones that are over a year old.
Is it possible this is also causing - or is caused by - greenwave / taskotron issues? No matter which bodhi update I look at, I get the "failed to talk to greenwave" error message instead of the "ignored" state, and no taskotron results get fetched for any of my updates, not even ones that are over a year old.
I wouldn't think so, but it's possible...
Another thing: There are a TON (273+) '/usr/bin/bodhi-approve-testing' cron jobs on bodhi-backend01 @cverna Could this be contributing to issues or another symptom? I think it is a symptom, the cronjob is dealing with tagging builds in koji for rawhide gating so if koji is slow to process these requests I guess that could leave the cron in a weird state.
I would agree except for the place it hangs is adding a comment to a update. Look at all the locks in db01's bodhi2 db:
...snip a bunch more of the same... | bodhi2 | | | ExclusiveLock | t | virtualxid | bodhi2 | 2019-12-29 01:33:07.221469+00 | 00:48:26.857518 | 21308 | UPDATE comments SET update_id=176147, user_id=91 WHERE comments.id = 1180412 bodhi2 | | 2721427923 | ExclusiveLock | t | transactionid | bodhi2 | 2019-12-29 01:24:07.640285+00 | 00:57:26.438702 | 13713 | UPDATE comments SET update_id=176147, user_id=91 WHERE comments.id = 1180397 bodhi2 | | 2721388468 | ExclusiveLock | t | transactionid | bodhi2 | 2019-12-29 00:54:06.819603+00 | 01:27:27.259384 | 18230 | UPDATE comments SET update_id=176147, user_id=91 WHERE comments.id = 1180386 bodhi2 | | 2721428424 | ExclusiveLock | t | transactionid | bodhi2 | 2019-12-29 01:39:07.720449+00 | 00:42:26.358538 | 26261 | UPDATE comments SET update_id=176147, user_id=91 WHERE comments.id = 1180414 bodhi2 | | 2721428952 | ExclusiveLock | t | transactionid | bodhi2 | 2019-12-29 01:54:07.422068+00 | 00:27:26.656919 | 6601 | UPDATE comments SET update_id=176147, user_id=91 WHERE comments.id = 1180421 bodhi2 | | 2721411721 | ExclusiveLock | t | transactionid | bodhi2 | 2019-12-29 01:06:08.436176+00 | 01:15:25.642811 | 29656 | UPDATE comments SET update_id=176147, user_id=91 WHERE comments.id = 1180391 bodhi2 | | 2721370917 | ExclusiveLock | t | transactionid | bodhi2 | 2019-12-29 00:45:06.902888+00 | 01:36:27.176099 | 9914 | UPDATE comments SET update_id=176147, user_id=91 WHERE comments.id = 1180383 (226 rows)
All stuck on trying to add comments to that one update.
Here's one example/datapoint: a scratch build for which the build completed after 50min, but it took Koji more than 14 hours to close the task.
Can't build for F31 RetryError: unable to retry call 11 (method host.initBuild) for session 92593309 https://koji.fedoraproject.org/koji/taskinfo?taskID=39986788
RetryError: unable to retry call 11 (method host.initBuild) for session 92593309
I tried a bunch more things today to isolate it.
Its not caused by the balancer setup in apache on the proxies.
It's not caused by the koji hubs from being overloaded
I tried switching back to haproxy with no real change.
I tried adjusting balancer limits and options with no real change.
I am down to thinking it's a koji bug, but I have no idea why it would appear now, not after an upgrade. There were schema changes between 1.18.1 and 1.19.0, so we can't downgrade to there. 1.19.1 was only some small fixes that don't seem likely to have anything to do with this.
Things are somewhat better now, but the issue is still happening. I think if we can unstick bodhi we can probibly get updates flowing again, but to fix the underlying issue we need probibly some koji developers or someone able to trace the koji codepaths.
The failure seems to be:
a builder is working along and then tries to take a task: Dec 31 00:46:17 buildhw-03.phx2.fedoraproject.org kojid[1982746]: 2019-12-31 00:46:17,562 [INFO] {1982746} ko ji.TaskManager:1210 Attempting to take task 40002458
The task shows actually as being on that builder after that so I think it's taken it, but it does not ever start processing it, and it fails to check into the hub after that.
Thanks Kevin for all your work on this difficult tricky problem.
Another funky scratch build, this time it doesn't start building - while I see new koschei builds getting processed ahead. Also seems to agree without your diagnosis. It seems the problem is intermittent?
I am restarting builders as they fail to check in from time to time to keep things going...
(This is probably completely unrelated (I should file a Koji bug), but I often see failed Koji builds pointing one to "see root.log" when actually the failure is in the "build.log": this has been happening for months though now...)
Should a koji bug be opened?
Looking at the spec diff, 1.19.1 did add a new service, koji-sweep-db. Did we enable that in prod? Could it be involved?
koji-sweep-db
Sure, I was hoping to catch people here, but thats a good idea. Filed: https://pagure.io/koji/issue/1897
We did not enable that (although we have had a local cron that does that same thing since forever)
Okay.
For whatever it's worth, by my reading of the code, the error is in TaskManager.takeTask(), which is called from TaskManager.getNextTask(), which is part of the daemon main loop.
TaskManager.takeTask()
TaskManager.getNextTask()
The main loop is very simple, it pretty much just calls TaskManager.updateBuildroots(), TaskManager.updateTasks() and TaskManager.getNextTask() in a loop forever. So we must be getting stuck somewhere in that fairly tight loop.
TaskManager.updateBuildroots()
TaskManager.updateTasks()
Are you logging at debug or info level ATM? Either way, if i were doing it, my next step would just be to throw a ton more log statements into the TaskManagerto try and figure out exactly where it gets stuck...
TaskManager
I've setup a few builders with --debug --debug-task --debug-xmlrpc and am waiting for them to drop out to examine the logs.
buildhw-10:
Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,073 [INFO] {530977} koji.TaskManager:1210 Attempting to take task 40025476 ... Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2580 url: https://koji.fedoraproject.org/kojihub?session-id=92635914&session-key=<redacted>&callnum=1093 Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koj i:2589 headers: {'User-Agent': 'koji/1', 'Content-Type': 'text/xml', 'Content-Length': '159'} Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koj i:2589 data: "<?xml version='1.0'?>\n<methodCall>\n<methodName>host.openTask</methodName>\n<params>\n<param>\n<value><int>40025476</int></value>\n</param>\n</params>\n</methodCall>\n" Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2589 stream: True Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2589 verify: '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem' Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2589 timeout: 43200 Jan 01 09:38:02 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 09:38:02,460 [DEBUG] {535641} koji:2213 Opening new requests session
So, it waited almost 3 hours before opening a new session (the timeout was set to 3 hours on the proxies)
koji01:
[Wed Jan 01 06:44:28.106504 2020] [:error] [pid 5258] 2020-01-01 06:44:28,106 [INFO] m=host.openTask u=buildhw-10.phx2.fedoraproject.org p=5258 r=10.5.126.9:33346 koji.xmlrpc: Completed method host.openTask for session 92635914 (#1093): 0.010418 seconds, rss 36736, stime 0.362359'
koji02:
[Wed Jan 01 06:44:28.050079 2020] [:error] [pid 26682] 2020-01-01 06:44:28,049 [INFO] m=host.getLoadData u=buildhw-10.phx2.fedoraproject.org p=26682 r=10.5.126.9:41044 koji.xmlrpc: Completed method host.getLoadData for session 92635914 (#1091): 0.185286 seconds, rss 36196, stime 0.166596'
So, I think it's waiting for the hub to send it something back and not getting it, then times out after 3 hours. (although I now set this timeout much lower)
I think that this may be Koji bug 1840 - I was seeing similar symptoms in a different Koji installation (ref: BST-916) and the above patch fixed them.
I don't think thats it, as it's not just slower than normal, but builders drop out and stop checking in entirely.
But in case it was, I created a build with that patch and applied it and restarted kojid and... the problem persists. :(
Thanks for the suggestion... open to other ideas.
I've looked at the RabbitMQ cluster and found the following things:
rabbitmqctl list_channels
One more thing: when no binding is declared in our rabbit/queue ansible role, the queue is bound to every message on the bus, which seems counter-intuitive and dangerous. This is why the wiki queue is so enormous. I'll purge it, unsubscribe it, and fix the role.
wiki
I've fiddled with the RabbitMQ cluster a bit, and restarted rabbitmq on the nodes one after the other. It led the cluster into a weird state, and restarting node 02 got it out of again. In the meantime it looks like we lost the messages stored in the queues :-( IIRC only the faf user (ABRT) had many of them.
faf
The following user accounts did not reconnect properly:
openqa.stg_scheduler
openqa_relvalconsumer
the-new-hotness
I think the the-new-hotness consumer need to be restarted.
I'm not sure how we got into the initial issue but as far as I can tell things seem to be back to normal now. I'm guessing something went wrong between node 02 and the rest of the cluster, and queues got out of sync.
The newer versions of RabbitMQ have a rabbitmq-diagnostics command that could have been useful... :-/
rabbitmq-diagnostics
Please tell me if you keep seeing weird behaviour with the RabbitMQ cluster
Thanks for the heads-up on the queues. The autocloudreporter queue you mentioned before the edit is obsolete and can be removed; we decommissioned autocloud so autocloudreporter no longer needs to run, I've turned it off.
autocloudreporter
I'll remove the queue, then, thanks. I edited the comment because I realized I had a way to know if there was a consumer running before the outage, so I only kept the accounts that had one.
BTW, in case it's of any interest to you, this is what happened to one of the consumers ~3 hours before I manually restarted it after seeing your comment. (The new fpaste appears to have 1 day as the longest expiry, so read quick!) After those errors you'd sort of expect the service to fail out and automatically restart, but...no? This sort of thing happens often enough that I have a daily cron job on the openQA boxes that just restarts all the consumers...
Yeah that's because when I restarted node 02, suddenly all queues were gone, and since consumers don't have the permission to create queues (only ansible can do that) they failed with what was interpreted as a permanent error. When I restarted node02 again, the queues were back (but empty).
seems wrong that the consumer service apparently doesn't treat it as a fatal error but continues running in a zombie state, though...
Oh yeah one more thing: I just published a new version of fedora-messaging (2.0.1) that fixes a bug that prevents publishing after the client has reconnected after a connection timeout. It's on PyPI and I'm building it for Fedora right now (Rawhide done, stable releases to go).
I agree it seems wrong, are you running the consumer with the "fedora-messaging consume" command? If so we should catch it and exit with RC > 0.
I'm using the fm-consumer@ service pattern. The boxes have fedora-messaging-1.7.2-1.fc31.noarch; I see that 2.0.0 made stable now, I can try them on that...
fm-consumer@
Interesting. I wonder then... could the fedora-messaging plugin to koji be the source of our problems? It only runs on the 2 hubs, so I am surprised it's so many connections... it does also seem to be emitting messages ok as far as I know... On the other hand, koji is odd in that it's wsgi script based, so perhaps each apache thread connects?
@pingou I know you debugged this for the missing signing messages, perhaps we can see whats going on again there?
I just restarted builders that were not checking in, and in the last 20min... none have dropped out.
So far all builders are staying in. Looks like the issue is gone.
So, something with the rabbitmq cluster was the cause all the time.
We still have to clean up updates...
I tried resuming some, but they ejected everything, even though I thought they were in the right tags. We are going to have to go back to those, tag them how bodhi expects and resubmit.
I'm going to just resume the last 2, have them eject things and then at least we can do pushes of things that are stuck pending, and we can then go back and try and resubmit everything that was ejected.
ok. I got the f30 updates/updates-testing to complete normally.
However, we now need to clean up all those ejected updates, and I am not sure the best way to do that.
basically we need to either:
a) 1 - find all ejected updates since mid december (we have fedmsgs, so this should be pretty easy) 2. - for each one, if it was going to stable, tag builds with -updates if it was going to testing tag builds with candidate. 3 - for each one, resubmit to testing or stable, unless the maintainer submitted it for stable already
or
a) we find all updates that were ejected and resubmit them for testing or stable unless they were testing and the maintainer already submitted to stable b) Can we get bodhi to check if the update is already in the destination tags and then just say 'oh, it's already there, moving on...' ?
a) Some magic in the bodhi db to tell it those things tagged in already are done pushing (But this won't close bugs or email the package announce list right)?
I've disabled the bodhi auto pushing, there are no composes to resume now, so we just need to clean up ejected updates and do a new push.
as I mentioned on IRC, if possible it'd be good to look out for cases where there is a later update for the same package, and not push an ejected/stuck update if a later update already went through...
I have done that and it looks like it is correctly processing messages now.
Yeah, I think we are 100% back to normal. I am going to leave this open however so we can perhaps try and figure out root cause next week so we can make sure it never happens again. ;)
I'm investigating this, there's a lot of interesting information in the RabbitMQ logs.
There was one more unfixed issue: the federation plugin wasn't relaying messages between the internal /pubsub vhost and the public /public_pubsub vhost.
I saw that the federation system had no active link:
# rabbitmqctl eval 'rabbit_federation_status:status().' []
Also, the web UI showed that one of the federation queues had 15k message in it, while it should be empty since relaying should be instantaneous.
I deleted and re-created the policy that sets up relaying:
(deletion was done in the web UI but can be done with rabbitmqctl clear_policy -p /public_pubsub pubsub-to-public_pubsub) # rabbitmqctl set_policy -p /public_pubsub --apply-to exchanges pubsub-to-public_pubsub "^(amq|zmq)\.topic$" '{"federation-upstream":"pubsub-to-public_pubsub"}' Setting policy "pubsub-to-public_pubsub" for pattern "^(amq|zmq)\\.topic$" to "{\"federation-upstream\":\"pubsub-to-public_pubsub\"}" with priority "0"
And then federation started to work again, the queue cleared up and I saw active links in the federation status:
# rabbitmqctl eval 'rabbit_federation_status:status().' [[{exchange,<<"zmq.topic">>}, {upstream_exchange,<<"zmq.topic">>}, {type,exchange}, {vhost,<<"/public_pubsub">>}, {upstream,<<"pubsub-to-public_pubsub">>}, {id,<<"c1e7747425938349520c60dda5671b2758e210b8">>}, {status,running}, {local_connection,<<"<rabbit@rabbitmq01.phx2.fedoraproject.org.2.10643.38>">>}, {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>}, {timestamp,{{2020,1,6},{12,46,58}}}], [{exchange,<<"amq.topic">>}, {upstream_exchange,<<"amq.topic">>}, {type,exchange}, {vhost,<<"/public_pubsub">>}, {upstream,<<"pubsub-to-public_pubsub">>}, {id,<<"b40208be0a999cc93a78eb9e41531618f96d4cb2">>}, {status,running}, {local_connection,<<"<rabbit@rabbitmq01.phx2.fedoraproject.org.2.10613.38>">>}, {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>}, {timestamp,{{2020,1,6},{12,46,59}}}]]
My guess is that since federation is basically an AMQP client listening on one side and publishing on the other, the issue we've had previously with the cluster (that I'm still investigating) caused the publishing part of the process to fail/timeout.
After spending quite a few hours reading logs and docs, I think I have a better idea of what happened.
First, I must explain how clustering works in RabbitMQ. Every queue has a primary node and secondary nodes where the messages are mirrored. When a client connect to any node in the cluster, its operations are transparently forwarded to the primary node for the queue that the client works with. In the event of a node failure, a new primary is elected, and when the failed node comes back the messages are copied back to it so it can catch up.
In our case, I think that didn't work well because when we had the network partition, we had two queues that were very large (a few millions of messages), and copying those messages took a few minutes, longer than the flapping time. In addition, during the outage some of the nodes were restarted to try to get them back into shape. This had two results that I saw in the logs:
** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rabbitmq01.phx2.fedoraproject.org'}
memory resource limit alarm set on node 'rabbit@rabbitmq01.phx2.fedoraproject.org'. Publishers will be blocked until this alarm clears
To avoid this kind of issue in the future, I would recommend keeping an eye on the queue sizes. I can add a default Nagios warning that would alert at, say, 10k messages. We can also set RabbitMQ limits on the queues which would automatically enforce this (there are possible limits on queue size or message age).
Also, I think we should make a good practice to not restart a rabbitmq node without checking that there is no queue sync taking place at that moment. It is visible in the logs (Mirrored queue 'queue_name' in vhost '/pubsub': Synchronising: XX messages to synchronise followed by Mirrored queue 'queue_name' in vhost '/pubsub': Synchronising: all slaves already synced).
Mirrored queue 'queue_name' in vhost '/pubsub': Synchronising: XX messages to synchronise
Mirrored queue 'queue_name' in vhost '/pubsub': Synchronising: all slaves already synced
RabbitMQ can detect network partitions (look for Partial partition detected in the logs) and gives details about which nodes sees which, so I think we should let it handle the situation and recover in whatever way it prefers.
Partial partition detected
One last thing: newer versions of RabbitMQ have better diagnostics tools that could have been useful. RHEL8+OS15 should provide an upgrade path (see #8167), but we need to check some of our code.
Does this make sense?
And thanks again to all those who help diagnose, mitigate and fix this, especially during the holidays.
@abompard awesome detective work :-). Could you please update the Rabbitmq SOP troubleshooting section (https://fedora-infra-docs.readthedocs.io/en/latest/sysadmin-guide/sops/rabbitmq.html#troubleshooting) with the useful commands and the how to access the web interface etc ...
Thanks :-)
My vote would be for the nagios checks checking all queues for 10k messages rather than the queue size limit (at least at first, if we run into more issues, then the queue size limit may become necessary).
Closing this now as abompard landed monitoring checks. :)
Thanks everyone!
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.