#8477 koji / bodhi / rabbitmq slowness and errors
Closed: Fixed 8 months ago by kevin. Opened 9 months ago by kevin.

Two composes were failed, I tried to resume them, but then bodhi-backend failed

F31M-testing : 1 updates (requested)
F30M-testing : 1 updates (requested)

Dec 22 00:49:45 bodhi-backend01.phx2.fedoraproject.org systemd[1]: Started Bodhi's Celery worker.
Dec 22 00:49:48 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:49:48,756: INFO/MainProcess] Connected to amqps://bodhi:**@rabbitmq.fedoraproject.org:5671//bodhi
Dec 22 00:49:48 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:49:48,789: INFO/MainProcess] mingle: searching for neighbors          
Dec 22 00:49:49 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:49:49,902: INFO/MainProcess] mingle: all alone                                                                                     
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [2019-12-22 00:50:56,777: CRITICAL/MainProcess] Unrecoverable error: NotFound(404, "NOT_FOUND - failed to perform operation on queue 'has_koji_mou
nt' in vhost '/bodhi' due to timeout", (50, 10), 'Queue.declare')                                            
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: Traceback (most recent call last):   
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/celery/worker/worker.py", line 205, in start                                                             
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     self.blueprint.start(self)       
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/celery/bootsteps.py", line 119, in start                                                                 
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     step.start(parent)               
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/celery/bootsteps.py", line 369, in start                                                                 
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     return self.obj.start()
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/s[47/1868]
ges/celery/worker/consumer/consumer.py", line 317, in start
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     blueprint.start(self)            Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/celery/bootsteps.py", line 119, in start                                                                 Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     step.start(parent)
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/celery/worker/consumer/tasks.py", line 41, in start
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     c.connection, on_decode_error=c.o
n_decode_error,                                                                                              Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/celery/app/amqp.py", line 297, in TaskConsumer                                                           
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     **kw                             Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/kombu/messaging.py", line 386, in __init__                                                               Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     self.revive(self.channel)
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/kombu/messaging.py", line 408, in revive                                                                 
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     self.declare()
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/kombu/messaging.py", line 421, in declare 
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     queue.declare()
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/kombu/entity.py", line 605, in declare                                                                   
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     self._create_queue(nowait=nowait,
 channel=channel)                                                                                            
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/kombu/entity.py", line 614, in _create_queue
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     self.queue_declare(nowait=nowait, passive=False, channel=channel)                  
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/kombu/entity.py", line 649, in queue_declare
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     nowait=nowait,                   Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/amqp/channel.py", line 1154, in queue_declare                                                            Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     spec.Queue.DeclareOk, returns_tup
le=True,                                                                                                     
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/amqp/abstract_channel.py", line 80, in wait                                                              Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     self.connection.drain_events(time
out=timeout)                                                                                                 
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/amqp/connection.py", line 500, in drain_events                                                           
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     while not self.blocking_read(time
out):                                                                                                        Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packa
ges/amqp/connection.py", line 506, in blocking_read
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     return self.on_inbound_frame(fram
e)
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/amqp/method_framing.py", line 55, in on_frame
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     callback(channel, method_sig, buf, None)
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/amqp/connection.py", line 510, in on_inbound_method
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     method_sig, payload, content,
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/amqp/abstract_channel.py", line 126, in dispatch_method
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     listener(*args)
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   File "/usr/lib/python3.7/site-packages/amqp/channel.py", line 282, in _on_close
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:     reply_code, reply_text, (class_id, method_id), ChannelError,
Dec 22 00:50:56 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: amqp.exceptions.NotFound: Queue.declare: (404) NOT_FOUND - failed to perform operation on queue 'has_koji_mount' in vhost '/bodhi' due to timeout
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:  
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:  -------------- celery@bodhi-backend01.phx2.fedoraproject.org v4.2.1 (windowlicker)
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: ---- **** -----
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: --- * ***  * -- Linux-5.3.8-200.fc30.x86_64-x86_64-with-fedora-30-Thirty 2019-12-22 00:49:46
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: -- * - **** ---
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- [config]
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- .> app:         __main__:0x7f99cc8f9650
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- .> transport:   amqps://bodhi:**@rabbitmq.fedoraproject.org:5671//bodhi
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - ** ---------- .> results:     rpc://
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: - *** --- * --- .> concurrency: 30 (prefork)
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: -- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: --- ***** -----
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:  -------------- [queues]
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:                 .> has_koji_mount   exchange=has_koji_mount(direct) key=has_koji_mount
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:                 
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]: [tasks]
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   . compose
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org celery-3[68766]:   . handle_update
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org systemd[1]: bodhi-celery.service: Main process exited, code=exited, status=1/FAILURE
Dec 22 00:50:58 bodhi-backend01.phx2.fedoraproject.org systemd[1]: bodhi-celery.service: Failed with result 'exit-code'.

This is blocking updates, so would be good to fix asap.

@cverna and @abompard or anyone else on the bodhi team, can you look?


So, I managed to get it working again by rebooting rabbitmq02 (I rebooted 01 also, but it didn't seem to matter).

I don't understand what state it was in tho...

Before the reboot:

[root@rabbitmq01 ~][PROD]# rabbitmqctl list_queues -p "/pubsub"                                              
Listing queues
amqp_to_zmq     386
faf     10697
openqa.stg_scheduler    0
openqa_relvalamiconsumer.test   6438
federation: amq.topic -> rabbit@rabbitmq01.phx2.fedoraproject.org:/public_pubsub:amq.topic      0
robosignatory   1
fedora-build-checks     0
greenwave       0
wiki    577508

After the reboot:

[root@rabbitmq03 ~][PROD]# rabbitmqctl list_queues -p /pubsub --online
Listing queues
openqa_checkcomp        0
openqa_relvalamiconsumer.test   6438
bodhi_composer  0
openqa_wiki_reporter    0
centos-ci       1029
bodhi   0
openqa_autocloudreporter.stg    0
amqp_bridge_verify_missing      19
federation: zmq.topic -> rabbit@rabbitmq01.phx2.fedoraproject.org:/public_pubsub:zmq.topic      0
mts     0
the-new-hotness 55
robosignatory   0
openqa_scheduler        0
openqa_resultsdb_reporter       0
openqa_autocloudreporter        0
openqa.stg_scheduler    0
faf     10707
amqp_to_zmq     0
federation: amq.topic -> rabbit@rabbitmq01.phx2.fedoraproject.org:/public_pubsub:amq.topic      0
resultsdb_ci_listener   0
wiki    578159
openqa_relvalamiconsumer        0
openqa_relvalconsumer.test      0
fedora-build-checks     0
koschei 1
openqa_relvalconsumer   0
coreos  0
greenwave       38

I'd appreciate any ideas on how this happened and how to prevent it. :)

This might be related: no builds have been pushed through bodhi to rawhide for half a day ...

https://bodhi.fedoraproject.org/releases/F32

Looks like first, moving things from testing → stable stopped happening, and then moving stuff from pending → testing stopped a few hours later, as well.

Metadata Update from @mizdebsk:
- Issue tagged with: bodhi

9 months ago

@kevin it seems that many of the rawhide updates are blocked because the builds are waiting to be signed. Or they are still tagged as f32-signing-pending.

Could you check if robosignatory needs a restart ? if it is working as expected we might need to tag again the builds to be signed so that the rawhide gating pipeline can restart.

I found basically all the openQA consumers weren't picking up messages after the outage, even though they looked as if they had reconnected successfully and were working. I had to restart all the services manually.

I suspect we need to restart basically all fedora-messaging consumers that haven't been restarted yet to be sure things are working.

I did restart it, and it's been signing new things fine.

I wonder if the messages for those things were lost and we need to retag them to get it to pick them up now?
I can try that...

I'm consolidating:

#8482 Koji build fails with "GenericError: Build already in progress"
and
#8480 koji builds not getting tagged

since I think they are all this same issue and we can debug and provide status here.

Symptoms:

  • koji was very slow to repond to requests
  • some koji builds are failing builds with "GenericError: Build already in progress"
  • Some koji builders are, after a while just not updating. They have various errors about this and a kojid restart brings them back (for a while).
  • signing is happening, but it seems to get stuck after a while. At the step where the actual signing is all done and the bridge is talking to koji to import tha signatures.
  • bodhi has 2 failed pushes that it can't seem to complete. (quite possibly because ostree signing is not happening due to last point)
  • bodhi pushes on resume are ejecting a bunch of builds.

Things I have done:

  • I rebooted proxy101/110.
  • I rebooted db-koji01
  • I've been restarting the sign-bridge when I notice it stuck
  • I rebooted koji01/koji02 and now response time is much much better, but I am not sure the issues are all solved.
  • disabled the db backup on db-koji02 to avoid more slowness from that.
  • I've tried to restart/resume bodhi pushes, but they are in a weird state, might need some db poking (but please tell me what to do to do this moving forward).

I guess lets see if things are better now and if someone could figure out how to get the updates pushes out that would be good.

The odd thing about all this is that there isn't any changes I can see that were made when it started happening...

So, responsiveness is much better, and signing seems to have caught up... however, tasks are sometimes still not finishing right. The linger in tagging.

Hello infra

Thanks a lot for taking care of this issue. :D

So, responsiveness is much better, and signing seems to have caught up... however, tasks are sometimes still not finishing right. The linger in tagging.

One of my builds was not tagged, and hence the build failed. However, when I rebuild again, I get an error message stating that "the build already exists". I am unable to find the build and hence am stuck. I can update the Release tag and rebuild, but I am wondering if there is a cached copy somewhere of the same that can be used.

I am not sure if this is the correct place to ask, but I didn't want to bug the devel list, and hence am posting here.

Is there something we can do to get the borked bodhi updates un-stuck?

With a primitive bodhi query, it looks like there are 100+ updates stuck (?) in pending state across the three active releases:

https://bodhi.fedoraproject.org/updates/?status=pending&request=testing

There are also 200+ updates with a request for stable, which seems like an unusually high number (and at least some of them have already been ejected from pushes):

https://bodhi.fedoraproject.org/updates/?status=testing&request=stable

And, what can a packager do if an update is "ejected" by bodhi due to being in the wrong tag?

Is there something we can do to get the borked bodhi updates un-stuck?
With a primitive bodhi query, it looks like there are 100+ updates stuck (?) in pending state across the three active releases:
https://bodhi.fedoraproject.org/updates/?status=pending&request=testing
There are also 200+ updates with a request for stable, which seems like an unusually high number (and at least some of them have already been ejected from pushes):
https://bodhi.fedoraproject.org/updates/?status=testing&request=stable
And, what can a packager do if an update is "ejected" by bodhi due to being in the wrong tag?

I took care of the ejected updates, they should be in a correct state in bodhi now.

For the rest of the updates we just need to run a new updates composes. We finally managed to unblock the composes from last weekend so we should be able to run a set of fresh compose.

I ll try to do that today and monitor it, since koji still seems slow at times which impacts bodhi.

One of my builds was not tagged, and hence the build failed. However, when I rebuild again, I get an error message stating that "the build already exists". I am unable to find the build and hence am stuck. I can update the Release tag and rebuild, but I am wondering if there is a cached copy somewhere of the same that can be used.

I could resume my builds by manual tagging them with:
koji tag-build f31-updates-candidate <package-name>

I hope this helps

I could resume my builds by manual tagging them with:
koji tag-build f31-updates-candidate <package-name>
I hope this helps

Thanks a lot @tartina, it worked beautifully. :smile:

I reinstalled proxy101 and proxy110 as fedora 31.

Builders are still dropping off and stopping checkin with: "koji.TaskManager:1210 Attempting to take task..." messages.

Some updates pushes finally finished, but there are some new ones stuck now.

I guess it still could be a proxy issue, or a koji bug, but I still don't see anything that could have changed or any errors showing whats breaking.

Perhaps the koji folks could tell us what the "attempting to take task" hang means?

@tkopecek any ideas?

Another thing: There are a TON (273+) '/usr/bin/bodhi-approve-testing' cron jobs on bodhi-backend01

@cverna Could this be contributing to issues or another symptom?

Another thing: There are a TON (273+) '/usr/bin/bodhi-approve-testing' cron jobs on bodhi-backend01
@cverna Could this be contributing to issues or another symptom?

I think it is a symptom, the cronjob is dealing with tagging builds in koji for rawhide gating so if koji is slow to process these requests I guess that could leave the cron in a weird state.

The update eject message is also a symptom, during the compose the builds are move to the stable tag ('f31-updates`) for example but some of the tag build task in koji fail which make the whole compose fail.
So next time we run the compose, some of the builds in the updates are already tagged in the stable tag but the update is not in the stable status.

For example this is tag build that failed --> https://koji.fedoraproject.org/koji/taskinfo?taskID=39945241

Is it possible this is also causing - or is caused by - greenwave / taskotron issues?

No matter which bodhi update I look at, I get the "failed to talk to greenwave" error message instead of the "ignored" state, and no taskotron results get fetched for any of my updates, not even ones that are over a year old.

Is it possible this is also causing - or is caused by - greenwave / taskotron issues?
No matter which bodhi update I look at, I get the "failed to talk to greenwave" error message instead of the "ignored" state, and no taskotron results get fetched for any of my updates, not even ones that are over a year old.

I wouldn't think so, but it's possible...

Another thing: There are a TON (273+) '/usr/bin/bodhi-approve-testing' cron jobs on bodhi-backend01
@cverna Could this be contributing to issues or another symptom?

I think it is a symptom, the cronjob is dealing with tagging builds in koji for rawhide gating so if koji is slow to process these requests I guess that could leave the cron in a weird state.

I would agree except for the place it hangs is adding a comment to a update.
Look at all the locks in db01's bodhi2 db:

...snip a bunch more of the same...
| bodhi2  |                           |               | ExclusiveLock    | t       | virtualxid    | bodhi2   
| 2019-12-29 01:33:07.221469+00 | 00:48:26.857518  | 21308 | UPDATE comments SET update_id=176147, user_id=91
 WHERE comments.id = 1180412
 bodhi2  |                           |    2721427923 | ExclusiveLock    | t       | transactionid | bodhi2   
| 2019-12-29 01:24:07.640285+00 | 00:57:26.438702  | 13713 | UPDATE comments SET update_id=176147, user_id=91
 WHERE comments.id = 1180397
 bodhi2  |                           |    2721388468 | ExclusiveLock    | t       | transactionid | bodhi2   
| 2019-12-29 00:54:06.819603+00 | 01:27:27.259384  | 18230 | UPDATE comments SET update_id=176147, user_id=91
 WHERE comments.id = 1180386
 bodhi2  |                           |    2721428424 | ExclusiveLock    | t       | transactionid | bodhi2   
| 2019-12-29 01:39:07.720449+00 | 00:42:26.358538  | 26261 | UPDATE comments SET update_id=176147, user_id=91
 WHERE comments.id = 1180414
 bodhi2  |                           |    2721428952 | ExclusiveLock    | t       | transactionid | bodhi2   
| 2019-12-29 01:54:07.422068+00 | 00:27:26.656919  |  6601 | UPDATE comments SET update_id=176147, user_id=91
 WHERE comments.id = 1180421
 bodhi2  |                           |    2721411721 | ExclusiveLock    | t       | transactionid | bodhi2   
| 2019-12-29 01:06:08.436176+00 | 01:15:25.642811  | 29656 | UPDATE comments SET update_id=176147, user_id=91
 WHERE comments.id = 1180391
 bodhi2  |                           |    2721370917 | ExclusiveLock    | t       | transactionid | bodhi2   
| 2019-12-29 00:45:06.902888+00 | 01:36:27.176099  |  9914 | UPDATE comments SET update_id=176147, user_id=91
 WHERE comments.id = 1180383
(226 rows)

All stuck on trying to add comments to that one update.

Here's one example/datapoint: a scratch build for which the build completed after 50min, but it took Koji more than 14 hours to close the task.

Can't build for F31
RetryError: unable to retry call 11 (method host.initBuild) for session 92593309
https://koji.fedoraproject.org/koji/taskinfo?taskID=39986788

I tried a bunch more things today to isolate it.

Its not caused by the balancer setup in apache on the proxies.

It's not caused by the koji hubs from being overloaded

I tried switching back to haproxy with no real change.

I tried adjusting balancer limits and options with no real change.

I am down to thinking it's a koji bug, but I have no idea why it would appear now, not after an upgrade. There were schema changes between 1.18.1 and 1.19.0, so we can't downgrade to there. 1.19.1 was only some small fixes that don't seem likely to have anything to do with this.

Things are somewhat better now, but the issue is still happening. I think if we can unstick bodhi we can probibly get updates flowing again, but to fix the underlying issue we need probibly some koji developers or someone able to trace the koji codepaths.

The failure seems to be:

  • a builder is working along and then tries to take a task:
    Dec 31 00:46:17 buildhw-03.phx2.fedoraproject.org kojid[1982746]: 2019-12-31 00:46:17,562 [INFO] {1982746} ko
    ji.TaskManager:1210 Attempting to take task 40002458

  • The task shows actually as being on that builder after that so I think it's taken it, but it does not ever start processing it, and it fails to check into the hub after that.

Thanks Kevin for all your work on this difficult tricky problem.

Another funky scratch build, this time it doesn't start building - while I see new koschei builds getting processed ahead. Also seems to agree without your diagnosis. It seems the problem is intermittent?

I am restarting builders as they fail to check in from time to time to keep things going...

(This is probably completely unrelated (I should file a Koji bug), but I often see failed Koji builds pointing one to "see root.log" when actually the failure is in the "build.log": this has been happening for months though now...)

Should a koji bug be opened?

Looking at the spec diff, 1.19.1 did add a new service, koji-sweep-db. Did we enable that in prod? Could it be involved?

Should a koji bug be opened?

Sure, I was hoping to catch people here, but thats a good idea. Filed: https://pagure.io/koji/issue/1897

Looking at the spec diff, 1.19.1 did add a new service, koji-sweep-db. Did we enable that in prod? Could it be involved?

We did not enable that (although we have had a local cron that does that same thing since forever)

Okay.

For whatever it's worth, by my reading of the code, the error is in TaskManager.takeTask(), which is called from TaskManager.getNextTask(), which is part of the daemon main loop.

The main loop is very simple, it pretty much just calls TaskManager.updateBuildroots(), TaskManager.updateTasks() and TaskManager.getNextTask() in a loop forever. So we must be getting stuck somewhere in that fairly tight loop.

Are you logging at debug or info level ATM? Either way, if i were doing it, my next step would just be to throw a ton more log statements into the TaskManagerto try and figure out exactly where it gets stuck...

I've setup a few builders with --debug --debug-task --debug-xmlrpc and am waiting for them to drop out to examine the logs.

buildhw-10:

Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,073 [INFO] {530977} koji.TaskManager:1210 Attempting to take task 40025476
...
Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2580 url: https://koji.fedoraproject.org/kojihub?session-id=92635914&session-key=<redacted>&callnum=1093
Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koj i:2589 headers: {'User-Agent': 'koji/1', 'Content-Type': 'text/xml', 'Content-Length': '159'}
Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koj i:2589 data: "<?xml version='1.0'?>\n<methodCall>\n<methodName>host.openTask</methodName>\n<params>\n<param>\n<value><int>40025476</int></value>\n</param>\n</params>\n</methodCall>\n"
Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2589 stream: True
Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2589 verify: '/etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem'
Jan 01 06:44:28 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 06:44:28,087 [DEBUG] {530977} koji:2589 timeout: 43200
Jan 01 09:38:02 buildhw-10.phx2.fedoraproject.org kojid[530977]: 2020-01-01 09:38:02,460 [DEBUG] {535641} koji:2213 Opening new requests session

So, it waited almost 3 hours before opening a new session (the timeout was set to 3 hours on the proxies)

koji01:

[Wed Jan 01 06:44:28.106504 2020] [:error] [pid 5258] 2020-01-01 06:44:28,106 [INFO] m=host.openTask u=buildhw-10.phx2.fedoraproject.org p=5258 r=10.5.126.9:33346 koji.xmlrpc: Completed method host.openTask for session 92635914 (#1093): 0.010418 seconds, rss 36736, stime 0.362359'

koji02:

[Wed Jan 01 06:44:28.050079 2020] [:error] [pid 26682] 2020-01-01 06:44:28,049 [INFO] m=host.getLoadData u=buildhw-10.phx2.fedoraproject.org p=26682 r=10.5.126.9:41044 koji.xmlrpc: Completed method host.getLoadData for session 92635914 (#1091): 0.185286 seconds, rss 36196, stime 0.166596'

So, I think it's waiting for the hub to send it something back and not getting it, then times out after 3 hours. (although I now set this timeout much lower)

I think that this may be Koji bug 1840 - I was seeing similar symptoms in a different Koji installation (ref: BST-916) and the above patch fixed them.

I think that this may be Koji bug 1840 - I was seeing similar symptoms in a different Koji installation (ref: BST-916) and the above patch fixed them.

I don't think thats it, as it's not just slower than normal, but builders drop out and stop checking in entirely.

But in case it was, I created a build with that patch and applied it and restarted kojid and... the problem persists. :(

Thanks for the suggestion... open to other ideas.

I've looked at the RabbitMQ cluster and found the following things:

  • There are currently 175 open connections for the koji user, that seems a lot. I don't know the koji architecture, maybe it's explained by each connection being on a different process. Not sure if we can reduce that or not.
  • Since the end of december the number of connections has been growing every day and dropping back during the night, see this graph. I don't know yet which app is causing this, but no app has a higher connection count than koji (by far) so it's a likely cause
  • There was a queue on the server with more than 300k messages and nothing reading them, because of a configuration mistake (it was subscribed to everything) and because it has been setup for testing with no consumer for now. I've purged the queue. Large queues are harder to sync between cluster nodes and it's recommended to keep queues as empty as possible in RabbitMQ
  • There's also the faf queue that has 11k messages and no consumer, I've emailed @msuchy to ask him if I can purge it
  • On the servers, the command rabbitmqctl list_channels lists some channels and then hangs, I'm investigating that.

One more thing: when no binding is declared in our rabbit/queue ansible role, the queue is bound to every message on the bus, which seems counter-intuitive and dangerous. This is why the wiki queue is so enormous. I'll purge it, unsubscribe it, and fix the role.

I've fiddled with the RabbitMQ cluster a bit, and restarted rabbitmq on the nodes one after the other. It led the cluster into a weird state, and restarting node 02 got it out of again. In the meantime it looks like we lost the messages stored in the queues :-( IIRC only the faf user (ABRT) had many of them.

The following user accounts did not reconnect properly:

  • openqa.stg_scheduler
  • openqa_relvalconsumer
  • the-new-hotness, which is already piling up some messages.

I think the the-new-hotness consumer need to be restarted.

I'm not sure how we got into the initial issue but as far as I can tell things seem to be back to normal now. I'm guessing something went wrong between node 02 and the rest of the cluster, and queues got out of sync.

The newer versions of RabbitMQ have a rabbitmq-diagnostics command that could have been useful... :-/

Please tell me if you keep seeing weird behaviour with the RabbitMQ cluster

Thanks for the heads-up on the queues. The autocloudreporter queue you mentioned before the edit is obsolete and can be removed; we decommissioned autocloud so autocloudreporter no longer needs to run, I've turned it off.

I'll remove the queue, then, thanks.
I edited the comment because I realized I had a way to know if there was a consumer running before the outage, so I only kept the accounts that had one.

BTW, in case it's of any interest to you, this is what happened to one of the consumers ~3 hours before I manually restarted it after seeing your comment. (The new fpaste appears to have 1 day as the longest expiry, so read quick!) After those errors you'd sort of expect the service to fail out and automatically restart, but...no? This sort of thing happens often enough that I have a daily cron job on the openQA boxes that just restarts all the consumers...

Yeah that's because when I restarted node 02, suddenly all queues were gone, and since consumers don't have the permission to create queues (only ansible can do that) they failed with what was interpreted as a permanent error. When I restarted node02 again, the queues were back (but empty).

seems wrong that the consumer service apparently doesn't treat it as a fatal error but continues running in a zombie state, though...

Oh yeah one more thing: I just published a new version of fedora-messaging (2.0.1) that fixes a bug that prevents publishing after the client has reconnected after a connection timeout. It's on PyPI and I'm building it for Fedora right now (Rawhide done, stable releases to go).

I agree it seems wrong, are you running the consumer with the "fedora-messaging consume" command? If so we should catch it and exit with RC > 0.

I'm using the fm-consumer@ service pattern. The boxes have fedora-messaging-1.7.2-1.fc31.noarch; I see that 2.0.0 made stable now, I can try them on that...

Interesting. I wonder then... could the fedora-messaging plugin to koji be the source of our problems? It only runs on the 2 hubs, so I am surprised it's so many connections... it does also seem to be emitting messages ok as far as I know... On the other hand, koji is odd in that it's wsgi script based, so perhaps each apache thread connects?

@pingou I know you debugged this for the missing signing messages, perhaps we can see whats going on again there?

I just restarted builders that were not checking in, and in the last 20min... none have dropped out.

So far all builders are staying in. Looks like the issue is gone.

So, something with the rabbitmq cluster was the cause all the time.

We still have to clean up updates...

I tried resuming some, but they ejected everything, even though I thought they were in the right tags. We are going to have to go back to those, tag them how bodhi expects and resubmit.

I'm going to just resume the last 2, have them eject things and then at least we can do pushes of things that are stuck pending, and we can then go back and try and resubmit everything that was ejected.

ok. I got the f30 updates/updates-testing to complete normally.

However, we now need to clean up all those ejected updates, and I am not sure the best way to do that.

basically we need to either:

a)
1 - find all ejected updates since mid december (we have fedmsgs, so this should be pretty easy)
2. - for each one, if it was going to stable, tag builds with -updates if it was going to testing tag builds with candidate.
3 - for each one, resubmit to testing or stable, unless the maintainer submitted it for stable already

or

a) we find all updates that were ejected and resubmit them for testing or stable unless they were testing and the maintainer already submitted to stable
b) Can we get bodhi to check if the update is already in the destination tags and then just say 'oh, it's already there, moving on...' ?

or

a) Some magic in the bodhi db to tell it those things tagged in already are done pushing
(But this won't close bugs or email the package announce list right)?

I've disabled the bodhi auto pushing, there are no composes to resume now, so we just need to clean up ejected updates and do a new push.

as I mentioned on IRC, if possible it'd be good to look out for cases where there is a later update for the same package, and not push an ejected/stuck update if a later update already went through...

I think the the-new-hotness consumer need to be restarted.

I have done that and it looks like it is correctly processing messages now.

Yeah, I think we are 100% back to normal. I am going to leave this open however so we can perhaps try and figure out root cause next week so we can make sure it never happens again. ;)

I'm investigating this, there's a lot of interesting information in the RabbitMQ logs.

There was one more unfixed issue: the federation plugin wasn't relaying messages between the internal /pubsub vhost and the public /public_pubsub vhost.

I saw that the federation system had no active link:

# rabbitmqctl eval 'rabbit_federation_status:status().'
[]

Also, the web UI showed that one of the federation queues had 15k message in it, while it should be empty since relaying should be instantaneous.

I deleted and re-created the policy that sets up relaying:

(deletion was done in the web UI but can be done with rabbitmqctl clear_policy -p /public_pubsub pubsub-to-public_pubsub)
# rabbitmqctl set_policy -p /public_pubsub --apply-to exchanges pubsub-to-public_pubsub "^(amq|zmq)\.topic$" '{"federation-upstream":"pubsub-to-public_pubsub"}'
Setting policy "pubsub-to-public_pubsub" for pattern "^(amq|zmq)\\.topic$" to "{\"federation-upstream\":\"pubsub-to-public_pubsub\"}" with priority "0"

And then federation started to work again, the queue cleared up and I saw active links in the federation status:

# rabbitmqctl eval 'rabbit_federation_status:status().'
[[{exchange,<<"zmq.topic">>},
  {upstream_exchange,<<"zmq.topic">>},
  {type,exchange},
  {vhost,<<"/public_pubsub">>},
  {upstream,<<"pubsub-to-public_pubsub">>},
  {id,<<"c1e7747425938349520c60dda5671b2758e210b8">>},
  {status,running},
  {local_connection,<<"<rabbit@rabbitmq01.phx2.fedoraproject.org.2.10643.38>">>},
  {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>},
  {timestamp,{{2020,1,6},{12,46,58}}}],
 [{exchange,<<"amq.topic">>},
  {upstream_exchange,<<"amq.topic">>},
  {type,exchange},
  {vhost,<<"/public_pubsub">>},
  {upstream,<<"pubsub-to-public_pubsub">>},
  {id,<<"b40208be0a999cc93a78eb9e41531618f96d4cb2">>},
  {status,running},
  {local_connection,<<"<rabbit@rabbitmq01.phx2.fedoraproject.org.2.10613.38>">>},
  {uri,<<"amqps://rabbitmq01.phx2.fedoraproject.org/%2Fpubsub">>},
  {timestamp,{{2020,1,6},{12,46,59}}}]]

My guess is that since federation is basically an AMQP client listening on one side and publishing on the other, the issue we've had previously with the cluster (that I'm still investigating) caused the publishing part of the process to fail/timeout.

After spending quite a few hours reading logs and docs, I think I have a better idea of what happened.

First, I must explain how clustering works in RabbitMQ. Every queue has a primary node and secondary nodes where the messages are mirrored. When a client connect to any node in the cluster, its operations are transparently forwarded to the primary node for the queue that the client works with.
In the event of a node failure, a new primary is elected, and when the failed node comes back the messages are copied back to it so it can catch up.

In our case, I think that didn't work well because when we had the network partition, we had two queues that were very large (a few millions of messages), and copying those messages took a few minutes, longer than the flapping time. In addition, during the outage some of the nodes were restarted to try to get them back into shape. This had two results that I saw in the logs:

  • there were log messages about inconsistant Mnesia database events (** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rabbitmq01.phx2.fedoraproject.org'})
  • syncing those messages eat up a lot of memory, and we reached a watermark after which the publishers are temporarily unauthorized to publish, until the alert clears (memory resource limit alarm set on node 'rabbit@rabbitmq01.phx2.fedoraproject.org'. Publishers will be blocked until this alarm clears)

To avoid this kind of issue in the future, I would recommend keeping an eye on the queue sizes. I can add a default Nagios warning that would alert at, say, 10k messages. We can also set RabbitMQ limits on the queues which would automatically enforce this (there are possible limits on queue size or message age).

Also, I think we should make a good practice to not restart a rabbitmq node without checking that there is no queue sync taking place at that moment. It is visible in the logs (Mirrored queue 'queue_name' in vhost '/pubsub': Synchronising: XX messages to synchronise followed by Mirrored queue 'queue_name' in vhost '/pubsub': Synchronising: all slaves already synced).

RabbitMQ can detect network partitions (look for Partial partition detected in the logs) and gives details about which nodes sees which, so I think we should let it handle the situation and recover in whatever way it prefers.

One last thing: newer versions of RabbitMQ have better diagnostics tools that could have been useful. RHEL8+OS15 should provide an upgrade path (see #8167), but we need to check some of our code.

Does this make sense?

And thanks again to all those who help diagnose, mitigate and fix this, especially during the holidays.

@abompard awesome detective work :-). Could you please update the Rabbitmq SOP troubleshooting section (https://fedora-infra-docs.readthedocs.io/en/latest/sysadmin-guide/sops/rabbitmq.html#troubleshooting) with the useful commands and the how to access the web interface etc ...

Thanks :-)

To avoid this kind of issue in the future, I would recommend keeping an eye on the queue sizes. I can add a default Nagios warning that would alert at, say, 10k messages. We can also set RabbitMQ limits on the queues which would automatically enforce this (there are possible limits on queue size or message age).

My vote would be for the nagios checks checking all queues for 10k messages rather than the queue size limit (at least at first, if we run into more issues, then the queue size limit may become necessary).

Closing this now as abompard landed monitoring checks. :)

Thanks everyone!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

8 months ago

Login to comment on this ticket.

Metadata