After the network updates this morning the scm_requests toddler isnt processing.
Things I have checked:
Yet, toddlers never sees or acts on them.
It has to be something with rabbitmq not routing correctly, but I have run the playbook over it and looked and it all seems like it's right. ;(
@abompard perhaps you can see whats happening here?
Or @ryanlerch perhaps?
I've looked at it, it's very weird indeed. I don't understand what's going on. I have a script to replay messages from datagrepper, I ran it on 5127cd64-c119-47c3-ba5f-0c3fdfcc90d4, I see it going through the bus if I listen to it with a local consumer, but toddlers never sees it. Even at a logging level where fedora-messaging would log any received messages before toddlers has a chance to start processing it.
I've tried unsubscribing the toddlers queue from io.pagure.*.pagure.issue.new and subscribing it again, but it didn't help. I'm running the toddlers consumer in a debug container to watch it more closely, I'll report here what I can find.
io.pagure.*.pagure.issue.new
OK it's really weird. Actually, unsubscribing the toddlers queue and subscribing it again did work, it's now getting the topics it didn't get before. I checked with rabbitmqctl list_bindings on all 3 cluster members and none was missing a binding before the remove/readd operation I did.
rabbitmqctl list_bindings
I'm not sure what happened, but more importantly I don't know how many other queues are affected by the same issue.
My queued epel10 branch requests finally happened but builds are failing:
Building perl-Devel-OverloadInfo-0.007-7.el10 for epel10-candidate Created task: 122696143 Task info: https://koji.fedoraproject.org/koji/taskinfo?taskID=122696143 Watching tasks (this may be safely interrupted)... 122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free 122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free -> open (buildhw-x86-14.iad2.fedoraproject.org) 122696156 buildSRPMFromSCM (/rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free 122696156 buildSRPMFromSCM (/rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free -> open (buildvm-a64-06.iad2.fedoraproject.org) 122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): open (buildhw-x86-14.iad2.fedoraproject.org) -> FAILED: BuildError: package perl-Devel-OverloadInfo not in list for tag epel10.0-testing-candidate 0 free 1 open 0 done 1 failed 122696156 buildSRPMFromSCM (/rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): open (buildvm-a64-06.iad2.fedoraproject.org) -> closed 0 free 0 open 1 done 1 failed 122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808) failed
Related? Or am I just being impatient (these builds have been possible straight after branching earlier in the week but they would result in "pending" updates in bodhi and I'd have to push them to testing myself)?
The koji sync listener consumer wasn't processing. I restarted it and am running a sync.
So, please hang on...
All the packages added today should be set in koji now.
I restarted the other 2 nodes of rabbitmq cluster, my only theory is that somehow one node was in a disconnected state (even tho the cluster status showed them all connected)...
I tried to look at rabbit.log's but there's a ton of noise in there, it's hard to tell what might matter. ;(
For investigation purposes, the first message to not be processed instantly by the scm-request-processor was https://pagure.io/releng/fedora-scm-requests/issue/65902. Any significant event in our infra on 2024-08-29 between 15:47:00UTC and 16:14:30UTC?
Possibly relevant in rabbitmq01's logs:
2024-08-29 15:53:03.242 [error] <0.172.0> ** Node 'rabbit@rabbitmq03.iad2.fedoraproject.org' not responding ** ** Removing (timedout) connection ** 2024-08-29 15:53:03.244 [info] <0.426.0> rabbit on node 'rabbit@rabbitmq03.iad2.fedoraproject.org' down 2024-08-29 15:53:03.249 [warning] <0.3012.0> Management delegate query returned errors: [{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}] 2024-08-29 15:53:03.249 [warning] <0.3013.0> Management delegate query returned errors: [{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}] 2024-08-29 15:53:03.249 [warning] <0.3014.0> Management delegate query returned errors: [{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}] 2024-08-29 15:53:03.251 [warning] <0.3720.0> Management delegate query returned errors: [{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}] 2024-08-29 15:53:03.254 [error] <0.166.0> ** Node 'rabbit@rabbitmq02.iad2.fedoraproject.org' not responding ** ** Removing (timedout) connection ** 2024-08-29 15:53:03.258 [warning] <0.3729.0> Management delegate query returned errors: [{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}] 2024-08-29 15:53:03.286 [info] <0.426.0> Node rabbit@rabbitmq03.iad2.fedoraproject.org is down, deleting its listeners 2024-08-29 15:53:03.303 [info] <0.426.0> rabbit on node 'rabbit@rabbitmq02.iad2.fedoraproject.org' down 2024-08-29 15:53:03.320 [info] <0.426.0> Node rabbit@rabbitmq02.iad2.fedoraproject.org is down, deleting its listeners 2024-08-29 15:53:03.330 [info] <0.426.0> node 'rabbit@rabbitmq03.iad2.fedoraproject.org' down: net_tick_timeout 2024-08-29 15:53:03.330 [info] <0.426.0> node 'rabbit@rabbitmq02.iad2.fedoraproject.org' down: net_tick_timeout
And later:
2024-08-29 16:10:38.283 [info] <0.426.0> node 'rabbit@rabbitmq02.iad2.fedoraproject.org' up 2024-08-29 16:10:38.284 [error] <0.225.0> Mnesia('rabbit@rabbitmq01.iad2.fedoraproject.org'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rabbitmq02.iad2.fedoraproject.org'} 2024-08-29 16:10:38.308 [info] <0.426.0> Autoheal request received from 'rabbit@rabbitmq02.iad2.fedoraproject.org' 2024-08-29 16:10:38.308 [info] <0.4125.1860> connection <0.4125.1860> (10.3.163.75:35746 -> 10.3.163.78:5671): user 'centos-odcs-private-queue' authenticated and granted access to vhost '/centos-odcs' 2024-08-29 16:10:38.331 [error] <0.225.0> Mnesia('rabbit@rabbitmq01.iad2.fedoraproject.org'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rabbitmq03.iad2.fedoraproject.org'} 2024-08-29 16:10:38.332 [info] <0.4552.1860> Connection <0.4552.1860> (10.3.163.79:57909 -> 10.3.163.78:5671) has a client-provided name: Federation link (upstream: pubsub-to-public_pubsub, policy: pubsub-to-public_pubsub) 2024-08-29 16:10:38.335 [info] <0.426.0> Autoheal decision * Partitions: [['rabbit@rabbitmq01.iad2.fedoraproject.org','rabbit@rabbitmq03.iad2.fedoraproject.org'],['rabbit@rabbitmq02.iad2.fedoraproject.org']] * Winner: 'rabbit@rabbitmq01.iad2.fedoraproject.org' * Losers: ['rabbit@rabbitmq02.iad2.fedoraproject.org'] 2024-08-29 16:10:38.335 [info] <0.426.0> Autoheal: I am the winner, waiting for ['rabbit@rabbitmq02.iad2.fedoraproject.org'] to stop 2024-08-29 16:10:38.335 [info] <0.426.0> node 'rabbit@rabbitmq03.iad2.fedoraproject.org' up 2024-08-29 16:10:38.336 [info] <0.4552.1860> connection <0.4552.1860> (10.3.163.79:57909 -> 10.3.163.78:5671 - Federation link (upstream: pubsub-to-public_pubsub, policy: pubsub-to-public_pubsub)): user 'pubsub_federation' authenticated and granted access to vhost '/pubsub' 2024-08-29 16:10:38.338 [info] <0.426.0> Autoheal request received from 'rabbit@rabbitmq03.iad2.fedoraproject.org' when healing; ignoring 2024-08-29 16:10:38.714 [info] <0.426.0> rabbit on node 'rabbit@rabbitmq02.iad2.fedoraproject.org' down 2024-08-29 16:10:38.717 [info] <0.426.0> Keeping rabbit@rabbitmq02.iad2.fedoraproject.org listeners: the node is already back 2024-08-29 16:10:38.719 [info] <0.426.0> Autoheal: final node has stopped, starting... 2024-08-29 16:10:38.724 [info] <0.426.0> Autoheal finished according to winner 'rabbit@rabbitmq01.iad2.fedoraproject.org' 2024-08-29 16:10:38.948 [error] <0.225.0> Mnesia('rabbit@rabbitmq01.iad2.fedoraproject.org'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rabbitmq03.iad2.fedoraproject.org'}
Then it looks like rabbitmq might have been restarted:
2024-08-29 16:10:43.048 [info] <0.43.0> Application mnesia exited with reason: stopped 2024-08-29 16:10:43.381 [info] <0.22732.1860> Starting RabbitMQ 3.7.23 on Erlang 22.0.7 Copyright (C) 2007-2019 Pivotal Software, Inc. Licensed under the MPL. See https://www.rabbitmq.com/ 2024-08-29 16:10:43.383 [info] <0.22732.1860> node : rabbit@rabbitmq01.iad2.fedoraproject.org home dir : /var/lib/rabbitmq config file(s) : /etc/rabbitmq/rabbitmq.config cookie hash : [redacted] log(s) : /var/log/rabbitmq/rabbit.log : /var/log/rabbitmq/rabbit_upgrade.log database dir : /var/lib/rabbitmq/mnesia/rabbit 2024-08-29 16:10:43.397 [info] <0.22732.1860> Running boot step pre_boot defined by app rabbit 2024-08-29 16:10:43.397 [info] <0.22732.1860> Running boot step rabbit_core_metrics defined by app rabbit 2024-08-29 16:10:43.398 [info] <0.22732.1860> Running boot step rabbit_alarm defined by app rabbit 2024-08-29 16:10:43.404 [info] <0.5342.1860> Memory high watermark set to 3884 MiB (4073355673 bytes) of 9711 MiB (10183389184 bytes) total 2024-08-29 16:10:43.410 [info] <0.9967.1860> Enabling free disk space monitoring 2024-08-29 16:10:43.410 [info] <0.9967.1860> Disk free limit set to 50MB 2024-08-29 16:10:43.415 [info] <0.22732.1860> Running boot step code_server_cache defined by app rabbit 2024-08-29 16:10:43.415 [info] <0.22732.1860> Running boot step file_handle_cache defined by app rabbit [...] 2024-08-29 16:10:45.268 [info] <0.19046.1860> Server startup complete; 6 plugins started.
It looks like we've had a network partition for about 16m30s. That may be the cause.
Metadata Update from @phsmoura: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: medium-gain, medium-trouble, ops
Closing this as it is already resolved and the tickets are processed.
Metadata Update from @zlopez: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.