#12157 scm_request_processor not processing
Closed: Fixed 18 hours ago by zlopez. Opened a month ago by kevin.

After the network updates this morning the scm_requests toddler isnt processing.

Things I have checked:

  • pagure is sending messages
  • they are getting to rabbitmq
  • They are in datagrepper

Yet, toddlers never sees or acts on them.

It has to be something with rabbitmq not routing correctly, but I have run the playbook over it and looked and it all seems like it's right. ;(

@abompard perhaps you can see whats happening here?


I've looked at it, it's very weird indeed. I don't understand what's going on.
I have a script to replay messages from datagrepper, I ran it on 5127cd64-c119-47c3-ba5f-0c3fdfcc90d4, I see it going through the bus if I listen to it with a local consumer, but toddlers never sees it. Even at a logging level where fedora-messaging would log any received messages before toddlers has a chance to start processing it.

I've tried unsubscribing the toddlers queue from io.pagure.*.pagure.issue.new and subscribing it again, but it didn't help. I'm running the toddlers consumer in a debug container to watch it more closely, I'll report here what I can find.

OK it's really weird. Actually, unsubscribing the toddlers queue and subscribing it again did work, it's now getting the topics it didn't get before. I checked with rabbitmqctl list_bindings on all 3 cluster members and none was missing a binding before the remove/readd operation I did.

I'm not sure what happened, but more importantly I don't know how many other queues are affected by the same issue.

My queued epel10 branch requests finally happened but builds are failing:

Building perl-Devel-OverloadInfo-0.007-7.el10 for epel10-candidate
Created task: 122696143
Task info: https://koji.fedoraproject.org/koji/taskinfo?taskID=122696143
Watching tasks (this may be safely interrupted)...
122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free
122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free -> open (buildhw-x86-14.iad2.fedoraproject.org)
  122696156 buildSRPMFromSCM (/rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free
  122696156 buildSRPMFromSCM (/rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): free -> open (buildvm-a64-06.iad2.fedoraproject.org)
122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): open (buildhw-x86-14.iad2.fedoraproject.org) -> FAILED: BuildError: package perl-Devel-OverloadInfo not in list for tag epel10.0-testing-candidate
  0 free  1 open  0 done  1 failed
  122696156 buildSRPMFromSCM (/rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808): open (buildvm-a64-06.iad2.fedoraproject.org) -> closed
  0 free  0 open  1 done  1 failed

122696143 build (epel10-candidate, /rpms/perl-Devel-OverloadInfo.git:b53f422abcbe449bf8cfa93400200d137f3ba808) failed

Related? Or am I just being impatient (these builds have been possible straight after branching earlier in the week but they would result in "pending" updates in bodhi and I'd have to push them to testing myself)?

The koji sync listener consumer wasn't processing. I restarted it and am running a sync.

So, please hang on...

All the packages added today should be set in koji now.

I restarted the other 2 nodes of rabbitmq cluster, my only theory is that somehow one node was in a disconnected state (even tho the cluster status showed them all connected)...

I tried to look at rabbit.log's but there's a ton of noise in there, it's hard to tell what might matter. ;(

For investigation purposes, the first message to not be processed instantly by the scm-request-processor was https://pagure.io/releng/fedora-scm-requests/issue/65902. Any significant event in our infra on 2024-08-29 between 15:47:00UTC and 16:14:30UTC?

Possibly relevant in rabbitmq01's logs:

2024-08-29 15:53:03.242 [error] <0.172.0> ** Node 'rabbit@rabbitmq03.iad2.fedoraproject.org' not responding **
** Removing (timedout) connection **
2024-08-29 15:53:03.244 [info] <0.426.0> rabbit on node 'rabbit@rabbitmq03.iad2.fedoraproject.org' down
2024-08-29 15:53:03.249 [warning] <0.3012.0> Management delegate query returned errors:
[{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}]
2024-08-29 15:53:03.249 [warning] <0.3013.0> Management delegate query returned errors:
[{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}]
2024-08-29 15:53:03.249 [warning] <0.3014.0> Management delegate query returned errors:
[{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}]
2024-08-29 15:53:03.251 [warning] <0.3720.0> Management delegate query returned errors:
[{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}]
2024-08-29 15:53:03.254 [error] <0.166.0> ** Node 'rabbit@rabbitmq02.iad2.fedoraproject.org' not responding **
** Removing (timedout) connection **
2024-08-29 15:53:03.258 [warning] <0.3729.0> Management delegate query returned errors:
[{<6238.2644.0>,{exit,{nodedown,'rabbit@rabbitmq02.iad2.fedoraproject.org'},[]}},{<6239.2599.0>,{exit,{nodedown,'rabbit@rabbitmq03.iad2.fedoraproject.org'},[]}}]
2024-08-29 15:53:03.286 [info] <0.426.0> Node rabbit@rabbitmq03.iad2.fedoraproject.org is down, deleting its listeners
2024-08-29 15:53:03.303 [info] <0.426.0> rabbit on node 'rabbit@rabbitmq02.iad2.fedoraproject.org' down
2024-08-29 15:53:03.320 [info] <0.426.0> Node rabbit@rabbitmq02.iad2.fedoraproject.org is down, deleting its listeners
2024-08-29 15:53:03.330 [info] <0.426.0> node 'rabbit@rabbitmq03.iad2.fedoraproject.org' down: net_tick_timeout
2024-08-29 15:53:03.330 [info] <0.426.0> node 'rabbit@rabbitmq02.iad2.fedoraproject.org' down: net_tick_timeout

And later:

2024-08-29 16:10:38.283 [info] <0.426.0> node 'rabbit@rabbitmq02.iad2.fedoraproject.org' up
2024-08-29 16:10:38.284 [error] <0.225.0> Mnesia('rabbit@rabbitmq01.iad2.fedoraproject.org'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rabbitmq02.iad2.fedoraproject.org'}
2024-08-29 16:10:38.308 [info] <0.426.0> Autoheal request received from 'rabbit@rabbitmq02.iad2.fedoraproject.org'
2024-08-29 16:10:38.308 [info] <0.4125.1860> connection <0.4125.1860> (10.3.163.75:35746 -> 10.3.163.78:5671): user 'centos-odcs-private-queue' authenticated and granted access to vhost '/centos-odcs'
2024-08-29 16:10:38.331 [error] <0.225.0> Mnesia('rabbit@rabbitmq01.iad2.fedoraproject.org'): ** ERROR ** mnesia_event got {inconsistent_database, running_partitioned_network, 'rabbit@rabbitmq03.iad2.fedoraproject.org'}
2024-08-29 16:10:38.332 [info] <0.4552.1860> Connection <0.4552.1860> (10.3.163.79:57909 -> 10.3.163.78:5671) has a client-provided name: Federation link (upstream: pubsub-to-public_pubsub, policy: pubsub-to-public_pubsub)
2024-08-29 16:10:38.335 [info] <0.426.0> Autoheal decision
  * Partitions: [['rabbit@rabbitmq01.iad2.fedoraproject.org','rabbit@rabbitmq03.iad2.fedoraproject.org'],['rabbit@rabbitmq02.iad2.fedoraproject.org']]
  * Winner:     'rabbit@rabbitmq01.iad2.fedoraproject.org'
  * Losers:     ['rabbit@rabbitmq02.iad2.fedoraproject.org']
2024-08-29 16:10:38.335 [info] <0.426.0> Autoheal: I am the winner, waiting for ['rabbit@rabbitmq02.iad2.fedoraproject.org'] to stop
2024-08-29 16:10:38.335 [info] <0.426.0> node 'rabbit@rabbitmq03.iad2.fedoraproject.org' up
2024-08-29 16:10:38.336 [info] <0.4552.1860> connection <0.4552.1860> (10.3.163.79:57909 -> 10.3.163.78:5671 - Federation link (upstream: pubsub-to-public_pubsub, policy: pubsub-to-public_pubsub)): user 'pubsub_federation' authenticated and granted access
 to vhost '/pubsub'
2024-08-29 16:10:38.338 [info] <0.426.0> Autoheal request received from 'rabbit@rabbitmq03.iad2.fedoraproject.org' when healing; ignoring
2024-08-29 16:10:38.714 [info] <0.426.0> rabbit on node 'rabbit@rabbitmq02.iad2.fedoraproject.org' down
2024-08-29 16:10:38.717 [info] <0.426.0> Keeping rabbit@rabbitmq02.iad2.fedoraproject.org listeners: the node is already back
2024-08-29 16:10:38.719 [info] <0.426.0> Autoheal: final node has stopped, starting...
2024-08-29 16:10:38.724 [info] <0.426.0> Autoheal finished according to winner 'rabbit@rabbitmq01.iad2.fedoraproject.org'
2024-08-29 16:10:38.948 [error] <0.225.0> Mnesia('rabbit@rabbitmq01.iad2.fedoraproject.org'): ** ERROR ** mnesia_event got {inconsistent_database, starting_partitioned_network, 'rabbit@rabbitmq03.iad2.fedoraproject.org'}

Then it looks like rabbitmq might have been restarted:

2024-08-29 16:10:43.048 [info] <0.43.0> Application mnesia exited with reason: stopped
2024-08-29 16:10:43.381 [info] <0.22732.1860> 
 Starting RabbitMQ 3.7.23 on Erlang 22.0.7
 Copyright (C) 2007-2019 Pivotal Software, Inc.
 Licensed under the MPL.  See https://www.rabbitmq.com/
2024-08-29 16:10:43.383 [info] <0.22732.1860> 
 node           : rabbit@rabbitmq01.iad2.fedoraproject.org
 home dir       : /var/lib/rabbitmq
 config file(s) : /etc/rabbitmq/rabbitmq.config
 cookie hash    : [redacted]
 log(s)         : /var/log/rabbitmq/rabbit.log
                : /var/log/rabbitmq/rabbit_upgrade.log
 database dir   : /var/lib/rabbitmq/mnesia/rabbit
2024-08-29 16:10:43.397 [info] <0.22732.1860> Running boot step pre_boot defined by app rabbit
2024-08-29 16:10:43.397 [info] <0.22732.1860> Running boot step rabbit_core_metrics defined by app rabbit
2024-08-29 16:10:43.398 [info] <0.22732.1860> Running boot step rabbit_alarm defined by app rabbit
2024-08-29 16:10:43.404 [info] <0.5342.1860> Memory high watermark set to 3884 MiB (4073355673 bytes) of 9711 MiB (10183389184 bytes) total
2024-08-29 16:10:43.410 [info] <0.9967.1860> Enabling free disk space monitoring
2024-08-29 16:10:43.410 [info] <0.9967.1860> Disk free limit set to 50MB
2024-08-29 16:10:43.415 [info] <0.22732.1860> Running boot step code_server_cache defined by app rabbit
2024-08-29 16:10:43.415 [info] <0.22732.1860> Running boot step file_handle_cache defined by app rabbit
[...]
2024-08-29 16:10:45.268 [info] <0.19046.1860> Server startup complete; 6 plugins started.

It looks like we've had a network partition for about 16m30s. That may be the cause.

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

a month ago

Closing this as it is already resolved and the tickets are processed.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

18 hours ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog