While checking queue sizes this morning I found out that the greenwave queue on our rabbitmq server has 1336 messages in it (at the time of this writing).
greenwave
I've pinged @lholecek and @gnaponie on IRC about this and I'm opening this ticket to track this.
I've added monitoring on the queue size to help finding this out earlier next time: https://pagure.io/fedora-infra/ansible/c/fd98793878e7a5772e9e66f3d85741358cabe082?branch=master
The consumer pod got stuck on following exception:
[greenwave.consumers.consumer INFO] Getting greenwave info [fedora_messaging.twisted.protocol INFO] Successfully consumed message from topic org.fedoraproject.prod.resultsdb.result.new (message id 14ed11d8-6557-4fa3-9b0e-a4de684abf60) [fedora_messaging.twisted.protocol INFO] Consuming message from topic org.fedoraproject.prod.taskotron.result.new (message id e4ac1d25-f143-41b7-af7b-f635be7d83bd) [greenwave.consumers.fedora_messaging_consumer INFO] Received message from fedora-messaging with topic: org.fedoraproject.prod.taskotron.result.new [fedora_messaging.twisted.protocol INFO] Successfully consumed message from topic org.fedoraproject.prod.taskotron.result.new (message id e4ac1d25-f143-41b7-af7b-f635be7d83bd) [fedora_messaging.twisted.protocol WARNING] The connection to the broker was lost (ConnectionDone()), consumer halted; the connection should restart and consuming will resume. [fedora_messaging.twisted.protocol INFO] Disconnect requested, but AMQP connection already gone [fedora_messaging.twisted.protocol INFO] Disconnect requested, but AMQP connection already gone Unhandled error in Deferred: [fedora_messaging.twisted.protocol INFO] Disconnect requested, but AMQP connection already gone Traceback (most recent call last): File "/usr/lib/python3.7/site-packages/pika/callback.py", line 233, in process callback(*args, **keywords) File "/usr/lib/python3.7/site-packages/pika/adapters/twisted_connection.py", line 1240, in _on_connection_failed d.errback(exc) File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 501, in errback self._startRunCallbacks(fail) File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks self._runCallbacks() --- <exception caught here> --- File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/lib/python3.7/site-packages/fedora_messaging/twisted/factory.py", line 337, in on_ready_errback self._client_deferred.errback(wrapped_failure) File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 501, in errback self._startRunCallbacks(fail) File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 561, in _startRunCallbacks raise AlreadyCalledError twisted.internet.defer.AlreadyCalledError: [fedora_messaging.twisted.factory ERROR] The connection failed with an unexpected exception; please report this bug. Traceback (most recent call last): File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks result = g.send(result) StopIteration
It probably just needs restart.
Proper solution would involve fixing the library where it got stuck and ideally writing a liveness probe to watch the consumer health.
The consumer pod was redeployed and seems to be healthy.
zodbot | RECOVERY - rabbitmq01.iad2.fedoraproject.org/Check queue greenwave is OK: RABBITMQ_QUEUE OK - messages OK (0) messages_ready OK (0) messages_unacknowledged OK (0) consumers OK (1) All queues under the thresholds (noc01)
Looks to be fine now.
Thanks @lholecek
@kevin this is the same issue we've seen in mirror_ansible_pagure in the batcave and on robosignatory yesterday, the weird hiccup I've told you about
Metadata Update from @pingou: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.