#9120 greenwave queue in rabbitmq has over 1300 messages
Closed: Fixed 3 years ago by pingou. Opened 3 years ago by pingou.

Describe what you would like us to do:


While checking queue sizes this morning I found out that the greenwave queue on our rabbitmq server has 1336 messages in it (at the time of this writing).

I've pinged @lholecek and @gnaponie on IRC about this and I'm opening this ticket to track this.

When do you need this to be done by? (YYYY/MM/DD)



I've added monitoring on the queue size to help finding this out earlier next time: https://pagure.io/fedora-infra/ansible/c/fd98793878e7a5772e9e66f3d85741358cabe082?branch=master

The consumer pod got stuck on following exception:

[greenwave.consumers.consumer INFO] Getting greenwave info
[fedora_messaging.twisted.protocol INFO] Successfully consumed message from topic org.fedoraproject.prod.resultsdb.result.new (message id 14ed11d8-6557-4fa3-9b0e-a4de684abf60)
[fedora_messaging.twisted.protocol INFO] Consuming message from topic org.fedoraproject.prod.taskotron.result.new (message id e4ac1d25-f143-41b7-af7b-f635be7d83bd)
[greenwave.consumers.fedora_messaging_consumer INFO] Received message from fedora-messaging with topic: org.fedoraproject.prod.taskotron.result.new
[fedora_messaging.twisted.protocol INFO] Successfully consumed message from topic org.fedoraproject.prod.taskotron.result.new (message id e4ac1d25-f143-41b7-af7b-f635be7d83bd)
[fedora_messaging.twisted.protocol WARNING] The connection to the broker was lost (ConnectionDone()), consumer halted; the connection should restart and consuming will resume.
[fedora_messaging.twisted.protocol INFO] Disconnect requested, but AMQP connection already gone
[fedora_messaging.twisted.protocol INFO] Disconnect requested, but AMQP connection already gone
Unhandled error in Deferred:
[fedora_messaging.twisted.protocol INFO] Disconnect requested, but AMQP connection already gone

Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/pika/callback.py", line 233, in process
    callback(*args, **keywords)
  File "/usr/lib/python3.7/site-packages/pika/adapters/twisted_connection.py", line 1240, in _on_connection_failed
    d.errback(exc)
  File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 568, in _startRunCallbacks
    self._runCallbacks()
--- <exception caught here> ---
  File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 654, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "/usr/lib/python3.7/site-packages/fedora_messaging/twisted/factory.py", line 337, in on_ready_errback
    self._client_deferred.errback(wrapped_failure)
  File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 501, in errback
    self._startRunCallbacks(fail)
  File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 561, in _startRunCallbacks
    raise AlreadyCalledError
twisted.internet.defer.AlreadyCalledError: 

[fedora_messaging.twisted.factory ERROR] The connection failed with an unexpected exception; please report this bug.
Traceback (most recent call last):
  File "/usr/lib64/python3.7/site-packages/twisted/internet/defer.py", line 1418, in _inlineCallbacks
    result = g.send(result)
StopIteration

It probably just needs restart.

Proper solution would involve fixing the library where it got stuck and ideally writing a liveness probe to watch the consumer health.

The consumer pod was redeployed and seems to be healthy.

 zodbot | RECOVERY - rabbitmq01.iad2.fedoraproject.org/Check queue greenwave is OK: RABBITMQ_QUEUE OK - messages OK (0) messages_ready OK (0) messages_unacknowledged OK (0) consumers OK (1) All queues under the thresholds (noc01) 

Looks to be fine now.

Thanks @lholecek

@kevin this is the same issue we've seen in mirror_ansible_pagure in the batcave and on robosignatory yesterday, the weird hiccup I've told you about

Metadata Update from @pingou:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata