#125 osci-pipelines queues accumulating unacked messages
Closed 2 years ago by mvadkert. Opened 3 years ago by pingou.

This morning nagios notified us of an osci-pipelines queue going over its threshold on unacked messages. So I went and looked at the admin interface of the Fedora-infra rabbitmq cluster where I noticed the following:

Screenshot_from_2020-07-20_15-52-08.png

Basically, it looks like some of the OSCI pipelines queue are accumulated unacked messages.

Here are what each osci-pipeines queue is bound to:

osci-pipelines-queue-0
  - org.centos.ci.#
  - org.fedoraproject.prod.bodhi.update.status.testing.koji-build-group.build.complete
  - org.fedoraproject.prod.pagure.pull-request.comment.added
  - org.fedoraproject.prod.pagure.pull-request.new
  - org.fedoraproject.prod.pungi.compose.status.change
osci-pipelines-queue-1
  - org.centos.ci.#
  - org.fedoraproject.prod.bodhi.update.status.testing.koji-build-group.build.complete
osci-pipelines-queue-2
  - org.centos.ci.#
  - org.fedoraproject.prod.copr.build.start
  - org.fedoraproject.prod.pagure.pull-request.new
osci-pipelines-queue-3
  - org.centos.ci.#
  - org.fedoraproject.prod.buildsys.task.state.change
osci-pipelines-queue-4
  - org.centos.ci.#
  - org.fedoraproject.prod.buildsys.tag
  - org.fedoraproject.prod.copr.build.start
osci-pipelines-queue-5
  - org.centos.ci.#
osci-pipelines-queue-6
  - org.centos.ci.#
osci-pipelines-queue-7
  - org.centos.ci.#
osci-pipelines-queue-8 (Not found)

osci-pipelines-queue-9
  - org.centos.ci.#
  - org.fedoraproject.prod.pagure.pull-request.comment.added
osci-pipelines-queue-10
  - org.centos.ci.#
osci-pipelines-queue-11
  - org.centos.ci.#
osci-pipelines-queue-12
  - org.centos.ci.#
  - org.fedoraproject.prod.buildsys.tag
osci-pipelines-queue-13
  - org.centos.ci.#
osci-pipelines-queue-14
  - org.centos.ci.#
osci-pipelines-queue-15
  - org.centos.ci.#
rpminspect
  - org.fedoraproject.prod.buildsys.build.state.change
  - org.fedoraproject.prod.buildsys.build.state.change.#

@jimbair @astepano @msrb any idea what is going on here?

tagging @tflink because of the rpminspect job.

osci-pipelines-queue-0 - should be subscribed only to "org.fedoraproject.prod.pungi.compose.status.change"

osci-pipelines-queue-2 - should be subscribed only to "org.fedoraproject.prod.pagure.pull-request.new"

As an FYI I saw this, this morning:

07:22:20  zodbot | RECOVERY - rabbitmq01.iad2.fedoraproject.org/Check queue osci-pipelines-queue-2 is OK: RABBITMQ_QUEUE OK - messages OK (9393) messages_ready OK (0) messages_unacknowledged OK (9393) consumers OK (2) All queues under the thresholds (noc01)                                                                                                                                                                      

Which is basically nagios telling us that the queue osci-pipelines-queue-2 went back under the threshold (which is at 10k).

@pingou do you know if there is any TTL configured for the queues?

I didn't check them all, but all the osci-pipelines ones I've check have x-message-ttl: 864000000 and rpminspect doesn't have any

You can see the TTL in the screenshot above btw, in the 5th column

Yes, ttl for osci queue as set when queues were created https://pagure.io/fedora-infra/ansible/pull-request/118#request_diff

I'm not sure how rpminspect queues were created.

@pingou please could you unsubscribe the following queues, fro mthe following topics?

osci-pipelines-queue-0
- org.fedoraproject.prod.bodhi.update.status.testing.koji-build-group.build.complete
- org.fedoraproject.prod.pagure.pull-request.comment.added
- org.fedoraproject.prod.pagure.pull-request.new

osci-pipelines-queue-2
- org.fedoraproject.prod.copr.build.start

Done:

osci-pipelines-queue-0
  - org.fedoraproject.prod.pungi.compose.status.change
osci-pipelines-queue-2
  - org.fedoraproject.prod.pagure.pull-request.new

In case this helps to clarify: https://stackoverflow.com/questions/31915773/rabbitmq-what-are-ready-and-unacked-types-of-messages basically, it looks like the pipeline took the message from the queue but never ack/nack them, so the server does not know if they have been processed or not

the queue in our jenkins seems empty, so I don't know why those messages are unacked.

regarding osci-pipelines-queue-3 specifically, this is a trigger for scratch builds, as the scratch builds are causing too much load we currently disabled this build in the pipeline, could this cause the queue to grow or jenkins would just process it and just no run the job?

Does jenkins ack the message to rabbitmq when it is done processing them?

Does jenkins ack the message to rabbitmq when it is done processing them?

@msrb @astepano would you know how jenkins handle this? ^

The rpminspect queue was backed up due to an infra problem; the jenkins instance is working through it, but the queue is quite long still: http://fedora-build-checks.apps.ci.centos.org/

That makes me feel like Jenkins should be acking them if rpminspect is?

The rpminspect queue was backed up due to an infra problem;

Looking at the first screenshot, on our side, it looks like all messages were accepted by jenkins, it just had not told back if they have been correctly processed (ack) or not (nack).

the jenkins instance is working through it, but the queue is quite long still: http://fedora-build-checks.apps.ci.centos.org/
That makes me feel like Jenkins should be acking them if rpminspect is?

Could be, but then that means there is almost 100K message in the backlog to be acked and looking at the two screenshot, it doesn't seem to be catching up :(

I'll do some poking around and see what I can find; we can chat here or on IRC. :) Either works for me. In a few meetings now but I'll let you know when I find something helpful.

queue-4 should be used by https://osci-jenkins.ci.fedoraproject.org/job/RabbitMQ_Copr_Heartbeat/ -- which is not available. I think that's why the queue is so big. In any case, this heartbeat job is just a test job so if all messages get automatically dropped when their TTL is reached, then it's still fine.

I think that's why the queue is so big.

Well, the queue itself is empty, there are 0 messages on "ready", so all messages have been picked/taken by whichever consumer who then has not ack/nack them.

We rebooted jenkins as we were seeing the following errors in Jenkins:

WARNING: Exception occurred when closing channel: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - unknown delivery tag 218944, class-id=60, method-id=80)

After the reboot, jenkins filled up with a lot of new jobs (currently at 726 jobs): https://jenkins-continuous-infra.apps.ci.centos.org/

Can you check the dashboard to see if it's better?

The state of things this morning:

osci-pipelines-queue-4 has 33.8K unacked messages
osci-pipelines-queue-12 has 7k unacked messages

The state of things this morning:
osci-pipelines-queue-4 has 33.8K unacked messages
osci-pipelines-queue-12 has 7k unacked messages

This is expected. The Jenkins instance that was processing the queue-4 is not running at this point (it was just a dummy "heartbeat" job) and the queue-12 is used by the upcoming ELN pipelines. The work on the ELN pipeline is still in progress, so the pipeline is not processing all messages.

To mitigate this issue, I'd propose to lower the TTL for all OSCI queues. It's currently set to 10 days, but I think 5 days should do just fine for us. That way broker will be dropping those unprocessed messages quicker.

Considering the messages are unacked, while they are not processed, something is still picking them up from the queue (otherwise, the messages would be "Ready", not "unacked")

Shall I try to purge the queues -4 and -12 or do you want to keep them as is?

@msrb should I purge the queues for -4 and -12?

This morning:

osci-pipelines-queue-4 has 38.9K unacked messages
osci-pipelines-queue-12 has 8.5k unacked messages

@pingou feel free to purge :wink: thanks

Arf, purge doesn't do anything (it does for "Ready" messages, but not "unacked" ones :()

So, looking more into this I ran into: http://rabbitmq.1065348.n5.nabble.com/Message-unacked-Queue-not-responding-td23561.html

I'll highlight:

It's a common mistake to miss the basic_ack. It's an easy error, but the consequences are serious. Messages will be redelivered when your client quits (which may look like random redelivery), but RabbitMQ will eat more and more memory as it won't be able to release any unacked messages. 

That could explain some of the behavior you saw earlier when you rebooted one/some of the machine.

I have the IP address of the consumers attached to the queues -4 and -12 if that can help you (because I do see consumers attached to them!)

Looks like we'll need find these consumers if we want to move the messages back to "Ready" so we can purge the queues: https://stackoverflow.com/a/40933535

I can try to restart the instance where both jobs seem to be running. I hope that the JMS plugin will not try to fetch all 30+k messages at once :)

Stop it, let me clear the queue and then restart it, otherwise jenkins will be pulling all these messages

The morning, the number of unacked messages on the whole instance is 0 \รณ/

Therefore, I think we can close this ticket :)

Metadata Update from @mvadkert:
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata