Issue #125: osci-pipelines queues accumulating unacked messages - general

fedora-ci / general

#125 osci-pipelines queues accumulating unacked messages

Closed 2 years ago by mvadkert. Opened 3 years ago by pingou.

This morning nagios notified us of an osci-pipelines queue going over its threshold on unacked messages. So I went and looked at the admin interface of the Fedora-infra rabbitmq cluster where I noticed the following:

Basically, it looks like some of the OSCI pipelines queue are accumulated unacked messages.

Here are what each osci-pipeines queue is bound to:

osci-pipelines-queue-0
  - org.centos.ci.#
  - org.fedoraproject.prod.bodhi.update.status.testing.koji-build-group.build.complete
  - org.fedoraproject.prod.pagure.pull-request.comment.added
  - org.fedoraproject.prod.pagure.pull-request.new
  - org.fedoraproject.prod.pungi.compose.status.change
osci-pipelines-queue-1
  - org.centos.ci.#
  - org.fedoraproject.prod.bodhi.update.status.testing.koji-build-group.build.complete
osci-pipelines-queue-2
  - org.centos.ci.#
  - org.fedoraproject.prod.copr.build.start
  - org.fedoraproject.prod.pagure.pull-request.new
osci-pipelines-queue-3
  - org.centos.ci.#
  - org.fedoraproject.prod.buildsys.task.state.change
osci-pipelines-queue-4
  - org.centos.ci.#
  - org.fedoraproject.prod.buildsys.tag
  - org.fedoraproject.prod.copr.build.start
osci-pipelines-queue-5
  - org.centos.ci.#
osci-pipelines-queue-6
  - org.centos.ci.#
osci-pipelines-queue-7
  - org.centos.ci.#
osci-pipelines-queue-8 (Not found)

osci-pipelines-queue-9
  - org.centos.ci.#
  - org.fedoraproject.prod.pagure.pull-request.comment.added
osci-pipelines-queue-10
  - org.centos.ci.#
osci-pipelines-queue-11
  - org.centos.ci.#
osci-pipelines-queue-12
  - org.centos.ci.#
  - org.fedoraproject.prod.buildsys.tag
osci-pipelines-queue-13
  - org.centos.ci.#
osci-pipelines-queue-14
  - org.centos.ci.#
osci-pipelines-queue-15
  - org.centos.ci.#
rpminspect
  - org.fedoraproject.prod.buildsys.build.state.change
  - org.fedoraproject.prod.buildsys.build.state.change.#

bgoncalv commented 3 years ago

@jimbair @astepano @msrb any idea what is going on here?

tagging @tflink because of the rpminspect job.

bgoncalv commented 3 years ago

osci-pipelines-queue-0 - should be subscribed only to "org.fedoraproject.prod.pungi.compose.status.change"

osci-pipelines-queue-2 - should be subscribed only to "org.fedoraproject.prod.pagure.pull-request.new"

Edited 3 years ago by bgoncalv

pingou commented 3 years ago

As an FYI I saw this, this morning:

07:22:20  zodbot | RECOVERY - rabbitmq01.iad2.fedoraproject.org/Check queue osci-pipelines-queue-2 is OK: RABBITMQ_QUEUE OK - messages OK (9393) messages_ready OK (0) messages_unacknowledged OK (9393) consumers OK (2) All queues under the thresholds (noc01)

Which is basically nagios telling us that the queue osci-pipelines-queue-2 went back under the threshold (which is at 10k).

msrb commented 3 years ago

@pingou do you know if there is any TTL configured for the queues?

pingou commented 3 years ago

I didn't check them all, but all the osci-pipelines ones I've check have x-message-ttl: 864000000 and rpminspect doesn't have any

pingou commented 3 years ago

You can see the TTL in the screenshot above btw, in the 5th column

bgoncalv commented 3 years ago

Yes, ttl for osci queue as set when queues were created https://pagure.io/fedora-infra/ansible/pull-request/118#request_diff

I'm not sure how rpminspect queues were created.

bgoncalv commented 3 years ago

@pingou please could you unsubscribe the following queues, fro mthe following topics?

osci-pipelines-queue-0
- org.fedoraproject.prod.bodhi.update.status.testing.koji-build-group.build.complete
- org.fedoraproject.prod.pagure.pull-request.comment.added
- org.fedoraproject.prod.pagure.pull-request.new

osci-pipelines-queue-2
- org.fedoraproject.prod.copr.build.start

pingou commented 3 years ago

Done:

osci-pipelines-queue-0
  - org.fedoraproject.prod.pungi.compose.status.change
osci-pipelines-queue-2
  - org.fedoraproject.prod.pagure.pull-request.new

pingou commented 3 years ago

The state of things this morning:

Some disappeared, such as the rpminspect queue. Queue -2 is going down slowly, the other seem to be going up

pingou commented 3 years ago

In case this helps to clarify: https://stackoverflow.com/questions/31915773/rabbitmq-what-are-ready-and-unacked-types-of-messages basically, it looks like the pipeline took the message from the queue but never ack/nack them, so the server does not know if they have been processed or not

bgoncalv commented 3 years ago

the queue in our jenkins seems empty, so I don't know why those messages are unacked.

regarding osci-pipelines-queue-3 specifically, this is a trigger for scratch builds, as the scratch builds are causing too much load we currently disabled this build in the pipeline, could this cause the queue to grow or jenkins would just process it and just no run the job?

pingou commented 3 years ago

Does jenkins ack the message to rabbitmq when it is done processing them?

bgoncalv commented 3 years ago

Does jenkins ack the message to rabbitmq when it is done processing them?

@msrb @astepano would you know how jenkins handle this? ^

jimbair commented 3 years ago

The rpminspect queue was backed up due to an infra problem; the jenkins instance is working through it, but the queue is quite long still: http://fedora-build-checks.apps.ci.centos.org/

That makes me feel like Jenkins should be acking them if rpminspect is?

pingou commented 3 years ago

The rpminspect queue was backed up due to an infra problem;

Looking at the first screenshot, on our side, it looks like all messages were accepted by jenkins, it just had not told back if they have been correctly processed (ack) or not (nack).

the jenkins instance is working through it, but the queue is quite long still: http://fedora-build-checks.apps.ci.centos.org/
That makes me feel like Jenkins should be acking them if rpminspect is?

Could be, but then that means there is almost 100K message in the backlog to be acked and looking at the two screenshot, it doesn't seem to be catching up :(

jimbair commented 3 years ago

I'll do some poking around and see what I can find; we can chat here or on IRC. :) Either works for me. In a few meetings now but I'll let you know when I find something helpful.

msrb commented 3 years ago

queue-4 should be used by https://osci-jenkins.ci.fedoraproject.org/job/RabbitMQ_Copr_Heartbeat/ -- which is not available. I think that's why the queue is so big. In any case, this heartbeat job is just a test job so if all messages get automatically dropped when their TTL is reached, then it's still fine.

pingou commented 3 years ago

I think that's why the queue is so big.

Well, the queue itself is empty, there are 0 messages on "ready", so all messages have been picked/taken by whichever consumer who then has not ack/nack them.

jimbair commented 3 years ago

We rebooted jenkins as we were seeing the following errors in Jenkins:

WARNING: Exception occurred when closing channel: channel is already closed due to channel error; protocol method: #method<channel.close>(reply-code=406, reply-text=PRECONDITION_FAILED - unknown delivery tag 218944, class-id=60, method-id=80)

After the reboot, jenkins filled up with a lot of new jobs (currently at 726 jobs): https://jenkins-continuous-infra.apps.ci.centos.org/

Can you check the dashboard to see if it's better?

Edited 3 years ago by jimbair

pingou commented 3 years ago

The state of things this morning:

osci-pipelines-queue-4 has 33.8K unacked messages
osci-pipelines-queue-12 has 7k unacked messages

msrb commented 3 years ago

The state of things this morning:
osci-pipelines-queue-4 has 33.8K unacked messages
osci-pipelines-queue-12 has 7k unacked messages

This is expected. The Jenkins instance that was processing the queue-4 is not running at this point (it was just a dummy "heartbeat" job) and the queue-12 is used by the upcoming ELN pipelines. The work on the ELN pipeline is still in progress, so the pipeline is not processing all messages.

To mitigate this issue, I'd propose to lower the TTL for all OSCI queues. It's currently set to 10 days, but I think 5 days should do just fine for us. That way broker will be dropping those unprocessed messages quicker.

pingou commented 3 years ago

Considering the messages are unacked, while they are not processed, something is still picking them up from the queue (otherwise, the messages would be "Ready", not "unacked")

pingou commented 3 years ago

Shall I try to purge the queues -4 and -12 or do you want to keep them as is?

pingou commented 3 years ago

@msrb should I purge the queues for -4 and -12?

This morning:

osci-pipelines-queue-4 has 38.9K unacked messages
osci-pipelines-queue-12 has 8.5k unacked messages

msrb commented 3 years ago

@pingou feel free to purge :wink: thanks

msrb commented 3 years ago

Btw. I've opened an issue for the jms plugin: https://github.com/jenkinsci/jms-messaging-plugin/issues/202

pingou commented 3 years ago

Arf, purge doesn't do anything (it does for "Ready" messages, but not "unacked" ones :()

pingou commented 3 years ago

So, looking more into this I ran into: http://rabbitmq.1065348.n5.nabble.com/Message-unacked-Queue-not-responding-td23561.html

I'll highlight:

It's a common mistake to miss the basic_ack. It's an easy error, but the consequences are serious. Messages will be redelivered when your client quits (which may look like random redelivery), but RabbitMQ will eat more and more memory as it won't be able to release any unacked messages.

That could explain some of the behavior you saw earlier when you rebooted one/some of the machine.

I have the IP address of the consumers attached to the queues -4 and -12 if that can help you (because I do see consumers attached to them!)

pingou commented 3 years ago

Looks like we'll need find these consumers if we want to move the messages back to "Ready" so we can purge the queues: https://stackoverflow.com/a/40933535

msrb commented 3 years ago

I can try to restart the instance where both jobs seem to be running. I hope that the JMS plugin will not try to fetch all 30+k messages at once :)

pingou commented 3 years ago

Stop it, let me clear the queue and then restart it, otherwise jenkins will be pulling all these messages

pingou commented 3 years ago

The morning, the number of unacked messages on the whole instance is 0 \ó/

Therefore, I think we can close this ticket :)