#5249 Monitor the size of queued_messages in the FMN db
Closed: Fixed 5 years ago Opened 7 years ago by ralph.

It started overflowing recently and we deleted some entries for clogged users which got it flowing again.

To detect this in the future, we should monitor the size of the queued_messages table in the notifications db in both collectd and nagios.

Do we have any examples of db table size checks out there already? Those would be nice and easy to copy, if we do.


If no one is working on this I am interested in fixing this. I'm new around here so I'll ask a lot of questions that might of been answered somewhere else.

What db is FMN db using (postgres, mysql, etc...)?

Is there a readonly user that acceses the db?

Would you like the notification to send an email to a subset of people?

Will this be a standalone script or builtin to another monitoring program?

What would be the preferred language (bash, python, etc...)?

I am in the process of rewriting FMN's backend, splitting its core business into a group of workers.

Maybe we can wait for this new backend as it will require more monitoring (we'll be adding 2 queues, one between the fedmsg consumer and the workers, then one between the workers and the backend (the part doing the IO) and if we split the backends, we might add even more queues that will also need monitoring).

As for the questions:

What db is FMN db using (postgres, mysql, etc...)?
* almost all our apps are using PostgreSQL

Is there a readonly user that acceses the db?
* we might have a readonly user and if we don't it'd be easy enough to add one

Would you like the notification to send an email to a subset of people?
* I think the original idea would be to add the monitoring to collectd and nagios itself, so no need to re-invent who is notified of what

Will this be a standalone script or builtin to another monitoring program?
* I think this got answered above

What would be the preferred language (bash, python, etc...)?
* I guess bash or python would be preferred

@pingou I think this re-write was done a while back... what can we monitor here to ensure processing is going along as expected?

Metadata Update from @kevin:
- Issue tagged with: easyfix

7 years ago

So there are two rabbitmq queues that we could monitor one called worker and the other backend.

I know @jcline wanted to renamed one of them, though I forgot if we did it or not at the end.

There is already a nagios check provided by @puiterwijk that monitors the rabbitmq queue for basset, so we could base the work on this.

I didn't rename any queues so that should still be correct.

So, there's three queues that need to be checked: the worker rabbitmq queue (pretty sure we already have nagios for this), the backend rabbitmq queue (I'm not sure, we might have monitoring on this, should be easy), and the queued_messages database table that's also a sort of queue.
This table is used for digest capturing, and by the IRC backend until its connected (which should be improved).
Monitoring the queued_messages is slightly more tricky because nagios would need access to the database, but with that done it'd be a select count(*) from queued_messages; basically.

Unless I'm mistaken, the RabbitMQ piece was done a while back. The backend queues weren't being monitored correctly because the Nagios services were misnamed on the server compared to the client configs. I fixed that with https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?h=master&id=40879a4fa8437c69e8fc87a71fb6dd84a9a7f937. Are there other RabbitMQ queues we need to monitor? I didn't see any more on notifs-backend01.

Metadata Update from @smooge:
- Issue assigned to keitellf (was: tammyb5)

6 years ago

I think this is complete. If anyone thinks it's not, please feel free to reopen.

:door:

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Login to comment on this ticket.

Metadata