It started overflowing recently and we deleted some entries for clogged users which got it flowing again.
To detect this in the future, we should monitor the size of the queued_messages table in the notifications db in both collectd and nagios.
Do we have any examples of db table size checks out there already? Those would be nice and easy to copy, if we do.
If no one is working on this I am interested in fixing this. I'm new around here so I'll ask a lot of questions that might of been answered somewhere else.
What db is FMN db using (postgres, mysql, etc...)?
Is there a readonly user that acceses the db?
Would you like the notification to send an email to a subset of people?
Will this be a standalone script or builtin to another monitoring program?
What would be the preferred language (bash, python, etc...)?
I am in the process of rewriting FMN's backend, splitting its core business into a group of workers.
Maybe we can wait for this new backend as it will require more monitoring (we'll be adding 2 queues, one between the fedmsg consumer and the workers, then one between the workers and the backend (the part doing the IO) and if we split the backends, we might add even more queues that will also need monitoring).
As for the questions:
What db is FMN db using (postgres, mysql, etc...)? * almost all our apps are using PostgreSQL Is there a readonly user that acceses the db? * we might have a readonly user and if we don't it'd be easy enough to add one Would you like the notification to send an email to a subset of people? * I think the original idea would be to add the monitoring to collectd and nagios itself, so no need to re-invent who is notified of what Will this be a standalone script or builtin to another monitoring program? * I think this got answered above What would be the preferred language (bash, python, etc...)? * I guess bash or python would be preferred
What db is FMN db using (postgres, mysql, etc...)? * almost all our apps are using PostgreSQL
Is there a readonly user that acceses the db? * we might have a readonly user and if we don't it'd be easy enough to add one
Would you like the notification to send an email to a subset of people? * I think the original idea would be to add the monitoring to collectd and nagios itself, so no need to re-invent who is notified of what
Will this be a standalone script or builtin to another monitoring program? * I think this got answered above
What would be the preferred language (bash, python, etc...)? * I guess bash or python would be preferred
@pingou I think this re-write was done a while back... what can we monitor here to ensure processing is going along as expected?
Metadata Update from @kevin: - Issue tagged with: easyfix
So there are two rabbitmq queues that we could monitor one called worker and the other backend.
worker
backend
I know @jcline wanted to renamed one of them, though I forgot if we did it or not at the end.
There is already a nagios check provided by @puiterwijk that monitors the rabbitmq queue for basset, so we could base the work on this.
I didn't rename any queues so that should still be correct.
So, there's three queues that need to be checked: the worker rabbitmq queue (pretty sure we already have nagios for this), the backend rabbitmq queue (I'm not sure, we might have monitoring on this, should be easy), and the queued_messages database table that's also a sort of queue. This table is used for digest capturing, and by the IRC backend until its connected (which should be improved). Monitoring the queued_messages is slightly more tricky because nagios would need access to the database, but with that done it'd be a select count(*) from queued_messages; basically.
queued_messages
select count(*) from queued_messages;
Unless I'm mistaken, the RabbitMQ piece was done a while back. The backend queues weren't being monitored correctly because the Nagios services were misnamed on the server compared to the client configs. I fixed that with https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?h=master&id=40879a4fa8437c69e8fc87a71fb6dd84a9a7f937. Are there other RabbitMQ queues we need to monitor? I didn't see any more on notifs-backend01.
Metadata Update from @smooge: - Issue assigned to keitellf (was: tammyb5)
I think this is complete. If anyone thinks it's not, please feel free to reopen.
:door:
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.