We currently have a number of nagios alerts triggering on busgateway01 and notifs-backend01/notifs-web01.
We should fix these. In some cases it's probibly best to remove the check entirely, but in other cases we should try and fix it and make it work.
On busgateway01:
Check datanommer for recent ansible messages This service has 1 comment associated with it This service problem has been acknowledged CRITICAL 01-17-2023 23:37:00 7d 20h 53m 8s 3/3 CRIT: no ansible messages in 604800 seconds
Check datanommer for recent greenwave messages This service has 1 comment associated with it This service problem has been acknowledged CRITICAL 01-17-2023 23:37:03 5d 5h 13m 5s 3/3 CRIT: no greenwave messages in 172800 seconds
Check fedmsg consumers and producers hub This service has 1 comment associated with it This service problem has been acknowledged CRITICAL 01-17-2023 23:36:58 238d 18h 13m 0s 3/3 ERROR: Nommer not found among installed plugins
Check datanommer for recent rpm sign messages This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:36:57 111d 23h 43m 15s 3/3 Usage: /usr/lib64/nagios/plugins/check_datanommer_timesince.py CATEGORY WARNING_THRESHOLD CRITICAL_THRESHOLD
Check fedmsg-hub consumers backlog This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:36:30 111d 23h 43m 11s 3/3 UNKNOWN: fedmsg consumer Nommer not found
Check fedmsg-hub consumers exceptions This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:37:01 111d 23h 43m 11s 3/3 UNKNOWN: fedmsg consumers Nommer not found
notifs-backend01:
Check backend email queue size This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:35:11 49d 6h 32m 17s 3/3 NRPE: Unable to read output
Check backend irc queue size This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:35:17 49d 6h 32m 11s 3/3 NRPE: Unable to read output
Check fedmsg-hub consumers backlog This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:35:12 84d 19h 46m 0s 3/3 UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-hub.socket does not exist
Check fedmsg-hub consumers exceptions This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:35:17 84d 19h 45m 59s 3/3 UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-hub.socket does not exist
Check worker queue size This service has 1 comment associated with it This service problem has been acknowledged UNKNOWN 01-17-2023 23:35:15 49d 6h 32m 8s 3/3 NRPE: Unable to read output
All these should be fixed. This is fallout from our moving notifs to python3. The check/plugins might be using python2?
Notifs-web01:
http-apps.fedoraproject.org-notifications-fmn.web This service has 1 comment associated with it This service problem has been acknowledged CRITICAL 01-17-2023 23:42:08 106d 0h 23m 8s 3/3 HTTP CRITICAL: HTTP/1.1 308 PERMANENT REDIRECT - string 'Notifications' not found on 'http://localhost:80/notifications' - 579 bytes in 0.003 second response time
Also should be fixed. Also related to the python3 move most likely.
Metadata Update from @zlopez: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: low-gain, medium-trouble, ops
Metadata Update from @aheath1992: - Issue assigned to aheath1992
[backlog]
This just needs folks to work on it. ;)
Created https://pagure.io/fedora-infra/ansible/pull-request/1367 for the one check that can be removed, the rest of the checks I do not have permissions to check and see what the true error is.
[backlog] Most of the checks were cleaned up, only 2 remains.
The fedmsg-hub alerts are still giving problems, not sure if its a client side issue or if its another system issue that the client can't connect to.
Have worked with members to update Nagios to fix checks or remove unnecessary checks
Metadata Update from @aheath1992: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.