#11090 nagios: busgateway01 and notifs nagios alerts
Closed: Fixed a year ago by aheath1992. Opened 2 years ago by kevin.

We currently have a number of nagios alerts triggering on busgateway01 and notifs-backend01/notifs-web01.

We should fix these. In some cases it's probibly best to remove the check entirely, but in other cases we should try and fix it and make it work.

On busgateway01:

Check datanommer for recent ansible messages
This service has 1 comment associated with it This service problem has been acknowledged
CRITICAL 01-17-2023 23:37:00 7d 20h 53m 8s 3/3 CRIT: no ansible messages in 604800 seconds

  • We can probibly just leave this one. It's due to ansible on batcave01 being python3.9, but fedora-messaging is using python3.6. When we upgrade batcave01 to rhel9 they should hopefully be on the same python.

Check datanommer for recent greenwave messages
This service has 1 comment associated with it This service problem has been acknowledged
CRITICAL 01-17-2023 23:37:03 5d 5h 13m 5s 3/3 CRIT: no greenwave messages in 172800 seconds

  • This one should be removed. greenwave no longer emits messages.

Check fedmsg consumers and producers hub
This service has 1 comment associated with it This service problem has been acknowledged
CRITICAL 01-17-2023 23:36:58 238d 18h 13m 0s 3/3 ERROR: Nommer not found among installed plugins

  • This one should be fixed. It's related to our changes with datanommer/datagrepper.

Check datanommer for recent rpm sign messages
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:36:57 111d 23h 43m 15s 3/3 Usage: /usr/lib64/nagios/plugins/check_datanommer_timesince.py CATEGORY WARNING_THRESHOLD CRITICAL_THRESHOLD

  • This one should be fixed. There should be signing messages.

Check fedmsg-hub consumers backlog
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:36:30 111d 23h 43m 11s 3/3 UNKNOWN: fedmsg consumer Nommer not found

  • This one should be fixed.

Check fedmsg-hub consumers exceptions
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:37:01 111d 23h 43m 11s 3/3 UNKNOWN: fedmsg consumers Nommer not found

  • This one should be fixed.

notifs-backend01:

Check backend email queue size
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:35:11 49d 6h 32m 17s 3/3 NRPE: Unable to read output

Check backend irc queue size
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:35:17 49d 6h 32m 11s 3/3 NRPE: Unable to read output

Check fedmsg-hub consumers backlog
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:35:12 84d 19h 46m 0s 3/3 UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-hub.socket does not exist

Check fedmsg-hub consumers exceptions
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:35:17 84d 19h 45m 59s 3/3 UNKNOWN - /var/run/fedmsg/monitoring-fedmsg-hub.socket does not exist

Check worker queue size
This service has 1 comment associated with it This service problem has been acknowledged
UNKNOWN 01-17-2023 23:35:15 49d 6h 32m 8s 3/3 NRPE: Unable to read output

All these should be fixed. This is fallout from our moving notifs to python3. The check/plugins might be using python2?

Notifs-web01:

http-apps.fedoraproject.org-notifications-fmn.web
This service has 1 comment associated with it This service problem has been acknowledged
CRITICAL 01-17-2023 23:42:08 106d 0h 23m 8s 3/3 HTTP CRITICAL: HTTP/1.1 308 PERMANENT REDIRECT - string 'Notifications' not found on 'http://localhost:80/notifications' - 579 bytes in 0.003 second response time

Also should be fixed. Also related to the python3 move most likely.


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: low-gain, medium-trouble, ops

2 years ago

Metadata Update from @aheath1992:
- Issue assigned to aheath1992

2 years ago

[backlog]

This just needs folks to work on it. ;)

Created https://pagure.io/fedora-infra/ansible/pull-request/1367 for the one check that can be removed, the rest of the checks I do not have permissions to check and see what the true error is.

[backlog]
Most of the checks were cleaned up, only 2 remains.

The fedmsg-hub alerts are still giving problems, not sure if its a client side issue or if its another system issue that the client can't connect to.

Have worked with members to update Nagios to fix checks or remove unnecessary checks

Metadata Update from @aheath1992:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a year ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog