Container registry is listed as having Important SLE, yet one of our registries was down for about 11 hours (see #9230 for details) and we didn't get any Nagios notification about the issue.
Monitoring should be improved so that we are notified about this kind of issues sooner.
I would like to work on it.
Metadata Update from @smooge:
- Issue assigned to nasirhm
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops
@mizdebsk i think we can get notification when systemd-monitored service enters failed state if we do OnFailure to unit !
for more details about option => https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Specifiers
We have existing Nagios setup that would be trivial to configure to cover OCI registry -- for example adding checks for:
- presence of word "fedora" under https://registry.fedoraproject.org/v2/_catalog (that covers v2 API)
- presence of word "rawhide" under https://registry.fedoraproject.org/repo/fedora/tags/ (covers web interface)
Hint: the relevant file in ansible.git is roles/nagios_server/templates/nagios/services/websites.cfg.j2
Thank You very much @mizdebsk and @seddik for the pointers, I will work on it after work today.
Take your time. We are in beta freeze, so changes to monitoring will need to wait until the freeze ends, or follow FBR SOP
to comment on this ticket.