#9231 Improve monitoring for container registry
Opened 3 months ago by mizdebsk. Modified 12 days ago

Container registry is listed as having Important SLE, yet one of our registries was down for about 11 hours (see #9230 for details) and we didn't get any Nagios notification about the issue.
Monitoring should be improved so that we are notified about this kind of issues sooner.

I would like to work on it.

Metadata Update from @smooge:
- Issue assigned to nasirhm
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

3 months ago

@mizdebsk i think we can get notification when systemd-monitored service enters failed state if we do OnFailure to unit !
for more details about option => https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Specifiers

We have existing Nagios setup that would be trivial to configure to cover OCI registry -- for example adding checks for:
- presence of word "fedora" under https://registry.fedoraproject.org/v2/_catalog (that covers v2 API)
- presence of word "rawhide" under https://registry.fedoraproject.org/repo/fedora/tags/ (covers web interface)

Hint: the relevant file in ansible.git is roles/nagios_server/templates/nagios/services/websites.cfg.j2

Thank You very much @mizdebsk and @seddik for the pointers, I will work on it after work today.

Take your time. We are in beta freeze, so changes to monitoring will need to wait until the freeze ends, or follow FBR SOP


Could you give any update ?

I can work on that if needed

Login to comment on this ticket.

Boards 1
ops Status: Backlog