#9231 Improve monitoring for container registry
Closed: Fixed 3 years ago by mizdebsk. Opened 3 years ago by mizdebsk.

Container registry is listed as having Important SLE, yet one of our registries was down for about 11 hours (see #9230 for details) and we didn't get any Nagios notification about the issue.
Monitoring should be improved so that we are notified about this kind of issues sooner.


I would like to work on it.

Metadata Update from @smooge:
- Issue assigned to nasirhm
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

3 years ago

@mizdebsk i think we can get notification when systemd-monitored service enters failed state if we do OnFailure to unit !
for more details about option => https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Specifiers

We have existing Nagios setup that would be trivial to configure to cover OCI registry -- for example adding checks for:
- presence of word "fedora" under https://registry.fedoraproject.org/v2/_catalog (that covers v2 API)
- presence of word "rawhide" under https://registry.fedoraproject.org/repo/fedora/tags/ (covers web interface)

Hint: the relevant file in ansible.git is roles/nagios_server/templates/nagios/services/websites.cfg.j2

Thank You very much @mizdebsk and @seddik for the pointers, I will work on it after work today.

Take your time. We are in beta freeze, so changes to monitoring will need to wait until the freeze ends, or follow FBR SOP

Hi,

Could you give any update ?

I can work on that if needed

Monitoring for container registry is still needed, patches are welcome. Please let me know if you need any help with implementing this.

The change has been deployed and can be seen eg. here and here.
Thank you for your contribution. This issue is resolved.

Metadata Update from @mizdebsk:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done