Issue #9231: Improve monitoring for container registry - fedora-infrastructure

fedora-infrastructure

#9231 Improve monitoring for container registry

Closed: Fixed 3 years ago by mizdebsk. Opened 3 years ago by mizdebsk.

Container registry is listed as having Important SLE, yet one of our registries was down for about 11 hours (see #9230 for details) and we didn't get any Nagios notification about the issue.
Monitoring should be improved so that we are notified about this kind of issues sooner.

nasirhm commented 3 years ago

I would like to work on it.

Metadata Update from @smooge:
- Issue assigned to nasirhm
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

3 years ago

seddik commented 3 years ago

@mizdebsk i think we can get notification when systemd-monitored service enters failed state if we do OnFailure to unit !
for more details about option => https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Specifiers

Edited 3 years ago by seddik

mizdebsk commented 3 years ago

We have existing Nagios setup that would be trivial to configure to cover OCI registry -- for example adding checks for:
- presence of word "fedora" under https://registry.fedoraproject.org/v2/_catalog (that covers v2 API)
- presence of word "rawhide" under https://registry.fedoraproject.org/repo/fedora/tags/ (covers web interface)

Hint: the relevant file in ansible.git is roles/nagios_server/templates/nagios/services/websites.cfg.j2

nasirhm commented 3 years ago

Thank You very much @mizdebsk and @seddik for the pointers, I will work on it after work today.

mizdebsk commented 3 years ago

Take your time. We are in beta freeze, so changes to monitoring will need to wait until the freeze ends, or follow FBR SOP

seddik commented 3 years ago

Hi,

Could you give any update ?

darknao commented 3 years ago

I can work on that if needed

mizdebsk commented 3 years ago

Monitoring for container registry is still needed, patches are welcome. Please let me know if you need any help with implementing this.

darknao commented 3 years ago

Related PR fedora-infra/ansible#321 has been merged

mizdebsk commented 3 years ago

The change has been deployed and can be seen eg. here and here.
Thank you for your contribution. This issue is resolved.

Metadata Update from @mizdebsk:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

nasirhm

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Done

fedora-infrastructure

Source Code

#9231 Improve monitoring for container registry Closed: Fixed 3 years ago by mizdebsk. Opened 3 years ago by mizdebsk.

Metadata

monitoring easyfix medium-gain medium-trouble ops

Boards 1

#9231 Improve monitoring for container registry

Closed: Fixed 3 years ago by mizdebsk. Opened 3 years ago by mizdebsk.