As suggested by @kevin in #6043 we should probably have some monitoring for waiverdb now that it is in production. OpenShift already has its own monitoring and recovery stuff for restarting crashed pods, etc. So we don't need to replicate that. But it would be nice to have a Nagios monitor just checking that https://waiverdb-web-waiverdb.app.os.fedoraproject.org/healthcheck is returning Health check OK.
Health check OK
Or perhaps it should hit https://waiverdb-web-waiverdb.app.os.fedoraproject.org/api/v1.0/waivers/ and check that it gets a JSON response in a reasonable amount of time?
So I was going to try proposing a patch for this, but I am getting a bit lost in the Nagios setup...
If I am understanding it correctly, right now all the web checks are done using NRPE to execute the check on the actual server itself, connecting to localhost. Like this for example:
define service { hostgroup_name tagger service_description http-tagger-internal check_command check_website!localhost!/tagger/api/v1/score/ralph/!libravatar.org max_check_attempts 8 use internalwebsitetemplate #event_handler restart_httpd }
But for the apps running in OpenShift, there is nowhere for the checks to be run remotely. Should they just run on the nagios server itself in that case?
So I'm thinking we want to add another block in roles/nagios_server/templates/nagios/services/websites.cfg.j2 something like this ...?
roles/nagios_server/templates/nagios/services/websites.cfg.j2
define service { hostgroup_name ???SOMETHINGHERE??? service_description http-waiverdb-api check_command check_website_ssl!waiverdb-web-waiverdb.app.os.fedoraproject.org!/api/v1.0/waivers/!"data": [ max_check_attempts 8 use internalwebsitetemplate }
Yeah - I've starting plugging at this here:
https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=0fe0721f04186cd3e686118bc97c35f5d737a4fa
I set the check to run on the proxies, which is a little unnecessary. It means ten different servers will all check to see if they can get to greenwave and waiverdb.
After a few iterations, the checks are green now.
Going to close this for now. Feel free to reopen or file anew if there's something else we need.
Metadata Update from @ralph: - Issue status updated to: Closed (was: Open)
Login to comment on this ticket.