#6423 Add Nagios monitoring for waiverdb HTTP endpoint
Closed 6 years ago Opened 6 years ago by dcallagh.

As suggested by @kevin in #6043 we should probably have some monitoring for waiverdb now that it is in production. OpenShift already has its own monitoring and recovery stuff for restarting crashed pods, etc. So we don't need to replicate that. But it would be nice to have a Nagios monitor just checking that https://waiverdb-web-waiverdb.app.os.fedoraproject.org/healthcheck is returning Health check OK.

Or perhaps it should hit https://waiverdb-web-waiverdb.app.os.fedoraproject.org/api/v1.0/waivers/ and check that it gets a JSON response in a reasonable amount of time?


So I was going to try proposing a patch for this, but I am getting a bit lost in the Nagios setup...

If I am understanding it correctly, right now all the web checks are done using NRPE to execute the check on the actual server itself, connecting to localhost. Like this for example:

define service {
  hostgroup_name        tagger
  service_description   http-tagger-internal
  check_command         check_website!localhost!/tagger/api/v1/score/ralph/!libravatar.org
  max_check_attempts    8   
  use                   internalwebsitetemplate
  #event_handler         restart_httpd
}

But for the apps running in OpenShift, there is nowhere for the checks to be run remotely. Should they just run on the nagios server itself in that case?

So I'm thinking we want to add another block in roles/nagios_server/templates/nagios/services/websites.cfg.j2 something like this ...?

define service {
  hostgroup_name        ???SOMETHINGHERE???
  service_description   http-waiverdb-api
  check_command         check_website_ssl!waiverdb-web-waiverdb.app.os.fedoraproject.org!/api/v1.0/waivers/!"data": [
  max_check_attempts    8   
  use                   internalwebsitetemplate
}

Yeah - I've starting plugging at this here:

https://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=0fe0721f04186cd3e686118bc97c35f5d737a4fa

I set the check to run on the proxies, which is a little unnecessary. It means ten different servers will all check to see if they can get to greenwave and waiverdb.

After a few iterations, the checks are green now.

Going to close this for now. Feel free to reopen or file anew if there's something else we need.

Metadata Update from @ralph:
- Issue status updated to: Closed (was: Open)

6 years ago

Login to comment on this ticket.

Metadata