#482 Monitor websites at the app server level
Closed: Fixed None Opened 16 years ago by ricky.

Right now, I think we only monitor our apps by looking at responses from the proxy server. When the apps are load balanced, we can miss problems that are only affecting one server. (Side question: does nagios make sure that the HTTP response code is 200?)

What do you think is the best way to handle this sort of thing? Would we have to copy our current nagios checks and change them to hit the app servers (which can have different ports and paths)?

Right now, here are the apps/app servers that I can think of (I may have missed some):
* mirrormanager (app4/app5)
* bodhi (releng1)
* koji (publictest8)
* pkgdb (app4/app5)
* fas (fas1/fas2)
* wiki (app2, although it's sometimes on app1)
* transifex (app4)
* damned lies (app1/app2)


Oops, correction: the wiki is on both app1 and app2, app1 is for write operations, and app2 is handling everything else.

Nope, this is a bad idea and would ultimately cause more spam. So there's two monitoring philosophies I'm currently looking at

1) internal - Monitors machine health (drive space, etc)

2) external - End to end test of the application. Usually via a get on a web page that hits the database. Typically this is where my pager alerts come from as well.

So I saw that app1 was in a bad state. We probably need to set nagios up to do proper flap detection (IE: it can learn when a service is going up and down and up and down which is likely a bad app server or proxy server, then notify us).

I've been trying to get the sysadmin-noc team in better shape but unfortunately it seems like the few volunteers we have on the group can only commit an hour or two every other week which, so far, hasn't been enough. Proper monitoring is far more complex then it seems. For example, we shouldn't be getting nearly as many alerts as we actually get. Part of this is monitoring part of this is that our environment is getting unstable. Without more people I'm not sure what to do about the former and the latter is getting put on the back burner for other things like the wiki.

Suggest that the following files be modified for monitoring apps. In an ideal world, I would like to test this in a test environment.

  1. manifests/services/nagios.pp
    In class nrpe:
    From:
    configfile { '/etc/nagios/nrpe.cfg':
    source => 'system/nrpe.cfg',
    require => Package[nrpe],
    notify => Service[nrpe],
    }

    To:
    templatefile { '/etc/nagios/nrpe.cfg':
    content => template ("system/nagios/nrpe.cfg'")
    require => Package[nrpe],
    notify => Service[nrpe],
    }


  1. configs/system/nagios/services/procs.cfg
    Add a new service for each application to be modified.
    define service {
    hostgroup servers ???
    service_description Monitor <application name>
    check_command check_by_nrpe!check_<application_name>
    use defaulttemplate
    retry_check_interval 5 ???
    max_check_attempts 12 ???
    }

  1. configs/system/nrpe.cfg
    Conditionally add a new command to monitor an application
    <% if $<app>Running == 1 %>
    command[check_<app>]=/usr/lib/nagios/plugins/check_procs -c 1:1 \
    -a 'path_to_application' -u <app user id>
    <% end %>

  1. manifests/nodes/app4.fedora.phx.redhat.com.pp
    Assuming the application runs on app4
    Add the following line:
    $<app>Running = 1

    at the top of the file.

This has been put on hold pending the investigation of Zabbix as a replacement for nagios.

Now that we have haproxy setup for most of our apps, this is good enough for me.

Login to comment on this ticket.

Metadata