Issue #482: Monitor websites at the app server level - fedora-infrastructure

fedora-infrastructure

#482 Monitor websites at the app server level

Closed: Fixed None Opened 16 years ago by ricky.

Right now, I think we only monitor our apps by looking at responses from the proxy server. When the apps are load balanced, we can miss problems that are only affecting one server. (Side question: does nagios make sure that the HTTP response code is 200?)

What do you think is the best way to handle this sort of thing? Would we have to copy our current nagios checks and change them to hit the app servers (which can have different ports and paths)?

Right now, here are the apps/app servers that I can think of (I may have missed some):
* mirrormanager (app4/app5)
* bodhi (releng1)
* koji (publictest8)
* pkgdb (app4/app5)
* fas (fas1/fas2)
* wiki (app2, although it's sometimes on app1)
* transifex (app4)
* damned lies (app1/app2)

ricky commented 16 years ago

Oops, correction: the wiki is on both app1 and app2, app1 is for write operations, and app2 is handling everything else.

mmcgrath commented 16 years ago

Nope, this is a bad idea and would ultimately cause more spam. So there's two monitoring philosophies I'm currently looking at

1) internal - Monitors machine health (drive space, etc)

2) external - End to end test of the application. Usually via a get on a web page that hits the database. Typically this is where my pager alerts come from as well.

So I saw that app1 was in a bad state. We probably need to set nagios up to do proper flap detection (IE: it can learn when a service is going up and down and up and down which is likely a bad app server or proxy server, then notify us).

I've been trying to get the sysadmin-noc team in better shape but unfortunately it seems like the few volunteers we have on the group can only commit an hour or two every other week which, so far, hasn't been enough. Proper monitoring is far more complex then it seems. For example, we shouldn't be getting nearly as many alerts as we actually get. Part of this is monitoring part of this is that our environment is getting unstable. Without more people I'm not sure what to do about the former and the latter is getting put on the back burner for other things like the wiki.

fchiulli commented 15 years ago

Suggest that the following files be modified for monitoring apps. In an ideal world, I would like to test this in a test environment.

manifests/services/nagios.pp
In class nrpe:
From:
configfile { '/etc/nagios/nrpe.cfg':
source => 'system/nrpe.cfg',
require => Package[nrpe],
notify => Service[nrpe],
}

To:
templatefile { '/etc/nagios/nrpe.cfg':
content => template ("system/nagios/nrpe.cfg'")
require => Package[nrpe],
notify => Service[nrpe],
}

configs/system/nagios/services/procs.cfg
Add a new service for each application to be modified.
define service {
hostgroup servers ???
service_description Monitor <application name>
check_command check_by_nrpe!check_<application_name>
use defaulttemplate
retry_check_interval 5 ???
max_check_attempts 12 ???
}

configs/system/nrpe.cfg
Conditionally add a new command to monitor an application
<% if $<app>Running == 1 %>
command[check_<app>]=/usr/lib/nagios/plugins/check_procs -c 1:1 \
-a 'path_to_application' -u <app user id>
<% end %>

manifests/nodes/app4.fedora.phx.redhat.com.pp
Assuming the application runs on app4
Add the following line:
$<app>Running = 1

at the top of the file.

fchiulli commented 15 years ago

This has been put on hold pending the investigation of Zabbix as a replacement for nagios.

ricky commented 15 years ago

Now that we have haproxy setup for most of our apps, this is good enough for me.

Metadata

Assignee

fchiulli

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#482 Monitor websites at the app server level Closed: Fixed None Opened 16 years ago by ricky.

Metadata

#482 Monitor websites at the app server level

Closed: Fixed None Opened 16 years ago by ricky.