#2899 Add some nagios checks to autoqa01.qa and autoqa-stg01.qa

Created 5 years ago by athmane

We need to add:

  • ping checks to autoqa01.qa and autoqa-stg01.qa
  • checks if autotestd is running on autoqa01.qa

I used netcat to discover some public services which may need to be monitored:

  • SSH/22
  • HTTP/80

Also we should add a new contact for qa.

for autotestd check we need 'nagios-plugins-procs' and its deps in autoqa01.qa (currently this box is not managed with puppet)

The following checks has been implemented:

  • ICMP Ping

  • SSH

  • Autotest Frontend (HTTP)

Adding jlaska for comment here.

I think we need:

  • plugins installed on the machines.

  • a list of emails/accounts that should get alerts? Or should we just make it 'sysadmin-qa' for ease?

  • Any other tests needed.

Replying to [comment:3 kevin]:

Adding jlaska for comment here.

I think we need:

  • plugins installed on the machines.

I've not used nagios before, but am happy to install any needed packages

  • a list of emails/accounts that should get alerts? Or should we just make it 'sysadmin-qa' for ease?

sysadmin-qa would be good, the only concern I have is that sysadmin-qa must also be members of sysadmin ... which generates a lot of email for events sysadmin-qa cannot act on (or isn't expected to). Is there a way to reduce email traffic or unsubscribe sysadmin-qa@ from sysadmin@ ?

  • Any other tests needed.

Autotest has a concept of whether a test client is active or in failed state. There is a command-line tool (or XMLRPC interface) used to discover client state. Is there a way to trigger an email should autotest's view of a test clients state change? Or maybe easier, only show when a system enters autotest failed state?

  • Everything working as normal (no output) ... {{{

atest host list -s Failed

}}}

  • A problem that needs investigation (output) ... {{{

atest host list -s Failed --unlocked

Host Status Locked Platform Labels 10.5.124.162 Failed False x86_64 fc15, virt }}}

Replying to [comment:4 jlaska]:

Autotest has a concept of whether a test client is active or in failed state. There is a command-line tool (or XMLRPC interface) used to discover client state. Is there a way to trigger an email should autotest's view of a test clients state change? Or maybe easier, only show when a system enters autotest failed state?

  • Everything working as normal (no output) ... {{{

atest host list -s Failed

}}}

  • A problem that needs investigation (output) ... {{{

atest host list -s Failed --unlocked

Host Status Locked Platform Labels 10.5.124.162 Failed False x86_64 fc15, virt }}}

I was just browsing Monitoring tickets and saw this. Don't know if it was resolved before, but since last change was made three months ago and I can't reach the autoqa01.qa box from bastion, here's my two cents.

We can run this very simple script (but does its job) from the cron to trigger emails if something wrong with autotest.

{{{

/bin/bash

run=atest host list -s Failed --unlocked if [ echo $run | wc -l -gt 0 ] ; then echo $run | mail -s"There is a problem that needs investigation on hostname ---fqdn" put_email_addres_here fi }}}

Plain and simple.

If we want to add this check to nagios, we should probably write a new plugin. If we require this -- let me know, I will figure something.

If i understand the requirement correctly this looks quite easy to be written in nagios plugin.

And a new contact(s) can be set for QA in nagios without the need to subscribe to any other list.

If i understood correctly and you still want this, assign it to me to provide a puppet patch.

The only missing info is who should get the alerts, do you have a mailing list or a list of mails to use?

Adding Tim and Adam here.

What monitoring (if any) should we have for the qa network machines, and who should get notices?

Is this ticket still active?

somewhat. We are kind of waiting here to see what kind of monitoring we do want to add (if any) from the folks that control those machines. ;)

ok, the checks we have now are good.

Please file a new ticket for any more.

Login to comment on this ticket.