#2883 add monitoring for mgmt interfaces
Closed: Fixed None Opened 12 years ago by kevin.

All the physical machines in phx2/qa have -mgmt interfaces.

We should monitor them.
Some information gathering would be required before adding them.
Check each -mgmt host from dns and confirm:

  • it's pingable from noc01
  • it's listening on http (port 80/tcp)
  • it's listening on https (port 443/tcp)

For any that are not answering, we need to work on fixing them.

Then add checks for:

  • ping to make sure they are responding
  • http
  • https

Pinging from noc01:

{{{

ping bnfs01-mgmt : OK
ping ;s390-hub-mgmt : ERR
ping tape02-mgmt : OK
ping tape02 : OK
ping backup01-mgmt : OK
ping backup03-mgmt : OK
ping bc01-mgmt : OK
ping bc02-mgmt : OK
ping bxen04-mgmt : OK
ping virthost-comm01-mgmt : OK
ping bxen03-mgmt : OK
ping bvirthost01-mgmt : ERR
ping virthost-comm01-mgmt : ERR
ping download01-mgmt : OK
ping download02-mgmt : OK
ping download03-mgmt : OK
ping download04-mgmt : OK
ping download05-mgmt : OK
ping junk01-mgmt : OK
ping junk02-mgmt : OK
ping junk05-mgmt : OK
ping ;ppc-comm01-mgmt : ERR
ping ;ppc-comm02-mgmt : ERR
ping qa01-mgmt : OK
ping qa02-mgmt : OK
ping qa03-mgmt : OK
ping qa04-mgmt : OK
ping qa05-mgmt : OK
ping qa06-mgmt : OK
ping qa07-mgmt : OK
ping qa08-mgmt : OK
ping ppc04-mgmt : OK
ping sign-vault01-mgmt : OK
ping sign-vaultXX-mgmt : OK
ping unknown00-mgmt : OK
ping unknown01-mgmt : ERR
ping unknown02-mgmt : ERR
ping unknown03-mgmt : ERR
ping unknown04-mgmt : OK
ping unknown05-mgmt : OK
ping unknown06-mgmt : OK
ping unknown07-mgmt : ERR
ping unknown08-mgmt : ERR
ping unknown09-mgmt : ERR
ping virthost01-mgmt : OK
ping virthost02-mgmt : ERR
ping virthost03-mgmt : OK
ping virthost13-mgmt : OK
ping xen01-mgmt : OK
ping junk05-mgmt : OK
ping xen03-mgmt : OK
ping xen04-mgmt : ERR
ping xen05-mgmt : OK
ping xen08-mgmt : ERR
ping xen09-mgmt : OK
ping junk03-mgmt : OK
ping xen15-mgmt : OK
ping xen16-mgmt : ERR
ping xen17-mgmt : ERR
ping xen18-mgmt : ERR
ping xen19-mgmt : ERR

}}}

HTTP tests:
{{{
http bnfs01-mgmt : OK
http ;s390-hub-mgmt : ERR
http tape02-mgmt : OK
http tape02 : OK
http backup01-mgmt : OK
http backup03-mgmt : OK
http bc01-mgmt : OK
http bc02-mgmt : OK
http bxen04-mgmt : OK
http virthost-comm01-mgmt : OK
http bxen03-mgmt : OK
http bvirthost01-mgmt : ERR
http virthost-comm01-mgmt : OK
http download01-mgmt : OK
http download02-mgmt : OK
http download03-mgmt : OK
http download04-mgmt : OK
http download05-mgmt : OK
http junk01-mgmt : OK
http junk02-mgmt : OK
http junk05-mgmt : OK
http ;ppc-comm01-mgmt : ERR
http ;ppc-comm02-mgmt : ERR
http qa01-mgmt : OK
http qa02-mgmt : OK
http qa03-mgmt : OK
http qa04-mgmt : OK
http qa05-mgmt : OK
http qa06-mgmt : OK
http qa07-mgmt : OK
http qa08-mgmt : OK
http ppc04-mgmt : OK
http sign-vault01-mgmt : OK
http sign-vaultXX-mgmt : OK
http unknown00-mgmt : ERR
http unknown01-mgmt : ERR
http unknown02-mgmt : ERR
http unknown03-mgmt : ERR
http unknown04-mgmt : ERR
http unknown05-mgmt : ERR
http unknown06-mgmt : ERR
http unknown07-mgmt : ERR
http unknown08-mgmt : ERR
http unknown09-mgmt : ERR
http virthost01-mgmt : OK
http virthost02-mgmt : ERR
http virthost03-mgmt : OK
http virthost13-mgmt : OK
http xen01-mgmt : OK
http junk05-mgmt : OK
http xen03-mgmt : OK
http xen04-mgmt : ERR
http xen05-mgmt : OK
http xen08-mgmt : ERR
http xen09-mgmt : OK
http junk03-mgmt : OK
http xen15-mgmt : OK
http xen16-mgmt : ERR
http xen17-mgmt : ERR
http xen18-mgmt : ERR
http xen19-mgmt : ERR
}}}

HTTPS tests:
{{{
https bnfs01-mgmt : ERR
https ;s390-hub-mgmt : ERR
https tape02-mgmt : ERR
https tape02 : ERR
https backup01-mgmt : OK
https backup03-mgmt : ERR
https bc01-mgmt : OK
https bc02-mgmt : ERR
https bxen04-mgmt : ERR
https virthost-comm01-mgmt : ERR
https bxen03-mgmt : ERR
https bvirthost01-mgmt : ERR
https virthost-comm01-mgmt : ERR
https download01-mgmt : OK
https download02-mgmt : ERR
https download03-mgmt : OK
https download04-mgmt : OK
https download05-mgmt : OK
https junk01-mgmt : OK
https junk02-mgmt : OK
https junk05-mgmt : OK
https ;ppc-comm01-mgmt : ERR
https ;ppc-comm02-mgmt : ERR
https qa01-mgmt : OK
https qa02-mgmt : OK
https qa03-mgmt : OK
https qa04-mgmt : OK
https qa05-mgmt : OK
https qa06-mgmt : OK
https qa07-mgmt : OK
https qa08-mgmt : OK
https ppc04-mgmt : ERR
https sign-vault01-mgmt : OK
https sign-vaultXX-mgmt : ERR
https unknown00-mgmt : ERR
https unknown01-mgmt : ERR
https unknown02-mgmt : ERR
https unknown03-mgmt : ERR
https unknown04-mgmt : ERR
https unknown05-mgmt : ERR
https unknown06-mgmt : ERR
https unknown07-mgmt : ERR
https unknown08-mgmt : ERR
https unknown09-mgmt : ERR
https virthost01-mgmt : ERR
https virthost02-mgmt : ERR
https virthost03-mgmt : ERR
https virthost13-mgmt : OK
https xen01-mgmt : ERR
https junk05-mgmt : OK
https xen03-mgmt : OK
https xen04-mgmt : ERR
https xen05-mgmt : OK
https xen08-mgmt : ERR
https xen09-mgmt : OK
https junk03-mgmt : OK
https xen15-mgmt : OK
https xen16-mgmt : ERR
https xen17-mgmt : ERR
https xen18-mgmt : ERR
https xen19-mgmt : ERR

}}}

Great. ;)

Please leave the following out of monitoring:

{{{

ping ;s390-hub-mgmt : ERR
ping ;ppc-comm01-mgmt : ERR
ping ;ppc-comm02-mgmt : ERR
ping unknown01-mgmt : ERR
ping unknown02-mgmt : ERR
ping unknown03-mgmt : ERR
ping unknown07-mgmt : ERR
ping unknown08-mgmt : ERR
ping unknown09-mgmt : ERR

http ;s390-hub-mgmt : ERR
http ;ppc-comm01-mgmt : ERR
http ;ppc-comm02-mgmt : ERR
http unknown00-mgmt : ERR
http unknown01-mgmt : ERR
http unknown02-mgmt : ERR
http unknown03-mgmt : ERR
http unknown04-mgmt : ERR
http unknown05-mgmt : ERR
http unknown06-mgmt : ERR
http unknown07-mgmt : ERR
http unknown08-mgmt : ERR
http unknown09-mgmt : ERR

https ;s390-hub-mgmt : ERR

https ;ppc-comm01-mgmt : ERR
https ;ppc-comm02-mgmt : ERR
https ppc04-mgmt : ERR
https sign-vaultXX-mgmt : ERR
https unknown00-mgmt : ERR
https unknown01-mgmt : ERR
https unknown02-mgmt : ERR
https unknown03-mgmt : ERR
https unknown04-mgmt : ERR
https unknown05-mgmt : ERR
https unknown06-mgmt : ERR
https unknown07-mgmt : ERR
https unknown08-mgmt : ERR
https unknown09-mgmt : ERR

}}}

These have been cleaned up/removed from dns now:

{{{

ping xen08-mgmt : ERR
ping xen16-mgmt : ERR
ping xen17-mgmt : ERR
ping xen18-mgmt : ERR
ping xen19-mgmt : ERR

}}}

I've fixed https on the following:

{{{

bnfs01-mgmt
backup03-mgmt
bc02-mgmt
bxen04-mgmt
virthost-comm01-mgmt
bxen03-mgmt
download-02-mgmt

}}}

Also, tape02/tape02-mgmt has no https and is the same machine, so monitor just tape02-mgmt for ping and http.

After filtering, the following hosts are not pingable from noc01 (DNS seems OK):

bvirthost01-mgmt: ERR
virthost02-mgmt: ERR
xen04-mgmt: ERR

Manual verification:

{{{
[athmane@noc01 ~]$ ping bvirthost01-mgmt.phx2.fedoraproject.org
PING bvirthost01-mgmt.phx2.fedoraproject.org (10.5.126.224) 56(84) bytes of data.
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=2 Destination Host Unreachable
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=3 Destination Host Unreachable
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=4 Destination Host Unreachable
^C
--- bvirthost01-mgmt.phx2.fedoraproject.org ping statistics ---
4 packets transmitted, 0 received, +3 errors, 100% packet loss, time 3710ms

[athmane@noc01 ~]$ ping virthost02-mgmt.phx2.fedoraproject.org
PING virthost02-mgmt.phx2.fedoraproject.org (10.5.126.223) 56(84) bytes of data.
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=1 Destination Host Unreachable
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=2 Destination Host Unreachable
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=3 Destination Host Unreachable
^C
--- virthost02-mgmt.phx2.fedoraproject.org ping statistics ---
5 packets transmitted, 0 received, +3 errors, 100% packet loss, time 4541ms

[athmane@noc01 ~]$ ping xen04-mgmt.phx2.fedoraproject.org
PING xen04-mgmt.phx2.fedoraproject.org (10.5.126.204) 56(84) bytes of data.
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=2 Destination Host Unreachable
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=3 Destination Host Unreachable
From noc01.phx2.fedoraproject.org (10.5.126.41) icmp_seq=4 Destination Host Unreachable
^C
--- xen04-mgmt.phx2.fedoraproject.org ping statistics ---
6 packets transmitted, 0 received, +3 errors, 100% packet loss, time 5920ms
}}}

Yes, all three of those should work, but do not. ;(

We are going to need to power cycle those machines to get the mgmt to reset and hopefully come up. So, I would like to add them to monitoring, then ack the alert so we know they are pending problems to be fixed and we know when they come back up. ;)

Login to comment on this ticket.

Metadata