Issue #3191: reparent nagios sites - fedora-infrastructure

fedora-infrastructure

#3191 reparent nagios sites

Closed: Fixed None Opened 12 years ago by kevin.

Not entirely sure this is a easyfix, but it shouldn't be too hard.

Our main nagios instance (noc01) is at our main site (phx2).

it monitors machines at our other sites as well.

When theres a network issue reaching one site from phx2, we get a bunch of alerts of all the things it can't reach. Instead of this, we should add a parent for each of our remote sites thats a router/gateway for that network. Then we can get ONE alert that the site is down, and not a flurry of them.

Additionally noc02 (at telia) does the same thing when it can't reach a site so it would be good to fix it too.

Sites:

tummy
telia
ibiblio
osuosl
serverbeach
internetx

Hopefully we can identify a 'last hop' router for each of those to monitor and parent all the hosts at that site to it so we only get one alert.

mangas commented 12 years ago

attachment
ffm-b6-link.cfg

mangas commented 12 years ago

hi,

That patch will reparent the hosts from the telia and hopefully resolve the alerts problem.

The ffm-b6-link.cfg file is the host config file to monitor the peer that is common for all telia nodes, for some reason it wouldn't appear on the diff.

Not all the hosts needed reparenting because they all depend on telia01 but i did it anyway.

ctria commented 12 years ago

We discussed this with mangas on IRC, i'm putting here some comments for further discussion. Having multiple parents in on nagios means that you have multiple ways to reach the node. Thus the parents availability is ORed in order to find out if an outage of a host is ERROR or UNREACHABLE.

Now from i check i did on [[https://admin.fedoraproject.org/nagios/cgi-bin//statusmap.cgi?host=all&createimage&time=1331628431&canvas_x=0&canvas_y=0&canvas_width=957&canvas_height=919&max_width=0&max_height=0&layout=5&layermode=exclude | admin.fp.o/nagios]], it looks that all hosts have a parent thus we are limited to the following cases:
- hosts have wrong parent thus the failed parent affect wrong children
- hosts have multiple parents thus one failed parent doesn't affect their availability (i've seen that case with some vpn hostnames)
- hosts are checked less frequently than children thus their status is not updated before children's.
- something that we haven't thought

Would it be easy to have a list of cases that should not have sent email but they sent so that we check the config of specific cases, if something wrong we could extrapolate it to other cases too.

kevin commented 12 years ago

Yeah, I think it would be good to examine some more specific cases we want to fix... in general things should only have one parent, or at most two. (we have a vpn, but if the main network is down, the vpn is very likely down as well since it uses that. ;)

So, here's some of the cases we want to fix:

internetx01 (the host) was down and we got:

{{{
Mar 10 19:34:15 <zodbot> PROBLEM - proxy02-wildcard.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Mar 10 19:34:21 <zodbot> PROBLEM - proxy02-fpo.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Mar 10 19:34:27 <zodbot> PROBLEM - proxy02.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Mar 10 19:34:43 <zodbot> PROBLEM - 85.236.55.7-people02 is DOWN: PING CRITICAL - Packet loss = 100% (noc02)
Mar 10 19:35:13 <zodbot> PROBLEM - 85.236.55.5-internetx is DOWN: PING CRITICAL - Packet loss = 100% (noc02)
Mar 10 19:35:17 <zodbot> PROBLEM - 85.236.55.6-internetx is DOWN: PING CRITICAL - Packet loss = 100% (noc02)
Mar 10 19:35:37 <zodbot> PROBLEM - ns05.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Mar 10 19:35:45 <zodbot> PROBLEM - ns05.fedoraproject.org is DOWN: PING CRITICAL - Packet loss = 100% (noc02)
Mar 10 19:36:38 <zodbot> PROBLEM - internetx01.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Mar 10 19:36:46 <zodbot> PROBLEM - people02.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
}}}

tummy01 host was unreachable:

{{{
Feb 26 15:43:28 <zodbot> PROBLEM - proxy03-fpo.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Feb 26 15:43:33 <zodbot> PROBLEM - proxy03-wildcard.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Feb 26 15:43:38 <zodbot> PROBLEM - proxy03.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
Feb 26 15:43:45 <zodbot> PROBLEM - tummy01.vpn.fedoraproject.org is DOWN: CHECK_NRPE: Socket timeout after 20 seconds. (noc01)
}}}

The parenting on the above 2 could be improved...

Looking back in history, it looks like telia has another issue.
telia gets occassional very high packet loss to/from phx2:

{{{
Feb 04 09:19:53 <zodbot> PROBLEM - smtp-mm01.fedoraproject.org/ICMP Ping is CRITICAL: PING CR
ITICAL - Packet loss = 70%, RTA = 176.67 ms (noc01)
Feb 04 09:22:13 <zodbot> PROBLEM - proxy05-fpo.vpn.fedoraproject.org/ICMP Ping is WARNING: PI
NG WARNING - Packet loss = 50%, RTA = 175.31 ms (noc01)
Feb 04 09:27:15 <zodbot> PROBLEM - proxy05-fpo.vpn.fedoraproject.org/ICMP Ping is CRITICAL: P
ING CRITICAL - Packet loss = 70%, RTA = 175.89 ms (noc01)
Feb 04 09:33:54 <zodbot> PROBLEM - proxy05-wildcard.vpn.fedoraproject.org/ICMP Ping is CRITIC
AL: PING CRITICAL - Packet loss = 80%, RTA = 174.67 ms (noc01)
Feb 04 09:44:55 <zodbot> PROBLEM - smtp-mm01.fedoraproject.org/ICMP Ping is WARNING: PING WAR
NING - Packet loss = 44%, RTA = 174.58 ms (noc01)
Feb 04 09:49:55 <zodbot> PROBLEM - smtp-mm01.fedoraproject.org/ICMP Ping is CRITICAL: PING CR
ITICAL - Packet loss = 70%, RTA = 176.39 ms (noc01)
Feb 04 09:53:53 <zodbot> PROBLEM - proxy05-wildcard.vpn.fedoraproject.org/ICMP Ping is WARNIN
G: PING WARNING - Packet loss = 50%, RTA = 173.94 ms (noc01)
Feb 04 09:58:55 <zodbot> PROBLEM - proxy05-wildcard.vpn.fedoraproject.org/ICMP Ping is CRITIC
AL: PING CRITICAL - Packet loss = 60%, RTA = 176.41 ms (noc01)
}}

I don't think this can be solved in parenting, but we should fix the parenting there too in any case.
Hopefully that gives some examples we can solve. ;)

mangas commented 12 years ago

looks like proxy02, ns05 and people02 had bastion-vpn and internetx01 as parents which in this case were triggering the alerts because bastion-vpn was up.

proxy03 had the same problem with tummy01 and proxy03-fpo was parented by proxy03 and bastion-vpn, in this case i think bastion-vpn should be dropped and keep proxy03 that is if it's the correct parent.

mangas commented 12 years ago

attachment
changes

kevin commented 12 years ago

ok, I see the issue here, and we will need to figure out how to fix it.

ANY host we have defined as vpn.fedoraproject.org means that it goes over the vpn to reach that host.
Logically speaking they can only have bastion-vpn as their parent, because there's only one network link there. However we can fix this by:

Creating host records for each of the machines at remote sites, so, for example for tummy:

proxy03.fedoraproject.org
smtp-mm02.fedoraproject.org
unbound-tummy01.fedoraproject.org

(for any that don't already exist).

Then, as parent, each of those has 'tummy01.fedoraproject.org' (which also needs a host check record).

Then, on the vpn side:

proxy03.vpn.fedoraproject.org has as a parent proxy03.fedoraproject.org and bastion-vpn.
smtp-mm02.vpn.fedoraproject.org has as a parent smtp-mm03.fedoraproject.org and bastion-vpn.
unbound-tummy01.vpn.fedoraproject.org has as a parent unbound-tummy01.fedoraproject.org and bastion-vpn.

Does that make sense?

skvidal commented 12 years ago

That seems to make sense to me. If we stored the parent information in infra-hosts we could change it and track it as things move around.

mangas commented 12 years ago

attachment
newChanges

marcelk commented 12 years ago

Hi,

what action is still required here? Please excuse the silly question: I have to ask because I cannot really say how well parented the hosts are, due to my unfamiliarity with the server infrastructure.

kevin commented 12 years ago

Well, we need to do basically what I outlined in comment #5 I think...

It's sort of cheating, but we setup non vpn and vpn versions of each hostcheck and then parent the end vpn nodes on their real nodes. That way we can get only one alert.

I can just try and do this to one of our sites to explain what I am thinking and confirm it works.

mangas commented 12 years ago

Adjusting proxy03/proxy03-vpn tummy01 's parenting config and add tummy01 dependency
changes.2

spack commented 11 years ago

Working on this and I think I'm making good progress.

Maybe shall we keep parent definitions as is (plus some corrections) so we stay close to the network topology and use host dependencies for VPN hosts?

kevin commented 11 years ago

Replying to [comment:9 spack]:

Working on this and I think I'm making good progress.

Maybe shall we keep parent definitions as is (plus some corrections) so we stay close to the network topology and use host dependencies for VPN hosts?

Yes, that may well work out. Then when a host is down, the dependency will work so that the vpn is marked down as well and flood of alerts is reduced.

kevin commented 11 years ago

I finally sat down and commited this rework. :)

Instead of adding deps on the vpn side, I just did:

Replaced all vpn hosts with their external hostname
Set a new nrpe check on all vpn clients to ping back to the server. This should let us see when it's just the vpn thats down.
Reparented everything.

Metadata

Assignee

spack

Tags

None

Blocking

None

Depending on

None

Priority

None

fedora-infrastructure

Source Code

#3191 reparent nagios sites Closed: Fixed None Opened 12 years ago by kevin.

Metadata

#3191 reparent nagios sites

Closed: Fixed None Opened 12 years ago by kevin.