#4044 Write nagios checks to ensure specific fedmsg consumers are running in memory
Closed: Fixed None Opened 10 years ago by ralph.

Background: fedmsg consumers run as plugins to a single fedmsg-hub daemon. For instance, a fedmsg-hub daemon runs on badges-backend01. It loads a single BadgesConsumer plugin that does the work of awarding badges.

It is possible, even common, for the daemon to start up fine but for one of its consumers to raise an exception at startup. The daemon continues to run, but with zero plugins loaded; it does zero work.

This is deceiving because when nagios looks at the process list, it sees fedmsg-hub running and it thinks everything is fine.

We need some fedmsg CLI command that interrogates a fedmsg twisted process and asks it for the list of consumers that it successfully loaded. We then need to write nagios/nrpe checks to ensure consumers we want to run are running.


Replying to [comment:1 ralph]:

This is blocking on https://github.com/fedora-infra/fedmsg/issues/194

I'm up for working on this, if that's possible for an apprentice.

Replying to [comment:2 sheap]:

I'm up for working on this, if that's possible for an apprentice.

Yeah, definitely! Thanks!

FWIW, the interface will need to be added to the moksha hub. The code for that lives at https://github.com/mokshaproject/moksha

If you have any questions, please do ask either me (threebean) or lmacken in #fedora-apps.

sheap, there's a patch pending on moksha which should make this possible: https://github.com/mokshaproject/moksha/pull/11

Here is first try. What do I need to fix?

Hey Janez! I just got the collectd version of this working!

See here for the graph -> https://admin.fedoraproject.org/collectd/bin/graph.cgi?hostname=notifs-backend01.phx2.fedoraproject.org;plugin=fedmsg;plugin_instance=hub;type=queue_length;type_instance=FMNConsumer_backlog;begin=-3800

See here for the script -> http://infrastructure.fedoraproject.org/cgit/ansible.git/tree/roles/collectd/fedmsg-service/templates/fedmsg-service-collectd.py

And then you use the ansible role like this -> http://infrastructure.fedoraproject.org/cgit/ansible.git/tree/playbooks/groups/notifs-backend.yml#n66


You should, if you can, log in to the notifs-backend01.stg host and try to run your check there. It might help to figure out what needs to be done next. If you can't log in there, please speak up in #fedora-admin and someone will get you into the fi-apprentice program and hooked up with ssh access.

Some thoughts:

  • It just changed recently, but the ipc monitoring socket is located at /var/run/fedmsg/monitoring-PROCESS_NAME.socket now. See the collectd script for an example.
  • Right now the script loops over all installed consumers and fails if any one of them is not initialized. That won't quite do as there are some cases where we have a consumer installed, but it is intentionally not initialized.,
  • The nagios check should probably take the consumer name as an argument. That way in nagios we can set up three checks one for ConsumerX, another for ConsumerY, and another for ConsumerZ. If we add another consumer later, we just add another check but re-use the same script.
  • In the sys.exit0 case, I think nagios wants you to print something like "OK - fedmsg-blah has ConsumerX initialized just fine". That shows up in the status reports so humans reading it can understand.

Future thoughts that make things more complicated.. we can ignore them for now.
- It would be cool if the check script could handle checking Producers as well as Consumers. See the notifs-backend01.stg ipc output for an example of some producers.
- It would be cool if we could have other separate nagios checks that raise a WARNING or an ERROR if the backlog for a consumer is over 100 or over 500 or something. Similarly, if the exceptions count is over 1 or 10, etc. That would make for three different nagios checks for each consumer on each host.

I'm gonna try and improve script. I will ping you if I need help.

What about this check? It checks all producers and consumers passed as arguments on command line. Socket is hardcoded at the moment but that will be changed. What do you think Ralph?

Yeah! Looks good to me :) Were you able to login and test it on any of the staging nodes? notifs-backend01.stg ?

Just a minor thing -- I might change it from a WARNING to an ERROR if a particular consumer isn't initialized. It is usually a pretty bad thing to discover.

With that, some testing, and making the ipc socket filename flexible we can probably ship this out soon.

I was able to login to notifs-backend01.stg and test it as much as possible. But it should be tested properly, since that is a limited environment.
I will change WARNING to ERROR. Slowly we're getting there.

Checks for consumer's backlog and exceptions. These checks check just one consumer, but I can change that so that they can check several consumers. Just say what you would like to see.
One more thing, Ralph feel free to change print messages. They might sound awkward, since I'm a non-native English speaker.

Janez, these look really good.

One minor thing, I can just fix when these get committed:

  • check_fedmsg_producers_consumers.py, the sys.exit(1) needs to be a sys.exit(2), I think.

Also, the with the fname = '/var/run/fedmsg/monitoring-fedmsg-hub.socket' line.. we can't hardcode the 'fedmsg-hub' name in there because we also have the 'fedmsg-irc' process (which works the same way) and the 'fedmsg-relay' process, and the 'fedmsg-gateway' process. All of the code should work the same way for them, we just need to allow for their name.

Perhaps add it as sys.argv[1] for each check.

Would you be interested in adding these to the puppet repo? I.e., developing a patch for that here?

In the case of the socket line I think the best solution is to have a ansible variable that is passed from playbook. Lets use the same approach that you did for collectd.

I will try to add these in repo. If there will be problems I will contact you for help.

One thing. There is nagios stuff in ansible and in puppet repo. Which one is the right one to add fedmsg checks?

Replying to [comment:15 janeznemanic]:

One thing. There is nagios stuff in ansible and in puppet repo. Which one is the right one to add fedmsg checks?

Unfortunately, it is not clear at the moment. I added collectd stuff for this only to ansible and skipped puppet for the time being.

Host groups that have fedmsg daemons that are under the control of ansible are:
- summershum (fedmsg-hub)
- badges-backend (fedmsg-hub)
- notifs-backend (fedmsg-hub)

The nagios head node is still controlled by puppet, so I think you're going to have to make modifications there no matter what.

hosts that have fedmsg daemons running that are still under puppet are:
- busgateway01 (fedmsg-hub, fedmsg-relay, fedmsg-gateway)
- app01 (fedmsg-relay)
- value03 (fedmsg-irc)
- pkgs01 (fedmsg-hub)

Socket file name is now composed from sys.argv[1] and hardcoded string.

And the first step towards fully implemented nagios checks. Something to start with.

Ralph could you give a short comment on the above patch. Am I on the right way?
I need to add consumers and producers to every command. But I'm not sure how to find consumers and producers that run on relevant hosts. I assume that I could probably get names of consumers and producers from fedmsg configuration files (endpoints-* files). Ralph could you give me a few tips how to find names of producers and consumers?

Replying to [comment:19 janeznemanic]:

Ralph could you give a short comment on the above patch. Am I on the right way?
I need to add consumers and producers to every command. But I'm not sure how to find consumers and producers that run on relevant hosts. I assume that I could probably get names of consumers and producers from fedmsg configuration files (endpoints-* files).

It looks good to me!

Ralph could you give me a few tips how to find names of producers and consumers?

That information isn't documented anywhere. Having nagios checks will be nice because we can point and say "see, here is what should be running".

To figure it out: if you log into each host, you should see them in the /var/run/fedmsg/monitoring-%s.socket output.

Managed to log into all machines except pkgs01. Here's the list of fedmsg consumers and producers on machines.

The latest patch. I think the addition of new checks to nagios should be almost done. At the moment values for warning and critical are passed to consumer's check for exceptions and backlog. Ralph would you like to see those values hardcoded in scripts? Or is the current solution OK? Anyway what is OK, what do you want me to change, what can I improve ...

Janez, this is all deployed and pushed out now (both for puppet and ansible).

I set threshold levels for warning and critical and filled in the couple missing consumers.

http://infrastructure.fedoraproject.org/cgit/ansible.git/commit/?id=c96195506a772d1d1f857c46c45967d6064a1a77

Thanks so much for working on this. Its great to have it in place. :)

Login to comment on this ticket.

Metadata