#6756 Add httpd monitoring for mailman
Closed: Fixed 6 years ago Opened 7 years ago by mizdebsk.

Today httpd on mailman01.phx2.fedoraproject.org was stopped for more than 4 hours and we didn't get any notification about this.

Mar 01 01:08:24 httpd.service failed.
Mar 01 05:16:41 Starting The Apache HTTP Server...

Nagios notifications should be added to prevent this from happening in future.


Metadata Update from @smooge:
- Issue assigned to smooge

7 years ago

Metadata Update from @smooge:
- Assignee reset

7 years ago

If an apprentice needs help on this please contact @smooge

I am going to work on it.

Metadata Update from @mizdebsk:
- Issue assigned to dbruno

7 years ago

Metadata Update from @kevin:
- Issue priority set to: Waiting on Asignee

7 years ago

I'll work on this- I just recently dealt with a similar problem on my home server.

Yeah, thats not going to work here. Let me give a bit of background:

All our configuration is in our ansible repo (which you can clone on batcave01.phx2.fedoraproject.org with 'git clone /git/ansible' or you can look at it remotely at: https://infrastructure.fedoraproject.org/cgit/ansible.git/

The nagios configuration we use uses ansible templates to generate a lot of the config files. You can see the live config files on noc01.phx2.fedoraproject.org under /etc/nagios/

For this ticket are looking for a patch against the ansible repo to add website monitoring for lists.fedoraproject.org. You will want to look at the existing websites checks in roles/nagios_server/templates/nagios/services/websites.cfg.j2 (the websites check template) and add one for lists there and then generate a patch and attach it to this ticket.

Happy to help with patch creating or any other workflow issues...

And thanks much for working on this!

My patch edited parts of the config for how nagios runs- are you trying to tell me that I should be looking at our ansible repo's nagios directory instead and attaching a new config to this post?

I'm just a lurker, but doesn't the check mailman api service already check http on mailman01? The services uses the check_mailman_api command, which uses check_http (via check_nrpe).

That service check was added in 1cb897a76 ("Clean up nagios client for old stuff that no longer matters. Add a mailman api check. It gets a 401 now, but at least that tells us it's working.", 2018-02-06), about a month before this ticket and the incident which led to it.

Perhaps the notifications were not sent due to some other error or they were not sent to the right people? Looking at the nagios web interface, it appears no notifications were sent for the check mailman api service on the date in question:

https://nagios.fedoraproject.org/nagios/cgi-bin//notifications.cgi?host=mailman01.phx2.fedoraproject.org&service=check+mailman+api&type=0&archive=73

Looking at a more recent date, there were notifications sent for downtime (via fedmsg and ircbot):

https://nagios.fedoraproject.org/nagios/cgi-bin//notifications.cgi?host=mailman01.phx2.fedoraproject.org&service=check+mailman+api&type=0&archive=6

It should be possible to check the nagios logs and/or the web interface for more details on whether the check mailman api service failed the test at all on 2018-03-01. If it did not, then the check is in need of improvement. If it did, then the problem is why notifications were not sent or sent to the right places/people.

Our mailman servers have two http services listening on different ports - mailman3.service listens on port 8001, while httpd.service listens on port 80. The former is monitored with check_mailman_api mentioned above, while the latter is not monitored by nagios. This issue is about adding monitoring for httpd on port 80.

So then wouldn't my nagios updater work? We'd just have insert the correct changes into the .txt files for it to append the correct changes. Maybe we might have to change the script to append to different files, but wouldn't my program still fix this?

I can create a Version 3, which appends to the httpd config as well if we need that. I'm just trying to automate the process using a bash script.

@apmarek, using ansible is how things are automated in the fedora infrastructure. It's quite powerful and more easily managed than running individual scripts to add/change configuration.

Based on what @kevin and @mizdebsk said, if you clone the infrastructure ansible repo and take a look at roles/nagios_server/templates/nagios/services/websites.cfg.j2 you'll see a number of existing website checks. Adding one there for lists.fedoraproject.org should be all that's needed.

I think a patch to add that will look something like this:

diff --git i/roles/nagios_server/templates/nagios/services/websites.cfg.j2 w/roles/nagios_server/templates/nagios/services/websites.cfg.j2
index 04151e0ec..f37839e08 100644
--- i/roles/nagios_server/templates/nagios/services/websites.cfg.j2
+++ w/roles/nagios_server/templates/nagios/services/websites.cfg.j2
@@ -146,6 +146,13 @@ define service {
   check_command         check_website!communityblog.fedorainfracloud.org!/!Fedora Community Blog
 }

+define service {
+  host_name             lists.fedoraproject.org
+  service_description   http-lists.fedoraproject.org
+  check_command         check_website_ssl!lists.fedoraproject.org!/archives/!Fedora Mailing-Lists
+  use                   websitetemplate
+}
+
 {% if vars['nagios_location'] == 'internal' %}
 ##
 ## Other Frontend Websites 

I'm not familiar enough with the infra nagios config to know whether this check should be inside of the if vars['nagios_location'] == 'internal' conditional, so you'll want to research a little more or ask for some clarification on that.

Of course, you should defer to @kevin and @mizdebsk on this, as I'm just an interested observer. I could always be wrong (as I was regarding the existing check mailman api service check). :)

Hopefully this is helpful in getting you going toward a patch for this issue.

@tmz Your patch looks good, but it's only half of the solution I would like to see. The check verifies whether website under https://lists.fedoraproject.org is externally available with particular string in it, or not. The check can be ran on both internal (noc01) and external (noc02) nagios instances, so it is correctly placed outsides of the "if internal" condition. Due to its nature, this check depends on a few more things other than just http service running on mailman servers, for example it could fail due to problems with our proxies, expired SSL certificate etc. Because of that, in addition to the proposed "external" check it would be good to also have a more specific "internal" one for that would check for the same string on http://mailman01.phx2.fedoraproject.org.

Good points on adding an internal check as well @mizdebsk. @apmarek, do you want to take that and build a finished patch from this? The change is relatively straight-forward.

The check I included above would be copied and placed inside the if vars['nagios_location'] == 'internal' condition of the websites.cfg.j2 template. The host_name field and the host parameter of the check command would be changed (to mailman01 and mailman01.phx2.fedoraproject.org, respectively -- though both could use the fully-qualified name, the other entries are all hostgroups and use a short name, so mailman01 seems more consistent there). Lastly, the check command would drop the _ssl suffix. All together the internal http check looks something like this:

define service {
  host_name             mailman01
  service_description   http-mailman01.phx2.fedoraproject.org-internal
  check_command         check_website!mailman01.phx2.fedoraproject.org!/archives/!Fedora Mailing-Lists
  use                   websitetemplate
}

I'm happy to attach a patch doing all this, but I don't want to prevent @apmarek from getting the chance to submit a patch and learn a little more about the infrastructure process. This ticket has been open for a few months, so a few more days seems worth waiting if it can be used to help an apprentice gain some experience. :-)

One thing I'm not clear on is the status of mailman02. It exists in the mailman inventory group (and thus in the nagios mailman hostgroup), but it doesn't seem to be fully configured or functional -- as far as http I mean, the host is available via ssh. If that's not intentional, then we should use the hostgroup_name mailman in the internal check and know that when this check is added that mailman02 will fail. I looked in the git history and didn't find a definitive answer on the intended status of mailman02.

As always, comments or corrections to anything above are most welcome.

I'll work on it- but since I'm relatively new to nagios and these apps (I'm about to take my RHCSA) I may get stuck a bit. so I'll see you guys on IRC.

what kind of check_command should I issue? I know that some of the status checking functions in this config are checking the ssl cert or if a website is up or down. but what about the mail server?

@apmarek, the check_commands are already in place (check_website and check_website_ssl), so you only have to add the service definitions. The diff in this comment from yesterday shows the service for checking externally. And this comment has a service definition for checking internally, as @mizdebsk suggested.

Here's an example of how that might look when it's all put together: 5bdb7d91d ("nagios: add httpd monitoring for mailman", 2018-05-13). Take a look at that and see if it makes sense to you. (You can use the view button at the upper right of the patch to view the full file in context, to make it clearer where the service definitions are placed.)

I commented where I made my changes. Is this what you were looking for? EDIT: whoops, forgot the external checking.

Looking at the attached websites.cfg.j2, I only one new service instead of two. The service is checking the internal mailman01 host, but it's outside of the if vars['nagios_location'] == 'internal' condition. That doesn't quite test what we want externally. (It would happen to work on the external nagios server, noc02, because mailman02 can be reached via the VPN from noc02.)

That internal service check should be moved into the internal block. Another service, using lists.fedoraproject.org and the check_website_ssl check_command as I have here (and in this comment. Feel free to copy the websites.cfg.j2 file from my repository unless you notice that it is incomplete or incorrect. :)

It's also easier to see the suggested changes if you use git to generate a diff rather than attaching the full file. That makes adding a comment noting where your suggested changes are unnecessary as well. In your git clone of the fedora ansible repo, you can make your changes and then run git diff >/tmp/nagios-mailman-http-check.patch and attach that.

Ideally, once you've got the changes done, you'll commit them locally so you can explain what's being changed in the commit message. Then you would use git format-patch -1 to generate a patch file (the patch will be output to the current directory with a name like 0001-nagios-add-httpd-monitoring-for-mailman.patch. The commit message I would write goes something like this:

nagios: add httpd monitoring for mailman

Ensure we are notified if the mailman http service is unavailable.  Test
both internal (mailman01.phx2) and external (lists.fpo) addresses.  The
internal check avoids issues caused by proxies, expired TLS certificates
and other external factors.

Resolves: https://pagure.io/fedora-infrastructure/issue/6756
Helped-by: Mikolaj Izdebski <mizdebsk@redhat.com>
Helped-by: Kevin Fenzi <kevin@scrye.com>
Signed-off-by: Todd Zullinger <tmz@pobox.com>

Obviously, you would replace my address in the Signed-off-by field with your own (and perhaps adding a Helped-by: from me. You might also rewrite the commit message in your own words if you would explain the changes differently.

I'm not sure that a detailed commit message is required for a patch here, but it's always a good habit to be in. Many upstream projects insist on patches having commit messages which fully explain why the change is being made.

Apologies for being so verbose or if I'm telling you something about git which you already know. I'd rather err on the side of too much detail than too little. Hopefully this is helpful in explaining more about the process of generating and submitting patches.

crap, forgot to write my second change... :D I'm fixing it now

crap, forgot to write my second change... :D I'm fixing it now

Metadata Update from @mizdebsk:
- Issue tagged with: monitoring

7 years ago

@apmarek, are you still planning to submit a patch wrapping up the changes discussed? If not, perhaps this patch can be applied to resolve this ticket:

0001-nagios-add-httpd-monitoring-for-mailman.patch

I believe this addresses all the feedback and suggestions so far. If it doesn't, let me know and I can adjust it. I may still have push access and could push it directly, but I'm not sure about that as I haven't been very active in some time.

It seems like applying the patch I attached with my previous comment is a reasonable solution at this point (assuming it does indeed address all the comments which were brought up in this ticket). If that looks good I can push it to ansible directly and save someone the time of downloading and applying it with git am.

looks good to me. My apologies for the delay in answering earlier queries.

Thanks @smooge, I pushed that to the ansible git repo. And no worries, the goal was to save you and the infrastructure team a little time for the many other tasks in motion. :)

This was done a while ago... closing now.

:part_alternation_mark:

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Log in to comment on this ticket.

Metadata
Attachments 4