httpd.service on resultsdb01.qa.fedoraproject.org host crashed and was down for more than 37 hours, yet we didn't get any alert about that. Monitoring of the service should be added to prevent such long outage from happening in the future.
httpd.service
resultsdb01.qa.fedoraproject.org
Jan 01 01:43:23 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: A process of this unit has been killed by the OOM killer. Jan 01 01:43:58 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: Failed with result 'oom-kill'. Jan 01 01:43:58 resultsdb01.qa.fedoraproject.org systemd[1]: httpd.service: Consumed 1h 24min 33.687s CPU time. Jan 02 14:51:45 resultsdb01.qa.fedoraproject.org systemd[1]: Starting The Apache HTTP Server... Jan 02 14:51:45 resultsdb01.qa.fedoraproject.org httpd[812580]: [Thu Jan 02 14:51:45.939277 2020] [env:warn] [pid 812580:tid 140307113316672] AH01506: PassEnv variable HOSTNAME was undefined Jan 02 14:51:46 resultsdb01.qa.fedoraproject.org httpd[812580]: Server configured, listening on: port 80 Jan 02 14:51:46 resultsdb01.qa.fedoraproject.org systemd[1]: Started The Apache HTTP Server.
Would something akin to this be sufficient? I’m going by dedf9486721d28637c77f9bf27bd59470c8ebeca.
<img alt="0001-nagios-Add-httpd-monitoring-for-resultsdb01.patch" src="/fedora-infrastructure/issue/raw/files/f19a41cdaa762d5aec297fedfe1d5e9bc977b65916848bca8b7434af01f5d512-0001-nagios-Add-httpd-monitoring-for-resultsdb01.patch" />
the issue was reviewd ?
Oops. I totally missed the update on this one...
That looks like it should be ok, but the hostname has changed and so much of nagios changed it won't apply.
Can someone rebase it and use the new name (resultsdb01.iad2.fedoraproject.org) ?
<img alt="0002-nagios-Add-httpd-monitoring-for-resultsdb01.patch" src="/fedora-infrastructure/issue/raw/files/ab0bfa21d453f51db530bdc397ebe2a9e4c9fe58401113fd617544a34022a616-0002-nagios-Add-httpd-monitoring-for-resultsdb01.patch" /> I'm not yet familiar with the naming scheme here, so I hope it's ok. I've removed the ssl bit as it seems this service is http only right now, is that correct?
Yep. that looks great. :)
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.