We need to add nagios monitoring to a few pagure services on pagure02, such as:
Ideally, nagios should just try to restart the service once or twice if they fail.
Whenever possible
Metadata Update from @zlopez: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: high-gain, medium-trouble, monitoring, ops
I will try during the week !!
@pingou PR opened , but not completed = > https://pagure.io/fedora-infra/ansible/pull-request/343 i need your feedback to achieave the work !
Pingou and others are out until early January. I would not expect any reviews until then.
So, that PR was merged... but it's incomplete.
There's no where it's installing the check script on the pagure nagios_client side?
Can you add a new PR to do that @seddik ? :)
So, that PR was merged... but it's incomplete. There's no where it's installing the check script on the pagure nagios_client side? Can you add a new PR to do that @seddik ? :)
i will take a look :)
I think this is largely a duplicate of ticket #6441
So, I am going to close that one in favor of this one.
We need to install the monitor script and also monitor pkgs01 also. :)
I think this is largely a duplicate of ticket #6441 So, I am going to close that one in favor of this one. We need to install the monitor script and also monitor pkgs01 also. :)
new PR added to update pagure config
systemd units for pkgs is : pagure_api_key_expire_mail.service pagure_ev.service pagure_logcom.service pagure_mirror_project_in.service pagure_webhook.service pagure_worker.service
PR in progress ....
ok. I pushed out the pagure02 fix. There was a debugging line I removed... but it doesn't seem to work right still. Might be the output isn't exactly what nagios expects?
OK - Systemd units are active
But other plugins use :
PROCS OK: 574 processes | procs=574;;;0;
ok, so I looked at this a tiny bit more.
The nrpe call from noc01 is hanging:
/usr/lib64/nagios/plugins/check_nrpe -H pagure02.fedoraproject.org -c check_systemd_units CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.
I think this is happening because the script is using systemctl status and thats not completing because it shows journal entries with a pager?
When the nrpe command is running I see on pagure02:
nrpe 381200 0.0 0.0 9772 1836 ? S 00:04 0:00 bash /usr/lib64/nagios/plugins/check_systemd_units nrpe 381201 0.0 0.0 83740 4592 ? S 00:04 0:00 systemctl status pagure_ci nrpe 381202 0.0 0.0 9184 1040 ? S 00:04 0:00 grep -E Active: nrpe 381203 0.0 0.0 22448 1644 ? S 00:04 0:00 awk { print $2 } root 381237 0.0 0.0 12112 1108 pts/0 R+ 00:04 0:00 grep --color=auto nrpe
So, perhaps switch to using 'systemctl show' which is supposed to be machine parsable? Something like: systemctl show pagure_ci --property=ActiveState will just output: ActiveState=active
I'm not sure the format of the output... but https://nagios-plugins.org/doc/guidelines.html#PLUGOUTPUT might be worth following?
Thanks!
@seddik will you have time to look into this? Or should I ask something else to give it a try? No shame if you are busy...
Hello
I will take a look this week Sorry for delay
On Mon, May 17, 2021, 21:37 Kevin Fenzi pagure@pagure.io wrote:
kevin added a new comment to an issue you are following: `` @seddik will you have time to look into this? Or should I ask something else to give it a try? No shame if you are busy... `` To reply, visit the link below or just reply to this email https://pagure.io/fedora-infrastructure/issue/9461
kevin added a new comment to an issue you are following: `` @seddik will you have time to look into this? Or should I ask something else to give it a try? No shame if you are busy...
``
To reply, visit the link below or just reply to this email https://pagure.io/fedora-infrastructure/issue/9461
This is fixed now. :) Thanks!
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.