#9461 Add monitoring to services on pagure02
Closed: Fixed 2 years ago by kevin. Opened 3 years ago by pingou.

Describe what you would like us to do:


We need to add nagios monitoring to a few pagure services on pagure02, such as:

  • pagure_ci
  • pagure_ev
  • pagure_fast_worker
  • pagure_loadjson
  • pagure_logcom
  • pagure_medium_worker
  • pagure_milter
  • pagure_mirror
  • pagure_slow_worker
  • pagure_webhook
  • pagure_worker
  • pagure_mirror_project_in.timer

Ideally, nagios should just try to restart the service once or twice if they fail.

When do you need this to be done by? (YYYY/MM/DD)


Whenever possible


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: high-gain, medium-trouble, monitoring, ops

3 years ago

I will try during the week !!

@pingou PR opened , but not completed = > https://pagure.io/fedora-infra/ansible/pull-request/343
i need your feedback to achieave the work !

Pingou and others are out until early January. I would not expect any reviews until then.

So, that PR was merged... but it's incomplete.

There's no where it's installing the check script on the pagure nagios_client side?

Can you add a new PR to do that @seddik ? :)

So, that PR was merged... but it's incomplete.

There's no where it's installing the check script on the pagure nagios_client side?

Can you add a new PR to do that @seddik ? :)

i will take a look :)

I think this is largely a duplicate of ticket #6441

So, I am going to close that one in favor of this one.

We need to install the monitor script and also monitor pkgs01 also. :)

I think this is largely a duplicate of ticket #6441

So, I am going to close that one in favor of this one.

We need to install the monitor script and also monitor pkgs01 also. :)

new PR added to update pagure config

systemd units for pkgs is :
pagure_api_key_expire_mail.service
pagure_ev.service
pagure_logcom.service
pagure_mirror_project_in.service
pagure_webhook.service
pagure_worker.service

PR in progress ....

ok. I pushed out the pagure02 fix. There was a debugging line I removed... but it doesn't seem to work right still. Might be the output isn't exactly what nagios expects?

/usr/lib64/nagios/plugins/check_systemd_units

OK - Systemd units are active

But other plugins use :

/usr/lib64/nagios/plugins/check_procs

PROCS OK: 574 processes | procs=574;;;0;

ok, so I looked at this a tiny bit more.

The nrpe call from noc01 is hanging:

/usr/lib64/nagios/plugins/check_nrpe -H pagure02.fedoraproject.org -c check_systemd_units
CHECK_NRPE STATE CRITICAL: Socket timeout after 10 seconds.

I think this is happening because the script is using systemctl status and thats not completing because it shows journal entries with a pager?

When the nrpe command is running I see on pagure02:

nrpe 381200 0.0 0.0 9772 1836 ? S 00:04 0:00 bash /usr/lib64/nagios/plugins/check_systemd_units
nrpe 381201 0.0 0.0 83740 4592 ? S 00:04 0:00 systemctl status pagure_ci
nrpe 381202 0.0 0.0 9184 1040 ? S 00:04 0:00 grep -E Active:
nrpe 381203 0.0 0.0 22448 1644 ? S 00:04 0:00 awk { print $2 }
root 381237 0.0 0.0 12112 1108 pts/0 R+ 00:04 0:00 grep --color=auto nrpe

So, perhaps switch to using 'systemctl show' which is supposed to be machine parsable?
Something like:
systemctl show pagure_ci --property=ActiveState
will just output:
ActiveState=active

I'm not sure the format of the output... but https://nagios-plugins.org/doc/guidelines.html#PLUGOUTPUT might be worth following?

Thanks!

@seddik will you have time to look into this? Or should I ask something else to give it a try?
No shame if you are busy...

Hello

I will take a look this week
Sorry for delay

On Mon, May 17, 2021, 21:37 Kevin Fenzi pagure@pagure.io wrote:

kevin added a new comment to an issue you are following:
``
@seddik will you have time to look into this? Or should I ask something
else to give it a try?
No shame if you are busy...

``

To reply, visit the link below or just reply to this email
https://pagure.io/fedora-infrastructure/issue/9461

This is fixed now. :) Thanks!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Done