#11577 Add Zabbix monitoring to releng compose machines
Closed: Fixed 5 days ago by dkirwan. Opened 4 months ago by humaton.

The following list of machines should have the Zabbix agent installed.
Apart from basic system resources monitoring, we would like to monitor, the health of nightly cron jobs.

machines:
[releng_compose]
compose-x86-01.iad2.fedoraproject.org
compose-branched01.iad2.fedoraproject.org
compose-rawhide01.iad2.fedoraproject.org
compose-iot01.iad2.fedoraproject.org

[releng_compose_stg]
compose-x86-01.stg.iad2.fedoraproject.org


Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: high-gain, medium-trouble, monitoring, ops

4 months ago

Working on a SOP to document how this ticket is tackled and ultimately resolved, its a work in progress. Hoping that one of the releng guys will actually do the work when the SOP is completed:

https://hackmd.io/30ueDBr1SR6zewqGjKF0LQ

Fingers crossed this SOP will be a good reference then for anyone else wanting to get involved with adding Zabbix monitoring elsewhere in Fedora Infra.

Metadata Update from @dkirwan:
- Issue priority set to: Needs Review (was: Waiting on Assignee)

3 months ago

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

3 months ago

Metadata Update from @patrikp:
- Issue assigned to patrikp

3 months ago

Metadata Update from @patrikp:
- Issue priority set to: None (was: Waiting on Assignee)

3 months ago

Metadata Update from @dkirwan:
- Issue priority set to: Waiting on Assignee

3 months ago

Opened tickets with RH IT, to open up zabbix server and agent ports between the networks.

Server can currently connect to agent ports on the releng hosts, but releng hosts are timing out when connecting back to the server. Still troubleshooting.

Ok, the rh fw opened up to allow zabbix communications, and have the agent installed on the releng hosts, can start working on this SOP again.

Awesome. :)

So, a few other things:

  • Can you nuke that vnet* rule? I see those still alerting... and as we figured in staging, due to whatever quirk, bridges show up as 10MB connections so it alerts on them all the time.

  • All the iad2 hosts are in, but we still have all the non iad2 ones. ;) There's several ways we could address those so we should discuss options. We could just connect to them over our vpn (all the external hosts should be on the vpn I think). We would need to add a vpn endpoint on the zabbx01 vm and then add all the other ones with '$name.vpn.fedoraproject.org'. Or I think zabbix has proxy / spoke things? we could look at setting up some new vm's in each external dc and have those monitor the local machines and phone back to the hub. That sounds like it would be more work to me, but I'm not sure how those work entirely.

  • I think @ryanlerch found some matrix integration for zabbix. We should look at that and what it would take to set it up.

  • and there's likely a bunch of old one off special nagios checks we need to consider. I am not sure how best to approach that. Perhaps we just make a list of all hosts, then go through them slowly over time one by one and see if there's any 'non standard' checks? Or we could try and untangle those out of the nagios ansible stuff. Ideas welcome on that. ;)

Thanks for all the work on this...

Oh, and thinking about it more, the external hosts we probibly should just directly connect to. If we use the vpn, then all monitoring on all of those will depend on the vpn being up and working. Currently nagios directly connects to them I am pretty sure. So, thats just adding the '$name.fedoraproject.org' hosts that nagios has (all all the external ones not in iad2).

Completed work on this, and developed a SOP showing what was required: https://docs.fedoraproject.org/en-US/infra/sysadmin_guide/sop_add_zabbix_monitoring_to_releng_compose/

Final work is to update the SOP with the final things learned during implementation, but there is now zabbix monitoring in place on the cronjobs running on the releng_compose machines.

Metadata Update from @dkirwan:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

5 days ago

Great!

Small item I happened to note... you have a 'rm /tmp/fedora-compose-rawhide' at the end, but default is '-i' so it doesn't do anything... need a -f there?

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog