#11393 Replace Nagios with Zabbix in Fedora Infrastructure
Opened 10 months ago by zlopez. Modified a month ago

Describe what you would like us to do:


Currently Fedora infrastructure is using Nagios for monitoring services. We want to switch to zabbix, because it's better maintained than Nagios.

When do you need this to be done by? (YYYY/MM/DD)


No rush


Metadata Update from @zlopez:
- Issue assigned to dkirwan

10 months ago

Ansible role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/zabbix/zabbix_server
Server deployed staging for testing: https://zabbix.stg.fedoraproject.org/

Currently testing with the wiki01.stg.iad2.fedoraproject.org host, updating and debugging the zabbix_agentd.conf to get it working with the deployed server. Hoping to get this config added to ansible role: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/zabbix/zabbix_agent

@darknao and @aheath1992 had been looking at zabbix in the past. I think they are already talking with @dkirwan but wanted to mention them in the ticket.

Also :100:

@dkirwan we cannot connect with FAS account ??

Not yet! @seddik I'll look at getting that configured with accounts.stg asap though!

Metadata Update from @dkirwan:
- Issue unmarked as blocking: #11245

9 months ago

Screenshot_from_2023-07-04_10-17-36.png

Debuged the zabbix_agentd configuration, clients will auto register and begin sharing system information with the server. Will get this into the zabbix_agent role.

Metadata Update from @dkirwan:
- Issue marked as blocking: #11245

9 months ago

Updated zabbix_agent role with the latest config. It's configured to auto enrol instances.

currently only configured on the following hosts in staging:

playbooks/groups/oci-registry.yml
playbooks/groups/wiki.yml

@dkirwan I'm available to help work on this. I've been looking for something that will get me up to speed on the FI systems/tools and provide some value. I have extensive experience with monitoring solutions (and anisble) so would only need guidance on how you'd like to see the rest of the work accomplished, what needs doing, and ensuring I've got the needed access to things. Right now, I only have sysadmin-devel level access. Please let me know if this is of interest to you.

Nice :D so whats blocking me at the moment, I'm trying to figure out currently how to get Zabbix hooked into our FAS system. It needs to use SAML [1]. I'm currently experimenting before testing out a configuration on the staging ipsilon.

Once this works complete, and we're able to authenticate members, then the real work can begin, we need to start migrating everything from nagios over to zabbix.... to start getting familiar with the nagios_server and nagios_client roles.

At the moment it might require a lot of deep diving and research, find out how something is monitored currently, how it might be best done in Zabbix ecosystem. Once we have the answer we can make the change via ansible and test it out in staging.

I can help with the SAML2 part if needed. Just let me know :)

Here is the config on Zabbix side to set up:

IdP entity ID: https://id.fedoraproject.org/saml2/metadata
SSO service URL: https://id.fedoraproject.org/saml2/SSO/Redirect
SLO service URL: leave empty
Username attribute: username
SP entity ID: https://zabbix.stg.fedoraproject.org
SP name ID format: urn:oasis:names:tc:SAML:2.0:nameid-format:transient
Sign: untick all
Encrypt: untick all

SAML2 configuration is complete.
Note that in 6.0, a user must exist in Zabbix before being able to login with FAS.

So awesome thanks for that @darknao

Ok regarding the users needing to be created, I've a few ideas in mind how to manage that. Let me write up something on discussion including the report on the current state of Zabbix in staging and then invite feedback from everyone.

Ok, we've rolled out the zabbix agent to the majority of the staging instances, theres a few here and there where we can't easily meet dependencies eg the rhel7 boxes, some staging instances are not technically staging, or at least not iad2 based.

We've also 2 instances that are showing as inaccessible after the fact so will troubleshoot these quickly before we start writing up the SOPs to cover what we have so far.

All instances showing accessible in staging now, added SOPs related to zabbix to the Fedora Infra sysadmin guide.

Work completed, after code freeze, should be ready to deploy a server in production. Then begin the work of replacing Nagios monitoring service by service.

I would love to help on this, replacing nagios services by zabbix.

Just waiting until F39 full release and end of freeze before getting Zabbix running in production.

Currently debugging some issues showing in the staging instance related to network and disk load on the bvmhost machines in staging.

In order to make it easier for others to contribute, also need to break the workload down into small size, and perhaps have tickets for every service being migrated from nagios.

Using releng as a prototype as its a green field situation, there is little to no monitoring currently in place its a green field. We'll implement Zabbix checks, and document how we did it, it can become a reference then for others wanting to get involved and take on some of this work later.

Can follow along in the following ticket: https://pagure.io/fedora-infrastructure/issue/11577

With freeze over, started work on the creation of the Zabbix VMs for production. We've created the VM, now debugging the networking/tls via Apache/Haproxy.

Once completed will deploy Zabbix using our playbook/roles already developed.

Prod instance deployed: https://zabbix.fedoraproject.org

Need to debug some issues with user sync for the sysadmin-noc group. In the meantime Guest access is also enabled if you want to login for a look.

We sill cannot connect with FAS account ? Guest access allowed for contributors ?

@seddik currently members of the group sysadmin-noc have accounts created on the zabbix server with elevated privileges and can then login via FAS, but everyone may login via Guest user.

Will soon have a SOP that contributors can follow along as a reference if they wish to contribute to adding Zabbix monitoring to the various services in Fedora Infra.

Just before christmas, opened tickets with RH IT, to open up zabbix server and agent ports between the networks.

Managed to the the rabbitmq production hosts to auto enroll with the server. Still debugging some iptable rule issues.

All prod hosts now have agents running, but not all accessible! Few remaining firewall issues to debug, on the releng hosts and ipsilon.

Just capturing some requirements @kevin listed on the releng ticket:

Awesome. :)

So, a few other things:

    Can you nuke that vnet* rule? I see those still alerting... and as we figured in staging, due to whatever quirk, bridges show up as 10MB connections so it alerts on them all the time.

    All the iad2 hosts are in, but we still have all the non iad2 ones. ;) There's several ways we could address those so we should discuss options. We could just connect to them over our vpn (all the external hosts should be on the vpn I think). We would need to add a vpn endpoint on the zabbx01 vm and then add all the other ones with '$name.vpn.fedoraproject.org'. Or I think zabbix has proxy / spoke things? we could look at setting up some new vm's in each external dc and have those monitor the local machines and phone back to the hub. That sounds like it would be more work to me, but I'm not sure how those work entirely.

    I think @ryanlerch found some matrix integration for zabbix. We should look at that and what it would take to set it up.

    and there's likely a bunch of old one off special nagios checks we need to consider. I am not sure how best to approach that. Perhaps we just make a list of all hosts, then go through them slowly over time one by one and see if there's any 'non standard' checks? Or we could try and untangle those out of the nagios ansible stuff. Ideas welcome on that. ;)

Thanks for all the work on this... 

Oh, and thinking about it more, the external hosts we probibly should just directly connect to. If we use the vpn, then all monitoring on all of those will depend on the vpn being up and working. Currently nagios directly connects to them I am pretty sure. So, thats just adding the '$name.fedoraproject.org' hosts that nagios has (all all the external ones not in iad2). 

Regarding the vnet* rule, nuking it is tricky, the mechanism which autodiscovers and then applies is pretty complex, but I'll keep looking at it! The way we configure this on the centos side is very different! Id rather not just delete everything and implement it the same way as we would lose a lot of other fancy things! For the moment I've reduced that particular alerts severity to Information, and @ryanlerch has already figured out how to prevent such alerts firing via the matrix bot :)

I'll have to figure out how to determine which iface type corresponds with these bridges then can probably modify the alert rule to ignore if match occurs.

Screenshot_2024-01-17_at_11.28.16.png

Ok, got the vnet* rules sorted, no longer showing up. I found exactly where it can be configured in the base template applied to all the current hosts:

Linux by Zabbix agent active > macros >

{$NET.IF.IFNAME.MATCHES}
{$NET.IF.IFNAME.NOT_MATCHES}

I think I have the vpn configured on the zabbix01, but might be best to go with direct connections in that case. The zabbix agent should already be installed on the proxies outside iad2 too, so I shouldn't need a freeze break request thankfully. I'll see if I can add one and get it working from the server side with a direct connection.

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog