#9714 Resources to deploy Prometheus and Zabbix in fedora infrastructure
Closed: Fixed 3 years ago by pingou. Opened 3 years ago by asaleh.

As an ARC team initiative we want to investigate Prometheus and Zabbix
as our new monitoring and metrics solutions, by:

  • Installing Zabbix server in a VM, and hooking up the staging dist-git to it with an agent
  • Installing Prometheus in our Open Shift and collecting metrics for Anitya for it (because Michal has a specific usecase for getting notified for crashlooping pods)

In process we want to be able to answer the questions posed in the latest mailing thread and by the end have a setup that can lead directly into mirating us away from nagios. The questions (mostly from Kevin):

  • How can we provision both of them automatically from ansible?
    Ideally when we add some new host we just run something and it
    configures the needed places.

  • can we get zabbix to pull from prometheus? It might be nice if all the
    alerting at least could be in zabbix

  • Can zabbix handle our number of machines? I know a long time ago when
    we tried to deploy zabbix it couldn't keep up. So perhaps some kind of
    load testing? or adding all builders to it or something?

  • How flexable is the alerting. I think we may want to revisit things

  • from our current nagios setup. I think we have some good things:
    (alerts always happen on irc first so if someone sees it they can look)
    and some bad things (alerts get acked and problem gets worse, checks get
    disabled and never turned back on, etc). It would be nice to divy up
    things into some big at least SLE's... so mirrorlists being down would
    wake the world, but badges would just send email/irc until someone
    looked.

  • can zabbix/prometheus do any of our metrics needs?

To do this we will need:

  • 1 vm to run zabbix server (and required access to be able to run rbac playbook to provisioin vms - sysadmin-noc for stg or a separate group with right access) - this could be done by assigning siddharthvipul1 and asaleh either sysadmin-noc, sysadmin-noc for stg or a new group with access to at least the vm that will house zabbix and dist-git on stagigng
  • Access to openshift as cluster-admin (to be able to install operator that will install/configure prometheus, requires configuring clusterroles and clusterrolebindings)

to be able to install operator that will install/configure prometheus, requires configuring clusterroles and clusterrolebindings

Can this be done via ansible?

The idea being the usual: if openshift's server goes down, we get new hardware, run the playbook and everything comes back to life.

Yes, the instalation of the prometheus operator is automated already and porting the shell script to ansible shouldn't be hard.

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops, request-for-resources

3 years ago

I was added in sysadmin-noc and have the access to correct vm
This can be closed (from my perspective)

Metadata Update from @pingou:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Done