Issue #112: Prometheus + Grafana monitoring stack for Fedora CoreOS - centos-infra

centos-infra

#112 Prometheus + Grafana monitoring stack for Fedora CoreOS

Closed: Fixed 3 years ago by siddharthvipul1. Opened 3 years ago by cverna.

FCOS services expose metrics[0][1] using prometheus but we currently don't have any places to gather and store these data.
Would it be possible to leverage the CentOS OpenShift cluster to deploy a prometheus + grafana stack.

Ideally we would like to have prometheus with persistent volume, a grafana sitting next to it, OIDC access to both console, RBAC for us to change the configmaps.

[0] - https://status.updates.coreos.fedoraproject.org/metrics
[1] - https://fcos-metrics-31.lucabruno.net/bridge?selector=zincati

Metadata Update from @siddharthvipul1:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble

3 years ago

Metadata Update from @dkirwan:
- Issue tagged with: centos-ci-infra

3 years ago

dkirwan commented 3 years ago

Its pretty straight forward to turn on the user workload monitoring stack [1], it requires a PV to be available, and a configuration to be added see example: [2].

We will have to do a few things like giving users permission to create PrometheusRules, ServiceMonitor objects etc, there is no doubt an rbac permission to give them view access to the monitoring section within the openshift ui also.

Might also need to look into adding an extra configuration for Alertmanager to route the alerts to the right people.

Edited 3 years ago by dkirwan

asaleh commented 3 years ago

Currently blocked by https://pagure.io/centos-infra/issue/142 to provide persistence,
but https://prometheus-dashboards.apps.ocp.ci.centos.org/graph is configured to collect the two endpoints.

If you want to add more (or have https://prometheus.io/docs/operating/integrations/#remote-endpoints-and-storage in mind) let me know or
submit a PR against https://pagure.io/gatingdashboards/blob/master/f/prometheus/prometheus-config.yaml

lucab commented 3 years ago

@asaleh thanks for the update!

OIDC to Prometheus dashboard is working fine. I already see some targets being scraped. I've submitted https://pagure.io/gatingdashboards/pull-request/1 to update the list of fedora-coreos endpoints.

OIDC to Grafana dashboard is also working fine. However it lacks the Prom datasource, and I didn't manage to add it (possibly due to AuthN troubles).
No pressure, but if you manage to wire Grafana and Prometheus together I can happily add our contrib dashboard mixins.

lucab commented 3 years ago

@dkirwan @asaleh gentle ping, both for the endpoint fix PR and for the Grafana setup.

asaleh commented 3 years ago

@lucab Apologies it has been taking so long, got stuck on fixing Bodhi integration tests. Will try to merge and update today/tomorrow.

asaleh commented 3 years ago

@lucab so, I at least managed to update the prometheus config.

Based on https://prometheus-dashboards.apps.ocp.ci.centos.org/targets the https://fcos-31-metrics-cdn.lucabruno.net/bridge?selector=zincati didn't work.

I don't think I will be able to properly wire graphana and prometheus just yet, as this POC is lacking the persistent storage. I want to look into that afterwards.

There s still a plan to invest proper resources into creating fedora monitoring stack in Q1 2021, so this is on a bit of a back-burner.

Let me know if this still is useful enough for you, or if I should close this issue and let you know once we are working to setup something better supported :)

lucab commented 3 years ago

@asaleh thanks, I was doing some development work on that instance so it was effectively down. The new config works.

I am bit sad about having to go back to "fedora-infra monitoring will happen next cycle", as I have been in this loop for the last three years, but I understand that's part of the planning.

On this ephemeral Prometheus instance, is it at least possible to have a public endpoint (without authN) so that an external Grafana can be plugged into it on the fly?

asaleh commented 3 years ago

yeah, I think with this level of POC, I can safely disable the authn, as it just scrapes already public data anyway :)

https://prometheus-dashboards.apps.ocp.ci.centos.org/graph is now public

I am figuring our how to make these sorts of things more granular, but I want to give you something quick.

W.r.t. 'proper monitoring next quarter' ... that is the reason why I am trying to make these little poc's as useful as possible, on one hand we could have to make do with this sort of half-baked solutions for a while, on the other, I hope it will be easier to go 'let us make this half baked thing several people are already using something real, I need me and two other people with infra access for three months' than 'so, we need monitoring, what is it, nobody knows'

PS: we actually have grafana deployment in os.fp.o, if you wanted to introduce those dashboard mixins to that configuration go ahead:

url: https://monitor-dashboard-web-monitor-dashboard.app.os.fedoraproject.org/
configuration: https://pagure.io/fedora-infra/ansible/blob/master/f/roles/openshift-apps/monitor-dashboard/files

asaleh commented 3 years ago

Ok, I had some time to sit with the POC again. I has some persistence now (had to use postgresql and a promscale adapter, but it seems to work) and I imported the source and the dashboard mixin into our ansible definitions.

You can see the dashboard here:
https://monitor-dashboard-web-monitor-dashboard.app.os.fedoraproject.org/d/VrG7vdLGk/fedora-coreos-updates-zincati?orgId=1

I would consider this issue resolved?

Metadata Update from @arrfab:
- Issue assigned to asaleh

3 years ago

lucab commented 3 years ago

@asaleh yes, and thanks for importing my Grafana mixin as well.

For my own reference, the relevant configuration fragments are these:
* Prometheus: https://pagure.io/gatingdashboards/blob/master/f/prometheus/prometheus-config.yaml
* Grafana dashboards: https://pagure.io/fedora-infra/ansible/blob/main/f/roles/openshift-apps/monitor-dashboard/templates

asaleh commented 3 years ago

Ok, will close the issue, and will let you know once we would have something more stable, I'd hope in few months we'd be running all of this on fedora-infra in less of a POC state :-)

Metadata Update from @siddharthvipul1:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

asaleh

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

CentOS CI Infra Status: Backlog

centos-infra

Source Code

#112 Prometheus + Grafana monitoring stack for Fedora CoreOS Closed: Fixed 3 years ago by siddharthvipul1. Opened 3 years ago by cverna.

Metadata

medium-gain medium-trouble centos-ci-infra

Boards 1

#112 Prometheus + Grafana monitoring stack for Fedora CoreOS

Closed: Fixed 3 years ago by siddharthvipul1. Opened 3 years ago by cverna.