PR#7: Notes on Prometheus research. - fedora-infra/arc

fedora-infra / arc

#7 Notes on Prometheus research.

Merged 3 years ago by siddharthvipul1. Opened 3 years ago by asaleh.

prometheus_draft3 into main

Notes on Prometheus research.

Adam Saleh • 3 years ago

ed508e7

docs/monitoring_metrics/index.rst

file modified

+17 -1

		`@@ -28,10 +28,26 @@`
		`- Can zabbix handle our number of machines?`
		`- How flexible is the alerting?`

		`+ Main takeaway`
		`+ -------------`
		`+`
		`+ We managed to create proof-of-concept monitoring solutions with both prometheus and zabbix.`
		`+`
		`+ The initial configuration has proven to have more pitfals than expected,`
		`+ with Prometheus especially in the integration with openshift and its other auxialiary services,`
		`+ and with Zabbix espcially with correctly setting up the ip-tables and network permissions,`
		`+ and with configuring a reasonable setup for the user-access and user-account management.`
		`+`
		`+ Even despite these setbacks, we still feel this would be an improvement over our current setup based on Nagios.`
		`+`
		`+ To get a basic overview of Prometheus, you can watch this short tech-talk by Adam Saleh:`
		`+ (accessible only to RedHat) https://drive.google.com/file/d/1-uEIkS2jaJ2b8V_4y-AKW1J6sdZzzlc9/view`
		`+ or read up the more indepth report in the relevant sections of this documentation.`

		`.. toctree::`
		`:maxdepth: 1`

		`- prometheus`
		`+ prometheus_for_ops`
		`+ prometheus_for_dev`
		`faq`

docs/monitoring_metrics/prometheus_for_dev.rst ~~docs/monitoring_metrics/prometheus.rst~~

file renamed

+16 -33

		`@@ -1,36 +1,3 @@`
		`- Monitoring / Metrics with Prometheus`
		`- ========================`
		`-`
		`- For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.`
		`-`
		`- Beware, most of the deployment notes could be mostly obsolete in really short time.`
		`- The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,`
		`- as well as the no longer maintained application-monitoring operator.`
		`-`
		`- In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:`
		`-`
		`- * https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html`
		`- * https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack`
		`- * https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html`
		`-`
		`- The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to`
		`- run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.`
		`-`
		`-`
		`- Notes on operator deployment`
		`- -------------------`
		`-`
		`- The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:`
		`-`
		`- The definitions are as follows:`
		`- - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd`
		`- - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd`
		`- - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator`
		`-`
		`- Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.`
		`-`
		`- The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml`
		`-`
		`Notes on application monitoring self-service`
		`---------------------------------`

		`@@ -96,3 +63,19 @@`
		`these are not available to change from the developers namespaces.`

		`The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.`
		`+`
		`+ Notes on instrumenting the application`
		`+ --------------------------------------`
		`+`
		`+ Prometheus expects applications to scrape metrics from`
		`+ to be services, with '/metrics' endpoint exposed with metrics in correct`
		`+ format.`
		`+`
		`+ There are libraries that help with this for many different languages,`
		`+ confusingly called client-libraries, eve though they usually export metrics as a http server endpoint:`
		`+ https://prometheus.io/docs/instrumenting/clientlibs/`
		`+`
		`+ As part of the proof of concept we have instrumented Bodhi application,`
		`+ to collect data through prometheus_client python library:`
		`+ https://github.com/fedora-infra/bodhi/pull/4079`
		`+`

docs/monitoring_metrics/prometheus_for_ops.rst

file added

+115

		`@@ -0,0 +1,115 @@`
		`+ Monitoring / Metrics with Prometheus`
		`+ ========================`
		`+`
		`+ For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.`
		`+`
		`+ Beware, most of the deployment notes could be mostly obsolete in really short time.`
		`+ The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,`
		`+ as well as the no longer maintained application-monitoring operator.`
		`+`
		`+ In openshift 4.x that we plan to use in the near future, there is supported way integrated in the openshift deployment:`
		`+`
		`+ * https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html`
		`+ * https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack`
		`+ * https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html`
		`+`
		`+ The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to`
		`+ run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.`
		`+`
		`+`
		`+ Notes on operator deployment`
		`+ -------------------`
		`+`
		`+ Operator pattern is often used with kubernetes and openshift for more complex deployments.`
		`+ Instead of applying all of the configuration to dpeloy your services, you deploy a special,`
		`+ smaller service called operator, that has necessary permissions to deploy and configure the complex service.`
		`+ Once the operator is running, instead of configuring the service itself with servie-specific config-maps,`
		`+ you create operator specific kubernetes objects, so-alled CRDs.`
		`+`
		`+ The deployment of the operator in question was done by configuring the CRDs, roles and rolebinding and operator setup:`
		`+`
		`+ The definitions are as follows:`
		`+ - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd`
		`+ - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd`
		`+ - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator`
		`+`
		`+ Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.`
		`+`
		`+ The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml`
		`+`
		`+`
		`+ Notes on application monitoring operator deployment`
		`+ ---------------------------------------------------`
		`+`
		`+ The application-monitoring operator was created to solve the integration of Prometheus, Alertmanager and Grafana.`
		`+ After you configure it, it configures the relevant operators responsible for these services.`
		`+`
		`+ The most interesting difference between configuring this shared operator,`
		`+ compared to configuring these operators individually is that it configures some of the integrations,`
		`+ and it integrates well with openshifts auth system through oauth proxy.`
		`+`
		`+ The biggest drawback is, that the application-monitoring operator is orphanned project,`
		`+ but because it mostly configures other operators, it is relatively simple to just recreate`
		`+ the configuration for both prometheus and alertmanager to be deployed,`
		`+ and deploy the prometheus and alertmanager operators without the help or the application-monitoring operator.`
		`+`
		`+ Notes on persistence`
		`+ --------------------`
		`+`
		`+ Prometheus by default expects to have a writable /prometheus folder,`
		`+ that can serve as persistent storage.`
		`+`
		`+ For the persistent volume to work for this purpose, it has to`
		`+ needs to have POSIX-compliant filesystem, and NFS we currently have configured is not.`
		+ This is discussed in the `operational aspects <https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_
		`+ of Prmetheus documentation`
		`+`
		+ The easiest supported way to have a POSIX-compliant `filesystem is to setup local-storage <https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_
		`+ in the cluster.`
		`+`
		+ In 4.x versions of OpenShift `there is a local-storage-operator <https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_ for this purpose.
		`+`
		`+ This is the simplest way to have working persistence, but it prevents us to have multiple instanes`
		`+ across openshift nodes, as the pod is using the underlying gilesystem on the node.`
		`+`
		`+ To ask the operator to create persisted prometheus, you specify in its configuration i.e.:`
		`+`
		`+ ::`
		`+`
		`+ storage:`
		`+ volumeClaimTemplate:`
		`+ spec:`
		`+ retention: 24h`
		`+ storageClassName: local`
		`+ resources:`
		`+ requests:`
		`+ storage: 10Gi`
		`+`
		`+ By default retention is set to 24 hours and can be over-ridden`
		`+`
		`+`
		`+ Notes on long term storage`
		`+ --------------------`
		`+`
		`+ Usually, the prometheus itself is setup to store its metrics for shorter ammount of time,`
		`+ and it is expected that for longterm storage and analysis, there is some other storage solution,`
		`+ such as influxdb, timescale.`
		`+`
		`+ We are currently running a POC that sychronizes Prometheus with Timescaledb (running on Postgresql)`
		+ through a middleware service called `promscale <https://github.com/timescale/promscale>`_ .
		`+`
		`+ Promscale just needs an access to a appropriate postgresql database:`
		`+ and can be configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.`
		`+`
		`+ By default it will ensure the database has timescale installed and cofigures its database`
		`+ automatically.`
		`+`
		`+ We setup the prometheus with directive to use promscale service as a backend:`
		`+ https://github.com/timescale/promscale`
		`+`
		`+ ::`
		`+`
		`+ remote_write:`
		`+ - url: "http://promscale:9201/write"`
		`+ remote_read:`
		`+ - url: "http://promscale:9201/read"`
		`\ No newline at end of file`