#7 Notes on Prometheus research.
Merged 3 years ago by siddharthvipul1. Opened 3 years ago by asaleh.

@@ -28,10 +28,26 @@ 

   -  Can zabbix handle our number of machines?

   -  How flexible is the alerting?

  

+ Main takeaway

+ -------------

+ 

+ We managed to create proof-of-concept monitoring solutions with both prometheus and zabbix.

+ 

+ The initial configuration has proven to have more pitfals than expected,

+ with Prometheus especially in the integration with openshift and its other auxialiary services,

+ and with Zabbix espcially with correctly setting up the ip-tables and network permissions,

+ and with configuring a reasonable setup for the user-access and user-account management.

+ 

+ Even despite these setbacks, we still feel this would be an improvement over our current setup based on Nagios.

+ 

+ To get a basic overview of Prometheus, you can watch this short tech-talk by Adam Saleh:

+ (accessible only to RedHat) https://drive.google.com/file/d/1-uEIkS2jaJ2b8V_4y-AKW1J6sdZzzlc9/view 

+ or read up the more indepth report in the relevant sections of this documentation.

  

  .. toctree::

      :maxdepth: 1

  

-     prometheus

+     prometheus_for_ops

+     prometheus_for_dev

      faq

  

docs/monitoring_metrics/prometheus_for_dev.rst docs/monitoring_metrics/prometheus.rst
file renamed
+16 -33
@@ -1,36 +1,3 @@ 

- Monitoring / Metrics with Prometheus

- ========================

- 

- For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.

- 

- Beware, most of the deployment notes could be mostly obsolete in really short time.

- The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,

- as well as the no longer maintained application-monitoring operator.

- 

- In openshift 4.x that we plan to use in the near future, there is  supported way integrated in the openshift deployment:

- 

- * https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html

- * https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack

- * https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html

- 

- The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to

- run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.

- 

- 

- Notes on operator deployment

- -------------------

- 

- The deployment in question was done by configuring the CRDs, roles and rolebinding and operator setup:

- 

- The definitions are as follows:

- - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd

- - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd

- - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator

- 

- Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.

- 

- The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml

- 

  Notes on application monitoring self-service

  ---------------------------------

  
@@ -96,3 +63,19 @@ 

  these are not available to change from the developers namespaces.

  

  The managing and acknowledging of the alerts can be done in alert-manager in rudimentary fashion.

+ 

+ Notes on instrumenting the application

+ --------------------------------------

+ 

+ Prometheus expects applications to scrape metrics from

+ to be services, with '/metrics' endpoint exposed with metrics in correct

+ format.

+ 

+ There are libraries that help with this for many different languages,

+ confusingly called client-libraries, eve though they usually export metrics as a http server endpoint:

+ https://prometheus.io/docs/instrumenting/clientlibs/

+ 

+ As part of the proof of concept we have instrumented Bodhi application,

+ to collect data through prometheus_client python library:

+ https://github.com/fedora-infra/bodhi/pull/4079

+ 

@@ -0,0 +1,115 @@ 

+ Monitoring / Metrics with Prometheus

+ ========================

+ 

+ For deployment, we used combination for configuration of prometheus operator and application-monitoring operator.

+ 

+ Beware, most of the deployment notes could be mostly obsolete in really short time.

+ The POC was done on OpenShift 3.11, which limited us in using older version of prometheus operator,

+ as well as the no longer maintained application-monitoring operator.

+ 

+ In openshift 4.x that we plan to use in the near future, there is  supported way integrated in the openshift deployment:

+ 

+ * https://docs.openshift.com/container-platform/4.7/monitoring/understanding-the-monitoring-stack.html

+ * https://docs.openshift.com/container-platform/4.7/monitoring/configuring-the-monitoring-stack.html#configuring-the-monitoring-stack

+ * https://docs.openshift.com/container-platform/4.7/monitoring/enabling-monitoring-for-user-defined-projects.html

+ 

+ The supported stack is more limited, especially w.r.t. adding user defined pod- and service-monitors, but even if we would want to

+ run additional prometheus instances, we should be able to skip the instalation of the necessary operators, as all of them should already be present.

+ 

+ 

+ Notes on operator deployment

+ -------------------

+ 

+ Operator pattern is often used with kubernetes and openshift for more complex deployments.

+ Instead of applying all of the configuration to dpeloy your services, you deploy a special,

+ smaller service called operator, that has necessary permissions to deploy and configure the complex service.

+ Once the operator is running, instead of configuring the service itself with servie-specific config-maps,

+ you create operator specific kubernetes objects, so-alled CRDs.

+ 

+ The deployment of the operator in question was done by configuring the CRDs, roles and rolebinding and operator setup:

+ 

+ The definitions are as follows:

+ - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/prometheus-operator-crd

+ - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator-crd

+ - https://github.com/prometheus-operator/prometheus-operator/tree/v0.38.3/example/rbac/prometheus-operator

+ 

+ Once the operator is correctly running, you just define a prometheus crd and it will create prometheus deployment for you.

+ 

+ The POC lives in https://pagure.io/fedora-infra/ansible/blob/main/f/playbooks/openshift-apps/application-monitoring.yml

+ 

+ 

+ Notes on application monitoring operator deployment

+ ---------------------------------------------------

+ 

+ The application-monitoring operator was created to solve the integration of Prometheus, Alertmanager and Grafana.

+ After you configure it, it configures the relevant operators responsible for these services.

+ 

+ The most interesting difference between configuring this shared operator,

+ compared to configuring these operators individually is that it configures some of the integrations,

+ and it integrates well with openshifts auth system through oauth proxy.

+ 

+ The biggest drawback is, that the application-monitoring operator is orphanned project,

+ but because it mostly configures other operators, it is relatively simple to just recreate 

+ the configuration for both prometheus and alertmanager to be deployed,

+ and deploy the prometheus and alertmanager operators without the help or the application-monitoring operator.

+ 

+ Notes on persistence

+ --------------------

+ 

+ Prometheus by default expects to have a writable /prometheus folder,

+ that can serve as persistent storage.

+ 

+ For the persistent volume to work for this purpose, it has to

+ **needs to have POSIX-compliant filesystem**, and NFS we currently have configured is not.

+ This is discussed in the `operational aspects <https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects>`_

+ of Prmetheus documentation

+ 

+ The easiest supported way to have a POSIX-compliant `filesystem is to setup local-storage <https://docs.openshift.com/container-platform/3.11/install_config/configuring_local.html>`_ 

+ in the cluster.

+ 

+ In 4.x versions of OpenShift `there is a local-storage-operator <https://docs.openshift.com/container-platform/4.7/storage/persistent_storage/persistent-storage-local.html>`_ for this purpose.

+ 

+ This is the simplest way to have working persistence, but it prevents us to have multiple instanes

+ across openshift nodes, as the pod is using the underlying gilesystem on the node.

+ 

+ To ask the operator to create persisted prometheus, you specify in its configuration i.e.:

+ 

+ ::

+ 

+     storage:

+         volumeClaimTemplate:

+         spec:

+             retention: 24h

+             storageClassName: local

+             resources:

+                 requests:

+                     storage: 10Gi

+ 

+ By default retention is set to 24 hours and can be over-ridden 

+ 

+ 

+ Notes on long term storage

+ --------------------

+ 

+ Usually, the prometheus itself is setup to store its metrics for shorter ammount of time,

+ and it is expected that for longterm storage and analysis, there is some other storage solution,

+ such as influxdb, timescale.

+ 

+ We are currently running a POC that sychronizes Prometheus with Timescaledb (running on Postgresql)

+ through a middleware service called `promscale <https://github.com/timescale/promscale>`_ .

+ 

+ Promscale just needs an access to a appropriate postgresql database:

+ and can be configured through PROMSCALE_DB_PASSWORD, PROMSCALE_DB_HOST.

+ 

+ By default it will ensure the database has timescale installed and cofigures its database

+ automatically.

+ 

+ We setup the prometheus with directive to use promscale service as a backend:

+ https://github.com/timescale/promscale

+ 

+ ::

+ 

+     remote_write:

+     - url: "http://promscale:9201/write"

+     remote_read:

+     - url: "http://promscale:9201/read" 

\ No newline at end of file