#1296 websites: add monitoring rules
Merged a year ago by kevin. Opened a year ago by darknao.
fedora-infra/ darknao/ansible websites  into  main

@@ -99,3 +99,9 @@ 

      template: deployment.yml

      objectname: deployment.yml

      when: env == "staging"

+ 

+   - role: openshift/object

+     app: websites

+     file: prometheusRules.yml

+     objectname: prometheusRules.yml

+     when: env == "staging"

@@ -0,0 +1,57 @@ 

+ apiVersion: monitoring.coreos.com/v1

+ kind: PrometheusRule

+ metadata:

+   name: alerts

+ spec:

+   groups:

+   - name: jobFailed

+     rules:

+     - alert: JobFailed

+       annotations:

+         description: Job {{$labels.namespace}}/{{$labels.job_name}} has failed.

+         summary: At least one job has failed.

+       expr: kube_job_failed > 0

+       labels:

+         severity: warning

+   - name: BuildFailed

+     rules:

+     - alert: BuildFailed

+       annotations:

+         description: Build {{$labels.namespace}}/{{$labels.buildconfig}} ({{$labels.build}}) has failed.

+         summary: Build {{$labels.buildconfig}} has failed.

+       expr: openshift_build_status_phase_total{build_phase="failed"} > 0

+       labels:

+         severity: warning

+   - name: PodFailing

+     rules:

+     - alert: PodPending

+       annotations:

+         description: Pod {{$labels.namespace}}/{{$labels.pod}} is in pending state for more than 10m.

+         summary: Pod {{$labels.pod}} is in pending state.

+       expr: kube_pod_status_phase{phase="Pending"} > 0

+       for: 10m

+       labels:

+         severity: warning

+     - alert: PodRestarted

+       annotations:

+         description: Container {{$labels.container}} in Pod {{$labels.namespace}}/{{$labels.pod}} has restarted.

+         summary: Containers in pod {{$labels.pod}} has restarted.

+       expr: rate(kube_pod_container_status_restarts_total[10m]) * 60 * 10 > 0

+       labels:

+         severity: warning

+     - alert: PodCrashLoop

+       annotations:

+         description: Container {{$labels.container}} in Pod {{$labels.namespace}}/{{$labels.pod}} has restarted {{ printf "%.2f" $value }} in the last 15 minutes.

+         summary: Pod {{$labels.pod}} is in CrashLoop state.

+       expr: rate(kube_pod_container_status_restarts_total[15m]) * 60 * 15 > 2

+       labels:

+         severity: warning

+       for: 15m

+     - alert: PodOOMKilled

+       annotations:

+         description: Container {{$labels.container}} in Pod {{$labels.namespace}}/{{$labels.pod}} ran out

+           of memory and has been killed.

+         summary: Containers in pod {{$labels.pod}} has been OOMKilled.

+       expr: kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} > 0

+       labels:

+         severity: warning

@@ -18,7 +18,6 @@ 

        ADD . /websites

        WORKDIR /websites

        RUN npm install

-       RUN ./i18n_gen_yml.sh

        RUN npm run generate

  

        FROM quay.io/fedora/fedora:37

@@ -57,6 +57,13 @@ 

      objectname: appowners.yml

      template_fullpath: "{{roles_path}}/openshift/project/templates/appowners.yml"

  

+ - name: alertmanager.yml

+   include_role:

+     name: openshift/object

+   vars:

+     objectname: alertmanager.yml

+     template_fullpath: "{{roles_path}}/openshift/project/templates/alertmanager.yml"

+ 

  - name: egresspolicy.yml

    include_role:

      name: openshift/object

@@ -0,0 +1,16 @@ 

+ apiVersion: monitoring.coreos.com/v1beta1

+ kind: AlertmanagerConfig

+ metadata:

+   name: appowners-alerts

+   namespace: "{{app}}"

+ spec:

+   receivers:

+   - emailConfigs:

+     - sendResolved: true

+       to: "{{ appowners | product(['fedoraproject.org']) | map('join', '@') | join(',') }}"

+     name: default

+   route:

+     groupBy:

+     - alertname

+     - namespace

+     receiver: default

This PR set up an alertmanager email receiver (AlertmanagerConfig) in user namespaces using the appowners' fedoraproject.org emails.

It also adds a few Prometheus rules for the websites namespace to detect failed pods, jobs, and builds.
There are 2 requirements needed for this to work:

  • We need to enable the alert routing for user-defined projects (1), by adding the following in the cluster-monitoring-config config map in openshift-monitoring namespace:
data:
  config.yaml: |
    alertmanagerMain:
      enableUserAlertmanagerConfig: true
  • We also need to move the SMTP configuration cluster-wide so we don't need to replicate these settings for each receiver. This can be done in Alertmanager configuration (Administration → Cluster Settings → Configuration → Alertmanager)
global:
  smtp_from: sysadmin-openshift@ocp.stg.fedoraproject.org
  smtp_smarthost: 'bastion01.iad2.fedoraproject.org:25'
  smtp_hello: ocp.stg.fedoraproject.org
  smtp_require_tls: false

Related: https://pagure.io/fedora-infrastructure/issue/10671

Build succeeded.

2 new commits added

  • websites: add alerts for pod/job/build errors
  • websites: remove unused i18n script
a year ago

Build succeeded.

2 new commits added

  • websites: add alerts for pod/job/build errors
  • websites: remove unused i18n script
a year ago

Build succeeded.

Nice. Can we make this so we can leverage it for all apps? Or is that the idea, but this is the first one to test with?

Otherwise +1 here... step in the right direction. :)

The alertmanager config with appowners email will be deployed on all apps, but after that, it's up to the appowners to decide which events they want to be notified about.

Here is just an example of the kind of events we could monitor and also to make sure this is working as expected.

After that, I guess we could set up some base alert rules using the ones here for any new & existing apps.

rebased onto de196fd

a year ago

rebased onto de196fd

a year ago

Pull-Request has been merged by kevin

a year ago

Build succeeded.