#1619 Creating a prometheus resource in cloud-softwarefactory namespace hangs
Closed: Invalid a month ago by arrfab. Opened a month ago by mhuin.

Hello,

Following https://pagure.io/centos-infra/issue/1604 our service account should be allowed to create a Prometheus resource in our namespace.

Attempting to create such a resource ends up hanging:

kubectl describe prometheus/prometheus
Name:         prometheus
Namespace:    cloud-softwarefactory
Labels:       app=observability-stack
              run=prometheus
Annotations:  <none>
API Version:  monitoring.coreos.com/v1
Kind:         Prometheus
Metadata:
  Creation Timestamp:  2025-03-13T16:04:30Z
  Generation:          1
  Resource Version:    425407538
  UID:                 6b09a3f5-c7d8-43f7-9f55-54a08759eba2
Spec:
  Config Maps:
    prometheus-extra-scrapeconfig
  Enable Admin API:     false
  Evaluation Interval:  30s
  External Labels:
    Softwarefactory:  centosinfra-prod
  Pod Monitor Namespace Selector:
  Pod Monitor Selector:
    Match Expressions:
      Key:       sf-monitoring
      Operator:  Exists
  Port Name:     web
  Resources:
    Requests:
      Memory:  400Mi
  Rule Namespace Selector:
  Rule Selector:
    Match Expressions:
      Key:       sf-monitoring
      Operator:  Exists
  Rules:
    Alert:
  Scrape Interval:       30s
  Service Account Name:  sf-service-account
  Service Monitor Namespace Selector:
  Service Monitor Selector:
    Match Expressions:
      Key:       sf-monitoring
      Operator:  Exists
Events:          <none>

No events are logged for the resource; I'd expect at least one pod running a prometheus container shoud be up. Could you check the prometheus operator's logs and see if any error message appears related to this resource?

Thanks!


Metadata Update from @arrfab:
- Issue tagged with: centos-ci-infra, high-trouble, investigation, low-gain

a month ago

@gwmngilfen : do you think you can take that one ? actually busy on plenty of other internal tasks

@arrfab if I knew where to look, sure :)

I see multiple OCP clusters mentioned in https://docs.centos.org/infra-docs/infra/dns/#static-zones but I don't seem able to reach them, so I'm obviously looking in the wrong place. If you can point me to the right spot I can take a look at the cluster logs and see what it's annoyed about.

Metadata Update from @arrfab:
- Issue assigned to gwmngilfen

a month ago

@gwmngilfen : thanks for having a look .. that's the ocp CI cluster , so https://docs.centos.org/centos-sig-guide/ci/ has the link to console and also how to download oc client to interact with it once authenticated through ACO/FAS account (and per ansible ci inventory, you can see you have cluster admin right already)

Thanks @arrfab, I'm in.

Regarding this:

Could you check the prometheus operator's logs and see if any error message appears related to this resource?

Looking at Operators > Installed Operators, for all projects, I don't see Prometheus as an installed operator. In fact I see only one operator at all which suggests I'm looking in the wrong place - any pointers?

it should be installed by default on any openshift cluster. The API resources are also listed for it:

kubectl api-resources | grep monitoring.coreos.com
alertmanagerconfigs                   amcfg                                                                                  monitoring.coreos.com/v1beta1                 true         AlertmanagerConfig
alertmanagers                         am                                                                                     monitoring.coreos.com/v1                      true         Alertmanager
podmonitors                           pmon                                                                                   monitoring.coreos.com/v1                      true         PodMonitor
probes                                prb                                                                                    monitoring.coreos.com/v1                      true         Probe
prometheuses                          prom                                                                                   monitoring.coreos.com/v1                      true         Prometheus
prometheusrules                       promrule                                                                               monitoring.coreos.com/v1                      true         PrometheusRule
servicemonitors                       smon                                                                                   monitoring.coreos.com/v1                      true         ServiceMonitor
thanosrulers                          ruler                                                                                  monitoring.coreos.com/v1                      true         ThanosRuler

OK, paired up with @mhuin to dig into this, but we're a bit stuck.

The relevant namespace appears to be openshift-monitoring which has a bunch of pods like:

prometheus-adapter
prometheus-k8s
prometheus-operator
prometheus-operator-admission-webhook

and so forth. However, we tailed the logs on prometheus-operator (all containers) and found (a) nothing has been logged since Mar 9th, and (b) nothing was logged while @mhuin recreated the resource.

The current suspicion is that the operator isn't watching the cloud-sf namespace for events, but we're unsure how to check that.

Any ideas @arrfab?

OK, some update for you @mhuin ...

Looks like there is a piece of work that is needed to convert from "just enough monitoring for Openshift itself" to "monitoring any other tenant can use", which is detailed here

Until that's done, you won't be able to use the Prometheus entities that Openshift has deployed for itself. We'll have to scope/prioritise that appropriately.

Thanks for the update. Just to clarify, I don't want to use a prometheus currently deployed by OpenShift (with the intent to monitor the cluster), I was hoping to deploy and manage a prometheus myself in my namespace using the Custom Resource provided by the monitoring operator.

For the record, another approach would be to enable user workload monitoring so that our application metrics get collected by the centralized cluster monitoring system. However I wouldn't expect you to want to add these extra metrics to your collection, nor to allow us access to the cluster's monitoring instance to scrape our metrics from there.

So let's close this issue, we will deploy prometheus the regular way without the operator - it's just more work on our side to get the configuration set up.

Closing, per @mhuin's last comment

Metadata Update from @arrfab:
- Issue close_status updated to: Invalid
- Issue status updated to: Closed (was: Open)

a month ago

Log in to comment on this ticket.

Metadata