Hello,
Following https://pagure.io/centos-infra/issue/1604 our service account should be allowed to create a Prometheus resource in our namespace.
Attempting to create such a resource ends up hanging:
kubectl describe prometheus/prometheus Name: prometheus Namespace: cloud-softwarefactory Labels: app=observability-stack run=prometheus Annotations: <none> API Version: monitoring.coreos.com/v1 Kind: Prometheus Metadata: Creation Timestamp: 2025-03-13T16:04:30Z Generation: 1 Resource Version: 425407538 UID: 6b09a3f5-c7d8-43f7-9f55-54a08759eba2 Spec: Config Maps: prometheus-extra-scrapeconfig Enable Admin API: false Evaluation Interval: 30s External Labels: Softwarefactory: centosinfra-prod Pod Monitor Namespace Selector: Pod Monitor Selector: Match Expressions: Key: sf-monitoring Operator: Exists Port Name: web Resources: Requests: Memory: 400Mi Rule Namespace Selector: Rule Selector: Match Expressions: Key: sf-monitoring Operator: Exists Rules: Alert: Scrape Interval: 30s Service Account Name: sf-service-account Service Monitor Namespace Selector: Service Monitor Selector: Match Expressions: Key: sf-monitoring Operator: Exists Events: <none>
No events are logged for the resource; I'd expect at least one pod running a prometheus container shoud be up. Could you check the prometheus operator's logs and see if any error message appears related to this resource?
Thanks!
Metadata Update from @arrfab: - Issue tagged with: centos-ci-infra, high-trouble, investigation, low-gain
@gwmngilfen : do you think you can take that one ? actually busy on plenty of other internal tasks
@arrfab if I knew where to look, sure :)
I see multiple OCP clusters mentioned in https://docs.centos.org/infra-docs/infra/dns/#static-zones but I don't seem able to reach them, so I'm obviously looking in the wrong place. If you can point me to the right spot I can take a look at the cluster logs and see what it's annoyed about.
Metadata Update from @arrfab: - Issue assigned to gwmngilfen
@gwmngilfen : thanks for having a look .. that's the ocp CI cluster , so https://docs.centos.org/centos-sig-guide/ci/ has the link to console and also how to download oc client to interact with it once authenticated through ACO/FAS account (and per ansible ci inventory, you can see you have cluster admin right already)
Thanks @arrfab, I'm in.
Regarding this:
Could you check the prometheus operator's logs and see if any error message appears related to this resource?
Looking at Operators > Installed Operators, for all projects, I don't see Prometheus as an installed operator. In fact I see only one operator at all which suggests I'm looking in the wrong place - any pointers?
it should be installed by default on any openshift cluster. The API resources are also listed for it:
kubectl api-resources | grep monitoring.coreos.com alertmanagerconfigs amcfg monitoring.coreos.com/v1beta1 true AlertmanagerConfig alertmanagers am monitoring.coreos.com/v1 true Alertmanager podmonitors pmon monitoring.coreos.com/v1 true PodMonitor probes prb monitoring.coreos.com/v1 true Probe prometheuses prom monitoring.coreos.com/v1 true Prometheus prometheusrules promrule monitoring.coreos.com/v1 true PrometheusRule servicemonitors smon monitoring.coreos.com/v1 true ServiceMonitor thanosrulers ruler monitoring.coreos.com/v1 true ThanosRuler
OK, paired up with @mhuin to dig into this, but we're a bit stuck.
The relevant namespace appears to be openshift-monitoring which has a bunch of pods like:
openshift-monitoring
prometheus-adapter prometheus-k8s prometheus-operator prometheus-operator-admission-webhook
and so forth. However, we tailed the logs on prometheus-operator (all containers) and found (a) nothing has been logged since Mar 9th, and (b) nothing was logged while @mhuin recreated the resource.
prometheus-operator
The current suspicion is that the operator isn't watching the cloud-sf namespace for events, but we're unsure how to check that.
Any ideas @arrfab?
OK, some update for you @mhuin ...
Looks like there is a piece of work that is needed to convert from "just enough monitoring for Openshift itself" to "monitoring any other tenant can use", which is detailed here
Until that's done, you won't be able to use the Prometheus entities that Openshift has deployed for itself. We'll have to scope/prioritise that appropriately.
Thanks for the update. Just to clarify, I don't want to use a prometheus currently deployed by OpenShift (with the intent to monitor the cluster), I was hoping to deploy and manage a prometheus myself in my namespace using the Custom Resource provided by the monitoring operator.
For the record, another approach would be to enable user workload monitoring so that our application metrics get collected by the centralized cluster monitoring system. However I wouldn't expect you to want to add these extra metrics to your collection, nor to allow us access to the cluster's monitoring instance to scrape our metrics from there.
So let's close this issue, we will deploy prometheus the regular way without the operator - it's just more work on our side to get the configuration set up.
Closing, per @mhuin's last comment
Metadata Update from @arrfab: - Issue close_status updated to: Invalid - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.