#7115 We need application monitoring against openshift routers
Opened 2 years ago by puiterwijk. Modified 21 days ago

Right now we have nagios checks against the proxies for various services to make sure the entire stack is up, but we also need checks for all openshift applications against the openshift nodes.

Sometimes the routers get in a funky state and stop updating their configurations, leading to where one router picks up on pod changes and one doesn't, which leads to 50% of the requests failing in a Application Not Available.

So we need e.g. checks against os-node0{1,2} for https://{greenwave,bodhi}.fedoraproject.org.

So to be clear we need a local check_http on those 2 machines that checks locally http with the endpoints of all applications we are serving on there?

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

2 years ago

Metadata Update from @smooge:
- Issue assigned to smooge

10 months ago

Metadata Update from @mizdebsk:
- Issue tagged with: monitoring

5 months ago

I've implemented this for Koschei in commit 7ba589c. The check succeeds on os-node0[1-3] but fails on os-node0[4-5]. Are OpenShift routers running on nodes 04 and 05?

Edit: haproxy is not running on nodes 04 and 05. Is this expected?

routers only run on 'infra' tagged nodes... you can see the labels with:

➜ ~ oc get nodes -l 'node-role.kubernetes.io/infra=true'
NAME                               STATUS    ROLES           AGE       VERSION
os-node01.phx2.fedoraproject.org   Ready     compute,infra   362d      v1.11.0+d4cacc0
os-node02.phx2.fedoraproject.org   Ready     compute,infra   362d      v1.11.0+d4cacc0
os-node03.phx2.fedoraproject.org   Ready     compute,infra   362d      v1.11.0+d4cacc0

I will need to create Ansible host group with these nodes and make Nagios check hosts in that group only.

That should be fine I would think.

I've defined os_infra_nodes host group and made Nagios check only infra nodes. What remains to be done is to add checks for all other relevant apps besides Koschei.

Metadata Update from @smooge:
- Assignee reset

21 days ago

Login to comment on this ticket.