#8194 openshift/stg: cluster-wide downtime
Closed: Fixed a month ago by kevin. Opened a month ago by lucab.

The staging OpenShift cluster is having some hiccups and a bit of downtime.
Symptoms:
* frontend endpoint at https://os.stg.fedoraproject.org/ seems to work
* authN is broken after initial redirection. IDP reacts to this with a "503 - Application is not available"
* services on the cluster are also broken. The proxy in front of them too reacts to this with a "503 - Application is not available"

From IRC logs, at ~8h UTC this was already being investigated by @cverna.


Metadata Update from @cverna:
- Issue assigned to cverna

a month ago

So all our compute nodes are in the NotReady state

[root@os-master01 ~][STG]# oc get nodes
NAME                                     STATUS     ROLES           AGE       VERSION
os-master01.stg.phx2.fedoraproject.org   Ready      master          1y        v1.11.0+d4cacc0
os-master02.stg.phx2.fedoraproject.org   Ready      master          1y        v1.11.0+d4cacc0
os-master03.stg.phx2.fedoraproject.org   Ready      master          1y        v1.11.0+d4cacc0
os-node01.stg.phx2.fedoraproject.org     NotReady   compute,infra   1y        v1.11.0+d4cacc0
os-node02.stg.phx2.fedoraproject.org     NotReady   compute,infra   1y        v1.11.0+d4cacc0
os-node03.stg.phx2.fedoraproject.org     NotReady   compute,infra   1y        v1.11.0+d4cacc0
os-node04.stg.phx2.fedoraproject.org     NotReady   compute,infra   1y        v1.11.0+d4cacc0

After looking a the atomic-openshift-node service logs on the compute node, it seems to hang after loading the cert/key pair from /etc/origin/node/certificates/kubelet-client-current.pem.

Checking the certificate expiration date we have

openssl x509 -enddate -noout -in /etc/origin/node/certificates/kubelet-client-2018-09-10-22-38-43.pem
notAfter=Sep 10 22:34:00 2019 GMT

So our cert have expired yesterday. I am not sure how we generate these and how we deploy them. So I ll wait for @kevin to be around and work with him.

I have also checked the production nodes and the certs will expire on Sep 27 22:14:00 2019 GMT so we should probably renew them now.

I am working on fixing this issue now.

Metadata Update from @kevin:
- Issue assigned to kevin (was: cverna)
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: OpenShift

a month ago

Things should be back up and working for the most part, but I'm still trying to fix metrics, so there may be some short up and downtime as it rolls new certs for those.

thanks @kevin!

I have also checked the production nodes and the certs will expire on Sep 27 22:14:00 2019 GMT so we should probably renew them now.

Can we handle prod too so we don't have an even bigger problem there? In general how do we make sure our certs don't expire on us like this in the future?

After freeze, yes, we can do prod. I don't want to disrupt things before then.

I also have a commit ready that will add checking of all these certs to nagios. :)
I'll push that after freeze as well. Although I suppose I could do a freeze break when I get time...

In any case this should be all fixed now.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

FYI, fixed prod certs now.

Login to comment on this ticket.

Metadata