Unplanned Outage: OpenShift cluster - 2021-05-18 17UTC
There was an outage starting at 2021-05-18 17UTC,
which lasted about 45minutes.
The issue was all the nodes stopped being ready/working with:
"Part of the existing bootstrap client certificate is expired"
All the pods were still there, but the nodes dropped off due to expired cert.
I did two things to bring it back (although it's not clear if both are needed):
On os-master01: oc get csr -o name | xargs oc adm certificate approve
There were tons of old csr's that were not approved. Might be the nodes requested new certs, but they were not being approved for some reason?
On os-control01: ansible-playbook -i cluster-inventory -e openshift_certificate_expiry_fail_on_warn=False playbooks/redeploy-certificates.yml
This ran ok, but then failed on os-node01... then things started working, so I think step 1 was the fix.
Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)
to comment on this ticket.