#9963 Unplanned Outage: OpenShift cluster - 2021-05-18 17UTC
Closed: Fixed a month ago by kevin. Opened a month ago by kevin.

Unplanned Outage: OpenShift cluster - 2021-05-18 17UTC

There was an outage starting at 2021-05-18 17UTC,
which lasted about 45minutes.

The issue was all the nodes stopped being ready/working with:

"Part of the existing bootstrap client certificate is expired"

All the pods were still there, but the nodes dropped off due to expired cert.


I did two things to bring it back (although it's not clear if both are needed):

  1. On os-master01: oc get csr -o name | xargs oc adm certificate approve
    There were tons of old csr's that were not approved. Might be the nodes requested new certs, but they were not being approved for some reason?

  2. On os-control01: ansible-playbook -i cluster-inventory -e openshift_certificate_expiry_fail_on_warn=False playbooks/redeploy-certificates.yml
    This ran ok, but then failed on os-node01... then things started working, so I think step 1 was the fix.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Login to comment on this ticket.

Metadata