Hi all, the certificate renewal logic for updates.coreos.fedoraproject.org (and possibly more) stopped working properly, likely some time ago.
updates.coreos.fedoraproject.org
During the night the existing certificate reached its expiration date:
depth=0 CN = updates.coreos.fedoraproject.org verify error:num=10:certificate has expired notAfter=Jan 18 21:12:12 2022 GMT
See https://github.com/coreos/fedora-coreos-tracker/issues/1072.
@nb ran the proxies playbook and this seems to be fixed now.
$ export SITE_URL="updates.coreos.fedoraproject.org" $ export SITE_SSL_PORT="443" $ openssl s_client -connect ${SITE_URL}:${SITE_SSL_PORT} \ -servername ${SITE_URL} 2> /dev/null | openssl x509 -noout -dates notBefore=Jan 19 15:05:10 2022 GMT notAfter=Apr 19 15:05:09 2022 GMT
Metadata Update from @dustymabe: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Issue status updated to: Closed (was: Open)
Should we do a RCA first?
Well, that one endpoint is fixed but generally ingress does not seem to be in a healthy place.
Just as a quick example, some more broken TLS endpoints on this same project: * status.updates.coreos.fedoraproject.org * raw-updates.coreos.fedoraproject.org * status.raw-updates.coreos.fedoraproject.org
Metadata Update from @lucab: - Issue status updated to: Open (was: Closed)
I've fixed the rest of these. ;)
The cause is that the re-newals happen via ansible on playbook runs. Normally we run playbooks pretty often to change things, but I was on PTO for much of december/early jan and things were very quiet and so no one had to run any playbooks. ;(
Ideally we would think about some better monitoring on this at least (which we do have for some certs/endpoint, but not all) or a way to auto-renew/deploy.
This was fixed by running the proxies playbook against those endpoints...
Metadata Update from @kevin: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
As the renewal of these certificates seems to be a manual operation by the sysadmins, do you have a monitoring/alerting system where we can add our well-known TLS endpoints so that you get an alert/call-to-action if they are about to expire?
Additionally, these services are running on an openshift cluster and the certificates coming from LE. To the best of my knowledge, there is room for having those renewals auto-managed, avoiding the need of manual operations in the first place.
We do have checks for some certs in nagios. We could add more there...
Or I wonder... for the letsencrypt ones we could perhaps just generate a email report.
We could look at managing them in openshift after we migrate to the 4 cluster. 3.11 doesn't have a handy operator for this. ;(
As I've been asked separately, for the FCOS update service we have a total of 4 production TLS endpoints: updates.coreos.fedoraproject.org raw-updates.coreos.fedoraproject.org status.updates.coreos.fedoraproject.org status.raw-updates.coreos.fedoraproject.org
@kevin Until better infra is available, could we add the above endpoints to the Nagios cert check? It'd be good to have some degree of protection against this happening again.
Added in b388a003b41b55b3b9c18903965d772d7c196396 (of course nagios is weird and I will probibly need to tweak it, but thats the idea anyhow. :)
Login to comment on this ticket.