#10485 IAD2: certificate renewal stopped on updates.coreos.fedoraproject.org
Closed: Fixed with Explanation 2 years ago by kevin. Opened 2 years ago by lucab.

Hi all,
the certificate renewal logic for updates.coreos.fedoraproject.org (and possibly more) stopped working properly, likely some time ago.

During the night the existing certificate reached its expiration date:

depth=0 CN = updates.coreos.fedoraproject.org
verify error:num=10:certificate has expired
notAfter=Jan 18 21:12:12 2022 GMT

See https://github.com/coreos/fedora-coreos-tracker/issues/1072.


@nb ran the proxies playbook and this seems to be fixed now.

$ export SITE_URL="updates.coreos.fedoraproject.org"
$ export SITE_SSL_PORT="443"
$ openssl s_client -connect ${SITE_URL}:${SITE_SSL_PORT} \
  -servername ${SITE_URL} 2> /dev/null |  openssl x509 -noout  -dates
notBefore=Jan 19 15:05:10 2022 GMT
notAfter=Apr 19 15:05:09 2022 GMT

Metadata Update from @dustymabe:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago
  • Issue status updated to: Closed (was: Open)

Should we do a RCA first?

Well, that one endpoint is fixed but generally ingress does not seem to be in a healthy place.

Just as a quick example, some more broken TLS endpoints on this same project:
* status.updates.coreos.fedoraproject.org
* raw-updates.coreos.fedoraproject.org
* status.raw-updates.coreos.fedoraproject.org

Metadata Update from @lucab:
- Issue status updated to: Open (was: Closed)

2 years ago

I've fixed the rest of these. ;)

The cause is that the re-newals happen via ansible on playbook runs. Normally we run playbooks pretty often to change things, but I was on PTO for much of december/early jan and things were very quiet and so no one had to run any playbooks. ;(

Ideally we would think about some better monitoring on this at least (which we do have for some certs/endpoint, but not all) or a way to auto-renew/deploy.

This was fixed by running the proxies playbook against those endpoints...

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

2 years ago

As the renewal of these certificates seems to be a manual operation by the sysadmins, do you have a monitoring/alerting system where we can add our well-known TLS endpoints so that you get an alert/call-to-action if they are about to expire?

Additionally, these services are running on an openshift cluster and the certificates coming from LE. To the best of my knowledge, there is room for having those renewals auto-managed, avoiding the need of manual operations in the first place.

We do have checks for some certs in nagios. We could add more there...

Or I wonder... for the letsencrypt ones we could perhaps just generate a email report.

We could look at managing them in openshift after we migrate to the 4 cluster. 3.11 doesn't have a handy operator for this. ;(

As I've been asked separately, for the FCOS update service we have a total of 4 production TLS endpoints:
updates.coreos.fedoraproject.org
raw-updates.coreos.fedoraproject.org
status.updates.coreos.fedoraproject.org
status.raw-updates.coreos.fedoraproject.org

@kevin Until better infra is available, could we add the above endpoints to the Nagios cert check? It'd be good to have some degree of protection against this happening again.

Added in b388a003b41b55b3b9c18903965d772d7c196396 (of course nagios is weird and I will probibly need to tweak it, but thats the idea anyhow. :)

Login to comment on this ticket.

Metadata