Today (Friday April 16th, at 10:17pm UTC), Zabbix informed us (centos infra team) that the deployed noggin instance in the Fedora openshift wasn't available anymore. The displayed message is the traditional one from openshift :
Application is not available The application is currently not serving requests at this endpoint. It may not have been started or is still starting. Possible reasons you are seeing this page: The host doesn't exist. Make sure the hostname was typed correctly and that a route matching this hostname exists. The host exists, but doesn't have a matching path. Check if the URL path was typed correctly and that the route was created using the desired path. Route and path matches, but all pods are down. Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.
Just creating ticket for someone from Fedora Infra team to have a look
Thanks !
It came back by itself and we are working on why it went down
Metadata Update from @smooge: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: authentication, medium-gain, medium-trouble, ops, outage
It seems all 3 ipa servers had dirsrv / 389 crash/exit:
[16/Apr/2021:20:22:25.242751483 +0000] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meToipa03.iad2.fedoraproject.org" (ipa03:389): Attempting to release replica, but unable to receive endReplication extended operation response from the replica. Error -5 (Timed out) [16/Apr/2021:20:24:17.922391453 +0000] - ERR - setup_pr_read_pds - Not listening for new connections - too many fds open
So, I think we ran out of fds? Perhaps we need to raise a limit here.
Metadata Update from @kevin: - Issue untagged with: authentication, medium-gain, medium-trouble, ops, outage - Issue priority set to: Needs Review (was: Waiting on Assignee)
Metadata Update from @kevin: - Issue tagged with: authentication, medium-gain, medium-trouble, ops, outage
nsslapd-maxdescriptors is the parameter in dirsrv, you also might need to update the ulimit of the process (if that is smaller) and maybe even the fs.file-max sysctl setting depending on the needed amount
Metadata Update from @smooge: - Issue priority set to: Waiting on Assignee (was: Needs Review)
needs to wait until after freeze
Metadata Update from @kevin: - Issue untagged with: outage - Issue tagged with: unfreeze
So, looking... we have:
nsslapd-maxdescriptors: 262144
which seems like a lot to me.
I'm going to close this now and if it reoccurs investigate more. For now we have been fine.
Metadata Update from @kevin: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.