#9884 accounts.centos.org and accounts.fedoraproject.org are unreachable
Closed: Fixed 2 days ago by kevin. Opened a month ago by arrfab.

Today (Friday April 16th, at 10:17pm UTC), Zabbix informed us (centos infra team) that the deployed noggin instance in the Fedora openshift wasn't available anymore.
The displayed message is the traditional one from openshift :

Application is not available

The application is currently not serving requests at this endpoint. It may not have been started or is still starting.


Possible reasons you are seeing this page:

    The host doesn't exist. Make sure the hostname was typed correctly and that a route matching this hostname exists.
    The host exists, but doesn't have a matching path. Check if the URL path was typed correctly and that the route was created using the desired path.
    Route and path matches, but all pods are down. Make sure that the resources exposed by this route (pods, services, deployment configs, etc) have at least one pod running.

Just creating ticket for someone from Fedora Infra team to have a look

Thanks !


It came back by itself and we are working on why it went down

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: authentication, medium-gain, medium-trouble, ops, outage

a month ago

It seems all 3 ipa servers had dirsrv / 389 crash/exit:

[16/Apr/2021:20:22:25.242751483 +0000] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meToipa03.iad2.fedoraproject.org" (ipa03:389): Attempting to release replica, but unable to receive endReplication
extended operation response from the replica. Error -5 (Timed out)
[16/Apr/2021:20:24:17.922391453 +0000] - ERR - setup_pr_read_pds - Not listening for new connections - too many fds open

So, I think we ran out of fds? Perhaps we need to raise a limit here.

Metadata Update from @kevin:
- Issue untagged with: authentication, medium-gain, medium-trouble, ops, outage
- Issue priority set to: Needs Review (was: Waiting on Assignee)

a month ago

Metadata Update from @kevin:
- Issue tagged with: authentication, medium-gain, medium-trouble, ops, outage

a month ago

nsslapd-maxdescriptors is the parameter in dirsrv, you also might need to update the ulimit of the process (if that is smaller) and maybe even the fs.file-max sysctl setting depending on the needed amount

Metadata Update from @smooge:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

a month ago

needs to wait until after freeze

Metadata Update from @kevin:
- Issue untagged with: outage
- Issue tagged with: unfreeze

a month ago

So, looking... we have:

nsslapd-maxdescriptors: 262144

which seems like a lot to me.

I'm going to close this now and if it reoccurs investigate more. For now we have been fine.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 days ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog