#11733 noggin is currently down (impacting both Fedora and CentOS nogging frontend)
Closed: Fixed with Explanation a month ago by zlopez. Opened a month ago by arrfab.

https://accounts.fedoraproject.org and https://accounts.centos.org are currently marked as undeployed in openshift, and so not usable


Metadata Update from @zlopez:
- Issue assigned to zlopez
- Issue priority set to: None (was: Needs Review)
- Issue tagged with: high-gain, medium-trouble, ops

a month ago

I'm the one who is responsible for that and currently trying to restore the IPA again.

What happened: I accidentally deleted replication agreements from all IPA machines.

What I did: I tried to restore the missing replication agreements by following this guide. Which didn't help.

We tried to restore IPA in it's former glory, but we wasn't able to figure out how. This is the error we always encountered:

ldap_child[56078]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Decrypt integrity check failed. Unable to create GSSAPI-encrypted LDAP connection.

What we tried:
Restore ipa01 from old backup (daily backup)
Reinstall ipa03 from scratch and use ipa-restore from ipa02 data backup
Reinstall ipa02 from scratch and use ipa-restore from ipa02 data backup
Tried to regenerate kerberos keytab

Currently the kerberos and authentication works with the partially broken ipa02. So we leave it at this state and try to get some help from IPA folks when they are available.

The current state:

ipa01 (rhel8) is up, but ipa services are all turned off there.
ipa02 (rhel8) is up and ipa is running there, but has errors/issues.
ipa03 is down

On ipa02 you can get a admin kerberos ticket for admin, but when trying to use the client, it alernates between:

ipa: ERROR: Insufficient access: SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Credential cache is empty)

and

ipa: ERROR: ProtocolError: <ProtocolError for ipa02.iad2.fedoraproject.org/ipa/session/json: 401 Unauthorized>
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/__init__.py", line 120, in get_package
    plugins = api._remote_plugins
AttributeError: 'API' object has no attribute '_remote_plugins'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/ipalib/cli.py", line 1469, in run
    api.finalize()
  File "/usr/lib/python3.6/site-packages/ipalib/plugable.py", line 753, in finalize
    self.__do_if_not_done('load_plugins')
  File "/usr/lib/python3.6/site-packages/ipalib/plugable.py", line 432, in __do_if_not_done
    getattr(self, name)()
  File "/usr/lib/python3.6/site-packages/ipalib/plugable.py", line 632, in load_plugins
    for package in self.packages:
  File "/usr/lib/python3.6/site-packages/ipalib/__init__.py", line 952, in packages
    ipaclient.remote_plugins.get_package(self),
  File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/__init__.py", line 128, in get_package
    plugins = schema.get_package(server_info, client)
  File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/schema.py", line 546, in get_package
    schema = Schema(client)
  File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/schema.py", line 395, in __init__
    fingerprint, ttl = self._fetch(client, ignore_cache=read_failed)
  File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/schema.py", line 407, in _fetch
    client.connect(verbose=False)
  File "/usr/lib/python3.6/site-packages/ipalib/backend.py", line 69, in connect
    conn = self.create_connection(*args, **kw)
  File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 1064, in create_connection
    command([], {})
  File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 1276, in _call
    return self.__request(name, args)
  File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 1243, in __request
    verbose=self.__verbose >= 3,
  File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request
    return self.single_request(host, handler, request_body, verbose)
  File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 730, in single_request
    response.msg)
xmlrpc.client.ProtocolError: <ProtocolError for ipa02.iad2.fedoraproject.org/ipa/session/json: 401 Unauthorized>
ipa: ERROR: an internal error has occurred

/var/log/httpd/error_log has:

[Tue Jan 23 00:08:08.715155 2024] [wsgi:error] [pid 5370:tid 139692100007680] (11)Resource temporarily
 unavailable: [client 10.3.163.76:58076] mod_wsgi (pid=5370): Unable to connect to WSGI daemon process
 'kdcproxy' on '/etc/httpd/run/wsgi.4365.0.1.sock' after multiple attempts as listener backlog limit w
as exceeded or the socket does not exist.

Currently $ kinit mtasaka@FEDORAPROJECT.ORG is unresponsive, is this issue related to this ticket? Or should I file another ticket?

@mtasaka That would be related, the kerberos is not in great state as well

What do you see in krb5kdc.log on ipa01 and ipa02?

What do you see in krb5kdc.log on ipa01 and ipa02?

( This is not directed to me, right? I am not infra member, I just wanted to do fedpkg build so I have to do kinit beforehand but currently it is unresponsive. )

@mtasaka No that wasn't directed to you.

We are currently trying to find and fix the root cause of the outage with @abbra

It seems that the issue is with missing SIDs and ID ranges for entries. That is required for version of IPA server we currently have installed.

Right now we are assign ID ranges and SIDs to existing entries in IPA database.

Just to mention, this affects mass rebuild shenanigans as well, there was a blocker in qsort causing regression in glibc (https://bugzilla.redhat.com/show_bug.cgi?id=2259845); for this we might need to rebuild things again or stop the build until it gets sorted again but for that we need to get to root of the compose-branched machine which needs auth fixed!

We were able to regenerate SIDs and ID ranges. But there are still issues with kdcproxy and ipa-tomcat on ipa01.

The https://accounts.fedoraproject.org are working now. Same for https://accounts.centos.org. The IPA is still not in great shape, but most of the authentication is working now.

ipa02 seems to be prcessing things ok for now.

ipa03 isn't installing due to a kra issue. We are going to try and address that tomorrow when more ipa folks are around to assist.

Once thats done and in sync we can look at redoing 01 and then finally 02.

In any case everything should be working for the moment.

If people have issues with Noggin, clear your cookies for accounts.fedoraproject.org.

i confirmed that it works for me now.

Thanks

Current status:

ipa01 and ipa03 are reinstalled with rhel9 and (mostly) operating normally (see below)
ipa02 is still rhel8

The last issue is KRA. It's enabled on ipa02, but not on 01/03. We have a process to remove it from 02, which we are going to try tomorrow.

Once thats out, we can then reinstall ipa02 and everything should be back to normal.
We should also write up a blameless retrospective of this incident (what went wrong, how we could do better, etc).

From an end user perspective everything should be working normally again now.

I tested ipa ui and can't login : it outputs :

IPA error 907 : NetworkError
cannot connect to 'ldapi://%2Frun%2Fslapd-FEDORAPROJECT-ORG.socket': 

The error message usually means that dirsrv (389-DS) is not running. Check ipactl status.

ipactl status shows everything green

... and seems to be working now btw :-)

I created a work document on hackmd.io for better collaboration on the outage.

Whole IPA situation is now resolved and we have all three servers running. Thanks to everybody who helped with that.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

a month ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog