https://accounts.fedoraproject.org and https://accounts.centos.org are currently marked as undeployed in openshift, and so not usable
Metadata Update from @zlopez: - Issue assigned to zlopez - Issue priority set to: None (was: Needs Review) - Issue tagged with: high-gain, medium-trouble, ops
I'm the one who is responsible for that and currently trying to restore the IPA again.
What happened: I accidentally deleted replication agreements from all IPA machines.
What I did: I tried to restore the missing replication agreements by following this guide. Which didn't help.
We tried to restore IPA in it's former glory, but we wasn't able to figure out how. This is the error we always encountered:
ldap_child[56078]: Failed to initialize credentials using keytab [MEMORY:/etc/krb5.keytab]: Decrypt integrity check failed. Unable to create GSSAPI-encrypted LDAP connection.
What we tried: Restore ipa01 from old backup (daily backup) Reinstall ipa03 from scratch and use ipa-restore from ipa02 data backup Reinstall ipa02 from scratch and use ipa-restore from ipa02 data backup Tried to regenerate kerberos keytab
ipa-restore
Currently the kerberos and authentication works with the partially broken ipa02. So we leave it at this state and try to get some help from IPA folks when they are available.
The current state:
ipa01 (rhel8) is up, but ipa services are all turned off there. ipa02 (rhel8) is up and ipa is running there, but has errors/issues. ipa03 is down
On ipa02 you can get a admin kerberos ticket for admin, but when trying to use the client, it alernates between:
ipa: ERROR: Insufficient access: SASL(-1): generic failure: GSSAPI Error: Unspecified GSS failure. Minor code may provide more information (Credential cache is empty)
and
ipa: ERROR: ProtocolError: <ProtocolError for ipa02.iad2.fedoraproject.org/ipa/session/json: 401 Unauthorized> Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/__init__.py", line 120, in get_package plugins = api._remote_plugins AttributeError: 'API' object has no attribute '_remote_plugins' During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ipalib/cli.py", line 1469, in run api.finalize() File "/usr/lib/python3.6/site-packages/ipalib/plugable.py", line 753, in finalize self.__do_if_not_done('load_plugins') File "/usr/lib/python3.6/site-packages/ipalib/plugable.py", line 432, in __do_if_not_done getattr(self, name)() File "/usr/lib/python3.6/site-packages/ipalib/plugable.py", line 632, in load_plugins for package in self.packages: File "/usr/lib/python3.6/site-packages/ipalib/__init__.py", line 952, in packages ipaclient.remote_plugins.get_package(self), File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/__init__.py", line 128, in get_package plugins = schema.get_package(server_info, client) File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/schema.py", line 546, in get_package schema = Schema(client) File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/schema.py", line 395, in __init__ fingerprint, ttl = self._fetch(client, ignore_cache=read_failed) File "/usr/lib/python3.6/site-packages/ipaclient/remote_plugins/schema.py", line 407, in _fetch client.connect(verbose=False) File "/usr/lib/python3.6/site-packages/ipalib/backend.py", line 69, in connect conn = self.create_connection(*args, **kw) File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 1064, in create_connection command([], {}) File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 1276, in _call return self.__request(name, args) File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 1243, in __request verbose=self.__verbose >= 3, File "/usr/lib64/python3.6/xmlrpc/client.py", line 1154, in request return self.single_request(host, handler, request_body, verbose) File "/usr/lib/python3.6/site-packages/ipalib/rpc.py", line 730, in single_request response.msg) xmlrpc.client.ProtocolError: <ProtocolError for ipa02.iad2.fedoraproject.org/ipa/session/json: 401 Unauthorized> ipa: ERROR: an internal error has occurred
/var/log/httpd/error_log has:
[Tue Jan 23 00:08:08.715155 2024] [wsgi:error] [pid 5370:tid 139692100007680] (11)Resource temporarily unavailable: [client 10.3.163.76:58076] mod_wsgi (pid=5370): Unable to connect to WSGI daemon process 'kdcproxy' on '/etc/httpd/run/wsgi.4365.0.1.sock' after multiple attempts as listener backlog limit w as exceeded or the socket does not exist.
Currently $ kinit mtasaka@FEDORAPROJECT.ORG is unresponsive, is this issue related to this ticket? Or should I file another ticket?
$ kinit mtasaka@FEDORAPROJECT.ORG
@mtasaka That would be related, the kerberos is not in great state as well
What do you see in krb5kdc.log on ipa01 and ipa02?
krb5kdc.log
( This is not directed to me, right? I am not infra member, I just wanted to do fedpkg build so I have to do kinit beforehand but currently it is unresponsive. )
fedpkg build
kinit
@mtasaka No that wasn't directed to you.
We are currently trying to find and fix the root cause of the outage with @abbra
It seems that the issue is with missing SIDs and ID ranges for entries. That is required for version of IPA server we currently have installed.
Right now we are assign ID ranges and SIDs to existing entries in IPA database.
Just to mention, this affects mass rebuild shenanigans as well, there was a blocker in qsort causing regression in glibc (https://bugzilla.redhat.com/show_bug.cgi?id=2259845); for this we might need to rebuild things again or stop the build until it gets sorted again but for that we need to get to root of the compose-branched machine which needs auth fixed!
We were able to regenerate SIDs and ID ranges. But there are still issues with kdcproxy and ipa-tomcat on ipa01.
The https://accounts.fedoraproject.org are working now. Same for https://accounts.centos.org. The IPA is still not in great shape, but most of the authentication is working now.
ipa02 seems to be prcessing things ok for now.
ipa03 isn't installing due to a kra issue. We are going to try and address that tomorrow when more ipa folks are around to assist.
Once thats done and in sync we can look at redoing 01 and then finally 02.
In any case everything should be working for the moment.
If people have issues with Noggin, clear your cookies for accounts.fedoraproject.org.
i confirmed that it works for me now.
Thanks
Current status:
ipa01 and ipa03 are reinstalled with rhel9 and (mostly) operating normally (see below) ipa02 is still rhel8
The last issue is KRA. It's enabled on ipa02, but not on 01/03. We have a process to remove it from 02, which we are going to try tomorrow.
Once thats out, we can then reinstall ipa02 and everything should be back to normal. We should also write up a blameless retrospective of this incident (what went wrong, how we could do better, etc).
From an end user perspective everything should be working normally again now.
I tested ipa ui and can't login : it outputs :
IPA error 907 : NetworkError cannot connect to 'ldapi://%2Frun%2Fslapd-FEDORAPROJECT-ORG.socket':
The error message usually means that dirsrv (389-DS) is not running. Check ipactl status.
ipactl status
ipactl status shows everything green
... and seems to be working now btw :-)
I created a work document on hackmd.io for better collaboration on the outage.
Whole IPA situation is now resolved and we have all three servers running. Thanks to everybody who helped with that.
Metadata Update from @zlopez: - Issue close_status updated to: Fixed with Explanation - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.