#12149 ipa cluster degraded
Closed: Fixed with Explanation 5 days ago by kevin. Opened 2 months ago by kevin.

I was running ansible playbooks over servers (as one does) and something went very wrong with the ipa playbook.

I noticed alerts, then looked and the playbook was in the

name: configure replication

block of roles/ipa/server/main.yml

It had run it on all 3 cluster members.
ipa01.iad2.fedoraproject.org failed with an error saying 'sorry, removing this server would leave you without a CA server"
ipa02.iad2.fedoraproject.org completed removing the server
ipa03.iad2.fedoraproject.org was running, but I stopped it.

So, the state right now looks like ipa01/03 are both up and syncing.
ipa02 is uninstalled.

However, on 01/03 I can't do a ipactl status (it hangs).
Auth doesn't seem to be functioning even though those servers are up.

I tried to re-run the playbook and resetup ipa02 as a replica, but it hangs trying to register as a client.

I am not sure what to do to bring things back to normal or further debug. ;(

CC: @abbra @abompard or anyone else who can help


I'm trying to restart IPA on 03 because pki-tomcatd was not running, it hanged at restarting dirsrv, I'm getting this in the logs:

[23/Aug/2024:06:59:54.486639278 +0000] - WARN - NSMMReplicationPlugin - prot_stop - Incremental protocol for replica "agmt="cn=ipa03.iad2.fedoraproject.org-to-ipa01.iad2.fedoraproject.org" (ipa01:389)" did not shut down properly.

That blocked for 8 minutes, then the restart completed. ipactl status says RUNNING for all services.

Then I ran ipactl restart on ipa01. It blocked on:

[23/Aug/2024:07:12:58.015075195 +0000] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=ipa01.iad2.fedoraproject.org-to-ipa03.iad2.fedoraproject.org" (ipa03:389): Replication bind with GSSAPI auth resumed

Possibly because there may data to replicate. It's hanging there.
That said, I notice that the data-only-backup job has been running since Aug 23 05:09:11 and was still running, maybe it's preventing dirsrv from shutting down. I killed it and restarted the restart process, but that didn't work better.

I manually restarted the dirsrv@FEDORAPROJECT-ORG.service, and that worked. It's complaining that it can't replicate with ipa02, which is expected. But ipactl restart --debug still hangs after checking that dirsrv is active, and it doesn't give much info about what it's trying to do. Given the traceback when I Ctrl-C it, it's trying to connect to the LDAP server and hangs there.

Going back to ipa03, I notice that the LDAP server isn't responding there either, I can restart it, systemctl restart works but ipactl status it hangs when trying to contact the ldap server.

I'll investigate some more and report here.

OK apparently ipa01 eventually restarted properly, I just had to restart pki-tomcatd because it failed to contact the LDAP server and failed to start, but now that LDAP is running restarting pki-tomcatd alone worked fine.

I can now kinit on ipa01, ipactl status is all RUNNING. I'll try again on ipa03 and give it more time to restart.

On ipa03 I'm now getting this in a loop:

[23/Aug/2024:08:23:56.587397984 +0000] - ERR - NSMMReplicationPlugin - changelog program - repl_plugin_name_cl - agmt="cn=ipa03.iad2.fedoraproject.org-to-ipa01.iad2.fedoraproject.org" (ipa01:389): CSN 65b2a692000100340000 not found, we aren't as up to date, or we purged
[23/Aug/2024:08:23:56.588062322 +0000] - ERR - NSMMReplicationPlugin - send_updates - agmt="cn=ipa03.iad2.fedoraproject.org-to-ipa01.iad2.fedoraproject.org" (ipa01:389): Data required to update replica has been purged from the changelog. If the error persists the replica must be reinitialized.

I'm going to reinitialize the replica. Running:

[root@ipa03 ~][PROD-IAD2]# ipa-replica-manage re-initialize --from ipa01.iad2.fedoraproject.org

Update succeeded in 531 seconds, but I'm still getting the error. I've checked that the data seems to be up-to-date on ipa03 (we have users created today), I'll reinitialize the replica on 01 as well.

If finished a little short of 500 seconds, but I still have the same errors in ipa03's logs.
I don't know what more to do.

(side note: I was able to get Kerberos ticket and login to Pagure, so fedorastatus.org should probably change its 'red' downtime status)

Same here for the side-note: I'm able to get Kerberos ticket and login to Pagure.

We had a new person able to sign up for a FAS account and log in to both Matrix and Pagure maybe within the last half hour to hour ago.

Yeah, there's still a problem with fasjson, but I think I can fix that... either by resetting ipa02 back as a replica/server or just telling fasjson to not query it.

Thanks so much for getting things somewhat back @abompard !

We shouldn't reinstall ipa01 tho, as it's the CRA or whatever... cert master?

I'll try and get ipa02 back and/or fix fasjson.

ipa02 is all back and claims to be in sync. They all do.

I'm still seeing occasional errors from fasjson. It's when it hit ipa01 I think.

 10.3.163.54: NOT_ALLOWED_TO_DELEGATE: authtime 1724289457,  HTTP/fasjson.fedoraproject.org@FEDORAPROJECT.ORG for ldap/ipa03.iad2.fedoraproject.org@FEDORAPROJECT.ORG, KDC can't fulfill requested option

I don't understand why this would happen only on 01.

That seems odd that it's ldap/ipa03... the delegation has http?

ok, I might have worked around it by just taking ipa01 out of internal dns, so it stops querying it.

So, we need to figure out why ipa01 isn't working right for that.

Can you show where I can find these delegation rules in either Fedora infrastructure or fasjson code?

I see the following: https://github.com/fedora-infra/fasjson/blob/dev/devel/ansible/roles/fasjson/templates/setup-fasjson-service.sh, is that correct? This set of rules relies on existing IPA delegation rules for http and ldap services and if ldap/ipa03.iad2 is missing from the ipa-ldap-... delegation target, then that would explain your issue.

Additionally, I think it is fundamentally wrong to have an ansible playbook that removes pre-existing IPA servers from the existing topology and then recreates everything from scratch. It is not what you want to have for existing deployment. You need to ensure that you have your servers, not to remove existing ones because they already contain valuable data (Fedora accounts).

I'll let @abompard speak to fasjson... the only delegation rules I see are the ones from the playbook and those are all http/ipa...

Yes, the playbook is supposed to not do this if the server is deployed. It's intended to help you redeploy a failed server or something.
Likely we could move this off to a manual playbook for just those rare times, but the logic should normally cause it to just skip that entire block, but somehow in this case it tried to redeploy all the servers. ;(

Yeah ipa01 has no service delegation rule or target:

[root@ipa01 ~][PROD-IAD2]# ipa servicedelegationrule-find
----------------------------------
0 service delegation rules matched
----------------------------------
----------------------------
Number of entries returned 0
----------------------------
[root@ipa01 ~][PROD-IAD2]# ipa servicedelegationtarget-find
------------------------------------
0 service delegation targets matched
------------------------------------
----------------------------
Number of entries returned 0
----------------------------

This is pretty bad because IPA installs some for its own purposes. So there's been some data screwup somewhere. Possibly my fault when I ran ipa-replica-manage re-initialize. Not sure.

Can we reimport the entire db from 02 or 03 to 01? I would think ipa-replica-manage re-initialize would do that, but now I'm a little bit afraid to mess things up even more 😬

you can recover standard delegation targets on the specific IPA server by running

# ipa-ldap-updater /usr/share/ipa/updates/30-s4u2proxy.update

There will be logs appended to/var/log/ipaupgrade.log

Thanks! At first it looks like it didn't do anything:

# ipa-ldap-updater /usr/share/ipa/updates/30-s4u2proxy.update
Update complete, no data were modified
The ipa-ldap-updater command was successful

but actually that's my bad, I assumed I was logged in as admin on the terminal and I wasn't, I was logged in as a service that does not have the permissions to view the delegation rules & targets.

That's how it actually is right now:

[root@ipa01 ~][PROD-IAD2]# ipa servicedelegationrule-find
----------------------------------
4 service delegation rules matched
----------------------------------
  Delegation name: fasjson-delegation
  Member principals: HTTP/fasjson.fedoraproject.org@FEDORAPROJECT.ORG

  Delegation name: fasjson-http-delegation

  Delegation name: ipa-http-delegation
  Member principals: HTTP/id.fedoraproject.org@FEDORAPROJECT.ORG, HTTP/ipa01.iad2.fedoraproject.org@FEDORAPROJECT.ORG, HTTP/ipa02.iad2.fedoraproject.org@FEDORAPROJECT.ORG

  Delegation name: koji-http
  Member principals: HTTP/ipatest.fedorainfracloud.org@FEDORAPROJECT.ORG
----------------------------
Number of entries returned 4
----------------------------
[root@ipa01 ~][PROD-IAD2]# ipa servicedelegationtarget-find
------------------------------------
4 service delegation targets matched
------------------------------------
  Delegation name: ipa-cifs-delegation-targets

  Delegation name: ipa-http-delegation-targets
  Member principals: HTTP/ipa01.iad2.fedoraproject.org@FEDORAPROJECT.ORG, HTTP/ipa02.iad2.fedoraproject.org@FEDORAPROJECT.ORG, HTTP/ipa03.iad2.fedoraproject.org@FEDORAPROJECT.ORG

  Delegation name: ipa-ldap-delegation-targets
  Member principals: ldap/ipa01.iad2.fedoraproject.org@FEDORAPROJECT.ORG, ldap/ipa02.iad2.fedoraproject.org@FEDORAPROJECT.ORG

  Delegation name: koji
  Member principals: HTTP/koji.fedoraproject.org@FEDORAPROJECT.ORG
----------------------------
Number of entries returned 4
----------------------------

It looks like ipa03 is missing in a couple places, so I'll fix that and check that FASJSON & Noggin work.

Did you run ipa-ldap-updater part on ipa03?

Did you run ipa-ldap-updater part on ipa03?

Nope, I ran that on ipa01 because I thought that's where the problem was. Turns out it was ipa03 all along! (brings up the Scooby Doo unmasking meme)

ok. So, what happened is this:

We have a task that runs 'ipactl status' and if returns non zero, it assumes the host needs replication setup. :(

So, when I was running the playbook it got:
(on all ipa01/02/03)

Aug 23 2024 02:38:57    9       FAILED  determine whether we need to set up replication {"changed": true, "stdout": "Directory Service: RUNNING\nkrb5kdc Service: RUNNING\nkadmin Service: RUNNING\nhttpd Service: RUNNING\nipa-custodia Service: RUNNING\npki-tomcatd Service: STOPPED\nipa-otpd Service: RUNNING", "stderr": "1 service(s) are not running", "rc": 3, "cmd": "ipactl status", "start": "2024-08-23 02:38:54.361340", "end": "2024-08-23 02:38:57.529259", "delta": "0:00:03.167919", "msg": "non-zero return code", "stdout_lines": ["Directory Service: RUNNING", "krb5kdc Service: RUNNING", "kadmin Service: RUNNING", "httpd Service: RUNNING", "ipa-custodia Service: RUNNING", "pki-tomcatd Service: STOPPED", "ipa-otpd Service: RUNNING"], "stderr_lines": ["1 service(s) are not running"], "_ansible_no_log": false, "task_start": 1724380734.1333277, "task_end": 1724380737.548895, "task_name": "determine whether we need to set up replication", "task_module": "shell", "task_args": {"_raw_params": "ipactl status"}, "task_register": "replication_status", "task_tags": ["ipa/server", "config"], "task_when": ["not ipa_initial"], "task_userid": "kevin"}

So, it claimed tomcat wasn't running?

But looking at logs:

Aug 23 00:56:11 ipa01.iad2.fedoraproject.org server[1792652]:  java.base@17.0.12/java.lang.Thread.run(Thread.java:840)
Aug 23 03:10:41 ipa01.iad2.fedoraproject.org systemd[1]: Stopping PKI Tomcat Server pki-tomcat...

It was running until I tried to manually restart things at 03:10:41...

So, I am not sure why it was saying it wasn't running?

In any case I think we should either come up with a much better test that ipactl status output, and/or move this replication setup to a manual playbook that can be run by admins when they know a new server has to have replication setup...

Metadata Update from @phsmoura:
- Issue tagged with: high-gain, medium-trouble, ops

2 months ago

Metadata Update from @phsmoura:
- Issue priority set to: Waiting on Assignee (was: Needs Review)

2 months ago

Thanks for the report! I think moving this to a manual playbook is safer.

ok. So, what happened is this:

We have a task that runs 'ipactl status' and if returns non zero, it assumes the host needs replication setup. :(

So, when I was running the playbook it got:
(on all ipa01/02/03)

Aug 23 2024 02:38:57 9 FAILED determine whether we need to set up replication {"changed": true, "stdout": "Directory Service: RUNNING\nkrb5kdc Service: RUNNING\nkadmin Service: RUNNING\nhttpd Service: RUNNING\nipa-custodia Service: RUNNING\npki-tomcatd Service: STOPPED\nipa-otpd Service: RUNNING", "stderr": "1 service(s) are not running", "rc": 3, "cmd": "ipactl status", "start": "2024-08-23 02:38:54.361340", "end": "2024-08-23 02:38:57.529259", "delta": "0:00:03.167919", "msg": "non-zero return code", "stdout_lines": ["Directory Service: RUNNING", "krb5kdc Service: RUNNING", "kadmin Service: RUNNING", "httpd Service: RUNNING", "ipa-custodia Service: RUNNING", "pki-tomcatd Service: STOPPED", "ipa-otpd Service: RUNNING"], "stderr_lines": ["1 service(s) are not running"], "_ansible_no_log": false, "task_start": 1724380734.1333277, "task_end": 1724380737.548895, "task_name": "determine whether we need to set up replication", "task_module": "shell", "task_args": {"_raw_params": "ipactl status"}, "task_register": "replication_status", "task_tags": ["ipa/server", "config"], "task_when": ["not ipa_initial"], "task_userid": "kevin"}

So, it claimed tomcat wasn't running?

But looking at logs:

Aug 23 00:56:11 ipa01.iad2.fedoraproject.org server[1792652]: java.base@17.0.12/java.lang.Thread.run(Thread.java:840) Aug 23 03:10:41 ipa01.iad2.fedoraproject.org systemd[1]: Stopping PKI Tomcat Server pki-tomcat...

It was running until I tried to manually restart things at 03:10:41...

So, I am not sure why it was saying it wasn't running?

In any case I think we should either come up with a much better test that ipactl status output, and/or move this replication setup to a manual playbook that can be run by admins when they know a new server has to have replication setup...

That part was added by me, because the previous logic didn't work at all. It was based on some file, but that was already created and so the replication never happened. Maybe there is a better way to do it.

So, how can we best move this off or make 100% it doesn't run on a working replica?

Perhaps we could setup a extra variable 'ipa_replica_setup' or something and only run with that is manually passed with '-e ipa_replica_setup=true' ?

Or we just remove that part entirely and make a manual/ipa-replica-setup.yml playbook I guess.

The second option seems good. Or we can create a file when the replica is done and check if it's available on the machine on next run.

Let me assign to this and get it to finish line.

Metadata Update from @zlopez:
- Issue assigned to zlopez

11 days ago

I created a PR with the file check, let me know if this is OK.

There are issues when reinstalling a replica with newer PKI.

Please see https://pagure.io/freeipa/issue/9673 and https://pagure.io/freeipa/issue/9674

So I tried to play with manual confirmation of replica reinstall in playbook today and I encountered an issue with pause module. Obviously this module is only ran once for all the hosts in group, which doesn't work nicely with me running -l staging.

I tried some way to force the prompt to be shown for each host, but in the end it failed with unsafe condition error.

I ended up with compromise, ask sysadmin about replication for all hosts and then start replication on any server that is missing the log file created during replication.

Tested that on staging and it works nice, even if somebody will answer yes, it will not do anything till the log file is missing. So yes will only work for a completely new server or after sysadmin will delete the ipainstall.log file.

Anything missing or could we close this now?

Lets close it. Thanks!

Metadata Update from @kevin:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

5 days ago

Log in to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog