Issue #10358: IPA backup fails on ipa03 and ipa02.stg - fedora-infrastructure

fedora-infrastructure

#10358 IPA backup fails on ipa03 and ipa02.stg

Closed: Fixed with Explanation 3 months ago by zlopez. Opened 2 years ago by abompard.

Going through my admin@fpo email I noticed an error from ipa03's backup script:

Error: Local roles CA do not match globally used roles CA, KRA. A backup done on this host would not be complete enough to restore a fully functional, identical cluster.
The ipa-backup command failed. See /var/log/ipabackup.log for more information

I've traced it to this patch of May 2020, which may have landed recently in an update.

Apparently the backup fails because some global roles are absent from the local instance, not sure exactly what it means, but indeed when I run:

ldapsearch -b cn=masters,cn=ipa,cn=etc,dc=fedoraproject,dc=org

I see the KRA role for ipa01, ipa02 but not ipa03. Do we want to somehow add this role to ipa03 or should we just not run the backup on ipa03?

(same thing happens in ipa02.stg with the DNS and DNSKeySync roles)

Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: medium-gain, medium-trouble, ops

2 years ago

kevin commented 2 years ago

Ah yes, the KRA problem. ;)

We orig had ipa01.phx2.fedoraproject.org and ipa02.phx2.fedoraproject.org and both were KRA enabled/clustered.

Then we added ipa01.iad2.fedoraproject.org and ipa02.iad2.fedoraproject.org and synced them, all 4 were KRA enabled.

Then we nuked ipa01.phx2 and ipa02.phx2 because we no longer were in that datacenter.

Then we added ipa03.iad2.fedoraproject.org, but I was completely unable to get it to sync KRA. It gave a bunch of weird errors and failed. I read somewhere that this was because the orig server that setup that function (ipa01.phx2) was gone and we can't add any more KRA replicants because of it. Not sure if thats really true, but sounds somewhat plausable.

We don't use KRA currently, so I went on to other things.

We can:

Just remove KRA / disable / drop it somehow from existing servers and drive on.
or
Try and figure out how to get it to add ipa03.iad and ipa02.stg into their KRA replication

KRA might be usefull/nice someday, but I don't know.

kevin commented 2 years ago

Here's what happens if you try and enable it:

[root@ipa03 ~][PROD-IAD2]# ipa-kra-install --no-host-dns                                            
Directory Manager password:                                                                         

/usr/lib/python3.6/site-packages/urllib3/connection.py:376: SubjectAltNameWarning: Certificate for ip
a03.iad2.fedoraproject.org has no `subjectAltName`, falling back to check for a `commonName` for now.
 This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/
shazow/urllib3/issues/497 for details.)                                                             
  SubjectAltNameWarning                                                                             
Lookup failed: Preferred host ipa03.iad2.fedoraproject.org does not provide KRA.                    
Custodia uses 'ipa02.iad2.fedoraproject.org' as master peer.                                        

===================================================================                                 
This program will setup Dogtag KRA for the IPA Server.                                              


Configuring KRA server (pki-tomcatd). Estimated time: 2 minutes                                     
  [1/10]: creating ACIs for admin                                                                   
  [2/10]: creating installation admin user                                                          
  [3/10]: configuring KRA instance                                                                  
Failed to configure KRA instance                                                                    
See the installation logs and the following files/directories for more information:                 
  /var/log/pki/pki-tomcat                                                                           
  [error] RuntimeError: KRA configuration failed.                                                   

Your system may be partly configured.                                                               
If you run into issues, you may have to re-install IPA on this server.                              

KRA configuration failed.                                                                           
The ipa-kra-install command failed. See /var/log/ipaserver-kra-install.log for more information

abompard commented 2 years ago

Looking at the designated log file I see:

Exception: PKI subsystem 'KRA' for instance 'pki-tomcat' already exists!
  File "/usr/lib/python3.6/site-packages/pki/server/pkispawn.py", line 575, in main
    scriptlet.spawn(deployer)
  File "/usr/lib/python3.6/site-packages/pki/server/deployment/scriptlets/initialization.py", line 163, in spawn
    deployer.instance.verify_subsystem_does_not_exist()
  File "/usr/lib/python3.6/site-packages/pki/server/deployment/pkihelper.py", line 837, in verify_subsystem_does_not_exist
    self.mdict['pki_instance_name']))

Is there a way to remove the KRA configuration from Tomcat? Maybe @cheimes knows?

cheimes commented 2 years ago

IPA has no uninstaller for KRA. There is no supported way to remove KRA services from a cluster.

zlopez commented 2 years ago

[backlog refinement]
We need to fix the KRA to continue forward with the backups. We would appreciate the help of IPA folks on this, so the best way forward is probably to file a ticket on IPA tracker.
@abompard Do you know where we can do that?

zlopez commented 2 years ago

[backlog refinement]
We could try to remove KRA from ipa01, but there isn't any obvious way to do it.
Maybe @abompard could help here.

kevin commented 2 years ago

Note that this now only happens on ipa03. FOr some reason 02 and 02.stg are replicating KRA fine now.

zlopez commented a year ago

[backlog refinement]
The situation is still the same. There is a plan to update IPA machines to RHEL9, maybe this will solve the issue.

kevin commented a year ago

I'm going to try and do staging when we are in f38 final freeze (then, if it goes well we can do prod after f38 release).

zlopez commented 11 months ago

@kevin Did you have time to try this?

kevin commented 11 months ago

Nope, too much unplanned work/other work. :(

It's still definitely on my list tho... hopefully next week?

zlopez commented 7 months ago

Here is the guide for IPA migration to RHEL9

zlopez commented 7 months ago

Trying to understand how we are creating a new VMs in Fedora Infra. In documentation I found only this stub and from looking around I figured out this:

Create a new machine variables in inventory/host_vars/
Add the machine to correct group in inventory/inventory
Run the corresponding playbook

If this is correct way to do it, I'm not sure where to obtain the IP to set up in host_vars file.

I assume that we probably have a playbook or script that will do all this for you, but I can't find it in ansible repo.

kevin commented 7 months ago

We did a learning topic on it a while back: https://meetbot.fedoraproject.org/fedora-meeting-3/2021-05-27/infrastructure.2021-05-27-16.00.log.html

Basically yes, you setup a hostname in the dns repo using a unused ip address (both forward and reverse dns), then you use that in host variables in ansible.
You also need to pick which vmhost it's going to be setup on.

When you then run the playbook it sees that the hostname doesn't ping or exist on the vmhost, so it runs virt-install with the indicated kickstart, lets it install, then waits for ssh to come up after reboot and then continues the config process. If the machine pings/is defined on the host, it just skips all the install/provision tasks and moves on to configuration.

zlopez commented 7 months ago

So I played with RHEL9 VM for IPA on staging and here is a summary of what I did and when I got stuck:

I added a new machine ipa03.stg.iad2.fedoraproject.org and updated the related documentation
I ran the groups/ipa.yml playbook
Found out that the replication task will never run, because the /etc/ipa/default.conf is always created by ipa/client: Enroll system as IPA client task (I'm not sure if this is on purpose or just nobody noticed)
I tried to run the ipa-replica-install command manually and ended up with Failed to start replication (Found this bugzilla ticket and it was solved 3 years ago, but the log looks exactly the same)

After the error the machine ends up in inconsistent state and I needed to destroy it using ansible-playbook /srv/web/infra/ansible/playbooks/destroy_virt_inst.yml -e target=ipa03.stg.iad2.fedoraproject.org.

Not sure, how to continue now.

Edited 7 months ago by zlopez

kevin commented 7 months ago

Perhaps we can ask the ipa folks for help here? IPA list might be good? or fedora-aaa channel on irc?

I think you should need to enroll as a client first, but perhaps we got ordering wrong in the playbook. You could run with '-t ipa/client' to just run the ipa/client tag and get the client setup, then re-run the full one? But looking it seems like client is before server, so this should have worked?

It might be good to gather the exact call and output from the replica-install command so we can debug more?

zlopez commented 7 months ago

You are correct about the order, but it doesn't work in our ansible roles, because ipa-replica-install is only executed when /etc/ipa/default.conf file is missing, but it's already created by ipa-client-install. It seems like this file shouldn't be created or at least not block the server deployment.

I will try to ask in fedora-aaa channel and see, thanks for pointing me that way.

zlopez commented 7 months ago

After no response in fedora-aaa channel and quick few mails in freeipa-users mailing list I created an issue on freeipa tracker.

zlopez commented 7 months ago

Digging more into this and it seems that the issue is probably somewhere in kerberos:

Sep 27 14:21:23 ipa01.stg.iad2.fedoraproject.org ns-slapd[2437722]: [27/Sep/2023:14:21:23.518893204 +0000] - ERR - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meToipa03.stg.iad2.fedoraproject.org" (ipa03:389) - Replication bind with GSSAPI auth failed: LDAP error 49 (Invalid credentials) ()
Sep 27 14:21:26 ipa01.stg.iad2.fedoraproject.org ns-slapd[2437722]: GSSAPI Error: Unspecified GSS failure.  Minor code may provide more information (Server krbtgt/STG.IAD2.FEDORAPROJECT.ORG@STG.FEDORAPROJECT.ORG not found in Kerberos database)
Sep 27 14:21:32 ipa01.stg.iad2.fedoraproject.org ns-slapd[2437722]: [27/Sep/2023:14:21:32.618342270 +0000] - ERR - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meToipa03.stg.iad2.fedoraproject.org" (ipa03:389) - Replication bind with GSSAPI auth failed: LDAP error -1 (Can't contact LDAP server) ()

I'm surprised to see that kerberos server is not found in the database.

If anybody wants to try it again. This is the process how to recreate it on ipa03.stg:

Remove existing remains of replication:
(On ipa03.stg) ipa-server-install --uninstall
(On ipa01.stg) ipa server-del ipa03.stg.iad2.fedoraproject.org --force
Run the replication again

ipa-replica-install --setup-ca --setup-kra --admin-password=XXX --no-host-dns --mkhomedir --no-ntp --unattended --no-ssh --no-sshd --force-join --log-file=/var/log/ipainstall.log --domain=STG.FEDORAPROJECT.ORG --server=ipa01.stg.iad2.fedoraproject.org

abompard commented 6 months ago

I found the cause! After much digging I found an error message on ldap's log on ipa03 about a sasl packet exceeding the max size. We need to change the nsslapd-maxsasliosize parameter at install time, and for that we need to use the --dirsrv-config-file config option of ipa-replica-install.
Note that this option was used on ipa02, but pointing to a file in /tmp, so I couldn't retrieve what was there. I've made one with the following content:

dn: cn=config
changetype: modify
replace: nsslapd-maxsasliosize
nsslapd-maxsasliosize: 3145728

And with this, the replica install succeeds. I'm going to make an ansible PR so we save that for next time.

abompard commented 6 months ago

https://pagure.io/fedora-infra/ansible/pull-request/1609

zlopez commented 5 months ago

Commit 94478cc8 relates to this ticket

zlopez commented 5 months ago

I tried to run the playbook after merging the PR and it went through with only one error with dnf active-timer, but it's ignored by ansible.

fatal: [ipa03.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "cmd": ["systemctl", "is-active", "dnf-automatic-install.timer"], "delta": "0:00:00.010844", "end": "2023-11-09 10:17:53.059639", "msg": "non-zero return code", "rc"
: 3, "start": "2023-11-09 10:17:53.048795", "stderr": "", "stderr_lines": [], "stdout": "inactive", "stdout_lines": ["inactive"]}

After that I tried to verify status of IPA installation

[root@ipa03 ~][STG]# ipactl status
Directory Service: RUNNING
krb5kdc Service: RUNNING
kadmin Service: RUNNING
httpd Service: RUNNING
ipa-custodia Service: RUNNING
pki-tomcatd Service: RUNNING
ipa-otpd Service: RUNNING
ipa: INFO: The ipactl command was successful

It seems that the DNS capabilities are missing (noticed that we have --no-host-dns in RHEL9 replica)

[root@ipa03 ~][STG]# ipa server-role-find --status enabled --server ipa03.stg.iad2.fedoraproject.org
----------------------
2 server roles matched
----------------------
  Server name: ipa03.stg.iad2.fedoraproject.org
  Role name: CA server
  Role status: enabled

  Server name: ipa03.stg.iad2.fedoraproject.org
  Role name: KRA server
  Role status: enabled
----------------------------
Number of entries returned 2
----------------------------

[root@ipa01 ~][STG]# ipa server-role-find --status enabled --server ipa01.stg.iad2.fedoraproject.org
----------------------
3 server roles matched
----------------------
  Server name: ipa01.stg.iad2.fedoraproject.org
  Role name: CA server
  Role status: enabled

  Server name: ipa01.stg.iad2.fedoraproject.org
  Role name: DNS server
  Role status: enabled

  Server name: ipa01.stg.iad2.fedoraproject.org
  Role name: KRA server
  Role status: enabled
----------------------------
Number of entries returned 3
----------------------------

kevin commented 5 months ago

DNS is not something we manage in IPA... so it shouldn't be a service there.

But something isn't too right... I think it's ipa02.stg... dirserv wasn't running/able to restart there. I updated packages and ipa-server-upgrade on it, but that didn't help.

It looks like it's missing some schema member? If you all could take a closer look that would be great.

I think this is also causing staging sssd's on hosts to not work right (see the ticket about access to compose-x86-01.stg).

zlopez commented 5 months ago

Couldn't we just try to replace it with RHEL9 so it will be same as ipa03.stg?

kevin commented 5 months ago

Well, we could... but is ipa03.stg all right? If we do redo ipa02, we should save off the virt xml (virsh dumpxml ipa02.stg.iad2.fedoraproject.org > file) and lvrename the logical volume ( lvrename /dev/vg_guests/ipa02.stg.iad2.fedoraproject.org /dev/vg_guests/ipa02.stg.iad2.fedoraproject.org-el8) so we can revert back to it if needed (by just undefining whatever new one, redefining it with the old virt xml and putting the lv back).

Do you / @abompard want to try that? We should sort something here, as it's blocking testing on compose-x86-01.stg. ;(

zlopez commented 5 months ago

I was able to reinstall ipa02.stg to RHEL9, it took me few adjustments to ipa/server ansible role to make it run through. The only thing that I needed to do manually was to remove the replication entry from ipa01.stg (ipa server-del ipa02.stg.iad2.fedoraproject.org --force). This should be only needed if existing server needs to be reinstated from scratch.

I tried running ipactl statuson ipa02.stg and everything seems to be in order. In case anything is wrong there are backups on vmhost-x86-02.stg.

zlopez commented 5 months ago

I was looking today at IPA topology on staging and found out that there is already a topology set up between ipa01-ipa02 and ipa01-ipa03 for both ca and domain. Not sure why the topology on https://id.stg.fedoraproject.org/ipa/ui/#/p/topology-graph doesn't show that (it's possible that I just can't see it because of some missing permissions on staging).

[root@ipa01 ~][STG]# ipa topologysegment-find ca
------------------
2 segments matched
------------------
  Segment name: ipa01.stg.iad2.fedoraproject.org-to-ipa02.stg.iad2.fedoraproject.org
  Left node: ipa01.stg.iad2.fedoraproject.org
  Right node: ipa02.stg.iad2.fedoraproject.org
  Connectivity: both

  Segment name: ipa01.stg.iad2.fedoraproject.org-to-ipa03.stg.iad2.fedoraproject.org
  Left node: ipa01.stg.iad2.fedoraproject.org
  Right node: ipa03.stg.iad2.fedoraproject.org
  Connectivity: both
----------------------------
Number of entries returned 2
----------------------------

So the only topology that is missing is between ipa02-ipa03. I can add this manually by running ipa topologysegment-add, but I wonder how it was done on production that it actually works differently. I don't see anything about topology in ansible playbook.

@kevin Should I ran it manually? Or is there some clever way that I don't know about. Also does the IPA works as CA, do I need to reassign that role before replacing ipa01.stg (See the migration guide steps 1.3-1.5)?

kevin commented 5 months ago

Yeah, adding to the playbook would be nice...

backups don't seem happy in staging...

ipa02.stg:

/etc/cron.daily/data-only-backup.sh:

ls: cannot access '/var/lib/ipa/backup/ipa-data-*': No such file or directory
Error: Local roles CA, KRA do not match globally used roles CA, DNS, DNSKeySync, KRA. A backup done on this host would not be complete enough to restore a fully functional, identical
cluster.
The ipa-backup command failed. See /var/log/ipabackup.log for more information

ipa03.stg:

/etc/cron.daily/data-only-backup.sh:

ls: cannot access '/var/lib/ipa/backup/ipa-data-*': No such file or directory
Error: Local roles CA, KRA do not match globally used roles CA, DNS, DNSKeySync, KRA. A backup done on this host would not be complete enough to restore a fully functional, identical
cluster.
The ipa-backup command failed. See /var/log/ipabackup.log for more information

So, I think we need to remove DNS and DNSKeySync?

zlopez commented 5 months ago

I would say that redeploying ipa01.stg on RHEL 9 will solve that as it will be replicated from other existing machine.

zlopez commented 5 months ago

I was able to automatize the topology segment creation and now I'm trying to redeploy ipa01.stg after following the migration guide, but I'm hitting this ERROR Unable to find IPA Server to join. It seems that the DNS is still pointing to ipa01.stg as LDAP instance :-(

2023-11-29T16:48:56Z DEBUG [IPA Discovery]
2023-11-29T16:48:56Z DEBUG Starting IPA discovery with domain=stg.fedoraproject.org, servers=None, hostname=ipa01.stg.iad2.fedoraproject.org
2023-11-29T16:48:56Z DEBUG Search for LDAP SRV record in stg.fedoraproject.org
2023-11-29T16:48:56Z DEBUG Search DNS for SRV record of _ldap._tcp.stg.fedoraproject.org
2023-11-29T16:48:56Z DEBUG DNS record found: 0 100 389 ipa01.stg.iad2.fedoraproject.org.

I will look into that tomorrow to see what I can find.

zlopez commented 5 months ago

Staging instance is deployed now, we needed to add DNS records for ipa02.stg and ipa03.stg for LDAP.

zlopez commented 5 months ago

We need to rebuild python3-collectd-ipa package for RHEL 9, as this is not available currently and fails the deployment.

abompard commented 5 months ago

I made that package, I'll rebuild it.

abompard commented 5 months ago

Done, it should appear in the infra repo in a short while: https://koji.fedoraproject.org/koji/taskinfo?taskID=109723821

zlopez commented 5 months ago

@abompard Thanks

zlopez commented 5 months ago

I was able to execute the whole playbook with the package available.

I was also able to solve ipa: ERROR: Operations error: Allocation of a new value for range cn=posix ids,cn=distributed numeric assignment plugin,cn=plugins,cn=config failed! Unable to proceed. during ipa user-add by manually adding the id ranges to IPA servers using this guide. The ranges were probably removed during replication process.

Also the https://id.stg.fedoraproject.org/ipa/ui/ is now opening, although it throws kerberos error during login (in /var/log/httpd/error_log) ipa: INFO: 401 Unauthorized: kinit: Generic error (see e-text) while getting initial credentials

zlopez commented 5 months ago

After debugging this with @darknao for some time, we found out that the SIDs are missing for users.

We tried to enable them by ipa config-mod --enable-sid --add-sids, but it fails on permission denied on /etc/krb5.conf.ipabkp and we didn't figured out why. It doesn't seem to be a SELinux issue, at least not after we did the relabel of the machine.

zlopez commented 5 months ago

After some more digging I found out that this is really a SELinux issue. More specifically this one:

Nov 30 14:05:46 ipa01 tag_audit_log: type=AVC msg=audit(1701353143.393:173): avc:  denied  { write } for  pid=3214 comm="org.freeipa.ser" name="etc" dev="dm-0" ino=33685633 scontext=system_u:system_r:ipa_helper_t:s0 tcontext=system_u:object_r:etc_t:s0 tclass=dir permissive=0

zlopez commented 5 months ago

@darknao was able to generate the SIDs by following freeipa troubleshooting guide and it seems that everything is now working as it should. I noticed that even the backups are now going through without issue.

zlopez commented 4 months ago

Noticed that the ipa02.stg is currently failing when running /etc/cron.daily/cleanup-stage-users.

Traceback (most recent call last):
  File "/etc/cron.daily/cleanup-stage-users", line 15, in <module>
    client.login_kerberos()
  File "/usr/lib/python3.9/site-packages/python_freeipa/client.py", line 247, in login_kerberos
    return self._wrap_in_dns_discovery(self._login_kerberos)
  File "/usr/lib/python3.9/site-packages/python_freeipa/client.py", line 176, in _wrap_in_dns_discovery
    return function(*args, **kwargs)
  File "/usr/lib/python3.9/site-packages/python_freeipa/client.py", line 266, in _login_kerberos
    raise Unauthorized(response.text)
python_freeipa.exceptions.Unauthorized: <!DOCTYPE html>

Not sure why this is failing only on ipa02.stg.

I will check that out, once I have some spare time.

Metadata Update from @zlopez:
- Issue assigned to zlopez

4 months ago

zlopez commented 4 months ago

Today I had some time to look at the script issue. It happened only on ipa02.stg, because it was set as a master server during replacement of ipa01.stg. Just running the playbook on staging fixed the issue and the script is working without error on ipa01.stg.

zlopez commented 3 months ago

Today I'm replacing ipa03 in production.

First run of the playbook failed, but this was caused by not doing ipa server-del ipa03.iad2.fedoraproject.org before running the deployment playbook.

zlopez commented 3 months ago

Second run failed on KRA configuration during replication.

"Error" : "Unable to add KRA connector for https://ipa03.iad2.fedoraproject.org:8443: KRA connector already exists"

That was a completely clean VM.

I found similar issue and how to solve it in free-ipa mailing archive. Let's see if that works.

zlopez commented 3 months ago

This is currently blocked by https://pagure.io/fedora-infrastructure/issue/11733, will continue on this after this is resolved.

zlopez commented 3 months ago

All the machines are now running on RHEL9 without KRA. The backup should now work without issue.

Metadata Update from @zlopez:
- Issue close_status updated to: Fixed with Explanation
- Issue status updated to: Closed (was: Open)

3 months ago

Metadata

Assignee

zlopez

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Backlog

Related Pull Requests

#1609 Merged 5 months ago

fedora-infrastructure

Source Code

#10358 IPA backup fails on ipa03 and ipa02.stg Closed: Fixed with Explanation 3 months ago by zlopez. Opened 2 years ago by abompard.

Metadata

medium-gain medium-trouble ops

Boards 1

Related Pull Requests

#10358 IPA backup fails on ipa03 and ipa02.stg

Closed: Fixed with Explanation 3 months ago by zlopez. Opened 2 years ago by abompard.