#9196 [Tracker] Random nightly failure in ipa-replica-install: Failed to start replication
Opened 2 years ago by frenaud. Modified 2 years ago

Issue

FreeIPA nightly tests randomly fail trying to setup the replication. See for instance this test report with the following logs.

Package Version and Platform:

  • Platform: Fedora 36
  • Package and version: 389-ds-base-2.1.1-2
    The full package list is available here.

Steps to Reproduce

Steps to reproduce the behavior:
1. on the master, install ipa server with ipa-server-install --domain ipa.test --realm IPA.TEST -a Secret123 -p Secret123 --setup-dns --auto-forwarders --auto-reverse -U
2. on the replica, install an ipa client with ipa-client-install --domain ipa.test --realm IPA.TEST -p admin -w Secret123 --server server.ipa.test -U
3. on the replica, promote the machine as a replica with kinit admin; ipa-replica-install -U
The replica installation fails randomly.

Expected behavior

Replica installation should succeed.

Initial investigation

The replica installation fails in the step setting up initial replication, with the following error:

...
  [27/42]: creating DS keytab
  [28/42]: ignore time skew for initial replication
  [29/42]: setting up initial replication
Starting replication, please wait until this has completed.

Update in progress, 1 seconds elapsed
Update in progress, 2 seconds elapsed
Update in progress, 3 seconds elapsed
Update in progress, 4 seconds elapsed
Update in progress, 5 seconds elapsed
Update in progress, 6 seconds elapsed
Update in progress, 7 seconds elapsed
Update in progress, 8 seconds elapsed
Update in progress, 9 seconds elapsed
Update in progress, 10 seconds elapsed
Update in progress, 11 seconds elapsed
Update in progress, 12 seconds elapsed
Update in progress, 13 seconds elapsed
Update in progress, 14 seconds elapsed
Update in progress, 15 seconds elapsed
[ldap://master.ipa.test:389] reports: Update failed! Status: [Error (49) - LDAP error: Invalid credentials - no response received]

  [error] RuntimeError: Failed to start replication
Failed to start replication
The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information
Your system may be partly configured.
Run /usr/sbin/ipa-server-install --uninstall to clean up.

The replica installer performs the following steps:
- create a connection to the master, bind as fqdn=replica0.ipa.test,cn=computers,cn=accounts,dc=ipa,dc=test
- fetch nsDS5ReplicaId from the master
- increment and update the value on the master
- add replica config on the replica in cn=replica,cn=dc\3Dipa\2Cdc\3Dtest,cn=mapping tree,cn=config
- set changelog maxage to 30d on the replica
- on the replica, create a special user to let SASL mapping find a valid user on first replication: cn=ldap/master.ipa.test@IPA.TEST,cn=config and add this user to nsDS5ReplicaBindDN
- on the replica, create a SASL mapping cn=Peer Master,cn=mapping,cn=sasl,cn=config

objectclass: top, nsSaslMapping
nsSaslMapRegexString: '^[^:@]+$'
nsSaslMapBaseDNTemplate: cn=config
nsSaslMapFilterTemplate: '(cn=&@IPA.TEST)'
nsSaslMapPriority: 1

This will map a kerberos principal ldap/master.ipa.test@IPA.TEST to the entry cn=ldap/master.ipa.test@IPA.TEST,cn=config.

  • add replica config on the master in cn=replica,cn=dc\3Dipa\2Cdc\3Dtest,cn=mapping tree,cn=config
  • set changelog maxage to 30d on the master
  • on the master, make sure the group cn=replication managers,cn=sysaccounts,cn=etc,dc=ipa,dc=test exists and contains the principals for master and replica
  • create repl agreement on the master cn=meToreplica0.ipa.test,cn=replica,cn=dc\=ipa\,dc\=test,cn=mapping tree,cn=config
  • create repl agreement on the replica cn=meTomaster.ipa.test,cn=replica,cn=dc\=ipa\,dc\=test,cn=mapping tree,cn=config
  • start replication by setting nsds5BeginReplicaRefresh=start on the master (entry cn=meToreplica0.ipa.test,...)
  • read the entry and check if the replication has started. This fails.
    The audit logs show the MOD operation is happening at 20220704162708. After that, a search is done every second to check the replication status but the replication fails to start.
    The master's error log shows the following error:
ERR - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meToreplica0.ipa.test" (replica0:389) - Replication bind with GSSAPI auth failed: LDAP error 49 (Invalid credentials) ()

and the replica access log shows the master trying to connect but failing:

[04/Jul/2022:16:27:09.084070988 +0000] conn=4 op=0 BIND dn="" method=sasl version=3 mech=GSSAPI
[04/Jul/2022:16:27:09.093387701 +0000] conn=4 op=0 RESULT err=49 tag=97 nentries=0 wtime=0.000044549 optime=0.009317877 etime=0.009359701 - SASL(-13): authentication failure: GSSAPI Failure: gss_accept_sec_context
[04/Jul/2022:16:27:09.097120258 +0000] conn=4 op=1 UNBIND

The connection should be authenticated as cn=ldap/master.ipa.test@IPA.TEST,cn=config as the master is using its kerberos principal.
It seems that the SASL mapping is not working.

Companion issue opened against 389-ds: https://github.com/389ds/389-ds-base/issues/5361


Also seen in PR #1842 with test_caless_TestServerReplicaCALessToCAFull and test_cert (fedora36 with 389-ds-base-2.1.1-2.fc36.x86_64)

reproduced in [testing_master_latest] PR #1942
test_replication_layouts_TestLineTopologyWithoutCA: report, logs

reproduced in [testing_ipa-4.10_latest], PR #1957
test_backup_and_restore_TestReplicaInstallAfterRestore: report, logs

Reproducible in testing_master_latest PR 1965 Report

Reproducible in testing_master_latest PR 2014 Report

Reproducible in testing_master_latest PR 2043 Report

is it possible some of the test runs have the same replica un-installed, then re-installed?
( so the main replica may have the previous Kerberos tickets of the newer IPA replica in memory, until expiration, and a main replica LDAP service restart may be needed )

is it possible some of the test runs have the same replica un-installed, then re-installed?
( so the main replica may have the previous Kerberos tickets of the newer IPA replica in memory, until expiration, and a main replica LDAP service restart may be needed )

The CI is using a common Vagrant image but creates a new VM instance for each test and installs from scratch server, replica, client -> it can't be caused by something remaining from a previous test.
This issue is also seen in tests that don't do uninstall/reinstall -> probably not linked to a previous krb ticket.

Log in to comment on this ticket.

Metadata