#8637 [Tracker] Nightly failure in test_backup_and_restore.py::TestBackupAndRestoreWithReplica::test_full_backup_and_restore_with_replica
Closed: worksforme 3 years ago by frenaud. Opened 3 years ago by frenaud.

Issue

The nightly test test_backup_and_restore.py::TestBackupAndRestoreWithReplica::test_full_backup_and_restore_with_replica is randomly failing, see for instance in [testing_master_latest_selinux] PR #617 with the following logs and report:

self = <ipatests.test_integration.test_backup_and_restore.TestBackupAndRestoreWithReplica object at 0x7f254610fca0>
cert_sign_request = {'master.ipa.test': '/tmp/tmp.TH7tIvEkKZ', 'replica0.ipa.test': '/tmp/tmp.gGQlgojvGq', 'replica1.ipa.test': '/tmp/tmp.rp20ftjtcU'}

    def test_full_backup_and_restore_with_replica(self, cert_sign_request):
        # check prerequisites
        self.check_replication_success(self.master)
        self.check_replication_success(self.replica1)

        self.master.run_command(
            ['ipa', 'service-add', 'TEST/' + self.master.hostname])

        tasks.user_add(self.master, 'test1_master')
        tasks.user_add(self.replica1, 'test1_replica')

        with restore_checker(self.master):
            backup_path = tasks.get_backup_dir(self.master)

            # change data after backup
            self.master.run_command(['ipa', 'user-del', 'test1_master'])
            self.replica1.run_command(['ipa', 'user-del', 'test1_replica'])
            tasks.user_add(self.master, 'test2_master')
            tasks.user_add(self.replica1, 'test2_replica')

            # simulate master crash
            self.master.run_command(['ipactl', 'stop'])
            tasks.uninstall_master(self.master, clean=False)

            logger.info("Stopping and disabling oddjobd service")
            self.master.run_command([
                "systemctl", "stop", "oddjobd"
            ])
            self.master.run_command([
                "systemctl", "disable", "oddjobd"
            ])

            self.master.run_command(['ipa-restore', '-U', backup_path])

        status = self.master.run_command([
            "systemctl", "status", "oddjobd"
        ])
        assert "active (running)" in status.stdout_text

        # replication should not work after restoration
        # create users to force master and replica to try to replicate
        tasks.user_add(self.master, 'test3_master')
        tasks.user_add(self.replica1, 'test3_replica')
        self.check_replication_error(self.master)
        self.check_replication_error(self.replica1)
        assert {'admin', 'test1_master', 'test1_replica', 'test3_master'} == \
            self.get_users(self.master)
        assert {'admin', 'test2_master', 'test2_replica', 'test3_replica'} == \
            self.get_users(self.replica1)

        # reestablish and check replication
        self.replica1.run_command(['ipa-replica-manage', 're-initialize',
                                  '--from', self.master.hostname])
        # create users to force master and replica to try to replicate
        tasks.user_add(self.master, 'test4_master')
        tasks.user_add(self.replica1, 'test4_replica')
        self.check_replication_success(self.master)
>       self.check_replication_success(self.replica1)

test_integration/test_backup_and_restore.py:616: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
test_integration/test_backup_and_restore.py:545: in check_replication_success
    tasks.wait_for_replication(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

ldap = <ipatests.pytest_ipa.integration.host.LDAPClientWithoutCertCheck object at 0x7f2544a51e50>
timeout = 30
target_status_re = 'Error \\(0\\) Replica acquired successfully: Incremental update succeeded'
raise_on_timeout = True

    def wait_for_replication(ldap, timeout=30,
                             target_status_re=r'^0 |^Error \(0\) ',
                             raise_on_timeout=False):
        """Wait for all replication agreements to reach desired state

        With defaults waits until updates on all replication agreements are
        done (or failed) and exits without exception
        :param ldap: LDAP client
            autenticated with necessary rights to read the mapping tree
        :param timeout: Maximum time to wait, in seconds
        :param target_status_re: Regexp of status to wait for
        :param raise_on_timeout: if True, raises AssertionError if status not
            reached in specified time

        Note that this waits for updates originating on this host, not those
        coming from other hosts.
        """
        logger.debug('Waiting for replication to finish')
        start = time.time()
        while True:
            status_attr = 'nsds5replicaLastUpdateStatus'
            progress_attr = 'nsds5replicaUpdateInProgress'
            entries = ldap.get_entries(
                DN(('cn', 'mapping tree'), ('cn', 'config')),
                filter='(objectclass=nsds5replicationagreement)',
                attrs_list=[status_attr, progress_attr])
            logger.debug('Replication agreements: \n%s', _entries_to_ldif(entries))
            statuses = [entry.single_value[status_attr] for entry in entries]
            wrong_statuses = [s for s in statuses
                              if not re.match(target_status_re, s)]
            if any(e.single_value[progress_attr] == 'TRUE' for e in entries):
                msg = 'Replication not finished'
                logger.debug(msg)
            elif wrong_statuses:
                msg = 'Unexpected replication status: %s' % wrong_statuses[0]
                logger.debug(msg)
            else:
                logger.debug('Replication finished')
                return
            if time.time() - start > timeout:
                logger.error('Giving up wait for replication to finish')
                if raise_on_timeout:
>                   raise AssertionError(msg)
E                   AssertionError: Unexpected replication status: Error (11) Replication error acquiring replica: Unable to acquire replica: the replica has the same Replica ID as this one. Replication is aborting. (duplicate replica ID detected)

The test scenario is the following:
- install master + replica
- backup master
- delete one entry on master, delete one entry on replica (the entries are saved in the backup)
- add one entry on master, add one entry on replica (the entries are not in the backup)
- stop + uninstall master to simulate master crash
- restore the backup on the master
- add one entry on master, add one entry on replica, the replication isn't working (as expected)
- re-initialize master from replica
The replica logs show an unexpected error: Unexpected replication status: Error (11) Replication error acquiring replica: Unable to acquire replica: the replica has the same Replica ID as this one. Replication is aborting. (duplicate replica ID detected).

From the dirsrv error logs on the master: re-initialize is run at 17:26:06 and finishes at 17:26:09:

[28/Dec/2020:17:26:06.606344805 +0000] - INFO - NSMMReplicationPlugin - repl5_tot_run - Beginning total update of replica "agmt="cn=meToreplica1.ipa.test" (replica1:389)".
[28/Dec/2020:17:26:09.928476702 +0000] - INFO - NSMMReplicationPlugin - repl5_tot_run - Finished total update of replica "agmt="cn=meToreplica1.ipa.test" (replica1:389)". Sent 517 entries.
[28/Dec/2020:17:26:11.267721396 +0000] - ERR - NSMMReplicationPlugin - multimaster_extop_StartNSDS50ReplicationRequest - conn=42 op=5 replica="dc=ipa,dc=test": Unable to acquire replica: error: duplicate replica ID detected

From the dirsrv [error logs] on replica: we can see the re-init operation with the following error: replica_reload_ruv - New data for replica dc=ipa,dc=test does not match the data in the changelog:

[28/Dec/2020:17:26:06.633513500 +0000] - ERR - ipa-topology-plugin - ipa_topo_be_state_changebackend userRoot is going offline; inactivate plugin
[28/Dec/2020:17:26:06.635335432 +0000] - NOTICE - NSMMReplicationPlugin - multimaster_be_state_change - Replica dc=ipa,dc=test is going offline; disabling replication
[28/Dec/2020:17:26:07.060033252 +0000] - INFO - bdb_instance_start - Import is running with nsslapd-db-private-import-mem on; No other process is allowed to access the database
[28/Dec/2020:17:26:09.498733025 +0000] - INFO - import_monitor_threads - import userRoot: Workers finished; cleaning up...
[28/Dec/2020:17:26:09.701756870 +0000] - INFO - import_monitor_threads - import userRoot: Workers cleaned up.
[28/Dec/2020:17:26:09.702662454 +0000] - INFO - bdb_import_main - import userRoot: Indexing complete.  Post-processing...
[28/Dec/2020:17:26:09.703466451 +0000] - INFO - bdb_import_main - import userRoot: Generating numsubordinates (this may take several minutes to complete)...
[28/Dec/2020:17:26:09.716224241 +0000] - INFO - bdb_import_main - import userRoot: Generating numSubordinates complete.
[28/Dec/2020:17:26:09.717447861 +0000] - INFO - bdb_get_nonleaf_ids - import userRoot: Gathering ancestorid non-leaf IDs...
[28/Dec/2020:17:26:09.718494639 +0000] - INFO - bdb_get_nonleaf_ids - import userRoot: Finished gathering ancestorid non-leaf IDs.
[28/Dec/2020:17:26:09.719200832 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Starting sort of ancestorid non-leaf IDs...
[28/Dec/2020:17:26:09.719733718 +0000] - INFO - ldbm_get_nonleaf_ids - import userRoot: Finished sort of ancestorid non-leaf IDs.
[28/Dec/2020:17:26:09.721312058 +0000] - INFO - bdb_ancestorid_new_idl_create_index - import userRoot: Creating ancestorid index (new idl)...
[28/Dec/2020:17:26:09.726706207 +0000] - INFO - bdb_ancestorid_new_idl_create_index - import userRoot: Created ancestorid index (new idl).
[28/Dec/2020:17:26:09.727636286 +0000] - INFO - bdb_import_main - import userRoot: Flushing caches...
[28/Dec/2020:17:26:09.728667912 +0000] - INFO - bdb_import_main - import userRoot: Closing files...
[28/Dec/2020:17:26:09.770103676 +0000] - INFO - bdb_import_main - import userRoot: Import complete.  Processed 518 entries in 2 seconds. (259.00 entries/sec)
[28/Dec/2020:17:26:10.048730164 +0000] - ERR - ipa-topology-plugin - ipa_topo_be_state_change - backend userRoot is coming online; checking domain level and init shared topology
[28/Dec/2020:17:26:10.054533592 +0000] - NOTICE - NSMMReplicationPlugin - multimaster_be_state_change - Replica dc=ipa,dc=test is coming online; enabling replication
[28/Dec/2020:17:26:10.059856177 +0000] - WARN - NSMMReplicationPlugin - replica_reload_ruv - New data for replica dc=ipa,dc=test does not match the data in the changelog.
 Recreating the changelog file. This could affect replication with replica's  consumers in which case the consumers should be reinitialized.
[28/Dec/2020:17:26:10.073041644 +0000] - ERR - cos-plugin - cos_dn_defs_cb - Skipping CoS Definition cn=Password Policy,cn=accounts,dc=ipa,dc=test--no CoS Templates found, which should be added before the CoS Definition.
[28/Dec/2020:17:26:10.077797996 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=groups,cn=compat,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.078843847 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=computers,cn=compat,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.080430800 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=ng,cn=compat,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.081574290 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target ou=sudoers,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.082359050 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=users,cn=compat,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.083638188 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.084482838 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.085234696 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.085882875 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.087166069 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.087803539 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.088672740 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.089610959 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.090501794 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.091667909 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.092427613 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=vaults,cn=kra,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.094307013 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=ad,cn=etc,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.102441988 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=casigningcert cert-pki-ca,cn=ca_renewal,cn=ipa,cn=etc,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.103528809 +0000] - WARN - NSACLPlugin - acl_parse - The ACL target cn=casigningcert cert-pki-ca,cn=ca_renewal,cn=ipa,cn=etc,dc=ipa,dc=test does not exist
[28/Dec/2020:17:26:10.118134172 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
[28/Dec/2020:17:26:10.118755627 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
[28/Dec/2020:17:26:10.119617610 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
[28/Dec/2020:17:26:10.120245605 +0000] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
[28/Dec/2020:17:26:11.804087261 +0000] - ERR - NSMMReplicationPlugin - acquire_replica - agmt="cn=meTomaster.ipa.test" (master:389): Unable to aquire replica: the replica has the same Replica ID as this one. Replication is aborting.
[28/Dec/2020:17:26:11.805085684 +0000] - ERR - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=meTomaster.ipa.test" (master:389): Incremental update failed and requires administrator action

Opening an issue on ds side: https://github.com/389ds/389-ds-base/issues/4515

Metadata Update from @frenaud:
- Issue tagged with: tracker

3 years ago

Issue was not seen during the last month in the nightly tests, closing.

Metadata Update from @frenaud:
- Issue close_status updated to: worksforme
- Issue status updated to: Closed (was: Open)

3 years ago

Login to comment on this ticket.

Metadata