Issue #49863: CleanAllRUV test suite fails to restore master4 in the topology - 389-ds-base

389-ds-base

#49863 CleanAllRUV test suite fails to restore master4 in the topology

Closed: wontfix 4 years ago by spichugi. Opened 5 years ago by spichugi.

Issue Description

Sometimes after we successfully finish some test function, we try to restore master4
in the topology (restore_master4 function). But the tests start to fail during
test_replication function (after successful master4 restoration) -
https://pagure.io/389-ds-base/blob/master/f/dirsrvtests/tests/suites/replication/cleanallruv_test.py#_160

This mostly happens after test_abort or test_abort_restart.
But I've seen the failure happening after test_clean_restart too.

It looks like a timing issue but it happens also on fast machines.

We can see the error log having these lines in the end:

on master1

[18/Jul/2018:16:48:05.207259621 -0400] - ERR - NSMMReplicationPlugin - ruv_compare_ruv - RUV [database RUV] does not contain element [{replica 4} 5b4fa75c000000040000 5b4fa76a000800040000] which is present in RUV [changelog max RUV]
[18/Jul/2018:16:48:05.209129423 -0400] - WARN - NSMMReplicationPlugin - replica_check_for_data_reload - For replica dc=example,dc=com there were some differences between the changelog max RUV and the database RUV.  If there are obsolete elements in the database RUV, you should remove them using the CLEANALLRUV task.  If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.
[18/Jul/2018:16:48:05.320759190 -0400] - INFO - slapd_daemon - slapd started.  Listening on All Interfaces port 39001 for LDAP requests
[18/Jul/2018:16:48:19.975883756 -0400] - WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=002" (host-172-16-36-24:39002): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[18/Jul/2018:16:48:19.977935906 -0400] - WARN - NSMMReplicationPlugin - acquire_replica - agmt="cn=003" (host-172-16-36-24:39003): Unable to receive the response for a startReplication extended operation to consumer (Can't contact LDAP server). Will retry later.
[18/Jul/2018:16:48:20.481433078 -0400] - INFO - NSMMReplicationPlugin - repl5_tot_run - Beginning total update of replica "agmt="cn=temp_004" (host-172-16-36-24:39004)".
[18/Jul/2018:16:48:21.114461980 -0400] - ERR - NSMMReplicationPlugin - check_flow_control_tot_init - agmt="cn=temp_004" (host-172-16-36-24:39004) -  Total update flow control gives time (2000 msec) to the consumer before sending more entries [ msgid sent: 1453, rcv: 452]) If total update fails you can try to increase nsds5ReplicaFlowControlPause and/or decrease nsds5ReplicaFlowControlWindow in the replica agreement configuration
[18/Jul/2018:16:48:22.370839738 -0400] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=002" (host-172-16-36-24:39002): Replication bind with SIMPLE auth resumed
[18/Jul/2018:16:48:22.373100328 -0400] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=003" (host-172-16-36-24:39003): Replication bind with SIMPLE auth resumed
[18/Jul/2018:16:48:25.676770766 -0400] - INFO - NSMMReplicationPlugin - repl5_tot_run - Finished total update of replica "agmt="cn=temp_004" (host-172-16-36-24:39004)". Sent 1519 entries.
[18/Jul/2018:16:48:28.646462533 -0400] - ERR - NSMMReplicationPlugin - multimaster_extop_StartNSDS50ReplicationRequest - conn=4 op=3 replica="dc=example,dc=com": Unable to acquire replica: error: duplicate replica ID detected

on master4:

[18/Jul/2018:16:48:25.440484901 -0400] - NOTICE - NSMMReplicationPlugin - multimaster_be_state_change - Replica dc=example,dc=com is coming online; enabling replication
[18/Jul/2018:16:48:25.485255650 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
[18/Jul/2018:16:48:25.485971910 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
[18/Jul/2018:16:48:25.486589078 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding the replication changelog RUV, this may take several minutes...
[18/Jul/2018:16:48:25.487219702 -0400] - NOTICE - NSMMReplicationPlugin - changelog program - _cl5ConstructRUV - Rebuilding replication changelog RUV complete.  Result 0 (Success)
[18/Jul/2018:16:48:25.744562488 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa75c000000040000
[18/Jul/2018:16:48:25.746187981 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa762000200040000
[18/Jul/2018:16:48:25.747534085 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa762000c00040000
[18/Jul/2018:16:48:25.748864428 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa763000100040000
[18/Jul/2018:16:48:25.750210622 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa769000100040000
[18/Jul/2018:16:48:25.751621244 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa76a000000040000
[18/Jul/2018:16:48:25.753006696 -0400] - ERR - NSMMReplicationPlugin - csnplCommit: can't find csn 5b4fa76a000800040000
[18/Jul/2018:16:48:28.655153846 -0400] - ERR - NSMMReplicationPlugin - acquire_replica - agmt="cn=001" (host-172-16-36-24:39001): Unable to aquire replica: the replica has the same Replica ID as this one. Replication is aborting.
[18/Jul/2018:16:48:28.656000657 -0400] - ERR - NSMMReplicationPlugin - repl5_inc_run - agmt="cn=001" (host-172-16-36-24:39001): Incremental update failed and requires administrator action

The replicaID from dse.ldif:

[root@host-172-16-36-24 dirsrv]# grep -rni replicaid
slapd-master1/dse.ldif.bak:1330:nsDS5ReplicaId: 1
slapd-master1/dse.ldif.startOK:1330:nsDS5ReplicaId: 1
slapd-master1/dse.ldif:1330:nsDS5ReplicaId: 1
slapd-master2/dse.ldif.bak:1330:nsDS5ReplicaId: 2
slapd-master2/dse.ldif.startOK:1329:nsDS5ReplicaId: 2
slapd-master2/dse.ldif:1330:nsDS5ReplicaId: 2
slapd-master3/dse.ldif.bak:1330:nsDS5ReplicaId: 3
slapd-master3/dse.ldif.startOK:1329:nsDS5ReplicaId: 3
slapd-master3/dse.ldif:1330:nsDS5ReplicaId: 3
slapd-master4/dse.ldif.bak:1330:nsDS5ReplicaId: 4
slapd-master4/dse.ldif:1330:nsDS5ReplicaId: 4

Package Version and Platform

389-ds-base built on master with https://pagure.io/389-ds-base/pull-request/49846

Steps to reproduce

The issue is not 100% reproducible.
It can be reproduced by running suites/replication/cleanallruv_test.py many times.

mreynolds commented 5 years ago

Simon try this patch out. It seems to work for me. The main part of the fix is the change to m4rid()

Metadata Update from @mreynolds:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Custom field type adjusted to None
- Custom field version adjusted to None

5 years ago

vashirov commented 5 years ago

@mreynolds, I've ran cleanallruv test with your patch several times on a machine where it previously failed, it now passes all the time. Thanks!

spichugi commented 5 years ago

It still fails for me on the internal tool virtual machine.

[root@host-172-16-36-17 ds]# py.test -v dirsrvtests/tests/suites/replication/cleanallruv_test.py
==================== test session starts ======================
platform linux -- Python 3.6.6, pytest-3.6.3, py-1.5.4, pluggy-0.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
metadata: {'Python': '3.6.6', 'Platform': 'Linux-4.17.3-200.fc28.x86_64-x86_64-with-fedora-28-Twenty_Eight', 'Packages': {'pytest': '3.6.3', 'py': '1.5.4', 'pluggy': '0.6.0'}, 'Plugins': {'metadata': '1.7.0', 'html': '1.19.0'}}
389-ds-base: 1.4.0.13-20180724gitfd8ca9a.fc28
nss: 3.38.0-1.0.fc28
nspr: 4.19.0-1.fc28
openldap: 2.4.46-2.fc28
svrcore: not installed
FIPS: 0

rootdir: /mnt/tests/rhds/tests/upstream/ds, inifile:
plugins: metadata-1.7.0, html-1.19.0
collected 8 items

dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean PASSED [ 12%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean ERROR [ 12%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean_restart ERROR [ 25%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_clean_force ERROR [ 37%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort ERROR [ 50%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort_restart ERROR [ 62%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_abort_certify ERROR [ 75%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_stress_clean ERROR [ 87%]
dirsrvtests/tests/suites/replication/cleanallruv_test.py::test_multiple_tasks_with_force ERROR [100%]

============== ERRORS ================

It has the same ERROR as before:

 @pytest.fixture()
 def m4rid(request, topology_m4):
     log.debug("-------------- BEGIN RESET of m4 -----------------")
    repl = ReplicationManager(DEFAULT_SUFFIX)
>       repl.test_replication_topology(topology_m4.ms.values())

>       raise Exception("Replication did not sync in time!")
E       Exception: Replication did not sync in time!

Edited 5 years ago by spichugi

spichugi commented 5 years ago

Okay, the tests does pass on a faster machine still fails on a slower. We can increase the timeouts then.

mreynolds commented 5 years ago

How is it failing? What are the errors being reported?

spichugi commented 5 years ago

How is it failing? What are the errors being reported?

The same errors I reported in the issue description.
On a fast machine, it does pass 100% times though. I am trying to tweak some timeouts now...

spichugi commented 5 years ago

Mark, your diff was applied to https://pagure.io/389-ds-base/pull-request/49846
Thanks!

mreynolds commented 5 years ago

@spichugi - if this is fixed and you close this ticket?

Metadata Update from @mreynolds:
- Issue assigned to spichugi
- Issue set to the milestone: 1.4.0

5 years ago

spichugi commented 4 years ago

The issue is no longer present

Metadata Update from @spichugi:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

4 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2922

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Metadata

Assignee

spichugi

Tags

None

Blocking

None

Depending on

None

Priority

None

Milestone

1.4.0

reviewstatus

None

rhbz

None

origin

None

Attachments 1

cleanallruv2.patch

Attached 5 years ago View Comment

389-ds-base

Source Code

#49863 CleanAllRUV test suite fails to restore master4 in the topology Closed: wontfix 4 years ago by spichugi. Opened 5 years ago by spichugi.

Issue Description

Package Version and Platform

Steps to reproduce

Metadata

Attachments 1

#49863 CleanAllRUV test suite fails to restore master4 in the topology

Closed: wontfix 4 years ago by spichugi. Opened 5 years ago by spichugi.