https://bugzilla.redhat.com/show_bug.cgi?id=750425
Description of problem: Data loss during the promotion operation(Slave to Master). Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: Step-1: Have a topology like Master replicating to Slave and Slave replication to consumer. Master -> Slave-> Consumer. Step-2: Make sure that all are on sync at this time. Let?s take an example all are the on sync up to CSN5 (5 records are added to master from CSN1 to CSN5). Step-3: Delete the replication agreement from Master to Slave and also from Slave to consumer. Step-4: Promote the Slave to master. Promotion steps are given below. - Delete Supplier DN (cn=suppdn,cn=config) from Slave - Delete ?cn=replica? entry for the suffix ?o=USA? using ldapmodify. As a result, it will delete the changelog file. Ex: dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config changetype: delete - Modify the cn=o=USA ,cn=mapping tree,cn=config entry as below EX: dn: cn=o=USA,cn=mapping tree,cn=config changetype: modify replace: nsslapd-state nsslapd-state: backend dn: cn=o=USA,cn=mapping tree,cn=config changetype: modify delete: nsslapd-referral - Recreate the ?cn=replica? entry for the suffix as below. dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config changetype: add objectClass: nsds5replica objectClass: top nsDS5ReplicaRoot: o=USA nsDS5ReplicaType: 3 nsDS5Flags: 1 nsDS5ReplicaId: 10 --? Please assign the same ?nsDS5ReplicaId value what master was having. In my case, Original master replica ID was 10. nsds5ReplicaPurgeDelay: 1 nsds5ReplicaTombstonePurgeInterval: -1 cn: replica - Restart slapd process. Now Slave become Master. Is there anything am I missing during promotion operation or it?s not the right way to do the promotion operation? Step -5: Add the replication agreement between Slave(newly promoted Master) and Consumer . At this time both Slave and consumer are on sync up to CSN5. During agreement creation please do not initialize the consumer. Slave(newly promoted as master) - > consumer. Step-6: Add another 5 more entries to Slave which was promoted above as Master. Let?s assume CSN numbers for these 5 entries are from CSN6 to CSN10. Step-7: Now, you will see, among the last 5 entries only last few will gets replicated without halting the replication. Actual results Expected results: Additional info:
snippet of error log master1.html
Bug description: 1. Set up Master (replica ID = 1), Hub, and Consumer. 2. Shutdown Master and reconfigure Hub into a master and assign the Master's replica ID (replica ID = 1) 3. Generate an agreement for the NewMaster (ex-Hub) pointing to the same Consumer. 4. Now, without initializing consumer on the new master, add multiple entries to the new master. (I added 5 entries: test0, ..., test4) Then, the first a couple of entries are dropped on the Consumer. (sometimes, it starts from test1, test2 or test3. But not from test0)
The cause of the problem is purl in the Consumer's RUV is updated when the Hub is reconfigured to NewMaster, but the CSN and min CSN are not. When new adds/modifies are made to the NewServer, the obsolete CSN is used to position the cursor in the newly created changelog, but it fails: [18/Jan/2012:14:02:37 -0800] agmt="cn=master" (kiki:10391) - clcache_load_buffer: rc=-30988 (see the attachment master1.html)
Then, _cl5PositionCursorForReplay retries to position the cursor using SupplierRUV, but since multiple updates are being made in this test case, CSN picked up at the moment to position the cursor may not be the first update, but the second or later. (Please note that if the update is only one, then the update is correctly replicated.)
git patch file (master) 0001-Trac-Ticket-18-Data-inconsitency-during-replication.patch
Reviewed by Rich (Thanks!!!)
Pushed to master.
$ git merge trac18 Updating 1bde54d..daaae1c Fast-forward ldap/servers/plugins/replication/repl5_ruv.c | 19 ++++++++++++------- 1 files changed, 12 insertions(+), 7 deletions(-)
$ git push Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 1.06 KiB, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 1bde54d..daaae1c master -> master
Steps to verify: 1. Set up Master (replica ID = 1), Hub, and Consumer. 2. Shutdown Master and reconfigure Hub into a master and assign the Master's replica ID (replica ID = 1) 3. Generate an agreement for the NewMaster? (ex-Hub) pointing to the same Consumer. 4. Now, without initializing consumer on the NewMaster, add multiple entries to the NewMaster. For instance, add 5 entries: uid=test0, ..., uid=test4 with one ldapadd command-line. If all 5 are replicated to Consumer correctly, the bug was verified.
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=788745
Added initial screened field value.
Metadata Update from @nhosoi: - Issue assigned to nhosoi - Issue set to the milestone: 1.2.10.a7
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/18
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.