#18 Data inconsitency during replication
Closed: wontfix None Opened 12 years ago by mkosek.

https://bugzilla.redhat.com/show_bug.cgi?id=750425

Description of problem:

Data loss during the promotion operation(Slave to Master).
Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
Step-1:

Have a topology like Master replicating to Slave and Slave replication to
consumer.

Master -> Slave-> Consumer.

Step-2:
Make sure that all are on sync at this time. Let?s take an example all are the
on sync up to CSN5 (5 records are added to master from CSN1 to CSN5).

Step-3:

Delete the replication agreement from Master to Slave and also from Slave to
consumer.

Step-4:

Promote the Slave to master.  Promotion steps are given below.

-       Delete Supplier DN (cn=suppdn,cn=config) from Slave
-       Delete ?cn=replica? entry for the suffix ?o=USA? using ldapmodify. As a
result, it will delete the changelog file.
Ex: dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config
changetype: delete
-       Modify the cn=o=USA ,cn=mapping tree,cn=config entry as below
EX: dn: cn=o=USA,cn=mapping tree,cn=config
changetype: modify
replace: nsslapd-state
nsslapd-state: backend

dn: cn=o=USA,cn=mapping tree,cn=config
changetype: modify
delete: nsslapd-referral
-       Recreate the ?cn=replica? entry for the suffix as below.
dn: cn=replica,cn=o=USA,cn=mapping tree,cn=config
changetype: add
objectClass: nsds5replica
objectClass: top
nsDS5ReplicaRoot: o=USA
nsDS5ReplicaType: 3
nsDS5Flags: 1
nsDS5ReplicaId: 10  --? Please assign the same ?nsDS5ReplicaId value what
master was having. In my case, Original master replica ID was 10.
nsds5ReplicaPurgeDelay: 1
nsds5ReplicaTombstonePurgeInterval: -1
cn: replica
-       Restart  slapd process. Now Slave become Master.

Is there anything am I missing during promotion operation or it?s not the right
way to do the promotion operation?

Step -5:

Add the replication agreement between Slave(newly promoted Master) and Consumer
. At this time both Slave and consumer are on sync up to CSN5. During agreement
creation please do not initialize the consumer.

           Slave(newly promoted as master) - > consumer.

Step-6:

Add another 5 more entries to Slave which was promoted above as Master. Let?s
assume CSN numbers for these 5 entries are from CSN6 to CSN10.

Step-7:

Now, you will see, among the last 5 entries only last few will gets replicated
without halting the replication.


Actual results

Expected results:


Additional info:

Bug description:
1. Set up Master (replica ID = 1), Hub, and Consumer.
2. Shutdown Master and reconfigure Hub into a master and assign the Master's replica ID (replica ID = 1)
3. Generate an agreement for the NewMaster (ex-Hub) pointing to the same Consumer.
4. Now, without initializing consumer on the new master, add multiple entries to the new master.
(I added 5 entries: test0, ..., test4)
Then, the first a couple of entries are dropped on the Consumer.
(sometimes, it starts from test1, test2 or test3. But not from test0)

The cause of the problem is purl in the Consumer's RUV is updated when the Hub is reconfigured to NewMaster, but the CSN and min CSN are not. When new adds/modifies are made to the NewServer, the obsolete CSN is used to position the cursor in the newly created changelog, but it fails:
[18/Jan/2012:14:02:37 -0800] agmt="cn=master" (kiki:10391) - clcache_load_buffer: rc=-30988
(see the attachment master1.html)

Then, _cl5PositionCursorForReplay retries to position the cursor using SupplierRUV, but since multiple updates are being made in this test case, CSN picked up at the moment to position the cursor may not be the first update, but the second or later. (Please note that if the update is only one, then the update is correctly replicated.)

Reviewed by Rich (Thanks!!!)

Pushed to master.

$ git merge trac18
Updating 1bde54d..daaae1c
Fast-forward
ldap/servers/plugins/replication/repl5_ruv.c | 19 ++++++++++++-------
1 files changed, 12 insertions(+), 7 deletions(-)

$ git push
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.06 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
1bde54d..daaae1c master -> master

Steps to verify:
1. Set up Master (replica ID = 1), Hub, and Consumer.
2. Shutdown Master and reconfigure Hub into a master and assign the Master's replica ID (replica ID = 1)
3. Generate an agreement for the NewMaster? (ex-Hub) pointing to the same Consumer.
4. Now, without initializing consumer on the NewMaster, add multiple entries to the NewMaster.
For instance, add 5 entries: uid=test0, ..., uid=test4 with one ldapadd command-line.
If all 5 are replicated to Consumer correctly, the bug was verified.

Added initial screened field value.

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.2.10.a7

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/18

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata