Issue #48261: Data loss on replication topology - 389-ds-base

389-ds-base

#48261 Data loss on replication topology

Closed: wontfix 3 years ago by spichugi. Opened 8 years ago by baburaje12.

This is regarding Data loss on replication topology :

I have noticed Data loss during the replication between Supplier/Hub and consumer when master /hub changelog db file/replica entry is being deleted due to some reasons.

Please note that the hub and consumer is imported with some stale data and consumer doesn’t want initialization during the new replication agreement. The test scenario is outlined below

My Topology looks like .

(o=dev and c=test) (o=dev and c=test ) (o=dev and c=test ) (o=dev and c=test)
Master ======================= Hub1====== ======== Hub2=============== ==== Consumer

Created two suffixes (o=dev and c=test) in all instances and created replication for both suffixes and both suffixes are replicated from Master to all the way down to the consumer. Let us assume that , Now 10 entries/records has been added for both suffixes (o=dev and c=test) in the topology (i.e CSN1 –CSN10) - all are in sync at this point of time.

Reproducer steps :

1) Take a db2ldif with “-r” option from the Hub /supplier for both suffixes (o=dev, o=test) . Make sure that replica instance stopped to perform this step.
2) Delete Changelog and Recreate the Changelog again on Hub2 side
3) Delete the Supplier DN (cn=Replication manager,cn=config) and re add the Supplier DN (cn=Replication manager,cn=config) again on Hub2 side
4) Delete the replication agreements between Hub2 and consumer for both suffix (o=dev and o=test)
5) Delete replica and re-add replica for both suffixes (o=dev,o=test) on Hub2 side
6) Now add 5 more entries (CSN11-CSN15) to suffix (o=dev) ONLY on supplier Side and check they get replicated to supplier,Hub1.Hub2 : Now suffix ( o=dev) having CSN1- CSN15 and suffix (o=test) having CSN1 – CSN10 entries (as new 5 entries are added only o=dev suffix )
7) Now stop both Hub2 and consumer slapd instances
8) Import the data from the ldif file using ldif2db command which we have taken in step 1 above on Hub2 side
9) Now start the both slapd instances Hub2 and consumer
10) Delete the Supplier DN (cn=Replication manager,cn=config) and re add the Supplier DN (cn=Replication manager,cn=config) again on consumer side
11) Delete replica and re-add replica for both suffiex (o=dev,o=test) again on consumer side
12) Add the replication Agreements for both Hub2 and consumer (both suffiex o=dev and o=test )
13) Stop both slapd instances Hub and Consumer
14) import the data by using the same ldif file as done on step1 on consumer side
15) Now start Hub2 and consumer slapd instances
16) Now add another 5 entries on both suffixes (o=dev ,o=test) on master side (CSN16-CSN20)

Check entries in supplier ,Hub2 and consumer. Now you can that newly added entries (CSN11-CSN15) step 6, are missed in the consumer side for one suffiex (o=dev).

Output on supplier/Hub1/Hub2:

Suffix O=dev will have CSN1- CSN20 entries
Suffix O=test=> CSN1-CSN 15 entries

Output on Consumer side:
O=dev  CSN1- CSN10 entries only (5 entries were missed here )
O=test=> CSN1-CSN 15

I have verified this in latest code base and noticed the same. Any suggestions are welcome on this.

I found the root cause of the issue :

Root cause :

When we delete the changelog and replica entry (step 2nd and 5th) , the changelog will be deleted and it does not have previous MAX CSN number of o=test suffix . After importing the ldif file in the consumer (step 14) then it advertise old MAX CSN number which was not located in the hub changelog. From code I can see the condition which could occur in a replication sequence is that maxCSN of consumer is not locatable either in changelog database nor in purge RUV list. When this condition occurs, the supplier believes that this could occur when its database is initialized or reloaded. With this premise it tries to determine the cursor value from its RUV(MaxCSN).

I have identified the exact function ruv_get_min_or_max_csn() where I can see the problem- repl5_ruv.c file.

The function gets min{maxcsns of all ruv elements} if get_the_max=0,
or max{maxcsns of all ruv elements} if get_the_max != 0.

please let me know your inputs

rmeggins commented 8 years ago

What is the exact version of 389-ds-base you are using? What is the platform?

nhosoi commented 7 years ago

Per triage, push the target milestone to 1.3.6.

Metadata Update from @rmeggins:
- Issue assigned to lkrispen
- Issue set to the milestone: 1.3.6.0

7 years ago

Metadata Update from @mreynolds:
- Custom field component reset (from Replication - General)
- Issue close_status updated to: None
- Issue set to the milestone: 1.3.7.0 (was: 1.3.6.0)

6 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to None
- Issue set to the milestone: 1.4.2 (was: 1.3.7.0)

4 years ago

Metadata Update from @vashirov:
- Issue priority set to: normal (was: critical)
- Issue set to the milestone: 1.4.4 (was: 1.4.2)

4 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1592

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix
- Issue status updated to: Closed (was: Open)

3 years ago

Metadata

Assignee

lkrispen

Tags

None

Blocking

None

Depending on

None

Priority

normal

Milestone

1.4.4

reviewstatus

None

rhbz

None

origin

Community

389-ds-base

Source Code

#48261 Data loss on replication topology Closed: wontfix 3 years ago by spichugi. Opened 8 years ago by baburaje12.

Metadata

#48261 Data loss on replication topology

Closed: wontfix 3 years ago by spichugi. Opened 8 years ago by baburaje12.