#47442 repl problems with single-valued attrs, conflicts, modrdn
Opened 6 years ago by rmeggins. Modified 3 months ago

setup mmr m1 and m2

add an entry m1 with a single valued attr - replicate to m2

1)
pause replication
on m1 - ldapmodify
dn: entry
changetype: modify
delete: svattr
svattr: origvalue

sleep 1

on m2 - ldapmodify
dn: entry
changetype: modify
delete: svattr
svattr: origvalue
-
add: svattr
svattr: newvalue

unpause replication m1 to m2
sleep 5
unpause replication m2 to m1
sleep 5

the m2 mod should "win" - but it does not - different results on each server

2) same as 1) - but do the m2 mod first, then the m1 mod, and unpause m2-m1 first, then m1-m2

3) delete the svattr from m1 - replicate to m2
pause replication
on m1 - ldapmodify
dn: entry
changetype: modify
replace: svattr
svattr: origvalue
-
add: svattr
svattr: newvalue
-
delete: svattr
svattr: origvalue

sleep 1

modrdn - newrdn "svattr=newvalue" deleteoldrdn=1

sleep 1

on m2
modrdn - newrdn "svattr=origvalue"

unpause m1-m2
sleep 5
unpause m2-m1
sleep 5

the entries will not be in sync

4) same as 3) but do the ops on m2 first, then on m1, then unpause m2-m1 first, then m1-m2


There is another scenario using M1,M2, M3
M1,M2,M3 are in sync, using employeenumber as svattr

stop M2,M3
start M1

on M1
delete: employeenumber
employeenumber: oldnumber
-
add: employeenumber
employeenumber: oldnumber+1

stop M1
start M2
on M2:
delete: employeenumber
employeenumber: oldnumber
-
add: employeenumber
employeenumber: oldnumber+2

stop M2
start M3
on M3:
delete: employeenumber
employeenumber: oldnumber
-
add: employeenumber
employeenumber: oldnumber+3

start M1
start M2

M1 and M2 have oldnumber+3
M3 has oldnumber+2

that was from a deployment where I found the issue, 2 masters will probably show the problem as well

I wrote a doc for update resolution for single valued attributes:
http://port389.org/wiki/Update_resolution_for_single_valued_attributes

and implemented a test suite using lib389. There are 707 testcases based on the tables in chpt 6, and about 300 are failing. For an example based on comment6, where there are three masters, but updates were done only to two masters, the result after replication convergence is:

M1
nscpentrywsi: employeeNumber;vucsn-536b5a98000000c80001: 21000
M2
nscpentrywsi: employeeNumber;vucsn-536b5a20000000640001: 11000
M3
nscpentrywsi: employeeNumber;adcsn-536b5a20000000640000;vucsn-536b5a98000000c80001: 21000

so not only the values differ, but also where valuse are identical the replication meta data differ.

I will concentrate on fixing this testcase next

the reason for the failure for the example in comment8 is that state resolution handles every value found in the deleted values as pending value and makes it the present value. Checking if these values have an update csn fixes these cases, but probably a value when it is deleted could just be removed instead of moving to deleted values

The next type of failure is:
on M1 delete the single valued attribute (by an empty replace)
on M1 make it distinguished.
As a result the value is part of the rdn, but the attribute is gone.

Need to verify if this is not an effect of the previous fix.

Fixed the scenario in comment 9

There were two problems, the csn of the modrdn is greater than the csn of the delete.
If the delete was received after the modrdn there are situations where the value only has a a MDCSN, but no VDCSN, only the vdcsn is used to compare and so it was not detected that the modrdn was afetr the delete
If the modrn was received after the delete, the attr state was deleted, it was correctly detected that the modrdn is more recent, but the attr was not moved to to the present attributes and so the value did not show up in the entry

After fixing this, the number of ailing tests was considerably reduced: 297 --> 204

Now I'm looking into a scenario which is a bit unclear what the real expected behavoiur is:

Let the entry have the single valued attribute employeenumber
employeenumber: 1000

On M1:
changetype: modify
replace: employeenumber
employeenumber: 2000

On M2
changetype: modify
delete: employeenumber
employeenumber: 1000

After replication is converging the attribute has the value 2000 on all servers. What is correct depends on how the mod on M2 is viewed.
It is deleting the single value of an attrbute, so it is equivalent to deleting the attribute. If this is done, by

delete: employeenumber

or
replace: employeenumber
-

Then the value is removed on all servers.

If one views it just as the attempt to delete a specific value, then if it was replced before the delete fail 'non existing value' and teh replaced value remains.

I'm inclined to go with the current behaviour. It would also fit with the "single master" model, if a delete of a specific value is applied after the value was changed, the delete just fails. So it can be justified and does not change behaviour - I modified the test suite, now there are "only" 171 failures remaining

Most of the remaining failures involve modrdn operations. One specific scenario is:

On M1:
changetype: modrdn
newrdn: employeenumber=nnn
deleteoldrdn: 0

On M2:
changteype: modify
replace: employeenumber
-

if the csn of the change on M2 is later than on M1 then the rdn is employeenumber=nnn,<suffix>
but the attribute is not present.
There are two different failures.
In the server receiving a delete after modrdn (with csn del > csn modrdn) the code doesn't detect that the attr has to be distinguished and cannot be deleted
The other way round is more complicated:
if the modrdn is received after the delete urp doesn't even know that the value has to be distinguished. The current code calls entry_add_rdn_csn() after entry_apply_mods_wsi() was executed. Trying to call it before fails because before entry apply mods the attr is not in the present attrs and so the csn cannot be set.

resolved the issues regarding th mdcsn, the mdcsn needs to be cleared for the attrsin the old_rdn and to be set for the attrs in the new_rdn before calling entry_apply_mods_wsi.
resolved a few more scenarios

and now I am down to 15 failures

These are all for scenarios where the attribute is distinguished and on different master there are concurrent state changes (to different states). It is not really obvious which the correct value of teh attribute should be (will update the doc), but the value is also not consistent across the servers.
Will try to find if we can get at least a consistent state.

There is a problem to do state resoultion for concurrent modrdn operations.
If two modrdn operations are done concurrently one has a higher csn. When the modrdn with the lower csn is replayed the urp preop pluging detects this and the modrdn is ignored.
The effect is that the dn is always updated to respect the latest modrdn, but the attribute state resolution is only called when the later modrdn is applied, so inconsistent states can result.

It would be a lot of changes to modrdn to fully handle theses situations.

I also noticed a side effect that in the case of the ignored modrdn the maxcsn in the ruv is not updated and the op is replayed until an effective mod is received.

Hi Ludwig,

Could you update this ticket with the current status?

Do we want to push this ticket to 1.3.5?

Thanks!

yes, should be 1.3.5, change is not simple, so will not get it ready earlier

Replying to [comment:18 lkrispen]:

yes, should be 1.3.5, change is not simple, so will not get it ready earlier
Thank you, Ludwig!

Per triage, push the target milestone to 1.3.6.

Metadata Update from @lkrispen:
- Issue assigned to lkrispen
- Issue set to the milestone: 1.3.6.0

2 years ago

Metadata Update from @mreynolds:
- Custom field component reset (from Replication - General)
- Custom field reviewstatus reset (from needinfo)
- Custom field rhbz reset (from 0)
- Issue set to the milestone: 1.3.7.0 (was: 1.3.6.0)

2 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to None
- Issue set to the milestone: 1.4.2 (was: 1.3.7.0)

3 months ago

Login to comment on this ticket.

Metadata