#359 Database RUV could mismatch the one in changelog under the stress

Created 4 years ago by nhosoi
Modified 2 months ago

Found in a stress test.

On one master (blademtv5), one RUV (against another master blademtv1) on database stopped being updated, on the other hand, RUV in the changelog keeps updated. The mismatch stops the replication originated on blademtv1.

The cause was the simultaneous MODRDN operations caused conflicts and one conflict resolution failed, which left uncommitted CSN in the CSN list in the RUV element. It prevented to get the max CSN to update the RUV on database.

{{{
Fix description
. csnplRollUp (csnpl.c) - To get the first committed csndata, if
there are preceded uncommitted csn's in the csnpl list, this
patch skips them and returns the first committed csn.
. llistRemoveCurrentAndGetNext (llist.c) - when the last item
in the list is removed, tail pointer is initialized, too.
. multimaster_preop|bepreop_ (repl5_plugins.c) - process_operation
is moved from multimaster_preop_
to multimaster_bepreop_* to
avoid the uncommitted csn set in the csnpl (RUV element) by
process_operation is left without being committed, which is done
at the BE_TXN_POST timing.

}}}

Looks good. I would still like to know what changed in 1.2.10 that caused this problem.

{{{
Fix description:
. csnplRollUp (csnpl.c) - To get the first committed csndata, if
there are preceded uncommitted csn's in the csnpl list, this
patch skips them and returns the first committed csn.
. llistRemoveCurrentAndGetNext (llist.c) - when the last item
in the list is removed, tail pointer is initialized, too.
. ldbm_back_add, ldbm_back_modrdn (ldbm_add.c, ldbm_modrdn.c) -
make sure SLAPI_RESULT_CODE and SLAPI_PLUGIN_OPRETURN are set
not just when the transaction is started, but in general.
If an error occurs the RESULT_CODE triggers to remove the CSN
from the RUV element.
. plugin_call_func (plugin.c) - when the plugin type is be pre/
post op, respect the fatal error code (-1) instead of OR the
results from all the plugins. The error code -1 is checked
in ldap_back_add and ldbm_back_modrdn to distinguish from the
URP operation bits.
}}}

Sorry, I have to back off.

Found another problem in the patch. :(

[08/May/2012:12:08:54 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 1 ldap://<host>:<port>} 4fa96f07000000010000 4fa96f07000000010000] which is present in RUV [database RUV]

Okay, the problem (ruv_compare_ruv) was introduced prior to my patch.

If you go back beyond this commit, the problem disappears.
commit 0f50544
Date: Mon Apr 23 13:36:04 2012 -0400
Ticket #337 - RFE - Improve CLEANRUV functionality

Replying to [comment:9 nhosoi]:

Okay, the problem (ruv_compare_ruv) was introduced prior to my patch.

If you go back beyond this commit, the problem disappears.
commit 0f50544
Date: Mon Apr 23 13:36:04 2012 -0400
Ticket #337 - RFE - Improve CLEANRUV functionality

But I thought you were seeing the original problem for ticket #359 with 1.2.10? I believe that patch only applies to 1.2.11 and later. Or are you saying your patch for #359 works with 1.2.10 but not with 1.2.11 because of #337?

Replying to [comment:10 rmeggins]:

Replying to [comment:9 nhosoi]:

Okay, the problem (ruv_compare_ruv) was introduced prior to my patch.

If you go back beyond this commit, the problem disappears.
commit 0f50544
Date: Mon Apr 23 13:36:04 2012 -0400
Ticket #337 - RFE - Improve CLEANRUV functionality

But I thought you were seeing the original problem for ticket #359 with 1.2.10? I believe that patch only applies to 1.2.11 and later. Or are you saying your patch for #359 works with 1.2.10 but not with 1.2.11 because of #337?

Sorry about the confusion. But there is something odd going on...
On my F16, I tested both. My local build from master shows the problem (ruv_compare_ruv) after #337 is included. I don't see it with my local build from 1.2.10 branch with my patch. (See #337, just installing 2 Masters + 1 Hub shows the problem.)

On blademtv5, I installed my local 1.2.10.8 build with my patch (no #337) on top of the Michael's test env, which showed this error:
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 2 ldap://blademtv4-6.lab.sjc.redhat.com:38001} 4f91ecf6000000020000 4fa2f707000000020000] which is present in RUV [database RUV]
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: for replica o=my_suffix.com there were some differences between the changelog max RUV and the database RUV. If there are obsolete elements in the database RUV, you should remove them using CLEANRUV task. If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.

Unfortunately, since RUV on the server blademtv5 was broken anyway, it was hard for me to figure out the problem. So, I switched to test it on my machine.

Replying to [comment:11 nhosoi]:

Replying to [comment:10 rmeggins]:

On blademtv5, I installed my local 1.2.10.8 build with my patch (no #337) on top of the Michael's test env, which showed this error:
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - ruv_compare_ruv: RUV [changelog max RUV] does not contain element [{replica 2 ldap://blademtv4-6.lab.sjc.redhat.com:38001} 4f91ecf6000000020000 4fa2f707000000020000] which is present in RUV [database RUV]
[07/May/2012:18:16:51 -0700] NSMMReplicationPlugin - replica_check_for_data_reload: Warning: for replica o=my_suffix.com there were some differences between the changelog max RUV and the database RUV. If there are obsolete elements in the database RUV, you should remove them using CLEANRUV task. If they are not obsolete, you should check their status to see why there are no changes from those servers in the changelog.

Unfortunately, since RUV on the server blademtv5 was broken anyway, it was hard for me to figure out the problem. So, I switched to test it on my machine.

I believe the above problem was caused by the broken max RUVs in the changelog. To fix it, I had to reinitialize on each master and do some update operation on each to update the max ruv on each. Now, all the masters dump the expected max RUVs in the changelog when it's shutdown and the following restart does not issue the ruv_compare_ruv error any more.
dbid: 0000014d000000000000
max ruv:
{replicageneration} 4f91eb54000000010000
{replica 3} 4fa19249000000030000 4fa19249000000030000
{replica 4} 4faaac03000000040000 4faabc99000000040000
{replica 2} 4faabd9c000100020000 4faabd9c000100020000
{replica 1} 4faabea2000000010000 4faabea2000000010000

Reviewed by Rich (Thank you!!!)

Pushed to master.

$ git merge trac359
Updating 4d7d59e. f0f74b5
Fast-forward
ldap/servers/plugins/replication/csnpl.c | 23 ++++++++++++---------
ldap/servers/plugins/replication/llist.c | 8 +++++-
ldap/servers/plugins/usn/usn.c | 4 ++-
ldap/servers/slapd/back-ldbm/ldbm_add.c | 29 ++++++++++++++-------------
ldap/servers/slapd/back-ldbm/ldbm_modrdn.c | 29 ++++++++++++++-------------
ldap/servers/slapd/plugin.c | 10 +++++++-
6 files changed, 60 insertions(+), 43 deletions(-)

$ git push
Counting objects: 29, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 2.36 KiB, done.
Total 15 (delta 12), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
4d7d59e. f0f74b5 master -> master

Pushed to 389-ds-base-1.2.10 branch.

$ git push origin ds1210-local:389-ds-base-1.2.10
Enter passphrase for key '/home/nhosoi/.ssh/id_rsa':
Counting objects: 29, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 2.54 KiB, done.
Total 15 (delta 12), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
4c31c0d. ed1ebf6 ds1210-local -> 389-ds-base-1.2.10

Pushed to 389-ds-base-1.2.11 branch.

$ git push origin ds1211-local:389-ds-base-1.2.11
Enter passphrase for key '/home/nhosoi/.ssh/id_rsa':
Counting objects: 39, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (23/23), done.
Writing objects: 100% (23/23), 3.28 KiB, done.
Total 23 (delta 18), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
6041d86. c89ea2f ds1211-local -> 389-ds-base-1.2.11

Linked to Bugzilla bug: https://bugzilla.redhat.com/show_bug.cgi?id=821176 (''Red Hat Enterprise Linux 6'')

0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch - csnplRollUp leak
0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch

ack on "0001-Trac-Ticket-359-Database-RUV-could-mismatch-the-one-.patch - csnplRollUp leak"

e5bdf55. 4d982e5 389-ds-base-1.2.10 -> 389-ds-base-1.2.10
commit changeset 4d982e5/389-ds-base
f9dfeea. b5f3f98 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit changeset b5f3f98/389-ds-base
59ac943. 12567ff master -> master
commit changeset 12567ff/389-ds-base

Added initial screened field value.

I came across this ticket when investiating #49008.

The reason that the RUV cannot be updated because of an uncommited CSN in the pending list is that if urp decides to ignore an operation the csn is already in the pending list.
urp sets an ldap result code to the pblock and makes the ioperation a NOOP (success).
Later send_ldap_result is called with err=0 and the result code is reset.
Only in the process_postop an uncommitted csn would be cancelled, but since it sees success it doesn't cancel it.

In my opinion the correct solution would be to rollup the pending list only until the first uncommitted csn and move the cancelling of uncommitted csns to the bepostop calls, maybe the error could be handled in write_changelog_and_ruv() which is always called.

2 months ago

Metadata Update from @nhosoi:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.2.10

Login to comment on this ticket.

ack

Replication - General

1.2.10

Community

defect

cancel