#49008 aborted operation can leave RUV in incorrect state
Closed: fixed 2 years ago Opened 3 years ago by lkrispen.

if a plugin operation succeeded, but the operation itself fails and is aborted the RUV is in an incorrect state (rolled up to the succesful plugin op).

Here is test case:
enable memberof and add two members to a group, the second has no objectclasss allowing memberof, so the complete operation fails with err=65

[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 57ff838e000000640000 into pending list <<<< main MOD
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 57ff838e000100640000 into pending list <<<< succesful memberof
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_update_ruv: successfully committed csn 57ff838e000100640000
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin - ruv_add_csn_inprogress: successfully inserted csn 57ff838e000200640000 into pending list <<< failing memberof
[13/Oct/2016:14:48:21 +0200] NSMMReplicationPlugin -  csn=57ff838e000200640000 process postop: canceling operation csn
[13/Oct/2016:14:48:22 +0200] NSMMReplicationPlugin - conn=866 op=1 csn=57ff838e000000640000 process postop: canceling operation csn
[13/Oct/2016:14:48:22 +0200] NSMMReplicationPlugin - agmt="cn=to200" (localhost:4945): {replica 100 ldap://localhost:30522} 57f4cb95000000640000 57ff838e000100640000 57ff8295

nsds50ruv: {replicageneration} 57bf05e40000012c0000
nsds50ruv: {replica 100 ldap://localhost:30522} 57f4cb95000000640000 57ff838e000100640000

looks like the pending list mechanism does not work correctly


Looks like the behaviour was introduce by the fix for #359.

We have a real dilemma in how to rollup the pending list and decide when to update the RUV.

if we do it like it is done at the moment (after fix #359) the ruv gets updated even if there is a smaller csn not committed.

but if we do it like before, the uncommited csn would prevent to update the RUV, but if the parent operation is cancelled, it could potentially leave uncommitted csns in the pending list and would then prevent to update the RUV forever.

A fix has to pass the test for #359

This can be seen when having locker not able to resolve deadlocks. The messages that could show we are in presence of this bug are:

==============================
[17/Oct/2016:10:32:07 +0000] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: retry (49) the transaction (csn=5804e0db000800030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
[17/Oct/2016:10:32:07 +0000] NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn: failed to write entry with csn (5804e0db000800030000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
[17/Oct/2016:10:32:07 +0000] NSMMReplicationPlugin - write_changelog_and_ruv: can't add a change for <entry dn=""> (uniqid: 40d752a2-4fbb11e5-93a0d259-b00ed7d3, optype: 32) to changelog csn 5804e0db000800030000
[17/Oct/2016:10:32:07 +0000] - SLAPI_PLUGIN_BE_TXN_POST_DELETE_FN plugin returned error code but did not set SLAPI_RESULT_CODE
[17/Oct/2016:10:32:07 +0000] agmt="<agmt dn="">" (<host>:389) - Can't locate CSN 5804e0dc000200060000 in the changelog (DB rc=-30988). If replication stops, the consumer may need to be reinitialized.
=========================================

Thanks for the details, Ludwig. I agree the #359 introduced a bug.

if we do it like it is done at the moment (after fix #359) the ruv gets updated even if there is a smaller csn not committed.

And this proposal looks correct.
https://fedorahosted.org/389/ticket/359#comment:23

Here is my summary and a proposal from an oflien discussion:

{{{
Problem:

A csn pending list is a list of CSNs which have not yet been used to update the maxcsn of the RUV. There are two scenarios wher it is necessary to keep a pending list and only update the RUV with a specific CSN of this list after all CSN in the list which are smaller are committed.

Scenario 1: There are parallel write operations.

Assume operations 1,2,3 are started with csn_1 < csn_2 < csn_3. If op_3 completes first and would commit and update the ruv, the server would send op_3 to a consumer, the consumer would update it's RUV with csn_3 and if later op_1 and op_2 are committed they would not be sent to the consumer because their csn < csn_3.

NOTE: this scenario is not possible with the current use of the backend lock, but I think this very scenario was the reason to introduce the pending list.

Scenario 2: Internal operations inside the main operation.

An operation is started with csn_1. Plugins are called and can trigger internal updates with csn_2,....csn_5. All these operations are done in nested transactions- If one of the plugins or the main operation fails, the transactions are aborted and undone. None of the changes will survive and they will not be in the changelog. If any of the CSNs csn_1 ... csn_5 would have updated the RUV then the RUV would be ahead of the changelog and cursor position will fail. Therefore the pending list should ensure that only after all the csns 1..5 have successfully be committed the RUV is updated.

There are two variants of this scenario:

Scenario 2.1: the main operation is a direct client operation, so all csns have the same replica id

Scenario 2.2: the main operation is a replicated operation, so the csn for the main operation and the internal operation have different replica ids

NOTE: There is a nightmare variant of this scenario:

Scenario 2.3: One of the plugins triggers an internal update in an otehr backend with different RUV and pending lists.

What is currently implemented ?

We have two types of csn pending list.

1] A pending list of min csns maintained in the replica object.

This is a very specific list and only used until the min csn in the RUV for the local replicaID is set.

(I think this could be achieved by using the ruv pending list, see below)

2] a pending list in each ruv element

When a csn is created or received (either in replica_generate_next_csn() or in process_operation() in mmr preop), it is inserted into the pending list of the ruv element with the corresponding replica ID.

When an operation fails, in process_postop the csn is cancelled and removed from the pending list.

When an operation succeeds write_changelog_and_ruv() triggers a special handling of the pending list.

It sets the state of the csn of the completed operation to committed and it rolls up the pending list to detect the highest committed csn and updates the RUV with this csn.

What are the problems with the current implementation ?

All csns in the pending list are seen as csns of independent operation, the fact that csn of internal ops are tied to the main op is not reflected. If the main op is aborted the ruv can already have been updated with a csn from an internal op.

Since the fix for ticket #359 the pending list rollup ignores not committed csns and selects the highest commited csn to update the ruv

the pending lists for the different ruv elements are treated independently. Even if the fix for #359 would be reverted, this would handle only scenario 2.1, not 2.2

there is an asymmetry between committing and cancelling an csn. The commit is inside the txn and backend lock, wheres the cancelling is in postop and in some cases (urp) the ldap_errror could already have been reset and the csn will not be cancelled and remain uncommitted (this was the scenario which was fixed in #359)

What should be done ?

Here is a proposal for redesign of pending list management.

First some terminology:

Primary csn: the csn of the external operation, direct or replicated

Secondary csn: csns of internal operations, generated inside the op with the primary csn

Full operation (need a better name): the set of operations for a primary and its secondary csns

And some conditions:

All operations for a primary csn and its secondary csns are processed in one thread

There are only one or two replica IDs used in the pending list for one replica in a full operation. In the scenario 2.3 there are several replicas affected, but in each this condition holds.

Proposal:

Maintain the primary csn in thread local data, like it is used for the agreement name (or txn stack): prim_csn.

Extend the data structure of the pending list to keep prim_csn for each inserted csn

If a csn is created or received check prim_csn: if it exists use it, if it doesn't exist set it

when inserting a csn to the pending list pass the prim_csn

when cancelling a csn, if it is the prim_csn also cancell all secondary csns

when committing a csn,

if it is not the primary csn, just set the committed flag

if it is the prim_csn trigger the pending list rollup, stop at the first not committed csn

if the RID of the prim_csn is not the local RID also rollup the pending list for the local RID.

Does this also handle scenario 2.3, where a full operation spans more than one backend ?

No, but it could be extended and instead of only staring a prim_csn in the thread data also store a refrence to the affected replica. Cancelling or committing could then run replica_update_ruv for multiple replicas.
}}}

The proposal might not work in some deployments.

The basic idea is:
an operation is started, direct update or replicated, then the csn of this op is the primary csn
then plugins are called and can generate mods with new csns
when the main operation is committed or aborted the pending list is rolled up and the RUV is updated or the pending list is cleaered

this nesting of ops and committing is true for the transaction, but the ruv update is done by the repl plugin in the txn_post call in write_chnagelog_and_ruv
and since replication is a plugin itself the write_changelog_and_ruv for the primary csn could be called and then another plugin could generate an additional mod.

so if repl pugin is called after the plugins triggering internal ops, everything is fine, if not no guarantee.

I have been discussing this already today with Thierry and there are a few options to handle this:

  • change the precedence of the repl plugin so that it will always be last

  • move the write_changelog_and_ruv out of the plugin call, this is possible by a meachnism already used in managing the ruv_context, register a function in the pblock, call it between plugin_call_plugins() and dblayer_txn_[commit|abort]

  • split the write_changelog and update_ruv, do the write_changelog stll inside the txn and do the update_ruv in the postop

Just a minor commend about the patch, when prim_csn is set in thread private (ruv_add_csn_inprogress line 1648), it calls

{{{
set_thread_primary_csn(csn_dup(csn));
}}}

But set_thread_primary_csn also duplicate the csn before setting it into the thread private. The first duplicate will leak.

good catch, it was introduced in the last moment, when I looked at th compiler warnings and it said that

{{{
set_thread_primary_csn(csn);
}}}
looses the const from csn, and i just thought, yes it will be freed, so it has to be duplictaed.
we need to remove one of the csn_dup() calls

Also, if setting NULL it should free the current contents of the thread data.

Replying to [comment:12 mreynolds]:

Also, if setting NULL it should free the current contents of the thread data.

this is done in the destructor provided here:#
{{{
PR_NewThreadPrivateIndex (&thread_primary_csn, csnplFreeCSN);
}}}

Replying to [comment:13 lkrispen]:

Replying to [comment:12 mreynolds]:

Also, if setting NULL it should free the current contents of the thread data.

this is done in the destructor provided here:#
{{{
PR_NewThreadPrivateIndex (&thread_primary_csn, csnplFreeCSN);
}}}

Yup, my bad, thanks for confirming!

Reading again the patch it looks really great.
Just minor points:

  • it is not clear to me if you think we need to remove a csn_dup when calling set_thread_primary_csn (ruv_add_csn_inprogress) because set_thread_primary_csn will itself call again csn_dup.
  • May be add a comment in csnpldata_free/struct csnpldata mentioning that prim_csn is a pointer on thread private data and should not be freed. freed by further set_thread_primary_csn.
  • in ruv_update_ruv, if the csn is not the prim_csn it just returns, shouldn't also set the csn as committed ?

Thanks for looking into this again.

The dup of the prim_csn is just an error (I was removing compiler warnings and did the wrong thing).
Adding a comment is also a good idea

In my concept csn should onlöy be comitted when the primary csn is committed, otherwise the pending list rollup could updatet the RUV with a csn which still can be aborted

I fully agree and prefer the option to commit sub-csn in a raw when the primary is committed.
My remark was just about https://fedorahosted.org/389/ticket/49008#comment:7
In that update you mentioned

{{{
when committing a csn,

if it is not the primary csn, just set the committed flag
}}}

That is not exactly what is implemented. But as I said I prefer your implementation

Test case looks good to me. And it passes with the fix.

Very nice! However, there are indentation issues in csnplCommitAll() :)

I've cleaned up the test case a bit. Feel free to add it to your commit or I can push it for you, if you're okay with the changes.

Replying to [comment:20 mreynolds]:

Very nice! However, there are indentation issues in csnplCommitAll() :)

seem new patch

Yes very nice job.
Just minor thing, csnplCommitAll and csnplRemoveAll do always return 0.
It could simplify the code not testing their return value.

thanks for the comment - you are right, but if you do not insist on it I would like to keep it as is.
I did not change the logic of the caller when moving from csnplCommit/Remove to ..All, at the moment I do not see when they should return an error, but I am not sure it will stay like this
Also for the csnplCommitAll call there are two different log messages depending on the rc, now only the "success" info msg will be logged, but I would not like to not check an rc and always log success.

I agree, failures of those routines (even if they do not occur now) should be caught. So I am fine with current patch (except the indentation reported by Mark). You have my ACK

backported patch to 1.3.5 attached

committed to master:

commit 0c7ef56

Add CI test written by Ludwig Krispenz lkrispen@redhat.com.

To ssh://git.fedorahosted.org/git/389/ds.git
0c7ef56..360d797 master -> master
commit 360d797
Author: Simon Pichugin spichugi@redhat.com
Date: Tue Jan 17 15:32:38 2017 +0100

committed to 1.3.5:

commit 79a3dea

0001-fix-regression-of-patch-for-49008-handle-read-only-r.patch

Thanks for the patch, Ludwig!

I have one question... Is the change ok for HUB? It shares the same RID?

If it works, you have my ack.

The replicaid on a hub should also be the read_only_replicaid, but I am not definitely sure what happens when downgrading a master to a hub.
An alternative patch would be to check if the ruv element exists, new patch attached

Metadata Update from @tbordaz:
- Issue set to the milestone: 1.2.11.33

2 years ago

Metadata Update from @mreynolds:
- Custom field component reset (from Replication - General)
- Custom field reviewstatus reset (from review?)
- Issue close_status updated to: None

2 years ago

Adjust CI test for new MO plugin behavior

0001-Ticket-49008-Adjust-CI-test-for-new-memberOf-behavio.patch

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to review

2 years ago

That's clever. I think I did something else to fix this but ack from me!

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to ack (was: review)

2 years ago

That's clever. I think I did something else to fix this but ack from me!

Yeah I'm worried about that, as we are making MO plugin bullet proof. But it's an easy plugin to trigger errors to test other conditions. Most other plugins ignore replicated ops. So if we keep enforcing proper values for auto objectclass it could actually break some CI tests like it did to this one. But that's a discussion for the other issue.

5d5b9c6..d74ba63 master -> master

04904b6..d77e285 389-ds-base-1.3.6 -> 389-ds-base-1.3.6

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Fix MO betxn CI test suite:

e1dca08..d42024e master -> master

2141ea7..6bd1b7d 389-ds-base-1.3.6 -> 389-ds-base-1.3.6

Login to comment on this ticket.

Metadata