#47712 betxn: retro changelog broken after cancelled transaction
Closed: wontfix None Opened 7 years ago by nhosoi.

Description of problem:
If the first transaction after enabling the retro changelog plugin fails, the
database gets corrupted and any subsequent successful operation is not recorded
into changelog.

Version-Release number of selected component (if applicable):
389-ds-base-1.3.1.6-18.el7

How reproducible:
Always

Steps to Reproduce:
1. Turn on the retro changelog plugin. Restart.
2. Create a transaction that fails. E.g. by auto membership plugin.
3. Add entry in a transaction that succeeds.

Actual results:
ldapsearch -b "cn=changelog" does not show any sub-entries
log:
[04/Feb/2014:11:06:03 +0100] auto-membership-plugin -
automember_add_member_value: Unable to add
"cn=tuser,ou=people,dc=example,dc=com" as a "member" value to group
"cn=testgroup,ou=groups,dc=example,dc=com" (No such object).
[04/Feb/2014:11:06:03 +0100] DSRetroclPlugin - replog: an error occured while
adding change number 1, dn = changenumber=1,cn=changelog: Already exists.

Expected results:
The successful entry should be recorded in changelog

Note: bug reproducer is available in the bugzilla


The bugzilla contains another scenario: If there are alreday changes in the retrochange log then the effect of the failed transaction is that an chnagenumber is missing.
If the missing number is "1" the error behaviour is more severe.

I think the problem is caused that if the txn is aborted the generated changenumber is not reset. Need to check if this is possible - or if generating the changenumber can handle this.

General question: Is it mandatory that changenumbers in the retro changelog are without gaps ?

Replying to [comment:4 lkrispen]:

The bugzilla contains another scenario: If there are alreday changes in the retrochange log then the effect of the failed transaction is that an chnagenumber is missing.
If the missing number is "1" the error behaviour is more severe.

I think the problem is caused that if the txn is aborted the generated changenumber is not reset. Need to check if this is possible - or if generating the changenumber can handle this.

General question: Is it mandatory that changenumbers in the retro changelog are without gaps ?

It should be ok, but will probably be unexpected, and may break unsuspecting clients. We should try to avoid gaps if possible. I think USN and DNA can avoid gaps?

It should be ok, but will probably be unexpected, and may break unsuspecting clients. We should try to avoid gaps if possible. I think USN and DNA can avoid gaps?

There are already calls to release_changenumber in case of errors to avoid gaps, but maybe we miss some scenario

The problem in this specific scenario is that automember postop is called after retrocl postop. Since automember fails, the txn is aborted, but the internal cn is not reset.

This is a more general problem that plugins use static variables, but if an operation fails the changes to these variables are not reset.

An other observation is that if a txn is aborted because of plugin failures, the plugin are called again just before the abort. If they check the result code in the pblock they just return, but do all plugins behave like this ?

Replying to [comment:7 lkrispen]:

The problem in this specific scenario is that automember postop is called after retrocl postop. Since automember fails, the txn is aborted, but the internal cn is not reset.

Can the order of retrocl be changed to after automember? Can retrocl be changed to be called after all other plugins? Does order matter?

This is a more general problem that plugins use static variables, but if an operation fails the changes to these variables are not reset.

Yes. Really, for all of these counters that depend on db transaction semantics, we should really use a BDB Sequence, which would handle the value increment and rollback automatically. Unfortunately, I don't know if Sequence would be generally applicable to situations like DNA where you may not just be incrementing a simple scalar.

An other observation is that if a txn is aborted because of plugin failures, the plugin are called again just before the abort. If they check the result code in the pblock they just return, but do all plugins behave like this ?

Not sure.

the good news: setting plugin precedence to 99 for the retro CL fixes the issue
the bad news: the retro CL has already its precedence explicitely set to 25, with the following justification:
" Bug 518084 - Fix out of order retro changelog entries

When using the retro changelog plugin, post-op plugins that perform
internal operations (such as memberOf) can result in the internal
operation preceeding the original operation in the changelog."

It seems that the retrocl should have two BE_TXN_POSTOP functions - the first one is called very early, and the second one is called very late - the first one writes to the changelog, and the second one checks to see if any other BE_TXN_POSTOP plugin returned an error that caused the changelog write to be aborted, and will rollback the counter.

unfortunately it is not only the counter.

In the first scenario where the first operation fails, the changenumber is set always to 1 (just like setting back the counter), but it then fails because the changelog entry from the aborted transaction is still in the entry cache and an error "already exists is returned".
Even if we call retrocl later again, it has no api to remove an entry from the entry cache

The changelog entry which survived an aborted operation is also accessable from clients, a search
ldapsearch .... -s base -b "changenumber=,cn=changelog" returns the entry, after a restart it is gone.

I found a solution to prevent the entry to stay in the cache, but it is a bit brute force:
- define a new operation flag OP_FLAG_NEVER_CACHE
- in write_replog_db set this flag when adding a retro changelog entry
- in ldbm_back_add check the flag and do CACHE_REMOVE (not only CACHE_RETURN)

It works, but now all added retro cl entries do not stay in cache after creation.

The problem with the gap is much harder, tried to call retrocl_update_lastchangenumber() if retrocl_postob is called with an error set in the pblock, but it already hangs in the call triggereb by the failing automember mod, maybe careful passing/setting the txn would work, but I realized that these calls are all done before the txn is finally aborted, so they still see a valid retrocl changenumber.
Postponing the reset to a normal postop plugin call for the retro cl, when the txn is already aborted cour result in problems with conncurrently genertaed retrocl entrise in an other thread.

$ git merge ticket47712
Updating f573012..adc4dd0
Fast-forward
ldap/servers/plugins/retrocl/retrocl.h | 1 +
ldap/servers/plugins/retrocl/retrocl_cn.c | 17 +++++++++++++++--
ldap/servers/plugins/retrocl/retrocl_po.c | 7 ++++++-
ldap/servers/slapd/back-ldbm/ldbm_add.c | 4 ++++
ldap/servers/slapd/slapi-plugin.h | 1 +
ldap/servers/slapd/slapi-private.h | 1 +
6 files changed, 28 insertions(+), 3 deletions(-)
$ git push origin master
Counting objects: 27, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (14/14), done.
Writing objects: 100% (14/14), 2.27 KiB, done.
Total 14 (delta 12), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
f573012..adc4dd0 master -> master

Metadata Update from @lkrispen:
- Issue assigned to lkrispen
- Issue set to the milestone: 1.3.3 - 5/14 (May)

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1047

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

7 months ago

Login to comment on this ticket.

Metadata