It was introduced in the fix for 543633, to fix this situation:
2) If slapd went down in the window between writing to the datastore and writing to the changelog, then the last update against the replica will never appear on remote replicas. The incremental protocol will continue once tickled as described above, but the last change made before the crash will be missing and this is not detected by the protocol.
But with the current code, the changelog is written inside the same transaction as the changes to the database, so there cannot be an inconsistent state afetr recovery, either both are recovered or both are undone.
It should also be verified if part 1) of the original bug description is still true: 1) After starting back up, the RUV will frequently have a max CSN in the past as compared to the changelog and compared to remote replicas. This means that any updates in the changelog that were not yet sent before the crash will continue to not be sent after slapd comes back up, until a new update arrives. Then the RUV will leap ahead, the incremental protocol will position replay at the remote replica's max CSN, and the unsent updates from before the crash and also the new update will be replayed.
In general we need to find a way to ensure consistency of the dababase ruv with applied changes and changelog after (disorderly) shutdown, if it should be regularily updated from the replica ruv by the housekeeping thread or if it is passible without performance loss inside the modify operation. It should also be considered if a solution relies on serialization by the backend log or is valid if changes can be applied in parallel.
== Here is the current status ==
- Since 1.2.7 (633168) changelog DB and data store DB are sharing the same environment. So covered by the same transaction, after possible recovery, DB and changelog are in sync. In particular the changelog RUV is in sync with the updates in the DB. - Since 1.2.8 (543633), the RUV entry in the data store DB is also done under the same txn. So with this enhancement, after possible recovery, data store RUV in DB and changelog RUV are in sync. Before that enhancement, only in memory RUV reflected the changelog RUV. data store RUV was flushed periodically (30s) on disk. - The drawback of 543633 fix is that it writes two RUVs (changelog and data store) with the same information. The ticket 564 is to evaluate if it is possible to only keep one RUV in sync with the data store. If we let data store RUV behind the changelog RUV, and a crash occurs, we have (after recovery) the changelog RUV that is in sync with the actual data store. However we have a difference between DB RUV and CL RUV. This difference (where CL RUV covers DB RUV), in case of crash, should trigger the update of the DB RUV from the CL RUV. Currently (replica_check_for_data_reload), if there is any difference between DB RUV and CL RUV the changelog is recreated and the CL RUV aligned on DB RUV. The fix would be to detect that we are restarting after a crash and in that case aligned the DB RUV with CL RUV (in addition updates from 543633 could be removed). - The benefit of last fix (ticket564) is to get improvemnt of the write throughput (prevent to compute/update a RUV) During my tests, most of bottleneck was in txn_commit. This was hidding part of the benefit of the fix. With txn batch = 0 (default), the gain of the fix is 3%. With txn batch set to the number of clients threads, the gain is 6%.
== Here are the next steps ==
- Prepare a fix review - Run accpetance
Review of the fix for ticket564 0001-Ticket-564-Ticket-description.patch
The code to detect and handle the RUV in the disorderly shutdown case looks good. I would change this:
{{{ } // slapi_disordely_shutdown }}}
Not sure what the current rules are for using C++ style comments in C code, but to maximize portability, please use C style comments
As for the modify_context code - if we are getting rid of that, I would like to see a better way to make sure we update the database RUV inside the operation transaction.
Thanks for the review. I will modify the disordely shutdown part accordingly.
Since 543633 (1.2.8) database RUV update is done inside the op txn. It is done in three steps. - 1) compute the RUV mods and apply it on the RUV entry - 2) write the RUV entry - 3) update the entry in the entry cache
In order to be sure that update of the RUV stays inside the op txn, we may keep 1 and 2 within the op txn. But update of the entry cache, can be done outside of the op txn. The impact is that RUV entry lookup (through SRCH) may return RUV out of date. Step 1 can not be done before op txn because client operations are assigned a csn within the op txn.
Oppss I forgot... an other option, is to create a configuration parameters that enable/disable the update of the RUV entry in the op txn.
I ran quite an intensive test using the servers applied your patch. Steps: 1. setup MMR. 2. pound Master1 with ldclt (add, del, and mod) for an hour and repeat {kill -9 M2; sleep 30; start-slapd M2; sleep 30} while ldclt was running. 3. wait till the changes are completely replicated and compared the DBs. The result was perfect. The 2 DBs were identical. Great job, Thierry!! M1 RUV: rdn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff nsds50ruv: {replicageneration} 51156df2000000010000 nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 51156ed0000000010000 51157eb5000a00010000 nsds50ruv: {replica 2 ldap://localhost.localdomain:390} nsruvReplicaLastModified: {replica 1 ldap://localhost.localdomain:389} 51157eb4 nsruvReplicaLastModified: {replica 2 ldap://localhost.localdomain:390} 00000000 M2 RUV: rdn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff nsds50ruv: {replicageneration} 51156df2000000010000 nsds50ruv: {replica 2 ldap://localhost.localdomain:390} nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 51156ed0000000010000 51157eb5000a00010000 nsruvReplicaLastModified: {replica 2 ldap://localhost.localdomain:390} 00000000 nsruvReplicaLastModified: {replica 1 ldap://localhost.localdomain:389} 51158beb
Noriko,
did you also try to repeatedly kill M1, to see if it will correctly recreate its ruv and keep the changelog ?
Replying to [comment:10 lkrispen]:
Noriko, did you also try to repeatedly kill M1, to see if it will correctly recreate its ruv and keep the changelog ?
I'd say that side is not tested enough. I killed just a couple of since I had to restart ldclt each time. I have to ask Thierry to do more testing on it... ;)
I want to thank you so much for the extensive tests you did. I know how difficult it is to test this corner case, and the effort it requires to run/check such tests for a long time. I sincerely appreciate your help.
About the test Ludwig suggested. I did them during the unit tests so I have not a real available test case. I instrumented the code to trigger a crash (on demand) during backend update (ldbm_back_modify). The crash occured after txn_open, updates of DB+index but before the txn commit. After recovery, I check that startup detected disordely shutdown, recreated the DB ruv (according to CL ruv) and kept the CL. Also I verified that after the startup replication can continue. I realize that the code fix could be improved so that CL is recreated in the case of disordely shutdown but DB RUV is ahead of the CL ruv: if (disordely_shutdown && covers(CL_ruv, DB_ruv) && !covers(DB_ruv, CL_ruv)) then recreate_ruv(DB_ruv) else { recreate_CL; ....} I can not imagine how CL ruv would be behind DB ruv (during standard startup), but it would be more secure to through the Cl if we are not sure of it.
A test I did not do, M1 under ldclt load, stop M2, keeping M1 under ldctl load. kill M1. start M2+ldctl on M2, start/recovery M1 and check that M2 and M1 resynchronize. This would verify that M1 CL can retrieve M2 status and send its updates.
best regards thierry
'''Here is the current status'''
In my last update, I reported a hang condition with the fix. The hang occurred in fourway test suite, but others tests suites (MMR and basic) also contained failures. The hang was the first I investigate.
Today I have run the basic+MMR acceptance suites 3 times. Each time I got 100% success. I can not reproduce the hang but also the others test suites were successful.[[BR]] Before running acceptances, I had to investigate strange failures that were related to my environment (rpm to remove, initial install broken, wrong engage file...). These changes may be related to the failures I saw mid-Feb.
'''Here are the next steps'''
The suggested fix is valid. I ran 10 times the full MMR acceptance and the only failure that occurred were due to a problem in a fractional test case itself. After fixing it, all tests were constantly 100% successful.
I reproduced the hang condition and it was an environment problem it occurs if I rerun the MMR acceptance on the same configuration instance. If I run the acceptance from a fresh install, the acceptance is successful
'''Here are the next steps''' * main concern to address to include that fix is to check how monitoring tools (repl monitor and cn=monitor) are impacted.[[BR]] The issue is that they need to monitor the in memory ruv rather than the DB ruv. Else they will be late ~20se '''Here are the next steps'''
The called script is /usr/bin/repl-monitor.pl. It retrieves all the replicas on a given instance and for each of them it searches the replication status with the following lookup:
{{{ ldapsearch -LLL -h localhost -p 3002 -D "cn=directory manager" -w Secret123 -s one -b "dc=com" "(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectClass=nsTombstone))" nsds50ruv nsruvReplicaLastModified dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=com nsds50ruv: {replicageneration} 5176aebb000000010000 nsds50ruv: {replica 2 ldap://pctbordaz.redhat.com:3002} 5176af9c000000020000 5 177c8dd000000020000 nsds50ruv: {replica 1 ldap://pctbordaz.redhat.com:3001} 5176af7c000000010000 5 177c8bf000000010000 nsruvReplicaLastModified: {replica 2 ldap://pctbordaz.redhat.com:3002} 5177c8d d nsruvReplicaLastModified: {replica 1 ldap://pctbordaz.redhat.com:3001} 5177c8b f
}}}
Here is the current status
Here are the next steps
Replying to [comment:20 tbordaz]:
'''Here is the current status''' repl-monitor is monitoring the database RUV The called script is /usr/bin/repl-monitor.pl. It retrieves all the replicas on a given instance and for each of them it searches the replication status with the following lookup: {{{ ldapsearch -LLL -h localhost -p 3002 -D "cn=directory manager" -w Secret123 -s one -b "dc=com" "(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectClass=nsTombstone))" nsds50ruv nsruvReplicaLastModified dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=com nsds50ruv: {replicageneration} 5176aebb000000010000 nsds50ruv: {replica 2 ldap://pctbordaz.redhat.com:3002} 5176af9c000000020000 5 177c8dd000000020000 nsds50ruv: {replica 1 ldap://pctbordaz.redhat.com:3001} 5176af7c000000010000 5 177c8bf000000010000 nsruvReplicaLastModified: {replica 2 ldap://pctbordaz.redhat.com:3002} 5177c8d d nsruvReplicaLastModified: {replica 1 ldap://pctbordaz.redhat.com:3001} 5177c8b f }}} with ticket564 fix, the monitored RUV can be up to 20sec late.
That's not good. What if a customer has custom written monitoring apps that use this RUV entry?
'''Here are the next steps''' need to change /usr/bin/repl-monitor.pl to monitor in memory RUV
in memory RUV is not accessible in 389-ds
I created https://fedorahosted.org/389/ticket/47350 to support in memory RUV access
attachment 0002-Ticket-564-Is-ldbm_txn_ruv_modify_context-still-requ.patch
git merge ticket564 Updating 202c983..7f5268f Fast-forward ldap/servers/plugins/replication/repl5_replica.c | 73 ++++++++++++++++++++++++++++++++++++++++++++----------------------------- ldap/servers/plugins/replication/repl5_ruv.c | 31 +++++++++++++++++++++++++++++++ ldap/servers/plugins/replication/repl5_ruv.h | 1 + ldap/servers/slapd/back-ldbm/dblayer.c | 1 + ldap/servers/slapd/back-ldbm/ldbm_config.c | 2 +- ldap/servers/slapd/plugin.c | 11 +++++++++++ ldap/servers/slapd/slapi-plugin.h | 2 ++ 7 files changed, 91 insertions(+), 30 deletions(-)
git push origin master Counting objects: 29, done. Delta compression using up to 4 threads. Compressing objects: 100% (15/15), done. Writing objects: 100% (15/15), 3.01 KiB, done. Total 15 (delta 13), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 202c983..7f5268f master -> master
commit 7f5268f Author: Thierry bordaz (tbordaz) tbordaz@redhat.com Date: Wed Jun 5 11:09:25 2013 +0200
Metadata Update from @tbordaz: - Issue assigned to tbordaz - Issue set to the milestone: 1.3.2 - 05/13 (May)
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/564
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.