#564 Is ldbm_txn_ruv_modify_context still required ?
Closed: Fixed None Opened 6 years ago by lkrispen.

It was introduced in the fix for 543633, to fix this situation:

2) If slapd went down in the window between writing to the datastore and writing to the changelog, then the last update against the replica will never appear on remote replicas.
The incremental protocol will continue once tickled as described above, but the last change made before the crash will be missing and this is not detected by the protocol.

But with the current code, the changelog is written inside the same transaction as the changes to the database, so there cannot be an inconsistent state afetr recovery, either both are
recovered or both are undone.

It should also be verified if part 1) of the original bug description is still true:
1) After starting back up, the RUV will frequently have a max CSN in the past as compared to the changelog and compared to remote replicas. This means that any updates in the changelog that were not yet sent before the crash will continue to not be sent after slapd comes back up, until a new update arrives. Then the RUV will leap ahead, the incremental protocol will position replay at the remote replica's max CSN, and the unsent updates from before the crash and also the new update will be replayed.

In general we need to find a way to ensure consistency of the dababase ruv with applied changes and changelog after (disorderly) shutdown, if it should be regularily updated from the replica ruv by the housekeeping thread or if it is passible without performance loss inside the modify operation.
It should also be considered if a solution relies on serialization by the backend log or is valid if changes can be applied in parallel.


== Here is the current status ==

    - Since 1.2.7 (633168) changelog DB and data store DB are sharing the same
      environment.
      So covered by the same transaction, after possible recovery, DB and changelog are in sync.
      In particular the changelog RUV is in sync with the updates in the DB.

    - Since 1.2.8 (543633), the RUV entry in the data store DB is also done under the same 
      txn.
      So with this enhancement, after possible recovery, data store RUV in DB and changelog RUV are in sync.
      Before that enhancement, only in memory RUV reflected the changelog RUV. data store RUV was flushed periodically (30s)
      on disk.

    - The drawback of 543633 fix is that it writes two RUVs (changelog and data store) with the same information.
      The ticket 564 is to evaluate if it is possible to only keep one RUV in sync with the data store.
      If we let data store RUV behind the changelog RUV, and a crash occurs, we have (after recovery)
      the changelog RUV that is in sync with the actual data store. However we have a difference between DB RUV and CL RUV.
      This difference (where CL RUV covers DB RUV), in case of crash, should trigger the update of the DB RUV from
      the CL RUV.
      Currently (replica_check_for_data_reload), if there is any difference between DB RUV and CL RUV
      the changelog is recreated and the CL RUV aligned on DB RUV.
      The fix would be to detect that we are restarting after a crash and in that case aligned the DB RUV with CL RUV
      (in addition updates from 543633 could be removed).

    - The benefit of last fix (ticket564) is to get improvemnt of the write throughput (prevent to compute/update a RUV)
      During my tests, most of bottleneck was in txn_commit. This was hidding part of the benefit of the fix.
      With txn batch = 0 (default), the gain of the fix is 3%. With txn batch set to the number of clients threads,
      the gain is 6%.

== Here are the next steps ==

    - Prepare a fix review

    - Run accpetance

The code to detect and handle the RUV in the disorderly shutdown case looks good. I would change this:

{{{
} // slapi_disordely_shutdown
}}}

Not sure what the current rules are for using C++ style comments in C code, but to maximize portability, please use C style comments

As for the modify_context code - if we are getting rid of that, I would like to see a better way to make sure we update the database RUV inside the operation transaction.

Thanks for the review. I will modify the disordely shutdown part accordingly.

Since 543633 (1.2.8) database RUV update is done inside the op txn.
It is done in three steps.
- 1) compute the RUV mods and apply it on the RUV entry
- 2) write the RUV entry
- 3) update the entry in the entry cache

In order to be sure that update of the RUV stays inside the op txn, we may
keep 1 and 2 within the op txn. But update of the entry cache, can be done
outside of the op txn. The impact is that RUV entry lookup (through SRCH)
may return RUV out of date.
Step 1 can not be done before op txn because client operations are assigned
a csn within the op txn.

Oppss I forgot... an other option, is to create a configuration parameters that enable/disable the update of the RUV entry in the op txn.

I ran quite an intensive test using the servers applied your patch.
Steps:
1. setup MMR.
2. pound Master1 with ldclt (add, del, and mod) for an hour and repeat {kill -9 M2; sleep 30; start-slapd M2; sleep 30} while ldclt was running.
3. wait till the changes are completely replicated and compared the DBs.
The result was perfect. The 2 DBs were identical. Great job, Thierry!!
M1 RUV:
rdn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff
nsds50ruv: {replicageneration} 51156df2000000010000
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 51156ed0000000010000 51157eb5000a00010000
nsds50ruv: {replica 2 ldap://localhost.localdomain:390}
nsruvReplicaLastModified: {replica 1 ldap://localhost.localdomain:389} 51157eb4
nsruvReplicaLastModified: {replica 2 ldap://localhost.localdomain:390} 00000000
M2 RUV:
rdn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff
nsds50ruv: {replicageneration} 51156df2000000010000
nsds50ruv: {replica 2 ldap://localhost.localdomain:390}
nsds50ruv: {replica 1 ldap://localhost.localdomain:389} 51156ed0000000010000 51157eb5000a00010000
nsruvReplicaLastModified: {replica 2 ldap://localhost.localdomain:390} 00000000
nsruvReplicaLastModified: {replica 1 ldap://localhost.localdomain:389} 51158beb

Noriko,

did you also try to repeatedly kill M1, to see if it will correctly recreate its ruv and keep the changelog ?

Replying to [comment:10 lkrispen]:

Noriko,

did you also try to repeatedly kill M1, to see if it will correctly recreate its ruv and keep the changelog ?

I'd say that side is not tested enough. I killed just a couple of since I had to restart ldclt each time. I have to ask Thierry to do more testing on it... ;)

Noriko,

I want to thank you so much for the extensive tests you did. I know how difficult it is to test this corner case, and the effort it requires to run/check such tests for a long time. I sincerely appreciate your help.

About the test Ludwig suggested. I did them during the unit tests so I have not a real available test case. I instrumented the code to trigger a crash (on demand) during backend update (ldbm_back_modify). The crash occured after txn_open, updates of DB+index but before the txn commit.
After recovery, I check that startup detected disordely shutdown, recreated the DB ruv (according to CL ruv) and kept the CL. Also I verified that after the startup replication can continue.
I realize that the code fix could be improved so that CL is recreated in the case of disordely shutdown but DB RUV is ahead of the CL ruv:
if (disordely_shutdown && covers(CL_ruv, DB_ruv) && !covers(DB_ruv, CL_ruv)) then recreate_ruv(DB_ruv)
else { recreate_CL; ....}
I can not imagine how CL ruv would be behind DB ruv (during standard startup), but it would be more secure to through the Cl if we are not sure of it.

A test I did not do, M1 under ldclt load, stop M2, keeping M1 under ldctl load. kill M1.
start M2+ldctl on M2, start/recovery M1 and check that M2 and M1 resynchronize.
This would verify that M1 CL can retrieve M2 status and send its updates.

best regards
thierry

== Here is the current status == - The fix is incomplete[[BR]] running the acceptance it triggers for time to time a hang when a repl_agmt gets disabled. the hang is between the thread reading the RA results and an unknown thread. In fact the hang was detected late (during shutdown) and most of the threads have disappeared The hang is looking like: {{{ (gdb) info thread Id Target Id Frame 9 Thread 0x7fc6ba252700 (LWP 17604) "ns-slapd" 0x00000035d6cea9f3 in select () from /usr/lib64/libc.so.6 8 Thread 0x7fc6b9a51700 (LWP 17605) "ns-slapd" 0x00000035d6cea9f3 in select () from /usr/lib64/libc.so.6 7 Thread 0x7fc6b9250700 (LWP 17606) "ns-slapd" 0x00000035d6cea9f3 in select () from /usr/lib64/libc.so.6 6 Thread 0x7fc6b8a4f700 (LWP 17607) "ns-slapd" 0x00000035d6cea9f3 in select () from /usr/lib64/libc.so.6 5 Thread 0x7fc6b3fff700 (LWP 17608) "ns-slapd" 0x00000035d700b595 in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib64/libpthread.so.0 4 Thread 0x7fc6997ea700 (LWP 17644) "ns-slapd" 0x00000035d6cea9f3 in select () from /usr/lib64/libc.so.6 3 Thread 0x7fc698fe9700 (LWP 20831) "ns-slapd" 0x00000035d7008e60 in pthread_join () from /usr/lib64/libpthread.so.0 * 2 Thread 0x7fc6437fe700 (LWP 20833) "ns-slapd" 0x00000035d700ddcd in __lll_lock_wait () from /usr/lib64/libpthread.so.0 1 Thread 0x7fc6c2a18800 (LWP 17601) "ns-slapd" 0x00000035d6cea9f3 in select () from /usr/lib64/libc.so.6 (gdb) where #0 0x00000035d700ddcd in __lll_lock_wait () from /usr/lib64/libpthread.so.0 #1 0x00000035d7009c8c in _L_lock_1024 () from /usr/lib64/libpthread.so.0 #2 0x00000035d7009c35 in pthread_mutex_lock () from /usr/lib64/libpthread.so.0 #3 0x0000003dc42235c9 in PR_Lock () from /usr/lib64/libnspr4.so #4 0x00007fc6bdc0a1ab in conn_read_result_ex (conn=0x7fc630022130, retoidp=0x0, retdatap=0x0, returned_controls=0x0, send_msgid=-1, resp_msgid=0x7fc6437fdd50, block=0) at ldap/servers/plugins/replication/repl5_connection.c:332 #5 0x00007fc6bdc0d3b7 in repl5_inc_result_threadmain (param=0x7fc68002b660) at ldap/servers/plugins/replication/repl5_inc_protocol.c:295 #6 0x0000003dc4228cf3 in ?? () from /usr/lib64/libnspr4.so #7 0x00000035d7007d14 in start_thread () from /usr/lib64/libpthread.so.0 #8 0x00000035d6cf168d in clone () from /usr/lib64/libc.so.6 (gdb) frame 4 #4 0x00007fc6bdc0a1ab in conn_read_result_ex (conn=0x7fc630022130, retoidp=0x0, retdatap=0x0, returned_controls=0x0, send_msgid=-1, resp_msgid=0x7fc6437fdd50, block=0) at ldap/servers/plugins/replication/repl5_connection.c:332 332 PR_Lock(conn->lock); (gdb) print conn->lock $7 = (PRLock *) 0x7fc63000f320 (gdb) x/8lx 0x7fc63000f320 0x7fc63000f320: 0x0000000000000002 0x00000001000044d0 0x7fc63000f330: 0x0000000000000003 0x0000000000000000 0x7fc63000f340: 0x0000000000000000 0x0000000000000000 0x7fc63000f350: 0x0000000000000000 0x0000000000000000 (gdb) print 0x44d0 $9 = 17616 }}} == Here are the next steps == - Investigate the reason of the hang

'''Here is the current status'''

  • In my last update, I reported a hang condition with the fix. The hang occurred in fourway test suite, but others tests suites (MMR and basic) also contained failures. The hang was the first I investigate.

  • Today I have run the basic+MMR acceptance suites 3 times. Each time I got 100% success. I can not reproduce the hang but also the others test suites were successful.[[BR]]
    Before running acceptances, I had to investigate strange failures that were related to my environment (rpm to remove, initial install broken, wrong engage file...). These changes may be related to the failures I saw mid-Feb.

'''Here are the next steps'''

  • I will continue to run MMR acceptance in a loop over the night

'''Here is the current status'''

  • The suggested fix is valid.
    I ran 10 times the full MMR acceptance and the only failure that occurred were due to a problem in a fractional test case itself. After fixing it, all tests were constantly 100% successful.

  • I reproduced the hang condition and it was an environment problem
    it occurs if I rerun the MMR acceptance on the same configuration instance.
    If I run the acceptance from a fresh install, the acceptance is successful

'''Here are the next steps'''

  • Will wait further discussion if we want to introduce that fix in main
    This fix desynchronise the DB RUV from the DB itself and this needs further discussions
    Note: it remains the possibility to introduce it with a configuration parameter.

'''Here are the next steps'''
* main concern to address to include that fix is to check how monitoring tools (repl monitor and cn=monitor) are impacted.[[BR]]
The issue is that they need to monitor the in memory ruv rather than the DB ruv. Else they will be late ~20se
'''Here are the next steps'''

  • identify the tools and check how they monitor

'''Here is the current status'''

  • repl-monitor is monitoring the database RUV

The called script is /usr/bin/repl-monitor.pl. It retrieves all the replicas on a given instance and for each of them it searches the replication status with the following lookup:

{{{
ldapsearch -LLL -h localhost -p 3002 -D "cn=directory manager" -w Secret123 -s one -b "dc=com" "(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectClass=nsTombstone))" nsds50ruv nsruvReplicaLastModified
dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=com
nsds50ruv: {replicageneration} 5176aebb000000010000
nsds50ruv: {replica 2 ldap://pctbordaz.redhat.com:3002} 5176af9c000000020000 5
177c8dd000000020000
nsds50ruv: {replica 1 ldap://pctbordaz.redhat.com:3001} 5176af7c000000010000 5
177c8bf000000010000
nsruvReplicaLastModified: {replica 2 ldap://pctbordaz.redhat.com:3002} 5177c8d
d
nsruvReplicaLastModified: {replica 1 ldap://pctbordaz.redhat.com:3001} 5177c8b
f

}}}

  • with ticket564 fix, the monitored RUV can be up to 20sec late.

'''Here are the next steps'''

  • need to change /usr/bin/repl-monitor.pl to monitor in memory RUV

Here is the current status

  • in memory RUV is not reported by replica_config_search (search on 'cn=replica,cn=\"${SUFFIX}\",cn=mapping tree,cn=config).

Here are the next steps

  • check if it is available through an other method and if not if it is ok to report it in replica_config_search

Replying to [comment:20 tbordaz]:

'''Here is the current status'''

  • repl-monitor is monitoring the database RUV

The called script is /usr/bin/repl-monitor.pl. It retrieves all the replicas on a given instance and for each of them it searches the replication status with the following lookup:

{{{
ldapsearch -LLL -h localhost -p 3002 -D "cn=directory manager" -w Secret123 -s one -b "dc=com" "(&(nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff)(objectClass=nsTombstone))" nsds50ruv nsruvReplicaLastModified
dn: nsuniqueid=ffffffff-ffffffff-ffffffff-ffffffff,dc=com
nsds50ruv: {replicageneration} 5176aebb000000010000
nsds50ruv: {replica 2 ldap://pctbordaz.redhat.com:3002} 5176af9c000000020000 5
177c8dd000000020000
nsds50ruv: {replica 1 ldap://pctbordaz.redhat.com:3001} 5176af7c000000010000 5
177c8bf000000010000
nsruvReplicaLastModified: {replica 2 ldap://pctbordaz.redhat.com:3002} 5177c8d
d
nsruvReplicaLastModified: {replica 1 ldap://pctbordaz.redhat.com:3001} 5177c8b
f

}}}

  • with ticket564 fix, the monitored RUV can be up to 20sec late.

That's not good. What if a customer has custom written monitoring apps that use this RUV entry?

'''Here are the next steps'''

  • need to change /usr/bin/repl-monitor.pl to monitor in memory RUV

git merge ticket564
Updating 202c983..7f5268f
Fast-forward
ldap/servers/plugins/replication/repl5_replica.c | 73 ++++++++++++++++++++++++++++++++++++++++++++-----------------------------
ldap/servers/plugins/replication/repl5_ruv.c | 31 +++++++++++++++++++++++++++++++
ldap/servers/plugins/replication/repl5_ruv.h | 1 +
ldap/servers/slapd/back-ldbm/dblayer.c | 1 +
ldap/servers/slapd/back-ldbm/ldbm_config.c | 2 +-
ldap/servers/slapd/plugin.c | 11 +++++++++++
ldap/servers/slapd/slapi-plugin.h | 2 ++
7 files changed, 91 insertions(+), 30 deletions(-)

git push origin master
Counting objects: 29, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (15/15), done.
Writing objects: 100% (15/15), 3.01 KiB, done.
Total 15 (delta 13), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
202c983..7f5268f master -> master

commit 7f5268f
Author: Thierry bordaz (tbordaz) tbordaz@redhat.com
Date: Wed Jun 5 11:09:25 2013 +0200

Metadata Update from @tbordaz:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.2 - 05/13 (May)

2 years ago

Login to comment on this ticket.

Metadata