Issue #47978: Deadlock between two MODs on the same entry between entry cache and backend lock - 389-ds-base

389-ds-base

#47978 Deadlock between two MODs on the same entry between entry cache and backend lock

Closed: wontfix None Opened 9 years ago by tbordaz.

The deadlock occurs when two MODS target the same entry.
One MOD locks the entry in the entry cache (find_entry_internal ?) then tries to acquire the backend lock (in txn_begin).
The second MOD acquired the backend lock (txn_begin) and hangs while locking the entry in the entry cache (cache_lock_entry).

The deadlock occurs while doing performance measurement.
To do this performance and skip the IOs bottleneck, I tuned the :
- nsslapd-threadnumber: 100
- nsslapd-db-transaction-batch-val: 100
- nsslapd-backend-opt-level: 7

This tuning helped to reproduce the hang but are not the cause of the hang.

Deadlock:

MOD ("cn=mr000006001,o=People,o=test_bis_create")
Thread 89 (Thread 0x7f84a97fa700 (LWP 13200)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f84d1118943 in PR_EnterMonitor (mon=0x7f84480064b0) at ../../../nspr/pr/src/pthreads/ptsynch.c:592
#2  0x00007f84c8a2795a in cache_lock_entry () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#3  0x00007f84c8a6bde0 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#4  0x00007f84d2d1c8e1 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0
#5  0x00007f84d2d1dc1f in do_modify () from /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007f84d31fc3c1 in connection_threadmain ()
#7  0x00007f84d111de3b in _pt_root (arg=0x7f84d46f7b30) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#8  0x00007f84d0abdee5 in start_thread (arg=0x7f84a97fa700) at pthread_create.c:309
#9  0x00007f84d07ecb8d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

MOD ("cn=mr000006001,o=People,o=test_bis_create")
Thread 87 (Thread 0x7f84a87f8700 (LWP 13202)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f84d1118943 in PR_EnterMonitor (mon=0x7f84d3feb9d0) at ../../../nspr/pr/src/pthreads/ptsynch.c:592
#2  0x00007f84c8a2c3b7 in dblayer_lock_backend () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#3  0x00007f84c8a311be in dblayer_txn_begin () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#4  0x00007f84c8a6ca77 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#5  0x00007f84d2d1c8e1 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0
#6  0x00007f84d2d1dc1f in do_modify () from /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007f84d31fc3c1 in connection_threadmain ()
#8  0x00007f84d111de3b in _pt_root (arg=0x7f84d47321b0) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#9  0x00007f84d0abdee5 in start_thread (arg=0x7f84a87f8700) at pthread_create.c:309

To reproduce:

On F20 - 389-DS 1.3.4
32 cores hardware machine: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
32 Gb memory

Install 389-DS 1.3.4 (master branch c3389a4)
Create a suffix and trigger MOD update on low number of entries (using ldclt) so that several mods apply to the same entry

tbordaz commented 9 years ago

Debug session of the hang: {{{ (gdb) thread 89 [Switching to thread 89 (Thread 0x7f84a97fa700 (LWP 13200))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 62: movl (%rsp), %edi (gdb) where #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f84d1118943 in PR_EnterMonitor (mon=0x7f84480064b0) at ../../../nspr/pr/src/pthreads/ptsynch.c:592 #2 0x00007f84c8a2795a in cache_lock_entry (cache=cache@entry=0x7f84d4260bc8, e=0x7f8450045580) at ldap/servers/slapd/back-ldbm/cache.c:1548 #3 0x00007f84c8a6bde0 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:829 #4 0x00007f84d2d1c8e1 in op_shared_modify (pb=pb@entry=0x7f84a97f9ae0, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1081 #5 0x00007f84d2d1dc1f in do_modify (pb=pb@entry=0x7f84a97f9ae0) at ldap/servers/slapd/modify.c:419 #6 0x00007f84d31fc3c1 in connection_dispatch_operation (pb=0x7f84a97f9ae0, op=0x7f84d46e9b80, conn=0x7f84d30efbf0) at ldap/servers/slapd/connection.c:660 #7 connection_threadmain () at ldap/servers/slapd/connection.c:2534 #8 0x00007f84d111de3b in _pt_root (arg=0x7f84d46f7b30) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #9 0x00007f84d0abdee5 in start_thread (arg=0x7f84a97fa700) at pthread_create.c:309 #10 0x00007f84d07ecb8d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) frame 2 #2 0x00007f84c8a2795a in cache_lock_entry (cache=cache@entry=0x7f84d4260bc8, e=0x7f8450045580) at ldap/servers/slapd/back-ldbm/cache.c:1548 1548 PR_EnterMonitor(e->ep_mutexp); (gdb) print e->ep_mutexp $1 = (PRMonitor *) 0x7f84480064b0 (gdb) print *e->ep_mutexp $2 = {name = 0x0, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 3, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001\000\000\000\003", '\000' <repeats 22 times>, __align = 0}, owner = 140207739340544, entryCV = {__data = {__lock = 0, __futex = 1, __total_seq = 1, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x7f84480064b8, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, "\270d\000H\204\177\000\000\002\000\000\000\000\000\000", __align = 4294967296}, waitCV = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, refCount = 1, entryCount = 1, notifyTimes = 0} (gdb) print (void *) 140207739340544 $3 = (void *) 0x7f84a87f8700 (gdb) thread 87 [Switching to thread 87 (Thread 0x7f84a87f8700 (LWP 13202))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 62: movl (%rsp), %edi (gdb) where #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f84d1118943 in PR_EnterMonitor (mon=0x7f84d3feb9d0) at ../../../nspr/pr/src/pthreads/ptsynch.c:592 #2 0x00007f84d2cdd105 in global_backend_lock_lock () at ldap/servers/slapd/backend.c:329 #3 0x00007f84c8a2c3b7 in dblayer_lock_backend (be=be@entry=0x7f84d42ce940) at ldap/servers/slapd/back-ldbm/dblayer.c:3949 #4 0x00007f84c8a311be in dblayer_txn_begin (be=0x7f84d42ce940, parent_txn=<optimized out>, txn=txn@entry=0x7f84a87f36a0) at ldap/servers/slapd/back-ldbm/dblayer.c:3671 #5 0x00007f84c8a6ca77 in ldbm_back_modify (pb=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:560 #6 0x00007f84d2d1c8e1 in op_shared_modify (pb=pb@entry=0x7f84a87f7ae0, pw_change=pw_change@entry=0, old_pw=0x0) at ldap/servers/slapd/modify.c:1081 #7 0x00007f84d2d1dc1f in do_modify (pb=pb@entry=0x7f84a87f7ae0) at ldap/servers/slapd/modify.c:419 #8 0x00007f84d31fc3c1 in connection_dispatch_operation (pb=0x7f84a87f7ae0, op=0x7f84d46e9dc0, conn=0x7f84d30efd40) at ldap/servers/slapd/connection.c:660 #9 connection_threadmain () at ldap/servers/slapd/connection.c:2534 #10 0x00007f84d111de3b in _pt_root (arg=0x7f84d47321b0) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #11 0x00007f84d0abdee5 in start_thread (arg=0x7f84a87f8700) at pthread_create.c:309 #12 0x00007f84d07ecb8d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 (gdb) frame 2 #2 0x00007f84d2cdd105 in global_backend_lock_lock () at ldap/servers/slapd/backend.c:329 329 PR_EnterMonitor(backend_global_lock_config.global_db_mutex); (gdb) print *backend_global_lock_config.global_db_mutex $4 = {name = 0x0, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 9, __kind = 3, __spins = 2, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\t\000\000\000\003\000\000\000\002", '\000' <repeats 18 times>, __align = 0}, owner = 140207756125952, entryCV = {__data = {__lock = 0, __futex = 16175, __total_seq = 8092, __wakeup_seq = 8083, __woken_seq = 8083, __mutex = 0x7f84d3feb9d8, __nwaiters = 18, __broadcast_seq = 0}, __size = "\000\000\000\000/?\000\000\234\037\000\000\000\000\000\000\223\037\000\000\000\000\000\000\223\037\000\000\000\000\000\000ع\376ӄ\177\000\000\022\000\000\000\000\000\000", __align = 69471096012800}, waitCV = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, refCount = 1, entryCount = 1, notifyTimes = 0} (gdb) print (void *) 140207756125952 $5 = (void *) 0x7f84a97fa700 }}}

tbordaz commented 8 years ago

The hang can be reproduce quite easily with Freeipa CI tests Here is the debug session of the CI test hang: {{{ (gdb) thread 25 [Switching to thread 25 (Thread 0x7f33cf7f6700 (LWP 7578))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 62: movl (%rsp), %edi (gdb) where #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f33f84956b3 in PR_EnterMonitor (mon=0x7f33b815d140) at ../../../nspr/pr/src/pthreads/ptsynch.c:592 #2 0x00007f33ed37bafa in cache_lock_entry (cache=cache@entry=0x7f33fb9856f8, e=0x7f33b80fb510) at ldap/servers/slapd/back-ldbm/cache.c:1519 #3 0x00007f33ed38cdd5 in find_entry_internal_dn (flags=0, txn=0x7f33cf7f1310, lock=1, sdn=0x7f33b81156b0, be=0x7f33fbac84a0, pb=0x7f33b812a330) at ldap/servers/slapd/back-ldbm/findentry.c:126 #4 find_entry_internal (pb=pb@entry=0x7f33b812a330, be=0x7f33fbac84a0, lock=lock@entry=1, txn=txn@entry=0x7f33cf7f1310, flags=0, addr=<optimized out>, addr=<optimized out>) at ldap/servers/slapd/back-ldbm/findentry.c:264 #5 0x00007f33ed38d0ac in find_entry2modify (pb=pb@entry=0x7f33b812a330, be=<optimized out>, addr=<optimized out>, txn=txn@entry=0x7f33cf7f1310) at ldap/servers/slapd/back-ldbm/findentry.c:295 #6 0x00007f33ed3c120d in ldbm_back_modify (pb=0x7f33b812a330) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:547 #7 0x00007f33fa2b1f65 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #8 0x00007f33fa2b2a15 in modify_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #9 0x00007f33ef28abc3 in ipalockout_postop (pb=0x7f33cf7f5b00) at ipa_lockout.c:630 #10 0x00007f33fa2c3d79 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #11 0x00007f33fa2c4005 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #12 0x00007f33fa78f729 in do_bind () #13 0x00007f33fa795ea1 in connection_threadmain () #14 0x00007f33f849acab in _pt_root (arg=0x7f33fbf1f720) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #15 0x00007f33f7e3b555 in start_thread (arg=0x7f33cf7f6700) at pthread_create.c:333 #16 0x00007f33f7b75f3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) frame 2 #2 0x00007f33ed37bafa in cache_lock_entry (cache=cache@entry=0x7f33fb9856f8, e=0x7f33b80fb510) at ldap/servers/slapd/back-ldbm/cache.c:1519 1519 PR_EnterMonitor(e->ep_mutexp); (gdb) print *e->ep_mutexp $2 = {name = 0x0, lock = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 3, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = '\000' <repeats 12 times>, "\001\000\000\000\003", '\000' <repeats 22 times>, __align = 0}, owner = 139860434151168, entryCV = {__data = {__lock = 0, __futex = 1, __total_seq = 1, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x7f33b815d148, __nwaiters = 2, __broadcast_seq = 0}, __size = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, "H\321\025\270\063\177\000\000\002\000\000\000\000\000\000", __align = 4294967296}, waitCV = {__data = {__lock = 0, __futex = 0, __total_seq = 0, __wakeup_seq = 0, __woken_seq = 0, __mutex = 0x0, __nwaiters = 0, __broadcast_seq = 0}, __size = '\000' <repeats 47 times>, __align = 0}, refCount = 1, entryCount = 1, notifyTimes = 0} (gdb) print (void *)139860434151168 $3 = (void *) 0x7f33cb7ee700 (gdb) print *e->ep_entry $5 = {e_sdn = {flag = 6 '\006', udn = 0x0, dn = 0x7f33b815b540 "uid=admin,cn=users,cn=accounts,SUFFIX", ndn = 0x7f33b80fb700 "uid=admin,cn=users,cn=accounts,SUFFIX", ndn_len = 82}, e_srdn = {flag = 0 '\000', rdn = 0x7f33b80fb760 "uid=admin", rdns = 0x0, butcheredupto = -1, nrdn = 0x0, all_rdns = 0x7f33b80fb780, all_nrdns = 0x0}, e_uniqueid = 0x7f33b81064f0 "97249536-6e8911e5-8e69a0e8-cc754ef1", e_dncsnset = 0x0, e_maxcsn = 0x7f33b80fb810, e_attrs = 0x7f33b81045d0, e_deleted_attrs = 0x0, e_virtual_attrs = 0x0, e_virtual_watermark = 0, e_virtual_lock = 0x7f33b80fb6c0, e_extension = 0x7f33b80fb640, e_flags = 0 '\000', e_aux_attrs = 0x0} (gdb) thread 17 [Switching to thread 17 (Thread 0x7f33cb7ee700 (LWP 7586))] #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 185 62: movl (%rsp), %edi (gdb) where #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f33f84956b3 in PR_EnterMonitor (mon=0x7f33fb985820) at ../../../nspr/pr/src/pthreads/ptsynch.c:592 #2 0x00007f33ed380642 in dblayer_lock_backend (be=<optimized out>) at ldap/servers/slapd/back-ldbm/dblayer.c:3904 #3 0x00007f33ed385606 in dblayer_txn_begin (be=0x7f33fbac84a0, parent_txn=0x0, txn=txn@entry=0x7f33cb7e9310) at ldap/servers/slapd/back-ldbm/dblayer.c:3623 #4 0x00007f33ed3c0c32 in ldbm_back_modify (pb=0x7f33b817e950) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:531 #5 0x00007f33fa2b1f65 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007f33fa2b2a15 in modify_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #7 0x00007f33ef28abc3 in ipalockout_postop (pb=0x7f33cb7edb00) at ipa_lockout.c:630 #8 0x00007f33fa2c3d79 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #9 0x00007f33fa2c4005 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #10 0x00007f33fa78f729 in do_bind () #11 0x00007f33fa795ea1 in connection_threadmain () #12 0x00007f33f849acab in _pt_root (arg=0x7f33fbf21820) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #13 0x00007f33f7e3b555 in start_thread (arg=0x7f33cb7ee700) at pthread_create.c:333 #14 0x00007f33f7b75f3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) frame 4 #4 0x00007f33ed3c0c32 in ldbm_back_modify (pb=0x7f33b817e950) at ldap/servers/slapd/back-ldbm/ldbm_modify.c:531 531 retval = dblayer_txn_begin(be, parent_txn, &txn); (gdb) print *pb->pb_op.o_params.target_address Structure has no component named operator*. (gdb) print pb->pb_op.o_params.target_address $7 = {udn = 0x7f33b81270c0 "uid=admin,cn=users,cn=accounts,SUFFIX", uniqueid = 0x0, sdn = 0x7f33b8110240} }}}

mbasti commented 8 years ago

This happens quite often with FreeIPA vault CI test. (Test has not been pushed to repository yet https://www.redhat.com/archives/freeipa-devel/2015-October/msg00124.html)

I have not been able to reproduce it manually.

Description of test:
1. install IPA master with KRA
1. install IPA replica without KRA
1. install KRA component on replica
1. uninstall KRA component on replica
* each step adds new vault on replica and new vault on master

DS on the master server usually hangs after uninstallation of KRA component on master.

tbordaz commented 8 years ago

I think this problem can also be reproduced (to be confirmed) doing many SASL auth with the same dn (here it was with 'uid=admin,cn=users,cn=accounts,SUFFIX') in parallel.

The deadlock looks weird (comment #7). The thread 17 should not hold the entry lock, because it has not yet acquired the backend lock.
An hypotheses is that thread 17 acquired the entry lock and did not release it. It hangs thread 25 that has already acquired the backend lock. Later thread 17 is processing an OTHER operation and is waiting for the backend lock.
Now I do not understand how a thread can acquire the entry lock and do not release it.

tbordaz commented 8 years ago

These errors are logged

{{{
[09/Oct/2015:15:45:53 +0200] - Operation error fetching Null DN (f683a431-6e8b11e5-8e69a0e8-cc754ef1), error -30993.
[09/Oct/2015:15:45:53 +0200] - dn2entry_ext: Failed to get id for changenumber=340,cn=changelog from entryrdn index (-30993)
[09/Oct/2015:15:45:53 +0200] - Operation error fetching changenumber=340,cn=changelog (null), error -30993.
[09/Oct/2015:15:45:53 +0200] DSRetroclPlugin - replog: an error occured while adding change number 340, dn = changenumber=340,cn=changelog: Operations error.
[09/Oct/2015:15:45:53 +0200] retrocl-plugin - retrocl_postob: operation failure [1]

}}}

But many sasl bind succeeded (dn="uid=admin,cn=users,cn=accounts,SUFFIX") after this failure, so it should not explain the hang.

tbordaz commented 8 years ago

attachment
0001-Ticket-47978-Deadlock-between-two-MODs-on-the-same-e.patch

tbordaz commented 8 years ago

The RC of the problem is likely identified. During a modify, the modified entry gets into the entry cache and is locked. If after the be_txn_postop/txn_commit and before the modify returns the modified entry gets out of the entry cache, the entry is not unlocked.
This is a side effect of https://fedorahosted.org/389/ticket/47834. In that fix it was assumed that the entry need to be unlock if the entry was still in the cache
The fix is running CI test_vault.
Usually the test hangs almost systematically, with the fix there is not hang out of 4 run.

Next steps

Let the test run over the week end
wait for review

nhosoi commented 8 years ago

Great work, Thierry!

Just to be conservative, could you please do this?
{{{
922 if (cache_is_in_cache(&inst->inst_cache, ec)) {
924 if (ec_locked || cache_is_in_cache(&inst->inst_cache, ec)) {
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
}}}
Do we want to backport this fix to the older versions? I see the same coding in all the versions down to 1.2.11.

tbordaz commented 8 years ago

Hi Noriko,

Thank you sooo much for your review.
You remark makes sense but I have a doubt:

{{{
792 if (cache_replace( &inst->inst_cache, e, ec ) != 0 ) {
793 MOD_SET_ERROR(ldap_result_code, LDAP_OPERATIONS_ERROR, retry_count);
794 goto error_return;
795 }
796 / e uncached /
797 / we must return both e (which has been deleted) and new entry ec to cache /
798 / cache_replace removes e from the cache hash tables /
799 cache_unlock_entry( &inst->inst_cache, e );
800 CACHE_RETURN( &inst->inst_cache, &e );
801 / lock new entry in cache to prevent usage until we are complete /
802 cache_lock_entry( &inst->inst_cache, ec );
803 ec_locked = 1;

}}}

at 792, 'ec' is in the cache but in '800' (CACHE_RETURN) I wonder if it can not be put out of the cache.
Before https://fedorahosted.org/389/ticket/47834, the flag was confusing. It was 'ec_in_cache' but I think it was wrong, it actually meant 'ec is locked'.

The only action if 'ec_locked' is to unlock, not to return 'ec' from the cache.

{{{
924 if (ec_locked) {
925 cache_unlock_entry(&inst->inst_cache, ec);
926 } else if (e) {

}}}

Anyway I agree with you, better to be conservative. I will apply your suggested fix.

You are right it applies to many versions, should it be fixed in 1.2.11, 1.3.3, 1.3.4 and master ?

nhosoi commented 8 years ago

Thank you for your agreement. ;)

You are right it applies to many versions, should it be fixed in 1.2.11, 1.3.3, 1.3.4 and master ?

Yes, let's do that.

I'm giving "ack" to the new patch.

tbordaz commented 8 years ago

''Master''

git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.09 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
9fefc13..f1e30b0 master -> master

commit f1e30b0
Author: Thierry Bordaz tbordaz@redhat.com
Date: Fri Oct 16 18:18:01 2015 +0200

''1.3.4''

git push origin 389-ds-base-1.3.4
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.09 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
2a05a3a..f27c164 389-ds-base-1.3.4 -> 389-ds-base-1.3.4

commit f27c164
Author: Thierry Bordaz tbordaz@redhat.com
Date: Fri Oct 16 18:18:01 2015 +0200

''1.3.3''

git push origin 389-ds-base-1.3.3
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.09 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
b20756b..6844592 389-ds-base-1.3.3 -> 389-ds-base-1.3.3

commit 6844592
Author: Thierry Bordaz tbordaz@redhat.com
Date: Fri Oct 16 18:18:01 2015 +0200

''1.2.11''

git push origin 389-ds-base-1.2.11
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.09 KiB, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
72a6380..bdaed4e 389-ds-base-1.2.11 -> 389-ds-base-1.2.11

commit bdaed4e
Author: Thierry Bordaz tbordaz@redhat.com
Date: Mon Oct 19 11:03:54 2015 +0200

nhosoi commented 8 years ago

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1273550

nhosoi commented 8 years ago

Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1273552

tbordaz commented 8 years ago

Replying to [comment:11 tbordaz]:

The RC of the problem is likely identified. During a modify, the modified entry gets into the entry cache and is locked. If after the be_txn_postop/txn_commit and before the modify returns the modified entry gets out of the entry cache, the entry is not unlocked.

This is a side effect of https://fedorahosted.org/389/ticket/47834. In that fix it was assumed that the entry need to be unlock if the entry was still in the cache

The fix is running CI test_vault.
Usually the test hangs almost systematically, with the fix there is not hang out of 4 run.

Next steps

Let the test run over the week end

wait for review

To run the CI test_vault:

{{{
sudo IPATEST_YAML_CONFIG=/root/ipa-test-config.yaml ipa-run-tests test_integration/test_vault.py --logging-level debug --pdb
}}}

Metadata Update from @nhosoi:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.2.11.33

7 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1309

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Metadata

Assignee

tbordaz

Tags

None

Blocking

None

Depending on

None

Priority

major

Milestone

1.2.11.33

reviewstatus

ack

rhbz

https://bugzilla.redhat.com/show_bug.cgi?id=1273550
https://bugzilla.redhat.com/show_bug.cgi?id=1273552

origin

Community

389-ds-base

Source Code

#47978 Deadlock between two MODs on the same entry between entry cache and backend lock Closed: wontfix None Opened 9 years ago by tbordaz.

Metadata

#47978 Deadlock between two MODs on the same entry between entry cache and backend lock

Closed: wontfix None Opened 9 years ago by tbordaz.