The deadlock occurs when two MODS target the same entry. One MOD locks the entry in the entry cache (find_entry_internal ?) then tries to acquire the backend lock (in txn_begin). The second MOD acquired the backend lock (txn_begin) and hangs while locking the entry in the entry cache (cache_lock_entry).
The deadlock occurs while doing performance measurement. To do this performance and skip the IOs bottleneck, I tuned the : - nsslapd-threadnumber: 100 - nsslapd-db-transaction-batch-val: 100 - nsslapd-backend-opt-level: 7
This tuning helped to reproduce the hang but are not the cause of the hang.
Deadlock:
MOD ("cn=mr000006001,o=People,o=test_bis_create") Thread 89 (Thread 0x7f84a97fa700 (LWP 13200)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f84d1118943 in PR_EnterMonitor (mon=0x7f84480064b0) at ../../../nspr/pr/src/pthreads/ptsynch.c:592 #2 0x00007f84c8a2795a in cache_lock_entry () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #3 0x00007f84c8a6bde0 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #4 0x00007f84d2d1c8e1 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #5 0x00007f84d2d1dc1f in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007f84d31fc3c1 in connection_threadmain () #7 0x00007f84d111de3b in _pt_root (arg=0x7f84d46f7b30) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #8 0x00007f84d0abdee5 in start_thread (arg=0x7f84a97fa700) at pthread_create.c:309 #9 0x00007f84d07ecb8d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 MOD ("cn=mr000006001,o=People,o=test_bis_create") Thread 87 (Thread 0x7f84a87f8700 (LWP 13202)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f84d1118943 in PR_EnterMonitor (mon=0x7f84d3feb9d0) at ../../../nspr/pr/src/pthreads/ptsynch.c:592 #2 0x00007f84c8a2c3b7 in dblayer_lock_backend () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #3 0x00007f84c8a311be in dblayer_txn_begin () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #4 0x00007f84c8a6ca77 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #5 0x00007f84d2d1c8e1 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007f84d2d1dc1f in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 #7 0x00007f84d31fc3c1 in connection_threadmain () #8 0x00007f84d111de3b in _pt_root (arg=0x7f84d47321b0) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #9 0x00007f84d0abdee5 in start_thread (arg=0x7f84a87f8700) at pthread_create.c:309
To reproduce:
On F20 - 389-DS 1.3.4 32 cores hardware machine: Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz 32 Gb memory
Install 389-DS 1.3.4 (master branch c3389a4) Create a suffix and trigger MOD update on low number of entries (using ldclt) so that several mods apply to the same entry
This happens quite often with FreeIPA vault CI test. (Test has not been pushed to repository yet https://www.redhat.com/archives/freeipa-devel/2015-October/msg00124.html)
I have not been able to reproduce it manually.
Description of test: 1. install IPA master with KRA 1. install IPA replica without KRA 1. install KRA component on replica 1. uninstall KRA component on replica * each step adds new vault on replica and new vault on master
DS on the master server usually hangs after uninstallation of KRA component on master.
I think this problem can also be reproduced (to be confirmed) doing many SASL auth with the same dn (here it was with 'uid=admin,cn=users,cn=accounts,SUFFIX') in parallel.
The deadlock looks weird (comment #7). The thread 17 should not hold the entry lock, because it has not yet acquired the backend lock. An hypotheses is that thread 17 acquired the entry lock and did not release it. It hangs thread 25 that has already acquired the backend lock. Later thread 17 is processing an OTHER operation and is waiting for the backend lock. Now I do not understand how a thread can acquire the entry lock and do not release it.
These errors are logged
{{{ [09/Oct/2015:15:45:53 +0200] - Operation error fetching Null DN (f683a431-6e8b11e5-8e69a0e8-cc754ef1), error -30993. [09/Oct/2015:15:45:53 +0200] - dn2entry_ext: Failed to get id for changenumber=340,cn=changelog from entryrdn index (-30993) [09/Oct/2015:15:45:53 +0200] - Operation error fetching changenumber=340,cn=changelog (null), error -30993. [09/Oct/2015:15:45:53 +0200] DSRetroclPlugin - replog: an error occured while adding change number 340, dn = changenumber=340,cn=changelog: Operations error. [09/Oct/2015:15:45:53 +0200] retrocl-plugin - retrocl_postob: operation failure [1]
}}}
But many sasl bind succeeded (dn="uid=admin,cn=users,cn=accounts,SUFFIX") after this failure, so it should not explain the hang.
attachment 0001-Ticket-47978-Deadlock-between-two-MODs-on-the-same-e.patch
This is a side effect of https://fedorahosted.org/389/ticket/47834. In that fix it was assumed that the entry need to be unlock if the entry was still in the cache
The fix is running CI test_vault. Usually the test hangs almost systematically, with the fix there is not hang out of 4 run.
Next steps
Great work, Thierry!
Just to be conservative, could you please do this? {{{ 922 if (cache_is_in_cache(&inst->inst_cache, ec)) { 924 if (ec_locked || cache_is_in_cache(&inst->inst_cache, ec)) { ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ }}} Do we want to backport this fix to the older versions? I see the same coding in all the versions down to 1.2.11.
Hi Noriko,
Thank you sooo much for your review. You remark makes sense but I have a doubt:
{{{ 792 if (cache_replace( &inst->inst_cache, e, ec ) != 0 ) { 793 MOD_SET_ERROR(ldap_result_code, LDAP_OPERATIONS_ERROR, retry_count); 794 goto error_return; 795 } 796 / e uncached / 797 / we must return both e (which has been deleted) and new entry ec to cache / 798 / cache_replace removes e from the cache hash tables / 799 cache_unlock_entry( &inst->inst_cache, e ); 800 CACHE_RETURN( &inst->inst_cache, &e ); 801 / lock new entry in cache to prevent usage until we are complete / 802 cache_lock_entry( &inst->inst_cache, ec ); 803 ec_locked = 1;
at 792, 'ec' is in the cache but in '800' (CACHE_RETURN) I wonder if it can not be put out of the cache. Before https://fedorahosted.org/389/ticket/47834, the flag was confusing. It was 'ec_in_cache' but I think it was wrong, it actually meant 'ec is locked'.
The only action if 'ec_locked' is to unlock, not to return 'ec' from the cache.
{{{ 924 if (ec_locked) { 925 cache_unlock_entry(&inst->inst_cache, ec); 926 } else if (e) {
Anyway I agree with you, better to be conservative. I will apply your suggested fix.
You are right it applies to many versions, should it be fixed in 1.2.11, 1.3.3, 1.3.4 and master ?
Thank you for your agreement. ;)
Yes, let's do that.
I'm giving "ack" to the new patch.
''Master''
git push origin master Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 1.09 KiB, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 9fefc13..f1e30b0 master -> master
commit f1e30b0 Author: Thierry Bordaz tbordaz@redhat.com Date: Fri Oct 16 18:18:01 2015 +0200
''1.3.4''
git push origin 389-ds-base-1.3.4 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 1.09 KiB, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 2a05a3a..f27c164 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit f27c164 Author: Thierry Bordaz tbordaz@redhat.com Date: Fri Oct 16 18:18:01 2015 +0200
''1.3.3''
git push origin 389-ds-base-1.3.3 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 1.09 KiB, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git b20756b..6844592 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit 6844592 Author: Thierry Bordaz tbordaz@redhat.com Date: Fri Oct 16 18:18:01 2015 +0200
''1.2.11''
git push origin 389-ds-base-1.2.11 Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 1.09 KiB, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 72a6380..bdaed4e 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit bdaed4e Author: Thierry Bordaz tbordaz@redhat.com Date: Mon Oct 19 11:03:54 2015 +0200
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1273550
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1273552
Replying to [comment:11 tbordaz]:
The RC of the problem is likely identified. During a modify, the modified entry gets into the entry cache and is locked. If after the be_txn_postop/txn_commit and before the modify returns the modified entry gets out of the entry cache, the entry is not unlocked. This is a side effect of https://fedorahosted.org/389/ticket/47834. In that fix it was assumed that the entry need to be unlock if the entry was still in the cache The fix is running CI test_vault. Usually the test hangs almost systematically, with the fix there is not hang out of 4 run. Next steps Let the test run over the week end wait for review
To run the CI test_vault:
{{{ sudo IPATEST_YAML_CONFIG=/root/ipa-test-config.yaml ipa-run-tests test_integration/test_vault.py --logging-level debug --pdb }}}
Metadata Update from @nhosoi: - Issue assigned to tbordaz - Issue set to the milestone: 1.2.11.33
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1309
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.