#47931 deadlock with memberof plugin and retrocl
Closed: Fixed None Opened 5 years ago by lkrispen.

In IPA upgrade this deadlock was encountered:

Thread 4 (Thread 0x7f27f97e2700 (LWP 11679)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f282fa1a1af in _L_lock_1026 () from /lib64/libpthread.so.0
#2  0x00007f282fa1a151 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f2833f46d18) at ../nptl/pthread_mutex_lock.c:134
#3  0x00007f28300722c9 in PR_Lock (lock=lock@entry=0x7f2833f46d18) at ../../../nspr/pr/src/pthreads/ptsynch.c:174
#4  0x00007f2830072883 in PR_EnterMonitor (mon=0x7f2833f46d10) at ../../../nspr/pr/src/pthreads/ptsynch.c:529
#5  0x00007f282612bff6 in dblayer_txn_begin () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#6  0x00007f282614bba3 in ldbm_back_add () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#7  0x00007f2831c4ec1a in op_shared_add () from /usr/lib64/dirsrv/libslapd.so.0
#8  0x00007f2831c4f4a3 in add_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0
#9  0x00007f28249daf0d in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#10 0x00007f2831ca7cb0 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
#11 0x00007f2831ca7f08 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
#12 0x00007f2826166b91 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#13 0x00007f2831c94fa1 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0
#14 0x00007f2831c95a54 in modify_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0
#15 0x00007f28258446a1 in memberof_fix_memberof_callback () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so
#16 0x00007f2831cbc3cd in send_ldap_search_entry_ext () from /usr/lib64/dirsrv/libslapd.so.0
#17 0x00007f2831cbcc0c in send_ldap_search_entry () from /usr/lib64/dirsrv/libslapd.so.0
#18 0x00007f2831c9a593 in iterate.isra.0.constprop.3 () from /usr/lib64/dirsrv/libslapd.so.0
#19 0x00007f2831c9a73a in send_results_ext.constprop.2 () from /usr/lib64/dirsrv/libslapd.so.0
#20 0x00007f2831c9be21 in op_shared_search () from /usr/lib64/dirsrv/libslapd.so.0
#21 0x00007f2831cabc8e in search_internal_callback_pb () from /usr/lib64/dirsrv/libslapd.so.0
#22 0x00007f2825845ef7 in memberof_fixup_task_thread () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so
#23 0x00007f2830077c10 in _pt_root (arg=0x7f28140485e0) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#24 0x00007f282fa17f33 in start_thread (arg=0x7f27f97e2700) at pthread_create.c:309
#25 0x00007f282f745ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 36 (Thread 0x7f283204f700 (LWP 11643)):
#0  __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f282fa1a1af in _L_lock_1026 () from /lib64/libpthread.so.0
#2  0x00007f282fa1a151 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x7f2833f9fd38) at ../nptl/pthread_mutex_lock.c:134
#3  0x00007f28300722c9 in PR_Lock (lock=lock@entry=0x7f2833f9fd38) at ../../../nspr/pr/src/pthreads/ptsynch.c:174
#4  0x00007f2830072883 in PR_EnterMonitor (mon=0x7f2833f9fd30) at ../../../nspr/pr/src/pthreads/ptsynch.c:529
#5  0x00007f2825846eca in memberof_postop_del () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so
#6  0x00007f2831ca7cb0 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0
#7  0x00007f2831ca7f08 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0
#8  0x00007f28261594de in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so
#9  0x00007f2831c5beb0 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0
#10 0x00007f2831c5c062 in delete_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0
#11 0x00007f28249db6de in delete_changerecord () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#12 0x00007f28249db9b1 in changelog_trim_thread_fn () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so
#13 0x00007f2830077c10 in _pt_root (arg=0x7f28100008e0) at ../../../nspr/pr/src/pthreads/ptthread.c:204
#14 0x00007f282fa17f33 in start_thread (arg=0x7f283204f700) at pthread_create.c:309
#15 0x00007f282f745ead in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

There is a scope for memberof defined:

memberofentryscope: dc=ipa,dc=dev,dc=lan

which should exclude cn=changelog, but the check is only done after acquiring the memberof lock.


I think for postop_add and postop_del this check coould be done when memberof_oktodo() is called

Ludwig, setting the next 1.3.2 to the milestone.
Please feel free to adjust it to your plan.
Thanks!

Is this still an issue in the IPA upgrade (from which version to which)?

If so, could you put the 389-ds-base version and the stacktraces with the matched debuginfo (there are some functions which options and lines are missing...)?

Setting to FUTURE.

Comment by Ludwig:
global lock and lmdb redesign should handle this

Version: 389-ds-base-1.3.3.1-15.el7_1.x86_64

Replying to [comment:10 mreynolds]:
Thanks to Mark for picking up this ticket.
Setting Milestone to 1.3.4.3...

Hi Mark!

I'm reviewing your fix. First, I'm trying to understand the "fix description"... The deadlock among plug-ins is always hard to solve, isn't it? ;)
{{{
Bug Description: When concurrently updating multiple backends the
memberOf and retrocl plugins can deadlock on each
other. This is caused by the required retrocl lock,
and the db lock on the changenumber index in the
retrocl db.

Fix Description: Added scoping to the retrocl that allows subtrees/suffixes
to be included or excluded. Also moved the existing
memberOf scoping outside of its global lock.
}}}
So, the deadlock was caused by the memberof scoping inside the global lock and retrocl lock (retrocl_internal_lock or retrocl_cn_lock?)? And basically just moving the memberof scoping out solves the deadlock?

Replying to [comment:14 nhosoi]: > Hi Mark! > > I'm reviewing your fix. First, I'm trying to understand the "fix description"... The deadlock among plug-ins is always hard to solve, isn't it? ;) It is, and this fix is almost a workaround. The retrocl design makes it very difficult to avoid deadlocks when dealing with multiple backends. > {{{ > Bug Description: When concurrently updating multiple backends the > memberOf and retrocl plugins can deadlock on each > other. This is caused by the required retrocl lock, > and the db lock on the changenumber index in the > retrocl db. > > Fix Description: Added scoping to the retrocl that allows subtrees/suffixes > to be included or excluded. Also moved the existing > memberOf scoping outside of its global lock. > }}} > So, the deadlock was caused by the memberof scoping inside the global lock and retrocl lock (retrocl_internal_lock or retrocl_cn_lock?)? And basically just moving the memberof scoping out solves the deadlock? There are two different deadlocks this patch tries to address. The first one is very easy to reproduce, the other I can not reproduce. Here is the stack from the first deadloack that is easy to reproduce: {{{ Thread 14 #2 0x00007fc6b38420b9 in PR_Lock () from /lib64/libnspr4.so #3 0x00007fc6a7ec99b0 in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #4 0x00007fc6b546156f in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #5 0x00007fc6b5461893 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #6 0x00007fc6a965d231 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #7 0x00007fc6b544e905 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #8 0x00007fc6b544f387 in modify_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #9 0x00007fc6a8d354a2 in memberof_fix_memberof_callback () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #10 0x00007fc6a8d35f65 in memberof_modop_one_replace_r () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #11 0x00007fc6a8d362a6 in memberof_mod_smod_list () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #12 0x00007fc6a8d3758d in memberof_postop_modify () from /usr/lib64/dirsrv/plugins/libmemberof-plugin.so #13 0x00007fc6b546156f in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #14 0x00007fc6b5461893 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #15 0x00007fc6a965d231 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #16 0x00007fc6b544e905 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #17 0x00007fc6b544fbe7 in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 Thread 13 (Thread 0x7fc678ff1700 (LWP 1931)): #0 0x00007fc6b31ed590 inpthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fc6adc811fb in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so #2 0x00007fc6adc8059b in __db_tas_mutex_lock () from /lib64/libdb-5.3.so #3 0x00007fc6add2aa91 in __lock_get_internal () from /lib64/libdb-5.3.so #4 0x00007fc6add2b657 in __lock_get () from /lib64/libdb-5.3.so #5 0x00007fc6add576af in __db_lget () from /lib64/libdb-5.3.so #6 0x00007fc6adc9e4f0 in __bam_search () from /lib64/libdb-5.3.so #7 0x00007fc6adc89144 in __bamc_search () from /lib64/libdb-5.3.so #8 0x00007fc6adc8b074 in __bamc_get () from /lib64/libdb-5.3.so #9 0x00007fc6add43ba3 in __dbc_iget () from /lib64/libdb-5.3.so #10 0x00007fc6add53092 in __dbc_get_pp () from /lib64/libdb-5.3.so #11 0x00007fc6a9672b70 in ldbm_back_seq () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #12 0x00007fc6b54650f9 in seq_internal_callback_pb () from /usr/lib64/dirsrv/libslapd.so.0 #13 0x00007fc6b54652a9 in slapi_seq_callback () from /usr/lib64/dirsrv/libslapd.so.0 #14 0x00007fc6a7ec8919 in retrocl_update_lastchangenumber () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #15 0x00007fc6a7ec89bb in retrocl_assign_changenumber () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #16 0x00007fc6a7ec99b5 in retrocl_postob () from /usr/lib64/dirsrv/plugins/libretrocl-plugin.so #17 0x00007fc6b546156f in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #18 0x00007fc6b5461893 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #19 0x00007fc6a965d231 in ldbm_back_modify () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #20 0x00007fc6b544e905 in op_shared_modify () from /usr/lib64/dirsrv/libslapd.so.0 #21 0x00007fc6b544fbe7 in do_modify () from /usr/lib64/dirsrv/libslapd.so.0 }}} Thread 14 is updating many members in one mod operation, so the retrocl is being called multiple times in the modify operation. It holds a lock on the changennumber index(via retrocl plugin). Then there are updates occuring on a different backend (thread 13) that grab the retrocl lock, but then gets stuck waiting on the changenumber index which thread 13 holds. But then thread 14, which has the changenumber lock, tries to grab the retrocl lock again - and we deadlock between the two threads. So by adding scoping to retrocl, where we exclude the other backends, we avoid this deadlock. The second deadlock, the stack is listed in this ticket, I can not reproduce. But... By moving the scoping check outside of the memberOf lock, and combined with the new retrocl scoping, should address this deadlock as well. Really, the problem is the retro changelog design. The retrocl plugin almost needs to be either rewritten/redesigned entirely, or simply removed and the replication changelog extended to fulfill the needs that the retrocl was providing. So for now, this patch addresses the issues that current customers are running into with IPA, and it also adds some new functionality/flexibility to the retrocl plugin.

Thank you for your explanation, Mark! Now I understood the design of your fix.

2 minor issues...
1. Is 'i' supposed to get incremented?
{{{
645 / Check if the the entry include scope is a child of the sdn /
646 static Slapi_DN
647 memberof_scope_is_child_of_dn(MemberOfConfig
config, Slapi_DN sdn)
648 {
649 int i = 0;
650
651 while(config->entryScopes && config->entryScopes[i]){
652 if(slapi_sdn_issuffix(config->entryScopes[i], sdn)){
653 return config->entryScopes[i];
654 }
655 }
656 return NULL;
657 }
}}}
2. Is retro changelog applicable to the dynamic plug-in? If yes, retrocl_includes|excludes could leak?
{{{
48 static Slapi_DN retrocl_includes = NULL;
49 static Slapi_DN
retrocl_excludes = NULL;
391 retrocl_excludes = (Slapi_DN )slapi_ch_calloc(sizeof(Slapi_DN *),num_vals+1);
411 retrocl_includes = (Slapi_DN
)slapi_ch_calloc(sizeof(Slapi_DN
),num_vals+1);
}}}

Replying to [comment:16 nhosoi]:

Thank you for your explanation, Mark! Now I understood the design of your fix.

2 minor issues...
1. Is 'i' supposed to get incremented?
{{{
645 / Check if the the entry include scope is a child of the sdn /
646 static Slapi_DN
647 memberof_scope_is_child_of_dn(MemberOfConfig
config, Slapi_DN *sdn)
648 {
649 int i = 0;
650
651 while(config->entryScopes && config->entryScopes[i]){
652 if(slapi_sdn_issuffix(config->entryScopes[i], sdn)){
653 return config->entryScopes[i];
654 }
655 }
656 return NULL;
657 }
}}}

Good catch

  1. Is retro changelog applicable to the dynamic plug-in? If yes, retrocl_includes|excludes could leak?
    {{{
    48 static Slapi_DN retrocl_includes = NULL;
    49 static Slapi_DN
    retrocl_excludes = NULL;
    391 retrocl_excludes = (Slapi_DN )slapi_ch_calloc(sizeof(Slapi_DN *),num_vals+1);
    411 retrocl_includes = (Slapi_DN
    )slapi_ch_calloc(sizeof(Slapi_DN *),num_vals+1);
    }}}

This is freed in the plugin stop function, and that's how dynamic plugins work. Any change to plugin's config triggers the plugin to be restarted. So this is not leaked.

Hi Mark, thanks for waiting. Please push it!

6b138a2..fd959ac master -> master
commit fd959ac
Author: Mark Reynolds mreynolds@redhat.com
Date: Tue Aug 4 12:19:31 2015 -0400

e62b481..d810847 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit d810847

Coverity issues found, reopening...

fd959ac..5daea97 master -> master
commit 5daea97
Author: Mark Reynolds mreynolds@redhat.com
Date: Mon Aug 10 10:42:40 2015 -0400

d810847..9a0047e 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 9a0047e

08eabd5..f86dab3 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit f86dab3

f86dab3..e6343e6 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit e6343e6

Debug logging was accidentally left in the source - it has now been removed:

29c09a5..db7153f master -> master
commit db7153f
Author: Mark Reynolds mreynolds@redhat.com
Date: Wed Aug 19 10:03:50 2015 -0400

2c5e0d5..1781280 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 1781280

895dc4f..5c48f28 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit 5c48f28

Replying to [comment:27 mreynolds]:
Probably, a silly question... :p Could you tell me where test_ticket47931 is called?

Replying to [comment:28 nhosoi]:

Replying to [comment:27 mreynolds]:
Probably, a silly question... :p Could you tell me where test_ticket47931 is called?

Simon modified create-ticket.py to make new ticket tests more compliant with the pytest way. So now in the main function we call

{{{
pytest.main("-s %s" % CURRENT_FILE)
}}}

Which executes all the functions that start with "test".

Thanks Noriko!

4277f2b..48470e7 master -> master
commit 48470e7
Author: Mark Reynolds mreynolds@redhat.com
Date: Fri Aug 28 15:01:58 2015 -0400

8c9c2fc..22e4de7 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 22e4de7

f10e2e6..e7d3b2f 389-ds-base-1.3.3 -> 389-ds-base-1.3.3
commit e7d3b2f

Metadata Update from @mreynolds:
- Issue assigned to mreynolds
- Issue set to the milestone: 1.3.3.13

2 years ago

Login to comment on this ticket.

Metadata