#49079 deadlock on cos cache rebuild
Closed: fixed 2 years ago Opened 2 years ago by tbordaz.

The problem is easily reproducible using replication (see test case) but may also exists on standalone instance.

There is a thread responsible to rebuild cos cache when it is notified that a cos definition was updated. To do so this thread, holding the notification lock, will rebuild the cache reading the definitions (on backend) trying to acquire db page.

A thread that updates a cos definition will notify the first thread. The update acquire db page and tries to acquire notification lock.

This is a scenario of deadlock because of locks taken in the opposite order.


Signature of the deadlock is {{{ Thread 41 (Thread 0x7fa6dce71700 (LWP 4933)): #0 pthread_cond_wait@@GLIBC_2.3.2 (...) #1 0x00007fa6e89802f3 in __db_pthread_mutex_condwait (...) #2 __db_hybrid_mutex_suspend (...) #3 0x00007fa6e897f640 in __db_tas_mutex_lock_int (...) #4 __db_tas_mutex_lock (...) #5 0x00007fa6e8a29d0a in __lock_get_internal (...) #6 0x00007fa6e8a2a7f0 in __lock_get (...) #7 0x00007fa6e8a56132 in __db_lget (...) #8 0x00007fa6e899d5e5 in __bam_search (...) #9 0x00007fa6e8988256 in __bamc_search (...) #10 0x00007fa6e8989d0f in __bamc_get (...) #11 0x00007fa6e8a42c76 in __dbc_iget (...) #12 0x00007fa6e8a4344d in __dbc_get (...) #13 0x00007fa6e8a4f863 in __db_get (...) #14 0x00007fa6e8a53143 in __db_get_pp (...) #15 0x00007fa6e3d11b6b in id2entry (...) #16 0x00007fa6e3d4d2dd in ldbm_back_next_search_entry_ext (...) #17 0x00007fa6f179782b in iterate (...) #18 0x00007fa6f1797daa in send_results_ext (...) #19 0x00007fa6f1799811 in op_shared_search (...) #20 0x00007fa6f17a9ede in search_internal_callback_pb (...) #21 0x00007fa6f17aa459 in slapi_search_internal_callback_pb (...) #22 0x00007fa6e6c4da14 in cos_cache_add_dn_tmpls (...) #23 cos_cache_add_defn (...) #24 cos_dn_defs_cb (...) #25 0x00007fa6f17bb49d in send_ldap_search_entry_ext (...) #26 0x00007fa6f17bbd2c in send_ldap_search_entry (...) #27 0x00007fa6f1797c59 in iterate (...) #28 0x00007fa6f1797daa in send_results_ext (...) #29 0x00007fa6f1799811 in op_shared_search (...) #30 0x00007fa6f17a9ede in search_internal_callback_pb (...) #31 0x00007fa6f17aa459 in slapi_search_internal_callback_pb (...) #32 0x00007fa6e6c4eec0 in cos_cache_add_dn_defs (...) #33 cos_cache_build_definition_list (...) #34 cos_cache_create (...) #35 0x00007fa6e6c4f70f in cos_cache_wait_on_change (...) #36 0x00007fa6ef97f96b in _pt_root (...) #37 0x00007fa6ef31fdc5 in start_thread (...) #38 0x00007fa6ef04d1cd in clone (...) Thread 20 (Thread 0x7fa6c0ff9700 (LWP 4955)): #0 __lll_lock_wait (...) #1 0x00007fa6ef321d38 in _L_lock_975 () from /lib64/libpthread.so.0 #2 0x00007fa6ef321ce1 in __GI___pthread_mutex_lock (...) #3 0x00007fa6ef979e39 in PR_Lock (...) #4 0x00007fa6f17c7f7a in slapi_lock_mutex (...) #5 0x00007fa6e6c4ff04 in cos_cache_change_notify (...) #6 0x00007fa6e6c4abfd in cos_post_op (...) #7 0x00007fa6f17a4d38 in plugin_call_func (...) #8 0x00007fa6f17a4fc3 in plugin_call_list (...) #9 plugin_call_plugins (...) #10 0x00007fa6f1792575 in op_shared_modify (...) #11 0x00007fa6f1793014 in modify_internal_pb (...) #12 0x00007fa6f1793b13 in slapi_modify_internal_pb (...) #13 0x00007fa6e3a96926 in urp_fixup_modify_entry (...) #14 0x00007fa6e3a96a62 in mod_namingconflict_attr (...) #15 0x00007fa6e3a974ea in urp_annotate_dn (...) #16 0x00007fa6e3a977fb in urp_add_operation (...) #17 0x00007fa6e3a7d2e8 in multimaster_bepreop_add (...) #18 0x00007fa6f17a4d38 in plugin_call_func (...) #19 0x00007fa6f17a4fc3 in plugin_call_list (...) #20 plugin_call_plugins (...) #21 0x00007fa6e3d297b1 in ldbm_back_add (...) #22 0x00007fa6f1748ee8 in op_shared_add (...) #23 0x00007fa6f174a1b0 in do_add (...) #24 0x00007fa6f1c7e9b3 in connection_dispatch_operation (...) #25 connection_threadmain (...) #26 0x00007fa6ef97f96b in _pt_root (...) #27 0x00007fa6ef31fdc5 in start_thread (...) #28 0x00007fa6ef04d1cd in clone (...) }}}

A small note, is cos_cache_create_unloc()? meant to be cos_cache_create_unlock()?

Otherwise, I can understand the fix and how it works, it seems reasonable. I'm wondering if there is a cleaner way to achieve this with reordering of the backend lock compared to the change_lock, or with monitors.

Regardless, I think this solution will work,

Thanks William and Ludwig for the reviews

'''git push origin master'''
Counting objects: 7, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.90 KiB | 0 bytes/s, done.
Total 7 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
9feca58..ac44337 master -> master

commit ac44337
Author: Thierry Bordaz tbordaz@redhat.com
Date: Tue Jan 10 14:32:53 2017 +0100

Pushed to 389-ds-base-1.3.5:
79a3dea..3ac12cb 389-ds-base-1.3.5 -> 389-ds-base-1.3.5
commit 3ac12cb

Metadata Update from @tbordaz:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.6 backlog

2 years ago

Hey Thierry, is this issue now resolved? Would you like to close it?

Metadata Update from @firstyear:
- Issue close_status updated to: None

2 years ago

Metadata Update from @tbordaz:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata