#49079 deadlock on cos cache rebuild
Closed: wontfix 7 years ago Opened 7 years ago by tbordaz.

The problem is easily reproducible using replication (see test case) but may also exists on standalone instance.

There is a thread responsible to rebuild cos cache when it is notified that a cos definition was updated. To do so this thread, holding the notification lock, will rebuild the cache reading the definitions (on backend) trying to acquire db page.

A thread that updates a cos definition will notify the first thread. The update acquire db page and tries to acquire notification lock.

This is a scenario of deadlock because of locks taken in the opposite order.


Signature of the deadlock is {{{ Thread 41 (Thread 0x7fa6dce71700 (LWP 4933)): #0 pthread_cond_wait@@GLIBC_2.3.2 (...) #1 0x00007fa6e89802f3 in __db_pthread_mutex_condwait (...) #2 __db_hybrid_mutex_suspend (...) #3 0x00007fa6e897f640 in __db_tas_mutex_lock_int (...) #4 __db_tas_mutex_lock (...) #5 0x00007fa6e8a29d0a in __lock_get_internal (...) #6 0x00007fa6e8a2a7f0 in __lock_get (...) #7 0x00007fa6e8a56132 in __db_lget (...) #8 0x00007fa6e899d5e5 in __bam_search (...) #9 0x00007fa6e8988256 in __bamc_search (...) #10 0x00007fa6e8989d0f in __bamc_get (...) #11 0x00007fa6e8a42c76 in __dbc_iget (...) #12 0x00007fa6e8a4344d in __dbc_get (...) #13 0x00007fa6e8a4f863 in __db_get (...) #14 0x00007fa6e8a53143 in __db_get_pp (...) #15 0x00007fa6e3d11b6b in id2entry (...) #16 0x00007fa6e3d4d2dd in ldbm_back_next_search_entry_ext (...) #17 0x00007fa6f179782b in iterate (...) #18 0x00007fa6f1797daa in send_results_ext (...) #19 0x00007fa6f1799811 in op_shared_search (...) #20 0x00007fa6f17a9ede in search_internal_callback_pb (...) #21 0x00007fa6f17aa459 in slapi_search_internal_callback_pb (...) #22 0x00007fa6e6c4da14 in cos_cache_add_dn_tmpls (...) #23 cos_cache_add_defn (...) #24 cos_dn_defs_cb (...) #25 0x00007fa6f17bb49d in send_ldap_search_entry_ext (...) #26 0x00007fa6f17bbd2c in send_ldap_search_entry (...) #27 0x00007fa6f1797c59 in iterate (...) #28 0x00007fa6f1797daa in send_results_ext (...) #29 0x00007fa6f1799811 in op_shared_search (...) #30 0x00007fa6f17a9ede in search_internal_callback_pb (...) #31 0x00007fa6f17aa459 in slapi_search_internal_callback_pb (...) #32 0x00007fa6e6c4eec0 in cos_cache_add_dn_defs (...) #33 cos_cache_build_definition_list (...) #34 cos_cache_create (...) #35 0x00007fa6e6c4f70f in cos_cache_wait_on_change (...) #36 0x00007fa6ef97f96b in _pt_root (...) #37 0x00007fa6ef31fdc5 in start_thread (...) #38 0x00007fa6ef04d1cd in clone (...) Thread 20 (Thread 0x7fa6c0ff9700 (LWP 4955)): #0 __lll_lock_wait (...) #1 0x00007fa6ef321d38 in _L_lock_975 () from /lib64/libpthread.so.0 #2 0x00007fa6ef321ce1 in __GI___pthread_mutex_lock (...) #3 0x00007fa6ef979e39 in PR_Lock (...) #4 0x00007fa6f17c7f7a in slapi_lock_mutex (...) #5 0x00007fa6e6c4ff04 in cos_cache_change_notify (...) #6 0x00007fa6e6c4abfd in cos_post_op (...) #7 0x00007fa6f17a4d38 in plugin_call_func (...) #8 0x00007fa6f17a4fc3 in plugin_call_list (...) #9 plugin_call_plugins (...) #10 0x00007fa6f1792575 in op_shared_modify (...) #11 0x00007fa6f1793014 in modify_internal_pb (...) #12 0x00007fa6f1793b13 in slapi_modify_internal_pb (...) #13 0x00007fa6e3a96926 in urp_fixup_modify_entry (...) #14 0x00007fa6e3a96a62 in mod_namingconflict_attr (...) #15 0x00007fa6e3a974ea in urp_annotate_dn (...) #16 0x00007fa6e3a977fb in urp_add_operation (...) #17 0x00007fa6e3a7d2e8 in multimaster_bepreop_add (...) #18 0x00007fa6f17a4d38 in plugin_call_func (...) #19 0x00007fa6f17a4fc3 in plugin_call_list (...) #20 plugin_call_plugins (...) #21 0x00007fa6e3d297b1 in ldbm_back_add (...) #22 0x00007fa6f1748ee8 in op_shared_add (...) #23 0x00007fa6f174a1b0 in do_add (...) #24 0x00007fa6f1c7e9b3 in connection_dispatch_operation (...) #25 connection_threadmain (...) #26 0x00007fa6ef97f96b in _pt_root (...) #27 0x00007fa6ef31fdc5 in start_thread (...) #28 0x00007fa6ef04d1cd in clone (...) }}}

A small note, is cos_cache_create_unloc()? meant to be cos_cache_create_unlock()?

Otherwise, I can understand the fix and how it works, it seems reasonable. I'm wondering if there is a cleaner way to achieve this with reordering of the backend lock compared to the change_lock, or with monitors.

Regardless, I think this solution will work,

Thanks William and Ludwig for the reviews

'''git push origin master'''
Counting objects: 7, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 1.90 KiB | 0 bytes/s, done.
Total 7 (delta 4), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
9feca58..ac44337 master -> master

commit ac44337
Author: Thierry Bordaz tbordaz@redhat.com
Date: Tue Jan 10 14:32:53 2017 +0100

Pushed to 389-ds-base-1.3.5:
79a3dea..3ac12cb 389-ds-base-1.3.5 -> 389-ds-base-1.3.5
commit 3ac12cb

Metadata Update from @tbordaz:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.6 backlog

7 years ago

Hey Thierry, is this issue now resolved? Would you like to close it?

Metadata Update from @firstyear:
- Issue close_status updated to: None

7 years ago

Metadata Update from @tbordaz:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2138

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Login to comment on this ticket.

Metadata