If a server in a MMR environment is under load (doing adds and deletes), and you try to initialize the database(ldif2db.pl), you can deadlock the server:
#0 0x000000378e40e054 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x000000378e4093be in _L_lock_995 () from /lib64/libpthread.so.0 #2 0x000000378e409326 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x0000003a8d023fe9 in PR_Lock () from /lib64/libnspr4.so #4 0x00007f0d153113a8 in replica_get_generation (r=0x12c8790) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:957 #5 0x00007f0d1530c84d in copy_operation_parameters (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/plugins/replication/repl5_plugins.c:923 #6 0x00007f0d1530bab9 in multimaster_preop_delete (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/plugins/replication/repl5_plugins.c:391 #7 0x00007f0d18abddb9 in plugin_call_func (list=0xf66860, operation=423, pb=0x7f0cfc0192c0, call_one=0) at ../ds/ldap/servers/slapd/plugin.c:1453 #8 0x00007f0d18abdc6c in plugin_call_list (list=0xf57ac0, operation=423, pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/plugin.c:1415 #9 0x00007f0d18abc200 in plugin_call_plugins (pb=0x7f0cfc0192c0, whichfunction=423) at ../ds/ldap/servers/slapd/plugin.c:398 #10 0x00007f0d18a67584 in op_shared_delete (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/delete.c:355 #11 0x00007f0d18a670e6 in delete_internal_pb (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/delete.c:242 #12 0x00007f0d18a66f2d in slapi_delete_internal_pb (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/delete.c:185 #13 0x00007f0d15314b8e in _delete_tombstone (tombstone_dn=0x12acb60 "dc=example,dc=com", uniqueid=0x7f0d15355b10 "ffffffff-ffffffff-ffffffff-ffffffff", ext_op_flags=131072) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:2723 #14 0x00007f0d15313d65 in _replica_configure_ruv (r=0x12c8790, isLocked=1) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:2225 replica_reload_ruv() takes repl lock --> as does frame #3 #15 0x00007f0d15311efe in replica_reload_ruv (r=0x12c8790) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:1318 #16 0x00007f0d153169ce in replica_enable_replication (r=0x12c8790) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:3612 #17 0x00007f0d1530d87e in multimaster_be_state_change (handle=0x7f0d1530d7cf, be_name=0x7f0cfc00b3b0 "userRoot", old_be_state=2, new_be_state=1) at ../ds/ldap/servers/plugins/replication/repl5_plugins.c:1487 #18 0x00007f0d18a9ffd8 in mtn_be_state_change (be_name=0x7f0cfc00b3b0 "userRoot", old_state=2, new_state=1) at ../ds/ldap/servers/slapd/mapping_tree.c:237 #19 0x00007f0d18aa65a7 in mtn_internal_be_set_state (be=0xfa2310, state=1) at ../ds/ldap/servers/slapd/mapping_tree.c:3584 #20 0x00007f0d18aa6628 in slapi_mtn_be_enable (be=0xfa2310) at ../ds/ldap/servers/slapd/mapping_tree.c:3634 #21 0x00007f0d155b4132 in import_all_done (job=0x7f0c9802a790, ret=0) at ../ds/ldap/servers/slapd/back-ldbm/import.c:1118 #22 0x00007f0d155b4ec4 in import_main_offline (arg=0x7f0c9802a790) at ../ds/ldap/servers/slapd/back-ldbm/import.c:1510 #23 0x00007f0d155b4f19 in import_main (arg=0x7f0c9802a790) at ../ds/ldap/servers/slapd/back-ldbm/import.c:1530 #24 0x0000003a8d029a73 in ?? () from /lib64/libnspr4.so #25 0x000000378e407851 in start_thread () from /lib64/libpthread.so.0 #26 0x000000378e0e890d in clone () from /lib64/libc.so.6
If we want to make the lock re-entrant, we could use PRMonitor instead of PRLock... Is it too invasive?
Replying to [comment:5 nhosoi]:
It shouldn't be too bad, as all the changes would be in repl5_replica.c, but it is a corner case. Anyway, I will work on it as I'm not thrilled about all the locking/unlocking to workaround the deadlock.
revision 0001-Ticket-47781-Server-deadlock-if-online-import-starte.patch
New patch attached, thanks for the suggestion of using PRMonitor.
Thanks, Mark! Also, thanks for updating the milestone.
Ticket has been cloned to Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1117021
git merge ticket47781 Updating badd354..0e11f71 Fast-forward ldap/servers/plugins/replication/repl5_replica.c | 270 +++++++++++++++++++++++++++---------------------------
git push origin master To ssh://git.fedorahosted.org/git/389/ds.git badd354..0e11f71 master -> master
commit 0e11f71
942e892..5b46542 389-ds-base-1.3.2 -> 389-ds-base-1.3.2 commit 5b46542
54bf43d..9bd67e2 389-ds-base-1.3.1 -> 389-ds-base-1.3.1 commit 9bd67e2e869ace5db5e059a8cdcf60350928d06b
0ad19ce..d6d3731 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit d6d3731
CI test 0001-Ticket-47781-Add-CI-test.patch
git merge ticket47781 Updating 0e11f71..2f2d95b Fast-forward dirsrvtests/tickets/ticket47781_test.py | 235 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
git push origin master 0e11f71..2f2d95b master -> master
commit 2f2d95b
Just a typo remark, you should use EXPORT_REPL_INFO rather than 'repl-info' in the export task args.
Use predefined property names 0001-Ticket-47781-CI-test-use-predefined-property-name-va.patch
Replying to [comment:15 tbordaz]:
I used EXPORT_REPL_INFO and TASK_WAIT where applicable. New patch attached, can you please review it Thierry?
2f2d95b..daf4b42 master -> master commit daf4b42
Metadata Update from @tbordaz: - Issue assigned to mreynolds - Issue set to the milestone: 1.2.11.30
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1113
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.