#47781 Server deadlock if online import started while server is under load
Closed: wontfix None Opened 10 years ago by mreynolds.

If a server in a MMR environment is under load (doing adds and deletes), and you try to initialize the database(ldif2db.pl), you can deadlock the server:

#0  0x000000378e40e054 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x000000378e4093be in _L_lock_995 () from /lib64/libpthread.so.0
#2  0x000000378e409326 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000003a8d023fe9 in PR_Lock () from /lib64/libnspr4.so
#4  0x00007f0d153113a8 in replica_get_generation (r=0x12c8790) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:957
#5  0x00007f0d1530c84d in copy_operation_parameters (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/plugins/replication/repl5_plugins.c:923
#6  0x00007f0d1530bab9 in multimaster_preop_delete (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/plugins/replication/repl5_plugins.c:391
#7  0x00007f0d18abddb9 in plugin_call_func (list=0xf66860, operation=423, pb=0x7f0cfc0192c0, call_one=0) at ../ds/ldap/servers/slapd/plugin.c:1453
#8  0x00007f0d18abdc6c in plugin_call_list (list=0xf57ac0, operation=423, pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/plugin.c:1415
#9  0x00007f0d18abc200 in plugin_call_plugins (pb=0x7f0cfc0192c0, whichfunction=423) at ../ds/ldap/servers/slapd/plugin.c:398
#10 0x00007f0d18a67584 in op_shared_delete (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/delete.c:355
#11 0x00007f0d18a670e6 in delete_internal_pb (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/delete.c:242
#12 0x00007f0d18a66f2d in slapi_delete_internal_pb (pb=0x7f0cfc0192c0) at ../ds/ldap/servers/slapd/delete.c:185
#13 0x00007f0d15314b8e in _delete_tombstone (tombstone_dn=0x12acb60 "dc=example,dc=com", uniqueid=0x7f0d15355b10 "ffffffff-ffffffff-ffffffff-ffffffff", ext_op_flags=131072)
    at ../ds/ldap/servers/plugins/replication/repl5_replica.c:2723
#14 0x00007f0d15313d65 in _replica_configure_ruv (r=0x12c8790, isLocked=1) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:2225

replica_reload_ruv() takes repl lock --> as does frame #3

#15 0x00007f0d15311efe in replica_reload_ruv (r=0x12c8790) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:1318
#16 0x00007f0d153169ce in replica_enable_replication (r=0x12c8790) at ../ds/ldap/servers/plugins/replication/repl5_replica.c:3612
#17 0x00007f0d1530d87e in multimaster_be_state_change (handle=0x7f0d1530d7cf, be_name=0x7f0cfc00b3b0 "userRoot", old_be_state=2, new_be_state=1) at ../ds/ldap/servers/plugins/replication/repl5_plugins.c:1487
#18 0x00007f0d18a9ffd8 in mtn_be_state_change (be_name=0x7f0cfc00b3b0 "userRoot", old_state=2, new_state=1) at ../ds/ldap/servers/slapd/mapping_tree.c:237
#19 0x00007f0d18aa65a7 in mtn_internal_be_set_state (be=0xfa2310, state=1) at ../ds/ldap/servers/slapd/mapping_tree.c:3584
#20 0x00007f0d18aa6628 in slapi_mtn_be_enable (be=0xfa2310) at ../ds/ldap/servers/slapd/mapping_tree.c:3634
#21 0x00007f0d155b4132 in import_all_done (job=0x7f0c9802a790, ret=0) at ../ds/ldap/servers/slapd/back-ldbm/import.c:1118
#22 0x00007f0d155b4ec4 in import_main_offline (arg=0x7f0c9802a790) at ../ds/ldap/servers/slapd/back-ldbm/import.c:1510
#23 0x00007f0d155b4f19 in import_main (arg=0x7f0c9802a790) at ../ds/ldap/servers/slapd/back-ldbm/import.c:1530
#24 0x0000003a8d029a73 in ?? () from /lib64/libnspr4.so
#25 0x000000378e407851 in start_thread () from /lib64/libpthread.so.0
#26 0x000000378e0e890d in clone () from /lib64/libc.so.6

If we want to make the lock re-entrant, we could use PRMonitor instead of PRLock... Is it too invasive?

Replying to [comment:5 nhosoi]:

If we want to make the lock re-entrant, we could use PRMonitor instead of PRLock... Is it too invasive?

It shouldn't be too bad, as all the changes would be in repl5_replica.c, but it is a corner case. Anyway, I will work on it as I'm not thrilled about all the locking/unlocking to workaround the deadlock.

Replying to [comment:5 nhosoi]:

If we want to make the lock re-entrant, we could use PRMonitor instead of PRLock... Is it too invasive?

New patch attached, thanks for the suggestion of using PRMonitor.

Thanks, Mark! Also, thanks for updating the milestone.

git merge ticket47781
Updating badd354..0e11f71
ldap/servers/plugins/replication/repl5_replica.c | 270 +++++++++++++++++++++++++++---------------------------

git push origin master
To ssh://git.fedorahosted.org/git/389/ds.git
badd354..0e11f71 master -> master

commit 0e11f71

942e892..5b46542 389-ds-base-1.3.2 -> 389-ds-base-1.3.2
commit 5b46542

54bf43d..9bd67e2 389-ds-base-1.3.1 -> 389-ds-base-1.3.1
commit 9bd67e2e869ace5db5e059a8cdcf60350928d06b

0ad19ce..d6d3731 389-ds-base-1.2.11 -> 389-ds-base-1.2.11
commit d6d3731

git merge ticket47781
Updating 0e11f71..2f2d95b
dirsrvtests/tickets/ticket47781_test.py | 235 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

git push origin master
0e11f71..2f2d95b master -> master

commit 2f2d95b

Just a typo remark, you should use EXPORT_REPL_INFO rather than 'repl-info' in the export task args.

Replying to [comment:15 tbordaz]:

Just a typo remark, you should use EXPORT_REPL_INFO rather than 'repl-info' in the export task args.

I used EXPORT_REPL_INFO and TASK_WAIT where applicable. New patch attached, can you please review it Thierry?

Metadata Update from @tbordaz:
- Issue assigned to mreynolds
- Issue set to the milestone:

8 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1113

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

4 years ago

Log in to comment on this ticket.
