#47976 deadlock in mep delete post op
Closed: wontfix None Opened 6 years ago by tbordaz.

A deadlock occurs in a single thread processing a DEL operation. The deadlock is dynamic and there is no clear testcase.

The deadlock occurs in the database lock while the DEL operation under transaction holds a db_lock in write, then issue an internal DEL without the parent txn. The internal DEL needs the db_lock in read, but as it is not using the same transaction the lock is not granted

Thread 6 (Thread 0x7f52c1feb700 (LWP 17418)):
#0  0x00007f52f2c0c705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f52ed55d2f3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
#2  0x00007f52ed55c640 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
#3  0x00007f52ed606cea in __lock_get_internal () from /lib64/libdb-5.3.so
#4  0x00007f52ed6077d0 in __lock_get () from /lib64/libdb-5.3.so
#5  0x00007f52ed633112 in __db_lget () from /lib64/libdb-5.3.so
#6  0x00007f52ed57a5f5 in __bam_search () from /lib64/libdb-5.3.so
#7  0x00007f52ed565256 in __bamc_search () from /lib64/libdb-5.3.so
#8  0x00007f52ed566d0f in __bamc_get () from /lib64/libdb-5.3.so
#9  0x00007f52ed61fc56 in __dbc_iget () from /lib64/libdb-5.3.so
#10 0x00007f52ed62c843 in __db_get () from /lib64/libdb-5.3.so
#11 0x00007f52ed630123 in __db_get_pp () from /lib64/libdb-5.3.so
#12 0x00007f52e97535fb in id2entry (be=0x1cc48b0, id=id@entry=4, txn=txn@entry=0x0, err=err@entry=0x7f52c1fe6638) at ldap/servers/slapd/back-ldbm/id2entry.c:323
#13 0x00007f52e9777c20 in ldbm_back_delete (pb=0x7f5284061cf0) at ldap/servers/slapd/back-ldbm/ldbm_delete.c:463
#14 0x00007f52f4c1d380 in op_shared_delete (pb=pb@entry=0x7f5284061cf0) at ldap/servers/slapd/delete.c:364
#15 0x00007f52f4c1d532 in delete_internal_pb (pb=pb@entry=0x7f5284061cf0) at ldap/servers/slapd/delete.c:242
#16 0x00007f52f4c1d7e3 in slapi_delete_internal_pb (pb=pb@entry=0x7f5284061cf0) at ldap/servers/slapd/delete.c:185
#17 0x00007f52e907c3cd in mep_del_post_op (pb=<optimized out>) at ldap/servers/plugins/mep/mep.c:2556
#18 0x00007f52f4c638c5 in plugin_call_func (list=0x1bcae30, operation=operation@entry=563, pb=pb@entry=0x3184f90, call_one=call_one@entry=0) at ldap/servers/slapd/plugin.c:1453
#19 0x00007f52f4c63a78 in plugin_call_list (pb=0x3184f90, operation=563, list=<optimized out>) at ldap/servers/slapd/plugin.c:1415
#20 plugin_call_plugins (pb=pb@entry=0x3184f90, whichfunction=whichfunction@entry=563) at ldap/servers/slapd/plugin.c:398
#21 0x00007f52e9776f07 in ldbm_back_delete (pb=0x3184f90) at ldap/servers/slapd/back-ldbm/ldbm_delete.c:1091
#22 0x00007f52f4c1d380 in op_shared_delete (pb=pb@entry=0x3184f90) at ldap/servers/slapd/delete.c:364
#23 0x00007f52f4c1d643 in do_delete (pb=0x3184f90) at ldap/servers/slapd/delete.c:128
#24 0x0000000000417688 in connection_dispatch_operation (pb=<optimized out>, op=0x2dbc8e0, conn=0x7f52e02b5190) at ldap/servers/slapd/connection.c:587
#25 connection_threadmain () at ldap/servers/slapd/connection.c:2372
#26 0x00007f52f32679eb in _pt_root () from /lib64/libnspr4.so
#27 0x00007f52f2c08df3 in start_thread () from /lib64/libpthread.so.0
#28 0x00007f52f293601d in clone () from /lib64/libc.so.6

The delete is done under backend-transcation and owns in write a id2entry page (2):

    800006b9 dd= 8 locks held 185  write locks 85   pid/thread 17379/139993418741504 flags 0    priority 100
    800006b9 READ         30 HELD    ipaca/ancestorid.db       page          1
    ...
    800006b9 WRITE         1 HELD    userRoot/id2entry.db      page          2
    ...

Then the internal delete is hanging waiting for this same page lock to do a read.

       2 dd=103 locks held 0    write locks 0    pid/thread 17379/139993418741504 flags 0    priority 100
       2 READ          1 WAIT    userRoot/id2entry.db      page          2

The test case I did is the following but I was only able to reproduce once:
1 - install ipa server + replica
2 - stop the replica (I reproduced without any replication session)
3 - run the script lot_mods.sh
note: I was testing on 1.3.1 where MOD/REPL kept the delete values in the entry. So the entry was growing. This explain why the test create/delete the test entry. The growth of the entry can help to reproduce. In 1.3.3 and after MOD/REPL do not keep deleted values. The deadlock occured during the 'ipa user-del'


Could this be still an issue on the latest versions? (When the deadlock was observed, what version of 389-ds-base was running?)

If the deadlock still occurs, I'm interested in reproducing this bug with the 389-ds-base only env...

What's the minimum condition to reproduce it?
Replication, relint, memberOf, linkedAttr, USN, DNA, and mep?
If you could attach the configurations for the plug-ins, I'd appreciate it.

Hello Noriko,

The VM where the problem occurred was no longer available.
I prepared the test case on a new VM and attached the plugin config. The DS version is 1.3.2 (while I first hit the problem on 1.3.1). IMHO this is not that important as the DS plugin config depends on IPA setup.

Also reported in freeipa-users https://www.redhat.com/archives/freeipa-users/2015-September/msg00331.html {{{ 80003201 dd= 6 locks held 234 write locks 110 pid/thread 6134/140366703388416 (7FA9AB7E6700)flags 0 priority 100 .. 80003201 WRITE 1 HELD userRoot/id2entry.db page 2 ... 80003201 WRITE 4 HELD userRoot/objectclass.db page 2 80003201 READ 2 HELD userRoot/objectclass.db page 2 Thread 9 (Thread 0x7fa9ab7e6700 (LWP 6173)): #0 0x00007fa9dda41705 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x00007fa9d860e2f3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so #2 0x00007fa9d860d640 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so #3 0x00007fa9d86b7cea in __lock_get_internal () from /lib64/libdb-5.3.so #4 0x00007fa9d86b87d0 in __lock_get () from /lib64/libdb-5.3.so #5 0x00007fa9d86e4112 in __db_lget () from /lib64/libdb-5.3.so #6 0x00007fa9d862b5f5 in __bam_search () from /lib64/libdb-5.3.so #7 0x00007fa9d8616256 in __bamc_search () from /lib64/libdb-5.3.so #8 0x00007fa9d8617d0f in __bamc_get () from /lib64/libdb-5.3.so #9 0x00007fa9d86d0c56 in __dbc_iget () from /lib64/libdb-5.3.so #10 0x00007fa9d86dd843 in __db_get () from /lib64/libdb-5.3.so #11 0x00007fa9d86e1123 in __db_get_pp () from /lib64/libdb-5.3.so #12 0x00007fa9d248949b in id2entry () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #13 0x00007fa9d24af7dd in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #14 0x00007fa9dfc60190 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0 #15 0x00007fa9dfc60342 in delete_internal_pb () from /usr/lib64/dirsrv/libslapd.so.0 #16 0x00007fa9d1da4739 in mep_del_post_op () from /usr/lib64/dirsrv/plugins/libmanagedentries-plugin.so #17 0x00007fa9dfcac280 in plugin_call_func () from /usr/lib64/dirsrv/libslapd.so.0 #18 0x00007fa9dfcac4d8 in plugin_call_plugins () from /usr/lib64/dirsrv/libslapd.so.0 #19 0x00007fa9d24ae42e in ldbm_back_delete () from /usr/lib64/dirsrv/plugins/libback-ldbm.so #20 0x00007fa9dfc60190 in op_shared_delete () from /usr/lib64/dirsrv/libslapd.so.0 #21 0x00007fa9dfc60453 in do_delete () from /usr/lib64/dirsrv/libslapd.so.0 #22 0x00007fa9e017a37e in connection_threadmain () #23 0x00007fa9de09c7bb in _pt_root () from /lib64/libnspr4.so #24 0x00007fa9dda3ddf5 in start_thread () from /lib64/libpthread.so.0 #25 0x00007fa9dd76b1ad in clone () from /lib64/libc.so.6 36 dd=127 locks held 0 write locks 0 pid/thread 6134/140366703388416 (7FA9AB7E6700) flags 0 priority 100 36 READ 1 WAIT userRoot/id2entry.db page 2 }}}

I agree with Ludwig on the email list that you should not have to do anything to mep - the txn should already be set correctly in ldbm_delete.c

It looks like 2f4f74b introduced a problem - every database argument must pass in the transaction handle, but this commit removed it (and mea culpa for not noticing this when I reviewed the patch :P )
{{{
diff --git a/ldap/servers/slapd/back-ldbm/ldbm_delete.c b/ldap/servers/slapd/back-ldbm/ldbm_delete.c
index 61af3a6..3556658 100644
--- a/ldap/servers/slapd/back-ldbm/ldbm_delete.c
+++ b/ldap/servers/slapd/back-ldbm/ldbm_delete.c
...
@@ -449,12 +446,31 @@ ldbm_back_delete( Slapi_PBlock pb )
...
+ * Although a rare case, multiple parents from repl conflict could exist.
+ * In such case, if a parent entry is found just by parentsdn
+ * (find_entry2modify_only_ext), a wrong parent could be found,
+ * and numsubordinate count could get confused.
+
/
+ ID pid = (ID)strtol(pid_str, (char )NULL, 10);
+ parent = id2entry(be, pid ,NULL, &retval);
+ if (parent && cache_lock_entry(&inst->inst_cache, parent)) {
+ /
Failed to obtain parent entry's entry lock
/
+ CACHE_RETURN(&(inst->inst_cache), &parent);
+ goto error_return;
}}}

id2entry here should have used the &txn.

While py.test may order the test cases, it's not a guaranteed ordering behaviour. What you have in test_ticket47976_1 and test_ticket47976_2 might be better suited in test_ticket47976_init, and then that itself should become either a decorator, fixture or called at the start of test_ticket47976_3. You should be able to just decorate like this:

{{{
@pytest.fixture(scope="module")
def ds_topology(request):
...
return TopologyStandalone(standalone)

@pytest.fixture()
def topology(ds_topology):
...
# Do your setup here from _init, _1, _2
...
return ds_topology #This is the ds instance from the former fixture.

Renamed from _3

test_ticket47976(topology):
...

}}}

I hope that helps!

Thanks Ludwig and Rich for the reviews !!

{{{
git merge ticket47976
Updating 0d1ce9d..83c98d9
Fast-forward
ldap/servers/slapd/back-ldbm/ldbm_delete.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

commit 83c98d9
Author: Thierry Bordaz tbordaz@redhat.com
Date: Tue Nov 3 15:59:54 2015 +0100

git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 983 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
0d1ce9d..83c98d9 master -> master
}}}

Replying to [comment:12 firstyear]:

While py.test may order the test cases, it's not a guaranteed ordering behaviour. What you have in test_ticket47976_1 and test_ticket47976_2 might be better suited in test_ticket47976_init, and then that itself should become either a decorator, fixture or called at the start of test_ticket47976_3. You should be able to just decorate like this:

{{{
@pytest.fixture(scope="module")
def ds_topology(request):
...
return TopologyStandalone(standalone)

@pytest.fixture()
def topology(ds_topology):
...
# Do your setup here from _init, _1, _2
...
return ds_topology #This is the ds instance from the former fixture.

Renamed from _3

test_ticket47976(topology):
...

}}}

I hope that helps!

Hi,

Thanks for the review. In fact I was not really confident with the order of execution of the test cases.
This order is important and your suggested fix would work. Now I think that if we rename 'topology' fixture into ds_topology we would need to change the create-test.py script.

I will do your suggested modifications and retest.

'''1.3.4'''
git push origin 389-ds-base-1.3.4
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 983 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
a8d30b3..55434d3 389-ds-base-1.3.4 -> 389-ds-base-1.3.4

'''1.3.3'''
git push origin 389-ds-base-1.3.3
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 985 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
6844592..a4db5ec 389-ds-base-1.3.3 -> 389-ds-base-1.3.3

'''1.2.11'''
git push origin 389-ds-base-1.2.11
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 974 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
bdaed4e..24a38f2 389-ds-base-1.2.11 -> 389-ds-base-1.2.11

Looks the fix is already pushed. Closing this ticket.

Thank you, Mark, for the review via email!

To ssh://git.fedorahosted.org/git/389/ds.git

Pushed to master:
7110db9..a12dc36 master -> master
commit a12dc36
Author: Simon Pichugin spichugi@redhat.com
Date: Tue Aug 2 17:29:00 2016 +0200

Metadata Update from @tbordaz:
- Issue assigned to tbordaz
- Issue set to the milestone: 1.3.5.4

4 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1307

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

8 months ago

Login to comment on this ticket.

Metadata