#48166 libdb deadlock while performing modrdns
Opened 4 years ago by mreynolds. Modified 2 years ago

While running a modrdn stress test it appears that there is a deadlock when between the modrdn operation and the checkpointing thread

#0  0x00007f92ddb7eca0 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007f92d687a1c3 in __db_hybrid_mutex_suspend () from /lib64/libdb-5.3.so
#2  0x00007f92d68795a8 in __db_tas_mutex_lock () from /lib64/libdb-5.3.so
#3  0x00007f92d69946fb in __memp_fget () from /lib64/libdb-5.3.so
#4  0x00007f92d68970b1 in __bam_search () from /lib64/libdb-5.3.so
#5  0x00007f92d6882126 in __bamc_search () from /lib64/libdb-5.3.so
#6  0x00007f92d68865a4 in __bamc_put () from /lib64/libdb-5.3.so
#7  0x00007f92d693db85 in __dbc_iput () from /lib64/libdb-5.3.so
#8  0x00007f92d6938d1e in __db_put () from /lib64/libdb-5.3.so
#9  0x00007f92d694e5d4 in __db_put_pp () from /lib64/libdb-5.3.so
#10 0x00007f92d46de02f in idl_new_insert_key (be=<optimized out>, db=<optimized out>, key=<optimized out>, id=4, txn=<optimized out>,
    a=<optimized out>, disposition=0x0) at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/idl_new.c:830
#11 0x00007f92d46dcec5 in idl_insert_key (be=be@entry=0xd2e9d0, db=db@entry=0x7f92340029f0, key=key@entry=0x7f92b2fefc10, id=id@entry=4,
    txn=txn@entry=0x7f9268004e40, a=a@entry=0xe09f70, disposition=disposition@entry=0x0)
    at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/idl_shim.c:144
#12 0x00007f92d46ecbe8 in addordel_values_sv (be=be@entry=0xd2e9d0, db=0x7f92340029f0, indextype=<optimized out>, vals=<optimized out>,
    id=id@entry=4, flags=flags@entry=1, txn=txn@entry=0x7f92b2ff2550, a=0xe09f70, idl_disposition=idl_disposition@entry=0x0,
    buffer_handle=buffer_handle@entry=0x0, type=<optimized out>)
    at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/index.c:1962
#13 0x00007f92d46ed642 in index_addordel_values_ext_sv (be=be@entry=0xd2e9d0, type=<optimized out>, vals=0x7f9268008a10,
    evals=evals@entry=0x0, id=id@entry=4, flags=flags@entry=1, txn=txn@entry=0x7f92b2ff2550, idl_disposition=idl_disposition@entry=0x0,
    buffer_handle=buffer_handle@entry=0x0) at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/index.c:2156
#14 0x00007f92d46edbe4 in index_addordel_values_sv (be=be@entry=0xd2e9d0, type=<optimized out>, vals=<optimized out>,
    evals=evals@entry=0x0, id=id@entry=4, flags=flags@entry=1, txn=txn@entry=0x7f92b2ff2550)
    at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/index.c:2021
#15 0x00007f92d46ee3ab in index_add_mods (be=0xd2e9d0, mods=<optimized out>, olde=olde@entry=0x7f9294009a50,
    newe=newe@entry=0x7f926802e960, txn=txn@entry=0x7f92b2ff2550)
    at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/index.c:674
#16 0x00007f92d470bde0 in modrdn_rename_entry_update_indexes (ptxn=ptxn@entry=0x7f92b2ff2550, pb=pb@entry=0x7f92b2ff4ae0, e=0x7f9294009a50,
    ec=ec@entry=0x7f92b2ff2578, smods1=smods1@entry=0x7f92b2ff2670, smods2=smods2@entry=0x7f92b2ff2690, smods3=smods3@entry=0x7f92b2ff2650,
    li=<optimized out>) at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/ldbm_modrdn.c:1901
#17 0x00007f92d470c76b in ldbm_back_modrdn (pb=<optimized out>)
    at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/ldbm_modrdn.c:1053
#18 0x00007f92e045d047 in op_shared_rename (pb=pb@entry=0x7f92b2ff4ae0, passin_args=0)
    at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/modrdn.c:652
#19 0x00007f92e045d885 in do_modrdn (pb=pb@entry=0x7f92b2ff4ae0) at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/modrdn.c:256
#20 0x00000000004183b9 in connection_dispatch_operation (pb=0x7f92b2ff4ae0, op=0xf4b530, conn=0x7f92e0800aa0)
    at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/connection.c:655
#21 connection_threadmain () at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/connection.c:2534

The checkpointing thread appears to own the db lock:

This is the modrdn thread

(gdb) thread 13
[Switching to thread 13 (Thread 0x7f4e697f2700 (LWP 4177))]
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
185     62:     movl    (%rsp), %edi
(gdb) up
#1  0x00007f4e8e0e1570 in __db_pthread_mutex_condwait (env=0x2616fa0, mutex=1965, mutexp=0x7f4e888a6090, timespec=0x0) at ../src/mutex/mut_pthread.c:321
321                     RET_SET((pthread_cond_wait(&mutexp->u.m.cond,
(gdb) p *mutexp
$1 = {u = {m = {mutex = {__data = {__lock = 0, __count = 0, __owner = 0, __nusers = 1, __kind = 128, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}},
        __size = '\000' <repeats 12 times>, "\001\000\000\000\200", '\000' <repeats 22 times>, __align = 0}, cond = {__data = {__lock = 0, __futex = 1, __total_seq = 1, __wakeup_seq = 0, __woken_seq = 0,
          __mutex = 0xffffffffffffffff, __nwaiters = 2, __broadcast_seq = 0},
        __size = "\000\000\000\000\001\000\000\000\001", '\000' <repeats 23 times>, "\377\377\377\377\377\377\377\377\002\000\000\000\000\000\000", __align = 4294967296}}, rwlock = {__data = {__lock = 0,
        __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 1, __nr_readers_queued = 128, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 4294967296, __flags = 1},
      __size = '\000' <repeats 12 times>, "\001\000\000\000\200", '\000' <repeats 27 times>, "\001\000\000\000\001\000\000\000\000\000\000", __align = 0}}, tas = 0 '\000', sharecount = {value = 1}, wait = 1,
  pid = 4145, tid = 139975259846400, mutex_next_link = 0, alloc_id = 15, mutex_set_wait = 0, mutex_set_nowait = 17, mutex_set_rd_wait = 0, mutex_set_rd_nowait = 48, hybrid_wait = 0, hybrid_wakeup = 0,
  flags = 49]

not sure what all the fields mean, but tid = 139975259846400 == 0x7f4e87a3f700
and this is the checkpoint thread

Thread 41 (Thread 0x7f4e87a3f700 (LWP 4149)):
#0  0x00007f4e9519e5f3 in select () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f4e8e29fd19 in __os_sleep (env=0x2616fa0, secs=1, usecs=0) at ../src/os/os_yield.c:90
#2  0x00007f4e8e29fcb9 in __os_yield (env=0x2616fa0, secs=1, usecs=0) at ../src/os/os_yield.c:48
#3  0x00007f4e8e299d8e in __memp_sync_int (env=0x2616fa0, dbmfp=0x0, trickle_max=0, flags=4, wrote_totalp=0x0, interruptedp=0x0) at ../src/mp/mp_sync.c:483
#4  0x00007f4e8e2b1a7a in __txn_checkpoint (env=0x2616fa0, kbytes=0, minutes=0, flags=0) at ../src/txn/txn_chkpt.c:242
#5  0x00007f4e8e2b14fd in __txn_checkpoint_pp (dbenv=0x26ecf90, kbytes=0, minutes=0, flags=0) at ../src/txn/txn_chkpt.c:81
#6  0x00007f4e8bf3aaa7 in checkpoint_threadmain (param=<optimized out>) at /home/mareynol/workspaces/389-ds-base/ds/ldap/servers/slapd/back-ldbm/dblayer.c:4769
#7  0x00007f4e95ad7c2b in _pt_root (arg=0x26f65f0) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#8  0x00007f4e95477ee5 in start_thread (arg=0x7f4e87a3f700) at pthread_create.c:309
#9  0x00007f4e951a6d1d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

reproducer script - needs to be run from ds/dirsrvtests/
reliab7.5_test.py

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.3.6 backlog

2 years ago

Can we still reproduce this issue @mreynolds ?

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to new
- Issue close_status updated to: None

2 years ago

Can we still reproduce this issue @mreynolds ?

Yes this is still a problem, but some what of a corner case...

You edit dirsrvtests/tests/stress/reliability/reliab_7_5_test.py and change "rename_s" to "rename". Then run it, and the server will get deadlocked in less than 10 passes.

It does not hang if we use "rename_s" as it waits for the result (removing the timing condition).

This issue seems to be coming up even in production.
We are tracking a very similar issue downstream for libdb on RHEL-based systems over here:
https://bugzilla.redhat.com/show_bug.cgi?id=1349779
I am however having a hard time to reproduce this by running the test case as it generally takes a lot of passes for the deadlock to show up on my box (definitely not less than 10).

Metadata Update from @mreynolds:
- Custom field reviewstatus reset (from new)
- Issue set to the milestone: 1.3.7 backlog (was: 1.3.6 backlog)

2 years ago

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4 backlog (was: 1.3.7 backlog)

2 years ago

Login to comment on this ticket.

Metadata