#48341 deadlock on connection mutex
Closed: wontfix None Opened 8 years ago by lkrispen.

The slapd process is hanging and the pstack shows:

Thread 7 (Thread 0x7f8ab310c700 (LWP 7761)):
#0  0x00007f8b0050bb7d in poll () from /lib64/libc.so.6
#1  0x00007f8b00e43967 in _pr_poll_with_poll () from /lib64/libnspr4.so
#2  0x00007f8b02f239e8 in connection_read_operation ()
#3  0x00007f8b02f248df in connection_threadmain ()
#4  0x00007f8b00e477bb in _pt_root () from /lib64/libnspr4.so
#5  0x00007f8b007e8df5 in start_thread () from /lib64/libpthread.so.0
#6  0x00007f8b005161ad in clone () from /lib64/libc.so.6

Thread 1 (Thread 0x7f8b02eee840 (LWP 7715)):
#0  0x00007f8b007eef7d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f8b007ead68 in _L_lock_975 () from /lib64/libpthread.so.0
#2  0x00007f8b007ead11 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007f8b00e41cb9 in PR_Lock () from /lib64/libnspr4.so
#4  0x00007f8b02f291e5 in slapd_daemon ()
#5  0x00007f8b02f1c17c in main ()

thread 7 holds a c_mutex and the main thread is blocked when iterating the conn table.
The mutex lock in connection read operation was introduced with the fix for #47320


I'm interested in Thread 7. I think PR_Poll called from connection_read_operation does not wait and times out in 1 sec (1000 / milliseconds /). If it is a new operation, it bails there. If it is not new, it usually waits until the ioblock timeout expires.

Are there any client side issue that cannot continue sending the requests?

Also, if the ioblock timeout is set to some small value like 5 seconds, the connection in Thread 7 fails and the main thread comes back to normal? That is, strictly speaking, it is not a typical deadlock?

Another question is #47320 put PR_Lock(conn->c_mutex) covering almost the entire connection_read_operation. Could it be unlocked when PR_Poll gets timed out and relock even for this short time? It might help the main thread to scan the connection table?
{{{
--- a/ldap/servers/slapd/connection.c
+++ b/ldap/servers/slapd/connection.c
@@ -1223,9 +1223,11 @@ int connection_read_operation(Connection conn, Operation op, ber_tag_t tag, i
} else {
/
The turbo mode may cause threads starvation.
Do a yield here to reduce the starving.
*/
+ PR_Unlock(conn->c_mutex);
PR_Sleep(PR_INTERVAL_NO_WAIT);
+ PR_Lock(conn->c_mutex);
continue;
}}}

Replying to [comment:4 nhosoi]:

I'm interested in Thread 7. I think PR_Poll called from connection_read_operation does not wait and times out in 1 sec (1000 / milliseconds /). If it is a new operation, it bails there. If it is not new, it usually waits until the ioblock timeout expires.

yes, but even if ioblocktimeout is set to a few seconds that can be too much for otehr connections

Are there any client side issue that cannot continue sending the requests?

If the main thread is blocked no new connections can be accepted

Also, if the ioblock timeout is set to some small value like 5 seconds, the connection in Thread 7 fails and the main thread comes back to normal? That is, strictly speaking, it is not a typical deadlock?

well, if ioblocktimeout is set it is not forever, but can still be too long
Another question is #47320 put PR_Lock(conn->c_mutex) covering almost the entire connection_read_operation. Could it be unlocked when PR_Poll gets timed out and relock even for this short time? It might help the main thread to scan the connection table?

I think unlocking while sleep is possible, but no solution.
Do we really need the lock while callin PR_Poll ?

The other solution would be not to wait for the lock in the main thread. If it continuously iterates over the connection table and cannot get a lock it should just continue, get it next time.
But I don't see that this can be done with NSPR. we need something like pthread_mutex_trylock()

A new occurrence during 'ipa-server-uninstall' with 389-ds-base-1.3.4.5-1.fc23.x86_64 During an established SSL connection, DS is redoing a handcheck {{{ [07/Dec/2015:20:36:12 +0100] conn=10 fd=70 slot=70 SSL connection from 10.34.47.242 to 10.34.47.242 [07/Dec/2015:20:36:12 +0100] conn=10 TLS1.2 128-bit AES; client CN=CA Subsystem,O=IDM.LAB.ENG.BRQ.REDHAT.COM; issuer CN=Certificate Authority,O=IDM.LAB.ENG.BRQ.REDHAT.COM [07/Dec/2015:20:36:12 +0100] conn=10 TLS1.2 client bound as uid=pkidbuser,ou=people,o=ipaca ... [07/Dec/2015:20:43:07 +0100] conn=10 op=42 MOD dn="cn=caIPAserviceCert_mod,ou=certificateProfiles,ou=ca,o=ipaca" [07/Dec/2015:20:43:07 +0100] conn=10 op=42 RESULT err=0 tag=103 nentries=0 etime=1 [Switching to thread 27 (Thread 0x7f1545ffb700 (LWP 21198))] #0 0x00007f156bc37ffd in poll () at ../sysdeps/unix/syscall-template.S:84 84 T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) (gdb) where #0 0x00007f156bc37ffd in poll () at ../sysdeps/unix/syscall-template.S:84 #1 0x00007f156c5671db in poll (__timeout=5000, __nfds=1, __fds=0x7f1545ffa7c0) at /usr/include/bits/poll2.h:46 #2 pt_poll_now (op=op@entry=0x7f1545ffa820) at ../../../nspr/pr/src/pthreads/ptio.c:583 #3 0x00007f156c567eb6 in pt_Continue (op=0x7f1545ffa820) at ../../../nspr/pr/src/pthreads/ptio.c:706 #4 pt_Recv (fd=0x5597243dc6b0, buf=0x5597243d0668, amount=<optimized out>, flags=<optimized out>, timeout=4294967295) at ../../../nspr/pr/src/pthreads/ptio.c:1865 #5 0x00007f156d0d4bbe in ssl_DefRecv (ss=ss@entry=0x5597243d0260, buf=<optimized out>, len=5, flags=flags@entry=0) at ssldef.c:61 #6 0x00007f156d0cfb6f in ssl3_GatherData (flags=0, gs=0x5597243d0618, ss=0x5597243d0260) at ssl3gthr.c:58 #7 ssl3_GatherCompleteHandshake (ss=ss@entry=0x5597243d0260, flags=flags@entry=0) at ssl3gthr.c:335 #8 0x00007f156d0d01a4 in ssl3_GatherAppDataRecord (ss=ss@entry=0x5597243d0260, flags=flags@entry=0) at ssl3gthr.c:451 #9 0x00007f156d0daac1 in DoRecv (flags=<optimized out>, len=<optimized out>, out=<optimized out>, ss=<optimized out>) at sslsecur.c:616 #10 ssl_SecureRecv (ss=0x5597243d0260, buf=0x5597243df350 "od\nvisible=false\npaca\npolicyset.serverCertSet.9.default.params.crlDistPointsIssuerType_0=DirectoryName\npolicyset.serverCertSet.9.default.params.crlDistPointsNum=1\npolicyset.serverCertSet.9.default.par"..., len=512, flags=0) at sslsecur.c:1237 #11 0x00007f156d0de6d1 in ssl_Recv (fd=<optimized out>, buf=0x5597243df350, len=512, flags=0, timeout=0) at sslsock.c:2393 #12 0x0000559722548e61 in connection_read_ldap_data (conn=0x7f1552d33280, conn=0x7f1552d33280, err=<synthetic pointer>) at ldap/servers/slapd/connection.c:1087 #13 connection_read_operation (conn=conn@entry=0x7f1552d33280, op=op@entry=0x55972426a8e0, tag=tag@entry=0x7f1545ffaad8, remaining_data=remaining_data@entry=0x7f1545ffaad4) at ldap/servers/slapd/connection.c:1163 #14 0x0000559722549670 in connection_threadmain () at ldap/servers/slapd/connection.c:1584 #15 0x00007f156c56a5cb in _pt_root (arg=0x559723f27ef0) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #16 0x00007f156bf0960a in start_thread (arg=0x7f1545ffb700) at pthread_create.c:334 #17 0x00007f156bc43a9d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 (gdb) print *conn $1 = {c_sb = 0x559723f1f1d0, c_sd = 70, c_ldapversion = 3, c_dn = 0x7f14f40180c0 "uid=pkidbuser,ou=people,o=ipaca", c_isroot = 0, c_isreplication_session = 0, c_authtype = 0x7f14f401ec30 "SSL", c_external_dn = 0x7f14f40180c0 "uid=pkidbuser,ou=people,o=ipaca", c_external_authtype = 0x5597225629f0 "SSL", cin_addr = 0x5597243dc880, cin_destaddr = 0x5597243dc900, c_domain = 0x0, c_ops = 0x559723f34810, c_gettingber = 1, c_currentber = 0x0, c_starttime = 1449516972, c_connid = 10, c_maxthreadscount = 0, c_maxthreadsblocked = 0, c_opsinitiated = 44, c_opscompleted = 43, c_threadnumber = 2, c_refcnt = 3, c_mutex = 0x5597243dc6f0, c_pdumutex = 0x5597243dc7a0, c_idlesince = 1449517387, c_idletimeout = 0, c_idletimeout_handle = 8, c_private = 0x5597243df310, c_flags = 1, c_needpw = 0, c_client_cert = 0x559724009580, c_prfd = 0x559724357ef0, c_ci = 70, c_fdi = -1, c_next = 0x7f1552d33118, c_prev = 0x7f1552d333e8, c_bi_backend = 0x0, c_extension = 0x5597243ddfc0, c_sasl_conn = 0x5597243dc980, c_local_ssf = 0, c_sasl_ssf = 0, c_ssl_ssf = 128, c_unix_local = 0, c_local_valid = 0, c_local_uid = 0, c_local_gid = 0, c_pagedresults = {prl_maxlen = 0, prl_count = 0, prl_list = 0x0}, c_push_io_layer_cb = 0x0, c_pop_io_layer_cb = 0x0, c_io_layer_cb_data = 0x0, c_ct = 0x559723f1cc80, c_tp = 0x0, c_ns_close_jobs = 0} }}} For some reason the client (ipa-server-uninstall) does not respond. And the listener thread trying to build the connection list to poll is hanging on the connection waiting for a handcheck. {{{ (gdb) thread 1 (gdb) where #0 __lll_lock_wait () at ../sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007f156bf0bae6 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x5597243dc6f0) at ../nptl/pthread_mutex_lock.c:136 #2 0x00007f156c564949 in PR_Lock (lock=0x5597243dc6f0) at ../../../nspr/pr/src/pthreads/ptsynch.c:177 #3 0x000055972254f275 in handle_pr_read_ready (num_poll=<optimized out>, ct=<optimized out>) at ldap/servers/slapd/daemon.c:1743 #4 slapd_daemon (ports=0x7ffef9905e80) at ldap/servers/slapd/daemon.c:1267 #5 0x0000559722540bf3 in main (argc=7, argv=0x7ffef99064a8) at ldap/servers/slapd/main.c:1117 (gdb) frame 3 #3 0x000055972254f275 in handle_pr_read_ready (num_poll=<optimized out>, ct=<optimized out>) at ldap/servers/slapd/daemon.c:1743 1743 PR_Lock( c->c_mutex ); (gdb) print *c $4 = {c_sb = 0x559723f1f1d0, c_sd = 70, c_ldapversion = 3, c_dn = 0x7f14f40180c0 "uid=pkidbuser,ou=people,o=ipaca", c_isroot = 0, c_isreplication_session = 0, c_authtype = 0x7f14f401ec30 "SSL", c_external_dn = 0x7f14f40180c0 "uid=pkidbuser,ou=people,o=ipaca", c_external_authtype = 0x5597225629f0 "SSL", cin_addr = 0x5597243dc880, cin_destaddr = 0x5597243dc900, c_domain = 0x0, c_ops = 0x559723f34810, c_gettingber = 1, c_currentber = 0x0, c_starttime = 1449516972, c_connid = 10, .... (gdb) print *c->c_mutex $5 = {mutex = {__data = {__lock = 2, __count = 0, __owner = 21198, __nusers = 1, __kind = 3, __spins = 52, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\316R\000\000\001\000\000\000\003\000\000\000\064", '\000' <repeats 18 times>, __align = 2}, notified = {length = 0, cv = {{cv = 0x0, times = 0}, {cv = 0x0, times = 0}, {cv = 0x0, times = 0}, {cv = 0x0, times = 0}, {cv = 0x0, times = 0}, {cv = 0x0, times = 0}}, link = 0x0}, locked = 1, owner = 139729345427200} (gdb) thread 27 [Switching to thread 27 (Thread 0x7f1545ffb700 (LWP 21198))] }}}

In thread 27 - what is the poll() timeout? Can we decrease the poll() timeout to be very small? If the connection is not available immediately, the poll should just return EWOULDBLOCK and connection_read_operation should retry the read.

For info: tests with idletimeout of 30 seconds is not helping at all (in the context of ipa).

Replying to [comment:10 gparente]:

For info: tests with idletimeout of 30 seconds is not helping at all (in the context of ipa).

I would think ioblocktimeout is what you are should be setting, not idletimeout.

Thanks Mark. That had been done longer ago without any improvement. ioblocktimeout was set to 20000

In the hung state the connection is blocked by the active thread and the main thread is waiting. For the connection c_gettingber == 1, so the main thread would do nothing if we would shortly release the lock in the working thread. The code is:

{{{
if ( c->c_mutex != NULL )
{
PR_Lock( c->c_mutex );
if ( connection_is_active_nolock (c) && c->c_gettingber == 0 )
{
.......
}
PR_Unlock( c->c_mutex );
}}}

the access to c_gettingber should be atomic, couldn't we try a simple fix like:

{{{
if ( c->c_mutex != NULL )
{
+ if (c->c_gettingber) continue;
+
PR_Lock( c->mutex);
if ( connection_is_active_nolock (c) && c->c_gettingber == 0 )
{
.......
}
PR_Unlock( c->c_mutex );
}}}

My understanding is that gettingber==1 should prevent the deamon to poll on it (https://git.fedorahosted.org/cgit/389/ds.git/tree/ldap/servers/slapd/daemon.c#n1643).

So I agree your fix will prevent the deadlock but what I miss is why we have been polling this connection.
May be there is special condition while handling handcheck where gettingber is dropped while a thread is keeping holding the connection lock.

Replying to [comment:15 tbordaz]:

So I agree your fix will prevent the deadlock but what I miss is why we have been polling this connection.
May be there is special condition while handling handcheck where gettingber is dropped while a thread is keeping holding the connection lock.

we don't. The deadlock and fix is in handle_pr_read_ready, where we loop over ALL active connections and eg check if idle timeout is exceeded. In the current implementation we always take the lock and if c_gettingber is set do nothing, but just release the lock and move on to the next connection.
I think the check of c_gettingber before taking the lock is safe and avoids taking the lock

well, my last comment was not completeƶy correct, we check if tehre is connection activity, but only if C_gettingber == 0 - and this is not the case

committed to master:
commit d73b573

committed to 1.3.4:
commit a1635fc

Metadata Update from @tbordaz:
- Issue assigned to lkrispen
- Issue set to the milestone: 1.3.4.7

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1672

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

3 years ago

Login to comment on this ticket.

Metadata