The slapd process is hanging and the pstack shows:
Thread 7 (Thread 0x7f8ab310c700 (LWP 7761)): #0 0x00007f8b0050bb7d in poll () from /lib64/libc.so.6 #1 0x00007f8b00e43967 in _pr_poll_with_poll () from /lib64/libnspr4.so #2 0x00007f8b02f239e8 in connection_read_operation () #3 0x00007f8b02f248df in connection_threadmain () #4 0x00007f8b00e477bb in _pt_root () from /lib64/libnspr4.so #5 0x00007f8b007e8df5 in start_thread () from /lib64/libpthread.so.0 #6 0x00007f8b005161ad in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f8b02eee840 (LWP 7715)): #0 0x00007f8b007eef7d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8b007ead68 in _L_lock_975 () from /lib64/libpthread.so.0 #2 0x00007f8b007ead11 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007f8b00e41cb9 in PR_Lock () from /lib64/libnspr4.so #4 0x00007f8b02f291e5 in slapd_daemon () #5 0x00007f8b02f1c17c in main ()
thread 7 holds a c_mutex and the main thread is blocked when iterating the conn table. The mutex lock in connection read operation was introduced with the fix for #47320
I'm interested in Thread 7. I think PR_Poll called from connection_read_operation does not wait and times out in 1 sec (1000 / milliseconds /). If it is a new operation, it bails there. If it is not new, it usually waits until the ioblock timeout expires.
Are there any client side issue that cannot continue sending the requests?
Also, if the ioblock timeout is set to some small value like 5 seconds, the connection in Thread 7 fails and the main thread comes back to normal? That is, strictly speaking, it is not a typical deadlock?
Another question is #47320 put PR_Lock(conn->c_mutex) covering almost the entire connection_read_operation. Could it be unlocked when PR_Poll gets timed out and relock even for this short time? It might help the main thread to scan the connection table? {{{ --- a/ldap/servers/slapd/connection.c +++ b/ldap/servers/slapd/connection.c @@ -1223,9 +1223,11 @@ int connection_read_operation(Connection conn, Operation op, ber_tag_t tag, i } else { / The turbo mode may cause threads starvation. Do a yield here to reduce the starving. */ + PR_Unlock(conn->c_mutex); PR_Sleep(PR_INTERVAL_NO_WAIT); + PR_Lock(conn->c_mutex); continue; }}}
Replying to [comment:4 nhosoi]:
yes, but even if ioblocktimeout is set to a few seconds that can be too much for otehr connections
If the main thread is blocked no new connections can be accepted
Also, if the ioblock timeout is set to some small value like 5 seconds, the connection in Thread 7 fails and the main thread comes back to normal? That is, strictly speaking, it is not a typical deadlock? well, if ioblocktimeout is set it is not forever, but can still be too long Another question is #47320 put PR_Lock(conn->c_mutex) covering almost the entire connection_read_operation. Could it be unlocked when PR_Poll gets timed out and relock even for this short time? It might help the main thread to scan the connection table?
well, if ioblocktimeout is set it is not forever, but can still be too long Another question is #47320 put PR_Lock(conn->c_mutex) covering almost the entire connection_read_operation. Could it be unlocked when PR_Poll gets timed out and relock even for this short time? It might help the main thread to scan the connection table?
I think unlocking while sleep is possible, but no solution. Do we really need the lock while callin PR_Poll ?
The other solution would be not to wait for the lock in the main thread. If it continuously iterates over the connection table and cannot get a lock it should just continue, get it next time. But I don't see that this can be done with NSPR. we need something like pthread_mutex_trylock()
In thread 27 - what is the poll() timeout? Can we decrease the poll() timeout to be very small? If the connection is not available immediately, the poll should just return EWOULDBLOCK and connection_read_operation should retry the read.
For info: tests with idletimeout of 30 seconds is not helping at all (in the context of ipa).
Replying to [comment:10 gparente]:
I would think ioblocktimeout is what you are should be setting, not idletimeout.
Thanks Mark. That had been done longer ago without any improvement. ioblocktimeout was set to 20000
In the hung state the connection is blocked by the active thread and the main thread is waiting. For the connection c_gettingber == 1, so the main thread would do nothing if we would shortly release the lock in the working thread. The code is:
{{{ if ( c->c_mutex != NULL ) { PR_Lock( c->c_mutex ); if ( connection_is_active_nolock (c) && c->c_gettingber == 0 ) { ....... } PR_Unlock( c->c_mutex ); }}}
the access to c_gettingber should be atomic, couldn't we try a simple fix like:
{{{ if ( c->c_mutex != NULL ) { + if (c->c_gettingber) continue; + PR_Lock( c->mutex); if ( connection_is_active_nolock (c) && c->c_gettingber == 0 ) { ....... } PR_Unlock( c->c_mutex ); }}}
My understanding is that gettingber==1 should prevent the deamon to poll on it (https://git.fedorahosted.org/cgit/389/ds.git/tree/ldap/servers/slapd/daemon.c#n1643).
So I agree your fix will prevent the deadlock but what I miss is why we have been polling this connection. May be there is special condition while handling handcheck where gettingber is dropped while a thread is keeping holding the connection lock.
Replying to [comment:15 tbordaz]:
we don't. The deadlock and fix is in handle_pr_read_ready, where we loop over ALL active connections and eg check if idle timeout is exceeded. In the current implementation we always take the lock and if c_gettingber is set do nothing, but just release the lock and move on to the next connection. I think the check of c_gettingber before taking the lock is safe and avoids taking the lock
well, my last comment was not completeƶy correct, we check if tehre is connection activity, but only if C_gettingber == 0 - and this is not the case
attachment 0001-Ticket-48341-deadlock-on-connection-mutex.patch
committed to master: commit d73b573
committed to 1.3.4: commit a1635fc
Metadata Update from @tbordaz: - Issue assigned to lkrispen - Issue set to the milestone: 1.3.4.7
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1672
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.