customer experienced a hang with the following pattern:
one thread in poll holding the connection mutex
Thread 5 (Thread 0x7f17417ea700 (LWP 28874)): #0 0x00007f1791afec3d in poll () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007f1792436967 in poll (__timeout=1000, __nfds=1, __fds=0x7f17417e9730) at /usr/include/bits/poll2.h:46 #2 _pr_poll_with_poll (pds=0x7f17417e99c0, npds=1, timeout=<optimized out>) at ../../../nspr/pr/src/pthreads/ptio.c:3922 #3 0x00007f1794729105 in connection_read_operation () #4 0x00007f179472976e in connection_threadmain () #5 0x00007f179243a7bb in _pt_root (arg=0x7f179512b170) at ../../../nspr/pr/src/pthreads/ptthread.c:212 #6 0x00007f1791ddbdc5 in start_thread (arg=0x7f17417ea700) at pthread_create.c:308 #7 0x00007f1791b0928d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
the main thread is blocke on this connection mutex.
Thread 1 (Thread 0x7f17946ec840 (LWP 28828)): #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f17924352c3 in PR_EnterMonitor (mon=0x7f17958b2c80) at ../../../nspr/pr/src/pthreads/ptsynch.c:592 #2 0x00007f179472b5ed in setup_pr_read_pds () #3 0x00007f179472f5a3 in slapd_daemon () #4 0x00007f1794720c6c in main ()
Steps to reproduce:
Use a client which does not support ldap and connect to the secure port:
openssl s_client -connect localhost:636 and then type in some text and hit Enter
other clients will be blocked until this connecion is stopped or ioblocktimeout is exceeded
I had suggested this fix:
{{{ diff --git a/ldap/servers/slapd/daemon.c b/ldap/servers/slapd/daemon.c
index 90d1f7d..7ede31b 100644 --- a/ldap/servers/slapd/daemon.c +++ b/ldap/servers/slapd/daemon.c @@ -1543,7 +1543,7 @@ setup_pr_read_pds(Connection_Table *ct, PRFileDesc n_tcps, PRFileDesc s_tcps { connection_table_move_connection_out_of_active_list(ct,c); } - else + else if (c->c_gettingber == 0) { PR_EnterMonitor(c->c_mutex); if (c->c_flags & CONN_FLAG_CLOSING }}}
it avoids the hang, but I think we skip some code, which should not be skipped.
When holding the mutex in the current code, we do a couple of checks and settings, most of them are only done if c_gettingber == 0, so taking the lock only in this case will suffice.
But in case c_gettingber == 1 we also do block the socket from polling:
{{{ c->c_fdi = SLAPD_INVALID_SOCKET_INDEX; }}}
This should not be skipped, I'm wondering if we can do one of the following:
1] rely on changing the value to be atomic, do the assignment outside the lock, eg
{{{ if (c_gettingber == 0) { PR_EnterMonitor(c_mutex) .... PR_ExitMonitor } else { c->c_fdi = SLAPD_INVALID_SOCKET_INDEX; } }}}
2] since c_gettingber set to 1 triggers the change to an invalid socket index, why not do it at the time c_gettingber is set ?
You do not need to use s_client to reproduce this, and you do not need to use the SSL port. You can easily reproduce this with nc: {{{
}}} You just need something that sends at least 1 byte of data to the server (in the above case, just a newline), and holds open the connection so that the server will poll for more data.
Yeah, it is going to be tricky to use the same c_gettingber hack in setup_pr_read_pds. If c_gettingber == 1, or c->c_threadnumber >= max_threads_per_conn, setup_pr_read_pds sets c->c_fdi = SLAPD_INVALID_SOCKET_INDEX, which is used in handle_pr_read_ready to get the poll out_flags for the fd. In a worst case scenario, setup_pr_read_pds sees that c_gettingber == 1 and skips this setting. Then, c_gettingber is set to 0. Then, in handle_pr_read_ready, c_gettingber == 0, but c->c_fdi != SLAPD_INVALID_SOCKET_INDEX, so handle_pr_read_ready checks the out_flags, and crashes and other race conditions can happen.
Replying to [comment:6 rmeggins]:
but why couldn't we set c->c_fdi = SLAPD_INVALID_SOCKET_INDEX at the same tome we set c_gettingber = 1 ?
Because we test for {{{c->c_fdi == SLAPD_INVALID_SOCKET_INDEX}}} here in {{{handle_pr_read_ready()}}}: {{{ if (c->c_fdi != SLAPD_INVALID_SOCKET_INDEX) }}} but we don't do {{{c_gettingber = 1}}} until later, in the call to {{{connection_activity()}}} {{{ if ((connection_activity( c, maxthreads )) == -1) { }}} {{{ connection_activity(Connection conn, int maxthreads) ... / set these here so setup_pr_read_pds will not add this conn back to the poll array */ conn->c_gettingber = 1; }}}
but we do the check in handle_pr_read_ready only if c_gettingber == 0 ?
I need to check again the timing, flow of activity on a connection and when these fields change
AFAICT, we need to set {{{c->c_fdi = SLAPD_INVALID_SOCKET_INDEX}}} in {{{setup_pr_read_pds}}} so that we can use it to test in {{{handle_pr_read_ready}}}
attachment 0001-Ticket-48882-server-can-hang-in-connection-list-proc.patch
implemented a new fix based on Noriko's research about the (non public) existence of a PR_TestAndEnterMonitor() function.
The use of this test and enter functionality avoids the hang in the main thread and avoids all the potential side effects discussed with other suggestions
Thank you for porting the API, Ludwig!!
committed to master:
ldap/servers/slapd/daemon.c | 69 +++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 68 insertions(+), 1 deletion(-)
New commits: commit 7110db9
f823ea0..19c4638 389-ds-base-1.3.4 -> 389-ds-base-1.3.4 commit 19c4638 Author: Ludwig Krispenz lkrispen@redhat.com Date: Mon Aug 1 10:47:31 2016 +0200
Ludwig, is it ok to close this ticket as fixed?
Thanks!
Metadata Update from @nhosoi: - Issue assigned to lkrispen - Issue set to the milestone: 1.3.6 backlog
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1941
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.