#48882 server can hang in connection list processing
Closed: Fixed None Opened 3 years ago by lkrispen.

customer experienced a hang with the following pattern:

one thread in poll holding the connection mutex

Thread 5 (Thread 0x7f17417ea700 (LWP 28874)):
#0  0x00007f1791afec3d in poll () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f1792436967 in poll (__timeout=1000, __nfds=1, __fds=0x7f17417e9730) at /usr/include/bits/poll2.h:46
#2  _pr_poll_with_poll (pds=0x7f17417e99c0, npds=1, timeout=<optimized out>) at ../../../nspr/pr/src/pthreads/ptio.c:3922
#3  0x00007f1794729105 in connection_read_operation ()
#4  0x00007f179472976e in connection_threadmain ()
#5  0x00007f179243a7bb in _pt_root (arg=0x7f179512b170) at ../../../nspr/pr/src/pthreads/ptthread.c:212
#6  0x00007f1791ddbdc5 in start_thread (arg=0x7f17417ea700) at pthread_create.c:308
#7  0x00007f1791b0928d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

the main thread is blocke on this connection mutex.

Thread 1 (Thread 0x7f17946ec840 (LWP 28828)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f17924352c3 in PR_EnterMonitor (mon=0x7f17958b2c80) at ../../../nspr/pr/src/pthreads/ptsynch.c:592
#2  0x00007f179472b5ed in setup_pr_read_pds ()
#3  0x00007f179472f5a3 in slapd_daemon ()
#4  0x00007f1794720c6c in main ()

Steps to reproduce:

Use a client which does not support ldap and connect to the secure port:

openssl s_client -connect localhost:636

and then type in some text and hit Enter

other clients will be blocked until this connecion is stopped or ioblocktimeout is exceeded


I had suggested this fix:

{{{
diff --git a/ldap/servers/slapd/daemon.c b/ldap/servers/slapd/daemon.c

index 90d1f7d..7ede31b 100644
--- a/ldap/servers/slapd/daemon.c
+++ b/ldap/servers/slapd/daemon.c
@@ -1543,7 +1543,7 @@ setup_pr_read_pds(Connection_Table *ct, PRFileDesc
n_tcps, PRFileDesc s_tcps
{
connection_table_move_connection_out_of_active_list(ct,c);
}
- else
+ else if (c->c_gettingber == 0)
{
PR_EnterMonitor(c->c_mutex);
if (c->c_flags & CONN_FLAG_CLOSING
}}}

it avoids the hang, but I think we skip some code, which should not be skipped.

When holding the mutex in the current code, we do a couple of checks and settings, most of them are only done if c_gettingber == 0, so taking the lock only in this case will suffice.

But in case c_gettingber == 1 we also do block the socket from polling:

{{{
c->c_fdi = SLAPD_INVALID_SOCKET_INDEX;
}}}

This should not be skipped, I'm wondering if we can do one of the following:

1] rely on changing the value to be atomic, do the assignment outside the lock, eg

{{{
if (c_gettingber == 0) {
PR_EnterMonitor(c_mutex)
....
PR_ExitMonitor
} else {
c->c_fdi = SLAPD_INVALID_SOCKET_INDEX;
}
}}}

2] since c_gettingber set to 1 triggers the change to an invalid socket index, why not do it at the time c_gettingber is set ?

You do not need to use s_client to reproduce this, and you do not need to use the SSL port. You can easily reproduce this with nc:
{{{

{ echo "" ; sleep 10 ; } | nc localhost 389

}}}
You just need something that sends at least 1 byte of data to the server (in the above case, just a newline), and holds open the connection so that the server will poll for more data.

Yeah, it is going to be tricky to use the same c_gettingber hack in setup_pr_read_pds. If c_gettingber == 1, or c->c_threadnumber >= max_threads_per_conn, setup_pr_read_pds sets c->c_fdi = SLAPD_INVALID_SOCKET_INDEX, which is used in handle_pr_read_ready to get the poll out_flags for the fd. In a worst case scenario, setup_pr_read_pds sees that c_gettingber == 1 and skips this setting. Then, c_gettingber is set to 0. Then, in handle_pr_read_ready, c_gettingber == 0, but c->c_fdi != SLAPD_INVALID_SOCKET_INDEX, so handle_pr_read_ready checks the out_flags, and crashes and other race conditions can happen.

Replying to [comment:6 rmeggins]:

Yeah, it is going to be tricky to use the same c_gettingber hack in setup_pr_read_pds. If c_gettingber == 1, or c->c_threadnumber >= max_threads_per_conn, setup_pr_read_pds sets c->c_fdi = SLAPD_INVALID_SOCKET_INDEX, which is used in handle_pr_read_ready to get the poll out_flags for the fd. In a worst case scenario, setup_pr_read_pds sees that c_gettingber == 1 and skips this setting. Then, c_gettingber is set to 0. Then, in handle_pr_read_ready, c_gettingber == 0, but c->c_fdi != SLAPD_INVALID_SOCKET_INDEX, so handle_pr_read_ready checks the out_flags, and crashes and other race conditions can happen.

but why couldn't we set c->c_fdi = SLAPD_INVALID_SOCKET_INDEX at the same tome we set c_gettingber = 1 ?

but why couldn't we set c->c_fdi = SLAPD_INVALID_SOCKET_INDEX at the same tome we set c_gettingber = 1 ?

Because we test for {{{c->c_fdi == SLAPD_INVALID_SOCKET_INDEX}}} here in {{{handle_pr_read_ready()}}}:
{{{
if (c->c_fdi != SLAPD_INVALID_SOCKET_INDEX)
}}}
but we don't do {{{c_gettingber = 1}}} until later, in the call to {{{connection_activity()}}}
{{{
if ((connection_activity( c, maxthreads )) == -1) {
}}}
{{{
connection_activity(Connection conn, int maxthreads)
...
/
set these here so setup_pr_read_pds will not add this conn back to the poll array */
conn->c_gettingber = 1;
}}}

but we do the check in handle_pr_read_ready only if c_gettingber == 0 ?

I need to check again the timing, flow of activity on a connection and when these fields change

AFAICT, we need to set {{{c->c_fdi = SLAPD_INVALID_SOCKET_INDEX}}} in {{{setup_pr_read_pds}}} so that we can use it to test in {{{handle_pr_read_ready}}}

implemented a new fix based on Noriko's research about the (non public) existence of a PR_TestAndEnterMonitor() function.

The use of this test and enter functionality avoids the hang in the main thread and avoids all the potential side effects discussed with other suggestions

Thank you for porting the API, Ludwig!!

committed to master:

ldap/servers/slapd/daemon.c | 69 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 68 insertions(+), 1 deletion(-)

New commits:
commit 7110db9

f823ea0..19c4638 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 19c4638
Author: Ludwig Krispenz lkrispen@redhat.com
Date: Mon Aug 1 10:47:31 2016 +0200

Ludwig, is it ok to close this ticket as fixed?

Thanks!

Metadata Update from @nhosoi:
- Issue assigned to lkrispen
- Issue set to the milestone: 1.3.6 backlog

2 years ago

Login to comment on this ticket.

Metadata