#49289 At shutdown, DS can abort when a thread fails to queue its connection
Closed: wontfix 7 years ago Opened 7 years ago by tbordaz.

Issue Description

With the attached test case, there is a quite frequent abort (with DS being compiled with DEBUG).
The abort occurs during the shutdown when a thread fails to register its fd.

Package Version and Platform

Master

Steps to reproduce

  1. enable core dump
  2. run attached testcase with ulimit -c unlimited

Actual results

An abort in debug build.
In normal build: A error message like "ns_connection_post_io_or_closing - post I/O job for conn xx for fd=yy failed to be added to event queue"

Expected results

No error msg or abort


During regular shutdown

(gdb) print slapd_shutdown
$3 = 1  # SLAPI_SHUTDOWN_SIGNAL
(gdb) thread 3
[Switching to thread 3 (Thread 0x7f55cea7f800 (LWP 4893))]
#0  0x00007f55cbd15aed in write () from /lib64/libpthread.so.0
(gdb) where
#0  0x00007f55cbd15aed in write () from /lib64/libpthread.so.0
#1  0x00007f55cc36c11e in pt_Write (fd=0x12648a0, buf=0x7ffefff4c270, amount=178) at ../../../nspr/pr/src/pthreads/ptio.c:1315
#2  0x0000000000423bc3 in slapi_write_buffer (fd=0x12648a0, buf=0x7ffefff4c270, amount=178)
    at $DIR/389-ds-base/ldap/servers/slapd/fileio.c:48
#3  0x00007f55ce1a26cf in vslapd_log_error (fp=0x12648a0, sev_level=25, subsystem=0x4366d2 "op_thread_cleanup", 
    fmt=0x436660 "slapd shutting down - signaling operation threads - op stack size %d max work q size %d max work q stack size %d\n", ap=0x7ffefff4cb38, locked=1) at $DIR/389-ds-base/ldap/servers/slapd/log.c:2400
#4  0x00007f55ce1a204a in slapd_log_error_proc_internal (sev_level=25, subsystem=0x4366d2 "op_thread_cleanup", 
    fmt=0x436660 "slapd shutting down - signaling operation threads - op stack size %d max work q size %d max work q stack size %d\n", ap_err=0x7ffefff4cb20, ap_file=0x7ffefff4cb38)
    at $DIR/389-ds-base/ldap/servers/slapd/log.c:2234
#5  0x00007f55ce1a2961 in slapi_log_error (loglevel=25, subsystem=0x4366d2 "op_thread_cleanup", 
    fmt=0x436660 "slapd shutting down - signaling operation threads - op stack size %d max work q size %d max work q stack size %d\n") at $DIR/389-ds-base/ldap/servers/slapd/log.c:2450
#6  0x0000000000418f9d in op_thread_cleanup ()
    at $DIR/389-ds-base/ldap/servers/slapd/connection.c:1966
#7  0x000000000041cf10 in slapd_daemon (ports=0x7ffefff4ce70, tp=0x14092e0)
    at $DIR/389-ds-base/ldap/servers/slapd/daemon.c:1206
#8  0x0000000000425e45 in main (argc=5, argv=0x7ffefff4cfa8)
    at $DIR/389-ds-base/ldap/servers/slapd/main.c:1181

Nunc stans fails to queue a fd

(gdb) where
#0  0x00007f55cb730a98 in raise () from /lib64/libc.so.6
#1  0x00007f55cb73269a in abort () from /lib64/libc.so.6
#2  0x00007f55cc3559ec in PR_Assert (s=0x437b81 "job_result == PR_SUCCESS", 
    file=0x436d08 "$DIR/389-ds-base/ldap/servers/slapd/daemon.c", ln=1821)
    at ../../../nspr/pr/src/io/prlog.c:553
#3  0x000000000041e1fc in ns_connection_post_io_or_closing (conn=0x7f55ce9a8a10)
    at $DIR/389-ds-base/ldap/servers/slapd/daemon.c:1821
#4  0x0000000000417f32 in connection_make_readable_nolock (conn=0x7f55ce9a8a10)
    at $DIR/389-ds-base/ldap/servers/slapd/connection.c:1366
#5  0x0000000000418b75 in connection_threadmain ()
    at $DIR/389-ds-base/ldap/servers/slapd/connection.c:1822
#6  0x00007f55cc36e68b in _pt_root (arg=0x15e5000) at ../../../nspr/pr/src/pthreads/ptthread.c:216
#7  0x00007f55cbd0d60a in start_thread () from /lib64/libpthread.so.0
#8  0x00007f55cb7fea4d in clone () from /lib64/libc.so.6
(gdb) frame 3
#3  0x000000000041e1fc in ns_connection_post_io_or_closing (conn=0x7f55ce9a8a10)
    at $DIR/389-ds-base/ldap/servers/slapd/daemon.c:1821
1821            PR_ASSERT(job_result == PR_SUCCESS);
(gdb) print conn->c_flags   # connection not mark to closing
$1 = 0
(gdb) print job_result
$2 = PR_FAILURE
(gdb) list 1805,1826
1805    #ifdef DEBUG
1806            PR_ASSERT(0 == connection_acquire_nolock(conn));
1807    #else
1808            if (connection_acquire_nolock(conn) != 0) { /* event framework now has a reference */
1809                /* 
1810                 * This has already been logged as an error in ./ldap/servers/slapd/connection.c
1811                 * The error occurs when we get a connection in a closing state.
1812                 * For now we return, but there is probably a better way to handle the error case.
1813                 */
1814                return;
1815            }
1816    #endif
1817            PRStatus job_result = ns_add_io_timeout_job(conn->c_tp, conn->c_prfd, &tv,
1818                          NS_JOB_READ|NS_JOB_PRESERVE_FD,
1819                          ns_handle_pr_read_ready, conn, NULL);
1820    #ifdef DEBUG
1821            PR_ASSERT(job_result == PR_SUCCESS);
1822    #endif
1823            if (job_result != PR_SUCCESS) {
1824                slapi_log_err(SLAPI_LOG_WARNING, "ns_connection_post_io_or_closing", "post I/O job for "
1825                    "conn %" PRIu64 " for fd=%d failed to be added to event queue\n", conn->c_connid, conn->c_sd);
1826            } else {

Additional thread that may participate to the abort

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f55b0542700 (LWP 4928))]
#0  0x00007f55cb7ff043 in epoll_wait () from /lib64/libc.so.6
(gdb) where
#0  0x00007f55cb7ff043 in epoll_wait () from /lib64/libc.so.6
#1  0x00007f55cbae24b8 in epoll_dispatch () from /lib64/libevent-2.0.so.5
#2  0x00007f55cbaccadd in event_base_loop () from /lib64/libevent-2.0.so.5
#3  0x00007f55ce68e72f in ns_event_fw_loop (ns_event_fw_ctx=0x1409450)
    at $DIR/389-ds-base/src/nunc-stans/ns/ns_event_fw_event.c:310
#4  0x00007f55ce68c054 in event_loop_thread_func (arg=0x14092e0)
    at $DIR/389-ds-base/src/nunc-stans/ns/ns_thrpool.c:558
#5  0x00007f55cbd0d60a in start_thread () from /lib64/libpthread.so.0
#6  0x00007f55cb7fea4d in clone () from /lib64/libc.so.6

Metadata Update from @tbordaz:
- Custom field type adjusted to defect

7 years ago

Add testcase. Note the testcase was created (by @lkrispen ) for an other user-case. It triggers the abort but need to be adapted to capture the failure at shutdown.
ticket49289_test.py

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.3.7.0

7 years ago

Metadata Update from @firstyear:
- Issue assigned to firstyear

7 years ago

I'll look into this asap.

I can't reproduce this: but I think I know the cause.

https://pagure.io/389-ds-base/issue/48184

Because we don't close connections on shutdown, there is a short period of time between nunc-stans being told to stop, and the event loop actually stopping. In this time, because the connections are left open, it's possible for an IO event to trigger. Because nunc-stans is in the "shutdown" phase, the job rearm returns a failure, because don't accept new work once we are shutting down.

I think this is the scenario you are observing here - I knew it could happen, but I've never been able to actually observe it. As a result, it may become more important to properly clean up our connections on shutdown since it can now be produced.

Indeed I think it is a rare and dynamic issue.
Cleanup connection at shutdown looks good. In addition would it be ok to harden connection_make_readable_nolock so that it does not call ns_connection_post_io_or_closing if the server is under shutdown ?

Just my 5 cents: it is not a rare issue, i see it every time on each of our replicated servers in production (v1.3.6.x) during service restart, examples :
[13/Jun/2017:17:09:02.838214460 +0200] - WARN - ns_connection_post_io_or_closing - post I/O job for conn 2 for fd=257 failed to be added to event queue
[17/Jun/2017:20:42:21.803831257 +0200] - WARN - ns_connection_post_io_or_closing - post I/O job for conn 350 for fd=262 failed to be added to event queue
[17/Jun/2017:20:43:11.430922229 +0200] - WARN - ns_connection_post_io_or_closing - post I/O job for conn 974 for fd=329 failed to be added to event queue

@pj101 Yeah, that's the "non-assert" path. It's the same cause, but in a production build we don't trigger the assert, we just accept the failure and move on.

0001-Ticket-49289-Improve-result-handling-from-connection.patch

This improves the return codes of NS, so we can handle this case a bit better in the logs. Sadly today, the root cause can't easily be solved without a very invasive change. I would like to look into performing this soon however.

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review

7 years ago

The patch looks great. Ack
Just a minor question, in ns_connection_post_io_or_closing in debug mode, why dropping the assert if result differs from shutdown and success ?

Metadata Update from @tbordaz:
- Custom field reviewstatus adjusted to ack (was: review)

7 years ago

When the server is running, if we fail to add the job, that means we did something bad, like violated the job state machine, or failed to allocate memory. We want to make a lot of noise here.

In the other case, if we are shutting down, we are going to trash the job anyway, so we don't mind if it fails to queue, and we just give up on it. It'll leak resources, but this is the case with the old handler and conntable today anyway.

commit f56b8d5
To ssh://git@pagure.io/389-ds-base.git
b23e560..f56b8d5 master -> master

Metadata Update from @firstyear:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2348

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

4 years ago

Log in to comment on this ticket.

Metadata