With the attached test case, there is a quite frequent abort (with DS being compiled with DEBUG). The abort occurs during the shutdown when a thread fails to register its fd.
Master
An abort in debug build. In normal build: A error message like "ns_connection_post_io_or_closing - post I/O job for conn xx for fd=yy failed to be added to event queue"
No error msg or abort
During regular shutdown
(gdb) print slapd_shutdown $3 = 1 # SLAPI_SHUTDOWN_SIGNAL (gdb) thread 3 [Switching to thread 3 (Thread 0x7f55cea7f800 (LWP 4893))] #0 0x00007f55cbd15aed in write () from /lib64/libpthread.so.0 (gdb) where #0 0x00007f55cbd15aed in write () from /lib64/libpthread.so.0 #1 0x00007f55cc36c11e in pt_Write (fd=0x12648a0, buf=0x7ffefff4c270, amount=178) at ../../../nspr/pr/src/pthreads/ptio.c:1315 #2 0x0000000000423bc3 in slapi_write_buffer (fd=0x12648a0, buf=0x7ffefff4c270, amount=178) at $DIR/389-ds-base/ldap/servers/slapd/fileio.c:48 #3 0x00007f55ce1a26cf in vslapd_log_error (fp=0x12648a0, sev_level=25, subsystem=0x4366d2 "op_thread_cleanup", fmt=0x436660 "slapd shutting down - signaling operation threads - op stack size %d max work q size %d max work q stack size %d\n", ap=0x7ffefff4cb38, locked=1) at $DIR/389-ds-base/ldap/servers/slapd/log.c:2400 #4 0x00007f55ce1a204a in slapd_log_error_proc_internal (sev_level=25, subsystem=0x4366d2 "op_thread_cleanup", fmt=0x436660 "slapd shutting down - signaling operation threads - op stack size %d max work q size %d max work q stack size %d\n", ap_err=0x7ffefff4cb20, ap_file=0x7ffefff4cb38) at $DIR/389-ds-base/ldap/servers/slapd/log.c:2234 #5 0x00007f55ce1a2961 in slapi_log_error (loglevel=25, subsystem=0x4366d2 "op_thread_cleanup", fmt=0x436660 "slapd shutting down - signaling operation threads - op stack size %d max work q size %d max work q stack size %d\n") at $DIR/389-ds-base/ldap/servers/slapd/log.c:2450 #6 0x0000000000418f9d in op_thread_cleanup () at $DIR/389-ds-base/ldap/servers/slapd/connection.c:1966 #7 0x000000000041cf10 in slapd_daemon (ports=0x7ffefff4ce70, tp=0x14092e0) at $DIR/389-ds-base/ldap/servers/slapd/daemon.c:1206 #8 0x0000000000425e45 in main (argc=5, argv=0x7ffefff4cfa8) at $DIR/389-ds-base/ldap/servers/slapd/main.c:1181
Nunc stans fails to queue a fd
(gdb) where #0 0x00007f55cb730a98 in raise () from /lib64/libc.so.6 #1 0x00007f55cb73269a in abort () from /lib64/libc.so.6 #2 0x00007f55cc3559ec in PR_Assert (s=0x437b81 "job_result == PR_SUCCESS", file=0x436d08 "$DIR/389-ds-base/ldap/servers/slapd/daemon.c", ln=1821) at ../../../nspr/pr/src/io/prlog.c:553 #3 0x000000000041e1fc in ns_connection_post_io_or_closing (conn=0x7f55ce9a8a10) at $DIR/389-ds-base/ldap/servers/slapd/daemon.c:1821 #4 0x0000000000417f32 in connection_make_readable_nolock (conn=0x7f55ce9a8a10) at $DIR/389-ds-base/ldap/servers/slapd/connection.c:1366 #5 0x0000000000418b75 in connection_threadmain () at $DIR/389-ds-base/ldap/servers/slapd/connection.c:1822 #6 0x00007f55cc36e68b in _pt_root (arg=0x15e5000) at ../../../nspr/pr/src/pthreads/ptthread.c:216 #7 0x00007f55cbd0d60a in start_thread () from /lib64/libpthread.so.0 #8 0x00007f55cb7fea4d in clone () from /lib64/libc.so.6 (gdb) frame 3 #3 0x000000000041e1fc in ns_connection_post_io_or_closing (conn=0x7f55ce9a8a10) at $DIR/389-ds-base/ldap/servers/slapd/daemon.c:1821 1821 PR_ASSERT(job_result == PR_SUCCESS); (gdb) print conn->c_flags # connection not mark to closing $1 = 0 (gdb) print job_result $2 = PR_FAILURE (gdb) list 1805,1826 1805 #ifdef DEBUG 1806 PR_ASSERT(0 == connection_acquire_nolock(conn)); 1807 #else 1808 if (connection_acquire_nolock(conn) != 0) { /* event framework now has a reference */ 1809 /* 1810 * This has already been logged as an error in ./ldap/servers/slapd/connection.c 1811 * The error occurs when we get a connection in a closing state. 1812 * For now we return, but there is probably a better way to handle the error case. 1813 */ 1814 return; 1815 } 1816 #endif 1817 PRStatus job_result = ns_add_io_timeout_job(conn->c_tp, conn->c_prfd, &tv, 1818 NS_JOB_READ|NS_JOB_PRESERVE_FD, 1819 ns_handle_pr_read_ready, conn, NULL); 1820 #ifdef DEBUG 1821 PR_ASSERT(job_result == PR_SUCCESS); 1822 #endif 1823 if (job_result != PR_SUCCESS) { 1824 slapi_log_err(SLAPI_LOG_WARNING, "ns_connection_post_io_or_closing", "post I/O job for " 1825 "conn %" PRIu64 " for fd=%d failed to be added to event queue\n", conn->c_connid, conn->c_sd); 1826 } else {
Additional thread that may participate to the abort
(gdb) thread 2 [Switching to thread 2 (Thread 0x7f55b0542700 (LWP 4928))] #0 0x00007f55cb7ff043 in epoll_wait () from /lib64/libc.so.6 (gdb) where #0 0x00007f55cb7ff043 in epoll_wait () from /lib64/libc.so.6 #1 0x00007f55cbae24b8 in epoll_dispatch () from /lib64/libevent-2.0.so.5 #2 0x00007f55cbaccadd in event_base_loop () from /lib64/libevent-2.0.so.5 #3 0x00007f55ce68e72f in ns_event_fw_loop (ns_event_fw_ctx=0x1409450) at $DIR/389-ds-base/src/nunc-stans/ns/ns_event_fw_event.c:310 #4 0x00007f55ce68c054 in event_loop_thread_func (arg=0x14092e0) at $DIR/389-ds-base/src/nunc-stans/ns/ns_thrpool.c:558 #5 0x00007f55cbd0d60a in start_thread () from /lib64/libpthread.so.0 #6 0x00007f55cb7fea4d in clone () from /lib64/libc.so.6
Metadata Update from @tbordaz: - Custom field type adjusted to defect
Add testcase. Note the testcase was created (by @lkrispen ) for an other user-case. It triggers the abort but need to be adapted to capture the failure at shutdown. <img alt="ticket49289_test.py" src="/389-ds-base/issue/raw/71bf01db0569aa8e4067867ba7de61a4363680e1eac7725f587233f2fb3a662e-ticket49289_test.py" />
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.7.0
Metadata Update from @firstyear: - Issue assigned to firstyear
I'll look into this asap.
I can't reproduce this: but I think I know the cause.
https://pagure.io/389-ds-base/issue/48184
Because we don't close connections on shutdown, there is a short period of time between nunc-stans being told to stop, and the event loop actually stopping. In this time, because the connections are left open, it's possible for an IO event to trigger. Because nunc-stans is in the "shutdown" phase, the job rearm returns a failure, because don't accept new work once we are shutting down.
I think this is the scenario you are observing here - I knew it could happen, but I've never been able to actually observe it. As a result, it may become more important to properly clean up our connections on shutdown since it can now be produced.
Indeed I think it is a rare and dynamic issue. Cleanup connection at shutdown looks good. In addition would it be ok to harden connection_make_readable_nolock so that it does not call ns_connection_post_io_or_closing if the server is under shutdown ?
Just my 5 cents: it is not a rare issue, i see it every time on each of our replicated servers in production (v1.3.6.x) during service restart, examples : [13/Jun/2017:17:09:02.838214460 +0200] - WARN - ns_connection_post_io_or_closing - post I/O job for conn 2 for fd=257 failed to be added to event queue [17/Jun/2017:20:42:21.803831257 +0200] - WARN - ns_connection_post_io_or_closing - post I/O job for conn 350 for fd=262 failed to be added to event queue [17/Jun/2017:20:43:11.430922229 +0200] - WARN - ns_connection_post_io_or_closing - post I/O job for conn 974 for fd=329 failed to be added to event queue
@pj101 Yeah, that's the "non-assert" path. It's the same cause, but in a production build we don't trigger the assert, we just accept the failure and move on.
<img alt="0001-Ticket-49289-Improve-result-handling-from-connection.patch" src="/389-ds-base/issue/raw/ce833d102be78d03756a0f05db373aabe06d866dd2bf44a95ccdedbba1e2e1b3-0001-Ticket-49289-Improve-result-handling-from-connection.patch" />
This improves the return codes of NS, so we can handle this case a bit better in the logs. Sadly today, the root cause can't easily be solved without a very invasive change. I would like to look into performing this soon however.
Metadata Update from @firstyear: - Custom field reviewstatus adjusted to review
The patch looks great. Ack Just a minor question, in ns_connection_post_io_or_closing in debug mode, why dropping the assert if result differs from shutdown and success ?
Metadata Update from @tbordaz: - Custom field reviewstatus adjusted to ack (was: review)
When the server is running, if we fail to add the job, that means we did something bad, like violated the job state machine, or failed to allocate memory. We want to make a lot of noise here.
In the other case, if we are shutting down, we are going to trash the job anyway, so we don't mind if it fails to queue, and we just give up on it. It'll leak resources, but this is the case with the old handler and conntable today anyway.
commit f56b8d5 To ssh://git@pagure.io/389-ds-base.git b23e560..f56b8d5 master -> master
Metadata Update from @firstyear: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2348
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: fixed)
Log in to comment on this ticket.