#49569 deadlock in nunc-stans
Opened 2 years ago by mreynolds. Modified a month ago

Issue Description

Random nunc-stans deadlock. Was testing anonymous searches and anonymous resource limits when I noticed this. Server was idle at time of deadlock

Interesting threads:

Thread 42 (Thread 0x7fe29fbc7700 (LWP 15586)):
#0  0x00007fe2dba6009d in __lll_lock_wait () at /lib64/libpthread.so.0
#1  0x00007fe2dba58e98 in pthread_mutex_lock () at /lib64/libpthread.so.0
#2  0x00007fe2de7cf5f3 in ns_job_done (job=0x7565e00) at ../389-ds-base/src/nunc-stans/ns/ns_thrpool.c:760
#3  0x000000000041e3c1 in ns_connection_post_io_or_closing (conn=0x44bfe00) at ../389-ds-base/ldap/servers/slapd/daemon.c:1693
#4  0x0000000000419b49 in disconnect_server_nomutex_ext (conn=0x44bfe00, opconnid=354, opid=1, reason=-4996, error=0, schedule_closure_job=1)
    at ../389-ds-base/ldap/servers/slapd/connection.c:2301
#5  0x0000000000419bf7 in disconnect_server_nomutex (conn=0x44bfe00, opconnid=354, opid=1, reason=-4996, error=0) at ../389-ds-base/ldap/servers/slapd/connection.c:2314
#6  0x0000000000419909 in disconnect_server (conn=0x44bfe00, opconnid=354, opid=1, reason=-4996, error=0) at ../389-ds-base/ldap/servers/slapd/connection.c:2209
#7  0x00007fe2de527987 in do_disconnect_server (conn=0x44bfe00, opconnid=354, opid=1) at ../389-ds-base/ldap/servers/slapd/plugin_internal_op.c:160
#8  0x00007fe2de53c8e8 in flush_ber (pb=0x44a6060, conn=0x44bfe00, op=0x1d30000, ber=0x452b7c0, type=2) at ../389-ds-base/ldap/servers/slapd/result.c:1780
#9  0x00007fe2de53c56c in send_ldap_search_entry_ext (pb=0x44a6060, e=0x501dbc0, ectrls=0x0, attrs=0x0, attrsonly=0, send_result=0, nentries=0, urls=0x0)
    at ../389-ds-base/ldap/servers/slapd/result.c:1661
#10 0x00007fe2de53b023 in send_ldap_search_entry (pb=0x44a6060, e=0x501dbc0, ectrls=0x0, attrs=0x0, attrsonly=0) at ../389-ds-base/ldap/servers/slapd/result.c:1043
#11 0x00007fe2de511243 in send_entry (pb=0x44a6060, e=0x501dbc0, operation=0x1d30000, attrs=0x0, attrsonly=0, pnentries=0x7fe29fbc20c0)
    at ../389-ds-base/ldap/servers/slapd/opshared.c:1059
#12 0x00007fe2de51180c in iterate (pb=0x44a6060, be=0x1a675f0, send_result=1, pnentries=0x7fe29fbc20c0, pagesize=-1, pr_statp=0x7fe29fbc20cc)
    at ../389-ds-base/ldap/servers/slapd/opshared.c:1390
#13 0x00007fe2de511c1a in send_results_ext (pb=0x44a6060, send_result=1, nentries=0x7fe29fbc20c0, pagesize=-1, pr_stat=0x7fe29fbc20cc)
    at ../389-ds-base/ldap/servers/slapd/opshared.c:1607
#14 0x00007fe2de510857 in op_shared_search (pb=0x44a6060, send_result=1) at ../389-ds-base/ldap/servers/slapd/opshared.c:819
#15 0x0000000000433e5f in do_search (pb=0x44a6060) at ../389-ds-base/ldap/servers/slapd/search.c:335
#16 0x0000000000416ca3 in connection_dispatch_operation (conn=0x44bfe00, op=0x1d30000, pb=0x44a6060) at ../389-ds-base/ldap/servers/slapd/connection.c:648
#17 0x0000000000418b86 in connection_threadmain () at ../389-ds-base/ldap/servers/slapd/connection.c:1769
#18 0x00007fe2dc0ba0eb in _pt_root () at /lib64/libnspr4.so
#19 0x00007fe2dba5636d in start_thread () at /lib64/libpthread.so.0
#20 0x00007fe2db541e1f in clone () at /lib64/libc.so.6


Thread 26 (Thread 0x7fe2c2c05700 (LWP 15563)):
#0  0x00007fe2dba5c90b in pthread_cond_wait@@GLIBC_2.3.2 () at /lib64/libpthread.so.0
#1  0x00007fe2dc0b47d3 in PR_EnterMonitor () at /lib64/libnspr4.so
#2  0x000000000041e6e8 in ns_handle_pr_read_ready (job=0x7565e00) at ../389-ds-base/ldap/servers/slapd/daemon.c:1782
#3  0x00007fe2de7ce653 in work_job_execute (job=0x7565e00) at ../389-ds-base/src/nunc-stans/ns/ns_thrpool.c:291
#4  0x00007fe2de7cf1a4 in event_cb (job=0x7565e00) at ../389-ds-base/src/nunc-stans/ns/ns_thrpool.c:642
#5  0x00007fe2de7d131c in event_cb (fd=64, event=2, arg=0x7565e00) at ../389-ds-base/src/nunc-stans/ns/ns_event_fw_event.c:118
#6  0x00007fe2db8169d9 in event_base_loop () at /lib64/libevent-2.0.so.5
#7  0x00007fe2de7d1862 in ns_event_fw_loop (ns_event_fw_ctx=0x1aaa280) at ../389-ds-base/src/nunc-stans/ns/ns_event_fw_event.c:308
#8  0x00007fe2de7ceffc in event_loop_thread_func (arg=0x1ca0e40) at ../389-ds-base/src/nunc-stans/ns/ns_thrpool.c:581
#9  0x00007fe2dba5636d in start_thread () at /lib64/libpthread.so.0
#10 0x00007fe2db541e1f in clone () at /lib64/libc.so.6

It's not a deadlock in nunc-stans.

It's a deadlock in DS which is misusing ns still to this day.

The issue looks like an io event occurs on the socket and we try to enter the connection monitor.

But the conn monitor is held by a previous connection, that is serving a do_search, and the client has become unresponsive. Because the do_search is run in a connection_threadmain rather than a work_job_execute, the ns safety guarantees aren't being upheld with regard to iojob disposal. So we hold the conn monitor as we are in send_result, but we can't acquire the job to mark is as done.

If the do_search was run from a worker correctly, then the job lock would be held preventing the second worker from trying to work on the read_ready. It's pretty likely the content of thread 26 is actually a client disconnect IO event that we haven't serviced yet tbh.

This will be resolved by https://pagure.io/389-ds-base/issue/48977 which will replace connection_threadmain.

Metadata Update from @firstyear:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Custom field type adjusted to None
- Custom field version adjusted to None

2 years ago

PS: This won't deadlock the server, it just deadlocks that pair of worker threads and that iojob. The server will otherwise continue but it's still not a good situation.

Metadata Update from @firstyear:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1544477

2 years ago

@firstyear I do not fully understand how the move from connection_thread to ns_thread would prevent the deadlock.
My understanding is that currently we have a ns job lock (job->monitor) and a connection lock (c_mutex). The deadlock occurs because event thread detecting IO (holding ns jobs lock) needs to set connection flags (c_mutex), at the same time a worker thread sends data on the same connection (c_mutex) and decides to stop the job (job lock).
If we replace the worker thread by a nunc-stans thread. This nunc-stans thread may still decide to stop the job while it was sending data (so with the same order of locking: c_mutex then job_lock)

Switch from workers threads to nunc-stans thread (https://pagure.io/389-ds-base/issue/48977) looks significant change. Do you think a job cleaning thread could be an transient option ? For example going through the conn_table, if a connection is to be closed then release the c_mutex and call ns_job_done.

okay, so this has to do with the "job safety" model, and how the current ds integration violates it in some areas.

A NS job assumes exclusive control over the fd and the job data. So when a job triggers and is run through nunc-stas work_job_execute, what happens is:

work_job_execute(job) {
    lock(job);
    // assert the job state machine is valid
    job->callback(do_stuff); // actually run something
    // move the job to the next state (done, rearm etc).
    unlock(job);
}

So for the duration of the "callback" the lock of the job is held - this prevents anyone CHANGING the job data, it triggering in another worker, it's status changing, anything. Now one could imagine inside of this "callback" we are taking the connection lock too ...

The issue here is that when the connection_threadmain is running it's NOT using this path. It's going into ns_connection_post_io_or_closing, locking the connection, then it tries to submit "hey dispose of this job now because no more IO can occur on it".

But the other thread is actually in work_job_execute! and it's locked in the opposite order!

So this "thread" to clean up and do ns_job_done won't help ,because in a way that's what we already have.

There are a few options.

One is to not use connection thread main, and have jobs execute properly under work_job_execute, so they lock correctly, in order. Job then connection.

The other is to add an external interface to ns to allow external locking of the job, so that we can take the lock in the right order.

Really, the first solution is the "best" but this may impact async multithreading of ldap, because some design flaws of NS prevent this.

The "perfect" solution is merge connection + nunc stans together, so there is one lock.

I think for now, I might add the external lock interface so we can put these in order properly.

And then later I have to rewrite connection handling to work correctly with this.

@firstyear many thanks for your clear and patient explanations.

I do agree with your conclusion, with nunc-stans the design is to have only one lock (merge job+connection locks)

So asynch operation will be delayed. Correct ?

It will be delayed in the current design, but when I get some time I'll fix it.

So the fix today will

  • retain async working ability
  • Give the ability for slapi_connection to direct NS to take a lock in the right order.

I'll start work on this soon :)

Metadata Update from @firstyear:
- Issue assigned to firstyear

2 years ago

I've started on this now, :) @tbordaz

@tbordaz So I've started on this, and it looks like sadly this will be more invasive than I want, but it's the only real fix. The issue is jobs have a 1 to 1 relation with fd, but the current integration makes a new job per io event on the fd. So to fix it I have to correct that jobs live longer which is a sadly more invasive change.

Not as big as a full fix up, but still more than 'a few lines' :(

So we'll need lots of testing to guarantee this patch I think.

@firstyear, thanks for continuous investigation. I agree the design basic is NS job <==> connection/fd. I can imagine how complex it is to move from those two locks to a single one.

I mentioned above a possibility to do a cleanup task.
Currently if a thread (42) sends data (hold fd) and decides to close the connection it triggers a new job. Would it be possibly it just flag the connection to be closed and stop sending data.
Then we can have a cleaning NS thread (a job) that acquires the 'fd' and process the closure.
I am not familiar with NS, but the cleaning thread could be a job spawn for each opened connection. This cleaning thread will have the same order of locking NS_job - fd than the thread 26.

Well, the issue is that a cleanup task won't help here because the call to ns_job_done has to happen in the current model, because you are freeing the job associate. You still run the risk of the lock potentially. You would need to make a freelsit of disposed jobs that connections are done with, then have a thread to clean that.

When really, I've done the work now, and it's actually not looking to bad to fix the connection to 1:1 fd to connection. I have to do a bunch of testing today and tomorrow to be sure it's correct of course.

Erghh the more I look at this the worse it gets. We have such a mess between ns + conntable, and what actually drives the events. It's a miracle it works at all. I'll need some time to think about this but don't be surprised if I come back with "just fix it properly" as the answer .... I don't think there is a quick fix out of this.

@firstyear FYI the same STR test that systematically fails in 7.5 does not fail in 389-ds-base-1.3.6.1-19.el7_4.
It may be more difficult to hit in 7.4 or a recent 7.5 fix introduce/reveal it

Reverting fix for ticket 48184 which is what introduced this regression

3552312..fda63dc master -> master

65d0e49..804002e 389-ds-base-1.3.8 -> 389-ds-base-1.3.8

d5e2f74..52bc5d0 389-ds-base-1.3.7 -> 389-ds-base-1.3.7

Great work to find and revert this. I'll still apply and submit the proper fix soon for 1.4.x which fixes the way we use nunc-stans.

@firstyear I've tested your patch, unfortunately ipa-server-install fails to complete. I'm attaching errors log file from ns-slapd. There are interesting errors that I haven't seen before, like:

[19/Apr/2018:13:03:25.863436169 -0400] - ERR - new_passwdPolicy - Failed to find active system threads, refusing to allocate pwpolicy

at startup and

[19/Apr/2018:13:05:24.706751484 -0400] - ERR - ns_handle_pr_read_ready - conn 108 for fd=78 - socket is already closing

during normal operation.
errors_49636

The first one I found during my work, I'm surprised to see that again. That would indicate something I missed during developement, but hard to recreate...

The "already closing" is saying "hey you tried to read on a socket that's shutting down". So there must be something else going on here.

49569 described two symptoms
- This deadlock BZ
- The required backout of #48184 (https://pagure.io/389-ds-base/issue/49569#comment-502760)

The ticket #49569 (aka #1544477) is dedicated to the second symptom and can now be closed.

The first (initial) symptom of deadlock (BZ 1608746) requires to open a new ticket to be tracked upstream.

Closing this ticket as it was fixed with https://pagure.io/389-ds-base/issue/49569#comment-502760

Metadata Update from @tbordaz:
- Issue status updated to: Closed (was: Open)

a year ago

Metadata Update from @tbordaz:
- Issue status updated to: Open (was: Closed)

4 months ago

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.4.2 (was: 1.3.7.0)

3 months ago

JFYI, I tried to build PR#49636 with TSAN. test_nuncstans fails during make check (it fails on master too).

tsan.lt-test_nuncstans.3246

Okay, I'll have a look at this soon because this is on my todo list.

Metadata Update from @tbordaz:
- Issue tagged with: Nunc Stans

a month ago

Login to comment on this ticket.

Metadata