#50905 intermittent SSL hang with rhds
Closed: wontfix 3 years ago by mreynolds. Opened 4 years ago by tbordaz.

The deadlock occurs in NSS callbacks while two threads access the same connection

Thread 32 (...):
#0  0x00007f1b2918da35 in pthread_cond_wait@@GLIBC_2.3.2 (...) at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f1b297e4483 in PR_EnterMonitor (...) at ../../../nspr/pr/src/pthreads/ptsynch.c:592
#2  0x00007f1b2a15e474 in ssl_SecureRecv (...) at sslsecur.c:787
#3  0x00007f1b2a162461 in ssl_Recv (...) at sslsock.c:2943
#4  0x000055df54bdf630 in sasl_io_start_packet (...) at ldap/servers/slapd/sasl_io.c:200
#5  0x000055df54bdfc46 in sasl_io_recv (...) at ldap/servers/slapd/sasl_io.c:480
#6  0x000055df54bcf161 in connection_read_ldap_data (...) at ldap/servers/slapd/connection.c:1097
#7  0x000055df54bcf161 in connection_read_operation (...) at ldap/servers/slapd/connection.c:1181
#8  0x000055df54bcf997 in connection_threadmain (...) at ldap/servers/slapd/connection.c:1630
        conn = 0x55df5909bd00
#9  0x00007f1b297e9bfb in _pt_root (...) at ../../../nspr/pr/src/pthreads/ptthread.c:201
#10 0x00007f1b29189ea5 in start_thread (...) at pthread_create.c:307
#11 0x00007f1b288358cd in clone (...) at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 16 (...):
#0  0x00007f1b2919054d in __lll_lock_wait (...) at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f1b2918bed1 in _L_lock_1093 (...) at /lib64/libpthread.so.0
#2  0x00007f1b2918be72 in __GI___pthread_mutex_lock (...) at ../nptl/pthread_mutex_lock.c:133
#3  0x00007f1b297e3e59 in PR_Lock (...) at ../../../nspr/pr/src/pthreads/ptsynch.c:171
#4  0x00007f1b2a1623c2 in ssl_Send (...) at sslsock.c:2960
#5  0x00007f1b2a15774c in ssl_DefSend (...) at ssldef.c:105
#6  0x00007f1b2a147d9c in ssl3_SendRecord (...) at ssl3con.c:2440
#7  0x00007f1b2a147f18 in ssl3_SendApplicationData (...) at ssl3con.c:2553
#8  0x00007f1b2a15e789 in ssl_SecureSend (...) at sslsecur.c:973
#9  0x00007f1b2a162718 in ssl_Write (...) at sslsock.c:3004
#10 0x000055df54bd26e1 in write_function (...) at ldap/servers/slapd/daemon.c:2004
#11 0x000055df54bd26e1 in openldap_write_function (...) at ldap/servers/slapd/daemon.c:2078
#12 0x00007f1b2a39a92e in ber_int_sb_write (...) at sockbuf.c:445
#13 0x00007f1b2a396c0b in ber_flush2 (...) at io.c:246
#14 0x00007f1b2a396d1d in ber_flush (...) at io.c:211
#15 0x00007f1b2ba48c2c in flush_ber (.., conn = 0x55df5909bd00, ...) at ldap/servers/slapd/result.c:1761
#16 0x00007f1b2ba4ae30 in send_ldap_result_ext (...) at ldap/servers/slapd/result.c:600
        conn = 0x55df5909bd00
#17 0x00007f1b2ba4aff1 in send_ldap_result (...) at ldap/servers/slapd/result.c:199
#18 0x000055df54be1ded in ids_sasl_check_bind (...) at ldap/servers/slapd/saslbind.c:1125
#19 0x000055df54bc9ad1 in do_bind (...) at ldap/servers/slapd/bind.c:363
#20 0x000055df54bd0768 in connection_dispatch_operation (...) at ldap/servers/slapd/connection.c:600
#21 0x000055df54bd0768 in connection_threadmain (...) at ldap/servers/slapd/connection.c:1790
#22 0x00007f1b297e9bfb in _pt_root (...) at ../../../nspr/pr/src/pthreads/ptthread.c:201
#23 0x00007f1b29189ea5 in start_thread (...) at pthread_create.c:307

Thread 32 is waiting in ssl_SecureRecv for a SSL Monitor (read op)
Thread 16 is waiting in ssl_SecureSend for a SSL lock (bind)


Metadata Update from @tbordaz:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1801327

4 years ago

I have a doubt that it can be a duplicate of https://pagure.io/389-ds-base/issue/50636

In both ticket, a sasl bind registered new callbacks. Without the fix #50636, a reader thread (connection_read_ldap_data) can push the callbacks while registration is not completed. For #50636 it leads to a crash. Could it lead to a deadlock ?

At the moment I propose to wait for feedback of a version containing fix #50636.

Metadata Update from @tbordaz:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

4 years ago

Well, the concern I have is that we don't have vision into which locks are being deadlocked here. Can we get the info from t32 frame 1 and t16 frame 3 about which locks they are trying to aquire and which thread id's currently hold them? Pthreads are using inside of nspr mutexs, and they have the current holding thread id, so we can use that to help establish what order and what locks were taken, because the current stack trace isn't clear where we are taking and what we are taking in reverse order.

Do you have a coredump to get that info from? Or just the bt?

@firstyear, thanks for your feedback.

I can only provide part of the pstacks. Both threads use the same connection. There are two different NSS locks (monitor and thread lock) and it is looking like a deadlock scenario where two locks are acquired in the opposite order.

Thread 32 (Thread 0x7f1af7e27700 (LWP 28659)):
#0  0x00007f1b2918da35 in pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x00007f1b297e4483 in PR_EnterMonitor (mon=0x55df6de1a630) at ../../../nspr/pr/src/pthreads/ptsynch.c:592
        self = 139753804691200
#2  0x00007f1b2a15e474 in ssl_SecureRecv (ss=0x55df5e56a000, buf=0x7f1af7e26650, len=7, flags=0) at sslsecur.c:787
        rv = 0
...

Thread 16 (Thread 0x7f1affe37700 (LWP 28643)):
#0  0x00007f1b2919054d in __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1  0x00007f1b2918bed1 in _L_lock_1093 () at /lib64/libpthread.so.0
#2  0x00007f1b2918be72 in __GI___pthread_mutex_lock (mutex=mutex@entry=0x55df7a6e44d0) at ../nptl/pthread_mutex_lock.c:133
        cnt = 11
        max_cnt = -1
        type = 3
        id = <optimized out>
#3  0x00007f1b297e3e59 in PR_Lock (lock=0x55df7a6e44d0) at ../../../nspr/pr/src/pthreads/ptsynch.c:171
...

Waiting for feedback on #50636 fix. I will have a look at the provided cores to confirms if the threadId of the holder thread confirms the deadlock.

See, this is what I think is weird, you shouldn't be locking on the lock that is used in a condition variable, and both of the locks seem to be in NSS, so I'm wondering how it's possible we've taken anything in reverse order? That's why I think it would be good to see the threadid of who is holding the locks.

I agree both locks are in NSS, now I do not think they are part of CV.
This deadlock is weird and I agree it worth debugging more.
Both threads T32 and T16 can aquire a XmitBufLock (monitor). But as they are reader/sender they should not acquire the same lock. So yes, hopefully the core will confirm if they are in deadlock or if there is an other player.

Great, I look forward to hearing more from you @tbordaz :)

Metadata Update from @mreynolds:
- Issue priority set to: major
- Issue set to the milestone: 1.3.10 (was: 0.0 NEEDS_TRIAGE)

4 years ago

Metadata Update from @tbordaz:
- Issue assigned to tbordaz

4 years ago

03bb5ba..03103f1 389-ds-base-1.4.2
c4e7736..21638c0 389-ds-base-1.4.1
3a8d5bc..fe19978 389-ds-base-1.3.10

Fixing sasl_bind crash/hang requires this fix and also https://pagure.io/389-ds-base/issue/50636

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/3958

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Login to comment on this ticket.

Metadata