#49815 Thread gets stuck spinning after upgrade to 389-ds-base-1.3.7.5-24.el7_5
Closed: wontfix 3 years ago Opened 3 years ago by orion.

Issue Description

After upgrade from 389-ds-base-1.3.7.5-21.el7_5.x86_64 to 389-ds-base-1.3.7.5-24.el7_5, after running of some period of time (2-12 hours) a thread can get stuck consuming 100% cpu.

htop thread view show the PID to be 3359. pstack:

Thread 36 (Thread 0x7fcc9578c700 (LWP 3359)):
#0  0x00007fccaeb8e9d7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007fccafb5d50d in PR_Sleep (ticks=0) at ../../../nspr/pr/src/pthreads/ptthread.c:787
#2  0x00007fccb203bc89 in work_job_execute (job=0x55897987e960) at src/nunc-stans/ns/ns_thrpool.c:291
#3  0x00007fccb203cbe3 in event_cb (fd=<optimized out>, event=<optimized out>, arg=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:118
#4  0x00007fccaf08ca14 in event_process_active_single_queue (activeq=0x5589736a5d70, base=0x558973346500) at event.c:1350
#5  0x00007fccaf08ca14 in event_process_active (base=<optimized out>) at event.c:1420
#6  0x00007fccaf08ca14 in event_base_loop (base=0x558973346500, flags=flags@entry=1) at event.c:1621
#7  0x00007fccb203ceae in ns_event_fw_loop (ns_event_fw_ctx=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:308
#8  0x00007fccb203bac9 in event_loop_thread_func (arg=0x5589732f6c00) at src/nunc-stans/ns/ns_thrpool.c:581
#9  0x00007fccaf4fcdd5 in start_thread (arg=0x7fcc9578c700) at pthread_create.c:308
#10 0x00007fccaeba9b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113

strace shows spin cycle:

epoll_wait(14, [], 32, 0)               = 0
sched_yield()                           = 0
sched_yield()                           = 0

with the occasional:

sched_yield()                           = 0
epoll_ctl(14, EPOLL_CTL_ADD, 159, {EPOLLIN, {u32=159, u64=159}}) = 0
epoll_wait(14, [{EPOLLIN, {u32=17, u64=17}}], 32, 0) = 1
read(17, "a", 1)                        = 1
sched_yield()                           = 0

or

sched_yield()                           = 0
epoll_wait(14, [{EPOLLIN, {u32=159, u64=159}}], 32, 0) = 1
epoll_ctl(14, EPOLL_CTL_DEL, 159, 0x7fcc9578bb70) = 0
futex(0x5589736fd34c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x5589736fd348, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
futex(0x5589794ae2fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x5589794ae2f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1
sched_yield()
sched_yield()                           = 0
epoll_wait(14, [{EPOLLIN, {u32=9, u64=9}}], 32, 0) = 1
accept(9, {sa_family=AF_INET6, sin6_port=htons(59198), inet_pton(AF_INET6, "::ffff:10.10.30.201", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 158
fcntl(158, F_GETFL)                     = 0x2 (flags O_RDWR)
fcntl(158, F_SETFL, O_RDWR|O_NONBLOCK)  = 0
setsockopt(158, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(158, SOL_TCP, TCP_NODELAY, [0], 4) = 0
getsockname(158, {sa_family=AF_INET6, sin6_port=htons(389), inet_pton(AF_INET6, "::ffff:10.10.41.4", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0
epoll_ctl(14, EPOLL_CTL_ADD, 158, {EPOLLIN, {u32=158, u64=158}}) = 0
sched_yield()                           = 0

Package Version and Platform

Scientific Linux 7.5
389-ds-base-1.3.7.5-24.el7_5.x86_64

Steps to reproduce

Actual results

Not sure.

After several hours, ns-slapd stopped functioning and eventually ran into:

[27/Jun/2018:21:43:34.503657050 -0600] - ERR - libdb - BDB2520 /var/lib/dirsrv/slapd-NWRA-COM/db/log.0000007218: log file unreadable: Too many open files
[27/Jun/2018:21:43:34.511042759 -0600] - ERR - libdb - BDB0061 PANIC: Too many open files
[27/Jun/2018:21:43:34.604851998 -0600] - ERR - libdb - BDB0060 PANIC: fatal region error detected; run recovery
[27/Jun/2018:21:43:34.606730942 -0600] - CRIT - deadlock_threadmain - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
[27/Jun/2018:21:43:34.621337249 -0600] - ERR - libdb - BDB0060 PANIC: fatal region error detected; run recovery
[27/Jun/2018:21:43:34.623196488 -0600] - ERR - trickle_threadmain - Serious Error---Failed to trickle, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)

Metadata Update from @mreynolds:
- Custom field component adjusted to None
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None
- Custom field type adjusted to None
- Custom field version adjusted to None

3 years ago

From IRC:

<mreynolds> orionp: just curious how long was the server up for before the problem occurred? Any idea how many connections were processed before the issue happened?
<mreynolds> If you still have the access logs from that time "logconv.pl -V" would be useful

<orionp> mreynolds: first time it was about 1.5 hours (before the cpu spin started). Second time about 11 hours
<orionp> logconv takes a while...
<orionp> Lots of : Unindexed Component #457 (notes=U) Search Base: cn=groups,cn=accounts,dc=nwra,dc=com Search Filter: (&(objectclass=ipaexternalgroup)(ipaexternalmember=sSID))

I do see paged searches which is what was supposed to be fixed in 389-ds-base-1.3.7.5-24

----------- Access Log Output ------------

Start of Logs:    28/Jun/2018:07:55:30.243693764
End of Logs:      28/Jun/2018:15:35:20.009007335

Processed Log Time:  7 Hours, 39 Minutes, 49.765316608 Seconds

Restarts:                     1
Secure Protocol Versions:
  - TLS1.2 client bound as uid=pkidbuser,ou=people,o=ipaca (4 connections)
  - TLS1.2 256-bit AES; client  (4 connections)
  - TLS1.2 256-bit AES-GCM (119 connections)

Peak Concurrent Connections:  11
Total Operations:             241980
Total Results:                244345
Overall Performance:          100.0%

Total Connections:            6012          (0.22/sec)  (13.07/min)
 - LDAP Connections:          5888          (0.21/sec)  (12.80/min)
 - LDAPI Connections:         94            (0.00/sec)  (0.20/min)
 - LDAPS Connections:         30            (0.00/sec)  (0.07/min)
 - StartTLS Extended Ops:     95            (0.00/sec)  (0.21/min)

Searches:                     196570        (7.12/sec)  (427.48/min)
Modifications:                854           (0.03/sec)  (1.86/min)
Adds:                         1             (0.00/sec)  (0.00/min)
Deletes:                      1             (0.00/sec)  (0.00/min)
Mod RDNs:                     0             (0.00/sec)  (0.00/min)
Compares:                     0             (0.00/sec)  (0.00/min)
Binds:                        4638          (0.17/sec)  (10.09/min)

Proxied Auth Operations:      0
Persistent Searches:          2
Internal Operations:          0
Entry Operations:             0
Extended Operations:          39573
Abandoned Requests:           343
Smart Referrals Received:     0

VLV Operations:               0
VLV Unindexed Searches:       0
VLV Unindexed Components:     0
SORT Operations:              0

Entire Search Base Queries:   2
Paged Searches:               16982        <-- This could be the cause
Unindexed Searches:           0
Unindexed Components:         42524

Broken Pipes:                 0
Connections Reset By Peer:    0
Resource Unavailable:         0
Max BER Size Exceeded:        0

----- Current Open Connection IDs ----- 

Conn Number:  3792 (10.10.10.7)
Conn Number:  3735 (10.10.40.30)
Conn Number:  3728 (10.10.41.21)
Conn Number:  2 (10.10.41.4)
Conn Number:  3693 (10.10.41.4)
Conn Number:  3820 (10.10.20.201)
Conn Number:  3756 (192.168.183.86)
Conn Number:  3731 (192.168.183.92)
Conn Number:  3828 (local)
Conn Number:  3771 (10.0.1.74)
Conn Number:  3815 (10.10.41.27)
Conn Number:  3738 (10.0.0.203)
Conn Number:  3802 (10.10.20.3)
Conn Number:  53 (10.10.41.4)
Conn Number:  3750 (10.10.40.25)
Conn Number:  3647 (10.10.10.31)
Conn Number:  3 (10.10.41.4)
Conn Number:  3703 (10.10.20.12)
Conn Number:  3679 (10.0.1.41)
Conn Number:  52 (10.10.41.4)
Conn Number:  3805 (10.10.30.27)
Conn Number:  3794 (10.0.1.76)
Conn Number:  5 (local)


----- Total Connection Codes -----

B1     3487   Bad Ber Tag Encountered                 
U1     2499   Cleanly Closed Connections              
A1       21   Client Aborted Connections              
B4       10   Server failed to flush data (response) back to Client

That first behaviour is an expected part of nunc-stans. It's designed so that idle threads yield and recheck "in case" they missed a notification sometimes between sleeps. Not my idea, but it indicates an issue with nspr notifications .... I never investigated though.

The second part is the likely cause, but not sure what causes that. Resource leak?

I wonder if this could be a consequence of https://pagure.io/389-ds-base/issue/49765
This bug was observed with async operation. Do you know if some requests are async ?

The original symptom was some hanging connection but I do not know if it was cpu consuming.

@mreynolds, am I correct that #49765 is not included in 1.3.7.5-24

I wonder if this could be a consequence of https://pagure.io/389-ds-base/issue/49765
This bug was observed with async operation. Do you know if some requests are async ?

There are paged searches in the logconv output above

The original symptom was some hanging connection but I do not know if it was cpu consuming.
@mreynolds, am I correct that #49765 is not included in 1.3.7.5-24

It is not in any 1.3.7 builds - there are no bugzilla's listed in the ticket. @tbordaz do you know if there is a BZ?

Metadata Update from @mreynolds:
- Issue set to the milestone: 1.3.7.0

3 years ago

This ticket #49815 is not due to https://pagure.io/389-ds-base/issue/49848.

#49848 exists in both 1.3.7.5-21 and 1.3.7.5-24

Any progress here? Anything else I can provide to help?

@orion I think not, as there hasn't been much activity. Without a reproducer it's hard to get to the bottom. Are you seeing this issue regularly?

Metadata Update from @tbordaz:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1605554

3 years ago

@orion, the RC of that bug is likely identified as connection leak.

To close a connection, DS schedule a closure handler. If an other handler was scheduled at that time, DS cancels it before registering the closure handler. Unfortunately the cancelation may fail (because of the state of the handler) so we have two handlers. It is not supported or does not work well, so that the closure handler is never called.

At the end of the day the event thread is continuously called on that connection because of the closure handle that stay pending.

Could you check, when the 100% CPU occurs that the number of opened connection is increasing.

A possibility to investigate would be to test if it exists a handler before registering the read handler ns_handle_pr_read_ready.

For the moment, we decided to disable NS as we have not identified a fix for this issue.

@tbordaz Reminder that there is that NS patch awaiting review.

For now disabling nunc-stans seems like the best course of action.

Just to state that I do not appear to be seeing this with 389-ds-base-1.3.8.4-15.el7

Metadata Update from @mreynolds:
- Issue close_status updated to: duplicate
- Issue status updated to: Closed (was: Open)

3 years ago

Just to state that I do not appear to be seeing this with 389-ds-base-1.3.8.4-15.el7

Starting in 389-ds-base-1.3.8.4-10 nunc-stans has been disabled by default. Since we are aware of the many issues with nunc-stans, which are being tracked in other tickets/bugzilla, I closed this out.

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2874

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: duplicate)

2 years ago

Login to comment on this ticket.

Metadata
Attachments 1
Attached 3 years ago View Comment