After upgrade from 389-ds-base-1.3.7.5-21.el7_5.x86_64 to 389-ds-base-1.3.7.5-24.el7_5, after running of some period of time (2-12 hours) a thread can get stuck consuming 100% cpu.
htop thread view show the PID to be 3359. pstack:
Thread 36 (Thread 0x7fcc9578c700 (LWP 3359)): #0 0x00007fccaeb8e9d7 in sched_yield () at ../sysdeps/unix/syscall-template.S:81 #1 0x00007fccafb5d50d in PR_Sleep (ticks=0) at ../../../nspr/pr/src/pthreads/ptthread.c:787 #2 0x00007fccb203bc89 in work_job_execute (job=0x55897987e960) at src/nunc-stans/ns/ns_thrpool.c:291 #3 0x00007fccb203cbe3 in event_cb (fd=<optimized out>, event=<optimized out>, arg=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:118 #4 0x00007fccaf08ca14 in event_process_active_single_queue (activeq=0x5589736a5d70, base=0x558973346500) at event.c:1350 #5 0x00007fccaf08ca14 in event_process_active (base=<optimized out>) at event.c:1420 #6 0x00007fccaf08ca14 in event_base_loop (base=0x558973346500, flags=flags@entry=1) at event.c:1621 #7 0x00007fccb203ceae in ns_event_fw_loop (ns_event_fw_ctx=<optimized out>) at src/nunc-stans/ns/ns_event_fw_event.c:308 #8 0x00007fccb203bac9 in event_loop_thread_func (arg=0x5589732f6c00) at src/nunc-stans/ns/ns_thrpool.c:581 #9 0x00007fccaf4fcdd5 in start_thread (arg=0x7fcc9578c700) at pthread_create.c:308 #10 0x00007fccaeba9b3d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113
strace shows spin cycle:
epoll_wait(14, [], 32, 0) = 0 sched_yield() = 0 sched_yield() = 0
with the occasional:
sched_yield() = 0 epoll_ctl(14, EPOLL_CTL_ADD, 159, {EPOLLIN, {u32=159, u64=159}}) = 0 epoll_wait(14, [{EPOLLIN, {u32=17, u64=17}}], 32, 0) = 1 read(17, "a", 1) = 1 sched_yield() = 0
or
sched_yield() = 0 epoll_wait(14, [{EPOLLIN, {u32=159, u64=159}}], 32, 0) = 1 epoll_ctl(14, EPOLL_CTL_DEL, 159, 0x7fcc9578bb70) = 0 futex(0x5589736fd34c, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x5589736fd348, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 futex(0x5589794ae2fc, FUTEX_WAKE_OP_PRIVATE, 1, 1, 0x5589794ae2f8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 sched_yield()
sched_yield() = 0 epoll_wait(14, [{EPOLLIN, {u32=9, u64=9}}], 32, 0) = 1 accept(9, {sa_family=AF_INET6, sin6_port=htons(59198), inet_pton(AF_INET6, "::ffff:10.10.30.201", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 158 fcntl(158, F_GETFL) = 0x2 (flags O_RDWR) fcntl(158, F_SETFL, O_RDWR|O_NONBLOCK) = 0 setsockopt(158, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0 setsockopt(158, SOL_TCP, TCP_NODELAY, [0], 4) = 0 getsockname(158, {sa_family=AF_INET6, sin6_port=htons(389), inet_pton(AF_INET6, "::ffff:10.10.41.4", &sin6_addr), sin6_flowinfo=0, sin6_scope_id=0}, [28]) = 0 epoll_ctl(14, EPOLL_CTL_ADD, 158, {EPOLLIN, {u32=158, u64=158}}) = 0 sched_yield() = 0
Scientific Linux 7.5 389-ds-base-1.3.7.5-24.el7_5.x86_64
Not sure.
After several hours, ns-slapd stopped functioning and eventually ran into:
[27/Jun/2018:21:43:34.503657050 -0600] - ERR - libdb - BDB2520 /var/lib/dirsrv/slapd-NWRA-COM/db/log.0000007218: log file unreadable: Too many open files [27/Jun/2018:21:43:34.511042759 -0600] - ERR - libdb - BDB0061 PANIC: Too many open files [27/Jun/2018:21:43:34.604851998 -0600] - ERR - libdb - BDB0060 PANIC: fatal region error detected; run recovery [27/Jun/2018:21:43:34.606730942 -0600] - CRIT - deadlock_threadmain - Serious Error---Failed in deadlock detect (aborted at 0x0), err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery) [27/Jun/2018:21:43:34.621337249 -0600] - ERR - libdb - BDB0060 PANIC: fatal region error detected; run recovery [27/Jun/2018:21:43:34.623196488 -0600] - ERR - trickle_threadmain - Serious Error---Failed to trickle, err=-30973 (BDB0087 DB_RUNRECOVERY: Fatal error, run database recovery)
Attaching full pstack provided by orion
<img alt="pstack" src="/389-ds-base/issue/raw/files/2a27eb7e17ce674805e58d73e8603159bee5e04b43b2fcced23ee2e91d75b3ce-pstack" />
Metadata Update from @mreynolds: - Custom field component adjusted to None - Custom field origin adjusted to None - Custom field reviewstatus adjusted to None - Custom field type adjusted to None - Custom field version adjusted to None
From IRC:
<mreynolds> orionp: just curious how long was the server up for before the problem occurred? Any idea how many connections were processed before the issue happened? <mreynolds> If you still have the access logs from that time "logconv.pl -V" would be useful
<orionp> mreynolds: first time it was about 1.5 hours (before the cpu spin started). Second time about 11 hours <orionp> logconv takes a while... <orionp> Lots of : Unindexed Component #457 (notes=U) Search Base: cn=groups,cn=accounts,dc=nwra,dc=com Search Filter: (&(objectclass=ipaexternalgroup)(ipaexternalmember=sSID))
I do see paged searches which is what was supposed to be fixed in 389-ds-base-1.3.7.5-24
----------- Access Log Output ------------ Start of Logs: 28/Jun/2018:07:55:30.243693764 End of Logs: 28/Jun/2018:15:35:20.009007335 Processed Log Time: 7 Hours, 39 Minutes, 49.765316608 Seconds Restarts: 1 Secure Protocol Versions: - TLS1.2 client bound as uid=pkidbuser,ou=people,o=ipaca (4 connections) - TLS1.2 256-bit AES; client (4 connections) - TLS1.2 256-bit AES-GCM (119 connections) Peak Concurrent Connections: 11 Total Operations: 241980 Total Results: 244345 Overall Performance: 100.0% Total Connections: 6012 (0.22/sec) (13.07/min) - LDAP Connections: 5888 (0.21/sec) (12.80/min) - LDAPI Connections: 94 (0.00/sec) (0.20/min) - LDAPS Connections: 30 (0.00/sec) (0.07/min) - StartTLS Extended Ops: 95 (0.00/sec) (0.21/min) Searches: 196570 (7.12/sec) (427.48/min) Modifications: 854 (0.03/sec) (1.86/min) Adds: 1 (0.00/sec) (0.00/min) Deletes: 1 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds: 4638 (0.17/sec) (10.09/min) Proxied Auth Operations: 0 Persistent Searches: 2 Internal Operations: 0 Entry Operations: 0 Extended Operations: 39573 Abandoned Requests: 343 Smart Referrals Received: 0 VLV Operations: 0 VLV Unindexed Searches: 0 VLV Unindexed Components: 0 SORT Operations: 0 Entire Search Base Queries: 2 Paged Searches: 16982 <-- This could be the cause Unindexed Searches: 0 Unindexed Components: 42524 Broken Pipes: 0 Connections Reset By Peer: 0 Resource Unavailable: 0 Max BER Size Exceeded: 0 ----- Current Open Connection IDs ----- Conn Number: 3792 (10.10.10.7) Conn Number: 3735 (10.10.40.30) Conn Number: 3728 (10.10.41.21) Conn Number: 2 (10.10.41.4) Conn Number: 3693 (10.10.41.4) Conn Number: 3820 (10.10.20.201) Conn Number: 3756 (192.168.183.86) Conn Number: 3731 (192.168.183.92) Conn Number: 3828 (local) Conn Number: 3771 (10.0.1.74) Conn Number: 3815 (10.10.41.27) Conn Number: 3738 (10.0.0.203) Conn Number: 3802 (10.10.20.3) Conn Number: 53 (10.10.41.4) Conn Number: 3750 (10.10.40.25) Conn Number: 3647 (10.10.10.31) Conn Number: 3 (10.10.41.4) Conn Number: 3703 (10.10.20.12) Conn Number: 3679 (10.0.1.41) Conn Number: 52 (10.10.41.4) Conn Number: 3805 (10.10.30.27) Conn Number: 3794 (10.0.1.76) Conn Number: 5 (local) ----- Total Connection Codes ----- B1 3487 Bad Ber Tag Encountered U1 2499 Cleanly Closed Connections A1 21 Client Aborted Connections B4 10 Server failed to flush data (response) back to Client
That first behaviour is an expected part of nunc-stans. It's designed so that idle threads yield and recheck "in case" they missed a notification sometimes between sleeps. Not my idea, but it indicates an issue with nspr notifications .... I never investigated though.
The second part is the likely cause, but not sure what causes that. Resource leak?
I wonder if this could be a consequence of https://pagure.io/389-ds-base/issue/49765 This bug was observed with async operation. Do you know if some requests are async ?
The original symptom was some hanging connection but I do not know if it was cpu consuming.
@mreynolds, am I correct that #49765 is not included in 1.3.7.5-24
There are paged searches in the logconv output above
The original symptom was some hanging connection but I do not know if it was cpu consuming. @mreynolds, am I correct that #49765 is not included in 1.3.7.5-24
It is not in any 1.3.7 builds - there are no bugzilla's listed in the ticket. @tbordaz do you know if there is a BZ?
Metadata Update from @mreynolds: - Issue set to the milestone: 1.3.7.0
This ticket #49815 is not due to https://pagure.io/389-ds-base/issue/49848.
#49848 exists in both 1.3.7.5-21 and 1.3.7.5-24
Any progress here? Anything else I can provide to help?
@orion I think not, as there hasn't been much activity. Without a reproducer it's hard to get to the bottom. Are you seeing this issue regularly?
Metadata Update from @tbordaz: - Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1605554
@orion, the RC of that bug is likely identified as connection leak.
To close a connection, DS schedule a closure handler. If an other handler was scheduled at that time, DS cancels it before registering the closure handler. Unfortunately the cancelation may fail (because of the state of the handler) so we have two handlers. It is not supported or does not work well, so that the closure handler is never called.
At the end of the day the event thread is continuously called on that connection because of the closure handle that stay pending.
Could you check, when the 100% CPU occurs that the number of opened connection is increasing.
A possibility to investigate would be to test if it exists a handler before registering the read handler ns_handle_pr_read_ready.
For the moment, we decided to disable NS as we have not identified a fix for this issue.
@tbordaz Reminder that there is that NS patch awaiting review.
For now disabling nunc-stans seems like the best course of action.
Just to state that I do not appear to be seeing this with 389-ds-base-1.3.8.4-15.el7
Metadata Update from @mreynolds: - Issue close_status updated to: duplicate - Issue status updated to: Closed (was: Open)
Starting in 389-ds-base-1.3.8.4-10 nunc-stans has been disabled by default. Since we are aware of the many issues with nunc-stans, which are being tracked in other tickets/bugzilla, I closed this out.
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2874
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: duplicate)
Log in to comment on this ticket.