When a nunc-stans job is rearmed, the job is enqueued (event_q) and event thread is notified.
The event thread will dequeue the job, lock it, launch its callback and unlock the job. The problem is that during rearm, the job lock is released after the notification. So if the event thread is scheduled immediately at notification and if the armed job is the one dequeued, the event thread will "hang" until the thread running the "rearm" will unlock the job
Signature of the bug is
Thread 54 (Thread 0x7feb416ee700 (LWP 1305)): #0 0x00007feb7be2f42d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007feb7be2ade6 in _L_lock_870 () from /lib64/libpthread.so.0 #2 0x00007feb7be2acdf in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007feb7e5c96cc in update_event () from /usr/lib64/dirsrv/libnunc-stans.so.0 #4 0x00007feb7e5c99ed in get_new_event_requests.isra.3 () from /usr/lib64/dirsrv/libnunc-stans.so.0 #5 0x00007feb7e5c9ad1 in wakeup_cb () from /usr/lib64/dirsrv/libnunc-stans.so.0 #6 0x00007feb7e5c9bf9 in work_job_execute () from /usr/lib64/dirsrv/libnunc-stans.so.0 #7 0x00007feb7e5ca9cb in event_cb () from /usr/lib64/dirsrv/libnunc-stans.so.0 #8 0x00007feb7bbe9a14 in event_base_loop () from /lib64/libevent-2.0.so.5 #9 0x00007feb7e5cac4e in ns_event_fw_loop () from /usr/lib64/dirsrv/libnunc-stans.so.0 #10 0x00007feb7e5c9a39 in event_loop_thread_func () from /usr/lib64/dirsrv/libnunc-stans.so.0 #11 0x00007feb7be28e25 in start_thread () from /lib64/libpthread.so.0 #12 0x00007feb7b70a34d in clone () from /lib64/libc.so.6 Thread 17 (Thread 0x7feb2eec9700 (LWP 1342)): #0 0x00007feb7b6eee47 in sched_yield () from /lib64/libc.so.6 #1 0x00007feb7c48931d in PR_Sleep () from /lib64/libnspr4.so #2 0x00007feb7e5c9905 in internal_ns_job_rearm () from /usr/lib64/dirsrv/libnunc-stans.so.0 #3 0x00007feb7e5ca0c2 in ns_add_io_timeout_job () from /usr/lib64/dirsrv/libnunc-stans.so.0 #4 0x000056012824254b in ns_connection_post_io_or_closing () #5 0x000056012823f67a in connection_threadmain () #6 0x00007feb7c4889bb in _pt_root () from /lib64/libnspr4.so #7 0x00007feb7be28e25 in start_thread () from /lib64/libpthread.so.0 #8 0x00007feb7b70a34d in clone () from /lib64/libc.so.6
Consequences: The problem is transient as the job will be released and event thread can continue. However while it is waiting for the lock, the server is likely to not processed received events that will be processed with a delay. Some cases were reported that DS may be transiently missing new connection (accept). It is a possibility but I am not sure of that
Since 7.4 (1.3.6)
No identified reproducer.. so far
DS may appear like hanging for short period (not processing: new connection, new request, signals, timers..)
This period should be very short so it is not clear if it can have a significant impact
event thread should never wait for a lock
Metadata Update from @tbordaz: - Issue assigned to tbordaz
Metadata Update from @mreynolds: - Custom field component adjusted to None - Custom field origin adjusted to None - Custom field reviewstatus adjusted to None - Custom field type adjusted to None - Custom field version adjusted to None - Issue set to the milestone: 1.4.0
https://pagure.io/389-ds-base/pull-request/49869 is canceled in favor of https://pagure.io/389-ds-base/pull-request/49636
Closing this ticket as will not fix
Metadata Update from @tbordaz: - Issue close_status updated to: wontfix - Issue status updated to: Closed (was: Open)
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2907
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Log in to comment on this ticket.