Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1433850
Please note that this Bug is private and may not be accessible as it contains confidential Red Hat customer information.
Description of problem: DS instance hangs when its trying to delete a suffix. This is observed during dsCoreLong categorized tests being executed. dsCoreLong categorized tests usually takes not more than 10hrs to complete. With 389-ds-base-1.3.6.1-x, the execution goes for more than 3 days and it never completes. Hence, I logged in to the system to check whats going on. It looks like the deletion of suffix has been hanging and the CPU the utilization goes up to 100%. Version-Release number of selected component (if applicable): 389-ds-base-1.3.6.1-3 How reproducible: Consistently with 1.3.6.x Steps to Reproduce: 1. Run dsCoreLong categorized tests. 2. Wait for 10 to 12 hrs to finish. 3. If execution not completed within 12 hrs, login to the test machine. 4. Check for error message in accountinact_reliab.run.out. 5. It shows "ldap_delete: Operation not allowed on nonleaf" Actual results: Deletion of suffix hangs for ever. Expected results: Suffix deletion should be completed without any server hang. Additional info: Not reproducible with 389-ds-base-1.3.5.x builds. Also, when I tried to restart the server, it wasn't responding. Few lines from accountinact_reliab.run.out file: DelIndex_DefaultList: DONE DelIndex cn=index,cn=myairius.com,cn=ldbm database,cn=plugins,cn=config (leaf) ldap_delete: Operation not allowed on nonleaf backend for dc=airius,dc=com is myairius.com ldap_result: Can't contact LDAP server (-1) TS_cleanup: Start SaveLog: successfully saved logs Few lines from DS error logs: ==> /var/log/dirsrv/slapd-deftestinst/errors <== [20/Mar/2017:02:54:40.113410843 -0400] - INFO - main - 389-Directory/1.3.6.1 B2017.069.1413 starting up [20/Mar/2017:02:54:40.151251686 -0400] - NOTICE - ldbm_back_start - found 3881980k physical memory [20/Mar/2017:02:54:40.152143831 -0400] - NOTICE - ldbm_back_start - found 388196k avaliable [20/Mar/2017:02:54:40.153060938 -0400] - NOTICE - ldbm_back_start - total cache size: 439677747 B; [20/Mar/2017:02:54:40.154994458 -0400] - NOTICE - dblayer_start - Detected Disorderly Shutdown last time Directory Server was running, recovering database. [20/Mar/2017:02:54:40.342359431 -0400] - INFO - slapd_daemon - slapd started. Listening on All Interfaces port 21835 for LDAP requests [20/Mar/2017:02:55:03.946709660 -0400] - INFO - op_thread_cleanup - slapd shutting down - signaling operation threads - op stack size 1 max work q size 1 max work q stack size 1 [20/Mar/2017:02:55:03.948811205 -0400] - INFO - slapd_daemon - slapd shutting down - closing down internal subsystems and plugins [20/Mar/2017:02:55:03.956834991 -0400] - INFO - dblayer_pre_close - Waiting for 4 database threads to stop (gdb) p *be->be_lock $3 = {__data = {__lock = 0, __nr_readers = 1313, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 1, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = "\000\000\000\000!\005", '\000' <repeats 14 times>, "\001", '\000' <repeats 34 times>, __align = 5639292059648} Thread 33 #0 pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85 #1 0x00007f5958eb32ea in slapi_rwlock_wrlock (rwlock=<optimized out>) at ldap/servers/slapd/slapi2nspr.c:238 #2 0x00007f5958e3ed9c in slapi_be_Wlock (be=<optimized out>) at ldap/servers/slapd/backend_manager.c:396 #3 0x00007f5958e7ce66 in slapi_mtn_be_stopping (be=0x55eb64fe5400) at ldap/servers/slapd/mapping_tree.c:3736 #4 0x00007f594d3c607a in ldbm_instance_delete_instance_entry_callback (pb=<optimized out>, entryBefore=<optimized out>, entryAfter=<optimized out>, returncode=0x7f5926499428, returntext=0x7f5926499470 "", arg=<optimized out>) at ldap/servers/slapd/back-ldbm/ldbm_instance_config.c:1239 #5 0x00007f5958e4a369 in dse_call_callback (pb=pb@entry=0x7f5926499a50, operation=operation@entry=32, flags=flags@entry=1, entryBefore=entryBefore@entry=0x55eb66d42300, entryAfter=entryAfter@entry=0x0, returncode=returncode@entry=0x7f5926499428, returntext=returntext@entry=0x7f5926499470 "", pdse=<optimized out>) at ldap/servers/slapd/dse.c:2637 #6 0x00007f5958e4d7a8 in dse_delete (pb=0x7f5926499a50) at ldap/servers/slapd/dse.c:2499 #7 0x00007f5958e43f50 in op_shared_delete (pb=pb@entry=0x7f5926499a50) at ldap/servers/slapd/delete.c:331 #8 0x00007f5958e442d2 in do_delete (pb=pb@entry=0x7f5926499a50) at ldap/servers/slapd/delete.c:97 #9 0x000055eb6256f4f2 in connection_dispatch_operation (pb=0x7f5926499a50, op=0x55eb65e541a0, conn=0x55eb660226a8) at ldap/servers/slapd/connection.c:623 #10 connection_threadmain () at ldap/servers/slapd/connection.c:1772 #11 0x00007f59571f69bb in _pt_root (arg=0x55eb64f409a0) at ../../../nspr/pr/src/pthreads/ptthread.c:216 #12 0x00007f5956b96dc5 in start_thread (arg=0x7f592649a700) at pthread_create.c:308 #13 0x00007f595647834d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113 Occurs after an error 66, trying to delete the index container on the same suffix - not sure if it's related or not
Metadata Update from @mreynolds: - Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1433850
Metadata Update from @mreynolds: - Issue assigned to mreynolds
This was a regression introduced by this ticket:
https://pagure.io/389-ds-base/issue/49051
Metadata Update from @mreynolds: - Custom field reviewstatus adjusted to new - Custom field type adjusted to defect
<img alt="0001-Issue-49192-Deleting-suffix-can-hang-server.patch" src="/389-ds-base/issue/raw/files/5746f85de1148a8f606a04792df6baa1620c182448d32875be045e1e561c39fc-0001-Issue-49192-Deleting-suffix-can-hang-server.patch" />
Metadata Update from @mreynolds: - Custom field origin adjusted to QE - Custom field reviewstatus adjusted to review (was: new)
I'm wondering if the following 2 error cases also need to call slapi_be_Unlock(be)? 742 if (!bind_target_entry) { 743 slapi_pblock_set(pb, SLAPI_PB_RESULT_TEXT, "No such entry"); 744 send_ldap_result(pb, LDAP_INVALID_CREDENTIALS, NULL, "", 0, NULL); 745 goto free_and_return; 746 }
770 if (-1 == myrc) { 771 / need_new_pw failed; need_new_pw already send_ldap_result in it. / 772 goto free_and_return; 773 }
For the second one, maybe we could move this unlock prior to the line 767 if (auth_response_requested) ? 774 if (be) { 775 slapi_be_Unlock(be); 776 }
Actually this is looking like more of a problem in pw_verify_be_dn() which is where the lock is taken, but not always unlocked. @firstyear do you know why you did it that way? Basically you only unlock it if there is an error, but it seems like we should always unlock it. I'm sure there was a reason for it, so I'd like to know what that was before proceeding with a fix.
It is indeed a mistake. It should be unlocked before return rc;
I think it will only affect a sasl bind though because that's the only path where we use pw_verify_be_dn. I'll make a patch now.
Does this fix it? <img alt="0001-Ticket-49192-DS-hangs-deleting-non-leaf-entry.patch" src="/389-ds-base/issue/raw/files/7693c3691890d0a6103bff14eeedf0fc4af591003763535ea8b04d05bc80b6b0-0001-Ticket-49192-DS-hangs-deleting-non-leaf-entry.patch" />
Actually this was my original fix to the issue, but in fact, we ALWAYS need to unlock it - always! The be lock is taken when we select the mapping tree backend, so matter what the result code is it still needs to be unlocked. I'll be attaching my patch shortly.
Note - this hang has nothing to do with sasl binds - see the testcase:
<img alt="0001-Issue-49192-Deleting-suffix-can-hang-server.patch" src="/389-ds-base/issue/raw/files/2f03bc72a9c13279874513a8c21277546d878520733d4e4dea2feb317b383c4b-0001-Issue-49192-Deleting-suffix-can-hang-server.patch" />
Looks good to me. (but I will leave the privilege to ack to William ;).
(but I will leave the privilege to ack to William ;). acked.
Metadata Update from @nhosoi: - Custom field reviewstatus adjusted to ack (was: review)
092f08a..3c869cc master -> master
I'm happy with this fix :)
Metadata Update from @mreynolds: - Issue close_status updated to: fixed - Issue status updated to: Closed (was: Open)
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/2251
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: fixed)
Login to comment on this ticket.