#49192 Directory Server hangs and utilizes high CPU when deleting a non-leaf entry
Closed: wontfix 7 years ago Opened 7 years ago by mreynolds.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1433850

Please note that this Bug is private and may not be accessible as it contains confidential Red Hat customer information.

Description of problem: DS instance hangs when its trying to delete a suffix.
This is observed during dsCoreLong categorized tests being executed. dsCoreLong
categorized tests usually takes not more than 10hrs to complete. With
389-ds-base-1.3.6.1-x, the execution goes for more than 3 days and it never
completes. Hence, I logged in to the system to check whats going on. It looks
like the deletion of suffix has been hanging and the CPU the utilization goes
up to 100%.

Version-Release number of selected component (if applicable):
389-ds-base-1.3.6.1-3

How reproducible: Consistently with 1.3.6.x

Steps to Reproduce:
1. Run dsCoreLong categorized tests.
2. Wait for 10 to 12 hrs to finish.
3. If execution not completed within 12 hrs, login to the test machine.
4. Check for error message in accountinact_reliab.run.out.
5. It shows "ldap_delete: Operation not allowed on nonleaf"

Actual results: Deletion of suffix hangs for ever.

Expected results: Suffix deletion should be completed without any server hang.

Additional info: Not reproducible with 389-ds-base-1.3.5.x builds. Also, when I
tried to restart the server, it wasn't responding.

Few lines from accountinact_reliab.run.out file:
DelIndex_DefaultList: DONE
DelIndex cn=index,cn=myairius.com,cn=ldbm database,cn=plugins,cn=config (leaf)
ldap_delete: Operation not allowed on nonleaf
backend for dc=airius,dc=com is myairius.com
ldap_result: Can't contact LDAP server (-1)
TS_cleanup: Start
SaveLog: successfully saved logs

Few lines from DS error logs:
==> /var/log/dirsrv/slapd-deftestinst/errors <==
[20/Mar/2017:02:54:40.113410843 -0400] - INFO - main - 389-Directory/1.3.6.1
B2017.069.1413 starting up
[20/Mar/2017:02:54:40.151251686 -0400] - NOTICE - ldbm_back_start - found
3881980k physical memory
[20/Mar/2017:02:54:40.152143831 -0400] - NOTICE - ldbm_back_start - found
388196k avaliable
[20/Mar/2017:02:54:40.153060938 -0400] - NOTICE - ldbm_back_start - total cache
size: 439677747 B;
[20/Mar/2017:02:54:40.154994458 -0400] - NOTICE - dblayer_start - Detected
Disorderly Shutdown last time Directory Server was running, recovering
database.
[20/Mar/2017:02:54:40.342359431 -0400] - INFO - slapd_daemon - slapd started.
Listening on All Interfaces port 21835 for LDAP requests
[20/Mar/2017:02:55:03.946709660 -0400] - INFO - op_thread_cleanup - slapd
shutting down - signaling operation threads - op stack size 1 max work q size 1
max work q stack size 1
[20/Mar/2017:02:55:03.948811205 -0400] - INFO - slapd_daemon - slapd shutting
down - closing down internal subsystems and plugins
[20/Mar/2017:02:55:03.956834991 -0400] - INFO - dblayer_pre_close - Waiting for
4 database threads to stop



(gdb) p *be->be_lock
$3 = {__data = {__lock = 0, __nr_readers = 1313, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 1, __writer = 0, __shared = 0, 
    __pad1 = 0, __pad2 = 0, __flags = 0}, __size = "\000\000\000\000!\005", '\000' <repeats 14 times>, "\001", '\000' <repeats 34 times>, __align = 5639292059648}


Thread 33
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00007f5958eb32ea in slapi_rwlock_wrlock (rwlock=<optimized out>) at ldap/servers/slapd/slapi2nspr.c:238
#2  0x00007f5958e3ed9c in slapi_be_Wlock (be=<optimized out>) at ldap/servers/slapd/backend_manager.c:396
#3  0x00007f5958e7ce66 in slapi_mtn_be_stopping (be=0x55eb64fe5400) at ldap/servers/slapd/mapping_tree.c:3736
#4  0x00007f594d3c607a in ldbm_instance_delete_instance_entry_callback (pb=<optimized out>, entryBefore=<optimized out>, 
    entryAfter=<optimized out>, returncode=0x7f5926499428, returntext=0x7f5926499470 "", arg=<optimized out>)
    at ldap/servers/slapd/back-ldbm/ldbm_instance_config.c:1239
#5  0x00007f5958e4a369 in dse_call_callback (pb=pb@entry=0x7f5926499a50, operation=operation@entry=32, flags=flags@entry=1, 
    entryBefore=entryBefore@entry=0x55eb66d42300, entryAfter=entryAfter@entry=0x0, returncode=returncode@entry=0x7f5926499428, 
    returntext=returntext@entry=0x7f5926499470 "", pdse=<optimized out>) at ldap/servers/slapd/dse.c:2637
#6  0x00007f5958e4d7a8 in dse_delete (pb=0x7f5926499a50) at ldap/servers/slapd/dse.c:2499
#7  0x00007f5958e43f50 in op_shared_delete (pb=pb@entry=0x7f5926499a50) at ldap/servers/slapd/delete.c:331
#8  0x00007f5958e442d2 in do_delete (pb=pb@entry=0x7f5926499a50) at ldap/servers/slapd/delete.c:97
#9  0x000055eb6256f4f2 in connection_dispatch_operation (pb=0x7f5926499a50, op=0x55eb65e541a0, conn=0x55eb660226a8)
    at ldap/servers/slapd/connection.c:623
#10 connection_threadmain () at ldap/servers/slapd/connection.c:1772
#11 0x00007f59571f69bb in _pt_root (arg=0x55eb64f409a0) at ../../../nspr/pr/src/pthreads/ptthread.c:216
#12 0x00007f5956b96dc5 in start_thread (arg=0x7f592649a700) at pthread_create.c:308
#13 0x00007f595647834d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Occurs after an error 66, trying to delete the index container on the same suffix - not sure if it's related or not

Metadata Update from @mreynolds:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1433850

7 years ago

Metadata Update from @mreynolds:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1433850

7 years ago

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

7 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to new
- Custom field type adjusted to defect

7 years ago

Metadata Update from @mreynolds:
- Custom field origin adjusted to QE
- Custom field reviewstatus adjusted to review (was: new)

7 years ago

I'm wondering if the following 2 error cases also need to call slapi_be_Unlock(be)?
742 if (!bind_target_entry) {
743 slapi_pblock_set(pb, SLAPI_PB_RESULT_TEXT, "No such entry");
744 send_ldap_result(pb, LDAP_INVALID_CREDENTIALS, NULL, "", 0, NULL);
745 goto free_and_return;
746 }

770 if (-1 == myrc) {
771 / need_new_pw failed; need_new_pw already send_ldap_result in it. /
772 goto free_and_return;
773 }

For the second one, maybe we could move this unlock prior to the line 767 if (auth_response_requested) ?
774 if (be) {
775 slapi_be_Unlock(be);
776 }

Actually this is looking like more of a problem in pw_verify_be_dn() which is where the lock is taken, but not always unlocked. @firstyear do you know why you did it that way? Basically you only unlock it if there is an error, but it seems like we should always unlock it. I'm sure there was a reason for it, so I'd like to know what that was before proceeding with a fix.

It is indeed a mistake. It should be unlocked before return rc;

I think it will only affect a sasl bind though because that's the only path where we use pw_verify_be_dn. I'll make a patch now.

Actually this was my original fix to the issue, but in fact, we ALWAYS need to unlock it - always! The be lock is taken when we select the mapping tree backend, so matter what the result code is it still needs to be unlocked. I'll be attaching my patch shortly.

Looks good to me. (but I will leave the privilege to ack to William ;).

(but I will leave the privilege to ack to William ;).
acked.

Metadata Update from @nhosoi:
- Custom field reviewstatus adjusted to ack (was: review)

7 years ago

I'm happy with this fix :)

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/2251

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Login to comment on this ticket.

Metadata