Issue #49192: Directory Server hangs and utilizes high CPU when deleting a non-leaf entry - 389-ds-base

389-ds-base

#49192 Directory Server hangs and utilizes high CPU when deleting a non-leaf entry

Closed: wontfix 7 years ago Opened 7 years ago by mreynolds.

Ticket was cloned from Red Hat Bugzilla (product Red Hat Enterprise Linux 7): Bug 1433850

Please note that this Bug is private and may not be accessible as it contains confidential Red Hat customer information.

Description of problem: DS instance hangs when its trying to delete a suffix.
This is observed during dsCoreLong categorized tests being executed. dsCoreLong
categorized tests usually takes not more than 10hrs to complete. With
389-ds-base-1.3.6.1-x, the execution goes for more than 3 days and it never
completes. Hence, I logged in to the system to check whats going on. It looks
like the deletion of suffix has been hanging and the CPU the utilization goes
up to 100%.

Version-Release number of selected component (if applicable):
389-ds-base-1.3.6.1-3

How reproducible: Consistently with 1.3.6.x

Steps to Reproduce:
1. Run dsCoreLong categorized tests.
2. Wait for 10 to 12 hrs to finish.
3. If execution not completed within 12 hrs, login to the test machine.
4. Check for error message in accountinact_reliab.run.out.
5. It shows "ldap_delete: Operation not allowed on nonleaf"

Actual results: Deletion of suffix hangs for ever.

Expected results: Suffix deletion should be completed without any server hang.

Additional info: Not reproducible with 389-ds-base-1.3.5.x builds. Also, when I
tried to restart the server, it wasn't responding.

Few lines from accountinact_reliab.run.out file:
DelIndex_DefaultList: DONE
DelIndex cn=index,cn=myairius.com,cn=ldbm database,cn=plugins,cn=config (leaf)
ldap_delete: Operation not allowed on nonleaf
backend for dc=airius,dc=com is myairius.com
ldap_result: Can't contact LDAP server (-1)
TS_cleanup: Start
SaveLog: successfully saved logs

Few lines from DS error logs:
==> /var/log/dirsrv/slapd-deftestinst/errors <==
[20/Mar/2017:02:54:40.113410843 -0400] - INFO - main - 389-Directory/1.3.6.1
B2017.069.1413 starting up
[20/Mar/2017:02:54:40.151251686 -0400] - NOTICE - ldbm_back_start - found
3881980k physical memory
[20/Mar/2017:02:54:40.152143831 -0400] - NOTICE - ldbm_back_start - found
388196k avaliable
[20/Mar/2017:02:54:40.153060938 -0400] - NOTICE - ldbm_back_start - total cache
size: 439677747 B;
[20/Mar/2017:02:54:40.154994458 -0400] - NOTICE - dblayer_start - Detected
Disorderly Shutdown last time Directory Server was running, recovering
database.
[20/Mar/2017:02:54:40.342359431 -0400] - INFO - slapd_daemon - slapd started.
Listening on All Interfaces port 21835 for LDAP requests
[20/Mar/2017:02:55:03.946709660 -0400] - INFO - op_thread_cleanup - slapd
shutting down - signaling operation threads - op stack size 1 max work q size 1
max work q stack size 1
[20/Mar/2017:02:55:03.948811205 -0400] - INFO - slapd_daemon - slapd shutting
down - closing down internal subsystems and plugins
[20/Mar/2017:02:55:03.956834991 -0400] - INFO - dblayer_pre_close - Waiting for
4 database threads to stop



(gdb) p *be->be_lock
$3 = {__data = {__lock = 0, __nr_readers = 1313, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 1, __writer = 0, __shared = 0, 
    __pad1 = 0, __pad2 = 0, __flags = 0}, __size = "\000\000\000\000!\005", '\000' <repeats 14 times>, "\001", '\000' <repeats 34 times>, __align = 5639292059648}


Thread 33
#0  pthread_rwlock_wrlock () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_rwlock_wrlock.S:85
#1  0x00007f5958eb32ea in slapi_rwlock_wrlock (rwlock=<optimized out>) at ldap/servers/slapd/slapi2nspr.c:238
#2  0x00007f5958e3ed9c in slapi_be_Wlock (be=<optimized out>) at ldap/servers/slapd/backend_manager.c:396
#3  0x00007f5958e7ce66 in slapi_mtn_be_stopping (be=0x55eb64fe5400) at ldap/servers/slapd/mapping_tree.c:3736
#4  0x00007f594d3c607a in ldbm_instance_delete_instance_entry_callback (pb=<optimized out>, entryBefore=<optimized out>, 
    entryAfter=<optimized out>, returncode=0x7f5926499428, returntext=0x7f5926499470 "", arg=<optimized out>)
    at ldap/servers/slapd/back-ldbm/ldbm_instance_config.c:1239
#5  0x00007f5958e4a369 in dse_call_callback (pb=pb@entry=0x7f5926499a50, operation=operation@entry=32, flags=flags@entry=1, 
    entryBefore=entryBefore@entry=0x55eb66d42300, entryAfter=entryAfter@entry=0x0, returncode=returncode@entry=0x7f5926499428, 
    returntext=returntext@entry=0x7f5926499470 "", pdse=<optimized out>) at ldap/servers/slapd/dse.c:2637
#6  0x00007f5958e4d7a8 in dse_delete (pb=0x7f5926499a50) at ldap/servers/slapd/dse.c:2499
#7  0x00007f5958e43f50 in op_shared_delete (pb=pb@entry=0x7f5926499a50) at ldap/servers/slapd/delete.c:331
#8  0x00007f5958e442d2 in do_delete (pb=pb@entry=0x7f5926499a50) at ldap/servers/slapd/delete.c:97
#9  0x000055eb6256f4f2 in connection_dispatch_operation (pb=0x7f5926499a50, op=0x55eb65e541a0, conn=0x55eb660226a8)
    at ldap/servers/slapd/connection.c:623
#10 connection_threadmain () at ldap/servers/slapd/connection.c:1772
#11 0x00007f59571f69bb in _pt_root (arg=0x55eb64f409a0) at ../../../nspr/pr/src/pthreads/ptthread.c:216
#12 0x00007f5956b96dc5 in start_thread (arg=0x7f592649a700) at pthread_create.c:308
#13 0x00007f595647834d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:113


Occurs after an error 66, trying to delete the index container on the same suffix - not sure if it's related or not

Metadata Update from @mreynolds:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1433850

7 years ago

Metadata Update from @mreynolds:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1433850

7 years ago

Metadata Update from @mreynolds:
- Issue assigned to mreynolds

7 years ago

mreynolds commented 7 years ago

This was a regression introduced by this ticket:

https://pagure.io/389-ds-base/issue/49051

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to new
- Custom field type adjusted to defect

7 years ago

mreynolds commented 7 years ago

Metadata Update from @mreynolds:
- Custom field origin adjusted to QE
- Custom field reviewstatus adjusted to review (was: new)

7 years ago

nhosoi commented 7 years ago

I'm wondering if the following 2 error cases also need to call slapi_be_Unlock(be)?
742 if (!bind_target_entry) {
743 slapi_pblock_set(pb, SLAPI_PB_RESULT_TEXT, "No such entry");
744 send_ldap_result(pb, LDAP_INVALID_CREDENTIALS, NULL, "", 0, NULL);
745 goto free_and_return;
746 }

770 if (-1 == myrc) {
771 / need_new_pw failed; need_new_pw already send_ldap_result in it. /
772 goto free_and_return;
773 }

For the second one, maybe we could move this unlock prior to the line 767 if (auth_response_requested) ?
774 if (be) {
775 slapi_be_Unlock(be);
776 }

mreynolds commented 7 years ago

Actually this is looking like more of a problem in pw_verify_be_dn() which is where the lock is taken, but not always unlocked. @firstyear do you know why you did it that way? Basically you only unlock it if there is an error, but it seems like we should always unlock it. I'm sure there was a reason for it, so I'd like to know what that was before proceeding with a fix.

firstyear commented 7 years ago

It is indeed a mistake. It should be unlocked before return rc;

firstyear commented 7 years ago

I think it will only affect a sasl bind though because that's the only path where we use pw_verify_be_dn. I'll make a patch now.

firstyear commented 7 years ago

Does this fix it?

mreynolds commented 7 years ago

Actually this was my original fix to the issue, but in fact, we ALWAYS need to unlock it - always! The be lock is taken when we select the mapping tree backend, so matter what the result code is it still needs to be unlocked. I'll be attaching my patch shortly.

mreynolds commented 7 years ago

Note - this hang has nothing to do with sasl binds - see the testcase: