#50542 Entry cache contention during base search
Closed: wontfix 4 years ago by mreynolds. Opened 4 years ago by tbordaz.

Issue Description

Entry cache lock is acquired/release each time that the server needs find/return (aka refcnt++/refcnt--) an entry.
This can create a contention if many threads are constantly find/return entry.
For example, deref plugin can derefence a large group and find/return each members of the group.

In case of base search, the entry is find/return to retrieve the base of the search. Then the entry is find/return when sending back the entry. So a base search acquire/release the entry cache lock four times for a same entry.

If a group contains 10K members and is derefenced by 100 clients, the entry cache lock will be acquired/release 4M times.

Package Version and Platform

All version

Steps to reproduce

  1. testcase to provide

Actual results

Monitoring cache_lock/cache_unlock (PR_Monitor), we can see that its duration varies a lot from few microsec to several 10ms.

Expected results

Entry cache being unique and protected by a lock we can not eliminate contention but in specific case of base search (like deref plugin) we could reduce by 2 the number of lock acquire/release and reduce contention


Metadata Update from @tbordaz:
- Custom field origin adjusted to None
- Custom field reviewstatus adjusted to None

4 years ago

Metadata Update from @tbordaz:
- Issue assigned to tbordaz

4 years ago

Metadata Update from @tbordaz:
- Custom field origin adjusted to IPA (was: None)
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=1724761
- Issue set to the milestone: 1.3.9

4 years ago

In agreement with @lkrispen and @firstyear , the patch is pushed without the testcase
(In order to get early feedback).

7a24286..ca915d5 master
ab24aa4..4f98f3b 389-ds-base-1.4.0
5956610..35b42aa 389-ds-base-1.3.9

Not closing the ticket until the testcase is pushed

Latest master crashes in suites/filter/filter_match_test.py (in test_valid_invalid_attributes or test_valid_invalid_modes). It doesn't happen all the time, even with ASAN.

Here's the ASAN output:

=================================================================
==16378==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000048 (pc 0x7efe96d44393 bp 0x7efe84ae99e0 sp 0x7efe84ae9920 T6)
==16378==The signal is caused by a READ memory access.
==16378==Hint: address points to the zero page.
    #0 0x7efe96d44392 in ldbm_back_entry_release ldap/servers/slapd/back-ldbm/ldbm_search.c:1913
    #1 0x7efe9e21b39d  (/usr/lib64/dirsrv/libslapd.so.0+0x18a39d)
    #2 0x7efe9e2206df in op_shared_search (/usr/lib64/dirsrv/libslapd.so.0+0x18f6df)
    #3 0x7efe9e25452a  (/usr/lib64/dirsrv/libslapd.so.0+0x1c352a)
    #4 0x7efe99c550da in cos_cache_add_dn_defs ldap/servers/plugins/cos/cos_cache.c:1051
    #5 0x7efe99c550da in cos_cache_build_definition_list ldap/servers/plugins/cos/cos_cache.c:666
    #6 0x7efe99c550da in cos_cache_create_unlock ldap/servers/plugins/cos/cos_cache.c:458
    #7 0x7efe99c550da in cos_cache_creation_lock ldap/servers/plugins/cos/cos_cache.c:586
    #8 0x7efe99c560bc in cos_cache_wait_on_change ldap/servers/plugins/cos/cos_cache.c:419
    #9 0x7efe9de5a8a8 in _pt_root ../../.././nspr/pr/src/pthreads/ptthread.c:201
    #10 0x7efe9ddf75a1 in start_thread /usr/src/debug/glibc-2.29-24-g2ec0b166bf/nptl/pthread_create.c:486
    #11 0x7efe9dc72162 in __clone (/lib64/libc.so.6+0xfb162)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV ldap/servers/slapd/back-ldbm/ldbm_search.c:1913 in ldbm_back_entry_release
Thread T6 created by T0 here:
    #0 0x7efe9e44b965 in pthread_create (/lib64/libasan.so.5+0x3a965)
    #1 0x7efe9de5a58a in _PR_CreateThread ../../.././nspr/pr/src/pthreads/ptthread.c:433

==16378==ABORTING

I confirm the crash and got a better stack for frame #1:

  #0 0x7f7fb518c391 in ldbm_back_entry_release /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/back-ldbm/ldbm_search.c:1913
  #1 0x7f7fc5db7092 in cache_return_target_entry /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/opshared.c:213
  #2 0x7f7fc5dbcba0 in op_shared_search /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/opshared.c:933
  #3 0x7f7fc5e02d34 in search_internal_callback_pb /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin_internal_op.c:730
  #4 0x7f7fc5e015b0 in slapi_search_internal_callback_pb /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/slapd/plugin_internal_op.c:520
  #5 0x7f7fb8254f38 in cos_cache_add_dn_defs /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/cos/cos_cache.c:1051
  #6 0x7f7fb82528a9 in cos_cache_build_definition_list /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/cos/cos_cache.c:666
  #7 0x7f7fb8251d64 in cos_cache_create_unlock /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/cos/cos_cache.c:458
  #8 0x7f7fb82521fd in cos_cache_creation_lock /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/cos/cos_cache.c:586
  #9 0x7f7fb8251b50 in cos_cache_wait_on_change /home/lkrispen/TEST/caa/ws/389-ds-base/ldap/servers/plugins/cos/cos_cache.c:419
  #10 0x7f7fc35c9487 in _pt_root ../../.././nspr/pr/src/pthreads/ptthread.c:201
  #11 0x7f7fc2f66593 in start_thread (/lib64/libpthread.so.0+0x7593)
  #12 0x7f7fc282be5e in __GI___clone (/lib64/libc.so.6+0xf9e5e

which is in the new function introduced by this patch :-(

@lkrispen, this fixed failures in filter test suite and a similar crash in suites/password/pwdModify_test.py::test_pwd_modify_with_subsuffix. There are only failures known before the entry cache contention fix.

cherry-=pociked fix to 1.3.10

a6f9e10..d871134 389-ds-base-1.3.10 -> 389-ds-base-1.3.10

Added Ludwig's patch to fix crash on 1.3.10

d871134..95acf7a 389-ds-base-1.3.10 -> 389-ds-base-1.3.10

@lkrispen - you still need to cherry-pick your fix to 1.4.x and 1.3.9, thanks!

Here are Ludwig's commit:

7edcaff (master - 1.4.1)

2672461 1.4.0

d871134..95acf7a 389-ds-base-1.3.10 -> 389-ds-base-1.3.10

2c1bd9a 1.3.9

Metadata Update from @mreynolds:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

4 years ago

0001-fix-for-50542-crashes-in-filter-tests.patch
this fixes the crashes for me, but haven't done further tests

Thanks @lkrispen for finding the culprit of the crash, it was a nasty one.
Just a minor comment, the failing function (ldbm_back_entry_release) existed before, #50542 did not modified it, just started using it. I guess this function was implemented but no longer used since was started using asan.

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/3598

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: fixed)

3 years ago

Login to comment on this ticket.

Metadata