#47629 random crash in send_ldap_search_entry_ext()

Created 3 years ago by pspacek
Modified 14 days ago

Hello,

my 389-ds-base-1.3.2.8-1.fc20.x86_64 is randomly crashing somewhere in send_ldap_search_entry_ext() function. I have captured coredump and stack traces, I will attach them to this ticket.

You can analyze the coredump in Brno lab, vm-123.

I have a feeling that it is related to connection handling/connection termination, but I'm not able to find solid reproducer. The crash in the same function occurs randomly but I have seen that several times.

errors printed by GDB after running 'gcore' command
gcore.log

output from "rpm -qa"
rpm-qa.log

Looks like the search is related to syncrepl.

@richm: looking at backtrace.log I do see only one search:
rawbase = 0x7f6224000e40 "cn=dns-100x100, dc=ipa,dc=test"
fstr = 0x7f6224001420 "(|(objectClass=idnsConfigObject)(objectClass=idnsZone)(objectClass=idnsForwardZone)(objectClass=idnsRecord))

which is client srch request, so I don't see the relation to syncrepl.

@pspacek: what makes you think it is related to connection management ? syncrepl in persistent mode keeps the connection and operation, maybe there is a side effect of freeing some parts of the conn.
Could you collect a few more core dumps from the next crashes ?

Thread 42 is in sync_send_results() referencing connection conn = 0x7f624c313410. This is the same connection that is doing the search - Thread 32 - #7 0x00007f6260db024b in flush_ber (pb=pb @entry=0x7f622f7f5ae0, conn=conn @entry=0x7f624c313410

pspacek mentioned that he only sees the crash when using syncrepl.

yes, sync_send_results is the thread for the peristent phase, but it still waiting on the condition variable to get activated once the search has sent the initial content, so this should be ok.

I agree that the crash is most likely related to syncrepl, but don't see how.

pspacek, do you see any pattern when it happens, do you use cookies ?

I found that send_ldap_search_entry() operates on uninitialized memory sometimes. See attachment:valgrind.log from line 1030.

I have no idea if it is related or not.

The attachment:valgrind.log file also contains memory leaks, some of them look serious. At least the leak on line 2951 seems to affect any any.

I used following commands (running in parallel) for this test:
{{{
while true; do sleep $(echo "scale=5; 0.05 + 0.$RANDOM" | bc); pkill -9 ldapsearch; done
while true; do ldapsearch -H "ldap://vm-123:389" -w password -D 'cn=Directory manager' -b "cn=dns,dc=ipa,dc=test" -E sync=rp; done
}}}

This code runs ldapsearch with synrepl in refresh & persist mode in a loop and kills the client at arbitrary times.

rhel 6.5 389-ds-base 1.3.2 valgrind with -E sync=rp
slapd.vg.32575

rhel 6.5 389-ds-base 1.3.2 valgrind with -E sync=ro
slapd.vg.21622

0001-fix-various-problems-that-cause-a-lot-of-valgrind-no.patch
0001-fix-various-problems-that-cause-a-lot-of-valgrind-no.patch

I ran a lot of tests with various combinations of search parameters. I even ran mozldap ldapsearch with persistent search. All of them are mostly valgrind clean except for ldapsearch -E sync=rp https://fedorahosted.org/389/attachment/ticket/47629/slapd.vg.32575 and -E sync=ro https://fedorahosted.org/389/attachment/ticket/47629/slapd.vg.21622

Here is the valgrind related patch I have been using:
https://fedorahosted.org/389/attachment/ticket/47629/0001-fix-various-problems-that-cause-a-lot-of-valgrind-no.patch

It looks like the code in sync_send_results() has mostly been copied from ps_send_results(), with the exception of the following: ps_send_results has some extra pblock cleanup, starting at psearch.c:401; ps_send_results calls connection_remove_operation_ext() which is required in the new no-malloc style op stack; there is this special code in connection_threadmain():
{{{
/ If this op isn't a persistent search, remove it /
if ( pb->pb_op->o_flags & OP_FLAG_PS ) {
PR_Lock( conn->c_mutex );
connection_release_nolock (conn); / psearch acquires ref to conn - release this one now /
PR_Unlock( conn->c_mutex );
/ ps_add makes a shallow copy of the pb - so we
* can't free it or init it here - just memset it to 0
* ps_send_results will call connection_remove_operation_ext to free it
/
memset(pb, 0, sizeof(*pb));
}}}
If the intention is that sync_send_results() "owns" the operation pblock and is responsible for its lifecycle management, then somewhere in the syncrepl code the pb->pb_op->o_flags needs to set the OP_FLAG_PS flag so that connection_threadmain knows not to touch this pblock.

Or, we can add some other flag (OP_FLAG_PBLOCK_COPIED?)

The OP_FLAG_PS flag is set, it is set in the pre_search plugin:
sync.h:#define OP_FLAG_SYNC_PERSIST 0x01
sync_refresh.c: slapi_operation_set_flag(operation, OP_FLAG_SYNC_PERSIST);

Regarding the valgrind report, I don't see where uninitialized memory is used, the cookie is set/get by the slapi_set/get_object_extension

All the crashes are in the refresh phase when the existing entries are sent and the pre_entry plugin adds a control, so the sync_send_results is not yet active, just waiting to get notified

second thought, sync_send_results will also check if op is abandoned and does some cleanup, will check.

I understand the crash I'm seeing in the test scenario. The intention of the refresh and persist implementation is to
- send all requested initial entries (all or depending on a cookie)
- send all modified entries using a seperate thread running sync_send_results
and sending modified entries should only start when th einitial refresh is completed.
This works with cookies, but without cookie the sync_send_result start to early and interleaves with sending the regular entries. The problem is that the persitent thhread uses the same operation as the original search thread (no conflict if it starts after refresh is complete), and both set and get: SLAPI_SEARCH_RESULT_ENTRY and free it when done, so in some cases on thread tries to use an entry which was just freed.
A fix would be to correctly delay the start of the sync_send_results

The other "crash" doesn't seem to be a crash. If running the test under gdb, gdb halts when the connetion is terminated while writing lin __libc_send and logs: Program received signal SIGPIPE, but with continue the process continues and the error is handled in flush_ber calling do_disconnect_server.

with the attached patch the regular crashes in the test env provided by pspacek are no longer reproducable. But the client machine did runout of memory after some time, so long duration test was not yet possible

next I will look into the problems reported by valgrind

fix retested in new environment, could not reproduce teh crash.
The mem leak will be handled in ticket 47672

Hi Ludwig,

I have a question not related to that fix. In sync_persist_terminate, it may call sync_remove_request to remove a request from sync_request_list.
In sync_remove_request I have not seen any free of the request, only removal from the list. How the request is retrieved to be freed ?

regards
thierry

@thiery: in sync_persist_terminate active is set to false and complete to true, so the thread terminates and frees the request it is serving, in sync_send_results()

$ git merge ticket47629
Updating 36c48b8. 659b777
Fast-forward
ldap/servers/plugins/sync/sync.h | 7 ++++---
ldap/servers/plugins/sync/sync_persist.c | 3 ++-
ldap/servers/plugins/sync/sync_refresh.c | 18 ++++++++++++------
3 files changed, 18 insertions(+), 10 deletions(-)
$ git push origin master
Counting objects: 17, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 1.30 KiB, done.
Total 9 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
36c48b8. 659b777 master -> master

$ git push origin 389-ds-base-1.3.2
Counting objects: 17, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (9/9), done.
Writing objects: 100% (9/9), 1.30 KiB, done.
Total 9 (delta 6), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
270ad9a. d06de4e 389-ds-base-1.3.2 -> 389-ds-base-1.3.

14 days ago

Metadata Update from @pspacek:
- Issue set to the milestone: 1.3.2.11

Login to comment on this ticket.

ack

IPA

Directory Server

defect

cancel