#48188 segfault in ns-slapd due to accessing Slapi_DN freed in pre bind plug-in
Closed: Fixed None Opened 4 years ago by wetpaste.

Hopefully this is the appropriate place for this.

I'm running freeIPA on centos 7. Each of my client servers are doing authentication and group lookup via vanilla ldap. I am getting pretty consistent segfaults in the ns-slapd process, especially when I am doing mass changes via ansible. This triggers a flood of ldap requests that end up crashing the service, I haven't been able to pinpoint the exact conditions but it doesn't take long for it to buckle.

version 1.3.3.1 on centos 7


Thread 1 seems to indicate that there might be memory corruption (the DN value is messed up, which leads to the crash). If possible, running ns-slapd under valgrind would help to pinpoint where the memory corruption is occuring. Details on doing this are here:

http://www.port389.org/docs/389ds/FAQ/faq.html#memory-growth

It looks like you are using compat tree with ID Views in FreeIPA. What exact IPA and slapi-nis versions do you have?

Hello wetpaste, did you have any luck to reproduce the crash and get the valgrind output?

Sorry I have not yet, I had turned off valgrind before I was able to replicate because of some strange issues I was having right before I left for vacation. I will work on getting it back in place and hopefully have some results in the next few days. Thanks for your help!

I'm having trouble replicating this when valgrind is running. It seems to slow things down to the point where I cannot get enough throughput to the server to cause a crash. I'm testing this very indirectly though, as it seems to happen when I am doing high-throughput ansible playbooks (around 15 ssh threads at once through a single jump-host). Each time ansible makes a sudo/ssh call it's doing lookups against ldap. However when valgrind is running, Its so slow that I get all kinds of timeouts from ansible so in order to get relief I bumped it down to 5 threads.

If I could figure out how to mimick the traffic that is normally sent over to the ldap server, I could get around trying to figure out where in the stack[ansible -> ssh -> ssh jump host -> host -> sudo??] it's timing out.

If anyone has any ideas let me know.

Replying to [comment:6 wetpaste]:

snip

Yeah, valgrind makes the server crawl.

If I could figure out how to mimick the traffic that is normally sent over to the ldap server, I could get around trying to figure out where in the stack[ansible -> ssh -> ssh jump host -> host -> sudo??] it's timing out.

If anyone has any ideas let me know.

The first step would be to use logconv.pl to get a report of the operations.

The next step would be to try to replay the same sequence of operations. IHAGHS (I have a github script) https://github.com/richm/scripts/blob/master/replayAccessAudit.py that can replay a sequence of operations from the access (and audit log, if there are updates).

Hello wetpaste,

Could there be any progress in your investigation/debugging?

I think we haven't heard your 389-ds-base revision, yet? Output of "rpm -q 389-ds-base"?

Any logs in the error log? (/var/log/dirsrv/slapd-YOURSERVERID/errors)

It might be useful to see some operations served just before the server's crash.
http://www.port389.org/docs/389ds/FAQ/faq.html#printing-access-log-buffer

In the access log, do you see lots of simple paged results requests? (notes=P)

Any updates? If not, we are going to close this bug.

Current 389-ds-base:

{{{
$ rpm -q 389-ds-base
389-ds-base-1.3.3.1-16.el7_1.x86_64
}}}

I'm going to try using logconv and replay to valgrind. Looks like it crashed again a couple times 2 days ago.

nhosoi: nothing is in the error logs except for the startup output.

Replying to [comment:7 rmeggins]:

The first step would be to use logconv.pl to get a report of the operations.
Can you clarify what this is for? Do you want a type of statistics report?

The next step would be to try to replay the same sequence of operations. IHAGHS (I have a github script) https://github.com/richm/scripts/blob/master/replayAccessAudit.py that can replay a sequence of operations from the access (and audit log, if there are updates).

Trying this now. Thanks!

EDIT: one issue, I can't seem to get the script to work and am baffled to as why. I looked at the source and, with my limited python knowledge, it looks okay...
{{{
$ python replayAccessAudit.py -c access -b 4 -e 73773
Traceback (most recent call last):
File "replayAccessAudit.py", line 666, in <module>
if len(args.c) == 0 and len(args.u) == 0:
AttributeError: 'Namespace' object has no attribute 'c'
}}}

Replying to [comment:11 wetpaste]:

Replying to [comment:7 rmeggins]:

The first step would be to use logconv.pl to get a report of the operations.
Can you clarify what this is for? Do you want a type of statistics report?

Yes. logconv.pl can produce a lot of statistics/reports from the access log. See man logconv.pl.

The next step would be to try to replay the same sequence of operations. IHAGHS (I have a github script) https://github.com/richm/scripts/blob/master/replayAccessAudit.py that can replay a sequence of operations from the access (and audit log, if there are updates).

Trying this now. Thanks!

EDIT: one issue, I can't seem to get the script to work and am baffled to as why. I looked at the source and, with my limited python knowledge, it looks okay...
{{{
$ python replayAccessAudit.py -c access -b 4 -e 73773
Traceback (most recent call last):
File "replayAccessAudit.py", line 666, in <module>
if len(args.c) == 0 and len(args.u) == 0:
AttributeError: 'Namespace' object has no attribute 'c'
}}}

Fixed. Please try it again.

Hello wetpaste,

Could there be any updates?

I'm kind of out of ideas. I have not been able to replicate the issue except for when I am in production. If I use the script to replay, I cannot get the same crash out of the app. Cannot get it to replicate while valgrind is running.

I revisited the stacktraces and noticed there is a possibility the dn/rawdn variables could be trashed by the IPA slapi-nis/compat plug-in.

in do_bind() dn points inside of sdn - so we need to set dn = NULL after the last time it is used before calling the PRE_BIND plugins, and set dn back to slapi_sdn_get_dn(sdn) after the PRE_BIND plugin (or not, if it is not used after that).

{{{
if (pb_sdn != sdn) {
}}}
pb_sdn != sdn doesn't mean strcmp(pb_sdn.ndn, sdn.ndn). That is, although it doesn't make sense for the plugin to change the pointer without changing the actual dn, it is still possible. I suggest using slapi_sdn_compare to make sure they are really different.

Replying to [comment:22 rmeggins]:

{{{
if (pb_sdn != sdn) {
}}}
pb_sdn != sdn doesn't mean strcmp(pb_sdn.ndn, sdn.ndn). That is, although it doesn't make sense for the plugin to change the pointer without changing the actual dn, it is still possible. I suggest using slapi_sdn_compare to make sure they are really different.

But the issue is not the dn value, but the memory...

slapi_sdn_compare basically compares the normalized dn from the 2 Slapi_DN to compare.

If a pre-bind plug-in happens to recreate a new sdn from the same normalized dn, we crashes there...
1845 backend_bind_cb(Slapi_PBlock pb)
1914 slapi_pblock_get(pb, SLAPI_BIND_TARGET_SDN, &sdn);
1916 slapi_sdn_free(&sdn);
1918 sdn = slapi_sdn_new_dn_byref(ndn);
1919 slapi_pblock_set(pb, SLAPI_BIND_TARGET_SDN, (void
) sdn);

There are two issues

1) If the plugin changes the SLAPI_BIND_TARGET_SDN pointer, we need to use it instead of the current sdn and dn. For this, it is sufficient to do pb_sdn != sdn.

2) If the plugin changes the SLAPI_BIND_TARGET_SDN value, we need to select a different backend. It is possible (but not very useful) for the plugin to change the pointer, but use the same value.

I'm also not sure if it is sufficient to call be = slapi_be_select(sdn);. The bind code sets the be like this:
{{{
/ We could be serving multiple database backends. Select the appropriate one /
if (slapi_mapping_tree_select(pb, &be, &referral, errorbuf) != LDAP_SUCCESS) {
send_nobackend_ldap_result( pb );
be = NULL;
goto free_and_return;
}
}}}
Do we need to have similar logic in order to change the backend? What if the plugin selects a non-existent backend that causes be == defbackend_get_backend()?

We also need to set the new backend in the pblock: slapi_pblock_set( pb, SLAPI_BACKEND, be );

git patch file (master) -- Additional fixes based upon the comments by Rich in comment 24
0001-Ticket-48188-segfault-in-ns-slapd-due-to-accessing-S.2.patch

Thanks to Simo for the patch & Thanks to Rich for the review and the suggestion.

Pushed to master:
c41d36d..8212a89 master -> master
commit 40e0d0f
commit 8212a89

Pushed to 389-ds-base-1.3.4:
d9f03f5..a215c00 389-ds-base-1.3.4 -> 389-ds-base-1.3.4
commit 6871f4f
commit a215c00

Metadata Update from @wetpaste:
- Issue assigned to nhosoi
- Issue set to the milestone: 1.3.4.5

2 years ago

Login to comment on this ticket.

Metadata