#305 Certain CMP operations hang or cause ns-slapd to crash
Closed: wontfix None Opened 12 years ago by imorgan.

On a frequent basis, ldapcompare operations against the pwdpolicySubentry attribute will hang. The access log shows the CMP operation, but no corresponding result. Once this has occurred, a second attempt at the same ldapcompare command will cause the server to crash. Even if the server does not crash, it cannot be shut down cleanly.

This problem was first observed with 1.2.9.16, but also occurs with
1.2.10.1. The problem is most easily reproducible after reinitializing
the server, but can occur at any time.

The server is one of a pair of servers configured for multimaster
replication. Both servers are running on RHEL 6.2.


Detailed description of the initial problem
initial-problem

A simplified and redacted version of the class-of-service configuration
cos-setup.ldif

gdb analysis of 1.2.10.1 during ldapcompare hang
389-hang.txt

it is blowing up during search operations
389stacktrace.txt

This looks like it is blowing up during search operations too

Bug appears to be directly related to my adding a second pwpolicy today. After deleting the policy, my crashes stopped.

Looking at both the hang and crash stack traces, we can see that we have two threads in cos_cache_query_attr(). Both the hang and the crash are happening, at line 2393, while doing a free operation(double free). Instead of allocating a new normalized dn for the targetTree, freeing the current targetTree(which is shared data), and reassigning the new value - I am just modifying the existing pointer.

Previously I was able to crash the server with less than 10 concurrent searches. I have now run the new code through over 1.25 million searches without issue.

Sending fix out for review...

Mark

slapi_dn_normalize_original is deprecated - the problem with doing DN normalization is that you cannot be guaranteed that you can always do it in place - converting to certain escape sequences will cause the string to grow. It looks like the real problem here is locking - there should be no way that another thread can free or change pTargetTree->val out from under the current thread. If the real problem is locking, then you could still run into weird problems if one thread is normalizing the DN out from under another thread - there could be odd characters in the DN that would cause strange errors at runtime.

I wonder if we should even be normalizing at that point, as we are potentially normalizing the same DN string multiple time. There are also the thread safety issues Rich points out where the DN can be modified by one thread while another is reading it. It seems like we should only normalize once when it is added to the CoS cache. Perhaps that portion of the code is already protected by locking as well.

I didn't see any errors/problems with the previous fix, but I did had concerns. Moved the dn normalization to the cache building code. This is built under a lock. Sending new fix out for review...

[mareynol@localhost plugins]$ git merge ticket305
Updating 6fd5d70..142c8f0
Fast-forward
ldap/servers/plugins/cos/cos_cache.c | 24 ++++++++++--------------
1 files changed, 10 insertions(+), 14 deletions(-)

[mareynol@localhost plugins]$ git push origin master
Counting objects: 13, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (7/7), done.
Writing objects: 100% (7/7), 934 bytes, done.
Total 7 (delta 5), reused 0 (delta 0)
To ssh://git.fedorahosted.org/git/389/ds.git
6fd5d70..142c8f0 master -> master

stack trace of hung server with 1.2.10.3
bt-during-hang.txt

Looked at the stack trace, and there is no deadlock. It looks like the server is running. The only thread that is doing anything is thread #34, and it appears to be doing something in slapi_attr_basetype: Thread 34 (Thread 0x7fd68ffff700 (LWP 2479)): #0 0x0000003002e41c1f in slapi_attr_basetype (type=0x7fd69400d915 "mber", buf=0x7fd68fffa2f0 "gidNu", bufsize=256) at ldap/servers/slapd/attr.c:384 i = <value optimized out> #1 0x0000003002eb8461 in vattr_map_lookup (type_to_find=<value optimized out>, result=0x7fd68fffa438) at ldap/servers/slapd/vattr.c:1889 basetype = 0x0 tmp = 0x0 buf = "gidNu", '\000' <repeats 171 times>, "X6\340\002\060\000\000\000 \362\340\002\060\000\000\000\020,M\243\326\177\000\000pX\000$\326\177\000\000\320w\001\224\326\177\000\000\377\377\377\377\000\000\000\000@E\236\001\000\000\000\000\251\334\344\002\060\000\000\000+\000\000\000\000\000\000\000\237\337\344\002\060\000\000" #2 0x0000003002eb8542 in vattr_map_namespace_sp_getlist (dn=0x1a67310, type_to_find=0x7fd69400d910 "gidNumber") at ldap/servers/slapd/vattr.c:2186 ... ... It would be nice to have a few stack traces to see if there is actually any progress. Also, what is the exact compare operation you are doing? I need to reproduce this so I'd like the exact steps. Thanks, Mark

Sorry I found the compare in the "initial problem" attachment. Continuing investigation...

To ssh://git.fedorahosted.org/git/389/ds.git
3f960dc..55135e3 master -> master
commit changeset:55135e3/389-ds-base
Author: Rich Megginson rmeggins@redhat.com
Date: Tue Mar 13 11:45:32 2012 -0600

Added initial screened field value.

Metadata Update from @imorgan:
- Issue assigned to rmeggins
- Issue set to the milestone: 1.2.10.4

7 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/305

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Fixed)

4 years ago

Log in to comment on this ticket.

Metadata