On a frequent basis, ldapcompare operations against the pwdpolicySubentry attribute will hang. The access log shows the CMP operation, but no corresponding result. Once this has occurred, a second attempt at the same ldapcompare command will cause the server to crash. Even if the server does not crash, it cannot be shut down cleanly.
This problem was first observed with 1.2.9.16, but also occurs with 1.2.10.1. The problem is most easily reproducible after reinitializing the server, but can occur at any time.
The server is one of a pair of servers configured for multimaster replication. Both servers are running on RHEL 6.2.
Detailed description of the initial problem initial-problem
A simplified and redacted version of the class-of-service configuration cos-setup.ldif
gdb analysis of 1.2.10.1 during ldapcompare hang 389-hang.txt
ns-slapd coredump from 1.2.10.1 389-coredump.txt
it is blowing up during search operations 389stacktrace.txt
This looks like it is blowing up during search operations too
Bug appears to be directly related to my adding a second pwpolicy today. After deleting the policy, my crashes stopped.
Looking at both the hang and crash stack traces, we can see that we have two threads in cos_cache_query_attr(). Both the hang and the crash are happening, at line 2393, while doing a free operation(double free). Instead of allocating a new normalized dn for the targetTree, freeing the current targetTree(which is shared data), and reassigning the new value - I am just modifying the existing pointer.
Previously I was able to crash the server with less than 10 concurrent searches. I have now run the new code through over 1.25 million searches without issue.
Sending fix out for review...
Mark
slapi_dn_normalize_original is deprecated - the problem with doing DN normalization is that you cannot be guaranteed that you can always do it in place - converting to certain escape sequences will cause the string to grow. It looks like the real problem here is locking - there should be no way that another thread can free or change pTargetTree->val out from under the current thread. If the real problem is locking, then you could still run into weird problems if one thread is normalizing the DN out from under another thread - there could be odd characters in the DN that would cause strange errors at runtime.
I wonder if we should even be normalizing at that point, as we are potentially normalizing the same DN string multiple time. There are also the thread safety issues Rich points out where the DN can be modified by one thread while another is reading it. It seems like we should only normalize once when it is added to the CoS cache. Perhaps that portion of the code is already protected by locking as well.
attachment 0001-Ticket-305-Certain-CMP-operations-hang-or-cause-ns-s.patch
I didn't see any errors/problems with the previous fix, but I did had concerns. Moved the dn normalization to the cache building code. This is built under a lock. Sending new fix out for review...
[mareynol@localhost plugins]$ git merge ticket305 Updating 6fd5d70..142c8f0 Fast-forward ldap/servers/plugins/cos/cos_cache.c | 24 ++++++++++-------------- 1 files changed, 10 insertions(+), 14 deletions(-)
[mareynol@localhost plugins]$ git push origin master Counting objects: 13, done. Delta compression using up to 4 threads. Compressing objects: 100% (7/7), done. Writing objects: 100% (7/7), 934 bytes, done. Total 7 (delta 5), reused 0 (delta 0) To ssh://git.fedorahosted.org/git/389/ds.git 6fd5d70..142c8f0 master -> master
stack trace of hung server with 1.2.10.3 bt-during-hang.txt
Sorry I found the compare in the "initial problem" attachment. Continuing investigation...
0001-Ticket-305-Certain-CMP-operations-hang-or-cause-ns-s.patch 0001-Ticket-305-Certain-CMP-operations-hang-or-cause-ns-s.2.patch
To ssh://git.fedorahosted.org/git/389/ds.git 3f960dc..55135e3 master -> master commit changeset:55135e3/389-ds-base Author: Rich Megginson rmeggins@redhat.com Date: Tue Mar 13 11:45:32 2012 -0600
Added initial screened field value.
Metadata Update from @imorgan: - Issue assigned to rmeggins - Issue set to the milestone: 1.2.10.4
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/305
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Log in to comment on this ticket.