After a correct BIND: [access log] ... [19/Feb/2014:08:52:54 +0000] conn=107 op=0 BIND dn="uid=opencms, ou=apps,dc=mydomain,dc=com" method=128 version=3 [19/Feb/2014:08:52:54 +0000] conn=107 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=opencms, ou=apps,dc=mydomain,dc=com" ...
The application makes paged searches that are made, and are returned correctly from the server: [access log] ... [19/Feb/2014:09:57:53 +0000] conn=107 op=62 SRCH base="dc=mydomain,dc=com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=user1))" attrs="mail postalAddress description uid sn postalCode givenName" [19/Feb/2014:09:57:53 +0000] conn=107 op=62 RESULT err=0 tag=101 nentries=1 etime=0 notes=P ...
But if the same BIND connection makes a non paged search: [access log] ... [19/Feb/2014:09:57:54 +0000] conn=107 op=133 SRCH base="dc=mydomain,dc=com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=user2))" attrs="mail postalAddress description uid sn postalCode givenName" [19/Feb/2014:09:57:54 +0000] conn=107 op=133 RESULT err=0 tag=101 nentries=1 etime=0 ...
The server dies after the next search operation: [access log] ... [19/Feb/2014:09:57:54 +0000] conn=107 op=134 SRCH base="dc=mydomain,dc=com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=user3))" attrs="mail postalAddress description uid sn postalCode givenName" ...
[error log] .... [19/Feb/2014:09:57:54 +0000] - pagedresults_parse_control_value: invalid cookie: -1
And the slapd instance dies.
I think that the server must be protected on this type of incorrect usage from clients.
What version of 389-ds-base are you using? This might be an issue that has already been fixed.
Please follow the instructions here for generating a stack trace of the crash. That will help us to identify if this is an issue that we have already fixed:
http://port389.org/wiki/FAQ#Debugging_Crashes
It works for me (tested with the master).
[..] conn=8 op=0 BIND dn="uid=tuser0,o=my.com" method=128 version=3 [..] conn=8 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=tuser0,o=my.com" [..] conn=8 op=1 SRCH base="o=my.com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=tuser1))" attrs="mail postalAddress description uid sn postalCode givenName" [..] conn=8 op=1 RESULT err=0 tag=101 nentries=1 etime=0 [..] conn=8 op=2 SRCH base="o=my.com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=tuser1))" attrs="mail postalAddress description uid sn postalCode givenName" [..] conn=8 op=2 RESULT err=0 tag=101 nentries=1 etime=0 [..] conn=8 op=3 SRCH base="o=my.com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=tuser1))" attrs="mail postalAddress description uid sn postalCode givenName" [..] conn=8 op=3 RESULT err=0 tag=101 nentries=1 etime=0 [..] conn=8 op=5 UNBIND [..] conn=8 op=5 fd=64 closed - U1
Please note: if a search is a "non paged search" -- with no LDAP_CONTROL_PAGEDRESULTS, it won't call pagedresults_parse_control_value. So, I'm puzzled why you see "pagedresults_parse_control_value: invalid cookie: -1" in the error log... 461 if ( slapi_control_present (ctrlp, LDAP_CONTROL_PAGEDRESULTS, 462 &ctl_value, &iscritical) ) 463 { 464 rc = pagedresults_parse_control_value(pb, ctl_value, 465 &pagesize, &pr_idx);
If you could provide your client program/script to reproduce the problem, it may help our investigation. Thanks.
Replying to [comment:1 nkinder]:
What version of 389-ds-base are you using? This might be an issue that has already been fixed. Please follow the instructions here for generating a stack trace of the crash. That will help us to identify if this is an issue that we have already fixed: http://port389.org/wiki/FAQ#Debugging_Crashes
This is the startup info: [20/Feb/2014:16:16:53 +0000] - '''389-Directory/1.2.11.25 B2013.325.1951''' starting up
These are RPMs installed from EPEL: [root@server ~]# rpm -qa | grep -i 389 '''389-ds-1.2.2-1.el6.noarch 389-ds-console-doc-1.2.6-1.el6.noarch 389-ds-base-libs-1.2.11.25-1.el6.x86_64 389-admin-1.1.35-1.el6.x86_64 389-admin-console-doc-1.1.8-1.el6.noarch 389-admin-console-1.1.8-1.el6.noarch 389-dsgw-1.1.11-1.el6.x86_64 389-adminutil-1.1.19-1.el6.x86_64 389-ds-console-1.2.6-1.el6.noarch 389-console-1.1.7-1.el6.noarch 389-ds-base-1.2.11.25-1.el6.x86_64'''
I have not yet made more test with the application that makes crash the server, when it uses paged searches. I have disabled it. I will make some test with the debug info that you have provided me.
Replying to [comment:2 nhosoi]:
It works for me (tested with the master). [..] conn=8 op=0 BIND dn="uid=tuser0,o=my.com" method=128 version=3 [..] conn=8 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=tuser0,o=my.com" [..] conn=8 op=1 SRCH base="o=my.com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=tuser1))" attrs="mail postalAddress description uid sn postalCode givenName" [..] conn=8 op=1 RESULT err=0 tag=101 nentries=1 etime=0 [..] conn=8 op=2 SRCH base="o=my.com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=tuser1))" attrs="mail postalAddress description uid sn postalCode givenName" [..] conn=8 op=2 RESULT err=0 tag=101 nentries=1 etime=0 [..] conn=8 op=3 SRCH base="o=my.com" scope=2 filter="(&(objectClass=organizationalPerson)(uid=tuser1))" attrs="mail postalAddress description uid sn postalCode givenName" [..] conn=8 op=3 RESULT err=0 tag=101 nentries=1 etime=0 [..] conn=8 op=5 UNBIND [..] conn=8 op=5 fd=64 closed - U1 Please note: if a search is a "non paged search" -- with no LDAP_CONTROL_PAGEDRESULTS, it won't call pagedresults_parse_control_value. So, I'm puzzled why you see "pagedresults_parse_control_value: invalid cookie: -1" in the error log... 461 if ( slapi_control_present (ctrlp, LDAP_CONTROL_PAGEDRESULTS, 462 &ctl_value, &iscritical) ) 463 { 464 rc = pagedresults_parse_control_value(pb, ctl_value, 465 &pagesize, &pr_idx); If you could provide your client program/script to reproduce the problem, it may help our investigation. Thanks.
The LDAP client application that causes the problem, is an OpenCMS module that provides LDAP support (Alkacon OCEE LDAP for OpenCMS v8). It appears that it use com.sun.jndi.ldap beans.
I have not yet made more test with the application that makes crash the server, when it uses paged searches. I have disabled paged searches in the module application, and without them works fine. I will make some test with more debug info.
for reference, opened Red Hat Bugzilla for very similar report: bz 1071707 - rhds91 389-ds-base-1.2.11.15-31.el6_5 crash on paged searches followed by simple srch
I'm trying to reproduce the problem, but so far no luck. Using the same connection (conn=8), I ran one operation out of the rest with no SIMPLE PAGED RESULTS control (op=2). But the following "next" simple paged result requests were processed without causing the reported crash. I did the non SIMPLE PAGED RESULT search as op=1 or later, but there was no difference. If you could provide us a stacktrace from the crash and/or a reproducer, it'd be a big help... {{{ [..] conn=8 op=0 BIND dn="uid=TVradmin0,ou=People,o=my.com" method=128 version=3 [..] conn=8 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=tvradmin0,ou=people,o=my.com" [..] conn=8 op=1 SRCH base="o=my.com" scope=2 filter="(uid=)" attrs="uid cn mail" [..] conn=8 op=1 RESULT err=0 tag=101 nentries=1 etime=0 notes=U,P [..] conn=8 op=2 SRCH base="o=my.com" scope=2 filter="(cn=)" attrs="uid cn mail" [..] conn=8 op=2 RESULT err=0 tag=101 nentries=100 etime=0 [..] conn=8 op=3 SRCH base="o=my.com" scope=2 filter="(objectClass=)" attrs="uid cn mail" [..] conn=8 op=3 RESULT err=0 tag=101 nentries=1 etime=0 notes=U,P [..] conn=8 op=4 SRCH base="o=my.com" scope=2 filter="(uid=)" attrs="uid cn mail" [..] conn=8 op=4 RESULT err=0 tag=101 nentries=1 etime=0 notes=U,P [..] conn=8 op=5 SRCH base="o=my.com" scope=2 filter="(cn=*)" attrs="uid cn mail" [..] conn=8 op=5 RESULT err=0 tag=101 nentries=1 etime=0 notes=P [..] }}}
I think that the problem is caused when the same connection makes concurrent requests, before the paged searches have finished. This afternoon we will take a time to apply the stacktrace config in the laboratory host (http://port389.org/wiki/FAQ#Debugging_Crashes), and we will force searches from the app client. When crashes, we will upload the stacktrace file.
Thanks a lot.
Replying to [comment:8 nhosoi]:
attachment access.buf
Hello!
We have made a core dump succesfully, after a server crash: [root@server ~]# cat /var/log/ldap/ldap-access | grep conn=674 [07/Mar/2014:12:26:49 +0000] conn=674 fd=69 slot=69 connection from X.X.X.X to Y.Y.Y.Y [07/Mar/2014:12:26:49 +0000] conn=674 op=0 BIND dn="uid=opencms-ldap,o=Applications,o=gobiernodecanarias,c=es" method=128 version=3 [07/Mar/2014:12:26:49 +0000] conn=674 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="uid=opencms-ldap,c=es" [07/Mar/2014:12:26:49 +0000] conn=674 op=1 SRCH base="c=es" scope=2 filter="(&(objectClass=organizationalPerson)(uid=emoruser))" attrs="mail postalAddress description uid sn postalCode givenName" [07/Mar/2014:12:26:49 +0000] conn=674 op=1 RESULT err=0 tag=101 nentries=1 etime=0 notes=P [07/Mar/2014:12:26:49 +0000] conn=674 op=2 ABANDON targetop=Simple Paged Results [07/Mar/2014:12:26:56 +0000] conn=674 op=3 SRCH base="c=es" scope=2 filter="(&(objectClass=organizationalPerson)(uid=admin))" attrs="mail street telephoneNumber postalAddress description uid sn postalCode givenName l" [07/Mar/2014:12:26:56 +0000] conn=674 op=3 RESULT err=0 tag=101 nentries=3 etime=0 '''[07/Mar/2014:12:26:56 +0000] conn=674 op=4 SRCH base="c=es" scope=2 filter="(&(objectClass=organizationalPerson)(uid=lbaruser))" attrs="mail postalAddress description uid sn postalCode givenName" '''
After the crash, we have parsed the dump over gdb. I attach the file generated (access.buf).
attachment access2.buf
attachment access3.buf
gdb -ex 'set print elements 0' -ex 'set confirm off' -ex 'set pagination off' -ex 'print loginfo.log_access_buffer.top' -ex 'thread apply all bt' -ex 'quit' /usr/sbin/ns-slapd /var/log/ldap/core.8789 > access4.buf 2>&1 access4.buf
Hello,
I have attached a stack trace of the crash including "thread apply all bt".
Thanks,
Jamie
attachment stacktrace.1395445012.txt
I might be facing something similar, our host dies daily basis. The server is a read-only slave with little bit older RPMs running on RedHat 6.5
389-ds-base-1.2.11.15-32.el6_5.x86_64 389-ds-base-debuginfo-1.2.11.15-32.el6_5.x86_64
I included [https://fedorahosted.org/389/raw-attachment/ticket/47707/stacktrace.1395445012.txt stacktrace] from our host.
reproducer of this ticket paged_def.c
core dump backtrace. core_signature.txt
Steps to reproduce it:
dn: cn=userXXXX,o=redhat objectclass: inetorgperson cn: userXXXX sn: userXXXX description: descXXXX userpassword: userXXXX
(in my test case, I have replication of o=redhat configured. Could be irrelevant. But we have to remember it's a timing issue).
NOTE: in my test case sn attribute is not indexed.
we should see core dump with the signature showed in core_signature.txt
diffs I have used to fix the core dump. diffs_pagedresults.txt
Hi German,
Your fix looks good to me. May I ask you to run one more test with your patch?
Keep the paged result requests search connected until it reaches the search timelimit? I'd like to see the connection gets disconnected and how the paged result handles are cleaned up then. If no double free is observed, I think #47706 is fixed as well...
Also, when adding a patch, it'll be nice if you could make it via git. It could be applied to other developers' local tree easier. Also, the patch contains your name and comments. Please take a look at this page. If you have any questions, please let us know. http://directory.fedoraproject.org/wiki/GIT_Rules
I'm trying to duplicate the crash using your test client. I tested it against the server built from the master and 389-ds-base-1.2.11 branch (without applying your patch) on Fedora 19, but it looks the servers work just fine.
Following your instructions, I set up 2-way MMR, and imported 15000 user entries, then started test program "testpagedsearch". It's been running for more than 30 min...
How long did it take for you to crash the server? Which 389-ds-base branch and OS did you use for your debugging? Any advice to reproduce the crash would be greatly appreciated...
I cannot reproduce it in F19 with paged_def.c
389-ds-base-1.3.1.22-1.fc19.x86_64
I have found the reason why the crash is not reproduced in the master branch.
In fact, https://fedorahosted.org/389/ticket/47623 is protecting (in a way) the code that is provoking the crash. A check has been added:
So, more likely the assignment is not taking place. The crash is taking place only when *index == -1 (this happens when operation is abandoned)
The patch proposed was protecting this code a little more since *index could be -1 and also it is checking whether operation has been abandoned in some other part of the control decoding. But the crash was provoked only by the wrong assignment, imho.
I would add a check that *index != 1 before the condition, though.
Regards,
German
Hi,
I want to share some additional investigation from my side.
I have used master branch and my reproducer. I cannot crash the server BUT I see from time to time the message
[29/Apr/2014:12:03:49 +0200] - pagedresults_parse_control_value: invalid cookie: -1
So, this part in the code:
if (!conn->c_pagedresults.prl_list[index].pr_mutex) { conn->c_pagedresults.prl_list[index].pr_mutex = PR_NewLock(); }
is being called with *index==-1 indeed.
Now, I remove latest changes of that source file (so as to come back to customer version):
git checkout d53e822 ldap/servers/slapd/pagedresults.c git commit -m"reverting revs" ldap/servers/slapd/pagedresults.c
Re-build it and it crashes. Not very often but I arrive to reproduce the crash, just after having the message:
[29/Apr/2014:12:25:42 +0200] - pagedresults_parse_control_value: invalid cookie: -1
To sum up:
crash once the message "pagedresults_parse_control_value: invalid cookie: -1"
Wait just to reproduce the message "pagedresults_parse_control_value: invalid cookie: -1" and see it's not crashing.
Some more hints to reproduce:
use the reproducer paged_def.c + change the ldap search to a quicker one as, for instance, (sn=user100) instead of (sn-user) in line 51.
don't let the reproducer to run for long. The crash is nearly immediate BUT you must restart dirsrv each time. Why ? I think, I have not fully verified, prl_list is not cleaned up after operation has been abandoned. So, count is increasing always. I am talking here of the file being reverted to rev d53e822.
Crash is not taking place but something seems still weird. Request for a new page (of an existing search) but prl_list is already cleaned up ? And that's why *index == -1.
Bug Description: If a simple paged search request was sent to the server and the request was abandoned, the paged result slot in the connection table was not properly released by setting NULL to pr_current_be. Since the slot did not look available for the next request even though it was, the next request failed to get the valid slot number, and the initial slot number -1 failed to be replaced with the real slot number. Until the fix for "Ticket #47623 fix memleak caused by 47347" was made, it overrode the allocated array's [-1] location, which usually stores the meta data of the allocated memory. That crashed the server in the next realloc since the corrupted memory was passed to the function.
Fix Description: This patch cleans up the abandoned/cleaned up slot for reuse. Also, more check not to break the meta data is added.
Special thanks to German Parente (gparente@redhat.com) for providing the reproducer and analysing the crash.
git patch file (master) 0001-Ticket-47707-389-DS-Server-crashes-and-dies-while-ha.patch
Reviewed by Rich (Thank you!!)
Pushed to master: d341b77..087356f master -> master commit 087356f
Pushed to 389-ds-base-1.3.2: 65c5e58..2132875 389-ds-base-1.3.2 -> 389-ds-base-1.3.2 commit 2132875
Pushed to 389-ds-base-1.3.1: ea1b127..40e86e7 389-ds-base-1.3.1 -> 389-ds-base-1.3.1 commit 40e86e74fb4ecc0fc5a1027d8241945d9b2564e0
Pushed to 389-ds-base-1.2:11: 00a7594..b2ee65d 389-ds-base-1.2.11 -> 389-ds-base-1.2.11 commit b2ee65d
Metadata Update from @gparente: - Issue assigned to nhosoi - Issue set to the milestone: 1.2.11.30
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/1042
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Fixed)
Login to comment on this ticket.