389-ds-base

#47997 Individual Threads Always Answering With Empty Result Sets

Closed: wontfix None Opened 9 years ago by jbern.

We have a CentOS 6.6 system running Zarafa which, in turn, uses 389-server as its backend (unencrypted connection to 127.0.0.1). We had a major "users not known anymore"-style malfunction on 05-Jan that wasn't fully analyzed, another after a reboot on 15-Jan, and a third on 19-Jan, also after a reboot.

After 15-Jan, I tore into the logs and found a correlation between the malfunctions in Zarafa and postfix, and entries in 389-server's logs saying that it successfully returned "nentries=0" for lookups where entries should have been found.

Those erroneous answers correlated with specific threads(?) - "conn=3" on 15-Jan, "conn=4" and "conn=5" on 19-Jan - which would always return an empty result:

$ for CONN in `grep ' conn=[1-9][0-9]* op=[1-9].* RESULT ' \
> ANALYSIS.slapd-access-15Jan.excerpt | sed -e 's/.* conn=//' \
>    -e 's/ op=.*//' | sort -nu` ; do
>    grep " conn=$CONN "'op=[1-9].* RESULT ' ANALYSIS.slapd-access-15Jan.excerpt | \
>       sed -e 's/.* nentries=//' -e 's/ .*//' | \
>       awk '{ n++; if ($1==0) z++; } END { if ((n>9)&&((100*z/n)>95)) \
>          printf "conn %4u\t%3u / %3u\t%3u %%\n",'"$CONN"',z,n,(100*z/n); }'
> done
conn    3       981 / 981       100 %
conn   93        91 /  94        96 %
conn  236       189 / 193        97 %
conn  372        73 /  75        97 %
conn  700        91 /  94        96 %
conn 1068        91 /  94        96 %
conn 1468        23 /  24        95 %
conn 1577       125 / 128        97 %

(Low conn numbers indicate that the thread is talking with the most long-running and basic Zarafa processes. Since there is a high percentage of requests where an empty set is the correct result, I cannot rule out the possibility that the problem occasionally appears in later threads, and just goes unnoticed then.)

Here's an example of two different threads giving different replies for the same lookup in the same second:

[15/Jan/2015:20:31:31 +0100] conn=4 op=68 SRCH base="dc=pawisda,dc=de" scope=2 filter="(&(&(objectClass=zarafa-user)(|(objectClass=zarafa-user)(objectClass=zarafa-contact)))(|(uid=friedrich)))" attrs="objectClass zarafaSharedStoreOnly zarafaResourceType zarafaSecurityGroup uidNumber nsUniqueId ou cn cn modifyTimestamp"
[15/Jan/2015:20:31:31 +0100] conn=4 op=68 RESULT err=0 tag=101 nentries=1 etime=0 notes=P
[15/Jan/2015:20:31:31 +0100] conn=3 op=106 SRCH base="dc=pawisda,dc=de" scope=2 filter="(&(&(objectClass=zarafa-user)(|(objectClass=zarafa-user)(objectClass=zarafa-contact)))(|(uid=friedrich)))" attrs="objectClass zarafaSharedStoreOnly zarafaResourceType zarafaSecurityGroup uidNumber nsUniqueId ou cn cn modifyTimestamp"
[15/Jan/2015:20:31:31 +0100] conn=3 op=106 RESULT err=0 tag=101 nentries=0 etime=0 notes=P

(The server was started 20:28:37, conn=3 was opened 20:30:07, and kept giving out empty results until 22:13:59. We re-rebooted 22:29.)

# rpm -qa | grep 389-ds
389-ds-base-libs-1.2.11.15-48.el6_6.x86_64
389-ds-console-1.2.6-1.el6.noarch
389-ds-1.2.2-1.el6.noarch
389-ds-base-1.2.11.15-48.el6_6.x86_64
389-dsgw-1.1.11-1.el6.x86_64
389-ds-console-doc-1.2.6-1.el6.noarch

No updates were installed on the system between 21-Dec and 08-Jan; no extra reboots, either, though. The 389-server's error log remains empty during the incidents.

Since single "user doesn't exist" replies might be mistaken, we're currently forced to have the server return temp fails (4xx SMTP codes) instead of permanent ones, and misaddressed e-mails pile up in the central mailserver's queue. Could someone please point us to a fix, or some more possibilities to analyze the problem?

jbern commented 9 years ago

(Also reported to CentOS as [http://bugs.centos.org/view.php?id=8116 issue 8116].)

nhosoi commented 9 years ago

I have a question regarding the example you put in the original report. I see "notes=P" in the RESULT line, which means you are searching with a "simple paged results" request. The request returns the results in a paging manner. For instance, is the search operation on conn=3 the first round of the paged searches or the second?

Here is an example of simple paged result search. I sent a search request with page size 550 against the 1000 matched entries. The first round returns 550, then the second round does 450.
[..] conn=14 op=1 SRCH base="o=stress_ssl" scope=2 filter="(cn=test)" attrs="distinguishedName"
[..] conn=14 op=1 RESULT err=0 tag=101 nentries=550 etime=0 notes=P
[..] conn=14 op=2 SRCH base="o=stress_ssl" scope=2 filter="(cn=test)" attrs="distinguishedName"
[..] conn=14 op=2 RESULT err=0 tag=101 nentries=450 etime=0 notes=P

jbern commented 9 years ago

I'm not sure that this answers your question, and I don't have insight into Zarafa's code, but:

-- the request shown is for a single user/mailbox and IMHO should never have more than one match
-- in agreement with that, searches for a "uid" yielding != 1 result (and outside the affected thread) seem to be exceedingly rare in the log (and then yield 0 entries)
-- the surrounding requests within conn=3 search for patterns different from the one shown
-- (''across'' threads, 4/68 obviously ''does'' lend itself to 3/106 being a continuation)
-- other searches (exploding e-mail aliases) return up to 79 entries in a single reply
-- another uid-related search had conn=3 reply with nentries=0 ''two minutes'' apart from other threads answering nentries=1 to the exact same request

nhosoi commented 9 years ago

I wish I could read the zarafa code that is sending the search requests to the Directory Server. Reading this man page, it's indeed sending the search with simple paged results request.
{{{
http://linux.die.net/man/5/zarafa-ldap.cfg
ldap_page_size
Limit result sets in pages of this size, downloading fewer results at a time from the LDAP server.
Default: 1000
}}}

another uid-related search had conn=3 reply with nentries=0 two minutes apart from other threads answering nentries=1 to the exact same request

Unfortunately, the simple page results request part is not logged in the access log, so we cannot distinguish the search request is for the first page or the second one.

If you are interested in, please run this command line. I'd think it returns one entry and stops there since ldapsearch understands there is no more entries and closes the connection.
$ ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -w <your_password> -b "dc=pawisda,dc=de" -E pr=1 "(&(&(objectClass=zarafa-user)(|(objectClass=zarafa-user)(objectClass=zarafa-contact)))(|(uid=friedrich)))" dn

Could it be possible to find out a simple reproducer without using zarafa?

jbern commented 9 years ago

I'm still not convinced that paging has anything to do with it, as the request in question really has just '''one''' match (and ldap_page_size is at the default 1000):

{{{
[root@zcp zarafa]# ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -w cat -b "dc=pawisda,dc=de" -E pr=1000 "(&(&(objectClass=zarafa-user)(|(objectClass=zarafa-user)(objectClass=zarafa-contact)))(|(uid=friedrich)))" dn
[...]
dn: uid=friedrich,ou=People,dc=pawisda,dc=de

[root@zcp zarafa]# ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -w cat -b "dc=pawisda,dc=de" -E pr=1 "(&(&(objectClass=zarafa-user)(|(objectClass=zarafa-user)(objectClass=zarafa-contact)))(|(uid=friedrich)))" dn
[...]
dn: uid=friedrich,ou=People,dc=pawisda,dc=de
}}}

Actual multi-result queries work as I'ld expect:

{{{
[root@zcp zarafa]# ldapsearch -LLLx -h localhost -p 389 -D 'cn=directory manager' -w cat -b "dc=pawisda,dc=de" -E pr=1 $REQUEST_FOR_USERLIST dn
[...]
dn: uid=Administrator,ou=People,dc=pawisda,dc=de

pagedresults: estimate=79 cookie=MA==

Press [size] Enter for the next {1|size} entries.

dn: uid=worsch,ou=People,dc=pawisda,dc=de

pagedresults: estimate=78 cookie=MA==

Press [size] Enter for the next {1|size} entries.
[...]
}}}

I wish I could read the zarafa code that is sending the search requests to the Directory Server.

I never dug into the code myself (not a developer :-S ), but Zarafa ''is'' Open Source:

http://download.zarafa.com/community/final/7.1/7.1.11-46050/sourcecode/

FWIW, I see the pertinent ''defaults'' within zcp-7.1.11.tar.gz , in zarafa-7.1.11/provider/plugins/LDAPUserPlugin.cpp ...

Could it be possible to find out a simple reproducer without using zarafa?

If my log analysis is correct, the trick is to get a connection / thread that shows the incorrect behavior, and those ''might'' be limited to the first connections after 389-server's startup ...

I'll try to script ldapsearch calls like above and see whether that triggers the symptom, and how often.

nhosoi commented 9 years ago

Hello jbern,

I wrote a small test program based upon the code you mentioned (zarafa-7.1.11/provider/plugins/LDAPUserPlugin.cpp -- Thanks!!) But the server always returns 1 result as expected...
# egrep nentries= access | egrep notes=P | egrep -v nentries=1
#
I looped the simple paged results search on a connection like 50000 times. And ran a dozen of the test programs simultaneously with no luck...

I have one question regarding the bind dn used by zarafa. If you run this grep, what is returned?
# grep "conn=3" access | egrep "\<BIND>"
And could you please find the RESULT of the operation?
# grep "conn=3 op=0" access
Is it different from the other connections?

And what is the value of this config param: nsslapd-allow-anonymous-access? Is it off?

Thanks,
--noriko

nhosoi commented 9 years ago

Hello jbern,

Are you still having this problem? Do you happen to have a chance to see the comment 7?
https://fedorahosted.org/389/ticket/47997#comment:7

We'd greatly appreciate if you could provide us more information.

Thanks,
--noriko

jbern commented 9 years ago

I'm terribly sorry, we had some intervening hi-prio activities. I'll look into the analysis you suggested, hopefully this evening.

In the meantime, as preliminary information: We had no choice but to do ''some'' reboots of the server in the meantime, and pretty much made it SOP to check for "never a nonempty reply" threads immediately afterwards, like here:

{{{

for CONN in `grep '^.13.Feb.2015:00:[3-5]. conn=[1-9][0-9] op=[1-9].* RESULT ' | \

/var/log/dirsrv/slapd-zcp/access | sed -e 's/. conn=//' -e 's/ op=.//' | sort -nu` ; do
grep "^.13.Feb.2015:00:[3-5]. conn=$CONN "'op=[1-9]. RESULT ' \
/var/log/dirsrv/slapd-zcp/access | sed -e 's/. nentries=//' -e 's/ .//' | \
awk '{ n++; if ($1==0) z++; } END { if ((n>5)&&((100z/n)>90)) printf \
"conn %4u\t%3u / %3u\t%3u %%\n",'"$CONN"',z,n,(100z/n); }'; done
conn 5 7 / 7 100 %
conn 6 15 / 15 100 %
conn 7 17 / 17 100 %
conn 8 18 / 18 100 %
conn 9 15 / 15 100 %
conn 20 91 / 94 96 %
conn 41 114 / 115 99 %
conn 249 17 / 18 94 %
conn 332 65 / 66 98 %
}}}

(In this case, there were also visible malfunctions in that test logins to WebAccess and iCal services were refused.)

So far (in the order of four reboots), the problem reappeared ''every'' time after the reboot, and apparently ''always'' got fixed by restarting just the zarafa-server process ''once''.

(I should probably also point out that SELinux is currently in "permissive" mode.)

jbern commented 9 years ago

OK, technically is already is evening here, so ...

I notice that I didn't explicitly say this before, so let me add it here: We're not running 389-server standalone, but as part of the (Red Hat) Directory Server application, as offered by the CentOS repositories.

I had a little permanent client running as well, it collected ~133000 responses (with ldapsearch and an explicit base), the only errors (which may have been empty replies ''or no reply at all'') were recorded at the times where the server got rebooted.

I'm afraid that I don't see a parameter "nsslapd-allow-anonymous-access" in any config file. I'll check with some colleagues to make sure I'm not missing something here.

Cross-referencing with the bind dn used looks promising. conn=3 from the original incident used an '''empty''' one:

{{{
[15/Jan/2015:20:30:07 +0100] conn=3 op=0 BIND dn="" method=128 version=3
[15/Jan/2015:20:30:07 +0100] conn=3 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
}}}

So did the suspicious conn's from the last reboot (no.s 5-9), but not the neighboring ones:

{{{
[13/Feb/2015:00:39:01 +0100] conn=4 op=0 SRCH base="ou=People,dc=pawisda,dc=de" scope=1 filter="(&(zarafaAliases=technik@linworks.de)(zarafaAccount=1))" attrs="mail"
[13/Feb/2015:00:39:01 +0100] conn=4 op=0 RESULT err=0 tag=101 nentries=0 etime=0 notes=U
[13/Feb/2015:00:39:55 +0100] conn=5 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=9 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=8 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=6 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=9 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=8 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=7 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=6 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=7 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=5 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=10 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=10 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[13/Feb/2015:00:39:55 +0100] conn=11 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=11 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[13/Feb/2015:00:39:55 +0100] conn=12 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=12 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
}}}

One second, let me kibo the current logs ...

... umh, not ''that'' good a match, there's 684841 BINDs in the current logs and 1897 have an empty bind dn ... quite evenly distributed over office hours.

Hmmm, let me try to correlate with the start times of the zarafa-server process that night ...

{{{
Fri Feb 13 00:39:02 2015: Starting zarafa-server version 7,1,11,46050, pid 2310
Fri Feb 13 00:43:27 2015: Starting zarafa-server version 7,1,11,46050, pid 3903
}}}

{{{

grep -c '^.13/Feb/2015:00:4[3-9]:.* BIND dn="" ' access.20150212-143217

0
}}}

... OK, I should check back with the system's designer about that, I guess. :-?

rmeggins commented 9 years ago

Replying to [comment:11 jbern]:

OK, technically is already is evening here, so ...

I notice that I didn't explicitly say this before, so let me add it here: We're not running 389-server standalone, but as part of the (Red Hat) Directory Server application, as offered by the CentOS repositories.

I'm not sure what you're getting at here. There is no difference between 389-server and RHDS (either as RHDS or centos-ds). There is a difference between 389/RHDS and IdM/freeipa, as the latter adds some additional functionality via plugins and slapi-nis.

I had a little permanent client running as well, it collected ~133000 responses (with ldapsearch and an explicit base), the only errors (which may have been empty replies ''or no reply at all'') were recorded at the times where the server got rebooted.

I'm afraid that I don't see a parameter "nsslapd-allow-anonymous-access" in any config file. I'll check with some colleagues to make sure I'm not missing something here.

If the attribute is set to the default value, it will not be in dse.ldif. You'll have to do an ldapsearch of cn=config to see all of the parameters.

Cross-referencing with the bind dn used looks promising. conn=3 from the original incident used an '''empty''' one:

{{{
[15/Jan/2015:20:30:07 +0100] conn=3 op=0 BIND dn="" method=128 version=3
[15/Jan/2015:20:30:07 +0100] conn=3 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
}}}

That usually means an anonymous bind e.g. ldapsearch without specifying -D or some other auth mechanism.

So did the suspicious conn's from the last reboot (no.s 5-9), but not the neighboring ones:

{{{
[13/Feb/2015:00:39:01 +0100] conn=4 op=0 SRCH base="ou=People,dc=pawisda,dc=de" scope=1 filter="(&(zarafaAliases=technik@linworks.de)(zarafaAccount=1))" attrs="mail"
[13/Feb/2015:00:39:01 +0100] conn=4 op=0 RESULT err=0 tag=101 nentries=0 etime=0 notes=U
[13/Feb/2015:00:39:55 +0100] conn=5 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=9 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=8 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=6 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=9 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=8 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=7 op=0 BIND dn="" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=6 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=7 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=5 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn=""
[13/Feb/2015:00:39:55 +0100] conn=10 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=10 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[13/Feb/2015:00:39:55 +0100] conn=11 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=11 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
[13/Feb/2015:00:39:55 +0100] conn=12 op=0 BIND dn="cn=Directory Manager" method=128 version=3
[13/Feb/2015:00:39:55 +0100] conn=12 op=0 RESULT err=0 tag=97 nentries=0 etime=0 dn="cn=directory manager"
}}}

One second, let me kibo the current logs ...

... umh, not ''that'' good a match, there's 684841 BINDs in the current logs and 1897 have an empty bind dn ... quite evenly distributed over office hours.

I'm sorry. What are you trying to do here? Figure out which connections are from zarafa? Is there any other ldap client running on the same machine as zarafa? Is the directory server running behind a load balancer? If not, then the connection line from the access log contains the IP address of the zarafa machine.

Hmmm, let me try to correlate with the start times of the zarafa-server process that night ...

{{{
Fri Feb 13 00:39:02 2015: Starting zarafa-server version 7,1,11,46050, pid 2310
Fri Feb 13 00:43:27 2015: Starting zarafa-server version 7,1,11,46050, pid 3903
}}}

{{{

grep -c '^.13/Feb/2015:00:4[3-9]:.* BIND dn="" ' access.20150212-143217

0
}}}

... OK, I should check back with the system's designer about that, I guess. :-?

jbern commented 9 years ago

The server is running Zarafa and everything needed to make it work (which includes the 389-server and a bunch of other software, but IIUC no LDAP client besides Zarafa itself). A moderate amount of LDAP requests arrive from remote machines, almost half of them from the monitoring system. :-}

What are you trying to do here?

Find correlations. :-}

It seems that the problematic connections are the only ones to use the IPv6 localhost address (::1) '''and''' the empty bind dn (with many more doing ''either'' one ''or'' the other), but I don't see what that might want to tell me ...

nhosoi commented 9 years ago

Replying to [comment:13 jbern]:

The server is running Zarafa and everything needed to make it work (which includes the 389-server and a bunch of other software, but IIUC no LDAP client besides Zarafa itself). A moderate amount of LDAP requests arrive from remote machines, almost half of them from the monitoring system. :-}

What are you trying to do here?

Find correlations. :-}

It seems that the problematic connections are the only ones to use the IPv6 localhost address (::1) '''and''' the empty bind dn (with many more doing ''either'' one ''or'' the other), but I don't see what that might want to tell me ...

Hello, any progress on your side? Has Zarafa started working with 389-ds-base?

Regarding (::1), if you bind as anonymous (no bind user) via localhost, the folowing lines are logged in the access log.
[..] conn=20 fd=65 slot=65 connection from ::1 to ::1
[..] conn=20 op=0 BIND dn="" method=128 version=3

Thanks,
--noriko

nhosoi commented 9 years ago

Since we haven't heard from you for a couple of weeks, we assume the problem was solved.

Let us close this ticket for now. Please reopen it when it comes back. Thanks.

jbern commented 9 years ago

For documentation: The issue was indeed caused by a bug introduced in Zarafa 7.1.11, where several threads could concurrently start using the LDAP backend/plugin (and didn't properly guard against the relevant config not yet having been read?).

Zarafa 7.1.12 fixed that bug, but - when started in safe mode - does still admonish the admin that "if you weren't in safe mode, I'ld ''still'' drop accounts right now". These log entries seem to be untrue, however; Zarafa has opened a new ticket for that issue:

https://jira.zarafa.com/browse/ZCP-13243

Metadata Update from @jbern:
- Issue assigned to nhosoi
- Issue set to the milestone: N/A

7 years ago

spichugi commented 3 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1328

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: Invalid)

3 years ago

Metadata

Assignee

nhosoi

Tags

None

Blocking

None

Depending on

None

Priority

critical

Milestone

N/A

reviewstatus

needinfo

rhbz

None

origin

Community

389-ds-base

Source Code

#47997 Individual Threads Always Answering With Empty Result Sets Closed: wontfix None Opened 9 years ago by jbern.

pagedresults: estimate=79 cookie=MA==

pagedresults: estimate=78 cookie=MA==

for CONN in `grep '^.13.Feb.2015:00:[3-5]. conn=[1-9][0-9] op=[1-9].* RESULT ' | \

grep -c '^.13/Feb/2015:00:4[3-9]:.* BIND dn="" ' access.20150212-143217

grep -c '^.13/Feb/2015:00:4[3-9]:.* BIND dn="" ' access.20150212-143217

Metadata

#47997 Individual Threads Always Answering With Empty Result Sets

Closed: wontfix None Opened 9 years ago by jbern.