Issue #299: dirsrv becomes unresponsive after an unknown amount of time - 389-ds-base

389-ds-base

#299 dirsrv becomes unresponsive after an unknown amount of time

Closed: wontfix None Opened 12 years ago by crazed.

The environment:

600 client nodes using LDAP for auth (sssd 1.5.x)
2x dirsrv instances in multi-master configuration
dirsrv machines are Xen VMs with 1 core and 4gb of ram

A couple days ago I experienced an odd behavior where dirsrv would stop responding to ldap queries. Each time dirsrv became unresponsive, I had to kill -9 the instance to bring it down and start it back up. I did some troubleshooting on IRC with richm, we tried increasing max file descriptors, though they were already at 8192 and no problems were logged. It did appear that a lot of file descriptors after looking at the logconv.pl output. We also tried limiting the concurrency by lowering the thread count, but this did not appear to help either. Eventually, I rebuilt the indexes as a last ditch effort using db2index.pl, this stopped the crashing problem. I'm not 100% sure why this fixed it as there was no errors in the logs to indicate this was a problem. I've attached a stack trace, and below is the output from logconv.pl.

----------- Access Log Output ------------

Restarts:                     0

Total Connections:            139189
Peak Concurrent Connections:  55268
Total Operations:             2831767
Total Results:                2840487
Overall Performance:          100.0%

Searches:                     1947842
Modifications:                32
Adds:                         23
Deletes:                      0
Mod RDNs:                     0

Persistent Searches:          0
Internal Operations:          0
Entry Operations:             0
Extended Operations:          870
Abandoned Requests:           11138
Smart Referrals Received:     0

VLV Operations:               6
VLV Unindexed Searches:       6
SORT Operations:              170
SSL Connections:              138436

Entire Search Base Queries:   3710
Unindexed Searches:           146033

FDs Taken:                    139189
FDs Returned:                 83940
Highest FD Taken:             1397

Broken Pipes:                 0
Connections Reset By Peer:    0
Resource Unavailable:         0

Binds:                        883000
Unbinds:                      75139

 LDAP v2 Binds:               324
 LDAP v3 Binds:               882676
 SSL Client Binds:            0
 Failed SSL Client Binds:     0
 SASL Binds:                  0

 Directory Manager Binds:     61254
 Anonymous Binds:             818332
 Other Binds:                 3414

crazed commented 12 years ago

stack trace of dirsrv
389stacktrace.txt

mreynolds commented 12 years ago

Can the dse.ldif, access, and error logs please be provided?

Thanks,
Mark

crazed commented 12 years ago

attachment
dse.ldif

crazed commented 12 years ago

attachment
ldap_errors

crazed commented 12 years ago

attachment
ldap_access.gz

crazed commented 12 years ago

Mark,

I've attached the data you requested. The errors were happening on the 9th of February into the 10th, I unfortunately only have access logs from the 10th because of the log rotation scheme. Also I've replaced our domain name with foo for sanitization.

Thanks,
Allan

mreynolds commented 12 years ago

Hi Allan,

Thanks for getting me this information!

I'm not going to really get to look at this until next week, but I might have a workaround/solution for now.

Set nsslapd-ioblocktimeout to 15 seconds under cn=config. It's in milliseconds so the value would be 15000.

I've seen this same issue before, and it relates to SSL connections. I think the code can be improved, but it still requires we set the ioblocktimeout. If this setting doesn't help, don't unset it as we will still need it moving forward.

Thanks,
Mark

mreynolds commented 12 years ago

Hi Allen,

Did you have a chance to test the config setting? How often were you running into this problem?

Like I said before I think the code can be slightly improved, but this change would not effect the scenario you were hitting - based off of the stack you provided. I believe this config setting will resolve your issue, but let me know what's going on.

Thanks,
Mark

mreynolds commented 12 years ago

Any update?

mreynolds commented 12 years ago

Any news yet? If I don't hear anything back by next week I will close this out.

Regards,
Mark

mreynolds commented 12 years ago

Closing ticket...