The environment:
A couple days ago I experienced an odd behavior where dirsrv would stop responding to ldap queries. Each time dirsrv became unresponsive, I had to kill -9 the instance to bring it down and start it back up. I did some troubleshooting on IRC with richm, we tried increasing max file descriptors, though they were already at 8192 and no problems were logged. It did appear that a lot of file descriptors after looking at the logconv.pl output. We also tried limiting the concurrency by lowering the thread count, but this did not appear to help either. Eventually, I rebuilt the indexes as a last ditch effort using db2index.pl, this stopped the crashing problem. I'm not 100% sure why this fixed it as there was no errors in the logs to indicate this was a problem. I've attached a stack trace, and below is the output from logconv.pl.
----------- Access Log Output ------------ Restarts: 0 Total Connections: 139189 Peak Concurrent Connections: 55268 Total Operations: 2831767 Total Results: 2840487 Overall Performance: 100.0% Searches: 1947842 Modifications: 32 Adds: 23 Deletes: 0 Mod RDNs: 0 Persistent Searches: 0 Internal Operations: 0 Entry Operations: 0 Extended Operations: 870 Abandoned Requests: 11138 Smart Referrals Received: 0 VLV Operations: 6 VLV Unindexed Searches: 6 SORT Operations: 170 SSL Connections: 138436 Entire Search Base Queries: 3710 Unindexed Searches: 146033 FDs Taken: 139189 FDs Returned: 83940 Highest FD Taken: 1397 Broken Pipes: 0 Connections Reset By Peer: 0 Resource Unavailable: 0 Binds: 883000 Unbinds: 75139 LDAP v2 Binds: 324 LDAP v3 Binds: 882676 SSL Client Binds: 0 Failed SSL Client Binds: 0 SASL Binds: 0 Directory Manager Binds: 61254 Anonymous Binds: 818332 Other Binds: 3414
stack trace of dirsrv 389stacktrace.txt
Can the dse.ldif, access, and error logs please be provided?
Thanks, Mark
attachment dse.ldif
attachment ldap_errors
attachment ldap_access.gz
Mark,
I've attached the data you requested. The errors were happening on the 9th of February into the 10th, I unfortunately only have access logs from the 10th because of the log rotation scheme. Also I've replaced our domain name with foo for sanitization.
Thanks, Allan
Hi Allan,
Thanks for getting me this information!
I'm not going to really get to look at this until next week, but I might have a workaround/solution for now.
Set nsslapd-ioblocktimeout to 15 seconds under cn=config. It's in milliseconds so the value would be 15000.
I've seen this same issue before, and it relates to SSL connections. I think the code can be improved, but it still requires we set the ioblocktimeout. If this setting doesn't help, don't unset it as we will still need it moving forward.
Hi Allen,
Did you have a chance to test the config setting? How often were you running into this problem?
Like I said before I think the code can be slightly improved, but this change would not effect the scenario you were hitting - based off of the stack you provided. I believe this config setting will resolve your issue, but let me know what's going on.
Any update?
Any news yet? If I don't hear anything back by next week I will close this out.
Regards, Mark
Closing ticket...
Added initial screened field value.
Metadata Update from @mreynolds: - Issue assigned to mreynolds - Issue set to the milestone: N/A
389-ds-base is moving from Pagure to Github. This means that new issues and pull requests will be accepted only in 389-ds-base's github repository.
This issue has been cloned to Github and is available here: - https://github.com/389ds/389-ds-base/issues/299
If you want to receive further updates on the issue, please navigate to the github issue and click on subscribe button.
subscribe
Thank you for understanding. We apologize for all inconvenience.
Metadata Update from @spichugi: - Issue close_status updated to: wontfix (was: Invalid)
Login to comment on this ticket.