#7997 Error causes Directory Server to stop (FD Table Full)
Opened 2 months ago by jhildreth. Modified 2 months ago

Issue

[27/Jun/2019:21:45:56.695431162 +0000] - ERR - accept_and_configure - PR_Accept() failed, Netscape Portable Runtime error -5971 (Process open FD table is full.)
[27/Jun/2019:21:45:56.700826623 +0000] - ERR - accept_and_configure - PR_Accept() failed, Netscape Portable Runtime error -5971 (Process open FD table is full.)

Steps to Reproduce

  1. Issue doesnt occur while in slave mode (No data being accessed)
  2. Once operational work is failed over to the replica within 6-8 hours the directory server fails with the mentioned errors.
  3. Replica was a re-created replica whereby we removed the old one entirely from IPA and re-built the replica. Issue still occurred.

Actual behavior

Directory Server fails on replica and thus all operations of the web site that uses this as the authentication mechanism fails.

Expected behavior

Directory Server to not fail and FD Table to not fill.

Version/Release/Distribution

ipa-server-4.6.4-10.el7_6.3.x86_64
ipa-client-4.6.4-10.el7_6.3.x86_64
389-ds-base-1.3.8.4-23.el7_6.x86_64
pki-ca-10.5.9-13.el7_6.noarch
krb5-server-1.15.1-37.el7_6.x86_64


See https://access.redhat.com/documentation/en-us/red_hat_directory_server/9.0/html/performance_tuning_guide/system-tuning#file-descriptors

In general though I don't entirely follow the terminology you are using. IPA does not configure slave (consumer) replication. It is also unclear what you mean by failover.

I suspect this is going to need to be moved to https://pagure.io/389-ds-base/ if you continue to have problems as it isn't an issue with IPA itself.

This is actually in IPA which obviously integrates 389 DS. While 389DS is installed on the system would this not be an issue for IPA as well since the DS is controlled by ipactl? If I am wrong here that is fine.

What I mean by Replica is the created replica of the master IPA server via the ipa-replica-install install.

AS for fail over....we have an IPA instance in each of our two data centers (DC). Only one DC is operational at a time for the web site using the IPA instance. So if DC1 is active then the IPA instance at DC1 is active for accepting connections. Conversely, when DC2 becomes the active site for the web site then that IPA instance is the active one. The IPA instance on DC1 is working properly without issue and the one in DC2 is not and fails after 6-8 hours.

@jhildreth, nunc-stans DS connection handler suffered from connection leak that can exhaust the connection table. By default nunc-stans should be disabled in your version but it worth to eliminate this possibility.

Could you confirm that
ldapsearch -D "cn=directory manager" -W -b "cn=config" -s base nsslapd-enable-nunc-stans
dn: cn=config
nsslapd-enable-nunc-stans: off

Yep it is set to off

dn: cn=config
nsslapd-enable-nunc-stans: off

DS and process fd limit could be the same but here it hits the process limit.
It could be a bad process limit or some application not closing connection.
Is open files limit in /proc/<ds_pid>/limits large enough ?
ldapsearch -D "cn=directory manager" -W -b "cn=config" -s base nsslapd-conntablesize nsslapd-maxdescriptors ?

Not working system /proc/<ds_pid>/limits file

Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 15036 15036 processes
Max open files 1024 8192 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 15036 15036 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

working system /proc/<ds_pid>/limits file

Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 15065 15065 processes
Max open files 1024 8192 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 15065 15065 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us

Both systems set to the following for the LDAP fields....

dn: cn=config
nsslapd-conntablesize: 8192
nsslapd-maxdescriptors: 1024

One thing I am wondering is Re: My IPA setup....

My lone "real" MASTER is in One Data Center (DC1) along with one of its replicas
My issue Replica and another to fail to if needed are in a geographically distant Data Center (DC2).

All of the replicas have replication agreements only with the "real" master. So the two in DC2 go over a long distance network connection to DC1 in order to sync. Could this issue be caused due to this setup? Should I disconnect the replication agreement between the issue replica in DC2 and have it with an agreement to the "if needed" system only so that it is getting its updates from within the same site?

What does an inactive master mean?

All masters are masters. The only distinguishing features are the optional services (CA, DNS, KRA) and the original is the default CRL generator and CA renewal master.

You want to avoid a single point of failure where if one master goes away then all replication agreements are broken and you end up with a split brain.

Regarding DS failure to accept new connections. The replica (where the problem happens) has the same limit than the master (where it does not happen), I have no clue what could differs.
I suggest you monitor over time the number of opened files (lsof -p) and connection (something like 'lsof -i tcp:389 -a -p <ds_pid>') to confirm if it reaches the limit because of connections and what is their status (established, wait closing). Access logs can also confirm if connections are closed or not from DS pov.

Login to comment on this ticket.

Metadata