#2564 pki-tomcat for 10+ minutes before generating cert
Closed: Fixed None Opened 4 years ago by afsilva.

Hello Everyone,

My name is Anderson Silva, and I am the newest member of the Red Hat IT IAM team. We are currently suffering a PROD issue that happens in stage as well. I am going to do my best to describe this, but feel free to ask me more questions, logs, files.

I go to:

ca.prod.host:8443/ca/ee/ca/ and then click on "Directory-Authenticated User Dual-Use ECC Certificate Enrollment"

When I get into that site, and I enter my Username and Pin+Token (for linotp auth) and give a certificate label, and hit submit the pki-tomcat host will spin for several minutes taking longer than 10 minutes to return a cert.


this is a debug output of the tomcat pki, you will see that at 15:26 the request starts, and then it only returns at 15:42 with the SSL handshake. I have the ldap logs as well
ansilva.request.ca01.pki

note that the connection on the ldap server only happens at 13:42, nothing came at 13:26 when it said the authentication was going to happen
ansilva.request.frac-slave02

Note: on the first attached filed, I used timestamp 15:26, but I meant 13:26.

Also, restarting the tomcat service resolves the issue until it starts happening again eventually.

I have taken a thread dumps and heap dumps at a different time in which this issue was happening. I am attaching them as well. These thread dumps and heap dumps are at a different time than the two first files I uploaded.

thread dump... I see a null exception on ldap, and then a lot of messages related to "parking to wait"
messages

I can't attach the heap dumps, but I can make then available somewhere if needed.

more data, this time thread dump and heap dump are related to the request.

<ansilva> file: http://file.rdu.redhat.com/~ansilva/ansilva-request-2016-12-15-1833--1849.tar.gz
<ansilva> request started at 18:33 - from http-bio-exec-3 (on debug) file
<ansilva> request seem to have gone to frac-slave01, which has access-frac-ldap1

message file contains thread dump

Most of the requests I've done in prod have worked after 16 minutes or so, but every once in a while they will also fail, and the RHCS UI (via browser) spits out the following error: Sorry, your request is not submitted. The reason is "LDAPException caught from operation. null".

[16/Dec/2016:13:27:10]http-bio-8443-exec-9: UidPwdDirAuthentication: Authenticating: Searching for uid=ansilva base DN=dc=redhat,dc=com

and then silence... and then...

[16/Dec/2016:13:42:39]http-bio-8443-exec-9: SSL handshake happened
[16/Dec/2016:13:42:39]http-bio-8443-exec-2: CMSServlet:service() uri = /ca/services
[16/Dec/2016:13:42:39]http-bio-8443-exec-2: CMSServlet: services start to service.
[16/Dec/2016:13:42:39]http-bio-8443-exec-2: MainPageServlet process
[16/Dec/2016:13:42:39]http-bio-8443-exec-2: CMSServlet: curDate=Fri Dec 16 13:42:39 UTC 2016 id=services time=0
[16/Dec/2016:13:42:40]http-bio-8443-exec-9: Authenticating: User authentication failure: netscape.ldap.LDAPException: error result (48)
[16/Dec/2016:13:42:40]http-bio-8443-exec-9: Authenticating: closing bad connection
[16/Dec/2016:13:42:40]http-bio-8443-exec-9: returnConn: mNumConns now 5
LDAPException caught from operation. null
at com.netscape.cms.authentication.UidPwdDirAuthentication.authenticate(UidPwdDirAuthentication.java:271)
at com.netscape.cms.authentication.DirBasedAuthentication.authenticate(DirBasedAuthentication.java:417)
at com.netscape.cms.servlet.processors.CAProcessor.authenticate(CAProcessor.java:446)
at com.netscape.cms.servlet.processors.CAProcessor.authenticate(CAProcessor.java:495)
at com.netscape.cms.servlet.cert.EnrollmentProcessor.processEnrollment(EnrollmentProcessor.java:149)
at com.netscape.cms.servlet.profile.ProfileSubmitServlet.processEnrollment(ProfileSubmitServlet.java:224)
at com.netscape.cms.servlet.profile.ProfileSubmitServlet.process(ProfileSubmitServlet.java:124)
at com.netscape.cms.servlet.base.CMSServlet.service(CMSServlet.java:513)
at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
at sun.reflect.GeneratedMethodAccessor33.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:277)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:274)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAsPrivileged(Subject.java:549)
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:309)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:169)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:297)
at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:55)
at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:191)
at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:187)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:186)
at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:52)
at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:277)
at org.apache.catalina.security.SecurityUtil$1.run(SecurityUtil.java:274)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAsPrivileged(Subject.java:549)
at org.apache.catalina.security.SecurityUtil.execute(SecurityUtil.java:309)
at org.apache.catalina.security.SecurityUtil.doAsPrivilege(SecurityUtil.java:249)
at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:237)
at org.apache.catalina.core.ApplicationFilterChain.access$000(ApplicationFilterChain.java:55)
at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:191)
at org.apache.catalina.core.ApplicationFilterChain$1.run(ApplicationFilterChain.java:187)
at java.security.AccessController.doPrivileged(Native Method)
at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:186)
at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:220)
at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:122)
at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:501)
at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171)
at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:950)
at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:116)

I will be taking some tcpdumps, and looking into bypassing the loadbalancer today to see if the LB is getting in the way of this.

As an update to this ticket. I have a ticket open with Red Hat support, and I know they are in touch with upstream (aka you guys). It does seems like the issue is related to our load balancer and/or network, but I still think some sort of time out needs to be implemented on the pki tomcat side to avoid 15 min 30sec requests.

So, unless some new development happens with Red Hat support, this ticket could be summarized as a feature request to add time out for LDAP requests in PKI Tomcat. Thoughts?

Update:

We have discovered the issue, in a nutshell: "our F5 (the load balancer) is closing the connections after the idle timeout is consumed and by default, the java socket won't set a keepalive"

Taken from our GSS case, this was provided by Severin:

Launiau, Severin on Jan 06 2017 at 09:40 AM -08:00
For sure we can't let the service with unlimited idle time on the load-balancer. See their docs: https://support.f5.com/csp/#/article/K7606

"Important: Specifying an indefinite idle time-out for connectionless protocols, such as User Datagram Protocol (UDP), can lead to stalled connections and resource exhaustion. Specifying an indefinite idle time-out for connection-oriented protocols, such as TCP, can lead to resource exhaustion if connections are not shut down gracefully. In most cases, F5 recommends that you set the idle time-out to a small as possible finite value to sustain the application."

I would instead suggest to set that keepalive somewhere in the application. Apparently there isn't anything we can set directly into CS.cfg: I tried all of the options I mentioned in my post yesterday, nothing worked.

So, I had a look at the source code in pki-core-10.2.6:
From base/server/cmscore/src/com/netscape/cmscore/ldapconn/LdapBoundConnection.java: that's where we retrieve the config parameters and open the connection. It calls a "new LdapJssSSLSocketFactory()".
From base/server/cmscore/src/com/netscape/cmscore/ldapconn/LdapJssSSLSocketFactory.java:

public Socket makeSocket(String host, int port) throws LDAPException {

[...]
s = new SSLSocket(host, port);
s.setUseClientMode(true);
s.enableV2CompatibleHello(false);
[...]

So it's creating a SSLSocket object. Looking at the official doc for that one, it extends java.net.Socket. Very likely the default is to keep it disabled.

As such, if there is no way to apply that to a more global config file, what I suggest then is to add a s.setKeepAlive(true) in that block.

see http://www-archive.mozilla.org/projects/security/pki/jss/javadoc/org/mozilla/jss/ssl/SSLSocket.html#setKeepAlive%28boolean%29

Any thoughts?

interesting finding
the LDAPS connection goes through JSS / Java NSS, and there is no timeout or parameter exposed in the CS LDAP connection in the CS.cfg configuration file.
The keepalive parameter is the global system TCP configuration, like for example
net.ipv4.tcp_keepalive_time = 7200
or
net.ipv4.tcp_fin_timeout = 60
We need to review this request.
But would it make sense to try a smaller net.ipv4.tcp_keepalive_time value as a workaround?

The above patch provides a parameter to enable the TCP Keep-Alive option. Here's the documentation:
http://pki.fedoraproject.org/wiki/DS_Connection#TCP_Options

Is it possible to get this patch built into an RPM? src or binary so we can try it on our dev environment?

We are currently running pki-core-10.2.6-9.el7

Based on comment #11 I'm proposing to fix this in 10.2 and all later minor releases.

patch tested, Customer Support ticket updated. I think we are all good!

Per CS meeting this bug should be fixed in 10.3 and later versions.

Fixed in master:

  • b3ee1c28f658a70468c5a5fcf3cb4840574be756

Fixed in 10.3 branch:

  • 4252656c27f230a5198a01a6085dad4b8e4df59f

TCP Keepalive sounds related to my attempt to improve general performance of FreeIPA and IPA vault.

Question: Was TCP keepalive disable before? Or did you enable it and add a new option to disable it on demand?

Metadata Update from @afsilva:
- Issue set to the milestone: 10.3.10

4 years ago

Dogtag PKI is moving from Pagure issues to GitHub issues. This means that existing or new
issues will be reported and tracked through Dogtag PKI's GitHub Issue tracker.

This issue has been cloned to GitHub and is available here:
https://github.com/dogtagpki/pki/issues/2684

If you want to receive further updates on the issue, please navigate to the
GitHub issue and click on Subscribe button.

Thank you for understanding, and we apologize for any inconvenience.

Login to comment on this ticket.

Metadata