#48315 RHDS rarely crashes/shuts down somewhere in a pkidestroy and pkispawn workflow when re-provisioning slaves
Closed: wontfix 5 years ago Opened 6 years ago by mharmsen.

This ticket was cloned from PKI TRAC Ticket #1471 - RHDS rarely crashes/shuts down somewhere in a pkidestroy and pkispawn workflow when re-provisioning slaves:

I've had this happen to me twice total out of the hundreads of installs I've done.

The infrastructure is Master CA with RHDS running on a separate machine. Slave CA with RHDS running on a separate machine. 4 total machines or unique instances. What I recall happening is being unhappy with the slave install for some reason. Then doing a pkidestory, then doing a pkispawn on the slave. The pkisapwn never completes because it is unable to contact the LDAP server to setup the replication agreement. When I login to the RHDS node that the slave is pointing at, sure enough RHDS is no longer running. Once I start it back up and run pkispawn on the slave again things work as they should.

I get the feeling that it may have something to do with the replication agreements but I can't reproduce it reliably enough and haven't spent the time digging in the logs to prove it. I'm not sure if removing replication agreement or trying to create the replication causes the crash.

Its also possible that its something else or I'm doing something weird to cause the problem, but I'm never interacting directly with the RHDS box and nothing else is doing any LDAP operations against it, so it definitely seems like something in RHCS is causing the problem.

Should the problem occur again or if I figure out how to reproduce it reliably, I'll flush out this bug some more. Otherwise, I'm curious if other people might chime in and say they've experienced similar and can provide more info. If there is no input after a while, feel free to close the bug.

This isn't a blocker or big deal for us. Just wanted to put it out there in case others are seeing it.

rhds_crash.tar.gz​ (49.7 KB) - added by dminnich 3 months ago.
rhds_crash.tar.gz

Further info from [https://fedorahosted.org/pki/ticket/1471 PKI TRAC Ticket #1471 - RHDS rarely crashes/shuts down somewhere in a pkidestroy and pkispawn workflow when re-provisioning slaves:
{{{

I just saw this happen again. The setup was like this. Each entity is a seperate machine. Master ca01 -> ldap01 Clone ca02 -> ldap02 Serveral kras, ocsps and a 3rd CA also existed, but I don't think any of that is relevant.

A full install of these components had taken place in the past and was working fine. I decided to do a re-insall with a new version of RHCS. To do that I pkidestory'ed and yum removed everything.

I then yum installed the latest release and pkispawn'ed on ca01 and ca02. The pkispawn on ca02 failed because rhds on ldap02 went down. I had not touched the LDAP server between the uninstall and re-install process. And I noticed that RHDS on ldap02 was in fact running before I issued the pkispawn on ca02. So something in the pkispawn of a re-install of a clone CA seems to kill RHDS.

Note that this is happening with pki-ca-10.2.6-2 and redhat-ds-base-10.0.0-1.el7dsrv.x86_64.

Attached are: pkispawn config of the clone pkispawn log of the clone debug log of the clone access logs for rhds on ldap02 error logs for rhds on ldap02

both the rhcs debug log and the rhds error log talk about vlv.

It almost looks like RHCS tells RHDS to delete some data so that it can import it again. Problem is RHDS shuts down to delete the data so the RHCS install never finishes. One thing I will mention that is if RHDS is supposed to go down and bring itself back up, the way that we are calling pkispawn through puppet may not be allowing a long enough wait time for this to happen. I'd try to test this theory by using pkispawn directly, but I can't get this to happen often enough or know the exact steps to do so.

RHDS: [23/Jul/2015:15:10:39 +0000] - Deleted Virtual List View Search (caRenewal-pki-tomcat). [23/Jul/2015:15:10:39 +0000] - Deleted Virtual List View Index. [23/Jul/2015:15:10:39 +0000] - Deleted Virtual List View Index. [23/Jul/2015:15:10:39 +0000] - Deleted Virtual List View Search (caRevocation-pki-tomcat). [23/Jul/2015:15:10:39 +0000] - Deleted Virtual List View Search (caRevocation-pki-tomcat). [23/Jul/2015:15:10:40 +0000] - ldbm: Bringing intca02.pki.qa.int.phx1.redhat.com offline...

RHCS: [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: initializing with mininum 3 and maximum 15 connections to host intca02.ldap.qa.int.phx1.redhat.com port 636, secure connection, true, authentication type 1 [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: increasing minimum connections by 3 [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: new total available connections 3 [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: new number of connections 3 [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: In LdapBoundConnFactory::getConn() [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: masterConn is connected: true [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: getConn: conn is connected true [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: getConn: mNumConns now 2 [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: importLDIFS: param=preop.internaldb.post_ldif [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: importLDIFS(): ldif file = /usr/share/pki/ca/conf/vlv.ldif [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: importLDIFS(): ldif file copy to /var/lib/pki/pki-tomcat/ca/conf/vlv.ldif [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: importLDIFS(): LDAP Errors in importing /var/lib/pki/pki-tomcat/ca/conf/vlv.ldif [23/Jul/2015:15:10:50][http-bio-8443-exec-10]: LDAPUtil:importLDIF: exception in adding entry cn=allCerts-pki-tomcat, cn=intca02.pki.qa.int.phx1.redhat.com, cn=ldbm database, cn=plugins, cn=config:netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)

[23/Jul/2015:15:10:50][http-bio-8443-exec-10]: LDAPUtil:importLDIF: exception in adding entry cn=allExpiredCerts-pki-tomcat, cn=intca02.pki.qa.int.phx1.redhat.com, cn=ldbm database, cn=plugins, cn=config: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)

[23/Jul/2015:15:10:50][http-bio-8443-exec-10]: LDAPUtil:importLDIF: exception in adding entry cn=allInvalidCerts-pki-tomcat, cn=intca02.pki.qa.int.phx1.redhat.com, cn=ldbm database, cn=plugins, cn=config: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)

[23/Jul/2015:15:10:50][http-bio-8443-exec-10]: LDAPUtil:importLDIF: exception in adding entry cn=allInValidCertsNotBefore-pki-tomcat, cn=intca02.pki.qa.int.phx1.redhat.com, cn=ldbm database, cn=plugins, cn=config:netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)

[23/Jul/2015:15:10:50][http-bio-8443-exec-10]: LDAPUtil:importLDIF: exception in adding entry cn=allNonRevokedCerts-pki-tomcat, cn=intca02.pki.qa.int.phx1.redhat.com, cn=ldbm database, cn=plugins, cn=config: netscape.ldap.LDAPException: IO Error creating JSS SSL Socket (-1)
}}}

Hi Matt,

Could you please provide the version number?
rpm -q 389-ds-base

And can we have core files from the crash? This does not look like a core or stacktraces...
https://fedorahosted.org/389/attachment/ticket/48315/rhds_crash.tar.gz

Hi Matt,

Do we happen to have any update on this ticket?

Thanks!

Update from Dustin:
On 01/15/2016 05:25 AM, Dustin Minnich wrote:

When I rebuit our Dev and QA environments, 30 RHCS machines, I did not experience this issue. One thing that was done differently was a full uninstall and reinstall of RHDS before the RHCS loads.

My theory now is in some but not all cases an RHCS uninstall may not clean up any old replication agreements OR an RHCS install may in some cases see an existing replication entry and not properly remove and re-create it. In reacting to this situation, RHCS then does something that causes RHDS to crash. I'm not sure what the logic in that code is and really this is just a guess based on my past experiences.

This problem has always been rare and fleeting so its also possible that sometimes I'm lucky and sometimes I'm not and it doesn't relate to what I just said.

I really wish I could be more helpful with this. Should I ever experience the issue again I'll update the ticket or open a new ticket if you feel like there isn't enough info in this current one to keep it open.

Setting the target milestone to 1.3.6 backlog.

Metadata Update from @mharmsen:
- Issue set to the milestone: 1.3.6 backlog

5 years ago

Is this issue still valid?

Metadata Update from @firstyear:
- Custom field reviewstatus adjusted to review
- Issue close_status updated to: None

5 years ago

Metadata Update from @mreynolds:
- Custom field reviewstatus adjusted to new (was: review)

5 years ago

Please open a new issue if you can reproduce this issue.

Metadata Update from @firstyear:
- Issue close_status updated to: worksforme
- Issue status updated to: Closed (was: Open)

5 years ago

389-ds-base is moving from Pagure to Github. This means that new issues and pull requests
will be accepted only in 389-ds-base's github repository.

This issue has been cloned to Github and is available here:
- https://github.com/389ds/389-ds-base/issues/1646

If you want to receive further updates on the issue, please navigate to the github issue
and click on subscribe button.

Thank you for understanding. We apologize for all inconvenience.

Metadata Update from @spichugi:
- Issue close_status updated to: wontfix (was: worksforme)

2 years ago

Login to comment on this ticket.

Metadata