#8212 Nightly test failure in test_integration/test_backup_and_restore.py::TestReplicaInstallAfterRestore::test_replica_install_after_restore: ipa-restore does not detect PKI start failure
Closed: worksforme 2 years ago by frenaud. Opened 4 years ago by frenaud.

The nightly test test_integration/test_backup_and_restore.py::TestReplicaInstallAfterRestore::test_replica_install_after_restore failed in PR #147 - testing_master_389ds.
The test failed during replica installation, trying to communicate with PKI server on the master:

[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30]   [5/29]: configuring certificate server instance
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] Failed to configure CA instance: CalledProcessError(Command ['/usr/sbin/pkispawn', '-s', 'CA', '-f', '/tmp/tmpz1kefvsi'] returned non-zero exit status 1: 'WARNING : pkiparser      Unable to get security domain info: 503 Server Error: Service Unavailable for url: https://master.ipa.test:443/ca/rest/securityDomain/domainInfo\n')
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] See the installation logs and the following files/directories for more information:
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30]   /var/log/pki/pki-tomcat
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30]   [error] RuntimeError: CA configuration failed.
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] CA configuration failed.
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] The ipa-replica-install command failed. See /var/log/ipareplica-install.log for more information
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] Your system may be partly configured.
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] Run /usr/sbin/ipa-server-install --uninstall to clean up.
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] 
[ipatests.pytest_ipa.integration.host.Host.replica1.cmd30] Exit code: 1

The journal log on the master shows that PKI failed to start during the ipa-restore command:

Feb 29 21:17:23 master.ipa.test ipa-pki-wait-running[34880]: ipa-pki-wait-running: Connection failed: HTTPConnectionPool(host='master.ipa.test', port=8080): Read timed out. (read timeout=1.0)
Feb 29 21:17:24 master.ipa.test systemd[1]: pki-tomcatd@pki-tomcat.service: Start-post operation timed out. Stopping.
Feb 29 21:17:24 master.ipa.test systemd[1]: pki-tomcatd@pki-tomcat.service: Control process exited, code=killed, status=15/TERM
Feb 29 21:17:25 master.ipa.test systemd[1]: pki-tomcatd@pki-tomcat.service: Failed with result 'timeout'.
Feb 29 21:17:25 master.ipa.test systemd[1]: Failed to start PKI Tomcat Server pki-tomcat.
Feb 29 21:17:25 master.ipa.test audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pki-tomcatd@pki-tomcat comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'
Feb 29 21:17:25 master.ipa.test systemd[1]: pki-tomcatd@pki-tomcat.service: Consumed 34.599s CPU time.
Feb 29 21:17:25 master.ipa.test systemd[1]: Reached target PKI Tomcat Server.

Need to investigate how it is possible for ipa-restore to not detect the PKI start failure.


Are the full logs available anywhere?

The full IPA is started twice so having some idea which one failed could be useful (once as a start and once as a restart).

But it all happens in a big try/finally block so if a service failed to start then it should either raise an exception or have a non-zero returncode.

This makes my head hurt. ipactl in the restore shows that everything started just fine:

2020-02-29T21:15:03Z DEBUG args=['/usr/sbin/ipactl', 'start']
2020-02-29T21:17:26Z DEBUG Process finished, return code=0
2020-02-29T21:17:26Z DEBUG stdout=Starting Directory Service
Starting krb5kdc Service
Starting kadmin Service
Starting named Service
Starting httpd Service
Starting ipa-custodia Service
Starting pki-tomcatd Service
Starting ipa-otpd Service
Starting ipa-dnskeysyncd Service

When in fact it failed and was reported properly by systemd:

Feb 29 21:15:41 master.ipa.test systemd[1]: Starting PKI Tomcat Server pki-tomcat...
Feb 29 21:15:48 master.ipa.test pki-server[34725]: pki-tomcat instance migrated
Feb 29 21:15:54 master.ipa.test pkidaemon[34785]: SUCCESS: Successfully archived '/var/lib/pki/pki-tomcat/conf/ca/archives/CS.cfg.bak.20200229211554'
Feb 29 21:15:54 master.ipa.test pkidaemon[34785]: SUCCESS: Successfully backed up '/var/lib/pki/pki-tomcat/conf/ca/CS.cfg.bak'
Feb 29 21:15:55 master.ipa.test server[34879]: options used: -Dcatalina.base=/var/lib/pki/pki-tomcat -Dcatalina.home=/usr/share/tomcat -Djava.endorsed.dirs= -Djava.io.tmpdir=/var/lib/pki/pki-tomcat/temp -Djava.util.logging.config.file=/var/lib/pki/pki-tomcat/conf/logging.properties -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.security.manager -Djava.security.policy==/var/lib/pki/pki-tomcat/conf/catalina.policy
Feb 29 21:17:24 master.ipa.test systemd[1]: pki-tomcatd@pki-tomcat.service: Start-post operation timed out. Stopping.
Feb 29 21:17:24 master.ipa.test systemd[1]: pki-tomcatd@pki-tomcat.service: Control process exited, code=killed, status=15/TERM
Feb 29 21:17:25 master.ipa.test systemd[1]: pki-tomcatd@pki-tomcat.service: Failed with result 'timeout'.
Feb 29 21:17:25 master.ipa.test systemd[1]: Failed to start PKI Tomcat Server pki-tomcat.
Feb 29 21:17:25 master.ipa.test audit[1]: SERVICE_START pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=pki-tomcatd@pki-tomcat comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=failed'

So the ipautil.run to call systemctl start on the service should have blown up.

A typical startup would look like:

Starting pki-tomcatd Service
ipa: DEBUG: Starting external process
ipa: DEBUG: args=['/bin/systemctl', 'start', 'pki-tomcatd.target']
ipa: DEBUG: Process finished, return code=0
ipa: DEBUG: request POST http://ipa.example.test:8080/ca/admin/ca/getStatus
ipa: DEBUG: request body ''
ipa: DEBUG: response status 200
ipa: DEBUG: response headers Content-Type: application/xml
Content-Length: 191
Date: Tue, 05 Jan 2021 20:06:13 GMT

ipa: DEBUG: response body (decoded): b'<?xml version="1.0" encoding="UTF-8" standalone="no"?><XMLResponse><State>1</State><Type>CA</Type><Status>running</Status><Version>10.9.4-1.module_el8.3.0+500+458aeb54</Version></XMLResponse>'
ipa: DEBUG: wait_for_open_ports: localhost [8080, 8443] timeout 120
ipa: DEBUG: waiting for port: 8080
ipa: DEBUG: SUCCESS: port: 8080
ipa: DEBUG: waiting for port: 8443
ipa: DEBUG: SUCCESS: port: 8443
ipa: DEBUG: Start of pki-tomcatd.target complete

I think we should pass --debug to ipactl start in ipa-restore to get the full logging. I think that's all we can do for now since we can't otherwise see what happened.

Looking further into the logs it doesn't match the current code (even for ipa-4-8).

I wonder if this execution predates https://pagure.io/freeipa/issue/8226 so there is no call to ipactl restart. I wonder if that would have also addressed the pki issue.

It doesn't explain why nothing was raised though.

Lowering the priority as the issue is random and happens rarely.

Metadata Update from @frenaud:
- Issue priority set to: low

3 years ago

This issue hasn't occurred in the last 12 months, closing for now. Feel free to re-open if it happens again.

Metadata Update from @frenaud:
- Issue close_status updated to: worksforme
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.

Metadata