#9381 Race condition in ipa-server-upgrade where pki-tomcat needs dirsrv while it's stopped
Closed: fixed a year ago by frenaud. Opened a year ago by pgn674.

Request for enhancement

As a FreeIPA server admin, I want ipa-server-upgrade to complete successfully.

Issue

On my systems ipa-server-upgrade failed when upgrading from 4.9.8 to 4.9.11.

Following are lots of supporting logs, leading to the root cause.

Running it manually showed:

IPA server upgrade failed: Inspect /var/log/ipaupgrade.log and run command ipa-server-upgrade manually.
Unexpected error - see /var/log/ipaupgrade.log for details:
RemoteRetrieveError: Failed to authenticate to CA REST API
The ipa-server-upgrade command failed. See /var/log/ipaupgrade.log for more information

/var/log/ipaupgrade.log showed:

2023-05-22T21:50:18Z INFO Migrating profile 'caECServerCertWithSCT'
2023-05-22T21:50:18Z DEBUG request GET https://ipa01.qa.example.net:8443/ca/rest/account/login
2023-05-22T21:50:18Z DEBUG response status 404
2023-05-22T21:50:18Z DEBUG response body (decoded): b'...HTTP Status 404 \xe2\x80\x93 Not Found...'

Manually curling https://ipa01.qa.example.net:8443/ca/rest/account/login afterwards showed the 404 was still happening.

journalctl --unit pki-tomcatd@pki-tomcat.service showed

May 22 19:44:10 ipa01.qa.example.net server[3309172]: SEVERE: One or more listeners failed to start. Full details will be found in the appropriate container log file
May 22 19:44:10 ipa01.qa.example.net server[3309172]: SEVERE: Context [/ca] startup failed due to previous errors
May 22 19:44:12 ipa01.qa.example.net server[3309172]: SEVERE: One or more listeners failed to start. Full details will be found in the appropriate container log file
May 22 19:44:12 ipa01.qa.example.net server[3309172]: SEVERE: Context [/acme] startup failed due to previous errors

/var/log/pki/pki-tomcat/ca/debug.2023-05-22.log and /var/log/pki/pki-tomcat/acme/debug.2023-05-22.log showed:

2023-05-22 19:44:10 [main] FINE: LdapBoundConnection: Connecting to ipa01.qa.example.net:636 with client cert auth
2023-05-22 19:44:10 [main] SEVERE: Unable to create socket: java.net.ConnectException: Connection refused (Connection refused)
2023-05-22 19:44:11 [main] INFO: Loading LDAP database configuration from /etc/pki/pki-tomcat/ca/CS.cfg
2023-05-22 19:44:12 [main] SEVERE: Unable to create socket: java.net.ConnectException: Connection refused (Connection refused)

But I was able to telnet to that LDAP endpoint no problem using telnet ipa01.qa.example.net 636
Running systemctl restart pki-tomcatd@pki-tomcat.service made the 8443 curl work, but the next ipa-server-upgrade run still failed and pki-tomcat was unhappy again.

From /var/log/ipaupgrade.log and /var/log/dirsrv/slapd-QA-EXAMPLE-NET/errors, I could see that slapd was being stopped and started right at the time that pki-tomcat's ca and acme containers were trying to connect to slapd. As soon as they were unable to connect, they started returning 404 forever (until someone ran systemctl restart pki-tomcatd@pki-tomcat.service).

See below for my workaround.

Steps to Reproduce

  1. Run ipa-server-upgrade

Actual behavior

It fails to complete, showing the RemoteRetrieveError error above.

Expected behavior

It completes successfully.

Version/Release/Distribution

package freeipa-server is not installed
package freeipa-client is not installed
ipa-server-4.9.11-5.module+el8.8.0+18146+a1d8660b.x86_64
ipa-client-4.9.11-5.module+el8.8.0+18146+a1d8660b.x86_64
389-ds-base-1.4.3.34-1.module+el8.7.0+18367+58a49cb0.x86_64
package pki-ca is not installed
krb5-server-1.18.2-22.el8_7.x86_64

My workaround

Edit /usr/lib/python3.6/site-packages/ipaserver/install/server/upgrade.py. At about line 1787, inside upgrade_configuration(), after the kra.start('pki-tomcat'), and before the certmonger_service = services.knownservices.certmonger, add this:

    logger.info('Sleeping 10 seconds for pki-tomcat ca and acme to finish starting up')
    time.sleep(10)

Also, put import time at the top. Then, run ipa-server-upgrade. This time it completes successfully.

Additional info:

These systems are AWS EC2 instances of size m5.large, with 2 vCPU cores and 8 GiB of memory. They are running Red Hat Enterprise Linux 8.8 (Ootpa), Linux 4.18.0-240.15.1.el8_3.x86_64.

Closing comments

I can see that in [Migrate CRL publish directory] the script waits for ports 8080 and 8443 to be connectable after starting pki-tomcat before it goes on, but at that point the ca and acme containers haven't finished starting yet. The next step ([Verifying that KDC configuration is using ipa-kdb backend]) briefly stops dirsrv (ns-slapd), and on my systems it happens to line up perfectly with ca and acme trying to connect to slapd. And it seems those two Tomcat containers do not retry after getting failures, and just respond with 404 forever instead.

I'm not sure what the right solution is (specifically code the pki-tomcat to also wait for https://ipa01.qa.example.net:8443/ca/rest/account/login to be working?), but I would like to see this race condition get addressed.


@pgn674
can you check if there is a drop-in file on your server:

# cat /etc/systemd/system/pki-tomcatd@pki-tomcat.service.d/ipa.conf
[Service]
Environment=LC_ALL=C.UTF-8
ExecStartPost=/usr/libexec/ipa/ipa-pki-wait-running

If not, create the directory + file manually, launch systemctl --system daemon-reload then ipa-server-upgrade. This file ensures that PKI is ready (CA subsystem available) before the start command returns.
On new installations (post RHEL 8.1), this file is created at install time but it looks like an upgrade from 8.0 to 8.1+ does not create it, resulting in PKI not fully initialized when the upgrade is run.

Metadata Update from @frenaud:
- Issue assigned to frenaud

a year ago

Metadata Update from @frenaud:
- Custom field rhbz adjusted to https://bugzilla.redhat.com/show_bug.cgi?id=2215336

a year ago

Metadata Update from @frenaud:
- Custom field on_review adjusted to https://github.com/freeipa/freeipa/pull/6889

a year ago

master:

  • 0472067 Upgrade: add PKI drop-in file if missing
  • d76f8fc Integration test: add a test for upgrade and PKI drop-in file

ipa-4-10:

  • f25003a Upgrade: add PKI drop-in file if missing
  • 392e60e Integration test: add a test for upgrade and PKI drop-in file

ipa-4-9:

  • 86c1426 Upgrade: add PKI drop-in file if missing
  • 356ec5c Integration test: add a test for upgrade and PKI drop-in file

Metadata Update from @frenaud:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

a year ago

Hi, I know it's been a long time but I'm returning just to answer the question.

I have not run any FreeIPA upgrade on the server since I filed this bug. Currently, /etc/systemd/system/pki-tomcatd@pki-tomcat.service.d/ipa.conf and its folder do not exist on the server. Also, this instance was started from an AMI (Amazon Machine Image) named RHEL-8.0.0_HVM-20190426-x86_64-1-Hourly2-GP2, which I think was the latest RHEL 8 AMI as of April of 2019.

So it looks like your RHEL 8.0 to 8.1 in-place upgrade thought hit the root cause right on.

I don't see any reason to change this issue's status from Closed: fixed to anything else. Thanks for the fix!

Log in to comment on this ticket.

Metadata