#8096 Nightly test failure in test_integration/test_topology.py::TestTopologyOptions::()::test_topology_updated_on_replica_install_remove::setup while getting RA cert
Opened 8 months ago by frenaud. Modified 2 months ago

Issue

The nightly test failed during ipa-server-install while trying to get the RA cert.
Logs: PR #3785, artifacts on ipa-4-8 branch.

Same issue also randomly happened in other test suites (for instance PR #3652, artifacts on master branch) .

Additional info:

The ipa-server-install logs show the following:

2019-10-12T18:11:39Z DEBUG   [12/29]: requesting RA certificate from CA
2019-10-12T18:11:39Z DEBUG Starting external process
2019-10-12T18:11:39Z DEBUG args=['/usr/bin/openssl', 'pkcs7', '-inform', 'DER', '-print_certs', '-out', '/var/lib/ipa/tmpbzp2261x']
2019-10-12T18:11:39Z DEBUG Process finished, return code=0
2019-10-12T18:11:39Z DEBUG stdout=
2019-10-12T18:11:39Z DEBUG stderr=
2019-10-12T18:11:39Z DEBUG Starting external process
2019-10-12T18:11:39Z DEBUG args=['/usr/bin/openssl', 'pkcs12', '-nokeys', '-clcerts', '-in', '/root/ca-agent.p12', '-out', '/var/lib/ipa/tmpzopsbyp7', '-passin', 'file:/tmp/tmphewltv7e']
2019-10-12T18:11:40Z DEBUG Process finished, return code=0
2019-10-12T18:11:40Z DEBUG stdout=
2019-10-12T18:11:40Z DEBUG stderr=
2019-10-12T18:11:40Z DEBUG Starting external process
2019-10-12T18:11:40Z DEBUG args=['/usr/bin/openssl', 'pkcs12', '-nocerts', '-in', '/root/ca-agent.p12', '-out', '/var/lib/ipa/tmp4ugfb6sb', '-passin', 'file:/tmp/tmpf4c4gmv6', '-nodes']
2019-10-12T18:11:42Z DEBUG Process finished, return code=0
2019-10-12T18:11:42Z DEBUG stdout=
2019-10-12T18:11:42Z DEBUG stderr=
2019-10-12T18:11:42Z DEBUG certmonger request is in state dbus.String('GENERATING_KEY_PAIR', variant_level=1)
2019-10-12T18:11:47Z DEBUG certmonger request is in state dbus.String('NEED_KEYINFO_READ_PIN', variant_level=1)
2019-10-12T18:11:52Z DEBUG certmonger request is in state dbus.String('NEED_KEYINFO_READ_PIN', variant_level=1)
2019-10-12T18:11:57Z DEBUG certmonger request is in state dbus.String('NEED_KEYINFO_READ_PIN', variant_level=1)
2019-10-12T18:12:02Z DEBUG certmonger request is in state dbus.String('NEED_KEYINFO_READ_PIN', variant_level=1)
2019-10-12T18:12:07Z DEBUG certmonger request is in state dbus.String('NEED_KEYINFO_READ_PIN', variant_level=1)
2019-10-12T18:12:12Z DEBUG wait_for_request raised request timed out
2019-10-12T18:12:12Z DEBUG Cert request 20191012181142 failed: TIMEOUT (None)
2019-10-12T18:12:12Z DEBUG 20191012181142 not in final state, continue waiting
[...]
2019-10-12T18:16:48Z DEBUG certmonger request is in state dbus.String('NEED_KEYINFO_READ_PIN', variant_level=1)
2019-10-12T18:16:53Z DEBUG wait_for_request raised request timed out
2019-10-12T18:16:53Z DEBUG Cert request 20191012181142 failed: TIMEOUT (None)
2019-10-12T18:16:53Z DEBUG Request 20191012181142 reached resubmit deadline
2019-10-12T18:16:53Z DEBUG Traceback (most recent call last):
  File "/usr/lib/python3.7/site-packages/ipaserver/install/service.py", line 603, in start_creation
    run_step(full_msg, method)
  File "/usr/lib/python3.7/site-packages/ipaserver/install/service.py", line 589, in run_step
    method()
  File "/usr/lib/python3.7/site-packages/ipaserver/install/cainstance.py", line 877, in __request_ra_certificate
    resubmit_timeout=api.env.certmonger_wait_timeout
  File "/usr/lib/python3.7/site-packages/ipalib/install/certmonger.py", line 366, in request_and_wait_for_cert
    "Certificate issuance failed ({}: {})".format(state, ca_error)
RuntimeError: Certificate issuance failed (TIMEOUT: None)

2019-10-12T18:16:53Z DEBUG   [error] RuntimeError: Certificate issuance failed (TIMEOUT: None)

and the journal shows:

Oct 12 18:11:39 master.ipa.test certmonger[18239]: 2019-10-12 18:11:39 [18239] No hooks set for ca-pre-save command.
Oct 12 18:11:40 master.ipa.test certmonger[18239]: 2019-10-12 18:11:39 [18239] No hooks set for ca-post-save command.
Oct 12 18:11:40 master.ipa.test certmonger[18239]: 2019-10-12 18:11:40 [18296] Certificate "Local Signing Authority" valid for 31622398s.
Oct 12 18:11:43 master.ipa.test certmonger[18239]: 2019-10-12 18:11:43 [18320] Internal error reading key from "/var/lib/ipa/ra-agent.key".
Oct 12 18:11:43 master.ipa.test certmonger[18239]: 2019-10-12 18:11:43 [18320] error:0909006C:PEM routines:get_name:no start line
Oct 12 18:11:44 master.ipa.test certmonger[18239]: 2019-10-12 18:11:44 [18322] Error opening key file "/var/lib/ipa/ra-agent.key" for writing: File exists.
Oct 12 18:11:44 master.ipa.test certmonger[18239]: 2019-10-12 18:11:44 [18322] Attempting to open key file "/var/lib/ipa/ra-agent.key.Wezl4qBf.key" for writing.
Oct 12 18:11:44 master.ipa.test certmonger[18239]: 2019-10-12 18:11:44 [18323] Internal error reading key from "/var/lib/ipa/ra-agent.key".
Oct 12 18:11:44 master.ipa.test certmonger[18239]: 2019-10-12 18:11:44 [18323] error:0909006C:PEM routines:get_name:no start line

I almost think we need to install an instrumented certmonger (or I need to make general improvements). There are some non-unique error messages in certmonger. The message "Error opening key file" appears 10 times across 5 mechanisms.

certmonger tries to open the key file with O_CREAT | O_EXCL which can return "File exists".

Why the file exists I have no idea though. Previous cleanup failed?

Based on the "Attempting" message I'm assuming this is happening in cm_keygen_o_main()

So certmonger can't open the existing file exclusively so it uses a temporary file instead. That seems to work, or at least it doesn't throw any errors. But I don't see where certmonger moves the newly generated key over. I'll play around with this.

Metadata Update from @rcritten:
- Issue assigned to rcritten

8 months ago

I think the certmonger error may be a red herring. Looking at the install log it looks like it tried, and failed, to install IPA three times. The first time it seems to have failed during the DS installation step, the subsequent two with the reported error.

I wonder if the first failure left a remnant, though it's hard to know because it doesn't seem to have gotten that far and we have no logging to know one way or the other because the logs are overwritten each time.

I wonder if the CA install failed initially for some reason but not far enough that the RA agent would have been marked for cleanup. I'll look into that.

The CA uninstaller is pretty clear about removing it in ipaserver/install/ca.py so if it got that far it would have been removed.

I don't know that root cause can be determined because logs aren't maintained, only the last one is available.

As near as I can tell if a test class has multiple tests and one fails then the test framework will uninstall whatever mess was left over and try to re-install and continue on. This avoids false positives but the logs for the failed install are not saved which makes diagnosing it difficult.

@ftrivino have you seen this before? Is saving the logs for failed individual runs something that can be done? I'm not all that familiar with the guts of the test framework code. I could probably figure it out over time but is this something your team can handle better?

I don't know that root cause can be determined because logs aren't maintained, only the last one is available.
As near as I can tell if a test class has multiple tests and one fails then the test framework will uninstall whatever mess was left over and try to re-install and continue on. This avoids false positives but the logs for the failed install are not saved which makes diagnosing it difficult.
@ftrivino have you seen this before?

@rcritten this issue is intermittent and looks like it's becoming more frequent. I haven't seen it before but I think we should try to reproduce it, and we believe this is happening in our last nightly regressions runs: e.g. https://github.com/freeipa/freeipa/pull/3833

Is saving the logs for failed individual runs something that can be done?

The logs are storage in our AWS bucket forever. Other artifacts and vagrant environments are destroyed after three days as part of auto-cleaner process.

I'm not all that familiar with the guts of the test framework code. I could probably figure it out over time but is this something your team can handle better?

These reports are kept forever: http://freeipa-org-pr-ci.s3-website.eu-central-1.amazonaws.com/jobs/3075acbc-faab-11e9-be30-fa163ec27a04/

is there anything there that you are missing?

Also happened in test_integration/test_caless.py::TestServerReplicaCALessToCAFull::()::test_replica_ipa_ca_install, please see issue #8253

Issue also happened in fedora-latest/test_ipahealthcheck in the test test_integration/test_ipahealthcheck.py::TestIpaHealthCheckWithoutDNS::test_ipa_dns_systemrecords_check::setup in PR #4511

Login to comment on this ticket.

Metadata