#8425 Nightly test failure in test_cert.test_cert.TestInstallMasterClient (certmonger timeout)
Closed: fixed 8 months ago by abbra. Opened 9 months ago by fcami.

testing_master_latest Nightly PR #305 failed in test_cert.
logs

    def wait_for_request(host, request_id, timeout=120):
        for _i in range(0, timeout, 5):
            result = host.run_command(
                "getcert list -i %s | grep status: | awk '{ print $2 }'" %
                request_id
            )

            state = result.stdout_text.strip()
            logger.info("certmonger request is in state %s", state)
            if state in ('CA_REJECTED', 'CA_UNREACHABLE', 'CA_UNCONFIGURED',
                         'NEED_GUIDANCE', 'NEED_CA', 'MONITORING'):
                break
            time.sleep(5)
        else:
>           raise RuntimeError("request timed out")
E           RuntimeError: request timed out
INFO     ipatests.pytest_ipa.integration.host.Host.master.IPAOpenSSHTransport:transport.py:391 RUN getcert list -i test-request | grep status: | awk '{ print $2 }'
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd128:transport.py:513 RUN getcert list -i test-request | grep status: | awk '{ print $2 }'
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd128:transport.py:557 SUBMITTING
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd128:transport.py:217 Exit code: 0
INFO     ipatests.pytest_ipa.integration.tasks:tasks.py:2233 certmonger request is in state SUBMITTING
INFO     ipatests.pytest_ipa.integration.host.Host.master.IPAOpenSSHTransport:transport.py:391 RUN getcert list -i test-request | grep status: | awk '{ print $2 }'
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd129:transport.py:513 RUN getcert list -i test-request | grep status: | awk '{ print $2 }'
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd129:transport.py:557 SUBMITTING
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd129:transport.py:217 Exit code: 0
INFO     ipatests.pytest_ipa.integration.tasks:tasks.py:2233 certmonger request is in state SUBMITTING
INFO     ipatests.pytest_ipa.integration.host.Host.master.IPAOpenSSHTransport:transport.py:391 RUN getcert list -i test-request | grep status: | awk '{ print $2 }'
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd130:transport.py:513 RUN getcert list -i test-request | grep status: | awk '{ print $2 }'
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd130:transport.py:557 SUBMITTING
DEBUG    ipatests.pytest_ipa.integration.host.Host.master.cmd130:transport.py:217 Exit code: 0
INFO     ipatests.pytest_ipa.integration.tasks:tasks.py:2233 certmonger request is in state SUBMITTING

Also happened in [testing_master_latest] in PR #317, see logs.

The test creates the subCA
[Thu Jul 30 01:05:38.750262 2020] [wsgi:error] [pid 23852:tid 140053211711232] [remote 192.168.122.104:46110] ipa: INFO: [jsonserver_session] admin@IPA.TEST: caacl_add_ca/1('hosts_services_caIPAserviceCert', version='2.239', ca=('mysubca',)): SUCCESS

then certmonger makes the request:
Jul 30 01:05:39 master.ipa.test certmonger[24999]: 2020-07-30 01:05:39 [24999] Running enrollment helper "/usr/libexec/certmonger/ipa-server-guard".

and IPA never gets the request.

This suggest that ipa-server-guard couldn't get the lock.

This is also reported with each getcert execution, not sure if it is related:
Jul 30 01:05:44 master.ipa.test getcert[25013]: usr/lib/api/apiutil.c Could not open /run/lock/opencryptoki/LCK..APIlock

Metadata Update from @rcritten:
- Issue assigned to rcritten

8 months ago

@frenaud did some investigation in this, or another related ticket, and determined that indeed the lock is getting hung up because certmonger is in some cases issuing a KILL to make child processes go away!?

The way to duplicate this behavior is:

while [ true ]; do getcert modify-ca -c IPA -e "/usr/libexec/certmonger/ipa-server-guard /usr/libexec/certmonger/ipa-submit"; done

If you start certmonger with a strace -e signal you'll eventually see a ton of KILLs.

Interestingly in certmonger there is a proper waidpid() immediately after the SIGKILL call. The side effect of removing the kill is that operations are actually done more serially because of the lock, so therefore much slower. certmonger runs in tight loop so any one call can affect others. Try running the above while loop then try getcert list to see what I mean.

I'm going to propose some changes to ipa-server-guard that will reduce its execution time by half for unsupported operations and by about 25% otherwise by removing unnecessary imports (six, print_statement) and moving others to be JIT (certs and traceback).

master:

  • 18a8a41 Improve performance of ipa-server-guard

ipa-4-8:

  • aac7179 Improve performance of ipa-server-guard

Metadata Update from @abbra:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

8 months ago

Login to comment on this ticket.

Metadata