#8500 "ipa ca-show" returns 0 even if it fails to retrieve a subCA
Opened 3 years ago by fcami. Modified 3 years ago

Issue

[root@ipa9 ~]# ipa ca-show test_subca_master 
ipa: ERROR: The certificate for test_subca_master is not available on this server.
  Name: test_subca_master
  Description: subca
  Authority ID: 58cefae9-a8c3-4330-a58a-59546cf0ec18
  Subject DN: CN=test_subca_master
  Issuer DN: CN=Certificate Authority,O=LAPTOP.EXAMPLE.ORG
[root@ipa9 ~]# echo $?
0

@ftweedal What is the implication of trying to retrieve a subca cert from the CA and getting a 404 which drives this exception? Is it that replication hasn't happened yet, the user doesn't have permissions?

The exception is displayed to the user via the messages API so this isn't treated as an error and the return value from the command is 0 rather than 1.

ipalib/frontend.py::output_for_cli() is pretty limited in what it considers error beyond exceptions (limited to add/remove memberships and searches).

@rcritten @fcami it occurs because key/cert replication did not occur yet. Previously, this caused the whole command to fail. It was observed in relation to https://pagure.io/freeipa/issue/7964. Commit 854d305 implemented a change to tolerate this condition.

If the condition does not resolve itself quickly, it could be due to ipaca DB replication error, custodia not servicing requests properly, or some other kind of key retrieval error (GSS-API error, bad custodia keys, ...)

Maybe there should be a health check for LWCA keys and certs being present (if there isn't already one).

IMO this behaviour is not a bug, since the condition legitimately arises in normal operation. But if it does not resolve itself quickly, then it points to a problem somewhere else.

The problem we're trying to solve is automation. If the return value is not reliable then the test (or user) will have to scrape the output to look for the message. This just seems wrong.

Do you want to return non-zero for this case, even though it is not necessarily an error?
I'm am OK with it in this case because although it is an expected condition, if it persists then something is wrong.

A way to hint a requested exit status to the frontend without throwing an exception would be nice.. it will have to be done in a backwards compatible way compatible way.

I'm trying to understand what the conditions are where this would occur to determine if we should return non-zero or not. It seems odd to have a transient error like this beyond the case of slow replication (which we can't really detect).

There are some tests that do things like ipa ca-add foo immediately followed by ipa ca-show foo which are fragile because of this. It isn't possible based on the returncode to know if something, anything, went wrong. Was foo created? Probably. Would we want to do something reactive to the non-zero return code? I dunno. Maybe we should do a retry of the get on failure.

It's almost like the return code should be 0.5 or something. It mostly worked :-)

If it's transient, it's not an error. If it persists, it is an error :)

Maybe we should use a specific nonzero exit status to signal this condition, and document what it means. In the context of a test, if it occurs we can ignore it if appropriate.

It's why I wonder if there should be a server-side retry, wait for an entry, something, so that server-side it can be determined to be transient or not.

If someone does a ca-add they will almost certainly get a 0 response code and think eveything is a-ok. That doesn't seem to be guaranteed in this case.

It's in the same class of issue as replication problems in general. Say you did user-add... that will always return zero, but it doesn't mean everything replicated properly.

But let's think it through... if ca-show sees the IPA CA object, then IPA replication succeeded. If it does not see the corresponding Dogtag object (or its certificate) then either the Dogtag LDAP object has not replicated yet, or the custodia key replication did not succeed yet. Retry? Sure, but how long is "enough". 5 seconds sounds reasonable. Maybe we retry eagerly (say, 1s intervals, and give up after 5 attempts?)

There is no perfect solution, but I think this is a satisfactory approach.

Login to comment on this ticket.

Metadata