Issue #178: Define Behavior on Task Failure - libtaskotron

taskotron / libtaskotron

#178 Define Behavior on Task Failure

Closed: Invalid 6 years ago Opened 9 years ago by tflink.

Taskotron isn't incredibly consistent on how task failures should behave at this time. This makes it difficult to monitor failing tasks and have consistent error handling on task failure.

This is a parent tracking task to cover:
define runner behavior on task failure (exit code etc.)
how should failures be reported?
how should crashed tests be reported?
sending appropriate notifications on task failure

The first part of this task is to propose how taskotron should behave on task failure. This will include:
Runner
- exit code usage
- signaling failure (if anything beyond exit code in the runner is needed)
Reporting
- How should failures be reported?
- Which failures should be reported?
* Notifications
- Which failure modes should notify taskotron admins?
- How should those notifications work (at least in the short/medium term)?

somebody commented 9 years ago

This ticket also depends on https://pagure.io/taskotron/issue/38

jskladan commented 9 years ago

To define "failure" behaviour, we should IMHO start with describing the "clean" path. There might be stuff I got wrong, so please feel free to amend me where required.

# Koji emits a message
# Trigger parses it, and spawns new task in the execution engine
* Now, the scheduler should also create the Job in ResultsDB, with SCHEDULED status, and pass its id with the other arguments to the execution engine
# The slave starts the task execution
* The job's status should be set to RUNNING
* This should ideally be done by adding a step to the BuildFactory()
# runtask.py performs all the steps of the given yml recipe
# Job's state is changed to COMPLETED. Once again preferably by a step in the BuildFactory()

For starters, I believe that we should make-do with just COMPLETED and CRASHED for the task's outcomes. If anything goes wrong in any of the steps, then we should just set the task to CRASHED, and inspect the reasons.

What I'd like to see, though, is a way to distinguish between failures in libtaskotron and everywhere else (meaning the actual check's code). My initial, idea would be adding a flag (simple key-value, something like test_body: True), to the respective directive(s). Then the runner would encapsulate the directives' execution in a general try...except, and if an exception was raised, it would either re-raise it, thus ending with exitcode 1 (in case of non-test_body flagged directive), or just printed the exception stack to stderr/out and ended with exitcode 3 (or a different one, should we find an one better suiting).

Having this 'differentiation' between the user-code failing, and the 'boilerplate' code failing should enable us to possibly send additional notification, when the user-code fails, thus notifying the check-developer that something is possibly wrong with his code. Having this, we could then also add further differentiation to the "execution failures" for the task - e.g. set NEEDS_INSPECTION when the exception happens in the library code, and CRASHED for the check code.

//Note:// I'm not sure whether there is a way to store exitcode of one BuildFactory() step, and pass it into another. This would be required for the step #5 to be able to recognize, whether to set the resulting state to COMPLETED or CRASHED (NEEDS_INSPECTION...). As far as I understand the BuildBot's docs, it seems not to be possible, but the actual implementation is IMHO a bit out of the scope of this ticket, and could be solved, e.g. via storing the exit-code in a pre-defined file in /var/run

This all being quite dependent on ResultsDB, we might also consider having a separata notification for when ResultsDB is unreachable.
When it can't be reached during #1, it might be good to be able to "try again later", instead of just throwing the job away alltogether (I'm not sure how the trigger is implemented, and whether this is possible and/or even a good idea).
When the ResultsDB query fails in any other step, we should IMHO spawn a notification saying, that somehing ran, but was not logged. This could then be used to possibly re-schedule the job (or somehing completely different).