#413 RFE: Allow specifying whether a single arch failure auto cancels sibling tasks from "koji build" command line option
Closed: Fixed 4 years ago Opened 5 years ago by mbooth.

I'm talking about the "build_arch_can_fail" option, which is set to true in Fedora these days:

https://pagure.io/koji/blob/master/f/builder/kojid.conf#_90

Quite often I want my long-running builds (cough Eclipse cough) to "fail-fast" so that I do not end up waiting hours for a build only to find a single arch failed at the first hurdle.

It would be great if I could override the default failure behaviour on a per-build basis, through an option to the koji command line tool, for example: "koji build --fail-fast=true"


side note: you can always cancel the build if you see that one of the subtasks failed

side note: you can always cancel the build if you see that one of the subtasks failed

Yeah, if I notice it, that's what I do :-)

Talking on the #fedora-devel channel with @tibbs -- depending on the circumstances that lead to the change in the default behaviour, we wondered if the inverse would be a more preferable set-up. I.e., change the default behaviour back to failing fast and require the option "--fail-fast=false" to override and force all subtasks to run to completion.

On reflection, I think I like this way better than my original proposal. Thoughts?

The problem with canceling builds is that many people don't. At one point recently there were four libreoffice scratch builds going at once, consuming most of the s390x hosts. They timed out on s390x after two days but are still going on ppc64.

I would much rather have fail_fast be the default, but alterable by the user doing the build if they want actually want the current behavior. That way you'd have to take some action if you really want to occupy one of the slower builders even though it's already known that the overall build will fail.

I understand that this was changed in Fedora in part because there was no way for a packager to specify which behavior they wanted. But with the secondary arches moving in to the main build setup, we're now running into the situation where builds can block for hours waiting on a free host, and so having fail-fast for most builds would really help to keep the queue clear.

So I guess for full flexibility here, we'd need the ability to specify the default in kojid, as well as the cli option to override.

Ping @kevin -- is this request something that Fedora would adopt? (the ability to default to fail-fast)

I didn't realize this wasn't already a koji configurable, sorry.

I can imagine more complicated situations where only "resource constrained" architectures would fail early, or fail-early behavior would only kick in when there are no available builders in the pool or something like that, but simpler is probably better here.

Just to add perhaps some more useful evidence. As I write this:

The four libreoffice builds are still going for whatever reason; they've been running for over eight days now.
There are three mongodb scratch builds running on s390x which will all eventually fail.
*There are two scratch builds of GeographicLib which have failed on all architectures but have yet to even start on s390x.

Now, we could start an education campaign asking people to please cancel their scratch builds instead of just fixing and resubmitting, but I'm not entirely sure it would help.

I didn't realize this wasn't already a koji configurable, sorry.

It's not configurable the way this ticket describes.

The existing code provides a build_arch_can_fail boolean config option (defaults to false) for kojid that switches the behavior with no user override.

The initial request in this ticket was to provide a user option to disable build_arch_can_fail on a per-build basis.

Then folks suggested that the builds should fast fail by default and only continue subtasks when an option is specified.

In all this, we need to consider

  • there may be Koji instances that do not want this feature to be available at all (or are there?)
  • we should do our best to be backwards compatible and not surprise our users

Does it make sense to make buid_arch_can_fail more states? (enabled/disabled/expected/allowed)

  • enabled - user can do nothing, behaviour is forced to can_fail
  • disabled - user can do nothing, kojid behaves old way, failing on first error
  • expected - default behaviour is enabled with possible user override via --fail-fast
  • allowed - default behaviourt is disabled with possible user override via --fail-slow

I'm not happy with naming and maybe it is also unnecessarily complicated .

I guess the question is how many of the options above are desired? Currently we only have the first two depending on the value build_arch_can_fail.

If we add the user facing options, does anyone out there want to disallow those? That is, force fail-fast even if user passes --fail-slow, or vice-versa? My gut feeling is probably not. The task is going to fail either way.

Metadata Update from @tkopecek:
- Issue set to the milestone: 1.14

5 years ago

I would say, that only first three makes sense. Should I go with these?

I would say, that only first three makes sense. Should I go with these?

Except I think #4 is specifically being asked for. See @mbooth's comment:

Talking on the #fedora-devel channel with @tibbs -- depending on the circumstances that lead to the change in the default behaviour, we wondered if the inverse would be a more preferable set-up. I.e., change the default behaviour back to failing fast and require the option "--fail-fast=false" to override and force all subtasks to run to completion.

On reflection, I think I like this way better than my original proposal. Thoughts?

In fact, if I had to do all it over again, I might be inclined to support only #4 (of course, that is not an option now).

Yes, the fourth option is the one which would be useful in the current situation.

I don't understand why this would be any more complicated than a default setting on the koji server (fail-fast or not) and a flag passed by the client which overrides the default setting. Is there some reason to add additional states for ignoring a client-supplied flag? If there is, why wouldn't that be just a second flag on the server?

Where we're at:

  • build_arch_can_fail is a kojid config option that affects all builds
  • no user override

Where PR #432 gets us:

  • adds a user override to disable build_arch_can_fail per build

(this is basically the 1-3 cases above)

Where we probably need to be:

  • Make the default failure mode (fast or slow) configurable in kojid
  • Allow the user to specify override the failure mode per build

I don't think there is much harm in allowing the user this latitude, but we I suppose we could add another config option to control that as well.

If I'm right about where we want to get to, then the option naming in #432 probably needs adjusting. I think passing a single fail_mode option would be simpler than passing either fail_fast or fail_slow.

If I'm right about where we want to get to,

I think so -- it certainly satisfies my original request, thanks for working on it.

then the option naming in #432 probably needs adjusting. I think passing a single fail_mode option would be simpler than passing either fail_fast or fail_slow.

I agree there's probably not much point in having two options. Maybe --fail_mode=[fast|slow] is better than my original suggestion of --fail_fast=[true|false] because it allows for the addition of future new failure modes... --fail_mode=spectacular maybe? :-)

I'm less concerned about the command line option that the api itself

Login to comment on this ticket.

Metadata