Issue #413: RFE: Allow specifying whether a single arch failure auto cancels sibling tasks from "koji build" command line option - koji

koji

#413 RFE: Allow specifying whether a single arch failure auto cancels sibling tasks from "koji build" command line option

Closed: Fixed 6 years ago Opened 6 years ago by mbooth.

I'm talking about the "build_arch_can_fail" option, which is set to true in Fedora these days:

https://pagure.io/koji/blob/master/f/builder/kojid.conf#_90

Quite often I want my long-running builds (cough Eclipse cough) to "fail-fast" so that I do not end up waiting hours for a build only to find a single arch failed at the first hurdle.

It would be great if I could override the default failure behaviour on a per-build basis, through an option to the koji command line tool, for example: "koji build --fail-fast=true"

mikem commented 6 years ago

sounds reasonable

mikem commented 6 years ago

side note: you can always cancel the build if you see that one of the subtasks failed

mbooth commented 6 years ago

side note: you can always cancel the build if you see that one of the subtasks failed

Yeah, if I notice it, that's what I do :-)

tkopecek commented 6 years ago

PR #432

mbooth commented 6 years ago

Talking on the #fedora-devel channel with @tibbs -- depending on the circumstances that lead to the change in the default behaviour, we wondered if the inverse would be a more preferable set-up. I.e., change the default behaviour back to failing fast and require the option "--fail-fast=false" to override and force all subtasks to run to completion.

On reflection, I think I like this way better than my original proposal. Thoughts?

tibbs commented 6 years ago

The problem with canceling builds is that many people don't. At one point recently there were four libreoffice scratch builds going at once, consuming most of the s390x hosts. They timed out on s390x after two days but are still going on ppc64.

I would much rather have fail_fast be the default, but alterable by the user doing the build if they want actually want the current behavior. That way you'd have to take some action if you really want to occupy one of the slower builders even though it's already known that the overall build will fail.

I understand that this was changed in Fedora in part because there was no way for a packager to specify which behavior they wanted. But with the secondary arches moving in to the main build setup, we're now running into the situation where builds can block for hours waiting on a free host, and so having fail-fast for most builds would really help to keep the queue clear.

mikem commented 6 years ago

So I guess for full flexibility here, we'd need the ability to specify the default in kojid, as well as the cli option to override.

Ping @kevin -- is this request something that Fedora would adopt? (the ability to default to fail-fast)

tibbs commented 6 years ago

I didn't realize this wasn't already a koji configurable, sorry.

I can imagine more complicated situations where only "resource constrained" architectures would fail early, or fail-early behavior would only kick in when there are no available builders in the pool or something like that, but simpler is probably better here.

Just to add perhaps some more useful evidence. As I write this:

The four libreoffice builds are still going for whatever reason; they've been running for over eight days now.
There are three mongodb scratch builds running on s390x which will all eventually fail.
*There are two scratch builds of GeographicLib which have failed on all architectures but have yet to even start on s390x.

Now, we could start an education campaign asking people to please cancel their scratch builds instead of just fixing and resubmitting, but I'm not entirely sure it would help.

mikem commented 6 years ago

I didn't realize this wasn't already a koji configurable, sorry.

It's not configurable the way this ticket describes.

The existing code provides a build_arch_can_fail boolean config option (defaults to false) for kojid that switches the behavior with no user override.

The initial request in this ticket was to provide a user option to disable build_arch_can_fail on a per-build basis.

Then folks suggested that the builds should fast fail by default and only continue subtasks when an option is specified.

In all this, we need to consider

there may be Koji instances that do not want this feature to be available at all (or are there?)
we should do our best to be backwards compatible and not surprise our users

tkopecek commented 6 years ago

Does it make sense to make buid_arch_can_fail more states? (enabled/disabled/expected/allowed)

enabled - user can do nothing, behaviour is forced to can_fail
disabled - user can do nothing, kojid behaves old way, failing on first error
expected - default behaviour is enabled with possible user override via --fail-fast
allowed - default behaviourt is disabled with possible user override via --fail-slow

I'm not happy with naming and maybe it is also unnecessarily complicated .

mikem commented 6 years ago

I guess the question is how many of the options above are desired? Currently we only have the first two depending on the value build_arch_can_fail.

If we add the user facing options, does anyone out there want to disallow those? That is, force fail-fast even if user passes --fail-slow, or vice-versa? My gut feeling is probably not. The task is going to fail either way.

Metadata Update from @tkopecek:
- Issue set to the milestone: 1.14

6 years ago

tkopecek commented 6 years ago

I would say, that only first three makes sense. Should I go with these?

mikem commented 6 years ago

I would say, that only first three makes sense. Should I go with these?

Except I think #4 is specifically being asked for. See @mbooth's comment:

Talking on the #fedora-devel channel with @tibbs -- depending on the circumstances that lead to the change in the default behaviour, we wondered if the inverse would be a more preferable set-up. I.e., change the default behaviour back to failing fast and require the option "--fail-fast=false" to override and force all subtasks to run to completion.

On reflection, I think I like this way better than my original proposal. Thoughts?

In fact, if I had to do all it over again, I might be inclined to support only #4 (of course, that is not an option now).

tibbs commented 6 years ago

Yes, the fourth option is the one which would be useful in the current situation.

I don't understand why this would be any more complicated than a default setting on the koji server (fail-fast or not) and a flag passed by the client which overrides the default setting. Is there some reason to add additional states for ignoring a client-supplied flag? If there is, why wouldn't that be just a second flag on the server?

mikem commented 6 years ago

Where we're at:

build_arch_can_fail is a kojid config option that affects all builds
no user override

Where PR #432 gets us:

adds a user override to disable build_arch_can_fail per build

(this is basically the 1-3 cases above)

Where we probably need to be:

Make the default failure mode (fast or slow) configurable in kojid
Allow the user to specify override the failure mode per build

I don't think there is much harm in allowing the user this latitude, but we I suppose we could add another config option to control that as well.

If I'm right about where we want to get to, then the option naming in #432 probably needs adjusting. I think passing a single fail_mode option would be simpler than passing either fail_fast or fail_slow.

mbooth commented 6 years ago

If I'm right about where we want to get to,

I think so -- it certainly satisfies my original request, thanks for working on it.

then the option naming in #432 probably needs adjusting. I think passing a single fail_mode option would be simpler than passing either fail_fast or fail_slow.

I agree there's probably not much point in having two options. Maybe --fail_mode=[fast|slow] is better than my original suggestion of --fail_fast=[true|false] because it allows for the addition of future new failure modes... --fail_mode=spectacular maybe? :-)

Edited 6 years ago by mbooth

mikem commented 6 years ago

I'm less concerned about the command line option that the api itself

mikem commented 6 years ago

Commit 8bb44e7 fixes this issue

tkopecek commented 6 years ago

Commit f22dcaf fixes this issue

Metadata

Assignee

None

Tags

None

Blocking

None

Depending on

None

Priority

None

Milestone

1.14

Size

None

koji

Source Code

Documentation

#413 RFE: Allow specifying whether a single arch failure auto cancels sibling tasks from "koji build" command line option Closed: Fixed 6 years ago Opened 6 years ago by mbooth.

Metadata

#413 RFE: Allow specifying whether a single arch failure auto cancels sibling tasks from "koji build" command line option

Closed: Fixed 6 years ago Opened 6 years ago by mbooth.