#340 koji doesn't take plugins into account when median capacity is calculated
Opened 3 years ago by vrutkovs. Modified a year ago

Koji expects all the hosts in the channel are able to run the task when median capacity is calculated in getNextTask - this might not be necessarily true.

OSBS team has hit this, when a host (containerbuild), which has a plugin for buildContainer task, was moved to a testing channel, where other hosts were added. containerbuild host had capacity set to 24, buildContainer task weight is 2.0, max-jobs on builder was reset to 13.

Nevertheless koji scheduled 10 builds out of 13 in queue, because median capacity algorithm was expecting other hosts to take the remaining three jobs. They couldn't, as none of them had necessary plugin installed.

getNextTask should be able to filter out hosts, which can't run the task, when median capacity is calculated

Without a significant refactor, there is no reasonable way for one host to know that another host cannot take a particular task for reasons other than arch or channel.

What we should do in the short term is implement a short delay when the host is in the bottom half, instead of constantly rejecting it. This would still give hosts in the upper half a chance, but keep situations like this from stalling out forever.

Actually, looks like I have some partial work on this from a few months back

Metadata Update from @mikem:
- Issue assigned to mikem

3 years ago

Here is the partial work that was mentioned: https://github.com/mikem23/koji-playground/commits/bin-lower-half

Still needs unit tests

Metadata Update from @mikem:
- Issue set to the milestone: 1.17

2 years ago

I've cleaned up the work above and submitted as PR #1176

Commit d424fb0 relates to this ticket

The short term fix from #1176 will be in 1.17. A proper fix will likely have to wait for 1.18 or later.

Metadata Update from @mikem:
- Issue set to the milestone: 1.18 (was: 1.17)

a year ago

Metadata Update from @dgregor:
- Custom field Size adjusted to None
- Issue set to the milestone: None (was: 1.18)

a year ago

Removing from 1.18 as the long-term work will get rolled up into some upcoming schedule refactoring.

Login to comment on this ticket.