#318 Outline possible conventions for storing tasks in a dist-git repo
Closed: Fixed None Opened 8 years ago by tflink.

With the namespaced repos that packagedb now supports, it seems most likely that we'll be using those instead of trying to come up with some other solution.

This still leaves us with questions around what conventions we want to use. Come up with conventions we could use for these new task repos and how most things could be overridden. Examples that were discussed after DevConf 2016 include:
* have tasks in dirs off the main repo - one task per dir
* assume that tasks are to be run @ koji build completion unless otherwise specified
* have some sort of .taskrun file (needs better name) which can override defaults.

Present a proposal to the group for discussion.


After thinking about this and talking to @tflink and @kparal, I think that prior to deciding the conventions, we need to investigate how do we access task repos first. The problem is that we need to access a task repo twice:
1. when taskotron-trigger gets a fedmsg to pull information on which task we actually will run
2. on a minion when libtaskotron is about to run the task

In order not to be slowed down with git cloning twice, three options come to mind:
1. Instead of the git pull in a trigger, just download one file (git archive can do that) from the task repo that would contain info on what should be scheduled
2. Ask the infra folks if we can somehow "mount"/have repos close the task repo so it would be fast for us to look at the repo in the trigger
3. Create a "super complex caching" system to cache scheduling information. We can emit a fedmsg when task maintainers push to task repo, the caching system would listen on those fedmsgs and cache repos on the trigger machine. But that could create some fun race conditions like a koji build is ready but the caching system is not fast enough so the check is not cached and therefore not run on the koji build. And I would rather avoid implementing something like this if we could do something simpler, at least for starters.

The 1. is easy to implement and s-wordn't slow down things that much, but check maintainers would need to maintain just another file.
With 2. we can have scheduling information directly in task formulas, but according to @tflink it seems it's not possible from infra side.
The 3. is error prone and scares me. :)

Better ideas? Thoughts?

It'd be at least twice, no? Actually, N+1 - Once for the trigger and N times, 1 for each of the tasks which are contained in the repo?

I can think of some ways to keep 3) a bit less risky and scary but I agree that we should see if we can get something less complicated to work.

The first thing I'm wondering is if we're looking at premature optimization here - how many problems do we really think we'd cause if we started doing shallow clones of the task repo every time it was needed? I have a hard time believing that would cause problems anytime in the near future and if/when it does start being an issue, we can start looking at a more complicated solution.

! In #704#9696, @tflink wrote:
It'd be at least twice, no? Actually, N+1 - Once for the trigger and N times, 1 for each of the tasks which are contained in the repo?

True. Thanks for clarifying.

I can think of some ways to keep 3) a bit less risky and scary but I agree that we should see if we can get something less complicated to work.

The first thing I'm wondering is if we're looking at premature optimization here - how many problems do we really think we'd cause if we started doing shallow clones of the task repo every time it was needed? I have a hard time believing that would cause problems anytime in the near future and if/when it does start being an issue, we can start looking at a more complicated solution.

Probably. Just wanted to see if anyone has better ideas right away. I am not against trying git cloning each time if you feel that wouldn't cause much problems.

OK, let's try to do a shallow clone of the pkg dir from dist-git even for taskotron-trigger needs, each time it receives a new koji_build event, and see how fast or slow it is.

Tim, if you talked to Fedora Infra, can you give us some details why we can't have the whole qa dist-git mounted read-only over nfs? Are we not in the same datacenter?

We can't have the whole dist-git volume mounted over nfs because it lives on the local disk of the box serving the repos, not on a chunk of remote storage. My understanding is that it isn't impossible but it's a lot of work and not something that's going to happen soon.

I just chatted with infra now that things are less on fire today and they don't have any objections to us starting off with N+1 shallow clones of the task/check repo.

Unless we can come up with any better ideas, I think that we should keep planning around the assumption that we'll be doing shallow clones of the repos on clients/triggers when needed. It'd be awesome to have an early draft of this ready for the qadevel meeting on Monday

So here's how I think it could work. Let's have this formula for a reference:

name: check_firefox_sanity
desc: Checks sanity of firefox
maintainer: jdoe

scheduling:
  run: True # default, can be omitted

input:
    args:
        - koji_build

# actions

For the following, assume we have consumers in taskotron-trigger for each supported input_type; currently we have koji_build and koji_tag (and consume which we don't use).

Now, example:

  1. taskotron-trigger gets a fedmsg that firefox was just built in koji (koji_build item_type)
  2. taskotron-trigger does something like fedpkg clone tasks/firefox && fedpkg switch $branch, where $branch is determined within taskotron-trigger's koji_build consumer from the fedmsg
  3. taskotron-trigger sees e.g. 3 directories in the repo
    * check_firefox_sanity
    * check_firefox_something_on_koji_tag
    * check_firefox_something_else_on_koji_tag
    For each directory read $directory_name/$directory_name.yml (require yaml file to be named same as task directory?) and look into scheduling section, if there's run: False, break (if the scheduling section is omitted, run: True is default, we might omit the section if there's no benefit) otherwise looks into input section and if args contains koji_build then a job item: firefox-x.x, item_type: koji_build, check: $directory_name, branch: $branch is added to list of tasks to be scheduled (currently in addition to rpmlint which is our koji_build check).
  4. buildbot then does fedpkg clone tasks/firefox && fedpkg switch $branch and runs libtaskotron on a buildslave with arguments received from taskotron-trigger. We'll need to have some namespacing for tasks to differentiate between our tasks (rpmlint, depcheck, upgradepath) and dist-git ones so we know what repo to clone.

From task maintainer point of view, it's just pushing tests into the repo, each in its own directory. No scheduling info is required by default. Everything is retrieved from the existing input section. The scheduling section might be added for some use cases; like if the maintainer doesn't want to run the check but wants to keep it in the repo for future use.

Does that make sense? Any additions to the scheduling section? Other thoughts? Let's shoot holes into the proposal.

Overall, I like it and can't think of too many things that we might want to change.

I like the idea of using the input section to determine when the task should run. I suspect we'll want the ability to override it at some point but let's cross that bridge if/when we get there - I can't think of anything it'd need for now.

The other input_type that I've been thinking about is dist-git commit. It might be worth asking around before committing to it but enabling checks when a pkg git repo has been changed seems like something that would be desired.

For each directory read $directory_name/$directory_name.yml (require yaml file to be named same as task directory?)

For a naming convention, I'd say either require the yaml file to be named the same as the task directory or have a standard name like runtask.yaml. I think that the static name has less potential for confusion.

taskotron-trigger does something like fedpkg clone tasks/firefox && fedpkg switch $branch, where $branch is determined within taskotron-trigger's koji_build consumer from the fedmsg

I wonder if we'd be better off doing the raw git operations and skipping fedpkg. That way we could do shallow clones (I think that fedpkg does full clones by default) and not worry about being directly dependent on yet another package. Is there an advantage to using fedpkg that I'm not thinking about?

It'd be interesting to do some tests and figure out how long it takes to do a shallow clone vs. a full clone. I suspect that it would be enough of a speedup to justify doing, if only to save on data transfer from the git host but I could easily be wrong.

urrently we have koji_build and koji_tag (and consume which we don't use).

I assume that "consume" was a typo for compose?

! In #704#9736, @tflink wrote:
Overall, I like it and can't think of too many things that we might want to change.

I like the idea of using the input section to determine when the task should run. I suspect we'll want the ability to override it at some point but let's cross that bridge if/when we get there - I can't think of anything it'd need for now.

The other input_type that I've been thinking about is dist-git commit. It might be worth asking around before committing to it but enabling checks when a pkg git repo has been changed seems like something that would be desired.

Yeah, I don't think adding more consumers is much work.

For each directory read $directory_name/$directory_name.yml (require yaml file to be named same as task directory?)

For a naming convention, I'd say either require the yaml file to be named the same as the task directory or have a standard name like runtask.yaml. I think that the static name has less potential for confusion.

The static name seems better to me.

taskotron-trigger does something like fedpkg clone tasks/firefox && fedpkg switch $branch, where $branch is determined within taskotron-trigger's koji_build consumer from the fedmsg

I wonder if we'd be better off doing the raw git operations and skipping fedpkg. That way we could do shallow clones (I think that fedpkg does full clones by default) and not worry about being directly dependent on yet another package. Is there an advantage to using fedpkg that I'm not thinking about?

I was using fedpkg just so it's clear at first glance that we're cloning dist-git check, I was not implying to use it. So yeah, use raw git.

It'd be interesting to do some tests and figure out how long it takes to do a shallow clone vs. a full clone. I suspect that it would be enough of a speedup to justify doing, if only to save on data transfer from the git host but I could easily be wrong.

urrently we have koji_build and koji_tag (and consume which we don't use).

I assume that "consume" was a typo for compose?

Yep, it's typo :)

! In #704#9735, @mkrizek wrote:
scheduling:
run: True # default, can be omitted

This new section looks good to me.

For each directory read $directory_name/$directory_name.yml (require yaml file to be named same as task directory?) and look into scheduling section,

I don't understand one thing. I thought the plan was to be able to run the checks completely without any formula at all by default? And only if people want to tweak something (defaults are not good enough), they would need to create it? At least that was my impression from our discussions around DevConf. Of course we don't need to do it now, I'm just asking.

We'll need to have some namespacing for tasks to differentiate between our tasks (rpmlint, depcheck, upgradepath) and dist-git ones so we know what repo to clone.

If we decide to use raw git URLs directly, I think we avoid this problem, right? We simply pass buildbot a full git url (and some other arguments specifying the branch, etc).

! In #704#9737, @mkrizek wrote:
For a naming convention, I'd say either require the yaml file to be named the same as the task directory or have a standard name like runtask.yaml. I think that the static name has less potential for confusion.

The static name seems better to me.

I agree. As a minor nitpick, we could decide whether we want to use .yml or .yaml file extension and then use the same approach everywhere, because currently we're pretty random at that throughout our project.

! In #704#9875, @kparal wrote:

For each directory read $directory_name/$directory_name.yml (require yaml file to be named same as task directory?) and look into scheduling section,

I don't understand one thing. I thought the plan was to be able to run the checks completely without any formula at all by default? And only if people want to tweak something (defaults are not good enough), they would need to create it? At least that was my impression from our discussions around DevConf. Of course we don't need to do it now, I'm just asking.

I don't understand - how would running a task without a formula work? Am I just forgetting a previous conversation here?

We'll need to have some namespacing for tasks to differentiate between our tasks (rpmlint, depcheck, upgradepath) and dist-git ones so we know what repo to clone.

If we decide to use raw git URLs directly, I think we avoid this problem, right? We simply pass buildbot a full git url (and some other arguments specifying the branch, etc).

Yeah, that probably makes more sense. The assumption we currently make that all repos have the same baseurl isn't really practical going forward.

The static name seems better to me.

I agree. As a minor nitpick, we could decide whether we want to use .yml or .yaml file extension and then use the same approach everywhere, because currently we're pretty random at that throughout our project.

Agreed. I don't really care which one we use but consistency is good :)

I broke this proposal into tickets (#749, #750, #751) under #564 and am closing this, we'll deal with details in those sub-tickets. If there are objections, please reopen, thanks.

Login to comment on this ticket.

Metadata