#7498 Support on-the-fly tarball generation in Koji
Opened 2 years ago by fweimer. Modified 4 days ago

I've just filed this fedpkg-minimal feature request:

It would be great if we could get this to work on Fedora Koji, with a suitable whitelist of supported repositories.


Nothing would stop us from doing this, except for the fact it's a bad idea. At this point, I don't care if you have a "whitelist" because that list is completely arbitrary.

The problem with it is that stuff can shift and change at any time, meaning that the source content is not stable and there's no way to guarantee anything that remotely looks like reproducibility. Especially if we're generating from Git repositories, since Git repositories explicitly allow for rewriting history, such that all refs are unstable.

I'd rather hold us to a higher standard and be able to reproduce any Fedora build using the same actions in Koji (distgit-to-srpm->build->publish).

Sorry, I don't understand the problem with source stability. Koji will still store the SRPM, so the build will still be fully reproducible from that.

I don't see how we gain reproducibility by forcing developers to do this on their workstations.

At the same token, I don't see why we should grant some repositories the "privilege" of being checked out directly. What qualifications would we use to identify which ones are "good" vs which ones are "bad"?

If we decided to go down this path, I'd probably say that we'd need to fork every source repository into Fedora's VCS, to maintain total source stability. That also changes the dynamic of how we'd build packages entirely.

The problem with the SRPM alone is that it doesn't count as upstream content, since it's no longer just a pure container of "their code" and "our code". It's a generated intermediate artifact.

One thing we do have is that in most cases, we're able to say that we're using what upstream gives us (in the form of their tarballs) to build our releases. That changes when we use the VCS directly. We can do all kinds of weird transformations and no one would be able to easily figure it out. It also makes it much easier to do monkey-patching without ever contributing back to the upstream projects (a la Debian, et al).

And generally, developers aren't making tarballs out of goop. Most things are forge-hosted, so tarballs come from the release URLs. Things that aren't release tarballs on their own. And things without tarballs generally aren't supposed to be used as releases, which is a clear delineation.

So I guess the better question is, what problem are you trying to solve? What thing is so difficult and annoying that you want Koji to do it instead?

We already have to do these weird transformations today because tarball + patches is so burdensome. I would even suggest that it discourages upstream work: First, we spent effort instead on shoehorning Git repositories into Fedora, time which could be used more productively. Second, if we have a downstream patch, upstreaming it will introduce a conflict, which is generally difficult to resolve with the basic tarball+patches tooling that RPM provides.

Eventually, I want to use Git-based tooling (whether it's Gerrit or Pagure, we'll see) and remove a random developer workstation from the release process, where a crucial step is performed (tarball generation), which is difficult to review at the Git level. In addition, for Fedora at least, I want to use actual merges from upstream to bring in upstream changes, rather than serializing everything into a list of patches.

I also want to use the same process in Fedora glibc and downstream, so that less developer training is needed. Downstream, we have hundreds of backported patches. In Fedora 28, we currently have 53 upstream patches since the release, plus the Fedora-specific patches. For Fedora 27, we will soon have 146 upstream patches.

What about the projects that don't use Git? Are they going to be second or third-tier citizens in Fedora?

Right now, all projects have that "pain" equally, Git or otherwise.

@ngompa, I don't think additional Git support doesn't take away features for other SCMs. The fact that developers do not upload SRPMs themselves already expresses a strong preference for Git and src.fedoraproject.org as the hosting platform (something that Debian does not do).

The reason developers don't upload source RPMs directly is because we're not fools. We want 1:1 mapping from source content (spec+patches+cache tarball checkout) to built artifacts (SRPM, binary RPMs).

I've worked in distributions where that's not the case, and it's madness because you can't figure out what the heck happened before and after unless you are snapshotting a mirror of the SRPMs.

It doesn't express preference toward Git or anything else. Heck, we only switched to Git from CVS after we wrapped all the ugly Git CLI stuff into fedpkg.

Creating the tarball directly from git actually gives you better auditability, because it removes the human from the process. You can look at the tarball and know (based on naming) exactly what git commit it came from, and trace the history from there. With a release tarball, you don't know what the person did to generate it, or exactly what commit it was based on. You only have a release number, which may or may not correlate to a git tag and/or branch.

If maintainers of other SCMs want to opt-in to this approach, we could investigate adding support for those SCMs as well. Most SCMs have the ability to generate tarballs from a commit.

One problem/challenge with getting koji to do this would be that the builders (which also do the SCMtoSRPM), do not have access to the internet for reproducability and auditability reasons.
So you'd also need a fix for that.

Regardless, I think that this should first be filed and fixed in upstream koji (https://pagure.io/koji/) before releng has anything to do about this.

@puiterwijk The SRPM builders have access to the lookaside cache, so there is some form of Internet access. Whether you use plain HTTPS via curl or Git-over-HTTPS does not seem to make a huge difference here.

I'm not sure I'm buying @ngompa's argument that this shouldn't be done for some things unless we revamp everything. That seems like a very high bar for no really strong reason. Likewise, if we can do something useful for Git, we shouldn't have to support other SCMs. This is "perfect as enemy of good" taken all the way to absurdity.

@mattdm So then, what do we decide as "good" repos vs "bad" ones? Which ones are worth trusting such that we're okay with direct git checkouts for this? How about submodules? LFS? ...?

@fweimer they have restricted access to certain whitelisted services/targets that are regarded as "controlled by Fedora Infra", not full internet access, based on target IP addresses.

I'm unclear on whats being proposed here. I know it's using git repos instead of source from lookaside, but is it:

a) any arbitrary git repo on the internet? Or a list of such repos/sites that are acceptable?
or
b) that there's another collection of git repos created in Fedora Infrastructure for this purpose?

a is a non starter, IMHO. b I guess would depend on what advantages it provides and how it's maintained.

I'm not sure a is really significantly worse than what we have now — someone can replace a versioned tarball without changing the version, too, right? (And I've seen it happen.) But that said, I think there are other advantages to b, too: it could solve the network access issue, and there's a lot of great stuff we could do with our own local repos.

I'm not sure a is really significantly worse than what we have now — someone can replace a versioned tarball without changing the version, too, right? (And I've seen it happen.)

Sure, but:

  • We would be depending on remote sites to be up and reachable to do builds.
  • If we built against git hash deadbeef and want to reproduce it or just see what exactly was in it, remote could have re-written history so it's now different. If we made or used a tar.gz that we control we could diff the two to see what was changed.
  • If history gets re-written all our hashes against the remote repo are now wrong and we have no idea what was built against what and can't reproduce them.
  • Some projects go away, along with their git repos, so now we have no idea what we built and can't build it again.
    I'm sure I could go on. ;)

But that said, I think there are other advantages to b, too: it could solve the network access issue, and there's a lot of great stuff we could do with our own local repos.

Possibly yeah, thats why I said depends on advantages.

there was a long thread about this at least twice that I recall. Once when we were moving from cvs to git and some folks wanted expoded source too, and another time a number of years later where people wanted that again... but I don't think it ever got to a high enough yummy to trouble ratio to do.

a is not at all an option for the reasons @kevin mentioned.
b I dont know what advantages it provides. But, if they are worth it then we can look into them.

@kevin Every build contains a srpm which contains a tarball with all the sources, as they were checked out at the time of the build. If you want to reproduce something, you can rebuild that srpm or exact the tarball from it and inspect it. Why is this any worse than what we have today, with a tarball in the lookaside cache, with very little provenance. The download links for release tarballs can (and do) go away as well.

But in the common case, the dist-git repo will point to the upstream repo and commit, and a user can go to that repo and view the full history of the source that was built into the srpm. Isn't that a better experience than what we have today, when that chain of custody always ends at the tarball in our lookaside cache?

But in the common case, the dist-git repo will point to the upstream repo and commit, and a user can go to that repo and view the full history of the source that was built into the srpm. Isn't that a better experience than what we have today, when that chain of custody always ends at the tarball in our lookaside cache?

No, because that repo isn't guaranteed to be around. As I said earlier, I think that if we wanted to allow this, we'd need to go full-on repo mirroring to make it work. And that also adds some complexity because repositories are different types, different features in use, and so on. In the space of purely Git repos, Pagure does not yet have all the necessary functionality to mirror every single Git repo, principally because it lacks support for Git LFS and Git-Annex.

That's not even getting into the mess related to supporting Mercurial, Bazaar, SVN, and CVS repositories. We'd need some kind of process to handle those in a sensible manner too.

No, because that repo isn't guaranteed to be around.

@ngompa I sincerely don't get this. There's no guarantee that upstream tarballs will be around at their original location, either.

I'd get the concern if the proposal was to straight from repo to binary without keeping a snapshot... but that's not the suggestion.

But, let's say we're doing repo-mirroring. Just because we can't necessarily support all possible external version control (either non-git, or git with features we don't have) doesn't mean we couldn't support some.

No, because that repo isn't guaranteed to be around.

@ngompa I sincerely don't get this. There's no guarantee that upstream tarballs will be around at their original location, either.

That's why we download them and upload them into the lookaside cache. With no local copy of the sources, we're at the mercy of remote availability.

I'd get the concern if the proposal was to straight from repo to binary without keeping a snapshot... but that's not the suggestion.

That is the suggestion, though. It'd just be a pointer to the upstream repo and commit.

Well, the bug says "support for creating tarballs from Git repositories". That tarball then gets used in building the RPM, including the source RPM.

As I am understanding your concern, the issue is with a sequence like

  1. Do scratch build
  2. External repo changes
  3. Attempt real build

where #3 could be very different from #1, in a surprising way.

What if, instead of just generating the tarball on the fly, the feature would instead add to the lookaside cache, and then use that for the build?

@ngompa The srpm, and the tarball inside it, are the local copy of the source. You can always extract the tarball from the srpm and inspect it, using the current approach and the new proposal.

@mattdm If the upstream repo is rewritten, then the commit ID referenced in the source-repos file will no longer exist, and step 3 (the real build) will fail, which is exactly what you want.

@mattdm If the upstream repo is rewritten, then the commit ID referenced in the source-repos file will no longer exist, and step 3 (the real build) will fail, which is exactly what you want.

That is still worse than what we have now.

I'm not sure a is really significantly worse than what we have now — someone can replace a versioned tarball without changing the version, too, right? (And I've seen it happen.)

Sure, but we'd still have every copy, because lookaside stores them by their checksums. You can't just "blow away" a tarball.

To clarify, the tarballs are stored with the checksum of the tarball as the filename. The sources file is the only way we know what they originally were.

To clarify, the tarballs are stored with the checksum of the tarball as the filename. The sources file is the only way we know what they originally were.

Not really. The lookaside cache is storing the tarballs in paths like /<namespace>/<package_name>/<filename>/sha512/<hash>/<filename>.

Metadata Update from @mohanboddu:
- Issue tagged with: meeting

2 years ago

@ngompa What's the problem with treating the tarball/srpm as the local cache? Why is that worse than what we have now?

Pagure does not yet have all the necessary functionality to mirror every single Git repo, principally because it lacks support for Git LFS and Git-Annex.

Er...does anyone use Git-Annex to store source code? As far as LFS, sure; though are you aware of any software in the Fedora package/container set that uses that today?

That's not even getting into the mess related to supporting Mercurial, Bazaar, SVN, and CVS repositories. We'd need some kind of process to handle those in a sensible manner too.

For gnome-continuous (which mirrors all upstream repositories, same as rdgo), those were so rare that it was easier to leave them as tarballs - and I just imported tarballs into git.

BTW, this issue is #2 on my list here: https://github.com/cgwalters/rpmdistro-gitoverlay/blob/master/doc/reworking-fedora-releng.md#create-a-production-git-mirror-to-augmentreplace-the-lookaside-cache

I'll also add that this system is totally opt-in, so for any upstream that can't support on-the-fly tarball generation (because they use git extensions or other SCMs), they can continue to use tarballs+lookaside cache as always.

Ping, any update here? @mohanboddu was this ever discussed at a meeting?

Unfortunately in Ceph we cannot use plain "git archive" because we have so many Git submodules in the tree that point at random other bundled projects. Ceph has a special git-archive-all.sh script that recursively archives everything into the official release tarball. But it's a mess.

At least it runs in Jenkins instead of a random workstation.

(On the subject of mirroring, if we tried to mirror all our submodules, that would mean rewriting https://github.com/ceph/ceph/blob/master/.gitmodules to point at different repo URLs, so the sha1 in ceph.git would change, and I'll need some tool to maintain that "use mirrors" commit during rebases over time.)

It's not a good reason, but it's a reason why the https://download.ceph.com tarballs end up being more reproducible for me, and the "git archive" thing in rpkg can't apply to my situation yet :(

It would be cool to implement what Copr has with .copr/Makefile srpm, where you can run arbitrary commands to generate the SRPM. rpkg could use .rpkg/Makefile tar or something if you want to keep buildSRPMfromSCM largely as-is.

(On the subject of mirroring, if we tried to mirror all our submodules, that would mean rewriting https://github.com/ceph/ceph/blob/master/.gitmodules to point at different repo URLs, so the sha1 in ceph.git would change, and I'll need some tool to maintain that "use mirrors" commit during rebases over time.)

No; the way GNOME Continuous works, which is the same model rewritten into rpmdistro-gitoverlay - git is mirrored recursively, but only the build process uses that git repo. This indeed means there's not a convenient command to clone recursively from that repo as an outside user (AFAIK), but it wouldn't be too hard to write a script to do so I believe.

If you look at what e.g. Android does with the repo tool - it's basically a custom tool to clone many git repos recursively locally, except instead of submodules they use more custom logic. Clearly a fedpkg prep type tool could learn about the recursive submodules in the same way.

You can do all of this today in Fedora without specific Koji changes or specific client tooling. All you need is an upstream hosted on GitHub or GitLab. Or pretty much every modern source hosting service except pagure because of:
https://pagure.io/pagure/issue/861

The required macros are all merged in redhat-rpm-config, the last missing mile is documentation and it’s waiting merging here
https://src.fedoraproject.org/rpms/redhat-rpm-config/pull-request/51

If you control the upstream you are packaging it’s a huge productivity enabler
1. do all the changes you want in the upstream repo branch or in a downstream fork
2. (optionally) tag the state you think will work
3. bump the commit id or tag or version in the spec (no other change needed)
4. spectool
5. build and test

(3/4/5 can be done on another vm/system with different security measures, they do not require write access to the upstream repo)

If the result does not work as expected, commit fixes/move the tag and repeat.

When it all works as expected tag as official upstream release or PR your fork.

This is pretty much the same workflow as creating fake upstream archives to test changes, except it’s all tracked cleanly in SCM and is miles ahead usability-wise.

Unfortunately in Ceph we cannot use plain "git archive" because we have so many Git submodules in the tree that point at random other bundled projects.

Bundling is bad, it breaks rpm version/upgrade management, and can have unexpected legal consequences. Package those cleanly in separate specs and your problems will go way.

Unfortunately in Ceph we cannot use plain "git archive" because we have so many Git submodules in the tree that point at random other bundled projects.

Bundling is bad, it breaks rpm version/upgrade management, and can have unexpected legal consequences. Package those cleanly in separate specs and your problems will go way.

Ceph will never do that, though I'm curious if @ktdreyer knows all the reasons why they do it.

The last time I looked at it, it's because they don't want to contribute or handle their dependencies properly. They even forked a bunch of things to freeze them and mutate them rather than using them properly through APIs.

Speaking of which, Ceph already breaks our guidelines by not enumerating bundled dependencies, but that's a separate problem.

So far, I've seen no good reason for allowing arbitrary VCS access. And no one has presented a solution for handling upstreams properly when mirroring Git sources that have LFS, Annex, or submodule stuff. And this solution ignores all the non-Git VCSes, which still makes up ~35% of upstreams today.

Metadata Update from @syeghiay:
- Issue assigned to mohanboddu

a year ago

We got all derailed here... lets try and focus again.

@fweimer can you walk us through how your workflow would look if we had this enabled? ie, how it would be better?

I think the reproducibility is minor but still there (in the case you build something, upstream dies or changes hashes or does something wacky) you can't just rebuild the src.rpm right? You would need to modify it to use the exact source that was used on that build? Or no?

The availability issue is also there for me: If remote is down, build fails right? I suppose it would be easy enough to switch back to a uploaded tar.gz in case remote is down and we urgently need to get something built?

You can get 99% of the convenience without the lack of auditability just by using the existing forge fedora macros.

You declare your forge url in the specfile. You declare the target tag version or commit. You do a spectool -g followed by a rpm -bs and you get a src.rpm that can be used in all our tools (mock, copr and koji), and uploaded to become the Fedora golden source.

And then it does not matter if upstream reworks its scm or rehosts somewhere else or disappears, because the tarball has been generated at packaging time, and has been checked at packaging time, and it’s no better or worse than to rely on a traditional tarball that could be changed upstream after the Fedora download.

You can even test pre-releases with the spec pointing to commits ids on a feature branch in in your own private pagure.io forked repo, then get the branch merged and tagged in the master repo once the tests are done, with the same spec, replacing only the forked repo url and tested commit id with the master repo url and official version tag once the whole thing is done.

You'll spend more time waiting on pagure.io than changing lines in the spec.

@fweimer can you walk us through how your workflow would look if we had this enabled? ie, how it would be better?

I make a commit to a Fedora repository with a cherry-picked upstream commit (using git cherry-pick, perhaps with some commit message adjustments for bug references, but no separate tools and no additional file changes). I push and run fedpkg build, and the infrastructure does the rest.

I think the reproducibility is minor but still there (in the case you build something, upstream dies or changes hashes or does something wacky) you can't just rebuild the src.rpm right? You would need to modify it to use the exact source that was used on that build? Or no?

Given that we keep the source RPM, I think rebuilding is not critical from a reproducibility perspective.

The availability issue is also there for me: If remote is down, build fails right?

Yes, but I would expect the repository to live on src.fedoraproject.org, so that wouldn't be a change from today.

I suppose it would be easy enough to switch back to a uploaded tar.gz in case remote is down and we urgently need to get something built?

I don't think this would be necessary if the Git repository is hosted by the right Fedora infrastructure.

Again, this is what the forge macros allow doing today

Take
https://src.fedoraproject.org/rpms/go-rpm-macros/blob/master/f/go-rpm-macros.spec

I could replace the

Version:  3.0.8

with a

%global commit HASH

before the

 %forgemeta

line and everything would still work (in fact that’s how the whole package was polished before inclusion in Fedora, because I couldn’t be bothered to tag a version for every commit I wanted to test).

All it would take is fedpkg build doing the spectool -g automatically instead of needing a separate spectool -g / rpmbuild -bs step

Metadata Update from @amoloney:
- Issue tagged with: backlog

a month ago

Metadata Update from @cverna:
- Assignee reset

4 days ago

Login to comment on this ticket.

Metadata