Issue #1309: bundling Provides needlessly bloat metadata - packaging-committee

I would agree with the statement in the issue title if you dropped "needlessly". Right now, this metadata is the only way to reliably query all of Fedora, for example, to check what packages are impacted by security vulnerabilities. This metadata is also used by Red Hat ProdSec.

Looking into all the source code of all packages in Fedora is not a suitable (or feasible) replacement for easily accessible metadata. As @sgallagh noted on the ticket that caused this discussion (here: https://pagure.io/fedora-rust/rust2rpm/issue/139), policies are not set in stone - we might be able to do something better, but for now, Provides: bundled(...) virtual provides for packages that bundle dependencies is what's required.

ngompa commented 6 months ago

I'm opposed to this. This makes it too easy to obscure, and frankly given the automation around generating these Provides for Rust, Go, etc. means that there's less excuse to be accurate on it.

Requiring people to download and dig out files means that it's just much more painful for distro-wide auditing.

sgallagh commented 6 months ago

@walters I think we might need to break down what you're asking for a little bit. You've proposed a solution without fully defining the problem you're trying to solve.

If your main concern is that the presence of bundled(foo) is causing a prohibitive bloat in the repodata, then one option is to move the "source of truth" to another location. That information really should not be lost however, which is what your proposal of bundled(rust) does.

The reason for maintaining this data is to allow us to easily identify what software needs to be rebuilt in the event of a critical bug or CVE in one of the bundled dependencies. Saying "The source of truth is one thing - the source code" is not helpful, because it is prohibitively difficult for us (Fedora) to scan the source code of every package we ship to find references to affected packages. That information needs to be aggregated in some way. Then that information needs to be retrievable in a useful manner.

Now, maybe we could avoid using the repodata as that aggregation and retrieval method. This means coming up with a way for packagers to provide this information. Right now, the bundled Provides mechanism covers all of these requirements and does so with minimal Packager effort (in some cases like in Node.js, the Provides are even detected and added automatically, thanks to specialized tools in the build system). If we were to adopt a new approach, the goal needs to be that the decrease in repodata size must be of greater value to Fedora than any additional Packager or security team cost that such an approach would add.

Perhaps a simpler approach to improving repodata size would be for us to post-process the XML and remove the bundled(foo) entries into a separate, searchable database. The mechanism that packagers use to provide this information would remain the same, the repodata would be smaller and (if properly designed) the database could be more efficient than dnf repoquery for locating affected packages.

walters commented 6 months ago

because it is prohibitively difficult for us (Fedora) to scan the source code of every package we ship to find references to affected packages.

Why is it difficult? Spell it out explicitly please.

I think I know why you say that, but let's be more specific for how accessing repository metadata is easier than source code.

I think that's the problem to fix.

decathorpe commented 6 months ago

Well ... the size of all SRPMs in Fedora is, best I can remember, on the order of 100GB. Repository metadata for these bundled() Provides is, likely, on the order of 10MB or less. I would say having to download (or keeping up to date) a complete local mirror of the "source" repositories, weighing in on 100GB or so, would be very prohibitive (supposing that there would even be a standard way of auditing those sources to begin with).

sgallagh commented 6 months ago

because it is prohibitively difficult for us (Fedora) to scan the source code of every package we ship to find references to affected packages.

Why is it difficult? Spell it out explicitly please.

You're the agent of change in this conversation. The onus is on you to demonstrate how your approach improves things. How exactly would you discover, generically among all of Fedora software, what things are being bundled? Every language and every project handles their dependencies in different ways (sometimes multiple ways in single, large projects). If you can describe a way to identify this information at least as accurately as we currently do by requiring packagers to provide it, then that will make this viable.

I think I know why you say that, but let's be more specific for how accessing repository metadata is easier than source code.

Because repodata is by its very nature structured for data retrieval. Source code is a recipe; repodata is a menu.

I think that's the problem to fix.

I'll admit to a great deal of skepticism here, but I'll bite: how would you fix this?

walters commented 6 months ago

You're the agent of change in this conversation.

Fair!

I'll admit to a great deal of skepticism here, but I'll bite: how would you fix this?

Convert the lookaside cache into a git repository to start.

The onus is on you to demonstrate how your approach improves things. How exactly would you discover, generically among all of Fedora software, what things are being bundled?

The package build process is doing it at build time. That tooling gets converted to run against imported source code to start.

Once the lookaside cache is replaced with a git repository, it also becomes possible to do things like CI checks against it, which is where it should be done...

james commented 6 months ago

I am very sympathetic to the generic problem of "repo md should be smaller", and generally will happily try to help any progress. Also I understand progress is often made in many small steps. Saying that, unless Monday morning is getting to me, some data here suggests this would be a very small step:

% sudo dnf repoquery --available --provides > abcd
% wc -l abcd
465148 abcd
% fgrep bundled abcd | wc -l
6855
% fgrep bundled abcd | fgrep crate | wc -l
400

walters commented 6 months ago

some data here suggests this would be a very small step:

Also fair!

Though one thing to bear in mind is that it looks like most of the provides data is actually O(total packages) because each package has a provides for itself, and there's provides between subpackages...whereas bundling provides grow at O(N * M), even though it's just 1% today. There are a lot of packages as a baseline.

fgrep bundled abcd | wc -l

Incidentally it's more efficient to use grep -c bundled abcd.

Also for the record I briefly glanced at C9S repodata for baseos/appstream, and that also is much smaller (as you'd expect) but also the same orders of magnitude:

$ wc -l < provides
234574
$ grep -c bundled /tmp/provides 
2892

Interestingly it's not that much different with crb enabled.

Also tangentially related to this one thing I looked at at one point is a "repodata optimizer" that would remove redundant provides that aren't needed to depsolve within itself. e.g.

$ rpm -q --provides glibc | wc -l
106

95% of that is redundant because basically all packages are going to transitively depend on glibc anyways, and even the ones that don't will do so via e.g. libc.so.6(GLIBC_2.33)(64bit) not libc.so.6(GLIBC_2.3.4)(64bit) (truly ancient provides).

james commented 6 months ago

Also tangentially related to this one thing I looked at at one point is a "repodata optimizer" that would remove redundant provides that aren't needed to depsolve within itself

There are a lot of possibilities around this, with likely some huge wins, but the big hurdle you need to cross is how to convince mattdm/FESCO/everyone that the optimizations are worth non-Fedora repos. being second class. Maybe if you try just optimizing requires that could be seen as a big enough win (esp. if you could blacklist certain packages due to "freeworld" issues).

You also mentioned splitting into more than one repo. and that again could go pretty deep, and likely be mostly political as everyone sees it as making their packages second class and/or the return of fedora-core. But maybe you could get some traction if you proposed the split as the packages in fedora-eln and everything else (I wouldn't use core as the name though ;).

ngompa commented 6 months ago

Anything involving splitting Fedora in two is probably dead in the water. It was Herculean to merge them, and any split would result in a second-class effect somewhere. Especially if we used ELN as the basis for a split, that's literally the return of Fedora "Core"/"Extras" split.

Edited 6 months ago by ngompa

mulhern commented 6 months ago

It would help if bundled(Provides) for Rust were actually accurate. Taken from closed RFE[1] comment:

"""
cargo-auditable would be very useful here: https://github.com/rust-secure-code/cargo-auditable , if the purpose is to detect what executables actually contain any code from a CVE'd package.

Compiling with this tool enabled, which is done easily, just embeds, in the executable, the information about what dependencies have actually been compiled into that executable. Ths information can then be extracted w/ rust-audit-info:
https://crates.io/crates/rust-audit-info.

cargo-vendor, out of the box, overestimates the packages actually depended on significantly. For instance, stratisd's vendor directory has 180 separate dependencies, but on my machine, the actual dependency count is 120. I've seen other packages where the actual/static dependency ratio is a whole lot smaller than stratisd's 2/3.

If you filter the vendor tarfile, then you are not actually bundling all the dependencies which your binary rpm claims and you have proved, by building the executable that these dependencies don't go in your executable.

There's a partial ordering here:

num_packages(cargo-vendor) >= num_packages(filtered cargo-vendor result) >= num_packages(cargo-auditable)

It is possible that cargo-auditable might, due to a bug, omit a dependency actually included. filtered cargo-vendor never can, because if it did, the executable would not have compiled.

That it is necessary for some kind of legal reasons to include, in the bundled(Provides) every package that is in the vendor directory is in direct conflict with the other use of bundled(Provides), for security purposes and checking CVEs.
"""

[1] https://pagure.io/fedora-rust/rust2rpm/issue/139#comment-876784

decathorpe commented 6 months ago

Reading the previous comments, I think two different issues are being conflated:

which dependencies are bundled / vendored (does not affect package license)
which dependencies end up statically linked into shipped binaries (does affect package license)

With respect to Rust packaging, there are now RPM macros to determine both:

The %cargo_vendor_manifest macro writes dependency tree to a file, using the same logic as the cargo vendor command. This is intended to be used with the new RPM generator for automatically generating Provides: bundled(crate(...)) for vendored dependencies.
The %cargo_license and %cargo_license_summary macros use a different logic, and only print those dependencies (and their licenses) from the dependency tree that are actually going to be staticlly linked into shipped binaries.

However, If you use one of these things as a source of truth for the other one, it will either include too much or not enough, obviously (since 1. is a superset of 2.).

I'm not sure if dependencies that are only vendored for the purpose of being used at build-time or for running tests need to be declared with Provides: bundled(...), but they are currently included in the output of %cargo_vendor_manifest, to be on the safe side, and to actually match the contents of the vendor tarballs generated by cargo vendor.

But on the other hand, when it comes to determining the license of everything that ends up in statically linked binaries, including build-time or test-only dependencies is obviously wrong, which is why they are excluded in %cargo_license* macro output.

I have put in a lot of thought about these two things, and the current implementations in the Rust RPM macros reflect what I think is "correct" considering both the current Packaging Guidelines and the updated Legal Guidance from Red Hat Legal, and they have also been influenced by what other people have done when dealing with similar problems in other language stacks.

These things are also not the only source of information I use when rebuilding Rust applications for security vulnerabilities. They are just the best starting point (because I try to avoid needlessly rebuilding packages if they're not affected by an issue):

query repository metadata (with something like dnf repoquery --whatrequires rust-$foo-devel --recursive | grep src to get which packages depend on crate $foo at build-time
filter out packages that don't ship binaries
for the remaining packages, check whether they are actually affected by the issue in crate $foo
rebuild packages

For checking packages that build with vendored dependencies, step two is not necessary, because there are no packages that don't ship binaries but use vendored dependencies:

query repository metadata (with something like dnf repoquery --whatprovides "bundled(crate(foo))")
check whether the packages in the query results are actually affected by the issue in crate $foo
rebuild packages

I have also looked at cargo-auditable (even started packaging it at some point), but I didn't move forward with it, because I'm not sure if it would even work for our purposes. I suspect the symbols it injects during the build process would get stripped by debuginfo handling in RPM.

Additionally, using cargo-auditable gets you back to the original problem: That data would then end up in package contents, not in package metadata, so it would be much less discoverable, and very difficult to query.

mulhern commented 6 months ago

I only know about packaging what I've gleaned, quite painfully, from observation, but:

I don't think that the last objection to using cargo-auditable has to be true. Once the executables are built during the packaging process the information is in the executables. It should be possible to extract it when generating the bundled Provides information. Then it would be in the package metadata, because the packaging process would have put it there. Perhaps you are saying that it is impossible to do this with macros? I don't think it should be harder than anything else that I have seen done. One would use rust-audit-info to extract the information from the executable (as JSON) and stow it in a file somewhere for some sort of further processing later.

Regarding the process, I would like the step "check whether ..." to be made easier in some cases by being eliminated altogether because cargo-auditable has been used to give a smaller set of bundled Provides.

I'm really interested in CVES and I believe that a CVE does not affect a package that ships binaries if the package with the CVE has not been statically linked into any of the binaries that are shipped and that cargo-auditable will give me that information and that that information can, in principle, be extracted from the executables and included in the package metadata.

mulhern commented 6 months ago

I don't know what the "the new RPM generator for automatically generating Provides: bundled(crate(...))" is. That would be the key to knowing if it and cargo-auditable could cooperate.

decathorpe commented 6 months ago

I don't think that the last objection to using cargo-auditable has to be true. Once the executables are built during the packaging process the information is in the executables. It should be possible to extract it when generating the bundled Provides information. Then it would be in the package metadata, because the packaging process would have put it there. Perhaps you are saying that it is impossible to do this with macros? I don't think it should be harder than anything else that I have seen done. One would use rust-audit-info to extract the information from the executable (as JSON) and stow it in a file somewhere for some sort of further processing later.

If this is what you want to do, then there's no need to go on a detour via cargo-auditable. The information about what's getting statically linked into the binary is already available at build time (it's what the %cargo_license macro determines). Dumping this information in a place where an automatic dependency generator can pick it up would be trivial.

I'm really interested in CVES and I believe that a CVE does not affect a package that ships binaries if the package with the CVE has not been statically linked into any of the binaries that are shipped and that cargo-auditable will give me that information and that that information can, in principle, be extracted from the executables and included in the package metadata.

This is not true, especially since it's very easy to generate code at build-time in Rust. Both build scripts and procedural macros inject code at build time, and dependencies
for them are also only needed at build time. But if one of these crates is found to produce problematic code, packages will need to be rebuilt even though the problematic crate is not statically linked into them, but the code they generated is.

I don't know what the "the new RPM generator for automatically generating Provides: bundled(crate(...))" is. That would be the key to knowing if it and cargo-auditable could cooperate.

Probably not. As mentioned above, I don't think we need cargo-auditable to do what you want at all.

mulhern commented 6 months ago

I'm really interested in CVES and I believe that a CVE does not affect a package that ships binaries if the package with the CVE has not been statically linked into any of the binaries that are shipped and that cargo-auditable will give me that information and that that information can, in principle, be extracted from the executables and included in the package metadata.

This is not true, especially since it's very easy to generate code at build-time in Rust. Both build scripts and procedural macros inject code at build time, and dependencies
for them are also only needed at build time. But if one of these crates is found to produce problematic code, packages will need to be rebuilt even though the problematic crate is not statically linked into them, but the code they generated is.

You are correct that, e.g., procedural macros may be used to generate code but may not themselves be included in the binary and that errors in those macros would, potentially, affect the generated executable. This is something I am well aware of, but I was trying to use the language that you seemed to prefer, and ended up misspeaking.

In any case, cargo-auditable claims to
"
Know the exact crate versions used to build your Rust executable. Audit binaries for known bugs or security vulnerabilities in production, at scale, with zero bookkeeping.
".

They seem to be making the claim that the information they embed includes all packages used, which is actually the information that is desired. I am going to try asking them if that is true.

mulhern commented 6 months ago

https://github.com/rust-secure-code/cargo-auditable/issues/126

mulhern commented 6 months ago

So, it looks like we have the following partial orderings and that we can use set notation to express them:

dependencies(%cargo_license) \subset dependencies(cargo-auditable)
dependencies(cargo-auditable) \subset dependencies(cargo-vendor)
often: cardinality(dependencies(cargo-auditable)) << cardinality(dependencies(cargo-vendor))
actual dependencies used (found dynamically) \subset dependencies(cargo auditable)

mulhern commented 6 months ago

I have just worked out the following:

cargo-vendor and cargo-auditable are both really using variations of what is obtained by cargo-metadata the command-line tool. cargo-vendor vendors every dependency in the "cargo-metadata --all-features" result unless it is a path dependency.

cargo-auditable just restricts the set of dependencies by:
1. specifying the appropriate target triple
2. hijacking the cargo command-line to pull out any other command-line options which match those that can be passed to cargo-metadata.
3. Stripping out dev dependencies (as I have been told, although I don't see where that happens)
I was also told that cargo-metadata is hard-coded to use the v1 dependency resolver, not the v2 resolver that is the default for newer versions of Rust. The v1 resolver should find strictly more dependencies than the v2 one does because it does more feature unification.

Its further bit of cleverness, which can be ignored for our purposes, is inserting the result into the executable.

So, it is, in principle, not hard to extend some rpm macro for a cargo build so that, as well as running the build command, it can run the cargo-metadata command with the same feature and other relevant arguments as are used for the actual build and the appropriate target triple. This would yield a result that constituted a superset of the packages actually used to build the executable, but not nearly as large as the maximal cargo-metadata output which is used by cargo-vendor. It should be this result that should be included in the bundled Provides, not the maximal cargo-vendor result.

This would add more complication to the already existing complication but would reduce the bundled Provides bloat and would reduce the number of false positive affected-by-CVE'd package judgements.

decathorpe commented 6 months ago

Ok, let me repeat myself then:

It seems there is a disagreement over what needs to be declared as bundled Provides, and for what purposes it is used.

Right now, %cargo_vendor_manifest lists everything that is included when running cargo vendor, because the fact is that these dependencies are vendored in the package.

If you want to track which dependencies actually get statically linked into the binary, that is a different problem and we should not be using a mechanism that is intended for a different purpose for this.

There's been an open issue in rust2rpm about doing something like this: https://pagure.io/fedora-rust/rust2rpm/issue/39

TL;DR: I don't think the syntax of Provides: bundled(...) should be used for two semantically different things. If we want metadata for statically linked dependencies, that should use different syntax, as well.

But in that case, we're back at the problem: Doing both of these things (since they appear to be necessary for different purposes) will make metadata even bigger :)

mulhern commented 6 months ago

I was not there when the concept of bundled(Provides) was first introduced.

But, I think it's weird that, for Rust:

the rpm that does bundle the vendor tarfile, i.e., the source rpm, does not provide the bundled Provides information.
the rpm that does not bundle the vendor tarfile, i.e., the binary rpm does provide bundled Provides information and also that this information includes many potential dependencies, (sometimes the majority of the entries so included) which had nothing more to do with the build of the binaries packaged in the binary rpm any more than any randomly selected crate on crates.io did.

Maybe if someone could explain how we got to this place I could understand better.

decathorpe commented 6 months ago

I was not there when the concept of bundled(Provides) was first introduced.

But, I think it's weird that, for Rust:

the rpm that does bundle the vendor tarfile, i.e., the source rpm, does not provide the bundled Provides information.

the rpm that does not bundle the vendor tarfile, i.e., the binary rpm does provide bundled Provides information and also that this information includes many potential dependencies, (sometimes the majority of the entries so included) which had nothing more to do with the build of the binaries packaged in the binary rpm any more than any randomly selected crate on crates.io did.

Maybe if someone could explain how we got to this place I could understand better.

As far as I know, this is because (1) is not possible in RPM. Any Provides for package X are attached to the built package X, not the source package X. If built package X does not exist, the Provides are not applied to the source package X either, they are applied to nothing (a common footgun, since this does not even cause RPM to print warnings).

packaging-committee

#1309 bundling Provides needlessly bloat metadata

Opened 6 months ago by walters. Modified 6 months ago

Metadata

packaging-committee

Source Code

#1309 bundling Provides needlessly bloat metadata Opened 6 months ago by walters. Modified 6 months ago

Close issue as:

Metadata

#1309 bundling Provides needlessly bloat metadata

Opened 6 months ago by walters. Modified 6 months ago