#1955 Let's get rid of filedeps (FESCo edition)
Closed 5 years ago Opened 5 years ago by mattdm.

See FPC issue 714.

The issue: arbitrary file-name dependencies have significant overhead, yet provide only a small benefit. Specifically, file deps add over eight million points to a dataset which would otherwise be well under 400,000. Very few packages actually use these, and it appears that most that do are using the in error. The few that are taking advantage could instead use virtual deps. And the feature of allowing install-by-filename could be provided by a dnf plugin.

Getting rid of that — or even just limiting to files in whitelisted directories like /usr/bin — would provide immediate speed benefits and reduce bandwidth. This is particularly a frustration in small cloud instances and in containers, but really can help us everywhere.

FPC wasn't able to move on this; there's a sort of circular argument with DNF functionality which leaves the whole thing stuck. FPC suggested escalating to FESCo, which I am doing here.


arbitrary file-name dependencies have significant overhead

I would really love to see this ^^ quantified.

arbitrary file-name dependencies have significant overhead

I would really love to see this ^^ quantified.

See the FPC ticket, and in particular https://pagure.io/packaging-committee/issue/714#comment-464348

@ignatenkobrain notes that this results in about a 10-second delay on initial run on rawhide, aiui not even considering extra download time. And on my F28 system today, the filelists.xml.gz file is 45MB — a significant fraction of a minimal container's actual functional contents. (Note that the "whitelisted" special directories are in primary.xml.gz, which is 16M. I think we could meaningfully reduce the whitelist too, but simply going to only that would be a huge step.) Oh, and it's also 15MB in updates, so we're really looking at 60MB of basically useless dependency information on every system, vm, and container.

Oh, and it's also 15MB in updates, so we're really looking at 60MB of basically useless dependency information on every system, vm, and container.

That's really only the case if someone does 'dnf update' inside the container, correct? In an immutable image world, that isn't very common. I wouldn't want any dnf metadata in my container image at all, etc. I'd think pet containers would suffer the most from that.

That's really only the case if someone does 'dnf update' inside the container, correct? In an immutable image world, that isn't very common. I wouldn't want any dnf metadata in my container image at all, etc. I'd think pet containers would suffer the most from that.

Updating containers by redeploying from updated versions is better, but even then all this is pulled down when building the container from rpms (which often have dnf update and dnf install). It'd be nice for that to be faster too.

long form of the details in old ticket 714, but ignored for six months:

It looks like:

  • a few directory dependencies,

  • a pile of font dependencies probably from a mis-packaging, and

  • a couple of remaining packages that are probably trivially refactored to get rid of them

Alternative :

  • Why not attack the '31 different packages' and get them cleaned up, and run some performance stats.

  • As the root issue seems to be performance of rebuilding repodate, why not attack that problem, rather than a symptom: to refactoring createrepo run times, and the unpacking / re-building issues, this really calls for cacheing the 99.99 pct unchanged data, and simply invalidating and rebuilding a small subset of what has changed for most of the time, along with a n option 'wipe the world,' and rebuild (which seems to be the present incredibly naiive approach) As I recall Seth's approach to repodata 'back in the day', it was intentionally a cheap Proof of Concept, rather than a thoughtful design -- the repo data format was considered, but the implementation detail and 'easy wins' in cacheing have not been thoughtfully approached, so far as I can tell

TL;DR proposal:

I would 'table' this proposal and put it on hold for N months,

  • get the package cleanups done, and

  • there is a new 'easyfix' set of fruit, in cleaning out packages still carrying '/bin/' and '/sbin/' around

  • get a formal study of the implementation of cache invalidation done, and THEN see if this seems well advised.

I suspect the urgency for this proposal will fade away, and there will be bigger fish to fry; also, and out of scope to 'traditional' Fedora per se, cross-fertilization
I also previously mentioned:

  • 'skinnying down' installations (container based comes to mind) by 'breaking huge dependency chains, is possible with 'File dependencies' -- and almost impossible at the packaging dependency level

My opinion on this is that the FPC ticket doesn't need to "move forward", it just needs to be closed.

The packaging guidelines include a general discouragement of file dependencies as well as a recognition that they can be necessary in some cases. Eliminating them entirely is simply not reasonable. Work to reduce them would be something that could certainly be undertaken but doesn't require any changes to existing guidelines in any case. It shouldn't, however, be undertaken with the expectation that it would have any performance benefit because that would need dnf changes which may or may not ever happen.

The proposal has been stalled in the packaging committee because none of us really has the gumption to just close the ticket as something that isn't going to be accepted. Of course that's not going to happen now that this ticket exists. Of course, if FESCo decides to do something like instruct the dnf developers to stop supporting file deps entirely, then obviously the guidelines would need to change. But that would be a rather significant change with rather important (and somewhat negative) side effects which leads me to believe that the chances of that happening are slim.

As a dnf CLI user, I do find the ability to run dnf install /usr/bin/whatever to be a very nice feature (and similarly dnf provides \*libwhatever.so). I suppose it would be fine if that were a plugin as suggested, though it would make it less discoverable for new users. I don't find the data useless on my desktop system, though I can see it being useless on managed servers. Perhaps if we had the plugin @mattdm suggested then we could have it installed by default on workstation to alleviate my discoverability concern, while leaving it uninstalled by default on other editions?

As @jwboyer pointed out, containers would probably be scrubbing the metadata out in the final image after building anyway, so I don't think it should practically affect them.

My understanding of the issue is that:
– file deps would have to be to be supported by dnf because of non-distro packages even if Fedora stopped using them
– right now there's a small number (64 reqs / 31 packages) of file deps in non-whitelisted dirs, and a larger (426 reqs) in whitelisted dirs
– not downloading filelists would provide significant bandwith savings, since filelists.xml.gz at 47 MB compressed are bigger than all other metadata combined
– not loading filelists would provide noticeable improvement in dnf startup (10s)
– dnf currently downloads and always uses both primary.xml.gz and filelists.xml.gz
– the way to implement lazy loading of filelists would be to load them when a file dep is encountered and redo the dependency analysis. libsolv supports this.
– dnf developers didn't want to implement lazy loading, possibly because file deps (whitelisted and not) are used in Fedora
– dnf currently has no support. @ignatenkobrain says a major rework of libdnf would be needed to support this.
– the packaging guidelines says file deps SHOULD NOT be used
– the packaging guidelines make no differentiation between whitelisted and non-whitelisted locations
– if dnf were to implement lazy loading of filelists.xml, the benefit would be immediate, without getting rid of them in all packages, as long as the popular packages didn't use them.
– as long as dnf does not support lazy loading, there is no actual benefit from getting rid of file deps in packages

In the light of this, I think that we should try to get rid of non-whitelisted file deps, and allow the whitelisted ones, which are much more frequently used and cost much less. If we get this done, we can always iterate.

Disabling loading of file deps entirely in dnf or moving it to a plugin is not useful, because that'd break external packages. The dnf changes to support lazy loading are the most important. Without them everything else is moot. So it'd be great to quantify how much work would be needed for dnf to support lazy loading.

If dnf developers say that this is possible, then two things also need to happen, and should be done even before the changes in dnf are implemented:
1. packaging guidelines should be updated to just differentiate non-whitelisted file deps and discourage their use even more strongly than whitelisted ones (I'm intentionally vague here, leaving the details to FPC)
2. those 31 packages need to be cleaned up. I think the Mass Change Policy with some provenpackager involvment should be used to expedite this.

tl;dr Lazy loading of filelists.xml (non-whitelisted file deps) would provide noticable bandwidth and runtime savings, but needs dnf/libdnf changes. We need a statement from dnf developers if they could implement this.

In the 2018-08-06 FESCo meeting, it was agreed that we would defer the decision on this ticket until after Flock (and a chance to talk with the DNF team). (+1:5,+0:0,-1:0)

Metadata Update from @jsmith:
- Issue tagged with: meeting

5 years ago

@jdieter is working on the zchunk feature (https://fedoraproject.org/wiki/Changes/Zchunk_Metadata) -- I wonder how much that will mitigate the issues we're seeing now.

Thoughts?

I started another thread of discussion on rpm-ecosystem@ and fedora-devel@: https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org/thread/BRX7BAAFUS4HCRDN4J243FFBSECEFNTV/.

@jdieter is working on the zchunk feature

zchunk helps of course, but this change is beneficial independently of zchunk. What zchunk helps with is that it makes repeat downloads of metadata faster. But there are situations where there's just a single download, and any caching is not relevant. The example that was given on the mailing list is a container installation, which is created with some set of packages, and file deps should never be necessary. Skipping the download makes both the container installation faster and also saves some disk space, which in the case of a container could be a nontrivial percentage.

Is there some analysis of the existing file deps with the reasons why they exist?

@till see https://pagure.io/packaging-committee/issue/714#comment-464348

The updated list for today's rawide is:
32 '^/s?bin/'
24 '^/etc/'
393 '^/usr/s?bin/'
64 rest
512 total
so the number have gone up a bit.

The discussion on rpm-ecosystem@ was quite long. I'll try to provide a summary here:
- a whilelist of patterns that are included in primary.xml would have to be added to primary.xml, so that the resolver can know if it needs filelists.xml to resolve a name and we can mutate the patterns in the future.
- the patterns in Fedora are ... err ... surprising: /etc|/usr/lib/sendmail|bin/. So stuff like /var/www/moodle/web/admin/tool/recyclebin/classes/base_bin.php lands in primary.xml.
- A potentially tricky issue occurs when filelists.xml need to be downloaded, but the download fails because metadata was rewritten in the meantime. Essentially, we want to download full set of fresh metadata and restart resolution.
- lazy loading of primary extensions is supported in libsolv.

There's some stuff to do:
- update createrepo_c to add the patterns to primary.xml
- update dnf to make use of the patterns list and only load filelists lazily
- update the patterns to contains less junk, and maybe add libexec to it.
- update dnf to support lazy downloading of filelists.xml and make it op-in
- make lazy downloading the default

I guess the first three points are easier than the last two, but they'd be useful on their own, because they would speed up dnf resolution, even if not downloads.

the patterns in Fedora are ... err ... surprising

That's upstream createrepo_c and libsolv. So SUSE does the same I believe. @ngompa, correct me if I'm wrong.

If we would just implement update dnf to support lazy downloading of filelists.xml, then it would already help 99% of users. Everything else is just improvement to this.

+0.5

I think we roll this out and see how many bug reports we get. I'm sure there will be things we missed and interactions that we didn't foresee.

This was discussed in yesterday's FESCo meeting (2018-08-20):
AGREED: ask dnf folks to put lazy loading (or reduced repodata loads) on their roadmap, close ticket (+1: 5, -1: 0, +0: 0)

The FPC ticket https://pagure.io/packaging-committee/issue/714 was in accepted a few days ago.
Nirik filed https://bugzilla.redhat.com/show_bug.cgi?id=1619368 for the dnf RFE for lazy loading.

Metadata Update from @zbyszek:
- Issue untagged with: meeting
- Issue status updated to: Closed (was: Open)

5 years ago

Login to comment on this ticket.

Metadata