#1796 pkgset: optimize cache check (saves 20 minutes)
Merged 20 days ago by lsedlar. Opened 20 days ago by adamwill.
adamwill/pungi pkgset-fix-cache-check  into  master

@@ -265,7 +265,7 @@ 

          for arch in arch_list:

              self.rpms_by_arch.setdefault(arch, [])

              for i in other.rpms_by_arch.get(arch, []):

-                 if i.file_path in self.file_cache:

+                 if i.file_path in self.file_cache.file_cache:

                      # TODO: test if it really works

                      continue

                  if inherit_to_noarch and exclusivearch_list and arch == "noarch":

The pkgset phase takes around 35 minutes in current composes.
Around 20 minutes of that is spent creating these per-arch
subsets of the global package set. In a rather roundabout way
(see #1794 ), I figured out that almost all of this time is
spent in this cache check, which is broken for a subtle reason.

Python's in keyword works by first attempting to call the
container's magic __contains__ method. If the container does
not implement __contains__, it falls back to iteration - it
tries to iterate over the container until it either hits what
it's looking for, or runs out. (If the container implements
neither, you get an error).

The FileCache instance's file_cache is a plain Python dict.
dicts have a very efficient __contains__ implementation, so
doing foo in (somedict) is basically always very fast no matter
how huge the dict is. FileCache itself, though, implements
__iter__ by returning an iterator over the file_cache dict's
keys, but it does not implement __contains__. So when we do
foo in self.file_cache, Python has to iterate over every key
in the dict until it hits foo or runs out. This is massively
slower than foo in self.file_cache.file_cache, which uses the
efficient __contains__ method.

Because these package sets are so huge, and we're looping over
one huge set and checking each package from it against the cache
of another, increasingly huge, set, this effect becomes massive.
To make it even worse, I ran a few tests where I added a debug log
if we ever hit the cache, and it looks like we never actually do -
so every check has to iterate through the entire dict.

We could probably remove this entirely, but changing it to check
the dict instead of the FileCache instance makes it just about as
fast as taking it out, so I figured let's go with that in case
there's some unusual scenario in which the cache does work here.

Signed-off-by: Adam Williamson awilliam@redhat.com

I wrote a little test script to demonstrate how in works, and prove this bug on a much smaller scale:

import time

class Nope:
    def __init__(self):
        self.dic = dict()

class Iter:
    def __init__(self):
        self.dic = dict()
        self.first = True

    def __iter__(self):
        if self.first:
            print("we're in iteration!")
            self.first = False
        return iter(self.dic)

class Contains:
    def __init__(self):
        self.dic = dict()
        self.first = True

    def __iter__(self):
        if self.first:
            print("we're in iteration!")
            self.first = False
        return iter(self.dic)

    def __contains__(self, item):
        if self.first:
            print("we're in containment!")
            self.first = False
        return self.dic.__contains__(item)


it = Iter()
"foo" in it
contains = Contains()
"foo" in contains
nope = Nope()
try:
    "foo" in nope
except Exception as err:
    print(f"oops, that didn't work: {err}")

print("testing contains")
start = time.time()

for i in range(10000):
    contains.dic[i] = i

for i in range(10000):
    "foo" in contains

end = time.time()
print(f"took {end - start} secs!")

print("testing iterator")
start = time.time()

for i in range(10000):
    it.dic[i] = i

for i in range(10000):
    "foo" in it

end = time.time()
print(f"took {end - start} secs!")

The first part demonstrates that in prefers __contains__, falls back to iteration, and errors out if neither is implemented. The second shows how much slower iteration is than a dict's __contains__ method. On my system, this is the output:

[adamw@xps13a pungi (pkgset-fix-cache-check)]$ python3 /tmp/test.py
we're in iteration!
we're in containment!
oops, that didn't work: argument of type 'Nope' is not iterable
testing contains
took 0.002552032470703125 secs!
testing iterator
took 1.0152699947357178 secs!

Note in the test script we 'only' put 10000 items in the dicts. In a compose, we have about 300,000 packages, and since we do multiple loops, I think we probably hit the cache check like...a million times, or something? It's a lot, anyway. :P

I tested this practically by just running multiple test composes with and without this change. Every single time, the "Merging package sets" steps take ~20 minutes without this change, and just seconds with it. I've done this I think five times now. It seems like a completely solid result.

another option would be to add FileCache.__contains__(self, item) that would simply return item in self.file_cache result. You'd get a bit more isolation as users of FileCache.file_cache do not need to know about it structure.

yeah, I will probably send a PR to do that too. (you'll notice that's how the test script works). It's late tonight, didn't get around to it yet. But it's a bit academic for here specifically because pungi does need to know the structure later, it really wants to teleport entries directly into the FileCache's file_cache (rather than using its add method) for...reasons.

Ok, makes sense to continue it messing around. ;)

Great catch!

In my tests I used tags with ~30k RPMs, and the slowdown wasn't noticeable. The patch doesn't really change the run time there in a measurable way. But when I try with a tag with ~421k packages, it's very visible. In a very non-scientific test merging each arch goes from about 10 minutes to under 10 seconds.

But it's a bit academic for here specifically because pungi does need to know the structure later, it really wants to teleport entries directly into the FileCache's file_cache (rather than using its add method) for...reasons.

If I'm reading the code correctly, Pungi does both. When it adds the entry by file path, it uses add(). When it has the object to cache, it puts it there directly to avoid re-parsing headers.

However, I don't think it makes sense to check the cache in the merge method. We only ever call merge on an empty package set object. So the only way the check will ever hit is if the same rpm is present multiple times. That's not impossible, but requires arguably incorrect configuration. My test with ~30k packages actually hit for two RPMs, because a build that's present in pkgset_koji_tag is also listed explicitly in pkgset_koji_builds.

Pull-Request has been merged by lsedlar

20 days ago
Metadata