#1 depcheck: download only rpm headers?
Opened 9 years ago by kparal. Modified 6 years ago

For depcheck, we download whole RPMs (dozens and hundreds of MBs). But only RPM headers would be sufficient for us, we need just the metadata. Investigate what needs to be done to make the change. Probably the largest obstacle could be mash, investigate whether it is able to work with RPM headers only, in order to create the repository metadata.


This ticket had assigned some Differential requests:
D223
D232

It's possible to download only RPM header using:

import subprocess

def reset_sigpipe():
    import signal
    signal.signal(signal.SIGPIPE, signal.SIG_DFL)

def get_rpm_header(name, url):
    cmd = 'curl -s {} | tee {} | rpm -qp /dev/fd/0'.format(url, name)
    subprocess.call(['bash', '-e', '-c', cmd], preexec_fn=reset_sigpipe)

This code is taken from [[ https://github.com/msimacek/koschei | Koschei ]] project. From initial testing, it seems that mash and depcheck work on header-only RPMs.

Just a note - this "Scares the s**t out of @tflink", so we should really (at least) verify with RPM devs, and all the important folks, that this is ok.

I'd like to use a pure-python approach. Either using RPM Python API, or at least downloading first $size bytes from every package (where $size >= $max_header_size) and then running rpm -qip command on it. Consult with RPM developers and Koschei developers. Thanks.

I have published differential request. It isn't complete solution, I am publishing it only for discussion about approaches we can take.

Downloading the header data is all well and good, but we need to make sure that libsolv correctly handles it. I will not merge this until the following conditions are met:

Verify libsolv does correctly handle just the rpm header data, or have patches submitted and accepted upstream to do so.
Have RPM devs verify that this will not break the world.
Have tflink sign off on this.

@jdulaney it is you, who should verify it, as you (evidently) have the broad libsolv experience and knowledge. Also, I though that libsolv operates on repo metadata, not on RPMs? But I guess that you should know better, I'm not really that much into those "magical" parts of the Depcheck code.

btw @jsedlak

{meme, src=megusta}

@jdulaney I have been on #openSUSE IRC channel and realized that libsolv has nothing to do with this - hawkey/libsolv only uses repo metadata. Mash and createrepo are utilities that we should check.

Sorry, it's 5:42 AM local and I'm still awake. Investigating.

@jsedlak tested this quite a bit and there appears to be no problem with mash/createrepo directive. But we're not totally sure we want to use this, it has been posted just to present a possible solution (mainly useful for development for people with limited bandwidth, like us) and start a discussion. @jsedlak is going to prepare one alternative solution as well, feeding rpm directly.

After poking at this, libsolv needs the header, lead, and sig-head as well; basically everything up to the actual payload. So, if you can get everything but the payload, then we'll have something useful.

Yep, code is written in that way. It reads first 112 bytes (where is lead and start of signature header), then calculates length of signature header (+ padding) and then reads payload header. But I am afraid that @tflink will not like it :-). It's little bit of a magic.

{meme, src=freddie, below="it\'s a kind of magic"}

Yeah, I'm not so sure I like that; it sounds like there's the potential for reliability issues as well as forwards compatibility.

Forward compatibility shouldn't be problem. It checks RPM file version and if it is different version than 3, it falls back to downloading complete RPMs. So it isn't really forward compatible, but it will not fail if (when) RPM file format version 4 appears.

@jsedlak, if you can, please also publish a second diff, which would use python-rpm, named fifo, and a background thread. It will be useful for comparison.

I have added revision that uses python-rpm, named pipe and threading. Now we can discuss and compare different approaches - curl | tee | rpm -qp vs rpm header reading magic vs named pipe.

Note: msimacek from koschei also implemented this approach (pipe data to rpm), independently on me, [[ http://paste.fedoraproject.org/132369/10343664/ | here is his solution ]].

Login to comment on this ticket.

Metadata