#290 consider some validation/retry/check for openRemoteFile
Closed: Fixed 4 years ago by tkopecek. Opened 7 years ago by kevin.

For a while the Fedora koji setup has had a machine running squid with apache behind it for packages (kojipkgs01.phx2.fedoraproject.org/kojipkgs.fedoraproject.org). With this config things work fine, however, it presents a single point of failure if that host is down or unavailable.

We recently tried serveral things:

  • kojipkgs.fedoraproject.org pointing to proxy01 and proxy10, so requests go to apache on those proxies, then via haproxy to kojipkgs01 or kojipkgs02 (a pair).

  • kojipkgs.fedoraproject.org pointing to kojipkgs01 and kojipgks02 in dns so requests go round robin to each of them.

  • kojipkgs.fedoraproject.org pointing to just proxy01, then it's apache and haproxy to kojipkgs01/02, or even just kojipkgs01.

All these paths resulted in builds failing to unpack src.rpms, which would show as:

DEBUG util.py:435: error: unpacking of archive failed on file /builddir/build/SOURCES/graphviz-2.40.1.tar.gz;587cd943: cpio: read
DEBUG util.py:435: error: /builddir/build/originals/graphviz-2.40.1-2.fc26.src.rpm cannot be installed

and indeed the src.rpm fetched by openRemoteFile here was incomplete/invalid.

So, I wonder if there's some way to verify the download of the src.rpm, and/or retry if it's not complete that would avoid this issue. We would really love to not have a single kojipkgs as a single point of failure if we can at all do so. I would think the http server would say the size and it could check that, or perhaps there's some way to validate the rpm after download... whatever works.


So, it turns out our problem was squid in smp mode. It was sometimes causing connections to hang.

However, I think it would still be good to:

  • set a timeout on the src.rpm download so it doesn't wait forever.
  • Check the src.rpm and see that it's valid and matches what you expected to download.

I've implemented a crude RPM sanity check in #293, which should do a reasonable job at verifying the RPM.

A timeout is a bit more tricky, since both httplib2 and requests (#294) implement a timeout as "time to first byte", while we are interested in "no further bytes received in X seconds after transfer start".
I think getting this would require code like DNF has with "less than Xbytes/second for Y seconds", which is tricky to accomplish with a reasonably high level of abstraction (requests vs compatrequests).

Metadata Update from @dgregor:
- Custom field Size adjusted to None
- Issue set to the milestone: 1.20

4 years ago

Metadata Update from @jcupova:
- Issue tagged with: testing-done

4 years ago

Login to comment on this ticket.

Metadata