#5846 move away from md5 for look-aside cache
Closed: Fixed 2 years ago Opened 5 years ago by till.

The lookaside cache uses md5, but something more secure like sha-256 or sha-512 should be used instead. Maybe it should even be made to allow easy changes in the future.

This was previously discussed in [https://fedorahosted.org/fedora-infrastructure/ticket/3358 Intrastructure ticket 3358].

A few explanations on the patches I just attached.

The pyrpkg patch introduces a new option in the configuration file: lookasidefallbackhash.

This hash is to be used when the verification fails using the normal lookasidehash. However, files are always uploaded with the normal lookasidehash.

Then, the fedpkg changes just adapt to that new option, and set the lookasidehash sha256 instead of md5, while setting the new lookasidefallbackhash to md5.

With these 3 changes applied, fedpkg now uploads sources with the sha256 hash, but can still verify previously uploaded sources.

Wouldn't it be easier to change the syntax of the sources file, so it is clear, which hash was used?

If the old format of the sources file is used, it is md5, otherwise the sources file says which one to use.

What are the consumers of the sources file except rpkg?

Wouldn't it be easier to change the syntax of the sources file, so it is clear, which hash was used?

Other tools are based on the pyrpkg API, I know of at least:
nbpkg: my own fedpkg equivalent at $dayjob
goosepkg: GoOSe Linux's fedpkg equivalent
* rhpkg: RHEL's fedpkg equivalent

There might be others, for example I wouldn't be surprised if Centos was using it too.

The fact that it was md5 previously is configured in the /etc/fedpkg.conf file, so you can't assume that all of these tools have been using md5 so far: they might very well have started from scratch with sha256 or example.

To me, this is a strong argument in favour of being able to specify the previously used hash, as a fallback.

It doesn't preclude the fact that we could indeed change the format of the sources file, of course.

You mentioned the advantage: it makes it explicit which hash was used.

The drawback is that the currently trivial way to fill it without fedpkg would not work any more:

$ md5sum foo-1.0.tar.xz > sources

In any case, I'd be happy to provide additional patches to implement this if people feel like it would be interesting. I still think that introducing a « previous hash » configuration option is important.

It seems to me that the hash used can be determined by the length of the hash string. Those are constant.

$ for hash in md5sum sha{1,256,512}sum; do len=$( $hash ~/.bashrc | awk '{printf "%s", $1}' | wc -c ); printf "%-10s: %3s\n" $hash $len; done
md5sum : 32
sha1sum : 40
sha256sum : 64
sha512sum : 128

It would even be possible to determine individually for different items in a single source file and have the tools use the appropriate tool to verify each item (though mixing them seems like it would be rather pointless ;).

But teaching the tools that verify sources to use the non-changing format of the various hash algorithms would allow changes to be made to the hash used to generate sources without requiring that every packager updated fedpkg first, wouldn't it? Then, as soon as the builders were updated to recognize a new hash algorithm, it could be used in a sources file.

(Apologies for chiming in without any code to show, I feel guilty about that.)

If we're going to change the hash, we might want to go right to SHA-3 (Keccak).

It might be necessary that the server side needs to be adjusted as well. Maybe someone who knows can confirm or deny this.

Indeed, I completely forgot about the server side. :-)

There's a lot of "md5" in the source, but it all seems to be just names of variables which were chosen assuming we were using md5 as the hashing. Those could be renamed to make things clearer, but that wouldn't change anything not to do it.

The only thing really related to md5 I could find with a quick look is this:

That obviously needs to be adapted.

I'll see if I can get some time today to work on that, but anyone should feel free to beat me to it. ;-)

Indeed, I didn't thinka bout that.

And as a bonus, it makes the change much simpler! :)

So, I've been attaching quite a few patches here, and I can't find a way to mark some as obsolete, so here are the details.

First, these patches can be totally ignored, as I found tmz's suggestion comment 8 would be better than what I had done:

  • 0001-lookaside-Add-a-new-fallback-hash.patch​
  • 0001-lookaside-Adapt-to-the-rpkg-change.patch​
  • 0002-lookaside-Use-sha256-instead-of-md5-by-default.patch​

Then, 0001-lookaside-Determine-which-hash-function-was-used.patch is a patch to apply to rpkg, which implements the following:

  • when verifying a source file, determine the hash used from the length of the hashed string
  • nothing changes when uploading: the hash used is the one set in the configuration file

Finally, 0001-lookaside-Support-more-than-just-md5sum.patch is for the upload.cgi script on the lookaside server, so that when a file is uploaded, we don't assume the hash is md5, but determine it from the length of the hashed string.

One note though: the patch for the lookaside drops compatibility with older versions of Python which didn't have the hashlib module.

Those versions didn't have hash functions other than md5 and sha1, so it wouldn't have been possible to deal with sha256 or sha512.

I was assured that our lookaside runs on EL 6 though, which has a recent enough Python, and it's the system I tested these changes on.

I just sent some patches to the appropriate lists, based on the discussion at a previous releng meeting:

As Dennis requested, this avoids breaking the pyrpkg API by limiting the client-side changes to fedpkg only.

Has anybody looked at git-annex as a way to store references to source files, and fetch them down? https://git-annex.branchable.com/

It would be more of a change than what's been talked about here, but it would remove bits of homegrown software, and would more natively fit with the rest of the git structure. Just a thought, I'm not really going to be putting any work in here :)

An update on the current status.

This follows the migration plan I had outlined on the mailing-list:


Staging has distgit in Ansible, so I've rebased the first part of my patches to it, and Pierre-Yves merged them. (thanks!)

The current status is that the staging lookaside can now accept uploads as SHA512, but still falls back on MD5. It can also store the uploaded sources in the new directory layout (containing the hash type), but doesn't do it for MD5.

So all in all, with all this merged, we are preparing for the migration, but absolutely nothing should change.

Again, this is only in staging for now.

This was deferred to after the Fedora 22 release:

Lots of things have been happening for this ticket, but I never really mentioned it in here.

We're getting close to the point where we could actually migrate, most of the preparation work has been done.

So here's a long-overdue status report, with as much detail as possible.

There doesn't seem to be notification emails sent when I edited the previous comment as things got done, so here's a new one to make sure interested folks receive an email and know where we stand. :-)

As detailed in the previous comment, we are now at the point where we could just flip the switch in fedpkg to start using the new {{{sources}}} file format and sha512 hashes.

At least all the client code is ready and in all Fedora/EPEL branches. The server code is ready as well and has been used for a few months already.

Now all that remains is decide when we do that cert change (ticket #6111) and do this at the same time.

we are now using sha512 for the chacksums of lookaside cache

Metadata Update from @ausil:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 years ago

Login to comment on this ticket.