#2427 change nightly cvs checkout seeds to be combined git checkout seed
Closed: Fixed None Opened 13 years ago by kevin.

= Change Requested =

Currently we have nightly checkout seeds for the old cvs server at:
http://cvs.fedoraproject.org/webfiles/
These are still generated every night, even though cvs never changes now.
We should disable these, but it would also be nice if we could do something toward a combined git checkout of all packages in it's place.

= Reasoning =

We provided this service for people wanting to checkout and look at all cvs files, we should be able to do this with git as well. (I have no idea if it will be larger or smaller or if it could work at all).

At the very least we should stop offering the stale cvs ones. ;)


Yes, we should definitely stop offering the old checkouts.

The reason we needed to do that is partly due to general CVS suckage, but even with git it would certainly be easier to pull a tarball, especially since due to the multi-repository layout we have no simple way to check out everything. I'm not sure if you can just tar up a git repository and have it work as a repo after you untar it somewhere else.

any progress on this?

currently i have script to clone each repo with package list from https://admin.fedoraproject.org/pkgdb/lists/bugzilla?tg_format=plain
filtering "Fedora" from first column

would be nice if all repos would be available something rsyncable, it will possibly load servers less than git pull...

Attaching a script for a checkout seed that I think works. Caveats:

  • It only works for creating an anonymous checkout.
  • I'm not sure that it's correct. If someone has more git experience then me, that would be great :-)

I did a diff against a fedpkg clone of a repo to attempt to validate this. There are many diffrences but like I say, I'm not a git expert so I don't know if these are problems or if they can be fixed in the script. If someone would like to take a look that would be great.

Also, we could try this using git clone and then fixing up the url in the .git/config. nirik and I thought we'd want to avoid that as it might be too CPU intensive but we haven't tested it.

here's the version that uses git clone instead of cp -l
make-git-checkout-seed.2.sh

here's some statistics on running the scripts on pkgs01.stg:

Initial run of the git clone script:

3h11m and 7.33GB of disk of which, 2.79GB is the tarball we're outputting.

Initial run of the cp -rl verions:
2h43m and 6GB of disk of which, 2.80GB is the tarball we're outputting.

I'll try a second run of git clone today to give a baseline for git pull (in production, it will be somewhat longer -- the git pull in stg won't pick up any changes.)

The git pull run (where nothing changed) was 2h23m. The tarball is very slightly bigger than the original git clone one (1.7MB) which surprised me a little. Does git pull change the data in some way? Do we need to run git gc on the cloned directories once in a while?

I think that a large fraction of time is being spent running tar and xz. I'm not sure that there's anyway we can avoid that. Also note that I'm using xz -2 in the script which in the past has been a sweet spot between time and compression (it compresses better than either gzip -9 or bzip2 and is faster than bzip2). I'll attach the latest version of the script here.

git clone & git pull version of the script
make-git-checkout-seed.sh

I've deployed this to production. The stg test run hasn't completed yet but it looks pretty good. It's set to run every Sunday. You should be able to grab a tarball with the seed in this directory: http://pkgs.fedoraproject.org/repo/ on Monday.

I just realized that there's a race condition when creating the tarball where users could download the tar.xz as it's generating and get incomplete results. I'll look at modifying the script to address that but it might not be deployed in time for the first run.

After a few changes, the script successfully went through two runs in stg. Looks good. We can close this once we see that a real checkout was generated in production on Sunday's cron run.

Ran in production and created a 4GB tarball which is almost twice as large as the staging data. It may have been quite a while ago that stg was synced, though, and every branch and new package will add something to the totals. I also saw an error from the script that I believe to be due to this being the first time the script was run. Running it a second time to make sure.

Looks like it worked. Closing ticket.

Login to comment on this ticket.

Metadata