#722 [dist-git] add scripts to clear lookaside cache of old sources
Merged 4 years ago by dturecek. Opened 5 years ago by dturecek.
copr/ dturecek/copr clear-lookaside  into  master

@@ -82,6 +82,8 @@ 

  

  cp -a conf/logrotate %{buildroot}%{_sysconfdir}/logrotate.d/copr-dist-git

  

+ mv %{buildroot}%{_bindir}/remove_unused_sources %{buildroot}%{_bindir}/copr-prune-dist-git-sources

+ 

  # for ghost files

  touch %{buildroot}%{_var}/log/copr-dist-git/main.log

  

@@ -0,0 +1,15 @@ 

+ #!/usr/bin/python3

+ import os

+ import subprocess

+ import re

+ 

+ lookaside_cache = "/var/lib/dist-git/cache/lookaside/pkgs"

+ git_dir = "/var/lib/dist-git/git"

+ 

+ for user in os.listdir(lookaside_cache):

+     for project in os.listdir(os.path.join(lookaside_cache, user)):

+         for package in os.listdir(os.path.join(lookaside_cache, user, project)):

+             pkg_git_dir = os.path.join(git_dir, user, project, package + ".git")

+             pkg_lookaside_dir = os.path.join(lookaside_cache, user, project, package)

+             # TODO use the script from dist-git once it's merged there

+             subprocess.call(['/usr/bin/copr-prune-dist-git-sources', pkg_git_dir, pkg_lookaside_dir])

@@ -0,0 +1,83 @@ 

+ #!/bin/bash

+ 

+ # TODO

+ # This is a copy of a script from dist-git and is only here until the original

+ # is merged and released. The original should then be used and this should be deleted.

+ # https://github.com/release-engineering/dist-git/pull/24

+ 

+ Usage() {

+     cat <<EOF

+ Usage:

+     $0 <package_git_directory> <package_lookaside_directory>

+ 

+     Removes all tarballs in <package_lookaside_directory> except

+     for tarballs that are referenced by the latest commit of each

+     branch in <package_git_directory>.

+ EOF

+ }

+ 

+ die() { echo "$*" 1>&2 ; exit 1; }

+ 

+ if [[ $# != 2 ]]; then

+     Usage

+     exit 1

+ fi

+ 

+ pkg_git_dir="$1"

+ pkg_lookaside_dir="$2"

+ 

+ if [ ! -d "$pkg_git_dir" ]; then

+     echo "$pkg_git_dir is not a valid directory."

+     exit 1

+ fi

+ if [ ! -d "$pkg_lookaside_dir" ]; then

+     echo "$pkg_lookaside_dir is not a valid directory."

+     exit 1

+ fi

+ 

+ pushd "$pkg_git_dir" > /dev/null || exit 1

+ 

+ whitelist=()

+ 

+ # find sources that are referenced by the latest commit in any of the branches

+ for branch in $(git for-each-ref --format="%(refname:short)" refs/heads); do

+     while read -r line; do

+         set -- $line

+         hash=$1

+         filename=$2

+         # skip projects using the new format

+         test $# -eq 0 && continue

+         test $# -ne 2 && die "Unsupported format. Only the old '<SUM> <FILENAME' format is used."

+         whitelist+=("$hash","$filename")

+     done < <(git show "$branch":sources)

+ done

+ 

+ # remove all source files that are not referenced

+ while read -r file; do

+     old_IFS=$IFS

+     IFS=/

+     set -- $file

+     IFS=$old_IFS

+ 

+     # safety measure, if this is really the layout we expect the first and

+     # third part matches

+     test "$1" = "$3" || continue

+ 

+     filename=$1

+     hash=$2

+ 

+     keep=false

+     for source in "${whitelist[@]}"; do

+         IFS=','

+         set -- $source

+         IFS=$old_IFS

+         # keep sources where tarname and hash match the referenced ones

+         if test "$1" = "$hash" -a "$2" = "$filename"; then

+             keep=true

+             break

+         fi

+     done

+ 

+     "$keep" && continue

+     unlink "$pkg_lookaside_dir/$file"

+ done < <( cd "$pkg_lookaside_dir" || exit 1 ; find . -mindepth 3 -maxdepth 3 -type f  -printf '%P\n' )

This adds a script that goes over every package in lookaside cache and calls dist-git's script to remove all except the newest sources.

I'm temporarily adding the dist-git script here so that it can be used straight away before it's merged in dist-git.

This can be huge. I think that 3 for-loops would be better.

dist-git/run/clear_lookaside_pache.py

There seems to be a typo in the filename

You don't need those parentheses here, it can simply be '/'.join(...)

Is the i variable used for something? Pylint doesn't show me, that it is an unused variable, so maybe it is used, but I don't really see it.

dist-git/run/remove_unused_sources

Can we maybe add a note somewhere, that the script is copy-paste of a script proposed in this PR
https://github.com/release-engineering/dist-git/pull/24/ and therefore we want to remove it once a new dist-git is released?

rebased onto 37e640d18337fd5333c346b2a0126a7f2aa53478

5 years ago

2 new commits added

  • [dist-git] temporarily add a dist-git script to clean one package
  • [dist-git] add script to clear lookaside cache of old sources
5 years ago

Thanks for the feedback, I've addressed all of your points.

2 new commits added

  • [dist-git] temporarily add a dist-git script to clean one package
  • [dist-git] add script to clear lookaside cache of old sources
5 years ago

Rebasing so that the remove_unused_sources script matches the one in the dist-git PR.

1 new commit added

  • [dist-git] call clear_lookaside_cache with only two arguments
5 years ago

3 new commits added

  • [dist-git] call remove_unused_sources with only two arguments
  • [dist-git] temporarily add a dist-git script to clean one package
  • [dist-git] add script to clear lookaside cache of old sources
5 years ago

And call remove_unused_sources correctly now that it only needs path to pkg_git_dir and pkg_lookaside_dir.

Could we call e.g. /usr/libexec/copr-prune-dist-git-sources? Two points -> non-relative path, and name the file with "copr_" prefix.

In general, I think this is fine -> I commented the issues related to the prune-script itself in dist-git upstream PR.

Metadata Update from @praiskup:
- Pull-request tagged with: needs-work

5 years ago

rebased onto 238fe6a3329f420cbd7bf5a472b9b8953429e9b9

4 years ago

I've updated this to match the current code in dist-git PR, and I've moved the upstream's script to /usr/bin/copr-prune-dist-git-sources.

rebased onto 0e98dbb40aa5c5ff8665b226c0102dfeec541a8f

4 years ago

We noticed that projects what have "pull-request" CoprDirs still don't have implemented dist-git support properly, and we mismatch the dist-git content with "production" branches/lookaside. Therefore, instead of loosing something we don't want to get lost --- I suggest skipping projects which contain some sub-directory with name 'pr:<ID>'.

rebased onto ebd6a71a68bff3ebe713669eaf1c3dc58d1e2077

4 years ago

I changed the script so that it skips all packages for which there is a pull-request project that has the same package.

And I've updated the dist-git script to match the upstream PR.

not only the *\:pr\:* dirs should be skipped, but also the directories which have any *\:pr\:* counterpart

Off hand I don't see why you need this, can you please add a comment?

That's because I don't want to skip all projects of a user that has a PR project as there might be other packages built as well (not affected by the PR project). So if there is PR project, this finds all packages that were built there and when going over the main loop I skip all projects building such packages.

However, I am not sure if this really is needed, as for example for @copr all the packages are skipped anyway.

Ok, I misunderstood the intention, right - I think your plan is correct then. You only need to fix the iteration over projects (you skip e.g. all projects which don't have "pr:ID" in dir name).

rebased onto 7699d7779c95ea9d029b49abb993ff1f9c81486c

4 years ago

seems like that if package "foo" is in pull-request of any project, it's going to be ignored in all my projects ... but if "project1:pr:1" contains the package "Foo", it doesn't mean I want to skip the package in "project2".

rebased onto 0ad8eb38b0475f009c8e38e4fc7611dddeca0680

4 years ago

Now I see what you mean. I think now I've got it right.

You can split by whole colon-pr-colon string.

pkgs = skip_pkgs.get(project_name, [])

rebased onto 017adf7d02ce11b7420064e1187679981e673643

4 years ago

using set() would be more correct I guess, you could just do pkgs.add(package) without checking for existence

Correct, plus I'd say that if "project" ends with "PR:ID", remove everything without asking...

Well, that could affect running builds .... so remove every file/directory older than 14 days, e.g.

rebased onto a35192d10c2216978cca92cfe8d97e45ae331472

4 years ago

Can you please remove regular flies from colon-PR-colon projects here, if the package directory is older than some constant?

We actually have separate git repositories for
the pull-request CoprDirs! So there's no clash with the "main"
CoprDir at all. It means we can actually take the PR directory as any other... and we don't have to have a list of "blacklisted" packages,
it will be much simpler..

Sorry for spreading confusion, I really don't know why I though this
isn't implemented yet (it seems like it is done correctly since beginning).

rebased onto 6723c9724ca46cdb0827936028314e4c75f73a33

4 years ago

I see... I should have checked that.

So now I just iterate over each user/project/package, right?

rebased onto a9c03840c16998190e778a3075cea4b7fcce5489

4 years ago

rebased onto dde79d4

4 years ago

Pull-Request has been merged by dturecek

4 years ago

Metadata Update from @dturecek:
- Pull-request untagged with: needs-work

4 years ago