#1804 copr_prune_results common tracebacks
Closed: Fixed 3 years ago by praiskup. Opened 3 years ago by praiskup.

Traceback 1

[2021-04-29 05:16:17,745][ ERROR][PID:350511][copr_prune_results][copr_prune_results.py:run_prunerepo:69] Error pruning chroot omos/drop-selinux-disable/fedora-rawhide-aarch64
Traceback (most recent call last):
  File "/usr/bin/copr_prune_results.py", line 67, in run_prunerepo
    clean_copr(chroot_path, prune_days, verbose=True) 
  File "/usr/bin/copr_prune_results.py", line 205, in clean_copr
    shutil.rmtree(dir_path)
  File "/usr/lib64/python3.9/shutil.py", line 718, in rmtree
    _rmtree_safe_fd(fd, path, onerror)
  File "/usr/lib64/python3.9/shutil.py", line 655, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib64/python3.9/shutil.py", line 655, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  File "/usr/lib64/python3.9/shutil.py", line 655, in _rmtree_safe_fd
    _rmtree_safe_fd(dirfd, fullname, onerror)
  [Previous line repeated 3 more times]
  File "/usr/lib64/python3.9/shutil.py", line 659, in _rmtree_safe_fd
    onerror(os.rmdir, fullname, sys.exc_info())
  File "/usr/lib64/python3.9/shutil.py", line 657, in _rmtree_safe_fd
    os.rmdir(entry.name, dir_fd=topfd)
PermissionError: [Errno 13] Permission denied: 'var'

Traceback 2

[2021-04-29 05:16:17,871][ ERROR][PID:350517][copr_prune_results][copr_prune_results.py:run_prunerepo:69] Error pruning chroot omos/selinux-policy-testing/fedora-34-x86_64
Traceback (most recent call last):
  File "/usr/bin/copr_prune_results.py", line 62, in run_prunerepo
    rpms = get_rpms_to_remove(chroot_path, days=prune_days, log=LOG)
  File "/usr/lib/python3.9/site-packages/prunerepo/helpers.py", line 194, in get_rpms_to_remove
    if time.time() - get_package_build_time(rpm, log) < days * 24 * 3600:
  File "/usr/lib/python3.9/site-packages/prunerepo/helpers.py", line 63, in get_package_build_time
    stdout = run_cmd(query_cmd, log)
  File "/usr/lib/python3.9/site-packages/prunerepo/helpers.py", line 54, in run_cmd
    raise PrunerepoException("Command {} failed".format(str_cmd))
prunerepo.helpers.PrunerepoException: Command /usr/bin/rpm -qp --queryformat %{BUILDTIME} /var/lib/copr/public_html/results/omos/selinux-policy-testing/fedora-34-x86_64/02010948-selinux-policy/selinux-policy-minimum-3.14.8-3.20210221_101800.210f301.fc34.noarch.rpm failed

Traceback 3:

[2021-04-29 05:17:27,195][ ERROR][PID:350518][copr_prune_results][copr_prune_results.py:run_prunerepo:69] Error pruning chroot aikidouke/yadm/fedora-rawhide-x86_64
Traceback (most recent call last):
  File "/usr/bin/copr_prune_results.py", line 62, in run_prunerepo
    rpms = get_rpms_to_remove(chroot_path, days=prune_days, log=LOG)
  File "/usr/lib/python3.9/site-packages/prunerepo/helpers.py", line 181, in get_rpms_to_remove
    latest_rpms = get_rpms(get_latest_packages_cmd, directory, log)
  File "/usr/lib/python3.9/site-packages/prunerepo/helpers.py", line 71, in get_rpms
    stdout = run_cmd(repoquery_cmd, log)  # returns srpms as well
  File "/usr/lib/python3.9/site-packages/prunerepo/helpers.py", line 54, in run_cmd
    raise PrunerepoException("Command {} failed".format(str_cmd))
prunerepo.helpers.PrunerepoException: Command dnf repoquery --repofrompath=prunerepo_query,/var/lib/copr/public_html/results/aikidouke/yadm/fedora-rawhide-x86_64 --repo=prunerepo_query --refresh --queryformat=%{location} --quiet --setopt=skip_if_unavailable=False --latest-limit=1 failed

```


Tracaback 3 is because of what I wrote here https://pagure.io/copr/copr/pull-request/1796, when repoquery fail prunerepo log Error: Failed to download metadata for repo 'prunerepo_query': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried, but it is now just a debug print so in the log you can just see Error pruning chroot aikidouke/yadm/fedora-rawhide-x86_64

Traceback 2, it looks like it tried to get the build time of the package which wasn't there. That src.rpm was paired with the noarch.rpm file, even if it is no longer in that directory. So wrapping the call get_package_build_time (rpm, log) in try-except could help.

The traceback 3 is caused by entirely missing repodata, I think (some bug). Traceback is caused by removed RPMs, but the repodata still point at them... (some other bug). Traceback 1 is fixed by #1805.

What is a serious problem is the frequency of these issues:
$ grep Traceback /var/log/copr-backend/pruner-full-log.log | wc -l
1812

It means that we have 1812 broken repositories :( and we need to find a way to
fix them (re-running createrepo there?).

Ok, I think that Traceback 2 is already solved ... it looks like result of copr_prune_repo.py failures in previous releases (or older prunerepo failures). Before we a) removed the RPMs and b) then we run the createrepo_c (and appstream-builder, etc.).

The problem we see in Traceback 2 is successful RPM removal, but non-successful createrepo run (thus the old metadata are kept, pointing to already non-existing RPMs). The new code does this vice-versa.

Examples of traceback 3:

  1. https://download.copr.fedorainfracloud.org/results/jenslody/codeblocks/opensuse-tumbleweed-x86_64/00768203-codeblocks/backend.log.gz (old code, rsync gave up too early)
    Another instances:
    https://download.copr.fedorainfracloud.org/results/avsej/zig/fedora-rawhide-i386/00772397-zig/backend.log.gz
    https://download.copr.fedorainfracloud.org/results/kwenning/pacemaker-2.0/epel-6-x86_64/00740185-pacemaker/backend.log.gz

  2. https://download.copr.fedorainfracloud.org/results/till/cura:pr:8/fedora-33-x86_64/01640706-cura-fdm-materials/backend.log.gz (this is action processing failure)
    Another instances:
    https://download.copr.fedorainfracloud.org/results/bowlofeggs/bodhi/fedora-rawhide-aarch64/01903205-bodhi/backend.log.gz
    https://download.copr.fedorainfracloud.org/results/frostyx/copr-dev/fedora-32-x86_64/01600160-copr-backend/backend.log.gz
    https://download.copr.fedorainfracloud.org/results/jonny/Monero/fedora-32-x86_64/01696503-monero/backend.log.gz
    ...

  3. EPEL 7 in https://copr.fedorainfracloud.org/coprs/willo/vagrant/builds/
    dunno, but old build as hell, clean_copr() already removed leftovers, likely
    the first case.

The full list:

# grep "Command dnf repoquery" /var/log/copr-backend/pruner-full-log.log  | sed 's/.*prunerepo_query,//' | sed 's/ .*//' | wc -l
385

Ful list of affected directories, with contents, is here:
https://copr-be.cloud.fedoraproject.org/stats/issue-1804-traceback-3.log

I think the pattern is clear... the older builds are usually the case 1., the
newer builds are 2.

So I think we should merge the related PRs, and release new copr-backend. This will avoid any future instances for TB1. TB2 is resolved now so shouldn't appear again. TB3 is something I haven't seen for some time, but I'm not sure what fixed it (the initial createrepo usually works nowadays), but we would have at least better logging with the recent patches..

Once we have deployed the new backend, we should recursively traverse all the result directories, and
1. fix file permissions causing TB1
2. submit a full createrepo action for all affected projects
3. run full copr_prune_results --no-mtime-optimzations
4. enable the cron job ... and watch residual problems in logs

Can anyone please take a look at the proposal?

Metadata Update from @praiskup:
- Issue assigned to praiskup

3 years ago

Commit 9872e8e relates to this ticket

Commit 60a7edf relates to this ticket

Can anyone please take a look at the proposal above?

I see no problem with your proposal :-)

Commit e3ad541 relates to this ticket

2) step blocked, we need to apply another pr #1813.... I'll continue later. In the meantime, I believe the permissions from 1. will be fixed.

3) step 3 tried, though it ended with:

[2021-05-03 15:57:28,346][ ERROR][PID:2941109][copr_prune_results][copr_prune_results.py:run_prunerepo:69] Error pruning chroot @python/python3.10/fedora-rawhide-x86_64
Traceback (most recent call last):
  File "/usr/bin/copr_prune_results.py", line 65, in run_prunerepo
    call_copr_repo(directory=chroot_path, rpms_to_remove=rpms,
  File "/usr/lib/python3.9/site-packages/copr_backend/helpers.py", line 641, in call_copr_repo
    result = run_cmd(cmd, timeout=timeout, logger=logger, catch_timeout=True)
  File "/usr/lib/python3.9/site-packages/copr_backend/helpers.py", line 91, in run_cmd
    process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=shell, encoding="utf-8")
  File "/usr/lib64/python3.9/subprocess.py", line 951, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib64/python3.9/subprocess.py", line 1821, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 7] Argument list too long: 'copr-repo'

I've run prunerepo and createrepo manually against that large project, to miminimize the command length. The next prune repo run took:

# date ; time runuser -c "PYTHONUNBUFFERED=1 python3 /usr/bin/copr_prune_results.py --no-mtime-optimization &> /dev/null" - copr
Mon May  3 08:29:52 PM UTC 2021

real    399m31.577s
user    463m17.717s
sys     107m3.633s

Without --no-mtime-optimization it will be even faster.

The next daily runs, at least for now, but I created #1817 that we should resolve soon.

Metadata Update from @praiskup:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

Log in to comment on this ticket.

Metadata
Related Pull Requests
  • #1809 Merged 3 years ago
  • #1806 Merged 3 years ago
  • #1805 Closed 3 years ago