#11092 unable to build fedora-toolbox containers
Closed: Fixed 2 months ago by petersen. Opened 4 months ago by petersen.

For some weeks now it seems not possible to build fedora-toolbox containers in the buildsystem:

$ fedpkg container-build
Created task: 93133373
Task info: https://koji.fedoraproject.org/koji/taskinfo?taskID=93133373
Watching tasks (this may be safely interrupted)...
93133373 buildContainer (noarch): free
93133373 buildContainer (noarch): free -> open (buildvm-x86-03.iad2.fedoraproject.org)
93133373 buildContainer (noarch): open (buildvm-x86-03.iad2.fedoraproject.org) -> FAILED: Fault: <Fault 2001: 'Image build failed. Error in plugin orchestrate_build: {"x86_64": {"all_rpm_packages": "404 Client Error for http+docker://localhost/v1.26/containers/fb3a2a38ce3811093589effb48a5b160d18318142adff453c1c49cc5e598d56a/archive?path=%2Fvar%2Flib%2Frpm: Not Found (\\"lstat /usr/lib/sysimage/rpm: no such file or directory\\")"}}. OSBS build id: fedora-toolbox-f37-65bbb-3'>
  0 free  0 open  0 done  1 failed

93133373 buildContainer (noarch) failed

Rawhide fails in the same way - I didn't try f36 today.

  • When do you need this?

asap assuming F37 might get released next week


The current fedora-toolbox images are from August

Metadata Update from @phsmoura:
- Issue tagged with: medium-gain, medium-trouble, ops

4 months ago

Just to add the main problem is that of course fedora-toolbox:37 still has older fedora-repos with testing enabled - so it is not really a good UX.

I dunno why this kind of breakage seems to happen often around release time...
I personally feel fedora-toolbox should really be considered an import release artifact - though I think it is not tracked currently as part of releases - maybe it should be?

I am trying to find out what is happening here with no luck so far.

Interestingly the container image builds seem to finish on aarch64.

I can confirm fedora-toolbox container image builds are still failing on x86_64 as above (as I expected) for rawhide, f37, and f36 (presumably f35 too):

$ TZ=UTC koji-tool tasks -m buildcontainer -u petersen
fedora-toolbox buildContainer TaskFailed 2022-10-25 03:32:40UTC (3m 36s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93416337
fedora-toolbox buildContainer TaskFailed 2022-10-25 03:18:52UTC (3m 31s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93416140
fedora-toolbox buildContainer TaskFailed 2022-10-25 03:11:49UTC (3m 12s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93416034
fedora-toolbox buildContainer TaskFailed 2022-10-17 08:40:43UTC (3m 10s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93133373
fedora-toolbox buildContainer TaskFailed 2022-10-17 08:27:14UTC (3m 12s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93132962
fedora-toolbox buildContainer TaskFailed 2022-09-29 07:33:26UTC (2m 30s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92413086
fedora-toolbox buildContainer TaskFailed 2022-09-29 06:59:51UTC (2m 30s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92412692
fedora-toolbox buildContainer TaskFailed 2022-09-28 08:04:40UTC (2m 44s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92393395
fedora-toolbox buildContainer TaskFailed 2022-09-28 07:53:17UTC (2m 56s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92393215
fedora-toolbox buildContainer TaskClosed 2022-08-15 17:12:17UTC (3m 32s) https://koji.fedoraproject.org/koji/taskinfo?taskID=90840230

I can still reproduce this across rawhide, f37 and f36.

From the rawhide build logs, we can see that the aarch64 build succeeded:

2022-11-08 12:05:06,590 - atomic_reactor.plugin - DEBUG - plugin 'remove_built_image' finished in 2s
2022-11-08 12:05:06,697 - atomic_reactor.inner - INFO - Dockerfile used for build:
FROM registry.fedoraproject.org/fedora:38

ENV NAME=fedora-toolbox VERSION=38
LABEL com.github.containers.toolbox="true" \
      com.github.debarshiray.toolbox="true" \
      com.redhat.component="$NAME" \
      name="$NAME" \
      version="$VERSION" \
      usage="This image is meant to be used with the toolbox command" \
      summary="Base image for creating Fedora toolbox containers" \
      maintainer="Debarshi Ray <rishi@fedoraproject.org>"

COPY README.md /

RUN sed -i '/tsflags=nodocs/d' /etc/dnf/dnf.conf
RUN dnf -y swap coreutils-single coreutils-full

COPY missing-docs /
RUN dnf -y reinstall $(<missing-docs)
RUN rm /missing-docs

COPY extra-packages /
RUN dnf -y install $(<extra-packages)
RUN rm /extra-packages

RUN dnf clean all

LABEL "release"="2" "authoritative-source-url"="registry.fedoraproject.org" "distribution-scope"="public" "vendor"="Fedora Project" "build-date"="2022-11-08T12:02:12.283052" "architecture"="arm64" "vcs-type"="git" "vcs-ref"="32bf967e6d08cc406c56758ee037ed4fb3b978a5" "com.redhat.build-host"="osbs-aarch64-node01.iad2.fedoraproject.org"

2022-11-08 12:05:06,697 - atomic_reactor.inner - INFO - build has finished successfully \o/

... but the x86_64 build didn't:

2022-11-08 12:03:50,962 - atomic_reactor.plugin - DEBUG - plugin 'remove_built_image' finished in 0s
2022-11-08 12:03:51,014 - atomic_reactor.inner - INFO - Dockerfile used for build:
FROM registry.fedoraproject.org/fedora:38

ENV NAME=fedora-toolbox VERSION=38
LABEL com.github.containers.toolbox="true" \
      com.github.debarshiray.toolbox="true" \
      com.redhat.component="$NAME" \
      name="$NAME" \
      version="$VERSION" \
      usage="This image is meant to be used with the toolbox command" \
      summary="Base image for creating Fedora toolbox containers" \
      maintainer="Debarshi Ray <rishi@fedoraproject.org>"

COPY README.md /

RUN sed -i '/tsflags=nodocs/d' /etc/dnf/dnf.conf
RUN dnf -y swap coreutils-single coreutils-full

COPY missing-docs /
RUN dnf -y reinstall $(<missing-docs)
RUN rm /missing-docs

COPY extra-packages /
RUN dnf -y install $(<extra-packages)
RUN rm /extra-packages

RUN dnf clean all

LABEL "release"="2" "authoritative-source-url"="registry.fedoraproject.org" "distribution-scope"="public" "vendor"="Fedora Project" "build-date"="2022-11-08T12:02:12.919528" "architecture"="x86_64" "vcs-type"="git" "vcs-ref"="32bf967e6d08cc406c56758ee037ed4fb3b978a5" "com.redhat.build-host"="osbs-node02.iad2.fedoraproject.org"

2022-11-08 12:03:51,014 - atomic_reactor.inner - ERROR - image build failed: plugin 'all_rpm_packages' raised an exception: RuntimeError: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory")
Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/docker/api/client.py", line 268, in _raise_for_status
    response.raise_for_status()
  File "/usr/lib/python3.10/site-packages/requests/models.py", line 960, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/atomic_reactor/plugins/post_rpmqa.py", line 72, in gather_output
    bits, _ = self.tasker.get_archive(container_id, RPMDB_PATH)
  File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 375, in get_archive
    return self.tasker.get_archive(*args, **kwargs)
  File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 961, in get_archive
    return self.d.get_archive(container_id, dir_path)
  File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 243, in hooked
    return retry(orig_attr, *args, retry=self.retry_times, **kwargs)
  File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 210, in retry
    return function(*args, **kwargs)
  File "/usr/lib/python3.10/site-packages/docker/utils/decorators.py", line 19, in wrapped
    return f(self, resource_id, *args, **kwargs)
  File "/usr/lib/python3.10/site-packages/docker/api/container.py", line 748, in get_archive
    self._raise_for_status(res)
  File "/usr/lib/python3.10/site-packages/docker/api/client.py", line 270, in _raise_for_status
    raise create_api_error_from_http_exception(e)
  File "/usr/lib/python3.10/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception
    raise cls(e, response=response, explanation=explanation)
docker.errors.NotFound: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/lib/python3.10/site-packages/atomic_reactor/plugin.py", line 265, in run
    plugin_response = plugin_instance.run()
  File "/usr/lib/python3.10/site-packages/atomic_reactor/plugins/post_rpmqa.py", line 47, in run
    plugin_output = self.gather_output()
  File "/usr/lib/python3.10/site-packages/atomic_reactor/plugins/post_rpmqa.py", line 77, in gather_output
    raise RuntimeError(ex) from ex
RuntimeError: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory")

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/bin/atomic-reactor", line 33, in <module>
    sys.exit(load_entry_point('atomic-reactor==3.14.0', 'console_scripts', 'atomic-reactor')())
  File "/usr/lib/python3.10/site-packages/atomic_reactor/cli/main.py", line 315, in run
    cli.run()
  File "/usr/lib/python3.10/site-packages/atomic_reactor/cli/main.py", line 297, in run
    args.func(args)
  File "/usr/lib/python3.10/site-packages/atomic_reactor/cli/main.py", line 95, in cli_inside_build
    build_inside(input_method=args.input, input_args=args.input_arg,
  File "/usr/lib/python3.10/site-packages/atomic_reactor/inner.py", line 619, in build_inside
    build_result = dbw.build_docker_image()
  File "/usr/lib/python3.10/site-packages/atomic_reactor/inner.py", line 545, in build_docker_image
    postbuild_runner.run()
  File "/usr/lib/python3.10/site-packages/atomic_reactor/plugin.py", line 306, in run
    raise PluginFailedException(msg) from ex
atomic_reactor.plugin.PluginFailedException: plugin 'all_rpm_packages' raised an exception: RuntimeError: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory")

I'd bet this has something to do with the rpmdb transition and and the code in https://github.com/containerbuildsystem/atomic-reactor/blob/308b5b9b4b450e69b69ded1a7a200db7e8da19e2/atomic_reactor/plugins/rpmqa.py hardcoding /var/lb/rpm

Yeah, it's only extracting /var/lib/rpm which is now a symlink.

Oh, yeah, indeed. /var/lib/rpm is now a symlink:

$ ls -ld /var/lib/rpm
lrwxrwxrwx. 1 root root 26 May  4  2022 /var/lib/rpm -> ../../usr/lib/sysimage/rpm

... even on my Fedora 36. I must have missed this transition.

Do we do something different for aarch64?

It's worth noting that this isn't affecting all container builds. The flatpak ones are fine. This might be due to it not using the rpmqa option tho?

So, perhaps there's some way to disable/turn off the rpmqa call here until atomic-reactor can fix that hard coded path?

@cverna do you have any thoughts here? :)

And indeed the f35 (which does not have the rpm db change) toolbox container still builds fine.

Hmm couldn't we just patch the atomic-reactor fedora package for now as a quick initial workaround?
At least we could test that in rawhide?

(seems upstream development is now based on OSBS 2 already iiuc)

it seems that rpmqa plugin hasn't run on aarch64 at all, was it disabled in fedorainfra for aarch?

I opened https://bugzilla.redhat.com/show_bug.cgi?id=2142731 too against atomic-reactor for good measure, but at this point a Fedora PR would probably be more useful, right?
I don't see how patching the path could hurt?

I opened https://src.fedoraproject.org/rpms/atomic-reactor/pull-request/12
with the rpmdb path patch for rawhide.

This shouldn't break flatpak builds, right?

Well, sadly as I feared... it's not using the atomic-reactor from the building repo, but rather from the buildroot image (which is currently f36).

I can update it to f37 easily, but then we have to get a atomic-reactor update pushed out in f37 before we can test it.

I don't understand the atomic-reactor plugins, but there's got to be some way to just disable that plugin. ;(

ok, so I added a sed to the buildroot image Dockerfile and rebuilt it. That gets past that plugin... but now...

2022-11-15 20:34:32,190 platform:- - atomic_reactor.plugins.compare_components - WARNING - Comparison mismatch for component libgcc:
2022-11-15 20:34:32,190 platform:- - atomic_reactor.plugins.compare_components - WARNING - aarch64: libgcc-12.1.1-1.fc36 (999f7cbf38ab71f4)
2022-11-15 20:34:32,190 platform:- - atomic_reactor.plugins.compare_components - WARNING - x86_64: libgcc-12.2.1-3.fc38 (809a8d7ceb10b464)

it's somehow building f36 on the aarch64 cluster, but f38 on the x86_64 cluster? Will try looking some more...

Sorry, I missed the ping in this ticket.

We can disable plugins in https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/orchestrator_customize.json and https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/worker_customize.json

The buildroot image is defined here --> https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/buildroot-Dockerfile-production.j2#_1 and it is pointing to Fedora 36 now.

IIRC the Dockerfile ends up being on the master node under /etc/osbs/ if you want to do the test locally before running ansible.

Ping me on IRC if needed, I ll be happy to help

ok.

Any help here would be quite welcome. ;)

I got around the rpmdb problem, but now it's failing in that compare_components thing and I have no idea why. ;(

I tried a few f37 and f36 builds [0][1] today and they failed in the component compare plugin. I think this is because both our base Fedora container images for F37 and F36 don't have the same content for x86 and aarch64.

I have also noticed that we don't seems to have F37 builds for the base image. I am going to try to get the base image updated and with the same content for both architecture.

@petersen How much do you need/want to have aarch64 builds? since we could disable aarch64 for toolbox to get a successful build while we try to fix this issue.

[0] - https://koji.fedoraproject.org/koji/taskinfo?taskID=94488442
[1] - https://koji.fedoraproject.org/koji/taskinfo?taskID=94488670

Ok we lost the aarch64 cluster :-(

[root@osbs-aarch64-master01 ~][PROD-IAD2]# oc get nodes
Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2022-11-24T14:58:39Z is after 2022-11-24T12:07:29Z
[root@osbs-aarch64-master01 ~][PROD-IAD2]# 

I am not 100 sure how we regenerate and redeploy certificates on OCP clusters :-(

Also the aarch64 cluster is running on Fedora 33 boxes, so we should probably try to do a full redeploy :-(

I poked around and I think have it all back with renewed certs.

We can't upgrade it from f33 because docker is no longer in f34+ and openshift origin / 3.11 doesn't support any newer either.

Thanks kevin,

I gave it another try this morning and it looks like the x86 cluster cannot grab the logs from the aarch64 cluster.

I am not 100% sure why, but that might be a service account token that needs to be regenerated. I don't have the permission to run the OSBS playbooks but that could be worth trying to run playbooks/groups/osbs/configure-osbs.yml and playbooks/groups/osbs/osbs-post-install.yml

2022-11-25 09:07:52,793 platform:- - atomic_reactor.inner - ERROR - image build failed: {"aarch64": {"general": "b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Get https://osbs-aarch64-node02.iad2.fedoraproject.org:10250/containerLogs/worker/fedora-toolbox-f37-65bbb-11-build/custom-build?follow=true: dial tcp: lookup osbs-aarch64-node02.iad2.fedoraproject.org on 10.3.170.147:53: read udp 10.3.170.147:43152-\\\\u003e10.3.170.147:53: read: connection refused\",\"code\":500}'", "pod": {"reason": "Succeeded"}}}

configure-osbs gives:

RUNNING HANDLER [Remove the previous buildroot image] ************************************************
Friday 25 November 2022  18:42:10 +0000 (0:00:02.462)       0:01:57.657 *******                      
Friday 25 November 2022  18:42:10 +0000 (0:00:02.462)       0:01:57.657 *******                      
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Modul
eNotFoundError: No module named 'requests'                                                           
fatal: [osbs-master01.iad2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Failed to import 
the required Python library (Docker SDK for Python: docker above 5.0.0 (Python >= 3.6) or docker befor
e 5.0.0 (Python 2.7) or docker-py (Python 2.6)) on osbs-master01.iad2.fedoraproject.org's Python /usr/
bin/python3.6. Please read the module documentation and install it in the appropriate location. If the
 required library is installed, but Ansible is using the wrong Python interpreter, please consult the 
documentation on ansible_python_interpreter, for example via `pip install docker` (Python >= 3.6) or `
pip install docker==4.4.4` (Python 2.7) or `pip install docker-py` (Python 2.6). The error was: No mod
ule named 'requests'"}
fatal: [osbs-aarch64-master01.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr": 
"/bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module fa
iled to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact 
error", "rc": 127}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Modul
eNotFoundError: No module named 'requests'
fatal: [osbs-node01.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Failed to impor
t the required Python library (Docker SDK for Python: docker above 5.0.0 (Python >= 3.6) or docker bef
ore 5.0.0 (Python 2.7) or docker-py (Python 2.6)) on osbs-node01.stg.iad2.fedoraproject.org's Python /
usr/bin/python3.6. Please read the module documentation and install it in the appropriate location. If
 the required library is installed, but Ansible is using the wrong Python interpreter, please consult 
the documentation on ansible_python_interpreter, for example via `pip install docker` (Python >= 3.6) 
or `pip install docker==4.4.4` (Python 2.7) or `pip install docker-py` (Python 2.6). The error was: No
 module named 'requests'"}
fatal: [osbs-aarch64-node01.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr"
: "/bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module 
failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exac
t error", "rc": 127}
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Modul
eNotFoundError: No module named 'requests'
fatal: [osbs-node02.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Failed to impor
t the required Python library (Docker SDK for Python: docker above 5.0.0 (Python >= 3.6) or docker bef
ore 5.0.0 (Python 2.7) or docker-py (Python 2.6)) on osbs-node02.stg.iad2.fedoraproject.org's Python /
usr/bin/python3.6. Please read the module documentation and install it in the appropriate location. If
 the required library is installed, but Ansible is using the wrong Python interpreter, please consult 
the documentation on ansible_python_interpreter, for example via `pip install docker` (Python >= 3.6) 
or `pip install docker==4.4.4` (Python 2.7) or `pip install docker-py` (Python 2.6). The error was: No
 module named 'requests'"}
fatal: [osbs-aarch64-node02.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr"
: "/bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module 
failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exac
t error", "rc": 127}
fatal: [osbs-aarch64-node01.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr": "/
bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module fail
ed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact er
ror", "rc": 127}
fatal: [osbs-aarch64-node02.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr": "/
bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module fail
ed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact er
ror", "rc": 127}

:(

So, I poked around at this some more.

First, we hit a problem with koji-containerbuild. The new f37 one (in updates) uses a different channel and then also updates some OSBS handling. See https://pagure.io/fedora-infrastructure/issue/11020
I avoided that for now by downgrading back to the version we had before with f36.

Second. The aarch64 cluster nodes seem fine for a while, then they drop off.

Finally the current error seems to be:
atomic_reactor.plugin.PluginFailedException: {"aarch64": {"general": "HTTPSConnectionPool(host='osbs-aarch64-master01.iad2.fedoraproject.org', port=8443): Read timed out."}}
which I guess as you said is the builder trying to get the aarch64 result? Not clear.

So, I am wondering if we should just re-install the aarch64 cluster, it seems somehow not happy.
But perhaps we can figure out whats going on.

ok. We finally got a build of the f37 toolbox container. :)

Can you please try rawhide and any others you want to do and confirm it's working now?

very sorry for this long saga.

Thanks so much!

Can confirm builds succeeding now, yay

Metadata Update from @petersen:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

2 months ago

Login to comment on this ticket.

Metadata
Boards 1
Ops Status: Backlog