For some weeks now it seems not possible to build fedora-toolbox containers in the buildsystem:
$ fedpkg container-build Created task: 93133373 Task info: https://koji.fedoraproject.org/koji/taskinfo?taskID=93133373 Watching tasks (this may be safely interrupted)... 93133373 buildContainer (noarch): free 93133373 buildContainer (noarch): free -> open (buildvm-x86-03.iad2.fedoraproject.org) 93133373 buildContainer (noarch): open (buildvm-x86-03.iad2.fedoraproject.org) -> FAILED: Fault: <Fault 2001: 'Image build failed. Error in plugin orchestrate_build: {"x86_64": {"all_rpm_packages": "404 Client Error for http+docker://localhost/v1.26/containers/fb3a2a38ce3811093589effb48a5b160d18318142adff453c1c49cc5e598d56a/archive?path=%2Fvar%2Flib%2Frpm: Not Found (\\"lstat /usr/lib/sysimage/rpm: no such file or directory\\")"}}. OSBS build id: fedora-toolbox-f37-65bbb-3'> 0 free 0 open 0 done 1 failed 93133373 buildContainer (noarch) failed
Rawhide fails in the same way - I didn't try f36 today.
asap assuming F37 might get released next week
The current fedora-toolbox images are from August
Metadata Update from @phsmoura: - Issue tagged with: medium-gain, medium-trouble, ops
Just to add the main problem is that of course fedora-toolbox:37 still has older fedora-repos with testing enabled - so it is not really a good UX.
I dunno why this kind of breakage seems to happen often around release time... I personally feel fedora-toolbox should really be considered an import release artifact - though I think it is not tracked currently as part of releases - maybe it should be?
I am trying to find out what is happening here with no luck so far.
Interestingly the container image builds seem to finish on aarch64.
I can confirm fedora-toolbox container image builds are still failing on x86_64 as above (as I expected) for rawhide, f37, and f36 (presumably f35 too):
$ TZ=UTC koji-tool tasks -m buildcontainer -u petersen fedora-toolbox buildContainer TaskFailed 2022-10-25 03:32:40UTC (3m 36s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93416337 fedora-toolbox buildContainer TaskFailed 2022-10-25 03:18:52UTC (3m 31s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93416140 fedora-toolbox buildContainer TaskFailed 2022-10-25 03:11:49UTC (3m 12s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93416034 fedora-toolbox buildContainer TaskFailed 2022-10-17 08:40:43UTC (3m 10s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93133373 fedora-toolbox buildContainer TaskFailed 2022-10-17 08:27:14UTC (3m 12s) https://koji.fedoraproject.org/koji/taskinfo?taskID=93132962 fedora-toolbox buildContainer TaskFailed 2022-09-29 07:33:26UTC (2m 30s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92413086 fedora-toolbox buildContainer TaskFailed 2022-09-29 06:59:51UTC (2m 30s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92412692 fedora-toolbox buildContainer TaskFailed 2022-09-28 08:04:40UTC (2m 44s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92393395 fedora-toolbox buildContainer TaskFailed 2022-09-28 07:53:17UTC (2m 56s) https://koji.fedoraproject.org/koji/taskinfo?taskID=92393215 fedora-toolbox buildContainer TaskClosed 2022-08-15 17:12:17UTC (3m 32s) https://koji.fedoraproject.org/koji/taskinfo?taskID=90840230
$ TZ=UTC koji-tool tasks -m buildcontainer -u petersen
I can still reproduce this across rawhide, f37 and f36.
rawhide
f37
f36
From the rawhide build logs, we can see that the aarch64 build succeeded:
aarch64
2022-11-08 12:05:06,590 - atomic_reactor.plugin - DEBUG - plugin 'remove_built_image' finished in 2s 2022-11-08 12:05:06,697 - atomic_reactor.inner - INFO - Dockerfile used for build: FROM registry.fedoraproject.org/fedora:38 ENV NAME=fedora-toolbox VERSION=38 LABEL com.github.containers.toolbox="true" \ com.github.debarshiray.toolbox="true" \ com.redhat.component="$NAME" \ name="$NAME" \ version="$VERSION" \ usage="This image is meant to be used with the toolbox command" \ summary="Base image for creating Fedora toolbox containers" \ maintainer="Debarshi Ray <rishi@fedoraproject.org>" COPY README.md / RUN sed -i '/tsflags=nodocs/d' /etc/dnf/dnf.conf RUN dnf -y swap coreutils-single coreutils-full COPY missing-docs / RUN dnf -y reinstall $(<missing-docs) RUN rm /missing-docs COPY extra-packages / RUN dnf -y install $(<extra-packages) RUN rm /extra-packages RUN dnf clean all LABEL "release"="2" "authoritative-source-url"="registry.fedoraproject.org" "distribution-scope"="public" "vendor"="Fedora Project" "build-date"="2022-11-08T12:02:12.283052" "architecture"="arm64" "vcs-type"="git" "vcs-ref"="32bf967e6d08cc406c56758ee037ed4fb3b978a5" "com.redhat.build-host"="osbs-aarch64-node01.iad2.fedoraproject.org" 2022-11-08 12:05:06,697 - atomic_reactor.inner - INFO - build has finished successfully \o/
... but the x86_64 build didn't:
x86_64
2022-11-08 12:03:50,962 - atomic_reactor.plugin - DEBUG - plugin 'remove_built_image' finished in 0s 2022-11-08 12:03:51,014 - atomic_reactor.inner - INFO - Dockerfile used for build: FROM registry.fedoraproject.org/fedora:38 ENV NAME=fedora-toolbox VERSION=38 LABEL com.github.containers.toolbox="true" \ com.github.debarshiray.toolbox="true" \ com.redhat.component="$NAME" \ name="$NAME" \ version="$VERSION" \ usage="This image is meant to be used with the toolbox command" \ summary="Base image for creating Fedora toolbox containers" \ maintainer="Debarshi Ray <rishi@fedoraproject.org>" COPY README.md / RUN sed -i '/tsflags=nodocs/d' /etc/dnf/dnf.conf RUN dnf -y swap coreutils-single coreutils-full COPY missing-docs / RUN dnf -y reinstall $(<missing-docs) RUN rm /missing-docs COPY extra-packages / RUN dnf -y install $(<extra-packages) RUN rm /extra-packages RUN dnf clean all LABEL "release"="2" "authoritative-source-url"="registry.fedoraproject.org" "distribution-scope"="public" "vendor"="Fedora Project" "build-date"="2022-11-08T12:02:12.919528" "architecture"="x86_64" "vcs-type"="git" "vcs-ref"="32bf967e6d08cc406c56758ee037ed4fb3b978a5" "com.redhat.build-host"="osbs-node02.iad2.fedoraproject.org" 2022-11-08 12:03:51,014 - atomic_reactor.inner - ERROR - image build failed: plugin 'all_rpm_packages' raised an exception: RuntimeError: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory") Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/docker/api/client.py", line 268, in _raise_for_status response.raise_for_status() File "/usr/lib/python3.10/site-packages/requests/models.py", line 960, in raise_for_status raise HTTPError(http_error_msg, response=self) requests.exceptions.HTTPError: 404 Client Error: Not Found for url: http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/atomic_reactor/plugins/post_rpmqa.py", line 72, in gather_output bits, _ = self.tasker.get_archive(container_id, RPMDB_PATH) File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 375, in get_archive return self.tasker.get_archive(*args, **kwargs) File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 961, in get_archive return self.d.get_archive(container_id, dir_path) File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 243, in hooked return retry(orig_attr, *args, retry=self.retry_times, **kwargs) File "/usr/lib/python3.10/site-packages/atomic_reactor/core.py", line 210, in retry return function(*args, **kwargs) File "/usr/lib/python3.10/site-packages/docker/utils/decorators.py", line 19, in wrapped return f(self, resource_id, *args, **kwargs) File "/usr/lib/python3.10/site-packages/docker/api/container.py", line 748, in get_archive self._raise_for_status(res) File "/usr/lib/python3.10/site-packages/docker/api/client.py", line 270, in _raise_for_status raise create_api_error_from_http_exception(e) File "/usr/lib/python3.10/site-packages/docker/errors.py", line 31, in create_api_error_from_http_exception raise cls(e, response=response, explanation=explanation) docker.errors.NotFound: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory") The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/lib/python3.10/site-packages/atomic_reactor/plugin.py", line 265, in run plugin_response = plugin_instance.run() File "/usr/lib/python3.10/site-packages/atomic_reactor/plugins/post_rpmqa.py", line 47, in run plugin_output = self.gather_output() File "/usr/lib/python3.10/site-packages/atomic_reactor/plugins/post_rpmqa.py", line 77, in gather_output raise RuntimeError(ex) from ex RuntimeError: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory") The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/bin/atomic-reactor", line 33, in <module> sys.exit(load_entry_point('atomic-reactor==3.14.0', 'console_scripts', 'atomic-reactor')()) File "/usr/lib/python3.10/site-packages/atomic_reactor/cli/main.py", line 315, in run cli.run() File "/usr/lib/python3.10/site-packages/atomic_reactor/cli/main.py", line 297, in run args.func(args) File "/usr/lib/python3.10/site-packages/atomic_reactor/cli/main.py", line 95, in cli_inside_build build_inside(input_method=args.input, input_args=args.input_arg, File "/usr/lib/python3.10/site-packages/atomic_reactor/inner.py", line 619, in build_inside build_result = dbw.build_docker_image() File "/usr/lib/python3.10/site-packages/atomic_reactor/inner.py", line 545, in build_docker_image postbuild_runner.run() File "/usr/lib/python3.10/site-packages/atomic_reactor/plugin.py", line 306, in run raise PluginFailedException(msg) from ex atomic_reactor.plugin.PluginFailedException: plugin 'all_rpm_packages' raised an exception: RuntimeError: 404 Client Error for http+docker://localhost/v1.26/containers/3d3e6847d772287201af8cbd7549a7120385e46b1abda1092d874427ce90df59/archive?path=%2Fvar%2Flib%2Frpm: Not Found ("lstat /usr/lib/sysimage/rpm: no such file or directory")
I'd bet this has something to do with the rpmdb transition and and the code in https://github.com/containerbuildsystem/atomic-reactor/blob/308b5b9b4b450e69b69ded1a7a200db7e8da19e2/atomic_reactor/plugins/rpmqa.py hardcoding /var/lb/rpm
Yeah, it's only extracting /var/lib/rpm which is now a symlink.
/var/lib/rpm
Oh, yeah, indeed. /var/lib/rpm is now a symlink:
$ ls -ld /var/lib/rpm lrwxrwxrwx. 1 root root 26 May 4 2022 /var/lib/rpm -> ../../usr/lib/sysimage/rpm
... even on my Fedora 36. I must have missed this transition.
Do we do something different for aarch64?
It's worth noting that this isn't affecting all container builds. The flatpak ones are fine. This might be due to it not using the rpmqa option tho?
So, perhaps there's some way to disable/turn off the rpmqa call here until atomic-reactor can fix that hard coded path?
@cverna do you have any thoughts here? :)
And indeed the f35 (which does not have the rpm db change) toolbox container still builds fine.
f35
I opened https://github.com/containerbuildsystem/atomic-reactor/issues/2027
Hmm couldn't we just patch the atomic-reactor fedora package for now as a quick initial workaround? At least we could test that in rawhide?
(seems upstream development is now based on OSBS 2 already iiuc)
it seems that rpmqa plugin hasn't run on aarch64 at all, was it disabled in fedorainfra for aarch?
I opened https://bugzilla.redhat.com/show_bug.cgi?id=2142731 too against atomic-reactor for good measure, but at this point a Fedora PR would probably be more useful, right? I don't see how patching the path could hurt?
I opened https://src.fedoraproject.org/rpms/atomic-reactor/pull-request/12 with the rpmdb path patch for rawhide.
This shouldn't break flatpak builds, right?
Well, sadly as I feared... it's not using the atomic-reactor from the building repo, but rather from the buildroot image (which is currently f36).
I can update it to f37 easily, but then we have to get a atomic-reactor update pushed out in f37 before we can test it.
I don't understand the atomic-reactor plugins, but there's got to be some way to just disable that plugin. ;(
ok, so I added a sed to the buildroot image Dockerfile and rebuilt it. That gets past that plugin... but now...
2022-11-15 20:34:32,190 platform:- - atomic_reactor.plugins.compare_components - WARNING - Comparison mismatch for component libgcc: 2022-11-15 20:34:32,190 platform:- - atomic_reactor.plugins.compare_components - WARNING - aarch64: libgcc-12.1.1-1.fc36 (999f7cbf38ab71f4) 2022-11-15 20:34:32,190 platform:- - atomic_reactor.plugins.compare_components - WARNING - x86_64: libgcc-12.2.1-3.fc38 (809a8d7ceb10b464)
it's somehow building f36 on the aarch64 cluster, but f38 on the x86_64 cluster? Will try looking some more...
Sorry, I missed the ping in this ticket.
We can disable plugins in https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/orchestrator_customize.json and https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/worker_customize.json
The buildroot image is defined here --> https://pagure.io/fedora-infra/ansible/blob/main/f/files/osbs/buildroot-Dockerfile-production.j2#_1 and it is pointing to Fedora 36 now.
IIRC the Dockerfile ends up being on the master node under /etc/osbs/ if you want to do the test locally before running ansible.
Ping me on IRC if needed, I ll be happy to help
ok.
Any help here would be quite welcome. ;)
I got around the rpmdb problem, but now it's failing in that compare_components thing and I have no idea why. ;(
I tried a few f37 and f36 builds [0][1] today and they failed in the component compare plugin. I think this is because both our base Fedora container images for F37 and F36 don't have the same content for x86 and aarch64.
I have also noticed that we don't seems to have F37 builds for the base image. I am going to try to get the base image updated and with the same content for both architecture.
@petersen How much do you need/want to have aarch64 builds? since we could disable aarch64 for toolbox to get a successful build while we try to fix this issue.
[0] - https://koji.fedoraproject.org/koji/taskinfo?taskID=94488442 [1] - https://koji.fedoraproject.org/koji/taskinfo?taskID=94488670
Ok we lost the aarch64 cluster :-(
[root@osbs-aarch64-master01 ~][PROD-IAD2]# oc get nodes Unable to connect to the server: x509: certificate has expired or is not yet valid: current time 2022-11-24T14:58:39Z is after 2022-11-24T12:07:29Z [root@osbs-aarch64-master01 ~][PROD-IAD2]#
I am not 100 sure how we regenerate and redeploy certificates on OCP clusters :-(
Also the aarch64 cluster is running on Fedora 33 boxes, so we should probably try to do a full redeploy :-(
I poked around and I think have it all back with renewed certs.
We can't upgrade it from f33 because docker is no longer in f34+ and openshift origin / 3.11 doesn't support any newer either.
Thanks kevin,
I gave it another try this morning and it looks like the x86 cluster cannot grab the logs from the aarch64 cluster.
I am not 100% sure why, but that might be a service account token that needs to be regenerated. I don't have the permission to run the OSBS playbooks but that could be worth trying to run playbooks/groups/osbs/configure-osbs.yml and playbooks/groups/osbs/osbs-post-install.yml
playbooks/groups/osbs/configure-osbs.yml
playbooks/groups/osbs/osbs-post-install.yml
2022-11-25 09:07:52,793 platform:- - atomic_reactor.inner - ERROR - image build failed: {"aarch64": {"general": "b'{\"kind\":\"Status\",\"apiVersion\":\"v1\",\"metadata\":{},\"status\":\"Failure\",\"message\":\"Get https://osbs-aarch64-node02.iad2.fedoraproject.org:10250/containerLogs/worker/fedora-toolbox-f37-65bbb-11-build/custom-build?follow=true: dial tcp: lookup osbs-aarch64-node02.iad2.fedoraproject.org on 10.3.170.147:53: read udp 10.3.170.147:43152-\\\\u003e10.3.170.147:53: read: connection refused\",\"code\":500}'", "pod": {"reason": "Succeeded"}}}
configure-osbs gives:
RUNNING HANDLER [Remove the previous buildroot image] ************************************************ Friday 25 November 2022 18:42:10 +0000 (0:00:02.462) 0:01:57.657 ******* Friday 25 November 2022 18:42:10 +0000 (0:00:02.462) 0:01:57.657 ******* An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Modul eNotFoundError: No module named 'requests' fatal: [osbs-master01.iad2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Failed to import the required Python library (Docker SDK for Python: docker above 5.0.0 (Python >= 3.6) or docker befor e 5.0.0 (Python 2.7) or docker-py (Python 2.6)) on osbs-master01.iad2.fedoraproject.org's Python /usr/ bin/python3.6. Please read the module documentation and install it in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter, for example via `pip install docker` (Python >= 3.6) or ` pip install docker==4.4.4` (Python 2.7) or `pip install docker-py` (Python 2.6). The error was: No mod ule named 'requests'"} fatal: [osbs-aarch64-master01.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr": "/bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module fa iled to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact error", "rc": 127} An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Modul eNotFoundError: No module named 'requests' fatal: [osbs-node01.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Failed to impor t the required Python library (Docker SDK for Python: docker above 5.0.0 (Python >= 3.6) or docker bef ore 5.0.0 (Python 2.7) or docker-py (Python 2.6)) on osbs-node01.stg.iad2.fedoraproject.org's Python / usr/bin/python3.6. Please read the module documentation and install it in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter, for example via `pip install docker` (Python >= 3.6) or `pip install docker==4.4.4` (Python 2.7) or `pip install docker-py` (Python 2.6). The error was: No module named 'requests'"} fatal: [osbs-aarch64-node01.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr" : "/bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exac t error", "rc": 127} An exception occurred during task execution. To see the full traceback, use -vvv. The error was: Modul eNotFoundError: No module named 'requests' fatal: [osbs-node02.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "msg": "Failed to impor t the required Python library (Docker SDK for Python: docker above 5.0.0 (Python >= 3.6) or docker bef ore 5.0.0 (Python 2.7) or docker-py (Python 2.6)) on osbs-node02.stg.iad2.fedoraproject.org's Python / usr/bin/python3.6. Please read the module documentation and install it in the appropriate location. If the required library is installed, but Ansible is using the wrong Python interpreter, please consult the documentation on ansible_python_interpreter, for example via `pip install docker` (Python >= 3.6) or `pip install docker==4.4.4` (Python 2.7) or `pip install docker-py` (Python 2.6). The error was: No module named 'requests'"} fatal: [osbs-aarch64-node02.stg.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr" : "/bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module failed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exac t error", "rc": 127} fatal: [osbs-aarch64-node01.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr": "/ bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module fail ed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact er ror", "rc": 127} fatal: [osbs-aarch64-node02.iad2.fedoraproject.org]: FAILED! => {"changed": false, "module_stderr": "/ bin/sh: /usr/bin/python3.6: No such file or directory\n", "module_stdout": "", "msg": "The module fail ed to execute correctly, you probably need to set the interpreter.\nSee stdout/stderr for the exact er ror", "rc": 127}
:(
So, I poked around at this some more.
First, we hit a problem with koji-containerbuild. The new f37 one (in updates) uses a different channel and then also updates some OSBS handling. See https://pagure.io/fedora-infrastructure/issue/11020 I avoided that for now by downgrading back to the version we had before with f36.
Second. The aarch64 cluster nodes seem fine for a while, then they drop off.
Finally the current error seems to be: atomic_reactor.plugin.PluginFailedException: {"aarch64": {"general": "HTTPSConnectionPool(host='osbs-aarch64-master01.iad2.fedoraproject.org', port=8443): Read timed out."}} which I guess as you said is the builder trying to get the aarch64 result? Not clear.
So, I am wondering if we should just re-install the aarch64 cluster, it seems somehow not happy. But perhaps we can figure out whats going on.
ok. We finally got a build of the f37 toolbox container. :)
Can you please try rawhide and any others you want to do and confirm it's working now?
very sorry for this long saga.
Thanks so much!
Can confirm builds succeeding now, yay
Metadata Update from @petersen: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Log in to comment on this ticket.