#9040 Error 503 from osbs.fedoraproject.org
Closed: Fixed a month ago by mobrien. Opened 2 months ago by kalev.

Flatpak builds are all failing with koji getting error 503 from osbs.fedoraproject.org

$ fedpkg flatpak-build
Created task: 45789122
Task info: https://koji.fedoraproject.org/koji/taskinfo?taskID=45789122
Watching tasks (this may be safely interrupted)...
45789122 buildContainer (noarch): free
45789122 buildContainer (noarch): free -> open (buildvm-x86-03.iad2.fedoraproject.org)
45789122 buildContainer (noarch): open (buildvm-x86-03.iad2.fedoraproject.org) -> FAILED: Fault: <Fault 1: 'Traceback (most recent call last):\n  File "/usr/lib/python3.8/site-packages/requests/adapters.py", line 439, in send\n    resp = conn.urlopen(\n  File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 833, in urlopen\n    return self.urlopen(\n  File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 833, in urlopen\n    return self.urlopen(\n  File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 833, in urlopen\n    return self.urlopen(\n  [Previous line repeated 5 more times]\n  File "/usr/lib/python3.8/site-packages/urllib3/connectionpool.py", line 819, in urlopen\n    retries = retries.increment(method, url, response=response, _pool=self)\n  File "/usr/lib/python3.8/site-packages/urllib3/util/retry.py", line 436, in increment\n    raise MaxRetryError(_pool, url, error or ResponseError(cause))\nurllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host=\'osbs.fedoraproject.org\', port=443): Max retries exceeded with url: /apis/build.openshift.io/v1/namespaces/osbs-fedora/builds/?labelSelector=koji-task-id%3D45789122 (Caused by ResponseError(\'too many 503 error responses\'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.8/site-packages/osbs/http.py", line 59, in request\n    stream = HttpStream(url, *args, verbose=self.verbose, **kwargs)\n  File "/usr/lib/python3.8/site-packages/osbs/http.py", line 156, in __init__\n    self.req = self.session.request(method, url, **args)\n  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 533, in request\n    resp = self.send(prep, **send_kwargs)\n  File "/usr/lib/python3.8/site-packages/requests/sessions.py", line 646, in send\n    r = adapter.send(request, **kwargs)\n  File "/usr/lib/python3.8/site-packages/requests/adapters.py", line 507, in send\n    raise RetryError(e, request=request)\nrequests.exceptions.RetryError: HTTPSConnectionPool(host=\'osbs.fedoraproject.org\', port=443): Max retries exceeded with url: /apis/build.openshift.io/v1/namespaces/osbs-fedora/builds/?labelSelector=koji-task-id%3D45789122 (Caused by ResponseError(\'too many 503 error responses\'))\n\nDuring handling of the above exception, another exception occurred:\n\nTraceback (most recent call last):\n  File "/usr/lib/python3.8/site-packages/koji/daemon.py", line 1339, in runTask\n    response = (handler.run(),)\n  File "/usr/lib/python3.8/site-packages/koji/tasks.py", line 329, in run\n    return koji.util.call_with_argcheck(self.handler, self.params, self.opts)\n  File "/usr/lib/python3.8/site-packages/koji/util.py", line 258, in call_with_argcheck\n    return func(*args, **kwargs)\n  File "/usr/lib/koji-builder-plugins/builder_containerbuild.py", line 889, in handler\n    results = self.runBuilds(src, target_info, archlist,\n  File "/usr/lib/koji-builder-plugins/builder_containerbuild.py", line 558, in runBuilds\n    semi_results = self.createContainer(**kwargs)\n  File "/usr/lib/koji-builder-plugins/builder_containerbuild.py", line 630, in createContainer\n    build_response = create_method(**orchestrator_create_build_args)\n  File "/usr/lib/python3.8/site-packages/osbs/api.py", line 68, in catch_exceptions\n    return func(*args, **kwargs)\n  File "/usr/lib/python3.8/site-packages/osbs/api.py", line 1009, in create_orchestrator_build\n    return self._do_create_prod_build(**kwargs)\n  File "/usr/lib/python3.8/site-packages/osbs/api.py", line 811, in _do_create_prod_build\n    builds_for_koji_task = self._get_not_cancelled_builds_for_koji_task(koji_task_id)\n  File "/usr/lib/python3.8/site-packages/osbs/api.py", line 288, in _get_not_cancelled_builds_for_koji_task\n    all_builds_for_task = self.os.list_builds(koji_task_id=koji_task_id).json()[\'items\']\n  File "/usr/lib/python3.8/site-packages/osbs/core.py", line 594, in list_builds\n    return self._get(url)\n  File "/usr/lib/python3.8/site-packages/osbs/core.py", line 191, in _get\n    return self._con.get(\n  File "/usr/lib/python3.8/site-packages/osbs/http.py", line 46, in get\n    return self.request(url, "get", **kwargs)\n  File "/usr/lib/python3.8/site-packages/osbs/http.py", line 68, in request\n    raise OsbsNetworkException(url, str(ex), \'\',\nosbs.exceptions.OsbsNetworkException: () HTTPSConnectionPool(host=\'osbs.fedoraproject.org\', port=443): Max retries exceeded with url: /apis/build.openshift.io/v1/namespaces/osbs-fedora/builds/?labelSelector=koji-task-id%3D45789122 (Caused by ResponseError(\'too many 503 error responses\'))\n'>
  0 free  0 open  0 done  1 failed

45789122 buildContainer (noarch) failed

It would be great to debug this on irc as a group so we can see how to do so moving forward.

Metadata Update from @kevin:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: OSBS, groomed, high-trouble, medium-gain

2 months ago

So, we fixed the first thing (not being able to reach osbs), was a RHIT firewall issue.

I resubmitted the build:

Resubmitting the following task:
Task: 45789122
Type: buildContainer
Request Parameters:
Owner: kalev
State: failed
Created: Tue Jun 16 05:03:22 2020
Started: Tue Jun 16 05:04:41 2020
Finished: Tue Jun 16 05:12:42 2020
Host: buildvm-x86-03.iad2.fedoraproject.org

Resubmitted task 45789122 as new task 45836600
Watching tasks (this may be safely interrupted)...
45836600 buildContainer (noarch): free
45836600 buildContainer (noarch): free -> open (buildvm-x86-03.iad2.fedoraproject.org)

But then it seemed to hang. Looking at the master:

thunderbird-master-c0ed5-1-build 0/1 ContainerCreating 0 14m

it seems it's missing some secrets now:

Jun 17 21:05:39 osbs-master01.iad2.fedoraproject.org atomic-openshift-node[3810]: E0617 21:05:39.974995    3810 nestedpendingoperations.go:267] Operation for "\"kubernetes.io/secret/...snip...-kojisecret-secret\...snip...  failed. No retries permitted until 2020-06-17 21:07:41.974981079 +0000 UTC m=+646185.869543478 (durationBeforeRetry 2m2s). Error: "MountVolume.SetUp failed for volume \"kojisecret-secret\" (UniqueName: \"kubernetes.io/secret/...snip...-kojisecret-secret\") pod \"thunderbird-master-c0ed5-1-build\" (UID: \"...snip...\") : secrets \"kojisecret\" not found"

I don't know where they are supposed to be defined. @cverna ?

cverna and I worked on this today.

We fixed the secrets issue then ran into an issue where the buildroot image was not built correctly. We built this on both the nodes and the master which allowed a flatpack build of thunderbird to complete. The build did however fail with the following error

Log Tail:         File "/usr/lib/python3.7/site-packages/atomic_reactor/cli/main.py", line 99, in cli_inside_build
                    substitutions=args.substitute)
                  File "/usr/lib/python3.7/site-packages/atomic_reactor/inner.py", line 617, in build_inside
                    logger.info("Dockerfile used for build:\n%!s(MISSING)", dbw.builder.original_df)
                AttributeError: 'NoneType' object has no attribute 'original_df'

The journalctl logs also are filled with the following error although I'm not sure the two are related

ailed to get system container stats for "/system.slice/atomic-openshift-node.service": failed to get cgroup st
ats for "/system.slice/atomic-openshift-node.service": failed to get container info for "/system.slice/atomic-openshift-node.service": unknown container "/system.slice/atomic-openshift-node.service"

Nice, thanks for looking at this and getting it further! Who should we bring in for the atomic reactor issue?

Ok so looking at the orchestrator.log it seems that it fails from fetching the git repo from src.fp.o

2020-06-22 13:09:04,164 - osbs.utils - DEBUG - cloning '['git', 'clone', '-b', 'master', '--single-branch', '--depth', '1', 'https://src.fedoraproject.org/container/tools.git', '/tmp/tmppn18nm6o/tools']'
2020-06-22 13:09:05,232 - osbs.utils - INFO - retrying command '['git', 'clone', '-b', 'master', '--single-branch', '--depth', '1', 'https://src.fedoraproject.org/container/tools.git', '/tmp/tmppn18nm6o/tools']':
 'b"Cloning into '/tmp/tmppn18nm6o/tools'...\nfatal: unable to access 'https://src.fedoraproject.org/container/tools.git/': Failed to connect to src.fedoraproject.org port 443: Connection refused\n"'
2020-06-22 13:10:06,300 - osbs.utils - INFO - retrying command '['git', 'clone', '-b', 'master', '--single-branch', '--depth', '1', 'https://src.fedoraproject.org/container/tools.git', '/tmp/tmppn18nm6o/tools']':
 'b"Cloning into '/tmp/tmppn18nm6o/tools'...\nfatal: unable to access 'https://src.fedoraproject.org/container/tools.git/': Failed to connect to src.fedoraproject.org port 443: Connection refused\n"'
2020-06-22 13:12:07,420 - osbs.utils - INFO - retrying command '['git', 'clone', '-b', 'master', '--single-branch', '--depth', '1', 'https://src.fedoraproject.org/container/tools.git', '/tmp/tmppn18nm6o/tools']':
 'b"Cloning into '/tmp/tmppn18nm6o/tools'...\nfatal: unable to access 'https://src.fedoraproject.org/container/tools.git/': Failed to connect to src.fedoraproject.org port 443: Connection refused\n"'

So I think this is another RHIT firewall thing that needs to be configured. @kevin could you confirm ?

For any firewall issues I need alot of info to debug.
1. What is the ip address blocked
2. What is the destination ip address it is trying to get to.
3. Wjat are the ports.

src.fedoraproject.org is the proxies and
osbs-node01, osbs-master01 and osbs-node01 can curl https://src.fedoraproject.org/

Where is this running?

For any firewall issues I need alot of info to debug.
1. What is the ip address blocked
2. What is the destination ip address it is trying to get to.
3. Wjat are the ports.
src.fedoraproject.org is the proxies and
osbs-node01, osbs-master01 and osbs-node01 can curl https://src.fedoraproject.org/
Where is this running?

This is running from a container build on osbs-master01, so it might be docker not being able access src.fp.o

[root@osbs-master01 ~][PROD-IAD2]# docker run -it --rm --entrypoint /bin/bash buildroot
[root@e5f286cd7f3b /]# git clone -b master --single-branch --depth 1 https://src.fedoraproject.org/container/tools.git /tmp/tmppn18nm6o/tools
Cloning into '/tmp/tmppn18nm6o/tools'...
fatal: unable to access 'https://src.fedoraproject.org/container/tools.git/': Failed to connect to src.fedoraproject.org port 443: Connection refused
[root@e5f286cd7f3b /]# curl https://src.fedoraproject.org/
curl: (7) Failed to connect to src.fedoraproject.org port 443: Connection refused
[root@e5f286cd7f3b /]# 

We have some special ip-tables rules for osbs and docker (https://pagure.io/fedora-infra/ansible/blob/master/f/files/osbs/fix-docker-iptables.production) I guess that needs to be updated for iad2.

@mobrien do you want to try to look at that iptables script ^^

I can have a look and see what I can figure out.

git grep 10.5.126 shows all kinds of 'sins' that need to be dealt with.. and yeah.. osbs has a lot of things needing updates :)

files/osbs/fedora-dnsmasq.conf
files/osbs/fix-docker-iptables.production
playbooks/groups/osbs-cluster.yml
playbooks/groups/osbs/osbs-post-install.yml

I am amazed anything has been working.

It looks as though all the IP's specified in the IP tables file are phx2 related (10.5.0.0/16) and will need to be updated to iad2(10.3.0.0/16) I will see if I can track down all the new corresponding IP addresses.

I have updated the IPs in this PR https://pagure.io/fedora-infra/ansible/pull-request/147# as per the comment from @smooge

There is one I couldn't track down which is mentioned in the comment on the PR

@cverna I have updated the ips and ran the playbook. I can reach src.fp.o from inside a container so definitely an improvement. I tried to do a flatpak build of thunderbird again but it fails on the following error.

atomic_reactor.plugin.PluginFailedException: plugin 'bump_release' raised an exception: ConnectionError: HTTPSConnectionPool(host='cdn.registry.fedoraproject.org', port=443): Max retries exceeded with url: /v2/flatpak-build-base/blobs/sha256:de5d0ed6ba20c9d6f62dc58eceabd0e4304afbd51462a615ef824a90182797b2 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd439292d90>: Failed to establish a new connection: [Errno 101] Network is unreachable'))

I then tried to curl that registry image from inside a docker container and it was successful so not sure what is going on there.

[root@osbs-master01 ~][PROD-IAD2]# docker run -it --rm --entrypoint /bin/bash buildroot
[root@1bfc11c7101d /]# curl -i cdn.registry.fedoraproject.org/v2/flatpak-build-base/blobs/sha256:de5d0ed6ba20c9d6f62dc58eceabd0e4304afbd51462a615ef824a90182797b2
HTTP/1.1 200 OK
Content-Type: application/octet-stream
Content-Length: 2957
Connection: keep-alive
Date: Tue, 23 Jun 2020 09:36:54 GMT
...

The candidate registry was added to the ip tables rules which allowed the pod to connect but there appears to be an issue with the secrets when pushing a build which causes a failure.

atomic_reactor.plugin.PluginFailedException: {"x86_64": {"tag_and_push": "Command '['skopeo', 'copy', '--authfile=/var/run/secrets/atomic-reactor/v2-registry-dockercfg/.dockercfg', 'oci:/tmp/tmpkx547cwo/flatpak-oci-image:app/org.mozilla.Thunderbird/x86_64/stable', 'docker://candidate-registry.fedoraproject.org/thunderbird:f32-flatpak-candidate-60554-20200624080303-x86_64']' returned non-zero exit status 1."}}

There has now been a successful container and flatpak build carried out.

The last remaining issue was due to authentication with the candidate registry. This was due to the registry only accepting authentication requests from 10.5.x.x IP addresses(phx2) This has now been updated to only accept from 10.3.x.x IP addresses(iad2)

Metadata Update from @mobrien:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

a month ago

Awesome, thanks for all the long debugging sessions and fixing it!

Login to comment on this ticket.

Metadata