Issue #9242: Flatpak builds fail with "Network is unreachable" - fedora-infrastructure

fedora-infrastructure

#9242 Flatpak builds fail with "Network is unreachable"

Closed: Fixed 3 years ago by kevin. Opened 3 years ago by kalev.

I've tried a bunch of flatpak builds today and all have failed with a similar error:

$ fedpkg flatpak-build
Created task: 49473433
Task info: https://koji.fedoraproject.org/koji/taskinfo?taskID=49473433
Watching tasks (this may be safely interrupted)...
49473433 buildContainer (noarch): free
49473433 buildContainer (noarch): free -> open (buildvm-x86-04.iad2.fedoraproject.org)
49473433 buildContainer (noarch): open (buildvm-x86-04.iad2.fedoraproject.org) -> FAILED: Fault: <Fault 2001: 'Image build failed. Error in pod: {\'exitCode\': 1, \'containerID\': \'docker://967a290cb72d03596d9595925c6b1706fb64702c1cf2e924c94def64655b216d\', \'reason\': \' send\\n    r = adapter.send(request, **kwargs)\\n  File "/usr/lib/python3.7/site-packages/requests/adapters.py", line 516, in send\\n    raise ConnectionError(e, request=request)\\nrequests.exceptions.ConnectionError: HTTPSConnectionPool(host=\\\'cdn.registry.fedoraproject.org\\\', port=443): Max retries exceeded with url: /v2/flatpak-build-base/blobs/sha256:de5d0ed6ba20c9d6f62dc58eceabd0e4304afbd51462a615ef824a90182797b2 (Caused by NewConnectionError(\\\'<urllib3.connection.VerifiedHTTPSConnection object at 0x7f2aa794eb50>: Failed to establish a new connection: [Errno 101] Network is unreachable\\\'))\\n\\nDuring handling of the above exception, another exception occurred:\\n\\nTraceback (most recent call last):\\n  File "/usr/bin/atomic-reactor", line 11, in <module>\\n    load_entry_point(\\\'atomic-reactor==1.6.47\\\', \\\'console_scripts\\\', \\\'atomic-reactor\\\')()\\n  File "/usr/lib/python3.7/site-packages/atomic_reactor/cli/main.py", line 318, in run\\n    cli.run()\\n  File "/usr/lib/python3.7/site-packages/atomic_reactor/cli/main.py", line 300, in run\\n    args.func(args)\\n  File "/usr/lib/python3.7/site-packages/atomic_reactor/cli/main.py", line 99, in cli_inside_build\\n    substitutions=args.substitute)\\n  File "/usr/lib/python3.7/site-packages/atomic_reactor/inner.py", line 615, in build_inside\\n    build_result = dbw.build_docker_image()\\n  File "/usr/lib/python3.7/site-packages/atomic_reactor/inner.py", line 492, in build_docker_image\\n    prebuild_runner.run()\\n  File "/usr/lib/python3.7/site-packages/atomic_reactor/plugin.py", line 309, in run\\n    raise PluginFailedException(msg)\\natomic_reactor.plugin.PluginFailedException: plugin \\\'bump_release\\\' raised an exception: ConnectionError: HTTPSConnectionPool(host=\\\'cdn.registry.fedoraproject.org\\\', port=443): Max retries exceeded with url: /v2/flatpak-build-base/blobs/sha256:de5d0ed6ba20c9d6f62dc58eceabd0e4304afbd51462a615ef824a90182797b2 (Caused by NewConnectionError(\\\'<urllib3.connection.VerifiedHTTPSConnection object at 0x7f2aa794eb50>: Failed to establish a new connection: [Errno 101] Network is unreachable\\\'))\\n\'}. OSBS build id: gnome-weather-master-5da33-3'>
  0 free  0 open  0 done  1 failed

49473433 buildContainer (noarch) failed

mohanboddu commented 3 years ago

Seems like related to https://pagure.io/fedora-infrastructure/issue/9237

Metadata Update from @mohanboddu:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: groomed, high-trouble, medium-gain, ops

3 years ago

kalev commented 3 years ago

I don't think it's registry broken this time. I would suspect it's something wrong with the network setup on the build host?

The error is "Network is unreachable" which should mean that there is no route to the host (it's trying to connect to cdn.registry.fedoraproject.org. I don't know how the name resolution is done; does it resolve to an internal host instead of going to actual cdn?)

smooge commented 3 years ago

[root@bastion01 ~][PROD-IAD2]# host cdn.registry.fedoraproject.org
cdn.registry.fedoraproject.org is an alias for d18n9n1e9qt6fn.cloudfront.net.
d18n9n1e9qt6fn.cloudfront.net has address 13.32.182.79
d18n9n1e9qt6fn.cloudfront.net has address 13.32.182.81
d18n9n1e9qt6fn.cloudfront.net has address 13.32.182.72
d18n9n1e9qt6fn.cloudfront.net has address 13.32.182.117
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:4400:1c:bec0:f4c0:93a1
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:f000:1c:bec0:f4c0:93a1
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:1a00:1c:bec0:f4c0:93a1
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:ee00:1c:bec0:f4c0:93a1
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:d200:1c:bec0:f4c0:93a1
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:fe00:1c:bec0:f4c0:93a1
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:6400:1c:bec0:f4c0:93a1
d18n9n1e9qt6fn.cloudfront.net has IPv6 address 2600:9000:2009:c000:1c:bec0:f4c0:93a1

Yeah this is not going to work.. those boxes can't talk out on the internet except to 1 or 2 ips.

kalev commented 3 years ago

So, I don't really have a full picture how this is supposed to work, but I found that ansible roles/httpd/reverseproxy/templates/reversepassproxy.registry-generic.conf has:

{% if env == "production" %}
RewriteCond %{HTTP:VIA} !cdn77
RewriteCond %{HTTP:VIA} !cloudfront
RewriteCond %{SERVER_NAME} !^registry-no-cdn.fedoraproject.org$
RewriteCond %{REQUEST_METHOD} !^(PATCH|POST|PUT|DELETE|HEAD)$
RewriteCond %{REMOTE_HOST} !^osbs-$
RewriteRule ^/v2/(.)/blobs/([a-zA-Z0-9:]*) https://cdn.registry.fedoraproject.org/v2/$1/blobs/$2 [R]
{% endif %}

... which seems to be redirecting the cdn traffic to registry-no-cdn.fedoraproject.org

smooge commented 3 years ago

That only works if the tool knows about the reverse proxy. If it is just doing a ip lookup and pull it is going to time out with no route.

kalev commented 3 years ago

OK, but something must have changed in the infra because things were working just fine before I went to PTO two weeks ago.

I suspect the following commit. Would it be possible to try reverting it and see if it makes things work?

commit e03a7c35bdcf0017652e07312edf8ce3f728e102
Author: Kevin Fenzi <kevin@scrye.com>
Date:   Fri Aug 7 15:40:30 2020 -0700

    registry: try fixing osbs rule and try bypassing varnish

    I think the lack of ! on the osbs rule meant that nothing ever went to
    the cdn. This increases load on the real registry a lot.

    Also, we are using varnish here, but lets try and just go via haproxy.
    varnish might be having problems keeping all the 404s in memory/cache.
    The cdn thing should help that, but since we have cloudfront I don't
    think we also need to use varnish here.

    Signed-off-by: Kevin Fenzi <kevin@scrye.com>

kalev commented 3 years ago

I went ahead and did a PR to test this theory, https://pagure.io/fedora-infra/ansible/pull-request/216

@smooge, Can you merge it to see if it helps, please?

kevin commented 3 years ago

I am on PTO, but I don't think reverting this is correct. It worked before because ALL traffic was going to registery and 0 to cdn, swamping our proxies and resulting in a bunch of container problems.

"The RewriteCond directive defines a rule condition. One or more RewriteCond can precede a RewriteRule directive. The following rule is then only used if both the current state of the URI matches its pattern, and if these conditions are met."

reverting this makes ONLY osbs use the cdn.

I think instead we need a rule to say !build.iad2.fedoraproject.org as host (ie, if the host is Not buildiad2.fedoraproject.org use the cdn, otherwise (use the registry directly).

But apache rewrites are confusing, perhaps we need to OR some of the conditions instead of AND (the default). I just know that before it was causing 0 traffic to go to cdn.

kalev commented 3 years ago

Ahh, thanks, @kevin!

kevin commented 3 years ago

ok. I think I have this fixed.

I added:

RewriteCond expr "! -R '10.3.169.0/24'"

which means if the request is not from that subnet (along with all the other checks for not things) it sends it to the cdn.
Which means if it is from that subnet it will go direct and work.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

3 years ago

kalev commented 3 years ago

Thanks, @nirik!

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

Priority

Waiting on Assignee

Boards 1

ops Status: Done

fedora-infrastructure

Source Code

#9242 Flatpak builds fail with "Network is unreachable" Closed: Fixed 3 years ago by kevin. Opened 3 years ago by kalev.

Metadata

medium-gain high-trouble ops

Boards 1

#9242 Flatpak builds fail with "Network is unreachable"

Closed: Fixed 3 years ago by kevin. Opened 3 years ago by kalev.