#11973 rpm-ostree error (503) from hetzner to https://d2uk5hbyrobdzx.cloudfront.net
Opened 3 months ago by stkoelle. Modified 3 days ago

Describe what you would like us to do:

Fedora CoreOS zinacati started to behave very flaky.

It reports 503 errors like:

Jun 05 10:39:27 rs185 zincati[4105]:     error: While pulling 72cf2f80ba1496d478e110d03e1199d9d21382840e96ffeddf4303eb040fbb55: Server returned HTTP 503  

when digging deeper it's coming from rpm-ostree

Jun 05 10:39:21 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/41/5200e439bfc968bb9a26b74e29b0bc7c87c119db9f9749a25172c82afee61a.filez>: Server returned HTTP 503
Jun 05 10:39:23 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/99/601d82918ce51d6267d837cdaba5df5c5448cecf08cd0de61d601a7d27120a.filez>: Server returned HTTP 503
Jun 05 10:39:25 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/db/e3e23f2e97dfac8f3a518e66249d58dea1dd3e102cdfeeb4d34367ec9e1020.filez>: Server returned HTTP 503
Jun 05 10:39:25 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/56/dff8ea8710640a29f53b8de39f8bc11b0ef6559685601b2f8f5679d676f8d0.filez>: Server returned HTTP 503
Jun 05 10:39:27 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez>: Server returned HTTP 503
Jun 05 10:39:27 rs185 rpm-ostree[3686]: Txn Deploy on /org/projectatomic/rpmostree1/fedora_coreos failed: While pulling 72cf2f80ba1496d478e110d03e1199d9d21382840e96ffeddf4303eb040fbb55: Server returned HTTP 503

when using curl for those urls it works:

curl -v https://d2uk5hbyrobdzx.cloudfront.net/objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez
* Host d2uk5hbyrobdzx.cloudfront.net:443 was resolved.
* IPv6: (none)
* IPv4: 18.244.20.208, 18.244.20.19, 18.244.20.87, 18.244.20.212
*   Trying 18.244.20.208:443...
* Connected to d2uk5hbyrobdzx.cloudfront.net (18.244.20.208) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/pki/tls/certs/ca-bundle.crt
*  CApath: none
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / x25519 / RSASSA-PSS
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=*.cloudfront.net
*  start date: Oct 10 00:00:00 2023 GMT
*  expire date: Sep 19 23:59:59 2024 GMT
*  subjectAltName: host "d2uk5hbyrobdzx.cloudfront.net" matched cert's "*.cloudfront.net"
*  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://d2uk5hbyrobdzx.cloudfront.net/objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: d2uk5hbyrobdzx.cloudfront.net]
* [HTTP/2] [1] [:path: /objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez]
* [HTTP/2] [1] [user-agent: curl/8.6.0]
* [HTTP/2] [1] [accept: */*]
> GET /objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez HTTP/2
> Host: d2uk5hbyrobdzx.cloudfront.net
> User-Agent: curl/8.6.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
< HTTP/2 200
< content-length: 15160
< server: Apache
< strict-transport-security: max-age=31536000; includeSubDomains; preload
< x-frame-options: SAMEORIGIN
< x-xss-protection: 1; mode=block
< x-content-type-options: nosniff
< referrer-policy: same-origin
< strict-transport-security: max-age=31536000; includeSubDomains; preload
< last-modified: Mon, 29 Apr 2024 04:50:48 GMT
< apptime: D=3711
< x-fedora-appserver: kojipkgs02.iad2.fedoraproject.org
< x-varnish: 299826652
< via: 1.1 kojipkgs02.iad2.fedoraproject.org (Varnish/7.3), 1.1 93f1c701362eb59a676baaac7ea81bd8.cloudfront.net (CloudFront)
< accept-ranges: bytes
< x-fedora-proxyserver: proxy01.iad2.fedoraproject.org
< x-fedora-requestid: ZlqsG5WluXd4lnAadIKB5gAAA44
< date: Thu, 06 Jun 2024 05:08:17 GMT
< x-cache: Hit from cloudfront
< x-amz-cf-pop: FRA56-P11
< x-amz-cf-id: bzydx-VPNTc2ypEluova1rfZ8VJVAbXq28_nBilZFIe3haM2hmT6lA==
< age: 11394
<
Warning: Binary output can mess up your terminal. Use "--output -" to tell
Warning: curl to output it to your terminal anyway, or consider "--output
Warning: <FILE>" to save to a file.
* Failure writing output to destination
* Connection #0 to host d2uk5hbyrobdzx.cloudfront.net left intact

I find it interesting that rpm-ostree complains about 503 for different URL on each run.
Is this by design? Otherwise this might be difficult for the CDN.

When do you need this to be done by? (YYYY/MM/DD)


~ 2 month?


Work around: restart zincati in a loop, it will work every other time:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash

# Function to restart the zincati service
restart_service() {
    echo "HTTP 503 error detected. Restarting zincati.service..."
    systemctl restart zincati.service
    if [ $? -eq 0 ]; then
        echo "zincati.service restarted successfully."
    else
        echo "Failed to restart zincati.service."
    fi
}

# Monitor the journal logs for zincati.service
journalctl -fu zincati.service | while read -r line; do
    if echo "$line" | grep -q "Server returned HTTP 503"; then
        restart_service
    fi
done

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation

3 months ago

What version of rpm-ostree do you have there?

fcos: 40.20240504.3.0 and 40.20240416.3.1
this is rpm-ostree 2024.5

ok, in ostree 2024.6 there's a fix to retry on 5xx errors (this was intended, but it wasn't properly doing so). I don't know if there's any way for you to test 2024.6?

Also, I looked at cloudfront stats. There's been a pretty constant 0.02% error rate. I am not sure why it's hitting you so often, but it's a really small percentage and it's seemingly always been there. The change we made to haproxy a while back didn't seem to change this any.

@stkoelle so let's see if the problem persists once you get onto a version of FCOS with rpm-ostree 2024.6 ?

I think there is another underlying problem/bug with download of rpm-ostree. I will monitor further to pinpoint it better. I have a success rate of around 10%.

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Reporter (was: Waiting on Assignee)

3 months ago

Still I get many 503 just right now:
Jun 20 09:21:33 rs184 zincati[6632]: [ERROR zincati::update_agent::actor] failed to stage deployment: rpm-ostree deploy failed:
Jun 20 09:21:33 rs184 zincati[6632]: error: While pulling a65ed051ae3c7ae658f19bee19ff36be19723070282305382890a793904f6f5e: Server returned HTTP 503

The cloudfront stat seems not to cover that somehow.

Yeah, it's very odd. Are you still seeing it now?

Can you do a traceroute / ping to the CF endpoint and see if it's something between you and it?

For now the problem is gone.

Metadata Update from @stkoelle:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

19 days ago

Problem is still here:

rned HTTP 503
Sep 12 12:11:17 bitbucketrunner1 rpm-ostree[1119979]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/12/ad2630f0905778084a499e5cc7a0d11db16484779374599cb5d9371578fe27.dirtree>: Server returned HTTP 503
Sep 12 12:11:17 bitbucketrunner1 rpm-ostree[1119979]: Txn Deploy on /org/projectatomic/rpmostree1/fedora_coreos failed: While pulling 759c112d3a3d1f762ba368106411fcc4edf8b1c39323aca99269741c88e6a597: Server returned HTTP 503

I am confused by the tracepath, ping looks ok:

tracepath d2uk5hbyrobdzx.cloudfront.net
1?: [LOCALHOST]                      pmtu 1500
1:  ???                                                   0.389ms
1:  ???                                                   0.400ms
2:  core22.fsn1.hetzner.com                               1.108ms
3:  hos-tr4.ex3k9.dc4.fsn1.hetzner.com                    4.888ms
4:  amazon.hetzner.com                                    4.933ms asymm  5
5:  no reply
6:  no reply
7:  no reply
8:  no reply
9:  no reply
10:  no reply
...
25:  no reply
26:  no reply
27:  no reply
28:  no reply
29:  no reply
30:  no reply
Too many hops: pmtu 1500
Resume: pmtu 1500


root@bitbucketrunner1 /var/home/koelle # ping d2uk5hbyrobdzx.cloudfront.net
PING d2uk5hbyrobdzx.cloudfront.net (18.244.20.19) 56(84) bytes of data.
64 bytes from server-18-244-20-19.fra56.r.cloudfront.net (18.244.20.19): icmp_seq=1 ttl=250 time=5.22 ms
64 bytes from server-18-244-20-19.fra56.r.cloudfront.net (18.244.20.19): icmp_seq=2 ttl=250 time=5.17 ms





--- d2uk5hbyrobdzx.cloudfront.net ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 5.173/5.197/5.222/0.024 ms
root@bitbucketrunner1 /var/home/koelle # tracepath server-18-244-20-19.fra56.r.cloudfront.net
1?: [LOCALHOST]                      pmtu 1500
1:  ???                                                   0.485ms
1:  ???                                                   0.368ms
2:  core24.fsn1.hetzner.com                              27.966ms
3:  core1.fra.hetzner.com                                 5.006ms
4:  213-133-113-126.clients.your-server.de                5.946ms asymm  5
5:  no reply
6:  no reply
7:  no reply
8:  no reply
9:  no reply
...
15:  no reply
^C 

Maybe it's a issue with their peering?

Metadata Update from @stkoelle:
- Issue status updated to: Open (was: Closed)

3 days ago

Log in to comment on this ticket.

Metadata
Attachments 1
Attached 2 months ago View Comment