Fedora CoreOS zinacati started to behave very flaky.
It reports 503 errors like:
Jun 05 10:39:27 rs185 zincati[4105]: error: While pulling 72cf2f80ba1496d478e110d03e1199d9d21382840e96ffeddf4303eb040fbb55: Server returned HTTP 503
when digging deeper it's coming from rpm-ostree
Jun 05 10:39:21 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/41/5200e439bfc968bb9a26b74e29b0bc7c87c119db9f9749a25172c82afee61a.filez>: Server returned HTTP 503 Jun 05 10:39:23 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/99/601d82918ce51d6267d837cdaba5df5c5448cecf08cd0de61d601a7d27120a.filez>: Server returned HTTP 503 Jun 05 10:39:25 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/db/e3e23f2e97dfac8f3a518e66249d58dea1dd3e102cdfeeb4d34367ec9e1020.filez>: Server returned HTTP 503 Jun 05 10:39:25 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/56/dff8ea8710640a29f53b8de39f8bc11b0ef6559685601b2f8f5679d676f8d0.filez>: Server returned HTTP 503 Jun 05 10:39:27 rs185 rpm-ostree[3686]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez>: Server returned HTTP 503 Jun 05 10:39:27 rs185 rpm-ostree[3686]: Txn Deploy on /org/projectatomic/rpmostree1/fedora_coreos failed: While pulling 72cf2f80ba1496d478e110d03e1199d9d21382840e96ffeddf4303eb040fbb55: Server returned HTTP 503
when using curl for those urls it works:
curl -v https://d2uk5hbyrobdzx.cloudfront.net/objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez * Host d2uk5hbyrobdzx.cloudfront.net:443 was resolved. * IPv6: (none) * IPv4: 18.244.20.208, 18.244.20.19, 18.244.20.87, 18.244.20.212 * Trying 18.244.20.208:443... * Connected to d2uk5hbyrobdzx.cloudfront.net (18.244.20.208) port 443 * ALPN: curl offers h2,http/1.1 * TLSv1.3 (OUT), TLS handshake, Client hello (1): * CAfile: /etc/pki/tls/certs/ca-bundle.crt * CApath: none * TLSv1.3 (IN), TLS handshake, Server hello (2): * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8): * TLSv1.3 (IN), TLS handshake, Certificate (11): * TLSv1.3 (IN), TLS handshake, CERT verify (15): * TLSv1.3 (IN), TLS handshake, Finished (20): * TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1): * TLSv1.3 (OUT), TLS handshake, Finished (20): * SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / x25519 / RSASSA-PSS * ALPN: server accepted h2 * Server certificate: * subject: CN=*.cloudfront.net * start date: Oct 10 00:00:00 2023 GMT * expire date: Sep 19 23:59:59 2024 GMT * subjectAltName: host "d2uk5hbyrobdzx.cloudfront.net" matched cert's "*.cloudfront.net" * issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01 * SSL certificate verify ok. * Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption * Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption * Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption * using HTTP/2 * [HTTP/2] [1] OPENED stream for https://d2uk5hbyrobdzx.cloudfront.net/objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez * [HTTP/2] [1] [:method: GET] * [HTTP/2] [1] [:scheme: https] * [HTTP/2] [1] [:authority: d2uk5hbyrobdzx.cloudfront.net] * [HTTP/2] [1] [:path: /objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez] * [HTTP/2] [1] [user-agent: curl/8.6.0] * [HTTP/2] [1] [accept: */*] > GET /objects/13/06f20ecd9aa85efc08c51ffdc7ee51fdfd2242cdc5aa24b11dd1fd171d3975.filez HTTP/2 > Host: d2uk5hbyrobdzx.cloudfront.net > User-Agent: curl/8.6.0 > Accept: */* > * TLSv1.3 (IN), TLS handshake, Newsession Ticket (4): < HTTP/2 200 < content-length: 15160 < server: Apache < strict-transport-security: max-age=31536000; includeSubDomains; preload < x-frame-options: SAMEORIGIN < x-xss-protection: 1; mode=block < x-content-type-options: nosniff < referrer-policy: same-origin < strict-transport-security: max-age=31536000; includeSubDomains; preload < last-modified: Mon, 29 Apr 2024 04:50:48 GMT < apptime: D=3711 < x-fedora-appserver: kojipkgs02.iad2.fedoraproject.org < x-varnish: 299826652 < via: 1.1 kojipkgs02.iad2.fedoraproject.org (Varnish/7.3), 1.1 93f1c701362eb59a676baaac7ea81bd8.cloudfront.net (CloudFront) < accept-ranges: bytes < x-fedora-proxyserver: proxy01.iad2.fedoraproject.org < x-fedora-requestid: ZlqsG5WluXd4lnAadIKB5gAAA44 < date: Thu, 06 Jun 2024 05:08:17 GMT < x-cache: Hit from cloudfront < x-amz-cf-pop: FRA56-P11 < x-amz-cf-id: bzydx-VPNTc2ypEluova1rfZ8VJVAbXq28_nBilZFIe3haM2hmT6lA== < age: 11394 < Warning: Binary output can mess up your terminal. Use "--output -" to tell Warning: curl to output it to your terminal anyway, or consider "--output Warning: <FILE>" to save to a file. * Failure writing output to destination * Connection #0 to host d2uk5hbyrobdzx.cloudfront.net left intact
I find it interesting that rpm-ostree complains about 503 for different URL on each run. Is this by design? Otherwise this might be difficult for the CDN.
~ 2 month?
Work around: restart zincati in a loop, it will work every other time:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
#!/bin/bash # Function to restart the zincati service restart_service() { echo "HTTP 503 error detected. Restarting zincati.service..." systemctl restart zincati.service if [ $? -eq 0 ]; then echo "zincati.service restarted successfully." else echo "Failed to restart zincati.service." fi } # Monitor the journal logs for zincati.service journalctl -fu zincati.service | while read -r line; do if echo "$line" | grep -q "Server returned HTTP 503"; then restart_service fi done
Metadata Update from @zlopez: - Issue priority set to: Waiting on Assignee (was: Needs Review) - Issue tagged with: Needs investigation
What version of rpm-ostree do you have there?
fcos: 40.20240504.3.0 and 40.20240416.3.1 this is rpm-ostree 2024.5
ok, in ostree 2024.6 there's a fix to retry on 5xx errors (this was intended, but it wasn't properly doing so). I don't know if there's any way for you to test 2024.6?
Also, I looked at cloudfront stats. There's been a pretty constant 0.02% error rate. I am not sure why it's hitting you so often, but it's a really small percentage and it's seemingly always been there. The change we made to haproxy a while back didn't seem to change this any.
@stkoelle so let's see if the problem persists once you get onto a version of FCOS with rpm-ostree 2024.6 ?
I think there is another underlying problem/bug with download of rpm-ostree. I will monitor further to pinpoint it better. I have a success rate of around 10%.
Metadata Update from @zlopez: - Issue priority set to: Waiting on Reporter (was: Waiting on Assignee)
Still I get many 503 just right now: Jun 20 09:21:33 rs184 zincati[6632]: [ERROR zincati::update_agent::actor] failed to stage deployment: rpm-ostree deploy failed: Jun 20 09:21:33 rs184 zincati[6632]: error: While pulling a65ed051ae3c7ae658f19bee19ff36be19723070282305382890a793904f6f5e: Server returned HTTP 503
The cloudfront stat seems not to cover that somehow.
Yeah, it's very odd. Are you still seeing it now?
Can you do a traceroute / ping to the CF endpoint and see if it's something between you and it?
We are still have a lot of them, will do the traces.
<img alt="2024-07-24_17-14.png" src="/fedora-infrastructure/issue/raw/files/6e2060f1ded99782f140cbfe99327063eac898e015b39595336280e4a6fe6d92-2024-07-24_17-14.png" />
Any news here?
For now the problem is gone.
Metadata Update from @stkoelle: - Issue close_status updated to: Fixed - Issue status updated to: Closed (was: Open)
Problem is still here:
rned HTTP 503 Sep 12 12:11:17 bitbucketrunner1 rpm-ostree[1119979]: libostree HTTP error from remote fedora for <https://d2uk5hbyrobdzx.cloudfront.net/objects/12/ad2630f0905778084a499e5cc7a0d11db16484779374599cb5d9371578fe27.dirtree>: Server returned HTTP 503 Sep 12 12:11:17 bitbucketrunner1 rpm-ostree[1119979]: Txn Deploy on /org/projectatomic/rpmostree1/fedora_coreos failed: While pulling 759c112d3a3d1f762ba368106411fcc4edf8b1c39323aca99269741c88e6a597: Server returned HTTP 503
I am confused by the tracepath, ping looks ok:
tracepath d2uk5hbyrobdzx.cloudfront.net 1?: [LOCALHOST] pmtu 1500 1: ??? 0.389ms 1: ??? 0.400ms 2: core22.fsn1.hetzner.com 1.108ms 3: hos-tr4.ex3k9.dc4.fsn1.hetzner.com 4.888ms 4: amazon.hetzner.com 4.933ms asymm 5 5: no reply 6: no reply 7: no reply 8: no reply 9: no reply 10: no reply ... 25: no reply 26: no reply 27: no reply 28: no reply 29: no reply 30: no reply Too many hops: pmtu 1500 Resume: pmtu 1500 root@bitbucketrunner1 /var/home/koelle # ping d2uk5hbyrobdzx.cloudfront.net PING d2uk5hbyrobdzx.cloudfront.net (18.244.20.19) 56(84) bytes of data. 64 bytes from server-18-244-20-19.fra56.r.cloudfront.net (18.244.20.19): icmp_seq=1 ttl=250 time=5.22 ms 64 bytes from server-18-244-20-19.fra56.r.cloudfront.net (18.244.20.19): icmp_seq=2 ttl=250 time=5.17 ms --- d2uk5hbyrobdzx.cloudfront.net ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1001ms rtt min/avg/max/mdev = 5.173/5.197/5.222/0.024 ms root@bitbucketrunner1 /var/home/koelle # tracepath server-18-244-20-19.fra56.r.cloudfront.net 1?: [LOCALHOST] pmtu 1500 1: ??? 0.485ms 1: ??? 0.368ms 2: core24.fsn1.hetzner.com 27.966ms 3: core1.fra.hetzner.com 5.006ms 4: 213-133-113-126.clients.your-server.de 5.946ms asymm 5 5: no reply 6: no reply 7: no reply 8: no reply 9: no reply ... 15: no reply ^C
Maybe it's a issue with their peering?
Metadata Update from @stkoelle: - Issue status updated to: Open (was: Closed)
Log in to comment on this ticket.