#11317 sporadic HTTP 404s returned when fetching valid commits from the Fedora IoT ostree remote
Closed: Fixed 10 months ago by kevin. Opened 11 months ago by miabbott.

Describe what you would like us to do:

Please help us understand why the Fedora IoT ostree remote is returning HTTP 404 errors for ostree commits that are known to be good.

When do you need this to be done by? (YYYY/MM/DD)

2023/06/16


@jkonecny came into the #iot Matrix channel reporting that the Anaconda integration tests were failing because they were getting HTTP 404 errors for ostree commits on the fedora/rawhide/x86_64/iot ref. (See: https://matrix.to/#/!EJxoBZsXFkznRPFGzT:matrix.org/$EgHAQciVZUAEp78aA8KJxEFhm0e-vVxa0ymwQE4-9Ao?via=libera.chat&via=fedora.im&via=matrix.org)

We both tried debugging this behavior via raw curl commands of the Fedora IoT ostree remote (https://ostree.fedoraproject.org/iot) and via ostree commands.

What we observed were sporadic HTTP 404 errors when trying to fetch commits that existed on the ostree repo.

For example, @jkonecny said:

hmm, having similar issue even on stable ref:

17:25:56,589 ERR python3:libostree HTTP error from remote fedora-iot for <https://ostree.fedoraproject.org/iot/objects/91/ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822.commit>: Server returned HTTP 404

yes it's there too:

sudo ostree pull --commit-metadata-only --verbose --depth=-1 fedora-iot:fedora/stable/x86_64/iot
OT: Preparing transaction in repository 0x5634936752d0
OT: Pushing lock non-blocking with timeout 30
OT: Opening repo lock file
OT: Push lock: state=unlocked, depth=0
OT: Locking repo shared
OT: Reusing tmpdir staging-f32f9888-20c5-4cd1-bb8f-6477ad548d58-MFcsXj
OT: resuming legacy transaction
OT: starting fetch of 91ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822.commit (detached)
OT: fetch of 91ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822.commit (detached) complete
OT: starting fetch of 91ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822.commit
OT: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
OT: fetch of 91ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822.commit complete
OT: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
OT: Request caught error: Server returned HTTP 404
OT: pull: idle, exiting mainloop
OT: Aborting transaction in repository 0x5634936752d0
OT: Popping lock non-blocking with timeout 30
OT: Pop lock: state=shared, depth=1
OT: Unlocking repo

error: Server returned HTTP 404

Then I tried to pull the commit with ostree and did so successfully:

this is chugging along on my IoT device right now: ostree --verbose pull fedora-iot fedora/stable/x86_64/iot@91ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822

We suspect that there is evidence on the server side logs that would show some of the HTTP 404 errors we encountered. And maybe that gives us a glimpse into what is going wrong.

One theory: the IoT ostree repo is backed via a CloudFront CDN:

# cat /etc/ostree/remotes.d/fedora-iot.conf 
[remote "fedora-iot"]
url=https://ostree.fedoraproject.org/iot
gpg-verify=true
gpgkeypath=/etc/pki/rpm-gpg/
contenturl=mirrorlist=https://ostree.fedoraproject.org/iot/mirrorlist
# curl -L https://ostree.fedoraproject.org/iot/mirrorlist
https://d2ju0wfl996cmc.cloudfront.net/

...perhaps there is some interaction there where the CDN is not populated or not serving up the commits successfully? The cache needs to be primed, as it were?


We do use cloudfront... should I try just invalidating and have it recache?

Are the failing attempts coming from aws instances?

Metadata Update from @zlopez:
- Issue priority set to: Waiting on Assignee (was: Needs Review)
- Issue tagged with: Needs investigation, medium-gain, ops

11 months ago

I was able to reproduce the 404 behavior in a VM on my workstation.

@jkonecny said that the Anaconda integration tests were hitting the 404 more reliably; I don't know where they run, so we'll have to wait to hear from them.

Hi, no we don't use AWS. I ran these tests on my local machine (CZ, Brno).

If you want to can definitely do testing.

ok, I invalidated cloudfront cache for /objects/91/*
can you see if that fixes the above one? (That way we know if it's the cloudfront cache or something deeper here).

Hi, unfortunately it didn't helped. Still getting

sudo ostree pull --commit-metadata-only --verbose --depth=-1 fedora-iot:fedora/stable/x86_64/iot
OT: Preparing transaction in repository 0x561e828cc770
OT: Pushing lock non-blocking with timeout 30
OT: Opening repo lock file
OT: Push lock: state=unlocked, depth=0
OT: Locking repo shared
OT: Using new tmpdir staging-dc694ef4-1f65-461d-95ed-3939ead4c1f1-Qg2yBH
OT: starting fetch of 6341c68b2344f304a97962641cef0ebf882b88a8ff523f7b0c3e0aad3427af8b.commit (detached)
OT: fetch of 6341c68b2344f304a97962641cef0ebf882b88a8ff523f7b0c3e0aad3427af8b.commit (detached) complete
OT: starting fetch of 6341c68b2344f304a97962641cef0ebf882b88a8ff523f7b0c3e0aad3427af8b.commit
OT: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
OT: fetch of 6341c68b2344f304a97962641cef0ebf882b88a8ff523f7b0c3e0aad3427af8b.commit complete
OT: _ostree_fetcher_should_retry_request: error: 112:1 Server returned HTTP 404, n_retries_remaining: 5
OT: Request caught error: Server returned HTTP 404
OT: pull: idle, exiting mainloop
OT: Aborting transaction in repository 0x561e828cc770
OT: Popping lock non-blocking with timeout 30
OT: Pop lock: state=shared, depth=1
OT: Unlocking repo

error: Server returned HTTP 404

It is different commit now and I'm getting exactly this commit now.

:(

ok, I think I might see whats going on. Can you try another commit/sync?

Next one :)

sudo ostree pull --commit-metadata-only --verbose --depth=-1 fedora-iot:fedora/stable/x86_64/iot
OT: Preparing transaction in repository 0x564dda2392d0
OT: Pushing lock non-blocking with timeout 30
OT: Opening repo lock file
OT: Push lock: state=unlocked, depth=0
OT: Locking repo shared
OT: Reusing tmpdir staging-37b23f9b-9fba-441a-9477-cf6e7389aa79-VF5m3V
OT: resuming legacy transaction
OT: starting fetch of cdfcb9f6cf0a00d5c3242ccb3651c4c92489c4a2d1d84ef560b30859370f02d5.commit (detached)
OT: fetch of cdfcb9f6cf0a00d5c3242ccb3651c4c92489c4a2d1d84ef560b30859370f02d5.commit (detached) complete
OT: starting fetch of cdfcb9f6cf0a00d5c3242ccb3651c4c92489c4a2d1d84ef560b30859370f02d5.commit
OT: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
OT: fetch of cdfcb9f6cf0a00d5c3242ccb3651c4c92489c4a2d1d84ef560b30859370f02d5.commit complete
OT: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
OT: Request caught error: Server returned HTTP 404
OT: pull: idle, exiting mainloop
OT: Aborting transaction in repository 0x564dda2392d0
OT: Popping lock non-blocking with timeout 30
OT: Pop lock: state=shared, depth=1
OT: Unlocking repo

error: Server returned HTTP 404

I wonder, if it would be helpful to you. I can try to give you SSH access so you can test it yourself. If that would be helpful, please contact me.

Well, how about I just force cloudfront to invalidate all it's cache for this and we see if that clears it up?

At least that will tell us if it's caching or not.

Ok, done.

Metadata Update from @kevin:
- Issue assigned to kevin

11 months ago

Interesting. Still getting 404:

sudo ostree pull --commit-metadata-only --verbose --depth=-1 fedora-iot:fedora/stable/x86_64/iot
OT: Preparing transaction in repository 0x557c9a03d2d0
OT: Pushing lock non-blocking with timeout 30
OT: Opening repo lock file
OT: Push lock: state=unlocked, depth=0
OT: Locking repo shared
OT: Reusing tmpdir staging-d93d2131-16a9-4c27-9dbe-0fbff525fc24-MAqGLh
OT: resuming legacy transaction
OT: starting fetch of 9c94694a40b3a8937dc97be9b969afe17b8d417db4a20bbfbfe6f0e71e58c552.commit (detached)
OT: fetch of 9c94694a40b3a8937dc97be9b969afe17b8d417db4a20bbfbfe6f0e71e58c552.commit (detached) complete
OT: starting fetch of 9c94694a40b3a8937dc97be9b969afe17b8d417db4a20bbfbfe6f0e71e58c552.commit
OT: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
OT: fetch of 9c94694a40b3a8937dc97be9b969afe17b8d417db4a20bbfbfe6f0e71e58c552.commit complete
OT: _ostree_fetcher_should_retry_request: error: 125:1 Server returned HTTP 404, n_retries_remaining: 5
OT: Request caught error: Server returned HTTP 404
OT: pull: idle, exiting mainloop
OT: Aborting transaction in repository 0x557c9a03d2d0
OT: Popping lock non-blocking with timeout 30
OT: Pop lock: state=shared, depth=1
OT: Unlocking repo

error: Server returned HTTP 404

So I looked a bit at this in my spare time this weekend.

https://ostree.fedoraproject.org/iot/objects/91/ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822.commit is 404 and hitting our proxies.

https://kojipkgs.fedoraproject.org/compose/iot/repo/objects/91/ed83a67e0c55b9812ca9a87ef7f0963db236eb4c6e8e33eee095fbb735d822.commit is fine and it's also via our proxies.

I am not clear on how the ostree path works. I will need to do some more digging. It seems like the problem is on our end though.

Can someone do a ostree pull -v and see if that provides more info on what path it's taking?

That is aligned with what I tested. Using kojipks link (we have until now) works fine but when iot link is used we are getting 404. However, I wasn't aware it's proxy.

Tested the ostree pull -v --commit-metadata-only --verbose --depth=-1 fedora-iot:fedora/stable/x86_64/iot and output is the same as before.

Commit failing now is 9221e365b4629aa4c0f00ec39bb0d8ac574738598b2c7ebd9fdf9e7c341bd0c1.commit

Hi, any progress? Can I help somehow with this?

I guess we need someone to trace the exact path and try and see where this issue is happening.

I've got it on my list, but I simply have not had any spare cycles yet to do it.

I'll try and get to it soon.

So, what appears to happen is this:

client sends a request to https://ostree.fedoraproject.org/iot/mirrorlist then summary.sig then summary then config

the mirrorlist contains:

https://d2ju0wfl996cmc.cloudfront.net/

The summary.sig doesn't exist.

summary seems to be a ostree ref?

config has:

[core]
repo_version=1
mode=archive-z2

So, it looks to me like it's hitting ostree.fedoraproject.org to get that mirror (cloudfront) and then going direct to that. ;(
Of course cloudfront queries kojipkgs, so there could be something there, but a quick glance I don't see any 404's there from cloudfront requests except:

127.0.0.1 - - [15/Jun/2023:22:02:01 +0000] "GET /compose/iot/repo/delta-indexes/5_/OF0wToXcJ5NR6KUixpmK9Wr4fsPca7XWURz_EzCUo.index HTTP/1.1" 404 196 "-" "Amazon CloudFront"

I do see there's a different cloudfront distribution for iot than the one I invalidated before.
I just invalidated the other one now.

Can you still if you still see the problem?

Unfortunately still seeing the error :( :

sudo ostree pull -v --commit-metadata-only --verbose --depth=-1 fedora-
iot:fedora/stable/x86_64/iot
OT: Preparing transaction in repository 0x55f3d0501770
OT: Pushing lock non-blocking with timeout 30
OT: Opening repo lock file
OT: Push lock: state=unlocked, depth=0
OT: Locking repo shared
OT: Reusing tmpdir staging-659210ac-68a6-4a42-b055-51701f6891fc-05RJSj
OT: resuming legacy transaction
OT: starting fetch of bc99d2ec8d127c5571e6ed0ec1a78bd7ff3bb99a1ce49ce6282c86606b5a63fc.commit (detached)
OT: fetch of bc99d2ec8d127c5571e6ed0ec1a78bd7ff3bb99a1ce49ce6282c86606b5a63fc.commit (detached) complete
OT: starting fetch of bc99d2ec8d127c5571e6ed0ec1a78bd7ff3bb99a1ce49ce6282c86606b5a63fc.commit
OT: _ostree_fetcher_should_retry_request: error: unset, n_retries_remaining: 5
OT: fetch of bc99d2ec8d127c5571e6ed0ec1a78bd7ff3bb99a1ce49ce6282c86606b5a63fc.commit complete
OT: _ostree_fetcher_should_retry_request: error: 112:1 Server returned HTTP 404, n_retries_remaining: 5
OT: Request caught error: Server returned HTTP 404
OT: pull: idle, exiting mainloop
OT: Aborting transaction in repository 0x55f3d0501770
OT: Popping lock non-blocking with timeout 30
OT: Pop lock: state=shared, depth=1
OT: Unlocking repo

error: Server returned HTTP 404

ok. I am not sure what to try next here. I guess I could enable logging on the cloudfront side and hopefully see why it's not caching some objects?

We don't see this on any other ostree repos right? Just iot? fedora coreos, silverblue, etc are all not seeing it?

Hi, I found the issue. It seems to be my configuration part but still not resolving the solution for Anaconda use-case.

Anyway the pull works as expected if I add:
contenturl=mirrorlist=https://ostree.fedoraproject.org/iot/mirrorlist
to /etc/ostree/remotes.d/fedora-iot.conf. So it seems to be required mirrorlist to resolve this issue.

If this is expected than we can close this issue, the problem is that Anaconda does not support /etc/ostree/remotes.d/fedora-iot.conf AFAIK. I'll ask ostree developers if this is something missing from the Anaconda side or not.

Thanks a lot for all your effort and sorry for taking your time... :(

Ah ha!

We have: https://ostree.fedoraproject.org/iot/mirrorlist to point to the cloudfront cache... does that factor into things?

I'll go ahead and close this now then, but do reopen or file a new ticket if there's anything we can adjust from our end.

Metadata Update from @kevin:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

10 months ago

Login to comment on this ticket.

Metadata
Boards 1
ops Status: Backlog