#387 atomic workstation failures with stale file handle
Closed: Fixed 6 years ago Opened 6 years ago by dustymabe.

We have been trying to build atomic workstation for a few days and haven't really been successful. We consistently get a stale file handle issue:

Committing: 100%
error: While writing rootfs to mtree: fstatat(18/f352b4dfa8c0892a1a89dd62a541c800399f4e96d43bc23ccf6c78fb66b6bd.filez): Stale file handle

It would be easy to blame NFS for this, but I think this warrants some investigation because our atomic host ostree composes have been succeeding, while the atomic workstation has been failing. So all other things being equal Atomic Workstation is failing. It is also consistently failing at the same place, which makes me think it's not networking issues necessarily, but possibly either some issue with ostree/rpm-ostree or some issue with NFS that gets aggravated with ostree/rpm-ostree.

I do notice that rawhide composes seem to be working fine. The only difference I can see there is a newer version of rpm-ostree. We should get the new version of rpm-ostree into 27 stable (should be in the next run) and see if that helps.

Here are the error logs:

  • 1
  • 2 - this one succeeded but there are missing objects in the repo:
Nov 21 10:41:50 localhost.localdomain ostree[2755]: libostree HTTP error from remote fedora-workstation for <https://kojipkgs.fedoraproject.org/compose/updates/atomic/deltas/rM/AH5m0iAva2gB_4KLzfVBD7RMQiHpdVZLhI3bdUxCs/superblock>: Server returned HTTP 404
Nov 21 10:41:50 localhost.localdomain ostree[2755]: libostree HTTP error from remote fedora-workstation for <https://kojipkgs.fedoraproject.org/compose/updates/atomic/objects/8d/9d7dcc283355a3bd956c323febb2f4bd9a3de5f9c5bef71f2c491955b3ecf5.filez>: Server returned HTTP 404

pungi is writing directly into an archive repo NFS mounted here? Are there concurrent writes?

One tricky thing here is that a lot of what libostree is doing for local filesystem repos is almost an anti-pattern for NFS, mainly our use of the tmp/ dir for staging. See also https://github.com/ostreedev/ostree/issues/1184

A lot of those issues go away with the "compose into bare-user, pull-local to archive" pattern.

Metadata Update from @dustymabe:
- Issue tagged with: releng, workstation

6 years ago

pungi is writing directly into an archive repo NFS mounted here? Are there concurrent writes?

yes and yes. it's being written into an NFS mounted repo and there are multiple composes going on at the same time.

the main reason for using a networked repo is because the compose could happen on any koji builder and also we want for the new commit we make to have a parent commit. It would be nice to use local storage for the compose and then use pull-local, but we haven't implemented that yet.

One tricky thing here is that a lot of what libostree is doing for local filesystem repos is almost an anti-pattern for NFS, mainly our use of the tmp/ dir for staging.

where is the tmp dir located? could it be made to use a local tmp vs one on NFS?

See also https://github.com/ostreedev/ostree/issues/1184

What is the recommendation based on that issue? Set fsync opt to disabled?

A lot of those issues go away with the "compose into bare-user, pull-local to archive" pattern.

is the use of an archive repo a particular problem here? i.e. would using bare-user be more likely to not see problems?

Metadata Update from @dustymabe:
- Issue tagged with: F27

6 years ago

I found one aarch64 run that failed in this same way during this time frame: 1

where is the tmp dir located? could it be made to use a local tmp vs one on NFS?

In $repo/tmp - I suspect (not sure) that it's concurrency there that's the issue. There were some issues fixed upstream here. Changing pungi to use a local bare-user repo is effectively doing "local tmp". I am not sure rpm-ostree should be in the game of detecting and special casing NFS, but I'm not opposed to it either. We could probably add an rpm-ostree option to disable its use of a staging dir.

I filed https://pagure.io/pungi/pull-request/805 - I think it will help but I'll still need to do some libostree work here.

In $repo/tmp - I suspect (not sure) that it's concurrency there that's the issue. There were some issues fixed upstream here.

good to know - do you think it would be worth backporting that to f27 to see if that solves the problem?

Changing pungi to use a local bare-user repo is effectively doing "local tmp". I am not sure rpm-ostree should be in the game of detecting and special casing NFS, but I'm not opposed to it either. We could probably add an rpm-ostree option to disable its use of a staging dir.

I definitely agree special casing NFS is not desirable. Could we not just add an option to tell rpm-ostree what tmp staging dir to use? i.e. use /tmp (local fs) but the repo we operate on is /nfs/mounted/repo. is that what --workdir does?

I filed https://pagure.io/pungi/pull-request/805 - I think it will help but I'll still need to do some libostree work here.

Thanks! Just for dummies, why is using bare-repo attractive here since we're going to end up putting it in a archive repo anyway?

It's in the docs

This is because OSTree has to re-checksum and recompress the content each time it's committed. (Most of the CPU time is spent in compression which gets thrown away if the content turns out to be already stored).

It's in the docs

This is because OSTree has to re-checksum and recompress the content each time it's committed. (Most of the CPU time is spent in compression which gets thrown away if the content turns out to be already stored).

yes, but in your pull request you are making a tmp empty repo so none of the content would already exist in the repo so we wouldn't save any operations?

Also I had a few other questions in https://pagure.io/atomic-wg/issue/387#comment-480937, if you don't mind answering.

rpm-ostree-2017.10-3.fc27 just made it to stable so we can close this. although I fully intend to review and get https://pagure.io/pungi/pull-request/805 merged as well.

Metadata Update from @dustymabe:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Login to comment on this ticket.

Metadata