#281 Figure out comprehensive strategy for atomic host container storage
Closed: Fixed 6 years ago Opened 6 years ago by dustymabe.

We have system containers storing things in /var/lib/containers/storage. We have docker/moby containers storing things in /var/lib/docker. We might move to overlay2 on a single big root partition in the future, we might not.

We have discussed these topics in a few maililng list threads [1] [2] and also ran out of storage on atomic host before [3].

Let's incorporate the good info from those discussions and put some thought into this and try to figure out a proper strategy going forward.


Metadata Update from @dustymabe:
- Issue tagged with: F27

6 years ago

Following from that thread the proposal from colin is that we move to one big / with overlayfs by default for our cloud/vagrant images with all space being taken by the root filesystme by default . This is something we had talked about in the past, but we have not formally made this proposal until now. Pending negative feedback in the f26 timeframe, I'm +1 to this.

The other part of this puzzle is making sure that all container storage can be mounted on another block device/filesystem easily.

So proposal is:

  • 1 large filesystem by default
  • Ability to easily move container storage to another filesystem on system setup.

I'm going to do some work on this; though we're blocked a bit in testing by rawhide issues.

Metadata Update from @dustymabe:
- Issue tagged with: host

6 years ago

I tested the 20170615.n.0 FAH iso; it looks like we'll need this:

diff --git a/docker.spec b/docker.spec
index b654082..b170151 100644
--- a/docker.spec
+++ b/docker.spec
@@ -569,10 +569,14 @@ echo 'STORAGE_DRIVER=overlay2' >> %{repo}-storage-setup-workstation
 ln -s %{repo}-storage-setup-workstation %{repo}-storage-setup-cloud
 # create server override config
 ln -s %{repo}-storage-setup-workstation %{repo}-storage-setup-server
-# create atomic override config
+# create atomic override config; see https://pagure.io/atomic-wg/issue/281
+%if 0%{?fedora} >= 27
+ln -s %{repo}-storage-setup-workstation %{repo}-storage-setup-atomichost
+%else
 cp %{repo}-storage-setup-server %{repo}-storage-setup-atomichost
 echo 'CONTAINER_ROOT_LV_NAME=docker-root-lv' >> %{repo}-storage-setup-atomichost
 echo 'CONTAINER_ROOT_LV_MOUNT_PATH=/var/lib/docker' >> %{repo}-storage-setup-atomichost
+%endif #atomichost
 %endif # custom_storage
 popd

Basically all the editions become the same thing. Seem sane?

seems sane to me - i asked @dwalsh and @vgoyal to weigh in - also i'll bring this up in the next meeting we have.

Metadata Update from @dustymabe:
- Issue tagged with: meeting

6 years ago

So this relies on that fact that atomic users will not use devicemapper graph driver. If they really need to use devicemapper, they will need to add additional disk. (Which requires extra effort and planning).

So has this been established that overlayfs is working well for container workload and most of the users don't have the need to go back to devicemapper.

In F26, we wanted to retain capability to easily go back to devicemapper if things don't work out well with overlayfs. Given F26 is not GA, I guess we don't even have that data to make a decision?

@vgoyal: Not quite; the AH storage is the same as Server, we only use up to 15G. For real baremetal machines, they'll have larger disks, and hence there will be reserved space in the VG.

For cloud images, it's up to the operator; they can use cloud functionality to expand the root disk, or they can allocate a separate disk.

I think I am missing something. So rootfs which used to be 3G by default will now be 15G by default.

Ok, so for cloud images we will continue to have volume group and when image is grown, additional space will go in volume group and one can carve out space for thin pool. Yep, that will work.

Thinking of this change as "3G → 15G by default" indeed is a good and simple way to phrase it. It's not entirely accurate since we also need to factor in the default size of the cloud images and the vagrant box. But it's a good first approximation.

But also we're no longer using a separate partition by default for AH.

(Incidentally, I have a pending proposal to also use this same partitioning for "Atomic Workstation", though for somewhat different reasons: https://pagure.io/pungi-fedora/pull-request/257 )

For reference, here's the first commit: http://pkgs.fedoraproject.org/cgit/rpms/fedora-productimg-atomic.git/commit/?id=b3ca9664e36e6a55ff1762aef4dbaafc37bde563

Discussed in the atomic working group meeting today:

changing fedora 27 and beyond to default to overlayfs on the root partition will help us simplify our storage setup and align better with other fedora variants. This seems like a reasonable change to make assuming that overlayfs proves stable for the f26 cycle.

Metadata Update from @dustymabe:
- Issue untagged with: meeting

6 years ago

I swear we were defaulting to XFS before, but I may have been confused by the fact that it's hardcoded in the cloud kickstart.

Any objections to backporting this to f26?

diff --git a/installclass_atomic.py b/installclass_atomic.py
index cce1470..7363360 100644
--- a/installclass_atomic.py
+++ b/installclass_atomic.py
@@ -36,6 +36,7 @@ class AtomicInstallClass(FedoraBaseInstallClass):
     name = "Atomic Host"
     sortPriority = 11000
     hidden = False
+    defaultFS = "xfs"

     def setDefaultPartitioning(self, storage):
         # 3GB is obviously arbitrary, but we have to pick some default.

(We could also consider backporting unified storage)

Any objections to backporting this to f26?

I think if we make a good case for why XFS vs ext4 then backporting it to f26 would be a reasonable thing to do. I'd hate to do so 'just because', though.

We already discussed the rationale as an aside in the larger partitioning discussion earlier. F27 AH defaults to XFS already; that's what Fedora Server does, which also matches the downstream RHEL Atomic Host. The dynamic inodes in particular are nice for overlayfs.

For what it's worth, I was talking with a Red Hat filesystems engineer and he strongly recommended moving to XFS across all of Fedora. Pretty much the only downside is lack of filesystem shrinking, and I don't think that's a big issue with AH. Not that we have to do it because RH says so -- I mean that that's an opinion I have some faith in.

I mean, not that I don't also have faith in you guys. :)

Pretty much the only downside is lack of filesystem shrinking,

yeah, which i think sucks since we had that for ext4 and I actually
used it a decent amount. That's just a personal gripe, not one I've
really had to deal with too much in real life, though.

Issue #197 has been folded into this issue. Please do look at that issue, as it has some specific use-cases we want to solve.

Specifically, we want to be sure that the full disk is allocate on AWS instances.

All the partitioning, sizing, and resizing concerns mentioned in this issue vanish with a certain other filesystem, which does all resizes (grow, shrink, add and remove devices) online and atomically and typically in a single command. Whether scripted or user issued, the commands are shorter, easier to understand, complete faster and are safer.

Gotcha though is I haven't used it with overlayfs. A cursory search yields no hits. But it seems sane to allow Docker to continue to use overlayfs for the shared page cache benefit, and even snapshotting (if Docker supports that overlayfs feature now?).

But the main pro is that you can have separate fstrees read-only or read-write mounted, but they share the same storage pool, without hard barriers between them.

shrug

"certain other filesystem"? Can you please be less opaque?

Ahh sorry, I kinda figured realistically there are only three options: ext4, XFS, and Btrfs, and the only one not mentioned so far is Btrfs.

There's some hits of people using it in AWS contexts, but I have not yet run across Btrfs + overlayfs. So, I started a thread on linux-btrfs@vger.kernel.org to see if anyone's using containers with Btrfs + overlayfs. Insofar as I'm aware it's not a pathological combination, I'm just gonna guess to the vast majority it seems redundant, but they each bring different things to the table.

@chrismurphy ZFS is still out there, even if we can't touch it.

Yeah I wasn't considering anything we don't have in anaconda, but then also anything not already in the Fedora kernel for some time now.

Plus ZFS lacks fs shrink, and so you can't remove block devices arbitrarily, it also lacks online replication and seeding. So I even if it weren't for licensing it wouldn't be the direction I'd go in.

Anyway, I refuse to weigh into another XFS vs. BtrFS debate, and starting one here won't result in us having a storage plan.

I read a list of problems already with negative arguments, not supplied by me. And I've presented something that obviates literally all of them. I see it as advice, not debate.

I'm saying that the XFS vs. BtrFS decision was made elsewhere for Fedora, and we can't change it in this issue.

@jberkus
Specifically, we want to be sure that the full disk is allocate on AWS instances.

I just tried this with a rawhide qcow (not on aws but with kvm/qemu and giving it a bigger disk). I don't end up with a large disk, but just with the 3G and docker using the root filesystem + overlayfs.

[root@cloudhost sysconfig]# lsblk
NAME              MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda                 8:0    0   40G  0 disk 
├─sda1              8:1    0  300M  0 part /boot
└─sda2              8:2    0 39.7G  0 part 
  └─atomicos-root 253:0    0    3G  0 lvm  /sysroot

There is some more configuration that needs to be done to make this happen.

ok in Fedora 27 and rawhide right now we default to overlayfs on root filesystem. The root filesystem is automatically extended on cloud images to fill all available space (this can be overridden using cloud-init).

Early in this ticket we discussed:

  • 1 large filesystem by default
  • Ability to easily move container storage to another filesystem on system setup.

The first is covered (it is the default) and the 2nd can be done using CONTAINER_ROOT_LV* options to container-storage-setup.

The only part I don't think we've covered is the Storage for system containers discussion. In the case where we move container storage to another filesystem we aren't doing this for system containers.

I've opened a new issue to pick up this topic and carry it forward: https://pagure.io/atomic-wg/issue/350

Closing this ticket since most of the work is done.

Metadata Update from @dustymabe:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

6 years ago

Not sure why do we want to consume whole disk/VG by default. I am little concerned if people have tested overlay on top of overlay for container runtime and what are the issues involved.

I guess once F27 makes overlay its rootfs and container runtime setup overaly on top, we will come to know.

Hum? What do you mean by overlay on overlay? I don' t know why anyone would do that.

Try grabbing an image from today's compose and boot it - for example with the Vagrant box, we use all 40GB by default for the rootfs.

For infrastructure clouds we'll expand to whatever the cloud does. We don't use a separate LV for containers by default, but that can be enabled via cloud-init for those cases. (For the Vagrant box you'd have to shut down docker first)

Hum? What do you mean by overlay on overlay? I don' t know why anyone would do that.

I thought docker will setup overlay mount points for each container. And if rootfs is overlay, this becomes overlay on top of overlay?

One thing that came up in a quick IRC discussion was:

<vgoyal> walters: last I checked that xfs on rootfs does not enable quota by default.
<vgoyal> walters: and with overlayfs openshift needs quota support to limit max container size

I'd agree we should probably make this change by default.

Login to comment on this ticket.

Metadata