#186 switch to overlay2
Closed: Fixed 4 years ago Opened 4 years ago by walters.

<vgoyal> walters: so there are two options. One is changing default to overlay2, which is common across all variants
<vgoyal> walters: another new option is specifying where the storage from overlayfs comes from. Does it come from root filesystem or from free space in volume group
<vgoyal> for atomic, rootfs is just 3G and if we overlayfs uses rootfs, then it will fill up pretty fast.
<vgoyal> so we created a new option in dss, DOCKER_ROOT_VOLUME
<vgoyal> if one specifies DOCKER_ROOT_VOLUME=yes, then dss looks for free space in volume group and creates a logical volume, makes a file system on this and mounts on /var/lib/docker/
<vgoyal> now when docker starts, it will put all images and containers in the new volume ( and not logical volume backing rootfs)
<vgoyal> given workstation rootfs uses all free space, workstation will not require this by default.
<vgoyal> but server and atomic leave free space in volume group and by default that can be used for docker. (like devicemapper graph driver)
<vgoyal> so DOCKER_ROOT_VOLUME=yes is required only for server and atomic variants only (and not workstation one)


Upstream dokcer-storage-setup will continue to have devicemapper as default graph driver. So for this to work properly on atomic host, we probably require following two options set in /etc/sysconfig/docker-storage-setup.

STORAGE_DRIVER=overlay2
DOCKER_ROOT_VOLUME=yes

By default, a logical volume will be created which will consume 40% of free space. Remaining will remain free and can be used for growing either rootfs or docker logical volume. But growing step is manual and there is no automation.

If we want to give all the free space to docker by default, then we will also have to set.

DOCKER_ROOT_VOLUME_SIZE=100%FREE

So I just did a test of using OverlayFS on Fedora Atomic 25 as a cloud-init option. This resulted in having the whole disk as one volume, shared with the overlay, instead of multiple partitions. That's the desireable default setup, I think. Am I missing something?

So I just did a test of using OverlayFS on Fedora Atomic 25 as a cloud-init option. This resulted in having the whole disk as one volume, shared with the overlay, instead of multiple partitions. That's the desireable default setup, I think. Am I missing something?

That is one way to set it up, but I believe @vgoyal @walters and @dwalsh still want a separate filesystem to back the overlayfs so that your docker storage can't canibalize your root filesystem. This is more "production" like. So here with overlayfs we gain the ability to "re-use" the rootfs if we want to, or to set it up similar to how we were doing it in the past with DM. The "requirement" of the separate fs in the past was the PITA part that new users didn't understand. Now we can do it either way.

Except that docker storage can still cannibalize your rootfs, because in Atomic we put volumes and images on the rootfs. So it's never been an effective measure.

Except that docker storage can still cannibalize your rootfs, because in Atomic we put volumes and images on the rootfs. So it's never been an effective measure.

That is why the DOCKER_ROOT_VOLUME option was added to docker-storage-setup. It tells you to put "everything" over there: https://github.com/projectatomic/docker-storage-setup/pull/175/commits/ee035598cbbd8c194ab3f6830e38972dee24744a

if you use DOCKER_ROOT_VOLUME=yes, that will create a new logical volume, create xfs filesystem on it, and mount it on /var/lib/docker/. And now if you are using overlay2 driver, it should put all images, containers and volumes on this volume and leave rootfs untouched.

So I just did a test of using OverlayFS on Fedora Atomic 25 as a cloud-init option. This resulted in having the whole disk as one volume, shared with the overlay, instead of multiple partitions. That's the desireable default setup, I think. Am I missing something?

I am curious how storage configuration looks on this image. Can you run "lsblk".

So with this image booted, rootfs will consume all the space available on disk?

There are so many configurations now, that I forget which one is which.

So, there's two questions here:

Aha, thanks. Seems like DOCKER_ROOT_VOLUME should default to "on" then if we're going to do partitioning. That leaves us with two questions:

1) What do we feel is the best default settup for an overlay-based system?

2) Do we need to support switch-back to devicemapper on a running system?

For (1), what I'm concerned about with partition-by-default is that it tends to make a no-options-simple install unusable; that is, it requires getting the user involved with partitioning on the first Atomic Host they install. This is because we can't dynamically size the rootfs volume based on available space, so we end up with the default too-small 3GB we currently have. Personally, I can't even imagine any automated configuration of two partitions which wouldn't require the user to make decisions about partitioning, even in simple cases.

So what I'm saying is that, while serious production users are going to want separate partitions, we should consider leaving that up to those who are willing to feed cloud-init (or Kickstart) parameters, and know their expected disk usage. And that the default should be "one big shared volume", because that's the only "zero information" decision we can make.

Defaults are a question of "what do we do with no user input". The moment we talk about getting user input, we're no longer concerned with defaults.

However, if we need to support (2), then this kind of decides it for us. I'm not sure that we do, though. Arguments?

Sharing rootfs and overlay2 makes configuration easy. Agreed with that. But 2) is a requirement. If overlay2 does not work for a user, we want them an option to go back to devicemapper.

overlay2 is fairly new and is not fully posix compliant. So it can spring surprises for certain workloads
and users will be pretty upset if they don't have a way to go back and have to configure and boot into new instance of atomic-host.

If 2) was not a requirement, I would have advocated for 1) for sure as that makes configuration very easy.

My understanding is that fedora workstation, fedora server and fedora atomic-host are all following different default partitioning schemes.

Fedora workstation consumes all the free disk space by default. I am assuming that cloud-init based images grow rootfs to the fully available disk capacity on first boot. If that's the case, then having shared rootfs there makes sense as we never provided a proper devicemapper setup out of box there.

Fedora Server now has switched to the notion of providing a limited size rootfs and leaving rest of the space free. I am assuming fedora server based image has same behavior. And this was done so that devicemapper can be setup out of box properly (no loopback devices). So in this case, setting overlay2 on free space will make more sense.

And fedora atomic-host, falls in same category as fedora server. Only difference is that in atomic host
size of rootfs is much smaller (3G).

In summary, to me being able to go back to devicemapper is important as overlay2 is very new and might not work for all workloads. So keeping docker images/containers on a separate volume will help.

OK, so regarding (2), a follow-up question:

Is there any mechanism by which I can take a running system configured for overlay (with two partitons) and convert it to devicemapper without:

a) needing to re-partition?
b) needing to delete all containers on the system?
c) needing to erase all images & volumes on the system?

That is, does a "flip back to devicemapper" path actually exist?

I think you will have to destroy all images and containers stored in overlay and then setup devicemapper from that freed space.

Having said that, in theory a path might exist. If overlay2 does not use all free space, then in theory, one can setup devicemapper in rest of the free space and use atomic commands to migrate images/containers from overlay2 to devicemapper.

Having said that, migration logic has been racy and less used option.

Shishir wrote support for migration containers from one storage to other.

@dwalsh, you know more about it and how well does it work and should we rely on this option or not.

dwalsh mentioned a way to flip between them
https://www.spinics.net/linux/fedora/fedora-cloud/msg07620.html
Missing from that sequence is actually configuring the new storage if it doesn't exist yet.

I think putting custom partitioning into the hands of users, and then supporting those arbitrary layouts, is asking for endless trouble. Pick your battles, ignore the rest.

The more versatile production solution for dealing with runaway usage of space is quotas. But lack of familiarity causes people to keep running back to the familiar torture of fs resize and repartitioning. I'm hopeful the storaged and Cockpit folks will one day help solve this. Partitioning to solve these problems is so last century.

Chris: so the problem with @dwalsh's tool is that we're already talking about making the rootfs -- and thus /var -- many times smaller than docker_root. Which means that it wouldn't actually work for any users with the default partitioning proposed.

What I'm hearing is the "flip back to devicemapper without repartitioning" path doesn't actually exist.

What I'm hearing is the "flip back to devicemapper without repartitioning" path doesn't actually exist.

Flipping back to devicemapper will require free space. And that would be possible only if ext4 and xfs had online shrink support. That way one could have shrunk rootfs and made space for devicemapper consumtion.

Xfs does not support shrinking and ext4 supports only offline shrinking. So this option does not work.

Flipping from one to the other will take free space somewhere for the 'atomic storage export/import' operation to temporarily store docker images and containers to.

A way around the xfs lack of shrink issue is to put the filesystem containing /var onto a thinly provisioned LV (be it a dir on rootfs or its own volume). After 'atomic storage reset' wipes the docker storage, issue fstrim, and all the previously used extents will be returned to the thin pool, which can then be returned to the VG, which can then be reassigned to a new docker thin pool. Convoluted in my opinion, but doable.

The problem I'm having migrating from devicemapper to overlay is add /var to fstab isn't working. Systemd picks it up, but no mount command is issued. Seems like there's a problem making sure it happens after ostree switchroot as there's no /var directory prior to the ostree rootfs being setup.

Flipping from one to the other will take free space somewhere for the 'atomic storage export/import' operation to temporarily store docker images and containers to.
A way around the xfs lack of shrink issue is to put the filesystem containing /var onto a thinly provisioned LV (be it a dir on rootfs or its own volume). After 'atomic storage reset' wipes the docker storage, issue fstrim, and all the previously used extents will be returned to the thin pool, which can then be returned to the VG, which can then be reassigned to a new docker thin pool. Convoluted in my opinion, but doable.

IIUC, you are saying that use a thin LV for rootfs to work around xfs shrink issue? People have tried that in the past and there have been talks about that many a times. There are still issues with xfs on top of thin lv and how no space situation is handled etc. Bottom line, we are not there yet.

So if we can't use rootfs on thin LV and if xfs can't be shrinked, then only way to flip back to devicemapper is don't allow rootfs to use all free space. And have free space which can either
be used by overlay2 or devicemapper for container storage.

So, to summarize:

  1. The main reason given for keeping "two partitions" by default with docker-storage-setup is so that users can easily switch back to devicemapper if there are critical issues with OverlayFS.

  2. However, there is currently no tool which actually allows users to switch back to devicemapper without repartitioning.

  3. Therefore, the reason given for maintaining two partitions as the default is invalid.

  4. If that reason is invalid, we should again consider making "one big partition" the default for Overlay2 installations.

I disagree with 2.

We have tools that allow you to switch back to devicemapper if their is partioning, which is why we want to keep partitioning. If this was easy to switch from no partioning to partitioned, then I would agree with just default to overlay without partitions.

The main reason given for keeping "two partitions" by default with docker-storage-setup is so that users can easily switch back to devicemapper if there are critical issues with OverlayFS.

I would like to also point out that one other benefit would be to prevent containers from cannibalizing your root partition.

However, there is currently no tool which actually allows users to switch back to devicemapper without repartitioning.

IMHO in the world of container registries and such I would be less worried about the content that exists on the machine (container images, containers etc). If there are two LVs (and thus two partitions) one can choose DM or overlay2. If they want to switch between the two, blow away the one that exists and start over. If you want to keep your containers, store them somewhere else temporarily.

Therefore, the reason given for maintaining two partitions as the default is invalid.

If that reason is invalid, we should again consider making "one big partition" the default for Overlay2 installations.

I prefer overlay2 and would like to see there be only one option so that we can have less confusion in the future. However, giving users the choice is nice as well. Maybe there is a way to achieve both on startup.

@dwalsh see comments above about why those tools won't actually work in practice. If we can work around those, then that changes things. But right now what I'm hearing is "you can switch back, but only if you have unallocated space equal to or greater than your existing docker partition".

The main reason given for keeping "two partitions" by default with docker-storage-setup is so that users can easily switch back to devicemapper if there are critical issues with OverlayFS.

I would like to also point out that one other benefit would be to prevent containers from cannibalizing your root partition.

This might be of interest to counter my previous point:
Implement XFS quota for overlay2

Also, using partitioning to limit Docker's space consumption only makes sense if we can somehow automagically "right-size" the two partitions. In our current code, it doesn't matter how much space docker eats up, because we've only given the user 3GB on their root partion, which means they're already out of space anyway, just from log files.

@jberkus The tools will work fine if you just want to start fresh and blow away your container images.

atomic storage reset

Should delete everything, then you change your default backend using

atomic storage modify --driver ...

Only time you would need extra space would be if you wanted to export/import your images.

Without using separate partitions, I end up having to reinstall the system in order to setup separate partitions.

If you are using containers correctly, destroying your images, should not be too painful. :^)

@dwalsh aha. The current docks emphasize export/import, so I thought it was required.

Lemme test that, but if it works that's a powerful argument for maintaining dual partitions for backwards compatibility. If we're doing that, though, is there anything we can do about sizing the rootfs better?

If we're doing that, though, is there anything we can do about sizing the rootfs better?

there is the ROOT_SIZE variable in docker-storage-setup that allows you to specify the size of the root partition. We could theoretically get more detailed on how we determine the "default" size though. i.e. default to lessor of 6G or 30% of of entire disk. or something like that. So for systems with big disk root size is now 6G.

also would make sure that we never go below 3G

vgoyal
IIUC, you are saying that use a thin LV for rootfs to work around xfs shrink issue? People have tried that in the past and there have been talks about that many a times. There are still issues with xfs on top of thin lv and how no space situation is handled etc. Bottom line, we are not there yet.

You mean thin pool exhaustion? Right now the atomic host default uses the docker devicemapper driver which is XFS on a dm-thin pool. So I don't understand why one is OK and the other isn't.

So if we can't use rootfs on thin LV and if xfs can't be shrinked, then only way to flip back to devicemapper is don't allow rootfs to use all free space.

When hosted in the cloud, isn't it typical to charge for allocated space whether it's actively used or not?

jberkus
If that reason is invalid, we should again consider making "one big partition" the default for Overlay2 installations.

Yes. It's the same effort to add more space (partition, LV, raw/qcow2), make it an LVM PV, and add to the VG and then let docker-storage-setup create a docker-pool thin pool from that extra space.

dwalsh
We have tools that allow you to switch back to devicemapper if their is partioning, which is why we want to keep partitioning. If this was easy to switch from no partioning to partitioned, then I would agree with just default to overlay without partitions.

My interpretation of jberkus "one big partition" is a rootfs LV that uses all available space in the VG, reserving nothing. But it's still possible to add a PV to that VG and either grow rootfs for continued use of overlay2; or to fallback to devicemapper. I don't interpret it literally to mean dropping LVM. You'd probably want some way of doing online fs resize as an option, and that requires rootfs on LVM or Btrfs, not a plain partition.

I think it's a coin toss having this extra space already available in the VG, vs expecting the admin to enlarge the backing storage or add an additional device, which is then added to the VG, which can then grow rootfs (overlay2) or be used as fallback with the Docker devicemapper driver.

dustymabe
I would like to also point out that one other benefit would be to prevent containers from cannibalizing your root partition.

Not possible by making /var a separate file system, you'd have to use quotas. Ostree owns /var, it must be a directory on rootfs at present.

I prefer overlay2 and would like to see there be only one option so that we can have less confusion in the future. However, giving users the choice is nice as well. Maybe there is a way to achieve both on startup.

You could have two kickstarts: overlay2 and devicemapper, and each kickstart is specified using a GRUB menu entry on the installation media. The devicemapper case uses the existing kickstart and depends on the existing docker-storage-setup "use 40% of VG free space for a dm-thin pool"; the overlay2 kickstart would cause the installer to use all available space for rootfs, leaving no unused space in the VG.

vgoyal
IIUC, you are saying that use a thin LV for rootfs to work around xfs shrink issue? People have tried that in the past and there have been talks about that many a times. There are still issues with xfs on top of thin lv and how no space situation is handled etc. Bottom line, we are not there yet.

You mean thin pool exhaustion? Right now the atomic host default uses the docker devicemapper driver which is XFS on a dm-thin pool. So I don't understand why one is OK and the other isn't.

There are outstanding bugs and issues against that. Error handling was not graceful and there were instances of container hanging if thin pool was full and only solution was to reboot the system. So it is not fine as such. Just that we don't seem to have better options. People have been talking about much closer interaction between xfs and thin pool for quite some time.

Anaconda developers have tried setting thin pool out of box in the past and finally they backed it out later due to various issues.

In short, putting rootfs on thin lv increases complexity of default setup and difficult to recover if something is bad. (thin pool full). Lot of people don't like the idea of doing over provisioning for
rootfs. They better have peach of mind with pre-allocated rootfs.

vgoyal
In short, putting rootfs on thin lv increases complexity of default setup and difficult to recover if something is bad. (thin pool full). Lot of people don't like the idea of doing over provisioning for rootfs. They better have peach of mind with pre-allocated rootfs.

OK got it. Without over provisioning, rootfs on thin LV is the same risk as Docker devicemapper using a dm-thin pool; what unacceptably increases risk is overprovisioning rootfs which is necessarily what happens when using fstrim on it to recoup extents for a devicemapper based reversion. Fair enough. I have seen it explode spectacularly with total data loss of all LV's using the thin pool and repair tools being unable to repair it; not just the fs that was writing at the time the exhaustion happened.

When hosted in the cloud, isn't it typical to charge for allocated space whether it's actively used or not?

Not sure, what this has to do with how we partition the storage between rootfs and docker.

jberkus
If that reason is invalid, we should again consider making "one big partition" the default for Overlay2 installations.

Yes. It's the same effort to add more space (partition, LV, raw/qcow2), make it an LVM PV, and add to the VG and then let docker-storage-setup create a docker-pool thin pool from that extra space.

I think server variant also adds all space to a VG and then carves out a LV for rootfs and leaves rest of the space free in VG. Now docker-storage-setup can use this space for image/container storage. For now devicemapper makes use of it and going forward by default it will be used for overlay2.

dwalsh
We have tools that allow you to switch back to devicemapper if their is partioning, which is why we want to keep partitioning. If this was easy to switch from no partioning to partitioned, then I would agree with just default to overlay without partitions.

My interpretation of jberkus "one big partition" is a rootfs LV that uses all available space in the VG, reserving nothing. But it's still possible to add a PV to that VG and either grow rootfs for continued use of overlay2; or to fallback to devicemapper. I don't interpret it literally to mean dropping LVM. You'd probably want some way of doing online fs resize as an option, and that requires rootfs on LVM or Btrfs, not a plain partition.
I think it's a coin toss having this extra space already available in the VG, vs expecting the admin to enlarge the backing storage or add an additional device, which is then added to the VG, which can then grow rootfs (overlay2) or be used as fallback with the Docker devicemapper driver.

Asking users to either grow existing disk or add more disks in VG to make space for devicemapper might not always be feasible. For example if somebody gives me a VM to work with and I have no control on management of that VM and I am supposed to work with resources provided in that VM. I think we need to provide option of being able to go back to devicemapper without requiring to add more disk space to VM.

I prefer overlay2 and would like to see there be only one option so that we can have less confusion in the future. However, giving users the choice is nice as well. Maybe there is a way to achieve both on startup.

You could have two kickstarts: overlay2 and devicemapper, and each kickstart is specified using a GRUB menu entry on the installation media. The devicemapper case uses the existing kickstart and depends on the existing docker-storage-setup "use 40% of VG free space for a dm-thin pool"; the overlay2 kickstart would cause the installer to use all available space for rootfs, leaving no unused space in the VG.

So this is image generation part? So we will generate and ship two kind of images and users will download one of these based on the default storage driver they want to use?

dustymabe
I would like to also point out that one other benefit would be to prevent containers from cannibalizing your root partition.

Not possible by making /var a separate file system, you'd have to use quotas. Ostree owns /var, it must be a directory on rootfs at present.

with DOCKER_ROOT_VOLUME and overlayfs using that then all of /var/lib/docker would be taken care of. Please let me know if I'm wrong.

I prefer overlay2 and would like to see there be only one option so that we can have less confusion in the future. However, giving users the choice is nice as well. Maybe there is a way to achieve both on startup.

You could have two kickstarts: overlay2 and devicemapper, and each kickstart is specified using a GRUB menu entry on the installation media. The devicemapper case uses the existing kickstart and depends on the existing docker-storage-setup "use 40% of VG free space for a dm-thin pool"; the overlay2 kickstart would cause the installer to use all available space for rootfs, leaving no unused space in the VG.

So I hardly ever use interactive installs, but that is a valid case. I would think most people would be using their own kickstart file if they are installing a server fresh and they would set up storage the way they want it to be set up, right? I tend to think more about the cloud use case where you spin up a preconfigured image. What I was referring to is having docker-storage-setup be able to make the switch for us. It turns out that we have the storage configured like this in the baked images (note this is before docker-storage-setup runs):

-bash-4.3# lsblk
NAME              MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb                 8:16   0   10G  0 disk 
sdc                 8:32   0  368K  0 disk 
sda                 8:0    0   20G  0 disk 
├─sda2              8:2    0 19.7G  0 part 
│ └─atomicos-root 253:0    0    9G  0 lvm  /sysroot
└─sda1              8:1    0  300M  0 part /boot

This means we can essentially look at if the user provided overlay or DM and do whatever they asked.
- If they provided overlay then we can just extend the root partition and go on our merry way.
- If they also specified DOCKER_ROOT_VOLUME=yes then they want overlay on another partition, did they specify a partion? yes, use that one. no, create an LV.
- If they provided DM then create new LVs and set it up just like we have been doing before this discussion started.

And what happens if the user doesn't provide anything? That's the "defaults" case.

And what happens if the user doesn't provide anything? That's the "defaults" case.

This is the big question. We are essentially debating over if we should enable DOCKER_ROOT_VOLUME=yes by default or not.

And what happens if the user doesn't provide anything? That's the "defaults" case.

This is the big question. We are essentially debating over if we should enable DOCKER_ROOT_VOLUME=yes by default or not.

Right. And I think it is a good idea to enable it by default because that allows users switch back to devicemapper easily without having to add more disk space to VM.

dustymabe
with DOCKER_ROOT_VOLUME and overlayfs using that then all of /var/lib/docker would be taken care of. Please let me know if I'm wrong.

It'll work on a conventional installation. I'm skeptical it'll work on an rpm-ostree installation because /var is already a bind mount performed by ostree during the startup process. So I'm pretty sure ostree is going to have to know about the "true nature" of a separate var partition, mount it, then bind mount it correctly.

I tend to think more about the cloud use case where you spin up a preconfigured image. What I was referring to is having docker-storage-setup be able to make the switch for us.

I don't have a strong opinion on where the proper hinting belongs to indicate which driver to use. The user already has to setup #cloud-config so maybe the hint belongs in there, and either it does something to storage which is then understood by docker-storage-setup, or the hint is just a baton to docker-storage-config to do it, just depends on which is more flexible and maintainable.

This means we can essentially look at if the user provided overlay or DM and do whatever they asked.
- If they provided overlay then we can just extend the root partition and go on our merry way.
- If they also specified DOCKER_ROOT_VOLUME=yes then they want overlay on another partition, did they specify a partion? yes, use that one. no, create an LV.
- If they provided DM then create new LVs and set it up just like we have been doing before this discussion started.

Seems reasonable. But I have zero confidence at the moment that ostree can handle a separate /var file system; it's a question for Colin what assumptions are being made and I think it assumes it's directory that it bind mounts somewhere, and if it's really a separate volume, then something has to mount it first before it can be bind mounted elsewhere.

An additional trick is testing any changes against Btrfs where mounting subvolumes explicitly is actually a bind mount behind the scene. That should just work but...

Dusty:

Right, and that question comes down to "how much do we care about revertability VS. user experience". It's not an easy question to answer. In the long run, DOCKER_ROOT_VOLUME=no as default is the obvious answer. But for F26? Not so sure.

chrismurphy
Seems reasonable. But I have zero confidence at the moment that ostree can handle a separate /var file system; it's a question for Colin what assumptions are being made and I think it assumes it's directory that it bind mounts somewhere, and if it's really a separate volume, then something has to mount it first before it can be bind mounted elsewhere.

hmm. so I'm not sure about everything you've said because you've thrown around some concepts that I might not understand fully. However, what I can do is test. I grabbed a fedora 25 atomic system and did not allow docker to run on first boot (systemd.mask=docker systemd.mask=docker-storage-setup on kernel command line). I then did ostree admin unlock --hotfix so I could modify the contents of the tree. I then grabbed latest upstream docker-storage-setup and installed everything to the system with make install.

I then configured /etc/sysconfig/docker-storage-setup with:

STORAGE_DRIVER=overlay2
DOCKER_ROOT_VOLUME=yes

and rebooted the system. Now I get:

-bash-4.3# lsblk
NAME                          MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sdb                             8:16   0   10G  0 disk 
sdc                             8:32   0  368K  0 disk 
sda                             8:0    0   20G  0 disk 
├─sda2                          8:2    0  5.7G  0 part 
│ ├─atomicos-docker--root--lv 253:1    0  1.1G  0 lvm  /var/lib/docker
│ └─atomicos-root             253:0    0    3G  0 lvm  /sysroot
└─sda1                          8:1    0  300M  0 part /boot
-bash-4.3# 
-bash-4.3# blkid
/dev/sda1: UUID="1cffb3b3-f5c4-4c73-9e4c-adb168f1cefa" TYPE="ext4" PARTUUID="82b21228-01"
/dev/sda2: UUID="l5jqv8-ZxTX-jIfh-ve4J-aqID-mAZi-O5mU5n" TYPE="LVM2_member" PARTUUID="82b21228-02"
/dev/mapper/atomicos-root: UUID="96a6e82b-98e5-4ab3-8034-72b61540c166" TYPE="xfs"
/dev/sdc: UUID="2017-01-09-18-25-56-00" LABEL="cidata" TYPE="iso9660"
/dev/mapper/atomicos-docker--root--lv: UUID="3f5ee97d-f612-46c4-abe5-21799e4830b1" TYPE="xfs"
-bash-4.3# 
-bash-4.3# docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 0
Server Version: 1.12.5
Storage Driver: overlay2
 Backing Filesystem: xfs
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: null host bridge overlay
Swarm: inactive
Runtimes: oci runc
Default Runtime: oci
Security Options: seccomp selinux
Kernel Version: 4.8.15-300.fc25.x86_64
Operating System: Fedora 25 (Atomic Host)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 2
CPUs: 2
Total Memory: 3.859 GiB
Name: cloudhost.localdomain
ID: YKSF:TWGT:FNJH:B553:F3FK:RFHJ:OUAK:AYOO:T5NP:WBTL:KZFI:MYSY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Insecure Registries:
 127.0.0.0/8
Registries: docker.io (secure)

The mount is handled by var-lib-docker.mount systemd file:

-bash-4.3# systemctl cat var-lib-docker.mount 
# /etc/systemd/system/var-lib-docker.mount
[Unit]
Description=Mount docker-root-lv on docker root directory.
Before=docker-storage-setup.service

[Mount]
What=/dev/atomicos/docker-root-lv
Where=/var/lib/docker
Type=xfs
Options=defaults

[Install]
WantedBy=docker-storage-setup.service

Am I missing something? Did I make some bad assumptions somewhere in this test?

dustymabe
Am I missing something? Did I make some bad assumptions somewhere in this test?

Nope, works for me as well. /var is still a directory on the ext4 rootfs, but it looks like a new LV Is created at 40% of the free space in the VG, formatted XFS, and var-lib-docker.mount mounts it at /var/lib/docker; that mount file is created by the code triggered by DOCKER_ROOT_VOLUME=yes.

I did additionally try a migrate from devicemapper to overlay2 using atomic storage export + reset + modify + import and it does work. There is no automatic space recapture of the docker-root-lv LV however that could be deleted by the user after the modify step, reboot so docker-storage-setup sets up the dm-thin pool, and then do the import. I'm assuming in any case that there needs to be temp space somewhere for the exported containers.

Exported containers get written by default to /var/lib/atomic/migrate, this can be overwritten with the --dir option.

Should re atomic storage reset do the atomatic space recaptor of the docker-root-lv? @vgoyal ?

@dwalsh yes, reset of storage should remove docker-root-lv automatically. If it does not, then it is a bug which should be taken care of.

Also @chrismurphy mentioned that he tried migrating from devicemapper to overlay2. I am assuming he did not create docker-root-lv for devicemapper. And he created it only for overlay2 case. If that's the case, then one can't remove docker-root-lv as it is being used by overlay2.

@chrismurphy did you setup docker-root-lv for devicemapper case and expected it to be cleaned up over reset?

Sorry for the confusing report.

docker-root-lv was created automatically when /etc/sysconfig/docker-storage-setup contained

STORAGE_DRIVER=overlay2
DOCKER_ROOT_VOLUME=yes>

Upon stopping docker and issuing atomic storage reset, this LV is removed.

If I don't make changes to /etc/sysconfig/docker-storage-setup then a docker-pool LV (which is actually a dm thin pool) is created; and upon stopping docker and issuing atomic storage reset, this pool is likewise removed.

OK I think we will settle on having an extra partition that is used as the root store for /var/lib/docker on Atomic Host (i.e. we will have DOCKER_ROOT_VOLUME=yes be the default) along with default to overlay2. We may revisit this in a future release.

Here is a page that dwalsh made for the change request in fedora. Might need some more fleshing out: https://fedoraproject.org/wiki/Changes/DockerOverlay2

We are making docker-storage-setup more generic (to be usable by other container runtime as well, apart from docker). So in an attempt to do that, we are introducing new options.

EXTRA_VOLUME_NAME="docker-root-lv"
EXTRA_VOLUME_MOUNT_DIR="/var/lib/docker"

this will setup a new volume and mount on /var/lib/docker. This is very similar to DOCKER_ROOT_VOLUME=yes. Just that semantics are more generic.

Shishir is working on it and I think it should be ready in a week. I am reviewing patches.

IOW, basic theme remains the same, just that exact options we use are little different.

It might be better to use new options as we will deprecate DOCKER_ROOT_VOLUME=yes once new options work fine.

We are making docker-storage-setup more generic (to be usable by other container runtime as well, apart from docker). So in an attempt to do that, we are introducing new options.
EXTRA_VOLUME_NAME="docker-root-lv"
EXTRA_VOLUME_MOUNT_DIR="/var/lib/docker"

I would vote that we keep it easy for the user. Why don't we have CONTAINER_ROOT_VOLUME=yes instead of DOCKER_ROOT_VOLUME=yes. You can still have variables like EXTRA_VOLUME_NAME and EXTRA_VOLUME_MOUNT_DIR but just make them have default values that only need to be specified if the user doesn't like the defaults.

small nit: I would rather have them be CONTAINER_VOLUME_NAME and CONTAINER_VOLUME_MOUNT_DIR.

take it a step farther back and we should call it container storage setup?

We are making docker-storage-setup more generic (to be usable by other container runtime as well, apart from docker). So in an attempt to do that, we are introducing new options.
EXTRA_VOLUME_NAME="docker-root-lv"
EXTRA_VOLUME_MOUNT_DIR="/var/lib/docker"

I would vote that we keep it easy for the user. Why don't we have CONTAINER_ROOT_VOLUME=yes instead of DOCKER_ROOT_VOLUME=yes. You can still have variables like EXTRA_VOLUME_NAME and EXTRA_VOLUME_MOUNT_DIR but just make them have default values that only need to be specified if the user doesn't like the defaults.

small nit: I would rather have them be CONTAINER_VOLUME_NAME and CONTAINER_VOLUME_MOUNT_DIR.

take it a step farther back and we should call it container storage setup?

I would vote that we keep it easy for the user. Why don't we have CONTAINER_ROOT_VOLUME=yes instead of DOCKER_ROOT_VOLUME=yes.

Problem with that is that we don't know where to mount the volume after creation. Right now, we mount it on /var/lib/docker. Also we scan /etc/sysconfig/docker for option -g and try to mount it
there. But all that soon becomes very specific to docker.

Hence, instead we wanted user to pass in the directory where volume should be mounted. And if user is passing in directory, that itself means create extra volume.

I was thinking that EXTRA_VOLUME_MOUNT_DIR is more generic because we don't care how container runtime uses it. Whether it is their ROOT or they use it to carve out volumes or for
something else. That's the reason I suggested that instead this naming scheme. But I am not too particular and I will be fine with CONTAINER_ROOT_VOLUME_MOUNT_DIR as well.

Dusty, shishir has created a new PR for this change. Feel free to review it.

https://github.com/projectatomic/docker-storage-setup/pull/181

You can still have variables like EXTRA_VOLUME_NAME and EXTRA_VOLUME_MOUNT_DIR but just make them have default values that only need to be specified if the user doesn't like the defaults.

By default these will be nil. That is no extra volume will be set. Users will have to specify values if
they want extra volume to be set.

small nit: I would rather have them be CONTAINER_VOLUME_NAME and CONTAINER_VOLUME_MOUNT_DIR.

I am fine with above naming.

take it a step farther back and we should call it container storage setup?

That's the plan. We are doing changes slowly and ultimately will rename it to container-storage-setup.

Thinking more about it. I think EXTRA_VOLUME_NAME is more intuitive. Reason being that this script sets up thin pool volume as well and that's a container volume too. EXTRA sort of makes it explicit
that this volume is on top regular volumes setup by this script.

IOW, when somebody sets up thin pool, that's a container volume too and it will be easy to
confuse thin pool volume with container volume specified by CONTAINER_VOLUME_NAME.

A couple of things here. We plan on renaming docker-storage-setup -> container-storage-setup. We want to allow contianer-storage-setup be able to setup multiple container runtimes. Docker, CRI-O, perhaps others.

So having magic tools that figure out mount points like /var/lib/docker need to stop. We already are hacking up the script in order to support docker-latest.

So naming the mount points and the volume name should not be much of a hardship. Also the distribution has full ability to hard code this into the config file.

docker package can setup a unit file that sets the LABELS appropriate for docker and CRI-O/OCID can set up the defaults for it.

But we end up with just one container-storage-setup package.

DON'T Hard code "dockerisms" into any more tools.

vgoyal
Thinking more about it. I think EXTRA_VOLUME_NAME is more intuitive. Reason being that this script sets up thin pool volume as well and that's a container volume too. EXTRA sort of makes it explicit
that this volume is on top regular volumes setup by this script.

This script will only set up the thin pool when you are using devicemapper right? So the script would only make a thin pool LV when running for dm and only make a regular LV when running for overlay. So one type of volume gets created.

I think if you want them to be different names then they should really be different names:

  • OVERLAY_VOLUME_NAME
  • DM_THIN_VOLUME_NAME
  • etc.

I also don't think it's intuitive to the user that specifying OVERLAY_VOLUME_NAME implies CONTAINER_ROOT_VOLUME=yes (I know you are proposing that there be no CONTAINER_ROOT_VOLUME variable).

dwalsh
DON'T Hard code "dockerisms" into any more tools.

Then who is going to modify the docker options in /etc/sysconfig/docker-storage to specify devicemapper or overlay? I think if you have to code the tool to be able to do that then having a "default" location for a mountpoint is not that big of a deal.

vgoyal
Thinking more about it. I think EXTRA_VOLUME_NAME is more intuitive. Reason being that this script sets up thin pool volume as well and that's a container volume too. EXTRA sort of makes it explicit
that this volume is on top regular volumes setup by this script.

This script will only set up the thin pool when you are using devicemapper right? So the script would only make a thin pool LV when running for dm and only make a regular LV when running for overlay. So one type of volume gets created.

I think if you want them to be different names then they should really be different names:

  • OVERLAY_VOLUME_NAME
  • DM_THIN_VOLUME_NAME
  • etc.

I also don't think it's intuitive to the user that specifying OVERLAY_VOLUME_NAME implies CONTAINER_ROOT_VOLUME=yes (I know you are proposing that there be no CONTAINER_ROOT_VOLUME variable).

dwalsh
DON'T Hard code "dockerisms" into any more tools.

Then who is going to modify the docker options in /etc/sysconfig/docker-storage to specify devicemapper or overlay? I think if you have to code the tool to be able to do that then having a "default" location for a mountpoint is not that big of a deal.

User edits /etc/sysconfig/docker-storage-setup which should include documentation on what can be changed.

User edits /etc/sysconfig/ocid-storage-setup which should include documentation on what can be changed.

container-storage-setup will output /etc/sysconfig/docker-storage and /etc/sysconfig/ocid-storage.

The ocid-storage.spec and docker.storage.spec will be executing something like

container-storage-setup --input /etc/sysconfig/docker-storage-setup --output /etc/sysconfig/docker-storage

container-storage-setup --input /etc/sysconfig/ocid-storage-setup --output /etc/sysconfig/ocid-storage

This script will only set up the thin pool when you are using devicemapper right? So the script would only make a thin pool LV when running for dm and only make a regular LV when running for overlay. So one type of volume gets created.

No. Nothing stops you from saying.

STORAGE_DRIVER=devicemapper
EXTRA_VOLUME_NAME=foo
EXTRA_VOLUME_MOUNT_DIR=/var/lib/docker/volumes

In this case a thin pool LV will be created as well as an extra volume will be created and mounted on /var/lib/docker/volumes.

I think if you want them to be different names then they should really be different names:

OVERLAY_VOLUME_NAME

container-storage-setup does not care if caller sets up on overlay on top of extra volume or not. So calling it OVERLAY_VOLUME_NAME will not be right. It is just a logical volume. Caller can use vfs driver, or aufs driver or something else on top if they want to.

@dustymabe why your comments are appearing twice?

vgoyal
This script will only set up the thin pool when you are using devicemapper right? So the script would only make a thin pool LV when running for dm and only make a regular LV when running for overlay. So one type of volume gets created.

No. Nothing stops you from saying.
STORAGE_DRIVER=devicemapper
EXTRA_VOLUME_NAME=foo
EXTRA_VOLUME_MOUNT_DIR=/var/lib/docker/volumes
In this case a thin pool LV will be created as well as an extra volume will be created and mounted on /var/lib/docker/volumes.

ok, i understand the behavior now, but still think it shouldn't be EXTRA_VOLUME_NAME, CONTAINER_ROOT_VOLUME_NAME would be better. Thoughts?

@dustymabe why your comments are appearing twice?

probably because I clicked "refresh" on the page. I'm guess this is a bug in pagure.

vgoyal
This script will only set up the thin pool when you are using devicemapper right? So the script would only make a thin pool LV when running for dm and only make a regular LV when running for overlay. So one type of volume gets created.

No. Nothing stops you from saying.
STORAGE_DRIVER=devicemapper
EXTRA_VOLUME_NAME=foo
EXTRA_VOLUME_MOUNT_DIR=/var/lib/docker/volumes
In this case a thin pool LV will be created as well as an extra volume will be created and mounted on /var/lib/docker/volumes.

ok, i understand the behavior now, but still think it shouldn't be EXTRA_VOLUME_NAME, CONTAINER_ROOT_VOLUME_NAME would be better. Thoughts?

@dustymabe why your comments are appearing twice?

probably because I clicked "refresh" on the page. I'm guess this is a bug in pagure.

Lets go with

CONTAINER_LV_NAME
CONTAINER_LV_MOUNT_PATH

That makes it clear this is a logical volume, and we need to put examples into the configuration.

Lets go with
CONTAINER_LV_NAME
CONTAINER_LV_MOUNT_PATH

I am fine with these names.

CONTAINER_LV_NAME
CONTAINER_LV_MOUNT_PATH

That is better but I still think it's worth adding the "root" in there so that the user knows this is where the container runtime's root will be mounted vs the backend storage for the container runtime. As @vgoyal pointed out DM could have /var/lib/docker mounted on an LV as well as a thin pool that is used for it's storage. Can we go with:

CONTAINER_ROOT_LV_NAME
CONTAINER_ROOT_LV_MOUNT_PATH

CONTAINER_LV_NAME
CONTAINER_LV_MOUNT_PATH

That is better but I still think it's worth adding the "root" in there so that the user knows this is where the container runtime's root will be mounted vs the backend storage for the container runtime. As @vgoyal pointed out DM could have /var/lib/docker mounted on an LV as well as a thin pool that is used for it's storage. Can we go with:

CONTAINER_ROOT_LV_NAME
CONTAINER_ROOT_LV_MOUNT_PATH

I find ROOT confusing. ROOT means /root or / or the rootfs of the container. But I am bike shedding. I think we pick one and then put an explanation in the config file and be done with it. BTW This discussion should have gone on in the pull request. :^(

Right, ROOT can be confusing. We already have a knob called "ROOT_SIZE" which specifies the size of rootfs of system.

While one can argue that prefixing ROOT with CONTAINER should remove that confusion.

So I can live with CONTAINER_ROOT_LV_NAME and CONTAINER_ROOT_LV_MOUNT_PATH if
that's more intuitive.

@dustymabe Just merged the changes for CONTAINER_ROOT_LV_NAME and CONTAINER_ROOT_LV_MOUNT_PATH usptream.

If you like, you might want to give it a try.

Thanks for the heads up, and sorry for the double comments if it happens again!

Thanks for the heads up, and sorry for the double comments if it happens again!

Metadata Update from @dustymabe:
- Issue assigned to dustymabe

4 years ago

this is effectively done and will be shipped with f26

Metadata Update from @dustymabe:
- Issue unmarked as depending on: #197

4 years ago

Metadata Update from @dustymabe:
- Issue close_status updated to: Fixed
- Issue status updated to: Closed (was: Open)

4 years ago

Login to comment on this ticket.

Metadata