#23 OEM preload configuration design
Opened 2 years ago by ngompa. Modified 3 months ago

Typically with Windows preloads on computers, there is a preloaded "recovery partition" that includes a straightforward way to do a full system reset in case things have gone very wrong.

Now, we don't really have a concept of this in Fedora yet, but we could do some clever things with Btrfs snapshots and send/receive for system resets.

With Fedora getting preloaded on Lenovo laptops, perhaps @mpearson may have some ideas here of what he'd like to see architecturally for this sort of thing.


Metadata Update from @ngompa:
- Issue set to the milestone: Future Release
- Issue tagged with: Desktop

2 years ago

Having a recovery partition would be nice.

I'm currently getting beaten up by our support team as my "system recovery" instructions start off with "get the latest Fedora install image from the Fedora site". They are of the opinion that customers will expect the exact same install image the system came with - which is of course the standard Fedora image (without some documents) but I'm still getting some grief and we're looking at how we can make our factory image available to customers for "recovery".

I'm expecting Linux users to be technically competent enough to burn a USB stick and do an install - but having a recovery partition and the ability to 'factory reset' under most scenarios would be nice to have. If there is something we can do to help with this let me know.

Mark

So there are a couple of options here that I can think of right now:

  1. Lenovo, at preload time, can create an initial read-only snapshot of the finalized environment. A small recovery utility could exist in the system alongside a copy of the initial volume snapshot. For a "recovery" would allow you to reset (without purging the user data) by deleting the root subvolume and doing a btrfs send of the snapshot volume back onto the system btrfs volume as a new subvolume. This would effectively reset the system. If user data reset is also desired, then just deleting the home subvolume and re-creating it would be enough to deal with that.

  2. We can go with the hammer approach and just have an archive image of the whole volume. Recovery would work by effectively blowing away the main volume and doing a dd back on it to reset the state.

The first one is slightly fancy since it leverages the properties of our Btrfs setup and lets you quickly do a reset without user data loss in the /home. The second is "easy" but has the consequence of erasing everything always, like how it's done for Windows.

@mpearson What's the time frame for hardware shipping with Fedora 33? My understanding is one model is shipping with Fedora 32 now. But it might help know the "earliest" vs "latest" date for Fedora 33.

In other words, there might need to be an interim solution to the problem. (a) include a USB stick with whatever image you want; or (b) your "mega" image for imaging the system on the production line could include a "mini" image that can be consumed by Fedora Media Writer, and make it a recommended first step that the user create a restore USB stick.

To encrypt in the near term, one of these options is needed. It's a mandatory clean install.

The (c) version 2.0 might be as simple as a post-install script that, if Btrfs, just make a snapshot before the reboot. It could be used to do a kind of rollback to the clean install state, keeping /home data, and even keeping the old system root. The user can scrape whatever they need out of the old /etc and /var and then delete the failed system root subvolume.

Version 3.0? Needs discovery and design. What other failure modes do we care about? Bootloader problems? File system problems? Other confusion? #10 encrypt at first-boot? Should recovery be like a live boot environment with a full graphical installer? Or should it be a "headless" confirm, and wipe everything, no options other than "are you really f'n sure??" Whether and how to verify the recovery image is untampered? How does recovery get booted intentionally and not accidentally? Is it OK if it gets stale or does it need updating? How to update it and how often?

If it doesn't significantly increase complexity or scope creep to serve multiple use cases: reset, recovery, reprovision, major version upgrades, Fedora as "image factory" etc. then it may get more resources.

Btrfs might play a role, it has some neat options for different kinds of replication and transparent compression. But a fair chunk involves firmware, bootloader, UI/UX concerns, demand for simplicity, testing.

So there are a couple of options here that I can think of right now:

  1. Lenovo, at preload time, can create an initial read-only snapshot of the finalized environment. A small recovery utility could exist in the system alongside a copy of the initial volume snapshot. For a "recovery" would allow you to reset (without purging the user data) by deleting the root subvolume and doing a btrfs send of the snapshot volume back onto the system btrfs volume as a new subvolume. This would effectively reset the system. If user data reset is also desired, then just deleting the home subvolume and re-creating it would be enough to deal with that.

  2. We can go with the hammer approach and just have an archive image of the whole volume. Recovery would work by effectively blowing away the main volume and doing a dd back on it to reset the state.

The first one is slightly fancy since it leverages the properties of our Btrfs setup and lets you quickly do a reset without user data loss in the /home. The second is "easy" but has the consequence of erasing everything always, like how it's done for Windows.

Option 1 sounds good to me. We create a snapshot of the disk just before the OOBE runs (that's what manufacturing uses) so could tweak that. There shouldn't be any user data on the system at that point.
Can there be a 'recovery' option in the grub menu that uses the original snapshot? I've still got to get my head around btrfs :)

@mpearson What's the time frame for hardware shipping with Fedora 33? My understanding is one model is shipping with Fedora 32 now. But it might help know the "earliest" vs "latest" date for Fedora 33.

In other words, there might need to be an interim solution to the problem. (a) include a USB stick with whatever image you want; or (b) your "mega" image for imaging the system on the production line could include a "mini" image that can be consumed by Fedora Media Writer, and make it a recommended first step that the user create a restore USB stick.

To encrypt in the near term, one of these options is needed. It's a mandatory clean install.

The (c) version 2.0 might be as simple as a post-install script that, if Btrfs, just make a snapshot before the reboot. It could be used to do a kind of rollback to the clean install state, keeping /home data, and even keeping the old system root. The user can scrape whatever they need out of the old /etc and /var and then delete the failed system root subvolume.

Version 3.0? Needs discovery and design. What other failure modes do we care about? Bootloader problems? File system problems? Other confusion? #10 encrypt at first-boot? Should recovery be like a live boot environment with a full graphical installer? Or should it be a "headless" confirm, and wipe everything, no options other than "are you really f'n sure??" Whether and how to verify the recovery image is untampered? How does recovery get booted intentionally and not accidentally? Is it OK if it gets stale or does it need updating? How to update it and how often?

If it doesn't significantly increase complexity or scope creep to serve multiple use cases: reset, recovery, reprovision, major version upgrades, Fedora as "image factory" etc. then it may get more resources.

Btrfs might play a role, it has some neat options for different kinds of replication and transparent compression. But a fair chunk involves firmware, bootloader, UI/UX concerns, demand for simplicity, testing.

We have P1G3 and P15 that are in the pipeline with Fedora but I don't think the dates are going to match up well with F33 (infamous last words - but I'm hoping we get those out before F33 is released so they will be on F32). We're really strapped for test resources so I don't know how/when we're going to be able to do a F33 image update - I'd like to do that but it may be delayed a bit.

Encryption would be great but isn't on my todo list yet - mostly because the todo list is too long. So for now I'm not worrying about it. If there is a solution that is encrypted disk friendly that would be a bonus

My thinking was that recovery would just return it to the same state as when it was shipped with factory. User data is blown away etc. It doesn't cover firmware updates :)

I like the idea of having a read-only snapshot to revert back, but I think we should also have a regular recovery partition regardless, in case someone manages to really screw things up. As @chrismurphy said though, the tricky part here is writing some tooling with good UX to guide the user through the process and explain the various tradeoffs.

Windows and macOS style of "recovery partition" is typically around 500M - 700M. These are separate, and thus resilient from anything that isn't outright hardware failure.

This is the same size as the Fedora netinstaller. The desktop LiveOS images are around 2G. Is a Fedora recovery image a 3rd kind of image? Or could it be a variation on one of the existing images? There is GRUB bootloader support for squashfs, so it would be possible to directly boot the recovery from GRUB.

If the netinstall or LiveOS image payload were a Btrfs image, instead of squashfs, we'd get a couple of benefits: (1) integrity verification, we could drop the media check option in the bootloader in favor of always-on verification; (2) using Btrfs' native overlay, a.k.a. seed device feature, the replication to a Btrfs target does not decompress the data. It's really fast.

In case the destination is ext4 or XFS, they still benefit from (1). And while they miss out on (2) they're not worse off. A btrfs image does't compress as well as squashfs.

How to update the recovery partition? Maybe casync.

This recovery environment could also be a much friendlier way of doing disaster recovery, should system root become unbootable. Currently the user splats head long into a dracut prompt.

If the netinstall or LiveOS image payload were a Btrfs image, instead of squashfs, we'd get a couple of benefits: (1) integrity verification, we could drop the media check option in the bootloader in favor of always-on verification; (2) using Btrfs' native overlay, a.k.a. seed device feature, the replication to a Btrfs target does not decompress the data. It's really fast.

If someone is willing to add support to our image build tools to do this, I think this is definitely worth considering for Fedora 34. Then we can generate a custom "live" image for OEM preload and use that as the starting point.

Statistics

2.1 GB (SI units) / 1.9 GiB (IEC units). The squashfs.img "payload" on Fedora-Workstation-Live-x86_64-33-1.2.iso.
5.7 GB / 5.3 GiB Workstation as-installed, no compression
2.8 GB / 2.6 GiB Workstation as-installed, zstd:1
2.6 GB / 2.4 GiB Workstation as-installed, zstd:11


~ 705 MB Fedora netinstaller ISO
~ 500 MB Windows 10 WinRE partition.
~ 600 MB macOS recovery

Is an enduring cost of almost 3 GB acceptable? Restore would be fast, and require no internet. But the very first update will effectively replace everything by downloading everything a clean netinstaller would.

Should the environment provide a way for the user to scrape user data off the drive before obliterating it in favor of a clean state? What would that environment need to look like to be useful? Right now, a totally failed boot drops us to a dracut prompt, only experts will be able to do a recovery but they could copy /home onto a USB stick by CLI. A partially failed boot drops us into a maintenance mode that requires root, but root user isn't enabled - so they're quite literally stuck in no man's land. If there were an alternative way to boot that helps do better disaster recovery in general, covering more scenarios might help acquire more resources.

Truly the absolute simplest and most reliable recommendation right now is download and install Fedora Media Writer, and let it download the latest Fedora, and create a USB stick for you.

Hello,

I'm Freeman, a Linux engineer in @mpearson 's team from Lenovo, I'm based in Shanghai China.

For the system recovery, I prefer to use a separate partition and this is what we're developing in the Fedora preload on next generations of ThinkPad products.
In the new Fedora preload we plan to build a Fedora-based livecd and put it on another EFI partition (usually the second one) with a Fedora system image in squashfs format.
As shown in the following image:

partitions-fedora-full.png

The livecd for recovery is built using livecd-tools and customized to launch a simple gnome-shell desktop and a GUI recovery tool which provides step-by-step guide for end user to do a factory-reset.

A GRUB menuentry will be added so that the end user can enter the recovery environment.

As @chrismurphy asked, would you accept a seperate partition (3GiB ~ 4GiB) on your local storage device for recovery?

I don't think it'd be a problem to have ~5GB allocated for recovery. But I think you should also consider the idea of being able to reset without data loss of /home in the future, since Btrfs can give you that kind of flexibility.

@zhanggyb

  • Partition #2 I think should have a new partition type GUID created for this purpose: "Linux recovery partition". And the announcement should go to parted, gdisk, util-linux (fdisk), developers - probably sufficient to file issues with their upstream or on their lists. And update the wikipedia GPT entry. I do not think there should be two EFI System partitions on a system, while the UEFI spec doesn't disallow it, I think it's likely to result in confusion.

  • The tool should be very clear about any option that will obliterate user data, including snapshots, including dual boot; it's trivially easy on UEFI to install Windows after Fedora.

  • @javierm can recommend what should be dropped into /etc/grub.d to make sure that your recovery boot menu entry is always added whenever grub-mkconfig is run by the user, however rare that should be these days. I'm not sure if it should get it's own /etc/grub.d/20_recovery item, or other.

Any ideas whether, how and how often to upgrade $recovery partition? What if we could share all or part of an image used for installing Fedora? The more we repurpose, the better.

1.5G -rw-rw-r--. 1 chris chris 1.5G Jan  2 20:34 deltaiso
1.9G -rw-r--r--. 1 chris chris 1.9G Jan  2 20:24 Fedora-Workstation-Live-x86_64-34-1.2.iso
1.9G -rw-r--r--. 1 chris chris 1.9G Jan  2 20:24 Fedora-Workstation-Live-x86_64-35-1.2.iso

75% might seem high, but it's still a 25% savings on two files that ostensibly are completely different and highly compressed due to the use of squashfs+xz compression. Maybe casync or zchunk can do better.

Login to comment on this ticket.

Metadata
Attachments 1