Issue #3: GSOC: Ability for bootloader to choose old deployment on failures - fedora-iot

fedora-iot

#3 GSOC: Ability for bootloader to choose old deployment on failures

Closed: fixed 5 years ago Opened 5 years ago by dustymabe.

Lots of possibilities discussed here. We can use hardware watchdog features and possibly set "boot once" in GRUB/UEFI features to achieve this goal.

Metadata Update from @pbrobinson:
- Issue tagged with: GSoC

5 years ago

walters commented 5 years ago

This is a dup of https://pagure.io/fedora-iot/issue/2#comment-511930 right?

dustymabe commented 5 years ago

This is a dup of https://pagure.io/fedora-iot/issue/2#comment-511930 right?

yes, but I think 'booting' is a separate topic, so I created this issue, Copying in your words here:

@walters talked with Lennart in person about this. He was saying is that they'd discussed this in systemd but never implemented anything related to it.

One specific interesting topic that was discussed is having a protocol between the bootloader and the runtime where the "boot count" is stored as a suffix on the filename. Something like /boot/loader/entries/ostree-fedora-0.conf.5 where the .5 is new and represents "number of times to try this entry".

On success, drop the suffix to just /boot/loader/entries/ostree-fedora-0.conf. On failure, decrement the count. If the count reaches 0, skip to the next entry.

lorbus commented 5 years ago

I need to wrap my head around what parts of the machinery this is dealing with exactly.

Could the flow look something like this?:

rpm-ostree upgrade;
add suffix to new bootloader entry;
change bootloader config to point to new-entry.suffix;
reboot;

# this could be greenboot.target on systems using the health check framework,
# or just multi-user on systems without it, or any other target
check (configurable) target is active to determine boot success;

if boot success:
  remove suffix from bootloader entry;
  change bootloader config accordingly;
else:
  case suffix>0:
    suffix--;
    change bootloader config accordingly;
    reboot;
  case suffix=0:
    rpm-ostree rollback;
    add suffix=5 to first/latest older bootloader entry without suffix;
    change bootloader config to point to old-entry.suffix;
    reboot;

If I'm not mistaken libostree handles the bootloader config generation on upgrade/rollback, so would it make sense to put the suffix changing functionality there?

I guess the part with the target active check and if/else pseudocode could remain a seperate thing.

RFC @walters @jlebon @pbrobinson @dustymabe

Edited 5 years ago by lorbus

jlebon commented 5 years ago

Could the flow look something like this?:

Hmm, is handling upgrades in scope for this, though? My understanding was that we were keeping this general so that it applies equally well for IoT devices as for cloud VMs with their own update operators.

If I'm not mistaken libostree handles the bootloader config generation on upgrade/rollback, so would it make sense to put the suffix changing functionality there?

Yeah, I'm a bit fuzzy about that part as well. I think even doing the suffix manipulation outside of libostree would still require tweaking it (which is not off the table, of course), though maybe @walters can elaborate on what he had in mind.

lorbus commented 5 years ago

Copying over @jktjkt's comment from #2:

Hi there, re the bootloader -> systemd protocol for tracking a number of boot attempts:

We're doing something similar on our embedded system. We've got two read-only rootfs partitions and two R/W config partitions. These are called slot-A (rootfs-A + cfg-A) and slot-B. At any given time, just one of these partitions is active while the other one is inactive. When a system is booted from, say, slot-A, any update is installed into slot-B, and the bootloader is set to perform the next boot from slot-B. There is a number of attempts to try for each slot, and once it gets too low, the next slot becomes active again. This is a pretty common system, and there was a bunch of talks from various FLOSS projects and companies on how they are solving this.

We're currently using RAUC for updating the entire system. Everything is integrated with our bootloader (U-Boot), the numbers are tracked in the U-Boot's env. We've also verified that this works with grub2, and the RAUC project also integrates well with Barebox' bootchooser (Barebox is another bootloader). For a reference, check our patch to U-Boot which ensures that stuff works.

There are also other projects which target a similar use case, for example swupdate. People from both swupdate and RAUC gave a number of talks about where they are coming from and what problems they solve. My recommendation is to get in touch with them. For example, some systems might not have space for two copies of the rootfs, others might want to have a special, never-updated rescue slot, etc.

One thing that I was disappointed with is systemd's support for watchdog. Basically, my bootloader sets up a HW watchdog which reliably reboots the CPU unless the kernel periodically keeps telling it "hey, everything is OK". This is typically done via systemd, but there's a nasty bug in there because systemd happily keeps hitting the watchdog saying "hey, I'm up" even if it fails to reach its target. For more details, see [systemd-devel] Later activation of the HW watchdog which unfortunately didn't go anywhere. A TL;DR version of that is "I want systemd to only start touching the watchdog once it successfully reaches its default target".

If you're interested in this, feel free to reach me at jan.kundrat@cesnet.cz . I don't have (human) bandwidth for tracking this thread :), and we're using Buildroot, not Fedora's IoT for our builds anyway.

With kind regards,
Jan

lorbus commented 5 years ago

This is done. In Fedora 29, one can activate the boot counting feature by setting a boot_counter grubenv var, e.g. grub2-editenv - set boot_counter=3

Metadata Update from @pbrobinson:
- Issue close_status updated to: fixed
- Issue status updated to: Closed (was: Open)

5 years ago

Metadata

Assignee

None

Tags

Blocking

None

Depending on

None

fedora-iot

Source Code

Documentation

#3 GSOC: Ability for bootloader to choose old deployment on failures Closed: fixed 5 years ago Opened 5 years ago by dustymabe.

Metadata

GSoC

#3 GSOC: Ability for bootloader to choose old deployment on failures

Closed: fixed 5 years ago Opened 5 years ago by dustymabe.