#16 Consider adding automatic drive maintenance
Opened 3 years ago by atim. Modified 2 years ago

"openSUSE" using some automatic, scheduled maintenance for BTRFS partition, like balance and scrub. More details you can find on their page:
https://en.opensuse.org/SDB:Disable_btrfsmaintenance#Performing_manual_maintenance

Can we implement something similar in Fedora? For example scheduled HDD defragmentation would be very useful i guess. The balance useful as well and it saves me some disk space when i am doing it from time to time.


I think the balance stuff is ok, scrub might be reasonable on a monthly basis. I'd like to avoid defrag as a policy, rather use mount -o autodefrag for spinning rust and not at all for anything else.

@josef thanks. I've always wondered, why autodefrag still not default mount option for HDD? Not optimal in terms of performance in some cases?

RE: balance
(open)SUSE are in a different position, with automatic snapshots and a somewhat aggressive retention policy.

Ultimately any remaining block group allocation problems should be fixed in-kernel or an in-tree upstream supported and delivered daemon. If we ship a balance script run on timer, we're papering over the problems rather than actually fixing them. I've actively recommended NOT running these scripts in Fedora and (open)SUSE in order to flush out the remaining issues for years.

RE: autodefrag
The detection of rotational is not completely reliable. e.g. virtioblk, and all USB devices are always reported as rotational by sysfs. And the installer developers don't want to add a facility to deduce this better, and conditionally set mount options based on it, like autodefrag. For now, I think we're stuck with documentation and word of mouth for HDD+desktop=autodefrag. There are some workloads that aren't well suited for autodefrag: it's not snapshot aware, probably not great for large busy databases either. But desktop use cases with a spinning hard drive, it's good for that.

RE: scrub
This could use desktop integration so that the user is notified of issues. Not only scrub, but any detected read or write time error. Right now, EIO handling is up to the application.

Well it would be easy to add a gnome-settings-daemon plugin that runs scrub once per month and reports failures.

How would we detect read or write errors when they occur? Obviously the desktop can't see when EIO is returned to applications. Does btrfs have some interface for this?

I don't see anything in sysfs.

There's BTRFS_IOC_GET_DEV_STATS used by btrfs device stats that can be polled. This is a persistent set of stats (optionally resettable) stored in on-disk metadata. e.g.

[/dev/nvme0n1p7].write_io_errs    0
[/dev/nvme0n1p7].read_io_errs     0
[/dev/nvme0n1p7].flush_io_errs    0
[/dev/nvme0n1p7].corruption_errs  0
[/dev/nvme0n1p7].generation_errs  0

One gotcha, pretty sure this needs root privileges. The user space command returns an error without it.

And various btrfs kernel messages, means parsing dmesg/kmesg. We would notify on critical, maybe also on warn.

I'm not opposed to packaging https://github.com/kdave/btrfsmaintenance. I can't parse all of the pieces to determine whether it's practical to only include scrub and balance. Considerations:

(a) package not installed nor any service enabled by default
(b) balance – only recommended on a case by case basis once all information about the problem is collected; I don't want users thinking they need to do regular incantations manually but also don't want to encourage automated balance that just ends up papering over bugs that need to be fixed.
(c) balance and scrub – need to see about what systemd slice it ought to go in to make sure it's on the absolute last rung on the resource control ladder. FF hitting an sqlite database for spam should get more priority than this.
(d) defrag is not snapshot aware, it has a chance of causing more problems than it solves.

Autodefrag is probably better, and maybe also with xattr level granularity. A heuristic based on estimated seek/rotational latency could be fed back into adjust the aggressiveness of autodefrag rather than having to get smart about when and where to set the xattr or mount option.

I'm not opposed to packaging https://github.com/kdave/btrfsmaintenance.

I've send draft .spec on review. Maybe @ngompa would be interested and could review it.

btrfsmaintenance packaged and ready for testing. Thanks @ngompa for help.

Metadata Update from @ngompa:
- Issue tagged with: Dev, Utils

3 years ago

Metadata Update from @ngompa:
- Issue assigned to atim

3 years ago

Thanks.

I see btrfs-trim.service and btrfs-defrag.service which I think will confuse users. I'm not sure what's more appropriate, asking upstream to remove them, or Fedora.

Agree about removing completely btrfs-trim.service and probably also should remove TRIM block in config file as well /etc/sysconfig/btrfsmaintenance. But since we doesn't have autodefrag feature option on specified directories yet, btrfs-defrag.service looks useful for me. And even when autodefrag feature will be ready and available btrfs-defrag.service still useful until Fedora begin using snapshots feature everywhere. Also defrag disabled by default in current upstream configuration so it safe to install and use in current state i hope.

So maybe keep btrfs-defrag.service for now and just drop TRIM stuff?

Agree about removing completely btrfs-trim.service and probably also should remove TRIM block in config file as well /etc/sysconfig/btrfsmaintenance.

I support an issue/PR upstream for this. /usr/lib/systemd/system/fstrim.service and associated timer is provided by util-linux package for a while, I expect most distros have it by now.

btrfs-defrag.service looks useful for me

Is it user configurable what paths will be defragmented? Or is it either enabled or disabled?

autodefrag looks for a particular write pattern within files to consider it a candidate for defrag; therefore these files aren't worse off being defragmented even if there is a snapshot regime in place. The same is not true with the defragment subcommand, it can quickly explode the consumption of space. There is a warning to this effect in man btrfs filesystem.

So maybe keep btrfs-defrag.service for now and just drop TRIM stuff?

I supposed it's better for folks to stick with this upstream maintained package, rather than resort to something else.

Until we're sure the resource control isolation portion of the maintenance scripts is in a good place, I think all the timers should be disabled by default. Scrub seems benign because it does no writes but it does have heavy read IO that could negatively impact certain workloads.

Is it user configurable what paths will be defragmented? Or is it either enabled or disabled?

Paths configurable and no paths by default https://github.com/kdave/btrfsmaintenance/blob/master/sysconfig.btrfsmaintenance#L17

Until we're sure the resource control isolation portion of the maintenance scripts is in a good place, I think all the timers should be disabled by default. Scrub seems benign because it does no writes but it does have heavy read IO that could negatively impact certain workloads.

Current behavior: all disabled by default. Also have such configuration:

IOSchedulingClass=idle
CPUSchedulingPolicy=idle

FWIW I think we should enable at least a monthly scrub by default on desktop systems, and not block that on having UI integration or on figuring out what to do with the other things.

@mattdm

Scrub does integrity checking by verification of checksums for all data and metadata. It will do fixups when there's redundancy, e.g. dup metadata on HDD, and raid1 and above. The single drive SSD case by default won't fix detected errors, it just reports them (in dmesg, and thus the journal/system logger).

If a user has 50G on NVMe, a scrub will max out the read capacity of the drive, and it'll take less than a minute. Conversely, 2T on a HDD will take about 5 hours.

Scrub defaults to ionice ioprio_class=idle. But I don't know if either BFQ or None IO schedulers honor it. We mostly depend on BFQ in Fedora right now. Could the scrub compete with some users' workloads in a way they notice adversely? I don't know. Maybe.

Scrub will dump full path to filename into dmesg, and thus the journal/system log, for any file that has a checksum error. This could result in unintended leaks if the user shares their dmesg in bug reports. The passive integrity checking that always happens on reads, uses subvolid and inode number, not full paths.

There is an idea for /sys/fs/btrfs/<uuid>/devinfo/<devid>/scrub_speed_max which btrfsmaintenance could leverage to limit the rate. If we lower the rate to a 2 M/s trickle, it'd take days. If we keep the rate high, it might become noticeable by competing with other user workloads. Possibly more sophisticated resource control can make sure everything else is getting their minimum IO, leaving the left over to scrub. That way it's non-competitive but also runs at the max possible speed.

But I'm not yet sure what the best way to do this on Fedora is.

@catanzaro There are ioctls for start, stop, resume, and getting statistics. But I don't see an interface in libbtrfsutil. Btrfs stores per device statistics in the file system itself, it's a cumulative counter, which can be reset. There is also a local status file in /var/lib/btrfs that has summary statistics and is mainly used by user space tool to track progress so that if scrub is cancelled it can be resumed where it left off. As for identifying corrupt files, the current interface is dmesg.

Login to comment on this ticket.

Metadata
Boards 1
Development Status: In Review