#36 Write amplification [meta]
Opened 3 years ago by atim. Modified 2 years ago

Publicise btrfs write amplification data. Share your statistics and observations about how much data written on your disk with different workloads, setups. Share tips and gotchas if there some about how to minimize writes safely if this possible.


I've found that btrfs writing on disk when system idling much more in comparison with XFS for example. For most time this is not an issue but for for some cheap SSD with bad firmware and for RPi with NAND flash which works 24/7 this can cause to wear out earlier such drives.

I've did few test on Fedora 33 Workstation. Journal and auditd was setted to write on RAM during tests in order to reduce writes.

Results were obtained from Samsung 970 EVO Plus SMART. Default f33 disk partitioning.

  • btrfs [kernel 5.8] [zstd:1,space_cache=v2]

    • 67MB/h
    • 584GB per year
  • btrfs [kernel 5.9] [no deviations from default]

    • 37.5MB/h
    • 324GB per year
  • xfs [kernel 5.8] [no deviations from default]

    • 6.8MB/h
    • 60GB per year

Is there any room for improvements in future versions? Is there any safe tweaks which could help to minimize disk writes? I've tried with commit=300 but result is exactly the same as with default (30).

This seems to be something @josef might know about?

Metadata Update from @ngompa:
- Issue assigned to josef
- Issue tagged with: Kernel

3 years ago

Hmm that's odd, makes it seem like there's something changing the fs if we're btrfs but not if we're xfs? Can you install bpftrace and run this for say 5 minutes and upload the output?

bpftrace -e 'kprobe:btrfs_dirty_inode { @[kstack(),comm] = count(); }'

and then do it again with kprobe:xfs_fs_dirty_inode, I'd be interested to see where these writes are coming from and why it's not happening as often on xfs.

Guess: systemd-journald touches the journal often, and dnf does repo updates every hour. Plus wandering trees. Compression enabled does make the file trees taller, by a lot. Add in wandering trees and it's more metadata writes. But compression's net savings on writes overall vastly exceeds that of the wandering trees effect.

I'm trying the bpftrace command but I'm not getting anything out of it.

# bpftrace -e 'kprobe:btrfs_dirty_inode { @[kstack(),comm] = count(); }'
Attaching 1 probe...
^C


#

Update: Ahh ok it took a longer than 5 minutes. I'm not sure what the sweet spot is, maybe 10 or 15 minutes.

cannot attach kprobe, probe entry may not exist
Error attaching probe: 'kprobe:xfs_dirty_inode'

Is it a bcc bug? btrfs_dirty_inode works.

btrfsdirtyinode.txt

Here is mine ~15 min on btrfs:
btrfsdirtyinode.txt

I wish to emphasize that my systemd-journald and auditd not writing on disk and using RAM storage during this test. With journal and auditd it writes much more and i constantly see how quickly grows Data Units Written in samsung SMART.

And yes, as Chris already said this could be not an issue when system not idling and actively using especially if compression enabled this is compensated. But in same time i am not 100% sure and i doesn't have clear numbers now but on a mental plane it feels like btrfs with compression still writing more then ext4/xfs on my regular desktop workload.

dnf does repo updates every hour

I've monitored this in different ways: checked every 5 min and checked how much it writes after few hours. But it writing something small amounts but often.

Maximally vanilla F33 Workstation without any extensions, background apps and such. Xorg + nvidia 455.38, maybe this also matters, who knows.

and then do it again with kprobe:xfs_fs_dirty_inode

If i knew it before. 🙂 Now i have btrfs everywhere.

I recently copied a large amount of information from a ZFS drive to a BTRFS drive and noticed that the read speed was consistently well below the write speed. Both file systems were uncompressed. Unfortunately, I cannot repeat the test now, but I have a virtual machine where I could experiments if I could monitor the number of changed blocks (not written). I believe that when writing to BTRFS, some areas are repeatedly overwritten and it will not be possible to measure what was written only in the amount of data.

Here is another anecdotal data point.

I have a Dell XPS 13 laptop which is just about to turn three months old. It has always been running Fedora 34 on btrfs. Although I only use the computer for light coding and web browsing, smartctl reports 1.70TB written after just 100 hours of time powered on. This amounts to 17GB of writes per hour.

Compare these figures with my roughly year old Asus Zenbook running Fedora on ext4. Although both laptops endure similar usage patterns, smartctl on the Zenbook reports ~1.8TB written over approximately 1000 hours.

Lenovo Thinkpad bought 6/15/2021, and as of today:

Power On Hours:                     1,280
Data Units Read:                    5,735,879 [2.93 TB]
Data Units Written:                 5,581,658 [2.85 TB]

This is dual boot Windows and Fedora, and Fedora is used 98% of the time. There's dozens of snapshots.

Write amplification terrible on btrfs and average desktop workload. After one year observations i can clearly see the difference. Compression can't help much there, it's still several times more then on ext4. Mostly because of edge cases like data bases. Kmail wrote 150GB just in 20 minutes. And it will continue tear constantly SSD when idling.

Maybe it's good on Facebook servers and such workload, but definitely not on average desktops.

Btrfs has a fair bit of write amplification with snapshots, this is the cost of doing business. We are working on disk format changes to address those, but again FB tracks write amplification across our entire fleet, which has significantly heavier workloads than the consumer desktop. If we were seeing noticeably higher write amplification in real world scenarios it would cost us millions of dollars, and we're simply not seeing it. This is not to say there aren't problems, I know there are, and we are working to make the write amplification problem much better. In the meantime compression makes a huge difference in this, you end up with less data which offsets the metadata write amplification. Mounting with -o space_cache=v2 also makes a pretty big difference, I would recommend that as well. In the end write amplification is highly workload dependent, there are fundamental costs to the features that come with btrfs, and users need to decide for themselves if snapshots, crcs, built in compression, and volume management is worth the cost of possibly higher metadata write amplification.

Login to comment on this ticket.

Metadata
Attachments 3
Attached 3 years ago View Comment
Attached 3 years ago View Comment
Attached 3 years ago View Comment