Page MenuHomeFreeBSD

ffs: Flush device write cache on fsync/fdatasync
Needs ReviewPublic

Authored by tmunro on Aug 27 2022, 3:25 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 14, 12:12 PM
Unknown Object (File)
Wed, Oct 30, 12:28 AM
Unknown Object (File)
Fri, Oct 25, 1:55 PM
Unknown Object (File)
Oct 10 2024, 10:24 AM
Unknown Object (File)
Oct 5 2024, 7:30 PM
Unknown Object (File)
Oct 5 2024, 6:39 PM
Unknown Object (File)
Oct 5 2024, 7:08 AM
Unknown Object (File)
Oct 3 2024, 12:02 AM
Subscribers
This revision needs review, but there are no reviewers specified.

Details

Reviewers
None
Summary

On other operating systems and in ZFS, fsync() and fdatasync() flush volatile write caches so that you can't lose recent writes if the power goes out. In UFS, they don't. High end server equipment with non-volatile cache doesn't have this problem because controller/drive cache survives (nvram, batteries, supercaps), so you might not always want this. Consumer and prosumer equipment on the other hand might only have enough power to protect its own metadata on power loss, and for cloud/rental systems with many kinds of virtualised storage, who really knows?

This is just a proof-of-concept patch to see what others think of the idea and maybe get some clues from experts about whether this is the right way to go about it. The only control for now is a rather blunt vfs.ffs.nocacheflush (following the sysctl naming pattern from ZFS), but I guess we might want a mount point option, and some smart way to figure out from geom if it's necessary.

On my SK Hynix storage in a Thinkpad, with an 8KB block size file system, I can open(O_DSYNC) and then pwrite(..., 8192, 0) about 49.2k times/sec with vfs.ffs.nocacheflush=1 (like unpatched), and about with vfs.ffs.nocacheflush=0 it drops to ~2.5k, or much lower if there is other activity on the device. That's in the right ball park based on other operating systems on similar hardware (Linux xfs, Windows ntfs). The writes appear via dtrace/dwatch -X io as WRITE followed by FLUSH.

On other OSs O_DSYNC is sometimes handled differently, using FUA writes rather than writes + flushes to avoid having to wait for other incidental data in device caches, but that seems to be a whole separate can of worms (it often doesn't really work on consumer gear), so in this initial experiment I'm just using BIO_FLUSH for that too.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

tmunro created this revision.
tmunro edited the summary of this revision. (Show Details)

How does this compare to just disabling write caches in the drives?

sys/ufs/ffs/ffs_vnops.c
259

This looks more appropriate for geom/. Possibly other consumers also need it, e.g. msdosfs?

You mean conceptually, or the performance I get on my drive? I have to admit that I only got as far as finding 5.27.1.4 in the NVMe base spec and realising that I want to send that in a Set Features command, and seeing that there is a patch D32700 that might make that easier, but not figuring out how to do it with unpatched nvmecontrol (I guess maybe admin-passthru and some raw bytes?). Got an example command?

You mean conceptually, or the performance I get on my drive? I have to admit that I only got as far as finding 5.27.1.4 in the NVMe base spec and realising that I want to send that in a Set Features command, and seeing that there is a patch D32700 that might make that easier, but not figuring out how to do it with unpatched nvmecontrol (I guess maybe admin-passthru and some raw bytes?). Got an example command?

Ah, there's sysctls on ata and scsi, but not nvme to make this easy. For ata and scsi, it was easier to set in the drive and we didn't notice any real slowdown.... but we have a sequential write pattern, so I wouldn't expect much.

I need to get to all of his reviews... they are mostly good and I've been too busy with too many other things...

sys/ufs/ffs/ffs_vnops.c
290

this sure looks a lot like g_bio_flush()

sys/ufs/ffs/ffs_vnops.c
290

g_io_flush() you mean, yeah it does. Differences: that uses BIO_ORDERED whereas in this scenario we already waited for the writes to complete so we have no further ordering requirements, the length and offset seem to be swapped (huh?), and that will fail with biowait's EIO if we get ENOSUPP in bio_error, which I was hiding. So perhaps it needs to gain a flags argument, and perhaps I need to figure out how to avoid sending it in the first place if it's not supported.