Page MenuHomeFreeBSD

buf: Add bfinval() routine to discard data
ClosedPublic

Authored by cem on Sep 5 2019, 7:55 PM.
Tags
None
Referenced Files
Unknown Object (File)
Jan 17 2024, 5:11 AM
Unknown Object (File)
Jan 14 2024, 5:12 AM
Unknown Object (File)
Dec 9 2023, 11:02 AM
Unknown Object (File)
Sep 8 2023, 11:17 PM
Unknown Object (File)
Aug 25 2023, 8:35 PM
Unknown Object (File)
Aug 25 2023, 8:32 PM
Unknown Object (File)
Aug 25 2023, 8:31 PM
Unknown Object (File)
Aug 25 2023, 8:30 PM
Subscribers
None

Details

Summary

'Force invalidate' a buffer. This is intended to be used to allow layers
above the buffer cache to make more informed decisions about when discarding
dirty buffers without successful write is acceptable.

As a proof of concept, use in msdosfs to handle failures to mark the on-disk
'dirty' bit during rw mount or ro->rw update.

PR: 210316
Submitted by: avg, imp (earlier versions of bfinval)

Test Plan

We talked about it in the bug, Andriy / Warner proposed a concrete
implementation, and Kirk was supportive. So let's actually do the thing!

This Differential is prompted by real-world pain with a sometimes-RO device
that does not report WP (nor SWP) correctly. (It is sometimes RW; we cannot
blacklist it with a quirk.)

Basic ggate example repro from the PR comment #5 seems to work fine.

$ dd if=/dev/zero of=./image bs=1M count=100
$ newfs_msdos ./image

$ ggatel create -o ro ./image
ggate0
$ mount -t msdosfs /dev/ggate0 /mnt/tst ; echo $?
g_vfs_done():ggate0[WRITE(offset=512, length=4096)]error = 1
mount_msdosfs: /dev/ggate0: Operation not permitted
1
$ sysctl vfs.numdirtybuffers
vfs.numdirtybuffers: 0

Further testing with gnop to ensure similar handling across a range of possible errors.

ENODEV:

$ mdconfig -a -t swap -s 100M
md0
$ newfs_msdos /dev/md0

// ENODEV = 19
$ gnop create -e 19 -r 0 -w 100 -v md0
GEOM_NOP: Device md0.nop created.
$ mount -t msdosfs /dev/md0.nop /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 19
vfs_donmount: R/W mount failed, possibly R/O media, trying R/O mount
0

// Oops, autoro.

$ mount
...
/dev/md0.nop on /mnt/tst (msdosfs, local, read-only)
$ umount /mnt/tst

// Try again with explicit -o rw.

$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 19
mount_msdosfs: /dev/md0.nop: Operation not supported by device
1
$ sysctl vfs.numdirtybuffers
vfs.numdirtybuffers: 0

EIO:

$ gnop destroy md0.nop
$ gnop create -e 5 -r 0 -w 100 -v md0
$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 5
mount_msdosfs: /dev/md0.nop: Input/output error
1

EACCES:

$ gnop destroy md0.nop
$ gnop create -e 13 -r 0 -w 100 -v md0
$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $? ; sysctl vfs.numdirtybuffers
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 13
mount_msdosfs: /dev/md0.nop: Permission denied
1
vfs.numdirtybuffers: 0

ENXIO:

$ gnop destroy md0.nop
$ gnop create -e 6 -r 0 -w 100 -v md0
$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $? ; sysctl vfs.numdirtybuffers
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 6
mount_msdosfs: /dev/md0.nop: Device not configured
1
vfs.numdirtybuffers: 0

Finally, test the ro -> rw upgrade path.

$ gnop destroy md0.nop
$ gnop create -e 19 -r 0 -w 100 -v md0
$ mount -t msdosfs -o ro /dev/md0.nop /mnt/tst ; echo $?
0
$ mount
/dev/md0.nop on /mnt/tst (msdosfs, local, read-only)

$ mount -u -o current,rw /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 19
mount_msdosfs: /dev/md0.nop: Operation not supported by device
1

$ mount
/dev/md0.nop on /mnt/tst (msdosfs, local, read-only)

$ sysctl vfs.numdirtybuffers
vfs.numdirtybuffers: 0

Looks good.

Future work would be extending something like this to other filesystems that
set a dirty bit on mount.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 26295
Build 24774: arc lint + arc unit

Event Timeline

sys/fs/msdosfs/msdosfs_fat.c
1149

This is unrelated and should be committed separately.

1180

Please follow style, all declarations should be located at the start of the function. But see below, this variable might be not needed.

1188

We do not reference bug numbers in comments, source should be enough to understand.

1191

I would use incore() instead, to clearly state the intent of getting rid of failed write. The devvp vnode is not locked and in principle this buffer might be processed after bwrite().

sys/fs/msdosfs/msdosfs_vfsops.c
337

Could the window where the MSDOSFSMNT_RONLY flag was cleared, allow to create dirty data and metadata ?

cem marked an inline comment as done.Sep 5 2019, 9:22 PM
cem added inline comments.
sys/fs/msdosfs/msdosfs_fat.c
1149

Sure, it was just a trivial cleanup nearby.

1191

Doesn't geom prevent access by multiple writers? I suppose devvp could be a non-GEOM device. What would it mean if incore() did not find the buffer? The write eventually succeeded through the (?)bufdaemon or some other devvp writer?

Is there a reason to prefer incore() over getblk(GB_NOCREAT)? If we use incore we have to pull at least some of the buf locking logic out of getblk into the caller; maybe more.

sys/fs/msdosfs/msdosfs_vfsops.c
337

Good catch, thanks. (Though the condition is preexisting.)

Some msdos VOPs check MNT_RDONLY and prevent modification, but coverage is incomplete. I don't see sys/kern vfs routines checking for MNT_RDONLY.

It probably makes sense to prevent concurrent dirty data leaking in by adding a parameter to markvoldirty() allowing override of pmp MSDOSFSMNT_RONLY flag.

cem marked 2 inline comments as done.
  • Fix race in the msdosfs ro->rw upgrade case by permitting a variant of markvoldirty to bypass the msdosfs RONLY flag otherwise preventing writes
cem marked an inline comment as done.Sep 6 2019, 2:56 AM
cem added inline comments.
sys/fs/msdosfs/msdosfs_fat.c
1191

I attempted to use incore() here, but bundirty() panics due to the buf still being on QUEUE_DIRTY.

I guess this may be as simple as needing bremfree(), another thing getblk() does for me.

(Also, it seems incore() has a BO lock -> buf lock race that getblk does not. I don't know if that matters.)

cem marked an inline comment as done.
  • Take kib's suggestion of using incore() in place of getblk(). Not sure if it is obviously an improvement, but it seems to work (same tests pass).
  • (Style concern about C99 declaration is obviated by removal of variable.)
sys/fs/msdosfs/msdosfs_fat.c
1191

From geom PoV, multiple writers are entities like VFS consumers, and not threads. Other thread might start modifications to the mount point.

The change to add rw_upgrade looks good.

1222

Don't we need to check for the buffer error state before calling bfinval() ? Otherwise we might destroy somebody else work.

In fact, after reading the updated comment, I start thinking that neither incore() nor getblkx() are the right interfaces there, since we loose ownership of the buffer.

I suggest to add a flag, e.g. B_DIEONERR, which would be processed by brelse() and which cause the same action as bfinval() on BIO_ERROR. Then this fragment would simply set B_DIEONERR before calling bwrite().

sys/fs/msdosfs/msdosfs_fat.c
1222

Hm, would it make sense to instead add a variant of bwrite() that does not brelse the buffer, leaving it to the caller? That way our ownership of the buf is maintained. The approach is a little more flexible than adding another brelse side-effect flag.

sys/fs/msdosfs/msdosfs_fat.c
1222

I do not want yet another bwrite() variant, we already have dozen of them. But I am fine with adding a buffer flag that prevents brelse(). Fine details is that the flag should be mutually exclusive with B_ASYNC.

cem marked 2 inline comments as done.

Drop bfinval, add B_INVALONERR flag with appropriate behavior.

Tests described in Differential continue to pass.

kib added inline comments.
sys/fs/msdosfs/msdosfs_fat.c
1195

Please drop the reference to PR in source code.

This revision is now accepted and ready to land.Sep 11 2019, 8:21 PM
This revision was automatically updated to reflect the committed changes.