Page MenuHomeFreeBSD

buf: Add bfinval() routine to discard data
ClosedPublic

Authored by cem on Thu, Sep 5, 7:55 PM.

Details

Summary

'Force invalidate' a buffer. This is intended to be used to allow layers
above the buffer cache to make more informed decisions about when discarding
dirty buffers without successful write is acceptable.

As a proof of concept, use in msdosfs to handle failures to mark the on-disk
'dirty' bit during rw mount or ro->rw update.

PR: 210316
Submitted by: avg, imp (earlier versions of bfinval)

Test Plan

We talked about it in the bug, Andriy / Warner proposed a concrete
implementation, and Kirk was supportive. So let's actually do the thing!

This Differential is prompted by real-world pain with a sometimes-RO device
that does not report WP (nor SWP) correctly. (It is sometimes RW; we cannot
blacklist it with a quirk.)

Basic ggate example repro from the PR comment #5 seems to work fine.

$ dd if=/dev/zero of=./image bs=1M count=100
$ newfs_msdos ./image

$ ggatel create -o ro ./image
ggate0
$ mount -t msdosfs /dev/ggate0 /mnt/tst ; echo $?
g_vfs_done():ggate0[WRITE(offset=512, length=4096)]error = 1
mount_msdosfs: /dev/ggate0: Operation not permitted
1
$ sysctl vfs.numdirtybuffers
vfs.numdirtybuffers: 0

Further testing with gnop to ensure similar handling across a range of possible errors.

ENODEV:

$ mdconfig -a -t swap -s 100M
md0
$ newfs_msdos /dev/md0

// ENODEV = 19
$ gnop create -e 19 -r 0 -w 100 -v md0
GEOM_NOP: Device md0.nop created.
$ mount -t msdosfs /dev/md0.nop /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 19
vfs_donmount: R/W mount failed, possibly R/O media, trying R/O mount
0

// Oops, autoro.

$ mount
...
/dev/md0.nop on /mnt/tst (msdosfs, local, read-only)
$ umount /mnt/tst

// Try again with explicit -o rw.

$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 19
mount_msdosfs: /dev/md0.nop: Operation not supported by device
1
$ sysctl vfs.numdirtybuffers
vfs.numdirtybuffers: 0

EIO:

$ gnop destroy md0.nop
$ gnop create -e 5 -r 0 -w 100 -v md0
$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 5
mount_msdosfs: /dev/md0.nop: Input/output error
1

EACCES:

$ gnop destroy md0.nop
$ gnop create -e 13 -r 0 -w 100 -v md0
$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $? ; sysctl vfs.numdirtybuffers
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 13
mount_msdosfs: /dev/md0.nop: Permission denied
1
vfs.numdirtybuffers: 0

ENXIO:

$ gnop destroy md0.nop
$ gnop create -e 6 -r 0 -w 100 -v md0
$ mount -t msdosfs -o rw /dev/md0.nop /mnt/tst ; echo $? ; sysctl vfs.numdirtybuffers
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 6
mount_msdosfs: /dev/md0.nop: Device not configured
1
vfs.numdirtybuffers: 0

Finally, test the ro -> rw upgrade path.

$ gnop destroy md0.nop
$ gnop create -e 19 -r 0 -w 100 -v md0
$ mount -t msdosfs -o ro /dev/md0.nop /mnt/tst ; echo $?
0
$ mount
/dev/md0.nop on /mnt/tst (msdosfs, local, read-only)

$ mount -u -o current,rw /mnt/tst ; echo $?
g_vfs_done():md0.nop[WRITE(offset=512, length=4096)]error = 19
mount_msdosfs: /dev/md0.nop: Operation not supported by device
1

$ mount
/dev/md0.nop on /mnt/tst (msdosfs, local, read-only)

$ sysctl vfs.numdirtybuffers
vfs.numdirtybuffers: 0

Looks good.

Future work would be extending something like this to other filesystems that
set a dirty bit on mount.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

cem created this revision.Thu, Sep 5, 7:55 PM
kib added inline comments.Thu, Sep 5, 8:15 PM
sys/fs/msdosfs/msdosfs_fat.c
1149 ↗(On Diff #61704)

This is unrelated and should be committed separately.

1177 ↗(On Diff #61704)

Please follow style, all declarations should be located at the start of the function. But see below, this variable might be not needed.

1185 ↗(On Diff #61704)

We do not reference bug numbers in comments, source should be enough to understand.

1188 ↗(On Diff #61704)

I would use incore() instead, to clearly state the intent of getting rid of failed write. The devvp vnode is not locked and in principle this buffer might be processed after bwrite().

sys/fs/msdosfs/msdosfs_vfsops.c
323 ↗(On Diff #61704)

Could the window where the MSDOSFSMNT_RONLY flag was cleared, allow to create dirty data and metadata ?

cem edited the test plan for this revision. (Show Details)Thu, Sep 5, 8:52 PM
cem marked an inline comment as done.Thu, Sep 5, 9:22 PM
cem added inline comments.
sys/fs/msdosfs/msdosfs_fat.c
1149 ↗(On Diff #61704)

Sure, it was just a trivial cleanup nearby.

1188 ↗(On Diff #61704)

Doesn't geom prevent access by multiple writers? I suppose devvp could be a non-GEOM device. What would it mean if incore() did not find the buffer? The write eventually succeeded through the (?)bufdaemon or some other devvp writer?

Is there a reason to prefer incore() over getblk(GB_NOCREAT)? If we use incore we have to pull at least some of the buf locking logic out of getblk into the caller; maybe more.

sys/fs/msdosfs/msdosfs_vfsops.c
323 ↗(On Diff #61704)

Good catch, thanks. (Though the condition is preexisting.)

Some msdos VOPs check MNT_RDONLY and prevent modification, but coverage is incomplete. I don't see sys/kern vfs routines checking for MNT_RDONLY.

It probably makes sense to prevent concurrent dirty data leaking in by adding a parameter to markvoldirty() allowing override of pmp MSDOSFSMNT_RONLY flag.

cem updated this revision to Diff 61708.Thu, Sep 5, 10:00 PM
cem marked 2 inline comments as done.
  • Fix race in the msdosfs ro->rw upgrade case by permitting a variant of markvoldirty to bypass the msdosfs RONLY flag otherwise preventing writes
cem marked an inline comment as done.Fri, Sep 6, 2:56 AM
cem added inline comments.
sys/fs/msdosfs/msdosfs_fat.c
1188 ↗(On Diff #61704)

I attempted to use incore() here, but bundirty() panics due to the buf still being on QUEUE_DIRTY.

I guess this may be as simple as needing bremfree(), another thing getblk() does for me.

(Also, it seems incore() has a BO lock -> buf lock race that getblk does not. I don't know if that matters.)

cem updated this revision to Diff 61714.Fri, Sep 6, 3:21 AM
cem marked an inline comment as done.
  • Take kib's suggestion of using incore() in place of getblk(). Not sure if it is obviously an improvement, but it seems to work (same tests pass).
  • (Style concern about C99 declaration is obviated by removal of variable.)
kib added inline comments.Fri, Sep 6, 7:53 AM
sys/fs/msdosfs/msdosfs_fat.c
1188 ↗(On Diff #61704)

From geom PoV, multiple writers are entities like VFS consumers, and not threads. Other thread might start modifications to the mount point.

The change to add rw_upgrade looks good.

1222 ↗(On Diff #61714)

Don't we need to check for the buffer error state before calling bfinval() ? Otherwise we might destroy somebody else work.

In fact, after reading the updated comment, I start thinking that neither incore() nor getblkx() are the right interfaces there, since we loose ownership of the buffer.

I suggest to add a flag, e.g. B_DIEONERR, which would be processed by brelse() and which cause the same action as bfinval() on BIO_ERROR. Then this fragment would simply set B_DIEONERR before calling bwrite().

cem added inline comments.Fri, Sep 6, 8:03 AM
sys/fs/msdosfs/msdosfs_fat.c
1222 ↗(On Diff #61714)

Hm, would it make sense to instead add a variant of bwrite() that does not brelse the buffer, leaving it to the caller? That way our ownership of the buf is maintained. The approach is a little more flexible than adding another brelse side-effect flag.

kib added inline comments.Fri, Sep 6, 11:14 AM
sys/fs/msdosfs/msdosfs_fat.c
1222 ↗(On Diff #61714)

I do not want yet another bwrite() variant, we already have dozen of them. But I am fine with adding a buffer flag that prevents brelse(). Fine details is that the flag should be mutually exclusive with B_ASYNC.

cem updated this revision to Diff 61938.Wed, Sep 11, 5:01 PM
cem marked 2 inline comments as done.

Drop bfinval, add B_INVALONERR flag with appropriate behavior.

Tests described in Differential continue to pass.

kib accepted this revision.Wed, Sep 11, 8:21 PM
kib added inline comments.
sys/fs/msdosfs/msdosfs_fat.c
1191 ↗(On Diff #61938)

Please drop the reference to PR in source code.

This revision is now accepted and ready to land.Wed, Sep 11, 8:21 PM
This revision was automatically updated to reflect the committed changes.