Page MenuHomeFreeBSD

vfs_cluster.c: Do not propagate VOP_BMAP errors to the caller
ClosedPublic

Authored by arrowd on Jul 11 2025, 7:54 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Oct 12, 7:57 PM
Unknown Object (File)
Sun, Oct 12, 7:57 PM
Unknown Object (File)
Sun, Oct 12, 7:57 PM
Unknown Object (File)
Sun, Oct 12, 8:21 AM
Unknown Object (File)
Sun, Oct 5, 3:11 AM
Unknown Object (File)
Sat, Oct 4, 8:10 PM
Unknown Object (File)
Sep 26 2025, 11:38 PM
Unknown Object (File)
Sep 26 2025, 10:58 PM

Details

Summary

The code that makes this VOP_BMAP call tries to perform a read-ahead I/O
operation. Failing to do that for any reason isn't fatal for cluster_read(),
because we still can return some data to the caller. This is consistent with
other places within cluster_read(), where error returned by VOP_BMAP is not
returned to the caller - see the if (nblks > 1) block above the changed lines
and if (reqbp) at the end of the function.

PR: 264196

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 65348
Build 62231: arc lint + arc unit

Event Timeline

Interesting, although it's still unclear what's going on. Perhaps the error should be reset upon some condition, like in commit 0c38e3dbbf6e where a similar pattern could be seen?

markj added inline comments.
sys/kern/vfs_cluster.c
264

Why is VOP_BMAP failing to begin with?

sys/kern/vfs_cluster.c
264

This is in the context of a FUSE implementation of NTFS (filesystems/ntfs port). I haven't debugged this problem yet, but this seems to be an orthogonal to this one?

sys/kern/vfs_cluster.c
264

There is a VOP returning an error, we don't know why, but the solution must be to silence all errors? I don't think this is the right approach.

Your patch may be right in the end, but to reach that conclusion I think it is necessary to understand the root cause of the error.

sys/kern/vfs_cluster.c
264

As far as I understand, this VOP_BMAP call happens in the read-ahead code path. The cluster_read() does not treat read-ahead failures as fatal, because the request to read something is fulfilled anyways and data can be returned to the caller. The fact that hiding this error fixes reading via dd adds to this argument.

sys/kern/vfs_cluster.c
264

I understand, but we are not hiding just "this error" but all errors, and the code has worked this way for over 20 years. So I think it is reasonable to understand the root cause better before changing anything.

sys/kern/vfs_cluster.c
264

Ok, I'll dive into ntfs-3g, but as a last argument - the error coming from VOP_BMAP is already ignored if we hit the if (reqbp) case.

sys/kern/vfs_cluster.c
264

I am not saying the patch is wrong or that there are no bugs in the current version of cluster_read(). I am saying that the justification for the patch is not complete and doesn't meet the de facto standard for changing such code.

asomers added a subscriber: asomers.

I have tested this patch and confirm that it works. I've also written a minimal reproduction case, which I'll post as a separate review. And while I understand Mark's concern, I tend to agree with arrowd's approach. BMAP should be advisory, and readahead should be optional, so errors in neither should cause the main read to fail.
But I would like to see a better commit message. It should contain enough information to describe the problem without resorting to Bugzilla, since Bugzilla might not be around forever.

This revision now requires changes to proceed.Jul 14 2025, 11:28 PM

I did some research on the NTFS side and found some seemingly useful info:

I don't know if it is correct for our FUSE implementation to set blocksize based on this sysctl. Does this info ring a bell for someone?

I have no other idea how to move this forward from my side. Could someone please help me pushing this?

@arrowd I agree with your change. The only thing I wanted different is a more detailed commit message. Of course, Mark might disagree.

@asomers wrote:

The only thing I wanted different is a more detailed commit message. Of course, Mark might disagree.

Frankly, I'm also generally uncomfortable about changing the code which worked for 20 years without fully understanding the kitchen behind.

@markj wrote:

I don't think this is the right approach. Your patch may be right in the end, but to reach that conclusion I think it is necessary to understand the root cause of the error.

So I'm with Mark here. Even if the unconditional error silencing inside the loop here is indeed correct, we still need to know exactly what's going on to be able to write that aforementioned detailed commit message. :-)

code which worked for 20 years

How do you tell it is working? It might be that this code path was never triggered by anything but fusefs-ntfs with UBLIO disabled, which I think is quite possible. FUSE may work as a mocking framework for VOP_* operations - for testing purposes one can make any operation fail in arbitrary matter and the caller code should correctly handle the failure.

without fully understanding the kitchen behind.

I already provided my analysis in the inline comment https://reviews.freebsd.org/D51254#1171703 Unless I'm wrong with it, we now do understand what's going on and the proposed change in correct on its own, even without taking fusefs-ntfs into the context.

A bit of more research: this is what Linux does

https://github.com/torvalds/linux/blob/76eeb9b8de9880ca38696b2fb56ac45ac0a25c6c/fs/fuse/file.c#L2516
https://github.com/torvalds/linux/blob/f777d1112ee597d7f7dd3ca232220873a34ad0c8/fs/fuse/inode.c#L1825

It does not use a fixed value for blocksize for BMAP calls, but instead consults the s_blocksize field of struct superblock. This field is seemingly filled with PAGE_SIZE that is equal to 4096 and that matches what ntfs-3g expects.

I'm now certain that we should change https://github.com/freebsd/freebsd-src/blob/f261b63307fca34f27e4d12384d19cb543b4867a/sys/fs/fuse/fuse_vfsops.c#L444 to something else, but I still have no idea to what.

We already ignore VOP_BMAP() error when we handle the requested buffer itself, looking how many contiguous blocks we can read with it, see line 207. Then, ignoring the error and not accumulating more read-ahead buffers is even more ok, From this PoV I do not think that patch deserves such hot discussion, it can be argued that it adds the consistency, and more important case already does exactly this.

For UFS or any other in-tree local fs (msdosfs, cd9660, udf) it is practically impossible to get a failure from VOP_BMAP(), unless the fs is corrupted or the driver is failing. So I do not think it would really affect UFS.

Looking at the ntfs code, BMAP is failing because fusefs is calling the fuse BMAP operation with blocksize == maxiosize, and that'll generally be much larger than the NTFS cluster size. I'm not sure why fusefs passes that instead of the block size (or why the BMAP operation would take the blocksize as a parameter to begin with, is the block size not fixed by the client filesystem?), but it makes sense that the filesystem driver would return an error in that case.

And yes, it's ok in general for VOP_BMAP to fail. Before, I was concerned that VOP_BMAP errors would cause vnode_pager_generic_getpages() to misbehave, but I think fusefs doesn't use it.

So I have no objection to the change, but yes the commit log message should be more detailed.

sys/kern/vfs_cluster.c
262

Better would be to just write (void)VOP_BMAP(...);.

why the BMAP operation would take the blocksize as a parameter to begin with, is the block size not fixed by the client filesystem?

For instance, UFS has fragments, which already means that block size is varying.

Another case where varying block size would matter (in much more significant way than fragments) are extent-based fs. We do not have any right now, but would we have, the existing bmap interface fits.

In D51254#1215182, @kib wrote:

why the BMAP operation would take the blocksize as a parameter to begin with, is the block size not fixed by the client filesystem?

For instance, UFS has fragments, which already means that block size is varying.

Another case where varying block size would matter (in much more significant way than fragments) are extent-based fs. We do not have any right now, but would we have, the existing bmap interface fits.

And in fact NTFS is extent-based, so this probably explain that thing about passing the max block size. It would be more correct to pass the current extent size, which cluster cannot.

But anyway, everything that vfs_cluster does is optional and we only must ensure that the very first buffer of the cluster is validated.

Updated the commit message, waiting for the approval.

sys/kern/vfs_cluster.c
262

We still need to break out of the loop here in case of error.

This revision was not accepted when it landed; it landed in state Needs Revision.Wed, Oct 22, 5:00 PM
This revision was automatically updated to reflect the committed changes.
This comment was removed by arrowd.