Page MenuHomeFreeBSD

geom: Handle truncated I/O in g_*_data()
Needs ReviewPublic

Authored by markj on Jul 23 2021, 10:04 PM.



We have several helper routines for dispatching I/O to a provider. They
are used internally by the kernel and not by the data path. None of
them handle partial I/O, but in some rare cases this is possible. Note
in particular that g_io_check() silently truncates the request to fit
within the bounds of the provider.

Truncated reads can happen in the somewhat contrived scenario where a
file-backed memory disk is larger than the file itself. I/O to a
file-backed memory disk is treated similarly to I/O to a vnode: reads
past EOF return 0 bytes, and writes past EOF extend the file. Various
classes uses g_read_data() to look for their metadata blocks when
tasting, and assume that a successful read indicates that the supplied
buffer was filled with data by the provider. If a short read occurs,
this is not the case, and they end up tasting an uninitialized buffer.
So, change g_read_data() to check for this scenario, since I believe it
is always a bug to indicate success there.

I modified g_write_data() and g_delete_data() accordingly, though I'm
not as certain about this. See also commits 8948179aba54, de66da73746f
and 9dcafe16d4e8. In particular, the change reverts commit 9dcafe16d4e8
since I believe it is redundant with this change.

Reported by: KMSAN (g_read_data() bug)
MFC after: 1 week

Test Plan

Regression test suite

Diff Detail

rS FreeBSD src repository - subversion
Lint OK
No Unit Test Coverage
Build Status
Buildable 40654
Build 37543: arc lint + arc unit

Event Timeline

markj added reviewers: sobomax, mav, imp, asomers, trasz, kib.

Generally, this is a quite unusual interface conventions. Typical unix approach is to tolerate short io and not make it error, instead having the caller to interpret and make the decision.

I think that the conversions of short io to EIO warrant a kernel message. This is a user misconfiguration if not the driver bug.

In D31293#704913, @kib wrote:

Generally, this is a quite unusual interface conventions. Typical unix approach is to tolerate short io and not make it error, instead having the caller to interpret and make the decision.

I'm sorry, I missed this comment.

In practice, block devices are expected to set BIO_ERROR when completing I/O with a non-zero residual. This assumption appears to be pervasive: for instance, bufdone() may call softdep_disk_write_complete(), which appears to assume that (bp->b_ioflags & BIO_ERROR) == 0 implies bp->b_resid == 0. Virtually all users of g_(read|write)_data() make this assumption as well. I think most of the usual reasons for permitting short I/O do not really apply at the block layer.

vnode-backed MDs violate this assumption in multiple ways:

  1. the example given in the description, where a BIO_READ can hit EOF
  2. mdstart_vnode() does not convert short I/O from VOP_READ and VOP_WRITE to errors, nor does it attempt to retry any residual request

So I suspect that the real solution may be to make vnode-backed MDs conform better to the kernel's expectations of block devices.

I hit this area of residual data handling in GEOM several times over the years. It would be good to formalize and make use of it somehow. But I never saw any reasonable use for it with any block device. It could possibly be used to specify how much data were read/written before some error occurred, but the problem is that in case of parallel execution, done by many classes bio_completed may not mean the first number of bytes of the request has completed, but SOME number of bytes. So it is pretty much useless from that perspective.

I think vnode-based providers should zero-fill the buffer if reading beyond the end of file (if that really happen somehow, that IMO should not).