Page MenuHomeFreeBSD

geom: Handle truncated I/O in g_*_data()
ClosedPublic

Authored by markj on Jul 23 2021, 10:04 PM.

Details

Summary

We have several helper routines for dispatching I/O to a provider. They
are used internally by the kernel and not by the data path. None of
them handle partial I/O, but in some rare cases this is possible. Note
in particular that g_io_check() silently truncates the request to fit
within the bounds of the provider.

Truncated reads can happen in the somewhat contrived scenario where a
file-backed memory disk is larger than the file itself. I/O to a
file-backed memory disk is treated similarly to I/O to a vnode: reads
past EOF return 0 bytes, and writes past EOF extend the file. Various
classes uses g_read_data() to look for their metadata blocks when
tasting, and assume that a successful read indicates that the supplied
buffer was filled with data by the provider. If a short read occurs,
this is not the case, and they end up tasting an uninitialized buffer.
So, change g_read_data() to check for this scenario, since I believe it
is always a bug to indicate success there.

I modified g_write_data() and g_delete_data() accordingly, though I'm
not as certain about this. See also commits 8948179aba54, de66da73746f
and 9dcafe16d4e8. In particular, the change reverts commit 9dcafe16d4e8
since I believe it is redundant with this change.

Reported by: KMSAN (g_read_data() bug)
MFC after: 1 week

Test Plan

Regression test suite

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj added reviewers: sobomax, mav, imp, asomers, trasz, kib.

Generally, this is a quite unusual interface conventions. Typical unix approach is to tolerate short io and not make it error, instead having the caller to interpret and make the decision.

I think that the conversions of short io to EIO warrant a kernel message. This is a user misconfiguration if not the driver bug.

In D31293#704913, @kib wrote:

Generally, this is a quite unusual interface conventions. Typical unix approach is to tolerate short io and not make it error, instead having the caller to interpret and make the decision.

I'm sorry, I missed this comment.

In practice, block devices are expected to set BIO_ERROR when completing I/O with a non-zero residual. This assumption appears to be pervasive: for instance, bufdone() may call softdep_disk_write_complete(), which appears to assume that (bp->b_ioflags & BIO_ERROR) == 0 implies bp->b_resid == 0. Virtually all users of g_(read|write)_data() make this assumption as well. I think most of the usual reasons for permitting short I/O do not really apply at the block layer.

vnode-backed MDs violate this assumption in multiple ways:

  1. the example given in the description, where a BIO_READ can hit EOF
  2. mdstart_vnode() does not convert short I/O from VOP_READ and VOP_WRITE to errors, nor does it attempt to retry any residual request

So I suspect that the real solution may be to make vnode-backed MDs conform better to the kernel's expectations of block devices.

I hit this area of residual data handling in GEOM several times over the years. It would be good to formalize and make use of it somehow. But I never saw any reasonable use for it with any block device. It could possibly be used to specify how much data were read/written before some error occurred, but the problem is that in case of parallel execution, done by many classes bio_completed may not mean the first number of bytes of the request has completed, but SOME number of bytes. So it is pretty much useless from that perspective.

I think vnode-based providers should zero-fill the buffer if reading beyond the end of file (if that really happen somehow, that IMO should not).

I forgot about this review (again). I would like to commit the change as is: I found the problem is not limited to md(4) devices, it is also present in some GEOM classes. For instance, if a request extends past the end of mediasize, g_part_start() will simply return a truncated result. So g_label_ufs_taste_common(), which tries to find a superblock at various fixed offsets in a provider, may end up scanning uninitialized data when tasting a small bootcode partition. For this reason, I think it is not useful to print a message upon a partial I/O.

I think vnode-based providers should zero-fill the buffer if reading beyond the end of file (if that really happen somehow, that IMO should not).

Should g_part_start() reject requests that extend beyond the provider's mediasize?

Should g_part_start() reject requests that extend beyond the provider's mediasize?

I suspect that it should never receive requests extending beyond in first place, being trucated by g_io_check(), and the checks in g_part_start() are always false.

This revision is now accepted and ready to land.Jan 19 2022, 9:09 PM

So let's say i have a block device that's 10.5MB long, and I do a "dd if=/dev/this of=foo bs=1m" How big is foo?

In D31293#767794, @imp wrote:

So let's say i have a block device that's 10.5MB long, and I do a "dd if=/dev/this of=foo bs=1m" How big is foo?

I think 10.5MB. g_io_check() will truncate the last I/O to 0.5MB, while physio() backing geom_dev's reads "knows" how to return bio_resid to user-space.

In D31293#767797, @mav wrote:
In D31293#767794, @imp wrote:

So let's say i have a block device that's 10.5MB long, and I do a "dd if=/dev/this of=foo bs=1m" How big is foo?

I think 10.5MB. g_io_check() will truncate the last I/O to 0.5MB, while physio() backing geom_dev's reads "knows" how to return bio_resid to user-space.

I think you are right.

In D31293#767831, @imp wrote:
In D31293#767797, @mav wrote:
In D31293#767794, @imp wrote:

So let's say i have a block device that's 10.5MB long, and I do a "dd if=/dev/this of=foo bs=1m" How big is foo?

I think 10.5MB. g_io_check() will truncate the last I/O to 0.5MB, while physio() backing geom_dev's reads "knows" how to return bio_resid to user-space.

I think you are right.

Right. To be clear, this change has/should have no impact on user I/O.