Page MenuHomeFreeBSD

bhyve/virtio: Rework iovec handling functions for efficiency and clarity
Needs ReviewPublic

Authored by rosenfeld_grumpf.hope-2000.org on Oct 30 2025, 7:38 AM.
Tags
None
Referenced Files
F146314248: D53468.id171526.diff
Sun, Mar 1, 4:10 PM
Unknown Object (File)
Fri, Feb 27, 10:14 AM
Unknown Object (File)
Sun, Feb 22, 6:28 PM
Unknown Object (File)
Sun, Feb 22, 6:28 PM
Unknown Object (File)
Sun, Feb 22, 4:23 PM
Unknown Object (File)
Sat, Feb 21, 3:41 PM
Unknown Object (File)
Fri, Feb 20, 11:28 PM
Unknown Object (File)
Fri, Feb 20, 10:48 AM

Details

Reviewers
jhb
corvink
markj
Group Reviewers
bhyve
Summary

Add check_iov_len() to check whether an iovec array covers a certain
length without the need to call count_iov() on the whole array first.

Garbage-collect the 'seek' argument to buf_to_iov(), used only by
virtio-scsi control request handling. The apparent benefit of using it
to copy only the final status byte instead of the whole TMF or AN
request (25 and 21 bytes, respectively) is dubious at best, given that
the extra code to handle this in buf_to_iov() allocates and frees a new
iovec array and uses seek_iov(), which traverses the whole array and
copies iovecs around.

Replace seek_iov() and truncate_iov(), used only by virtio-scsi, with
a single function split_iov() which combines the functionality of both
in a more efficient way:
While seek_iov() always copies all iovecs past the seek offset into a
new iovec array, split_iov() works in place and doesn't copy iovecs
unless actually necessary. By using split_iov(), we can avoid almost
all copying of iovecs in I/O handling code paths of virtio-scsi.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 70780
Build 67663: arc lint + arc unit

Event Timeline

Thanks for splitting your initial commit! Note that this can also be split into smaller commits to make it easier to read: add check_iov_len, update buf_to_iov and replace seek + truncate by split. While I'm personally preferring splitting the commits, I'd be fine keeping it as is too.

usr.sbin/bhyve/iov.c
36

It's already included by iov.h.

usr.sbin/bhyve/iov.h
41

Why removing the parameter names from existing functions?

usr.sbin/bhyve/pci_virtio_scsi.c
493–521

IMO, it's a bit confusing, that we're passing two pointers here which are related to each other. Wouldn't it make things more clear to pass a single iov pointer and then passing an niov_in and niov_out length to this function?

Rework split_iov() as requested by Corvin.

usr.sbin/bhyve/iov.c
36

I think that to prevent header dependency issues, every source or header file should always include every header file whose definitions and declarations it uses directly. This file uses now uses the standard bool type and thus needs to include stdbool.h.

usr.sbin/bhyve/iov.h
41

Given that these parameter names in function prototypes are ignored by the compiler, not even causing a warning if the name is different from what the actual function implementation is using, they serve very little purpose if any purpose at all. I generally consider them noise that adds no value but clutters up the code and lengthens lines needlessly.

For this reason I don't add them for new functions or functions that I change, and in a case like this where it's relatively unintrusive I change the others, too, to make it all nice and consistent.

usr.sbin/bhyve/pci_virtio_scsi.c
493–521

I reworked it to return the data_iov pointer instead and streamlined the split_iov() code somewhat. Please let me know what you think?

rosenfeld_grumpf.hope-2000.org retitled this revision from bhyve/virtio: rework iovec handling functions for efficiency and clarity to bhyve/virtio: Rework iovec handling functions for efficiency and clarity.Mon, Feb 9, 4:23 PM

Sync with latest illumos bits after code review.

Use a size_t type where appropriate instead of the abomination that is uint.

Looks ok to me aside from the comment on split_iov().

usr.sbin/bhyve/iov.c
45

Conceptually, this function is splitting an iovec array in two and returning the number of entries on the left side in *niov, and the number of entries on the right side in *niov_rem. I think I have trouble with the meaning of "remaining" here, I would naively expect *niov_rem to be the number of iovecs before offset, not after. More generally, the "remainder" is what's left of the original after you take something away, but here it's used to refer to that which is being taken away, so the implementation is somewhat confusing to me.

Can we please rename the variables and reword the comment accordingly?

143

Not something to be fixed in this patch, but we should be checking for overflow here...

156

Same here with respect to overflow.

usr.sbin/bhyve/iov.c
45

I guess that's just another linguistic nuance that escaped me, not being a native english speaker. :-)

I've tried to reword it to be clearer, using neutral terms "1st part" and "2nd part", and renaming the variables accordingly. Please let me know if that's what you had in mind.

143

How would you suggest to handle detection of an overflow here?

Currently, this code implicitly relies on the limits set by the virtio device emulation to prevent overflows, and the guest honoring them. Given that an overflow at this point is quite unlikely to happen other than for some sort of severe error or corruption in another place, I wonder whether it's reasonable to just use an assertion here, making the reliance on sensible I/O size and iovec count limits explicit rather than implicit. (I kinda doubt we'll see single I/O requests in the exabyte range anytime soon).

This looks ok to me, but I'd prefer to simplify the split_iov() implementation some more. Those functions operate on untrusted data and should be easy to read and audit.

usr.sbin/bhyve/iov.c
45

Thank you, I think this is better, but for my taste this implementation still seems too complicated. My attempt is below, what do you think?

+/*
+ * Given an iovec and an offset, split the iovec into two at the offset and
+ * return a pointer to the beginning of the second iovec.
+ *
+ * The caller is responsible for providing an extra iovec in the array for the
+ * split.  That is, there should be space for *niov1 + 1 iovecs in the array.
+ */
+struct iovec *
+split_iov(struct iovec *iov, size_t *niov1, size_t offset, size_t *niov2)
+{
+       size_t count, resid;
+
+       /* Find the iovec entry that contains the offset. */
+       resid = offset;
+       for (count = 0; count < *niov1; count++) {
+               if (resid < iov[count].iov_len)
+                       break;
+               resid -= iov[count].iov_len;
+       }
+
+       if (resid == 0 || count == *niov1) {
+               /* Easy case, we don't have to split an iovec entry. */
+               *niov2 = *niov1 - count;
+               *niov1 = count;
+               if (*niov2 == 0)
+                       return (NULL);
+               return (&iov[count]);
+       }
+
+       /* The entry iov[count] needs to be split. */
+       *niov1 = count + 1;
+       *niov2 = *niov1 - count;
+       memmove(&iov[count + 1], &iov[count], sizeof(struct iovec) * (*niov2));
+       iov[count].iov_len = resid;
+       iov[count + 1].iov_base = (char *)iov[count].iov_base + resid;
+       iov[count + 1].iov_len -= resid;
+       return (&iov[count + 1]);
+}
+
97

Someone independently reported that this truncate_iov() is simply broken. It always returns exactly one entry because the test toseek <= iov[i].iov_len is always true. :(

I will post a minimal patch to fix this, so that it can easily be backported to existing stable branches. After that, your patch will just remove the implementation.

143

We have to take into account the possibility of a malicious guest which intentionally sets up descriptors to trigger overflow here.

Perhaps it'd be easiest to 1) assert against overflow in these generic functions, 2) add checks to vq_getchain() to abort processing if we detect that we're building an iovec with total length greater than SIZE_MAX. As part of that I think I'd add a new struct which combines the iov array pointer and array length rather than passing everything separately.