Page MenuHomeFreeBSD

nvme: Call vm_fault_hold_pages instead of vmapbuf
ClosedPublic

Authored by imp on Aug 25 2025, 11:01 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Oct 6, 12:37 PM
Unknown Object (File)
Mon, Oct 6, 9:18 AM
Unknown Object (File)
Sat, Oct 4, 8:54 PM
Unknown Object (File)
Fri, Oct 3, 7:43 AM
Unknown Object (File)
Fri, Oct 3, 3:08 AM
Unknown Object (File)
Thu, Oct 2, 10:27 PM
Unknown Object (File)
Thu, Oct 2, 4:50 PM
Unknown Object (File)
Thu, Oct 2, 11:42 AM
Subscribers

Details

Summary

Use the underlying mechanism of vmapbuf instead of using this legacy
interface. This means we don't have to allocate a buf, and can store the
page array on the stack as it will be small enough for transfers that
the vast majority of cards can do. And those that can do larger (> 512k)
have provisions to split up requests.

I opted for the simplicity of a hard limit of 1k bytes for this array
which is usually just a few entries rather than doing the malloc/free
dance since some users have been complaining about the old vmapbuf
bottleneck.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

imp requested review of this revision.Aug 25 2025, 11:01 AM

Note: this replaces much of an earlier patch series that made bufs easier to use by eliminating the need entirely.

sys/dev/nvme/nvme_ctrlr.c
1294

I don't quite understand this assertion. What if vm_fault_quick_hold_pages() fails?

1405

Is there any advantage to sizing the array this way vs. just having vm_page_t upages[NVME_MAX_PAGES]?

sys/dev/nvme/nvme_ctrlr.c
1294

hmmm... Yea, I should. I misread some code. We have a couple of issues, though, in the tree that I misread that might need attention:

sys/contrib/vchiq/interface/vchiq_arm/vchiq_2835_arm.c

actual_pages = vm_fault_quick_hold_pages(&p->p_vmspace->vm_map,
    (vm_offset_t)buf, count,
    (type == PAGELIST_READ ? VM_PROT_WRITE : 0 ) | VM_PROT_READ, pages, num_pages);

if (actual_pages != num_pages) {
        vm_page_unhold_pages(pages, actual_pages);
        free(pagelist, M_VCPAGELIST);
        return (-ENOMEM);
}

vm_page_unhold_pages will loop forever given how it's coded (and be UB signed int underflow to boot). There's a few others that are sloppy that just return in this case.

and in the linuxulator we have:

count = vm_fault_quick_hold_pages(map, start, len, prot, pages, nr_pages);
return (count == -1 ? -EFAULT : nr_pages);

which needs help, no?

Anyway, I'll clean these up and post a review to understand this interface better.

1405

Less stack usage almost all the time. Usually a lot less since npages is typically < 10.
But I had coded it as a malloc, then an alloc and then this construct. Maybe you're right. We're shallow in the stack here and have big stacks these days due to ZFS (which won't be on the stack either before or after)

sys/dev/nvme/nvme_ctrlr.c
1294

The stack thing is also used in this oops from tcp_usrreq.c:

	                nheld = atop(round_page(((vm_offset_t)sbp) +
                            (vm_size_t)outsbsz) - trunc_page((vm_offset_t)sbp));
                        vm_page_t ma[nheld];
                        if (vm_fault_quick_hold_pages(
	                    &curproc->p_vmspace->vm_map, (vm_offset_t)sbp,
		            outsbsz, VM_PROT_READ | VM_PROT_WRITE, ma,
                            nheld) < 0) {
                                error = EFAULT;
		                break;
                        }
sys/dev/nvme/nvme_ctrlr.c
1294

Ah, I knew I'd seen the assert pattern somewhere:

        ps->npages = vm_fault_quick_hold_pages(map, start, end - start,
            VM_PROT_WRITE, ps->pages, n);

        DDP_LOCK(toep);
	if (ps->npages < 0) {
                free(ps, M_CXGBE);
                return (EFAULT);
        }

        KASSERT(ps->npages == n, ("hold_aio: page count mismatch: %d vs %d",
            ps->npages, n));

in t4_ddp.c and I'm not sure what's going on in t4_cpl_io.c doesn't seem to check it has enough pages. And I'm not at all sure what to do wtih vfs_vnops.c which seems to know how many, but doesn't check.

Though looking at the vm_fault_quick_hold_pages() code, it looks like it will always return the count of pages or -1, so maybe all these examples could be cleaned up the other direction.

Check return -- unsure if I should keep the assert or not

I think this is ok, my comments are fairly minor.

sys/dev/nvme/nvme_ctrlr.c
1289

IMO it'd be a bit cleaner to compute max_pages here and return it by making the max_pages parameter an out-param. Or, have a helper function that just wraps vm_fault_quick_hold_pages() and returns the number of pages held or returns EFAULT or EIO if vm_fault_quick_hold_pages() fails or the user asked for too many pages, respectively.

1405

Using malloc() is fine too, since we're in the user ioctl path here I believe. I just prefer not to have this kind of dynamic stack allocation when possible.

sys/dev/nvme/nvme_ctrlr.c
1289

But you have to know how many pages to allocate for the array...
I'm not sure that I like either of these alternatives.

1405

Except that we're trading one malloc for another. The whole point of eliminating vmapbuf is to avoid allocating a buf, which was a choke point in performance measurements one of our vendors ran. While it is the user ioctl path, people are pushing it faster and faster.

I don't see what's wrong with doing the stack allocation like this. Why does that give you heartburn?

markj added inline comments.
sys/dev/nvme/nvme_ctrlr.c
1289

In my head both callers would look like this:

vm_page_t ma[NVME_MAX_PAGES];
int error, npages;

error = nvme_hold_user_pages(addr, len, is_read, ma, &npages, cb, cb_arg);
if (error != 0)
    return (error);
...
vm_page_unhold_pages(ma, npages);

and we'd have

static int
nvme_hold_user_pages(...)
{
    int npages;

    if (nvme_npages(addr, len) > NVME_MAX_PAGES)
        return (EIO);
    npages = vm_fault_quick_hold_pages(...)
    if (npages < 0)
        return (EFAULT);
    *npagesp = npages;
    return (0);
}

In the current patch, both callers have some complicated expression to calculate max_pages, and then the subroutine asserts that the calculation is right. Why not do the calculation in the subroutine itself?

1405

There's nothing really wrong with it here, it's just a dangerous pattern that I'd rather not see used more than necessary.

I didn't realize you explicitly wanted to avoid using malloc(). What's here is ok, but I'd still advocate for just having an array of size NVME_MAX_PAGES. If a user can ask for that many pages, then we should assume that they will. If that leaves us with too little stack space, then it's going to be a problem even with this dynamic sizing, so if the array is going to be on the stack we might as well do the simple thing.

This revision is now accepted and ready to land.Aug 25 2025, 9:39 PM
sys/dev/nvme/nvme_ctrlr.c
50

Are all added headers needed?

1405

I agree. The fixed size array takes exactly 1K of stack, which is large, but should be fine for top kernel code.

update, per review (except header files)

This revision now requires review to proceed.Aug 26 2025, 2:07 AM
This revision is now accepted and ready to land.Aug 26 2025, 2:28 AM
sys/dev/nvme/nvme_ctrlr.c
1289

You still have to check whether nvme_npages(addr, len) > NVME_MAX_PAGES before calling vm_fault_quick_hold_pages(). Note that vm_fault_quick_hold_pages() panics the kernel if you don't provide a large enough array (which is kind of a silly interface to be fair).

sys/dev/nvme/nvme_ctrlr.c
1289

Yes, we can be more friendly and just return the error in this case.

Add check for enough pages and preen .h's to what's required to compile.

This revision now requires review to proceed.Aug 26 2025, 4:31 PM

Don't need a separate bool. max_pages will do the trick.

sys/dev/nvme/nvme_ctrlr.c
1433

This is squashing both error cases (request too big; failed to fault pages) to EFAULT.

npages is a better variable name in these two functions

sys/dev/nvme/nvme_ctrlr.c
1433

Yes. I know. In both places. I thought about it, though: if we don't have enough pages elsewhere, we can return this value. And if we made vm_fault_quick_hold_pages return an error in that case, the rest of the VM system would return EFAULT, and we'd not know... the only way around that is to check where we check for max_xfer_size. de-facto, max_xfer_size is more like 128k or 256k, so it will catch this issue there (in a rather unfortunately noisy way). I can add a second condition there (and in the other routine) as well. That would be marginally better at not too great a cost, I think.

check earlier for too big. This is folded into EIO, but other too large
transactions are too. The message will happen, at least, though it could be
spammy in a buggy application.

markj added inline comments.
sys/dev/nvme/nvme_ctrlr.c
1433

kib's proposed vm_fault_grab_hold_pages_e() addresses that problem.

Some refactoring as I suggested earlier with the code snippet would also avoid the duplicate nvme_npages() checks. That is, have separate helper functions for 1) fetching user pages, 2) building a request structure from them.

This revision is now accepted and ready to land.Aug 27 2025, 6:16 PM
sys/dev/nvme/nvme_ctrlr.c
1433

I'll redo with kib's thing, but we do need check somewhere we have enough pages before we write to the array. Ideally, that would be an assert level check since it's easier to unwind earlier in the function. But we'll see how it turns out. I think the separate functions would be more complicated, though... I'll explore that, though, when i refactor for kib's new interface.

This revision now requires review to proceed.Sep 2 2025, 4:36 PM
imp retitled this revision from nvme: Call vm_fault_quick_hold_pages instead of vmapbuf to nvme: Call vm_fault_hold_pages_e instead of vmapbuf.Sep 2 2025, 4:37 PM
imp edited the test plan for this revision. (Show Details)
sys/dev/nvme/nvme_ctrlr.c
1273

IMO the interface would be somewhat cleaner if the function returned 0/error, and take a pointer to *req to return the request.

update per review: new api and return error, not request

imp retitled this revision from nvme: Call vm_fault_hold_pages_e instead of vmapbuf to nvme: Call vm_fault_hold_pages instead of vmapbuf.Sep 3 2025, 3:05 PM
imp edited the test plan for this revision. (Show Details)
This revision was not accepted when it landed; it landed in state Needs Review.Sep 3 2025, 3:56 PM
This revision was automatically updated to reflect the committed changes.