Details

Reviewers

jhb
mav
markj
kib

Commits

rGdc74f3003c2d: nvme: Call vm_fault_hold_pages instead of vmapbuf

Summary

Use the underlying mechanism of vmapbuf instead of using this legacy
interface. This means we don't have to allocate a buf, and can store the
page array on the stack as it will be small enough for transfers that
the vast majority of cards can do. And those that can do larger (> 512k)
have provisions to split up requests.

I opted for the simplicity of a hard limit of 1k bytes for this array
which is usually just a few entries rather than doing the malloc/free
dance since some users have been complaining about the old vmapbuf
bottleneck.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

imp created this revision.Aug 25 2025, 11:01 AM

Herald added a subscriber: dab. · View Herald TranscriptAug 25 2025, 11:01 AM

imp requested review of this revision.Aug 25 2025, 11:01 AM

Harbormaster completed remote builds in B66500: Diff 160922.Aug 25 2025, 11:02 AM

imp mentioned this in D52111: nvme: On ioctl, use nvme_allocate_request_buf.Aug 25 2025, 11:03 AM

Note: this replaces much of an earlier patch series that made bufs easier to use by eliminating the need entirely.

markj added inline comments.Aug 25 2025, 12:43 PM

sys/dev/nvme/nvme_ctrlr.c
1294	I don't quite understand this assertion. What if vm_fault_quick_hold_pages() fails?
1405	Is there any advantage to sizing the array this way vs. just having `vm_page_t upages[NVME_MAX_PAGES]`?

imp added inline comments.Aug 25 2025, 3:32 PM

sys/dev/nvme/nvme_ctrlr.c
1294	hmmm... Yea, I should. I misread some code. We have a couple of issues, though, in the tree that I misread that might need attention: sys/contrib/vchiq/interface/vchiq_arm/vchiq_2835_arm.c actual_pages = vm_fault_quick_hold_pages(&p->p_vmspace->vm_map, (vm_offset_t)buf, count, (type == PAGELIST_READ ? VM_PROT_WRITE : 0 ) \| VM_PROT_READ, pages, num_pages); if (actual_pages != num_pages) { vm_page_unhold_pages(pages, actual_pages); free(pagelist, M_VCPAGELIST); return (-ENOMEM); } vm_page_unhold_pages will loop forever given how it's coded (and be UB signed int underflow to boot). There's a few others that are sloppy that just return in this case. and in the linuxulator we have: count = vm_fault_quick_hold_pages(map, start, len, prot, pages, nr_pages); return (count == -1 ? -EFAULT : nr_pages); which needs help, no? Anyway, I'll clean these up and post a review to understand this interface better.
1405	Less stack usage almost all the time. Usually a lot less since npages is typically < 10. But I had coded it as a malloc, then an alloc and then this construct. Maybe you're right. We're shallow in the stack here and have big stacks these days due to ZFS (which won't be on the stack either before or after)

imp added inline comments.Aug 25 2025, 3:35 PM

sys/dev/nvme/nvme_ctrlr.c

1294

The stack thing is also used in this oops from tcp_usrreq.c:

	                nheld = atop(round_page(((vm_offset_t)sbp) +
                            (vm_size_t)outsbsz) - trunc_page((vm_offset_t)sbp));
                        vm_page_t ma[nheld];
                        if (vm_fault_quick_hold_pages(
	                    &curproc->p_vmspace->vm_map, (vm_offset_t)sbp,
		            outsbsz, VM_PROT_READ | VM_PROT_WRITE, ma,
                            nheld) < 0) {
                                error = EFAULT;
		                break;
                        }

imp added inline comments.Aug 25 2025, 3:44 PM

sys/dev/nvme/nvme_ctrlr.c
1294	Ah, I knew I'd seen the assert pattern somewhere: ps->npages = vm_fault_quick_hold_pages(map, start, end - start, VM_PROT_WRITE, ps->pages, n); DDP_LOCK(toep); if (ps->npages < 0) { free(ps, M_CXGBE); return (EFAULT); } KASSERT(ps->npages == n, ("hold_aio: page count mismatch: %d vs %d", ps->npages, n)); in t4_ddp.c and I'm not sure what's going on in t4_cpl_io.c doesn't seem to check it has enough pages. And I'm not at all sure what to do wtih vfs_vnops.c which seems to know how many, but doesn't check. Though looking at the vm_fault_quick_hold_pages() code, it looks like it will always return the count of pages or -1, so maybe all these examples could be cleaned up the other direction.

Check return -- unsure if I should keep the assert or not

Harbormaster completed remote builds in B66511: Diff 160948.Aug 25 2025, 3:56 PM

I think this is ok, my comments are fairly minor.

sys/dev/nvme/nvme_ctrlr.c
1289	IMO it'd be a bit cleaner to compute max_pages here and return it by making the `max_pages` parameter an out-param. Or, have a helper function that just wraps vm_fault_quick_hold_pages() and returns the number of pages held or returns EFAULT or EIO if vm_fault_quick_hold_pages() fails or the user asked for too many pages, respectively.
1405	Using malloc() is fine too, since we're in the user ioctl path here I believe. I just prefer not to have this kind of dynamic stack allocation when possible.

imp added inline comments.Aug 25 2025, 9:13 PM

sys/dev/nvme/nvme_ctrlr.c
1289	But you have to know how many pages to allocate for the array... I'm not sure that I like either of these alternatives.
1405	Except that we're trading one malloc for another. The whole point of eliminating vmapbuf is to avoid allocating a buf, which was a choke point in performance measurements one of our vendors ran. While it is the user ioctl path, people are pushing it faster and faster. I don't see what's wrong with doing the stack allocation like this. Why does that give you heartburn?

markj accepted this revision.Aug 25 2025, 9:39 PM

markj added inline comments.

sys/dev/nvme/nvme_ctrlr.c
1289	In my head both callers would look like this: vm_page_t ma[NVME_MAX_PAGES]; int error, npages; error = nvme_hold_user_pages(addr, len, is_read, ma, &npages, cb, cb_arg); if (error != 0) return (error); ... vm_page_unhold_pages(ma, npages); and we'd have static int nvme_hold_user_pages(...) { int npages; if (nvme_npages(addr, len) > NVME_MAX_PAGES) return (EIO); npages = vm_fault_quick_hold_pages(...) if (npages < 0) return (EFAULT); *npagesp = npages; return (0); } In the current patch, both callers have some complicated expression to calculate max_pages, and then the subroutine asserts that the calculation is right. Why not do the calculation in the subroutine itself?
1405	There's nothing really wrong with it here, it's just a dangerous pattern that I'd rather not see used more than necessary. I didn't realize you explicitly wanted to avoid using malloc(). What's here is ok, but I'd still advocate for just having an array of size NVME_MAX_PAGES. If a user can ask for that many pages, then we should assume that they will. If that leaves us with too little stack space, then it's going to be a problem even with this dynamic sizing, so if the array is going to be on the stack we might as well do the simple thing.

This revision is now accepted and ready to land.Aug 25 2025, 9:39 PM

kib added inline comments.Aug 26 2025, 12:06 AM

sys/dev/nvme/nvme_ctrlr.c
50	Are all added headers needed?
1405	I agree. The fixed size array takes exactly 1K of stack, which is large, but should be fine for top kernel code.

update, per review (except header files)

This revision now requires review to proceed.Aug 26 2025, 2:07 AM

Harbormaster completed remote builds in B66528: Diff 160978.Aug 26 2025, 2:07 AM

kib accepted this revision.Aug 26 2025, 2:28 AM

This revision is now accepted and ready to land.Aug 26 2025, 2:28 AM

markj added inline comments.Aug 26 2025, 12:55 PM

sys/dev/nvme/nvme_ctrlr.c
1289	You still have to check whether `nvme_npages(addr, len) > NVME_MAX_PAGES` before calling vm_fault_quick_hold_pages(). Note that vm_fault_quick_hold_pages() panics the kernel if you don't provide a large enough array (which is kind of a silly interface to be fair).

kib added inline comments.Aug 26 2025, 2:35 PM

sys/dev/nvme/nvme_ctrlr.c
1289	Yes, we can be more friendly and just return the error in this case.

Add check for enough pages and preen .h's to what's required to compile.

This revision now requires review to proceed.Aug 26 2025, 4:31 PM

Harbormaster completed remote builds in B66538: Diff 160998.Aug 26 2025, 4:31 PM

Don't need a separate bool. max_pages will do the trick.

Harbormaster completed remote builds in B66539: Diff 160999.Aug 26 2025, 4:36 PM

markj added inline comments.Aug 26 2025, 4:37 PM

sys/dev/nvme/nvme_ctrlr.c
1433	This is squashing both error cases (request too big; failed to fault pages) to EFAULT.

npages is a better variable name in these two functions

Harbormaster completed remote builds in B66540: Diff 161000.Aug 26 2025, 4:37 PM

imp added inline comments.Aug 26 2025, 4:48 PM

sys/dev/nvme/nvme_ctrlr.c
1433	Yes. I know. In both places. I thought about it, though: if we don't have enough pages elsewhere, we can return this value. And if we made vm_fault_quick_hold_pages return an error in that case, the rest of the VM system would return EFAULT, and we'd not know... the only way around that is to check where we check for max_xfer_size. de-facto, max_xfer_size is more like 128k or 256k, so it will catch this issue there (in a rather unfortunately noisy way). I can add a second condition there (and in the other routine) as well. That would be marginally better at not too great a cost, I think.

check earlier for too big. This is folded into EIO, but other too large
transactions are too. The message will happen, at least, though it could be
spammy in a buggy application.

Harbormaster completed remote builds in B66544: Diff 161004.Aug 26 2025, 5:00 PM

markj accepted this revision.Aug 27 2025, 6:16 PM

markj added inline comments.

sys/dev/nvme/nvme_ctrlr.c
1433	kib's proposed vm_fault_grab_hold_pages_e() addresses that problem. Some refactoring as I suggested earlier with the code snippet would also avoid the duplicate nvme_npages() checks. That is, have separate helper functions for 1) fetching user pages, 2) building a request structure from them.

This revision is now accepted and ready to land.Aug 27 2025, 6:16 PM

imp added inline comments.Aug 27 2025, 6:28 PM

sys/dev/nvme/nvme_ctrlr.c
1433	I'll redo with kib's thing, but we do need check somewhere we have enough pages before we write to the array. Ideally, that would be an assert level check since it's easier to unwind earlier in the function. But we'll see how it turns out. I think the separate functions would be more complicated, though... I'll explore that, though, when i refactor for kib's new interface.

update for new APIs

This revision now requires review to proceed.Sep 2 2025, 4:36 PM

Harbormaster completed remote builds in B66765: Diff 161380.Sep 2 2025, 4:36 PM

imp retitled this revision from nvme: Call vm_fault_quick_hold_pages instead of vmapbuf to nvme: Call vm_fault_hold_pages_e instead of vmapbuf.Sep 2 2025, 4:37 PM

imp edited the test plan for this revision. (Show Details)

kib added inline comments.Sep 2 2025, 7:40 PM

sys/dev/nvme/nvme_ctrlr.c
1273	IMO the interface would be somewhat cleaner if the function returned 0/error, and take a pointer to *req to return the request.

update per review: new api and return error, not request

Harbormaster completed remote builds in B66782: Diff 161430.Sep 3 2025, 3:04 PM

imp retitled this revision from nvme: Call vm_fault_hold_pages_e instead of vmapbuf to nvme: Call vm_fault_hold_pages instead of vmapbuf.Sep 3 2025, 3:05 PM

imp edited the test plan for this revision. (Show Details)

This revision was not accepted when it landed; it landed in state Needs Review.Sep 3 2025, 3:56 PM

Closed by commit rGdc74f3003c2d: nvme: Call vm_fault_hold_pages instead of vmapbuf (authored by imp). · Explain Why

This revision was automatically updated to reflect the committed changes.

imp added a commit: rGdc74f3003c2d: nvme: Call vm_fault_hold_pages instead of vmapbuf.

nvme: Call vm_fault_hold_pages instead of vmapbuf
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 161433

sys/dev/nvme/nvme_ctrlr.c

nvme: Call vm_fault_hold_pages instead of vmapbufClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 161433

sys/dev/nvme/nvme_ctrlr.c

nvme: Call vm_fault_hold_pages instead of vmapbuf
ClosedPublic
Actions

Revision Contents
Changeset List