Page MenuHomeFreeBSD

Add the MEM_EXTRACT_PADDR ioctl to /dev/mem.
ClosedPublic

Authored by markj on Aug 30 2020, 5:41 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Oct 8, 6:01 AM
Unknown Object (File)
Sat, Oct 4, 6:38 AM
Unknown Object (File)
Tue, Sep 23, 2:08 AM
Unknown Object (File)
Sun, Sep 14, 9:07 AM
Unknown Object (File)
Sat, Sep 13, 6:39 PM
Unknown Object (File)
Sep 12 2025, 6:23 AM
Unknown Object (File)
Sep 11 2025, 3:14 AM
Unknown Object (File)
Sep 10 2025, 5:40 AM

Details

Summary

Currently we have no good mechanism for resolving userspace virtual
addresses to physical addresses. In principle one could use /dev/kmem
to walk page tables from userspace but this seems fraught. This diff
adds a privileged ioctl to provide this functionality.

The intended use-case is DPDK, which performs DMA from userspace and
also wants to be able to determine whether a set of large pages is
physically contiguous. Currently it uses contigmem.ko (shipped with
DPDK) to both allocate and create superpage mappings, and provide the
physical address of each large page. The aim is to replace this module
with the POSIX shm-based interface that Kostik wrote.

Note, DPDK operates in one of two modes: "IOVA as PA" and "IOVA as VA".
The former is for devices not behind an IOMMU and is the mode that
requires MEM_EXTRACT_PADDR. "IOVA as VA" mode makes use of a custom kernel
driver which provides an interface to program the IOMMU in front of a
device, and does not need to be able to resolve UVAs. However, this
kernel driver is not implemented for FreeBSD at the moment.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Aug 30 2020, 5:41 PM
share/man/man4/mem.4
69 ↗(On Diff #76372)

May be MEM_EXTRACT_PADDR ?

79 ↗(On Diff #76372)

I would add explicit mention that me_vaddr is input, and me_paddr/me_domain are output.

Also perhaps it makes sense to note that result for unwired memory may be invalid at the time of use.

sys/dev/mem/memdev.c
106 ↗(On Diff #76372)

Do we want to return error when the address is not mapped ? Userspace can check me_paddr for 0, but for me this feels as a leak of specific interface used for the implementation.

markj added inline comments.
sys/dev/mem/memdev.c
106 ↗(On Diff #76372)

I agree it seems better to use an error number for this case. I am not sure how we can distinguish an unmapped addressed from a valid mapping of the page at PA 0, if it exists. AFAIK there is nothing prohibiting this.

markj marked an inline comment as done.

Address feedback.

share/man/man4/mem.4
83 ↗(On Diff #76377)

I think more interesting case of malfunction is when the page is reclaimed after the ioctl. Or even reclaimed and then different physical page is reassigned and mapped there.

sys/dev/mem/memdev.c
106 ↗(On Diff #76372)

On amd64 PA 0 have to be excluded from any use due to L1TF.
I have vague memory that busdma has issues with PA 0 as well.

markj marked an inline comment as done.

Try to clarify possible races in the man page.

kib added inline comments.
sys/dev/mem/memdev.c
108 ↗(On Diff #76408)

I wonder if ENOMEM more common for the situation.

This revision is now accepted and ready to land.Aug 31 2020, 5:48 PM

I only wonder if it would be nice to do this for multi-page ranges and if that would let DPDK use fewer ioctls. For example, if there was a me_length output struct member which told DPDK how many pages were both physically and virtually contiguous from the start of the me_vaddr input address. The value would be in bytes, not pages of course. But you could get an entire superpage in one ioctl this way. It's true that right now pmap_extract doesn't give you that easily. You could either imagine a variant of pmap_extract() that returned the length at least for things like superpages? There might not be a point in walking additional pages when DPDK won't care. Another option might be that me_length could be both an in/out with DPDK saying how much virtual address space it cares about, and the return of the ioctl being the range of contiguous PA up to me_length in size? Just not sure if DPDK does enough ioctls that this matters or if page at a time is ok as-is. For a long-running process that just wires some buffers at startup, this ioctl is probably just a fixed cost at startup and not a hot path while running, so page at a time is fine.

In D26237#583503, @jhb wrote:

I only wonder if it would be nice to do this for multi-page ranges and if that would let DPDK use fewer ioctls. For example, if there was a me_length output struct member which told DPDK how many pages were both physically and virtually contiguous from the start of the me_vaddr input address. The value would be in bytes, not pages of course. But you could get an entire superpage in one ioctl this way. It's true that right now pmap_extract doesn't give you that easily. You could either imagine a variant of pmap_extract() that returned the length at least for things like superpages? There might not be a point in walking additional pages when DPDK won't care. Another option might be that me_length could be both an in/out with DPDK saying how much virtual address space it cares about, and the return of the ioctl being the range of contiguous PA up to me_length in size? Just not sure if DPDK does enough ioctls that this matters or if page at a time is ok as-is. For a long-running process that just wires some buffers at startup, this ioctl is probably just a fixed cost at startup and not a hot path while running, so page at a time is fine.

For DPDK the overhead almost certainly doesn't matter - it is calling the ioctl once per large page, i.e., once per 2MB or 1GB page that it manages, and it's done at application startup time. In other words, at the time that DPDK is calling the ioctl, it has already mapped a set of superpages (which later get remapped once the physical addresses are known) and it knows the underlying large page size.

I did think about a more general interface but in the absence of pmap support and a compelling reason to minimize ioctl overhead I just decided to go with the approach in the diff. The other option I considered was to add a sysctl akin to KERN_PROC_VMMAP which returns physical address info for the entire address space (or swap block or fs block info for non-resident pages), but this is more complicated and unnecessary for me. As far as I understand Linux provides something similar in /proc/<pid>/pagemap.

Ok, if DPDK already "knows" it's a large page, then I don't see a reason to add a length. Yes, Linux does have the 'pagemap' thing which I've seen used to determine NUMA and superpage stuff from user space in a similar way to mincore(2).

sys/dev/mem/memdev.c
108 ↗(On Diff #76408)

Perhaps EFAULT?

sys/dev/mem/memdev.c
108 ↗(On Diff #76408)

At least mlock(2) and mincore(2) return ENOMEM.
EFAULT typically means that the actual access was made and faulted.

markj retitled this revision from Add the MEM_EXTRACT ioctl to /dev/mem. to Add the MEM_EXTRACT_PADDR ioctl to /dev/mem..Sep 1 2020, 2:05 PM
markj edited the summary of this revision. (Show Details)

Return ENOMEM if the address is not present in the physical map.

This revision now requires review to proceed.Sep 1 2020, 2:05 PM
share/man/man4/mem.4
81 ↗(On Diff #76480)

ENOMEM

206 ↗(On Diff #76480)

Do you need to add ENOMEM to this list ?

markj marked 2 inline comments as done.

Harmonize the man page and code.

share/man/man4/mem.4
82 ↗(On Diff #76481)

not mapped or not faulted in.

share/man/man4/mem.4
82 ↗(On Diff #76481)

The ambiguity suggests that we should look at both the vm_map and pmap to see if the address is valid and mapped, and use different errno values for the two cases.

markj marked an inline comment as done.

Return EINVAL if the address is not present in the vm_map, ENOMEM
if the address is not present in the physical map.

Add a "state" field after some discussion with kib@. This can be extended to
provide more information about the mapped page's state. For now just use it to
indicate whether a given VA is valid (i.e., covered by a map_entry) and mapped.

sys/dev/mem/memdev.c
111 ↗(On Diff #76506)

ME_STATE_MAPPED ?
Also why use bitfield ? IMO ME_STATE_MAPPED and ME_STATE_RESIDENT (and ME_STATE_INVALID instead of 0) are good enough alone.

Also I would consider checking MAP_ENTRY_USER_WIRED and returning a state like ME_STATE_LOCKED.

sys/dev/mem/memdev.c
111 ↗(On Diff #76506)

Because some of the states are orthogonal. Suppose I added _LOCKED. Then after:

mlock(addr, PAGE_SIZE);
mprotect(addr, PAGE_SIZE, PROT_NONE);

the virtual page will be valid and locked, but not entered into the pmap.

sys/dev/mem/memdev.c
111 ↗(On Diff #76506)

If pte is zero, it cannot be valid for this interface.

Convert the me_status bitmask to return a single value.

sys/dev/mem/memdev.c
111 ↗(On Diff #76506)

I ended up not adding a check for MAP_ENTRY_USER_WIRED for now:

  • We return this information already from the vmmap sysctl.
  • It is not really a property of the physical mapping, as I pointed out.
kib added inline comments.
sys/dev/mem/memdev.c
111 ↗(On Diff #76506)

My point in suggesting to return LOCKED was to provide information that the result is *still* valid and not subject to races with pagedaemon.

This revision is now accepted and ready to land.Sep 1 2020, 8:22 PM
sys/dev/mem/memdev.c
111 ↗(On Diff #76506)

In the one use-case I have so far, the application knows that this is not a problem. I slightly prefer to wait until there is some legitimate reason to add the new state.

This revision was automatically updated to reflect the committed changes.