Details

Reviewers

jhb
corvink
markj
andrew
manu

Group Reviewers

bhyve

Commits

rGa4197ea47777: vmm: Add support for guest NUMA emulation

Summary

This patch adds the necessary kernelspace bits required for supporting NUMA domains in bhyve VMs.

The layout of system memory segments and how they're created has been reworked.
Each guest NUMA domain will now have its own memory segment. Furthermore, the patch allows users to tweak the domain's backing vm_object domainset(9) policy.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

bnovkov created this revision.Mar 30 2024, 4:33 PM

Herald added a reviewer: bhyve. · View Herald TranscriptMar 30 2024, 4:33 PM

Herald added subscribers: bcran, rgrimes, imp. · View Herald Transcript

bnovkov requested review of this revision.Mar 30 2024, 4:33 PM

bnovkov added a child revision: D44566: libvmmapi: Add support for setting up and configuring guest NUMA domains.

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future. It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

IMO a better approach would be to start by finding a way to assign a domain ID to each memory segment. This might require extending some existing interfaces in libvmmapi, particularly vm_setup_memory().

In D44565#1016805, @markj wrote:

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.

We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future.

You're right, the primary goal was to have a way of faking NUMA topologies in a guest for kernel testing purposes. I did consider the second goal but ultimately decided to focus on the "fake" bits first and implement the rest in a separate patch.
I'll rework the patch so that it covers both goals.

It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

I'm not that happy about directly specifying PA ranges directly. The only other thing I could think of is to let the user specify the amount of memory per-domain and let bhyve deal with PA ranges, do you think that this is a more sane approach?
As for the CPU affinities, these are needed for SRAT but that can be done purely from userspace. I've kept them in vmm in case we might want to get NUMA topology info using bhyvectl, but I guess that information can be obtained from the guest itself. I'll remove the cpusets.

Reworked patch and updated summary.

Update the patch to allow tweaking memory policies for emulated domains:

Each emulated domain now has its own 'sysmem' memory segment
vm_alloc_memseg now optionally takes a domainset and a domainset(9) policy, which are validated by the new domainset_populate function
The allocated vm_object is then configured to use said domain policy, if any

bnovkov added a parent revision: D46608: kern_domainset: Split domainset validation logic into a separate function.Sep 8 2024, 6:12 PM

Rebase on top of recent vmm memory unification changes

Herald added a reviewer: andrew. · View Herald TranscriptJun 23 2025, 7:02 PM

Herald added a reviewer: andrew. · View Herald Transcript

Herald added a reviewer: manu. · View Herald Transcript

Herald added subscribers: riscv, emaste. · View Herald Transcript

markj added inline comments.Jun 29 2025, 4:40 PM

sys/amd64/include/vmm_dev.h
57–60	This structure is part of the ioctl interface, so changing it this way breaks binary compatibility for the VM_ALLOC_MEMSEG ioctl and others. This is important in practice, mainly for running bhyve in a jail that's older than the host kernel. So either you need to add compat ioctls (see VM_RUN_13 for an example), or define alternate ioctls which take a domainset parameter, i.e., VM_ALLOC_MEMSEG_DOMAIN etc.. I think the former is better. I think the compat layer will also need to handle segid mapping, i.e., handling old values for VM_BOOTROM etc..
sys/dev/vmm/vmm_mem.c
13	This should be sorted after malloc.h.
191	Redundant since you overwrite the buffer in the next line.
192	In principle, vm_alloc_memseg() is a generic subroutine for vmm code, so the fact that `ds_mask` must be a userspace pointer is surprising. IMO this copyin() should be done closer to the ioctl layer, probably handling it in alloc_memseg() is fine. Then this routine can be used even if the request isn't coming from userspace.
sys/dev/vmm/vmm_mem.h
22	Since the use of VM_MAXSYSMEM below relies on this.

Address @markj 's comments:

Move domainset parsing code to the ioctl handler
Add COMPAT handlers for pre-15.0 memory segments

bnovkov marked 5 inline comments as done.Jul 3 2025, 10:26 PM

bnovkov added inline comments.

sys/amd64/include/vmm_dev.h
57–60	Thanks, I failed to take that into account back when I was writing the initial patch. I've added the corresponding compat handlers.
sys/dev/vmm/vmm_mem.c
192	right, agreed; I've moved the copyin parts to the ioctl handler.

This is looking pretty good, I just had one question about compatibility, the other comments are just nits.

sys/dev/vmm/vmm_dev.c
314	This new compat code can be amd64-only as well: the arm64 and riscv bhyve ports are new in 15.
315	Extra newline
648
651	Since `mask` is allocated in the block above, can we declare `mask` and free it in that block too?
sys/dev/vmm/vmm_mem.c
190	As a matter of style, this function is inconsistent about using braces for single-statement bodies. Either style is ok per style(9), but it should be consistent.
sys/dev/vmm/vmm_mem.h
27	This commit should arguably remove the libvmmapi definitions of these constants. Hmm. These constants are used directly by consumers (e.g., via vm_create_devmem()). If there is out-of-tree code which uses them, it'll break when libvmmapi.so is updated. grub2-bhyve is an example of an out-of-tree consumer, but I believe it's not affected. Do we think this is unlikely to break anything? Alternately, is it possible to avoid changing these values entirely?

Address @markj 's comments.

sys/dev/vmm/vmm_mem.h
27	I am personally not aware of any out-of-tree users that would be affected by this. However, I think we might be able to keep the old values and append the newer ones. Something along the lines of: enum { VM_SYSMEM = 0, VM_BOOTROM, VM_FRAMEBUFFER, VM_PCIROM, VM_SYSMEM_EMUL, VM_SYSMEM_EMUL_END=VM_SYSMEM_EMUL + VM_MAXSYSMEM, VM_MEMSEG_END }; (am open to suggestions for a better name, nothing else comes to mind atm). `alloc_memseg` and a couple of other functions would have to detect and route `VM_SYSMEM` to `VM_SYSMEM_EMUL`, but that's straightforward to implement. Does this seem reasonable? I don't think we can avoid changing `VM_MEMSEG_END` though.

I think this is ok with the change I suggested inline.

sys/dev/vmm/vmm_mem.h
27	I think there is no problem with changing `VM_MEMSEG_END`. The other option here is to bump VMMAPI_VERSION in libvmmapi and bump the DSO version as well, as was done in commit 7d9ef309bd09c for example. I think that's fine.

This revision is now accepted and ready to land.Jul 15 2025, 10:43 PM

bnovkov added inline comments.Jul 27 2025, 4:32 PM

sys/dev/vmm/vmm_mem.h
27	Bumped, thanks!

Closed by commit rGa4197ea47777: vmm: Add support for guest NUMA emulation (authored by bnovkov). · Explain WhyJul 27 2025, 4:33 PM

This revision was automatically updated to reflect the committed changes.

bnovkov added a commit: rGa4197ea47777: vmm: Add support for guest NUMA emulation.

I think this broke an already installed grub-bhyve with libvmmapi.so.6 for me - I had no COMPAT_FREEBSD14 in the kernel.

Can we add this to RELNOTES/UPDATING?

vmm: Add support for specifying NUMA configuration
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 159237

lib/libvmmapi/Makefile

lib/libvmmapi/vmmapi.h

sys/amd64/include/vmm_dev.h

sys/arm64/include/vmm_dev.h

sys/dev/vmm/vmm_dev.c

sys/dev/vmm/vmm_mem.h

sys/dev/vmm/vmm_mem.c

sys/riscv/include/vmm_dev.h

vmm: Add support for specifying NUMA configurationClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 159237

lib/libvmmapi/Makefile

lib/libvmmapi/vmmapi.h

sys/amd64/include/vmm_dev.h

sys/arm64/include/vmm_dev.h

sys/dev/vmm/vmm_dev.c

sys/dev/vmm/vmm_mem.h

sys/dev/vmm/vmm_mem.c

sys/riscv/include/vmm_dev.h

vmm: Add support for specifying NUMA configuration
ClosedPublic
Actions

Revision Contents
Changeset List