Page MenuHomeFreeBSD

vmm: Add support for specifying NUMA configuration
ClosedPublic

Authored by bnovkov on Mar 30 2024, 4:33 PM.
Tags
None
Referenced Files
F125911683: D44565.id143122.diff
Wed, Aug 13, 5:43 AM
Unknown Object (File)
Fri, Aug 8, 4:18 AM
Unknown Object (File)
Wed, Aug 6, 12:01 AM
Unknown Object (File)
Sun, Aug 3, 9:24 AM
Unknown Object (File)
Fri, Aug 1, 1:15 AM
Unknown Object (File)
Thu, Jul 31, 1:41 AM
Unknown Object (File)
Sun, Jul 27, 5:15 PM
Unknown Object (File)
Sat, Jul 26, 2:43 PM

Details

Summary

This patch adds the necessary kernelspace bits required for supporting NUMA domains in bhyve VMs.

The layout of system memory segments and how they're created has been reworked.
Each guest NUMA domain will now have its own memory segment. Furthermore, the patch allows users to tweak the domain's backing vm_object domainset(9) policy.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

  • We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
  • We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future. It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

IMO a better approach would be to start by finding a way to assign a domain ID to each memory segment. This might require extending some existing interfaces in libvmmapi, particularly vm_setup_memory().

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

  • We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
  • We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future.

You're right, the primary goal was to have a way of faking NUMA topologies in a guest for kernel testing purposes. I did consider the second goal but ultimately decided to focus on the "fake" bits first and implement the rest in a separate patch.
I'll rework the patch so that it covers both goals.

It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

I'm not that happy about directly specifying PA ranges directly. The only other thing I could think of is to let the user specify the amount of memory per-domain and let bhyve deal with PA ranges, do you think that this is a more sane approach?
As for the CPU affinities, these are needed for SRAT but that can be done purely from userspace. I've kept them in vmm in case we might want to get NUMA topology info using bhyvectl, but I guess that information can be obtained from the guest itself. I'll remove the cpusets.

bnovkov edited the summary of this revision. (Show Details)

Reworked patch and updated summary.

bnovkov edited the summary of this revision. (Show Details)

Update the patch to allow tweaking memory policies for emulated domains:

  • Each emulated domain now has its own 'sysmem' memory segment
  • vm_alloc_memseg now optionally takes a domainset and a domainset(9) policy, which are validated by the new domainset_populate function
  • The allocated vm_object is then configured to use said domain policy, if any

Rebase on top of recent vmm memory unification changes

sys/amd64/include/vmm_dev.h
77

This structure is part of the ioctl interface, so changing it this way breaks binary compatibility for the VM_ALLOC_MEMSEG ioctl and others. This is important in practice, mainly for running bhyve in a jail that's older than the host kernel.

So either you need to add compat ioctls (see VM_RUN_13 for an example), or define alternate ioctls which take a domainset parameter, i.e., VM_ALLOC_MEMSEG_DOMAIN etc.. I think the former is better.

I think the compat layer will also need to handle segid mapping, i.e., handling old values for VM_BOOTROM etc..

sys/dev/vmm/vmm_mem.c
12 ↗(On Diff #157460)

This should be sorted after malloc.h.

194 ↗(On Diff #157460)

Redundant since you overwrite the buffer in the next line.

195 ↗(On Diff #157460)

In principle, vm_alloc_memseg() is a generic subroutine for vmm code, so the fact that ds_mask must be a userspace pointer is surprising. IMO this copyin() should be done closer to the ioctl layer, probably handling it in alloc_memseg() is fine. Then this routine can be used even if the request isn't coming from userspace.

sys/dev/vmm/vmm_mem.h
22 ↗(On Diff #157460)

Since the use of VM_MAXSYSMEM below relies on this.

Address @markj 's comments:

  • Move domainset parsing code to the ioctl handler
  • Add COMPAT handlers for pre-15.0 memory segments
bnovkov added inline comments.
sys/amd64/include/vmm_dev.h
77

Thanks, I failed to take that into account back when I was writing the initial patch.
I've added the corresponding compat handlers.

sys/dev/vmm/vmm_mem.c
195 ↗(On Diff #157460)

right, agreed; I've moved the copyin parts to the ioctl handler.

This is looking pretty good, I just had one question about compatibility, the other comments are just nits.

sys/dev/vmm/vmm_dev.c
296

This new compat code can be amd64-only as well: the arm64 and riscv bhyve ports are new in 15.

297

Extra newline

547
550

Since mask is allocated in the block above, can we declare mask and free it in that block too?

sys/dev/vmm/vmm_mem.c
190 ↗(On Diff #157912)

As a matter of style, this function is inconsistent about using braces for single-statement bodies. Either style is ok per style(9), but it should be consistent.

sys/dev/vmm/vmm_mem.h
27 ↗(On Diff #157912)

This commit should arguably remove the libvmmapi definitions of these constants.

Hmm. These constants are used directly by consumers (e.g., via vm_create_devmem()). If there is out-of-tree code which uses them, it'll break when libvmmapi.so is updated. grub2-bhyve is an example of an out-of-tree consumer, but I believe it's not affected. Do we think this is unlikely to break anything? Alternately, is it possible to avoid changing these values entirely?

bnovkov marked 7 inline comments as done.

Address @markj 's comments.

sys/dev/vmm/vmm_mem.h
27 ↗(On Diff #157912)

I am personally not aware of any out-of-tree users that would be affected by this.
However, I think we might be able to keep the old values and append the newer ones. Something along the lines of:

enum {
        VM_SYSMEM = 0,
        VM_BOOTROM,
        VM_FRAMEBUFFER,
        VM_PCIROM,
        VM_SYSMEM_EMUL,
        VM_SYSMEM_EMUL_END=VM_SYSMEM_EMUL + VM_MAXSYSMEM,
        VM_MEMSEG_END
};

(am open to suggestions for a better name, nothing else comes to mind atm).

alloc_memseg and a couple of other functions would have to detect and route VM_SYSMEM to VM_SYSMEM_EMUL, but that's straightforward to implement. Does this seem reasonable?

I don't think we can avoid changing VM_MEMSEG_END though.

I think this is ok with the change I suggested inline.

sys/dev/vmm/vmm_mem.h
27 ↗(On Diff #157912)

I think there is no problem with changing VM_MEMSEG_END.

The other option here is to bump VMMAPI_VERSION in libvmmapi and bump the DSO version as well, as was done in commit 7d9ef309bd09c for example. I think that's fine.

This revision is now accepted and ready to land.Tue, Jul 15, 10:43 PM
sys/dev/vmm/vmm_mem.h
27 ↗(On Diff #157912)

Bumped, thanks!

This revision was automatically updated to reflect the committed changes.