Page MenuHomeFreeBSD

vmm: Add support for specifying NUMA configuration
Needs ReviewPublic

Authored by bnovkov on Mar 30 2024, 4:33 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 31, 11:25 PM
Unknown Object (File)
Jan 10 2025, 5:04 PM
Unknown Object (File)
Nov 18 2024, 12:07 PM
Unknown Object (File)
Nov 14 2024, 6:39 PM
Unknown Object (File)
Oct 19 2024, 6:56 AM
Unknown Object (File)
Oct 19 2024, 6:56 AM
Unknown Object (File)
Oct 19 2024, 6:56 AM
Unknown Object (File)
Oct 19 2024, 6:56 AM
Subscribers

Details

Reviewers
jhb
corvink
markj
Group Reviewers
bhyve
Summary

This patch adds the necessary kernelspace bits required for supporting NUMA domains in bhyve VMs.

The layout of system memory segments and how they're created has been reworked.
Each guest NUMA domain will now have its own memory segment. Furthermore, the patch allows users to tweak the domain's backing vm_object domainset(9) policy.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

  • We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
  • We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future. It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

IMO a better approach would be to start by finding a way to assign a domain ID to each memory segment. This might require extending some existing interfaces in libvmmapi, particularly vm_setup_memory().

It's not clear to me why we don't extend the vm_memmap structure instead.

Stepping back for a second, the goal of this patch is not really clear to me. I can see two possibilities:

  • We want to create a fake NUMA topology, e.g., to make it easier to use bhyve to test NUMA-specific features in guest kernels.
  • We want some way to have bhyve/vmm allocate memory from multiple physical NUMA domains on the host, and pass memory affinity information to the guest. In that case, vmm itself needs to ensure, for example, that the VM object for a given memseg has the correct NUMA allocation policy.

I think this patch ignores the second goal and makes it harder to implement in the future.

You're right, the primary goal was to have a way of faking NUMA topologies in a guest for kernel testing purposes. I did consider the second goal but ultimately decided to focus on the "fake" bits first and implement the rest in a separate patch.
I'll rework the patch so that it covers both goals.

It also appears to assume that each domain can be described with a single PA range, and I don't really understand why vmm needs to know the CPU affinity of each domain.

I'm not that happy about directly specifying PA ranges directly. The only other thing I could think of is to let the user specify the amount of memory per-domain and let bhyve deal with PA ranges, do you think that this is a more sane approach?
As for the CPU affinities, these are needed for SRAT but that can be done purely from userspace. I've kept them in vmm in case we might want to get NUMA topology info using bhyvectl, but I guess that information can be obtained from the guest itself. I'll remove the cpusets.

bnovkov edited the summary of this revision. (Show Details)

Reworked patch and updated summary.

bnovkov edited the summary of this revision. (Show Details)

Update the patch to allow tweaking memory policies for emulated domains:

  • Each emulated domain now has its own 'sysmem' memory segment
  • vm_alloc_memseg now optionally takes a domainset and a domainset(9) policy, which are validated by the new domainset_populate function
  • The allocated vm_object is then configured to use said domain policy, if any