Page MenuHomeFreeBSD

bhyve: Add support for specifying VM NUMA configuration
ClosedPublic

Authored by bnovkov on Mar 30 2024, 4:33 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Aug 3, 10:17 PM
Unknown Object (File)
Sun, Aug 3, 6:41 PM
Unknown Object (File)
Thu, Jul 31, 9:00 PM
Unknown Object (File)
Thu, Jul 31, 2:47 AM
Unknown Object (File)
Wed, Jul 30, 9:11 AM
Unknown Object (File)
Mon, Jul 28, 5:08 AM
Unknown Object (File)
Sun, Jul 27, 6:03 PM
Unknown Object (File)
Sat, Jul 26, 1:36 AM

Details

Summary

This patch adds basic support for adding NUMA domains to a bhyve VM.

The user can define a NUMA domain using the -n flag, which expects a domain id, a CPU set, and memory size for each NUMA domain.
After parsing the node configurations, we use the interfaces added in previous patches to set the NUMA configuration for the virtual machine.
Afterwards, we use the configuration to build the ACPI Static Resource Affinity Table (SRAT) which is used to pass NUMA information to the guest.

Users can optionally configure domainset(9) allocation policies for each domain.
Since each NUMA domain is essentially a separate SYSMEM segment, we can parse user-provided domainset(9) policies and install them into the backing vm_object of the appropriate segment.

Test Plan

I've tested the patch by booting a 10GB FreeBSD VM with two domains: -n id=0,size=5G,cpus=0-4 -n id=1,size=5G,cpus=5-9.
I can confirm that FreeBSD detects the specified domains properly.

bojan@dev /u/h/bojan> sysctl vm.phys_segs
vm.phys_segs: 
SEGMENT 0:

start:     0x1000
end:       0xa0000
domain:    0
free list: 0xffffffff81c073f0
...
SEGMENT 10:

start:     0x180000000
end:       0x2b6124000
domain:    1
free list: 0xffffffff81c078d0
...

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

bnovkov edited the summary of this revision. (Show Details)
bnovkov edited the test plan for this revision. (Show Details)

Update patch.

bnovkov edited the summary of this revision. (Show Details)

Update patch:

  • bhyve now uses cpuset(1)'s domainset(9) policy parser, allowing users to configure domainset(9) allocation policies for vm_objects backing each NUMA domain
  • Added two usage examples to the manpage
usr.sbin/bhyve/bhyve.8
279

To me this description is unclear. NUMA might refer to GPAs or HPAs. As I understand from the code, this option lets you:

  • divide guest physical memory into emulated domains, described by the SRAT;
  • map those emulated domains to NUMA policies for the host physical memory which backs them.

Moreover, I think this lets you configure one independent of the other. In other words, you could use this option to:

  • configure a NUMA policy for all of guest memory while keeping a single emulated domain (which is mostly redundant because cpuset(1) can already provide that)
  • configure emulated domains while retaining the default domainset policy for all of them (useful for testing purposes).

Is that more or less correct?

280
281
303

Stray newline.

usr.sbin/bhyve/bhyverun.c
253

This could be for (i = 0; i < VM_MAXMEMDOM; i++)

277
bnovkov added inline comments.
usr.sbin/bhyve/bhyve.8
279

To me this description is unclear. NUMA might refer to GPAs or HPAs. As I understand from the code, this option lets you:

  • divide guest physical memory into emulated domains, described by the SRAT;
  • map those emulated domains to NUMA policies for the host physical memory which backs them.

Moreover, I think this lets you configure one independent of the other. In other words, you could use this option to:

  • configure a NUMA policy for all of guest memory while keeping a single emulated domain (which is mostly redundant because cpuset(1) can already provide that)
  • configure emulated domains while retaining the default domainset policy for all of them (useful for testing purposes).

Is that more or less correct?

Agreed, my phrasing was a bit ambiguous. Your summary is correct, I'll try to work those points into the manpage (am open to more suggestions as well).

Looks pretty good to me.

usr.sbin/bhyve/acpi.c
789
usr.sbin/bhyve/bhyve.8
275

We should also mention that this is amd64-only for now.

We could add NUMA affinity info to the generated DTB used by the arm64 and riscv targets, but I'm not sure if that's worth the effort; it'd be nice to instead build ACPI tables for them.

283
300
bnovkov marked 4 inline comments as done.

Address @markj 's comments and rework acpi.c changes:

  • acpi.c now tracks each 'vCPUid->domain' mapping
  • New mappings can be added using acpi_add_vcpu_affinity
  • Manpage fixes

I have reworked the acpi.c changes in a way that simplifies tracking cpu affinities (and allows us to get rid of libvmmapi cpu affinity interfaces).

Each 'vCPUid->domain' mapping now gets stored as a separate entry. Each mapping is later on converted into an ACPI CPU affinity structure while building the SRAT.

Have you tried booting a Linux guest with NUMA configured?

usr.sbin/bhyve/bhyve.8
279

Is "it" referring to the -n option? .Nm expands to "bhyve", so this is a bit confusing. A straw proposal:

.Fl n
Configure guest NUMA domains. This option applies only to the amd64 platform.
.Pp
The
.Fl n
option allows the guest physical address space to be partitioned into domains.
The layout of each domain is encoded in an ACPI table visible to the guest operating system.
The
.Fl n
option also allows the specification of a
.Xr domainset 9
memory allocation policy for the host memory backing a given NUMA domain.
...
308

Again, it's unclear what happens or should happen if the sum of the per-domain sizes doesn't equal the total amount of memory. We should describe what happens in that case.

Is it possible to omit the size of each domain and let bhyve split up available guest memory (roughly) evenly?

usr.sbin/bhyve/bhyverun.c
212
223
253
bnovkov marked 3 inline comments as done.

Address @markj's comments

Have you tried booting a Linux guest with NUMA configured?

Yes, Linux appears to detect the domains properly.
Here's the relevant dmesg output when booting a Debian 12 nocloud VM using -n id=0,size=4G,cpus=0-1 -n id=1,size=4G,cpus=2-3:

[[snip]]
[    0.015787] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0xbfffffff]
[    0.016208] ACPI: SRAT: Node 0 PXM 0 [mem 0xff000000-0xff000fff]
[    0.016630] ACPI: SRAT: Node 0 PXM 0 [mem 0xffc84000-0xffffffff]
[    0.017051] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
[    0.017484] ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x23fffffff]
[    0.017920] NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0x00000000-0x13fffffff]
[    0.018665] NODE_DATA(0) allocated [mem 0x13ffd5000-0x13fffffff]
[    0.019205] NODE_DATA(1) allocated [mem 0x23ffd5000-0x23fffffff]
[    0.019889] Zone ranges:
[    0.020069]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.020502]   DMA32    [mem 0x0000000001000000-0x00000000ffffffff]
[    0.020938]   Normal   [mem 0x0000000100000000-0x000000023fffffff]
[    0.021371]   Device   empty
[    0.021573] Movable zone start for each node
[    0.021875] Early memory node ranges
[    0.022125]   node   0: [mem 0x0000000000001000-0x000000000009ffff]
[    0.022564]   node   0: [mem 0x0000000000100000-0x00000000be242fff]
[    0.023010]   node   0: [mem 0x00000000be246000-0x00000000be7a2fff]
[    0.023448]   node   0: [mem 0x00000000be7a4000-0x00000000be920fff]
[    0.023889]   node   0: [mem 0x00000000beb1b000-0x00000000bfb9afff]
[    0.024327]   node   0: [mem 0x00000000bfbff000-0x00000000bff7bfff]
[    0.024768]   node   0: [mem 0x0000000100000000-0x000000013fffffff]
[    0.025207]   node   1: [mem 0x0000000140000000-0x000000023fffffff]
[    0.025647] Initmem setup node 0 [mem 0x0000000000001000-0x000000013fffffff]
[    0.026140] Initmem setup node 1 [mem 0x0000000140000000-0x000000023fffffff]
[[snip]]
usr.sbin/bhyve/bhyve.8
279

yes, it was referring to the flag, don't know how .Nm ended up there.

308

So currently bhyve will refuse to boot if the two values are different, but I can see that being tedious.

The mechanism you proposed is useful and I'll work it into the patch, but we still have to consider the case where both -m and individual domain sizes are passed.
I'd personally consider the sum of the individual domain sizes as the authoritative value in that case.

Rework and clarify interactions between the -m flag and individual domain memory sizes.

markj added inline comments.
usr.sbin/bhyve/bhyve.8
275
This revision is now accepted and ready to land.Sun, Jul 27, 1:44 PM