Page MenuHomeFreeBSD

bhyve(8): Add VM Generation Counter ACPI device
ClosedPublic

Authored by cem on Jan 14 2020, 6:24 AM.
Tags
None
Referenced Files
Unknown Object (File)
Jan 3 2024, 9:37 PM
Unknown Object (File)
Nov 24 2023, 8:14 AM
Unknown Object (File)
Nov 14 2023, 8:59 AM
Unknown Object (File)
Nov 11 2023, 3:15 PM
Unknown Object (File)
Nov 10 2023, 8:02 PM
Unknown Object (File)
Nov 7 2023, 9:24 PM
Unknown Object (File)
Oct 10 2023, 2:13 PM
Unknown Object (File)
Oct 9 2023, 6:44 PM

Details

Summary

Add partial implementatation of the 'Virtual Machine Generation ID' spec to
Bhyve. This subset of the spec provides a randomly generated GUID (at
bhyve start) in device memory, along with an ACPI device with _CID
VM_Gen_Counter and ADDR Method returning a Package pointing at that GUID.

This implementation does not yet execute an ACPI Notify operation on the
VM_Gen_Counter device when the ID changes. I don't think Bhyve can take
snapshots and resume from them, yet, though, so perhaps that's not a major
problem at this time. I have some work in progress in this direction but it
is not functional yet.

Suggested by: rpokala

Test Plan

The implemented portions seem to work. The vmgenc(4) driver attaches the
correct ACPI device and parses out the correct GUID. E.g.,

bhyve:
(gdb) x/16bx &vmgen_id
0x46c9a0 <vmgen_id>:    0x1a    0x79    0x7f    0xf9    0x0d    0x2c    0x16    0x1d
0x46c9a8 <vmgen_id+8>:  0x51    0x2e    0x9d    0x83    0x69    0x65    0xf0    0x36
Guest:
testvm# sysctl -x dev.vmgenc.0.guid
dev.vmgenc.0.guid: Format: Length:16 Dump:0x1a797ff90d2c161d512e9d836965f036

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 28650
Build 26676: arc lint + arc unit

Event Timeline

usr.sbin/bhyve/vmgenc.c
88–102

This ends up running before PCI assigns resource addresses, so we get the very first lowmem_limit address. I don't know any reason that would cause problems for PCI (it does not appear to), but just an FYI for reviewers.

I think it would be good to document Bhyve's physical memory layout and how registered mem/fallback ranges work in a single place; I had some trouble piecing it together across multiple files and libraries.

136–139

The VM Gen. ID spec does say the device should have a unique _HID of some kind but leaves the content and form unspecified. Here we have followed the HyperV convention of a readable English named with explicit version number suffix. I picked zero because this is only a partial implementation, but the number is arbitrary. Any sane VM Gen driver will match on _CID or _DDN.

I've defined _HID as a Method returning a string, rather than a bare named string, as a hack to work around a stupid naming guideline in the iasl compiler for _HID (but not _CID) names. The guideline clearly does not matter to guests — you could say HyperV is the reference implementation of this spec, and HyperV uses a non-canonical _HID. Every operating system seems to run just fine on Azure.

(Canonically, _HID is one of three forms: AAA#### (A is any uppercase ASCII letter, # is any hexadecimal number`, NNNN#### (N is uppercase ASCII or decimal digits, # as before), or ACPI#### (here, "ACPI" is the literal characters, # as before). Obviously this would be very limiting in communicating anything meaningful about the "hardware" implementation in the _HID.)

142–149

The spec doesn't say ADDR must be a method, just that it must evaluate to a Package of two integers; that means it could just be a named Package object.
So more simply, we could instead specify it as:

Name (ADDR, Package (0x02)
{
  0x%08x,
  0x%08x
})

That would be tolerable for our vmgenc(4) and any spec-compliant guest implementation, but I haven't audited operating systems explicitly (Linux would be easy to check; Windows might be impossible). Leaving it as a Method is also, IMO, harmless.

So, bhyve literally generates the ACPI source text, and then does system(iasl) as part of startup? "Neat!"

usr.sbin/bhyve/acpi.c
36

Since you're in here: "... and then compiling them ..."

usr.sbin/bhyve/vmgenc.c
84

Since this is going to be an address, shouldn't it be uint64_t?

A note for future work: with Windows, the ACPI tables are generated from within UEFI. You would need either a tag on the UUID so EFI could search for it (like it does with the SMBIOS tables), or use the fw i/f to allow UEFI to query the address of the UUID. I can help with that.

usr.sbin/bhyve/vmgenc.c
88–102

For PCI it does take away a 1GB-aligned region, and won't allow 512MB BARs to be used. This may not be that much of an issue, but I think it could be avoided by putting the UUID into high ROM space. There is 16MB available for this and EFI only uses 2MB max.

  • Implement GPE bits, discussed with jhb. (Unused because we don't have snapshots.)
  • Simplify ADDR.
cem marked 2 inline comments as done.Jan 15 2020, 6:49 AM

Thanks everybody for the quick feedback.

So, bhyve literally generates the ACPI source text, and then does system(iasl) as part of startup? "Neat!"

Yeah, I first learned of this horror when an ACPICA update broke Bhyve because the syntax the compiler recognized for FADT or similar changed. It's... not as crazy as it might seem. It might be preferable if the compiler was linked in instead of system()'d, but whatever.

A note for future work: with Windows, the ACPI tables are generated from within UEFI.

I don't understand how to parse that sentence (or how it would work). Is this just a function of our EFI stack being a borrowed blob from TianoCore EDK II?

You would need either a tag on the UUID so EFI could search for it (like it does with the SMBIOS tables), or use the fw i/f to allow UEFI to query the address of the UUID. I can help with that.

Neither of those are how the protocol is specified, though. Compliant guest implementations aren't going to scan memory like SMBIOS, etc. I'd love to support EFI eventually but don't understand the limitations here. I'd prefer to leave EFI support for a future revision, if that's alright with you. If necessary, this revision could make the device conditional on !efi.

Thanks!

usr.sbin/bhyve/acpi.c
36

That's far enough away from anything I'm touching I'm inclined to leave it be in this revision, but feel free to do a drive-by typo fix in the comment if you feel strongly about it :-).

usr.sbin/bhyve/vmgenc.c
84

No -- we want to allocate in the PCI hole, which is by definition below 4GB.

88–102

That's a good consideration, thanks.

Do we have any existing scheme for allocating ROM space, or is it just ad-hoc? Also, is the ROM region required to be mapped UC? The spec suggests the UUID should not share a page with anything requiring UC mapping, although it's unclear to me why that would possibly matter.

One other possibility would just be to allocate the UUID after the PCI buses in the PCI hole. Usually there is plenty of space there, unless a guest needs very large 32-bit BARs. But I like the suggestion to use ROM space for this instead. The important part is that the UUID doesn't end up on a page included as ordinary "Memory" in SMAP/e820.

Yeah, I first learned of this horror when an ACPICA update broke Bhyve because the syntax the compiler recognized for FADT or similar changed. It's... not as crazy as it might seem. It might be preferable if the compiler was linked in instead of system()'d, but whatever.

While the exec is not pretty nor fast, for the DSDT the alternative is to write an AML byte-code emitter that is guaranteed to be correct. Compiling via iasl at least guarantees that the byte code is as correct as the compiler.

I don't understand how to parse that sentence (or how it would work). Is this just a function of our EFI stack being a borrowed blob from TianoCore EDK II?

It's as it says. The bhyve/UEFI firmware image generates ACPI tables at runtime. The bhyve ones should be disabled (by not using the -A option), but they are ignored in any case since UEFI controls the RSDT/XSDT pointer values.

It's not a borrowed blob - the image is built from source at https://github.com/freebsd/uefi-edk2.

As to whether the ACPI tables should be generated in the hypervisor or guest f/w is a vexing question. Qemu now generates the DSDT in the hypervisor and pushes it through to their UEFI image via the fwcfg interface.

usr.sbin/bhyve/vmgenc.c
88–102

ROM space allocation is ad-hoc.

I suspect it is required to be mapped UC since at least on real h/w it's outside the coherency domain. But, as you mentioned this does seem a strange requirement and may not be enforced.

usr.sbin/bhyve/vmgenc.c
84

Fair enough. A comment to that effect would be nice. :-)

cem marked 4 inline comments as done.

Rebase on D24422 abstraction to locate the generation counter in BIOS memory.

Correct GPE0 block bit width, which acpica now warns about. The field is in
units of bits, and includes both the STS and EN bytes.

  • Trivial update to add new 'flag' parameter to bootrom_alloc() call.
This revision was not accepted when it landed; it landed in state Needs Review.Apr 15 2020, 2:00 AM
This revision was automatically updated to reflect the committed changes.
head/usr.sbin/bhyve/vmgenc.c
15 ↗(On Diff #70588)

You might want to remove NETAPP from the copyrights :)

cem marked an inline comment as done.Apr 15 2020, 5:56 AM
cem added inline comments.
head/usr.sbin/bhyve/vmgenc.c
15 ↗(On Diff #70588)

Ah, thanks. rS359959

It might be basically harmless when virtualizing Linux or FreeBSD as a guest, but Windows Server uses the VM-Generation-ID when performing a USN Rollback (https://docs.microsoft.com/en-us/windows-server/identity/ad-ds/introduction-to-active-directory-domain-services-ad-ds-virtualization-level-100).

In Proxmox (Debian-based virtualization platform), a VM-Generation-ID is generated and stored in the VM config file that is used to start kvm/qemu. You can generate a new random VM-Generation-ID from their web interface and that will tell Active Directory that it may have been restored to a previous point in time and that data needs to be 'replayed' into Directory Services. In my environment (and probably many others), not having a VM-Generation-ID is inconvenient, but having one that is generated randomly on every restart would be detrimental.