Page MenuHomeFreeBSD

Update the I/O MMU in bhyve when PCI devices are added and removed.
ClosedPublic

Authored by jhb on Aug 27 2016, 1:34 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jan 8, 2:00 PM
Unknown Object (File)
Tue, Jan 7, 4:58 AM
Unknown Object (File)
Sat, Jan 4, 1:08 AM
Unknown Object (File)
Dec 22 2024, 9:17 PM
Unknown Object (File)
Nov 29 2024, 8:12 PM
Unknown Object (File)
Nov 29 2024, 6:14 AM
Unknown Object (File)
Nov 2 2024, 9:00 PM
Unknown Object (File)
Nov 2 2024, 3:44 PM
Subscribers

Details

Summary

When the I/O MMU is active in bhyve, all PCI devices need valid entries
in the DMAR context tables. The I/O MMU code does a single enumeration
of the available PCI devices during initialization to add all existing
devices to a domain representing the host. The ppt(4) driver then moves
pass through devices in and out of domains for virtual machines as needed.
However, when new PCI devices were added at runtime either via SR-IOV or
HotPlug, the I/O MMU tables were not updated.

This change adds a new set of EVENTHANDLERS that are invoked when PCI
devices are added and deleted. The I/O MMU driver in bhyve installs
handlers for these events which it uses to add and remove devices to
the "host" domain.

Test Plan
  • Fire up a VM passing through a PF to activate the I/O MMU, then create a VF. Verify via dmardump that a context entry for the VF is created with the patches but was not created before.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jhb retitled this revision from to Update the I/O MMU in bhyve when PCI devices are added and removed..
jhb updated this object.
jhb edited the test plan for this revision. (Show Details)
jhb added a reviewer: grehan.
  • Document the new eventhandlers.
  • Rebase.

Other than which domain to pass, I think this looks good.

sys/amd64/vmm/io/iommu.c
162 ↗(On Diff #19804)

Is this always in the host_domain?

Adding kib@. I think even ACPI_DMAR would need something like this to cope with newly arriving devices. One question for Konstantin is if the PCI-specific bit is too specific. In particular, there are provisions in VT-d for having "fake" RIDs for certain devices (like HPETs, etc.) that can then have corresponding context table entries. Maybe it would be better if we passed along just the 'rid' instead of the device_t? Not sure?

share/man/man9/pci.9
152 ↗(On Diff #19956)

Not sure if there's a good way to mark this up as a typedef. The few other manpages that try to document eventhandlers follow this style. However, it's a bit confusing to have a typedef listed next to actual function prototypes.

sys/amd64/vmm/io/iommu.c
162 ↗(On Diff #19804)

All PCI devices initially belong to the host, yes. If you are going to do PCI pass through, you need to assign the device to the domain of the guest, but that happens later. Particularly for bhyve it happens when the VM starts up and the ppt devices move themselves into the guest domain. That requires the 'ppt' driver to be attached to the device though, so well after this callback when a device is just being added. OTOH, if you aren't going to use this device with PCI pass through, you want it to work on the host out of the box.

In D7667#160855, @jhb wrote:

Adding kib@. I think even ACPI_DMAR would need something like this to cope with newly arriving devices. One question for Konstantin is if the PCI-specific bit is too specific. In particular, there are provisions in VT-d for having "fake" RIDs for certain devices (like HPETs, etc.) that can then have corresponding context table entries. Maybe it would be better if we passed along just the 'rid' instead of the device_t? Not sure?

Yes, HPETs and IO-APICs are special and their RIDs are encoded in the ACPI tables. Practically there are chipset registers that are filled by BIOS with the values used for initiator id in the MSI transactions. But note that the RIDs are for interrupt remapping.

Newer VT-d spec defines so called 'ACPI Name-space Devices', which do not have PCIe structure, and device scope declaration in DMAR table directly provides RIDs for DMA requests for such devices, similar to IO-APICs and HPETS interrupt requests. I believe that mobile SkyLakes and latest atom platforms already utilize that.

So passing rid, or having different interfaces (PCIe/others) for event handlers is reasonable. I would prefer the second approach, where RID is only passed for the case where there is no PCIe path, to centralize the pcie->rid algorithm. It is quite quircky itself, and numerous bugs in PCIe/PCI(e) bridges add even more complications.

wblock added inline comments.
share/man/man9/pci.9
932 ↗(On Diff #19956)

s/with while/while/

In D7667#160972, @kib wrote:
In D7667#160855, @jhb wrote:

Adding kib@. I think even ACPI_DMAR would need something like this to cope with newly arriving devices. One question for Konstantin is if the PCI-specific bit is too specific. In particular, there are provisions in VT-d for having "fake" RIDs for certain devices (like HPETs, etc.) that can then have corresponding context table entries. Maybe it would be better if we passed along just the 'rid' instead of the device_t? Not sure?

Yes, HPETs and IO-APICs are special and their RIDs are encoded in the ACPI tables. Practically there are chipset registers that are filled by BIOS with the values used for initiator id in the MSI transactions. But note that the RIDs are for interrupt remapping.

Newer VT-d spec defines so called 'ACPI Name-space Devices', which do not have PCIe structure, and device scope declaration in DMAR table directly provides RIDs for DMA requests for such devices, similar to IO-APICs and HPETS interrupt requests. I believe that mobile SkyLakes and latest atom platforms already utilize that.

So passing rid, or having different interfaces (PCIe/others) for event handlers is reasonable. I would prefer the second approach, where RID is only passed for the case where there is no PCIe path, to centralize the pcie->rid algorithm. It is quite quircky itself, and numerous bugs in PCIe/PCI(e) bridges add even more complications.

Ok, I can leave this interface as-is then. Note that in the case of the bhyve iommu stuff, we just use 'pci_get_rid()' of each PCI device directly. 3.4.1 of the VT-D spec seems to imply that this is true for any PCI-e device. IIRC, the complication you dealt with was that the effective RID for any devices behind a PCI-e->PCI bridge (or PCIe->ISA, etc.) was the RID of the bridge's PCI-e device, not always the device itself.

Eventually we should talk about what it would mean to have bhyve talk to the ACPI_DMAR driver. In particular, bhyve uses a model where there is a "host" domain all devices belong to by default. Each new VM creates a private domain and any pass through devices use that private domain / EPT tables (right now bhyve doesn't reuse the EPT tables, but it should). I'm curious if we could instead use "untranslated" entries for devices in the "host" domain instead of requiring an explicit domain? I think bhyve might be excluding the wired memory in guests from the host domain so that devices on the host can't DMA to guests, hence the complication of a "host" domain.

jhb edited edge metadata.
  • Fix brain-o found by Warren.
jhb marked an inline comment as done.Sep 2 2016, 6:56 PM
In D7667#161072, @jhb wrote:

Eventually we should talk about what it would mean to have bhyve talk to the ACPI_DMAR driver. In particular, bhyve uses a model where there is a "host" domain all devices belong to by default. Each new VM creates a private domain and any pass through devices use that private domain / EPT tables (right now bhyve doesn't reuse the EPT tables, but it should). I'm curious if we could instead use "untranslated" entries for devices in the "host" domain instead of requiring an explicit domain? I think bhyve might be excluding the wired memory in guests from the host domain so that devices on the host can't DMA to guests, hence the complication of a "host" domain.

Changing bhyve to use ACPI_DMAR was my goal. In particular, I split domains vs. contexts and wrote dmar_move_ctx_to_domain() for that to work, see r284869. Implementing bhyve interface on top of that functionality should be not too complicated.

But my testing environment only had haswell machine + pci bridge + lem(4) class PCI card. It was all buggy: BIOS wrongly routed INTX interrupts from under PCIe/PCI bridge (off by 1) and Intel refused to fix BIOS. PCIe/PCI bridge did not reported itself as PCIe from the host side (this was fixed by jah in r279117). And the final point was that Intel PCI ethernet adapters are buggy, they read way beyond past the descriptor rings and buffers, so DMAR faults. I just gave up for some time.

For host domain, if we mark devices participating in domain as disabled for dmar, using DMAR_DOMAIN_IDMAP is possible and that would eliminate all concern about DMAR slowing down small transfers due to high setup and (less) tear-down cost.

imp added a reviewer: imp.
This revision is now accepted and ready to land.Sep 2 2016, 7:12 PM
jhb edited edge metadata.

Rebase.

This revision now requires review to proceed.Sep 6 2016, 7:42 PM
This revision was automatically updated to reflect the committed changes.