Details

Reviewers

grehan
kib
imp

Group Reviewers

PCI
manpages

Commits

rS305497: Update the I/O MMU in bhyve when PCI devices are added and removed.

Summary

When the I/O MMU is active in bhyve, all PCI devices need valid entries
in the DMAR context tables. The I/O MMU code does a single enumeration
of the available PCI devices during initialization to add all existing
devices to a domain representing the host. The ppt(4) driver then moves
pass through devices in and out of domains for virtual machines as needed.
However, when new PCI devices were added at runtime either via SR-IOV or
HotPlug, the I/O MMU tables were not updated.

This change adds a new set of EVENTHANDLERS that are invoked when PCI
devices are added and deleted. The I/O MMU driver in bhyve installs
handlers for these events which it uses to add and remove devices to
the "host" domain.

Test Plan

Fire up a VM passing through a PF to activate the I/O MMU, then create a VF. Verify via dmardump that a context entry for the VF is created with the patches but was not created before.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhb updated this revision to Diff 19746.Aug 27 2016, 1:34 AM

jhb retitled this revision from to Update the I/O MMU in bhyve when PCI devices are added and removed..

jhb updated this object.

jhb edited the test plan for this revision. (Show Details)

jhb added a reviewer: grehan.

Herald added a subscriber: imp. · View Herald TranscriptAug 27 2016, 1:34 AM

jhb added a reviewer: PCI.Aug 29 2016, 8:17 PM

Compile.

Document the new eventhandlers.
Rebase.

Herald added a reviewer: manpages. · View Herald TranscriptSep 1 2016, 9:38 PM

jhb added a reviewer: kib.Sep 1 2016, 9:38 PM

Other than which domain to pass, I think this looks good.

sys/amd64/vmm/io/iommu.c
162 ↗	(On Diff #19804)	Is this always in the host_domain?

Adding kib@. I think even ACPI_DMAR would need something like this to cope with newly arriving devices. One question for Konstantin is if the PCI-specific bit is too specific. In particular, there are provisions in VT-d for having "fake" RIDs for certain devices (like HPETs, etc.) that can then have corresponding context table entries. Maybe it would be better if we passed along just the 'rid' instead of the device_t? Not sure?

share/man/man9/pci.9
152 ↗	(On Diff #19956)	Not sure if there's a good way to mark this up as a typedef. The few other manpages that try to document eventhandlers follow this style. However, it's a bit confusing to have a typedef listed next to actual function prototypes.

jhb added inline comments.Sep 1 2016, 9:43 PM

sys/amd64/vmm/io/iommu.c
162 ↗	(On Diff #19804)	All PCI devices initially belong to the host, yes. If you are going to do PCI pass through, you need to assign the device to the domain of the guest, but that happens later. Particularly for bhyve it happens when the VM starts up and the ppt devices move themselves into the guest domain. That requires the 'ppt' driver to be attached to the device though, so well after this callback when a device is just being added. OTOH, if you aren't going to use this device with PCI pass through, you want it to work on the host out of the box.

In D7667#160855, @jhb wrote:

Adding kib@. I think even ACPI_DMAR would need something like this to cope with newly arriving devices. One question for Konstantin is if the PCI-specific bit is too specific. In particular, there are provisions in VT-d for having "fake" RIDs for certain devices (like HPETs, etc.) that can then have corresponding context table entries. Maybe it would be better if we passed along just the 'rid' instead of the device_t? Not sure?

Yes, HPETs and IO-APICs are special and their RIDs are encoded in the ACPI tables. Practically there are chipset registers that are filled by BIOS with the values used for initiator id in the MSI transactions. But note that the RIDs are for interrupt remapping.

Newer VT-d spec defines so called 'ACPI Name-space Devices', which do not have PCIe structure, and device scope declaration in DMAR table directly provides RIDs for DMA requests for such devices, similar to IO-APICs and HPETS interrupt requests. I believe that mobile SkyLakes and latest atom platforms already utilize that.

So passing rid, or having different interfaces (PCIe/others) for event handlers is reasonable. I would prefer the second approach, where RID is only passed for the case where there is no PCIe path, to centralize the pcie->rid algorithm. It is quite quircky itself, and numerous bugs in PCIe/PCI(e) bridges add even more complications.

wblock added a subscriber: wblock.Sep 2 2016, 2:02 PM

wblock added inline comments.

share/man/man9/pci.9
932 ↗	(On Diff #19956)	s/with while/while/

In D7667#160972, @kib wrote:

In D7667#160855, @jhb wrote:

Adding kib@. I think even ACPI_DMAR would need something like this to cope with newly arriving devices. One question for Konstantin is if the PCI-specific bit is too specific. In particular, there are provisions in VT-d for having "fake" RIDs for certain devices (like HPETs, etc.) that can then have corresponding context table entries. Maybe it would be better if we passed along just the 'rid' instead of the device_t? Not sure?

Yes, HPETs and IO-APICs are special and their RIDs are encoded in the ACPI tables. Practically there are chipset registers that are filled by BIOS with the values used for initiator id in the MSI transactions. But note that the RIDs are for interrupt remapping.

Newer VT-d spec defines so called 'ACPI Name-space Devices', which do not have PCIe structure, and device scope declaration in DMAR table directly provides RIDs for DMA requests for such devices, similar to IO-APICs and HPETS interrupt requests. I believe that mobile SkyLakes and latest atom platforms already utilize that.

So passing rid, or having different interfaces (PCIe/others) for event handlers is reasonable. I would prefer the second approach, where RID is only passed for the case where there is no PCIe path, to centralize the pcie->rid algorithm. It is quite quircky itself, and numerous bugs in PCIe/PCI(e) bridges add even more complications.

Ok, I can leave this interface as-is then. Note that in the case of the bhyve iommu stuff, we just use 'pci_get_rid()' of each PCI device directly. 3.4.1 of the VT-D spec seems to imply that this is true for any PCI-e device. IIRC, the complication you dealt with was that the effective RID for any devices behind a PCI-e->PCI bridge (or PCIe->ISA, etc.) was the RID of the bridge's PCI-e device, not always the device itself.

Eventually we should talk about what it would mean to have bhyve talk to the ACPI_DMAR driver. In particular, bhyve uses a model where there is a "host" domain all devices belong to by default. Each new VM creates a private domain and any pass through devices use that private domain / EPT tables (right now bhyve doesn't reuse the EPT tables, but it should). I'm curious if we could instead use "untranslated" entries for devices in the "host" domain instead of requiring an explicit domain? I think bhyve might be excluding the wired memory in guests from the host domain so that devices on the host can't DMA to guests, hence the complication of a "host" domain.

Fix brain-o found by Warren.

Herald edited edge metadata. · View Herald TranscriptSep 2 2016, 6:55 PM

jhb marked an inline comment as done.Sep 2 2016, 6:56 PM

In D7667#161072, @jhb wrote:

Eventually we should talk about what it would mean to have bhyve talk to the ACPI_DMAR driver. In particular, bhyve uses a model where there is a "host" domain all devices belong to by default. Each new VM creates a private domain and any pass through devices use that private domain / EPT tables (right now bhyve doesn't reuse the EPT tables, but it should). I'm curious if we could instead use "untranslated" entries for devices in the "host" domain instead of requiring an explicit domain? I think bhyve might be excluding the wired memory in guests from the host domain so that devices on the host can't DMA to guests, hence the complication of a "host" domain.

Changing bhyve to use ACPI_DMAR was my goal. In particular, I split domains vs. contexts and wrote dmar_move_ctx_to_domain() for that to work, see r284869. Implementing bhyve interface on top of that functionality should be not too complicated.

But my testing environment only had haswell machine + pci bridge + lem(4) class PCI card. It was all buggy: BIOS wrongly routed INTX interrupts from under PCIe/PCI bridge (off by 1) and Intel refused to fix BIOS. PCIe/PCI bridge did not reported itself as PCIe from the host side (this was fixed by jah in r279117). And the final point was that Intel PCI ethernet adapters are buggy, they read way beyond past the descriptor rings and buffers, so DMAR faults. I just gave up for some time.

For host domain, if we mark devices participating in domain as disabled for dmar, using DMAR_DOMAIN_IDMAP is possible and that would eliminate all concern about DMAR slowing down small transfers due to high setup and (less) tear-down cost.

imp accepted this revision.Sep 2 2016, 7:12 PM

imp added a reviewer: imp.

This revision is now accepted and ready to land.Sep 2 2016, 7:12 PM

Rebase.

This revision now requires review to proceed.Sep 6 2016, 7:42 PM

Herald edited edge metadata. · View Herald TranscriptSep 6 2016, 7:42 PM

Closed by commit rS305497: Update the I/O MMU in bhyve when PCI devices are added and removed. (authored by jhb). · Explain WhySep 6 2016, 8:18 PM

This revision was automatically updated to reflect the committed changes.

Update the I/O MMU in bhyve when PCI devices are added and removed.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 20106

head/share/man/man9/pci.9

head/sys/amd64/vmm/io/iommu.c

head/sys/dev/pci/pci.c

head/sys/dev/pci/pcivar.h

Update the I/O MMU in bhyve when PCI devices are added and removed.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 20106

head/share/man/man9/pci.9

head/sys/amd64/vmm/io/iommu.c

head/sys/dev/pci/pci.c

head/sys/dev/pci/pcivar.h

Update the I/O MMU in bhyve when PCI devices are added and removed.
ClosedPublic
Actions

Revision Contents
Changeset List