Page MenuHomeFreeBSD

Remap interrupts using VT-d hardware; allow 32bit x2apic apic_ids to be used for interrupt destinations.
ClosedPublic

Authored by kib on Feb 18 2015, 9:26 AM.

Details

Reviewers
grehan
neel
jhb
Summary

The patch utilizes VT-d interrupt remapping block (IR) to perform FSB messages translation. One consequence is that even despite io apics only take 8bit apic id, IR translation structures accept 32bit apic id, which allows x2apic mode to function properly.

For each msi and ioapic interrupt source, the iommu cookie is added, which is in fact index of the IRE (interrupt remap entry) in the IR table. Cookie is allocated at the source allocation time, and then used at the map time to fill both IRE and device registers. The MSI address/data registers and ioapic redirection registers are programmed with the special values which are recognized by IR and used to restore the IRE index, to find proper delivery mode and target. I have to map all MSI interrupts in the block when msi_map() is called.

KPI of IR is isolated into the x86/iommu/iommu_intrmap.h, to avoid bringing all dmar headers into interrupt code. The non-PCI(e) devices which generate message interrupts on FSB require special handling. The HPET FSB interrupts are remapped, while DMAR interrupts are not. I need to detect device class of the outstanders to correctly handle originators and mapping, this is why hpet and dmar device classes become global vars.

The problem with the patch is that interrupt source setup and dismantle code are done in the non-sleepable context. Sometimes the context even disallows blocking (why icu_lock is spinlock ?). Even if I manage to ensure that no memory allocation failure happens, I still need to flush interrupt entries cache in the IR hardware, which is done async and ideally waits for the interrupt. Instead, I busy-wait for queue to drain.

Another issue which I see is with intr_shuffle_irqs() is non-atomic. When bus_remap_intr() is called, the devices reload MSI/MSIX(-like) config registers, and must disable interrupts around writes, since more that one reg is written. This could be eliminated at least on amd64, since remapping effectively only changes IR table, MSI config is constant unless iommu interrupt cookie is changed. We could eliminate the calls to bus_remap_intr() at all, but this is currently not done.

Test Plan

The patch below was booted multiuser on desktop haswell-class machine, ivy laptop and sandybridge two-socket xeon, also on the Core2 two-socket/5400 box. The later machine has ahci controller which has MSI block of 16 interrupts, so MSI (without -X) was tested as well.

For review, please note that the patch naturally splits into changes to the interrupt code, and DMAR IR support. I would highly appreaciate a review of the code outside x86/iommu, and definitely will understand if you do not want to read iommy changes proper.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

kib updated this revision to Diff 3836.Feb 18 2015, 9:26 AM
kib retitled this revision from to Remap interrupts using VT-d hardware; allow 32bit x2apic apic_ids to be used for interrupt destinations..
kib updated this object.
kib edited the test plan for this revision. (Show Details)
kib added a reviewer: jhb.
kib set the repository for this revision to rS FreeBSD src repository.
kib added a subscriber: emaste.
jhb edited edge metadata.Feb 24 2015, 3:36 PM

To answer one question before I review: icu_lock is a spin lock so we can mask and EOI interrupts before scheduling the ithread in "filter" context. For that reason it has to be a spin lock.

jhb added a comment.Feb 24 2015, 3:59 PM

Hmm, without context I can't easily see if the iommu code is being called from the callbacks used by ithreads to mask and EOI sources. MSI doesn't really do either of those, but level-triggered I/O APIC interrupts would.

sys/dev/acpica/acpi_hpet.c
69 ↗(On Diff #3836)

You can actually just do 'devclass_find("hpet")' instead of exporting this as a global (and that is the more new-bus way to do this. I actually want to change DRIVER_MODULE to not require a devclass pointer as it's just a waste and never used 99% of the time. It is a holdover from pre-new-bus when devclasses were not managed as a dictionary.)

kib added a comment.Feb 24 2015, 4:57 PM
In D1892#7, @jhb wrote:

Hmm, without context I can't easily see if the iommu code is being called from the callbacks used by ithreads to mask and EOI sources. MSI doesn't really do either of those, but level-triggered I/O APIC interrupts would.

You mean the intr_event_create() args ?

error = intr_event_create(&isrc->is_event, isrc, 0, vector,

	    intr_disable_src, (mask_fn)isrc->is_pic->pic_enable_source,
	    (mask_fn)isrc->is_pic->pic_eoi_source, intr_assign_cpu, "irq%d:",
	    vector);

The ioapic_enable_source() and ioapic_disable_source() do not call the iommu, and I do not see why would they need to. The enable/disable is done by manipulating the IOART_INTMASK in the low register of the ioapic redirection table entry. IOMMU mapping is not changed. The only requirement for IOMMU is to have vectors to be programmed consistently in the low register and in the actual IRE, which is needed for EOI to propagate properly. This is probably not strictly needed in the EOI supression mode, when EOI is not passed from APIC to IOAPIC.

sys/dev/acpica/acpi_hpet.c
69 ↗(On Diff #3836)

Done this both for hpet and for dmar. This allows me to keep the dmar class static as well.

kib added a comment.Feb 24 2015, 4:58 PM

I did not uploaded new diff for devclass_find(), it is trivial change. I will do after more changes are done.

kib updated this revision to Diff 4110.Mar 5 2015, 12:51 PM
kib edited edge metadata.

This is the minor update to revive the process. The hpet_devclass was staticized back. I also added full content to diff to ease reading of msi.c and io_apic.c changes.

jhb added a comment.Mar 6 2015, 6:30 PM

In general I think this is fine. Perhaps we should push the icu_lock down inside the enable/disable_intr callbacks someday.

sys/x86/x86/msi.c
548

Would it help if we kept a linked list of an MSI "group"? The "first" MSI would have a SLIST_HEAD and the others would be linked in order perhaps? That would make this loop simpler and perhaps easier to read?

kib added a comment.Mar 6 2015, 6:54 PM
In D1892#12, @jhb wrote:

In general I think this is fine. Perhaps we should push the icu_lock down inside the enable/disable_intr callbacks someday.

The thing which worries me somewhat is the msi_lock drop in the msi_alloc() and msi_release() around calls to iommu. I convinced myself that this is fine, since e.g. in msi_alloc(), msi->msi_dev is already filled, so the source cannot be reused. Any other confict should be due to the driver bug. Similar note for io_apic.c and icu_lock, but there I believe the reasoning is even simpler, since icu_lock is always immediately dropped in the caller after re-acquisition.

The only non-completeness in this code I am aware of is that for io-apics which are enumerated as pcie devices, the requester id for interrupt messages is their pcie rid, but I currently do not establish the correspondence. I only found mention of pcie io-apics in the datasheets for really old P4-era chipsets, which do not have VT-d.

sys/x86/x86/msi.c
548

Sure, this would eliminate the loop over the whole set of msi interrupt sources.

On the other hand, this would trade CPU time for memory. The interrup setup is performed once usually, while +8 bytes per msi_intsrc are consumed for the whole duration of the device life. This is small amount, but is it worth that ?

jhb added a comment.Mar 6 2015, 7:36 PM

It's fine to leave the loop. I'm not exactly sure what you mean by the I/O APICs in PCI comment. Perhaps I should go look at the relevant spec first? (Is it the x2APIC spec or something else?)

kib added a comment.EditedMar 6 2015, 7:50 PM
In D1892#14, @jhb wrote:

I'm not exactly sure what you mean by the I/O APICs in PCI comment. Perhaps I should go look at the relevant spec first? (Is it the x2APIC spec or something else?)

From "Intel® Virtualization Technology for Directed I/O. Architecture Specification" rev 2.3 section 5.1.1 Identifying Origination of Interrupt Requests:

· Legacy pin interrupts

  • For devices that use legacy methods for interrupt routing (such as either through direct wiring to the I/OxAPIC input pins, or through INTx messages), the I/OxAPIC hardware generates the interrupt-request transaction. To identify the source of interrupt requests generated by I/OxAPICs, the interrupt-remapping hardware requires each I/OxAPIC in the platform (enumerated through the ACPI Multiple APIC Descriptor Tables (MADT)) to include a unique 16-bit source-id in its requests. BIOS reports the source-id for these I/OxAPICs via ACPI structures to system software. Refer to Section 8.3.1.1 for more details on I/OxAPIC identity reporting.

Section 8.3.1.1 Reporting Scope for I/OxAPICs:

For I/OxAPICs that are PCI-discoverable, the source-id for such I/OxAPICs (computed using the
above pseudocode from its Device Scope structure) must match its PCI requester-id effective at
the time of boot.

Basically, the interrupt remapping tables are indexed by the rid of the TLP initiator for the interrupt message. IOMMU code must know correct rid to fill the right table with the IRE (interrupt remap entry). Due to the citations above, code must know pcie bus/dev/fun for io-apics enumerated through the pcie bus.

For io-apics which are _not_ enumerable through pcie, there is secret register which is programmed by BIOS with the impossible rid, which does not conflict with any existing or hot-plugged devices. Its value is used as the rid in the interrupt messsages sent from io-apic to apic. This case is handled, this is the only relevant configuration for the modern (e.g. core2 and newer) hardware.

I cited the section numbers of the document so that you could quickly find the relevant text without searching the whole spec, but there is more context around the citations which might be interesting.

neel edited edge metadata.Mar 17 2015, 12:40 AM

Apologies for the delayed review. I have some comments inline.

sys/x86/iommu/intel_drv.c
817

Switch arguments to match the printf string.

sys/x86/iommu/intel_intrmap.c
85

Perhaps add a KASSERT(error != EOPNOTSUPP) since this errno value is interpreted specially by the caller.

133

(cookie & 0x8000) << 2

168

Perhaps add a KASSERT(error != EOPNOTSUPP) since this errno value is interpreted specially by the caller.

199

(idx & 0x8000) != 0 ? (1 << 11) : 0

328

I am a bit confused by the use of INTRCNT_COUNT to compute irte_cnt. I would have expected that NUM_IO_INTS would be used here instead.

Am I missing something?

sys/x86/iommu/intel_qi.c
289

start + cnt <= unit->irte_cnt

sys/x86/iommu/intel_reg.h
106

Typo: missing the closing ")"

kib added inline comments.Mar 17 2015, 11:39 AM
sys/x86/iommu/intel_intrmap.c
328

No, I agree that NUM_IO_INTS is enough.

kib updated this revision to Diff 4253.Mar 17 2015, 11:40 AM
kib edited edge metadata.

Neel, thank you, I made all changes you suggested.

neel accepted this revision.Mar 17 2015, 4:23 PM
neel edited edge metadata.
This revision is now accepted and ready to land.Mar 17 2015, 4:23 PM
kib closed this revision.Mar 19 2015, 1:59 PM