Page MenuHomeFreeBSD

io_apic: Don't route to APIC ID > 255
ClosedPublic

Authored by cperciva on Sat, Mar 14, 6:31 PM.
Tags
None
Referenced Files
F151028994: D55857.diff
Sun, Apr 5, 1:19 PM
Unknown Object (File)
Thu, Apr 2, 9:45 PM
Unknown Object (File)
Thu, Apr 2, 11:57 AM
Unknown Object (File)
Sat, Mar 28, 7:22 AM
Unknown Object (File)
Fri, Mar 27, 10:54 AM
Unknown Object (File)
Thu, Mar 26, 6:55 PM
Unknown Object (File)
Tue, Mar 24, 5:49 AM
Unknown Object (File)
Tue, Mar 24, 1:43 AM
Subscribers

Details

Summary

I/O APIC Redirection Table Entries use 8 bits encode the Destination ID.
Attempting to route an IRQ to a higher APIC ID would result in it being
silently routed to the value reduced modulo 256, causing a panic if the
IRQ fired since the receiving CPU would not expect that IRQ.

Instead, print an error message and mask the interrupt. Note that we
also check io_cpu in ioapic_enable_source and force the mask bit back
on despite the request to remove it.

Tested on: EC2 r8i.96xlarge
MFC after: 1 month
Sponsored by: Amazon

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 71404
Build 68287: arc lint + arc unit

Event Timeline

Add comment about keeping interrupt masked even when "enabling" it.

Did you tried enabling DMAR/AMD IOMMU and interrupt remapping on this machine?

In D55857#1278564, @kib wrote:

Did you tried enabling DMAR/AMD IOMMU and interrupt remapping on this machine?

This (virtual) machine doesn't have a proper IOMMU. See http://david.woodhou.se/15-bit-msi.pdf for context.

In D55857#1278564, @kib wrote:

Did you tried enabling DMAR/AMD IOMMU and interrupt remapping on this machine?

This (virtual) machine doesn't have a proper IOMMU. See http://david.woodhou.se/15-bit-msi.pdf for context.

sys/x86/x86/io_apic.c
272

I think kib's point here is that this will break machines that work with an IOMMU today. They will have io_cpu > IOAPIC_MAX_ID, but will have a valid low. Your test here is insufficient.

I think instead we need to add a new bool (or just another 1-bit field after io_masked) which is something like io_valid. It should be cleared to 0 in both your new case below, and in the IOMMU case in ioapic_program_intpin when remapping fails with an error other than EOPNOTSUPP. I would probably add an invalid: label at the bottom of ioapic_program_intpin` that clears this field and just sets the mask bit similar to what happens now for the IRQ_DISABLED case. Then enable_source can instead do this:

if (intpin->io_masked && intpio->io_valid)

I would also check the valid flag in ioapic_disable_source as well probably.

sys/x86/x86/io_apic.c
272

Oh, whoops. I forgot that the >255 check was only relevant when when we hit the IOMMU EOPNOTSUPP case. Yeah, sounds like we need a new io_valid flag, I'll get to work on that.

@jhb I think I've fixed this now, with a new io_valid field in D56006 being used to mark intpins as being invalid (and forcibly masked) due to having unsupported destinations.

sys/x86/x86/io_apic.c
374

I suggest to add some info which specific pin is problematic.

I still think that the error should somehow propagate up. With this series, the device is malfunctioning, with the only hint about the reason being the single cryptic line in the dmesg, which is not even attributed cleanly to the device.
It is better to fail the driver attach in this case than to offer such failure mode.

Print intpin when warning about unsupporting APIC IDs

In D55857#1283545, @kib wrote:

I still think that the error should somehow propagate up. With this series, the device is malfunctioning, with the only hint about the reason being the single cryptic line in the dmesg, which is not even attributed cleanly to the device.
It is better to fail the driver attach in this case than to offer such failure mode.

I agree in theory, but a lot of functions would need to change from (void) to (int) to do this. And there's some weird cases like what should ioapic_resume do if one of its ioapic_program_intpin calls fails. (Also, we need to program the interrupt to go nowhere rather than returning without programming it -- otherwise if an interrupt is being *re*programmed we get a panic when interrupts keep arriving at a CPU which no longer knows what to do with them.)

So... it's not an easy fix and definitely not an API change I can make quickly, and I don't want to hold up "better" just because we can't do "perfect" yet.

sys/x86/x86/io_apic.c
374

Done.

In D55857#1283545, @kib wrote:

I still think that the error should somehow propagate up. With this series, the device is malfunctioning, with the only hint about the reason being the single cryptic line in the dmesg, which is not even attributed cleanly to the device.
It is better to fail the driver attach in this case than to offer such failure mode.

I agree in theory, but a lot of functions would need to change from (void) to (int) to do this. And there's some weird cases like what should ioapic_resume do if one of its ioapic_program_intpin calls fails. (Also, we need to program the interrupt to go nowhere rather than returning without programming it -- otherwise if an interrupt is being *re*programmed we get a panic when interrupts keep arriving at a CPU which no longer knows what to do with them.)

So... it's not an easy fix and definitely not an API change I can make quickly, and I don't want to hold up "better" just because we can't do "perfect" yet.

Sometimes even a promise would do.

This revision is now accepted and ready to land.Sat, Mar 28, 5:41 AM
In D55857#1283609, @kib wrote:

So... it's not an easy fix and definitely not an API change I can make quickly, and I don't want to hold up "better" just because we can't do "perfect" yet.

Sometimes even a promise would do.

Fair enough. It's certainly something I'll try to find time to do later. I don't want to make any promises though, given how release engineering tends to take over my FreeBSD time (and with 15.1 coming up soon).

This revision was automatically updated to reflect the committed changes.