Page MenuHomeFreeBSD

Dynamically allocate IRQ ranges on x86.
ClosedPublic

Authored by jhb on Aug 23 2018, 7:43 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Mar 23, 9:22 AM
Unknown Object (File)
Fri, Mar 22, 7:47 AM
Unknown Object (File)
Jan 11 2024, 12:42 PM
Unknown Object (File)
Jan 5 2024, 3:47 PM
Unknown Object (File)
Dec 22 2023, 10:18 PM
Unknown Object (File)
Dec 11 2023, 7:23 AM
Unknown Object (File)
Nov 21 2023, 6:35 AM
Unknown Object (File)
Nov 21 2023, 3:59 AM
Subscribers

Details

Summary

Previously, x86 used static ranges of IRQ values for different types
of I/O interrupts. Interrupt pins on I/O APICs and 8259A PICs used
IRQ values from 0 to 254. MSI interrupts used a compile-time-defined
range starting at 256, and Xen event channels used a
compile-time-defined range after MSI. Some recent systems have more
than 255 I/O APIC interrupt pins which resulted in those IRQ values
overflowing into the MSI range triggering an assertion failure.

Replace statically assigned ranges with dynamic ranges. Do a single
pass computing the sizes of the IRQ ranges (PICs, MSI, Xen) to
determine the total number of IRQs required. Allocate the interrupt
source and interrupt count arrays dynamically once this pass has
completed. To minimize runtime complexity these arrays are only sized
once during bootup. The PIC range is determined by the PICs present
in the system. The MSI and Xen ranges continue to use a fixed size,
though this does make it possible to turn the MSI range size into a
tunable in the future.

As a result, various places are updated to use dynamic limits instead
of constants. In addition, the vmstat(8) utility has been taught to
understand that some kernels may treat 'intrcnt' and 'intrnames' as
pointers rather than arrays when extracting interrupt stats from a
crashdump. This is determined by the presence (vs absence) of a
global 'nintrcnt' symbol.

This change reverts r189404 which worked around a buggy BIOS which enumerated
an I/O APIC twice (using the same memory mapped address for both entries but
using an IRQ base of 256 for one entry and a valid IRQ base for the second
entry). Making the "base" of MSI IRQ values dynamic avoids the panic that
r189404 worked around, and there may now be valid I/O APICs with an IRQ base
above 256 which this workaround would incorrectly skip.

If in the future the issue reported in PR 130483 reoccurs, we will have to
add a pass over the I/O APIC entries in the MADT to detect duplicates using
the memory mapped address and use some strategy to choose the "correct"
one.

While here, reserve room in intrcnts for the Hyper-V counters.

PR: 229429, 130483

Test Plan
  • booted amd64 and i386 under bhyve with APIC enabled
  • booted i386 under bhyve with APIC disabled
  • verified vmstat -i works on both old and new kernels both live and using -M /dev/mem -N /path/to/kernel

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/x86/x86/msi.c
335 ↗(On Diff #47154)

If I were to remove this max() then MSI IRQs would start lower after the last APIC IRQ (e.g. 32 in a bhyve VM). I've kept the 256 minimum for now, but it probably would be fine to remove that minimum as a followup I think.

336 ↗(On Diff #47154)

We could make num_msi_ints a variable and a tunable now after this change (and that is probably worth doing as a followup change).

Added @royger who I forgot to add initially. I only tested bare metal kernels, I have not tested the XENHVM changes. Roger, would you be able to test this under Xen?

This panics quite early with the following trace:

/boot/kernel/kernel text=0x167caa8 data=0x1cdac8+0x79c1f0 syms=[0x8+0x1803c0+0x8+0x19cd3e]
/boot/entropy size=0x1000
Booting...
GDB: no debug ports present
KDB: debugger backends: ddb
KDB: current backend: ddb
---<<BOOT>>---
Copyright (c) 1992-2018 The FreeBSD Project.
Copyright (c) 1979, 1980, 1983, 1986, 1988, 1989, 1991, 1992, 1993, 1994
	The Regents of the University of California. All rights reserved.
FreeBSD is a registered trademark of The FreeBSD Foundation.
FreeBSD 12.0-ALPHA2 #0 d016733ea(master)-dirty: Thu Aug 23 13:55:07 UTC 2018
    root@freebsd:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
FreeBSD clang version 6.0.1 (tags/RELEASE_601/final 335540) (based on LLVM 6.0.1)
WARNING: WITNESS option enabled, expect reduced performance.
VT(vga): text 80x25
XEN: Hypervisor version 4.12 detected.
CPU: Intel(R) Xeon(R) CPU E3-1230 v6 @ 3.50GHz (3504.12-MHz K8-class CPU)
  Origin="GenuineIntel"  Id=0x906e9  Family=0x6  Model=0x9e  Stepping=9
  Features=0x1fc3fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT>
  Features2=0xfffa3203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,PCID,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV>
  AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM>
  AMD Features2=0x121<LAHF,ABM,Prefetch>
  Structured Extended Features=0x1c6fbb<FSGSBASE,TSCADJ,BMI1,HLE,AVX2,SMEP,BMI2,ERMS,INVPCID,RTM,NFPUSG,MPX,RDSEED,ADX,SMAP>
  Structured Extended Features3=0xc000000<IBPB,STIBP>
  XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES>
  AMD Extended Feature Extensions ID EBX=0x1000
Hypervisor: Origin = "XenVMMXenVMM"
real memory  = 16768827392 (15992 MB)
avail memory = 16202080256 (15451 MB)
Event timer "LAPIC" quality 100
ACPI APIC Table: <Xen HVM>
WARNING: L1 data cache covers fewer APIC IDs than a core (0 < 1)
WARNING: L2 data cache covers fewer APIC IDs than a core (0 < 1)
WARNING: L3 data cache covers fewer APIC IDs than a core (0 < 1)
FreeBSD/SMP: Multiprocessor System Detected: 8 CPUs
FreeBSD/SMP: 1 package(s) x 8 cache groups x 1 core(s)
random: unblocking device.
panic: Assertion intrcnt_index < nintrcnt failed at /usr/src/sys/x86/x86/intr_machdep.c:468
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff82696900
vpanic() at vpanic+0x1a3/frame 0xffffffff82696960
panic() at panic+0x43/frame 0xffffffff826969c0
intrcnt_add() at intrcnt_add+0xc1/frame 0xffffffff826969e0
xen_intr_init() at xen_intr_init+0xd0/frame 0xffffffff82696a50
mi_startup() at mi_startup+0x118/frame 0xffffffff82696a70
btext() at btext+0x2c
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db>
sys/x86/include/intr_machdep.h
54 ↗(On Diff #47154)

extra 'the' at the end of the line.

66 ↗(On Diff #47154)

AFAICT both should be unsigned int instead of int?

sys/x86/x86/intr_machdep.c
206 ↗(On Diff #47154)

Maybe I would add a KASSERT(smp_started == 0, ...)

The following diff on top of your patch seem to fix the issue:

diff --git a/sys/x86/xen/xen_intr.c b/sys/x86/xen/xen_intr.c
index 4b64ac36b774..c7328ad71e27 100644
--- a/sys/x86/xen/xen_intr.c
+++ b/sys/x86/xen/xen_intr.c
@@ -634,17 +634,13 @@ xen_intr_init(void *dummy __unused)
 	mtx_init(&xen_intr_isrc_lock, "xen-irq-lock", NULL, MTX_DEF);
 
 	/*
-	 * Register interrupt count manually as we aren't
-	 * guaranteed to see a call to xen_intr_assign_cpu()
-	 * before our first interrupt. Also set the per-cpu
-	 * mask of CPU#0 to enable all, since by default
-	 * all event channels are bound to CPU#0.
+	 * Set the per-cpu mask of CPU#0 to enable all, since by default all
+	 * event channels are bound to CPU#0.
 	 */
 	CPU_FOREACH(i) {
 		pcpu = DPCPU_ID_PTR(i, xen_intr_pcpu);
 		memset(pcpu->evtchn_enabled, i == 0 ? ~0 : 0,
 		    sizeof(pcpu->evtchn_enabled));
-		xen_intr_intrcnt_add(i);
 	}
 
 	for (i = 0; i < nitems(s->evtchn_mask); i++)
@@ -669,6 +665,23 @@ xen_intr_init(void *dummy __unused)
 }
 SYSINIT(xen_intr_init, SI_SUB_INTR, SI_ORDER_SECOND, xen_intr_init, NULL);
 
+static void
+xen_intrcnt_init(void *dummy __unused)
+{
+	unsigned int i;
+
+	if (!xen_domain())
+		return;
+
+	/*
+	 * Register interrupt count manually as we aren't guaranteed to see a
+	 * call to xen_intr_assign_cpu() before our first interrupt.
+	 */
+	CPU_FOREACH(i)
+		xen_intr_intrcnt_add(i);
+}
+SYSINIT(xen_intrcnt_init, SI_SUB_INTR, SI_ORDER_MIDDLE, xen_intrcnt_init, NULL);
+
 void
 xen_intr_alloc_irqs(void)
 {
@@ -883,7 +896,6 @@ xen_intr_assign_cpu(struct intsrc *base_isrc, u_int apic_id)
 
 	to_cpu = apic_cpuid(apic_id);
 	vcpu_id = pcpu_find(to_cpu)->pc_vcpu_id;
-	xen_intr_intrcnt_add(to_cpu);
 
 	mtx_lock(&xen_intr_isrc_lock);
 	isrc = (struct xenisrc *)base_isrc;
@@ -1284,9 +1296,6 @@ xen_intr_bind_virq(device_t dev, u_int virq, u_int cpu,
 	struct evtchn_bind_virq bind_virq = { .virq = virq, .vcpu = vcpu_id };
 	int error;
 
-	/* Ensure the target CPU is ready to handle evtchn interrupts. */
-	xen_intr_intrcnt_add(cpu);
-
 	isrc = NULL;
 	error = HYPERVISOR_event_channel_op(EVTCHNOP_bind_virq, &bind_virq);
 	if (error != 0) {
@@ -1349,9 +1358,6 @@ xen_intr_alloc_and_bind_ipi(u_int cpu, driver_filter_t filter,
 	char name[MAXCOMLEN + 1];
 	int error;
 
-	/* Ensure the target CPU is ready to handle evtchn interrupts. */
-	xen_intr_intrcnt_add(cpu);
-
 	isrc = NULL;
 	error = HYPERVISOR_event_channel_op(EVTCHNOP_bind_ipi, &bind_ipi);
 	if (error != 0) {
  • Use a SYSINIT to add Xen per-CPU interrupt counters.
jhb marked 2 inline comments as done.
  • Fix comment and sprinkle some 'u_int'.
cem added a subscriber: cem.

LGTM with the caveat that I am mostly unfamiliar with x86 interrupt gory details. I did not look closely at the Xen bits.

sys/sys/interrupt.h
157–159 ↗(On Diff #47154)

Why not bring all arch over to a pointer representation for consistency / fewer ifdefs?

sys/x86/x86/intr_machdep.c
174 ↗(On Diff #47154)

weak preference for sizeof(*interrupt_sources)

227 ↗(On Diff #47154)

might also be worth printing num_io_irqs

451–453 ↗(On Diff #47154)

Should this be < nintrcnt?

sys/x86/x86/msi.c
336 ↗(On Diff #47154)

Perhaps tunable + sanely inferred defaults scaled on something like memory size, maybe? (Similar to nbuf?) Later follow-up ok.

This revision is now accepted and ready to land.Aug 24 2018, 6:17 AM
jhb marked 2 inline comments as done.
  • Tweaks from Conrad.
This revision now requires review to proceed.Aug 24 2018, 7:37 AM
sys/sys/interrupt.h
157–159 ↗(On Diff #47154)

It's a fair bit of work and best to let each platform decide how they want to do this. The current layout is still not great in many ways (e.g. NUMA, and it would probably be nicer to do something different like an array of structures perhaps that might include additional info about an interrupt such as which CPU it is bound to.

sys/x86/x86/intr_machdep.c
206 ↗(On Diff #47154)

Even if CPUs were started it still wouldn't really matter for the purposes of this lock as long as SYSINITs are single-threaded (only thread0 runs them). The only time we walk the pic list not from thread0 is for suspend and resume (which is also single-threaded), which is why I think removing the lock altogether as a followup is probably sensible.

451–453 ↗(On Diff #47154)

If it was + 1 instead of +2. I need 2 slots in this case (1 for stray, 1 for actual), and if intrcnt_index + 2 == nintrcnt, that is ok as I have 2 slots left (intrcnt_index and intrcnt_index + 1).

sys/x86/x86/msi.c
336 ↗(On Diff #47154)

Probably scaled by mp_ncpus more than memory size if we come up with a formula. In theory mp_ncpus * 190 is the effective max, but that is probably a bit much to reserve. The 512 limit currently there is probably targeted at 8 or 16-way systems which would be something like 'mp_ncpus * 64' or 'mp_ncpus * 32'. OTOH, on a 256-way system you probably aren't assigning interrupt to all cores for all I/O devices either, so the existing number of 512 is perhaps best for now (but we can revisit that when making this a tunable as a followup change).

Tested with Xen on amd64, no issues. Thanks.

This revision is now accepted and ready to land.Aug 24 2018, 11:21 AM

Thanks!

sys/sys/interrupt.h
157–159 ↗(On Diff #47154)

I am not proposing any major change to existing architectures. Just renaming their existing intrnames[] to something else and having the common-type *intrnames point at it so that the MI symbol has common type.

sys/x86/x86/msi.c
336 ↗(On Diff #47154)

Sure -- scaling by CPUs makes sense (although RAM will probably also scale with CPUs). The scaling does not need to be linear. Even something dumb and stepwise would probably be totally adequate.

Did you tested this with the interrupt remapping enable ? Tunables hw.dmar.enable=1, hw.dmar.ir=1, hw.dmar.dma=0.

In D16861#359810, @kib wrote:

Did you tested this with the interrupt remapping enable ? Tunables hw.dmar.enable=1, hw.dmar.ir=1, hw.dmar.dma=0.

Hmm, I'll have to find some hardware to try that. I have only tested in bhyve VMs. It seems like that only affects the (CPU, IDT) tuples that IRQs are mapped to rather than the actual layout of IRQ cookie values, so I would be surprised if it made a difference.

In D16861#360774, @jhb wrote:
In D16861#359810, @kib wrote:

Did you tested this with the interrupt remapping enable ? Tunables hw.dmar.enable=1, hw.dmar.ir=1, hw.dmar.dma=0.

Hmm, I'll have to find some hardware to try that. I have only tested in bhyve VMs. It seems like that only affects the (CPU, IDT) tuples that IRQs are mapped to rather than the actual layout of IRQ cookie values, so I would be surprised if it made a difference.

Just checked myself.

This revision was automatically updated to reflect the committed changes.