Page MenuHomeFreeBSD

Use a dedicated per-CPU stack for machine check exceptions.
ClosedPublic

Authored by jhb on Jan 17 2018, 11:24 PM.

Details

Summary

Similar to NMIs, machine check exceptions can fire at any time and are
not masked by IF. This means that machine checks can fire when the
kstack is too deep to hold a trap frame, or at critical sections in
trap handlers when a user %gs is used with a kernel %cs. Use the same
strategy used for NMIs of using a dedicated per-CPU stack configured
in IST 3. Store the CPU's pcpu pointer at the stop of the stack so
that the machine check handler can reliably find the proper value for
%gs (also borrowed from NMIs).

This should also fix a similar issue with PTI with a MC# occurring while
the CPU is executing on the trampoline stack.

While here, bypass trap() entirely and just call mca_intr(). This avoids
a bogus call to kdb_reenter() (there's no reason to try to reenter kdb if
a MC# is raised).

Test Plan
  • I booted a kernel both with and without PTI, so I didn't break anything very basic, but I don't have a way to provoke an MC# to test the MC# path.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

In pmap.c:pmap_pti_init(), add the ist 3 stacks to pti user mappings.

sys/amd64/amd64/exception.S
761 ↗(On Diff #38121)

I do not think that setting PCB_FULL_IRET and storing bases into pcb is strictly necessary there, because handler is not going to switch context. But keep it if it is simpler to keep the code similar to nmi.

Tested on AMD using amd_ecc_inject and injecting an uncorrectable ECC DRAM error:

sysctl hw.error_injection.dram_ecc.bit_mask=0x11
sysctl hw.error_injection.dram_ecc.inject=1

Got this:

DRAM ECC error injection support loaded
MCA: Bank 4, Status 0xbe082000b5080823
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 UNCOR PCC BUSLG Source WR Memory
MCA: Address 0x37292000
MCA: Misc 0xc01b0fff01000000
panic: Unrecoverable machine check exception
cpuid = 0
time = 1516280809
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff81eaf740
vpanic() at vpanic+0x19c/frame 0xffffffff81eaf7c0
panic() at panic+0x43/frame 0xffffffff81eaf820
mca_intr() at mca_intr+0x9b/frame 0xffffffff81eaf840
mchk_calltrap() at mchk_calltrap+0x8/frame 0xffffffff81eaf840
--- trap 0x1c, rip = 0xffffffff810bace6, rsp = 0xfffffe0021326a50, rbp = 0xfffffe0021326a50 ---
acpi_cpu_c1() at acpi_cpu_c1+0x6/frame 0xfffffe0021326a50
acpi_cpu_idle() at acpi_cpu_idle+0x2e6/frame 0xfffffe0021326aa0
cpu_idle_acpi() at cpu_idle_acpi+0x3f/frame 0xfffffe0021326ac0
cpu_idle() at cpu_idle+0x8f/frame 0xfffffe0021326ae0
sched_idletd() at sched_idletd+0xc2/frame 0xfffffe0021326bb0
fork_exit() at fork_exit+0x84/frame 0xfffffe0021326bf0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0021326bf0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 11 tid 100003 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why

A note to future self: I also had to modify register 0x44 of pci0:0:24:3 to clear bits 2 and 21, SyncFloodOnDramUcEcc and SyncFloodOnAnyUcErr respectively.
Those bits were set by BIOS and if either of them is left then the system would simply freeze.

sys/amd64/amd64/exception.S
571 ↗(On Diff #38121)

I think there is a bug here? Namely, if KCR3 is ~0, then we jmp to 1 which assumes that %rdi is valid, but we don't load %rdi (set to curpcb) until a few instructions below. I think this 1: label should be on the 'movq PCPU(curpcb),%rdi'? (I noticed this in the copy I had made in mchk).

761 ↗(On Diff #38121)

I can remove it easily enough. Is the same true of NMIs? I don't think those can trigger a context switch and return either?

  • Add MC# stacks to mini-kernel map for PTI.
  • Adjust 1 label so we always load curpcb.
  • Don't save FS/GS bases for MC#.
  • Remove labels not used since r190620.

Another report with a slightly more relevant stack trace.
Turns out that there is another interesting bit, NbMcaToMstCpuEn (NB machine check errors to master CPU only).

MCA: Bank 4, Status 0xbe082000b5080823
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000004
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 UNCOR PCC BUSLG Source WR Memory
MCA: Address 0x37284000
MCA: Misc 0xc01b0fff01000000
panic: Unrecoverable machine check exception
cpuid = 0
time = 1516285616
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xffffffff81eaf740
vpanic() at vpanic+0x19c/frame 0xffffffff81eaf7c0
panic() at panic+0x43/frame 0xffffffff81eaf820
mca_intr() at mca_intr+0x9b/frame 0xffffffff81eaf840
mchk_calltrap() at mchk_calltrap+0x8/frame 0xffffffff81eaf840
--- trap 0x1c, rip = 0xffffffff80f60c91, rsp = 0xfffffe0029aa4840, rbp = 0xfffffe0029aa48a0 ---
acpi_pcib_read_config() at acpi_pcib_read_config+0x1/frame 0xfffffe0029aa48a0
sysctl_root_handler_locked() at sysctl_root_handler_locked+0x7b/frame 0xfffffe0029aa48e0
sysctl_root() at sysctl_root+0x20e/frame 0xfffffe0029aa4960
userland_sysctl() at userland_sysctl+0x199/frame 0xfffffe0029aa4a10
sys___sysctl() at sys___sysctl+0x5f/frame 0xfffffe0029aa4ac0
amd64_syscall() at amd64_syscall+0x79b/frame 0xfffffe0029aa4bf0
fast_syscall_common() at fast_syscall_common+0x100/frame 0x7fffffffd1e0
KDB: enter: panic
[ thread pid 690 tid 100174 ]

The stack trace is missing sysctl_proc_inject frame, I guess that's that because the MCE was delivered before acpi_pcib_read_config updated rbp.

Initially, I was surprised that the exception was not on an access to the memory region where the error was (being) injected.
But then I checked the manual and saw that MC is an abort-type exception.

In D13962#293078, @avg wrote:

Another report with a slightly more relevant stack trace.
Turns out that there is another interesting bit, NbMcaToMstCpuEn (NB machine check errors to master CPU only).

Interesting, Intel has added a new LME flag (I have an untested patch for it) that is somewhat similar. It sends fatal MC#'s to only one logical processor in a package instead of broadcasting it to all when enabled. Intel exposes it in the MCA_CFG MSR rather than in the PCI config registers though.

Thanks for the tests though! This is more testing than I usually get to do with MC# code. I don't suppose you have a way to arrange it to be triggered while in userland by chance? (That may be harder and/or require more work to allocate a wired page that is mapped into userland and then touch it from userland after returning from the sysctl to set the trigger or some such.)

sys/amd64/amd64/exception.S
571 ↗(On Diff #38121)

I agree.

761 ↗(On Diff #38121)

NMI handler releases the in-nmi state and enables interrupts when interrupted thread executed in usermode, allowing context switching, so save of bases is needed there.

jhb marked an inline comment as done.Jan 18 2018, 7:47 PM
This revision was not accepted when it landed; it landed in state Needs Review.Jan 18 2018, 11:50 PM
This revision was automatically updated to reflect the committed changes.