Page MenuHomeFreeBSD

amd/ucode: add a table of known patch level dependencies
AbandonedPublic

Authored by glebius on Thu, Sep 18, 8:40 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Oct 10, 5:25 AM
Unknown Object (File)
Thu, Oct 9, 11:50 PM
Unknown Object (File)
Thu, Oct 9, 11:50 PM
Unknown Object (File)
Thu, Oct 9, 7:56 PM
Unknown Object (File)
Sun, Oct 5, 3:59 AM
Unknown Object (File)
Sat, Oct 4, 11:05 AM
Unknown Object (File)
Fri, Oct 3, 1:55 PM
Unknown Object (File)
Fri, Oct 3, 8:01 AM
Subscribers
None

Details

Summary

Newest microcode release from AMD has patches, that will fail to load
unless the already running microcode meets certain version. When we load
microcode before kernel, this means a panic. Augment ucode_amd_find()
with a table of known patch levels that have dependencies. When
ucode_subr.c is compiled into the kernel, never return those patches as
possible to load. When ucode_subr.c is compiled into cpucontrol(8)
disable this table, since late attempts of microcode update are protected
by a trap handler and can be safely tried.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 67146
Build 64029: arc lint + arc unit

Event Timeline

glebius created this revision.
In D52605#1201632, @kib wrote:

What panic?

Basically unprotected wrmsr causes a general fault and we are done. Not really a panic(9).

The information on this particular patch level dependency was obtained from John Allen and Borislav Petkov of AMD. Note that patch 0x0aa00219 was never released puclicly. So, today the only solution is BIOS update. They may come up with a chain update procedure later. Meanwhile we need stop bricking machines that run new EPYCs and decided to update the ports sysutils/cpu-microcode-amd.

For reference, this is the crash you'd get immediately after loader:

!!!! X64 Exception Type - 0D(#GP - General Protection)  CPU Apic ID - 00000000 !!!!
ExceptionData - 0000000000000000
RIP  - FFFFFFFF809C5E46, CS  - 0000000000000038, RFLAGS - 0000000000010017
RAX  - 000000008242C000, RCX - 00000000C0010020, RDX - 00000000FFFFFFFF
RBX  - FFFFFFFF82229C80, RSP - FFFFFFFF810DBAD0, RBP - FFFFFFFF810DBB00
RSI  - 0000000000000000, RDI - FFFFFFFF8242C000
R8   - FFFFFFFF8242C000, R9  - FFFFFFFF810DBB30, R10 - FFFFFFFF81FB8000
R11  - 786F6E71FCFF6073, R12 - FFFFFFFF810DBB10, R13 - 000000000AA00215
[Wed Sep  3 19:14:53 2025]R14  - FFFFFFFF810DBB18, R15 - 0000000000000000
DS   - 0000000000000030, ES  - 0000000000000030, FS  - 0000000000000030
GS   - 0000000000000030, SS  - 0000000000000030
CR0  - 0000000080010011, CR2 - 0000000000000000, CR3 - 000000008F5AD000
CR4  - 0000000000000628, CR8 - 0000000000000000
DR0  - 0000000000000000, DR1 - 0000000000000000, DR2 - 0000000000000000
DR3  - 0000000000000000, DR6 - 00000000FFFF0FF0, DR7 - 0000000000000400
GDTR - 00000000A62F6000 0000000000000047, LDTR - 0000000000000000
IDTR - 000000009F538018 0000000000000FFF,   TR - 0000000000000000
FXSAVE_STATE - FFFFFFFF810DB730

FFFFFFFF809C5E46 points at wrmsr in the kernel text. The failed update can be safely reproduced later at runtime with cpucontrol(8).

Thanks for doing this.

For posterity, I'll repeat a few bits of information from conversations with you, Oliver, (and later John Allen and Borislav Petkov when I contacted them, asking for hints on why you were hitting problems).

  • The sysutils/cpu-microcode-amd port was tested before the latest update, but the issue is specific to certain CPUs (that I didn't have access to). For future updates, I'll coordinate with you first.

Meanwhile we need stop bricking machines that run new EPYCs and decided to update the ports sysutils/cpu-microcode-amd.

  • What do you mean by "bricking" the machines? I know it's not ideal, but you should simply be able to reboot without the microcode update applied, and everything will be back to normal.
In D52605#1201655, @jrm wrote:
  • What do you mean by "bricking" the machines? I know it's not ideal, but you should simply be able to reboot without the microcode update applied, and everything will be back to normal.

Yes, bricking was a too strong word to use. Of course you can interrupt to loader and make it boot without update.

Looks good to me. Thanks for fixing this!

This revision is now accepted and ready to land.Thu, Sep 18, 9:56 PM

I do not like the approach. The truth source is the #gp from WRMSR, not the table that we would need to maintain in the source based on rumors or some limited testing on available models.

We need wrmsr_safe() variant operating in the early boot environment, before IDT and curthread are initialized, AFAIU. I tried to prototype that, for UEFI env only, not tested. D52607