Page MenuHomeFreeBSD

MCA: add AMD Error Thresholding support
ClosedPublic

Authored by avg on Feb 15 2017, 2:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Dec 9, 1:41 AM
Unknown Object (File)
Tue, Nov 26, 7:01 AM
Unknown Object (File)
Oct 21 2024, 8:05 PM
Unknown Object (File)
Oct 16 2024, 6:06 AM
Unknown Object (File)
Oct 1 2024, 2:04 PM
Unknown Object (File)
Sep 30 2024, 12:58 AM
Unknown Object (File)
Sep 28 2024, 4:54 AM
Unknown Object (File)
Sep 27 2024, 5:57 AM
Subscribers

Details

Summary

Currently the feature is implemented only for a subset of errors
reported via Bank 4. The subset includes only DRAM-related errors.

The new code builds upon and reuses the Intel CMC (Correctable MCE
Counters) support code. However, the AMD feature is quite different
and, unfortunately, much less regular.

For references please see AMD BKDGs for models 10h - 16h.
Specifically, see MSR0000_0413 NB Machine Check Misc (Thresholding)
Register (MC4_MISC0).
http://developer.amd.com/resources/developer-guides-manuals/

Test Plan

Tested with a processor from 10h family on a system where correctable
ECC errors occur semi-regularly, sometimes in bursts.

Here is an example of three consecutive MCE reports:

MCA: Bank 4, Status 0xdc544100e0080a13
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR OVER BUSLG Responder RD Memory
MCA: Address 0x216a6f1d0
MCA: Misc 0xc01b0fff01000000
MCA: Bank 4, Status 0xdc544100e0080a13
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR OVER BUSLG Responder RD Memory
MCA: Address 0x216be3040
MCA: Misc 0xc01b0fff01000000
MCA: Bank 4, Status 0xdc544100e0080a13
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR OVER BUSLG Responder RD Memory
MCA: Address 0x217b3c720
MCA: Misc 0xc01a0ffb01000000

The first two were triggered by the interrupt, the last one is
from the periodic polling. Note the values in Misc.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

avg retitled this revision from to MCA: add AMD Error Thresholding support.
avg updated this object.
avg edited the test plan for this revision. (Show Details)
avg added reviewers: jhb, kib, markj, marius.

Aside from the comment about resume, I think this looks fine.

sys/x86/x86/mca.c
1005 ↗(On Diff #25217)

I think this should be the lapic's job, but you need to ensure that lapic_resume() will DTRT. I think that's something to be fixed in the other review that added the lapic EVLT support.

For the normal LVT entries we keep an lvt[] array in each software lapic structure. I think you should do the same with having an evlt[] array with relevant fields and have the enable_mca_lvt bit modify the elvt[] array and rely on lapic_setup() to actually program the ELVTs. That will then handle resume correctly.

jhb edited edge metadata.
This revision is now accepted and ready to land.Feb 16 2017, 6:09 PM
avg edited edge metadata.

rebase

This revision now requires review to proceed.Feb 28 2017, 7:52 PM
avg edited edge metadata.

enhance amd mce thresholding

  • better name for MC_MISC_AMDNB_OVERFLOW, no need to save one character
  • don't worry about lapic configuration on resume, it's handled by lapic itself
  • extract common code shared between initial setup and resume
jhb edited edge metadata.
This revision is now accepted and ready to land.Feb 28 2017, 9:01 PM
This revision was automatically updated to reflect the committed changes.