Page MenuHomeFreeBSD

add a module that provides support for DRAM ECC error injection on AMD CPUs
ClosedPublic

Authored by avg on Feb 27 2017, 10:56 AM.
Tags
None
Referenced Files
F108115053: D9824.id25957.diff
Tue, Jan 21, 1:12 PM
Unknown Object (File)
Sun, Jan 19, 2:48 PM
Unknown Object (File)
Sun, Jan 19, 5:11 AM
Unknown Object (File)
Sun, Jan 19, 5:06 AM
Unknown Object (File)
Thu, Dec 26, 2:41 PM
Unknown Object (File)
Dec 19 2024, 12:40 PM
Unknown Object (File)
Dec 6 2024, 5:46 PM
Unknown Object (File)
Dec 6 2024, 1:57 AM

Details

Summary

I imagine that the module would be useful only to a very limited number
of developers, so that's my excuse for not writing any documentation :)

Test Plan

Tested on my family 10h processor.
The following settings:

$ sysctl hw.error_injection.dram_ecc
hw.error_injection.dram_ecc.inject: 0
hw.error_injection.dram_ecc.bit_mask: 8
hw.error_injection.dram_ecc.word_mask: 16
hw.error_injection.dram_ecc.quadrant: 1

Produced the following error after triggering the injection:

MCA: Bank 4, Status 0x9c0c4000a7080823
MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000
MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0
MCA: CPU 0 COR BUSLG Source WR Memory
MCA: Address 0x3a7654610
MCA: Misc 0xc01b0fff01000000

Post-processed with mcelog:

MCE 5
CPU 0 4 northbridge TSC 676d7e07481ba [at 3211 Mhz 6 days 13:24:22 uptime (unreliable)]
MISC c01b0fff01000000 ADDR 3a7654610
TIME 1488192936 Mon Feb 27 12:55:36 2017
  Northbridge RAM Chipkill ECC error
  Chipkill ECC syndrome = a718
       bit46 = corrected ecc error
       bit59 = misc error valid
  bus error 'local node origin, request didn't time out
             generic write mem transaction
             memory access, level generic'
STATUS 9c0c4000a7080823 MCGSTATUS 0
MCGCAP 106 APICID 0 SOCKETID 0
CPUID Vendor AMD Family 16 Model 4

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

avg retitled this revision from to add a module that provides support for DRAM ECC error injection on AMD CPUs.
avg updated this object.
avg edited the test plan for this revision. (Show Details)
avg added reviewers: jhb, kib.
sys/dev/ecc_inject/ecc_inject.c
152 ↗(On Diff #25747)

Blank line is needed if no local vars are declared.

159 ↗(On Diff #25747)

Same.

175 ↗(On Diff #25747)

I am quite amazed by this approach, where you disable cache and then allow other cores to continue with their business. For instance, I remember that the expected behaviour of lock; rmw instructions is only guaranteed when accessing WB memory. I suspect that allowing other CPUs to run while caches are disabled is too risky, and I do not see why do you need that, instead of parking other cores for the duration of the op.

212 ↗(On Diff #25747)

Don't you need to check the device class/id/vendor of nbdev to be sure that this is really north bridge ?

Also, what happens on multi-socket machines, do they provide multiple north bridges ?

sys/modules/ecc_inject/Makefile
5 ↗(On Diff #25747)

amd_ecc_inject or athlon_ecc_inject

sys/dev/ecc_inject/ecc_inject.c
152 ↗(On Diff #25747)

I would prefer to not follow that rule in my new code :-)

175 ↗(On Diff #25747)

Yeah, I think that it's dangerous too.
But I picked up the idea from Linux and the code seems to be written by people who at the time worked for AMD...

Let me reproduce a lengthy quote from the document:

The following can be used to trigger the injection:
• The memory address is not an explicit parameter of the error injection interface. Once the error injection
registers D18F3xB8 and D18F3xBC are set, the next non-cached access of the appropriate type will trig-
ger the mechanism and apply it to the accessed address. The access should be non-cached so that it is
guaranteed to be seen by the memory controller. Possible methods to ensure a non-cached access include
using the appropriate MTRR to set the memory type to UC or turning off caches. If it is important to
know the address, then system activity must be quiesced so that the access can take place under careful
software control. Once the error injection pattern is set in D18F3xB8 and D18F3xBC_x8:
• Set either D18F3xBC_x8[EccWrReq] or D18F3xBC_x8[DramErrEn] to enable the triggering mech-
anism.
• The next non-cached access of the appropriate type will trigger the mechanism and apply it to the
accessed address.
• After the error is injected, the data must be referenced in order for the error detection to be triggered. The error
address logged in MSR0000_0412 [NB Machine Check Address (MC4_ADDR)] will correspond to the cache-
line quadrant that contains the error.

I guess if we block other CPUs and just sit waiting on the current CPU, then there is only a slim chance that any qualifying memory write access would happen within a reasonable time frame. That is, unless we allocate some memory (64 bytes should be sufficient) and explicitly write and read it on the current CPU. Then we probably won't need to mess with other CPUs at all.

Let me see how the controlled approach works. If there's no problem with it, then it should be much better than the current approach.

212 ↗(On Diff #25747)

I assumed that if we have an AMD processor from a supported family, then nothing else can use that bus/slot/function.
On multi-socket machine there will be multiple NBs and multiple PCI devices representing them.
The current code can only work with the first socket. That's a limitation that can be removed later.

sys/modules/ecc_inject/Makefile
5 ↗(On Diff #25747)

Okay. I'd go with "amd".

enhance the module

  • allocate a page with the nocacheable PAT attribute and use for injecting erros instead of playing dangerously and disabling CPU caches by modifying CR0 on all CPUs
  • treat the value of inject sysctl as a count of errors to inject
  • add a parameter to configure a delay between errors when multiple are injected
  • rename the module to amd_ecc_inject
  • restore generation of required *_if.h headers

add amd_ecc_inject to modules/Makefile

stop writing to the injection area as soon as the injection is detected

This revision was automatically updated to reflect the committed changes.