I imagine that the module would be useful only to a very limited number
of developers, so that's my excuse for not writing any documentation :)
Details
Tested on my family 10h processor.
The following settings:
$ sysctl hw.error_injection.dram_ecc hw.error_injection.dram_ecc.inject: 0 hw.error_injection.dram_ecc.bit_mask: 8 hw.error_injection.dram_ecc.word_mask: 16 hw.error_injection.dram_ecc.quadrant: 1
Produced the following error after triggering the injection:
MCA: Bank 4, Status 0x9c0c4000a7080823 MCA: Global Cap 0x0000000000000106, Status 0x0000000000000000 MCA: Vendor "AuthenticAMD", ID 0x100f43, APIC ID 0 MCA: CPU 0 COR BUSLG Source WR Memory MCA: Address 0x3a7654610 MCA: Misc 0xc01b0fff01000000
Post-processed with mcelog:
MCE 5 CPU 0 4 northbridge TSC 676d7e07481ba [at 3211 Mhz 6 days 13:24:22 uptime (unreliable)] MISC c01b0fff01000000 ADDR 3a7654610 TIME 1488192936 Mon Feb 27 12:55:36 2017 Northbridge RAM Chipkill ECC error Chipkill ECC syndrome = a718 bit46 = corrected ecc error bit59 = misc error valid bus error 'local node origin, request didn't time out generic write mem transaction memory access, level generic' STATUS 9c0c4000a7080823 MCGSTATUS 0 MCGCAP 106 APICID 0 SOCKETID 0 CPUID Vendor AMD Family 16 Model 4
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 7874 Build 8014: arc lint + arc unit
Event Timeline
sys/dev/ecc_inject/ecc_inject.c | ||
---|---|---|
152 ↗ | (On Diff #25747) | Blank line is needed if no local vars are declared. |
159 ↗ | (On Diff #25747) | Same. |
175 ↗ | (On Diff #25747) | I am quite amazed by this approach, where you disable cache and then allow other cores to continue with their business. For instance, I remember that the expected behaviour of lock; rmw instructions is only guaranteed when accessing WB memory. I suspect that allowing other CPUs to run while caches are disabled is too risky, and I do not see why do you need that, instead of parking other cores for the duration of the op. |
212 ↗ | (On Diff #25747) | Don't you need to check the device class/id/vendor of nbdev to be sure that this is really north bridge ? Also, what happens on multi-socket machines, do they provide multiple north bridges ? |
sys/modules/ecc_inject/Makefile | ||
5 ↗ | (On Diff #25747) | amd_ecc_inject or athlon_ecc_inject |
sys/dev/ecc_inject/ecc_inject.c | ||
---|---|---|
152 ↗ | (On Diff #25747) | I would prefer to not follow that rule in my new code :-) |
175 ↗ | (On Diff #25747) | Yeah, I think that it's dangerous too. Let me reproduce a lengthy quote from the document:
I guess if we block other CPUs and just sit waiting on the current CPU, then there is only a slim chance that any qualifying memory write access would happen within a reasonable time frame. That is, unless we allocate some memory (64 bytes should be sufficient) and explicitly write and read it on the current CPU. Then we probably won't need to mess with other CPUs at all. Let me see how the controlled approach works. If there's no problem with it, then it should be much better than the current approach. |
212 ↗ | (On Diff #25747) | I assumed that if we have an AMD processor from a supported family, then nothing else can use that bus/slot/function. |
sys/modules/ecc_inject/Makefile | ||
5 ↗ | (On Diff #25747) | Okay. I'd go with "amd". |
enhance the module
- allocate a page with the nocacheable PAT attribute and use for injecting erros instead of playing dangerously and disabling CPU caches by modifying CR0 on all CPUs
- treat the value of inject sysctl as a count of errors to inject
- add a parameter to configure a delay between errors when multiple are injected