On platforms like arm64, where apei is the mechanism to deliver fatal ecc errors to the kernel, we do not currently panic on a fatal error. This can lead to experiencing what seems like random panics. Fix this by always printing the error and panicing when a memory error is fatal.
Diff Detail
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Comment Actions
There is already a panic in apei_ge_handler(), based on total status severity. Do you see it not enough?
Comment Actions
Yes, I see what you mean. I had not noticed that, I was focused on the mem handler. However, we have a box that has been crashing due to intermittent memory errors causing random kernel data corruption, and the problem has been obscured by our local patch to nerf prints (due to apei being so chatty about correctable pcie errs in the past .. i see you've added a way to mute correctables, so we should move to that..).
Is it possible that the firmware could set ACPI_HEST_GEN_ERROR_FATAL in ged->ErrorSeverity but not ges->ErrorSeverity ?
Comment Actions
There is nothing to stop it, but I don't think it would be right. I don't remember exact wording from the spec, but I think ges->ErrorSeverity should cover the worst of ged->ErrorSeverity.