Page MenuHomeFreeBSD

apei: panic on uncorrectable memory errors
AcceptedPublic

Authored by gallatin on Jan 9 2024, 9:38 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, May 8, 11:07 PM
Unknown Object (File)
Apr 12 2024, 5:19 AM
Unknown Object (File)
Apr 12 2024, 5:18 AM
Unknown Object (File)
Apr 12 2024, 3:29 AM
Unknown Object (File)
Apr 5 2024, 9:35 PM
Unknown Object (File)
Feb 12 2024, 12:36 AM
Unknown Object (File)
Jan 11 2024, 2:20 AM
Subscribers
None

Details

Reviewers
mav
andrew
jhb
imp
Summary

On platforms like arm64, where apei is the mechanism to deliver fatal ecc errors to the kernel, we do not currently panic on a fatal error. This can lead to experiencing what seems like random panics. Fix this by always printing the error and panicing when a memory error is fatal.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

gallatin created this revision.

Removed hunk that was Netflix specific

This revision is now accepted and ready to land.Jan 9 2024, 9:46 PM

There is already a panic in apei_ge_handler(), based on total status severity. Do you see it not enough?

In D43385#989059, @mav wrote:

There is already a panic in apei_ge_handler(), based on total status severity. Do you see it not enough?

Yes, I see what you mean. I had not noticed that, I was focused on the mem handler. However, we have a box that has been crashing due to intermittent memory errors causing random kernel data corruption, and the problem has been obscured by our local patch to nerf prints (due to apei being so chatty about correctable pcie errs in the past .. i see you've added a way to mute correctables, so we should move to that..).

Is it possible that the firmware could set  ACPI_HEST_GEN_ERROR_FATAL in ged->ErrorSeverity but not ges->ErrorSeverity ?

Is it possible that the firmware could set ACPI_HEST_GEN_ERROR_FATAL in ged->ErrorSeverity but not ges->ErrorSeverity ?

There is nothing to stop it, but I don't think it would be right. I don't remember exact wording from the spec, but I think ges->ErrorSeverity should cover the worst of ged->ErrorSeverity.