Details

Reviewers

lprylli_netflix.com
glebius
markj
jhb

Commits

rGfda981d41491: x86: Allow MCA messages to be rate-limited
rG12a5143cfc26: x86: Allow MCA messages to be rate-limited
rG0d14826de7d9: x86: Allow MCA messages to be rate-limited

Summary

Always print the first 50 messages of each type. After that, optionally rate-limit the messages. This provides a way to limit the overhead of processing excessive messages without suppressing the first few of any type.

As part of this change, we are switching from direct printf() calls to collecting data in an sbuf(9). In POLLED mode (run from a task queue), we dynamically allocate the buffer. In the other modes (which are likely called from a hardware interrupt), we use a buffer allocated from the BSS segment and guarded by a lock. In normal operation, most calls to mca_log() should come from the POLLED mode, so there should be no contention for the new lock. If there is an interrupt storm which exceeds the capacity of the free list, there will be new contention for this lock; however, overall lock contention should still be lower than it was prior to e770e32aa3a0, when the mca_lock was held for the entirety of the mca_log() call.

This commit is partly based on a patch proposed by Loic Prylli <lprylli@netflix.com>.

Test Plan

Create fake events using the infrastructure described in D52942.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jtl created this revision.Oct 6 2025, 9:30 PM

Herald added a subscriber: imp. · View Herald TranscriptOct 6 2025, 9:30 PM

jtl requested review of this revision.Oct 6 2025, 9:30 PM

glebius accepted this revision.Oct 7 2025, 5:42 AM

This revision is now accepted and ready to land.Oct 7 2025, 5:42 AM

Fix logic in case (mode != polled && (mca_startup_done || (rec->mr_status & MC_STATUS_UC)))

This revision now requires review to proceed.Oct 7 2025, 12:56 PM

Remove unneeded check for mca_startup_done

jtl mentioned this in D12278: x86: support sending non-fatal MCA messages to syslog.Oct 7 2025, 2:38 PM

Not sure whether it needs to be addressed or not, but for completeness there is the possibility of some sequence of events occasionally emptying the mca_freelist, and then my understanding is that mca_record_entry() will work in "degraded mode" directly call mca_log() in an interrupt context (next to "MCA: Unable to allocate space for an event.\n"), and as long as that situation persists, the corresponding MCA printf() won't be ratechecked any more.

It could be a very unlikely theoretical case with the mca_postscan() changes, but with previous version of code, a couple of machines were able to trigger the "MCA: Unable to allocate space for an event" situation in netflix fleet.

In D52946#1210071, @lprylli_netflix.com wrote:

Not sure whether it needs to be addressed or not, but for completeness there is the possibility of some sequence of events occasionally emptying the mca_freelist, and then my understanding is that mca_record_entry() will work in "degraded mode" directly call mca_log() in an interrupt context (next to "MCA: Unable to allocate space for an event.\n"), and as long as that situation persists, the corresponding MCA printf() won't be ratechecked any more.

It could be a very unlikely theoretical case with the mca_postscan() changes, but with previous version of code, a couple of machines were able to trigger the "MCA: Unable to allocate space for an event" situation in netflix fleet.

I agree that we could theoretically hit that. (And, based on what you said, may even be likely to hit that in some small percentage of cases.) I think the question is whether it is worth closing that gap? My view is that the rate-limiting is a best-effort functionality which should "fail open" rather than "fail closed". Even if the limiting isn't perfect, it would still help some.

To close this robustly, we either need to accept message truncation, find a way to reliably buffer the whole message in a way that avoids allocations, or we would likely need to separate event-classification from event-printing (as your original patch did, but I wanted to avoid running duplicate logic).

@glebius : I don't love it, but... what do you think about a heap allocation of 512 bytes which we would guard through a spin lock and only use when we're in an interrupt context? Or, accepting message truncation?

Reworked patch to use a large-ish buffer allocated from BSS and protected by a lock if called from an interrupt context. As noted in the commit message, this will result in increased lock contention during an interrupt storm which exceeds the capacity of the free list; however, overall lock contention should still be lower than it was when mca_log() was called with the mca_lock held.

markj added inline comments.Oct 8 2025, 2:56 PM

sys/x86/x86/mca.c
132
133	Why not define the sysctl together with the handler?
171	If `val` is a count of microseconds, why are we setting `tv_sec` here?
494	Not directly part of your change, but, what does the condition `!tes_supported(rec->mr_mcg_cap) \|\| ((rec->mr_status & MC_STATUS_TES_STATUS) >> 53) != 0x2` mean? That's cryptic enough to deserve a comment IMO.
717	I think some brief comment explaining the intent here would be helpful.

jtl added inline comments.Oct 8 2025, 3:15 PM

sys/x86/x86/mca.c
132	This should say seconds, not microseconds. Thanks for catching it. (This also answers your question below.) A man page would also help here. :-)
133	It was a misguided attempt to keep like things together, and avoid defining a function in the middle of variables. However, looking again at the code, I think it will still be quite readable if I move this down to where the handler exists.
171	As above, I meant this to be seconds.
494	Git spelunking (ac64943ca8514382289e6054e8127726a7c40ea3) says: "Setting hw.mca.log_corrected to 0 will mute corrected errors logging except ones marked as reaching Yellow threshold by hardware." (And, indeed, these magic numbers match the magic numbers added in 3bdba24c74604b1bb27623cd8304476bbbed69d1.) Do you think I should add a comment here? Do a subsequent commit to change the magic numbers into defines? Or, ignore this for now?
717	Happy to add one. Are you okay with 50 being statically defined here? Or, do you think it should be either compile-time or run-time changeable? (Or, be a different static value?)

markj added inline comments.Oct 8 2025, 3:21 PM

sys/x86/x86/mca.c
132	A man page would be great. :)
494	I'm fine with leaving this alone for now. It'd be nice to make this clearer in a subsequent commit if you plan to keep working in this area.
717	I think having 50 be hardcoded is fine for now.

jtl marked 3 inline comments as done and an inline comment as not done.Oct 8 2025, 3:28 PM

jtl marked 4 inline comments as done.

jtl marked an inline comment as done.

jtl added inline comments.

sys/x86/x86/mca.c
132	Let me work on one separately from this review. I agree one is needed by now. :-)

updates to incorporate review feedback

Seems ok to me.

sys/x86/x86/mca.c
176	Perhaps use the logging lock to protect this update? And set CTLFLAG_MPSAFE in the sysctl flags below, otherwise this handler will be serialized by Giant.
731	This line looks like it might have incorrect whitespace.

This revision is now accepted and ready to land.Oct 8 2025, 3:40 PM

jtl added inline comments.Oct 8 2025, 3:57 PM

sys/x86/x86/mca.c
176	Thanks! I added MPSAFE in my local environment. (I thought we'd already switched to MPSAFE by default, but I see we haven't. :-) ) I purposely didn't lock the update. I didn't think it provided a lot of value since there isn't really a synchronization issue that I could see. Am I missing a synchronization corner case?