Page MenuHomeFreeBSD

x86 MCA: Fix a deadlock in MCA exception processing
ClosedPublic

Authored by cem on Apr 28 2017, 5:30 PM.
Tags
None
Referenced Files
F106029085: D10536.diff
Tue, Dec 24, 3:49 AM
Unknown Object (File)
Sat, Dec 21, 6:45 PM
Unknown Object (File)
Fri, Dec 20, 2:13 AM
Unknown Object (File)
Fri, Dec 20, 1:54 AM
Unknown Object (File)
Fri, Dec 20, 1:21 AM
Unknown Object (File)
Thu, Dec 19, 11:40 PM
Unknown Object (File)
Nov 12 2024, 8:31 PM
Unknown Object (File)
Nov 12 2024, 3:18 AM
Subscribers

Details

Summary

In exceptional circumstances, an MCA exception will trigger when the
freelist is exhausted. In such a case, no error will be logged on the
list and 'mca_count' will not be incremented.

Prior to this patch, all CPUs that received the exception would spin
forever.

With this change, the CPU that detects the error but finds the freelist
empty will proceed to panic the machine, ending the deadlock.

A follow-up to r260457.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 8968
Build 9362: arc lint + arc unit

Event Timeline

sys/x86/x86/mca.c
656

I have a slight preference for 'recoverablep' as I feel that is what I most often see in our tree?

1181

I would just remove 'old_count' entirely and do 'while (count == 0) cpu_spinwait()'

sys/x86/x86/mca.c
656

No objection from me. I'll change it.

1181

Makes sense.

Rename to recoverablep.
Drop old_count entirely.

cem marked 4 inline comments as done.Apr 28 2017, 5:39 PM

Would be good to test this on known-broken hardware if possible.

sys/x86/x86/mca.c
1176

I suppose this comment needs updating slightly to say "only panic if this CPU logged an error" in the first sentence.

This revision is now accepted and ready to land.Apr 28 2017, 5:51 PM
cem edited edge metadata.

Update comment to reflect reality

This revision now requires review to proceed.Apr 28 2017, 5:58 PM
This revision is now accepted and ready to land.Apr 28 2017, 6:11 PM
This revision was automatically updated to reflect the committed changes.

It seems that now in mca_scan, !recoverable => count > 0 and count == 0 => recoverable. So now the only way we enter the infinite loop if (!recoverable) while (count == 0) cpu_spinwait(); is the mca_intr !MCG_STATUS_RIPV condition. Am I understanding that right, is that the intention?

It seems to me we would only want to enter an infinite loop if we "know" that another thread is going to panic for us with more context or after logging something more, otherwise we should just panic now (with a helpful message). If we are expecting another thread to panic for us, it should probably not be from mca_scan_cpus, as that may deadlock with us at sched_bind since we might never relinquish the CPU.

In any case I think that if we are deliberately going to enter an infinite loop then for debugging purposes we might want to printf first, something like

if (count == 0) {
	printf("Deferring machine check panic from thread %d", ...);
	for (;;)
		cpu_spinwait();
}

I don't know whether the !MCG_STATUS_RIPV condition ever happens in reality though.

In D10536#218494, @rlibby_gmail.com wrote:

It seems that now in mca_scan, !recoverable => count > 0 and count == 0 => recoverable. So now the only way we enter the infinite loop if (!recoverable) while (count == 0) cpu_spinwait(); is the mca_intr !MCG_STATUS_RIPV condition. Am I understanding that right, is that the intention?

It seems to me we would only want to enter an infinite loop if we "know" that another thread is going to panic for us with more context or after logging something more, otherwise we should just panic now (with a helpful message). If we are expecting another thread to panic for us, it should probably not be from mca_scan_cpus, as that may deadlock with us at sched_bind since we might never relinquish the CPU.

In any case I think that if we are deliberately going to enter an infinite loop then for debugging purposes we might want to printf first, something like

if (count == 0) {
	printf("Deferring machine check panic from thread %d", ...);
	for (;;)
		cpu_spinwait();
}

I don't know whether the !MCG_STATUS_RIPV condition ever happens in reality though.

The assumption is that if we are seeing a machine check with MCG_STATUS_RIPV cleared but we don't see any valid MC entries on this CPU, then there must be a valid MC entry on another CPU. Because it takes more work to record and log an MC entry, if we panic right away, we will panic before the other CPU finishes recording the MC entry. In real-world use cases what happens is that all CPUs on a multi-package Nehalem or later CPU get a MC# with RIPV cleared when an uncorrected ECC error occurs. Without the hang, the CPUs on the "remote" package don't see any valid error logged (as the ECC error is only logged in the uncore of the package containing the memory controller for the RAM where the error occurred). Those remove CPUs finish mca_scan sooner and panic before the CPUs on the package with the real error. So you end up with a "uncorrected machine check exception" panic, but you have no MCA records in your crash dump or logged to the console to figure out why. With the hang, the other CPUs get a chance to log at least one record so that you can get the details (failing DIMM, etc.) of the panic.

Also, mca_scan_cpus() does not panic. Only the MC# handler panics, so there is no danger of deadlock via sched_bind().

In D10536#218835, @jhb wrote:
In D10536#218494, @rlibby_gmail.com wrote:

It seems that now in mca_scan, !recoverable => count > 0 and count == 0 => recoverable. So now the only way we enter the infinite loop if (!recoverable) while (count == 0) cpu_spinwait(); is the mca_intr !MCG_STATUS_RIPV condition. Am I understanding that right, is that the intention?

It seems to me we would only want to enter an infinite loop if we "know" that another thread is going to panic for us with more context or after logging something more, otherwise we should just panic now (with a helpful message). If we are expecting another thread to panic for us, it should probably not be from mca_scan_cpus, as that may deadlock with us at sched_bind since we might never relinquish the CPU.

In any case I think that if we are deliberately going to enter an infinite loop then for debugging purposes we might want to printf first, something like

if (count == 0) {
	printf("Deferring machine check panic from thread %d", ...);
	for (;;)
		cpu_spinwait();
}

I don't know whether the !MCG_STATUS_RIPV condition ever happens in reality though.

The assumption is that if we are seeing a machine check with MCG_STATUS_RIPV cleared but we don't see any valid MC entries on this CPU, then there must be a valid MC entry on another CPU. Because it takes more work to record and log an MC entry, if we panic right away, we will panic before the other CPU finishes recording the MC entry. In real-world use cases what happens is that all CPUs on a multi-package Nehalem or later CPU get a MC# with RIPV cleared when an uncorrected ECC error occurs. Without the hang, the CPUs on the "remote" package don't see any valid error logged (as the ECC error is only logged in the uncore of the package containing the memory controller for the RAM where the error occurred). Those remove CPUs finish mca_scan sooner and panic before the CPUs on the package with the real error. So you end up with a "uncorrected machine check exception" panic, but you have no MCA records in your crash dump or logged to the console to figure out why. With the hang, the other CPUs get a chance to log at least one record so that you can get the details (failing DIMM, etc.) of the panic.

Okay, that all seems to make sense, but then one more question. That seems to imply that we are relying on mca_scan running in multiple specific places (CPUs or at least uncores) via trap/mca_intr/mca_scan.
But I don't see what is enforcing that we don't migrate CPUs between when we take the trap and when we get to the scan. So I wonder, is it possible we take the trap, migrate, miss observing anything during the scan, and then spin in the loop?

Even if possible, this would be a corner case (migration during a trap shouldn't be that common), but adding a print or something before the hang might make it more obvious to debug such an occurrence.

Also, mca_scan_cpus() does not panic. Only the MC# handler panics, so there is no danger of deadlock via sched_bind().

(Previous to this patch, it might have incremented the mca_count, which would have let us break out of the loop as previously written and reach the panic.)

In D10536#218864, @rlibby_gmail.com wrote:
In D10536#218835, @jhb wrote:
In D10536#218494, @rlibby_gmail.com wrote:

It seems that now in mca_scan, !recoverable => count > 0 and count == 0 => recoverable. So now the only way we enter the infinite loop if (!recoverable) while (count == 0) cpu_spinwait(); is the mca_intr !MCG_STATUS_RIPV condition. Am I understanding that right, is that the intention?

It seems to me we would only want to enter an infinite loop if we "know" that another thread is going to panic for us with more context or after logging something more, otherwise we should just panic now (with a helpful message). If we are expecting another thread to panic for us, it should probably not be from mca_scan_cpus, as that may deadlock with us at sched_bind since we might never relinquish the CPU.

In any case I think that if we are deliberately going to enter an infinite loop then for debugging purposes we might want to printf first, something like

if (count == 0) {
	printf("Deferring machine check panic from thread %d", ...);
	for (;;)
		cpu_spinwait();
}

I don't know whether the !MCG_STATUS_RIPV condition ever happens in reality though.

The assumption is that if we are seeing a machine check with MCG_STATUS_RIPV cleared but we don't see any valid MC entries on this CPU, then there must be a valid MC entry on another CPU. Because it takes more work to record and log an MC entry, if we panic right away, we will panic before the other CPU finishes recording the MC entry. In real-world use cases what happens is that all CPUs on a multi-package Nehalem or later CPU get a MC# with RIPV cleared when an uncorrected ECC error occurs. Without the hang, the CPUs on the "remote" package don't see any valid error logged (as the ECC error is only logged in the uncore of the package containing the memory controller for the RAM where the error occurred). Those remove CPUs finish mca_scan sooner and panic before the CPUs on the package with the real error. So you end up with a "uncorrected machine check exception" panic, but you have no MCA records in your crash dump or logged to the console to figure out why. With the hang, the other CPUs get a chance to log at least one record so that you can get the details (failing DIMM, etc.) of the panic.

Okay, that all seems to make sense, but then one more question. That seems to imply that we are relying on mca_scan running in multiple specific places (CPUs or at least uncores) via trap/mca_intr/mca_scan.

Well, we are assuming that if one CPU finds nothing valid to log, then at least one other CPU will find something valid to log when a MC# exception fires.

But I don't see what is enforcing that we don't migrate CPUs between when we take the trap and when we get to the scan. So I wonder, is it possible we take the trap, migrate, miss observing anything during the scan, and then spin in the loop?

The MC# IDT entry runs with interrupts disabled, so there is no migration. I've even thought about giving MC# it's own stack since it is more like an NMI (it doesn't honor EFLAGS.IF, but instead has its own scheme of preventing nested machine checks vi MCIP).

Even if possible, this would be a corner case (migration during a trap shouldn't be that common), but adding a print or something before the hang might make it more obvious to debug such an occurrence.

Also, mca_scan_cpus() does not panic. Only the MC# handler panics, so there is no danger of deadlock via sched_bind().

(Previous to this patch, it might have incremented the mca_count, which would have let us break out of the loop as previously written and reach the panic.)

No, mca_scan_cpus() does not call panic. Only mca_intr() calls panic. To be clear, we scan machine checks in 3 different places:

  1. A periodic timer is used to scan the machine check banks to support older CPUs without a CMCI or banks that don't support CMCI so that we can detect corrected errors.
  2. Modern Intel and AMD CPUs support a special interrupt (CMCI) when a correctable error is logged to the machine check registers.
  3. For a fatal (uncorrectable) machine check a MC# exception is raised.

mca_intr() is only used for 3) and is the only case that panics.

In D10536#220382, @jhb wrote:
In D10536#218864, @rlibby_gmail.com wrote:
In D10536#218835, @jhb wrote:
In D10536#218494, @rlibby_gmail.com wrote:

It seems that now in mca_scan, !recoverable => count > 0 and count == 0 => recoverable. So now the only way we enter the infinite loop if (!recoverable) while (count == 0) cpu_spinwait(); is the mca_intr !MCG_STATUS_RIPV condition. Am I understanding that right, is that the intention?

It seems to me we would only want to enter an infinite loop if we "know" that another thread is going to panic for us with more context or after logging something more, otherwise we should just panic now (with a helpful message). If we are expecting another thread to panic for us, it should probably not be from mca_scan_cpus, as that may deadlock with us at sched_bind since we might never relinquish the CPU.

In any case I think that if we are deliberately going to enter an infinite loop then for debugging purposes we might want to printf first, something like

if (count == 0) {
	printf("Deferring machine check panic from thread %d", ...);
	for (;;)
		cpu_spinwait();
}

I don't know whether the !MCG_STATUS_RIPV condition ever happens in reality though.

The assumption is that if we are seeing a machine check with MCG_STATUS_RIPV cleared but we don't see any valid MC entries on this CPU, then there must be a valid MC entry on another CPU. Because it takes more work to record and log an MC entry, if we panic right away, we will panic before the other CPU finishes recording the MC entry. In real-world use cases what happens is that all CPUs on a multi-package Nehalem or later CPU get a MC# with RIPV cleared when an uncorrected ECC error occurs. Without the hang, the CPUs on the "remote" package don't see any valid error logged (as the ECC error is only logged in the uncore of the package containing the memory controller for the RAM where the error occurred). Those remove CPUs finish mca_scan sooner and panic before the CPUs on the package with the real error. So you end up with a "uncorrected machine check exception" panic, but you have no MCA records in your crash dump or logged to the console to figure out why. With the hang, the other CPUs get a chance to log at least one record so that you can get the details (failing DIMM, etc.) of the panic.

Okay, that all seems to make sense, but then one more question. That seems to imply that we are relying on mca_scan running in multiple specific places (CPUs or at least uncores) via trap/mca_intr/mca_scan.

Well, we are assuming that if one CPU finds nothing valid to log, then at least one other CPU will find something valid to log when a MC# exception fires.

But I don't see what is enforcing that we don't migrate CPUs between when we take the trap and when we get to the scan. So I wonder, is it possible we take the trap, migrate, miss observing anything during the scan, and then spin in the loop?

The MC# IDT entry runs with interrupts disabled, so there is no migration. I've even thought about giving MC# it's own stack since it is more like an NMI (it doesn't honor EFLAGS.IF, but instead has its own scheme of preventing nested machine checks vi MCIP).

That's what I was missing. Thanks!

Even if possible, this would be a corner case (migration during a trap shouldn't be that common), but adding a print or something before the hang might make it more obvious to debug such an occurrence.

Also, mca_scan_cpus() does not panic. Only the MC# handler panics, so there is no danger of deadlock via sched_bind().

(Previous to this patch, it might have incremented the mca_count, which would have let us break out of the loop as previously written and reach the panic.)

No, mca_scan_cpus() does not call panic. Only mca_intr() calls panic.

Right. I didn't mean that mca_scan_cpus panicked directly, but that previous to this patch it could unblock another thread which was in the while (mca_count == old_count) loop, and that other thread would then proceed to panic.

To be clear, we scan machine checks in 3 different places:

  1. A periodic timer is used to scan the machine check banks to support older CPUs without a CMCI or banks that don't support CMCI so that we can detect corrected errors.
  2. Modern Intel and AMD CPUs support a special interrupt (CMCI) when a correctable error is logged to the machine check registers.
  3. For a fatal (uncorrectable) machine check a MC# exception is raised.

mca_intr() is only used for 3) and is the only case that panics.

Thanks for laying this out.