Page MenuHomeFreeBSD

Handle broadcast NMIs on x86.
ClosedPublic

Authored by kib on Oct 14 2016, 1:47 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Oct 18, 3:02 PM
Unknown Object (File)
Sun, Oct 12, 8:10 PM
Unknown Object (File)
Sun, Oct 12, 6:28 PM
Unknown Object (File)
Tue, Oct 7, 10:12 AM
Unknown Object (File)
Sep 15 2025, 3:24 PM
Unknown Object (File)
Sep 15 2025, 1:10 PM
Unknown Object (File)
Sep 9 2025, 7:36 PM
Unknown Object (File)
Aug 18 2025, 4:35 PM
Subscribers

Details

Summary

Several sources of NMIs are configured for broadcast mode. In particular, all Intel machines I had access to, implement 'ipmitool power diag' as broadcast NMI. Also, chipset RAS subsystems report hardware errors as broadcast NMI as well, I noted this when I developed DMAR driver.

Kernel typical reaction to the hardware-generated NMI is entrance into the debugger. Since each core gets NMI delivered, each core tries to enter the debugger. Due to synchronous nature of interruption, often all cores (or, at least more than one) are in NMI handlers simultaneously. As result, IPI_STOP_HARD is never delivered to such other cores, since NMI is latched and unblocked only by IRET. We end up with hard-locked machine in this case even after 'power diag'.

The patch tries to handle the problem by introducing the pseudo-lock for simultaneous attempts to handle NMIs. If one core happens to enter NMI trap handler, other cores see it and simulate reception of the IPI_STOP_HARD. More, generic_stop_cpus() avoids sending IPI_STOP_HARD, relying on the nmi handler reporting the stop.

Since it is impossible to detect at runtime whether some stray NMI is broadcast or unicast, patch adds a knob for administrator (really developer) to configure debugging NMI handling mode.

Test Plan

The patch was written several years ago when I developed DMAR driver, and rotten since then. I revived it with the significant help from avg@, who tested both broadcast and unicast modes of delivery with his WIP code for watchdogs.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 5607
Build 5871: CI src buildJenkins

Event Timeline

kib retitled this revision from to Handle broadcast NMIs on x86..
kib updated this object.
kib edited the test plan for this revision. (Show Details)
kib added reviewers: avg, jhb.
kib set the repository for this revision to rS FreeBSD src repository - subversion.
kib added a subscriber: emaste.

I have tested this code and it works.
I will not object against this code getting committed.

But I would like to re-iterate my suggestion.
In my opinion, this kind of logic really belongs to stop_cpus_hard. That should be the proper layer for it.
Benefits:

  • the code would be able to work regardless of how many CPUs get an NMI and that would not require any configuration
  • far more corner cases (races) would be covered, e.g. one CPU panic-ing while another getting an NMI "at the same time"

I think that it would be possible, still, to coalesce kdb_enter-s from a broadcast NMI by adding some code to kdb_enter.
For example, each NMI-ed CPU that enters kdb_enter can set its bit in a want-to-enter-kdb-becase-of-nmi mask before calling stop_cpus_hard.
If the bit is still set after the call, then that CPU is a winner, it clears the mask and proceeds to the debugger.
If the bit is cleared, then we simply return because we know that the winner has already handled the NMI.

One last, tangential note is we should honor KDB_UNATTENDED / debugger_on_panic when handling an NMI.
A physical button is not the only source of the NMI, we could have an NMI watchdog for instance.

In D8249#171896, @avg wrote:

I have tested this code and it works.
I will not object against this code getting committed.

Thanks.

But I would like to re-iterate my suggestion.
In my opinion, this kind of logic really belongs to stop_cpus_hard. That should be the proper layer for it.

I do not see how would it be possible. In fact, when the patch started grow into generic_stop_cpus(), I considered extracting the patch-specific bits into the isolated x86-specific function. I decided not to because I finally disliked the code duplication more that I disliked the '#if X86' lines, but still:

Benefits:

  • the code would be able to work regardless of how many CPUs get an NMI and that would not require any configuration

How ? IMO either the NMI handler (as a whole, not just generic_stop_cpus()) expects broadcast NMI and does coalesce, as done in the patch, or NMI on each core are handled completely isolated. Say, CPU 0 and 2 get NMI simultaneously (in the sense of nmi_kdb_lock). Why should these NMIs get coalseced ? The cause may be perf counter on one core and machine check on another, or e.g. each core get some PCIe error reported in non-broadcast manner. How the generic NMI wrapper could infer that ?

  • far more corner cases (races) would be covered, e.g. one CPU panic-ing while another getting an NMI "at the same time"

I think that it would be possible, still, to coalesce kdb_enter-s from a broadcast NMI by adding some code to kdb_enter.

It is too late.

For example, each NMI-ed CPU that enters kdb_enter can set its bit in a want-to-enter-kdb-becase-of-nmi mask before calling stop_cpus_hard.
If the bit is still set after the call, then that CPU is a winner, it clears the mask and proceeds to the debugger.
If the bit is cleared, then we simply return because we know that the winner has already handled the NMI.

We cannot return, we must:

  • ensure that winner does not decide to use stop_cpus_hard (it needs sometime to call the function, in fully automatic setup)
  • wait until winner commands us what to do next.

One last, tangential note is we should honor KDB_UNATTENDED / debugger_on_panic when handling an NMI.

This is unrelated.

A physical button is not the only source of the NMI, we could have an NMI watchdog for instance.

Yes, unfortunately. This is why I added the knob to revert to the independent NMI handling per core.

In D8249#172142, @kib wrote:
In D8249#171896, @avg wrote:

But I would like to re-iterate my suggestion.
In my opinion, this kind of logic really belongs to stop_cpus_hard. That should be the proper layer for it.

I do not see how would it be possible. In fact, when the patch started grow into generic_stop_cpus(), I considered extracting the patch-specific bits into the isolated x86-specific function. I decided not to because I finally disliked the code duplication more that I disliked the '#if X86' lines, but still:

My idea is to separate stop_cpus_hard from stop_cpus and suspend_cpus.
In stop_cpus_hard a CPU that lost the race to become a master would voluntarily call cpustop_handler.
Thus, the loser(s) would wait until they are released by the winner and then they can compete again to become the next master.
So, each caller of stop_cpus_hard is guaranteed to become a master eventually and there can only be one master at a time.

Benefits:

  • the code would be able to work regardless of how many CPUs get an NMI and that would not require any configuration

How ? IMO either the NMI handler (as a whole, not just generic_stop_cpus()) expects broadcast NMI and does coalesce, as done in the patch, or NMI on each core are handled completely isolated. Say, CPU 0 and 2 get NMI simultaneously (in the sense of nmi_kdb_lock). Why should these NMIs get coalseced ? The cause may be perf counter on one core and machine check on another, or e.g. each core get some PCIe error reported in non-broadcast manner. How the generic NMI wrapper could infer that ?

The perf counter case as well as other similar cases would still be handled the same as now.
I am talking only about unexpected / unexplained NMIs that result in kdb_trap.

  • far more corner cases (races) would be covered, e.g. one CPU panic-ing while another getting an NMI "at the same time"

I think that it would be possible, still, to coalesce kdb_enter-s from a broadcast NMI by adding some code to kdb_enter.

It is too late.

I think that it isn't. I will try to explain the full algorithm that I have in mind.
BTW, I wrote kdb_enter, but I really meant kdb_trap.

For example, each NMI-ed CPU that enters kdb_enter can set its bit in a want-to-enter-kdb-becase-of-nmi mask before calling stop_cpus_hard.
If the bit is still set after the call, then that CPU is a winner, it clears the mask and proceeds to the debugger.
If the bit is cleared, then we simply return because we know that the winner has already handled the NMI.

We cannot return, we must:

  • ensure that winner does not decide to use stop_cpus_hard (it needs sometime to call the function, in fully automatic setup)
  • wait until winner commands us what to do next.

Well, let me try to draw the whole picture as I see it.

The general NMI handling remains the same as now.
stop_cpus_hard is changed in such a way that a loser calls cpustop_handler before trying to become a master again.
The loser can potentially enter cpustop_handler twice, the first time voluntarily and the second time to handle the NMI IPI from the master (unless it's already handling an earlier NMI). That should not be a problem.
The stop_cpus_hard change should ensure that there can not be more than one master at a time and that all other CPUs are parked in cpustop_handler.
So, this solves the deadlock part of the problem that your change solves.

kdb_trap is changed to have special handling for T_NMI: in that case it sets a bit in a new kdb_nmi cpuset.
Then, kdb_trap calls stop_cpus_hard just as it does now.
In the case of a broadcast NMI there will be the first winner that will proceed past the stop_cpus_hard call and all other CPUs will be parked in kdb_trap -> stop_cpus_hard -> cpustop_handler. The winner will understand that it is the first winner by seeing that its bit is still set in the kdb_nmi cpuset. The first winner will then clear kdb_nmi and will proceed further. When the debugger is exited the first winner will call restart_cpus_hard and that will release other CPUs. One of them will become the second winner
in stop_cpus_hard. The second winner will then see that its bit is cleared in kdb_nmi and, thus, it will understand that it is not a first winner and there is no need to enter the debugger again. The second winner will call restart_cpus_hard. All subsequent CPUs will do the same. Thus, the debugger will be entered only once.
Just in case, kdb_nmi is set and checked only for the T_NMI case. For all other traps kdb_trap works the same as now.
So, this coalesces an unexpected broadcast NMI to a single debugger invocation.

In the case that an unexpected NMI is delivered to single CPU only the situation is even simpler. That CPU calls kdb_trap which calls stop_cpus_hard, all other CPUs are stopped via an NMI IPI. There is no competition.

Finally, there could be an intermediate case. Let's say that there is no broadcast NMI, but two or more singular NMIs are delivered to two or more CPUs "at the same time".
In this case those NMIs will be coalesced to a single debugger invocation just as in the broadcast case following exactly the same logic.
In this situation it will be up to a human being (or other intelligent agent) interacting with the debugger to see that more than CPU got the unexpected NMIs.
Hopefully that will be easy to do if each affected CPU prints an informative message before calling stop_cpus_hard.

All of the above obviously depends on the fact that a CPU that is handling an NMI can not be interrupted with another NMI.
I am not sure if this rule is true for all supported platforms that have an equivalent of NMI.

@kib I think that now I perhaps understand what you said earlier. What I suggested would not reliably coalesce the broadcast NMI because the leader does not coordinate at all with other CPUs before sending them hard-stop NMIs. So, a CPU could see its bit in ipi_stop_nmi_pending and go directly to cpustop_handler. Once the leader quits the debugger, the other CPU would return from trap(), do iret and immediately get the second NMI. This time ipi_stop_nmi_pending is clear, so the CPU would proceed to kdb_trap.
Not fatal, but no coalescing either.

I looked at the Linux code for handling x86 NMIs and I have a couple of observations.
First, they seem to have a regular infrastructure for registering NMI handlers as opposed to our ad-hoc approach.
Second, they try to handle things like hitting a breakpoint within the NMI context.
Also, there is one trick that they do. They save the instruction pointer at the time when NMI is received. And if they get another NMI with the same instruction pointer, then they consider it to be a "back-to-back NMI" and in some cases they "swallow" it, that is, just ignore it. They admit, of course, that in some cases that can lead to a loss of a legit new NMI.

http://lxr.free-electrons.com/source/arch/x86/kernel/nmi.c

Another, tangentially related observation from the Linux land is that they seem to enabled LINT1 NMI only for BSP by default.
There is a kernel option that allows to to enable it for all processors or for none at all.
http://lxr.free-electrons.com/source/arch/x86/kernel/apic/apic.c#L1415

We seem to always enable it in all Local APICs.

In D8249#172369, @avg wrote:
In D8249#172142, @kib wrote:
In D8249#171896, @avg wrote:

But I would like to re-iterate my suggestion.
In my opinion, this kind of logic really belongs to stop_cpus_hard. That should be the proper layer for it.

I do not see how would it be possible. In fact, when the patch started grow into generic_stop_cpus(), I considered extracting the patch-specific bits into the isolated x86-specific function. I decided not to because I finally disliked the code duplication more that I disliked the '#if X86' lines, but still:

My idea is to separate stop_cpus_hard from stop_cpus and suspend_cpus.
In stop_cpus_hard a CPU that lost the race to become a master would voluntarily call cpustop_handler.
Thus, the loser(s) would wait until they are released by the winner and then they can compete again to become the next master.
So, each caller of stop_cpus_hard is guaranteed to become a master eventually and there can only be one master at a time.

Which is exactly opposite to what I want to do, at least in the situation where broadcast NMIs coalescing is enabled by the knob.

Benefits:

  • the code would be able to work regardless of how many CPUs get an NMI and that would not require any configuration

How ? IMO either the NMI handler (as a whole, not just generic_stop_cpus()) expects broadcast NMI and does coalesce, as done in the patch, or NMI on each core are handled completely isolated. Say, CPU 0 and 2 get NMI simultaneously (in the sense of nmi_kdb_lock). Why should these NMIs get coalseced ? The cause may be perf counter on one core and machine check on another, or e.g. each core get some PCIe error reported in non-broadcast manner. How the generic NMI wrapper could infer that ?

The perf counter case as well as other similar cases would still be handled the same as now.
I am talking only about unexpected / unexplained NMIs that result in kdb_trap.

  • far more corner cases (races) would be covered, e.g. one CPU panic-ing while another getting an NMI "at the same time"

I think that it would be possible, still, to coalesce kdb_enter-s from a broadcast NMI by adding some code to kdb_enter.

It is too late.

I think that it isn't. I will try to explain the full algorithm that I have in mind.
BTW, I wrote kdb_enter, but I really meant kdb_trap.

For example, each NMI-ed CPU that enters kdb_enter can set its bit in a want-to-enter-kdb-becase-of-nmi mask before calling stop_cpus_hard.
If the bit is still set after the call, then that CPU is a winner, it clears the mask and proceeds to the debugger.
If the bit is cleared, then we simply return because we know that the winner has already handled the NMI.

We cannot return, we must:

  • ensure that winner does not decide to use stop_cpus_hard (it needs sometime to call the function, in fully automatic setup)
  • wait until winner commands us what to do next.

Well, let me try to draw the whole picture as I see it.

The general NMI handling remains the same as now.
stop_cpus_hard is changed in such a way that a loser calls cpustop_handler before trying to become a master again.
The loser can potentially enter cpustop_handler twice, the first time voluntarily and the second time to handle the NMI IPI from the master (unless it's already handling an earlier NMI). That should not be a problem.
The stop_cpus_hard change should ensure that there can not be more than one master at a time and that all other CPUs are parked in cpustop_handler.
So, this solves the deadlock part of the problem that your change solves.

kdb_trap is changed to have special handling for T_NMI: in that case it sets a bit in a new kdb_nmi cpuset.
Then, kdb_trap calls stop_cpus_hard just as it does now.
In the case of a broadcast NMI there will be the first winner that will proceed past the stop_cpus_hard call and all other CPUs will be parked in kdb_trap -> stop_cpus_hard -> cpustop_handler. The winner will understand that it is the first winner by seeing that its bit is still set in the kdb_nmi cpuset. The first winner will then clear kdb_nmi and will proceed further. When the debugger is exited the first winner will call restart_cpus_hard and that will release other CPUs. One of them will become the second winner
in stop_cpus_hard. The second winner will then see that its bit is cleared in kdb_nmi and, thus, it will understand that it is not a first winner and there is no need to enter the debugger again. The second winner will call restart_cpus_hard. All subsequent CPUs will do the same. Thus, the debugger will be entered only once.
Just in case, kdb_nmi is set and checked only for the T_NMI case. For all other traps kdb_trap works the same as now.
So, this coalesces an unexpected broadcast NMI to a single debugger invocation.

In the case that an unexpected NMI is delivered to single CPU only the situation is even simpler. That CPU calls kdb_trap which calls stop_cpus_hard, all other CPUs are stopped via an NMI IPI. There is no competition.

Finally, there could be an intermediate case. Let's say that there is no broadcast NMI, but two or more singular NMIs are delivered to two or more CPUs "at the same time".
In this case those NMIs will be coalesced to a single debugger invocation just as in the broadcast case following exactly the same logic.

In this 'partial broadcast case', how would kdb_enter() or stop_cpu_hard() know that NMI must be sent to other cores ?
kdb must operate with other cores parked.

In this situation it will be up to a human being (or other intelligent agent) interacting with the debugger to see that more than CPU got the unexpected NMIs.
Hopefully that will be easy to do if each affected CPU prints an informative message before calling stop_cpus_hard.

All of the above obviously depends on the fact that a CPU that is handling an NMI can not be interrupted with another NMI.
I am not sure if this rule is true for all supported platforms that have an equivalent of NMI.

In fact, I cannot follow this description. If you will code that, I will withdraw my patch.

Otherwise, I think that what I have is useful for immediate benefit.

In D8249#172998, @avg wrote:

Another, tangentially related observation from the Linux land is that they seem to enabled LINT1 NMI only for BSP by default.
There is a kernel option that allows to to enable it for all processors or for none at all.
http://lxr.free-electrons.com/source/arch/x86/kernel/apic/apic.c#L1415

We seem to always enable it in all Local APICs.

Strange. I thought that NMI config for LINTs must execute what ACPI MADT local NMI sources entries direct the OS to do. The table may enable only one CPU' LINT, or it may enable all CPU's LINTs entries as configured for NMI. E.g. on all my machines Local APIC NMI is configured on LINT1 for all cpus.

In D8249#173352, @kib wrote:

Strange. I thought that NMI config for LINTs must execute what ACPI MADT local NMI sources entries direct the OS to do. The table may enable only one CPU' LINT, or it may enable all CPU's LINTs entries as configured for NMI. E.g. on all my machines Local APIC NMI is configured on LINT1 for all cpus.

Of all systems that I have access to (all AMD) none has NMI configuration in MADT.
So, I am talking about the default configuration:

/* Global defaults for local APIC LVT entries. */
static struct lvt lvts[APIC_LVT_MAX + 1] = {
        { 1, 1, 1, 1, APIC_LVT_DM_EXTINT, 0 },  /* LINT0: masked ExtINT */
        { 1, 1, 0, 1, APIC_LVT_DM_NMI, 0 },     /* LINT1: NMI */
        { 1, 1, 1, 1, APIC_LVT_DM_FIXED, APIC_TIMER_INT },      /* Timer */
        { 1, 1, 0, 1, APIC_LVT_DM_FIXED, APIC_ERROR_INT },      /* Error */
        { 1, 1, 1, 1, APIC_LVT_DM_NMI, 0 },     /* PMC */
        { 1, 1, 1, 1, APIC_LVT_DM_FIXED, APIC_THERMAL_INT },    /* Thermal */
        { 1, 1, 1, 1, APIC_LVT_DM_FIXED, APIC_CMC_INT },        /* CMCI */
};

@kib regarding D8249#173351:
In my proposal stop_cpus_hard always sends NMIs.

Regarding the code, I am working on it, but not sure when it will be ready.
As I've originally said, I have no objections against committing your change.

As another side note, I've finally found a root cause for my problem with getting NMIs when running Linux guests in bhyve: D8321.
I am mentioning this, because I remember talking to you about that problem earlier.
At that time I thought that I had problems with handling of a broadcast NMI. But it turns out that I was getting singular NMIs for every guest's vCPU.
So, in some cases I was getting just a single NMI, because I ran a single CPU guest.
In other cases I was getting multiple NMIs and they were so close in time that they appeared to be caused by the same source.

This revision was automatically updated to reflect the committed changes.
head/sys/x86/x86/cpu_machdep.c
572 ↗(On Diff #21656)

Sorry for a belated comment, but zero in this line is confusing.