swi_sched() from NMI context
ClosedPublic
Actions

Authored by mav on Jul 21 2020, 6:12 PM.

Details

Reviewers

kib
jhb

Group Reviewers

manpages

Commits

rS363527: Allow swi_sched() to be called from NMI context.

Summary

To handle events from ACPI Platform Error Interfaces I need a way to schedule process from NMI context. This patch implements two mechanisms allowing swi_sched() could work in NMI context.

The first, simple one, just makes hardclock() to periodically check the ih_need status for the SWI. It only requires minor machine-independent changes, but has 1/HZ latency, and so planned mostly as a fallback.

Second, more complicated one, sends IPI to the current CPU from the NMI context. To do that I've added new functions ipi_self() and ipi_self_from_nmi(), utilizing xAPIC's ability to send IPI to current CPU with only one register write instead of normal two. To do that I've slightly cleaned up LAPIC code, using the single register writes when possible. I also propose to remove read-modify-write of the ICR register, motivation for which added long ago I haven't found, and use of which I haven't found in Linux.

Test Plan

The NMIs in question are very rare, but I've tested this with my APEI driver prototype. Tested both in xAPIC and x2APIC modes.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mav created this revision.Jul 21 2020, 6:12 PM

Herald added a subscriber: imp. · View Herald TranscriptJul 21 2020, 6:12 PM

mav requested review of this revision.Jul 21 2020, 6:12 PM

Harbormaster completed remote builds in B32457: Diff 74753.Jul 21 2020, 6:12 PM

I strongly prefer that for x2APIC mode ipi_self() used SELF IPI MSR.

As I said in other message, when xAPIC mode IPI is sent from NMI context, it must wait for the ICR.Send Status to change from pending to idle. Otherwise, it breaks non-NMI senders.

Suppose that some CPU prepares to send hard stop NMI, and sets ipi_stop_nmi_pending bit(s) for destinations. Now, imagine that your async NMI comes in meantime. How do you plan to determine that ipi_nmi_handler should not consume that NMI ?

In D25754#570230, @kib wrote:

I strongly prefer that for x2APIC mode ipi_self() used SELF IPI MSR.

I've looked on that, but haven't found real explanation why it is better other then it has no additional flags. Do you know a reason why it worth separate code path?

As I said in other message, when xAPIC mode IPI is sent from NMI context, it must wait for the ICR.Send Status to change from pending to idle. Otherwise, it breaks non-NMI senders.

Yes, I saw that. I am just not sure how to better do that, not doing it for non-NMI case. Any ideas? And I haven't found that in Linux too.

Suppose that some CPU prepares to send hard stop NMI, and sets ipi_stop_nmi_pending bit(s) for destinations. Now, imagine that your async NMI comes in meantime. How do you plan to determine that ipi_nmi_handler should not consume that NMI ?

That is a good question I had from the beginning. It seems each mechanism using NMI has its own completely different status reporting mechanism. The best idea I have it to make each handler assume there are other ones, like with normal shared interrupts, panicking into debugger only if none of handlers consumed it.

In D25754#570262, @mav wrote:

In D25754#570230, @kib wrote:

I strongly prefer that for x2APIC mode ipi_self() used SELF IPI MSR.

I've looked on that, but haven't found real explanation why it is better other then it has no additional flags. Do you know a reason why it worth separate code path?

As I said in other message, when xAPIC mode IPI is sent from NMI context, it must wait for the ICR.Send Status to change from pending to idle. Otherwise, it breaks non-NMI senders.

Yes, I saw that. I am just not sure how to better do that, not doing it for non-NMI case. Any ideas? And I haven't found that in Linux too.

I will follow-up to these two points in one statement. I believe that high-level answer is that you need a separate function to send self-IPI from NMI context to handle it correctly.

Intel states that ICR write must not occur until state is not idle, and doing what you do clearly break this requirement. I have no idea what Linux does and what additional internal knowledge they got from someone.

For x2APIC it was clearly intended by Intel that the SELF IPI register is used as shortcut, you do not need to construct the command, and things are generally easier, as is common for x2APIC (for instance, there is no idle/pending states). So I do not see a reason not to use it, also that it does not interfere with top-level code feedling with ICR.

Suppose that some CPU prepares to send hard stop NMI, and sets ipi_stop_nmi_pending bit(s) for destinations. Now, imagine that your async NMI comes in meantime. How do you plan to determine that ipi_nmi_handler should not consume that NMI ?

That is a good question I had from the beginning. It seems each mechanism using NMI has its own completely different status reporting mechanism. The best idea I have it to make each handler assume there are other ones, like with normal shared interrupts, panicking into debugger only if none of handlers consumed it.

I think so far all NMI causes were safe in this regard. Practically you never get NMI for 'ISA errors', I think the only hardware NMI sources we handle are NMI from BMC and PMC interrupts. If you start handling NMI from the RAS hardware I think it is important to not cause false positives and make sure that other handlers are prepared for such races. It probably cannot be solved completely, but to solve it even partially all handlers must be gentle enough to accept a possibility of racing with other NMI source.

Introduced separate ipi_self_from_nmi(), waiting for IPI completion before returning.
Added use of SELF IPI register in x2APIC mode.

Harbormaster completed remote builds in B32487: Diff 74796.Jul 22 2020, 4:33 PM

In D25754#570274, @kib wrote:

Intel states that ICR write must not occur until state is not idle, and doing what you do clearly break this requirement.

Added wait in ipi_self_from_nmi().

So I do not see a reason not to use it, also that it does not interfere with top-level code feedling with ICR.

Added use of it.

If you start handling NMI from the RAS hardware.

Sure. I'll take another look. Though at least Supermicro boards are already sending those NMIs on PCIe errors, including hot-unplug, that is why I started all this at all, so whatever I do hardly can be worse. ;)

Could you please split this into two pieces, one for LAPIC changes only, and the rest for SWI_FROMNMI.

In D25754#570638, @mav wrote:

In D25754#570274, @kib wrote:

If you start handling NMI from the RAS hardware.

Sure. I'll take another look. Though at least Supermicro boards are already sending those NMIs on PCIe errors, including hot-unplug, that is why I started all this at all, so whatever I do hardly can be worse. ;)

In fact the action on the AER UNCOR errors is configurable at chipset level. I know, from the first hand experience, that many BIOSes set it to broadcast NMI (or generate broadcast NMI from SMI handler), which is both stupid and risky. Quite possible that there are some interfaces (I heard the abbreviation WMI) that may allow to reconfigure this.

sys/x86/x86/local_apic.c
298 ↗	(On Diff #74796)	You do not need this mfence when interrupting yourself. It will not make anything ordered that was not ordered without it, on the same core.
sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	This is not needed in x2APIC mode. There is no pending state.

In D25754#570661, @kib wrote:

In fact the action on the AER UNCOR errors is configurable at chipset level. I know, from the first hand experience, that many BIOSes set it to broadcast NMI (or generate broadcast NMI from SMI handler), which is both stupid and risky. Quite possible that there are some interfaces (I heard the abbreviation WMI) that may allow to reconfigure this.

In all contexts where I've heard or used WMI it was something proprietary. I've found nothing to disable the NMIs in APEI or WHEA specs I am dancing from, at most how to handle it. The ACPI on my board does not allow to handle AER hatively, so AER errors are reported as NMIs after being logged into BMC event log. Sure I am able to mask the errors on PCIe devices level, but I don't feel good about hiding the errors.

sys/x86/x86/local_apic.c
298 ↗	(On Diff #74796)	OK. I haven't thought it is "self" when copy-pasted. ;)
sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	lapic_ipi_wait() does nothing on x2APIC. Would you prefer to see explicit check here too?

Remove mfence() from lapic_write_self_ipi().

Harbormaster completed remote builds in B32494: Diff 74810.Jul 22 2020, 6:15 PM

mav marked an inline comment as done.Jul 22 2020, 6:15 PM

kib added inline comments.Jul 22 2020, 6:44 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	In fact, I am not sure that we want ipi_self() to take 'ipi' as a parameter, instead of a vector. Look at what IPIs that are not vector provide: HARDCLOCK, HARD_STOP etc. It does not make sense to send any of that to self. I thought for a second that IPI_AST does make sense, but then realized that it requires setting TDF_ASTPENDING to have any effect, which means that it cannot be done from nmi context. Then ipi_self_from_nmi() (do we need ipi_self() at all ?) would start as if (x2apic_mode) { lapic_write_self_ipi(vector); return; }

mav added inline comments.Jul 22 2020, 6:55 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	I was just trying to extend existing KPI without custom limitations. You may be right that it is overkill, but I don't care much about performance of something happening once in a while.

kib added inline comments.Jul 22 2020, 7:30 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	I mean ipi_self() cannot be correctly used with IPI parameter. It can be only utilized when you pass it non-vectored non-special vectors. Then it makes sense to not copy existing interface to not make wrong assumptions in reader' head.

mav marked an inline comment as not done.Jul 22 2020, 7:39 PM

mav added inline comments.

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	I am not getting why ipi_self() cannot be correctly used with IPI parameter. All the code setting bitmaps uses atomics and should work from any context. If you are worrying about thread being migrated, then we already have ipi_all_but_self(), that seems already quietly assumes the same.

kib added inline comments.Jul 22 2020, 8:56 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	It is not ipi_self() as such, but the use of any of the IPI (vs. vectored interrupts). You cannot HARD_STOP itself, you cannot schedule AST to yourself from NMI context because it requires td_lock, and so on.

mav added inline comments.Jul 23 2020, 6:54 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	I've added my IPI_SWI as bitmapped to not waste a vector. So it does require ipi there. Do you think it makes sense to waste a vector for such a rare thing? I still think unification is good. If you don't like bitmapped IPIs, then may be we should remove them all together turning into vectors? It would simplify the code much more than making this case different.

kib added inline comments.Jul 24 2020, 1:35 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	I think it is much easier to make it operate correctly with the distinguished vector. I would not use the term 'waste' for it. That said, I still want the patch split into (at least) two parts, and lets finish the APIC bit first.

mav added inline comments.Jul 24 2020, 5:20 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	Calling lapic_write_self_ipi() directly from ipi_self_from_nmi() would be a layering violation for apic_ops KPI used by Xen. I haven't looked what Xen can actually do from NMI, but I bet this code would be completely wrong there. So if we are any way going through lapic_ipi_vectored(), then my code is just right, and if you want me to remove the bitmap IPIs handling from there just for sake of removal, then I've already made a version using separate vector for IPI_SELF. It touches much more files to manage the vector, but if you wish so -- I am fine with it.

kib added inline comments.Jul 24 2020, 6:29 PM

sys/x86/x86/mp_x86.c
1440 ↗	(On Diff #74796)	I do not believe that RAS features are relevant to the Xen guest environment, even for dom0. I would the dedicated vector variant. But still, please split,