Why do we need this at all? Why cannot you use e.g. stop_cpus() or stop_cpus_hard() or even smp_rendezvous() to do that? It can be done in MI, and definitely does not require new IPI vector.

In D51622#1179347, @kib wrote:

Why do we need this at all? Why cannot you use e.g. stop_cpus() or stop_cpus_hard() or even smp_rendezvous() to do that? It can be done in MI, and definitely does not require new IPI vector.

The purpose of this is to make the CPU go catatonic so that it requires a core reset to continue. The handler disables all interrupts before going catatonic, so it requires the LAPIC INIT sequence to restore. We don't want the CPU to execute any other instructions until reset because the memory may have been overwritten.

In D51622#1179405, @jhibbits wrote:

In D51622#1179347, @kib wrote:

Why do we need this at all? Why cannot you use e.g. stop_cpus() or stop_cpus_hard() or even smp_rendezvous() to do that? It can be done in MI, and definitely does not require new IPI vector.

The purpose of this is to make the CPU go catatonic so that it requires a core reset to continue. The handler disables all interrupts before going catatonic, so it requires the LAPIC INIT sequence to restore. We don't want the CPU to execute any other instructions until reset because the memory may have been overwritten.

Then explain this, at least as a comment in the code.

But what happens if e.g. BIOS broadcasts SMI to all cores? Wouldn't this break the core out of the halt state if the memory address is overwritten?

In D51622#1179429, @kib wrote:

In D51622#1179405, @jhibbits wrote:

In D51622#1179347, @kib wrote:

Why do we need this at all? Why cannot you use e.g. stop_cpus() or stop_cpus_hard() or even smp_rendezvous() to do that? It can be done in MI, and definitely does not require new IPI vector.

The purpose of this is to make the CPU go catatonic so that it requires a core reset to continue. The handler disables all interrupts before going catatonic, so it requires the LAPIC INIT sequence to restore. We don't want the CPU to execute any other instructions until reset because the memory may have been overwritten.

Then explain this, at least as a comment in the code.

But what happens if e.g. BIOS broadcasts SMI to all cores? Wouldn't this break the core out of the halt state if the memory address is overwritten?

I suspect that the right answer is to deliver the INIT IPI to put all other cores into the init state.

In D51622#1179431, @kib wrote:

In D51622#1179429, @kib wrote:

In D51622#1179405, @jhibbits wrote:

In D51622#1179347, @kib wrote:

Why do we need this at all? Why cannot you use e.g. stop_cpus() or stop_cpus_hard() or even smp_rendezvous() to do that? It can be done in MI, and definitely does not require new IPI vector.

The purpose of this is to make the CPU go catatonic so that it requires a core reset to continue. The handler disables all interrupts before going catatonic, so it requires the LAPIC INIT sequence to restore. We don't want the CPU to execute any other instructions until reset because the memory may have been overwritten.

Then explain this, at least as a comment in the code.

This is (crudely) explained in mp_x86.c, but I can expand on it.

But what happens if e.g. BIOS broadcasts SMI to all cores? Wouldn't this break the core out of the halt state if the memory address is overwritten?

I suspect that the right answer is to deliver the INIT IPI to put all other cores into the init state.

Do you mean that the right thing would be instead of an IPI, to send the INIT IPI for catatonia? I thought about that, but was confused reading the manual, because it looks like the BIOS can start initialization, so running code, with the INIT IPI, it doesn't put it into a wait state.

In D51622#1179439, @jhibbits wrote:

Do you mean that the right thing would be instead of an IPI, to send the INIT IPI for catatonia? I thought about that, but was confused reading the manual, because it looks like the BIOS can start initialization, so running code, with the INIT IPI, it doesn't put it into a wait state.

Yes, send INIT IPI.

Can you expand on 'BIOS can start initialization'? I do not understand this.

I should add that this is sort of how Linux does it as well (synchronize, disable interrupts, go catatonic and hope for the best)

In fact, I retract my proposal with the INIT IPI. If SMI is broadcasted, other CPUs would not enter the SMI handler, and the sending CPU most likely hang waiting for the reply.

So yes, the cli;hlt loop is the best, but it should be executed from the memory which is not overwritten during kexec.

In D51622#1179455, @kib wrote:

In fact, I retract my proposal with the INIT IPI. If SMI is broadcasted, other CPUs would not enter the SMI handler, and the sending CPU most likely hang waiting for the reply.

So yes, the cli;hlt loop is the best, but it should be executed from the memory which is not overwritten during kexec.

Hm, would it be better to add another kexec MD method for stopping all APs? Something like kexec_md_stop_aps() or such? If amd64 needs its catatonia memory to be safe then we need a (trivial) page table and page for this, so it would need to know the available memory. All this said, the new kernel could still overwrite the memory used for this, since it would be tacked on to the end of the image. The chance of that happening is pretty low.

The loop is only "Just in case" an interrupt comes in that's not disabled (not being an x86 guy, I don't know what's disabled or not with disable_intr() and lapic_disable()). If nothing can come in, then there's no need for the safe memory, since once it halts it's dead.

In D51622#1179456, @jhibbits wrote:

In D51622#1179455, @kib wrote:

In fact, I retract my proposal with the INIT IPI. If SMI is broadcasted, other CPUs would not enter the SMI handler, and the sending CPU most likely hang waiting for the reply.

So yes, the cli;hlt loop is the best, but it should be executed from the memory which is not overwritten during kexec.

Hm, would it be better to add another kexec MD method for stopping all APs? Something like kexec_md_stop_aps() or such? If amd64 needs its catatonia memory to be safe then we need a (trivial) page table and page for this, so it would need to know the available memory. All this said, the new kernel could still overwrite the memory used for this, since it would be tacked on to the end of the image. The chance of that happening is pretty low.

Might be.

The loop is only "Just in case" an interrupt comes in that's not disabled (not being an x86 guy, I don't know what's disabled or not with disable_intr() and lapic_disable()). If nothing can come in, then there's no need for the safe memory, since once it halts it's dead.

As I said below, I believe SMI is still possible even in 'cli;hlt' loop. On AMD there is a way to disable SMI and NMI, but I do not think the method to do that is usable for kexec, and there is also Intel without something equivalent.

In D51622#1179546, @kib wrote:

The loop is only "Just in case" an interrupt comes in that's not disabled (not being an x86 guy, I don't know what's disabled or not with disable_intr() and lapic_disable()). If nothing can come in, then there's no need for the safe memory, since once it halts it's dead.

As I said below, I believe SMI is still possible even in 'cli;hlt' loop. On AMD there is a way to disable SMI and NMI, but I do not think the method to do that is usable for kexec, and there is also Intel without something equivalent.

Is SMI enabled even when the LAPIC is disabled? I noticed that the lapic_disable() call was included in the kexec patch, when it belongs in this patch. If that's not sufficient, would the following be correct?

Make the IPI handler call into a function pointer, defaulting to a normal HLT loop. I think the precursor to this loop can remain common, and only the loop itself be overridden. Meaning, the interrupt and LAPIC disable stays in the general code, and only the last bits go into the callback.
Add to the kexec patch (either into the patch, or a subsequent commit) an override that installs the kexec page table, and includes the HLT loop into that, in the trampoline page(s), to keep it all together.

I know at this time only kexec uses this IPI, but I think there could be a future use for it as well, so I don't want to make it entirely kexec-specific.

About SMI, from Intel SDM vol 3 7.3.1 External Interrupts:

Note that several other pins on the processor can cause a processor interrupt to occur. However, these interrupts
are not handled by the interrupt and exception mechanism described in this chapter. These pins include the
RESET#, FLUSH#, STPCLK#, SMI#, R/S#, and INIT# pins. Whether they are included on a particular processor is
implementation dependent. Pin functions are described in the data books for the individual processors. The SMI#
pin is described in Chapter 33, “System Management Mode.”

I read this (and it aligns with the experience) that SMI is independent from the LAPIC configuration or enablement, as it should be.

I really become curious in which way firmware parks APs after the initial configuration. Coreboot might have something for this.

And indeed, there is very interesting https://github.com/coreboot/coreboot/blob/main/src/cpu/x86/lapic/lapic_cpu_stop.c that is used to park APs. From my reading, they send INIT IPI to itself, which, by the claim in the source file, is equivalent to asserting the INIT# pin.

In D51622#1216461, @kib wrote:

I really become curious in which way firmware parks APs after the initial configuration. Coreboot might have something for this.

And indeed, there is very interesting https://github.com/coreboot/coreboot/blob/main/src/cpu/x86/lapic/lapic_cpu_stop.c that is used to park APs. From my reading, they send INIT IPI to itself, which, by the claim in the source file, is equivalent to asserting the INIT# pin.

Thanks for digging into coreboot. I just took a look as well, and saw this change https://github.com/coreboot/coreboot/commit/ba8965c0395a295828911cb375e7122051e73962, which indicates that the INIT IPI is not always correct because it can land in an "in-between state".

I spent today looking at the Linux kernel spaghetti, and arch/x86/kernel/smpboot.c has it doing something similar to what I'm doing here, noting that it doesn't protect against stray NMIs or SMIs, but accepts the risk because it'll be past the point of no return anyway. Not saying that's the correct approach, I do want to find the best approach, but it's prior art on that "hope and pray it doesn't happen".

In D51622#1216820, @jhibbits wrote:

In D51622#1216461, @kib wrote:

I really become curious in which way firmware parks APs after the initial configuration. Coreboot might have something for this.

And indeed, there is very interesting https://github.com/coreboot/coreboot/blob/main/src/cpu/x86/lapic/lapic_cpu_stop.c that is used to park APs. From my reading, they send INIT IPI to itself, which, by the claim in the source file, is equivalent to asserting the INIT# pin.

Thanks for digging into coreboot. I just took a look as well, and saw this change https://github.com/coreboot/coreboot/commit/ba8965c0395a295828911cb375e7122051e73962, which indicates that the INIT IPI is not always correct because it can land in an "in-between state".

This was just one of the commits in the history. I demonstrated the current code. They decided to go with the INIT IPI finally, but tracking it in their repo is somewhat hard because there were a lot of seemingly unrelated changes that move stuff around or reverted unrelated changes.

I spent today looking at the Linux kernel spaghetti, and arch/x86/kernel/smpboot.c has it doing something similar to what I'm doing here, noting that it doesn't protect against stray NMIs or SMIs, but accepts the risk because it'll be past the point of no return anyway. Not saying that's the correct approach, I do want to find the best approach, but it's prior art on that "hope and pray it doesn't happen".

In fact, I think it is fine to commit this as is, and we might consider adding the self-init trick later. Might be it should be covered by a tunable to disable.

This revision is now accepted and ready to land.Thu, Oct 23, 10:28 AM