Details

Reviewers

imp
markj
jhb
kib
jmg
stevek

Commits

rG16db4c6fff45: amd64: Add kexec support

Summary

The biggest difference between this and arm64 kexec is that we can't
disable the MMU for amd64, we have to instead create a new "safe" page
table that the trampoline and "child" kernel can use. This requires a
lot more work to create identity mappings, etc.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

jhibbits requested review of this revision.Jul 29 2025, 6:56 PM

Harbormaster completed remote builds in B65836: Diff 159420.Jul 29 2025, 6:56 PM

jhibbits added a parent revision: D51622: amd64: Add cpu_stop() support to go UP after SMP.Jul 29 2025, 6:56 PM

jhibbits added a child revision: D51624: x86/intr: Handle case of disabling MSI after release.

I only looked briefly over the patch, so this is not a review, but a bunch of random trivial comments.

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

sys/amd64/amd64/kexec_support.c
34	systm.h should go first, param.h is not needed, other sys includes must be ordered alphabetically
85	Comment style is wrong
93	We call such things pdp/pde/pte in amd64 code.
sys/amd64/amd64/kexec_tramp.S
34	This must be removed.
58	?
74	Why is this needed?
75	And this?
82	Same question
sys/amd64/amd64/machdep.c
227 ↗	(On Diff #159420)	Why is this needed, and why it appears in this patch even if needed?
sys/x86/x86/intr_machdep.c
262	Why it is not enough to do cli on all cpus? Your stop code does that on all other cores.

jhibbits marked 4 inline comments as done.Aug 4 2025, 3:58 PM

jhibbits added inline comments.

sys/amd64/amd64/kexec_tramp.S
58	Oops, this was debug from initial development that I forgot to remove.
75	It's possible only one of the above is necessary. We ran into problems at reboot that looked cache related, so I went extreme paranoia. I'll test again removing each, and see if it still works correctly.
82	This one may not be necessary, since everything is being done on this single core. I had added the wbinvd early on before noticing the problem I was experiencing was actually related to the TLB (PG_G entries not being flushed, so causing fun chaos).
sys/amd64/amd64/machdep.c
227 ↗	(On Diff #159420)	I'll move this out, I had added it to minimize the loader.kboot diff, but it's certainly not necessary for this diff.
sys/x86/x86/intr_machdep.c
262	This disables the interrupt at the IO-APIC, in case the driver didn't do so in the shutdown handler. This avoids the "reserved" interrupt panics in the new kernel.

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

In D51623#1181462, @jhibbits wrote:

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

In D51623#1181601, @kib wrote:

In D51623#1181462, @jhibbits wrote:

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

I think the sched_bind(curthread, CPU_FIRST()) in kern_reboot() should suffice, then, but correct me if I'm wrong.

When we eventually add rescue support, to run from panic, that might need to change, but would likely be a different entry, too.

In D51623#1181602, @jhibbits wrote:

In D51623#1181601, @kib wrote:

In D51623#1181462, @jhibbits wrote:

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

I think the sched_bind(curthread, CPU_FIRST()) in kern_reboot() should suffice, then, but correct me if I'm wrong.

When we eventually add rescue support, to run from panic, that might need to change, but would likely be a different entry, too.

Generally CPU_FIRST() is not BSP, but it happen on amd64. On x86 we define BSP as cpuid == 0.

In D51623#1181603, @kib wrote:

In D51623#1181602, @jhibbits wrote:

In D51623#1181601, @kib wrote:

In D51623#1181462, @jhibbits wrote:

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

I think the sched_bind(curthread, CPU_FIRST()) in kern_reboot() should suffice, then, but correct me if I'm wrong.

When we eventually add rescue support, to run from panic, that might need to change, but would likely be a different entry, too.

Generally CPU_FIRST() is not BSP, but it happen on amd64. On x86 we define BSP as cpuid == 0.

CPU_FIRST() seems to be the BSP on all platforms I've looked at (powerpc, amd64, arm64).

If CPU_FIRST() doesn't provide the BSP, what would? I don't see any cpu_bsp(), or similar. I recall @nwhitehorn doing some work on powerpc several years ago allowing BSP to be non-zero, but I don't recall how that panned out, and that may have only been in the platform side, which the MI side is unconcerned with, so keeps BSP as 0.

Address feedback.

Harbormaster completed remote builds in B66039: Diff 159911.Aug 6 2025, 9:03 PM

In D51623#1182449, @jhibbits wrote:

CPU_FIRST() seems to be the BSP on all platforms I've looked at (powerpc, amd64, arm64).

If CPU_FIRST() doesn't provide the BSP, what would? I don't see any cpu_bsp(), or similar. I recall @nwhitehorn doing some work on powerpc several years ago allowing BSP to be non-zero, but I don't recall how that panned out, and that may have only been in the platform side, which the MI side is unconcerned with, so keeps BSP as 0.

For amd64 we have IS_BSP() macro, you might e.g. add an assert like MPASS(IS_BSP()) on the kexec MD path.

BTW, is there any code that makes sure that kexec-ed image does not override EFI runtime memory? If not, we must force-disable EFIRT in the execed kernel.

sys/amd64/amd64/kexec_support.c
147	This is pdpe (etc) as well.
153	These two lines should be multi-line comment block.
188	This line assumes that NBPDP is 1G. You might use howmany() to avoid hard-coding that.
264	Comment style is wrong.
sys/amd64/amd64/kexec_tramp.S
39	Also note that interrupts must be disabled.
64	Use size suffixes consistently (see other comment)
75	I do not believe either of the instructions are needed. If their presence changes something, there is a bug somewhere else.
78	Same there.
80	Note that mov to %cr3 flushes TLB except PG_G entries, so the comment is somewhat misplaced.
82	Other instructions use size suffix, I would write this one as `andq` too.
sys/x86/x86/intr_machdep.c
262	At least add a comment. But, shouldn't the reinit of IOAPICs in the exec-ed kernel prevent this problem? I think there is a bug somewhere, either in our init sequence, or in kexec, if such workaround is needed.

Another question: you can read the description about assumptions for hammer_time() and pmap initialization in the comment in amd64/machdep.c: 1264, right above the amd64_loadaddr() helper. Note the last enumeration item about the memory block after the loaded stuff (the slop). How it is ensured that kexec-ed kernel is provided enough slop to run the early allocator without problems?

In D51623#1184952, @kib wrote:

BTW, is there any code that makes sure that kexec-ed image does not override EFI runtime memory? If not, we must force-disable EFIRT in the execed kernel.

How is that any different than a normal FreeBSD kernel? Or are you worried about Linix (in which case the EFIRT concern confuses me).... How would it override that memory? It, like the normal FreeBSD kernel, has to honor it. I don't think it's any different than the kexec from Linux case which passes the memory run-time in via the normal loader mechanisms. I think that jhibbits has preliminary hacks to my kboot code to do that from FreeBSD.

In D51623#1184965, @kib wrote:

Another question: you can read the description about assumptions for hammer_time() and pmap initialization in the comment in amd64/machdep.c: 1264, right above the amd64_loadaddr() helper. Note the last enumeration item about the memory block after the loaded stuff (the slop). How it is ensured that kexec-ed kernel is provided enough slop to run the early allocator without problems?

Same way that we do it for a Linux kexec: loader.kboot knows, just like loader.efi knows, and arranges for enough space.

jhibbits marked 5 inline comments as done.Aug 12 2025, 8:31 PM

jhibbits added inline comments.

sys/amd64/amd64/kexec_support.c
147	This whole thing largely shamelessly taken from loader. Will update to kernel style.
sys/amd64/amd64/kexec_tramp.S
80	I'll put a blank line between the movq above and the comment, because the comment is for the %cr4 twiddling, which flushes the whole TLB including global pages.

Address feedback. I hope I got it all now.

Harbormaster completed remote builds in B67424: Diff 163143.Sep 30 2025, 8:01 PM

It seems that you always build 4-level intermediate page table. Wouldn't it blow up if the source kernel is running in LA57 mode? [Kernel always expect LA48 on start nonetheless]

sys/amd64/amd64/kexec_support.c
35	sys/systm.h already provides sys/queue.h
66	So what does do_pte mean? From my reading of the code, do not fill ptes?
76	Generally int is the wrong type for result pf pmap_XXX_index(). It should be vm_pindex_t
137	We usually write this as `for (;;)`
199	Lines need to be properly wrapped

In D51623#1209045, @kib wrote:

It seems that you always build 4-level intermediate page table. Wouldn't it blow up if the source kernel is running in LA57 mode? [Kernel always expect LA48 on start nonetheless]

You're right. I'll need to force it down to LA48 when entering the new page table. I'll have to check on this, as I'm not familiar with configuring LA57 mode.

sys/amd64/amd64/kexec_support.c
66	do_pte means to fill in PTEs. Line 103 does an early-continue on !do_pte, and the rest of the code after that is to fill in the PTEs. If `do_pte` is true then there's no guarantee that the backing store is contiguous.
76	Thanks, I'll fix that.
sys/x86/x86/intr_machdep.c
262	I may be wrong, but my reading of the reference is that the APIC can only be completely reset by a hardware reset, and a software reset doesn't clear pending interrupts.

In D51623#1209355, @jhibbits wrote:

In D51623#1209045, @kib wrote:

It seems that you always build 4-level intermediate page table. Wouldn't it blow up if the source kernel is running in LA57 mode? [Kernel always expect LA48 on start nonetheless]

You're right. I'll need to force it down to LA48 when entering the new page table. I'll have to check on this, as I'm not familiar with configuring LA57 mode.

Basically the code needs to dive into the protected non-paged mode from long mode, to turn off the CR4.LA57 bit. Putting it other way, this would be a reverse to what la57_trampoline() does.

I might suggest, to not delay the commit even more, simply refuse kexec for now if we are in LA57. Then somebody would work out the missing code in the trampoline later.

sys/amd64/amd64/kexec_support.c
66	I mean, describe this in the source code comment.
sys/x86/x86/intr_machdep.c
262	Which APICs? LAPICs or IOAPICs? BTW, IOAPICs in modern times are often pci devices, so there is chance that they might be reset by some of the normal pci methods (FLR or power reset, or even line re-training), see pci_reset_child().

In D51623#1209399, @kib wrote:

I might suggest, to not delay the commit even more, simply refuse kexec for now if we are in LA57. Then somebody would work out the missing code in the trampoline later.

Good idea, I'll guard on that.

sys/amd64/amd64/kexec_support.c
66	It's mentioned immediately above, but I'll wordsmith that. I'll also add a comment at the PTE creation point.
sys/x86/x86/intr_machdep.c
262	For the LAPIC, 10.4.7.2 (Page 10-10) says pending interrupts are held and require masking or handling by the CPU. Though, 10.4.7.3 states that post-INIT reset state is the same as power-on reset, modulo the APIC ID, so the problem we saw may be caused by IO APIC, and the document I have access to right now doesn't include I/O APIC state after reset, or how to reset the I/O APIC, so I'm mostly going empirically. I just double-checked the Linux source, and it does this to put the I/O APIC back into "legacy mode"

Address @kib's feedback.

Harbormaster completed remote builds in B67596: Diff 163663.Oct 6 2025, 9:16 PM

kib added inline comments.Oct 8 2025, 5:42 AM

sys/amd64/amd64/kexec_support.c
66	This is still not handled. In fact, the whole last sentence is too cryptic/does not explain anything.
153	Still not done.
sys/amd64/amd64/kexec_tramp.S
54	I do not quite follow this. Code overrides the argument with the address of the kexec_saved_image, and there is no way to restore the argument. Then why the argument is needed at all?
72	So mfence is still there, without explanation. I do not believe in magic.

Address @kib's feedback further. I didn't reproduce the problem we solved with mfence, so removed that.

Harbormaster completed remote builds in B67771: Diff 164163.Oct 14 2025, 7:33 PM

kib added inline comments.Fri, Oct 17, 7:08 AM

sys/amd64/amd64/kexec_tramp.S
54	This is still not handled, or at least not explained.
76	You removed mfence, but wbinvd is still there.

Address @kib's feedback. kexec_do_reboot() no longer takes an argument, and hasn't even in the first commit, so remove the argument from the prototype. Removed the wbinvd from the trampoline.

Harbormaster completed remote builds in B67936: Diff 164597.Mon, Oct 20, 2:04 PM

kib added inline comments.Fri, Oct 24, 8:27 PM

sys/amd64/amd64/kexec_tramp.S
52	Why do you load the address of kexec_saved_image into %rdi and then copy it into %r9? If you change %rdi to %r9 in line 52 below, I do not see other uses of %rdi with that value.
56	What are these magic 24 and 16, 8 below? Please add a comment at least, if they cannot be symbolized.
66	decq already sets the %rflags, so I do not think you need explicit cmpq

Handling of my today notes should not change the correctness of the asm code, but is very desirable.

This revision is now accepted and ready to land.Fri, Oct 24, 8:28 PM

Closed by commit rG16db4c6fff45: amd64: Add kexec support (authored by jhibbits). · Explain WhyMon, Oct 27, 2:35 PM

This revision was automatically updated to reflect the committed changes.

jhibbits added a commit: rG16db4c6fff45: amd64: Add kexec support.

ehem_freebsd_m5p.com added a subscriber: ehem_freebsd_m5p.com.Mon, Oct 27, 10:17 PM

ehem_freebsd_m5p.com added inline comments.

sys/x86/x86/intr_machdep.c
261	This condition cannot ever be false. Any x86 interrupt controller which has actual interrupts must implement this function (ie the lapic pseudo-PIC omits, but it doesn't handle actual interrupts).

ehem_freebsd_m5p.com added inline comments.Mon, Oct 27, 10:27 PM

sys/x86/x86/intr_machdep.c
262	Is `PIC_EOI` appropriate here? Unless a driver leaves an interrupt behind there shouldn't be a need and `PIC_NO_EOI` would be appropriate.

ehem_freebsd_m5p.com mentioned this in D47745: intr/x86: merge pic_{dis,en}able_source() call into pic_{dis,en}able_intr().Tue, Oct 28, 12:24 AM

amd64: Add kexec support
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 165139

sys/amd64/amd64/genassym.c

sys/amd64/amd64/kexec_support.c

sys/amd64/amd64/kexec_tramp.S

sys/amd64/include/kexec.h

sys/conf/files.amd64

sys/x86/include/intr_machdep.h

sys/x86/x86/intr_machdep.c

amd64: Add kexec supportClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 165139

sys/amd64/amd64/genassym.c

sys/amd64/amd64/kexec_support.c

sys/amd64/amd64/kexec_tramp.S

sys/amd64/include/kexec.h

sys/conf/files.amd64

sys/x86/include/intr_machdep.h

sys/x86/x86/intr_machdep.c

amd64: Add kexec support
ClosedPublic
Actions

Revision Contents
Changeset List