Page MenuHomeFreeBSD

amd64: Add kexec support
Needs ReviewPublic

Authored by jhibbits on Jul 29 2025, 6:56 PM.
Tags
None
Referenced Files
F132500208: D51623.id164163.diff
Fri, Oct 17, 10:39 AM
F132469380: D51623.id164163.diff
Fri, Oct 17, 4:54 AM
Unknown Object (File)
Thu, Oct 16, 7:27 PM
Unknown Object (File)
Sun, Oct 12, 12:21 PM
Unknown Object (File)
Fri, Oct 10, 9:00 PM
Unknown Object (File)
Fri, Oct 10, 9:00 PM
Unknown Object (File)
Fri, Oct 10, 9:00 PM
Unknown Object (File)
Fri, Oct 10, 9:00 PM
Subscribers

Details

Summary

The biggest difference between this and arm64 kexec is that we can't
disable the MMU for amd64, we have to instead create a new "safe" page
table that the trampoline and "child" kernel can use. This requires a
lot more work to create identity mappings, etc.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 67771
Build 64654: arc lint + arc unit

Event Timeline

jhibbits created this revision.

I only looked briefly over the patch, so this is not a review, but a bunch of random trivial comments.

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

sys/amd64/amd64/kexec_support.c
34

systm.h should go first, param.h is not needed, other sys includes must be ordered alphabetically

85

Comment style is wrong

93

We call such things pdp/pde/pte in amd64 code.

sys/amd64/amd64/kexec_tramp.S
34

This must be removed.

58

?

74

Why is this needed?

75

And this?

82

Same question

sys/amd64/amd64/machdep.c
227 ↗(On Diff #159420)

Why is this needed, and why it appears in this patch even if needed?

sys/x86/x86/intr_machdep.c
262

Why it is not enough to do cli on all cpus? Your stop code does that on all other cores.

jhibbits added inline comments.
sys/amd64/amd64/kexec_tramp.S
58

Oops, this was debug from initial development that I forgot to remove.

75

It's possible only one of the above is necessary. We ran into problems at reboot that looked cache related, so I went extreme paranoia. I'll test again removing each, and see if it still works correctly.

82

This one may not be necessary, since everything is being done on this single core. I had added the wbinvd early on before noticing the problem I was experiencing was actually related to the TLB (PG_G entries not being flushed, so causing fun chaos).

sys/amd64/amd64/machdep.c
227 ↗(On Diff #159420)

I'll move this out, I had added it to minimize the loader.kboot diff, but it's certainly not necessary for this diff.

sys/x86/x86/intr_machdep.c
262

This disables the interrupt at the IO-APIC, in case the driver didn't do so in the shutdown handler. This avoids the "reserved" interrupt panics in the new kernel.

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

In D51623#1181601, @kib wrote:

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

I think the sched_bind(curthread, CPU_FIRST()) in kern_reboot() should suffice, then, but correct me if I'm wrong.

When we eventually add rescue support, to run from panic, that might need to change, but would likely be a different entry, too.

In D51623#1181601, @kib wrote:

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

I think the sched_bind(curthread, CPU_FIRST()) in kern_reboot() should suffice, then, but correct me if I'm wrong.

When we eventually add rescue support, to run from panic, that might need to change, but would likely be a different entry, too.

Generally CPU_FIRST() is not BSP, but it happen on amd64. On x86 we define BSP as cpuid == 0.

In D51623#1181603, @kib wrote:
In D51623#1181601, @kib wrote:

One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.

Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().

If you mean it should execute from an AP, is there a reason for that?

Yes, of course I mean 'kexec must occur on BSP'. Silly thinko.

I think the sched_bind(curthread, CPU_FIRST()) in kern_reboot() should suffice, then, but correct me if I'm wrong.

When we eventually add rescue support, to run from panic, that might need to change, but would likely be a different entry, too.

Generally CPU_FIRST() is not BSP, but it happen on amd64. On x86 we define BSP as cpuid == 0.

CPU_FIRST() seems to be the BSP on all platforms I've looked at (powerpc, amd64, arm64).

If CPU_FIRST() doesn't provide the BSP, what would? I don't see any cpu_bsp(), or similar. I recall @nwhitehorn doing some work on powerpc several years ago allowing BSP to be non-zero, but I don't recall how that panned out, and that may have only been in the platform side, which the MI side is unconcerned with, so keeps BSP as 0.

CPU_FIRST() seems to be the BSP on all platforms I've looked at (powerpc, amd64, arm64).

If CPU_FIRST() doesn't provide the BSP, what would? I don't see any cpu_bsp(), or similar. I recall @nwhitehorn doing some work on powerpc several years ago allowing BSP to be non-zero, but I don't recall how that panned out, and that may have only been in the platform side, which the MI side is unconcerned with, so keeps BSP as 0.

For amd64 we have IS_BSP() macro, you might e.g. add an assert like MPASS(IS_BSP()) on the kexec MD path.

BTW, is there any code that makes sure that kexec-ed image does not override EFI runtime memory? If not, we must force-disable EFIRT in the execed kernel.

sys/amd64/amd64/kexec_support.c
147

This is pdpe (etc) as well.

153

These two lines should be multi-line comment block.

188

This line assumes that NBPDP is 1G. You might use howmany() to avoid hard-coding that.

264

Comment style is wrong.

sys/amd64/amd64/kexec_tramp.S
39

Also note that interrupts must be disabled.

64

Use size suffixes consistently (see other comment)

75

I do not believe either of the instructions are needed. If their presence changes something, there is a bug somewhere else.

78

Same there.

80

Note that mov to %cr3 flushes TLB except PG_G entries, so the comment is somewhat misplaced.

82

Other instructions use size suffix, I would write this one as andq too.

sys/x86/x86/intr_machdep.c
262

At least add a comment.

But, shouldn't the reinit of IOAPICs in the exec-ed kernel prevent this problem?
I think there is a bug somewhere, either in our init sequence, or in kexec, if such workaround is needed.

Another question: you can read the description about assumptions for hammer_time() and pmap initialization in the comment in amd64/machdep.c: 1264, right above the amd64_loadaddr() helper. Note the last enumeration item about the memory block after the loaded stuff (the slop). How it is ensured that kexec-ed kernel is provided enough slop to run the early allocator without problems?

In D51623#1184952, @kib wrote:

BTW, is there any code that makes sure that kexec-ed image does not override EFI runtime memory? If not, we must force-disable EFIRT in the execed kernel.

How is that any different than a normal FreeBSD kernel? Or are you worried about Linix (in which case the EFIRT concern confuses me).... How would it override that memory? It, like the normal FreeBSD kernel, has to honor it. I don't think it's any different than the kexec from Linux case which passes the memory run-time in via the normal loader mechanisms. I think that jhibbits has preliminary hacks to my kboot code to do that from FreeBSD.

In D51623#1184965, @kib wrote:

Another question: you can read the description about assumptions for hammer_time() and pmap initialization in the comment in amd64/machdep.c: 1264, right above the amd64_loadaddr() helper. Note the last enumeration item about the memory block after the loaded stuff (the slop). How it is ensured that kexec-ed kernel is provided enough slop to run the early allocator without problems?

Same way that we do it for a Linux kexec: loader.kboot knows, just like loader.efi knows, and arranges for enough space.

jhibbits added inline comments.
sys/amd64/amd64/kexec_support.c
147

This whole thing largely shamelessly taken from loader. Will update to kernel style.

sys/amd64/amd64/kexec_tramp.S
80

I'll put a blank line between the movq above and the comment, because the comment is for the %cr4 twiddling, which flushes the whole TLB including global pages.

Address feedback. I hope I got it all now.

It seems that you always build 4-level intermediate page table. Wouldn't it blow up if the source kernel is running in LA57 mode? [Kernel always expect LA48 on start nonetheless]

sys/amd64/amd64/kexec_support.c
35

sys/systm.h already provides sys/queue.h

66

So what does do_pte mean? From my reading of the code, do not fill ptes?

76

Generally int is the wrong type for result pf pmap_XXX_index(). It should be vm_pindex_t

137

We usually write this as for (;;)

199

Lines need to be properly wrapped

In D51623#1209045, @kib wrote:

It seems that you always build 4-level intermediate page table. Wouldn't it blow up if the source kernel is running in LA57 mode? [Kernel always expect LA48 on start nonetheless]

You're right. I'll need to force it down to LA48 when entering the new page table. I'll have to check on this, as I'm not familiar with configuring LA57 mode.

sys/amd64/amd64/kexec_support.c
66

do_pte means to fill in PTEs. Line 103 does an early-continue on !do_pte, and the rest of the code after that is to fill in the PTEs. If do_pte is true then there's no guarantee that the backing store is contiguous.

76

Thanks, I'll fix that.

sys/x86/x86/intr_machdep.c
262

I may be wrong, but my reading of the reference is that the APIC can only be completely reset by a hardware reset, and a software reset doesn't clear pending interrupts.

In D51623#1209045, @kib wrote:

It seems that you always build 4-level intermediate page table. Wouldn't it blow up if the source kernel is running in LA57 mode? [Kernel always expect LA48 on start nonetheless]

You're right. I'll need to force it down to LA48 when entering the new page table. I'll have to check on this, as I'm not familiar with configuring LA57 mode.

Basically the code needs to dive into the protected non-paged mode from long mode, to turn off the CR4.LA57 bit. Putting it other way, this would be a reverse to what la57_trampoline() does.

I might suggest, to not delay the commit even more, simply refuse kexec for now if we are in LA57. Then somebody would work out the missing code in the trampoline later.

sys/amd64/amd64/kexec_support.c
66

I mean, describe this in the source code comment.

sys/x86/x86/intr_machdep.c
262

Which APICs? LAPICs or IOAPICs?

BTW, IOAPICs in modern times are often pci devices, so there is chance that they might be reset by some of the normal pci methods (FLR or power reset, or even line re-training), see pci_reset_child().

In D51623#1209399, @kib wrote:

I might suggest, to not delay the commit even more, simply refuse kexec for now if we are in LA57. Then somebody would work out the missing code in the trampoline later.

Good idea, I'll guard on that.

sys/amd64/amd64/kexec_support.c
66

It's mentioned immediately above, but I'll wordsmith that. I'll also add a comment at the PTE creation point.

sys/x86/x86/intr_machdep.c
262

For the LAPIC, 10.4.7.2 (Page 10-10) says pending interrupts are held and require masking or handling by the CPU. Though, 10.4.7.3 states that post-INIT reset state is the same as power-on reset, modulo the APIC ID, so the problem we saw may be caused by IO APIC, and the document I have access to right now doesn't include I/O APIC state after reset, or how to reset the I/O APIC, so I'm mostly going empirically.

I just double-checked the Linux source, and it does this to put the I/O APIC back into "legacy mode"

sys/amd64/amd64/kexec_support.c
66

This is still not handled. In fact, the whole last sentence is too cryptic/does not explain anything.

153

Still not done.

sys/amd64/amd64/kexec_tramp.S
54

I do not quite follow this. Code overrides the argument with the address of the kexec_saved_image, and there is no way to restore the argument. Then why the argument is needed at all?

72

So mfence is still there, without explanation. I do not believe in magic.

Address @kib's feedback further. I didn't reproduce the problem we solved with mfence, so removed that.

sys/amd64/amd64/kexec_tramp.S
54

This is still not handled, or at least not explained.

75

You removed mfence, but wbinvd is still there.