The biggest difference between this and arm64 kexec is that we can't
disable the MMU for amd64, we have to instead create a new "safe" page
table that the trampoline and "child" kernel can use. This requires a
lot more work to create identity mappings, etc.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 67771 Build 64654: arc lint + arc unit
Event Timeline
I only looked briefly over the patch, so this is not a review, but a bunch of random trivial comments.
One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
34 | systm.h should go first, param.h is not needed, other sys includes must be ordered alphabetically | |
85 | Comment style is wrong | |
93 | We call such things pdp/pde/pte in amd64 code. | |
sys/amd64/amd64/kexec_tramp.S | ||
34 | This must be removed. | |
58 | ? | |
74 | Why is this needed? | |
75 | And this? | |
82 | Same question | |
sys/amd64/amd64/machdep.c | ||
227 ↗ | (On Diff #159420) | Why is this needed, and why it appears in this patch even if needed? |
sys/x86/x86/intr_machdep.c | ||
262 | Why it is not enough to do cli on all cpus? Your stop code does that on all other cores. |
sys/amd64/amd64/kexec_tramp.S | ||
---|---|---|
58 | Oops, this was debug from initial development that I forgot to remove. | |
75 | It's possible only one of the above is necessary. We ran into problems at reboot that looked cache related, so I went extreme paranoia. I'll test again removing each, and see if it still works correctly. | |
82 | This one may not be necessary, since everything is being done on this single core. I had added the wbinvd early on before noticing the problem I was experiencing was actually related to the TLB (PG_G entries not being flushed, so causing fun chaos). | |
sys/amd64/amd64/machdep.c | ||
227 ↗ | (On Diff #159420) | I'll move this out, I had added it to minimize the loader.kboot diff, but it's certainly not necessary for this diff. |
sys/x86/x86/intr_machdep.c | ||
262 | This disables the interrupt at the IO-APIC, in case the driver didn't do so in the shutdown handler. This avoids the "reserved" interrupt panics in the new kernel. |
One global question I have, most likely you must execute the handoff and start executing the new kernel on AP, not BSPs. In other words. the reboot must migrate to AP if it not already did it.
Do you mean it needs to migrate to the BSP, and not run on the APs? kern_reboot() already binds to CPU_FIRST().
If you mean it should execute from an AP, is there a reason for that?
I think the sched_bind(curthread, CPU_FIRST()) in kern_reboot() should suffice, then, but correct me if I'm wrong.
When we eventually add rescue support, to run from panic, that might need to change, but would likely be a different entry, too.
Generally CPU_FIRST() is not BSP, but it happen on amd64. On x86 we define BSP as cpuid == 0.
CPU_FIRST() seems to be the BSP on all platforms I've looked at (powerpc, amd64, arm64).
If CPU_FIRST() doesn't provide the BSP, what would? I don't see any cpu_bsp(), or similar. I recall @nwhitehorn doing some work on powerpc several years ago allowing BSP to be non-zero, but I don't recall how that panned out, and that may have only been in the platform side, which the MI side is unconcerned with, so keeps BSP as 0.
CPU_FIRST() seems to be the BSP on all platforms I've looked at (powerpc, amd64, arm64).
If CPU_FIRST() doesn't provide the BSP, what would? I don't see any cpu_bsp(), or similar. I recall @nwhitehorn doing some work on powerpc several years ago allowing BSP to be non-zero, but I don't recall how that panned out, and that may have only been in the platform side, which the MI side is unconcerned with, so keeps BSP as 0.
For amd64 we have IS_BSP() macro, you might e.g. add an assert like MPASS(IS_BSP()) on the kexec MD path.
BTW, is there any code that makes sure that kexec-ed image does not override EFI runtime memory? If not, we must force-disable EFIRT in the execed kernel.
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
147 | This is pdpe (etc) as well. | |
153 | These two lines should be multi-line comment block. | |
188 | This line assumes that NBPDP is 1G. You might use howmany() to avoid hard-coding that. | |
264 | Comment style is wrong. | |
sys/amd64/amd64/kexec_tramp.S | ||
39 | Also note that interrupts must be disabled. | |
64 | Use size suffixes consistently (see other comment) | |
75 | I do not believe either of the instructions are needed. If their presence changes something, there is a bug somewhere else. | |
78 | Same there. | |
80 | Note that mov to %cr3 flushes TLB except PG_G entries, so the comment is somewhat misplaced. | |
82 | Other instructions use size suffix, I would write this one as andq too. | |
sys/x86/x86/intr_machdep.c | ||
262 | At least add a comment. But, shouldn't the reinit of IOAPICs in the exec-ed kernel prevent this problem? |
Another question: you can read the description about assumptions for hammer_time() and pmap initialization in the comment in amd64/machdep.c: 1264, right above the amd64_loadaddr() helper. Note the last enumeration item about the memory block after the loaded stuff (the slop). How it is ensured that kexec-ed kernel is provided enough slop to run the early allocator without problems?
How is that any different than a normal FreeBSD kernel? Or are you worried about Linix (in which case the EFIRT concern confuses me).... How would it override that memory? It, like the normal FreeBSD kernel, has to honor it. I don't think it's any different than the kexec from Linux case which passes the memory run-time in via the normal loader mechanisms. I think that jhibbits has preliminary hacks to my kboot code to do that from FreeBSD.
Same way that we do it for a Linux kexec: loader.kboot knows, just like loader.efi knows, and arranges for enough space.
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
147 | This whole thing largely shamelessly taken from loader. Will update to kernel style. | |
sys/amd64/amd64/kexec_tramp.S | ||
80 | I'll put a blank line between the movq above and the comment, because the comment is for the %cr4 twiddling, which flushes the whole TLB including global pages. |
It seems that you always build 4-level intermediate page table. Wouldn't it blow up if the source kernel is running in LA57 mode? [Kernel always expect LA48 on start nonetheless]
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
35 | sys/systm.h already provides sys/queue.h | |
66 | So what does do_pte mean? From my reading of the code, do not fill ptes? | |
76 | Generally int is the wrong type for result pf pmap_XXX_index(). It should be vm_pindex_t | |
137 | We usually write this as for (;;) | |
199 | Lines need to be properly wrapped |
You're right. I'll need to force it down to LA48 when entering the new page table. I'll have to check on this, as I'm not familiar with configuring LA57 mode.
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
66 | do_pte means to fill in PTEs. Line 103 does an early-continue on !do_pte, and the rest of the code after that is to fill in the PTEs. If do_pte is true then there's no guarantee that the backing store is contiguous. | |
76 | Thanks, I'll fix that. | |
sys/x86/x86/intr_machdep.c | ||
262 | I may be wrong, but my reading of the reference is that the APIC can only be completely reset by a hardware reset, and a software reset doesn't clear pending interrupts. |
Basically the code needs to dive into the protected non-paged mode from long mode, to turn off the CR4.LA57 bit. Putting it other way, this would be a reverse to what la57_trampoline() does.
I might suggest, to not delay the commit even more, simply refuse kexec for now if we are in LA57. Then somebody would work out the missing code in the trampoline later.
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
66 | I mean, describe this in the source code comment. | |
sys/x86/x86/intr_machdep.c | ||
262 | Which APICs? LAPICs or IOAPICs? BTW, IOAPICs in modern times are often pci devices, so there is chance that they might be reset by some of the normal pci methods (FLR or power reset, or even line re-training), see pci_reset_child(). |
Good idea, I'll guard on that.
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
66 | It's mentioned immediately above, but I'll wordsmith that. I'll also add a comment at the PTE creation point. | |
sys/x86/x86/intr_machdep.c | ||
262 | For the LAPIC, 10.4.7.2 (Page 10-10) says pending interrupts are held and require masking or handling by the CPU. Though, 10.4.7.3 states that post-INIT reset state is the same as power-on reset, modulo the APIC ID, so the problem we saw may be caused by IO APIC, and the document I have access to right now doesn't include I/O APIC state after reset, or how to reset the I/O APIC, so I'm mostly going empirically. I just double-checked the Linux source, and it does this to put the I/O APIC back into "legacy mode" |
sys/amd64/amd64/kexec_support.c | ||
---|---|---|
66 | This is still not handled. In fact, the whole last sentence is too cryptic/does not explain anything. | |
153 | Still not done. | |
sys/amd64/amd64/kexec_tramp.S | ||
54 | I do not quite follow this. Code overrides the argument with the address of the kexec_saved_image, and there is no way to restore the argument. Then why the argument is needed at all? | |
72 | So mfence is still there, without explanation. I do not believe in magic. |
Address @kib's feedback further. I didn't reproduce the problem we solved with mfence, so removed that.