Details

Reviewers

imp
markj
jhb
kib
jmg
stevek
brooks

Commits

rGe02c57ff374e: kern: Introduce kexec system feature (MI)

Summary

Introduce a new system call and reboot method to support booting a new
kernel directly from FreeBSD.

Linux has included a system call, kexec_load(), since 2005, which
permits booting a new kernel at reboot instead of requiring a full
reboot cycle through the BIOS/firmware. This change brings that same
system call to FreeBSD. Other changesets will add the MD components for
some of our architectures, with stubs for the rest until the MD
components have been written.

kexec_load() supports loading up to an arbitrary limit of 16 memory
segments. These segments must be contained inside memory bounded in
vm_phys_segs (vm.phys_segs sysctl), and a segment must be contained
within a single vm.phys_segs segment, cannot cross adjacent segments.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhibbits created this revision.Jul 29 2025, 6:55 PM

Herald added a reviewer: brooks. · View Herald TranscriptJul 29 2025, 6:55 PM

Herald added a subscriber: olce. · View Herald Transcript

jhibbits requested review of this revision.Jul 29 2025, 6:55 PM

Harbormaster completed remote builds in B65832: Diff 159416.Jul 29 2025, 6:55 PM

jhibbits added a child revision: D51620: syscalls: Regen after adding kexec_load syscall.Jul 29 2025, 6:55 PM

kib added inline comments.Jul 30 2025, 3:25 PM

sys/kern/kern_kexec.c
31	Includes should be ordered alphabetically.
60	Why this cannot be done in machine/kexec.h?
80	I fail to understand this. Also I cannot match it against the code.
155	This is spelled __unreachable()
188	Is there any other use of kexec_mutex? If no, an atomic would be enough.
197	Can you explain what this code is _supposed_ to do?
219	Use EXTERROR()
251	The comments should be merged.
255	So why it is fine to ignore the tryxbusy failure?
sys/sys/kexec.h
16	What it mean 'aligned with Linux'? And why is it important?

jhibbits marked 3 inline comments as done.Jul 30 2025, 5:15 PM

jhibbits added inline comments.

sys/kern/kern_kexec.c
31	Ack.
60	It should be, it was an oversight. And it can probably even be removed altogether.
80	This was the original design. `listq` was removed from the vm_page structure after design, so it was adapted but the comment wasn't updated. The intent was to save only the first vm_page, and walk the listq from that page for N pages. Something like this is done in the arm64 MD bits.
188	I'll switch to an atomic. I thought about doing that as well.
197	Hm, this comment actually belongs below where that work is actually done (230-~274), I don't know how it ended up here.
255	Can tryxbusy fail if the pages are freshly allocated, wired, and in a new object?
sys/sys/kexec.h
16	The flag value matches Linux's value. Not important, just a note for where the value came from.

The summary, please explain where should we end up after kexec_load().

sys/kern/kern_kexec.c
80	The big comment is 'I do not understand what this comment tries to explain'. I think the comment must be significantly rewritten explaining what should be the memory configuration from the kexec_load() call.
201	I do not understand this as well.

jhibbits marked 2 inline comments as done.Jul 31 2025, 3:18 AM

jhibbits added inline comments.

sys/kern/kern_kexec.c
201	In the code block starting at line 230: Allocate all the pages needed to hold the entire image plus any MD pages. Walk the object's page list, if a page in the object overlaps is in the target range, then put it into the right position in the object. For instance, if the image has a single segment, which should be loaded to physical address 0x10000000, with a size of 16MB (so, ending at 0x11000000), then if a page at index, say, 72, has a physical address of 0x10100000, then the page at index 0x100 will be swapped with the page at index 72, so that the page at index 72 goes to index 0x100, corresponding to the PA 0x10100000 (assuming 4k pages, of course). This way the page is at its "final" location, and is not at risk of being overwritten in the final copy phase.

imp added inline comments.Jul 31 2025, 4:50 AM

sys/kern/kern_kexec.c
80	After kexec_load, the memory requested by the segments is in the PA requested and won't be used for anything else. The system continues as normal after that. Hmmm. Looks like the other choice was made: load them in an arbitrary address and the copy them at load time... the kexec interface does allow both choices. I agree with kib that this comment is missing a lot of context needed to understand it.
201	I got nothing but errors if i tried to load any page the linux kernel is using though... so thos explanation is confusing to me.

kib added inline comments.Jul 31 2025, 8:25 AM

sys/kern/kern_kexec.c
201	I can read the code myself, your explanation transliterates the code into plain text, but does not make it more understandable. This 'put the page in the right position' business is not comprehensive. What is the right position? What if the page 'does not overlap with the target range', why it is fine to do nothing with it? If it is fine, why bother with the page that overlaps? Actually the single sentence from imp gives the hint that there is trampoline that would override and use the pages at actual phys addresses when the current kernel can be killed. It probably explains more than the whole comment.
255	If I understand right what imp explained, you do not need these busy/vm_page_replace() code. vm_page_replace() is designed to be used from e.g. page fault code, where the pages renamed between objects can be mapped. At this place, the object is guaranteed to have single reference, it cannot be mapped, just do vm_radix_iter_remove()/vm_page_radix_insert() directly. It might makes sense to add a helper to vm_page.c so that more page manipulation primitives are available for the helper (they are static in vm_page.c).

jhibbits added inline comments.Jul 31 2025, 12:29 PM

sys/kern/kern_kexec.c
80	After kexec_load() the image is staged in a region so that it can be (efficiently) copied to the final destination at reboot time. The system continues as normal, and if the RB_KEXEC flag is specified to reboot(2) then it will attempt to copy this image to the final location and execute it.
201	If a page overlaps that means it could be overwritten in the copy phase at reboot time. The purpose of sorting the pages is to avoid overwriting and requiring extra trampoline pages during the copy. If the PA of a page in the object (staging) does not overlap with the target PA ranges then there's no need to move it. I'll rework the comments to make it clearer of the end goal in addition to the technicals, to try to reduce the confusion.
255	Ah, okay, I think I understand. Thanks for the explanation, I'll work that into my update.

zlei added a subscriber: zlei.Aug 1 2025, 4:07 PM

jhibbits marked an inline comment as done.Aug 6 2025, 3:04 PM

jhibbits added inline comments.

sys/sys/kexec.h
12	My biggest question here, soliciting input from everyone: should `mem` and `memsz` be vm_paddr_t and vm_size_t instead of a pointer and size_t respectively? Linux uses unsigned long and size_t, our kboot uses void * and int.

imp added inline comments.Aug 6 2025, 3:11 PM

sys/sys/kexec.h
12	We likely should use vm_paddr_t. Not sure the value of vm_size_t, but it should likely be that or size_t. I used void * because it was simple and easy and I wanted to use %p in early debugging. So pretty weak reasons. So I'm agnostic: I don't care and will adapt the kboot stuff (since it really should match the Linux definitions better).
16	Linux defines the API, and it makes linux emulation easier if we follow the ABI, absent a good reason not to. There's no such reason here.

imp added inline comments.Aug 6 2025, 3:22 PM

sys/kern/kern_kexec.c
201	I'm not sure I understand this. When I did the LinuxBoot stuff, I could never overlap within my segments, nor could I overlap anything the kernel was currently using, though "currently using" was a bit of a fuzzy concept. Given the many layers of abstractions and obfuscation in Linux, I never tracked down if it moved other occupants of these pages out of the way, or if it just stashed the pages somewhere and copied them when it transferred control to its internal boot loader (inferno maybe, I forget).

jhibbits added inline comments.Aug 6 2025, 3:33 PM

sys/kern/kern_kexec.c
201	Linux stages just like we do, and it gathers pages in some way (I tried understanding it, but couldn't). It then uses a single trampoline page that doesn't overlap with the target range(s), which does the copy into the targets and jumps to the entry point, all in asm. So Linux allows overwriting anything, even the entire kernel in-place, whereas I chose not to (all our kernels are relocatable, so the only restrictions are on alignment, not on specific physical address).
sys/sys/kexec.h
12	I'll switch to vm_paddr_t and vm_size_t, then.

brooks added inline comments.Aug 6 2025, 4:43 PM

sys/kern/syscalls.master
3400	Per the header comment we prefer u_long to unsigned long. I know this follows linux and ultimately it's harmless, but it pretty silly to use a 64-bit count of segments.
3401
3402	I really don't understand the way linux has started using `long` for flags. Are 64-bit only flags useful?

Address feedback. The change to vm_radix_insert, etc, is untested still (will
test in my VM shortly).

Harbormaster completed remote builds in B66037: Diff 159909.Aug 6 2025, 9:02 PM

kib added inline comments.Aug 6 2025, 11:10 PM

sys/sys/kexec.h
12	vm_paddr_t is the right (and perhaps the only possible) option there. I will mention platforms like i386 with PAE, arm with LPAE, or probably ppc 32bit on 64bit CPUs (not sure). From there, memsz is arguably should be vm_paddr_t as well, although this makes it use wrong units conceptually. But we do not have the dual vm_psize_t type.

kib added inline comments.Aug 11 2025, 12:54 AM

sys/kern/syscalls.master
3397	The syscall must be disabled for compat32

jhibbits added inline comments.Aug 11 2025, 9:07 PM

sys/kern/syscalls.master
3397	Is there a reason it needs disabled for compat32? Also, how do I disable it in here for compat32? I tried NOCOMPAT32 and that failed. I could change the interface to use u_int instead of u_long, since I don't expect more than 32 flags, nor 4 billion segments (it's capped at 16 segments anyway), if those are the problem.

imp added inline comments.Aug 11 2025, 10:46 PM

sys/kern/syscalls.master
3397	So will you support going from 64-bit kernels to 32-bit kernels? And is this the right vector for that? Otherwise, the interface is too limited to boot a 64-bit kernel with a 32-bit loader if we have to load anything above 4GB. Easier and simpler to just disable it and not worry about supporting it.

kib added inline comments.Aug 12 2025, 12:42 AM

sys/kern/syscalls.master
3397	There are two reasons: The syscall ABI is not invariant against 64/32bit due to use of types. E.g. the layout of struct lexec_segment is different between 32 and 64bit, and then array of segments simply has different offsets for elements. In other words, the syscall args need convertion, which is not provided. Then, typically, 32bit host would have different assumptions about boot env than 64bit host, which makes kexec from compat32 simply not feasible. E.g. i386 kexec would need to follow bios boot protocol instead of UEFI. To disable the syscall for compat32, implement it for compat32 as a function returning ENOSYS. Something like this in sys/compat/freebsd32/freebsd32_misc.c: int freebsd32_kexec_load(struct thread td, struct freebsd32_kexec_load_args uap) { return (ENOSYS); }

brooks added inline comments.Aug 12 2025, 8:53 AM

sys/kern/syscalls.master
3397	The easy way to disable this for compat32 is to add it to the `unimpl` list in sys/compat/freebsd32/syscalls.conf. I'm tempted to add a NOCOMPAT or NATIVEONLY tag to syscalls.master, but no need to wait for that. I can land improvements in this space along with compat/freebsd64.
3401	Corrected from my previous suggestion.

jhibbits marked 3 inline comments as done.Aug 25 2025, 5:13 PM

jhibbits added inline comments.

sys/kern/syscalls.master
3397	I'll have to remember to make changes to stand/defs.mk for powerpc* then, because loader is built as 32-bit always for powerpc targets. I know kboot is different, and doesn't have any constraints on it like firmware does, so just noting this for the future.

Address feedback. I think I got it all.

I tried testing the page swap in kexec_load(), but so far in my VM I haven't hit that condition.

Harbormaster completed remote builds in B66514: Diff 160952.Aug 25 2025, 5:15 PM

One comment, but looks good from a syscall perspective.

sys/compat/freebsd32/syscalls.conf
57	Please add a comment explaining kexec_load's status here. Possibly something like "makes little or no sense on 64-bit hardware"

kib added inline comments.Aug 27 2025, 1:09 AM

sys/kern/kern_kexec.c
155	unreachable() is redundand, panic() is declared as dead2
sys/kern/syscalls.master
3397	WRT CAPENABLED. Do you really intend to enable the syscall for contained processes?
sys/sys/kexec.h
3	A license should be added, either explicit or SPDX
43	This should be surrounded by BEGIN/END_DECLS

jhibbits marked 5 inline comments as done.Sep 8 2025, 2:08 PM

jhibbits added inline comments.

sys/kern/syscalls.master
3397	Good question. I think not, so will remove it. Was most likely a C&P from another syscall.

Address feedback from @kib and @brooks.

Harbormaster completed remote builds in B66887: Diff 161711.Sep 8 2025, 2:11 PM

kib accepted this revision.Sep 14 2025, 7:42 PM

This revision is now accepted and ready to land.Sep 14 2025, 7:42 PM

I didn't see anything in the implementation of the MI parts to give me heartburn.

Closed by commit rGe02c57ff374e: kern: Introduce kexec system feature (MI) (authored by jhibbits). · Explain WhyMon, Oct 27, 2:35 PM

This revision was automatically updated to reflect the committed changes.

jhibbits added a commit: rGe02c57ff374e: kern: Introduce kexec system feature (MI).

kern: Introduce kexec system feature (MI)
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 165137

sys/compat/freebsd32/syscalls.conf

sys/conf/files

sys/kern/kern_kexec.c

sys/kern/syscalls.master

sys/sys/kexec.h

sys/sys/reboot.h

sys/sys/smp.h

sys/sys/syscallsubr.h

kern: Introduce kexec system feature (MI)ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 165137

sys/compat/freebsd32/syscalls.conf

sys/conf/files

sys/kern/kern_kexec.c

sys/kern/syscalls.master

sys/sys/kexec.h

sys/sys/reboot.h

sys/sys/smp.h

sys/sys/syscallsubr.h

kern: Introduce kexec system feature (MI)
ClosedPublic
Actions

Revision Contents
Changeset List