Page MenuHomeFreeBSD

kern: Introduce kexec system feature (MI)
AcceptedPublic

Authored by jhibbits on Jul 29 2025, 6:55 PM.
Tags
None
Referenced Files
F132421877: D51619.diff
Thu, Oct 16, 7:26 PM
Unknown Object (File)
Wed, Oct 15, 2:00 AM
Unknown Object (File)
Tue, Oct 14, 12:24 AM
Unknown Object (File)
Sun, Oct 12, 5:14 PM
Unknown Object (File)
Sun, Oct 12, 4:56 PM
Unknown Object (File)
Sun, Oct 12, 3:12 PM
Unknown Object (File)
Thu, Oct 9, 5:32 PM
Unknown Object (File)
Thu, Oct 9, 5:32 PM
Subscribers

Details

Summary

Introduce a new system call and reboot method to support booting a new
kernel directly from FreeBSD.

Linux has included a system call, kexec_load(), since 2005, which
permits booting a new kernel at reboot instead of requiring a full
reboot cycle through the BIOS/firmware. This change brings that same
system call to FreeBSD. Other changesets will add the MD components for
some of our architectures, with stubs for the rest until the MD
components have been written.

kexec_load() supports loading up to an arbitrary limit of 16 memory
segments. These segments must be contained inside memory bounded in
vm_phys_segs (vm.phys_segs sysctl), and a segment must be contained
within a single vm.phys_segs segment, cannot cross adjacent segments.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 66887
Build 63770: arc lint + arc unit

Event Timeline

sys/kern/kern_kexec.c
31

Includes should be ordered alphabetically.

60

Why this cannot be done in machine/kexec.h?

80

I fail to understand this. Also I cannot match it against the code.

155

This is spelled __unreachable()

188

Is there any other use of kexec_mutex?
If no, an atomic would be enough.

197

Can you explain what this code is _supposed_ to do?

219

Use EXTERROR()

251

The comments should be merged.

255

So why it is fine to ignore the tryxbusy failure?

sys/sys/kexec.h
16

What it mean 'aligned with Linux'? And why is it important?

jhibbits added inline comments.
sys/kern/kern_kexec.c
31

Ack.

60

It should be, it was an oversight. And it can probably even be removed altogether.

80

This was the original design. listq was removed from the vm_page structure after design, so it was adapted but the comment wasn't updated.

The intent was to save only the first vm_page, and walk the listq from that page for N pages. Something like this is done in the arm64 MD bits.

188

I'll switch to an atomic. I thought about doing that as well.

197

Hm, this comment actually belongs below where that work is actually done (230-~274), I don't know how it ended up here.

255

Can tryxbusy fail if the pages are freshly allocated, wired, and in a new object?

sys/sys/kexec.h
16

The flag value matches Linux's value. Not important, just a note for where the value came from.

The summary, please explain where should we end up after kexec_load().

sys/kern/kern_kexec.c
80

The big comment is 'I do not understand what this comment tries to explain'.

I think the comment must be significantly rewritten explaining what should be the memory configuration from the kexec_load() call.

201

I do not understand this as well.

jhibbits added inline comments.
sys/kern/kern_kexec.c
201

In the code block starting at line 230:

  • Allocate all the pages needed to hold the entire image plus any MD pages.
  • Walk the object's page list, if a page in the object overlaps is in the target range, then put it into the right position in the object. For instance, if the image has a single segment, which should be loaded to physical address 0x10000000, with a size of 16MB (so, ending at 0x11000000), then if a page at index, say, 72, has a physical address of 0x10100000, then the page at index 0x100 will be swapped with the page at index 72, so that the page at index 72 goes to index 0x100, corresponding to the PA 0x10100000 (assuming 4k pages, of course). This way the page is at its "final" location, and is not at risk of being overwritten in the final copy phase.
sys/kern/kern_kexec.c
80

After kexec_load, the memory requested by the segments is in the PA requested and won't be used for anything else. The system continues as normal after that.

Hmmm. Looks like the other choice was made: load them in an arbitrary address and the copy them at load time... the kexec interface does allow both choices.

I agree with kib that this comment is missing a lot of context needed to understand it.

201

I got nothing but errors if i tried to load any page the linux kernel is using though... so thos explanation is confusing to me.

sys/kern/kern_kexec.c
201

I can read the code myself, your explanation transliterates the code into plain text, but does not make it more understandable.

This 'put the page in the right position' business is not comprehensive. What is the right position? What if the page 'does not overlap with the target range', why it is fine to do nothing with it? If it is fine, why bother with the page that overlaps?

Actually the single sentence from imp gives the hint that there is trampoline that would override and use the pages at actual phys addresses when the current kernel can be killed. It probably explains more than the whole comment.

255

If I understand right what imp explained, you do not need these busy/vm_page_replace() code. vm_page_replace() is designed to be used from e.g. page fault code, where the pages renamed between objects can be mapped.

At this place, the object is guaranteed to have single reference, it cannot be mapped, just do vm_radix_iter_remove()/vm_page_radix_insert() directly. It might makes sense to add a helper to vm_page.c so that more page manipulation primitives are available for the helper (they are static in vm_page.c).

sys/kern/kern_kexec.c
80

After kexec_load() the image is staged in a region so that it can be (efficiently) copied to the final destination at reboot time. The system continues as normal, and if the RB_KEXEC flag is specified to reboot(2) then it will attempt to copy this image to the final location and execute it.

201

If a page overlaps that means it could be overwritten in the copy phase at reboot time. The purpose of sorting the pages is to avoid overwriting and requiring extra trampoline pages during the copy. If the PA of a page in the object (staging) does not overlap with the target PA ranges then there's no need to move it.

I'll rework the comments to make it clearer of the end goal in addition to the technicals, to try to reduce the confusion.

255

Ah, okay, I think I understand. Thanks for the explanation, I'll work that into my update.

jhibbits added inline comments.
sys/sys/kexec.h
12

My biggest question here, soliciting input from everyone: should mem and memsz be vm_paddr_t and vm_size_t instead of a pointer and size_t respectively? Linux uses unsigned long and size_t, our kboot uses void * and int.

sys/sys/kexec.h
12

We likely should use vm_paddr_t. Not sure the value of vm_size_t, but it should likely be that or size_t.
I used void * because it was simple and easy and I wanted to use %p in early debugging. So pretty weak reasons.
So I'm agnostic: I don't care and will adapt the kboot stuff (since it really should match the Linux definitions better).

16

Linux defines the API, and it makes linux emulation easier if we follow the ABI, absent a good reason not to. There's no such reason here.

sys/kern/kern_kexec.c
201

I'm not sure I understand this. When I did the LinuxBoot stuff, I could never overlap within my segments, nor could I overlap anything the kernel was currently using, though "currently using" was a bit of a fuzzy concept. Given the many layers of abstractions and obfuscation in Linux, I never tracked down if it moved other occupants of these pages out of the way, or if it just stashed the pages somewhere and copied them when it transferred control to its internal boot loader (inferno maybe, I forget).

sys/kern/kern_kexec.c
201

Linux stages just like we do, and it gathers pages in some way (I tried understanding it, but couldn't). It then uses a single trampoline page that doesn't overlap with the target range(s), which does the copy into the targets and jumps to the entry point, all in asm. So Linux allows overwriting anything, even the entire kernel in-place, whereas I chose not to (all our kernels are relocatable, so the only restrictions are on alignment, not on specific physical address).

sys/sys/kexec.h
12

I'll switch to vm_paddr_t and vm_size_t, then.

sys/kern/syscalls.master
3390

Per the header comment we prefer u_long to unsigned long.

I know this follows linux and ultimately it's harmless, but it pretty silly to use a 64-bit count of segments.

3391
3392

I really don't understand the way linux has started using long for flags. Are 64-bit only flags useful?

Address feedback. The change to vm_radix_insert, etc, is untested still (will
test in my VM shortly).

sys/sys/kexec.h
12

vm_paddr_t is the right (and perhaps the only possible) option there. I will mention platforms like i386 with PAE, arm with LPAE, or probably ppc 32bit on 64bit CPUs (not sure).

From there, memsz is arguably should be vm_paddr_t as well, although this makes it use wrong units conceptually. But we do not have the dual vm_psize_t type.

sys/kern/syscalls.master
3387

The syscall must be disabled for compat32

sys/kern/syscalls.master
3387

Is there a reason it needs disabled for compat32? Also, how do I disable it in here for compat32? I tried NOCOMPAT32 and that failed.

I could change the interface to use u_int instead of u_long, since I don't expect more than 32 flags, nor 4 billion segments (it's capped at 16 segments anyway), if those are the problem.

sys/kern/syscalls.master
3387

So will you support going from 64-bit kernels to 32-bit kernels? And is this the right vector for that? Otherwise, the interface is too limited to boot a 64-bit kernel with a 32-bit loader if we have to load anything above 4GB. Easier and simpler to just disable it and not worry about supporting it.

sys/kern/syscalls.master
3387

There are two reasons:

  1. The syscall ABI is not invariant against 64/32bit due to use of types. E.g. the layout of struct lexec_segment is different between 32 and 64bit, and then array of segments simply has different offsets for elements. In other words, the syscall args need convertion, which is not provided.
  2. Then, typically, 32bit host would have different assumptions about boot env than 64bit host, which makes kexec from compat32 simply not feasible. E.g. i386 kexec would need to follow bios boot protocol instead of UEFI.

To disable the syscall for compat32, implement it for compat32 as a function returning ENOSYS. Something like this in sys/compat/freebsd32/freebsd32_misc.c:

int
freebsd32_kexec_load(struct thread *td, struct freebsd32_kexec_load_args *uap)
{
   return (ENOSYS);
}
sys/kern/syscalls.master
3387

The easy way to disable this for compat32 is to add it to the unimpl list in sys/compat/freebsd32/syscalls.conf. I'm tempted to add a NOCOMPAT or NATIVEONLY tag to syscalls.master, but no need to wait for that. I can land improvements in this space along with compat/freebsd64.

3391

Corrected from my previous suggestion.

jhibbits added inline comments.
sys/kern/syscalls.master
3387

I'll have to remember to make changes to stand/defs.mk for powerpc* then, because loader is built as 32-bit always for powerpc targets. I know kboot is different, and doesn't have any constraints on it like firmware does, so just noting this for the future.

Address feedback. I think I got it all.

I tried testing the page swap in kexec_load(), but so far in my VM I haven't hit that condition.

One comment, but looks good from a syscall perspective.

sys/compat/freebsd32/syscalls.conf
57

Please add a comment explaining kexec_load's status here. Possibly something like "makes little or no sense on 64-bit hardware"

sys/kern/kern_kexec.c
155

unreachable() is redundand, panic() is declared as dead2

sys/kern/syscalls.master
3387

WRT CAPENABLED. Do you really intend to enable the syscall for contained processes?

sys/sys/kexec.h
3

A license should be added, either explicit or SPDX

43

This should be surrounded by BEGIN/END_DECLS

jhibbits added inline comments.
sys/kern/syscalls.master
3387

Good question. I think not, so will remove it. Was most likely a C&P from another syscall.

jhibbits marked an inline comment as done.

Address feedback from @kib and @brooks.

This revision is now accepted and ready to land.Sep 14 2025, 7:42 PM

I didn't see anything in the implementation of the MI parts to give me heartburn.