Details

Reviewers

Commits

rGd49b75d0d67f: Enforce that vdso does not contain runtime relocations
rGb0e0b4e7b379: Enforce that vdso does not consume too much from the shared page
rG8a4bd7f818a0: amd64 ia32 vdso: add unwind annotations to the signal trampoline
rG5b8918fac6fa: amd64 native vdso: add unwind annotations to the signal trampoline
rGeb029587481e: Add kern.elf{32,64}.vdso knobs to enable/disable vdso preloading
rG98c8b6252496: vdso for ia32 on amd64
rG290e05dde058: imgact_aout.c: some style
rGc210ef134792: rtld: load preloaded vdso
rG63fc4e820c86: rtld: extract header validation into new helper check_elf_headers()
rG01c77a436e35: Pass vdso address to userspace
rGab4524b3d7fb: amd64: wrap 64bit sigtramp into vdso
rG9b135747a326: kern.pre.mk: allow to use and customize elfdump during kernel build
rG9da5257e1c04: imgact_aout.c: We do not expect the aout support to be ported

Summary

Currently amd64 only

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

jrtc27 added inline comments.Nov 18 2021, 2:49 AM

sys/tools/amd64_vdso.sh
53	wc -c < elf-vdso.so.1 won't print the file name, if you care about having to have the awk
sys/tools/vdso_wrap.S
57	Can just make this a normal variable rather than the slightly gross "address is the size" approach that is mandated by objcopy (which doesn't work on RISC-V without -fPIE, and the same might be true for AArch64, due to using PC-relative addressing that assumes everything is +/- 1 GiB). Means the SV_DSO_SIG check when reading sv_szsigcode goes away and it can really be an `extern int _binary_elf_vdso_so_1_size`. Another benefit of the template asm approach that didn't occur to me.

jrtc27 added inline comments.Nov 18 2021, 2:52 AM

sys/kern/kern_sharedpage.c
324–338	I wouldn't be surprised if some of the Linuxulator VDSO code can then reuse this, but that's someone else's problem to figure out :)

kib added inline comments.Nov 18 2021, 3:09 AM

sys/tools/vdso_wrap.S
57	IMO the sigcodesz thing is gross, not the address as value. The later is the standard ELF way to get symbol value. I do not believe that PC-relative addressing is relevant there, all arches must be able to take address of something.

jrtc27 added inline comments.Nov 18 2021, 3:11 AM

sys/tools/vdso_wrap.S
57	The point is the code model explicitly assumes that code and data all fit into a 2GiB region, and thus a 32-bit PC-relative offset is sufficient to access any variable. Since the kernel is linked at a high address (but not high enough to be a small negative) and `_binary_elf_vdso_so_1_size` is between 0 and 2048, it is not within 2^32 bytes of PC, and thus you will get relocation out of range errors when linking on such architectures.

jrtc27 added inline comments.Nov 18 2021, 3:14 AM

sys/tools/vdso_wrap.S
57	(Weak symbols are special cased to always use the GOT on AArch64 when building with such a code model, so that weak undef symbols can resolve to 0 and not cause range errors)

kib added inline comments.Nov 18 2021, 3:22 AM

sys/tools/vdso_wrap.S
57	PC is irrelevant there. It is actually resolved at the static linking time, there is no code involved. FWIW, amd64 is also linked with 'kernel memory model', where code is assumed to be in [TOP-2G, TOP) region (TOP=0xfffff...ffff). Are you sure that amd64 and riscv kernel models assume that both text _and_ data are in 2G region? It would be too limiting, I expect that only text has this micro-optimization restriction.

jrtc27 added inline comments.Nov 18 2021, 3:34 AM

sys/tools/vdso_wrap.S
57	Oh good point, I wasn't thinking about the fact that it's in a static initialiser so just a normal R_32/64 relocation. Yes, that's the code model we compile with for AArch64 and RISC-V (linked at 0xffff000000000000 and 0xffffffc000000000 respectively, which are the lowest higher half addresses that are guaranteed to be supported by their MMUs), it covers both code and data and is the "normal" one that's not "assume everything is in the first 2 GiB". There is no kernel code model outside of x86. On AArch64 you get tiny (everything is within 1 MiB), small (everything is within 4 GiB, not sure why the GCC documentation sys 4, seems like it should be 2, but this is the default and the one I'm talking about) and large which is horribly inefficient (64-bit immediate materialisation everywhere). On RISC-V you get medlow (everything fits in a signed 32-bit integer) and medany (everything is within 2 GiB), there is currently no large (but -fPIC/-fPIE gets you a GOT which lets data grow beyond that, so long as text and the GOT can occupy a single 2 GiB region).

Add knobs to disable vdso prelinking.
Fix bug with missed vdso causing NULL deref.

lattera-gmail.com added a subscriber: lattera-gmail.com.Nov 25 2021, 12:27 PM

emaste added inline comments.Nov 30 2021, 4:34 PM

libexec/rtld-elf/rtld.c
2876	We could default stack_flags to `PF_R \| PF_W` for preload objs (and we could avoid `PT_GNU_STACK` in the VDSO if desired) (i.e., there's no point in introducing a new X-by-default stack case that requires `PT_GNU_STACK` to turn it off. If in the future we do need a RWX stack specified by a preload obj for some reason we can use `PT_GNU_STACK` to set it)
2904	Is there any reason we'd want to check this and fail if we find any? I suspect no.

kib marked 13 inline comments as done.Dec 1 2021, 12:27 AM

kib added inline comments.

libexec/rtld-elf/rtld.c
2876	In fact, it so happens that default stack mode for kpreload is 0. Right after I read your comment, I wanted to change it to either PF_R\|_W or PF_R\|_W\|_X, but now I think this is actually not bad. I added a comment to make this more explicit.
2904	Yes, we want to check this, and updated patch contains the check, but not there. But IMO doing it in rtld is too late and either check or the need for relocation bricks the system. I added the checks to the vdso build scripts.
sys/tools/amd64_vdso.sh
53	I prefer awk

Assert that there is no relocs in vdso, at the build stage.
Add some comments in load_kpreload(), about stack_mode and relocs.

So will we need to update qemu-bsd-user for this new signaling method?

In D32960#750447, @imp wrote:

So will we need to update qemu-bsd-user for this new signaling method?

There is no new signalling method, kernel still sets up trampoline, and deliver signal by pushing exactly the same sigframe, and jumping to the same signal trampoline code fragment.

What this change does, it the wrapping of the signal trampoline into real shared object. New rtld can handle both new and old kernels, for old rtld everything looks like an old kernel. If mapping the trampoline is the duty of qemu, then yes, eventually you would want to also provide dso instead of bare signal trampoline. To definitively answer your question, I need to know more how u-qemu handles signal delivery.

Also note that vdso is only implemented for amd64 kernel for now.

Fix compilation without COMPAT_43.

emaste added inline comments.Dec 2 2021, 6:38 PM

sys/conf/kern.pre.mk
49	Do we need to add elfdump to `_elftctools`

Add elfdump to _elftctools

emaste added inline comments.Dec 2 2021, 7:10 PM

sys/conf/vdso_amd64.ldscript

39–40

Wrong flags for PT_DYNAMIC and PT_GNU_EH_FRAME

Type           Offset             VirtAddr           PhysAddr
               FileSiz            MemSiz              Flg    Align
LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
               0x00000000000002e6 0x00000000000002e6  R E    0x10
DYNAMIC        0x0000000000000218 0x0000000000000218 0x0000000000000218
               0x00000000000000b0 0x00000000000000b0  R E    0x8
GNU_EH_FRAME   0x0000000000000194 0x0000000000000194 0x0000000000000194
               0x0000000000000014 0x0000000000000014  R E    0x4

kib added inline comments.Dec 2 2021, 7:15 PM

sys/conf/vdso_amd64.ldscript
39–40	Could you please clarify? What specifically is wrong?

emaste added inline comments.Dec 2 2021, 7:21 PM

sys/conf/vdso_amd64.ldscript
39–40	Oh, nvm, I am mistaken. Maybe worth a (reminder) comment about everything being in the same page.

kib marked 2 inline comments as done.Dec 2 2021, 7:27 PM

kib added inline comments.

sys/conf/vdso_amd64.ldscript
39–40	There is some discussion in the sys/tools/amd64_vdso.sh comment. Are you proposing to expand the text? It would become repeating the linker script.

emaste accepted this revision.Dec 2 2021, 8:39 PM

emaste added inline comments.

sys/amd64/amd64/sigtramp.S
47	I did not verify these
sys/amd64/ia32/ia32_sigtramp.S
51	did not verify these
sys/kern/imgact_elf.c
133	Any reason not to make this 1 everywhere (in the near future)?

This revision is now accepted and ready to land.Dec 2 2021, 8:39 PM

Have you verified that elfdump can be built as a bootstrap tool from non-FreeBSD hosts? You should be able to find out by pushing the commit to a GitHub fork of freebsd-src and letting the on-push action run, which will test both Linux and macOS hosts can build a kernel.

In D32960#751274, @jrtc27 wrote:

Have you verified that elfdump can be built as a bootstrap tool from non-FreeBSD hosts? You should be able to find out by pushing the commit to a GitHub fork of freebsd-src and letting the on-push action run, which will test both Linux and macOS hosts can build a kernel.

I included this change in some WIP that I pushed to GitHub. MacOS failed in bootstrap bmake which does not appear related. The Ubuntu Clang 9 and Clang 10 jobs were successful, although they hide build output so I cannot confirm that elfdump was indeed built/used.

--------------------------------------------------------------
>>> Kernel build for GENERIC started on Thu Dec  2 21:53:52 UTC 2021
--------------------------------------------------------------
===> GENERIC

--------------------------------------------------------------
>>> stage 1: configuring the kernel
--------------------------------------------------------------
Kernel build directory is /home/runner/work/freebsd/build/home/runner/work/freebsd/freebsd/amd64.amd64/sys/GENERIC
Don't forget to do ``make cleandepend && make depend''
0.11user 0.05system 0:00.16elapsed 101%CPU (0avgtext+0avgdata 47752maxresident)k
0inputs+8outputs (0major+21039minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 2.3: build tools
--------------------------------------------------------------
0.10user 0.04system 0:00.14elapsed 104%CPU (0avgtext+0avgdata 47548maxresident)k
0inputs+8outputs (0major+15217minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 3.1: building everything
--------------------------------------------------------------
linking kernel.full
ctfmerge -L VERSION -g -o kernel.full ...
      text      data       bss        dec         hex   filename
  23356772   1857165   4428672   29642609   0x1c44f71   kernel.full
713.10user 53.08system 6:40.14elapsed 191%CPU (0avgtext+0avgdata 816252maxresident)k
59752inputs+2278192outputs (44098major+15938065minor)pagefaults 0swaps
--------------------------------------------------------------
>>> Kernel build for GENERIC completed on Thu Dec  2 22:00:32 UTC 2021
--------------------------------------------------------------
>>> Kernel(s)  GENERIC built in 400 seconds, ncpu: 2, make -j2
--------------------------------------------------------------

In D32960#751276, @emaste wrote:

In D32960#751274, @jrtc27 wrote:

Have you verified that elfdump can be built as a bootstrap tool from non-FreeBSD hosts? You should be able to find out by pushing the commit to a GitHub fork of freebsd-src and letting the on-push action run, which will test both Linux and macOS hosts can build a kernel.

--------------------------------------------------------------
>>> Kernel build for GENERIC started on Thu Dec  2 21:53:52 UTC 2021
--------------------------------------------------------------
===> GENERIC

--------------------------------------------------------------
>>> stage 1: configuring the kernel
--------------------------------------------------------------
Kernel build directory is /home/runner/work/freebsd/build/home/runner/work/freebsd/freebsd/amd64.amd64/sys/GENERIC
Don't forget to do ``make cleandepend && make depend''
0.11user 0.05system 0:00.16elapsed 101%CPU (0avgtext+0avgdata 47752maxresident)k
0inputs+8outputs (0major+21039minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 2.3: build tools
--------------------------------------------------------------
0.10user 0.04system 0:00.14elapsed 104%CPU (0avgtext+0avgdata 47548maxresident)k
0inputs+8outputs (0major+15217minor)pagefaults 0swaps

--------------------------------------------------------------
>>> stage 3.1: building everything
--------------------------------------------------------------
linking kernel.full
ctfmerge -L VERSION -g -o kernel.full ...
      text      data       bss        dec         hex   filename
  23356772   1857165   4428672   29642609   0x1c44f71   kernel.full
713.10user 53.08system 6:40.14elapsed 191%CPU (0avgtext+0avgdata 816252maxresident)k
59752inputs+2278192outputs (44098major+15938065minor)pagefaults 0swaps
--------------------------------------------------------------
>>> Kernel build for GENERIC completed on Thu Dec  2 22:00:32 UTC 2021
--------------------------------------------------------------
>>> Kernel(s)  GENERIC built in 400 seconds, ncpu: 2, make -j2
--------------------------------------------------------------

Thanks. The macOS job on freebsd-src indeed looks slightly temperamental, has the occasional bmake test suite failure, not sure what's up with that. If it builds on Linux then it's probably fine and we can deal with a macOS failure post-commit on the off chance there's something special there.

kib marked 4 inline comments as done.Dec 2 2021, 10:40 PM

kib added inline comments.

sys/amd64/amd64/sigtramp.S
47	You mean, you did not checked the layout of the frame, or you did not verified that unwinding around signal frame work? First fact is more or less about matching register name and corresponding UC_XXX offset name. Second is tough, indeed. For instance, gdb, lldb, non-gnu libunwind match signal handler by comparing code sequence with the known one, so this change seems to be nop for them. On the other hand, in-tree unwinder from libgcc_s was able to catch an exception thrown from a signal handler context.
sys/kern/imgact_elf.c
133	I definitely do not have plans to work on it. It is up to the arch maintainer to do the necessary copy/paste and testing. In principle, I will prepare the arch patch if somebody going to test that.

Fixes for a.out and freebsd4 signal delivery.

This revision now requires review to proceed.Dec 3 2021, 10:56 AM

emaste accepted this revision.Dec 3 2021, 2:44 PM

emaste added inline comments.

sys/kern/imgact_aout.c
166	I guess this comment is not really applicable anymore (we're not going to add new arch a.out support)
sys/kern/imgact_elf.c
133	I just mean that for the kernel to provide `AT_KPRELOAD` `__elfN(vdso)` must be nonzero and `sv_vdso_base` must also be nonzero so there's no problem setting it to 1. I suppose there is some documentary value in having it report 0 to the user if not supported by the arch.

This revision is now accepted and ready to land.Dec 3 2021, 2:44 PM

emaste added a subscriber: jhb.Dec 3 2021, 2:46 PM

kib marked 2 inline comments as done.Dec 3 2021, 2:58 PM

kib added inline comments.

sys/kern/imgact_aout.c
166	It is same as it was in the time the line was written. I want something to plug if people add imgact_aout.c to their files.<arch>. If you suggest to change the message to something else, I do not object.
sys/kern/imgact_elf.c
133	Might be, it is useful to hide the sysctl at all from not supported arches. I will do this.

emaste added inline comments.Dec 3 2021, 2:58 PM

sys/tools/vdso_wrap.S
5	@jhb this was derived from a header carrying a ECATS sponsorship note. The This software was developed by SRI International and the University of Cambridge Computer Laboratory statement seems confusing in this context, should we remove that?

emaste added inline comments.Dec 3 2021, 3:01 PM

sys/kern/imgact_aout.c
166	Yeah just something I noticed in passing and not related to your work.
sys/kern/imgact_elf.c
133	OK, I think that is better.

Only expose kern.elfXX.vdso on amd64.

This revision now requires review to proceed.Dec 3 2021, 3:01 PM

jhb added inline comments.Dec 3 2021, 7:42 PM

sys/amd64/amd64/sigtramp.S
63	Are there CFI annotations for segment registers, fsbase, gsbase, and eflags? (Those are all present in ucontext_t as well and a recent change I have for GDB pulls all of those out of signal frames.)
sys/amd64/ia32/ia32_sigtramp.S
58	Similar question about segment registers, fsbase/gsbase (GDB uses those for TLS), and flags.
sys/tools/vdso_wrap.S
5	From crtbrand.S or the like? (Or maybe the firmware stub?) There's no useful shared content between those files and this. I would just give this a FF copyright from the get go.
37	No need for this if you aren't planning to merge to 12.

emaste added inline comments.Dec 3 2021, 7:45 PM

sys/tools/vdso_wrap.S
5	From the firmware stub

kib marked 8 inline comments as done.Dec 4 2021, 4:15 AM

kib added inline comments.

sys/amd64/amd64/sigtramp.S
63	All listed registers have DWARF register mapping number assigned to them in psABI, so there is a way to specify that in final object. But trying to actually do that resulted in the mess. For instance, i386 psABI specifies 93 as %fs.base, but binutils interpret it as %k0. Clang IAS does know about all segment names, but objdump from binutils report all encoded registers as invalid then. And so on. The only semi-consistent pseudo-registers seems to be fs/gs.base on x86_64, but even there clang managed to fail, it does not recognize the registers name, and I have to use numeric value. On the other hand, I am sure that %eflags are not useful for unwinders, and doubt that segment registers or fs/gsbase are useful. I think really doing that would require some iterations with toolchain maintainers, for now I commented out cfi instructions for segments/flags everywhere, and bases on 32bit.
sys/tools/vdso_wrap.S
5	The code was copied from firmware.S, and the only changes made were the variables rename. I did removed that copyright as suggested, but feel free to request a revert.

Add cfi offsets for amd64 fs/gs bases. Most other registers are also annotated but neutralized due to toolchain mess.
Change error message for a.out !ia32 arch.
Remove ECATS copyright from vdso_wrap.S.

Minor tweaks for 64bit vdso cfi annotations. Add a comment explaining toolchain mess.

kib added inline comments.Dec 4 2021, 3:30 PM

sys/amd64/amd64/sigtramp.S
69	Fixed locally.

jhb added inline comments.Dec 6 2021, 6:36 PM

sys/amd64/amd64/sigtramp.S
63	This is mostly for debuggers and when inspecting in core dumps. That said, GDB will keep using its own unwinder that will unwind all those registers ok even with out CFI (so long as we don't change the sigcode so the sigframe unwinder keeps matching). %eflags is possibly useful when debugging a core dump where a thread had caught a signal and you want to see the register state at the time of the faulting instruction, but that is definitely an obscure case. GDB uses fsbase/gsbase for TLS on amd64/i386, but those likely don't change between the signal handler and the signal frame so missing those is less important. My other interest is around using CFI for kernel exception frames as that removes the need for custom unwinders in kernel debuggers, but when I last tried that I found I couldn't annotate all of the registers (for some reason I couldn't annotate %eflags at the time. and probably it is what you ran into which is that there is a DWARF register number but IAS doesn't support it).