Currently amd64 only
Details
- Reviewers
emaste - Commits
- rGd49b75d0d67f: Enforce that vdso does not contain runtime relocations
rGb0e0b4e7b379: Enforce that vdso does not consume too much from the shared page
rG8a4bd7f818a0: amd64 ia32 vdso: add unwind annotations to the signal trampoline
rG5b8918fac6fa: amd64 native vdso: add unwind annotations to the signal trampoline
rGeb029587481e: Add kern.elf{32,64}.vdso knobs to enable/disable vdso preloading
rG98c8b6252496: vdso for ia32 on amd64
rG290e05dde058: imgact_aout.c: some style
rGc210ef134792: rtld: load preloaded vdso
rG63fc4e820c86: rtld: extract header validation into new helper check_elf_headers()
rG01c77a436e35: Pass vdso address to userspace
rGab4524b3d7fb: amd64: wrap 64bit sigtramp into vdso
rG9b135747a326: kern.pre.mk: allow to use and customize elfdump during kernel build
rG9da5257e1c04: imgact_aout.c: We do not expect the aout support to be ported
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
sys/tools/amd64_vdso.sh | ||
---|---|---|
52 | wc -c < elf-vdso.so.1 won't print the file name, if you care about having to have the awk | |
sys/tools/vdso_wrap.S | ||
56 | Can just make this a normal variable rather than the slightly gross "address is the size" approach that is mandated by objcopy (which doesn't work on RISC-V without -fPIE, and the same might be true for AArch64, due to using PC-relative addressing that assumes everything is +/- 1 GiB). Means the SV_DSO_SIG check when reading sv_szsigcode goes away and it can really be an extern int _binary_elf_vdso_so_1_size. Another benefit of the template asm approach that didn't occur to me. |
sys/kern/kern_sharedpage.c | ||
---|---|---|
324 | I wouldn't be surprised if some of the Linuxulator VDSO code can then reuse this, but that's someone else's problem to figure out :) |
sys/tools/vdso_wrap.S | ||
---|---|---|
56 | IMO the sigcodesz thing is gross, not the address as value. The later is the standard ELF way to get symbol value. I do not believe that PC-relative addressing is relevant there, all arches must be able to take address of something. |
sys/tools/vdso_wrap.S | ||
---|---|---|
56 | The point is the code model explicitly assumes that code and data all fit into a 2GiB region, and thus a 32-bit PC-relative offset is sufficient to access any variable. Since the kernel is linked at a high address (but not high enough to be a small negative) and _binary_elf_vdso_so_1_size is between 0 and 2048, it is not within 2^32 bytes of PC, and thus you will get relocation out of range errors when linking on such architectures. |
sys/tools/vdso_wrap.S | ||
---|---|---|
56 | (Weak symbols are special cased to always use the GOT on AArch64 when building with such a code model, so that weak undef symbols can resolve to 0 and not cause range errors) |
sys/tools/vdso_wrap.S | ||
---|---|---|
56 | PC is irrelevant there. It is actually resolved at the static linking time, there is no code involved. FWIW, amd64 is also linked with 'kernel memory model', where code is assumed to be in [TOP-2G, TOP) region (TOP=0xfffff...ffff). Are you sure that amd64 and riscv kernel models assume that both text _and_ data are in 2G region? It would be too limiting, I expect that only text has this micro-optimization restriction. |
sys/tools/vdso_wrap.S | ||
---|---|---|
56 | Oh good point, I wasn't thinking about the fact that it's in a static initialiser so just a normal R_32/64 relocation. Yes, that's the code model we compile with for AArch64 and RISC-V (linked at 0xffff000000000000 and 0xffffffc000000000 respectively, which are the lowest higher half addresses that are guaranteed to be supported by their MMUs), it covers both code and data and is the "normal" one that's not "assume everything is in the first 2 GiB". There is no kernel code model outside of x86. On AArch64 you get tiny (everything is within 1 MiB), small (everything is within 4 GiB, not sure why the GCC documentation sys 4, seems like it should be 2, but this is the default and the one I'm talking about) and large which is horribly inefficient (64-bit immediate materialisation everywhere). On RISC-V you get medlow (everything fits in a signed 32-bit integer) and medany (everything is within 2 GiB), there is currently no large (but -fPIC/-fPIE gets you a GOT which lets data grow beyond that, so long as text and the GOT can occupy a single 2 GiB region). |
libexec/rtld-elf/rtld.c | ||
---|---|---|
2875 | We could default stack_flags to PF_R | PF_W for preload objs (and we could avoid PT_GNU_STACK in the VDSO if desired) (i.e., there's no point in introducing a new X-by-default stack case that requires PT_GNU_STACK to turn it off. If in the future we do need a RWX stack specified by a preload obj for some reason we can use PT_GNU_STACK to set it) | |
2903 | Is there any reason we'd want to check this and fail if we find any? I suspect no. |
libexec/rtld-elf/rtld.c | ||
---|---|---|
2875 | In fact, it so happens that default stack mode for kpreload is 0. Right after I read your comment, I wanted to change it to either PF_R|_W or PF_R|_W|_X, but now I think this is actually not bad. I added a comment to make this more explicit. | |
2903 | Yes, we want to check this, and updated patch contains the check, but not there. But IMO doing it in rtld is too late and either check or the need for relocation bricks the system. I added the checks to the vdso build scripts. | |
sys/tools/amd64_vdso.sh | ||
52 | I prefer awk |
Assert that there is no relocs in vdso, at the build stage.
Add some comments in load_kpreload(), about stack_mode and relocs.
There is no new signalling method, kernel still sets up trampoline, and deliver signal by pushing exactly the same sigframe, and jumping to the same signal trampoline code fragment.
What this change does, it the wrapping of the signal trampoline into real shared object. New rtld can handle both new and old kernels, for old rtld everything looks like an old kernel. If mapping the trampoline is the duty of qemu, then yes, eventually you would want to also provide dso instead of bare signal trampoline. To definitively answer your question, I need to know more how u-qemu handles signal delivery.
Also note that vdso is only implemented for amd64 kernel for now.
sys/conf/kern.pre.mk | ||
---|---|---|
49 ↗ | (On Diff #99360) | Do we need to add elfdump to _elftctools |
sys/conf/vdso_amd64.ldscript | ||
---|---|---|
39–40 | Wrong flags for PT_DYNAMIC and PT_GNU_EH_FRAME Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align LOAD 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x00000000000002e6 0x00000000000002e6 R E 0x10 DYNAMIC 0x0000000000000218 0x0000000000000218 0x0000000000000218 0x00000000000000b0 0x00000000000000b0 R E 0x8 GNU_EH_FRAME 0x0000000000000194 0x0000000000000194 0x0000000000000194 0x0000000000000014 0x0000000000000014 R E 0x4 |
sys/conf/vdso_amd64.ldscript | ||
---|---|---|
39–40 | Could you please clarify? What specifically is wrong? |
sys/conf/vdso_amd64.ldscript | ||
---|---|---|
39–40 | Oh, nvm, I am mistaken. Maybe worth a (reminder) comment about everything being in the same page. |
sys/conf/vdso_amd64.ldscript | ||
---|---|---|
39–40 | There is some discussion in the sys/tools/amd64_vdso.sh comment. Are you proposing to expand the text? It would become repeating the linker script. |
Have you verified that elfdump can be built as a bootstrap tool from non-FreeBSD hosts? You should be able to find out by pushing the commit to a GitHub fork of freebsd-src and letting the on-push action run, which will test both Linux and macOS hosts can build a kernel.
I included this change in some WIP that I pushed to GitHub. MacOS failed in bootstrap bmake which does not appear related. The Ubuntu Clang 9 and Clang 10 jobs were successful, although they hide build output so I cannot confirm that elfdump was indeed built/used.
-------------------------------------------------------------- >>> Kernel build for GENERIC started on Thu Dec 2 21:53:52 UTC 2021 -------------------------------------------------------------- ===> GENERIC -------------------------------------------------------------- >>> stage 1: configuring the kernel -------------------------------------------------------------- Kernel build directory is /home/runner/work/freebsd/build/home/runner/work/freebsd/freebsd/amd64.amd64/sys/GENERIC Don't forget to do ``make cleandepend && make depend'' 0.11user 0.05system 0:00.16elapsed 101%CPU (0avgtext+0avgdata 47752maxresident)k 0inputs+8outputs (0major+21039minor)pagefaults 0swaps -------------------------------------------------------------- >>> stage 2.3: build tools -------------------------------------------------------------- 0.10user 0.04system 0:00.14elapsed 104%CPU (0avgtext+0avgdata 47548maxresident)k 0inputs+8outputs (0major+15217minor)pagefaults 0swaps -------------------------------------------------------------- >>> stage 3.1: building everything -------------------------------------------------------------- linking kernel.full ctfmerge -L VERSION -g -o kernel.full ... text data bss dec hex filename 23356772 1857165 4428672 29642609 0x1c44f71 kernel.full 713.10user 53.08system 6:40.14elapsed 191%CPU (0avgtext+0avgdata 816252maxresident)k 59752inputs+2278192outputs (44098major+15938065minor)pagefaults 0swaps -------------------------------------------------------------- >>> Kernel build for GENERIC completed on Thu Dec 2 22:00:32 UTC 2021 -------------------------------------------------------------- >>> Kernel(s) GENERIC built in 400 seconds, ncpu: 2, make -j2 --------------------------------------------------------------
Thanks. The macOS job on freebsd-src indeed looks slightly temperamental, has the occasional bmake test suite failure, not sure what's up with that. If it builds on Linux then it's probably fine and we can deal with a macOS failure post-commit on the off chance there's something special there.
sys/amd64/amd64/sigtramp.S | ||
---|---|---|
47 | You mean, you did not checked the layout of the frame, or you did not verified that unwinding around signal frame work? First fact is more or less about matching register name and corresponding UC_XXX offset name. Second is tough, indeed. For instance, gdb, lldb, non-gnu libunwind match signal handler by comparing code sequence with the known one, so this change seems to be nop for them. On the other hand, in-tree unwinder from libgcc_s was able to catch an exception thrown from a signal handler context. | |
sys/kern/imgact_elf.c | ||
133 | I definitely do not have plans to work on it. It is up to the arch maintainer to do the necessary copy/paste and testing. In principle, I will prepare the arch patch if somebody going to test that. |
sys/kern/imgact_aout.c | ||
---|---|---|
166 | I guess this comment is not really applicable anymore (we're not going to add new arch a.out support) | |
sys/kern/imgact_elf.c | ||
133 | I just mean that for the kernel to provide AT_KPRELOAD __elfN(vdso) must be nonzero and sv_vdso_base must also be nonzero so there's no problem setting it to 1. I suppose there is some documentary value in having it report 0 to the user if not supported by the arch. |
sys/kern/imgact_aout.c | ||
---|---|---|
166 | It is same as it was in the time the line was written. I want something to plug if people add imgact_aout.c to their files.<arch>. If you suggest to change the message to something else, I do not object. | |
sys/kern/imgact_elf.c | ||
133 | Might be, it is useful to hide the sysctl at all from not supported arches. I will do this. |
sys/amd64/amd64/sigtramp.S | ||
---|---|---|
63 | Are there CFI annotations for segment registers, fsbase, gsbase, and eflags? (Those are all present in ucontext_t as well and a recent change I have for GDB pulls all of those out of signal frames.) | |
sys/amd64/ia32/ia32_sigtramp.S | ||
58 | Similar question about segment registers, fsbase/gsbase (GDB uses those for TLS), and flags. | |
sys/tools/vdso_wrap.S | ||
5 | From crtbrand.S or the like? (Or maybe the firmware stub?) There's no useful shared content between those files and this. I would just give this a FF copyright from the get go. | |
37 | No need for this if you aren't planning to merge to 12. |
sys/tools/vdso_wrap.S | ||
---|---|---|
5 | From the firmware stub |
sys/amd64/amd64/sigtramp.S | ||
---|---|---|
63 | All listed registers have DWARF register mapping number assigned to them in psABI, so there is a way to specify that in final object. But trying to actually do that resulted in the mess. For instance, i386 psABI specifies 93 as %fs.base, but binutils interpret it as %k0. Clang IAS does know about all segment names, but objdump from binutils report all encoded registers as invalid then. And so on. The only semi-consistent pseudo-registers seems to be fs/gs.base on x86_64, but even there clang managed to fail, it does not recognize the registers name, and I have to use numeric value. On the other hand, I am sure that %eflags are not useful for unwinders, and doubt that segment registers or fs/gsbase are useful. I think really doing that would require some iterations with toolchain maintainers, for now I commented out cfi instructions for segments/flags everywhere, and bases on 32bit. | |
sys/tools/vdso_wrap.S | ||
5 | The code was copied from firmware.S, and the only changes made were the variables rename. I did removed that copyright as suggested, but feel free to request a revert. |
Add cfi offsets for amd64 fs/gs bases. Most other registers are also annotated but neutralized due to toolchain mess.
Change error message for a.out !ia32 arch.
Remove ECATS copyright from vdso_wrap.S.
Minor tweaks for 64bit vdso cfi annotations. Add a comment explaining toolchain mess.
sys/amd64/amd64/sigtramp.S | ||
---|---|---|
69 | Fixed locally. |
sys/amd64/amd64/sigtramp.S | ||
---|---|---|
63 | This is mostly for debuggers and when inspecting in core dumps. That said, GDB will keep using its own unwinder that will unwind all those registers ok even with out CFI (so long as we don't change the sigcode so the sigframe unwinder keeps matching). %eflags is possibly useful when debugging a core dump where a thread had caught a signal and you want to see the register state at the time of the faulting instruction, but that is definitely an obscure case. GDB uses fsbase/gsbase for TLS on amd64/i386, but those likely don't change between the signal handler and the signal frame so missing those is less important. My other interest is around using CFI for kernel exception frames as that removes the need for custom unwinders in kernel debuggers, but when I last tried that I found I couldn't annotate all of the registers (for some reason I couldn't annotate %eflags at the time. and probably it is what you ran into which is that there is a DWARF register number but IAS doesn't support it). |