dtrace: implement kinst
AbandonedPublic
Actions

Authored by christos on Jul 15 2022, 4:57 PM.

Details

Reviewers

markj
gnn

Summary

kinst is a new provider that allows for instruction-level tracing in a given function

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

christos created this revision.Jul 15 2022, 4:57 PM

Herald added a reviewer: gnn. · View Herald TranscriptJul 15 2022, 4:57 PM

Herald added a subscriber: imp. · View Herald Transcript

christos requested review of this revision.Jul 15 2022, 4:57 PM

markj added inline comments.Jul 15 2022, 6:17 PM

sys/cddl/dev/kinst/kinst.c
98	I explained further in an email, but we need to synchronize access to this list. One CPU can be executing kinst_make_probe() while another is executing kinst_destroy(). We can deal with reader synchronization using an IPI barrier, but we need a mutex for writers.
98
147	I think this and the functions above can be defined as `static`?
285	What happens if two different dtrace(1) instances create the same probe? We'll have two probe structures with the same address in this list. But kinst_invop() will only find the first one.
398	So this made me think about the `regs` array. If you look at dtrace.h, there are two definitions: #define DIF_VAR_REGS 0x0001 /* registers array / #define DIF_VAR_UREGS 0x0002 / user registers array / The second one is the `uregs` (userspace registers) array: https://illumos.org/books/dtrace/chp-user.html#chp-user-uregs It's implemented here: https://github.com/freebsd/freebsd-src/blob/main/sys/cddl/contrib/opensolaris/uts/common/dtrace/dtrace.c#L3380 Clearly the implementors of dtrace intended to have a `regs` array as well, but it isn't implemented for some reason. But I think it will be useful to have access to the register file from kinst probes. Remember, kinst_invop() takes a pointer to a `struct trapframe` parameter; this is a saved copy of the CPU registers at the time that the breakpoint was executed. So if we can somehow make that available to the DIF_VAR_REGS handler, it'll be possible to access the CPU registers from a D script. DIF_VAR_UREGS is implemented by fetching `curthread->td_frame`. When a thread enters the kernel (e.g., via an interrupt or a system call), the kernel saves a pointer to the usermode registers in `td_frame`. I think the easiest way to implement DIF_VAR_REGS is to add a new field `struct trapframe td_bp_frame` to struct kdtrace_thread. In kinst_invop() (and fbt_invop() too), set `curthread->td_dtrace->td_bp_frame` to the address of the trapframe before calling dtrace_probe(). Then, implement DIF_VAR_REGS just like DIF_VAR_UREGS, except that the trapframe comes from a different field than `td_frame`. You could implement and commit this mechanism independent of kinst: just do it for FBT. Then just modify kinst_invop() to use it.
406	I think this won't work with multiple dtrace(1) processes that are creating and destroying kinst probes simultaneously. We'll need something more clever.
441	Looks like the error is silently ignored?
sys/cddl/dev/kinst/trampoline.c
72	So, right now we allocate the trampoline above KERNBASE so that near calls/jumps/etc. don't need to be totally rewritten; instead, we just need to modify their displacement. I think the constraint of having to allocate trampoline memory above KERNBASE is going to cause some pain down the road. Here's another possibility: if we convert near calls/jmps/etc. to absolute calls/jmps/etc., then the trampoline can live anywhere in kernel memory. Then the allocator can be simpler: instead of using the low-level vm_map_find()/vm_map_wire(), we can use `kmem_alloc(PAGE_SIZE, M_EXEC)` to allocate a page of executable memory for trampolines. Note that the jmp that we add to trampolines is currently a near jmp; we'd need to make that an absolute jmp as well. Could you please try to implement this? You'll need to find out how to encode a "far" jmp with an absolute address as an operand.
72	Sorry, typo: `kmem_malloc(PAGE_SIZE, M_EXEC)`.

markj added inline comments.Jul 15 2022, 7:09 PM

sys/cddl/contrib/opensolaris/uts/common/sys/dtrace.h
1345	This should be in kinst.h. This file defines ioctls for /dev/dtrace/dtrace, and they live in a different namespace. So something like #define KINSTIOC_MAKEPROBE _IOW('k', 1, dtrace_kinst_probedesc_t)

markj added inline comments.Jul 15 2022, 9:41 PM

cddl/contrib/opensolaris/lib/libdtrace/common/dt_provider.c
704	`pvp` may be NULL: # dtrace -n kins::amd64_syscall: Segmentation fault (core dumped)
sys/cddl/dev/kinst/extern.h
16	This code is amd64-only, no need for ifdefs.
sys/cddl/dev/kinst/kinst.c
112	No need for the ifdef, this code is amd64-only. (IMO it is simply not worth implementing i386 support, or at least, it's very very low priority.)
203	It occurs to me that we must never instrument `sti`. Similarly, `popf` must be disallowed (because it can potentially set PSL_I, which is what `sti` does). Otherwise, we can't use dtrace_sync() to create a barrier. Those instructions are rare, so I don't expect it to be impactful.
214	Here's a wrinkle that we missed: a `call` instruction pushes the address of the next instruction onto the stack, then jumps to the call target. Later, a `ret` will pop the stack and place the value in `%rip`. When a `call` is executed from a trampoline, the return address will be the next instruction in the trampoline, i.e., the jmp back to the original code. Suppose that a probe of a `call` instruction is disabled after the probe fires in a thread, but before the thread executes the corresponding `ret`. Then, when it does execute `ret`, it'll jump to the trampoline, which has been filled with breakpoints. Oops. I see two options for handling `call`: Emulate it in software. kinst_invop() can implement `call` by manipulating fields in the trapframe, and set things up so that the saved return address is the instruction following the original instruction, instead of the instruction following the copy in the trampoline. when populating the trampoline, break the `call` up into a sequence of `push` and `jmp` instructions which give the desired semantics. I think the first option is much easier to implement. Basically, it means that `call` instructions don't require a trampoline at all. BTW, the following script helped me narrow down the bug: # cat /tmp/kinst-test.sh #!/bin/sh -x #set -e for i in $(seq 0 2316); do echo $i dtrace -n "kinst::amd64_syscall:$i {} tick-25ms {exit(0);}" -x switchrate=50hz done It just tries to enable each probe in amd64_syscall() individually. Eventually I found that the kernel crashes after enabling kinst::amd64_syscall:300, which for me is `call *(%rcx)`.

christos marked 4 inline comments as done.Jul 16 2022, 1:58 PM

christos added inline comments.

sys/cddl/contrib/opensolaris/uts/common/sys/dtrace.h
1345	In order for this to be in kinst.h, we also have to include kinst.h in libdtrace. Do you think it's better to change the name but leave it in dtrace.h?
sys/cddl/dev/kinst/kinst.c
147	These functions are supposed to be architecture dependent, meaning they'll eventually go to kinst_isa.c.
441	Is it better to convert to change the function's return value to int and return the result of dtrace_register()? I tried it and it compiles normally. If not, could we just print an error message before the return?

christos marked 3 inline comments as done.Jul 16 2022, 3:37 PM

markj added inline comments.Jul 16 2022, 3:56 PM

sys/cddl/contrib/opensolaris/uts/common/sys/dtrace.h
1345	Provider-specific interfaces shouldn't live in dtrace.h. It's fine to include kinst.h in libdtrace.
sys/cddl/dev/kinst/kinst.c
441	I think we can just pass the error up, but I didn't check. kinst_modevent() should return an errno value if an error occurs.

markj added inline comments.Sep 6 2022, 11:00 PM

sys/cddl/dev/kinst/kinst.c
214	We've since fixed this, though it requires a fair bit of work in the instruction dissector and in kinst_invop(). So far I haven't found any other instructions where this problem exists.
406	In my branch I added a mutex to serialize all trampoline allocations and deallocations, and that fixes the problem.
sys/cddl/dev/kinst/trampoline.c
72	I tried to implement this, and it turned into a bit of a rabbit hole. It's easy enough to use absolute jumps in the trampoline, but we also rely on trampolines being above KERNBASE for the benefit of instructions using a RIP-relative addressing mode (i.e. Mod == 0, R/M == 5). In that mode, the displacement is limited to 32 bits, so we'd have to find some general way to rewrite instructions to not rely on that. I'm sure it's possible, but it'll require more complexity in the code which populates trampolines, and I'd rather just get something committed first. So, let's ignore this problem for now.