Page MenuHomeFreeBSD

dtrace: implement kinst

Authored by christos on Jul 15 2022, 4:57 PM.



kinst is a new provider that allows for instruction-level tracing in a given function

Diff Detail

rG FreeBSD src repository
Lint Skipped
Tests Skipped

Event Timeline


I explained further in an email, but we need to synchronize access to this list. One CPU can be executing kinst_make_probe() while another is executing kinst_destroy(). We can deal with reader synchronization using an IPI barrier, but we need a mutex for writers.


I think this and the functions above can be defined as static?


What happens if two different dtrace(1) instances create the same probe? We'll have two probe structures with the same address in this list. But kinst_invop() will only find the first one.


So this made me think about the regs array. If you look at dtrace.h, there are two definitions:

#define DIF_VAR_REGS            0x0001  /* registers array */
#define DIF_VAR_UREGS          0x0002  /* user registers array */

The second one is the uregs (userspace registers) array:

It's implemented here:

Clearly the implementors of dtrace intended to have a regs array as well, but it isn't implemented for some reason. But I think it will be useful to have access to the register file from kinst probes. Remember, kinst_invop() takes a pointer to a struct trapframe parameter; this is a saved copy of the CPU registers at the time that the breakpoint was executed. So if we can somehow make that available to the DIF_VAR_REGS handler, it'll be possible to access the CPU registers from a D script.

DIF_VAR_UREGS is implemented by fetching curthread->td_frame. When a thread enters the kernel (e.g., via an interrupt or a system call), the kernel saves a pointer to the usermode registers in td_frame. I think the easiest way to implement DIF_VAR_REGS is to add a new field struct trapframe *td_bp_frame to struct kdtrace_thread. In kinst_invop() (and fbt_invop() too), set curthread->td_dtrace->td_bp_frame to the address of the trapframe before calling dtrace_probe().
Then, implement DIF_VAR_REGS just like DIF_VAR_UREGS, except that the trapframe comes from a different field than td_frame.

You could implement and commit this mechanism independent of kinst: just do it for FBT. Then just modify kinst_invop() to use it.


I think this won't work with multiple dtrace(1) processes that are creating and destroying kinst probes simultaneously. We'll need something more clever.


Looks like the error is silently ignored?


So, right now we allocate the trampoline above KERNBASE so that near calls/jumps/etc. don't need to be totally rewritten; instead, we just need to modify their displacement. I think the constraint of having to allocate trampoline memory above KERNBASE is going to cause some pain down the road.

Here's another possibility: if we convert near calls/jmps/etc. to absolute calls/jmps/etc., then the trampoline can live anywhere in kernel memory. Then the allocator can be simpler: instead of using the low-level vm_map_find()/vm_map_wire(), we can use kmem_alloc(PAGE_SIZE, M_EXEC) to allocate a page of executable memory for trampolines.

Note that the jmp that we add to trampolines is currently a near jmp; we'd need to make that an absolute jmp as well.

Could you please try to implement this? You'll need to find out how to encode a "far" jmp with an absolute address as an operand.


Sorry, typo: kmem_malloc(PAGE_SIZE, M_EXEC).


This should be in kinst.h. This file defines ioctls for /dev/dtrace/dtrace, and they live in a different namespace. So something like

#define KINSTIOC_MAKEPROBE _IOW('k', 1, dtrace_kinst_probedesc_t)

pvp may be NULL:

# dtrace -n kins::amd64_syscall:
Segmentation fault (core dumped)

This code is amd64-only, no need for ifdefs.


No need for the ifdef, this code is amd64-only. (IMO it is simply not worth implementing i386 support, or at least, it's very very low priority.)


It occurs to me that we must never instrument sti. Similarly, popf must be disallowed (because it can potentially set PSL_I, which is what sti does). Otherwise, we can't use dtrace_sync() to create a barrier. Those instructions are rare, so I don't expect it to be impactful.


Here's a wrinkle that we missed: a call instruction pushes the address of the next instruction onto the stack, then jumps to the call target. Later, a ret will pop the stack and place the value in %rip. When a call is executed from a trampoline, the return address will be the next instruction in the trampoline, i.e., the jmp back to the original code.

Suppose that a probe of a call instruction is disabled after the probe fires in a thread, but before the thread executes the corresponding ret. Then, when it does execute ret, it'll jump to the trampoline, which has been filled with breakpoints. Oops.

I see two options for handling call:

  1. Emulate it in software. kinst_invop() can implement call by manipulating fields in the trapframe, and set things up so that the saved return address is the instruction following the original instruction, instead of the instruction following the copy in the trampoline.
  2. when populating the trampoline, break the call up into a sequence of push and jmp instructions which give the desired semantics.

I think the first option is much easier to implement. Basically, it means that call instructions don't require a trampoline at all.

BTW, the following script helped me narrow down the bug:

 # cat /tmp/                                                                                                                                                                                                                                                                                              
#!/bin/sh -x                                                                                                                                                                                                                                                                                                                  
#set -e                                                                                                                                                                                                                                                                                                                       
for i in $(seq 0 2316); do                                                                                                                                                                                                                                                                                                    
    echo $i                                                                                                                                                                                                                                                                                                                   
    dtrace -n "kinst::amd64_syscall:$i {} tick-25ms {exit(0);}" -x switchrate=50hz                                                                                                                                                                                                                                            

It just tries to enable each probe in amd64_syscall() individually. Eventually I found that the kernel crashes after enabling kinst::amd64_syscall:300, which for me is call *(%rcx).

christos added inline comments.

In order for this to be in kinst.h, we also have to include kinst.h in libdtrace.
Do you think it's better to change the name but leave it in dtrace.h?


These functions are supposed to be architecture dependent, meaning they'll eventually
go to kinst_isa.c.


Is it better to convert to change the function's return value to int and return
the result of dtrace_register()? I tried it and it compiles normally. If not, could we
just print an error message before the return?


Provider-specific interfaces shouldn't live in dtrace.h. It's fine to include kinst.h in libdtrace.


I think we can just pass the error up, but I didn't check. kinst_modevent() should return an errno value if an error occurs.


We've since fixed this, though it requires a fair bit of work in the instruction dissector and in kinst_invop(). So far I haven't found any other instructions where this problem exists.


In my branch I added a mutex to serialize all trampoline allocations and deallocations, and that fixes the problem.


I tried to implement this, and it turned into a bit of a rabbit hole. It's easy enough to use absolute jumps in the trampoline, but we also rely on trampolines being above KERNBASE for the benefit of instructions using a RIP-relative addressing mode (i.e. Mod == 0, R/M == 5). In that mode, the displacement is limited to 32 bits, so we'd have to find some general way to rewrite instructions to not rely on that. I'm sure it's possible, but it'll require more complexity in the code which populates trampolines, and I'd rather just get something committed first. So, let's ignore this problem for now.