Page MenuHomeFreeBSD

truss: add option to print the active syscall(s) if no activity after N seconds.
Needs ReviewPublic

Authored by dnelson_1901_yahoo.com on Jun 15 2018, 8:37 PM.

Details

Reviewers
jhb
tuexen
mmokhi
Group Reviewers
manpages
Summary

Truss is often used while debugging "hung" processes, but if the target process really is not doing anything, then truss doesn't print anything either. This patch adds a 2-second timeout to the core loop, and if it fires, any threads waiting in a syscall will be printed.

Performance penalty: two setitimer() calls per traced syscall, plus the nested-loop search to determine how many pending syscalls there are. The pending syscall count could be aggregated into a summary variable and kept up-to-date by alloc_syscall() and free_syscall(), if needed.

Example output:

# simple trace of a sleep command
$ truss -d sleep 4
[...]
0.099993429 cap_fcntls_limit(2,CAP_FCNTL_GETFL)  ERR#78 'Function not implemented'
0.100991620 cap_enter()                          ERR#78 'Function not implemented'
0.102608113 sigaction(SIGINFO,{ 0x400cb0 SA_RESTART ss_t },{ SIG_DFL SA_RESTART ss_t }) = 0 (0x0)
2.105037303 nanosleep({ 4.000000000 },)          ...sleeping
4.105016531 nanosleep({ 4.000000000 },{ 4.000000000 }) = 0 (0x0)
4.106558773 sigprocmask(SIG_BLOCK,{ SIGHUP|SIGINT|SIGQUIT|SIGKILL|SIGPIPE|SIGALRM|SIGTERM|SIGURG|SIGSTOP|SIGTSTP|SIGCONT|SIGCHLD|SIGTTIN|SIGTTOU|SIGIO|SIGXCPU|SIGXFSZ|SIGVTALRM|SIGPROF|SIGWINCH|SIGINFO|SIGUSR1|SIGUSR2|SIGTHR },{ }) = 0 (0x0)
4.108041307 sigprocmask(SIG_SETMASK,{ },0x0)     = 0 (0x0)
[...]

# trace an active radius daemon, showing that a threaded process can be traced, and all pending syscalls are printed (_umtx_op and kevent in this case):
$ truss -dH -p $(pgrep radiusd)
101146: 2.000916627 kevent(4,0x0,0,,512,0x0)     ...sleeping
101146: 4.105742028 kevent(4,0x0,0,{ 10,EVFILT_READ,0x0,0,0x47,0x803668b08 },512,0x0) = 1 (0x1)
101146: 4.106004130 recvfrom(10,"\^A+\0G",4,MSG_PEEK,{ AF_INET 127.0.0.1:32998 },0x7fffffffe820) = 4 (0x4)
101146: 4.106193837 recvfrom(10,"\^A+\0G",4,MSG_PEEK,{ AF_INET 127.0.0.1:32998 },0x7fffffffe724) = 4 (0x4)
101146: 4.106297587 recvfrom(10,"\^A+\0G\^XQ\M^Z\M-.\M-$\M-M\M-="...,71,0,{ AF_INET 127.0.0.1:32998 },0x7fffffffe724) = 71 (0x47)
101146: 4.106523467 getsockname(10,{ AF_INET 0.0.0.0:1812 },0x7fffffffe720) = 0 (0x0)
101146: 4.106666032 _umtx_op(0x684d18,UMTX_OP_SEM2_WAKE,0x0,0x0,0x0) = 0 (0x0)
101682: 4.106912044 mmap(0x0,2097152,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34500247552 (0x808600000)
101682: 4.107070112 wait4(0,0x7fffdfffdeb8,WNOHANG,0x0) ERR#10 'No child processes'
101682: 4.107286309 mmap(0x0,4096,PROT_READ|PROT_WRITE,MAP_PRIVATE|MAP_ANON,-1,0x0) = 34366926848 (0x8006db000)
101682: 4.107526954 write(3,"Fri Jun 15 15:30:15 2018 : Info:"...,182) = 182 (0xb6)
101146: 4.440923540 kevent(4,0x0,0,{ },512,{ 0.333202000 }) = 0 (0x0)
101682: 6.441898153 _umtx_op(0x684d18,UMTX_OP_SEM2_WAIT,0x0,0x0,0x0) ...sleeping
101146: 6.441949765 kevent(4,0x0,0,,512,0x0)     ...sleeping
^C

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

kevans added subscribers: tuexen, jhb, kevans.

Adding @jhb and @tuexen as possible reviewers

I'd suggest make a new command-line switch for this functionality so we can:
1- set timeout dynamically
2- have the original behavior by not using the switch

I wonder if the wchan for the process can be printed easily instead of 'sleeping'

I'd suggest make a new command-line switch for this functionality so we can:
1- set timeout dynamically
2- have the original behavior by not using the switch

Adding a command-line option should be easy. I like the idea of it being enabled by default, though. I'll do some benchmarks to see what the performance hit actually is.

I wonder if the wchan for the process can be printed easily instead of 'sleeping'

Good idea.. I'll take a look at how ps/procstat/top do it. Luckily, this can be an arbitrarily expensive lookup, since it only gets printed when nothing else is happening anyway.

I would make the timeout configurable. I would be hesitant about enabling it by default as well though I don't feel too strongly about that.

While printing the wchan would be nice, you can't easily get that via ptrace(). You'd have to fetch the kinfo_proc structures which is a bit much. I might not say sleeping though, I might just not print any return value (which I think print_syscall() already doesn't do). Maybe replace "sleeping" with "<in progress>"

usr.bin/truss/setup.c
153

More reliable would be to have a global 'sig_atomic_t alarm_tripped' that this set to 1 and checking that in the loop below so that you really know that SIGALRM fired vs some other signal. Also, I think you can just spell out 'trap_alarm'

163

Blank line after the 'struct procinfo *p' line.

164

The prevailing style is to declare all the locals in one block at the start of the function, especially for a small function such as this.

168

A curious affect of this will be that it will print out incorrect times once the syscall actually completes. I would perhaps explicitly save the old value of 't->after' and restore it after calling print_syscall().

737–755

I would stop once you find a single thread in a syscall. It might be simplest to actually do this in a helper function:

static bool
set_alarm(struct trussing *info)
{
    struct procinfo *p;
    struct threadinfo *t;

    LIST_FOREACH(p, &info->proclist, entries) {
        LIST_FOREACH(t, &p->threadlist, entries) {
            if (t->in_syscall) {
                alarm(2);
                return (true);
            }
        }
    }
    return (false);
}

I would then use it like so;

alarm_set = set_alarm(info);

while (waitid(....) == -1) {
    if (errno == EINTR) {
        ...
        if (alarm_triggered) {
            print_sleeping_threads(info);
            alarm(0);
            alarm_set = false;
        }
     ...
}

if (alarm_set)
    alarm(0);
usr.bin/truss/syscalls.c
2621–2622

I would be inclined to leave this. The argument would be missing regardless of being in or out of a syscall.

dnelson_1901_yahoo.com added inline comments.
usr.bin/truss/setup.c
168

Everywhere else, 't->after' is updated before use, so I think computing an "interim" value before the syscall completes should be safe.

usr.bin/truss/syscalls.c
2621–2622

The difference here is that a NULL 's_args[i]' value during the syscall just means that it's an OUT argument with no useful value yet, not that something messed up and the argument couldn't be calculated. I've added an explicit check for 'sc->args[i].type & OUT' to make the intent clearer.

Updated to address comments, and added a -t option to allow for adjustment of the timeout.

usr.bin/truss/setup.c
65

Can probably merge these two declarations to declare both on a single line.

151

I would still add the underscore as most other functions in this file use an underscore between words.

168

Ah, ok.

usr.bin/truss/syscalls.c
2621–2622

I would still not print it as an empty string. Maybe "<unavailable>". I think the 't->in_syscall' check is sufficient without needing the OUT flag checked explicitly.

I like the functionality added and that the timeout is configurable via a command line option. However, I would prefer to have the default off... But I leave the decision up to you.

I like the functionality added and that the timeout is configurable via a command line option. However, I would prefer to have the default off... But I leave the decision up to you.

I think that everyone would want to know what a traced process is waiting for, especially after a few seconds of no output on the screen. A quick performance test (timing "cd /usr/src/sbin ; truss -f -o /dev/null make -j4") does show an extra 4% system cpu usage, but no increase in either user or real times. I was hoping for less impact, so I will update the patch to make it disabled by default. Were you worried about performance also, or did you have a different reason for not wanting it to be on by default?

usr.bin/truss/setup.c
151

Heh. I didn't even notice the underscore in your suggestion. I just saw the spelling change.

dnelson_1901_yahoo.com marked an inline comment as done.
dnelson_1901_yahoo.com retitled this revision from truss: print the active syscall(s) if no activity after 2 seconds. to truss: add option to print the active syscall(s) if no activity after N seconds..

More updates.