Allow stack(9) to capture stack of running process via stack_save_thread(). Have kern.proc.kstack use this.
AbandonedPublic
Actions

Authored by markj on Jul 1 2014, 9:53 PM.

Details

Reviewers

jhb
bdrewery

Summary

If a process is running sleep until hardclock can capture its stack.

TODO: non-amd64 support, stack.9 updates, API fixes as now there is both

stack_save_td() and stack_save_thread()

Sponsored by: EMC / Isilon Storage Division
Obtained from: Isilon OneFS

Test Plan

procstat(1) can be used.

P11 stack(9) running threads

1	Before:
2	# procstat -kka\|egrep "(idle:\|procstat)"
3	11 100003 idle idle: cpu0 <running>
4	11 100004 idle idle: cpu1 mi_switch+0x275 critical_exit+0x8b sched_idletd+0x1e8 fork_exit+0x84 fork_trampoline+0xe
5	5167 100172 procstat - <running>
6
7	After:
8	# procstat -kka\|egrep "(idle:\|procstat)"
9	11 100003 idle idle: cpu0 stack_hardclock+0xc2 hardclock_cnt+0x2d5 handleevents+0xbc timercb+0x208 lapic_handle_timer+0x9c Xtimerint+0x8c acpi_cpu_idle+0x13b cpu_idle_acpi+0x3f cpu_idle+0x93 sched_idletd+0x1e8 fork_exit+0x84 fork_trampoline+0xe
10	11 100004 idle idle: cpu1 mi_switch+0x275 critical_exit+0x8b sched_idletd+0x1e8 fork_exit+0x84 fork_trampoline+0xe
11	1141 100143 procstat - stack_save_thread+0x1e1 sysctl_kern_proc_kstack+0x21a sysctl_root+0x1bd userland_sysctl+0x19e sys___sysctl+0x74 amd64_syscall+0x2c3 Xfast_syscall+0xfb

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

bdrewery updated this revision to Diff 554.Jul 1 2014, 9:53 PM

bdrewery retitled this revision from to Allow stack(9) to capture stack of running process via stack_save_thread(). Have kern.proc.kstack use this..

bdrewery updated this object.

bdrewery edited the test plan for this revision. (Show Details)

bdrewery updated this object.Jul 1 2014, 9:55 PM

bdrewery edited the test plan for this revision. (Show Details)

bdrewery added a reviewer: jhb.

Hmm, another approach would be to use smp_rendezvous() with the CPU the thread is running on to capture the stack. This wouldn't introduce the weirdness with having to drop locks, etc. and avoid the race handling for the tid changing, etc. (It also wouldn't block but would grab the current stack frame instead of a stack frame at some point in the future.)

The in_panic change looks a bit interesting, but I'm curious if that is actually used. A userland process can't be scheduled to run to even invoke this sysctl if the system is panic'ing. Does Isilon use stack_save_thread() in other places?

Sorry, maybe I didn't investigate this enough, but I have to ask some questions:

Specified timeout is extremely short and the code assumes that ctx is removed from stack_hardclock_list (just zeroes it and retries). This seems risky.

In Linux there is a 'softlockup detector'. Once a cpu appears to be stuck there is an option to send NMIs to all cpus to collect all traces. Usefulness of such detector aside, such mechanism would allow people to collect traces even if given cpu is stuck with interrupt disabled. It also avoids the branch in hardclock and the problem with making sure stuff is no longer on the list.

In D323#5, @jhb wrote:

The in_panic change looks a bit interesting, but I'm curious if that is actually used. A userland process can't be scheduled to run to even invoke this sysctl if the system is panic'ing. Does Isilon use stack_save_thread() in other places?

Yeah, it is or was used at Isilon by some kernel component to gather stacks during panic (without having done any research as to what component, sorry). It's not useful for the ^T backtrace functionality. We could drop that piece for this review.

bdrewery added a reviewer: markj.Jul 9 2015, 5:45 PM

In D323#4632, @jhb wrote:

Hmm, another approach would be to use smp_rendezvous() with the CPU the thread is running on to capture the stack. This wouldn't introduce the weirdness with having to drop locks, etc. and avoid the race handling for the tid changing, etc. (It also wouldn't block but would grab the current stack frame instead of a stack frame at some point in the future.)

In principle, I agree with this proposal, it should be much more reliable and simpler than what is done in the patch.

The issue I see is that stack walker would need to walk over the interrupt frames, there is no support in stack_save() to detect and step over them. Still, the traces above seems to successfully work with the interrupts.

The in_panic change looks a bit interesting, but I'm curious if that is actually used. A userland process can't be scheduled to run to even invoke this sysctl if the system is panic'ing. Does Isilon use stack_save_thread() in other places?

cem added a subscriber: cem.Jul 16 2015, 1:56 PM

@markj has been working on converting this to use smp_rendezvous and cleaning up some other issues with it.

In D323#62895, @bdrewery wrote:

@markj has been working on converting this to use smp_rendezvous and cleaning up some other issues with it.

Yup, sorry, just been busy. I'll put it up soon.

One annoyance is that smp_rendezvous() can't be called with the target thread's lock held, else we may deadlock if the thread is spinning with interrupts disabled on its own thread lock. So the IPI handler needs to handle the case where the thread was migrated to a different CPU after we dropped the thread lock but before the interrupt was handled. I've fixed this by just detecting this case and retrying, but that doesn't seem ideal - any suggestions?

I thought about adding a new thread inhibitor for stack tracing, but that might be overkill.

In D323#63297, @markj wrote:

One annoyance is that smp_rendezvous() can't be called with the target thread's lock held, else we may deadlock if the thread is spinning with interrupts disabled on its own thread lock. So the IPI handler needs to handle the case where the thread was migrated to a different CPU after we dropped the thread lock but before the interrupt was handled. I've fixed this by just detecting this case and retrying, but that doesn't seem ideal - any suggestions?

thread_lock_flags_() reenables interrupts periodically while spinning, same as the normal spinlock. But indeed counters are set up such that the lock failure might be declared.

Did you considered checking for the stack trace request in thread_lock_flags() while spinning, and directly satisfying it from the locking function ?

In D323#63323, @kib wrote:

In D323#63297, @markj wrote:

One annoyance is that smp_rendezvous() can't be called with the target thread's lock held, else we may deadlock if the thread is spinning with interrupts disabled on its own thread lock. So the IPI handler needs to handle the case where the thread was migrated to a different CPU after we dropped the thread lock but before the interrupt was handled. I've fixed this by just detecting this case and retrying, but that doesn't seem ideal - any suggestions?

thread_lock_flags_() reenables interrupts periodically while spinning, same as the normal spinlock. But indeed counters are set up such that the lock failure might be declared.

If the thread already holds a different spinlock, thread_lock_flags_() will not reenable interrupts.

Did you considered checking for the stack trace request in thread_lock_flags() while spinning, and directly satisfying it from the locking function ?

I hadn't, but that could work. I'm not sure where the thread would store the stack in this case though. Perhaps have a TDF_STACK and a struct stack *td_stack in the struct thread. If one wishes to capture the stack of a running thread, they:

set td->td_stack = st, and set TDF_STACK in the target thread
drop the thread lock and call smp_rendezvous_cpus() to interrupt the target thread
if the target thread was interrupted, the handler acquires its thread lock, clears TDF_STACK, and captures the stack
otherwise the thread must have changed states, so it must have acquired its own thread lock at some point(?), at which point it stores its stack in td_stack and clears TDF_STACK. The rendezvous handler will see this and can just return.

Does that seem reasonable? Adding a new thread flag and member, and modifying thread_lock() just for this case seems like a somewhat intrusive solution to me, compared to adding a new thread inhibitor.

cem edited edge metadata.Jul 23 2015, 7:33 PM

cem removed a subscriber: cse_cem_gmail_com.

In D323#63341, @markj wrote:

If the thread already holds a different spinlock, thread_lock_flags_() will not reenable interrupts.

Which can be reasonably interpreted as the flaw in the approach. I start thinking that what should be done is to broadcast (?) NMI with some global variables pointing to thread and to struct stack to fill. NMI handler should detect the situation and fill the stack trace if current thread is equal to the stored pointer.

Did you considered checking for the stack trace request in thread_lock_flags() while spinning, and directly satisfying it from the locking function ?

I hadn't, but that could work. I'm not sure where the thread would store the stack in this case though. Perhaps have a TDF_STACK and a struct stack *td_stack in the struct thread. If one wishes to capture the stack of a running thread, they:

set td->td_stack = st, and set TDF_STACK in the target thread

drop the thread lock and call smp_rendezvous_cpus() to interrupt the target thread

if the target thread was interrupted, the handler acquires its thread lock, clears TDF_STACK, and captures the stack

otherwise the thread must have changed states, so it must have acquired its own thread lock at some point(?), at which point it stores its stack in td_stack and clears TDF_STACK. The rendezvous handler will see this and can just return.

Does that seem reasonable? Adding a new thread flag and member, and modifying thread_lock() just for this case seems like a somewhat intrusive solution to me, compared to adding a new thread inhibitor.

For this, I believe td_stack would be enough. Clear td_stack if the trace was filled. It would give the same effect as the td_flags flag. This is only IMHO.

In D323#63477, @kib wrote:

I start thinking that what should be done is to broadcast (?) NMI with some global variables pointing to thread and to struct stack to fill. NMI handler should detect the situation and fill the stack trace if current thread is equal to the stored pointer.

That seems like a morally correct approach if we think of this as an attempt to profile the target thread. I think a broadcast NMI could be avoided if the target thread lock is held throughout the operation, since we could then deliver an NMI to td_oncpu and expect the target thread to be running (perhaps spinning on its thread lock). There should probably also be some check to ensure that we are allowed to interrupt the target thread, perhaps with p_candebug().

The downside is that more MD code is required, and I'm not yet sure whether this can be done on all architectures. But, it should be doable on x86, with stubs for other architectures. I'll give this a try.

Did you considered checking for the stack trace request in thread_lock_flags() while spinning, and directly satisfying it from the locking function ?

...
Adding a new thread flag and member, and modifying thread_lock() just for this case seems like a somewhat intrusive solution to me, compared to adding a new thread inhibitor.

For this, I believe td_stack would be enough. Clear td_stack if the trace was filled. It would give the same effect as the td_flags flag. This is only IMHO.

Ah, I think you're right.

In case anyone's curious:

I implemented a stack-saver for amd64 using NMIs and modified kern.proc.kstack here: https://people.freebsd.org/~markj/patches/stack_nmi.diff

I then reimplemented stack-on-siginfo: https://people.freebsd.org/~markj/patches/siginfo_stack.diff

These appear to work, modulo some small bugs. But the second patch isn't correct: it resolves kernel symbols in tty_info(), which requires taking a sleepable lock with a tty mutex held. In Isilon's OS this is handled by resolving the symbols and printing to the tty in a taskqueue handler. But that seems excessively complicated: I think we should be able to resolve the stack without potentially sleeping (and I've hit this limitation before), so I'd like to try extending the linker KPI to accommodate this.

In D323#64235, @markj wrote:

In case anyone's curious:

I implemented a stack-saver for amd64 using NMIs and modified kern.proc.kstack here: https://people.freebsd.org/~markj/patches/stack_nmi.diff

This looks simple and clean enough.

Suppose that nmi_stack was set by CPU0 and nmi sent to CPU1, Assume that CPU2 got an NMI for any other reason meantime. Wouldn't the assert curthread == nmi_pending fire on CPU2 ? I do not think that the argument that the stack_save_td_running() locks nmi_lock is enough, e.g. NMI could come from some hardware.

Another thing which I cannot resist to point out is that the NULL write to nmi_pending must have release semantic, i.e. atomic_store_rel_ptr(&nmi_pending, NULL), and corresponding loop in stack_save_td_running() must use acquire load. This way, the stack trace writes in nmi handler are guaranteed to be visible in the calling cpu. I understand that the amd64 strictness makes this purely academic, but I believe that acq/rel use would highlight the intent and avoid the issue for relaxed arches when the code will be copy/pasted.