PMC: Collect user call chains while in kernel space --- redux
AbandonedPublic
Actions

Authored by mmacy on May 7 2018, 5:35 AM.

Details

Reviewers

jhb
jeff
sbruno

Group Reviewers

manpages

Summary

D7350 updated to D4227 updated to D15155 and D15275.

Depends on D4227.
Depends on D15275.

Currently, when PMC is sampling call chains, it will sample a user-space call chain when the counter fires while in user space and will sample kernel-space call chains when the counter fires while in kernel space. This is useful functionality and should not be discarded.

However, there are cases where a user really wants to know the user-space call chain every time the sample counter fires, even if the sample counter fires while in kernel mode. This change adds this capability, which is activated when the user specifies an optional parameter.

(To go a step further, we discussed at BSDCan the fact that a user might like to be able to see both the kernel and user call chains associated with each sampling interval. However, this patch does not go that far.)

Note: This patch is against stable/10 + D4227 (diff 1). Once D4227 is committed, I will regenerate the patch against CURRENT. Until that time, this is probably more of a draft than a serious proposal.
Test Plan

This has been tested within Juniper on a large-scale multi-threaded app. I also plan to run some simple pathological threaded apps (constantly starting/stopping threads, several threads with long-running tight loops) which I've used for basic PMC sanity-testing for previous changes.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 16618

Event Timeline

mmacy created this revision.May 7 2018, 5:35 AM

Herald added a reviewer: manpages. · View Herald TranscriptMay 7 2018, 5:35 AM

Herald added subscribers: gnn, imp. · View Herald Transcript

kib added a subscriber: kib.May 7 2018, 10:15 AM

kib added inline comments.

sys/dev/hwpmc/hwpmc_mod.c
1748	critical_enter() is redundand since you take the spin lock a line above it. Also, you pin the thread before calling the function. That said, what else, except the td_flags at line 1797, requires the thread lock in the locked region ? If nothing, it is arguably better to lock the thread only around flag or-ing.

reduce scope of thread lock in userret

critical_enter() is redundand since you take the spin lock a line above it. Also, you pin the thread before calling the function.

@kib I don't understand this comment. The thread is pinned in the NMI and then unpinned in hardclock context:

void
intr_irq_handler(struct trapframe *tf)
{
	struct trapframe * oldframe;
	struct thread * td;

	KASSERT(irq_root_filter != NULL, ("%s: no filter", __func__));

	VM_CNT_INC(v_intr);
	critical_enter();
	td = curthread;
	oldframe = td->td_intr_frame;
	td->td_intr_frame = tf;
	irq_root_filter(irq_root_arg);
	td->td_intr_frame = oldframe;
	critical_exit();
#ifdef HWPMC_HOOKS
	if (pmc_hook && TRAPF_USERMODE(tf) &&
	    (PCPU_GET(curthread)->td_pflags & TDP_CALLCHAIN))
		pmc_hook(PCPU_GET(curthread), PMC_FN_USER_CALLCHAIN, tf);
#endif
}

thread_userret is running much later so the thread is no longer pinned. If that is not in fact the case, please tell me what the actual sequence of events is here. Thanks.

mmacy added a reviewer: sbruno.May 8 2018, 6:56 AM

kib added inline comments.May 10 2018, 11:29 AM

sys/dev/hwpmc/hwpmc_mod.c
2327	Why td_pinned cannot be >= 1 ?
2330	Where is sched_pin() ?

mmacy added inline comments.May 14 2018, 7:04 PM

sys/dev/hwpmc/hwpmc_mod.c
2327	Good point. Thanks.
2330	pmc_process_interrupt -> pmc_add_sample -> pmc_post_callchain_callback() -> sched_pin The one thing that this changes though is that td_pinned will now need to be operated on with an atomic because we can't mask out NMIs. Do you have a better suggestion? We need to pin the thread until we've collected the sample in hardclock.

make_sched pin atomic to make it NMI safe
remove invalid assert
resolve conflicts with recent HEAD changes.

kib added inline comments.May 15 2018, 12:59 PM

sys/dev/hwpmc/hwpmc_mod.c
2330	Mmm, this is not going to work. Atomics cannot prevent a situation where the code checks for td_pinned == 0, then NMI occurs and increases the pinned counter, then e.g. sched_pickcpu() continues and migrate the thread to another processor. In this situation, you end up gathering counters on other CPU. Also, atomic update is too strong, at least on amd64, the lock prefix is not needed. You can see how e.g. pcb_flags update is performed which needs to be atomic WRT interrupts on the same CPU. I believe that instead of pinning and postponing to hardclock, you should issue an AST. AST issue is protected by the thread lock, and it will be correctly handled on return from kernel to usermode, even for NMI.

mmacy added inline comments.May 15 2018, 4:48 PM

sys/dev/hwpmc/hwpmc_mod.c
2330	Thanks. That sounds like better route.

mmacy added inline comments.May 16 2018, 2:52 AM

sys/dev/hwpmc/hwpmc_mod.c
2330	Is it possible to migrate before return from the AST?

I'll updated with sched_pin change reverted

sys/dev/hwpmc/hwpmc_mod.c
2330	Actually the non-atomic sched_pin is safe for the same reason that setting tdp_flags is - we only call it if the trap came from userland.

take atomics back out of sched_{pin, unpin}

Harbormaster completed remote builds in B16618: Diff 42612.May 16 2018, 4:22 AM

mmacy added inline comments.May 16 2018, 6:00 PM

sys/dev/hwpmc/hwpmc_mod.c
2330	@kib Ok. So I take that back. The whole point of this change is to allow us to collect user callchains when trapping in from kernel. Thus naively calling sched_pin and setting td_pflags is unsafe. Would it be ok to set CALLCHAIN on pcb_flags instead of tdp_flags?

kib added inline comments.May 18 2018, 10:56 PM

sys/dev/hwpmc/hwpmc_mod.c
2330	Ok, after the talk in other place, I tried to think some more about this stuff. Let me summarize the points: collect sampled counter values on NMI into a pcpu storage get the userspace call chain for the thread, on the next safe point, but definitely before the nearest return to the userspace. Problem is that we cannot safely read user memory if NMI interrupted the kernel code, for many reasons. In other words, we must schedule an AST to gather userspace call chain. Now, the problem is that the thread might actually migrate before we reach the AST call, so that the pcpu data in the current slot is potentially irrelevant. Is this correct ? If yes, then there are two things which should allow to handle this: have a dedicated atomic-store (i.e. int or larger) field in struct thread which is assigned from NMI handler, and test-and clear the field in AST context. This of course allows to loose the record because NMI might come after we finished with AST and are on the final path to usermode. Make pcpu storage for samples tagged by thread id. We cannot disable preemption from NMI, really. Or store into the thread instead of pcpu, is it possible ? Then AST handler would iterate over pcpus and find its own data, or just access the thread data.