Page MenuHomeFreeBSD

PMC: Collect user call chains while in kernel space --- redux
AbandonedPublic

Authored by mmacy on May 7 2018, 5:35 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 21, 11:35 AM
Unknown Object (File)
Thu, Apr 11, 1:51 AM
Unknown Object (File)
Feb 27 2024, 3:37 PM
Unknown Object (File)
Feb 17 2024, 4:44 PM
Unknown Object (File)
Dec 22 2023, 9:52 PM
Unknown Object (File)
Dec 10 2023, 11:30 PM
Unknown Object (File)
Nov 24 2023, 3:52 AM
Unknown Object (File)
Nov 23 2023, 5:11 PM

Details

Reviewers
jhb
jeff
sbruno
Group Reviewers
manpages
Summary

D7350 updated to D4227 updated to D15155 and D15275.

Depends on D4227.
Depends on D15275.

Currently, when PMC is sampling call chains, it will sample a user-space call chain when the counter fires while in user space and will sample kernel-space call chains when the counter fires while in kernel space. This is useful functionality and should not be discarded.

However, there are cases where a user really wants to know the user-space call chain every time the sample counter fires, even if the sample counter fires while in kernel mode. This change adds this capability, which is activated when the user specifies an optional parameter.

(To go a step further, we discussed at BSDCan the fact that a user might like to be able to see both the kernel and user call chains associated with each sampling interval. However, this patch does not go that far.)

Note: This patch is against stable/10 + D4227 (diff 1). Once D4227 is committed, I will regenerate the patch against CURRENT. Until that time, this is probably more of a draft than a serious proposal.
Test Plan

This has been tested within Juniper on a large-scale multi-threaded app. I also plan to run some simple pathological threaded apps (constantly starting/stopping threads, several threads with long-running tight loops) which I've used for basic PMC sanity-testing for previous changes.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 16618

Event Timeline

kib added inline comments.
sys/dev/hwpmc/hwpmc_mod.c
1748

critical_enter() is redundand since you take the spin lock a line above it. Also, you pin the thread before calling the function.

That said, what else, except the td_flags at line 1797, requires the thread lock in the locked region ? If nothing, it is arguably better to lock the thread only around flag or-ing.

reduce scope of thread lock in userret

critical_enter() is redundand since you take the spin lock a line above it. Also, you pin the thread before calling the function.

@kib I don't understand this comment. The thread is pinned in the NMI and then unpinned in hardclock context:

void
intr_irq_handler(struct trapframe *tf)
{
	struct trapframe * oldframe;
	struct thread * td;

	KASSERT(irq_root_filter != NULL, ("%s: no filter", __func__));

	VM_CNT_INC(v_intr);
	critical_enter();
	td = curthread;
	oldframe = td->td_intr_frame;
	td->td_intr_frame = tf;
	irq_root_filter(irq_root_arg);
	td->td_intr_frame = oldframe;
	critical_exit();
#ifdef HWPMC_HOOKS
	if (pmc_hook && TRAPF_USERMODE(tf) &&
	    (PCPU_GET(curthread)->td_pflags & TDP_CALLCHAIN))
		pmc_hook(PCPU_GET(curthread), PMC_FN_USER_CALLCHAIN, tf);
#endif
}

thread_userret is running much later so the thread is no longer pinned. If that is not in fact the case, please tell me what the actual sequence of events is here. Thanks.

sys/dev/hwpmc/hwpmc_mod.c
2327

Why td_pinned cannot be >= 1 ?

2330

Where is sched_pin() ?

sys/dev/hwpmc/hwpmc_mod.c
2327

Good point. Thanks.

2330

pmc_process_interrupt -> pmc_add_sample -> pmc_post_callchain_callback() -> sched_pin

The one thing that this changes though is that td_pinned will now need to be operated on with an atomic because we can't mask out NMIs. Do you have a better suggestion? We need to pin the thread until we've collected the sample in hardclock.

  • make_sched pin atomic to make it NMI safe
  • remove invalid assert
  • resolve conflicts with recent HEAD changes.
sys/dev/hwpmc/hwpmc_mod.c
2330

Mmm, this is not going to work. Atomics cannot prevent a situation where the code checks for td_pinned == 0, then NMI occurs and increases the pinned counter, then e.g. sched_pickcpu() continues and migrate the thread to another processor. In this situation, you end up gathering counters on other CPU.

Also, atomic update is too strong, at least on amd64, the lock prefix is not needed. You can see how e.g. pcb_flags update is performed which needs to be atomic WRT interrupts on the same CPU.

I believe that instead of pinning and postponing to hardclock, you should issue an AST. AST issue is protected by the thread lock, and it will be correctly handled on return from kernel to usermode, even for NMI.

sys/dev/hwpmc/hwpmc_mod.c
2330

Thanks. That sounds like better route.

sys/dev/hwpmc/hwpmc_mod.c
2330

Is it possible to migrate before return from the AST?

I'll updated with sched_pin change reverted

sys/dev/hwpmc/hwpmc_mod.c
2330

Actually the non-atomic sched_pin is safe for the same reason that setting tdp_flags is - we only call it if the trap came from userland.

  • take atomics back out of sched_{pin, unpin}
sys/dev/hwpmc/hwpmc_mod.c
2330

@kib Ok. So I take that back. The whole point of this change is to allow us to collect user callchains when trapping in from kernel. Thus naively calling sched_pin and setting td_pflags is unsafe. Would it be ok to set CALLCHAIN on pcb_flags instead of tdp_flags?

sys/dev/hwpmc/hwpmc_mod.c
2330

Ok, after the talk in other place, I tried to think some more about this stuff.

Let me summarize the points:

  • collect sampled counter values on NMI into a pcpu storage
  • get the userspace call chain for the thread, on the next safe point, but definitely before the nearest return to the userspace.

Problem is that we cannot safely read user memory if NMI interrupted the kernel code, for many reasons. In other words, we must schedule an AST to gather userspace call chain.

Now, the problem is that the thread might actually migrate before we reach the AST call, so that the pcpu data in the current slot is potentially irrelevant.

Is this correct ?

If yes, then there are two things which should allow to handle this:

  1. have a dedicated atomic-store (i.e. int or larger) field in struct thread which is assigned from NMI handler, and test-and clear the field in AST context. This of course allows to loose the record because NMI might come after we finished with AST and are on the final path to usermode.
  2. Make pcpu storage for samples tagged by thread id. We cannot disable preemption from NMI, really. Or store into the thread instead of pcpu, is it possible ? Then AST handler would iterate over pcpus and find its own data, or just access the thread data.