Page MenuHomeFreeBSD

off cpu tracking
Needs ReviewPublic

Authored by mjg on Aug 9 2020, 4:47 AM.

Details

Reviewers
kib
markj
jeff
Summary

While there are probes which can be used to track all on/off cpu activity, there are 2 problems with them:

  • probing on everything is quite expensive
  • there is no general way to distinguish blocking because of contention/waiting on something in kernel vs waiting on user (poll, select, waitpid etc.)

The patch below remedies the problem, although I don't have strong opinion on any names used.

The probes can be used with a script like this:

#pragma D option dynvarsize=32m

/*
 * The p_flag & 0x4 test filters out kernel threads.
 */

sched:::ksleep
/(curthread->td_proc->p_flag & 0x4) == 0/
{

        self->ts = timestamp;
        self->mesg = arg1;
}

sched:::kwakeup
/self->ts/
{
        @[stack(30), stringof(self->mesg)] = sum(timestamp - self->ts);
        self->ts = 0;
        self->mesg = 0;
}

dtrace:::END
{
        printa("%k\n%s\n%@d\n\n", @);
}
  1. rm offcpu ; dtrace -s kprof-offcpu.d -o offcpu
  2. cat offcpu | perl ~/FlameGraph/stackcollapse.pl | perl ~/FlameGraph/flamegraph.pl --color=io > /tmp/out-off.svg

https://people.freebsd.org/~mjg/fg/flix1-buildkernel-offcpu.svg
https://people.freebsd.org/~mjg/fg/flix1-buildworld-offcpu.svg

extra: -j 40 kernel over nfs:
https://people.freebsd.org/~mjg/fg/lynx4-builldkernel-nfs-offcpu.svg

If you hover over labels you will see "samples" count, which in this case is total time in nanoseconds. So for buildworld case that's 153 seconds total spent off cpu due to vm object. Note that a lot of it overlapped, so there is no immediate way to say how much real time it caused. Nonetheless, I find this very useful.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

mjg requested review of this revision.Aug 9 2020, 4:47 AM

Arguably user-sleeps are cv/mwait with PCATCH.

This does not catch rfork: rfppwait is doing cv_timedwait(&p2->p_pwait, &p2->p_mtx, hz); , similarly there are spots in umtx and others

In D26011#577013, @mjg wrote:

This does not catch rfork: rfppwait is doing cv_timedwait(&p2->p_pwait, &p2->p_mtx, hz); , similarly there are spots in umtx and others

For vfork() I agree, you would need some kind of cv_timedwait_user() interface indeed.

But I do not see a place in kern_umtx.c where we would use non-interruptible sleep for dropping off cpu which can be blamed on user. Busying the keychain is arguably the in-kernel contention.

So I had a more look around, I think PCATCH is too big of a hammer for this one. For example vn_start_write comes with PCATCH most of the time, but arguably it should be accounted. With my approach there is more work to do in the long run as newly discovered user problems have to be annotated, but nothing will be missed.

vn_start_write() is interruptible. Initially, I think, Tor made it interruptible to allow users to get out of numerous problems that the suspension implementation had. Since then most of the issues were fixed, and I think that the removal of PCATCH there is long overdue.

On the other hand, it could be argued that VFS syscalls become interruptible only on very exceptional condition, where _user_ initiated suspension, so there is no significant non-compliance. Also, since it is user who starts snapshots or unmounts, accounting that sleep time as requested by user is fair.

So I still very much dislike PUSERW.

sys/kern/subr_sleepqueue.c
184

Why not put these under the sched provider? I think the proposed names are confusing. "kernel" is too generic, and we already have a "profile" provider.

mjg retitled this revision from [WIP] off cpu tracking to off cpu tracking.
  • rebase
  • fix several annotations
  • rename probes to sched::{sleepq,turnstile}:user{sleep,wakeup}

So back to PUSERW (or an equivalent), turns out PCATCH *is* for kernel wait for example in pipelock (previously incorrect patched to be ignored).

Modulo perhaps some tidy ups I consider the patch committable with a note that there are user waits which are yet to be annotated. The code below is tinderbox-clean in that regard.

mjg edited the summary of this revision. (Show Details)
  • {u,k}{sleep,wakeup}
In D26011#610773, @mjg wrote:

So back to PUSERW (or an equivalent), turns out PCATCH *is* for kernel wait for example in pipelock (previously incorrect patched to be ignored).

Well pipelock() is weird, it seems to be sleepable only due to uiomove() under it. We never do it this way anywhere else. Might be I should find time to actually implement my proposal to switch it to sx.

That said, IMO adding a new msleep flag to indicate that the PCATCH-ed sleep is billed to kernel, and use it to mark exceptions in pipelock() and perhaps NFS seems to cause less churn.

In D26011#611350, @kib wrote:
In D26011#610773, @mjg wrote:

So back to PUSERW (or an equivalent), turns out PCATCH *is* for kernel wait for example in pipelock (previously incorrect patched to be ignored).

Well pipelock() is weird, it seems to be sleepable only due to uiomove() under it. We never do it this way anywhere else. Might be I should find time to actually implement my proposal to switch it to sx.

I'm having trouble seeing how you would do that without downsides. If you convert PIPE_MTX to an sx lock, it will serialize both ends of the pipe while handling page faults in uiomove(). If you convert the hand-rolled sleep in pipelock() to an sx, then pipelock() will be more expensive.

That said, IMO adding a new msleep flag to indicate that the PCATCH-ed sleep is billed to kernel, and use it to mark exceptions in pipelock() and perhaps NFS seems to cause less churn.

The current approach is a lot of churn but seems easier to maintain:

  • It's rare to add new USERWAIT points, I think, and more common to add ad-hoc synchronization. It's not just pipelock() that needs to be modified. The sleep in callout_drain() for example needs annotation because the caller is blocking while waiting for another thread to do work.
  • If a new USERWAIT point is added and the developer forgets to annotate, it will show up in profiles and can therefore easily be fixed.

But I don't understand why we need extra complexity and branching in the kernel for this. There's not that many UESRWAIT points in the kernel. Why can't DTrace filter them out itself?

sys/kern/subr_sleepqueue.c
628

Why do we need all of these new probes? You can write

sched:::sleep
/args[0]->td_flags & 0x80/

to catch user sleeps. Add a constant to /usr/lib/dtrace/sched.d to avoid hard-coding 0x80. Or, add a new parameter to sched:::sleep to indicate what kind fo sleep it is.

638

Similarly you can use sched:::on-cpu with a td_flags predicate instead of adding new probes here.

sys/sys/callout.h
64

Duplicated description.

sys/kern/subr_sleepqueue.c
628

I noted in the review description that the intent is to reduce probe effect -- going off/on cpu happens all the time and only fraction of it is of interest. The above filters it out without descending into dtrace. Apart from that, since the points are ts/sleepq aware, more information can be added to them later.

In D26011#611350, @kib wrote:
In D26011#610773, @mjg wrote:

So back to PUSERW (or an equivalent), turns out PCATCH *is* for kernel wait for example in pipelock (previously incorrect patched to be ignored).

Well pipelock() is weird, it seems to be sleepable only due to uiomove() under it. We never do it this way anywhere else. Might be I should find time to actually implement my proposal to switch it to sx.

I'm having trouble seeing how you would do that without downsides. If you convert PIPE_MTX to an sx lock, it will serialize both ends of the pipe while handling page faults in uiomove(). If you convert the hand-rolled sleep in pipelock() to an sx, then pipelock() will be more expensive.

I noted there is also the angle with pipe_poll which is *not* blocked by pipelock and is a source of contention. Should anyone convert this to sx they should provide a scheme where polling is not negatively affected.

imo the long term thing to do here is to convert poll/select internals to kqueue which persists across multiple calls. dfly uses kqueue internally for this but registers/deregisters on each call, which has significant single-threaded overhead and does not alleviate any of the contention.

That said, IMO adding a new msleep flag to indicate that the PCATCH-ed sleep is billed to kernel, and use it to mark exceptions in pipelock() and perhaps NFS seems to cause less churn.

The current approach is a lot of churn but seems easier to maintain:

  • It's rare to add new USERWAIT points, I think, and more common to add ad-hoc synchronization. It's not just pipelock() that needs to be modified. The sleep in callout_drain() for example needs annotation because the caller is blocking while waiting for another thread to do work.
  • If a new USERWAIT point is added and the developer forgets to annotate, it will show up in profiles and can therefore easily be fixed.

So, regardless of pipelock, I consider PCATCH as too much of a hammer to implicate waiting on user.

The churn in the patch is not significant and is mostly mechanical, more importantly to expand your second point, it does not accidentally hide any sleeping. With manual annotation at worst some points will get it when they should not have, but none will be hidden from the get go.

But I don't understand why we need extra complexity and branching in the kernel for this. There's not that many UESRWAIT points in the kernel. Why can't DTrace filter them out itself?

I noted elsewhere: filtering without descending into dtrace. Under high load it starts suffering from dynamic drops even with dynvarsize bumped.

sys/kern/subr_sleepqueue.c
628

I think you need to demonstrate that the probe effect is severe enough to skew the results. Running your threadspawn benchmark shows hundreds of thousands of context switches per second; with dtrace probes enabled as I suggested I see less than 1% reduction in throughput. sched:::sleep and sched:::on-cpu fire in slow paths by definition. I don't see why a sufficiently large dynvar space wouldn't be enough; if you're getting drops with large dynvarsize then I suspect either a bug in your script, in DTrace or in the placement of sched probes.

I also tried a script which checks curthread->td_wmesg against a table of 15 common user sleep points, again with very low overhead.

sys/kern/subr_sleepqueue.c
628

Note that in stock kernel the benchmark still suffers significant problems both single and mult-threaded, largely because of IPIs flying due to kstack page invalidation.

I have an experimental patchset which deals with that problem and more, with that in place single-threaded test looks like this:

[stock]
min:165130 max:165130 total:165130
min:165287 max:165287 total:165287
min:164162 max:164162 total:164162
[dtrace -n 'sched:::on-cpu /(curthread->td_flags & 0x80) == 0/ { @[stack()] = count(); }']
min:124408 max:124408 total:124408
min:124280 max:124280 total:124280
min:123964 max:123964 total:123964

multithreaded this still suffers a lot of contention even with the patchset (to be fixed soon(tm)).

However, I indeed can't reproduce dynamic var drops with the current script. It is plausible the old one had a bug there.

Still, I think the above justifies not going into dtrace if possible.

Regradless, it may be adding the td + lockname/mesg touple to sched:::sleep should be done regardless of the patch.

In D26011#611502, @mjg wrote:
In D26011#611350, @kib wrote:

That said, IMO adding a new msleep flag to indicate that the PCATCH-ed sleep is billed to kernel, and use it to mark exceptions in pipelock() and perhaps NFS seems to cause less churn.

The current approach is a lot of churn but seems easier to maintain:

  • It's rare to add new USERWAIT points, I think, and more common to add ad-hoc synchronization. It's not just pipelock() that needs to be modified. The sleep in callout_drain() for example needs annotation because the caller is blocking while waiting for another thread to do work.
  • If a new USERWAIT point is added and the developer forgets to annotate, it will show up in profiles and can therefore easily be fixed.

So, regardless of pipelock, I consider PCATCH as too much of a hammer to implicate waiting on user.

The churn in the patch is not significant and is mostly mechanical, more importantly to expand your second point, it does not accidentally hide any sleeping. With manual annotation at worst some points will get it when they should not have, but none will be hidden from the get go.

It's mechanical but it adds runtime cost and complexity just to make tracing a bit simpler. And it still does not differentiate between blocking on a resource vs. kernel threads waiting for work, so a script which uses the new probes still needs to filter out some data.

I'm not convinced that this machinery is worth it to reduce tracing overhead by several percent. It is easy to filter data in DTrace itself or do some post-processing. I don't think anyone has made any serious effort to try and optimize DTrace itself so I'm sure there are gains to be had there.

I do not object to adding more information to the kernel that can be used to filter when tracing, but I don't think this diff is adding anything fundamentally new.

sys/kern/subr_sleepqueue.c
628

That's expensive because it is tracing the stack most of the time. The predicate is wrong because TDF_USERWAIT is cleared before sched:::on-cpu fires. This represents a bug in the patch.

If I apply UMA patches to fix the kstack cache, for the single-threaded case I get:

[stock]
min: 120027 max: 120027 total: 120027
min: 120036 max: 120036 total: 120036
min: 120081 max: 120081 total: 120081
[dtrace -n 'sched:::on-cpu /(curthread->td_flags & 0x80) == 0/ { @[stack()] = count(); }']
min: 85787 max: 85787 total: 85787
min: 85983 max: 85983 total: 85983
min: 85812 max: 85812 total: 85812
[dtrace -n 'sched::sleepq:kwakeup {@[stack()] = count();}']
min: 104631 max: 104631 total: 104631
min: 104761 max: 104761 total: 104761
min: 104667 max: 104667 total: 104667
[dtrace -n 'sched:::sleep {self->t = 1;} sched:::on-cpu /self->t && (curthread->td_flags & 0x80) == 0/{ @[stack()] = count();}']
min: 97170 max: 97170 total: 97170
min: 97438 max: 97438 total: 97438
min: 97440 max: 97440 total: 97440
[dtrace -n 'sched:::sleep {self->t = 1;} sched:::on-cpu /self->t && (curthread->td_flags & 0x80) != 0/{ @[stack()] = count();}']
min: 102709 max: 102709 total: 102709
min: 102366 max: 102366 total: 102366
min: 102635 max: 102635 total: 102635
[https://people.freebsd.org/~markj/dtrace/off-cpu]
min: 103237 max: 103237 total: 103237
min: 103268 max: 103268 total: 103268
min: 103509 max: 103509 total: 103509

and this benchmark is an extreme case because every iteration causes the looping thread to context switch. I see a smaller probe effect for other will-it-scale benchmarks. Moreover I see a significant effect simply from enabling *any* SDT probes, presumably because it forces us to take slow paths in synchronization primitives.