Page MenuHomeFreeBSD

Greatly reduce interrupt latency caused by steal_idle
ClosedPublic

Authored by truckman on Aug 26 2017, 5:42 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 5, 8:49 AM
Unknown Object (File)
Thu, Dec 12, 6:22 PM
Unknown Object (File)
Wed, Dec 11, 10:39 AM
Unknown Object (File)
Dec 9 2024, 11:38 PM
Unknown Object (File)
Nov 21 2024, 6:43 AM
Unknown Object (File)
Nov 2 2024, 11:00 AM
Unknown Object (File)
Nov 2 2024, 11:00 AM
Unknown Object (File)
Nov 2 2024, 11:00 AM

Details

Summary

Reduce the interrupt latency caused by the steal_idle code
in tdq_idled() by removing the spinlock_enter()/spinlock_exit()
wrapper around it. Preemption events are pretty rare and can
be handled by detecting them and restarting the search.

Poll for recently assigned threads and switch to them before
calling tdq_lock_pair(). This happens frequently enough that
it is worth avoiding the aquisition of the extra lock which
will get immediately dropped.

If the cpu returned by sched_highest() no longer has a thead
to steal, restart the search from the beginning rather than
resuming the search at the same topology level. Things have
changed enough in the time that is has taken to get to this
point that the previous results are stale. The extra time
spent by restarting the search at the beginning does not
matter so much now that interrupts are not blocked, and
this is part of the idle thread ...

Straighten out the control flow in tdq_idled() so that it
is not quite so convoluted.

Fix an off-by-one error in the CPU_SEARCH_HIGHEST section
of cpu_search(). There are only transferable threads
when tdq_load > 1, since a CPU with a currently running
thread and an empty queue will have a load of 1.

Test Plan

Compare before and after performance

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This revision is now accepted and ready to land.Aug 26 2017, 6:35 PM
cem added a subscriber: cem.

This is interesting. Do you have any benchmarks or observations about the performance impact? How did you arrive at this needing to be addressed?

I have another patch that was a 2-3% perf improvement at isilon that I have been meaning to backport that is somewhat at odds with this, although not entirely. It performs a search in the last thread to switch out rather than switching to the idle thread. This saves you a context switch into the idle thread and then back out again when stealing will be immediately productive. The code is relatively straightforward. It helps a lot when there are tons of context switches and lots of short running threads coming and going.

It is not structurally at odds with this patch. Both would work. But it defeats some of the intent because I don't believe you can release the spinlock section while you are in the middle of switching a sleeping thread out. It makes you deadlock prone and can increase the latency to release the preempted-while-trying-to-sleep thread. This code does run in more cases than the search during switchout would so it still helps but perhaps not as much as it could.

Basically this started as an investigation of system hangs, reboots, and oddball process failures on Ryzen as documented in this PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=219399. It turns out that there are actually multiple issues and this PR ended up tracking the hang/reboot issue, which was eventually fixed. The remaining problem(s) were moved to this PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=221029. After doing a number of experiments, I came to the conclusion that the majority of the problems were correlated to thread migration, and I discovered that disabling both kern.sched.balance and kern.sched.steal_idle got rid most or all of the problems. That led me to start hacking on the scheduler code to see if there was any pattern to the problematic thread migrations and I found that disabling the last loop iteration that looked at all cores alleviated the problems. Ryzen apparently has issues with its IRET instruction, so I wasn't sure if the problem was triggered by having interrupts disabled for such a long time or if the problem was due to migrating threads between the two CCXes. It also looked to me like there was an off-by-one bug in sched_highest(), so I asked about that in this email thread: https://docs.freebsd.org/cgi/getmsg.cgi?fetch=12902+0+/usr/local/www/mailindex/archive/2017/freebsd-arch/20170827.freebsd-arch (I don't know why my original message isn't in the archive).

Disabling interrupts for so long concerned me in the general case and the trend was going the wrong way. It's not too bad for simple CPUs with a small number of cores, but the Ryzen has 16 logical CPUs and three levels of hierarchy. Threadripper is going to make this worse since it has twice as many cores and another level of hierarchy. If you start pinning a lot of threads causing tdq_move() to frequently fail, then the runtime starts going up as N^2.

After removing spinlock_enter()/spinlock_exit() and adding the code to handle preemptions, I also spammed the code with tons of counters to see what was really happening under my test load of building a bunch of ports with poudriere, and then tweaked the code to optimize the most frequent paths. It turns out that being preempted here by a higher priority thread is pretty rare. The code leading up to tdq_lock_pair() might be getting interrupted, but I don't think it makes too much of a difference to the results. If an iteration fails to do the chosen CPU no longer having an available thread to steal, it is much more likely that the thread disappears during the time it takes to do the tdq_lock_pair() than the time it execute the code leading up to that point.

I think that the off-by-one problem could cause a thread to be stolen from an idle CPU if the thread was assigned to the CPU but the its idle thread had not yet noticed and switched to that new thread.

Your improvement sounds promising. To keep the time with interrupts disabled to a reasonable values would be to only do the first loop iteration in the context of the thread switching out. Basically just look at the first level of the hierarchy and switch to the idle thread if you can't find an available thread there or if tdq_move() fails and let the idle thread to a more complete search. The statistics that I gathered on my 2 SMT thread, 4 core, 2 CCX Ryzen indicate that you should be able to handle about 80% of the potential steals without switching to the idle thread.

I've currently running benchmarks. I'll post the results when I get them.

This is very interesting. To validate your notion that migrating before iret is the problem you could simply sched_pin() between interrupt and iret or change the SCHED_CAN_MIGRATE test to exclude ithreads. This should be simple to test. I will also reach out to some contacts at AMD.

I too am a little concerned about the time to search these increasingly complex trees. I have some patches that also reduce the number of searches in sched_pickcpu() by doing multiple searches simultaneously. I will try to bring that in soon too.

I have a thread ripper system myself that I am now using for freebsd development. Do you have a test case that you suggest for reproducing some of this? Can I do something simpler than building the ports tree? I saw no instability with buildworld. You are aware that some early zen models have an errata that you can RMA for? This was reported as compilers seg faulting on linux but may cause other problems.

AMD has publicly been very vague about this problem. About the only thing they've said is that there is some sort of "performance marginality" and that Threadripper is not affected. Since Threadripper uses the same die stepping as Ryzen, it looks like they have figured out how to screen for parts with this problem. There is a long thread on the AMD forum here: https://community.amd.com/thread/215773?start=1095&tstart=0.

I do suspect that I have one of the buggy CPUs, but I haven't gotten started on the RMA process yet.

I don't think that there is a problem with ithread migration. As a matter of fact the lack of any obvious problems with ithreads makes me suspect a tlb issue. Ithreads live in kernel memory space which is going to be the same everywhere. I'm wondering if a thread that has migrated to a core on the other CCX isn't always getting its tlb fully initialized before returning to userspace.

As a test case, you might try building lang/ghc in parallel with some other load so that there is some thread migration activity. I'm not sure it is the same problem, but ghc seems to be super sensitive and I pretty much always see SIGBUS failures on my Ryzen machine.

I don't think that there is a problem with ithread migration. As a matter of fact the lack of any obvious problems with ithreads makes me suspect a tlb issue. Ithreads live in kernel memory space which is going to be the same everywhere. I'm wondering if a thread that has migrated to a core on the other CCX isn't always getting its tlb fully initialized before returning to userspace.

We do not re-initialize (flush is the common term) thread's TLB before returning to userspace. We flush TLB when switching contexts. More precisely, there are two variants of it.

On older CPUs, without the PCID feature, reload of %cr3 (the page table base pointer) flushes all non-global entries from TLB of the CPU thread which reloaded %cr3. Kernel-mode TLB entries are typically global (PG_G). The reload most often occurs on context switch, see cpu_switch.S.

On newer Intel CPUs, where PCID feature is present, each TLB entry is additionally tagged with the address space ID, and on the switch we typically inform CPU about new address space ID, avoiding the flush.

One of the reason I asked you about the verbose dmesg some time ago was to see which TLB switching mechanism is used, and apparently Ryzens do not implement PCID. Although AMD CPUs do tag TLB entries with ASIDs for SVM for long time.

You could experiment by adding the the equivalent of invltlb_glob() right after %cr3 reload in cpu_switch. This should flush the whole TLB, including the kernel entries.

You already tried to disable superpage promotions, so other known workarounds like erratum383 should be not useful.

sys/kern/sched_ule.c
736 ↗(On Diff #32412)

I see a small increase in wall clock time due to this change and an increase in CPU %idle since there are fewer opportunities to steal a thread when looking for a CPU with a tdq_load > 2.

1002–1005 ↗(On Diff #32412)

What is the intent of this code? I think I've been interpreting it backwards. On SMT hardware, do we want to make it easier or harder to steal a thread from another crippled SMT thread on the same core vs. from another core? It is a lighter weight migration due to shared cache, but in a 50% load situation it will tend to cluster threads on the same cores vs. spreading them out across cores with only one SMT thread on each core active, where they might run faster due to less resource contention.

If this code was reversed, then the steal would happen on the first iteration if the load on the other SMT thread was very high. That same transfer could happen in the second iteraction if the load on that SMT thread was the highest of all the logical CPUs at that level.

I've been wondering why steals from the first level of the hierarchy were far outnumbering steals from the other two levels during my Ryzen testing ...

1048 ↗(On Diff #32412)

The threshold test here should be the inverse of the condition on line 736 above.

If steal->tdl_load == 1 is not matched here, then I sometimes see tdq_move() succeed when steal->tdl_load == 1 && steal->tdq_transferrable != 0, which looks like we are somehow poaching the only thread assigned to that CPU. I'm not sure how that can happen other than the other CPU being idle and has somehow not picked up the thread. The frequency of occurrence seems to be independent of the machdep.idle setting. Maybe it is busy handling an interrupt ...

sys/kern/sched_ule.c
1005 ↗(On Diff #32412)

thresh == 1 is strange in this version of the code (which does a tdq_load >= thresh test) because most opportunities will be filtered out by the tdq_transferrable check, but every once in a while both tdq_load and tdq_transferrable will both be 1 and the move will succeed.

I did some further investigation of the successful steals that happen when steal->tdq_load == 1 && steal->tdq_transferrable !=0. That happened 1557 times during a buildworld run. Of those, steal->tdq_cpu_idle was nonzero 1099 times, and steal->tdq_ipipending was nonzero 196 times. It doesn't look like the source CPU being busy handling an interrupt was the issue. On my hardware, cpu 0 handles the majority of interrupts by far, but the distribution these events is fairly evenly distributed across all of the CPUs.

sys/kern/sched_ule.c
736 ↗(On Diff #32412)

As I describe in another inline comment the current code allows to steal a single "just assigned" thread from a CPU and that may be useful in some cases.

736 ↗(On Diff #32412)

Also, if this change is to be made, then, in my opinion, the default steal threshold should be lowered to one.

1005 ↗(On Diff #32412)

I think that this code makes a small difference in a single very specific scenario.
If a thread is scheduled to run on an SMT sibling and it hasn't started running yet (the sibling's load and transferable are both one), then stealing the thread allows it to start running a little bit sooner. If not stolen the thread would get to run only after the sibling is woken up.
That's the only theory that I could come up for that code.

1022 ↗(On Diff #32412)

wouldn't just returning zero here do the same job (given the loop in sched_idletd) ?

Adding Alexander as he made some ULE changes relatively recently.

In D12130#251884, @jeff wrote:

I have another patch that was a 2-3% perf improvement at isilon that I have been meaning to backport that is somewhat at odds with this, although not entirely. It performs a search in the last thread to switch out rather than switching to the idle thread. This saves you a context switch into the idle thread and then back out again when stealing will be immediately productive. The code is relatively straightforward. It helps a lot when there are tons of context switches and lots of short running threads coming and going.

I would love to see that change land in the tree. It makes a lot of sense to elide a switch to the idle thread if we have an opportunity to steal a thread and switch to it directly.

sys/kern/sched_ule.c
736 ↗(On Diff #32412)

I plan to back out this change.

1005 ↗(On Diff #32412)

Only a tiny bit sooner. After adding a bunch of debug code to analyze what exactly is happening, I discovered that the victim SMT thread is almost always blocked on thread_lock() in critcal_exit() at the time the thread is actually stolen. It is awake and returning from the IPI that will trigger the preemption that will cause it to switch from the the idle thread to the new thread that it was assigned, but it can't proceed because the poaching thread has called tdq_lock_pair() and grabbed its lock.

1022 ↗(On Diff #32412)

Yes, I realized that and made that change in my next version of the patch.

In D12130#253523, @avg wrote:
In D12130#251884, @jeff wrote:

I have another patch that was a 2-3% perf improvement at isilon that I have been meaning to backport that is somewhat at odds with this, although not entirely. It performs a search in the last thread to switch out rather than switching to the idle thread. This saves you a context switch into the idle thread and then back out again when stealing will be immediately productive. The code is relatively straightforward. It helps a lot when there are tons of context switches and lots of short running threads coming and going.

I would love to see that change land in the tree. It makes a lot of sense to elide a switch to the idle thread if we have an opportunity to steal a thread and switch to it directly.

https://people.freebsd.org/~cem/0001-Improve-scheduler-performance-and-decision-making.-F.patch if you want to work on adopting it. I've committed the standalone subr_smp.c portion already.

In D12130#253523, @avg wrote:
In D12130#251884, @jeff wrote:

I have another patch that was a 2-3% perf improvement at isilon that I have been meaning to backport that is somewhat at odds with this, although not entirely. It performs a search in the last thread to switch out rather than switching to the idle thread. This saves you a context switch into the idle thread and then back out again when stealing will be immediately productive. The code is relatively straightforward. It helps a lot when there are tons of context switches and lots of short running threads coming and going.

I would love to see that change land in the tree. It makes a lot of sense to elide a switch to the idle thread if we have an opportunity to steal a thread and switch to it directly.

As would I. I think it could only do a quick search in order to keep the critical section from taking too long, so if it can't quickly steal a thread it should bail out and let the idle thread do a more complete search.

In D12130#253585, @cem wrote:
In D12130#253523, @avg wrote:
In D12130#251884, @jeff wrote:

I have another patch that was a 2-3% perf improvement at isilon that I have been meaning to backport that is somewhat at odds with this, although not entirely. It performs a search in the last thread to switch out rather than switching to the idle thread. This saves you a context switch into the idle thread and then back out again when stealing will be immediately productive. The code is relatively straightforward. It helps a lot when there are tons of context switches and lots of short running threads coming and going.

I would love to see that change land in the tree. It makes a lot of sense to elide a switch to the idle thread if we have an opportunity to steal a thread and switch to it directly.

https://people.freebsd.org/~cem/0001-Improve-scheduler-performance-and-decision-making.-F.patch if you want to work on adopting it. I've committed the standalone subr_smp.c portion already.

Wow, there's a lot in there. Looks like good stuff. I'd like to get my patch out of the way before diving into this. I think both sets of changes can coexist.

truckman edited the test plan for this revision. (Show Details)

Mark more members of struct tdq as volatile to prevent accesses
to them from being optimized out by the compiler. They are
written in one context and read in another, sometimes multiple
times in one function, and we want the reader to notice changes
if they happen.

Back out hgroup.cs_limit change in cpu_search() because it
causes an increase in wall clock time. Effectively steal_thresh
is being changed from 2 to 3, providing fewer opportunities to
steal a thread.

Use tdq_notify() instead of always just sending an IPI in
sched_balance_pair(). This requires that tdq_move() return the
thread moved instead of a count. Since tdq_move() only ever
moves one thread, update an outdated comment.

The code to set the steal threshold to 1 for HTT and SMT threads
was introduced in r176735 with no rationale given for why the
value 1 was chosen. In most cases if the load on a CPU is 1,
then transferable is 0 and sched_highest() will not return it.
The exception is if a thread has recently been assigned to an
idle CPU and the idle thread hasn't yet switched to it. In the
cases where we poach a thread from an idle CPU, it appears that
the victim is just barely losing the race to start that thread.
We are probably better off leaving that thread in place and
continuing the search at the next level of the hierarchy and
looking for a plumper target. At the time of r176735,
steal_thresh was automatically tuned to the log2(#cpus),
supposedly to prevent "excessive thrashing". This was changed
by mav in r239194 to just always leave steal_thresh at its
default value of 2, which improved performance. That still
seems to be true based on the performance regression that I
observed when I changed cpu_search(). Simplify this code by
always passing steal_threash to sched_highest(). Also change
sched_balance_group() to use steal_thresh instead of 1.

Preemption is not normally used when assigning threads to a a
CPU that is running its idle thread, so poll tdq_load more
frequently in tdq_idled() to reduce the latency in switching to
a newly assigned thread. When that is detected, just return
from tdq_idled() and let sched_idletd() do the switch.

In sched_idletd(), poll tdq_load one more time between the fence
and the cpu_idle() call.

Benchmark results with make -j16 buildworld:

Wall time:

x oldule16
+ newulev216
* 4bsd16
% newule16
+----------------------------------------------------------------------------+
|* *    **       *  + *x+                                        %    %%    %|
|                 |__AM_|                                                    |
|                 |__A__|                                                    |
||___A___|                                                                   |
|                                                                 |___A____| |
+----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4       1584.52       1589.72      1588.695     1587.9075     2.3265622
+   4       1584.95       1590.18       1587.62     1587.5925     2.2148194
No difference proven at 95.0% confidence
*   4       1572.82       1578.68      1576.065     1575.9075     2.7576243
Difference at 95.0% confidence
	-12 +/- 4.41434

-0.755712% +/- 0.277126%
(Student's t, pooled s = 2.55121)

%   4       1620.64       1629.28       1624.92       1624.94     3.5409603
Difference at 95.0% confidence

37.0325 +/- 5.18384
2.33216% +/- 0.328772%
(Student's t, pooled s = 2.99594)

4BSD is best here, the previous version of this patch was worst,
my latest patch and the current code are essentially tied.


User CPU time:

x oldule16
+ newulev216
* 4bsd16
% newule16
+----------------------------------------------------------------------------+
|     %                                                              ++ x*   |
|%    %%                                                             ++ x** *|
|                                                                       |A|  |
|                                                                    |A      |
|                                                                        |A| |
| |__AM_|                                                                    |
+----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4      20849.73      20865.24     20853.345     20855.415     7.3261654
+   4      20827.91      20836.35      20833.27       20832.7     3.7182702
Difference at 95.0% confidence

-22.715 +/- 10.0519
-0.108917% +/- 0.0481565%
(Student's t, pooled s = 5.8094)

  • 4 20857.03 20875.79 20861.185 20863.798 8.4778117 No difference proven at 95.0% confidence % 4 20376.5 20415.3 20410.51 20403.205 18.031612 Difference at 95.0% confidence

-452.21 +/- 23.8131
-2.16831% +/- 0.113834%
(Student's t, pooled s = 13.7625)

The previous version of this patch was best (less thread movement ->
better cache utilization?).  This version of my patch is slightly
better than the others.


System CPU time:

x oldule16
+ newulev216
* 4bsd16
% newule16
+----------------------------------------------------------------------------+
|* ** *                        %%% %                                x  x++ ++|
|                                                                  |MA_|     |
|                                                                        |A_||
||_A_|                                                                       |
|                              |A_|                                          |
+----------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4       2457.41        2467.9      2458.125       2460.39     5.0181338
+   4       2471.05       2481.17      2476.225     2476.1675     4.7725491
Difference at 95.0% confidence

15.7775 +/- 8.47303
0.64126% +/- 0.345539%
(Student's t, pooled s = 4.89688)

  • 4 2265.36 2278.31 2272.3 2272.0675 5.3177776 Difference at 95.0% confidence

-188.322 +/- 8.94582
-7.65417% +/- 0.35076%
(Student's t, pooled s = 5.17013)

%   4       2351.47       2362.11      2355.195     2355.9925     4.5881469
Difference at 95.0% confidence

-104.397 +/- 8.31915
-4.24313% +/- 0.330386%
(Student's t, pooled s = 4.80795)

4BSD is best, the previous version of my patch was next best.  The
current version of ULE is in third place and this version of my patch
is in forth place, with the latter two being fairly close.

The objective of this patch is to reduce interrupt latency without
causing any significant peformance regressions.

This revision now requires review to proceed.Sep 4 2017, 7:46 PM
In D12130#252053, @kib wrote:

I don't think that there is a problem with ithread migration. As a matter of fact the lack of any obvious problems with ithreads makes me suspect a tlb issue. Ithreads live in kernel memory space which is going to be the same everywhere. I'm wondering if a thread that has migrated to a core on the other CCX isn't always getting its tlb fully initialized before returning to userspace.

We do not re-initialize (flush is the common term) thread's TLB before returning to userspace. We flush TLB when switching contexts. More precisely, there are two variants of it.

On older CPUs, without the PCID feature, reload of %cr3 (the page table base pointer) flushes all non-global entries from TLB of the CPU thread which reloaded %cr3. Kernel-mode TLB entries are typically global (PG_G). The reload most often occurs on context switch, see cpu_switch.S.

On newer Intel CPUs, where PCID feature is present, each TLB entry is additionally tagged with the address space ID, and on the switch we typically inform CPU about new address space ID, avoiding the flush.

One of the reason I asked you about the verbose dmesg some time ago was to see which TLB switching mechanism is used, and apparently Ryzens do not implement PCID. Although AMD CPUs do tag TLB entries with ASIDs for SVM for long time.

You could experiment by adding the the equivalent of invltlb_glob() right after %cr3 reload in cpu_switch. This should flush the whole TLB, including the kernel entries.

You already tried to disable superpage promotions, so other known workarounds like erratum383 should be not useful.

I finally had a chance to try this and it didn't solve the problem.

Any feedback? I'd like to get this committed before I RMA my Ryzen CPU.

At high context switching rates on systems with many cores stealing code does create significant CPU load. So I can believe that the critical section can indeed be an issue, so this way may be good to go. I personally thought about limiting maximal stealing distance based on some statistic factors, such as context switch rates, etc, since we shouldn't probably touch every CPU caches on every context switch -- its a dead end with growing number of CPUs, but haven't got far on it.

sys/kern/sched_ule.c
842 ↗(On Diff #32645)

It's being awhile, but I am not sure this is a correct change. This code is not about idle stealing threads but about equalizing the load. So anything transferrable may worth to be transferred from time to time to balance cache usage and other shared resources.

sys/kern/sched_ule.c
842 ↗(On Diff #32645)

The only situation where this would make a difference is if the balancer wanted to move a thread from a CPU with a load of 1 to a CPU with a load of 0. The only time that a CPU with a load of 1 would have a transferrable thread is if it was previously idle and and that thread was just assigned to it, but it hadn't yet done a context switch from the idle thread to the newly assigned thread. It is likely that the thread was assigned by sched_add(), which used sched_pickcpu() to chose the optimal CPU to assign the thread to.

sys/kern/sched_ule.c
842 ↗(On Diff #32645)

The value 2 might be better than steal_thresh, but I don't think 1 is the correct value. I don't think it matters much because 2 is the the default value for steal_thresh, and nobody is likely to adjust it higher because that increases the wall clock time in all of the testing that I've done.

In D12130#254431, @mav wrote:

At high context switching rates on systems with many cores stealing code does create significant CPU load. So I can believe that the critical section can indeed be an issue, so this way may be good to go. I personally thought about limiting maximal stealing distance based on some statistic factors, such as context switch rates, etc, since we shouldn't probably touch every CPU caches on every context switch -- its a dead end with growing number of CPUs, but haven't got far on it.

If you frequently poll for threads being assigned and abort the search early it lessens the pain. That's what I've tried to do in with this patch. Also, when the system is heavily loaded, the search is likely to succeed early, so fewer cycles are wasted when the load is heavy. The worst case is when all cores are busy with a load of 1 and most searches fail due to a lack of threads that are eligible to steal.

Using cpuset to pin threads makes this a lot worse because all the work done to find a thread to steal may have to be frequently thrown away because the tdq_move() fails.

It would be nice if some of this stuff could be precomputed and cached instead of each CPU repeating the same sets of calculations, but that looks difficult ...

Incorporate trysteal improvement from Jeff Roberson @ Isilon

This version of the patch adds the trysteal improvment from the Isilon
version of sched_ule.c, which avoids the extra context switch to the
idle thread before stealing a thread, with these changes:

  • Unlock tdq earlier to reduce the time that sched_add() on another CPU is blocked when trying to add a new thread to this CPU's queue.
  • Poll tdq_load more frequently and abort the search early if an incoming thread is detected, to reduce the latency from when a new thread is assigned to this CPU and when we start running it.
  • Limit the time spent in the critial section with preemption blocked by using a configurable limit on the topological distance for the search for a thread to steal, and aborting the search if any problems are detected after tdq_lock_pair(). The idle thread can do a more complete search and it does not block preemption. If there is a thread available to steal, tdq_trysteal() will find it in the majority of cases.

Set the sched_highest() threshold in tdq_balance_group() to 2 instead
of steal_thresh.

An additional improvement that I investigated was switching directly
to the stolen thread instead of putting it on the run queue and then
using choosethread() to pull it back off. This did not seem to make
a measureable difference in the benchmark results.

For the following benchmark results, I switched from doing "make
-j16 buildworld" on the 12.0-CURRENT source to "make -j16 buildworld"
on the 8.4-RELEASE source in an 8.4-RELEASE jail. The 8.4 buildworld
is not nearly as dominated by long compiles of the clang source,
so there is more rapid process churn. Also, 8.4 buildworld is much faster
so I was able to collect more sample data. In addition, I disabled
WITNESS and INVARIANTS to be more representative of real-world
production performance.

Wall time - The current version of sched_ule.c and my previous patch are
tied, 4BSD is worst, and this version of my patch is best.

x oldule.16
+ newule-v2.16
* newule-v3.16
% 4bsd.16
+------------------------------------------------------------------------+
|                                              %                         |
|               *                              %                         |
|          **** *  +   x                       %  %     %           %    |
|        *x****** **   x xx +                  %  %%  % %   %  %    %    |
|*   ****************x***x*+*   ++  + +       %#% %%%%% % % %  %% %%%%  %|
|          |____M_A_______|                                              |
|          |_____M_A_______|                                             |
|       |_____A_____|                                                    |
|                                               |______M_A_______|       |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  32        384.19        391.18        385.32     385.69688     1.3944786
+  32        383.83        389.52        385.41     385.84625     1.5597141
No difference proven at 95.0% confidence
*  32        382.42        387.47        384.92     384.88969     1.0994412
Difference at 95.0% confidence
	-0.807188 +/- 0.627514
	-0.20928% +/- 0.162486%
	(Student's t, pooled s = 1.25566)
%  32        390.92        395.91        392.69     392.96812     1.5453978
Difference at 95.0% confidence
	7.27125 +/- 0.735569
	1.88522% +/- 0.192334%
	(Student's t, pooled s = 1.47187)

User time - The current version of sched_ule.c is best. Both the
previous version of my patch and this version of my patch are tied.
4BSD is worst.

x oldule.16
+ newule-v2.16
* newule-v3.16
% 4bsd.16
+------------------------------------------------------------------------+
|           *+*                                                      % % |
|        xx *+*      *                                       %  %    % % |
|    xx* *x *+*+* *+**** *+                             %   %%  % %% % % |
|x   *x**********************   +  +               %  %%%% %%% %%%%%%%%%%|
|      |_____MA_____|                                                    |
|          |____M_A______|                                               |
|          |_____A_____|                                                 |
|                                                         |_____AM____|  |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  32       2967.28        2978.7       2972.72     2973.1391     2.9544536
+  32       2969.11       2982.88       2974.07     2975.0219      3.288318
Difference at 95.0% confidence
	1.88281 +/- 1.56214
	0.0633274% +/- 0.0525567%
	(Student's t, pooled s = 3.12585)
*  32       2969.98       2979.35       2974.39     2974.4475     2.7102244
No difference proven at 95.0% confidence
%  32       2990.13       2999.61      2996.235     2995.9212     2.6384841
Difference at 95.0% confidence
	22.7822 +/- 1.39976
	0.766267% +/- 0.0472814%
	(Student's t, pooled s = 2.80093)

System time - 4BSD is best, followed by this version of my patch,
the current version of sched_ule.c, and the previous version of my
patch brings up the rear.

x oldule.16
+ newule-v2.16
* newule-v3.16
% 4bsd.16
+------------------------------------------------------------------------+
|        * %       % %%   % %*        +*xx * *x     +x   + x  +          |
|O     %%*%%  #OOOOO*O% O*O+O*O%x#@x**@O**%***O*x*xO**x*** *O+++  +++   +|
|                               |__________A__________|                  |
|                                     |___________A_M_________|          |
|               |____________MA____________|                             |
|          |__________M__A____________|                                  |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  32        487.47        494.81        492.11     492.11687     1.8286809
+  32        489.59        496.74        493.51     493.29313     1.9214921
Difference at 95.0% confidence
	1.17625 +/- 0.937361
	0.239018% +/- 0.190692%
	(Student's t, pooled s = 1.87566)
*  32        485.42        494.03        489.89      489.9725     2.0996528
Difference at 95.0% confidence
	-2.14438 +/- 0.983925
	-0.435745% +/- 0.199562%
	(Student's t, pooled s = 1.96883)
%  32        485.48        494.77        488.83     489.19438     2.1163449
Difference at 95.0% confidence
	-2.9225 +/- 0.988381
	-0.593863% +/- 0.200334%
	(Student's t, pooled s = 1.97775)

The changes in this update are:

Add the always_steal sysctl tuning knob from jeff@'s patch

Staticize trysteal_limit

Rebase the patch to more recent HEAD.

I've been running this patch and earlier versions for quite some time
on my package build machines with no obvious problems. The previous
benchmark results show that it performs better than the unpatched ULE
under this and parallel buildworld loads. I'd like to commit this
to HEAD soon so that it can get wider testing with the thought of doing
an MFC to stable/11 before 11.2-RELEASE.

There are some other changes in Jeff's patch that are unrelated to
thread stealing and should be evaluated on their own merits. I also
have some interesting ideas that appear promising, but they are also
unrelated to thread stealing, so I don't want to include them at this
time.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 23 2018, 12:13 AM
This revision was automatically updated to reflect the committed changes.