```
commit ea8294fe8a614129d815c533885d10c02e5f399e (HEAD -> uleshort233)
Author: Mateusz Guzik <mjg@FreeBSD.org>
Date: Fri Mar 31 19:55:41 2023 +0000
ule: queue partial slice users on tdq_realtime and enable more preemption
This is a low-effort attempt at damage controlling one of the bugs,
simple enough to be suitable for inclusion in the pending release.
It comes with its own woes which will be addressed in a more involved
patch down the road. It is not a panacea, but it is less problematic
than the upatched state. A proper fill will require a significant rework
of runq usage, probably replacing the data structure altogether (see
below for another bug description).
The problem at hand: a thread going off CPU has to wait the full slice
to get back on if there is a CPU hog running. Should a thread of the
sort keep going off frequently, each time only utilizing a small fraction
of its slice, it will be struggling to get any work done as it will wait
full slice every time.
This is trivially reproducible by running a bunch of CPU hogs (one for
each hw thread) and make -j $(nproc) buildkernel. A sample timing is
from an 8-core vm, where ~7 minute total real time is extended to over
1h(!), even if the hogs are niced to 20.
Another bug (which is not fixed) is that the calendar queue does not
properly distribute CPU time between different priorities, for example
running a nice 0 hog vs nice 20 hog gives them about 50:50. This once
more negatively affects scheduling for buildkernel vs hogs.
One more bug which needs to be mentioned is the general stavation
potential of the runq mechanism. In principle the calendar queue sorts
it out for tdq_timeshare (except for the above bug), but it remains
unaddressed for tdq_realtime, all while regular user threads can land
there as is.
Work around the problem by:
1. queueing threads on tdq_realtime if they only used part of their slice
2. bumping preemption threshold to PRI_MAX_TIMESHARE
Upsides: near-starvation of frequent off CPU users is worked around
Downsides: there is more starvation potential for CPU hogs and the
entire ordeal negatively affects some workloads
This in particular extends -j 8 buildkernel by about 0.7%.
The hogs get about the same CPU time for the duration.
Interestingly a kernel with 4BSD takes slightly *less* total real time
to do the same build than stock ULE, all while not having the problem
fixed here. Or to put it differently, with enough work the entire thing
can go faster than it does with stock ULE despite not having the
problem. This will be sorted out for the next release.
Example:
x 4bsd.out
+ ule.out
* ule_patched.out
+--------------------------------------------------------------------------------+
| * |
| + ** |
|x +x x xx x x xx ++++ +++ * * ** **|
| |_________M_A________|___|______A__M________| |_____AM___||
+--------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 9 434.32 436.63 435.53 435.66667 0.73333144
+ 9 435.08 437.3 437 436.85111 0.68641176
Difference at 95.0% confidence
1.18444 +/- 0.709817
0.271869% +/- 0.163163%
(Student's t, pooled s = 0.710259)
* 9 437.96 438.92 438.72 438.61889 0.30608187
Difference at 95.0% confidence
2.95222 +/- 0.561549
0.677633% +/- 0.129638%
(Student's t, pooled s = 0.561899)
To illustrate the problem differently, a cpu hog was bound to a core
along with gzip. gzip was fed data from tar running on a different core
and total real time was measured.
Like so:
cpuset -l 3 nice -n 20 ./cpuburner-prio 1
Then on another terminal:
time tar cf - /usr/src | cpuset -l 3 time gzip > /dev/null
ps - kern.sched.pick_short, the fix in the commit
pt - kern.sched.preempt_thresh
ps pt time
- - 45.23 # baseline without the cpu hog
0 48 907.07 # 2005% of the baseline
0 224 864.24 # 1910% of the baseline
1 48 869.69 # 1922% of the baseline
1 224 61.46 # 136% of the baseline
Users which want to restore the previous behavior can put the following
into their /etc/sysctl.conf:
kern.sched.preempt_thresh=48
kern.sched.pick_short=0
```