Change Details

``` commit ea8294fe8a614129d815c533885d10c02e5f399e (HEAD -> uleshort233) Author: Mateusz Guzik <mjg@FreeBSD.org> Date: Fri Mar 31 19:55:41 2023 +0000 ule: queue partial slice users on tdq_realtime and enable more preemption This is a low-effort attempt at damage controlling one of the bugs, simple enough to be suitable for inclusion in the pending release. It comes with its own woes which will be addressed in a more involved patch down the road. It is not a panacea, but it is less problematic than the upatched state. A proper fill will require a significant rework of runq usage, probably replacing the data structure altogether (see below for another bug description). The problem at hand: a thread going off CPU has to wait the full slice to get back on if there is a CPU hog running. Should a thread of the sort keep going off frequently, each time only utilizing a small fraction of its slice, it will be struggling to get any work done as it will wait full slice every time. This is trivially reproducible by running a bunch of CPU hogs (one for each hw thread) and make -j $(nproc) buildkernel. A sample timing is from an 8-core vm, where ~7 minute total real time is extended to over 1h(!), even if the hogs are niced to 20. Another bug (which is not fixed) is that the calendar queue does not properly distribute CPU time between different priorities, for example running a nice 0 hog vs nice 20 hog gives them about 50:50. This once more negatively affects scheduling for buildkernel vs hogs. One more bug which needs to be mentioned is the general stavation potential of the runq mechanism. In principle the calendar queue sorts it out for tdq_timeshare (except for the above bug), but it remains unaddressed for tdq_realtime, all while regular user threads can land there as is. Work around the problem by: 1. queueing threads on tdq_realtime if they only used part of their slice 2. bumping preemption threshold to PRI_MAX_TIMESHARE Upsides: near-starvation of frequent off CPU users is worked around Downsides: there is more starvation potential for CPU hogs and the entire ordeal negatively affects some workloads This in particular extends -j 8 buildkernel by about 0.7%. The hogs get about the same CPU time for the duration. Interestingly a kernel with 4BSD takes slightly *less* total real time to do the same build than stock ULE, all while not having the problem fixed here. Or to put it differently, with enough work the entire thing can go faster than it does with stock ULE despite not having the problem. This will be sorted out for the next release. Example: x 4bsd.out + ule.out * ule_patched.out +--------------------------------------------------------------------------------+ | * | | + ** | |x +x x xx x x xx ++++ +++ * * ** **| | |_________M_A________|___|______A__M________| |_____AM___|| +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 9 434.32 436.63 435.53 435.66667 0.73333144 + 9 435.08 437.3 437 436.85111 0.68641176 Difference at 95.0% confidence 1.18444 +/- 0.709817 0.271869% +/- 0.163163% (Student's t, pooled s = 0.710259) * 9 437.96 438.92 438.72 438.61889 0.30608187 Difference at 95.0% confidence 2.95222 +/- 0.561549 0.677633% +/- 0.129638% (Student's t, pooled s = 0.561899) To illustrate the problem differently, a cpu hog was bound to a core along with gzip. gzip was fed data from tar running on a different core and total real time was measured. Like so: cpuset -l 3 nice -n 20 ./cpuburner-prio 1 Then on another terminal: time tar cf - /usr/src | cpuset -l 3 time gzip > /dev/null ps - kern.sched.pick_short, the fix in the commit pt - kern.sched.preempt_thresh ps pt time - - 45.23 # baseline without the cpu hog 0 48 907.07 # 2005% of the baseline 0 224 864.24 # 1910% of the baseline 1 48 869.69 # 1922% of the baseline 1 224 61.46 # 136% of the baseline Users which want to restore the previous behavior can put the following into their /etc/sysctl.conf: kern.sched.preempt_thresh=48 kern.sched.pick_short=0 ```

``` commit ea8294fe8a614129d815c533885d10c02e5f399e Author: Mateusz Guzik <mjg@FreeBSD.org> Date: Fri Mar 31 19:55:41 2023 +0000 ule: queue partial slice users on tdq_realtime and enable more preemption This is a low-effort attempt at damage controlling one of the bugs, simple enough to be suitable for inclusion in the pending release. It comes with its own woes which will be addressed in a more involved patch down the road. It is not a panacea, but it is less problematic than the upatched state. A proper fill will require a significant rework of runq usage, probably replacing the data structure altogether (see below for another bug description). The problem at hand: a thread going off CPU has to wait the full slice to get back on if there is a CPU hog running. Should a thread of the sort keep going off frequently, each time only utilizing a small fraction of its slice, it will be struggling to get any work done as it will wait full slice every time. This is trivially reproducible by running a bunch of CPU hogs (one for each hw thread) and make -j $(nproc) buildkernel. A sample timing is from an 8-core vm, where ~7 minute total real time is extended to over 1h(!), even if the hogs are niced to 20. Another bug (which is not fixed) is that the calendar queue does not properly distribute CPU time between different priorities, for example running a nice 0 hog vs nice 20 hog gives them about 50:50. This once more negatively affects scheduling for buildkernel vs hogs. One more bug which needs to be mentioned is the general stavation potential of the runq mechanism. In principle the calendar queue sorts it out for tdq_timeshare (except for the above bug), but it remains unaddressed for tdq_realtime, all while regular user threads can land there as is. Work around the problem by: 1. queueing threads on tdq_realtime if they only used part of their slice 2. bumping preemption threshold to PRI_MAX_TIMESHARE Upsides: near-starvation of frequent off CPU users is worked around Downsides: there is more starvation potential for CPU hogs and the entire ordeal negatively affects some workloads This in particular extends -j 8 buildkernel by about 0.7%. The hogs get about the same CPU time for the duration. Interestingly a kernel with 4BSD takes slightly *less* total real time to do the same build than stock ULE, all while not having the problem fixed here. Or to put it differently, with enough work the entire thing can go faster than it does with stock ULE despite not having the problem. This will be sorted out for the next release. Example: x 4bsd.out + ule.out * ule_patched.out +--------------------------------------------------------------------------------+ | * | | + ** | |x +x x xx x x xx ++++ +++ * * ** **| | |_________M_A________|___|______A__M________| |_____AM___|| +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 9 434.32 436.63 435.53 435.66667 0.73333144 + 9 435.08 437.3 437 436.85111 0.68641176 Difference at 95.0% confidence 1.18444 +/- 0.709817 0.271869% +/- 0.163163% (Student's t, pooled s = 0.710259) * 9 437.96 438.92 438.72 438.61889 0.30608187 Difference at 95.0% confidence 2.95222 +/- 0.561549 0.677633% +/- 0.129638% (Student's t, pooled s = 0.561899) To illustrate the problem differently, a cpu hog was bound to a core along with gzip. gzip was fed data from tar running on a different core and total real time was measured. Like so: cpuset -l 3 nice -n 20 ./cpuburner-prio 1 Then on another terminal: time tar cf - /usr/src | cpuset -l 3 time gzip > /dev/null ps - kern.sched.pick_short, the fix in the commit pt - kern.sched.preempt_thresh ps pt time - - 45.23 # baseline without the cpu hog 0 48 907.07 # 2005% of the baseline 0 224 864.24 # 1910% of the baseline 1 48 869.69 # 1922% of the baseline 1 224 61.46 # 136% of the baseline Users which want to restore the previous behavior can put the following into their /etc/sysctl.conf: kern.sched.preempt_thresh=48 kern.sched.pick_short=0 ```