commit ea8294fe8a614129d815c533885d10c02e5f399e Author: Mateusz Guzik <mjg@FreeBSD.org> Date: Fri Mar 31 19:55:41 2023 +0000 ule: queue partial slice users on tdq_realtime and enable more preemption This is a low-effort attempt at damage controlling one of the bugs, simple enough to be suitable for inclusion in the pending release and which can be trivially turned off if needed. It comes with its own woes which will be addressed in a more involved patch down the road. It is not a panacea, but it is less problematic than the unpatched state. A proper fix will require a significant rework of runq usage, probably replacing the data structure altogether (see below for another bug description). The problem at hand: a thread going off CPU has to wait the full slice to get back on if there is a CPU hog running. Should a thread of the sort keep going off frequently, each time only utilizing a small fraction of its slice, it will be struggling to get any work done as it will wait full slice every time. This is trivially reproducible by running a bunch of CPU hogs (one for each hw thread) and make -j $(nproc) buildkernel. A sample timing is from an 8-core vm, where ~7 minute total real time is extended to over 1h(!), even if the hogs are niced to 20. Another bug (which is not fixed) is that the calendar queue does not properly distribute CPU time between different priorities, for example running a nice 0 hog vs nice 20 hog gives them about 50:50. This once more negatively affects scheduling for buildkernel vs hogs. One more bug which needs to be mentioned is the general starvation potential of the runq mechanism. In principle the calendar queue sorts it out for tdq_timeshare (except for the above bug), but it remains unaddressed for tdq_realtime, all while regular user threads can land there as is. Work around the problem by: 1. queueing threads on tdq_realtime if they only used part of their slice 2. bumping preemption threshold to PRI_MAX_TIMESHARE Upsides: near-starvation of frequent off CPU users is worked around Downsides: there is more starvation potential for CPU hogs and the entire ordeal negatively affects some workloads This in particular extends -j 8 buildkernel by about 0.7%. The hogs get about the same CPU time for the duration. Interestingly a kernel with 4BSD takes slightly *less* total real time to do the same build than stock ULE, all while not having the problem fixed here. Or to put it differently, with enough work the entire thing can go faster than it does with stock ULE despite not having the problem. This will be sorted out for the next release. Example: x 4bsd.out + ule.out * ule_patched.out +--------------------------------------------------------------------------------+ | * | | + ** | |x +x x xx x x xx ++++ +++ * * ** **| | |_________M_A________|___|______A__M________| |_____AM___|| +--------------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 9 434.32 436.63 435.53 435.66667 0.73333144 + 9 435.08 437.3 437 436.85111 0.68641176 Difference at 95.0% confidence 1.18444 +/- 0.709817 0.271869% +/- 0.163163% (Student's t, pooled s = 0.710259) * 9 437.96 438.92 438.72 438.61889 0.30608187 Difference at 95.0% confidence 2.95222 +/- 0.561549 0.677633% +/- 0.129638% (Student's t, pooled s = 0.561899) To illustrate the problem differently, a CPU hog was bound to a core along with gzip. gzip was fed data from tar running on a different core and total real time was measured. Like so: cpuset -l 3 nice -n 20 ./cpuburner-prio 1 Then on another terminal: time tar cf - /usr/src | cpuset -l 3 time gzip > /dev/null ps - kern.sched.pick_short, the fix in the commit pt - kern.sched.preempt_thresh ps pt time - - 45.23 # baseline without the cpu hog 0 48 907.07 # 2005% of the baseline 0 224 864.24 # 1910% of the baseline 1 48 869.69 # 1922% of the baseline 1 224 61.46 # 136% of the baseline Users which want to restore the previous behavior can put the following into their /etc/sysctl.conf: kern.sched.preempt_thresh=48 kern.sched.pick_short=0
Details
- Reviewers
markj
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
sys/kern/sched_ule.c | ||
---|---|---|
2311 | Maybe a dumb question, but why do we need to get this again? I don't see why having been off CPU would invalidate ts. |
sys/kern/sched_ule.c | ||
---|---|---|
2311 | we don't, this is leftover from experimentation I did not remove. will do it later. |
I applied the patch on my desktop. If I build LLVM in the background with -j<ncpus> and watch Doom speedruns on youtube, I have to set pick_short=0 to avoid long freezes in the video. Audio is fine regardless. With pick_short=0 the video isn't perfectly smooth, but it's much better than the alternative. So I think the patch likely regresses certain workloads, keeping in mind that server processes may be scheduled as "interactive."
I also see much larger variance for certain operations when dnetc is running in the background. I tried running "git status" in a loop in a large repo and measuring the real time of each operation. The result is here: https://people.freebsd.org/~markj/ule/ministat.txt .
I didn't try to understand why this is happening, but in general we want to be wary of changes which have the potential to introduce latency spikes.
I think that raising the default preempt threshold is probably reasonable, but it should be done post 14.0-branch.
This in particular extends -j 8 buildkernel by about 0.7%.
The patch contains two logically unrelated changes. Which one appears to be responsible for the slowdown?
One more bug which needs to be mentioned is the general starvation potential of the runq mechanism. In principle the calendar queue sorts it out for tdq_timeshare (except for the above bug), but it remains unaddressed for tdq_realtime, all while regular user threads can land there as is.
What is the bug? A user thread which is starving other threads should be treated as non-interactive and get scheduled from the timesharing queues. By definition, a user thread has to sleep for significant amounts of time to be placed in td_realtime in the first place. After this patch that soft invariant is no longer true, which is one reason I don't think it's a good idea.
Furthermore, your comment above is incongruent with the patch: the problem is that threads in the timeshare queue can starve each other, and your solution is to sometimes schedule them from the realtime queue. Why aren't we fixing the problem by revisiting the design of the timeshare queues?
sys/kern/sched_ule.c | ||
---|---|---|
218 | Now there's no need for the FULL_PREEMPTION case. | |
221 | Wasn't there a known ZFS data corruption bug that arises when this setting enables preemption by user threads? I'm sure there are other bugs in that vein. Changing this right before a release seems like a bad idea. | |
498 | then you get less code duplication. | |
2311 | Suppose a user thread is running, and then a timer interrupt fires. The timer interrupt schedules a softclock thread to run callout handlers. That softclock thread runs at a very high priority, so it preempts the user thread, and after a brief period the user thread keeps running. But now ts_usedslice has been reset, so the next time the user thread goes off-CPU for any reason (including another softclock preemption), it'll be scheduled with realtime priority. It doesn't make a lot of sense to me that preemption has this effect. Doesn't it allow CPU hogs to gain realtime priority scheduling? It looks like we should reset usedslice only when the thread goes to sleep. |
Please: is this a candidate for inclusion in 14.0-CURRENT?
For what it's worth, my experience with diff 2 (https://reviews.freebsd.org/D40045?id=121849) has been almost entirely positive.
Although I have not aimed to perform measurements, qualitatively I do find the overall desktop environment experience much better – particularly whilst, for example, pushing this circa 2013 notebook (HP EliteBook 8570p) unusually hard with make on the host (typically with poudriere-devel and ALLOW_MAKE_JOBS=yes); or (pictured below, 20th May) building CURRENT in a (non-patched) VirtualBox guest.
{F62242755}
My personal opinion is that it should not be committed, at least in its current form. The idea appears to be based on a suggestion from D15985, but the implementation is flawed such that it can boost the priority of threads when there is no reason to do so.
For what it's worth, my experience with diff 2 (https://reviews.freebsd.org/D40045?id=121849) has been almost entirely positive.
Although I have not aimed to perform measurements, qualitatively I do find the overall desktop environment experience much better – particularly whilst, for example, pushing this circa 2013 notebook (HP EliteBook 8570p) unusually hard with make on the host (typically with poudriere-devel and ALLOW_MAKE_JOBS=yes); or (pictured below, 20th May) building CURRENT in a (non-patched) VirtualBox guest.
{F62242755}
The diff contains two independent changes. One changes the default kern.sched.preempt_thresh value from 48 to 224. The other artificially boosts the scheduling priority of a preempted thread; this can be disabled by setting kern.sched.pick_short=0. It would be useful to know whether you see a benefit from having one change but not the other.
… It would be useful to know whether you see a benefit from having one change but not the other.
With kern.sched.pick_short persistently 1:
- I tried, for a while, to sense a difference between 224 and 48 for kern.sched.preempt_thresh.
I'll continue testing with 48 but at the time of writing, I can't say that there's a noticeable difference.
The same for 224 versus 48 in months/years gone by, before D40045; I know that 224 was a popular recommendation for desktop use cases, but I never noticed anything worth writing about.
YMMV, HTH
Further testing by other people confirmed my worry this is too trivial to fix the problem without running into other side effects, thus I'm dropping the thing.