Paths

Table of Contentst

ule: queue partial slice users on tdq_realtime and enable more preemption
AbandonedPublic
Actions

Authored by mjg on May 10 2023, 2:10 PM.

Details

Reviewers

Summary

commit ea8294fe8a614129d815c533885d10c02e5f399e
Author: Mateusz Guzik <mjg@FreeBSD.org>
Date:   Fri Mar 31 19:55:41 2023 +0000

    ule: queue partial slice users on tdq_realtime and enable more preemption
    
    This is a low-effort attempt at damage controlling one of the bugs,
    simple enough to be suitable for inclusion in the pending release and
    which can be trivially turned off if needed.
    
    It comes with its own woes which will be addressed in a more involved
    patch down the road. It is not a panacea, but it is less problematic
    than the unpatched state. A proper fix will require a significant rework
    of runq usage, probably replacing the data structure altogether (see
    below for another bug description).
    
    The problem at hand: a thread going off CPU has to wait the full slice
    to get back on if there is a CPU hog running. Should a thread of the
    sort keep going off frequently, each time only utilizing a small fraction
    of its slice, it will be struggling to get any work done as it will wait
    full slice every time.
    
    This is trivially reproducible by running a bunch of CPU hogs (one for
    each hw thread) and make -j $(nproc) buildkernel. A sample timing is
    from an 8-core vm, where ~7 minute total real time is extended to over
    1h(!), even if the hogs are niced to 20.
    
    Another bug (which is not fixed) is that the calendar queue does not
    properly distribute CPU time between different priorities, for example
    running a nice 0 hog vs nice 20 hog gives them about 50:50. This once
    more negatively affects scheduling for buildkernel vs hogs.
    
    One more bug which needs to be mentioned is the general starvation
    potential of the runq mechanism. In principle the calendar queue sorts
    it out for tdq_timeshare (except for the above bug), but it remains
    unaddressed for tdq_realtime, all while regular user threads can land
    there as is.
    
    Work around the problem by:
    1. queueing threads on tdq_realtime if they only used part of their slice
    2. bumping preemption threshold to PRI_MAX_TIMESHARE
    
    Upsides: near-starvation of frequent off CPU users is worked around
    Downsides: there is more starvation potential for CPU hogs and the
    entire ordeal negatively affects some workloads
    
    This in particular extends -j 8 buildkernel by about 0.7%.
    
    The hogs get about the same CPU time for the duration.
    
    Interestingly a kernel with 4BSD takes slightly *less* total real time
    to do the same build than stock ULE, all while not having the problem
    fixed here. Or to put it differently, with enough work the entire thing
    can go faster than it does with stock ULE despite not having the
    problem.  This will be sorted out for the next release.
    
    Example:
    x 4bsd.out
    + ule.out
    * ule_patched.out
    +--------------------------------------------------------------------------------+
    |                                                                           *    |
    |                                                  +                        **   |
    |x            +x   x xx  x        x     xx  ++++  +++          *     *      ** **|
    |           |_________M_A________|___|______A__M________|            |_____AM___||
    +--------------------------------------------------------------------------------+
        N           Min           Max        Median           Avg        Stddev
    x   9        434.32        436.63        435.53     435.66667    0.73333144
    +   9        435.08         437.3           437     436.85111    0.68641176
    Difference at 95.0% confidence
            1.18444 +/- 0.709817
            0.271869% +/- 0.163163%
            (Student's t, pooled s = 0.710259)
    *   9        437.96        438.92        438.72     438.61889    0.30608187
    Difference at 95.0% confidence
            2.95222 +/- 0.561549
            0.677633% +/- 0.129638%
            (Student's t, pooled s = 0.561899)
    
    To illustrate the problem differently, a CPU hog was bound to a core
    along with gzip. gzip was fed data from tar running on a different core
    and total real time was measured.
    
    Like so:
    cpuset -l 3 nice -n 20 ./cpuburner-prio 1
    
    Then on another terminal:
    time tar cf - /usr/src | cpuset -l 3 time gzip > /dev/null
    
    ps - kern.sched.pick_short, the fix in the commit
    pt - kern.sched.preempt_thresh
    
    ps pt   time
    -  -    45.23   # baseline without the cpu hog
    0  48   907.07  # 2005% of the baseline
    0  224  864.24  # 1910% of the baseline
    1  48   869.69  # 1922% of the baseline
    1  224  61.46   # 136% of the baseline
    
    Users which want to restore the previous behavior can put the following
    into their /etc/sysctl.conf:
    kern.sched.preempt_thresh=48
    kern.sched.pick_short=0

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

mjg created this revision.May 10 2023, 2:10 PM

Herald added a subscriber: imp. · View Herald TranscriptMay 10 2023, 2:10 PM

mjg requested review of this revision.May 10 2023, 2:10 PM

mjg edited the summary of this revision. (Show Details)

mjg edited the summary of this revision. (Show Details)May 10 2023, 2:14 PM

emaste added a subscriber: emaste.May 10 2023, 2:19 PM

lwhsu added a subscriber: lwhsu.May 10 2023, 2:27 PM

freebsd_igalic.co added a subscriber: freebsd_igalic.co.May 10 2023, 2:34 PM

ihor_antonovs.family added a subscriber: ihor_antonovs.family.May 10 2023, 2:50 PM

jrtc27 added a subscriber: jrtc27.May 10 2023, 2:52 PM

jrtc27 added inline comments.

sys/kern/sched_ule.c
2311	Maybe a dumb question, but why do we need to get this again? I don't see why having been off CPU would invalidate ts.

mjg added inline comments.May 10 2023, 3:06 PM

sys/kern/sched_ule.c
2311	we don't, this is leftover from experimentation I did not remove. will do it later.

kevans added a subscriber: kevans.May 10 2023, 3:48 PM

garga added a subscriber: garga.May 10 2023, 4:22 PM

zarychtam_plan-b.pwste.edu.pl added a subscriber: zarychtam_plan-b.pwste.edu.pl.May 10 2023, 4:53 PM

arrowd added a subscriber: arrowd.May 10 2023, 5:15 PM

afedorov added a subscriber: afedorov.May 10 2023, 8:02 PM

fix build with KTR
whack noop assignment

mjg edited the summary of this revision. (Show Details)May 11 2023, 11:24 AM

junchoon_dec.sakura.ne.jp added a subscriber: junchoon_dec.sakura.ne.jp.May 11 2023, 11:57 AM

grahamperrin added a subscriber: grahamperrin.May 13 2023, 3:22 PM

I applied the patch on my desktop. If I build LLVM in the background with -j<ncpus> and watch Doom speedruns on youtube, I have to set pick_short=0 to avoid long freezes in the video. Audio is fine regardless. With pick_short=0 the video isn't perfectly smooth, but it's much better than the alternative. So I think the patch likely regresses certain workloads, keeping in mind that server processes may be scheduled as "interactive."

I also see much larger variance for certain operations when dnetc is running in the background. I tried running "git status" in a loop in a large repo and measuring the real time of each operation. The result is here: https://people.freebsd.org/~markj/ule/ministat.txt .
I didn't try to understand why this is happening, but in general we want to be wary of changes which have the potential to introduce latency spikes.

I think that raising the default preempt threshold is probably reasonable, but it should be done post 14.0-branch.

This in particular extends -j 8 buildkernel by about 0.7%.

The patch contains two logically unrelated changes. Which one appears to be responsible for the slowdown?

One more bug which needs to be mentioned is the general starvation potential of the runq mechanism. In principle the calendar queue sorts it out for tdq_timeshare (except for the above bug), but it remains unaddressed for tdq_realtime, all while regular user threads can land there as is.

What is the bug? A user thread which is starving other threads should be treated as non-interactive and get scheduled from the timesharing queues. By definition, a user thread has to sleep for significant amounts of time to be placed in td_realtime in the first place. After this patch that soft invariant is no longer true, which is one reason I don't think it's a good idea.

Furthermore, your comment above is incongruent with the patch: the problem is that threads in the timeshare queue can starve each other, and your solution is to sometimes schedule them from the realtime queue. Why aren't we fixing the problem by revisiting the design of the timeshare queues?

sys/kern/sched_ule.c
218	Now there's no need for the FULL_PREEMPTION case.
221	Wasn't there a known ZFS data corruption bug that arises when this setting enables preemption by user threads? I'm sure there are other bugs in that vein. Changing this right before a release seems like a bad idea.
498	then you get less code duplication.
2311	Suppose a user thread is running, and then a timer interrupt fires. The timer interrupt schedules a softclock thread to run callout handlers. That softclock thread runs at a very high priority, so it preempts the user thread, and after a brief period the user thread keeps running. But now ts_usedslice has been reset, so the next time the user thread goes off-CPU for any reason (including another softclock preemption), it'll be scheduled with realtime priority. It doesn't make a lot of sense to me that preemption has this effect. Doesn't it allow CPU hogs to gain realtime priority scheduling? It looks like we should reset usedslice only when the thread goes to sleep.

gbe added a subscriber: gbe.May 15 2023, 9:29 AM

cy added a subscriber: cy.May 15 2023, 1:23 PM

Please: is this a candidate for inclusion in 14.0-CURRENT?

In D40045#912875, @markj wrote:

… I think the patch likely regresses certain workloads, …

For what it's worth, my experience with diff 2 (https://reviews.freebsd.org/D40045?id=121849) has been almost entirely positive.

Although I have not aimed to perform measurements, qualitatively I do find the overall desktop environment experience much better – particularly whilst, for example, pushing this circa 2013 notebook (HP EliteBook 8570p) unusually hard with make on the host (typically with poudriere-devel and ALLOW_MAKE_JOBS=yes); or (pictured below, 20th May) building CURRENT in a (non-patched) VirtualBox guest.

{F62242755}

In D40045#920004, @grahamperrin wrote:

Please: is this a candidate for inclusion in 14.0-CURRENT?

My personal opinion is that it should not be committed, at least in its current form. The idea appears to be based on a suggestion from D15985, but the implementation is flawed such that it can boost the priority of threads when there is no reason to do so.

In D40045#912875, @markj wrote:

… I think the patch likely regresses certain workloads, …

For what it's worth, my experience with diff 2 (https://reviews.freebsd.org/D40045?id=121849) has been almost entirely positive.

Although I have not aimed to perform measurements, qualitatively I do find the overall desktop environment experience much better – particularly whilst, for example, pushing this circa 2013 notebook (HP EliteBook 8570p) unusually hard with make on the host (typically with poudriere-devel and ALLOW_MAKE_JOBS=yes); or (pictured below, 20th May) building CURRENT in a (non-patched) VirtualBox guest.

{F62242755}

The diff contains two independent changes. One changes the default kern.sched.preempt_thresh value from 48 to 224. The other artificially boosts the scheduling priority of a preempted thread; this can be disabled by setting kern.sched.pick_short=0. It would be useful to know whether you see a benefit from having one change but not the other.

In D40045#920224, @markj wrote:

… It would be useful to know whether you see a benefit from having one change but not the other.

With kern.sched.pick_short persistently 1:

I tried, for a while, to sense a difference between 224 and 48 for kern.sched.preempt_thresh.

I'll continue testing with 48 but at the time of writing, I can't say that there's a noticeable difference.

The same for 224 versus 48 in months/years gone by, before D40045; I know that 224 was a popular recommendation for desktop use cases, but I never noticed anything worth writing about.

YMMV, HTH

Further testing by other people confirmed my worry this is too trivial to fix the problem without running into other side effects, thus I'm dropping the thing.

Herald added a subscriber: olce. · View Herald TranscriptMar 12 2024, 10:35 PM