Page MenuHomeFreeBSD

ule: queue partial slice users on tdq_realtime and enable more preemption
AbandonedPublic

Authored by mjg on May 10 2023, 2:10 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Apr 1, 1:14 PM
Unknown Object (File)
Feb 17 2024, 3:46 AM
Unknown Object (File)
Feb 17 2024, 3:44 AM
Unknown Object (File)
Feb 17 2024, 3:43 AM
Unknown Object (File)
Feb 17 2024, 3:42 AM
Unknown Object (File)
Dec 20 2023, 5:33 AM
Unknown Object (File)
Dec 12 2023, 12:50 PM
Unknown Object (File)
Dec 2 2023, 5:31 AM

Details

Reviewers
markj
Summary
commit ea8294fe8a614129d815c533885d10c02e5f399e
Author: Mateusz Guzik <mjg@FreeBSD.org>
Date:   Fri Mar 31 19:55:41 2023 +0000

    ule: queue partial slice users on tdq_realtime and enable more preemption
    
    This is a low-effort attempt at damage controlling one of the bugs,
    simple enough to be suitable for inclusion in the pending release and
    which can be trivially turned off if needed.
    
    It comes with its own woes which will be addressed in a more involved
    patch down the road. It is not a panacea, but it is less problematic
    than the unpatched state. A proper fix will require a significant rework
    of runq usage, probably replacing the data structure altogether (see
    below for another bug description).
    
    The problem at hand: a thread going off CPU has to wait the full slice
    to get back on if there is a CPU hog running. Should a thread of the
    sort keep going off frequently, each time only utilizing a small fraction
    of its slice, it will be struggling to get any work done as it will wait
    full slice every time.
    
    This is trivially reproducible by running a bunch of CPU hogs (one for
    each hw thread) and make -j $(nproc) buildkernel. A sample timing is
    from an 8-core vm, where ~7 minute total real time is extended to over
    1h(!), even if the hogs are niced to 20.
    
    Another bug (which is not fixed) is that the calendar queue does not
    properly distribute CPU time between different priorities, for example
    running a nice 0 hog vs nice 20 hog gives them about 50:50. This once
    more negatively affects scheduling for buildkernel vs hogs.
    
    One more bug which needs to be mentioned is the general starvation
    potential of the runq mechanism. In principle the calendar queue sorts
    it out for tdq_timeshare (except for the above bug), but it remains
    unaddressed for tdq_realtime, all while regular user threads can land
    there as is.
    
    Work around the problem by:
    1. queueing threads on tdq_realtime if they only used part of their slice
    2. bumping preemption threshold to PRI_MAX_TIMESHARE
    
    Upsides: near-starvation of frequent off CPU users is worked around
    Downsides: there is more starvation potential for CPU hogs and the
    entire ordeal negatively affects some workloads
    
    This in particular extends -j 8 buildkernel by about 0.7%.
    
    The hogs get about the same CPU time for the duration.
    
    Interestingly a kernel with 4BSD takes slightly *less* total real time
    to do the same build than stock ULE, all while not having the problem
    fixed here. Or to put it differently, with enough work the entire thing
    can go faster than it does with stock ULE despite not having the
    problem.  This will be sorted out for the next release.
    
    Example:
    x 4bsd.out
    + ule.out
    * ule_patched.out
    +--------------------------------------------------------------------------------+
    |                                                                           *    |
    |                                                  +                        **   |
    |x            +x   x xx  x        x     xx  ++++  +++          *     *      ** **|
    |           |_________M_A________|___|______A__M________|            |_____AM___||
    +--------------------------------------------------------------------------------+
        N           Min           Max        Median           Avg        Stddev
    x   9        434.32        436.63        435.53     435.66667    0.73333144
    +   9        435.08         437.3           437     436.85111    0.68641176
    Difference at 95.0% confidence
            1.18444 +/- 0.709817
            0.271869% +/- 0.163163%
            (Student's t, pooled s = 0.710259)
    *   9        437.96        438.92        438.72     438.61889    0.30608187
    Difference at 95.0% confidence
            2.95222 +/- 0.561549
            0.677633% +/- 0.129638%
            (Student's t, pooled s = 0.561899)
    
    To illustrate the problem differently, a CPU hog was bound to a core
    along with gzip. gzip was fed data from tar running on a different core
    and total real time was measured.
    
    Like so:
    cpuset -l 3 nice -n 20 ./cpuburner-prio 1
    
    Then on another terminal:
    time tar cf - /usr/src | cpuset -l 3 time gzip > /dev/null
    
    ps - kern.sched.pick_short, the fix in the commit
    pt - kern.sched.preempt_thresh
    
    ps pt   time
    -  -    45.23   # baseline without the cpu hog
    0  48   907.07  # 2005% of the baseline
    0  224  864.24  # 1910% of the baseline
    1  48   869.69  # 1922% of the baseline
    1  224  61.46   # 136% of the baseline
    
    Users which want to restore the previous behavior can put the following
    into their /etc/sysctl.conf:
    kern.sched.preempt_thresh=48
    kern.sched.pick_short=0

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

mjg requested review of this revision.May 10 2023, 2:10 PM
mjg edited the summary of this revision. (Show Details)
jrtc27 added inline comments.
sys/kern/sched_ule.c
2311

Maybe a dumb question, but why do we need to get this again? I don't see why having been off CPU would invalidate ts.

sys/kern/sched_ule.c
2311

we don't, this is leftover from experimentation I did not remove. will do it later.

mjg added a reviewer: markj.
  • fix build with KTR
  • whack noop assignment

I applied the patch on my desktop. If I build LLVM in the background with -j<ncpus> and watch Doom speedruns on youtube, I have to set pick_short=0 to avoid long freezes in the video. Audio is fine regardless. With pick_short=0 the video isn't perfectly smooth, but it's much better than the alternative. So I think the patch likely regresses certain workloads, keeping in mind that server processes may be scheduled as "interactive."

I also see much larger variance for certain operations when dnetc is running in the background. I tried running "git status" in a loop in a large repo and measuring the real time of each operation. The result is here: https://people.freebsd.org/~markj/ule/ministat.txt .
I didn't try to understand why this is happening, but in general we want to be wary of changes which have the potential to introduce latency spikes.

I think that raising the default preempt threshold is probably reasonable, but it should be done post 14.0-branch.

This in particular extends -j 8 buildkernel by about 0.7%.

The patch contains two logically unrelated changes. Which one appears to be responsible for the slowdown?

One more bug which needs to be mentioned is the general starvation potential of the runq mechanism. In principle the calendar queue sorts it out for tdq_timeshare (except for the above bug), but it remains unaddressed for tdq_realtime, all while regular user threads can land there as is.

What is the bug? A user thread which is starving other threads should be treated as non-interactive and get scheduled from the timesharing queues. By definition, a user thread has to sleep for significant amounts of time to be placed in td_realtime in the first place. After this patch that soft invariant is no longer true, which is one reason I don't think it's a good idea.

Furthermore, your comment above is incongruent with the patch: the problem is that threads in the timeshare queue can starve each other, and your solution is to sometimes schedule them from the realtime queue. Why aren't we fixing the problem by revisiting the design of the timeshare queues?

sys/kern/sched_ule.c
218

Now there's no need for the FULL_PREEMPTION case.

221

Wasn't there a known ZFS data corruption bug that arises when this setting enables preemption by user threads? I'm sure there are other bugs in that vein. Changing this right before a release seems like a bad idea.

498

then you get less code duplication.

2311

Suppose a user thread is running, and then a timer interrupt fires. The timer interrupt schedules a softclock thread to run callout handlers. That softclock thread runs at a very high priority, so it preempts the user thread, and after a brief period the user thread keeps running. But now ts_usedslice has been reset, so the next time the user thread goes off-CPU for any reason (including another softclock preemption), it'll be scheduled with realtime priority.

It doesn't make a lot of sense to me that preemption has this effect. Doesn't it allow CPU hogs to gain realtime priority scheduling? It looks like we should reset usedslice only when the thread goes to sleep.

Please: is this a candidate for inclusion in 14.0-CURRENT?

… I think the patch likely regresses certain workloads, …

For what it's worth, my experience with diff 2 (https://reviews.freebsd.org/D40045?id=121849) has been almost entirely positive.

Although I have not aimed to perform measurements, qualitatively I do find the overall desktop environment experience much better – particularly whilst, for example, pushing this circa 2013 notebook (HP EliteBook 8570p) unusually hard with make on the host (typically with poudriere-devel and ALLOW_MAKE_JOBS=yes); or (pictured below, 20th May) building CURRENT in a (non-patched) VirtualBox guest.

{F62242755}

Please: is this a candidate for inclusion in 14.0-CURRENT?

My personal opinion is that it should not be committed, at least in its current form. The idea appears to be based on a suggestion from D15985, but the implementation is flawed such that it can boost the priority of threads when there is no reason to do so.

… I think the patch likely regresses certain workloads, …

For what it's worth, my experience with diff 2 (https://reviews.freebsd.org/D40045?id=121849) has been almost entirely positive.

Although I have not aimed to perform measurements, qualitatively I do find the overall desktop environment experience much better – particularly whilst, for example, pushing this circa 2013 notebook (HP EliteBook 8570p) unusually hard with make on the host (typically with poudriere-devel and ALLOW_MAKE_JOBS=yes); or (pictured below, 20th May) building CURRENT in a (non-patched) VirtualBox guest.

{F62242755}

The diff contains two independent changes. One changes the default kern.sched.preempt_thresh value from 48 to 224. The other artificially boosts the scheduling priority of a preempted thread; this can be disabled by setting kern.sched.pick_short=0. It would be useful to know whether you see a benefit from having one change but not the other.

… It would be useful to know whether you see a benefit from having one change but not the other.

With kern.sched.pick_short persistently 1:

  • I tried, for a while, to sense a difference between 224 and 48 for kern.sched.preempt_thresh.

I'll continue testing with 48 but at the time of writing, I can't say that there's a noticeable difference.

The same for 224 versus 48 in months/years gone by, before D40045; I know that 224 was a popular recommendation for desktop use cases, but I never noticed anything worth writing about.

YMMV, HTH

Further testing by other people confirmed my worry this is too trivial to fix the problem without running into other side effects, thus I'm dropping the thing.