runq/sched: Switch to 256 distinct queues and extend the timeshare range
Needs ReviewPublic
Actions

Authored by olce on May 27 2024, 10:36 PM.

Details

Reviewers

jhb
markj
mav
jeff
andrew
manu

Group Reviewers

cam

Summary

TL;DR:

Runqueues have now 256 distinct queues instead of 64, and distinct priorities (between 0 and 255) are now directly mapped to distinct queues (no folding).
Make ULE use a single runqueue instead of one per selection policy.
More compact and faster queue selection code.
Make FreeBSD comply with the POSIX requirement of 32 distinct priority levels for the SCHED_FIFO and SCHED_RR scheduling policies (the initial motivation for this work).
Increase the internal priority range for timesharing processes.

Although not a primary goal of this work, preliminary benchmarking on
a Ryzen 9 3900X (12 core/24 threads) indicates a close to 5% improvement
for 'make -j8 buildkernel'.

These changes were initially motivated by the bug that, although we
apparently provide 32 distinct priority levels for the
SCHED_FIFO/SCHED_RR scheduling policies, as the minimum mandated by
POSIX, in reality some of these levels are conflated. The publicly
visible levels are mapped to contiguous internal priority levels
(PRI_MIN_REALTIME to PRI_MAX_REALTIME). However, because of the current
implementation of runqueues, comprising only 64 levels, adjacent
priority levels (with different values modulo 4) are mapped internally
to the same queue, and threads with these priorities are treated by the
schedulers indifferently, breaking the POSIX requirement.

One could try to circumvent this problem by artificially spreading the
priorities so that two adjacent levels in the POSIX (and rtprio(2))
interfaces are in the end mapped to different queues, similarly to what
ULE currently does with priorities in the batch threads' selection range
(they are spread to occupy a full runqueue; ULE uses 3 distinct
runqueues, one for the realtime selection policy, which includes the
realtime threads but also the interactive ones, one for the timesahring
selection policy, coverings batch threads, and one for the idle
selection policy). However, doing this is in practice impossible since
runqueues do not have enough distinct queues anyway (with 32 dedicated
queues to realtime threads, there are only 32 others remaining, which
alone isn't enough for timesharing, nor idle threads if, by symmetry and
consistency, we also guarantee for them 32 distinct levels). Besides,
most people would probably agree that this would be a ugly hack.

We thus settled instead on this project (which is a sub-project of the
scheduling priorities revamp, please see the "rtprio(2) and POSIX(.1b)
Priorities, and Their FreeBSD Implementation: A Deep Dive (and Sweep)"
paper published at AsiaBSDCon 2024, which doesn't mention the fomer as
it was discovered after writing the article). It consists in ensuring
256 distinct queues per runqueue, with a compact, efficient runqueue
implementation, while keeping the code easy to read. The high-level
runqueue mechanisms themselves stay unchanged.

Internal knowledge of the runqueue's fast queue search implementation
(through a bitmap) has been removed from the schedulers, and conversely
the common implementation of runqueues now doesn't know any internals of
ULE anymore (i.e., it is no more aware of, and doesn't update the
timesharing policy's circular queue's head). Efficient cooperation in
this matter now happens solely thanks to a carefully chosen interface.
By contrast, each queue, as a list of threads, has not been hidden
behind an abstraction, as we saw no practical benefit to that in this
project.

All code performing a search for a non-empty queue in the status words
has been ultimately folded into a single "low-level" routine,
runq_findq(), which all other runqueue functions performing a search
eventually call. Besides a runqueue structure, it takes a queue range
to operate on (thus, ignoring the other queues) and a predicate which is
used to select the proper thread in the first found non-empty queue, or
to indicate to runq_findq() to continue the search with the next
non-empty queue, and in all cases to pass relevant data back and forth
between the predicate and runq_findq()'s caller. In most cases, there
is no runtime penalty for this architecture as most functions are
declared 'inline' and clang is able to optimize stacks efficiently.

runq_findq() is more optimized than some of the pre-existing, slightly
different variants it replaces: It always immediately jumps to the next
set bit through the use of RQSW_BSF() (similar to ffs()) instead of
looping around in some corner cases. ULE's timesharing selection policy
is implemented by calling runq_findq() twice to honor the circular
queue's head, taking care of providing two disjoint ranges for both
calls (some of the previous variants would uselessly rescan already
scanned queues).

As any distinct priority now maps to a different queue, the two ranges
dedicated to kernel threads can be shrinked (down to divided by 4) while
preserving the invariant that higher priority ones will always preempt
those of lower priority. This frees room for assigning more levels to
the timesharing selection policy, which should improve ULE's handling of
interactivity, and both 4BSD and ULE's fair treatment of timesharing
processes.

Some quirks in ULE have been removed. First, there is no more spreading
of timesharing priorities to compensate for queue aliasing. Second, ULE
no more avoids stealing the highest priority thread from some core (this
quirk has been noticed while refactoring the code; so far we have not
been able to devise a proper justification for it and suspect it is an
artefact of older versions of ULE).

These changes were originally developped as 25 separate commits. They
have been squashed here in a single diff to give an overview, and have
also been dispatched to 6 different, chained Phabricator differential
revisions in the hope to facilitate reviews:
https://reviews.freebsd.org/D45387
https://reviews.freebsd.org/D45388
https://reviews.freebsd.org/D45389
https://reviews.freebsd.org/D45390
https://reviews.freebsd.org/D45391
https://reviews.freebsd.org/D45392

Besides review, we call for much wider testing with production
workloads. As mentioned above, some preliminary benchmarking on
a single machine (Ryzen 9 3900X with 64GB of RAM) hints at performance
improvements for 'buildworld' (numbers are time in seconds):

N           Min           Max        Median           Avg        Stddev

x 23 109 111 111 110.52174 0.59310931
+ 21 104 107 105 105.28571 0.6436503
Difference at 95.0% confidence

-5.23602 +/- 0.376224
-4.73755% +/- 0.332716%
(Student's t, pooled s = 0.617692)

However, scheduling decisions can have large ramifications, so it is
hard to predict the net outcome of such changes on some specific
workloads and for different machines. We ideally aim at ensuring that
no common/important workload degrades noticeably with these changes.