Page MenuHomeFreeBSD

sched_ule: Recover previous nice and anti-starvation behaviors
Needs ReviewPublic

Authored by olce on Sep 6 2024, 1:51 PM.
Tags
None
Referenced Files
F118372653: firefox-1-stats.txt
Wed, May 28, 3:10 PM
F118372654: firefox-2-stats.txt
Wed, May 28, 3:10 PM
F118372621: firefox-2-stats.txt
Wed, May 28, 3:10 PM
F118372571: firefox-1-stats.txt
Wed, May 28, 3:10 PM
Unknown Object (File)
Sat, May 17, 9:28 PM
Unknown Object (File)
Sat, May 17, 9:18 PM
Unknown Object (File)
Apr 18 2025, 11:11 PM
Unknown Object (File)
Apr 13 2025, 8:25 PM
Subscribers

Details

Reviewers
jeff
mav
markj
jhb
Summary

Justification for this change is to avoid disturbing ULE's behavior too
much at this time. We however acknowledge that the effect of "nice"
values is extremely weak and will most probably change going forward.

Tuning allows to mostly recover ULE's behavior prior to the switch to
a single 256-queue runqueue and the increase of the timesharing priority
levels' range.

After this change, in a series of test involving two long-running
processes with varying nice values competing for the same CPU, we
observe that used CPU time ratios of the highest priority process to
change by at most 1.15% and on average by 0.46% (absolute differences).
In relative differences, they change by at most 2% and on average by
0.78%.

In order to preserve these ratios, as the number of priority levels
alloted to the timesharing policy have been raised from 136 to 168, we
keep the ratio of levels reserved to handle nice values to that
reserved for CPU usage by applying a factor of 5/4 (which is close to
168/136).

Time-based advance of the timesharing circular queue's head is ULE's
main fairness and anti-starvation mechanism. The higher number of
timesharing queues is now compensated by allowing a greater increment of
the head offset per tick. Because there are now 112 queue levels
dedicated to the timesharing policy, whereas there previously were 64
(priorities spread into a single, separate runqueue), the circular
queue's head must progress 7/4 faster.

While here, take into 'cnt' as the number of ticks when advancing the
circular queue's head. This fix depends on the other code changes
enabling incrementation by more than one.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 59349
Build 56236: arc lint + arc unit

Event Timeline

olce requested review of this revision.Sep 6 2024, 1:51 PM

Some results of conducted tests with two threads whose processes have different nice values forced on the same core, reporting the achieved %CPU ratio over periods of more than 1 minute as reported by ps using mainly etime and time:

-CURRENT

  • 0 vs 20 : 54.37% (3:18.47 vs 2:46.55)
  • -2 vs 20 : 55.51% (1:52.43 vs 1:30.11)
  • -4 vs 20 : 57.62% (1:22.77 vs 1:00.87)
  • -6 vs 20 : 59.93% (7:17.71 vs 4:52.70)
  • -8 vs 20 : 61.54% (3:36.98 vs 2:15.60)
  • -10 vs 20 : 63.30% (28:56.20 vs 16:46.66)
  • -12 vs 20 : 65.35% (10:11.64 vs 5:24.35)
  • -14 vs 20 : 66.66% (3:12.28 vs 1:36.19)
  • -16 vs 20 : 66.70% (2:41.02 vs 1:20.38)
  • -20 vs 20 : 66.76% (3:21.77 vs 1:40.45)

With this series of patches:

  • 0 vs 20 : 54.93% (2:03.79 vs 1:41.56)
  • -2 vs 20 : 55.17% (25:08.27 vs 20:25.73)
  • -4 vs 20 : 56.47% (19:18.32 vs 14:53.06)
  • -6 vs 20 : 59.63% (9:13.80 vs 6:14.87)
  • -8 vs 20 : 61.76% (3:02.38 vs 1:52.93)
  • -10 vs 20 : 64.26% (31:40.02 vs 17:36.76)
  • -12 vs 20 : 65.37% (50:07.73 vs 26:33.36)
  • -20 vs 20 : 66.72% (5:45.09 vs 2:52.14)

New round of multiple tests, to ensure the current anti-starvation and nice behaviors stay (approximately) the same after this series of patches.

Two processes competing for the same core with different nice values

Tests automated by scripts running both processes for 10 minutes on the same core, from which other (non-kernel) processes are excluded via cpuset, reporting the average %CPU of the first process over the whole experiment. The second process always has a nice value of 20. Tests under VirtualBox were conducted with hz set to 1000.

Process 1's niceBefore, Ryzen 9 3900XAfter, Ryzen 9 3900XBefore, VirtualBoxAfter, VirtualBox
-2066.7166.7166.7166.72
-1966.7166.7166.7166.73
-1866.7066.6966.7066.69
-1766.6966.6966.7066.68
-1666.6866.6866.6866.68
-1566.6866.6866.6866.67
-1466.6866.6766.6666.65
-1366.6866.6766.6566.67
-1265.2665.8665.2865.79
-1164.1165.3763.8265.35
-1063.6664.4363.6564.05
-962.1963.6962.1963.70
-862.1361.7662.1361.78
-761.2360.9061.2460.90
-659.1660.1959.1660.19
-557.8859.2157.9459.19
-457.0257.5257.0057.53
-355.6956.7355.7256.65
-254.9755.6554.9255.69
-153.6854.7253.7854.80
053.6053.0353.7153.05
152.7852.4852.8052.44
250.9051.4550.8551.49
350.0650.9550.1050.80
450.0150.0050.0250.03
550.0050.0150.0450.02
650.0050.0150.0350.03
750.0050.0050.0150.01
850.0050.0050.0150.00
950.0050.0150.0250.03
1050.0050.0150.0150.00
1150.0150.0050.0350.00
1250.0050.0150.0150.00
1350.0050.0050.0050.02
1450.0050.0050.0150.01
1550.0050.0050.0250.00
1650.0150.0050.0550.00
1750.0050.0050.0050.02
1850.0050.0050.0150.01
1950.0050.0150.0150.00
2050.0150.0050.0250.02

As one can see, relative performance plateaus at both ends (towards -20 and towards 20), only really varying between -13 and 3 or 4. Changes in the latter range are slightly slower with the new implementation, but are close to the original ones.

We have actually ran these tests numerous times, and the results are very stable (only the last digit can slightly vary, as is observable on the numbers above).

Building world

Here, we measure the time it takes to build world but the toolchain (WITHOUT_CROSS_COMPILER=1 WITHOUT_TOOLCHAIN=1) for a recent stable/14 (a few local commits on top of e9fe7b08eb2e331) in a loop, with intervening 'make cleanworld', using the LLVM 19.1.7 external toolchain, in single-user mode. Time is reported in centiseconds.

x Before
+ After
    N           Min           Max        Median           Avg        Stddev
x  15         35757         35915         35814     35821.067       42.1993
+  26         35693         35876       35735.5     35739.115     38.182013
Difference at 99.5% confidence
        -81.9513 +/- 42.6141
        -0.22878% +/- 0.118853%
        (Student's t, pooled s = 39.671)

We also did these experiments numerous times, in different configurations (different source trees, in parallel with other workloads). It appears that the results are sensitive to a lot of parameters (temperature, layout of files in the filesystem, and even possibly an intervening firmware update). However, when kernels before and after this series are tested in a row without any other intervening activity, we always observed that timings are very close, as is reported above (a negligible 0.23% reduction in execution time). Doing the experiments in single-user mode and redirecting the output to /dev/null considerably reduced the variance, allowing to report differences with 99.5% confidence.

Without the change here (D46566), we observed more significant differences (up to ~5% improvement, at the expense of less fair scheduling), so having practically none is another point increasing confidence that the fairness (anti-starvation) properties have been correctly preserved.

Netflix tests

The series was tested at Netflix in production for a week and compared with unpatched kernels, producing a number of raw graphs. Not knowing if I can share them, I'll just report the conclusions: Latency and CPU load remain essentially unchanged. We observe tiny improvements in average latency (~0.44%), and not so tiny in latency bounds (reduction of ~18% in the maximum latency; but in absence of histograms or variance, this may just be an artifact). Given that the workloads are not exactly identical (load-balancing), observed latency and CPU differences are not deemed significant enough.

Web browsing simulation

We emulate some web browsing session through a synthetic benchmark. Please see the 'late' repository at https://github.com/OlCe2/late, and in particular the 'firefox.sh' script. A launch of 'firefox.sh' currently only emulates only 5 processes (as these were the most significant in the traces we produced). Tests were launched with a calibration period of 10s for the work loop, and a duration of 180s per test. Each test was launched 10 times, and 2 of them were executing in parallel, on a Ryzen 9 3900X (so, 10 mono-threaded processes for 12 actual cores and 24 hardware threads, with no manual pinning).

Statistics about each run produced with ministat will be attached. There does not seem to be any clearly visible impact of the changes, as before/after variations are roughly in the range of variations from one test to the other one launched in parallel, except perhaps a tiny improvement in latency (confirming this would require doing more statistics).

Conclusion

After these tests, it seems that the goals of this revision are fulfilled, and the full patch series has no notable impact on the scheduler except for the part it is designed to improve (having distinct level per idprio/rtprio levels).

Statistics for the web browsing simulations:

For two firefox.sh, before the changes:

And the two after the changes: