Page MenuHomeFreeBSD

sched_ule: Sanitize CPU's use and priority computations, and ticks storage
Needs ReviewPublic

Authored by olce on Sep 6 2024, 1:52 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Dec 23, 11:37 AM
Unknown Object (File)
Sun, Dec 22, 3:22 AM
Unknown Object (File)
Dec 13 2024, 12:33 PM
Unknown Object (File)
Nov 11 2024, 4:00 AM
Unknown Object (File)
Oct 30 2024, 11:35 PM
Unknown Object (File)
Oct 2 2024, 9:59 AM
Unknown Object (File)
Sep 24 2024, 8:46 AM
Unknown Object (File)
Sep 23 2024, 7:35 PM
Subscribers

Details

Reviewers
jeff
mav
markj
jhb
Summary

Computation of %CPU in sched_pctcpu() was overly complicated, wrong in
the case of a non-maximal window (10 seconds span; this is always the
case in practice as the window would oscillate between 10 and 11 seconds
for continuously running processes) and performed unshifted for the
first part, essentially losing precision (up to 9% for SCHED_TICK_SECS
being 10), and with some uneffective shift for the second part.
Conserve maximum precision by only shifting by the require amount to
attain FSHIFT before dividing. Apply classical rounding to nearest
instead of rounding down.

To generally avoid wraparound problems with tick fields in 'struct
td_sched' (as already happened once in sched_pctcpu_update()), make then
all unsigned, and ensure 'ticks' is always converted to some 'u_int'.
While here, fix SCHED_AFFINITY().

Rewrite sched_pctcpu_update() while keeping the existing formulas:

  • Fix the hole in the cliff case that in theory 'ts_ticks' can become greater than the window size if a running thread has not been accounted for too long (today cannot happen because of sched_clock()).
  • Make the decay ratio explicit and configurable (SCHED_CPU_DECAY_NUMER, SCHED_CPU_DECAY_DENOM). Set it to the current value (10/11), currently producing a 95% attenuation after about ~32s. This eases experimenting with changing it. Apply the ratio on shifted ticks for better precision, independently of the chosen value for SCHED_TICK_MAX/SCHED_TICK_SECS.
  • Remove redundant SCHED_TICK_TARG. Compute SCHED_TICK_MAX from SCHED_TICK_SECS, the latter now really specifying the maximum size of the %CPU estimation window.
  • Ensure it is immune to varying 'hz' (which today can't happen), so that after computation SCHED_TICK_RUN(ts) is mathematically guaranteed lower than SCHED_TICK_LENGTH(ts).
  • Thoroughly explain the current formula, and mention its main drawback (it is completely dependent on the frequency of calls to sched_pctcpu_update(), which currently manifests itself for sleeping threads).

Rework sched_priority():

  • Ensure 'p_nice' is read only once, to be immune to a concurrent change.
  • Clearly show that the computed priority is the sum of 3 components. Make them all positive by shifting the starting priority and shifting the nice value in SCHED_PRI_NICE().
  • Compute the priority offset deriving from the %CPU with rounding to nearest.
  • Much more informative KASSERT() output with details regarding the priority computation.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 59350
Build 56237: arc lint + arc unit