HomeFreeBSD

sched_ule: Sanitize CPU's use and priority computations, and ticks storage

Description

sched_ule: Sanitize CPU's use and priority computations, and ticks storage

Computation of %CPU in sched_pctcpu() was overly complicated, wrong in
the case of a non-maximal window (10 seconds span; this is always the
case in practice as the window would oscillate between 10 and 11 seconds
for continuously running processes) and performed unshifted for the
first part, essentially losing precision (up to 9% for SCHED_TICK_SECS
being 10), and with some uneffective shift for the second part.
Conserve maximum precision by only shifting by the require amount to
attain FSHIFT before dividing. Apply classical rounding to nearest
instead of rounding down.

To generally avoid wraparound problems with tick fields in 'struct
td_sched' (as already happened once in sched_pctcpu_update()), make then
all unsigned, and ensure 'ticks' is always converted to some 'u_int'.
While here, fix SCHED_AFFINITY().

Rewrite sched_pctcpu_update() while keeping the existing formulas:

  • Fix the hole in the cliff case that in theory 'ts_ticks' can become greater than the window size if a running thread has not been accounted for too long (today cannot happen because of sched_clock()).
  • Make the decay ratio explicit and configurable (SCHED_CPU_DECAY_NUMER, SCHED_CPU_DECAY_DENOM). Set it to the current value (10/11), currently producing a 95% attenuation after about ~32s. This eases experimenting with changing it. Apply the ratio on shifted ticks for better precision, independently of the chosen value for SCHED_TICK_MAX/SCHED_TICK_SECS.
  • Remove redundant SCHED_TICK_TARG. Compute SCHED_TICK_MAX from SCHED_TICK_SECS, the latter now really specifying the maximum size of the %CPU estimation window.
  • Ensure it is immune to varying 'hz' (which today can't happen), so that after computation SCHED_TICK_RUN(ts) is mathematically guaranteed lower than SCHED_TICK_LENGTH(ts).
  • Thoroughly explain the current formula, and mention its main drawback (it is completely dependent on the frequency of calls to sched_pctcpu_update(), which currently manifests itself for sleeping threads).

Rework sched_priority():

  • Ensure 'p_nice' is read only once, to be immune to a concurrent change.
  • Clearly show that the computed priority is the sum of 3 components. Make them all positive by shifting the starting priority and shifting the nice value in SCHED_PRI_NICE().
  • Compute the priority offset deriving from the %CPU with rounding to nearest.
  • Much more informative KASSERT() output with details regarding the priority computation.

MFC after: 1 month
Event: Kitchener-Waterloo Hackathon 202506
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D46567

(cherry picked from commit a33225efb4bc2598e4caa1c1f7f715653e8b1fda)

Details