What're the problems with existing accouting?
- sched_pctcpu(), based on hardclock tick counting, can't reliably account
time slices less than 1ms (with usual hardclock hz=1000), while such time
slices may happen very very often for any applications that doing something
more than just 100% cpu loaded math, due to waiting for external events. As
a result, all sched_pctcpu()-based stats are not reliable as a precise thing.
For example, top(1) utility have its own cpu% calculator (based on runtime
diff math) instead of using ki_pctcpu value from the kernel due to this.
The only reliable way to calculate %cpu should be based on cpu ticks, not any
other ticks.
- kern_racct.c logic for "pcpu" stat was built on top of already imprecise
sched_pctcpu(), with added two imprecise hacks (custom formula for first 3
seconds of process time, and add-and-decay logic to aggregate per-jail stat).
This all leads to things like 4000% jail cpu usage on single-core CPU
(make buildkernel -j4 in a jail in my tests) or the opposite 30% cpu usage
when cpu is really loaded at 100% (dd if=/dev/zero of=/dev/null).
- Current racct implementation widely uses global RACCT lock, this is against
scalability.
- It is not realtime. In fact, all cputime-related actions only done by demand
every 1 sec.
There is one possible problem with the proposed accounting system:
It may cause additional cpu load. But I don't think this effect will be
noticeable. However, I didn't done precise tests for this yet. Two places
where the additional load may occur is statclock() and thread switching,
the added thing is call to racct_rt_add_thread_runtime(), I've tried to
make it as light as possible.
What was done:
- struct racct now have additional "realtime" fields with spinlock for them, r_runtime is total cpu ticks elapsed r_rtpersec is quickly-averaged cpu ticks per last second r_ltruntime, r_rtlastticks - helpers to calculate r_rtpersec
- there is new proc spinlock PROC_RTLOCK (p_rtmtx)
- struct thread now have additional fields, protected by PROC_RTLOCK(td_proc): td_rtpersec, td_ltruntime, td_rtlastticks
- (updated) using td->td_ucred without locks for realtime access to cred->racct structures.
td->td_ucred->cr_ruidinfo->ui_racct
td->td_ucred->cr_prison->pr_prison_racct->prr_racct
td->td_ucred->cr_loginclass->lc_racct
(but see bug #262633)
struct ucred itself is already immutable for the fields in question:
struct uidinfo *
struct prison *
struct loginclass *
- config variables: RACCT_RT - enable cputime accounting RACCT_RT_PCTCPU - also enable realtime cpu% calculation
- how it all reported to userspace:
When racct enabled, kinfo_proc->ki_pctcpu (both for procs and threads) filled from here, not from sched_pctcpu()
jail_get now can report two additional variables: racct.rt.us and racct.rt.uspersec, both uint64_t, containing total consumed cputime (us) by jail and quickly-averaged cputime per last second
What may be done in future:
Replace td_incruntime/ruxagg()/racct CPU and PCPU logic with this system.
Get rid of synchronous single-threaded 1-second step-by-step racctd()
and do all cputime accounting and related throttling asynchronously in
realtime as much as possible.
Global RACCT lock usage may be reduced.