Page MenuHomeFreeBSD

Allow realtime and precise accounting of cpu utilization for threads and racct-objects.
Needs ReviewPublic

Authored by firk_cantconnect.ru on Mar 22 2022, 5:36 PM.
Tags
Referenced Files
Unknown Object (File)
Oct 12 2024, 2:37 PM
Unknown Object (File)
Oct 12 2024, 2:36 PM
Unknown Object (File)
Oct 12 2024, 2:36 PM
Unknown Object (File)
Oct 12 2024, 2:36 PM
Unknown Object (File)
Oct 12 2024, 12:37 PM
Unknown Object (File)
Oct 1 2024, 3:50 AM
Unknown Object (File)
Sep 29 2024, 8:46 PM
Unknown Object (File)
Sep 29 2024, 8:45 PM

Details

Reviewers
trasz
Summary

What're the problems with existing accouting?

  1. sched_pctcpu(), based on hardclock tick counting, can't reliably account

time slices less than 1ms (with usual hardclock hz=1000), while such time
slices may happen very very often for any applications that doing something
more than just 100% cpu loaded math, due to waiting for external events. As
a result, all sched_pctcpu()-based stats are not reliable as a precise thing.
For example, top(1) utility have its own cpu% calculator (based on runtime
diff math) instead of using ki_pctcpu value from the kernel due to this.
The only reliable way to calculate %cpu should be based on cpu ticks, not any
other ticks.

  1. kern_racct.c logic for "pcpu" stat was built on top of already imprecise

sched_pctcpu(), with added two imprecise hacks (custom formula for first 3
seconds of process time, and add-and-decay logic to aggregate per-jail stat).
This all leads to things like 4000% jail cpu usage on single-core CPU
(make buildkernel -j4 in a jail in my tests) or the opposite 30% cpu usage
when cpu is really loaded at 100% (dd if=/dev/zero of=/dev/null).

  1. Current racct implementation widely uses global RACCT lock, this is against

scalability.

  1. It is not realtime. In fact, all cputime-related actions only done by demand

every 1 sec.

There is one possible problem with the proposed accounting system:

It may cause additional cpu load. But I don't think this effect will be
noticeable. However, I didn't done precise tests for this yet. Two places
where the additional load may occur is statclock() and thread switching,
the added thing is call to racct_rt_add_thread_runtime(), I've tried to
make it as light as possible.

What was done:

  1. struct racct now have additional "realtime" fields with spinlock for them, r_runtime is total cpu ticks elapsed r_rtpersec is quickly-averaged cpu ticks per last second r_ltruntime, r_rtlastticks - helpers to calculate r_rtpersec
  1. there is new proc spinlock PROC_RTLOCK (p_rtmtx)
  1. struct thread now have additional fields, protected by PROC_RTLOCK(td_proc): td_rtpersec, td_ltruntime, td_rtlastticks
  1. (updated) using td->td_ucred without locks for realtime access to cred->racct structures.

td->td_ucred->cr_ruidinfo->ui_racct
td->td_ucred->cr_prison->pr_prison_racct->prr_racct
td->td_ucred->cr_loginclass->lc_racct
(but see bug #262633)

struct ucred itself is already immutable for the fields in question:
struct uidinfo *
struct prison *
struct loginclass *

  1. config variables: RACCT_RT - enable cputime accounting RACCT_RT_PCTCPU - also enable realtime cpu% calculation
  1. how it all reported to userspace:

    When racct enabled, kinfo_proc->ki_pctcpu (both for procs and threads) filled from here, not from sched_pctcpu()

    jail_get now can report two additional variables: racct.rt.us and racct.rt.uspersec, both uint64_t, containing total consumed cputime (us) by jail and quickly-averaged cputime per last second

What may be done in future:

Replace td_incruntime/ruxagg()/racct CPU and PCPU logic with this system.

Get rid of synchronous single-threaded 1-second step-by-step racctd()
and do all cputime accounting and related throttling asynchronously in
realtime as much as possible.

Global RACCT lock usage may be reduced.

Test Plan

I made an utility to show all affected/related values in one table.

https://firk.cantconnect.ru/projects/jtop/jtop.tgz

Command line (to be called on large display because of wide table, I'm using about 150x40):
./jtop +jail +proc +loop +esc +ext sort=rpct -kern

First table - jails list (on test machine, most likely there will be only one testing jail)

SUMCPU% - cpu% usage manually summed from kinfo_proc->ki_pctcpu values
ELAPSED, CPU% - elapsed seconds and cpu% usage reported by the new accounting system (-ext to turn off)
RA.CPU, RA.PCPU - elapsed seconds and cpu% usage reported by the existing rctl_get_racct()

Second table - processes list (there will be not many on testing system, but anyway limited to 20)

OLDCPU% - cpu% usage from sched_pctcpu() (was in kinfo_proc->ki_pctcpu, now moved to ki_sparelong[1] by a hack)
KICPU% - cpu% usage reported by the new accounting system in kinfo_proc->ki_pctcpu
RCPU% - cpu% usage counted manually by comparing previous and current elapsed time and dividing by loop interval
ELAPSED - elapsed seconds from kinfo_proc->ki_runtime

On unpatched system, command line should be
./jtop +jail +proc +loop +esc -ext sort=rpct -kern
ELAPSED and CPU% in jails list will be zero
OLDCPU% in process list will be zero
KICPU% in process list will be from sched_pctcpu()

My tests:


What I've seen on single-core VM shortly after new 'make buildworld'

JID	NAME		SUMCPU%	ELAPSED		CPU%	RA.CPU	RA.PCPU	HOSTNAME			IP4ADDRS	PATH
1	"1"             1.17%	  32.129	96.76%	56	11090%	x                               127.0.0.1	"/"

So, SUMCPU is low because most cputime-consuming processes are shortlived and so not visible here
ELAPSED and CPU% fields looks like true data
RA.CPU is 2x more that real for some reason
RA.PCPU is 11090% which is nonsense for single-core system


What I've seen after 20 seconds of 'dd if=/dev/zero of=/dev/null'

JID	NAME		SUMCPU%	ELAPSED		CPU%	RA.CPU	RA.PCPU	HOSTNAME			IP4ADDRS	PATH
2	"2"             99.75%	  20.251	99.78%	19	83%	x                               127.0.0.1	"/"
FL	PID	PPID	PGID	TPGID	SID	TSID	JID	OLDCPU%	KICPU%	RCPU%	ELAPSED		COMMAND
  JM   	40745	40200	40745	40745	39615	39615	2	88.96%	99.75%	101.17%	  20.219	"dd" 0

Here everything is nearly fine, but sched_pctcpu()-sourced OLDCPU% for the process and RA.PCPU for the jail are not true.
After 10 seconds (so, 10 sec before this snapshot) it was worse.


Test program ("network daemon" waiting for external events and doing some idle task):

#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <sys/types.h>
#include <sys/select.h>

int main(int argc, char * * argv) {
  fd_set fds;
  struct timeval tv;
  int to, sl, j, k;

  to = (argc>=2)?atoi(argv[1]):1000;
  sl = (argc>=3)?atoi(argv[2]):0;
  k = 0;
  for(;;) {
    FD_ZERO(&fds);
    FD_SET(0, &fds);
    tv.tv_sec = 0;
    tv.tv_usec = (argc>=2)?atoi(argv[1]):1000;
    select(1, &fds, NULL, NULL, &tv);
    for(j=0; j<sl; j++) k = (int)sqrt(12345+j+k);
  }
  return k;
}

after about 1 minute of './test-select 1000 0' (kern.hz switched from VM-default 100 to 1000)

JID	NAME		SUMCPU%	ELAPSED		CPU%	RA.CPU	RA.PCPU	HOSTNAME			IP4ADDRS	PATH
1	"1"             2.49%	   2.017	2.50%	1	1%	x                               127.0.0.1	"/"
FL	PID	PPID	PGID	TPGID	SID	TSID	JID	OLDCPU%	KICPU%	RCPU%	ELAPSED		COMMAND
  JM   	95062	95017	95062	95017	93900	93900	1	1.46%	2.49%	2.50%	   1.974	"test-select" 0

sched_pctcpu() reports about half of real cpu usage
(on some other system it was 0.34% real and pure 0.00% from sched_pctcpu() but I can't reproduce it on test VM)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

firk_cantconnect.ru edited the test plan for this revision. (Show Details)
firk_cantconnect.ru edited the test plan for this revision. (Show Details)

I can't comment on time keeping, but I have some other stuff.

head/sys/kern/kern_racct.c
1437

given that this is only called for td == curthread, you can use td->td_ucred and avoid dereferncing p for creds. this also avoids adding extra locking to cred management code. note that per-thread cred pointer gets updated on each kernel entry, so this can only become "inaccurate" if another thread races with setuid or sometning similar. but even then some of the time was done on the td_ucred's terms (so to speak), so blindly adding them to possibly different creds in the proc is not fully accurate either. tl;dr drop the set/unset cred locking and use td_ucred here

head/sys/kern/kern_resource.c
886

use RACCT_ENABLED() instead

firk_cantconnect.ru added inline comments.
head/sys/kern/kern_racct.c
1437

Yes, while it was completely safe to call it for td!=curthread, that
wasn't the case. So getting rid of p->p_ucred usage. Using td->td_ucred for
now, but I'll see for td_realucred later (seems not in this rev, because it
depends on other code).

Thanks.

I'll prod someone time-related to have a look at the rest of the patch.

head/sys/sys/proc.h
630

this comment needs to be restored

Found a race between sys_exit() -> exit1() -> thread_exit() and sys_wait() -> proc_reap().
Slightly moving down PROC_SUNLOCK() in thread_exit() to protect RACCT_RT calculations from proc_reap() destroying p->p_racct.

@trasz could you please say something about this?