Allow realtime and precise accounting of cpu utilization for threads and racct-objects.
Needs ReviewPublic
Actions

Authored by firk_cantconnect.ru on Mar 22 2022, 5:36 PM.

Details

Reviewers

Summary

What're the problems with existing accouting?

sched_pctcpu(), based on hardclock tick counting, can't reliably account

time slices less than 1ms (with usual hardclock hz=1000), while such time
slices may happen very very often for any applications that doing something
more than just 100% cpu loaded math, due to waiting for external events. As
a result, all sched_pctcpu()-based stats are not reliable as a precise thing.
For example, top(1) utility have its own cpu% calculator (based on runtime
diff math) instead of using ki_pctcpu value from the kernel due to this.
The only reliable way to calculate %cpu should be based on cpu ticks, not any
other ticks.

kern_racct.c logic for "pcpu" stat was built on top of already imprecise

sched_pctcpu(), with added two imprecise hacks (custom formula for first 3
seconds of process time, and add-and-decay logic to aggregate per-jail stat).
This all leads to things like 4000% jail cpu usage on single-core CPU
(make buildkernel -j4 in a jail in my tests) or the opposite 30% cpu usage
when cpu is really loaded at 100% (dd if=/dev/zero of=/dev/null).

Current racct implementation widely uses global RACCT lock, this is against

scalability.

It is not realtime. In fact, all cputime-related actions only done by demand

every 1 sec.

There is one possible problem with the proposed accounting system:

It may cause additional cpu load. But I don't think this effect will be
noticeable. However, I didn't done precise tests for this yet. Two places
where the additional load may occur is statclock() and thread switching,
the added thing is call to racct_rt_add_thread_runtime(), I've tried to
make it as light as possible.

What was done:

struct racct now have additional "realtime" fields with spinlock for them, r_runtime is total cpu ticks elapsed r_rtpersec is quickly-averaged cpu ticks per last second r_ltruntime, r_rtlastticks - helpers to calculate r_rtpersec

there is new proc spinlock PROC_RTLOCK (p_rtmtx)

struct thread now have additional fields, protected by PROC_RTLOCK(td_proc): td_rtpersec, td_ltruntime, td_rtlastticks

(updated) using td->td_ucred without locks for realtime access to cred->racct structures.

td->td_ucred->cr_ruidinfo->ui_racct
td->td_ucred->cr_prison->pr_prison_racct->prr_racct
td->td_ucred->cr_loginclass->lc_racct
(but see bug #262633)

struct ucred itself is already immutable for the fields in question:
struct uidinfo *
struct prison *
struct loginclass *

config variables: RACCT_RT - enable cputime accounting RACCT_RT_PCTCPU - also enable realtime cpu% calculation

how it all reported to userspace:

When racct enabled, kinfo_proc->ki_pctcpu (both for procs and threads) filled from here, not from sched_pctcpu()

jail_get now can report two additional variables: racct.rt.us and racct.rt.uspersec, both uint64_t, containing total consumed cputime (us) by jail and quickly-averaged cputime per last second

What may be done in future:

Replace td_incruntime/ruxagg()/racct CPU and PCPU logic with this system.

Get rid of synchronous single-threaded 1-second step-by-step racctd()
and do all cputime accounting and related throttling asynchronously in
realtime as much as possible.

Global RACCT lock usage may be reduced.

Test Plan

I made an utility to show all affected/related values in one table.

https://firk.cantconnect.ru/projects/jtop/jtop.tgz

Command line (to be called on large display because of wide table, I'm using about 150x40):
./jtop +jail +proc +loop +esc +ext sort=rpct -kern

First table - jails list (on test machine, most likely there will be only one testing jail)

SUMCPU% - cpu% usage manually summed from kinfo_proc->ki_pctcpu values
ELAPSED, CPU% - elapsed seconds and cpu% usage reported by the new accounting system (-ext to turn off)
RA.CPU, RA.PCPU - elapsed seconds and cpu% usage reported by the existing rctl_get_racct()

Second table - processes list (there will be not many on testing system, but anyway limited to 20)

OLDCPU% - cpu% usage from sched_pctcpu() (was in kinfo_proc->ki_pctcpu, now moved to ki_sparelong[1] by a hack)
KICPU% - cpu% usage reported by the new accounting system in kinfo_proc->ki_pctcpu
RCPU% - cpu% usage counted manually by comparing previous and current elapsed time and dividing by loop interval
ELAPSED - elapsed seconds from kinfo_proc->ki_runtime

On unpatched system, command line should be
./jtop +jail +proc +loop +esc -ext sort=rpct -kern
ELAPSED and CPU% in jails list will be zero
OLDCPU% in process list will be zero
KICPU% in process list will be from sched_pctcpu()

My tests:

What I've seen on single-core VM shortly after new 'make buildworld'

JID	NAME		SUMCPU%	ELAPSED		CPU%	RA.CPU	RA.PCPU	HOSTNAME			IP4ADDRS	PATH
1	"1"             1.17%	  32.129	96.76%	56	11090%	x                               127.0.0.1	"/"

So, SUMCPU is low because most cputime-consuming processes are shortlived and so not visible here
ELAPSED and CPU% fields looks like true data
RA.CPU is 2x more that real for some reason
RA.PCPU is 11090% which is nonsense for single-core system

What I've seen after 20 seconds of 'dd if=/dev/zero of=/dev/null'

JID	NAME		SUMCPU%	ELAPSED		CPU%	RA.CPU	RA.PCPU	HOSTNAME			IP4ADDRS	PATH
2	"2"             99.75%	  20.251	99.78%	19	83%	x                               127.0.0.1	"/"
FL	PID	PPID	PGID	TPGID	SID	TSID	JID	OLDCPU%	KICPU%	RCPU%	ELAPSED		COMMAND
  JM   	40745	40200	40745	40745	39615	39615	2	88.96%	99.75%	101.17%	  20.219	"dd" 0

Here everything is nearly fine, but sched_pctcpu()-sourced OLDCPU% for the process and RA.PCPU for the jail are not true.
After 10 seconds (so, 10 sec before this snapshot) it was worse.

Test program ("network daemon" waiting for external events and doing some idle task):

#include <stdlib.h>
#include <unistd.h>
#include <math.h>
#include <sys/types.h>
#include <sys/select.h>

int main(int argc, char * * argv) {
  fd_set fds;
  struct timeval tv;
  int to, sl, j, k;

  to = (argc>=2)?atoi(argv[1]):1000;
  sl = (argc>=3)?atoi(argv[2]):0;
  k = 0;
  for(;;) {
    FD_ZERO(&fds);
    FD_SET(0, &fds);
    tv.tv_sec = 0;
    tv.tv_usec = (argc>=2)?atoi(argv[1]):1000;
    select(1, &fds, NULL, NULL, &tv);
    for(j=0; j<sl; j++) k = (int)sqrt(12345+j+k);
  }
  return k;
}

after about 1 minute of './test-select 1000 0' (kern.hz switched from VM-default 100 to 1000)

JID	NAME		SUMCPU%	ELAPSED		CPU%	RA.CPU	RA.PCPU	HOSTNAME			IP4ADDRS	PATH
1	"1"             2.49%	   2.017	2.50%	1	1%	x                               127.0.0.1	"/"
FL	PID	PPID	PGID	TPGID	SID	TSID	JID	OLDCPU%	KICPU%	RCPU%	ELAPSED		COMMAND
  JM   	95062	95017	95062	95017	93900	93900	1	1.46%	2.49%	2.50%	   1.974	"test-select" 0

sched_pctcpu() reports about half of real cpu usage
(on some other system it was 0.34% real and pure 0.00% from sched_pctcpu() but I can't reproduce it on test VM)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

firk_cantconnect.ru created this revision.Mar 22 2022, 5:36 PM

Herald added a subscriber: imp. · View Herald TranscriptMar 22 2022, 5:36 PM

firk_cantconnect.ru requested review of this revision.Mar 22 2022, 5:36 PM

firk_cantconnect.ru edited the summary of this revision. (Show Details)Mar 22 2022, 5:38 PM

firk_cantconnect.ru edited the test plan for this revision. (Show Details)

I can't comment on time keeping, but I have some other stuff.

head/sys/kern/kern_racct.c
1437	given that this is only called for td == curthread, you can use td->td_ucred and avoid dereferncing p for creds. this also avoids adding extra locking to cred management code. note that per-thread cred pointer gets updated on each kernel entry, so this can only become "inaccurate" if another thread races with setuid or sometning similar. but even then some of the time was done on the td_ucred's terms (so to speak), so blindly adding them to possibly different creds in the proc is not fully accurate either. tl;dr drop the set/unset cred locking and use td_ucred here
head/sys/kern/kern_resource.c
886	use RACCT_ENABLED() instead

firk_cantconnect.ru updated this revision to Diff 104134.Mar 23 2022, 11:19 PM

firk_cantconnect.ru marked 2 inline comments as done.Mar 23 2022, 11:22 PM

firk_cantconnect.ru added inline comments.

head/sys/kern/kern_racct.c
1437	Yes, while it was completely safe to call it for td!=curthread, that wasn't the case. So getting rid of p->p_ucred usage. Using td->td_ucred for now, but I'll see for td_realucred later (seems not in this rev, because it depends on other code).

firk_cantconnect.ru edited the summary of this revision. (Show Details)Mar 23 2022, 11:24 PM

Thanks.

I'll prod someone time-related to have a look at the rest of the patch.

head/sys/sys/proc.h
630	this comment needs to be restored

firk_cantconnect.ru marked 2 inline comments as done.Mar 25 2022, 8:52 AM

firk_cantconnect.ru updated this revision to Diff 104182.Mar 25 2022, 8:56 AM

allanjude added a reviewer: trasz.Mar 25 2022, 1:17 PM

allanjude added a subscriber: phk.

afedorov added a subscriber: afedorov.Mar 25 2022, 2:14 PM

Found a race between sys_exit() -> exit1() -> thread_exit() and sys_wait() -> proc_reap().
Slightly moving down PROC_SUNLOCK() in thread_exit() to protect RACCT_RT calculations from proc_reap() destroying p->p_racct.

crest_freebsd_rlwinm.de added a subscriber: crest_freebsd_rlwinm.de.Mar 30 2022, 12:12 PM

@trasz could you please say something about this?

Revision Contents
Changeset List

Path

Size

head/

sys/

amd64/

conf/

GENERIC

2 lines

conf/

options

2 lines

kern/

5 lines

15 lines

24 lines

169 lines

8 lines

5 lines

28 lines

sys/

proc.h

10 lines

racct.h

24 lines

Diff 104360

View Options

head/sys/amd64/conf/GENERIC

	Show First 20 Lines • Show All 82 Lines • ▼ Show 20 Lines
	options CAPABILITIES # Capsicum capabilities			options CAPABILITIES # Capsicum capabilities
	options MAC # TrustedBSD MAC Framework			options MAC # TrustedBSD MAC Framework
	options KDTRACE_FRAME # Ensure frames are compiled in			options KDTRACE_FRAME # Ensure frames are compiled in
	options KDTRACE_HOOKS # Kernel DTrace hooks			options KDTRACE_HOOKS # Kernel DTrace hooks
	options DDB_CTF # Kernel ELF linker loads CTF data			options DDB_CTF # Kernel ELF linker loads CTF data
	options INCLUDE_CONFIG_FILE # Include this file in kernel			options INCLUDE_CONFIG_FILE # Include this file in kernel
	options RACCT # Resource accounting framework			options RACCT # Resource accounting framework
	options RACCT_DEFAULT_TO_DISABLED # Set kern.racct.enable=0 by default			options RACCT_DEFAULT_TO_DISABLED # Set kern.racct.enable=0 by default
				options RACCT_RT # Realtime cputime calc for all objects
				options RACCT_RT_PCTCPU # Also realtime %cpu calc for all objects
	options RCTL # Resource limits			options RCTL # Resource limits

	# Debugging support. Always need this:			# Debugging support. Always need this:
	options KDB # Enable kernel debugger support.			options KDB # Enable kernel debugger support.
	options KDB_TRACE # Print a stack trace for a panic.			options KDB_TRACE # Print a stack trace for a panic.
	# For full debugger support use (turn off in stable branch):			# For full debugger support use (turn off in stable branch):
	options BUF_TRACKING # Track buffer history			options BUF_TRACKING # Track buffer history
	options DDB # Support DDB.			options DDB # Support DDB.
	▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines

View Options

head/sys/conf/options

	Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines
	SDP_DEBUG opt_ofed.h			SDP_DEBUG opt_ofed.h
	IPOIB opt_ofed.h			IPOIB opt_ofed.h
	IPOIB_DEBUG opt_ofed.h			IPOIB_DEBUG opt_ofed.h
	IPOIB_CM opt_ofed.h			IPOIB_CM opt_ofed.h

	# Resource Accounting			# Resource Accounting
	RACCT opt_global.h			RACCT opt_global.h
	RACCT_DEFAULT_TO_DISABLED opt_global.h			RACCT_DEFAULT_TO_DISABLED opt_global.h
				RACCT_RT opt_global.h
				RACCT_RT_PCTCPU opt_global.h

	# Resource Limits			# Resource Limits
	RCTL opt_global.h			RCTL opt_global.h

	# Random number generator(s)			# Random number generator(s)
	# Alternative RNG algorithm.			# Alternative RNG algorithm.
	RANDOM_FENESTRASX opt_global.h			RANDOM_FENESTRASX opt_global.h
	# With this, no entropy processor is loaded, but the entropy			# With this, no entropy processor is loaded, but the entropy
	▲ Show 20 Lines • Show All 61 Lines • Show Last 20 Lines

View Options

head/sys/kern/kern_clock.c

	Show First 20 Lines • Show All 52 Lines • ▼ Show 20 Lines
	#include <sys/gtaskqueue.h>			#include <sys/gtaskqueue.h>
	#include <sys/kdb.h>			#include <sys/kdb.h>
	#include <sys/kernel.h>			#include <sys/kernel.h>
	#include <sys/kthread.h>			#include <sys/kthread.h>
	#include <sys/ktr.h>			#include <sys/ktr.h>
	#include <sys/lock.h>			#include <sys/lock.h>
	#include <sys/mutex.h>			#include <sys/mutex.h>
	#include <sys/proc.h>			#include <sys/proc.h>
				#include <sys/racct.h>
	#include <sys/resource.h>			#include <sys/resource.h>
	#include <sys/resourcevar.h>			#include <sys/resourcevar.h>
	#include <sys/sched.h>			#include <sys/sched.h>
	#include <sys/sdt.h>			#include <sys/sdt.h>
	#include <sys/signalvar.h>			#include <sys/signalvar.h>
	#include <sys/sleepqueue.h>			#include <sys/sleepqueue.h>
	#include <sys/smp.h>			#include <sys/smp.h>
	#include <vm/vm.h>			#include <vm/vm.h>
	▲ Show 20 Lines • Show All 384 Lines • ▼ Show 20 Lines
	/*			/*
	* Compute the amount of time during which the current			* Compute the amount of time during which the current
	* thread was running, and add that to its total so far.			* thread was running, and add that to its total so far.
	*/			*/
	new_switchtime = cpu_ticks();			new_switchtime = cpu_ticks();
	runtime = new_switchtime - PCPU_GET(switchtime);			runtime = new_switchtime - PCPU_GET(switchtime);
	td->td_runtime += runtime;			td->td_runtime += runtime;
	td->td_incruntime += runtime;			td->td_incruntime += runtime;
				#if defined(RACCT) && defined(RACCT_RT)
				if (RACCT_ENABLED())
				racct_rt_add_thread_runtime(td, runtime);
				#endif
	PCPU_SET(switchtime, new_switchtime);			PCPU_SET(switchtime, new_switchtime);

	sched_clock(td, cnt);			sched_clock(td, cnt);
	thread_unlock(td);			thread_unlock(td);
	#ifdef HWPMC_HOOKS			#ifdef HWPMC_HOOKS
	if (td->td_intr_frame != NULL)			if (td->td_intr_frame != NULL)
	PMC_SOFT_CALL_TF( , , clock, stat, td->td_intr_frame);			PMC_SOFT_CALL_TF( , , clock, stat, td->td_intr_frame);
	#endif			#endif
	▲ Show 20 Lines • Show All 113 Lines • Show Last 20 Lines

View Options

head/sys/kern/kern_jail.c

	Show First 20 Lines • Show All 192 Lines • ▼ Show 20 Lines
	struct bool_flags *bf;			struct bool_flags *bf;
	struct jailsys_flags *jsf;			struct jailsys_flags *jsf;
	struct prison pr, mypr;			struct prison pr, mypr;
	struct vfsopt *opt;			struct vfsopt *opt;
	struct vfsoptlist *opts;			struct vfsoptlist *opts;
	char errmsg, name;			char errmsg, name;
	int drflags, error, errmsg_len, errmsg_pos, i, jid, len, pos;			int drflags, error, errmsg_len, errmsg_pos, i, jid, len, pos;
	unsigned f;			unsigned f;
				uint64_t r_us, r_uspersec;

	if (flags & ~JAIL_GET_MASK)			if (flags & ~JAIL_GET_MASK)
	return (EINVAL);			return (EINVAL);

	/* Get the parameter list. */			/* Get the parameter list. */
	error = vfs_buildopts(optuio, &opts);			error = vfs_buildopts(optuio, &opts);
	if (error)			if (error)
	return (error);			return (error);
	▲ Show 20 Lines • Show All 194 Lines • ▼ Show 20 Lines
	goto done;			goto done;
	error = vfs_setopt(opts, "osreldate", &pr->pr_osreldate,			error = vfs_setopt(opts, "osreldate", &pr->pr_osreldate,
	sizeof(pr->pr_osreldate));			sizeof(pr->pr_osreldate));
	if (error != 0 && error != ENOENT)			if (error != 0 && error != ENOENT)
	goto done;			goto done;
	error = vfs_setopts(opts, "osrelease", pr->pr_osrelease);			error = vfs_setopts(opts, "osrelease", pr->pr_osrelease);
	if (error != 0 && error != ENOENT)			if (error != 0 && error != ENOENT)
	goto done;			goto done;
				#if defined(RACCT) && defined(RACCT_RT)
				if (RACCT_ENABLED())
				racct_rt_get_runtime(pr->pr_prison_racct->prr_racct, &r_us,
				&r_uspersec, NULL);
				else
				#endif
				r_us = r_uspersec = 0;
				error = vfs_setopt(opts, "racct.rt.us", &r_us, sizeof(r_us));
				if (error != 0 && error != ENOENT)
				goto done;
				error = vfs_setopt(opts, "racct.rt.uspersec", &r_uspersec,
				sizeof(r_uspersec));
				if (error != 0 && error != ENOENT)
				goto done;

	/* Get the module parameters. */			/* Get the module parameters. */
	mtx_unlock(&pr->pr_mtx);			mtx_unlock(&pr->pr_mtx);
	drflags &= ~PD_LOCKED;			drflags &= ~PD_LOCKED;
	error = osd_jail_call(pr, PR_METHOD_GET, opts);			error = osd_jail_call(pr, PR_METHOD_GET, opts);
	if (error)			if (error)
	goto done;			goto done;
	prison_deref(pr, drflags);			prison_deref(pr, drflags);
	▲ Show 20 Lines • Show All 192 Lines • Show Last 20 Lines