Details

Reviewers

Commits

rG2c3fb4cd5096: racct: Improve handling of the pcpu resource
rGc72188d85a79: racct: Improve handling of the pcpu resource

Summary

This is a draft revision. It adds a new method for calculating pctcpu. Rather than aggregating pctcpu of processes, each racct, including containers (jail, loginclass, user), calculates pctcpu based on the runtime and wallclock properties.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

cyril_freebsdfoundation.org requested review of this revision.Jun 23 2021, 4:12 PM

cyril_freebsdfoundation.org created this revision.

markj added inline comments.Jun 23 2021, 4:18 PM

sys/kern/kern_racct.c
1262	Oh, be careful here. Despite the indentation, this is now if (p->p_state == PRS_ZOMBIE) PROC_UNLOCK(p); which is not right.

This diff now calculates runtime correctly when processes exit. Also, timeval is used to calculate the elapsed time, rather than the wallclock value, which is meaningless for non-process raccts. However, pctcpu is not yet updated on process exit.

markj added inline comments.Jun 28 2021, 1:31 PM

sys/kern/kern_racct.c
585	I believe it is correct to perform the subtraction before calling cputick2usec(). It's a bit nicer to use only one call, especially since cputick2usec() contains a comparatively expensive division operation.
594	Have you tried removing this ifdef to re-enable the KASSERT? I'm curious to know if it trips.
1165	Why do you need this check?
sys/sys/racct.h
154	It would be worthwhile to add a few sentences to the comment above explaining these fields.

Addresses markj's comments. From testing, pctcpu can occasionally go slightly above the maximum cpu percentage, which I'm not yet sure if is a bug or expected. It can be clamped to the maximum cpu percentage if it is expected. It also is still not yet updated on proc_exit.

I did some testing and this gives much more reasonable numbers. As you note, with a buildkernel we typically get slightly less than 100*ncpu% utilization. One possible explanation was undersubscription because compiler threads will typically spend small amounts of time blocked waiting on I/O, either for data or filesystem metadata. (For an example of the latter case, any time a system call specifies a path /foo/bar/baz, the kernel has to resolve that to a vnode, which is the in-kernel filesystem-independent representation of a file. The kernel maintains a cache of recently and frequently accessed names, but upon a cache miss, the kernel has to traverse each component of the path to locate the on-disk inode, and this can require disk I/O.) Another partial explanation is that some amount of the kernel's work is done by dedicated kernel threads, and this time doesn't get accounted to the jail. For instance, when a disk completes some I/O request, the interrupt handler may schedule a kernel thread to do some cleanup work before waking up requesting thread.

I think I'm also happy with simply not updating %CPU when threads exit. In workloads where exits are frequent (builds are of course a good example), we'll just waste CPU cycles updating a counter for no good reason: exits will be frequent enough that precise updates are not useful, and we can't reasonably apply a %CPU limit to processes as they're exiting.

As a step towards getting this committed, I think you should start removing the #if 0'd code. I would still like to reread the computation of the average a few more times before committing, but I think this approach works well enough that we can start shedding the old code.

sys/kern/kern_racct.c
317–318	This check needs to be carried forward to avoid accounting idle thread time in the host. Really it can be done much more simply: if (TD_IS_IDLETHREAD(td)) continue; I'm not sure why it was written this way to begin with, probably just oversight. Idle threads live in a dedicated process (see idle_setup()). I think you can just change racctd to skip the process if it contains an idle thread. I would do it by adding code to the very beginning of racctd to store a pointer to `STAILQ_FIRST(&cpuhead)->pc_idlethread->td_proc` in a local variable. Then skip `p` if it's equal to that value.
1153	Assertions should come after local variable declarations. In general, local variable declarations always come first.
1161	A comment explaining this calculation would be warranted, I think.
sys/sys/racct.h
144–146	s/current/total/, I think. Or perhaps "accumulated".

Addresses style issues and removes some now-unused code. The code for calculating pcpu on process exit is still left in, albeit commented out in case we want to experiment with it any further. Finally, there were situations in the previous scheme where pcpu was explicitly set to 0. I believe this is no longer necessary as the runtime value should not increase, but should we carry over this behaviour to ensure that pcpu is exactly 0 in these cases?

markj added inline comments.Jul 6 2021, 3:04 PM

sys/kern/kern_racct.c
583	Comment style is either /* This comment fits on one line. / or / * Here is a somewhat longer comment, it needs to be broken up into two * lines. */
587	Extra newline after the closing brace.
594	Hmm, so `p_prev_runtime` is effectively unused now? Looks like we can just remove it.
930	Can this be removed?
1162	Same style nit here regarding comment style.
1178	The multiplication on the RHS can overflow if mp_ncpus > 4, all three terms are 32 bits wide. Casting `mp_ncpus` to uint64_t before the multiplication should solve that.

Addresses style issues. Also removes racct_propagate_pcpu, which was meant to update pcpu on process exit. Removes p_prev_runtime from the process struct, as it is no longer used. Also sets pcpu to zero on process exit (for zombie processes, carrying over behaviour from old pcpu scheme).