This is a draft revision. It adds a new method for calculating pctcpu. Rather than aggregating pctcpu of processes, each racct, including containers (jail, loginclass, user), calculates pctcpu based on the runtime and wallclock properties.
This diff now calculates runtime correctly when processes exit. Also, timeval is used to calculate the elapsed time, rather than the wallclock value, which is meaningless for non-process raccts. However, pctcpu is not yet updated on process exit.
I believe it is correct to perform the subtraction before calling cputick2usec(). It's a bit nicer to use only one call, especially since cputick2usec() contains a comparatively expensive division operation.
Have you tried removing this ifdef to re-enable the KASSERT? I'm curious to know if it trips.
Why do you need this check?
It would be worthwhile to add a few sentences to the comment above explaining these fields.
Addresses markj's comments. From testing, pctcpu can occasionally go slightly above the maximum cpu percentage, which I'm not yet sure if is a bug or expected. It can be clamped to the maximum cpu percentage if it is expected. It also is still not yet updated on proc_exit.
I did some testing and this gives much more reasonable numbers. As you note, with a buildkernel we typically get slightly less than 100*ncpu% utilization. One possible explanation was undersubscription because compiler threads will typically spend small amounts of time blocked waiting on I/O, either for data or filesystem metadata. (For an example of the latter case, any time a system call specifies a path /foo/bar/baz, the kernel has to resolve that to a vnode, which is the in-kernel filesystem-independent representation of a file. The kernel maintains a cache of recently and frequently accessed names, but upon a cache miss, the kernel has to traverse each component of the path to locate the on-disk inode, and this can require disk I/O.) Another partial explanation is that some amount of the kernel's work is done by dedicated kernel threads, and this time doesn't get accounted to the jail. For instance, when a disk completes some I/O request, the interrupt handler may schedule a kernel thread to do some cleanup work before waking up requesting thread.
I think I'm also happy with simply not updating %CPU when threads exit. In workloads where exits are frequent (builds are of course a good example), we'll just waste CPU cycles updating a counter for no good reason: exits will be frequent enough that precise updates are not useful, and we can't reasonably apply a %CPU limit to processes as they're exiting.
As a step towards getting this committed, I think you should start removing the #if 0'd code. I would still like to reread the computation of the average a few more times before committing, but I think this approach works well enough that we can start shedding the old code.
This check needs to be carried forward to avoid accounting idle thread time in the host. Really it can be done much more simply:
if (TD_IS_IDLETHREAD(td)) continue;
I'm not sure why it was written this way to begin with, probably just oversight.
Idle threads live in a dedicated process (see idle_setup()). I think you can just change racctd to skip the process if it contains an idle thread. I would do it by adding code to the very beginning of racctd to store a pointer to STAILQ_FIRST(&cpuhead)->pc_idlethread->td_proc in a local variable. Then skip p if it's equal to that value.
Assertions should come after local variable declarations. In general, local variable declarations always come first.
A comment explaining this calculation would be warranted, I think.
s/current/total/, I think. Or perhaps "accumulated".
Addresses style issues and removes some now-unused code. The code for calculating pcpu on process exit is still left in, albeit commented out in case we want to experiment with it any further. Finally, there were situations in the previous scheme where pcpu was explicitly set to 0. I believe this is no longer necessary as the runtime value should not increase, but should we carry over this behaviour to ensure that pcpu is exactly 0 in these cases?
Comment style is either
/* This comment fits on one line. */
/* * Here is a somewhat longer comment, it needs to be broken up into two * lines. */
Extra newline after the closing brace.
Hmm, so p_prev_runtime is effectively unused now? Looks like we can just remove it.
Can this be removed?
Same style nit here regarding comment style.
The multiplication on the RHS can overflow if mp_ncpus > 4, all three terms are 32 bits wide. Casting mp_ncpus to uint64_t before the multiplication should solve that.
Addresses style issues. Also removes racct_propagate_pcpu, which was meant to update pcpu on process exit. Removes p_prev_runtime from the process struct, as it is no longer used. Also sets pcpu to zero on process exit (for zombie processes, carrying over behaviour from old pcpu scheme).