Page MenuHomeFreeBSD

racct: New scheme for calculating pctcpu
Needs ReviewPublic

Authored by cyril_freebsdfoundation.org on Jun 23 2021, 4:12 PM.

Details

Reviewers
markj
Summary

This is a draft revision. It adds a new method for calculating pctcpu. Rather than aggregating pctcpu of processes, each racct, including containers (jail, loginclass, user), calculates pctcpu based on the runtime and wallclock properties.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

sys/kern/kern_racct.c
1264

Oh, be careful here. Despite the indentation, this is now

if (p->p_state == PRS_ZOMBIE)
    PROC_UNLOCK(p);

which is not right.

This diff now calculates runtime correctly when processes exit. Also, timeval is used to calculate the elapsed time, rather than the wallclock value, which is meaningless for non-process raccts. However, pctcpu is not yet updated on process exit.

sys/kern/kern_racct.c
592

I believe it is correct to perform the subtraction before calling cputick2usec(). It's a bit nicer to use only one call, especially since cputick2usec() contains a comparatively expensive division operation.

601

Have you tried removing this ifdef to re-enable the KASSERT? I'm curious to know if it trips.

1167

Why do you need this check?

sys/sys/racct.h
156

It would be worthwhile to add a few sentences to the comment above explaining these fields.

Addresses markj's comments. From testing, pctcpu can occasionally go slightly above the maximum cpu percentage, which I'm not yet sure if is a bug or expected. It can be clamped to the maximum cpu percentage if it is expected. It also is still not yet updated on proc_exit.

I did some testing and this gives much more reasonable numbers. As you note, with a buildkernel we typically get slightly less than 100*ncpu% utilization. One possible explanation was undersubscription because compiler threads will typically spend small amounts of time blocked waiting on I/O, either for data or filesystem metadata. (For an example of the latter case, any time a system call specifies a path /foo/bar/baz, the kernel has to resolve that to a vnode, which is the in-kernel filesystem-independent representation of a file. The kernel maintains a cache of recently and frequently accessed names, but upon a cache miss, the kernel has to traverse each component of the path to locate the on-disk inode, and this can require disk I/O.) Another partial explanation is that some amount of the kernel's work is done by dedicated kernel threads, and this time doesn't get accounted to the jail. For instance, when a disk completes some I/O request, the interrupt handler may schedule a kernel thread to do some cleanup work before waking up requesting thread.

I think I'm also happy with simply not updating %CPU when threads exit. In workloads where exits are frequent (builds are of course a good example), we'll just waste CPU cycles updating a counter for no good reason: exits will be frequent enough that precise updates are not useful, and we can't reasonably apply a %CPU limit to processes as they're exiting.

As a step towards getting this committed, I think you should start removing the #if 0'd code. I would still like to reread the computation of the average a few more times before committing, but I think this approach works well enough that we can start shedding the old code.

sys/kern/kern_racct.c
324–325

This check needs to be carried forward to avoid accounting idle thread time in the host. Really it can be done much more simply:

if (TD_IS_IDLETHREAD(td))
    continue;

I'm not sure why it was written this way to begin with, probably just oversight.

Idle threads live in a dedicated process (see idle_setup()). I think you can just change racctd to skip the process if it contains an idle thread. I would do it by adding code to the very beginning of racctd to store a pointer to STAILQ_FIRST(&cpuhead)->pc_idlethread->td_proc in a local variable. Then skip p if it's equal to that value.

1155

Assertions should come after local variable declarations. In general, local variable declarations always come first.

1163

A comment explaining this calculation would be warranted, I think.

sys/sys/racct.h
147–149

s/current/total/, I think. Or perhaps "accumulated".

Addresses style issues and removes some now-unused code. The code for calculating pcpu on process exit is still left in, albeit commented out in case we want to experiment with it any further. Finally, there were situations in the previous scheme where pcpu was explicitly set to 0. I believe this is no longer necessary as the runtime value should not increase, but should we carry over this behaviour to ensure that pcpu is exactly 0 in these cases?

sys/kern/kern_racct.c
590

Comment style is either

/* This comment fits on one line. */

or

/*
 * Here is a somewhat longer comment, it needs to be broken up into two
 * lines.
 */
594

Extra newline after the closing brace.

601

Hmm, so p_prev_runtime is effectively unused now? Looks like we can just remove it.

937

Can this be removed?

1164

Same style nit here regarding comment style.

1180

The multiplication on the RHS can overflow if mp_ncpus > 4, all three terms are 32 bits wide. Casting mp_ncpus to uint64_t before the multiplication should solve that.

Addresses style issues. Also removes racct_propagate_pcpu, which was meant to update pcpu on process exit. Removes p_prev_runtime from the process struct, as it is no longer used. Also sets pcpu to zero on process exit (for zombie processes, carrying over behaviour from old pcpu scheme).