The runtime value is multiplied by 1000000, but it is already in microseconds, resulting in a very large estimate.
Removing this extra factor of 1000000 should fix bug #235556.
Note that this estimate is only used for short-lived processes.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
The patch works like a charm! This fixes the case I described here (https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=235556#c2), the stats are now correct: https://www.bsdstore.ru/trash/racct.png
Tested on: FreeBSD 14.0-CURRENT #0 main-n247127-1976e079544-dirty
I would suggest noting in the description/commit log message that this estimate is used only for short-lived processes.
sys/kern/kern_racct.c | ||
---|---|---|
328 | Style: missing parens around the return value. |
I did some more testing and now I think that the change is wrong. In particular, note that the RACCT_PCPU resource has RACCT_IN_MILLIONS attribute, which explains why this multiplication occurs.
My test was to run a buildkernel in a jail, passing the number of CPUs to make(1)'s -j parameter. I'd expect to see a usage of roughly 3200 in this case. Without the change, it is much larger than that. With the change, it is too small. By the way, you can collect stats for the host by running rctl -u jail:0 if you didn't already know that. So you don't have to actually create a jail.
Reading the code some more, I see a (the?) problem in racct_proc_exit(). There are two places where the PCPU resource is updated: periodically (once per second) by racctd, and when a process exits. In the latter case, we divide the total runtime of the process by the amount of time elapsed since the process was created, and convert the result to a percentage. Consider what happens when a compiler process is created, does its work, and exits. Suppose it takes 0.1s to compile a file. Compilers are CPU-bound, so the process will be on a CPU for almost all of its lifetime, corresponding to 100% CPU. Suppose we run 10 compiler instances back-to-back: won't that result in a reported usage of 1,000% CPU? In other words, the estimate we use for short-lived processes doesn't make sense when they're CPU-bound.
The problem is that %CPU, as it's currently computed, isn't additive over short time periods.