Details
- Reviewers
olce
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 71015 Build 67898: arc lint + arc unit
Event Timeline
Have rebased to main. Thanks!
| sys/x86/cpufreq/hwpstate_amd.c | ||
|---|---|---|
| 154 | NOT_READ is to flag if we have cached the request field. | |
I tried this on 3 EPYC generations at Netflix. The most recent (AMD EPYC 8434P, AMD EPYC 9535) behaved as expected. There was an dev.hwpstate_amd. node for each CPU, and a lot more freqs were exposed (from 3 to roughly a dozen).
On the oldest EPYC we have is EPYC 7502P, where it didn't change anything. We still only have 3 frequencies exposed. On this machine, we see just a single node from dev.hwpstate:
dev.hwpstate_amd.0.freq_settings: 2500/2750 2200/2200 1500/1350
dev.hwpstate_amd.0.%iommu:
dev.hwpstate_amd.0.%parent: cpu0
dev.hwpstate_amd.0.%pnpinfo:
dev.hwpstate_amd.0.%location:
dev.hwpstate_amd.0.%driver: hwpstate_amd
dev.hwpstate_amd.0.%desc: Cool`n'Quiet 2.0
dev.hwpstate_amd.%parent:
That contrasts to the other machines were we see something like this:
dev.hwpstate_amd.0.freq_settings: 400/-1 522/-1 644/-1 766/-1 888/-1 1010/-1 1132/-1 1255/-1 1377/-1 1499/-1 1621/-1 1743/-1 1865/-1 1987/-1 2110/-1 2232/-1 2354/-1 2476/-1 2598/-1 2720/-1 2843/-1 2965/-1 3087/-1
dev.hwpstate_amd.0.desired_performance: 33
dev.hwpstate_amd.0.maximum_performance: 255
dev.hwpstate_amd.0.minimum_performance: 33
dev.hwpstate_amd.0.epp: 0
dev.hwpstate_amd.0.%iommu:
dev.hwpstate_amd.0.%parent: cpu0
dev.hwpstate_amd.0.%pnpinfo:
dev.hwpstate_amd.0.%location:
dev.hwpstate_amd.0.%driver: hwpstate_amd
dev.hwpstate_amd.0.%desc: AMD Collaborative Processor Performance Control (CPPC)
I'm assuming its a BIOS setting to enable hwpstate, but I don't see it anywhere..
More generally, my assumption is that powerd is in complete control here, as it would be without the hwpstate control (which is great, as its exactly what I want). If somebody wanted to use EPP, I guess you'd poke at dev.hwpstate_amd.0.epp ?
All in all, this looks great. It seems to do no harm on the older box, and exposes more control for powerd on the newer boxes!
I suspect that EPYC 7002 series does not support CPPC since it fallback to Cool`n'Quiet 2.0 instead of using AMD CPPC. The fallback only happens when the CPPC in cpuid is not available or the CPPC enable MSR fails to write.
More generally, my assumption is that powerd is in complete control here, as it would be without the hwpstate control (which is great, as its exactly what I want). If somebody wanted to use EPP, I guess you'd poke at dev.hwpstate_amd.0.epp ?
Yes, people are still able to write epp bit. They are different feature. The powerd can adjust the cpu performace based on workload. EPP is for the user's preference.
All in all, this looks great. It seems to do no harm on the older box, and exposes more control for powerd on the newer boxes!
Great! It is good to hear that it actually works for other people.
Cache and free acpi_cppc_ctx immediately since we only use readonly field.
Also, use roundup to prevent duplicate frequency.
On my Ryzen PRO 5650U, now showing 18 freq_settings per thread under CPPC. Still need to see how this affects performance on battery.
Hi,
As said previously, I'm opposed to this approach. So we have to discuss (for other people, this discussion has started through other channels as well, but I'll try to keep this public revision updated as much as possible).
Basically, you are exporting frequencies from CPPC and allow setting them through cpufreq(4) so that powerd(8) can automatically leverage that, which is exactly what I would really like to avoid.
Some of my reasoning is summarized here: https://lists.freebsd.org/archives/freebsd-hackers/2026-February/005774.html. Another reason is that we are not sure how to determine the performance level <-> frequency mapping reliably, and experimentally on some machine I've seen that we are not actually able to devise one (either because there is no such mapping there, contrary to what AMD's APM says, or our frequency evaluator is severely broken; I've given more evidence in another channel). Here, you're circumventing that specific problem by performing a computation based on frequencies reported by ACPI's _CPC object. First, we already have cases where this object does not exist, but the processor actually supports CPPC. Second, the ACPI CPPC spec is relatively cumbersome, and in some areas strangely in contradiction with Intel's and AMD's spec (e.g., enabling autonomous mode). This, and the fact that other OSes such as Linux have specific Intel and AMD drivers, hints as the ACPI's CPPC spec being already in part obsoleted by actual Intel's and AMD's implementations on amd64 hardware.
Add to that that we would like a similar mechanism for Intel CPPC (why have something different?), and Intel explicitly says (and the ACPI spec too, by the way) that performance levels *do not map* to frequencies, so basically what you're doing here is not transposable to our Intel's CPPC driver, and thus is not generic enough (and I find it unlikely that AMD will continue to provide such a mapping if they plan to improve their CPPC support).
What I see as a generic future-proof direction is to leave hwpstate_amd(4) as is with respect to frequencies and instead change powerd(8) to use its new CPPC sysctl knobs directly. Then, we can add the same knobs to hwpstate_intel(4), and tweak powerd(8) to use them as for hwpstate_amd(4).
This also enables implementing different, and more complex, policies in powerd(8) directly (but we can still start small).
Ideally, we should integrate CPPC in cpufreq(4), but I don't see that as urgent/very compelling, since it seems the added benefits will be small. Basically, this would enable one to uniformly use the new cpufreq(4) CPPC control knobs instead of specifically those of hwpstate_intel(4) or hwpstate_amd(4) in powerd(8), but at the moment I think that's pretty much all.
What I see as a generic future-proof direction is to leave hwpstate_amd(4) as is with respect to frequencies and instead change powerd(8) to use its new CPPC sysctl knobs directly. Then, we can add the same knobs to hwpstate_intel(4), and tweak powerd(8) to use them as for hwpstate_amd(4).
This also enables implementing different, and more complex, policies in powerd(8) directly (but we can still start small).
I guess I wonder what you mean by this. I hope you don't just mean having powerd influence the hardware control loop by tweaking the epp sysctl. That would lead to 2 control loops trying to govern the same thing, which often ends in chaos. I hope you mean bypassing the hardware governor and stepping up/down the clock, whether or not those steps are labeled as frequencies or just as opaque steps..
Ideally, we should integrate CPPC in cpufreq(4), but I don't see that as urgent/very compelling, since it seems the added benefits will be small. Basically, this would enable one to uniformly use the new cpufreq(4) CPPC control knobs instead of specifically those of hwpstate_intel(4) or hwpstate_amd(4) in powerd(8), but at the moment I think that's pretty much all.
Let's not let the perfect be the enemy of the good. What this patch does is a huge improvement in the status quo. I'm not opposed to teaching powerd how to interact with cppc, but I'd also rather have this go in as-is until that can happen. Just because Intel doesn't offer this level of control is no reason to restrict AMD users from having it.
Drew
I understand this concern, as already stated in: https://lists.freebsd.org/archives/freebsd-hackers/2026-February/005774.html. We intend nonetheless to have one policy that tunes only the EPP setting to play with at some point, but that is clearly not the priority, which is to have a simple one that turns off autonomous mode and just sets all min/max/desired performance settings to the same value.
I hope you mean bypassing the hardware governor and stepping up/down the clock, whether or not those steps are labeled as frequencies or just as opaque steps..
As a corollary of what I said in https://lists.freebsd.org/archives/freebsd-current/2026-February/009918.html, you have to understand that there are no absolute guarantees that all hardware governors are turned off when using CPPC. However, by not using autonomous mode, you can at least remove the most significant one from the equation. It may be that, with current AMD processors, setting all min/max/desired performance is enough to have stable performance, but even that remains to be thoroughly tested (Intel's documentation explicitly says that EPP has an influence even when autonomous mode is off; AMD's is silent on that topic; since they say in the APM that, so far, there's a mapping between performance levels and CPU frequencies, at least it's reasonable to hope for the best). With CPPC, there may be a price to pay in exchange of having more performance levels exposed.
Some habits may not go away easily, but continuing to think in terms of clock/frequency is a simplification of reality that will just obstruct your understanding of what CPPC is and could or could not do.
Ideally, we should integrate CPPC in cpufreq(4), but I don't see that as urgent/very compelling, since it seems the added benefits will be small. Basically, this would enable one to uniformly use the new cpufreq(4) CPPC control knobs instead of specifically those of hwpstate_intel(4) or hwpstate_amd(4) in powerd(8), but at the moment I think that's pretty much all.
Let's not let the perfect be the enemy of the good. What this patch does is a huge improvement in the status quo. I'm not opposed to teaching powerd how to interact with cppc, but I'd also rather have this go in as-is until that can happen. Just because Intel doesn't offer this level of control is no reason to restrict AMD users from having it.
Our level of support for Intel CPPC has nothing to do with my rejection here (and aligning hwpstate_intel(4) with hwpstate_amd(4) is already in ShengYi's pipe and a matter of days). I'm rejecting this approach because, as I've developed multiple times already, it is wrong on multiple levels. First, frequencies are fake. Second, there is no way you can generally map 4 control knobs with the leeway allowed in any of the specifications into a one-dimensional scale (the frequencies) where increasing values mean increasing performance (what is the most performant/efficient between min = 128, max = 255, desired = 192 and EPP = 50, and min = 200, max = 200, desired = 200 and EPP = 50? well, you don't know; and between min = 0, max = 255, desired = 128 and EPP = 0 and min = 0, max = 255, desired = 255 and EPP = 100? you don't know either; and so on and so forth; and, on top of that, each answer will depend on a particular CPU model). Third, this approach is not future-proof, because AMD may as well stop mapping levels to frequency if they want to improve their CPPC support (we are not even sure how they are doing it now; some recent experiments of mine suggest that this claim might not even be true today). Four, this approach is not applicable to Intel, because they explicitly say that there is *no* mapping with frequencies, and that different balances are achieved via a combination of means. Finally, on this topic, let me quote the ACPI spec:
In order to provide backward compatibility with existing tools that report processor performance as frequencies, the _CPC object can optionally provide processor frequency range values for use by the OS. If these frequency values are provided, the restrictions on _CPC information usage still remain: the OSPM must make no assumption about the exact meaning of the performance values presented by the platform, and all functional decisions and interaction with the platform still happen using the abstract performance scale. The frequency values are only contained in the _CPC object to allow the OS to present performance data in a simple frequency range, when frequency is not discoverable from the platform via another mechanism.
TL;DR: While appealing on the surface, the change here is a hack that will inevitably cause insoluble problems going forward.
Fortunately, the good news are that teaching powerd(8) about CPPC knobs almost surely will take less time than it took to prepare this revision. After discussion with ShengYi, we will be advancing that topic before and at AsiaBSDCon, so an acceptable solution is to come very soon.