Page MenuHomeFreeBSD

hwpstate: add CPPC support for pstate driver on AMD
Needs ReviewPublic

Authored by aokblast on Mar 31 2025, 10:04 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 20, 5:19 PM
Unknown Object (File)
Thu, Dec 18, 11:15 AM
Unknown Object (File)
Thu, Dec 18, 11:15 AM
Unknown Object (File)
Mon, Dec 15, 8:54 AM
Unknown Object (File)
Fri, Dec 12, 8:34 PM
Unknown Object (File)
Fri, Dec 12, 1:45 AM
F139169190: D49587_additional_patch_proposal.diff
Mon, Dec 8, 9:14 PM
Unknown Object (File)
Mon, Dec 8, 9:01 PM
Subscribers

Details

Reviewers
cem
markj
olce
khng
Summary

hwpstate: add CPPC support for pstate driver on AMD

Implement CPPC interface for AMD Pstate Driver.
This feature is only enabled when the CPUID shows it support CPPC.

The CPPC is implemneted by the following steps:

  1. Write MSR to enable it.
  2. Read capability registert which indicates binary value of levels

about lowest, best energy efficient, guarantee, and max performance.

  1. Write request register with epp in energy balanced mode. And let

CPU and firmware to enter autonomous mode.

Also, create a sysctl to allow userspace to change epp value

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 69421
Build 66304: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

move set_autonomous mode into attach

This revision is now accepted and ready to land.Jun 12 2025, 10:42 PM

I'd like the opportunity to review this (next week).

I'd like the opportunity to review this (next week).

Of course. Not hurry. I just want to make more people engage in:).

Added a bunch of inline comments. Most important things:

  • Support for PSTATE_CPPC_PKG_CTRL (in software, because contrary to Intel, there is no per-package MSR).
  • In sysctl_epp_select(), one of the guard tests seems wrong.

There are a few places I didn't review in depth, I may return to them beginning of next week.

sys/x86/cpufreq/hwpstate_amd.c
106–110

Could you rename the AMD_CPPC_ prefix to AMD_CPPC_CAPS_1_, just to make it apparent that these are applicable (only) to the CAPS 1 MSR?

111–114

Same here please, AMD_CPPC_ => AMD_CPPC_REQUEST_.

233–239

You're basically open-coding some AMD_CPPC_ macros (those to prefix with AMD_CPPC_CAPS_1_) here. Please use the macros instead.

242–244

That second read actually reads the same MSR. I think you should completely remove data2 and use data instead. This distinction made sense in the Intel case as there is a specific MSR targeted at the package (MSR_IA32_HWP_REQUEST_PKG) but there's no such thing in AMD processors.

Even if there is no specific hardware support for package control on AMD processors, I think that providing this functionality to users is beneficial, in that it simplifies a lot the configuration of the "baseline" case where all cores are set to the same value.

292

You probably meant PSTATE_CPPC here, as bailing out in the case of PSTATE_CPPC_PKG_CTRL does not seem to make sense.

298–302

This is in the wrong place. It must be moved between the lines val = ((r >> 24) & 0xFFULL) * 100 / 0xff; and if (val > 100) {.

310

It would be better to use AMD_CPPC_REQUEST_ENERGY_PERF_BITS here, even if that's more verbose. With it, we immediately know that the code is correct without having to check if the shift and mask are correct.

If you find that too verbose, factor out the formula into a macro.

310–311

Missing input/output at this point (see previous comment).

312–325

Generally, support for PSTATE_CPPC_PKG_CTRL is missing (as said above, I think this functionality is

We may want to mimick what happens in the Intel counterpart, which means that setting the knob for one core should affect all the other cores. If so, basically, when PSTATE_CPPC_PKG_CTRL is set, we have to loop over all CPUs for the setting part.

318–319

Same as above, it would be preferable to use a macro based only on AMD_CPPC_REQUEST_ENERGY_PERF_BITS and avoid the hardcoded 24.

489

Aren't we supposed to use MSR C001_0280 ("Performance Time Stamp Counter (CU_PTSC)") here? Or are we sure the regular TSC is just invariant if PSTATE_CPPC_PWR_CALC is set?

But maybe that's not worth bothering, as it appears most AMD processors do not support PSTATE_CPPC_PWR_CALC (i.e., only some of family 16h do, skimming various PPR manuals).

631–638

Similarly as above, I'd prefer macros here on the right-hand side of |=, and no hardcoded shifts (and proper mask applied to the new value, just in case, even though you aptly chose uint8_t as the type for the cppc_settings fields).

725–726

Since you (rightly) added an occurrence of a call to device_set_desc() earlier, you can remove this one (redundant).

aokblast marked 9 inline comments as done.

Fix olce recommand fixes

This revision now requires review to proceed.Jul 8 2025, 2:59 PM
sys/x86/cpufreq/hwpstate_amd.c
298–302

Mark tell me that this can be moved to a place before we lock a thread. Should we move it back?

The sysctl_handle_int() call doesn't need to be inside the sched_bind() section, it should be moved earlier.

310

I think we should remove the check as the value is always between 0~255. The value won't never exceed 100 when we scale down the value. 255 * 100 / 255 = 100

312–325

I will check it but needs some time because I see they have different capabilities register value for each CPU. Thus I think it is package level in default.

489

I think I should delete this function totally as you know that this function is accessible in very few models. However, I will leave the macro in specialreg.h alone to prevent being reused by AMD.

Expose the epp interface in only cpu 0 and modify all cpu value.

@olce I think exposing the epp setting interface only in only cpu0 makes sence and therefore I make some changes. @markj Please also help me review it when you have time, thanks! Also, I remember that you have told me that it is better to send ipi instead of binding the CPU, but I forget the url of your sample code. Could you please provide it in here?

@olce I think exposing the epp setting interface only in only cpu0 makes sence and therefore I make some changes.

Mmm... It's not that providing it only on CPU 0 does not make sense, but actually having a different knob for each CPU can be useful (admittedly in corner cases), and would be consistent with hwpstate_intel. But that's of course not a blocker, this can be added back later on.

Going to review the new version.

@markj Please also help me review it when you have time, thanks! Also, I remember that you have told me that it is better to send ipi instead of binding the CPU, but I forget the url of your sample code. Could you please provide it in here?

If Mark gives no answer in a few days, I'll suggest something.

@olce I think exposing the epp setting interface only in only cpu0 makes sence and therefore I make some changes.

Mmm... It's not that providing it only on CPU 0 does not make sense, but actually having a different knob for each CPU can be useful (admittedly in corner cases), and would be consistent with hwpstate_intel. But that's of course not a blocker, this can be added back later on.

Going to review the new version.

@markj Please also help me review it when you have time, thanks! Also, I remember that you have told me that it is better to send ipi instead of binding the CPU, but I forget the url of your sample code. Could you please provide it in here?

If Mark gives no answer in a few days, I'll suggest something.

Is it what you mean? Use the provided sysctl to control the package level or logical cpu level control for cppc but Implement package level control in software? If that is the case, it is just like the intel one now. But the intel one have a problem that when we chnage epp from core0, sc->req info from core1 and other are not changed. Which means the value won't be correct after we change the epp in core0 and we want to read the epp value in core1 afterward and I don't have any intel machine to varify it now.

Is it what you mean? Use the provided sysctl to control the package level or logical cpu level control for cppc but Implement package level control in software? If that is the case, it is just like the intel one now.

Yes, I'd like the same behavior as the Intel one, except that for AMD we have to emulate package level control (by just setting all CPUs to the same value).

But the intel one have a problem that when we chnage epp from core0, sc->req info from core1 and other are not changed. Which means the value won't be correct after we change the epp in core0 and we want to read the epp value in core1 afterward and I don't have any intel machine to varify it now.

I agree. In fact, in sysctl_epp_select(), there should be a systematic rdmsr_safe(MSR_IA32_HWP_REQUEST_PKG) on the hwp_pkg_ctrl_en case instead of relying on sc->req.

Which makes me notice that we only support package control on Intel if the hardware supports it. Hence, the other hwp_pkg_ctrl flag, which indicates hardware support, whereas hwp_pkg_ctrl_en is a synthesis of the previous and hwpstate_pkg_ctrl_enable (which is the user setting). We should probably emulate package control on hwpstate_pkg_ctrl_enable set and hwp_pkg_ctrl no set. But that's work for another day. :-)

I think having package control as the default is really what makes sense, so we should emulate it. On the other hand, once that is done, it's not difficult to actually allow per-CPU knobs (the simple case).

And to complete these musings, I also think that, unfortunately, the design for hwpstate_intel was too close to the hardware in the sense that you have to choose a mode of operation (package control or not) which somehow changes the meaning of the existing knobs. From the user's point of view, a more rational design would have been to have a general knob that applies to all CPUs, and per-CPU knobs (and no more a mode knob). If not all CPUs agree, the general knob would have a special value (e.g., -1), whereas if they agree it would have the agreement value. Writing into the general knob acts on all CPUs (which can be implemented by using hardware package control where available, else in software), whereas writing on a specific CPU's knob influences only that CPU (which means going out or entering hardware package control mode as appropriate). That would have made the implementation slightly more complex but the interface more palatable to the outside world. Maybe that ship has sailed, but perhaps breaking compat' for such knobs in 16 isn't a big deal, in which case I'd like to see it done (but not in this revision, as part of follow-up work).

@olce I think exposing the epp setting interface only in only cpu0 makes sence and therefore I make some changes. @markj Please also help me review it when you have time, thanks! Also, I remember that you have told me that it is better to send ipi instead of binding the CPU, but I forget the url of your sample code. Could you please provide it in here?

I think it was this? https://github.com/markjdb/freebsd/commit/c639cf2dc193e7979b886cc024e94e35cb5b88ab

So, the main question now is whether we keep struct hwpstate_cppc_setting (=> struct hwpstate_cppc_state) at all, see inline comments.

Reading/writing the different MSRs should be factored out (if we don't keep the structure, that seems not needed, but in the end will be when using smp_rendezvous_cpus()).

Yes, this is it. That's a good illustration. It needs a little bit more work with respect to concurrent accesses as now the fields of a specific CPU's softc are accessed on another CPU.

But I advise to look into that only when all the rest has been handled (and possibly as a separate commit), as non-trivial code changes may need to happen first.

Additional patch proposal (see inline comments):

sys/x86/cpufreq/hwpstate_amd.c
106–109

So we have AMD_CPPC_CAPS_1_*_PERF that are function-like macros and AMD_CPPC_REQUEST_*_BITS that are plain ones. I think both should be consistent.

I have a slight preference for the AMD_CPPC_REQUEST_*_BITS as it allows to spot problems more immediately.

Proposing a revamp using only *_BITS with an additional set of function-like macros (BITS_VALUE(), BITS_FROM_VALUE(), SET_BITS_VALUE()). SET_BITS_VALUE() is similar to SET_VALUES(), and replaces it.

See the attached patch.

106–114

There's a space between #define and the macro name on these lines, contrary to other existing lines and new MSR_AMD* ones where there's a TAB, so use a TAB here preferably (or switch all lines to a space).

137

I was actually wondering if we should keep it. Currently, this structure is only written to, and only some of its values are immediately used in amd_set_autonomous_hwp() and nowhere else after that. In particular, amdhwp_dump_sysctl_handler() never uses sc->req.

We could perhaps keep this structure, and use it in amdhwp_dump_sysctl_handler(), for the following reasons. For Intel processors, the capabilities can change at any moment due to changing load/thermal package conditions (in fact, not more than once per second, and only the guaranteed and most efficient ones), according to the Architecture Software Developers reference manual. For AMD, the Architecture's Programmer Manual says that the "guaranteed" performance corresponds to ideal conditions, which can be interpreted as "will not change", contrary to what Intel does. Assuming it's the case in practice, then saving the capabilities into the structure here makes sense and then saves doing rdmsr() in amdhwp_dump_sysctl_handler().

Preferring a more paranoid approach is also fine, but in this case this structure is in fact not needed and should be removed.

If keeping it, should be renamed to hwpstate_cppc_state to better reflect purpose, and req should be moved to it.

146

See inline comment above about struct hwpstate_cppc_setting (=> hwpstate_cppc_state) about adding req to this structure.

146

If you keep it (see previous comment).

202

Please see discussion above on struct hwpstate_cppc_setting (=> struct hwpstate_cppc_state) to see if this function needs changing to stop reading anything from the hardware and just reporting values stored in sc->cppc_state.

298

Style (I know hwpstate_intel does the same, and should gradually be fixed, but this one is new, so...).

There are also several occurences of if (ret) => if (ret != 0) below, not creating an inline comment for each.

335–336

If the wrmsr_safe() before the loop for the current CPU worked, we should assume that it is going to work on all others. So:

  • If an error nonetheless occurs here inside the loop, I would just ignore it (print some diagnostic, probably only once), and reset ret to 0.
  • That allows to finish the loop and set sc->req to what the current CPU was set.

Doing so would allow us to be informed that the seemingly impossible case actually occurred, and at least try to have most CPUs in line with the request, and also have sc->req set to the requested value, even if setting fails on some other CPU.
That said, I'd also move the sc->req after the loop earlier, just after the test that the initial wrmsr() succeeded.
I don't think that adding code to revert previous modifications on an error in the loop is necessary at this point.

You must set the local CPU's sc->req as well in the loop, else the other (per-CPU) "epp" knobs will report a wrong value.

619–622

See inline comment at top about whether to keep struct hwpstate_cppc_setting, and thus sc->cppc_settings.

630
662–663

I'd keep the original comment more or less. It's a bit obscure, and I haven't been able so far to verify what it says.

672–675

Would rather set it to something like "AMD CPPC" on PSTATE_CPPC. "Cool`n'Quiet 2.0" is... cool but quite confusing.

675

Either 0 has to be returned instead, or hwpstate_attach() must be modified to read the flags again into sc->flags (see DEVICE_PROBE(9) manual page).

First option seems fine.

A small clarification: Above, I've been talking both to possibly remove struct hwpstate_cppc_setting (=> struct hwpstate_cppc_state) and to move the req field to it, which is not apparently consistent. I was considering both separately. The consistent view is that req should be moved to struct hwpstate_cppc_state and keep the latter unconditionally but possible removing from it the high, guaranteed, efficient and low fields (depending on some choices described above). req has to be kept in all cases as it is needed in sysctl_epp_select() to return the current value (which here is probably the proper thing to do; in the debug knob it's debatable whether we actually want to re-read from the MSR).

  1. Fix indentation
  2. Remove hwpstate_cppc_settings structure
  3. Update sc->req in all cpu cores
  4. Fix sc->req style nits
  5. Consistent CAPS and REQUESTS macro

Thanks for the comments! Hope it looks better now.

sys/x86/cpufreq/hwpstate_amd.c
335–336

I found a problem that we cannot actually set sc->req to all cpu since each cpu core can have different low and high capabilities. Only the EPP is in range 0~255 for each core. So we have to rdmsr each req bits for all cpu and set bits then finally wrmsr it back.

Thanks for the comments! Hope it looks better now.

Yes! A few more inline comments and I think this should be OK to commit, we can have more steps afterwards (one of which would be to switch to smp_rendezvous_cpus(); another would be to better handle concurrency; and more (which are not necessary upon you to implement)).

sys/x86/cpufreq/hwpstate_amd.c
274

It's in fact apparently possible to get to the desired devices directly, through something like:

const devclass_t dc = devclass_find(HWP_AMD_CLASSNAME);

and in the loop:

const device_t dev = devclass_get_device(dc, i);

which is simpler and independent of the precise hierarchy of devices (acpi -> acpi_cpu -> hwpstate).

Also, I suggested HWP_AMD_CLASSNAME to factor out this device's class name, which should be changed to hwpstate_amd.

335–336

Re-reading the last line of my comment: "You must set the local CPU's sc->req as well in the loop, else the other (per-CPU) "epp" knobs will report a wrong value.", the use of "set the local CPU's sc->req" could be confusing. I meant the softc associated to the core whose MSR you're writing into, i.e., while looping over each CPU, we should update the corresponding softc.

Once using the right softc, as you already do in sysctl_epp_select_foreach_core(), you don't really need to read the MSR at all.

348

Consistently with another inline comment, please use HWPSTATE_AMD_CLASSNAME and define this macro to "hwpstate_amd" near the top of the file.

Also, please replace existing verbatim "hwpstate" with HWPSTATE_AMD_CLASSNAME (it seems that makes sense for all occurrences).

372

We must not short-circuit sysctl_epp_select_foreach_core() in this case (in case of changes to hwpstate_pkg_ctrl_enable).

Label hwpstate_pkg_ctrl (or whichever name you choose) should be to statement if (hwpstate_pkg_ctrl_enable).

sys/x86/cpufreq/hwpstate_amd.c
372

I currently assumes that CPPC is only controlled by hwpstate_amd module and thus all epp in any condition should be same for all cores. That is why I short-circuit in here.

sys/x86/cpufreq/hwpstate_amd.c
335–336

If we trusted the value of sc->req, we should be able to short-circuit in if r == sc->req in package control mode as the msr should not be modified by other modules.

If we don't trust it, we should re-read the sc->req everytime and apply epp vlaue only.

Use hwpstate_amd instead of hwpstate
Not short-circuiting for cpu

Feels much better like that. I have one more inline comment for a change in sysctl_epp_select_per_core(). With that, you can consider the changes reviewed!

sys/x86/cpufreq/hwpstate_amd.c
285–291

I'd trust the value in sc here (contrary to the debug knob).

(If you keep this code, please amend the error message so that it is different than that of wrsafe() below.)

335–336

If we trusted the value of sc->req, we should be able to short-circuit in if r == sc->req in package control mode as the msr should not be modified by other modules.

We already trust it (it's the value we report at beginning of sysctl_epp_select() via sysctl_handle_int()).

Yes, you could as well test for val == sc->req (after applying SET_BITS_VALUE(req, AMD_CPPC_REQUEST_ENERGY_PERF_BITS, val);) even if package control is enabled, but that's just nice to have and doesn't affect correctness. We'll rework this anyway when switching to smp_rendezvous_cpus().

372

Even if hwpstate_amd is in control, you can't assume that, as the user is free to toggle hwpstate_pkg_ctrl_enable at any time. The only guarantee we (have to) maintain is that each CPU's sc reflects its current setting, which is the case currently. There's no guarantee that all cores have the same EPP value, even if hwpstate_pkg_ctrl_enable is true (that boolean may have just been changed).

This revision is now accepted and ready to land.Sat, Dec 20, 5:57 PM
olce requested changes to this revision.Sat, Dec 20, 6:09 PM

There's a compile error (missing parenthesis, see inline comment).

sys/x86/cpufreq/hwpstate_amd.c
343
This revision now requires changes to proceed.Sat, Dec 20, 6:09 PM

Trust sc->req in set_epp and add () in KASSERT

sys/x86/cpufreq/hwpstate_amd.c
285–291

I think I will just remove this code segment. Thanks!

372

I see. But hwpstate_pkg_ctrl_enable is RDTUN currently. Which means we are unable to change it until next reboot. This is same as the intel's code.