Page MenuHomeFreeBSD

Summary: Add support for Intel Speed Shift
Needs ReviewPublic

Authored by scottph on Nov 18 2018, 1:09 AM.



Intel Speed Shift is Intel's technology to control frequency in hardware, with hints from software.

Test Plan

some performance test?

Diff Detail

Lint OK
No Unit Test Coverage
Build Status
Buildable 23188
Build 22231: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
kib added inline comments.Feb 15 2019, 1:28 PM

Please do not use implementation-defined namespace without a reason (I do not see one).


return (-1);


return (set.freq);


And there too.


Are error values from this function used for anything ? If not, why not return -1 ?


I think this call can sleep and lock some VM locks as well. It is not good to lock while the thread is bound. How hard is to restructure the code to read all MSRs first, then unbind, then do printfs ?

bwidawsk added inline comments.Feb 16 2019, 12:18 AM

I don't mind changing it, but it's static, so I'm not sure what you mean about adding implementation-defined namespace.


Mostly it was for debug to be able to differentiate error cases. I suppose someone debugging can always change it though, and I'll do -1

cem added inline comments.Feb 16 2019, 12:46 AM

The implementation-defined part is the __ prefix in the function name.

cem added inline comments.Feb 16 2019, 12:51 AM

It's mostly uncommon in FreeBSD to return negative error numbers and positive success values — that's more of a Linux (and imported Linux code) thing. Typically we'd return 0 or EFOO and set an output parameter for the level index. (Or sometimes, return the level index, or -1 as an out-of-range sentinel value.)

(Also, the spelling would be return (-ENOENT); the idea being that return() could be defined as a macro.)

cem added a comment.Feb 16 2019, 10:08 PM

Thanks for doing this, Ben! I'm sorry it took me so long to take a look and appreciate your patience.


basic -> basis, I think


Is this case possible? It looks like this is only called with freq_dev, and I don't see a case where freq_dev isn't attached before it is initialized. But I may be mistaken.


I don't see where set is initialized. Probably this is the result of refactoring CPUFREQ_DRV_GET into __get_frequency and you meant to save the result of __get_frequency and use it for this comparison?


I appreciate adding comments documenting these routines, by the way. Thank you.


Hm, the drawback of this approach seems to be we're always invoking CPUFREQ_DRV_TYPE() even for all drivers prior to this change that allow cached frequencies for a given level.

Maybe I'm imagining things and that is a negligible cost?

It seems like we could make an invariant that CPUFREQ_FLAG_UNCACHED => (curr_set->freq == CPUFREQ_VAL_UNKNOWN). Then we would only need to check for != VAL_UNKNOWN to bail early here (like the condition prior to this change). We would then only invoke DRV_TYPE() if the value was UNKNOWN, and only save freq if !UNCACHED.

Again, I might just be overcomplicating this for no reason; feel free to tell me I'm wrong.


style(9) nitpick:

 * foo

(Multi-line comment blocks begin with a blank line. One thing to keep in mind with the style(9) manual page is: "Many of the style rules are implicit in the examples.")


The curr_set pointer into sc->curr_level is still valid, but it is a little confusing to me that the sc->curr_level assignment above is what affects this condition. I would probably replace curr_set->freq here with sc->curr_level.total_set.freq myself. Obviously, this isn't introduced in this change, and maybe it is clear to others. Feel free to ignore.


Maybe assert the CF_MTX lock here for reader clarity. I only suggest this because the goto estimate above jumps over a region that unlocks and relocks this lock, which is a little error prone for future refactoring.


The grammar here is a bit funky


It's a bit uncommon to use static variables in functions; are we sure it's safe in this context? In particular, we hold the CF_MTX for this softc, but do we hold any global lock preventing a race?

(And I guess the flip side — is sizeof(sets) large enough to actually be a concern re: stack use? By my count, cf_setting is 40 bytes on amd64, and MAX_SETTINGS is 256. Yeah, ok, that's too large for the stack.)

Is there a reason to avoid malloc here (which was used prior to this refactoring)?


style(9) nit: error != 0 (above)


It seems like prior to this change, theoretically multiple cpu frequency drivers could attach to a single CPU / cpufreq driver instance — hence all of the enumeration over devs. Now the attached freq_dev will just be whichever driver registered last. In practice this may not be a problem, but either we should enforce that more strictly, or somehow handle it again.


As above — I don't see anything enforcing this invariant. I might be missing something.


Would it be possible to define this in a shared header instead of multiple extern references in .c files?


Oh, I see. This is the hack that is preventing multiple freq drivers from registering on intel.

Doesn't this break est on hardware too old for HWP support? And the kludge only works for Intel devices?


This deserves a note in UPDATING. It might be worth just honoring the old name on both Intel and AMD hardware; I don't really see a downside.


This doesn't match the extern uint32_t declarations of the variable.

Style(9) nit: spaces around = operator.


DEVMETHOD_END is canonical. I don't feel strongly about it, though.


The canonical way to do something like this would be with a DEVICE_PROBE(9) implementation. There is a manual page and also some documentation in kern/device_if.m.

All drivers that identify themselves as candidates for a given devclass are PROBEd to determine which driver should be attached. It is possible for multiple drivers to successfully PROBE a given device; the priority is used to select the best driver for a given device.

For Intel CPUs, you would have est probe a lower priority and hwpstate_intel probe a higher priority (if available), on the same device or pseudo-device.


In practice, I don't think this will ever sleep or allocate in the locked section; none of the formats are large enough to exceed the pre-allocated 1024 byte buffer. In that case, however, it would make sense not to have an automatic drain function and just explicitly SYSCTL_OUT at the end.


inline keyword is probably gratuitous (below as well).

The function is small enough (certainly for non-DEBUG builds) to be inlined automatically, and even if it is not, these routines are only invoked from an (administrative) sysctl anyway — there isn't going to be any significant performance impact either way.

It is fine to leave it if you prefer, though.


Usually I'd reject invalid values rather than clamping them to the valid range, unless it makes sense that the valid range may grow or shrink in the future; here it doesn't (IMO).


what does this number represent?


I'm not sure this check is robust; I don't understand why we would bail if an INFO_ONLY driver is also attached under the same parent? But not if a !INFO_ONLY driver is attached? The condition in est is the opposite — *allow* an INFO_ONLY sibling, but bail if there are existing !INFO_ONLY siblings.


style(9) nit: space before =. Ditto below


it doesn't make sense to me to size buf to 128 but hardcode 18 here.

usually i'd suggest snprintf(foo, sizeof(foo), ...) and then hardcode foo[N], if anything.

either way, there may be no need for this buf? you could just use device_get_nameunit. it would be "debug.intel_hwpstateN" instead of "debug.intel_hwp_debugN", but arguably that is better anyway.

I'm not sure that placing debug sysctls under "debug." instead of just using CTLFLAG_SKIP and placing them in the normal device_get_sysctl_tree location is canonical; it may be.


Is this a leftover debugging print? It doesn't seem like something that should go to console on every invocation of the GET method.

cem added a reviewer: cem.Feb 16 2019, 10:09 PM
cem removed a subscriber: cem.
bwidawsk updated this revision to Diff 55252.Mar 19 2019, 9:21 PM
bwidawsk marked 37 inline comments as done.

intel hwpstate v2

Fixed some bugs in v1, and tried to address all review feedback.
patches now live here:

  • cpufreq: Add sysctl to show which freq driver
  • cpufreq: Rename hwpstate to hwpstate_amd.
  • cpufreq: Remove unnecessary indirection to dev
  • cpufreq: Extract cf_get functionality
  • cpufreq: Extract level information
  • cpufreq: Clarifying naming
  • cpufreq: Allow cpufreq to act over individual cpus
  • cpufreq: Create an uncached flag
  • cpufreq: Add support for Intel Speed Shift
  • cpufreq: Clean up multi-device usage
  • cpufreq: Add some debug messages and comments
  • hwpstate_intel: Enable by default
Harbormaster completed remote builds in B23188: Diff 55252.
bwidawsk added inline comments.Mar 19 2019, 9:22 PM

I'm going to punt on this for now... TYPE should be a pretty cheap operation and I'm risk averse :-)


In an earlier revision, the free paths were getting a little messy, and static seemed like an easy solution. Realistically, nothing uses anywhere near 256, but I agree it's problematic. I will rework it to use malloc.


I think my reworks address this, but please re-review.


yeah... jhb@ recommended this way. I'm fine with whatever though, recommendations?


It shouldn't. In the hwpstate_intel code, if the hardware is too old, it sets the bool to false. Same for basically any unsupported hardware.


I do agree what's there now is a kludge, however, the way that the device tree looks today is that cpufreq is a child of cpu, and the controller driver (in this case hwpstate_intel) is also a child of cpu. Each controller driver has its own devclass. So that leaves a few options. The most obvious are:

  1. Make all controller drivers share the same devclass
  2. Make the controller driver be a child of the cpufreq driver.

I am/was in favor of #2, however it was a pretty large change and the quick and dirty way (which jhb@ recommended) seemed the most expedient to get the feature landed.


So I can punt until later?


The valid range can indeed change from platform to platform. Of course a user of this sysctl could clamp themselves, but it seemed easier to provide the service for them to keep the API a little more stable. Marking as done, but feel free to complain again and I will adjust.


It's a mask of features supported by the hardware. I don't yet do anything with the features anyway, so I will drop it.

cem added a comment.Mar 28 2019, 1:57 AM

Looks better, thanks. Some more feedback




Would this ever be the case for Intel HWP, or is this just allowing the possibility in general?


error = 0;


This seems a little gratuitous given it's just the De Morgan's transformation of the if error == 0 && .... UNCACHED expression just 19 lines earlier, and every path under that condition has an explicit goto past this condition.




we invoke DRV_TYPE on cf_drv_dev but device_get_nameunit on cf_dev. Is that intentional?


error != 0


I don't think the rework addresses the issue. We now confirm it's not NULL when registering, but don't check that it meets the stronger condition (== cf_drv_dev) until unregister.


Leave it be for now, I guess. This variable is a gross hack, hopefully it isn't used in additional places.


Sure, but what controls the order in which est_identify / intel_hwpstate_identify run such that the latter is guaranteed to run first?


Yeah, that's fair. #2 seems like a reasonable approach to me. I don't know enough about the drawbacks of #1 to dismiss it.


No, you should still initialize sb differently so it doesn't have a drain function.



In here, after finish and before delete, you would add:

if (ret == 0)
    ret = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb));

I'd avoid double underscore names here too


The logical percent range can change to something other than 0-100%? I'm confused.


Isn't this backwards? If we already attached intel hwpstate, we're using intel_speed_shift and it should remain true.

jhb added a comment.Apr 5 2019, 9:37 PM

BTW, this booted fine on my X1 today (I just pulled the git branch).

I think it's worth getting the current type of approach in for now, but eventually this cpufreq design needs to be rethought as it's a mess. Realistically you want something like this to happen for each CPU:

  • fetch info from the info-only driver
  • have a single device_t that is the cpu frequency device and the various cpufreq drivers just compete over that device_t node with probe priorities and moving their identify logic into the probe

I think the way I would do this is to kill the info-only ACPI perf driver by having the acpi_cpu.c driver for cpuX devices fetch the info from _CST or whatever it is and provide that as an ivar on the cpuX device itself. Then cpuX would add a single "cpufreq" driver and the various cpufreq drivers would all end up sharing that devclass I think. I think this would mostly kill off p4tcc as it's priority is such that it would never win (same with acpi_throttle), but that is probably ok. If we really cared about still supporting both systems that did both P-states and T-states then we would basically create two device nodes under the CPU (one for P-states and one for T-states) and let the drivers attach to the relevant one. However, I would probably not bother with that. It seems like too much work for little gain. Leaving the T-state drivers around would let them still be used for CPUs without P-states, we would just lose the ability to mix the two (which is probably a feature anyway).


The 'dev' -> 'cf_dev' rename adds a lot of noise (and is a bit unusual in that almost all (if not actually all) foo_attach, etc. routines in the tree use 'dev' as the variable name. It is minor, but I probably would have left it as 'dev'.


'dev' vs 'cf_drv_dev' would seem more consistent with existing code, though I see that this is always 'sc->cf_drv_dev' in the caller, so probably ok as-is.


To be fair, that is the existing line. For better or for worse a lot of the code in the tree uses 'if (error)' rather than 'if (error != 0)'


In HEAD we just added a SYSCTL_ADD_CONST_STRING that should let you get rid of this cast I think.


I think you can still get multiple things attached, and that you should just add an 'if' here that bails on a mismatch. However, then you could get spurious warnings from the cf_dev == NULL check above. I think though that you could just remove the check (note that the comment is now also stale) and just make the entire thing be something like:

/* If this is the active cpufreq child device, remove the control device as well. */
cf_dev = device_find_child(device_get_parent(dev), "cpufreq", -1);
if (cf_dev == NULL)
    return (0);
sc = device_get_softc(cf_dev);
if (sc->cf_drv_dev == dev)
    device_delete_child(device_get_parent(cf_dev), cf_dev);
return (0);

This is fine as-is. I would perhaps be inclined to invert this flag's sense to assume an actual "real" get method by default and falling back to using the previously set level on "dumb" drivers, something like 'CPUFREQ_FLAG_SETONLY' maybe. I don't think it's worth changing right now though.


I mostly can't think of a non-crappy header to put it in.


I have wondered if we don't want to effectively merge the two drivers in some way.


You don't actually have to rename this driver FWIW. It's fine to have multiple drivers with the same name.


I would probably just leave the name as-is as well FWIW.

cem added inline comments.Apr 5 2019, 10:17 PM

I can't think of an immediate downside to this idea (and like it better than the current approach). Ben, is it objectionable to you?


I don't object to naming different things differently. It can reduce confusion in reports / debugging.

jhb added inline comments.Apr 8 2019, 6:11 PM

The main downside is that suppose after this is in I do the cpufreq rework I sketched out in my overall comment last week. In that world, it's quite easy to have the 'speedshift' driver use a higher probe priority than est and have them be separate drivers. Given that they work quite differently I think this is cleaner long term. I had typed this comment above before my general comment (so it was "earlier" in my thinking).

scottl added a subscriber: scottl.Jun 12 2019, 7:55 PM

After Yamagi brought this work to my attention and told me that it was possible to apply this patch to FreeBSD 12.0 I created this diff from the git repo. I applied the patch to FreeBSD 12.0 on a T470s. With this patch the cores clock independently, power consumption is reduced (especially with low to medium load) and peak performance improved.

During short bursts e.g. loading in chromium all cores jump to the full boost clock (in my case 3.5GHz) or 100MHz below that for long enough to load bloated websites before falling back to a modest 0,9GHz to 1,1GHz. This improves responsiveness a lot over what is achievable with powerd++ or powerd. With powerd++ I have to limit max CPU frequency to 200MHz below the 2,8GHz base clock to avoid triggering ACPI thermal shutdown warnings under sustained load. With the hwpstate patch applied it takes mprime with 4 workers to trigger the same warning (make -sj4 buildworld isn't enough) , but the system remains stable. My system may be a special problem case because Lenovo saw fit to reduce the max CPU temperature to just 75C for the T470s.

I applied the patch to FreeBSD 12.0 on a T470s. ...

Great, thank you very much for testing!

I just cherry picked this into my tree to try on some servers. As a *USER*, I have the following comments:

  1. We need documentation. We cannot expect users to download the Intel SDM and look up the definition of Energy_Performance_Preference to know that 0 is the performance preference, and 100 is the energy efficiency preference.
  1. It would be super nice to be able to set the default epp for the entire system in one shot via a tunable, rather than having to replicate the same tunable 64 times in loader.conf or sysctl.conf or write a shell script to poke the values in. Something like this would simplify things for what I expect is 99% of use cases:


Oh, and in the documentation, it would be handy to note that BIOSes will sometimes disable this functionality and you may need to enable or disable certain common settings in the BIOS.

Works fine on the Google Pixelbook (i7-7Y75) (where the old EST does not work because there aren't any acpi_perf devices).


The extern definition and the check for this variable needs to be #if defined(__amd64__), otherwise there's a linking error on aarch64

cem added inline comments.Sun, Sep 15, 7:52 PM

The whole declaration should be removed. See discussion on the definition @ . (Just curious, why are you trying to build the intel_hwpstate in-review driver on aarch64?)


(I just have a single git fork of src where I apply every patch I need to master. Not building this *driver*, but acpi_perf was being built, and it was affected by this patch.)

scottph commandeered this revision.Fri, Oct 11, 10:37 PM
scottph added a reviewer: bwidawsk.