Intel Speed Shift is Intel's technology to control frequency in hardware, with hints from software.
Please do not use implementation-defined namespace without a reason (I do not see one).
And there too.
Are error values from this function used for anything ? If not, why not return -1 ?
I think this call can sleep and lock some VM locks as well. It is not good to lock while the thread is bound. How hard is to restructure the code to read all MSRs first, then unbind, then do printfs ?
It's mostly uncommon in FreeBSD to return negative error numbers and positive success values — that's more of a Linux (and imported Linux code) thing. Typically we'd return 0 or EFOO and set an output parameter for the level index. (Or sometimes, return the level index, or -1 as an out-of-range sentinel value.)
(Also, the spelling would be return (-ENOENT); the idea being that return() could be defined as a macro.)
Thanks for doing this, Ben! I'm sorry it took me so long to take a look and appreciate your patience.
basic -> basis, I think
Is this case possible? It looks like this is only called with freq_dev, and I don't see a case where freq_dev isn't attached before it is initialized. But I may be mistaken.
I don't see where set is initialized. Probably this is the result of refactoring CPUFREQ_DRV_GET into __get_frequency and you meant to save the result of __get_frequency and use it for this comparison?
I appreciate adding comments documenting these routines, by the way. Thank you.
Hm, the drawback of this approach seems to be we're always invoking CPUFREQ_DRV_TYPE() even for all drivers prior to this change that allow cached frequencies for a given level.
Maybe I'm imagining things and that is a negligible cost?
It seems like we could make an invariant that CPUFREQ_FLAG_UNCACHED => (curr_set->freq == CPUFREQ_VAL_UNKNOWN). Then we would only need to check for != VAL_UNKNOWN to bail early here (like the condition prior to this change). We would then only invoke DRV_TYPE() if the value was UNKNOWN, and only save freq if !UNCACHED.
Again, I might just be overcomplicating this for no reason; feel free to tell me I'm wrong.
/* * foo
(Multi-line comment blocks begin with a blank line. One thing to keep in mind with the style(9) manual page is: "Many of the style rules are implicit in the examples.")
The curr_set pointer into sc->curr_level is still valid, but it is a little confusing to me that the sc->curr_level assignment above is what affects this condition. I would probably replace curr_set->freq here with sc->curr_level.total_set.freq myself. Obviously, this isn't introduced in this change, and maybe it is clear to others. Feel free to ignore.
Maybe assert the CF_MTX lock here for reader clarity. I only suggest this because the goto estimate above jumps over a region that unlocks and relocks this lock, which is a little error prone for future refactoring.
The grammar here is a bit funky
It's a bit uncommon to use static variables in functions; are we sure it's safe in this context? In particular, we hold the CF_MTX for this softc, but do we hold any global lock preventing a race?
(And I guess the flip side — is sizeof(sets) large enough to actually be a concern re: stack use? By my count, cf_setting is 40 bytes on amd64, and MAX_SETTINGS is 256. Yeah, ok, that's too large for the stack.)
Is there a reason to avoid malloc here (which was used prior to this refactoring)?
style(9) nit: error != 0 (above)
It seems like prior to this change, theoretically multiple cpu frequency drivers could attach to a single CPU / cpufreq driver instance — hence all of the enumeration over devs. Now the attached freq_dev will just be whichever driver registered last. In practice this may not be a problem, but either we should enforce that more strictly, or somehow handle it again.
As above — I don't see anything enforcing this invariant. I might be missing something.
Would it be possible to define this in a shared header instead of multiple extern references in .c files?
Oh, I see. This is the hack that is preventing multiple freq drivers from registering on intel.
Doesn't this break est on hardware too old for HWP support? And the kludge only works for Intel devices?
This deserves a note in UPDATING. It might be worth just honoring the old name on both Intel and AMD hardware; I don't really see a downside.
This doesn't match the extern uint32_t declarations of the variable.
Style(9) nit: spaces around = operator.
DEVMETHOD_END is canonical. I don't feel strongly about it, though.
The canonical way to do something like this would be with a DEVICE_PROBE(9) implementation. There is a manual page and also some documentation in kern/device_if.m.
All drivers that identify themselves as candidates for a given devclass are PROBEd to determine which driver should be attached. It is possible for multiple drivers to successfully PROBE a given device; the priority is used to select the best driver for a given device.
For Intel CPUs, you would have est probe a lower priority and hwpstate_intel probe a higher priority (if available), on the same device or pseudo-device.
In practice, I don't think this will ever sleep or allocate in the locked section; none of the formats are large enough to exceed the pre-allocated 1024 byte buffer. In that case, however, it would make sense not to have an automatic drain function and just explicitly SYSCTL_OUT at the end.
inline keyword is probably gratuitous (below as well).
The function is small enough (certainly for non-DEBUG builds) to be inlined automatically, and even if it is not, these routines are only invoked from an (administrative) sysctl anyway — there isn't going to be any significant performance impact either way.
It is fine to leave it if you prefer, though.
Usually I'd reject invalid values rather than clamping them to the valid range, unless it makes sense that the valid range may grow or shrink in the future; here it doesn't (IMO).
what does this number represent?
I'm not sure this check is robust; I don't understand why we would bail if an INFO_ONLY driver is also attached under the same parent? But not if a !INFO_ONLY driver is attached? The condition in est is the opposite — *allow* an INFO_ONLY sibling, but bail if there are existing !INFO_ONLY siblings.
style(9) nit: space before =. Ditto below
it doesn't make sense to me to size buf to 128 but hardcode 18 here.
usually i'd suggest snprintf(foo, sizeof(foo), ...) and then hardcode foo[N], if anything.
either way, there may be no need for this buf? you could just use device_get_nameunit. it would be "debug.intel_hwpstateN" instead of "debug.intel_hwp_debugN", but arguably that is better anyway.
I'm not sure that placing debug sysctls under "debug." instead of just using CTLFLAG_SKIP and placing them in the normal device_get_sysctl_tree location is canonical; it may be.
Is this a leftover debugging print? It doesn't seem like something that should go to console on every invocation of the GET method.
intel hwpstate v2
Fixed some bugs in v1, and tried to address all review feedback.
patches now live here: https://github.com/bwidawsk/freebsd/tree/hwp2
- cpufreq: Add sysctl to show which freq driver
- cpufreq: Rename hwpstate to hwpstate_amd.
- cpufreq: Remove unnecessary indirection to dev
- cpufreq: Extract cf_get functionality
- cpufreq: Extract level information
- cpufreq: Clarifying naming
- cpufreq: Allow cpufreq to act over individual cpus
- cpufreq: Create an uncached flag
- cpufreq: Add support for Intel Speed Shift
- cpufreq: Clean up multi-device usage
- cpufreq: Add some debug messages and comments
- hwpstate_intel: Enable by default
I'm going to punt on this for now... TYPE should be a pretty cheap operation and I'm risk averse :-)
In an earlier revision, the free paths were getting a little messy, and static seemed like an easy solution. Realistically, nothing uses anywhere near 256, but I agree it's problematic. I will rework it to use malloc.
I think my reworks address this, but please re-review.
yeah... jhb@ recommended this way. I'm fine with whatever though, recommendations?
It shouldn't. In the hwpstate_intel code, if the hardware is too old, it sets the bool to false. Same for basically any unsupported hardware.
I do agree what's there now is a kludge, however, the way that the device tree looks today is that cpufreq is a child of cpu, and the controller driver (in this case hwpstate_intel) is also a child of cpu. Each controller driver has its own devclass. So that leaves a few options. The most obvious are:
I am/was in favor of #2, however it was a pretty large change and the quick and dirty way (which jhb@ recommended) seemed the most expedient to get the feature landed.
So I can punt until later?
The valid range can indeed change from platform to platform. Of course a user of this sysctl could clamp themselves, but it seemed easier to provide the service for them to keep the API a little more stable. Marking as done, but feel free to complain again and I will adjust.
It's a mask of features supported by the hardware. I don't yet do anything with the features anyway, so I will drop it.
Looks better, thanks. Some more feedback
Would this ever be the case for Intel HWP, or is this just allowing the possibility in general?
error = 0;
This seems a little gratuitous given it's just the De Morgan's transformation of the if error == 0 && .... UNCACHED expression just 19 lines earlier, and every path under that condition has an explicit goto past this condition.
we invoke DRV_TYPE on cf_drv_dev but device_get_nameunit on cf_dev. Is that intentional?
error != 0
I don't think the rework addresses the issue. We now confirm it's not NULL when registering, but don't check that it meets the stronger condition (== cf_drv_dev) until unregister.
Leave it be for now, I guess. This variable is a gross hack, hopefully it isn't used in additional places.
Sure, but what controls the order in which est_identify / intel_hwpstate_identify run such that the latter is guaranteed to run first?
Yeah, that's fair. #2 seems like a reasonable approach to me. I don't know enough about the drawbacks of #1 to dismiss it.
No, you should still initialize sb differently so it doesn't have a drain function.
sb = sbuf_new(NULL, NULL, 1024, SBUF_FIXEDLEN | SBUF_INCLUDENUL);
In here, after finish and before delete, you would add:
if (ret == 0) ret = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb));
I'd avoid double underscore names here too
The logical percent range can change to something other than 0-100%? I'm confused.
Isn't this backwards? If we already attached intel hwpstate, we're using intel_speed_shift and it should remain true.
BTW, this booted fine on my X1 today (I just pulled the git branch).
I think it's worth getting the current type of approach in for now, but eventually this cpufreq design needs to be rethought as it's a mess. Realistically you want something like this to happen for each CPU:
- fetch info from the info-only driver
- have a single device_t that is the cpu frequency device and the various cpufreq drivers just compete over that device_t node with probe priorities and moving their identify logic into the probe
I think the way I would do this is to kill the info-only ACPI perf driver by having the acpi_cpu.c driver for cpuX devices fetch the info from _CST or whatever it is and provide that as an ivar on the cpuX device itself. Then cpuX would add a single "cpufreq" driver and the various cpufreq drivers would all end up sharing that devclass I think. I think this would mostly kill off p4tcc as it's priority is such that it would never win (same with acpi_throttle), but that is probably ok. If we really cared about still supporting both systems that did both P-states and T-states then we would basically create two device nodes under the CPU (one for P-states and one for T-states) and let the drivers attach to the relevant one. However, I would probably not bother with that. It seems like too much work for little gain. Leaving the T-state drivers around would let them still be used for CPUs without P-states, we would just lose the ability to mix the two (which is probably a feature anyway).
The 'dev' -> 'cf_dev' rename adds a lot of noise (and is a bit unusual in that almost all (if not actually all) foo_attach, etc. routines in the tree use 'dev' as the variable name. It is minor, but I probably would have left it as 'dev'.
'dev' vs 'cf_drv_dev' would seem more consistent with existing code, though I see that this is always 'sc->cf_drv_dev' in the caller, so probably ok as-is.
To be fair, that is the existing line. For better or for worse a lot of the code in the tree uses 'if (error)' rather than 'if (error != 0)'
In HEAD we just added a SYSCTL_ADD_CONST_STRING that should let you get rid of this cast I think.
I think you can still get multiple things attached, and that you should just add an 'if' here that bails on a mismatch. However, then you could get spurious warnings from the cf_dev == NULL check above. I think though that you could just remove the check (note that the comment is now also stale) and just make the entire thing be something like:
/* If this is the active cpufreq child device, remove the control device as well. */ cf_dev = device_find_child(device_get_parent(dev), "cpufreq", -1); if (cf_dev == NULL) return (0); sc = device_get_softc(cf_dev); if (sc->cf_drv_dev == dev) device_delete_child(device_get_parent(cf_dev), cf_dev); return (0);
This is fine as-is. I would perhaps be inclined to invert this flag's sense to assume an actual "real" get method by default and falling back to using the previously set level on "dumb" drivers, something like 'CPUFREQ_FLAG_SETONLY' maybe. I don't think it's worth changing right now though.
I mostly can't think of a non-crappy header to put it in.
I have wondered if we don't want to effectively merge the two drivers in some way.
You don't actually have to rename this driver FWIW. It's fine to have multiple drivers with the same name.
I would probably just leave the name as-is as well FWIW.
The main downside is that suppose after this is in I do the cpufreq rework I sketched out in my overall comment last week. In that world, it's quite easy to have the 'speedshift' driver use a higher probe priority than est and have them be separate drivers. Given that they work quite differently I think this is cleaner long term. I had typed this comment above before my general comment (so it was "earlier" in my thinking).
After Yamagi brought this work to my attention and told me that it was possible to apply this patch to FreeBSD 12.0 I created this diff from the git repo. I applied the patch to FreeBSD 12.0 on a T470s. With this patch the cores clock independently, power consumption is reduced (especially with low to medium load) and peak performance improved.
During short bursts e.g. loading youtube.com in chromium all cores jump to the full boost clock (in my case 3.5GHz) or 100MHz below that for long enough to load bloated websites before falling back to a modest 0,9GHz to 1,1GHz. This improves responsiveness a lot over what is achievable with powerd++ or powerd. With powerd++ I have to limit max CPU frequency to 200MHz below the 2,8GHz base clock to avoid triggering ACPI thermal shutdown warnings under sustained load. With the hwpstate patch applied it takes mprime with 4 workers to trigger the same warning (make -sj4 buildworld isn't enough) , but the system remains stable. My system may be a special problem case because Lenovo saw fit to reduce the max CPU temperature to just 75C for the T470s.
I just cherry picked this into my tree to try on some servers. As a *USER*, I have the following comments:
- We need documentation. We cannot expect users to download the Intel SDM and look up the definition of Energy_Performance_Preference to know that 0 is the performance preference, and 100 is the energy efficiency preference.
- It would be super nice to be able to set the default epp for the entire system in one shot via a tunable, rather than having to replicate the same tunable 64 times in loader.conf or sysctl.conf or write a shell script to poke the values in. Something like this would simplify things for what I expect is 99% of use cases:
Works fine on the Google Pixelbook (i7-7Y75) (where the old EST does not work because there aren't any acpi_perf devices).
The extern definition and the check for this variable needs to be #if defined(__amd64__), otherwise there's a linking error on aarch64
(I just have a single git fork of src where I apply every patch I need to master. Not building this *driver*, but acpi_perf was being built, and it was affected by this patch.)