Intel Speed Shift is Intel's technology to control frequency in
hardware, with hints from software.
Submitted by: bwidawsk
I don't think the rework addresses the issue. We now confirm it's not NULL when registering, but don't check that it meets the stronger condition (== cf_drv_dev) until unregister.
Leave it be for now, I guess. This variable is a gross hack, hopefully it isn't used in additional places.
Sure, but what controls the order in which est_identify / intel_hwpstate_identify run such that the latter is guaranteed to run first?
Yeah, that's fair. #2 seems like a reasonable approach to me. I don't know enough about the drawbacks of #1 to dismiss it.
No, you should still initialize sb differently so it doesn't have a drain function.
sb = sbuf_new(NULL, NULL, 1024, SBUF_FIXEDLEN | SBUF_INCLUDENUL);
In here, after finish and before delete, you would add:
if (ret == 0) ret = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb));
I'd avoid double underscore names here too
The logical percent range can change to something other than 0-100%? I'm confused.
Isn't this backwards? If we already attached intel hwpstate, we're using intel_speed_shift and it should remain true.
BTW, this booted fine on my X1 today (I just pulled the git branch).
I think it's worth getting the current type of approach in for now, but eventually this cpufreq design needs to be rethought as it's a mess. Realistically you want something like this to happen for each CPU:
I think the way I would do this is to kill the info-only ACPI perf driver by having the acpi_cpu.c driver for cpuX devices fetch the info from _CST or whatever it is and provide that as an ivar on the cpuX device itself. Then cpuX would add a single "cpufreq" driver and the various cpufreq drivers would all end up sharing that devclass I think. I think this would mostly kill off p4tcc as it's priority is such that it would never win (same with acpi_throttle), but that is probably ok. If we really cared about still supporting both systems that did both P-states and T-states then we would basically create two device nodes under the CPU (one for P-states and one for T-states) and let the drivers attach to the relevant one. However, I would probably not bother with that. It seems like too much work for little gain. Leaving the T-state drivers around would let them still be used for CPUs without P-states, we would just lose the ability to mix the two (which is probably a feature anyway).
The 'dev' -> 'cf_dev' rename adds a lot of noise (and is a bit unusual in that almost all (if not actually all) foo_attach, etc. routines in the tree use 'dev' as the variable name. It is minor, but I probably would have left it as 'dev'.
'dev' vs 'cf_drv_dev' would seem more consistent with existing code, though I see that this is always 'sc->cf_drv_dev' in the caller, so probably ok as-is.
To be fair, that is the existing line. For better or for worse a lot of the code in the tree uses 'if (error)' rather than 'if (error != 0)'
In HEAD we just added a SYSCTL_ADD_CONST_STRING that should let you get rid of this cast I think.
I think you can still get multiple things attached, and that you should just add an 'if' here that bails on a mismatch. However, then you could get spurious warnings from the cf_dev == NULL check above. I think though that you could just remove the check (note that the comment is now also stale) and just make the entire thing be something like:
/* If this is the active cpufreq child device, remove the control device as well. */ cf_dev = device_find_child(device_get_parent(dev), "cpufreq", -1); if (cf_dev == NULL) return (0); sc = device_get_softc(cf_dev); if (sc->cf_drv_dev == dev) device_delete_child(device_get_parent(cf_dev), cf_dev); return (0);
This is fine as-is. I would perhaps be inclined to invert this flag's sense to assume an actual "real" get method by default and falling back to using the previously set level on "dumb" drivers, something like 'CPUFREQ_FLAG_SETONLY' maybe. I don't think it's worth changing right now though.
I mostly can't think of a non-crappy header to put it in.
I have wondered if we don't want to effectively merge the two drivers in some way.
You don't actually have to rename this driver FWIW. It's fine to have multiple drivers with the same name.
I would probably just leave the name as-is as well FWIW.
The main downside is that suppose after this is in I do the cpufreq rework I sketched out in my overall comment last week. In that world, it's quite easy to have the 'speedshift' driver use a higher probe priority than est and have them be separate drivers. Given that they work quite differently I think this is cleaner long term. I had typed this comment above before my general comment (so it was "earlier" in my thinking).
After Yamagi brought this work to my attention and told me that it was possible to apply this patch to FreeBSD 12.0 I created this diff from the git repo. I applied the patch to FreeBSD 12.0 on a T470s. With this patch the cores clock independently, power consumption is reduced (especially with low to medium load) and peak performance improved.
During short bursts e.g. loading youtube.com in chromium all cores jump to the full boost clock (in my case 3.5GHz) or 100MHz below that for long enough to load bloated websites before falling back to a modest 0,9GHz to 1,1GHz. This improves responsiveness a lot over what is achievable with powerd++ or powerd. With powerd++ I have to limit max CPU frequency to 200MHz below the 2,8GHz base clock to avoid triggering ACPI thermal shutdown warnings under sustained load. With the hwpstate patch applied it takes mprime with 4 workers to trigger the same warning (make -sj4 buildworld isn't enough) , but the system remains stable. My system may be a special problem case because Lenovo saw fit to reduce the max CPU temperature to just 75C for the T470s.
I just cherry picked this into my tree to try on some servers. As a *USER*, I have the following comments:
Works fine on the Google Pixelbook (i7-7Y75) (where the old EST does not work because there aren't any acpi_perf devices).
|53 ↗||(On Diff #55252)|
The extern definition and the check for this variable needs to be #if defined(__amd64__), otherwise there's a linking error on aarch64
|53 ↗||(On Diff #55252)|
(I just have a single git fork of src where I apply every patch I need to master. Not building this *driver*, but acpi_perf was being built, and it was affected by this patch.)
Scott, thanks for picking this up! I'm really glad to see this being driven towards commit. I also really appreciate the cf_dev -> dev revert; keeping name changes divorced from functional changes reduces the size of the patch and increases clarity of what has changed.
Some line comments below, but here are the highlights summarized:
Thanks again. This will be great to get into the tree.
|53–54 ↗||(On Diff #63421)|
I still object to this big kludge ( https://reviews.freebsd.org/D18028#inline-117423 )
mea culpa. Scott, thanks for doing it anyway :)
Would device_get_nameunit() make sense, or is it a singleton? Or is that just not useful information for this sysctl?
When and if you get a chance to do that rework (I don't think it's happened yet?), the drivers could be pulled apart again. It's one of those "the tree isn't written in stone, we can always change it later" plus "don't block working things on good ideas no one has time for" deals, IMO.
Given the core kernel refers to this variable (ACPI), this driver cannot be compiled out (even as a module). I mean, I've said before and will repeat that I think this mechanism should die anyway, but — as-is, it creates some additional limitations. (E.g., the "device cpufreq" line in the new manual page is entirely moot since an amd64 kernel without the cpufreq device won't link.)
This one is still open and easy to fix.
This is good but you missed the sbuf_new change that goes along with it above.
I'd still encourage just rejecting invalid percentages at the sysctl boundary. For this use, negative and >100 percentages do not make sense.
Likewise, thanks for the speedy reviews all.
I don't see any harm in including the unit number, and it might be informative if for some reasons cpus get initialized out of order. then you might notice that cpu2 has hwpstate_intel5 or whatever.
Does this seem like a reasonable way to intertwine the two identify routines?
Sorry missed this one in reading through the old comments. I think it's fixed properly now.
Thanks, this is kinda weird but I like it better than before.
One thing that doesn't seem to have been done yet but needs to be for this to work is a module dependency on intel_hwpstate by est (if we are retaining separate .ko klds). Otherwise the dynamic linker won't look for the now-public identify routine in the other driver (it only searches the explicit dependencies for external symbols). It would also just be fine to compile both into a composite kld, I think. Either way, there may be some missing build glue here.
Probably put this in a header (it can be a totally trivial intel_cpufreq_internal.h, for example) rather than declaring it in both drivers.
It's kinda funky, but tolerable IMO. If John has strong objections, he overrules me :-). Newbus doesn't really make this easy/clean on us.
I'd add a comment referring to est_identify, as well as explaining why we do this funky thing for now. And IMO, the preferred commented-out code style is #if 0 / foo / #endif, but I'm not sure if that's just personal preference or something we actually follow consistently in the tree.
This comment is now mildly stale.
Looks good to me, thanks!
Alternatively you could just read the 1-4 64-bit MSRs in the CPU-bound section, storing them on the stack, and use sbuf printing only after unbinding the CPU. That'd avoid malloc/free and also avoid possibly truncating output. I'm not requesting any change, it's just another valid approach.
The compiler will probably complain about the val < 0 check since val is unsigned. if (val > 100) ... alone is sufficient.
FWIW, you can just check the flag in the cached cpu_power_eax instead of cpuid(6) + regs after rS353712. cpu_power_foo represents the cpuid 6 leaf. (The name isn't especially creative; sorry.)
Fine to leave as-is, especially if you want to MFC this.
This is probably not necessary, or should be conditional on bootverbose.
Because the device isn't marked "quiet," the description set in intel_hwpstate_probe is automatically printed at boot time when the device is attached. E.g.,
hwpstate0: <Cool`n'Quiet 2.0> on cpu0
Where hwpstate is the driver, and "Cool`n'Quiet 2.0" is the string from device_set_desc() below.
BUS_PROBE_NOWILDCARD might be more appropriate for now. I don't think the current value will break anything, though, so feel free to leave as-is.
(If/when est and intel_hwpstate are separated and probe for the same newbus pseudo-device, BUS_PROBE_DEFAULT for this one and BUS_PROBE_LOW_PRIORITY for est might make sense.)
Ditto other comment about cpuid
The new patch doesn't have context, so phabricator doesn't know how to show inter-diff and us casual readers cannot expand adjacent code in the web interface. Would you mind uploading a diff with full context (i.e., diff -U99999 ... or using the arc utility)? Thanks!
works for me.
works for me.
Looks good to me, thanks!
Minor style nit: we prefer NULL for null pointer values.
I think you'll need to add a $FreeBSD$ line at the end of your header block, or the pre-commit hooks will block the commit. See other headers for example.