Page MenuHomeFreeBSD

Add support for Intel Speed Shift
ClosedPublic

Authored by scottph on Nov 18 2018, 1:09 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Dec 13, 2:02 AM
Unknown Object (File)
Sun, Dec 1, 2:19 AM
Unknown Object (File)
Fri, Nov 29, 11:51 PM
Unknown Object (File)
Tue, Nov 26, 7:40 PM
Unknown Object (File)
Mon, Nov 25, 3:08 PM
Unknown Object (File)
Sun, Nov 24, 11:01 PM
Unknown Object (File)
Sat, Nov 23, 10:40 AM
Unknown Object (File)
Fri, Nov 22, 5:23 AM

Details

Summary

Intel Speed Shift is Intel's technology to control frequency in
hardware, with hints from software.

Submitted by: bwidawsk

Test Plan

some performance test?

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/x86/cpufreq/hwpstate_intel.c
191 ↗(On Diff #55252)

I'd avoid double underscore names here too

234–236 ↗(On Diff #53817)

The logical percent range can change to something other than 0-100%? I'm confused.

BTW, this booted fine on my X1 today (I just pulled the git branch).

I think it's worth getting the current type of approach in for now, but eventually this cpufreq design needs to be rethought as it's a mess. Realistically you want something like this to happen for each CPU:

  • fetch info from the info-only driver
  • have a single device_t that is the cpu frequency device and the various cpufreq drivers just compete over that device_t node with probe priorities and moving their identify logic into the probe

I think the way I would do this is to kill the info-only ACPI perf driver by having the acpi_cpu.c driver for cpuX devices fetch the info from _CST or whatever it is and provide that as an ivar on the cpuX device itself. Then cpuX would add a single "cpufreq" driver and the various cpufreq drivers would all end up sharing that devclass I think. I think this would mostly kill off p4tcc as it's priority is such that it would never win (same with acpi_throttle), but that is probably ok. If we really cared about still supporting both systems that did both P-states and T-states then we would basically create two device nodes under the CPU (one for P-states and one for T-states) and let the drivers attach to the relevant one. However, I would probably not bother with that. It seems like too much work for little gain. Leaving the T-state drivers around would let them still be used for CPUs without P-states, we would just lose the ability to mix the two (which is probably a feature anyway).

sys/kern/kern_cpu.c
152 ↗(On Diff #55252)

The 'dev' -> 'cf_dev' rename adds a lot of noise (and is a bit unusual in that almost all (if not actually all) foo_attach, etc. routines in the tree use 'dev' as the variable name. It is minor, but I probably would have left it as 'dev'.

417 ↗(On Diff #55252)

'dev' vs 'cf_drv_dev' would seem more consistent with existing code, though I see that this is always 'sc->cf_drv_dev' in the caller, so probably ok as-is.

602 ↗(On Diff #55252)

To be fair, that is the existing line. For better or for worse a lot of the code in the tree uses 'if (error)' rather than 'if (error != 0)'

1075 ↗(On Diff #55252)

In HEAD we just added a SYSCTL_ADD_CONST_STRING that should let you get rid of this cast I think.

1118 ↗(On Diff #53817)

I think you can still get multiple things attached, and that you should just add an 'if' here that bails on a mismatch. However, then you could get spurious warnings from the cf_dev == NULL check above. I think though that you could just remove the check (note that the comment is now also stale) and just make the entire thing be something like:

/* If this is the active cpufreq child device, remove the control device as well. */
cf_dev = device_find_child(device_get_parent(dev), "cpufreq", -1);
if (cf_dev == NULL)
    return (0);
sc = device_get_softc(cf_dev);
if (sc->cf_drv_dev == dev)
    device_delete_child(device_get_parent(cf_dev), cf_dev);
return (0);
sys/sys/cpu.h
126 ↗(On Diff #55252)

This is fine as-is. I would perhaps be inclined to invert this flag's sense to assume an actual "real" get method by default and falling back to using the previously set level on "dumb" drivers, something like 'CPUFREQ_FLAG_SETONLY' maybe. I don't think it's worth changing right now though.

sys/x86/cpufreq/est.c
53 ↗(On Diff #53817)

I mostly can't think of a non-crappy header to put it in.

921–922 ↗(On Diff #53817)

I have wondered if we don't want to effectively merge the two drivers in some way.

sys/x86/cpufreq/hwpstate_amd.c
156 ↗(On Diff #55252)

You don't actually have to rename this driver FWIW. It's fine to have multiple drivers with the same name.

329–330 ↗(On Diff #53817)

I would probably just leave the name as-is as well FWIW.

sys/x86/cpufreq/est.c
921–922 ↗(On Diff #53817)

I can't think of an immediate downside to this idea (and like it better than the current approach). Ben, is it objectionable to you?

sys/x86/cpufreq/hwpstate_amd.c
156 ↗(On Diff #55252)

I don't object to naming different things differently. It can reduce confusion in reports / debugging.

sys/x86/cpufreq/est.c
921–922 ↗(On Diff #53817)

The main downside is that suppose after this is in I do the cpufreq rework I sketched out in my overall comment last week. In that world, it's quite easy to have the 'speedshift' driver use a higher probe priority than est and have them be separate drivers. Given that they work quite differently I think this is cleaner long term. I had typed this comment above before my general comment (so it was "earlier" in my thinking).

After Yamagi brought this work to my attention and told me that it was possible to apply this patch to FreeBSD 12.0 I created this diff from the git repo. I applied the patch to FreeBSD 12.0 on a T470s. With this patch the cores clock independently, power consumption is reduced (especially with low to medium load) and peak performance improved.

During short bursts e.g. loading youtube.com in chromium all cores jump to the full boost clock (in my case 3.5GHz) or 100MHz below that for long enough to load bloated websites before falling back to a modest 0,9GHz to 1,1GHz. This improves responsiveness a lot over what is achievable with powerd++ or powerd. With powerd++ I have to limit max CPU frequency to 200MHz below the 2,8GHz base clock to avoid triggering ACPI thermal shutdown warnings under sustained load. With the hwpstate patch applied it takes mprime with 4 workers to trigger the same warning (make -sj4 buildworld isn't enough) , but the system remains stable. My system may be a special problem case because Lenovo saw fit to reduce the max CPU temperature to just 75C for the T470s.

I applied the patch to FreeBSD 12.0 on a T470s. ...

Great, thank you very much for testing!

I just cherry picked this into my tree to try on some servers. As a *USER*, I have the following comments:

  1. We need documentation. We cannot expect users to download the Intel SDM and look up the definition of Energy_Performance_Preference to know that 0 is the performance preference, and 100 is the energy efficiency preference.
  1. It would be super nice to be able to set the default epp for the entire system in one shot via a tunable, rather than having to replicate the same tunable 64 times in loader.conf or sysctl.conf or write a shell script to poke the values in. Something like this would simplify things for what I expect is 99% of use cases:

machdep.intel_speed_shift.default_epp=X

Oh, and in the documentation, it would be handy to note that BIOSes will sometimes disable this functionality and you may need to enable or disable certain common settings in the BIOS.

Works fine on the Google Pixelbook (i7-7Y75) (where the old EST does not work because there aren't any acpi_perf devices).

sys/dev/acpica/acpi_perf.c
53 ↗(On Diff #55252)

The extern definition and the check for this variable needs to be #if defined(__amd64__), otherwise there's a linking error on aarch64

sys/dev/acpica/acpi_perf.c
53 ↗(On Diff #55252)

The whole declaration should be removed. See discussion on the definition @ https://reviews.freebsd.org/D18028#inline-117423 . (Just curious, why are you trying to build the intel_hwpstate in-review driver on aarch64?)

sys/dev/acpica/acpi_perf.c
53 ↗(On Diff #55252)

(I just have a single git fork of src where I apply every patch I need to master. Not building this *driver*, but acpi_perf was being built, and it was affected by this patch.)

scottph added a reviewer: bwidawsk.
scottph retitled this revision from Summary: Add support for Intel Speed Shift to Add support for Intel Speed Shift.
scottph edited the summary of this revision. (Show Details)
scottph set the repository for this revision to rS FreeBSD src repository - subversion.
  • update cpufreq.4
  • create hwpstate_intel.4
  • guard intel_speed_shift with amd64 || i386
  • undo dev => cf_dev
  • some cf_drv_dev => dev
  • some style fixes
  • SYSCTL_ADD_STRING => SYSCTL_ADD_CONST_STRING
  • SYSCTL_OUT thing

Not done:

  • jhb's suggestion to merge with est(4)
  • gallatin's suggestion to have a single sysctl node for broadcast setting epp
  • jhb's idea for a bigger cpufreq rework.
cem requested changes to this revision.Oct 18 2019, 12:07 AM

Scott, thanks for picking this up! I'm really glad to see this being driven towards commit. I also really appreciate the cf_dev -> dev revert; keeping name changes divorced from functional changes reduces the size of the patch and increases clarity of what has changed.

Some line comments below, but here are the highlights summarized:

  1. The intel_speed_shift global shared between two drivers and ACPI needs to be fixed (removed, replaced with something using newbus). I believe this can be done without major rework of cpufreq, but don't have a concrete proposal for you.
  2. The sysctl sbuf change in this latest version of the patch is slightly flawed; it should be easy to fix.
  3. I'd still like the percentage sysctl to reject rather than clamp invalid percentages.

Thanks again. This will be great to get into the tree.

sys/dev/acpica/acpi_perf.c
53–54 ↗(On Diff #63421)

I still object to this big kludge ( https://reviews.freebsd.org/D18028#inline-117423 )

sys/kern/kern_cpu.c
1071 ↗(On Diff #63421)

Would device_get_nameunit() make sense, or is it a singleton? Or is that just not useful information for this sysctl?

602 ↗(On Diff #55252)

mea culpa. Scott, thanks for doing it anyway :)

sys/x86/cpufreq/est.c
921–922 ↗(On Diff #53817)

When and if you get a chance to do that rework (I don't think it's happened yet?), the drivers could be pulled apart again. It's one of those "the tree isn't written in stone, we can always change it later" plus "don't block working things on good ideas no one has time for" deals, IMO.

sys/x86/cpufreq/hwpstate_intel.c
58 ↗(On Diff #63421)

Given the core kernel refers to this variable (ACPI), this driver cannot be compiled out (even as a module). I mean, I've said before and will repeat that I think this mechanism should die anyway, but — as-is, it creates some additional limitations. (E.g., the "device cpufreq" line in the new manual page is entirely moot since an amd64 kernel without the cpufreq device won't link.)

185–186 ↗(On Diff #63421)

This is good but you missed the sbuf_new change that goes along with it above.

236–239 ↗(On Diff #63421)

I'd still encourage just rejecting invalid percentages at the sysctl boundary. For this use, negative and >100 percentages do not make sense.

139 ↗(On Diff #53817)

This one is still open and easy to fix.

This revision now requires changes to proceed.Oct 18 2019, 12:07 AM
bcr added a subscriber: bcr.

OK from manpages. Thanks for working on this.

scottph edited the summary of this revision. (Show Details)
  • remove the machdep.intel_speed_shift tunable
  • call out to hwpstate_intel's identify from est's
  • show the cf_drv_dev's nameunit in dev.cpufreq.%d.freq_driver
  • initialize the debug sysctl's sbuf w/o a drain function
  • reject percentages outside [0, 100] in epp sysctl

Likewise, thanks for the speedy reviews all.

sys/kern/kern_cpu.c
1071 ↗(On Diff #63421)

I don't see any harm in including the unit number, and it might be informative if for some reasons cpus get initialized out of order. then you might notice that cpu2 has hwpstate_intel5 or whatever.

sys/x86/cpufreq/est.c
921–922 ↗(On Diff #53817)

Does this seem like a reasonable way to intertwine the two identify routines?

sys/x86/cpufreq/hwpstate_intel.c
139 ↗(On Diff #53817)

Sorry missed this one in reading through the old comments. I think it's fixed properly now.

Thanks, this is kinda weird but I like it better than before.

One thing that doesn't seem to have been done yet but needs to be for this to work is a module dependency on intel_hwpstate by est (if we are retaining separate .ko klds). Otherwise the dynamic linker won't look for the now-public identify routine in the other driver (it only searches the explicit dependencies for external symbols). It would also just be fine to compile both into a composite kld, I think. Either way, there may be some missing build glue here.

sys/x86/cpufreq/est.c
854 ↗(On Diff #63566)

Probably put this in a header (it can be a totally trivial intel_cpufreq_internal.h, for example) rather than declaring it in both drivers.

921–922 ↗(On Diff #53817)

It's kinda funky, but tolerable IMO. If John has strong objections, he overrules me :-). Newbus doesn't really make this easy/clean on us.

sys/x86/cpufreq/hwpstate_intel.c
68 ↗(On Diff #63566)

I'd add a comment referring to est_identify, as well as explaining why we do this funky thing for now. And IMO, the preferred commented-out code style is #if 0 / foo / #endif, but I'm not sure if that's just personal preference or something we actually follow consistently in the tree.

102–106 ↗(On Diff #63566)

This comment is now mildly stale.

232 ↗(On Diff #63566)

The compiler will probably complain about the val < 0 check since val is unsigned. if (val > 100) ... alone is sufficient.

285 ↗(On Diff #63566)

FWIW, you can just check the flag in the cached cpu_power_eax instead of cpuid(6) + regs[0] after rS353712. cpu_power_foo represents the cpuid 6 leaf. (The name isn't especially creative; sorry.)

Fine to leave as-is, especially if you want to MFC this.

292 ↗(On Diff #63566)

This is probably not necessary, or should be conditional on bootverbose.

Because the device isn't marked "quiet," the description set in intel_hwpstate_probe is automatically printed at boot time when the device is attached. E.g.,

hwpstate0: <Cool`n'Quiet 2.0> on cpu0

Where hwpstate is the driver, and "Cool`n'Quiet 2.0" is the string from device_set_desc() below.

300 ↗(On Diff #63566)

BUS_PROBE_NOWILDCARD might be more appropriate for now. I don't think the current value will break anything, though, so feel free to leave as-is.

(If/when est and intel_hwpstate are separated and probe for the same newbus pseudo-device, BUS_PROBE_DEFAULT for this one and BUS_PROBE_LOW_PRIORITY for est might make sense.)

382 ↗(On Diff #63566)

Ditto other comment about cpuid

139 ↗(On Diff #53817)

Looks good to me, thanks!

Alternatively you could just read the 1-4 64-bit MSRs in the CPU-bound section, storing them on the stack, and use sbuf printing only after unbinding the CPU. That'd avoid malloc/free and also avoid possibly truncating output. I'm not requesting any change, it's just another valid approach.

I would leave the identify routine uncommented in the hwpstate DEVMETHODs. It doesn't hurt to run it twice.

sys/x86/cpufreq/est.c
921–922 ↗(On Diff #53817)

This is fine for now.

scottph marked 2 inline comments as done.
scottph added inline comments.
sys/x86/cpufreq/hwpstate_intel.c
102–106 ↗(On Diff #63566)

we shouldn't be dependent on the ordering now, so i've taken away the order and the comment.

285 ↗(On Diff #63566)

I'd prefer to do that as a follow on change for ease of mfc.

139 ↗(On Diff #53817)

That seems like a reasonable idea but I left it as-is for now.

The new patch doesn't have context, so phabricator doesn't know how to show inter-diff and us casual readers cannot expand adjacent code in the web interface. Would you mind uploading a diff with full context (i.e., diff -U99999 ... or using the arc utility)? Thanks!

sys/x86/cpufreq/hwpstate_intel.c
285 ↗(On Diff #63566)

works for me.

139 ↗(On Diff #53817)

works for me.

scottph marked an inline comment as done.

ah sorry, here's the same patch, but with context restored.

ah sorry, here's the same patch, but with context restored.

Thanks!

Looks good to me, thanks!

sys/x86/cpufreq/hwpstate_intel.c
104 ↗(On Diff #63775)

Minor style nit: we prefer NULL for null pointer values.

102–106 ↗(On Diff #63566)

perfect

sys/x86/cpufreq/hwpstate_intel_internal.h
25–26 ↗(On Diff #63775)

I think you'll need to add a $FreeBSD$ line at the end of your header block, or the pre-commit hooks will block the commit. See other headers for example.

This revision is now accepted and ready to land.Oct 29 2019, 11:38 PM

Is it just me or did setting epp break in the latest version?

In D18028#493129, @greg_unrelenting.technology wrote:

Is it just me or did setting epp break in the latest version?

Actually, it gets broken by suspend/resuming once.
Before suspend, it's freely adjustable but with weird rounding:

% doas sysctl dev.hwpstate_intel.{0..3}.epp=50
dev.hwpstate_intel.0.epp: 89 -> 49
[…]

After a suspend/resume cycle, it's no longer adjustable:

% doas sysctl dev.hwpstate_intel.{0..3}.epp=90
dev.hwpstate_intel.0.epp: 50 -> 50

I've tried the patch on a recent current and added some printf-debug. The output of rdmsr_safe(MSR_IA32_HWP_REQUEST, &requested); changes after the suspend:
before zzz :
Read: 0x0000000033001701

after zzz:
Read: 0xFFFFFFFF82612000

This revision was automatically updated to reflect the committed changes.