Page MenuHomeFreeBSD

Add support for Intel Speed Shift
AcceptedPublic

Authored by scottph on Nov 18 2018, 1:09 AM.

Details

Reviewers
kib
jhb
manu
cem
bwidawsk
Group Reviewers
manpages
Summary

Intel Speed Shift is Intel's technology to control frequency in
hardware, with hints from software.

Submitted by: bwidawsk

Test Plan

some performance test?

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
cem added inline comments.Mar 28 2019, 1:57 AM
sys/kern/kern_cpu.c
1135

I don't think the rework addresses the issue. We now confirm it's not NULL when registering, but don't check that it meets the stronger condition (== cf_drv_dev) until unregister.

sys/x86/cpufreq/est.c
53

Leave it be for now, I guess. This variable is a gross hack, hopefully it isn't used in additional places.

922–923

Sure, but what controls the order in which est_identify / intel_hwpstate_identify run such that the latter is guaranteed to run first?

sys/x86/cpufreq/hwpstate_intel.c
108–114

Yeah, that's fair. #2 seems like a reasonable approach to me. I don't know enough about the drawbacks of #1 to dismiss it.

140

No, you should still initialize sb differently so it doesn't have a drain function.

sb = sbuf_new(NULL, NULL, 1024, SBUF_FIXEDLEN | SBUF_INCLUDENUL);

185–186

In here, after finish and before delete, you would add:

if (ret == 0)
    ret = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb));
192

I'd avoid double underscore names here too

235–237

The logical percent range can change to something other than 0-100%? I'm confused.

263–264

Isn't this backwards? If we already attached intel hwpstate, we're using intel_speed_shift and it should remain true.

jhb added a comment.Apr 5 2019, 9:37 PM

BTW, this booted fine on my X1 today (I just pulled the git branch).

I think it's worth getting the current type of approach in for now, but eventually this cpufreq design needs to be rethought as it's a mess. Realistically you want something like this to happen for each CPU:

  • fetch info from the info-only driver
  • have a single device_t that is the cpu frequency device and the various cpufreq drivers just compete over that device_t node with probe priorities and moving their identify logic into the probe

I think the way I would do this is to kill the info-only ACPI perf driver by having the acpi_cpu.c driver for cpuX devices fetch the info from _CST or whatever it is and provide that as an ivar on the cpuX device itself. Then cpuX would add a single "cpufreq" driver and the various cpufreq drivers would all end up sharing that devclass I think. I think this would mostly kill off p4tcc as it's priority is such that it would never win (same with acpi_throttle), but that is probably ok. If we really cared about still supporting both systems that did both P-states and T-states then we would basically create two device nodes under the CPU (one for P-states and one for T-states) and let the drivers attach to the relevant one. However, I would probably not bother with that. It seems like too much work for little gain. Leaving the T-state drivers around would let them still be used for CPUs without P-states, we would just lose the ability to mix the two (which is probably a feature anyway).

sys/kern/kern_cpu.c
152

The 'dev' -> 'cf_dev' rename adds a lot of noise (and is a bit unusual in that almost all (if not actually all) foo_attach, etc. routines in the tree use 'dev' as the variable name. It is minor, but I probably would have left it as 'dev'.

448

'dev' vs 'cf_drv_dev' would seem more consistent with existing code, though I see that this is always 'sc->cf_drv_dev' in the caller, so probably ok as-is.

662

To be fair, that is the existing line. For better or for worse a lot of the code in the tree uses 'if (error)' rather than 'if (error != 0)'

1072

In HEAD we just added a SYSCTL_ADD_CONST_STRING that should let you get rid of this cast I think.

1135

I think you can still get multiple things attached, and that you should just add an 'if' here that bails on a mismatch. However, then you could get spurious warnings from the cf_dev == NULL check above. I think though that you could just remove the check (note that the comment is now also stale) and just make the entire thing be something like:

/* If this is the active cpufreq child device, remove the control device as well. */
cf_dev = device_find_child(device_get_parent(dev), "cpufreq", -1);
if (cf_dev == NULL)
    return (0);
sc = device_get_softc(cf_dev);
if (sc->cf_drv_dev == dev)
    device_delete_child(device_get_parent(cf_dev), cf_dev);
return (0);
sys/sys/cpu.h
126

This is fine as-is. I would perhaps be inclined to invert this flag's sense to assume an actual "real" get method by default and falling back to using the previously set level on "dumb" drivers, something like 'CPUFREQ_FLAG_SETONLY' maybe. I don't think it's worth changing right now though.

sys/x86/cpufreq/est.c
53

I mostly can't think of a non-crappy header to put it in.

922–923

I have wondered if we don't want to effectively merge the two drivers in some way.

sys/x86/cpufreq/hwpstate_amd.c
156

You don't actually have to rename this driver FWIW. It's fine to have multiple drivers with the same name.

329–330

I would probably just leave the name as-is as well FWIW.

cem added inline comments.Apr 5 2019, 10:17 PM
sys/x86/cpufreq/est.c
922–923

I can't think of an immediate downside to this idea (and like it better than the current approach). Ben, is it objectionable to you?

sys/x86/cpufreq/hwpstate_amd.c
156

I don't object to naming different things differently. It can reduce confusion in reports / debugging.

jhb added inline comments.Apr 8 2019, 6:11 PM
sys/x86/cpufreq/est.c
922–923

The main downside is that suppose after this is in I do the cpufreq rework I sketched out in my overall comment last week. In that world, it's quite easy to have the 'speedshift' driver use a higher probe priority than est and have them be separate drivers. Given that they work quite differently I think this is cleaner long term. I had typed this comment above before my general comment (so it was "earlier" in my thinking).

scottl added a subscriber: scottl.Jun 12 2019, 7:55 PM

After Yamagi brought this work to my attention and told me that it was possible to apply this patch to FreeBSD 12.0 I created this diff from the git repo. I applied the patch to FreeBSD 12.0 on a T470s. With this patch the cores clock independently, power consumption is reduced (especially with low to medium load) and peak performance improved.

During short bursts e.g. loading youtube.com in chromium all cores jump to the full boost clock (in my case 3.5GHz) or 100MHz below that for long enough to load bloated websites before falling back to a modest 0,9GHz to 1,1GHz. This improves responsiveness a lot over what is achievable with powerd++ or powerd. With powerd++ I have to limit max CPU frequency to 200MHz below the 2,8GHz base clock to avoid triggering ACPI thermal shutdown warnings under sustained load. With the hwpstate patch applied it takes mprime with 4 workers to trigger the same warning (make -sj4 buildworld isn't enough) , but the system remains stable. My system may be a special problem case because Lenovo saw fit to reduce the max CPU temperature to just 75C for the T470s.

I applied the patch to FreeBSD 12.0 on a T470s. ...

Great, thank you very much for testing!

I just cherry picked this into my tree to try on some servers. As a *USER*, I have the following comments:

  1. We need documentation. We cannot expect users to download the Intel SDM and look up the definition of Energy_Performance_Preference to know that 0 is the performance preference, and 100 is the energy efficiency preference.
  1. It would be super nice to be able to set the default epp for the entire system in one shot via a tunable, rather than having to replicate the same tunable 64 times in loader.conf or sysctl.conf or write a shell script to poke the values in. Something like this would simplify things for what I expect is 99% of use cases:

machdep.intel_speed_shift.default_epp=X

Oh, and in the documentation, it would be handy to note that BIOSes will sometimes disable this functionality and you may need to enable or disable certain common settings in the BIOS.

Works fine on the Google Pixelbook (i7-7Y75) (where the old EST does not work because there aren't any acpi_perf devices).

sys/dev/acpica/acpi_perf.c
53 ↗(On Diff #55252)

The extern definition and the check for this variable needs to be #if defined(__amd64__), otherwise there's a linking error on aarch64

cem added inline comments.Sep 15 2019, 7:52 PM
sys/dev/acpica/acpi_perf.c
53 ↗(On Diff #55252)

The whole declaration should be removed. See discussion on the definition @ https://reviews.freebsd.org/D18028#inline-117423 . (Just curious, why are you trying to build the intel_hwpstate in-review driver on aarch64?)

sys/dev/acpica/acpi_perf.c
53 ↗(On Diff #55252)

(I just have a single git fork of src where I apply every patch I need to master. Not building this *driver*, but acpi_perf was being built, and it was affected by this patch.)

scottph commandeered this revision.Oct 11 2019, 10:37 PM
scottph added a reviewer: bwidawsk.
scottph updated this revision to Diff 63421.Oct 17 2019, 9:52 PM
scottph retitled this revision from Summary: Add support for Intel Speed Shift to Add support for Intel Speed Shift.
scottph edited the summary of this revision. (Show Details)
scottph set the repository for this revision to rS FreeBSD src repository.
  • update cpufreq.4
  • create hwpstate_intel.4
  • guard intel_speed_shift with amd64 || i386
  • undo dev => cf_dev
  • some cf_drv_dev => dev
  • some style fixes
  • SYSCTL_ADD_STRING => SYSCTL_ADD_CONST_STRING
  • SYSCTL_OUT thing

Not done:

  • jhb's suggestion to merge with est(4)
  • gallatin's suggestion to have a single sysctl node for broadcast setting epp
  • jhb's idea for a bigger cpufreq rework.
cem requested changes to this revision.Oct 18 2019, 12:07 AM

Scott, thanks for picking this up! I'm really glad to see this being driven towards commit. I also really appreciate the cf_dev -> dev revert; keeping name changes divorced from functional changes reduces the size of the patch and increases clarity of what has changed.

Some line comments below, but here are the highlights summarized:

  1. The intel_speed_shift global shared between two drivers and ACPI needs to be fixed (removed, replaced with something using newbus). I believe this can be done without major rework of cpufreq, but don't have a concrete proposal for you.
  2. The sysctl sbuf change in this latest version of the patch is slightly flawed; it should be easy to fix.
  3. I'd still like the percentage sysctl to reject rather than clamp invalid percentages.

Thanks again. This will be great to get into the tree.

sys/dev/acpica/acpi_perf.c
53–54 ↗(On Diff #63421)

I still object to this big kludge ( https://reviews.freebsd.org/D18028#inline-117423 )

sys/kern/kern_cpu.c
662

mea culpa. Scott, thanks for doing it anyway :)

1071

Would device_get_nameunit() make sense, or is it a singleton? Or is that just not useful information for this sysctl?

sys/x86/cpufreq/est.c
922–923

When and if you get a chance to do that rework (I don't think it's happened yet?), the drivers could be pulled apart again. It's one of those "the tree isn't written in stone, we can always change it later" plus "don't block working things on good ideas no one has time for" deals, IMO.

sys/x86/cpufreq/hwpstate_intel.c
59

Given the core kernel refers to this variable (ACPI), this driver cannot be compiled out (even as a module). I mean, I've said before and will repeat that I think this mechanism should die anyway, but — as-is, it creates some additional limitations. (E.g., the "device cpufreq" line in the new manual page is entirely moot since an amd64 kernel without the cpufreq device won't link.)

140

This one is still open and easy to fix.

186–187

This is good but you missed the sbuf_new change that goes along with it above.

237–240

I'd still encourage just rejecting invalid percentages at the sysctl boundary. For this use, negative and >100 percentages do not make sense.

This revision now requires changes to proceed.Oct 18 2019, 12:07 AM
bcr accepted this revision as: manpages.Oct 18 2019, 6:07 AM
bcr added a subscriber: bcr.

OK from manpages. Thanks for working on this.

scottph updated this revision to Diff 63566.Wed, Oct 23, 1:31 AM
scottph edited the summary of this revision. (Show Details)
  • remove the machdep.intel_speed_shift tunable
  • call out to hwpstate_intel's identify from est's
  • show the cf_drv_dev's nameunit in dev.cpufreq.%d.freq_driver
  • initialize the debug sysctl's sbuf w/o a drain function
  • reject percentages outside [0, 100] in epp sysctl
scottph marked 12 inline comments as done.Wed, Oct 23, 1:42 AM

Likewise, thanks for the speedy reviews all.

sys/kern/kern_cpu.c
1071

I don't see any harm in including the unit number, and it might be informative if for some reasons cpus get initialized out of order. then you might notice that cpu2 has hwpstate_intel5 or whatever.

sys/x86/cpufreq/est.c
922–923

Does this seem like a reasonable way to intertwine the two identify routines?

sys/x86/cpufreq/hwpstate_intel.c
140

Sorry missed this one in reading through the old comments. I think it's fixed properly now.

cem added a comment.Thu, Oct 24, 12:19 AM

Thanks, this is kinda weird but I like it better than before.

One thing that doesn't seem to have been done yet but needs to be for this to work is a module dependency on intel_hwpstate by est (if we are retaining separate .ko klds). Otherwise the dynamic linker won't look for the now-public identify routine in the other driver (it only searches the explicit dependencies for external symbols). It would also just be fine to compile both into a composite kld, I think. Either way, there may be some missing build glue here.

sys/x86/cpufreq/est.c
856

Probably put this in a header (it can be a totally trivial intel_cpufreq_internal.h, for example) rather than declaring it in both drivers.

922–923

It's kinda funky, but tolerable IMO. If John has strong objections, he overrules me :-). Newbus doesn't really make this easy/clean on us.

sys/x86/cpufreq/hwpstate_intel.c
69

I'd add a comment referring to est_identify, as well as explaining why we do this funky thing for now. And IMO, the preferred commented-out code style is #if 0 / foo / #endif, but I'm not sure if that's just personal preference or something we actually follow consistently in the tree.

103–107

This comment is now mildly stale.

140

Looks good to me, thanks!

Alternatively you could just read the 1-4 64-bit MSRs in the CPU-bound section, storing them on the stack, and use sbuf printing only after unbinding the CPU. That'd avoid malloc/free and also avoid possibly truncating output. I'm not requesting any change, it's just another valid approach.

233

The compiler will probably complain about the val < 0 check since val is unsigned. if (val > 100) ... alone is sufficient.

286

FWIW, you can just check the flag in the cached cpu_power_eax instead of cpuid(6) + regs[0] after rS353712. cpu_power_foo represents the cpuid 6 leaf. (The name isn't especially creative; sorry.)

Fine to leave as-is, especially if you want to MFC this.

293

This is probably not necessary, or should be conditional on bootverbose.

Because the device isn't marked "quiet," the description set in intel_hwpstate_probe is automatically printed at boot time when the device is attached. E.g.,

hwpstate0: <Cool`n'Quiet 2.0> on cpu0

Where hwpstate is the driver, and "Cool`n'Quiet 2.0" is the string from device_set_desc() below.

301

BUS_PROBE_NOWILDCARD might be more appropriate for now. I don't think the current value will break anything, though, so feel free to leave as-is.

(If/when est and intel_hwpstate are separated and probe for the same newbus pseudo-device, BUS_PROBE_DEFAULT for this one and BUS_PROBE_LOW_PRIORITY for est might make sense.)

383

Ditto other comment about cpuid

jhb added a comment.Thu, Oct 24, 6:00 PM

I would leave the identify routine uncommented in the hwpstate DEVMETHODs. It doesn't hurt to run it twice.

sys/x86/cpufreq/est.c
922–923

This is fine for now.

scottph updated this revision to Diff 63730.Mon, Oct 28, 6:03 PM
scottph marked 2 inline comments as done.
scottph marked 8 inline comments as done.Mon, Oct 28, 6:09 PM
scottph added inline comments.
sys/x86/cpufreq/hwpstate_intel.c
103–107

we shouldn't be dependent on the ordering now, so i've taken away the order and the comment.

140

That seems like a reasonable idea but I left it as-is for now.

286

I'd prefer to do that as a follow on change for ease of mfc.

cem added a comment.Tue, Oct 29, 2:01 AM

The new patch doesn't have context, so phabricator doesn't know how to show inter-diff and us casual readers cannot expand adjacent code in the web interface. Would you mind uploading a diff with full context (i.e., diff -U99999 ... or using the arc utility)? Thanks!

sys/x86/cpufreq/hwpstate_intel.c
140

works for me.

286

works for me.

scottph updated this revision to Diff 63775.Tue, Oct 29, 11:23 PM
scottph marked an inline comment as done.

ah sorry, here's the same patch, but with context restored.

cem added a comment.Tue, Oct 29, 11:29 PM

ah sorry, here's the same patch, but with context restored.

Thanks!

cem accepted this revision.Tue, Oct 29, 11:38 PM

Looks good to me, thanks!

sys/x86/cpufreq/hwpstate_intel.c
103–107

perfect

104

Minor style nit: we prefer NULL for null pointer values.

sys/x86/cpufreq/hwpstate_intel_internal.h
25–26

I think you'll need to add a $FreeBSD$ line at the end of your header block, or the pre-commit hooks will block the commit. See other headers for example.

This revision is now accepted and ready to land.Tue, Oct 29, 11:38 PM