Page MenuHomeFreeBSD

Summary: Add support for Intel Speed Shift
Needs ReviewPublic

Authored by bwidawsk on Nov 18 2018, 1:09 AM.

Details

Reviewers
kib
jhb
manu
cem
Summary

Intel Speed Shift is Intel's technology to control frequency in hardware, with hints from software.

Test Plan

some performance test?

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 23188
Build 22231: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
bwidawsk added a subscriber: cem.
bwidawsk retitled this revision from Summary: RFC: Add support for Intel Speed Shift to Summary: Add support for Intel Speed Shift.Feb 12 2019, 12:29 AM
bwidawsk edited the summary of this revision. (Show Details)
kib added inline comments.Feb 15 2019, 1:28 PM
sys/dev/acpica/acpi_perf.c
53

Use bool perhaps ?

149

Comparing with true is too baroque. If you use bool for the type, then if (intel_speed_shift) is good enough.

sys/kern/kern_cpu.c
417

Please do not use implementation-defined namespace without a reason (I do not see one).

422

return (-1);

424

return (set.freq);

429

And there too.

442

Are error values from this function used for anything ? If not, why not return -1 ?

sys/x86/cpufreq/hwpstate_intel.c
140

I think this call can sleep and lock some VM locks as well. It is not good to lock while the thread is bound. How hard is to restructure the code to read all MSRs first, then unbind, then do printfs ?

bwidawsk added inline comments.Feb 16 2019, 12:18 AM
sys/kern/kern_cpu.c
417

I don't mind changing it, but it's static, so I'm not sure what you mean about adding implementation-defined namespace.

442

Mostly it was for debug to be able to differentiate error cases. I suppose someone debugging can always change it though, and I'll do -1

cem added inline comments.Feb 16 2019, 12:46 AM
sys/kern/kern_cpu.c
417

The implementation-defined part is the __ prefix in the function name.

cem added inline comments.Feb 16 2019, 12:51 AM
sys/kern/kern_cpu.c
442

It's mostly uncommon in FreeBSD to return negative error numbers and positive success values — that's more of a Linux (and imported Linux code) thing. Typically we'd return 0 or EFOO and set an output parameter for the level index. (Or sometimes, return the level index, or -1 as an out-of-range sentinel value.)

(Also, the spelling would be return (-ENOENT); the idea being that return() could be defined as a macro.)

cem added a comment.Feb 16 2019, 10:08 PM

Thanks for doing this, Ben! I'm sorry it took me so long to take a look and appreciate your patience.

sys/kern/kern_cpu.c
149

basic -> basis, I think

434

Is this case possible? It looks like this is only called with freq_dev, and I don't see a case where freq_dev isn't attached before it is initialized. But I may be mistaken.

439

I don't see where set is initialized. Probably this is the result of refactoring CPUFREQ_DRV_GET into __get_frequency and you meant to save the result of __get_frequency and use it for this comparison?

446–449

I appreciate adding comments documenting these routines, by the way. Thank you.

468–484

Hm, the drawback of this approach seems to be we're always invoking CPUFREQ_DRV_TYPE() even for all drivers prior to this change that allow cached frequencies for a given level.

Maybe I'm imagining things and that is a negligible cost?

It seems like we could make an invariant that CPUFREQ_FLAG_UNCACHED => (curr_set->freq == CPUFREQ_VAL_UNKNOWN). Then we would only need to check for != VAL_UNKNOWN to bail early here (like the condition prior to this change). We would then only invoke DRV_TYPE() if the value was UNKNOWN, and only save freq if !UNCACHED.

Again, I might just be overcomplicating this for no reason; feel free to tell me I'm wrong.

488

style(9) nitpick:

/*
 * foo

(Multi-line comment blocks begin with a blank line. One thing to keep in mind with the style(9) manual page is: "Many of the style rules are implicit in the examples.")

524–527

The curr_set pointer into sc->curr_level is still valid, but it is a little confusing to me that the sc->curr_level assignment above is what affects this condition. I would probably replace curr_set->freq here with sc->curr_level.total_set.freq myself. Obviously, this isn't introduced in this change, and maybe it is clear to others. Feel free to ignore.

529

Maybe assert the CF_MTX lock here for reader clarity. I only suggest this because the goto estimate above jumps over a region that unlocks and relocks this lock, which is a little error prone for future refactoring.

564–565

The grammar here is a bit funky

571

It's a bit uncommon to use static variables in functions; are we sure it's safe in this context? In particular, we hold the CF_MTX for this softc, but do we hold any global lock preventing a race?

(And I guess the flip side — is sizeof(sets) large enough to actually be a concern re: stack use? By my count, cf_setting is 40 bytes on amd64, and MAX_SETTINGS is 256. Yeah, ok, that's too large for the stack.)

Is there a reason to avoid malloc here (which was used prior to this refactoring)?

586

style(9) nit: error != 0 (above)

1100

It seems like prior to this change, theoretically multiple cpu frequency drivers could attach to a single CPU / cpufreq driver instance — hence all of the enumeration over devs. Now the attached freq_dev will just be whichever driver registered last. In practice this may not be a problem, but either we should enforce that more strictly, or somehow handle it again.

1139

As above — I don't see anything enforcing this invariant. I might be missing something.

sys/x86/cpufreq/est.c
53

Would it be possible to define this in a shared header instead of multiple extern references in .c files?

921–922

Oh, I see. This is the hack that is preventing multiple freq drivers from registering on intel.

Doesn't this break est on hardware too old for HWP support? And the kludge only works for Intel devices?

sys/x86/cpufreq/hwpstate_amd.c
329–330

This deserves a note in UPDATING. It might be worth just honoring the old name on both Intel and AMD hardware; I don't really see a downside.

sys/x86/cpufreq/hwpstate_intel.c
59

This doesn't match the extern uint32_t declarations of the variable.

Style(9) nit: spaces around = operator.

82

DEVMETHOD_END is canonical. I don't feel strongly about it, though.

108–114

The canonical way to do something like this would be with a DEVICE_PROBE(9) implementation. There is a manual page and also some documentation in kern/device_if.m.

All drivers that identify themselves as candidates for a given devclass are PROBEd to determine which driver should be attached. It is possible for multiple drivers to successfully PROBE a given device; the priority is used to select the best driver for a given device.

For Intel CPUs, you would have est probe a lower priority and hwpstate_intel probe a higher priority (if available), on the same device or pseudo-device.

140

In practice, I don't think this will ever sleep or allocate in the locked section; none of the formats are large enough to exceed the pre-allocated 1024 byte buffer. In that case, however, it would make sense not to have an automatic drain function and just explicitly SYSCTL_OUT at the end.

191

inline keyword is probably gratuitous (below as well).

The function is small enough (certainly for non-DEBUG builds) to be inlined automatically, and even if it is not, these routines are only invoked from an (administrative) sysctl anyway — there isn't going to be any significant performance impact either way.

It is fine to leave it if you prefer, though.

235–237

Usually I'd reject invalid values rather than clamping them to the valid range, unless it makes sense that the valid range may grow or shrink in the future; here it doesn't (IMO).

297

what does this number represent?

311–316

I'm not sure this check is robust; I don't understand why we would bail if an INFO_ONLY driver is also attached under the same parent? But not if a !INFO_ONLY driver is attached? The condition in est is the opposite — *allow* an INFO_ONLY sibling, but bail if there are existing !INFO_ONLY siblings.

322

style(9) nit: space before =. Ditto below

446

it doesn't make sense to me to size buf to 128 but hardcode 18 here.

usually i'd suggest snprintf(foo, sizeof(foo), ...) and then hardcode foo[N], if anything.

either way, there may be no need for this buf? you could just use device_get_nameunit. it would be "debug.intel_hwpstateN" instead of "debug.intel_hwp_debugN", but arguably that is better anyway.

I'm not sure that placing debug sysctls under "debug." instead of just using CTLFLAG_SKIP and placing them in the normal device_get_sysctl_tree location is canonical; it may be.

488

Is this a leftover debugging print? It doesn't seem like something that should go to console on every invocation of the GET method.

cem added a reviewer: cem.Feb 16 2019, 10:09 PM
cem removed a subscriber: cem.
bwidawsk marked 37 inline comments as done.Mar 19 2019, 9:21 PM
bwidawsk updated this revision to Diff 55252.

intel hwpstate v2

Fixed some bugs in v1, and tried to address all review feedback.
patches now live here: https://github.com/bwidawsk/freebsd/tree/hwp2

  • cpufreq: Add sysctl to show which freq driver
  • cpufreq: Rename hwpstate to hwpstate_amd.
  • cpufreq: Remove unnecessary indirection to dev
  • cpufreq: Extract cf_get functionality
  • cpufreq: Extract level information
  • cpufreq: Clarifying naming
  • cpufreq: Allow cpufreq to act over individual cpus
  • cpufreq: Create an uncached flag
  • cpufreq: Add support for Intel Speed Shift
  • cpufreq: Clean up multi-device usage
  • cpufreq: Add some debug messages and comments
  • hwpstate_intel: Enable by default
Harbormaster completed remote builds in B23188: Diff 55252.
bwidawsk added inline comments.Mar 19 2019, 9:22 PM
sys/kern/kern_cpu.c
468–484

I'm going to punt on this for now... TYPE should be a pretty cheap operation and I'm risk averse :-)

571

In an earlier revision, the free paths were getting a little messy, and static seemed like an easy solution. Realistically, nothing uses anywhere near 256, but I agree it's problematic. I will rework it to use malloc.

1139

I think my reworks address this, but please re-review.

sys/x86/cpufreq/est.c
53

yeah... jhb@ recommended this way. I'm fine with whatever though, recommendations?

921–922

It shouldn't. In the hwpstate_intel code, if the hardware is too old, it sets the bool to false. Same for basically any unsupported hardware.

sys/x86/cpufreq/hwpstate_intel.c
108–114

I do agree what's there now is a kludge, however, the way that the device tree looks today is that cpufreq is a child of cpu, and the controller driver (in this case hwpstate_intel) is also a child of cpu. Each controller driver has its own devclass. So that leaves a few options. The most obvious are:

  1. Make all controller drivers share the same devclass
  2. Make the controller driver be a child of the cpufreq driver.

I am/was in favor of #2, however it was a pretty large change and the quick and dirty way (which jhb@ recommended) seemed the most expedient to get the feature landed.

140

So I can punt until later?

235–237

The valid range can indeed change from platform to platform. Of course a user of this sysctl could clamp themselves, but it seemed easier to provide the service for them to keep the API a little more stable. Marking as done, but feel free to complain again and I will adjust.

297

It's a mask of features supported by the hardware. I don't yet do anything with the features anyway, so I will drop it.

cem added a comment.Mar 28 2019, 1:57 AM

Looks better, thanks. Some more feedback

sys/kern/kern_cpu.c
468–484

+1

473–475

Would this ever be the case for Intel HWP, or is this just allowing the possibility in general?

484

error = 0;

488

This seems a little gratuitous given it's just the De Morgan's transformation of the if error == 0 && .... UNCACHED expression just 19 lines earlier, and every path under that condition has an explicit goto past this condition.

571

Thanks!

591

we invoke DRV_TYPE on cf_drv_dev but device_get_nameunit on cf_dev. Is that intentional?

602

error != 0

1139

I don't think the rework addresses the issue. We now confirm it's not NULL when registering, but don't check that it meets the stronger condition (== cf_drv_dev) until unregister.

sys/x86/cpufreq/est.c
53

Leave it be for now, I guess. This variable is a gross hack, hopefully it isn't used in additional places.

921–922

Sure, but what controls the order in which est_identify / intel_hwpstate_identify run such that the latter is guaranteed to run first?

sys/x86/cpufreq/hwpstate_intel.c
108–114

Yeah, that's fair. #2 seems like a reasonable approach to me. I don't know enough about the drawbacks of #1 to dismiss it.

140

No, you should still initialize sb differently so it doesn't have a drain function.

sb = sbuf_new(NULL, NULL, 1024, SBUF_FIXEDLEN | SBUF_INCLUDENUL);

184–185

In here, after finish and before delete, you would add:

if (ret == 0)
    ret = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb));
191

I'd avoid double underscore names here too

235–237

The logical percent range can change to something other than 0-100%? I'm confused.

262–263

Isn't this backwards? If we already attached intel hwpstate, we're using intel_speed_shift and it should remain true.

jhb added a comment.Apr 5 2019, 9:37 PM

BTW, this booted fine on my X1 today (I just pulled the git branch).

I think it's worth getting the current type of approach in for now, but eventually this cpufreq design needs to be rethought as it's a mess. Realistically you want something like this to happen for each CPU:

  • fetch info from the info-only driver
  • have a single device_t that is the cpu frequency device and the various cpufreq drivers just compete over that device_t node with probe priorities and moving their identify logic into the probe

I think the way I would do this is to kill the info-only ACPI perf driver by having the acpi_cpu.c driver for cpuX devices fetch the info from _CST or whatever it is and provide that as an ivar on the cpuX device itself. Then cpuX would add a single "cpufreq" driver and the various cpufreq drivers would all end up sharing that devclass I think. I think this would mostly kill off p4tcc as it's priority is such that it would never win (same with acpi_throttle), but that is probably ok. If we really cared about still supporting both systems that did both P-states and T-states then we would basically create two device nodes under the CPU (one for P-states and one for T-states) and let the drivers attach to the relevant one. However, I would probably not bother with that. It seems like too much work for little gain. Leaving the T-state drivers around would let them still be used for CPUs without P-states, we would just lose the ability to mix the two (which is probably a feature anyway).

sys/kern/kern_cpu.c
152

The 'dev' -> 'cf_dev' rename adds a lot of noise (and is a bit unusual in that almost all (if not actually all) foo_attach, etc. routines in the tree use 'dev' as the variable name. It is minor, but I probably would have left it as 'dev'.

417

'dev' vs 'cf_drv_dev' would seem more consistent with existing code, though I see that this is always 'sc->cf_drv_dev' in the caller, so probably ok as-is.

602

To be fair, that is the existing line. For better or for worse a lot of the code in the tree uses 'if (error)' rather than 'if (error != 0)'

1075

In HEAD we just added a SYSCTL_ADD_CONST_STRING that should let you get rid of this cast I think.

1139

I think you can still get multiple things attached, and that you should just add an 'if' here that bails on a mismatch. However, then you could get spurious warnings from the cf_dev == NULL check above. I think though that you could just remove the check (note that the comment is now also stale) and just make the entire thing be something like:

/* If this is the active cpufreq child device, remove the control device as well. */
cf_dev = device_find_child(device_get_parent(dev), "cpufreq", -1);
if (cf_dev == NULL)
    return (0);
sc = device_get_softc(cf_dev);
if (sc->cf_drv_dev == dev)
    device_delete_child(device_get_parent(cf_dev), cf_dev);
return (0);
sys/sys/cpu.h
126

This is fine as-is. I would perhaps be inclined to invert this flag's sense to assume an actual "real" get method by default and falling back to using the previously set level on "dumb" drivers, something like 'CPUFREQ_FLAG_SETONLY' maybe. I don't think it's worth changing right now though.

sys/x86/cpufreq/est.c
53

I mostly can't think of a non-crappy header to put it in.

921–922

I have wondered if we don't want to effectively merge the two drivers in some way.

sys/x86/cpufreq/hwpstate_amd.c
156

You don't actually have to rename this driver FWIW. It's fine to have multiple drivers with the same name.

329–330

I would probably just leave the name as-is as well FWIW.

cem added inline comments.Apr 5 2019, 10:17 PM
sys/x86/cpufreq/est.c
921–922

I can't think of an immediate downside to this idea (and like it better than the current approach). Ben, is it objectionable to you?

sys/x86/cpufreq/hwpstate_amd.c
156

I don't object to naming different things differently. It can reduce confusion in reports / debugging.

jhb added inline comments.Apr 8 2019, 6:11 PM
sys/x86/cpufreq/est.c
921–922

The main downside is that suppose after this is in I do the cpufreq rework I sketched out in my overall comment last week. In that world, it's quite easy to have the 'speedshift' driver use a higher probe priority than est and have them be separate drivers. Given that they work quite differently I think this is cleaner long term. I had typed this comment above before my general comment (so it was "earlier" in my thinking).

scottl added a subscriber: scottl.Jun 12 2019, 7:55 PM

After Yamagi brought this work to my attention and told me that it was possible to apply this patch to FreeBSD 12.0 I created this diff from the git repo. I applied the patch to FreeBSD 12.0 on a T470s. With this patch the cores clock independently, power consumption is reduced (especially with low to medium load) and peak performance improved.

During short bursts e.g. loading youtube.com in chromium all cores jump to the full boost clock (in my case 3.5GHz) or 100MHz below that for long enough to load bloated websites before falling back to a modest 0,9GHz to 1,1GHz. This improves responsiveness a lot over what is achievable with powerd++ or powerd. With powerd++ I have to limit max CPU frequency to 200MHz below the 2,8GHz base clock to avoid triggering ACPI thermal shutdown warnings under sustained load. With the hwpstate patch applied it takes mprime with 4 workers to trigger the same warning (make -sj4 buildworld isn't enough) , but the system remains stable. My system may be a special problem case because Lenovo saw fit to reduce the max CPU temperature to just 75C for the T470s.

I applied the patch to FreeBSD 12.0 on a T470s. ...

Great, thank you very much for testing!

I just cherry picked this into my tree to try on some servers. As a *USER*, I have the following comments:

  1. We need documentation. We cannot expect users to download the Intel SDM and look up the definition of Energy_Performance_Preference to know that 0 is the performance preference, and 100 is the energy efficiency preference.
  1. It would be super nice to be able to set the default epp for the entire system in one shot via a tunable, rather than having to replicate the same tunable 64 times in loader.conf or sysctl.conf or write a shell script to poke the values in. Something like this would simplify things for what I expect is 99% of use cases:

machdep.intel_speed_shift.default_epp=X

Oh, and in the documentation, it would be handy to note that BIOSes will sometimes disable this functionality and you may need to enable or disable certain common settings in the BIOS.