Page MenuHomeFreeBSD

Add intel_rapl driver
Needs ReviewPublic

Authored by thj on May 4 2026, 3:07 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jun 4, 4:00 AM
Unknown Object (File)
Tue, Jun 2, 9:36 PM
Unknown Object (File)
Sun, May 31, 10:47 PM
Unknown Object (File)
Sun, May 31, 1:46 AM
Unknown Object (File)
Fri, May 29, 10:23 PM
Unknown Object (File)
Fri, May 29, 8:52 AM
F158110150: D56791-olce.patch
Thu, May 28, 2:40 PM
Unknown Object (File)
Wed, May 27, 5:06 PM
Subscribers

Details

Summary

Running Average Power Limit (RAPL) is an Intel technology which allows reading
measurements of Core and System power usage a run time with a low
overhead.

RAPL can be used to instrument power usage for individual function
calls, but it is also useful at a lower granularity as a tool for
understanding system power usage.

RAPL can report core device power usage, uncore device power usage
(usually documented as the onboard graphics) and platform power usage
(everything attached to the core such as pcie devices).

RAPL can also report memory power usage.

RAPL can be used to set limits on power usage for the core or platform.
These limits can be used to restrict power consumption and set thermal
limits. Currently using limits is an outstanding item for this driver.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 73338
Build 70221: arc lint + arc unit

Event Timeline

thj requested review of this revision.May 4 2026, 3:07 PM

what's the earliest chipset this supports? (eg the Atom C2xxx / C3xxx ones?)

olce requested changes to this revision.Tue, May 12, 8:03 PM

As is, this does not compile. I've proposed fixes in inline comments.

I'll review the meat tomorrow.

sys/modules/intel_rapl/Makefile
2–12

I do not follow what is done here. To me, most lines look superfluous, and the Makefile cannot possibly work as is.

Proposing a simpler alternative that should work.

After a bit more research: This is missing a change in sys/modules/Makefile:

diff --git a/sys/modules/Makefile b/sys/modules/Makefile                                                                                                                                                                                                                                                                                                                            
index faedb856977c..cbbe6cfb7044 100644                                                                                                                                                                                                                                                                                                                                             
--- a/sys/modules/Makefile                                                                                                                                                                                                                                                                                                                                                          
+++ b/sys/modules/Makefile                                                                                                                                                                                                                                                                                                                                                          
@@ -175,6 +175,7 @@ SUBDIR=     \                                                                                                                                                                                                                                                                                                                                                   
        ${_igc} \                                                                                                                                                                                                                                                                                                                                                                   
        imgact_binmisc \                                                                                                                                                                                                                                                                                                                                                            
        ${_imx} \                                                                                                                                                                                                                                                                                                                                                                   
+       ${_intel_rapl} \                                                                                                                                                                                                                                                                                                                                                            
        ${_intelspi} \                                                                                                                                                                                                                                                                                                                                                              
        ${_io} \                                                                                                                                                                                                                                                                                                                                                                    
        ${_ioat} \                                                                                                                                                                                                                                                                                                                                                                  
@@ -768,6 +769,7 @@ _et=                et                                                                                                                                                                                                                                                                                                                                          
 _ftgpio=       ftgpio                                                                                                                                                                                                                                                                                                                                                              
 _ftwd=         ftwd                                                                                                                                                                                                                                                                                                                                                                
 _exca=         exca                                                                                                                                                                                                                                                                                                                                                                
+_intel_rapl=   intel_rapl                                                                                                                                                                                                                                                                                                                                                          
 _io=           io                                                                                                                                                                                                                                                                                                                                                                  
 _itwd=         itwd                                                                                                                                                                                                                                                                                                                                                                
 _ix=           ix

without which the module is not compiled at all. With this change and the inline diff in this comment, the module compiles correctly.

sys/x86/power/intel_rapl.c
478

Compiler error here.

This revision now requires changes to proceed.Tue, May 12, 8:03 PM

I also see that sysctl knob handlers are not necessarily executed on the specific CPU the driver represents, so readings are taken from a random core. I suggest to look at what we did in sys/x86/cpufreq/hwpstate_amd.c (smp_rendezvous_cpu() generally, and for simpler cases not requiring specific consistency, x86_msr_op()).

what's the earliest chipset this supports? (eg the Atom C2xxx / C3xxx ones?)

This supports CPUs, not chipsets. Yes, the distinction is sometimes blurry, but the registers used here are (supposed to be) CPUs' ones and are documented in Intel's SDM, contrary to what is used in D54882 (which is mostly undocumented, except if you consider Linux code as documentation, but still that is not "official").

Comments about C2xx in D56790 give a partial answer to your question for the hardware part. Yes, these are supposed to be supported. To which extent, I'm not sure. SDM vol 4 answers that only the TDP can be retrieved (MSR_PKG_POWER_INFO), and a limit can be set (MSR_PKG_POWER_LIMIT), but it does not mention MSR_PKG_ENERGY_STATUS which reports the actual energy consumption (every msec). On the other hand, SDM vol 3 says that MSR_PKG_ENERGY_STATUS is "non-optional" for the PKG level, so we could expect it to be present given that MSR_PKG_POWER_LIMIT (also "non-optional") is mentioned present in vol 4.

So a definitive answer here can only be brought by actual tests.

As for the software part (this driver), in its current state it does not seem to support these, as e.g. I don't see model 0x4D being listed.

what's the earliest chipset this supports? (eg the Atom C2xxx / C3xxx ones?)

The earliest hardware supported by the first version of this driver is 6th generation. I have used the models from the SDM vol4 table "models in 6th-13th gen plus some other assorted stuff" chapter.

thj marked an inline comment as done.Wed, May 13, 2:11 PM
In D56791#1305613, @thj wrote:

what's the earliest chipset this supports? (eg the Atom C2xxx / C3xxx ones?)

The earliest hardware supported by the first version of this driver is 6th generation. I have used the models from the SDM vol4 table "models in 6th-13th gen plus some other assorted stuff" chapter.

Olce convinced me and I'll generalise support a bit more so we can avoid the model checks

  • fix build
  • add support selecting atom c2000 processors and their weird register
  • rename driver and export names to 'intel_rapl'
  • fix modules build

So, on multiple-package machines, this driver will malfunction as described in one of the inline comments (and I wasn't able to find a quick way to disable it in this case). I guess that's tolerable as a first shot.

I will also provide a patch with suggested changes for inline comments.

Not mentioned in inline comments, nor in the provided patch, as there are several occurrences of each to fix:

  • For 64-bit integer printing, please use PRIx64 (et alter as appropriate) instead of lx in format strings. You'll have to include <machine/_inttypes.h> for this to work. As this is a x86 driver, compilation should currently fail on i386 (FWIW; I haven't checked myself); making it compile only on amd64 would not shock me.
  • Every rdmsr_safe() or wrmsr_safe() call returns an error that must be tested (else you're basically subsequently either operating on an uninitialized value, or the previous value of the variable when chaining multiple calls).

At a higher-level, but related to last point, I would have attempted to completely get rid of CPU model detection and just try to read/write every MSR with these *_safe() variant, and return some dummy value if they are not available (alternatively, test all of them on attach, and not exporting the corresponding sysctl knobs; besides being more work, I'm not completely sure if that would be very much ergonomic from the user's point of view). Anyway, the current state should be fine for a first iteration.

I haven't checked the values for the different CPU models. Perhaps that info could be mutualized, or already exists, somewhere else in the tree.

sys/x86/power/intel_rapl.c
48–66

A number of whitespace problems around here (space before TAB; see patch).

51–54

These are not necessary, as they are the same as pu, esu, tu. See also below.

138–159

Please keep the list in the same order as the MSRs in specialreg.h. There are also typos in there. See proposed list in patch.

Also, I'd use a slightly different prefix, such as RAPL_FM_ (for Flag and MSR).

161–176

(See amended lists in the patch.)

178–189

To be changed in accordance with the re-ordered list above.

191–203

I'd just drop the REASON_ prefix for the printed value for every flag here.

213

To avoid an infinite loop on a low hz (such as in VMs by default, although reading RAPL values in a VM is of dubious interest; but someone could force hz to, e.g., 100).

228–237

I think it would be great to apply another * 1000 to all of these. For, e.g., ESU, if using uJ, truncation of the default value of 15.3 uJ already leads to a 2% error (and that could be worse with a higher divisor). Moreover, that additional * 1000 does not lead to overflow later when multiplying with a 32-bit value (as all the fields are 64-bit).

258–273

See suppression of the dram_* fields in the softc.

276
290

Suggestion to use sbuf_new_for_sysctl(). This entails providing a local variable to store an actual struct sbuf and removing the explicit SYSCTL_OUT() call at the end of the function. Please see patch.

299

pJ is currently wrong, but good as the new target (see above).

Others are currently good, so will have to be changed if applying the * 1000 (see above).

(Though, these could differ from the unit used to store the multiplier, if printing fractional numbers through floating point or fixed point. I'm not sure indeed if there's a need to go below mW, uJ and uS.)

348–349

Same here, suggestion to use sbuf_new_for_sysctl() (see another inline comment above).

364–367

I'd just invert the if and bail out early from this function if reasons are not supported.

368
371–385

This function currently does nothing (except returning 1 always).

Please see proposed version in the patch (reads and updates sc->read_frequency, and re-schedules the reading task immediately if the frequency changed).

390–393

This is fine, but not enough (see below proposed changes to intel_rapl_attach()). Also, the comment tells only part of the story, for two reasons.

First, we actually do not want to attach a driver per CPU, but just only one (for a given package), so even if supporting multiple packages we would need/want to limit attachment (or maybe not, I'm not so sure).

Second, if there are multiple packages, the current driver will "malfunction" in the sense that it will report/set values randomly from one package or the other (and changing over time).

So the comment here should probably be amended to reflect all that (or some herald comment put elsewhere).

Ideally, as long as it does not support multiple packages, I'd like to prevent the driver from attaching at all in that case. I deep dived into the SMP topology code, and IIUC we currently do not detect multiple packages at all (so the topology is wrong on multi-socket machines), so unfortunately that is not immediately simple (that would require a few hours work, which I don't have right now).

438

Minor: I'd add here a check that we are the first driver, and bail out if we are not, as an additional protection, in order to allow not relying on the current ordering of subr_bus.c about identification and probe. YMMV. See patch.

458
479–482

Initially wanted to suggest changing the description and opening up the possibility to write to this value to clear the log, but then I discovered that you implemented a reset of the log after reading. Not too sure if that is the best way, but it looks certainly good enough for the time being.

502–503

I'd just trigger an immediate read here.

Patch with suggested changes:

olce added inline comments.
sys/x86/power/intel_rapl.c
354