Page MenuHomeFreeBSD

Add temperature sensor support for AMD Zen processors; two separate commits included this review.
ClosedPublic

Authored by cem on Sep 2 2017, 5:01 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 21, 4:08 AM
Unknown Object (File)
Sat, Nov 16, 7:49 AM
Unknown Object (File)
Sat, Nov 16, 4:57 AM
Unknown Object (File)
Thu, Nov 7, 8:25 PM
Unknown Object (File)
Mon, Nov 4, 5:02 AM
Unknown Object (File)
Sat, Nov 2, 2:35 PM
Unknown Object (File)
Oct 3 2024, 2:24 AM
Unknown Object (File)
Sep 18 2024, 3:47 PM

Details

Summary

Add smn(4) driver for AMD System Management Network

AMD Family 17h CPUs have an internal network used to communicate between
the host CPU and the PSP and SMU coprocessors. It exposes a simple
32-bit register space.

amdtemp(4): Add support for Family 17h temperature sensor

The sensor value is formatted similarly to previous models (same
bitfield sizes, same units), but must be read off of the internal
System Management Network (SMN) from the System Management Unit (SMU)
co-processor.

PR: 218264

Test Plan

$ sysctl dev.cpu.0
dev.cpu.0.temperature: 52.6C

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 11345
Build 11709: arc lint + arc unit

Event Timeline

mjoras requested changes to this revision.Sep 2 2017, 5:49 PM

Also missing the sys/modules/amdsmn directory, so doesn't work with buildkernel at the moment.

Anyway, seems to work for me on the same processor:

mjoras at icarium in ~
↪ sysctl dev.cpu.0.temperature
dev.cpu.0.temperature: 52.5C

Then after stress -c 32:

mjoras at icarium in ~
↪ sysctl dev.cpu.0.temperature
dev.cpu.0.temperature: 71.6C
sys/dev/amdsmn/amdsmn.c
126

Unused.

sys/dev/amdsmn/amdsmn.h
29

Probably a style violation, I don't see any other uses in the kernel (as much as I don't really like inclusion guards).

This revision now requires changes to proceed.Sep 2 2017, 5:49 PM
cem edited edge metadata.
cem marked an inline comment as done.
  • Remove unused variable; spammy printf.
  • Add forgotten amdsmn module directory.

Should the 20C temperature offset for the 1700X and 1800X be set automatically? When I first got my 1700X, the motherboard had the original BIOS which did not account for the offset. It thought the CPU temperature was 50+C at idle and the fan ran at close to full speed even though I've got a massive heat sink on the CPU. A subsequent BIOS upgrade compensated for the offset and told me that the idle temperature was in the low 30C range and the fan speed was much more reasonable at idle.

I haven't checked the temperature in the BIOS today, but this module says that my CPU is running at 53C at idle and 81C under load from a parallel compile.

I realize that is is possible to manually set the offset using a sysctl or tunable, but if someone downgrades their CPU from a 1x00X to a 1700 then their temperature readings will be 20C too low if they forget to remove the override.

Should the 20C temperature offset for the 1700X and 1800X be set automatically? When I first got my 1700X, the motherboard had the original BIOS which did not account for the offset. It thought the CPU temperature was 50+C at idle and the fan ran at close to full speed even though I've got a massive heat sink on the CPU. A subsequent BIOS upgrade compensated for the offset and told me that the idle temperature was in the low 30C range and the fan speed was much more reasonable at idle.

I haven't checked the temperature in the BIOS today, but this module says that my CPU is running at 53C at idle and 81C under load from a parallel compile.

I realize that is is possible to manually set the offset using a sysctl or tunable, but if someone downgrades their CPU from a 1x00X to a 1700 then their temperature readings will be 20C too low if they forget to remove the override.

Yeah, that is an unfortunate problem of AMD's temperature scale. On the threadripper part, the offset is 27°C. I don't know how smart we want this thing to be. Is there a good way of determining specific model numbers? For what it's worth, I believe amdtemp had exactly the same problem on previous CPU generations.

sys/dev/amdsmn/amdsmn.h
29

style(9) doesn't mention it; GCC and Clang have supported it for a long time.

LLVM, zstd, and other contrib code uses it.

In D12217#253379, @cem wrote:

Yeah, that is an unfortunate problem of AMD's temperature scale. On the threadripper part, the offset is 27°C. I don't know how smart we want this thing to be. Is there a good way of determining specific model numbers? For what it's worth, I believe amdtemp had exactly the same problem on previous CPU generations.

My thought is that this is outside the scope of this revision since it's an existing problem in amdtemp. It's not totally clear to me what the best way to do it since the different parts have different offsets, but it's something someone should probably explore. If we can reliably map the offsets for every known part in every generation it would be a nice convenience feature.

sys/dev/amdsmn/amdsmn.h
29

Like I said, I'm not personally against it, but it doesn't exist today in non-imported code AFAICT. Someone _may_ whine once it hits the list since it's not-standard, so I guess we shall see.

This revision is now accepted and ready to land.Sep 2 2017, 11:51 PM

Works on my Ryzen machine, so I'm no longer flying blind in terms of CPU temperature under load. Doing mental math or manually setting the temperature offset is fine for now.

It was a warm day in my office today. When I checked the CPU temperature in the BIOS, it started off at 36 C and slowly ramped up to 42 C and the fan speed ramped up with it. With FreeBSD booted and idle, the fan was quieter than in the BIOS, so the reading in the low 50's returned by amdtemp would seem to be consistent with the BIOS if I manually subtract the 20 C offset that AMD adds to the sensor value on my 1700X.

For 15h model 60-6f
D0F0x60 Miscellaneous Index
D0F0x64 Miscellaneous Index Data
...
D0F0xB8 SMU Index Address
D0F0xBC SMU Index Data

Is amdsmn required as additional device?

family = CPUID_TO_FAMILY(cpu_id);

What about systems with multiple and different CPUs?
That is why amdtemp reads:

cpuid = pci_read_config(dev, AMDTEMP_CPUID, 4);
share/man/man4/amdsmn.4
36

Perhaps a period is need at the end of the sentence.

I would also suggest adding a couple of words to describe the SMN if the available information on it wasn't so scarce.

sys/dev/amdtemp/amdtemp.c
585

Perhaps what I am suggesting is a premature optimization, but looking up the device on every query doesn't feel nice.

Also, if you are worried about amdsmn unloading, then the current code is not safe either.
We do not have any get/put mechanism for newbus device, so the driver can get detached concurrently with amdsmn_read(). Perhaps, there is a way to dynamically add a dependency of this module on amdsmn? Or maybe we have to hold some lock? Not sure...

What about systems with multiple and different CPUs?

Are there any that actually work?

This is potentially possible in server segment.
I this this is reason why amdtemp read CPUID via PCI and try to not use global cpu_id and local do_cpuid().
Linux gays in k10temp.c: k10temp_is_visible() uses only info that read from target device.

sys/dev/amdtemp/amdtemp.c
585

Is it possible to unload amdsmn while

MODULE_DEPEND(amdtemp, amdsmn, 1, 1, 1);

?

sys/dev/amdtemp/amdtemp.c
585

No, it's not possible. I overlooked that line.
On the other hand, it means that amdsmn module must be loaded on systems where it is not needed at all.
But that's not a big issue.

For 15h model 60-6f
D0F0x60 Miscellaneous Index
D0F0x64 Miscellaneous Index Data
...
D0F0xB8 SMU Index Address
D0F0xBC SMU Index Data

Is amdsmn required as additional device?

I don't understand the question.

family = CPUID_TO_FAMILY(cpu_id);

What about systems with multiple and different CPUs?
That is why amdtemp reads:

cpuid = pci_read_config(dev, AMDTEMP_CPUID, 4);

The offset at AMDTEMP_CPUID is not documented in the PPR for family 17h, as far as I can see, and reads as a bogus value 0xffffffff. So I don't think that works on Family 17h.

share/man/man4/amdsmn.4
36

This section is a title, not a description sentence. Convention is no period (just checked manual pages for amdtemp(4), tail(1), git-push(1), geom(4)).

I really don't know much more about SMN than is present in the DESCRIPTION section below. It's a 32-bit address 32-bit value register bus present in at least family 17h CPUs.

sys/dev/amdtemp/amdtemp.c
585

Perhaps what I am suggesting is a premature optimization, but looking up the device on every query doesn't feel nice.

Ok, we can cache it in the softc.

amdtemp:

  • Cache SMN device_t in softc
  • Express amdsmn dependency in files.{i386,amd64}
This revision now requires review to proceed.Sep 4 2017, 4:52 PM
cem marked 2 inline comments as done.Sep 4 2017, 4:55 PM
family = CPUID_TO_FAMILY(cpu_id);

What about systems with multiple and different CPUs?
That is why amdtemp reads:

cpuid = pci_read_config(dev, AMDTEMP_CPUID, 4);

That register does not appear to exist on Ryzen. It is not mentioned in the Processor Programming Reference (PPR) for AMD Family 17 Model 01h, Revision B1 Processors. When I previously tested your patch, I searched all of config space and didn't find any registers with contents that matched what what I expect to see there.

I don't believe that populating a multi-socket system with non-identical CPU chips is supported.

In D12217#253572, @cem wrote:

For 15h model 60-6f
D0F0x60 Miscellaneous Index
D0F0x64 Miscellaneous Index Data
...
D0F0xB8 SMU Index Address
D0F0xBC SMU Index Data

Is amdsmn required as additional device?

I don't understand the question.

I dont know where did you get info about Ryzen thermal interface, but for fam 15h 0x60 named "Miscellaneous Index" and I dont know about any renaming registers before.

Why amdsmn in new driver module?
(I dont think than any one want to do something with SMU or Miscellaneous registers)

The offset at AMDTEMP_CPUID is not documented in the PPR for family 17h, as far as I can see, and reads as a bogus value 0xffffffff. So I don't think that works on Family 17h.

PPR contains nothing, BKDG not published for 17h.

family = CPUID_TO_FAMILY(cpu_id);

What about systems with multiple and different CPUs?
That is why amdtemp reads:

cpuid = pci_read_config(dev, AMDTEMP_CPUID, 4);

That register does not appear to exist on Ryzen. It is not mentioned in the Processor Programming Reference (PPR) for AMD Family 17 Model 01h, Revision B1 Processors. When I previously tested your patch, I searched all of config space and didn't find any registers with contents that matched what what I expect to see there.

I don't believe that populating a multi-socket system with non-identical CPU chips is supported.

54945_PPR_Family_17h_Models_00h-0Fh.pdf ?
It is short doc, that not contain any internal PCI devices description.

Why amdsmn in new driver module?
(I dont think than any one want to do something with SMU or Miscellaneous registers)

As an abstraction. You don't think anything else may need to access SMN values?

The offset at AMDTEMP_CPUID is not documented in the PPR for family 17h, as far as I can see, and reads as a bogus value 0xffffffff. So I don't think that works on Family 17h.

PPR contains nothing, BKDG not published for 17h.

I have been told AMD does not intend to publish a BKDG for 17h. The PPR is it.

I dont know where did you get info about Ryzen thermal interface, but for fam 15h 0x60 named "Miscellaneous Index" and I dont know about any renaming registers before.

Why amdsmn in new driver module?
(I dont think than any one want to do something with SMU or Miscellaneous registers)

Why do you think that there is any direct relation between SMU and Miscellaneous registers in family 15h and SMN in family 17h?
SMN is a completely new thing in family 17h (or so it seems), the only similarity between SMU and SMN is "SM" :-)
Family 17h is a big step from families 10h - 16h and even between those families there were incompatible changes in PCI register definitions (I know that for sure about registers that describe DRAM configuration).

Given the lack of documentation we can resort -- again! -- to using Linux commits by AMD employees as a guidance.
For example, https://patchwork.kernel.org/patch/9432511/

In D12217#253620, @avg wrote:

I dont know where did you get info about Ryzen thermal interface, but for fam 15h 0x60 named "Miscellaneous Index" and I dont know about any renaming registers before.

Why amdsmn in new driver module?
(I dont think than any one want to do something with SMU or Miscellaneous registers)

Why do you think that there is any direction relation between SMU and Miscellaneous registers in family 15h and SMN in family 17h?
SMN is a completely new thing in family 17h (or so it seems), the only similarity between SMU and SMN is "SM" :-)
Family 17h is a big step from families 10h - 16h and even between those families there were incompatible changes in PCI register definitions (I know that for sure about registers that describe DRAM configuration).

Given the lack of documentation we can resort -- again! -- to using Linux commits by AMD employees as a guidance.
For example, https://patchwork.kernel.org/patch/9432511/

Indeed, I believe that commit was cem's impetus for this work. It makes sense to treat it as a device needing a driver, that's how AMD presents it and that's how AMD employees have implemented it elsewhere, so why not do the same?

In D12217#253620, @avg wrote:

Why do you think that there is any direct relation between SMU and Miscellaneous registers in family 15h and SMN in family 17h?
SMN is a completely new thing in family 17h (or so it seems), the only similarity between SMU and SMN is "SM" :-)
Family 17h is a big step from families 10h - 16h and even between those families there were incompatible changes in PCI register definitions (I know that for sure about registers that describe DRAM configuration).

Given the lack of documentation we can resort -- again! -- to using Linux commits by AMD employees as a guidance.
For example, https://patchwork.kernel.org/patch/9432511/

I think "System Management Unit" and "System Management Net" is very near.
AMD before drops many low level interfaces, like TSI...SMU, so without docs and knowledge of plans we get one more driver that used only by one other driver only for few CPUs.
At least without BKDG 17h it is lottery with small chances to success.
amdsmb, as I know, does not support many CPUs and no one fix it.

My point is: while only amdtemp uses amdsmn on few CPUs - no need to have amdsmn as module, let it be a part of amdtemp.

In D12217#253620, @avg wrote:

My point is: while only amdtemp uses amdsmn on few CPUs - no need to have amdsmn as module, let it be a part of amdtemp.

Why is that better? What is bad about the separate module?

My point is: while only amdtemp uses amdsmn on few CPUs - no need to have amdsmn as module, let it be a part of amdtemp.

Why is that better? What is bad about the separate module?

mtx_lock(&sc->smn_lock);
pci_write_config(parent, SMN_ADDR_REG, addr, 4);
*value = pci_read_config(parent, SMN_DATA_REG, 4);
mtx_unlock(&sc->smn_lock);

That is all what module do. Same for write.
8 lines.
One consumer - amdtemp.
No perspective to get more consumers.
Unknown perspective to be supported in hardware in future.
Module does not guarantee nothing with it lock, for example: pciconf -w hostb0@pci0:0:0:0 0x60 0x59800 && pciconf -r hostb0@pci0:0:0:0 0x64 can break any logic/sequence while accessing to SMN

That is all what module do. Same for write.
8 lines.
One consumer - amdtemp.
No perspective to get more consumers.
Unknown perspective to be supported in hardware in future.

So what?

Module does not guarantee nothing with it lock, for example: pciconf -w hostb0@pci0:0:0:0 0x60 0x59800 && pciconf -r hostb0@pci0:0:0:0 0x64 can break any logic/sequence while accessing to SMN

This is true for literally any PCI driver lock. pciconf is a special interface and only root can do this. It doesn't mean the lock is useless.

amdtemp(4): Attach to the correct pciid host bridge on family 17h

In D12217#253696, @cem wrote:

That is all what module do. Same for write.
8 lines.
One consumer - amdtemp.
No perspective to get more consumers.
Unknown perspective to be supported in hardware in future.

So what?

Now, no one needs an amdsmn.
If some module in future will use SMN then it will be easy to share this code into module.

Now, no one needs an amdsmn.
If some module in future will use SMN then it will be easy to share this code into module.

That's true, but what's so offensive about having another module? Why do we need to wait for abundant reasons to have logical code separation as opposed to copy paste code sharing?

Now, no one needs an amdsmn.
If some module in future will use SMN then it will be easy to share this code into module.

It's easy to keep it as a module, too. I haven't actually heard any reason from you not to.

Now, no one needs an amdsmn.
If some module in future will use SMN then it will be easy to share this code into module.

That's true, but what's so offensive about having another module? Why do we need to wait for abundant reasons to have logical code separation as opposed to copy paste code sharing?

Unneeded modules - crap. Less crap - good.
Linux developers add SMN to existing module.
SMN is not fundamental thing like PCI/USB/SATA/ACHI/...

If you want modules then we also need:

  • SMU for h15m6x - no other way to read temp
  • TSI for some more old, but it optional
  • thermaltrip - just because we can move it to module, no matter that it in all CPUs
  • HardwareTermalControl (HTC)...
  • ReportedTemperatureControl...

So amtdtemp will be in 7 modules.
And all this need only for simple thing: read temperatures from CPU.
It can be done by various command line tools, like:
echo $((0x`setpci -s 00:0.0 60.l=0x59800 && setpci -s 00:0.0 64.l`/2097152*5/4))
(linux) and same way but diffrent args for other CPUs.

Next we can separate to modules other drivers and get more than 1k modules, just for lulz.

Keep it simple!

@rozhuk.im-gmail.com
SMN can and will be used for far more things than just the temperature reading. A lot of registers has been moved behind SMN. Having SMN interface in a separate module is just a good software design.
Your arguments so far sound irrational.

Regarding your note on the Linux change -- yes, they added SMN to the existing driver, but that's because they already had a driver to provide structured access to various things in the AMD "northbridge" and that's because Linux tinkers with hardware insides far more than FreeBSD does.
This change is the step in the same direction, not away from it.
E.g. I wish I was smart enough to create AMD "southbridge" driver at the time when I was writing amdsbwd driver. Without that driver the dependencies between amdsbwd, intpm and the proposed apuled drivers are much poorer than they would be with it.

In D12217#253766, @avg wrote:

@rozhuk.im-gmail.com
SMN can and will be used for far more things than just the temperature reading. A lot of registers has been moved behind SMN. Having SMN interface in a separate module is just a good software design.
Your arguments so far sound irrational.

Regarding your note on the Linux change -- yes, they added SMN to the existing driver, but that's because they already had a driver to provide structured access to various things in the AMD "northbridge" and that's because Linux tinkers with hardware insides far more than FreeBSD does.
This change is the step in the same direction, not away from it.
E.g. I wish I was smart enough to create AMD "southbridge" driver at the time when I was writing amdsbwd driver. Without that driver the dependencies between amdsbwd, intpm and the proposed apuled drivers are much poorer than they would be with it.

Why to not create amdsb now?
And place there SMU, SMN and other AMD specific crap/staff.
And move some code from other modules.
This is a more reasonable approach.

This revision was automatically updated to reflect the committed changes.