Page MenuHomeFreeBSD

x86 tsc: use RDTSCP in preference of fence + RDTSC

Authored by kib on Jan 5 2021, 11:02 PM.

Diff Detail

R10 FreeBSD src repository
Lint Skipped
Unit Tests Skipped

Event Timeline

kib requested review of this revision.Jan 5 2021, 11:02 PM
kib created this revision.

I tested the kernel part of this on a Netflix server with an EPYC 7502P using TSC-low. It saves roughly 2% CPU and tsc disappears entirely from profiling output, making profiling output match that from our Xeons much more closely.

Note that the bulk of the tsc usage is coming from microuptime called via rack tcp, so I cannot comment on accuracy, only performance and lack of any problems large enough to be noticed in quick system wide testing.

This revision is now accepted and ready to land.Jan 6 2021, 3:18 PM

Switch to use LFENCE;RDTSC for Zen+.
Rework ifunc resolver to avoid code duplication.

This revision now requires review to proceed.Jan 6 2021, 4:57 PM

After talking with our AMD rep, the AMD advise for this is:

before Zen (fam < 0x17): MFENCE;RDTSC is preferred
Zen1 & Zen2 : There isn't much of a difference between RDTSCP and LFENCE;RDTSC
Zen3: RDTSCP will be slower than LFENCE;RDTSC

So it seems like, for simplicity, Zen should use LFENCE;RDTSC; and older AMD should use MFENCE;RDTSC

This revision is now accepted and ready to land.Jan 7 2021, 9:00 PM

I ran stress tests on two different hosts for an hour. No problems seen.


IMO it is a bit neater to dereference th in rdtsc_low() rather than its callers, like rdtscp_low().


This is rdtscp32().


Doesn't tc_priv need to be cast to 32 bits like in tsc_get_timecount_low()?

kib marked an inline comment as done.Jan 8 2021, 7:19 PM
kib added inline comments.

Well, except that we do not have such helper in cpufunc.h. I considered adding it, but decided that single-use does not make sense, additional argument for me was that this file contains enough asm already.

If you prefer rdtscp32(), I will add it.


Casting would make it into r-value from l-value, which does not allow to use m specifier. Inline asm is happy without casting as well, because movl %1,%%ecx explicitly informs assembler about operand size (both 'l' and target size).

I believe the cast in tsc_get_timecount_low() is not strictly needed, since there is no harm from loading whole 64bit of tc_priv into %rcx on amd64. But I do not want to touch it ATM.

Make rdtsc_low() and rdtscp_low() similar by both taking struct timehands *.

This revision now requires review to proceed.Jan 8 2021, 7:21 PM
markj added inline comments.

There would be 2 uses: here and in __vdso_gettc.c, where rdtscp32() is already defined locally. Sorry for not being more clear, the duplication is why I made the suggestion.

This revision is now accepted and ready to land.Jan 9 2021, 4:49 PM
kib marked 2 inline comments as done.

Add rdtscp32() to cpufunc.h.

This revision now requires review to proceed.Jan 9 2021, 9:18 PM