Page MenuHomeFreeBSD

Speed up tsc_calib
AbandonedPublic

Authored by cperciva on Oct 31 2021, 7:29 PM.

Details

Reviewers
imp
jhb
kib
markj
Summary

Prior to this commit, the TSC frequency was calibrated at boot time by
measuring the TSC before and after a one-second sleep. This was simple
and effective, but had the disadvantage of *requiring a one-second sleep*.

Rather than making two measurements of the TSC (before and after sleeping)
we now perform many measurements; and rather than simply subtracting the
starting count from the ending count, we calculate a best-fit regression
between the TSC and the reference clock. While we do this, we keep track
of an estimate of the uncertainty in the regression slope (aka. the ratio
of clock speeds), and stop measuring when we believe the uncertainty is
less than 1 PPM.

Test Plan

On an EC2 m5.xlarge instance, where kvmclock is used as a reference clock,
the calibration typically takes 250-300 microseconds and yields a frequency
consistently accurate to within the target 1 PPM.

I will test this on my laptop (where HPET will be used as reference clock)
shortly; an earlier version of this code took 2-3 ms on an earlier laptop.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

cperciva created this revision.

This patch applies on top of D32512.

sys/x86/x86/tsc.c
668

Do you also need to prevent context switches? Or does the enable/disabling of interrupts below do the same trick? Is one better / worse than the other since this runs for such a short period of time?

727

Do we need to guard against n < 3 here?

728

A comment about this being a clever rearrangement of the uncertainty equation to avoid a division might be in order.

733

So we're breaking if we've been certain for at least half of the cycle? From above, the best case scenario is n == 5 and passes == 3. Is that enough?

738

Why not t / SBT_1US?

btw, generally I like this, just a couple of nits / questions.

I am somewhat surprised that this compiles, it can be considered a bug in clang then. Did you tried gcc?

Problem is that we compile kernel with -mno-sse -msoft-float at least, to avoid compiler in its eternal wisdom to do anything with FPU for the sake of 'optimizations'. Technically it should have resulted in calls to soft float library, which we do not have. Apparently it just emited instructions.

For AESNI, the way to work around this is to put just the code that operates on XMM registers, into dedicated source .c file, and compile it wit different compiler switches. You can see it conf/files.x86, search for aesni.

sys/x86/x86/tsc.c
668

Neither of this is needed, as well as sched_bind() below. FPU_KERN_NOCTX disables context switches (by using critical section).

sys/x86/x86/tsc.c
683

I suggest to not disable interrupts. We are already in critical section.

Instead, may be retry the calibration e.g. 2 or 3 times on "Statistical TSC calibration failed".

jrtc27 added inline comments.
sys/x86/x86/tsc.c
664

Having fpu_kern_enter/leave in the same function (or even translation unit, due to inlining) as where floating point is used is dodgy and fragile as the compiler is free to hoist things as it sees fit. Normally the floating point code lives in a separate translation unit with the callee using fpu_kern_enter/leave around the call to it.

In D32758#739421, @kib wrote:

I am somewhat surprised that this compiles, it can be considered a bug in clang then. Did you tried gcc?

Seems they're both the same and will use x87 opcodes. There's also -mno-80387, which the GCC manpage says is the same as far as I can tell, but does actually work how -msoft-float should and will emit libcalls (provided you also disable SSE etc). Clang agrees with GCC on this strange behaviour too, presumably deliberately for compatibility, even though it's stupid and dangerous.

sys/x86/x86/tsc.c
664

Yes, the code needs to live in separate translation unit, but for different reason as discussed elsewere in this review.

We do not allow any reordering around fpu_kern_enter(). It acts as the full compiler barrier, except in case of fpu_kern_thread optimization. At least this was intent for x86, not sure about arm64 implementation.

sys/x86/x86/tsc.c
664

It's just a function call even on x86, there's nothing stopping compilers from reordering as they desire so long as they're not accessing memory that could be clobbered by the callee. In practice there's little reason for it to do so, but it's incorrect to rely on and a ticking time bomb.

sys/x86/x86/tsc.c
664

No, it is not 'just a function call'. There is critical_enter() in the body, which does atomic_interrupt_fence() equivalent to C11 atomic_signal_fence(memory_order_seq_cst).

sys/x86/x86/tsc.c
664

And where does that information appear in the prototype? It doesn't. It's just another prototype like any other function.

My point is that the compiler can load from constant pools and hoist anything that doesn't read from mutable state as early as it likes, there is nothing to stop that happening, and that will panic the kernel if it does. It's generally not profitable for the compiler to do in practice, but it is legal.

sys/x86/x86/tsc.c
664

It is not legal for compiler to reorder against an unknown function call. If any compiler, supposed to be useful in any of MT environments, does that, it is broken.

FWIW mtx_lock() is an opaque call (at least for modules in _KERNEL + KLD_MODULE compilation environment). Or, pthread_mutex_lock() has no special wording in C standard.

sys/x86/x86/tsc.c
664

That is why I said mutable state. If it's a constant pool value, or a constant synthesised inline purely via instructions, it *should* be impossible for reordering to be observable. Except when you lie to the compiler and disable access to instructions behind its back, which is on you, not the compiler.

sys/x86/x86/tsc.c
664

Here's an example I managed to come up with https://godbolt.org/z/WGWx1frMP1. Totally legal transformation. Panics the kernel.

664

https://godbolt.org/z/WGWx1frMP... not sure where the 1 came from

sys/x86/x86/tsc.c
664

(and still holds with -mno-sse -mno-mmx, just makes the assembly messier)

sys/x86/x86/tsc.c
664

Mmm, I think this example is in fact a demonstration of compiler's bug. Lets ignore the kernel environment at all, i.e. assume that it is permissible to use FPU in any place.

Then using an instruction to convert single-precision float (with truncation) into integer is still invalid this way. Reason is that SDM says:

If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised.

And compiler does not control FPU exception, for instance ABI is silent about exception mask. So such reordering could move the FPU code to a region with different FPU env exception mask.

sys/x86/x86/tsc.c
664

Mmm, I think this example is in fact a demonstration of compiler's bug. Lets ignore the kernel environment at all, i.e. assume that it is permissible to use FPU in any place.

Then using an instruction to convert single-precision float (with truncation) into integer is still invalid this way. Reason is that SDM says:

As much as people like to think otherwise, C is not assembly. What the SDM says is irrelevant, what’s relevant is what the C standard says.

If a converted result is larger than the maximum signed doubleword integer, the floating-point invalid exception is raised.

And compiler does not control FPU exception, for instance ABI is silent about exception mask. So such reordering could move the FPU code to a region with different FPU env exception mask.

The only values in a are 1.0, 2.0 and 3.0. If i is less than 0 or greater than 2 the program has undefined behaviour. So we can assume only those three values are being converted. And all of those can be represented exactly as ints. So no floating point exception, in the C language sense, should be possible, and thus the transformation is legal. There is no compiler bug here. Please stop clutching at straws and just accept that you’re wrong, it’s tiring and a waste of my time to refute.

sys/x86/x86/tsc.c
664

https://godbolt.org/z/4e1199vzd clearly demonstrates that the compilers do not pay attention to the content of the array.

sys/x86/x86/tsc.c
664

C99 7.6.1p2:

The FENV_ACCESS pragma provides a means to inform the implementation when a program might access the floating-point environment to test floating-point status flags or run under non-default floating-point control modes.184) The pragma shall occur either outside external declarations or preceding all explicit declarations and statements inside a compound statement. When outside external declarations, the pragma takes effect from its occurrence until another FENV_ACCESS pragma is encountered, or until the end of the translation unit. When inside a compound statement, the pragma takes effect from its occurrence until another FENV_ACCESS pragma is encountered (including within a nested compound statement), or until the end of the compound statement; at the end of a compound statement the state for the pragma is restored to its condition just before the compound statement. If this pragma is used in any other context, the behavior is undefined. If part of a program tests floating-point status flags, sets floating-point control modes, or runs under non-default mode settings, but was translated with the state for the FENV_ACCESS pragma ''off'', the behavior is undefined. The default state (''on'' or ''off'') for the pragma is implementation-defined. (When execution passes from a part of the program translated with FENV_ACCESS ''off'' to a part translated with FENV_ACCESS ''on'', the state of the floating-point status flags is unspecified and the floating-point control modes have their default settings.)

...

  1. The purpose of the FENV_ACCESS pragma is to allow certain optimizations that could subvert flag tests and mode changes (e.g., global common subexpression elimination, code motion, and constant folding). In general, if the state of FENV_ACCESS is ''off'', the translator can assume that default modes are in effect and the flags are not tested.
sys/x86/x86/tsc.c
668

No need to prevent context switches in general (although as kib notes we may be doing that accidentally anyway). But if we bounced between CPUs we might have issues with unsynchronized TSCs.

683

Can you elaborate on why you think we should retry the calibration? I've never seen it fail (leaving aside when I deliberately triggered failures); if it does start failing for people, I'd like to have it give an approximate TSC calibration after one second and a warning which people might report for further investigation rather than quietly being slower.

727

No. For the case n = 1, {cva, va_t, va_tsc} are all equal to zero; for the case n = 2, the test is "zero > positive value".

728

Will do.

733

If we get five data points which are exactly along a straight line? Yeah, that's enough.

738

Because SBT_1US is 4294, and we need to divide by 4294.967296.

(Perhaps not an important distinction when we're talking about a printf, but I figured we might as well avoid unnecessary rounding errors.)

Tested on my laptop: Calibration takes 82-84 ms, and values are consistent to within 0.5 PPM. (Values reported: 211199{8997, 9041, 9205, 9610, 9936}.) According to ntpd, the clock is currently running 0.965 PPM slow.

Without this patch (taking 1 s to calibrate), the same system reports 21120{68384, 70638, 72514, 81268, 88430}, spanning a range from 30-40 PPM fast.

So this change provides a ~12x speedup and a ~20x increase in accuracy on my laptop.

Reworked as D33802, which has now been committed.