Page MenuHomeFreeBSD

x86: Implement deferred TSC calibration
ClosedPublic

Authored by markj on Oct 15 2021, 8:07 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 21, 3:42 PM
Unknown Object (File)
Mon, Nov 18, 7:03 PM
Unknown Object (File)
Thu, Nov 14, 12:05 PM
Unknown Object (File)
Sat, Nov 9, 2:38 AM
Unknown Object (File)
Sat, Nov 9, 2:38 AM
Unknown Object (File)
Sat, Nov 9, 2:38 AM
Unknown Object (File)
Sat, Nov 9, 2:38 AM
Unknown Object (File)
Sat, Nov 9, 2:38 AM

Details

Summary

There is no universal way to find the TSC frequency. Newer Intel CPUs
may report it via CPUID leaves 0x15 and 0x16. Sometimes it can be
obtained from the PLATFORM_INFO MSR as well, though we never use that.
On older platforms we derive the frequency using a DELAY(1000000) call,
which uses the 8254 PIT. On some newer platforms the 8254 is apparently
non-functional, leading to bogus calibration results. On such platforms
the TSC frequency must be available from CPUID. It is also possible to
disable calibration with a tunable, in which case we try to parse the
brand string if the TSC freq is not available from CPUID.

CPUID 0x15 provides an authoritative TSC frequency value, but even that
is not always available on new Intel platforms. CPUID 0x16 provides the
specified processor base frequency, which is not the same as the TSC
frequency. Empirically, it is close enough for early boot, but too far
off for timekeeping: on a Comet Lake NUC, CPUID 0x16 yields 1600MHz but
the TSC frequency is rougly 1608MHz, leading to frequent clock stepping.

Thus we have a situation where we cannot calibrate using the PIT and
cannot obtain a precise frequency from CPUID (or MSRs). This change
seeks to address that by using the CPUID 0x16 value during early boot
and refining the calibration later once ACPI-based timecounters are
available. TSC frequency detection is thus split into two phases:

Early phase:


- On Intel platforms, query CPUID 0x15 and 0x16 and use that value
initially if available.
- Otherwise, get an estimate using the PIT, reducing the delay loop to
100ms from 1s. I can't see any reason to have such a long loop and it
does not significantly change the calculated frequency on systems that
I have access to.
- Continue to register the TSC as the CPU ticks provider early, even
though the frequency may be off. Otherwise any code executed during
boot that uses cpu_ticks() (e.g., context switching) gets tripped up
when the ticks provider changes.

Later phase:


- In SI_SUB_CLOCKS, once the timehands are initialized, sample the
current timecounter and TSC values, and sample them again later using
a callout. Use the frequency of the selected timecounter (should be
HPET or ACPI PM timer) to derive the TSC frequency.
- Update the TSC timecounter, global tsc_freq and CPU ticker with the
new frequency and finally register the TSC as a timecounter.

TODO:


- kib suggested extending the core timecounter code to kick off deferred
calibration as soon as a high-quality source is registered, rather
than waiting until SI_SUB_CLOCKS. Though, one additional complication
here is that we also perform multi-core TSC synchronization at some
point, and we should be careful to ensure that that doesn't happen
during calibration.
- I removed parsing of the CPUID brand string. Perhaps it should be
restored and enabled by a tunable.
- Some kind of sanity checks of the early and late calibration results
should be added, but I'm not certain how to do it.

Test Plan

I booted a kernel on several desktop and server systems, both Intel and
AMD. In all cases so far the late calibration seems to yield fairly precise
results, based on values recorded in /var/db/ntp/ntpd.drift.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Oct 15 2021, 8:07 PM
markj created this revision.

What's the frequency error you see in early boot?

Also, Colin has code in this area as well, might want to add him as a reviewer.

In D32512#733493, @imp wrote:

What's the frequency error you see in early boot?

From the PIT loop you mean? I can't remember offhand but the value was clearly wrong. I'll recheck later today.

In D32512#733493, @imp wrote:

What's the frequency error you see in early boot?

From the PIT loop you mean? I can't remember offhand but the value was clearly wrong. I'll recheck later today.

For either that or the CPUID leaf values. During early boot, the frequency error can be large (even a few percent, as much as maybe 10% would be fine for device initialization, etc), but once we start 'timekeeping' for real, it needs to be < 100-200ppm and ideally < 10-20ppm (though at this level of accuracy, thermal effects start to introduce a fair amount of noise).

In D32512#733514, @imp wrote:
In D32512#733493, @imp wrote:

What's the frequency error you see in early boot?

From the PIT loop you mean? I can't remember offhand but the value was clearly wrong. I'll recheck later today.

For either that or the CPUID leaf values. During early boot, the frequency error can be large (even a few percent, as much as maybe 10% would be fine for device initialization, etc), but once we start 'timekeeping' for real, it needs to be < 100-200ppm and ideally < 10-20ppm (though at this level of accuracy, thermal effects start to introduce a fair amount of noise).

This was mentioned in the review description, though admittedly it is too verbose, sorry: I get 1600MHz from CPUID and ~1608MHz from late calibration (and ntpd reports an error of 5-20ppm with the latter). I suspect 1600MHz is good enough for early boot, indeed.

In D32512#733514, @imp wrote:
In D32512#733493, @imp wrote:

What's the frequency error you see in early boot?

From the PIT loop you mean? I can't remember offhand but the value was clearly wrong. I'll recheck later today.

For either that or the CPUID leaf values. During early boot, the frequency error can be large (even a few percent, as much as maybe 10% would be fine for device initialization, etc), but once we start 'timekeeping' for real, it needs to be < 100-200ppm and ideally < 10-20ppm (though at this level of accuracy, thermal effects start to introduce a fair amount of noise).

This was mentioned in the review description, though admittedly it is too verbose, sorry: I get 1600MHz from CPUID and ~1608MHz from late calibration (and ntpd reports an error of 5-20ppm with the latter). I suspect 1600MHz is good enough for early boot, indeed.

Sorry I missed it in the description. Yes. 1600MHz vs 1608MHz is about 500ppm, which is plenty good for early boot (likely by a factor of 10). It's terrible for later and the 5-20ppm ntpd error is the 1608MHz vs whatever the frequency has drifted to later and in line with what I'd expect from a FreeBSD system. We've seen similar 'steps' when we tried to use the CPUID leaf values instead of calibration at Netflix. ntpd has a lot of trouble keeping up at 500ppm and often fails to do it leading to a lot of steps (though some of the systems we experimented on were worse than 500ppm).

In D32512#733516, @imp wrote:

We've seen similar 'steps' when we tried to use the CPUID leaf values instead of calibration at Netflix.

Why did you want to disable calibration?

ntpd has a lot of trouble keeping up at 500ppm and often fails to do it leading to a lot of steps (though some of the systems we experimented on were worse than 500ppm).

Indeed, that's exactly how I came to notice this problem. :) ntpd was stepping the clock back by 1-2s every 10m or so.

sys/x86/x86/tsc.c
143

I'd expand this comment to state that experience has shown the error can be about 1%, which is good enough for booting until the more accurate calibration can happen later in boot so timekeeping is good.

150

A comment about why only Intel uses these might not be amiss.

342

You stated above that 1000ms gave the same results as 100ms. Did you explore smaller values?

401

The error in elapsed time in the few hundred milliseconds it takes to get to the late calibration stage is sub-millisecond. Consider adding this information.

769

How'd you come up with a 1s timeout?

sys/x86/x86/tsc.c
342

I didn't. I suppose it could be shorter still, but maybe this is good enough for Colin's purposes? I'm also not sure if some/most/all cloud instances these days report TSC frequencies via CPUID, so it could be moot. Before the patch, we perform calibration unless a tunable is set to disable it.

401

I wonder if this is true in general though? We start late calibration after devices are probed and attached. This can probably take a substantial amount of time on some systems?

769

It is arbitrary. Just some amount of time large enough for small errors (e.g., due to the fact that the two counters are not read simultaneously) is negligible. I can't see much of a reason for making it longer or shorter, but maybe there's something I'm not considering.

I somehow dislike the mechanism for calibration callback. After the specified callout fired, it requires scheduling the taskqueue thread, and then it again requires context switch to get the thread running on the target cpu.

Could the less involved scheme work, where the direct callout is scheduled on the target CPU and stores TSC value into callback data structure? The calibration code could either pause or even busy loop waiting for the value to appear.

Also, you are using current timecounter directly in the code that calculates the frequency. Why not use e.g. sbinuptime()? This would make the code less delicate, and in fact more correct in case if tc_windup occured between rdtsc executions.

sys/x86/x86/tsc.c
150

Indeed, and I do not see why AMD could not implement that leafs/benefit from the existing support. Practically, Intel and AMD try to not create contradictory CPUID reports.

338

Perhaps 'Early TSC frequency ...'

In D32512#733566, @kib wrote:

I somehow dislike the mechanism for calibration callback. After the specified callout fired, it requires scheduling the taskqueue thread, and then it again requires context switch to get the thread running on the target cpu.

Could the less involved scheme work, where the direct callout is scheduled on the target CPU and stores TSC value into callback data structure? The calibration code could either pause or even busy loop waiting for the value to appear.

Also, you are using current timecounter directly in the code that calculates the frequency. Why not use e.g. sbinuptime()? This would make the code less delicate, and in fact more correct in case if tc_windup occured between rdtsc executions.

Replying to myself. If you would use normal time-reporting functions (sbinuptime()) then it does not matter how accurate the callout firing moment and the moment when you are reading the TSC, are. In other words, my proposal (if workable) would result in less CPU cycles spent, but this is the only difference with the callout/taskqueue/sched_bind route to the second TSC reading.

Rather than having a callback after 1 second, can you leave the final synchronization happening in the SYSINIT? I realize this will slow things down (by preventing anything else from happening during that 1 second interval) but I will be able to speed up the calibration dramatically via repeated measurements.

val_packett.cool added inline comments.
sys/x86/x86/tsc.c
342

I've been using 100ms plus an "overhead" calculation imported from NetBSD: https://github.com/DankBSD/base/commit/5a0c863c90bfd83b4339069fa75c4f61979998b2

In D32512#733570, @kib wrote:
In D32512#733566, @kib wrote:

I somehow dislike the mechanism for calibration callback. After the specified callout fired, it requires scheduling the taskqueue thread, and then it again requires context switch to get the thread running on the target cpu.

Could the less involved scheme work, where the direct callout is scheduled on the target CPU and stores TSC value into callback data structure? The calibration code could either pause or even busy loop waiting for the value to appear.

Also, you are using current timecounter directly in the code that calculates the frequency. Why not use e.g. sbinuptime()? This would make the code less delicate, and in fact more correct in case if tc_windup occured between rdtsc executions.

Replying to myself. If you would use normal time-reporting functions (sbinuptime()) then it does not matter how accurate the callout firing moment and the moment when you are reading the TSC, are. In other words, my proposal (if workable) would result in less CPU cycles spent, but this is the only difference with the callout/taskqueue/sched_bind route to the second TSC reading.

Initially I used a direct callout to re-read the TSC and timecounter values and determine the TSC frequency, but tc_init() requires a sleepable context (it allocates a sysctl node). That's the only reason I used a taskqueue here. We could do the work in a callout and schedule a taskqueue to call tc_init(), but I don't really see why that's better.

I believe sbinuptime() could be used instead, yes. The accuracy of the callout does not matter even in the current diff, though, and I don't understand the comment about CPU cycles.

Rather than having a callback after 1 second, can you leave the final synchronization happening in the SYSINIT? I realize this will slow things down (by preventing anything else from happening during that 1 second interval) but I will be able to speed up the calibration dramatically via repeated measurements.

Sure, I can do this, and that would side-step some of the discussion above. Can you explain your approach in more detail (or provide a pointer to a patch)?

sys/x86/x86/tsc.c
150

Hmm, ok. I had noticed that AMD does not describe these CPUID functions in their manual. I would be a bit surprised if AMD implements them identically, given that they do not encode the TSC frequency directly. (And some other CPUID base functions documented in the SDM are clearly Intel-specific.)

In D32512#733570, @kib wrote:
In D32512#733566, @kib wrote:

I somehow dislike the mechanism for calibration callback. After the specified callout fired, it requires scheduling the taskqueue thread, and then it again requires context switch to get the thread running on the target cpu.

Could the less involved scheme work, where the direct callout is scheduled on the target CPU and stores TSC value into callback data structure? The calibration code could either pause or even busy loop waiting for the value to appear.

Also, you are using current timecounter directly in the code that calculates the frequency. Why not use e.g. sbinuptime()? This would make the code less delicate, and in fact more correct in case if tc_windup occured between rdtsc executions.

Replying to myself. If you would use normal time-reporting functions (sbinuptime()) then it does not matter how accurate the callout firing moment and the moment when you are reading the TSC, are. In other words, my proposal (if workable) would result in less CPU cycles spent, but this is the only difference with the callout/taskqueue/sched_bind route to the second TSC reading.

Initially I used a direct callout to re-read the TSC and timecounter values and determine the TSC frequency, but tc_init() requires a sleepable context (it allocates a sysctl node). That's the only reason I used a taskqueue here. We could do the work in a callout and schedule a taskqueue to call tc_init(), but I don't really see why that's better.

I believe sbinuptime() could be used instead, yes. The accuracy of the callout does not matter even in the current diff, though, and I don't understand the comment about CPU cycles.

What matters is to not postpone the processing too much after the callout fire. As I understand, you put it at 1s, and this almost guarantees that a timecounter wraps. Correctly handling the wrap and associated stepping of timehands is better delegated to the kern_tc.c code.

Rather than having a callback after 1 second, can you leave the final synchronization happening in the SYSINIT? I realize this will slow things down (by preventing anything else from happening during that 1 second interval) but I will be able to speed up the calibration dramatically via repeated measurements.

Sure, I can do this, and that would side-step some of the discussion above. Can you explain your approach in more detail (or provide a pointer to a patch)?

I think it is two SYSINITs(), first reads TSC and current timestamp, second one rereads both and does the calculation.

Rather than having a callback after 1 second, can you leave the final synchronization happening in the SYSINIT? I realize this will slow things down (by preventing anything else from happening during that 1 second interval) but I will be able to speed up the calibration dramatically via repeated measurements.

Sure, I can do this, and that would side-step some of the discussion above. Can you explain your approach in more detail (or provide a pointer to a patch)?

Conceptually: rather than plotting two points on a graph, drawing a line between them, and calculating the slope, I was plotting a few thousand points on a graph and drawing the best-fit line through them. How long it takes will depend on how fast the underlying timer is but I was getting < 1 ppm accuracy in ~10 ms..

Rather than having a callback after 1 second, can you leave the final synchronization happening in the SYSINIT? I realize this will slow things down (by preventing anything else from happening during that 1 second interval) but I will be able to speed up the calibration dramatically via repeated measurements.

Sure, I can do this, and that would side-step some of the discussion above. Can you explain your approach in more detail (or provide a pointer to a patch)?

Conceptually: rather than plotting two points on a graph, drawing a line between them, and calculating the slope, I was plotting a few thousand points on a graph and drawing the best-fit line through them. How long it takes will depend on how fast the underlying timer is but I was getting < 1 ppm accuracy in ~10 ms..

I'd expect 100 points would be within 3ppm or less at 1ms....

Rather than having a callback after 1 second, can you leave the final synchronization happening in the SYSINIT? I realize this will slow things down (by preventing anything else from happening during that 1 second interval) but I will be able to speed up the calibration dramatically via repeated measurements.

Sure, I can do this, and that would side-step some of the discussion above. Can you explain your approach in more detail (or provide a pointer to a patch)?

Conceptually: rather than plotting two points on a graph, drawing a line between them, and calculating the slope, I was plotting a few thousand points on a graph and drawing the best-fit line through them. How long it takes will depend on how fast the underlying timer is but I was getting < 1 ppm accuracy in ~10 ms..

Works for me! I reworked the diff to use sbinuptime() and a simple synchronous calibration.

  • Use sbinuptime() rather than reading a timecounter directly.
  • Get rid of the asynchronous mode, simply do a blocking DELAY().
kib added inline comments.
sys/x86/x86/tsc.c
151

I would remove this.

This revision is now accepted and ready to land.Oct 18 2021, 9:41 PM

One possible complication here: In lapic_calibrate_initcount we calibrate the local APIC timer against the TSC. This (currently) runs at SI_SUB_CPU / SI_ORDER_SECOND -- so with this patch, it's being calibrated against a not-incredibly-accurate TSC frequency.

I don't know how much the local APIC timer frequency matters (I don't even know what we use it for) but can we postpone its calibration until we have accurate reference clocks too?

We use the local APIC timer as the eventtimer to drive callout(9) on most modern x86 systems. That said, modern local APICs support a "TSC deadline" mode where you ask for an interrupt to fire when the TSC hits a desired value rather than the local APIC's timer to hit a given value. I'm not sure if we make use of that though? Looks like we do by default when it is available (search for lapic_timer_tsc_deadline in sys/x86/x86/local_apic.c). The timer frequency does matter though and in the cases when it is used it is probably worth postponing its calibration (or perhaps doing it twice as is done for TSC in this case?) (I'm not sure why we have lapic_calibrate_deadline() as a nop, it seems like that can just be removed?)

@mav since he worked on the local APIC eventtimer bits IIRC.

In D32512#735989, @jhb wrote:

We use the local APIC timer as the eventtimer to drive callout(9) on most modern x86 systems. That said, modern local APICs support a "TSC deadline" mode where you ask for an interrupt to fire when the TSC hits a desired value rather than the local APIC's timer to hit a given value. I'm not sure if we make use of that though? Looks like we do by default when it is available (search for lapic_timer_tsc_deadline in sys/x86/x86/local_apic.c). The timer frequency does matter though and in the cases when it is used it is probably worth postponing its calibration (or perhaps doing it twice as is done for TSC in this case?) (I'm not sure why we have lapic_calibrate_deadline() as a nop, it seems like that can just be removed?)

@mav since he worked on the local APIC eventtimer bits IIRC.

Lapic deadline mode uses calibrated tsc_freq directly. There is nothing to do special for it IMO (and this was a design decision when deadline mode was added: it can be used together with any timecounter, not requiring tsc timecounter). Non-deadline LAPIC requires calibration of FSB frequency, which is done against the current timecounter. It should be fine as well, since patched code switches to tsc timecounter only after tsc calibration.

What I used tsc directly is the measure of the xAPIC mode max delay to wait for IPI ready bit. I believe I mentioned this to Mark, probably not in the review directly. This one could be re-measured after the tsc precise calibration, but I am sure that the effect will be nop: we never time-out IPI idle wait on healthy system anyway.

In D32512#735996, @kib wrote:
In D32512#735989, @jhb wrote:

We use the local APIC timer as the eventtimer to drive callout(9) on most modern x86 systems. That said, modern local APICs support a "TSC deadline" mode where you ask for an interrupt to fire when the TSC hits a desired value rather than the local APIC's timer to hit a given value. I'm not sure if we make use of that though? Looks like we do by default when it is available (search for lapic_timer_tsc_deadline in sys/x86/x86/local_apic.c). The timer frequency does matter though and in the cases when it is used it is probably worth postponing its calibration (or perhaps doing it twice as is done for TSC in this case?) (I'm not sure why we have lapic_calibrate_deadline() as a nop, it seems like that can just be removed?)

@mav since he worked on the local APIC eventtimer bits IIRC.

Lapic deadline mode uses calibrated tsc_freq directly. There is nothing to do special for it IMO (and this was a design decision when deadline mode was added: it can be used together with any timecounter, not requiring tsc timecounter). Non-deadline LAPIC requires calibration of FSB frequency, which is done against the current timecounter. It should be fine as well, since patched code switches to tsc timecounter only after tsc calibration.

What I used tsc directly is the measure of the xAPIC mode max delay to wait for IPI ready bit. I believe I mentioned this to Mark, probably not in the review directly. This one could be re-measured after the tsc precise calibration, but I am sure that the effect will be nop: we never time-out IPI idle wait on healthy system anyway.

Am I understanding correctly that the local APIC timer frequency is irrelevant for anything we're doing? Why are we spending a second calibrating it then?

In D32512#735996, @kib wrote:
In D32512#735989, @jhb wrote:

We use the local APIC timer as the eventtimer to drive callout(9) on most modern x86 systems. That said, modern local APICs support a "TSC deadline" mode where you ask for an interrupt to fire when the TSC hits a desired value rather than the local APIC's timer to hit a given value. I'm not sure if we make use of that though? Looks like we do by default when it is available (search for lapic_timer_tsc_deadline in sys/x86/x86/local_apic.c). The timer frequency does matter though and in the cases when it is used it is probably worth postponing its calibration (or perhaps doing it twice as is done for TSC in this case?) (I'm not sure why we have lapic_calibrate_deadline() as a nop, it seems like that can just be removed?)

@mav since he worked on the local APIC eventtimer bits IIRC.

Lapic deadline mode uses calibrated tsc_freq directly. There is nothing to do special for it IMO (and this was a design decision when deadline mode was added: it can be used together with any timecounter, not requiring tsc timecounter). Non-deadline LAPIC requires calibration of FSB frequency, which is done against the current timecounter. It should be fine as well, since patched code switches to tsc timecounter only after tsc calibration.

What I used tsc directly is the measure of the xAPIC mode max delay to wait for IPI ready bit. I believe I mentioned this to Mark, probably not in the review directly. This one could be re-measured after the tsc precise calibration, but I am sure that the effect will be nop: we never time-out IPI idle wait on healthy system anyway.

Am I understanding correctly that the local APIC timer frequency is irrelevant for anything we're doing? Why are we spending a second calibrating it then?

There are two calibrations. One is for lapic non-deadline mode, where FSB frequency does matter. Another is for xAPIC IPI ready delay. Both are needed for corresponding hardware configuration.

The only unneeded calibration is FSB freq for deadline mode. I am not sure about ordering, i.e. do we know whether we would use deadline, when measuring FSB frequency. Perhaps this can be re-shuffled around and then this calibration can be avoided if we committed to LAT_MODE_DEADLINE over LAT_MODE_ONESHOT.

In D32512#736201, @kib wrote:

The only unneeded calibration is FSB freq for deadline mode. I am not sure about ordering, i.e. do we know whether we would use deadline, when measuring FSB frequency. Perhaps this can be re-shuffled around and then this calibration can be avoided if we committed to LAT_MODE_DEADLINE over LAT_MODE_ONESHOT.

I think we can reliably tell whether we are going to use LAT_MODE_DEADLINE over LAT_MODE_ONESHOT. But there is also LAT_MODE_PERIODIC, depending on the calibration. It is a pure legacy at this point, but formally still supported and can be enabled in run time with sysctl kern.eventtimer.periodic=1.

One more thought: if the original estimation get off by too much lower than it should, kern_clocksource.c, executing events based on precise timecounter, but using uncalibrated event timer, may fire extra timer interrupts for events during the second calibration process. Hopefully it won't be bad enough to cause interrupt storms.

sys/x86/x86/tsc.c
44

This is no longer needed.

735

Not a bit deal, but using 4194304 ((2<<32) / 1024) and 1024 you could avoid sbttons().

markj marked 6 inline comments as done.
  • Drop an unneeded include.
  • Avoid using sbttons(), do arithmetic with sbintime_ts.
  • Restore brand string parsing and tunable to skip early calibration.
  • Try to use CPUID.15H or 16H even on non-Intel systems.
  • Fix i386 build.
  • Add some comments.
This revision now requires review to proceed.Oct 26 2021, 3:30 PM
sys/x86/x86/tsc.c
330

You've added return status, but not using it.

336

I am thinking what should tsc_skip_calibration really disable in the new context of two-staged calibration? If just bypass PIT use, then we could still calibrate and use TSC as timecounter later, even if we miss time to register it directly as CPU ticker.

markj added a subscriber: jkim.
markj added inline comments.
sys/x86/x86/tsc.c
336

I'm not aware of any uses, but didn't want to perturb anything that relies on it as a workaround.

@jkim do you remember the motivation behind commit a4e4127f420352b211c3dfa8d16edf695bae7a51 ? Is there any need to keep this around as a chicken switch of some kind?

Check return status of tsc_freq_intel_brand()

In D32512#735996, @kib wrote:
In D32512#735989, @jhb wrote:

We use the local APIC timer as the eventtimer to drive callout(9) on most modern x86 systems. That said, modern local APICs support a "TSC deadline" mode where you ask for an interrupt to fire when the TSC hits a desired value rather than the local APIC's timer to hit a given value. I'm not sure if we make use of that though? Looks like we do by default when it is available (search for lapic_timer_tsc_deadline in sys/x86/x86/local_apic.c). The timer frequency does matter though and in the cases when it is used it is probably worth postponing its calibration (or perhaps doing it twice as is done for TSC in this case?) (I'm not sure why we have lapic_calibrate_deadline() as a nop, it seems like that can just be removed?)

@mav since he worked on the local APIC eventtimer bits IIRC.

Lapic deadline mode uses calibrated tsc_freq directly. There is nothing to do special for it IMO (and this was a design decision when deadline mode was added: it can be used together with any timecounter, not requiring tsc timecounter). Non-deadline LAPIC requires calibration of FSB frequency, which is done against the current timecounter. It should be fine as well, since patched code switches to tsc timecounter only after tsc calibration.

I was worried if the DELAY() used to calibrate the APIC timer would be imprecise if we are using the TSC to implement DELAY still at this point. If that is true then we want to defer calibrating the APIC timer until DELAY is more precise. (And it does look like we are using the TSC for DELAY() calls once tsc_freq is set to a non-zero value.

Note that using the imprecise TSC during boot may impact other calibrations. For example, lapic_calibrate_initcount() should probably be adjusted to also take sbinuptime snapshots and use the delta of those to compute the frequency rather than using the result of the timer after the DELAY as the frequency directly. I'm not sure if there are other cases of DELAY used prior to SI_SUB_CLOCKS + 1 where callers assume DELAY is precise.

sys/x86/x86/tsc.c
336

@markj I vaguely remember some people complained it took forever for their Intel systems to boot. Unless it is too hard to maintain, I think we should keep it.

In D32512#737491, @jhb wrote:

I was worried if the DELAY() used to calibrate the APIC timer would be imprecise if we are using the TSC to implement DELAY still at this point. If that is true then we want to defer calibrating the APIC timer until DELAY is more precise. (And it does look like we are using the TSC for DELAY() calls once tsc_freq is set to a non-zero value.

Yes, that is correct. But not setting tsc_freq yet isn't a good solution either, since without it the APIC timer would be calibrated against the (non-functional on some recent systems) i8254 timer.

If we need an accurate APIC timer on systems with broken i8254, we need to postpone it until at least after SI_SUB_CONFIGURE (which is when HPET and ACPI timers are discovered).

Note that using the imprecise TSC during boot may impact other calibrations. For example, lapic_calibrate_initcount() should probably be adjusted to also take sbinuptime snapshots and use the delta of those to compute the frequency rather than using the result of the timer after the DELAY as the frequency directly. I'm not sure if there are other cases of DELAY used prior to SI_SUB_CLOCKS + 1 where callers assume DELAY is precise.

We use DELAY while starting APs, but I'm guessing it won't matter if the 10 ms is 9 ms or 11 ms instead. Aside from that the other big user of DELAY is atkbdc reset/initialization, which happens even before TSC initialization (in hammer_time).

In D32512#737491, @jhb wrote:

I was worried if the DELAY() used to calibrate the APIC timer would be imprecise if we are using the TSC to implement DELAY still at this point. If that is true then we want to defer calibrating the APIC timer until DELAY is more precise. (And it does look like we are using the TSC for DELAY() calls once tsc_freq is set to a non-zero value.

Yes, that is correct. But not setting tsc_freq yet isn't a good solution either, since without it the APIC timer would be calibrated against the (non-functional on some recent systems) i8254 timer.

If we need an accurate APIC timer on systems with broken i8254, we need to postpone it until at least after SI_SUB_CONFIGURE (which is when HPET and ACPI timers are discovered).

Oh, I'm not suggesting to not use the imprecise TSC for DELAY(), I think the fix is that we need to postpone and/or redo the APIC timer calibration until after we've redone the TSC calibration.

We should also probably fix the APIC timer calibration to use the time counter to be more precise as Mark's change here does for the TSC.

Note that using the imprecise TSC during boot may impact other calibrations. For example, lapic_calibrate_initcount() should probably be adjusted to also take sbinuptime snapshots and use the delta of those to compute the frequency rather than using the result of the timer after the DELAY as the frequency directly. I'm not sure if there are other cases of DELAY used prior to SI_SUB_CLOCKS + 1 where callers assume DELAY is precise.

We use DELAY while starting APs, but I'm guessing it won't matter if the 10 ms is 9 ms or 11 ms instead. Aside from that the other big user of DELAY is atkbdc reset/initialization, which happens even before TSC initialization (in hammer_time).

Yes, those are all fine, and it may very well be that it is only the DELAY for the APIC timer calibration which matters.

sys/x86/x86/tsc.c
342

Unfortunately many cloud instances lag quite a bit on hardware generations -- until a few months ago the newest widely used x86 systems in EC2 were the M5 family, which are "Intel(R) Xeon(R) Platinum 8124M" aka Skylake. They're reporting cpu_high = 13, so they can't get a frequency reported by the CPU here.

I'm inclined to shorten this, probably with the overhead calculation Greg mentions thrown in as well -- but I can look at that later.

sys/x86/x86/tsc.c
336

Ok, I restored it. So the tunable now only controls whether early calibration is performed. If not, and we fail to obtain an initial frequency from the brand string, use of the TSC is disabled and we will not perform late calibration. If we do obtain an initial frequency from the brand string then we will continue with late calibration, so I just amended the machdep.disable_tsc_calibration sysctl description.

Maybe it would be useful to have it disable second-stage calibration as well, but I will leave it as-is unless a need arises.

342

I incorporated the delay into the early calibration, but left it at 100ms for now.

  • Don't perform late calibration if the TSC is disabled, either administratively or because we failed to get an initial frequency during early boot.
  • Incorporate the 8254 read overhead factor when calculating the early frequency. Obtained from NetBSD.
  • Update the LAPIC eventtimer frequency if deadline mode is configured and the value of tsc_freq changes. Otherwise the eventtimer is left using the early frequency (obtained from, e.g., CPUID.16H, which may be fairly inaccurate).

Minor tweak to the comment above tsc_calib.

BTW I have a patch ready for speeding up tsc_calib via statistical analysis -- I'll be uploading it shortly.

sys/x86/x86/tsc.c
697

s/ACPI-based/other/

In particular, we'll probably end up calibrating against either HPET (which has quality 950) or on virtual machines pvclock (which is quality 975), both of which are discovered during the same 'configure2' SYSINIT as the ACPI timer.

  • Don't perform late calibration if the TSC is disabled, either administratively or because we failed to get an initial frequency during early boot.

I do not quite understand why. What prevents us to do the late calibration always, regardless of the state of early calibration?

  • Incorporate the 8254 read overhead factor when calculating the early frequency. Obtained from NetBSD.
  • Update the LAPIC eventtimer frequency if deadline mode is configured and the value of tsc_freq changes. Otherwise the eventtimer is left using the early frequency (obtained from, e.g., CPUID.16H, which may be fairly inaccurate).
In D32512#739441, @kib wrote:
  • Don't perform late calibration if the TSC is disabled, either administratively or because we failed to get an initial frequency during early boot.

I do not quite understand why. What prevents us to do the late calibration always, regardless of the state of early calibration?

Nothing really prevents it. Though, registering the CPU ticker later can have some weird effects, at least on CPU time accounting during boot. The main reasons are that I don't know the complete motivation for the machdep.disable_tsc_calibration tunable and don't want to perturb working setups, and I want to ensure that the machdep.disable_tsc tunable is honoured.

I'd like to commit this soon. Are there any objections?

To summarize the discussion with respect to LAPIC:

  • In xAPIC mode we use a potentially inaccurate TSC frequency to figure out the maximum IPI delay. Unless the inaccuracy is very large, I don't think this is a problem.
  • The LAPIC eventtimer uses the measured TSC frequency when in deadline mode. I made a change to that code to reload the frequency the next time a timer is armed.
This revision is now accepted and ready to land.Nov 4 2021, 1:10 AM
sys/x86/x86/local_apic.c
1020

This still doesn't handle the case that count_freq might be wrong though since the DELAY(1 second) might have been wrong due to the imprecise tsc_freq at the time of the DELAY? That is, when using the local APIC timer in LAT_MODE_PERIODIC or LAT_MODE_ONESHOT we depend on count_freq which is calibrated in lapic_calibrate_initcount(). Right now we do that with the (potentially) imprecise TSC from native_lapic_setup() (called via lapic_setup() from apic_setup_io() at SI_SUB_INTR, SI_SUB_THIRD).

My suggestion for fixing the local APIC would be to change lapic_setup() to not program the lapic timer yet during boot (just leave it in a "quiet" state) and instead to calibrate the local APIC timer in cpu_initclocks() as the first thing before calling cpu_initclocks_bsp(). The local APIC timer bits can then be configured on the BSP after doing the calibration. For the APs we would program the local APIC timer just before calling cpu_initclocks_ap().

sys/x86/x86/local_apic.c
1020

I want to get rid of the 1 second DELAY for APIC timer calibration anyway, like I'm doing for TSC in D32758. Can it be postponed until SI_SUB_CLOCKS + 1?

markj added inline comments.
sys/x86/x86/local_apic.c
1020

So to perform LAPIC calibration from cpu_initclocks(), we can't really use DELAY() since the late TSC frequency is not yet obtained. I think the best option is to move late TSC calibration to SI_SUB_CLOCKS - 1. This means that we can't use sbinuptime() since the timehands are not yet initialized, so we have to go back to reading the currently selected timecounter directly as was done in an earlier version of this patch.

I started working on this, but per some discussion on IRC I'll defer it to a separate review: systems where the early TSC frequency is very inaccurate (i.e., those where we get it from CPUID.16H) will be using TSC deadline mode anyway.

This revision was automatically updated to reflect the committed changes.