Paths

Table of Contentst

callout(9): When using the C_HARDCLOCK flag, ensure the target time is aligned to "tick_sbt"
Needs ReviewPublic
Actions

Authored by • hselasky on Dec 3 2022, 12:07 AM.

Details

Reviewers

mav
rrs

Summary

This reduces the overall number of hard timer IRQ's when precision higher
than "kern.hz" is not needed.

MFC after: 1 week
Sponsored by: NVIDIA Networking

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

• hselasky created this revision.Dec 3 2022, 12:07 AM

Herald added a subscriber: imp. · View Herald TranscriptDec 3 2022, 12:07 AM

• hselasky requested review of this revision.Dec 3 2022, 12:07 AM

Why do you think this is needed? All C_HARDCLOCK consumers are supposed to pass only periods multiple of tick_sbt, and hardclocktime above should be aligned to hardclock also. In what cases do you see misaligned values?

@mav:

I added some code to the kernel and found the remainder of the to_sbt for hardclock is never properly aligned:

diff --git a/sys/kern/kern_timeout.c b/sys/kern/kern_timeout.c
index 0ac0eca36da4..a6acff175d85 100644
--- a/sys/kern/kern_timeout.c
+++ b/sys/kern/kern_timeout.c
@@ -114,6 +114,9 @@ static int ncallout;
 SYSCTL_INT(_kern, OID_AUTO, ncallout, CTLFLAG_RDTUN | CTLFLAG_NOFETCH, &ncallout, 0,
     "Number of entries in callwheel and size of timeout() preallocation");
 
+static int remainder;
+SYSCTL_INT(_kern, OID_AUTO, remainder, CTLFLAG_RW, &remainder, 0, "");
+
 #ifdef RSS
 static int pin_default_swi = 1;
 static int pin_pcpu_swi = 1;
@@ -906,6 +909,11 @@ callout_when(sbintime_t sbt, sbintime_t precision, int flags,
                        to_sbt = sbinuptime();
                if ((flags & C_HARDCLOCK) == 0)
                        to_sbt += tick_sbt;
+
+               /* make sure the target time is really aligned to "tick_sbt" */
+               if ((to_sbt % tick_sbt) != 0)
+                       remainder = to_sbt % tick_sbt;
+               to_sbt -= to_sbt % tick_sbt;
        } else
                to_sbt = sbinuptime();
        if (SBT_MAX - to_sbt < sbt)

I think this has to do with:
a) Code adjusting time also updates tick_sbt ?
b) Before a timer is available we have some dummy timer implementation which assumes hz=1000, which modify the reference timeout sbt used :-(

--HPS

• hselasky mentioned this in D37321: tcp: use single locked callout per tcpcb for the TCP timers.Dec 6 2022, 10:43 AM

@mav : I see more. Maybe when setting/updating tick_sbt, there should be a callback to re-align the hardclocktime? What do you think?

In D37594#855163, @hselasky wrote:

Maybe when setting/updating tick_sbt, there should be a callback to re-align the hardclocktime? What do you think?

@hselasky Until you've started to try align everything to absolute tick_sbt multiples it should not matter. All C_HARDCLOCK callouts measure time from the last hardclocktime, stepping in tick_sbt increments, but without strictly defined initial offset. I have no objections for realigning hardclocktime on the tick rate change, but I am missing a motivation.

Add some more chunks.

Add more alignment checks.

I was thinking about removing 64bit divisions in the hot path and moving them into the first initialization, instead you've added more... We should not bother about alignment in callout_when() if hardclocktime is aligned. Time passed by caller must be aligned to tick_sbt already by him, otherwise he should not use C_HARDCLOCK.

In D37594#855238, @mav wrote:

I was thinking about removing 64bit divisions in the hot path and moving them into the first initialization, instead you've added more... We should not bother about alignment in callout_when() if hardclocktime is aligned. Time passed by caller must be aligned to tick_sbt already by him, otherwise he should not use C_HARDCLOCK.

Yes, I removed that, but the "sbt" (timeout) itself is not guaranteed to be aligned - right ?

sys/kern/kern_clocksource.c
321	@mav : Unless we align the sbinuptime() value here, the *next value will start sliding. Do you agree?

In D37594#855242, @hselasky wrote:

Yes, I removed that, but the "sbt" (timeout) itself is not guaranteed to be aligned - right ?

If it is not aligned, then we should not try to align it. Most of C_HARDCLOCK users should do tick_sbt *, so it should be aligned.

mav added inline comments.Dec 6 2022, 5:04 PM

sys/kern/kern_clocksource.c
321	Periodic timers may have arbitrary phase or even drift here or there relative to timecounter, we have no control over that. We can not align them in general case. Due to jitter it may theoretically happen that some interrupt period will cover two ticks, while the next -- none. Existing code But I don't care much about the periodic timers, since they are not widely used any more. Actually, I can't even find now where nexttick is used for periodic timers now, may be it could be scrapped, or used only for one-shot timers to keep its current programming to know when we need to reprogram it.

@mav : I know modulus is a bit expensive, but many of those computations, even for 64-bit, could easily be avoided by using 1024 ticks instead of 1000 ticks per second .... Though that's a completely different discussion.

• hselasky marked an inline comment as done.Dec 6 2022, 5:07 PM

• hselasky added inline comments.

sys/kern/kern_clocksource.c
321	My patch doesn't hurt if two ticks are skipped or more. It just ensure that the next tick is properly aligned as it should, and doesn't accumulate errors!

mav added inline comments.Dec 6 2022, 5:19 PM

sys/kern/kern_clocksource.c
321	Who said that "it should"? As I have told, periodic counters didn't promise us anything. Some may be able to accept first event time, but not all. As I have told, I've tried to find what would your change hurt, but haven't found where that value is used at all for periodic timers, that is why I said may be we could scrap it all together, leaving only for one-shot timers.

@mav : What about synchronous execution between CPUs. When you have one CPU it doesn't matter that much, but when you have multiple CPUs, it might make sense to execute the timers aligned to eachother, instead of sliding randomly?

In D37594#855759, @hselasky wrote:

@mav : What about synchronous execution between CPUs. When you have one CPU it doesn't matter that much, but when you have multiple CPUs, it might make sense to execute the timers aligned to eachother, instead of sliding randomly?

@hselasky As I have told above, our code tries to align periodic timers by providing eventtimers first event time, but it is not guarantied to be supported. I am not sure what are you saying or proposing.

In D37594#855761, @mav wrote:

In D37594#855759, @hselasky wrote:

@mav : What about synchronous execution between CPUs. When you have one CPU it doesn't matter that much, but when you have multiple CPUs, it might make sense to execute the timers aligned to eachother, instead of sliding randomly?

@hselasky As I have told above, our code tries to align periodic timers by providing eventtimers first event time, but it is not guarantied to be supported. I am not sure what are you saying or proposing.

Maybe I wasn't that clear. For SMP the value that holds the next hardclock event is per-CPU, and what prevents those per-CPU values from sliding between CPUs? During boot, the dummy timer will put "random" starting values in there? Or am I wrong?

• hselasky added inline comments.Dec 8 2022, 10:00 PM

sys/kern/kern_clocksource.c
321	Who said that "it should" ? Nobody I guess. So why not make kern.hz a power of two, and skip all those expensive modulus and divisions in the callout subsystem? Would there be any consequences? I guess not?

For SMP the value that holds the next hardclock event is per-CPU, and what prevents those per-CPU values from sliding between CPUs? During boot, the dummy timer will put "random" starting values in there? Or am I wrong?

IIRC nexthard is initialized to the same value on all CPUs and incremented by tick_sbt up to the current time. They may lag on idle CPUs, but should catch up back in sync as soon as CPU wake up. I don't see there any randomness. By the time cpu_initclocks_bsp() calls configtimer() to initialize next hard I see system should already calibrate TSC, so time should already be accurate.

nexttick may be random on a system with periodic timers, but as I have told above, I'd prefer it to be removed for periodic timers rather than fixed, unless I missed what it is used for.

Whats about "kern.hz a power of two", we never had that limitation, and introducing it now feels a bit late. Mentioned divisions for earlier CPUs were much more expensive than for modern, and still hz was never restricted. I'd prefer hz to be deprecated instead.

IIRC nexthard is initialized to the same value on all CPUs and incremented by tick_sbt up to the current time. They may lag on idle CPUs, but should catch up back in sync as soon as CPU wake up. I don't see there any randomness. By the time
cpu_initclocks_bsp() calls configtimer() to initialize next hard I see system should already calibrate TSC, so time should already be accurate.

I see that too, but dumping the values I see they are not aligned like they should.

Try this:

echo kern.hz=2000 >> /boot/loader.conf
reboot
enter kgdb and dump the nexthard value and check if it is divisible by tick_sbt as it should.

Whats about "kern.hz a power of two", we never had that limitation, and introducing it now feels a bit late. Mentioned divisions for earlier CPUs were much more expensive than for modern, and still hz was never restricted. I'd prefer hz to be deprecated instead.

It is not a good idea to deprecate "hz", because thread preemption depends on it! I.E. when a thread runs for too long, and other threads needs to be switched in.
Also epoch reclamation is coupled with the hz.

In D37594#856094, @hselasky wrote:

enter kgdb and dump the nexthard value and check if it is divisible by tick_sbt as it should.

nexthard is not expected to be divisible by tick_sbt, since configtimer() initializes it to sbinuptime() + timerperiod. If you see something above that, please share, save my time.

It is not a good idea to deprecate "hz", because thread preemption depends on it!

Scheduler's preemption in sched_clock() depends on statclock (stathz), not hardclock (hz). Though sched_ule indeed uses ticks for some stats. And I am only saying that we should stop using hz in all random applications, not depending on it and just needing sleep for some time, etc.