Page MenuHomeFreeBSD

tcp: use single locked callout per tcpcb for the TCP timers
AbandonedPublic

Authored by glebius on Nov 9 2022, 11:02 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 10, 7:17 PM
Unknown Object (File)
Mon, Jan 9, 7:19 AM
Unknown Object (File)
Sat, Jan 7, 7:49 AM
Unknown Object (File)
Dec 16 2022, 10:32 AM
Unknown Object (File)
Dec 14 2022, 10:04 PM

Details

Summary

Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

The single callout per tcpcb guarantees that on connection teardown we
can fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). That is impossible with multiple callout per
a structure. In its turn this makes tcp_discardcb() the single entry
point for tcpcb destructor, merging the tcp_freecb() to the end of
the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 48634
Build 45520: arc lint + arc unit

Event Timeline

Add a handler for callout_stop() failure and a comment explaining what's going on.

Some profiling on how often and how long the looping happens. With this patch:

@@ -897,8 +910,17 @@ tcp_timer_stop(struct tcpcb *tp)
                stopped = callout_stop(&tp->t_callout);
                MPASS(stopped == 0);
        } else while(__predict_false(callout_stop(&tp->t_callout) == 0)) {
+               loops++;
                INP_WUNLOCK(inp);
                kern_yield(PRI_UNCHANGED);
                INP_WLOCK(inp);
        }
+
+       if (__predict_false(loops > 0)) {
+               u_int i, l;
+
+               for (i = 0, l = loops; l != 0; l >>= 1)
+                       i++;
+               counter_u64_add(tcp_timer_stop_loops[i], 1);
+       }
 }

And http/https traffic in order of 20-40 Gbit/s it happens less than 1 time per hour. However, it may loop pretty long time :(

# sysctl net.inet.tcp.timer_stop_loops
net.inet.tcp.timer_stop_loops: 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
# uptime
 5:04PM  up  7:50, 2 users, load averages: 9.93, 10.22, 10.08

This results were obtained with INVARIANTS+WITNESS kernel.

sys/netinet/tcp_timer.c
852

Shouldn't you assert that TDP_INCALLOUT is not set there?

If not, perhaps you need to curthread_pflags_set()/restore().

856

Don't you need to execute all timers which t_timers value is less or equal to the current moment?

sys/sys/proc.h
560

I believe that e.g. TDP_INTCPCALLOUT (or similar) would be a better name for the flag.
Also, for comment in this style, please use netinet/tcp_timer.c.

  • Rename the flag.
  • Check absence of flag before setting.
glebius added inline comments.
sys/netinet/tcp_timer.c
856

If two timers alias to the same sbt, or if we race with callout thread, providing sbt argument from the past, the callout(9) system will self correct and schedule to next slot. It is in the very beginning of callout_cc_add().

Profiling data for longer period. Note that the kernel is with WITNESS, I believe this explains very long spin times. In a day or two I will have larger profiling data for a group of machines and without WITNESS.

# sysctl net.inet.tcp.timer_stop_loops
net.inet.tcp.timer_stop_loops: 0 5 4 5 2 3 5 7 3 3 1 0 2 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0
# uptime
 5:12PM  up 3 days,  7:58, 2 users, load averages: 9.67, 9.64, 9.86
sys/netinet/tcp_timer.c
861

Hi,

It might be an idea to align sbinuptime() to tick_sbt, before doing that addition. This way you avoid firing the timer more than needed. Instead increase kern.hz ?

--HPS

910

This is wrong and might theoretically lead to live locks. Why not learn the callout subsystem about SMR and NET-EPOCH?

sys/netinet/tcp_timer.c
856

Might be, but why go through all the innards of callout_reset(), in particular, take at least two spinlocks etc?

sys/netinet/tcp_timer.c
856

Might be, but why go through all the innards of callout_reset(), in particular, take at least two spinlocks etc?

The probability of two timers aliasing to the same sbt is very low. The cost of rechecking for aliased timer every time would eat more CPU cycles than going through callout_reset spinlocks once in a while.

P.S. We might rethink that if we reduce precision of TCP timers.

861

It might be an idea to align sbinuptime() to tick_sbt, before doing that addition. This way you avoid firing the timer more than needed. Instead increase kern.hz ?

Sorry, can't understand. What do you mean with "align sbinuptime() to tick_sbt"?

861

Hi,

It might be an idea to align sbinuptime() to tick_sbt, before doing that addition. This way you avoid firing the timer more than needed. Instead increase kern.hz ?

--HPS

glebius marked an inline comment as not done.Nov 25 2022, 5:13 AM

I got some profiling data when running without WITNESS and with patch that counts the loops. The machines were serving 10 - 100 Gbit/s, but as they are using RACK, they use TCP timers not as often as the default stack does. Some machines didn't detect any looping but most experienced very little looping:

c051:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c012:net.inet.tcp.timer_stop_loops: 0 2 1 1 1 2 3 3 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
c044:net.inet.tcp.timer_stop_loops: 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c067:net.inet.tcp.timer_stop_loops: 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c072:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c047:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c001:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 1 0 1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c455:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c009:net.inet.tcp.timer_stop_loops: 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c014:net.inet.tcp.timer_stop_loops: 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c366:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c602:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c011:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c620:net.inet.tcp.timer_stop_loops: 0 0 0 1 2 1 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c007:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c004:net.inet.tcp.timer_stop_loops: 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c631:net.inet.tcp.timer_stop_loops: 0 0 0 1 0 2 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c649:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c655:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c411:net.inet.tcp.timer_stop_loops: 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c309:net.inet.tcp.timer_stop_loops: 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c024:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c005:net.inet.tcp.timer_stop_loops: 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c207:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Exception was machine c012, which reported a looping event with order of 2^20 iterations (see second line) and multiple shorter events. None of the machines reported any measurable problem with serving health.

I plan to push this change next week.

hselasky added a subscriber: mav.

Please look at my comments.

sys/netinet/tcp_timer.c
861

This code:

sbinuptime() + tick_sbt * ticks;

Should be written like:

sbt_t sbt = sbinuptime();
sbt - (sbt % tick_sbt) + tick_sbt * ticks;

Please also use another variable name than "ticks", because that is the name of a global variable!

Sorry, can't understand.

I think you don't see that the timer code will fire multiple timeouts if you don't align the absolute timeout value you pass to callout_reset_sbt().
The callout subsystem is not using periodic mode by default, and simply looks for the next timeout, based on the sbt value you pass.

When you don't align the sbt value to tick_sbt, you risk the the "random" remainder, will lead to "N * kern.hz" callout IRQ's on the given CPU instead of only limited by "kern.hz".

@mav : Can you help explain this obvious thing?

This revision now requires changes to proceed.Dec 2 2022, 11:22 PM

@glebius : Consider time alignment as a guarantee against excess number of timer IRQ's per second. If you want an own "hz" value in the TCP stack, that's OK.

For example if you have 10000 TCP callouts on a CPU core, and some mechanism like delayed ACK kicks in at the same time, then instead of limiting the maximum theoretical timer IRQ rate to "kern.hz", you basically allow a IRQ rate limited by the number of active TCP callouts on the given CPU core, because YES, the timer code is very simple and will worst-case fire the 100000 callouts one by one, even if the time difference is less than tick_sbt ! As stupid as it sounds. That's why I recommend, always align the absolute SBT time, when using callouts in large amounts!

I see. The current code uses callout_reset_on() which has C_HARDCLOCK in the macro, that does this rounding down. Do I understand it correct that the "precision" argument to callout_reset_sbt_on() becomes pretty useless if we specify C_HARDLOCK?

  • Use callout_when() to make sure that we are rounded down
  • Don't use local variable named "ticks"

There is currently a bug with C_HARDCLOCK , but we can fix that separately. Just use the flag for now.

See: https://reviews.freebsd.org/D37594

Do I understand it correct that the "precision" argument to callout_reset_sbt_on() becomes pretty useless if we specify C_HARDLOCK?

It is not useless if specific CPU is idle when the time of the event comes. The current patch passing zero precision would wake up CPU exactly on that hardclock tick, whenever it is needed for anything else or not. May be it could aggregate with some other hardclock callouts there, but more likely not. Even old callout_reset_on() is not that strict by using system-wide precision of 5%. If TCP can allow lower precision, then you should specify it. It allows the callout to be handled at some later time when CPU finally wake up. For that to work you should probably store the absolute precision values in addition to absolute times and make tcp_timer_next() return both time and precision, if different TCP timers may have too different precisions that may conflict (formally later timer may have smaller precision and so can not wait for long precision of the formally first timer, that is normally resolved by callouts subsystem).

I am not a big fan of C_HARDCLOCK any more. With fast TSC clocks everywhere C_HARDCLOCK's ability to save on sbinuptime() gives much less benefits. It is still OK to use it, if the area is really tied to hardclock in some way, but I would prefer the precision mechanism to be used when possible instead or in addition to reach events aggregation.

In D37321#855218, @mav wrote:

Do I understand it correct that the "precision" argument to callout_reset_sbt_on() becomes pretty useless if we specify C_HARDLOCK?

It is not useless if specific CPU is idle when the time of the event comes. The current patch passing zero precision would wake up CPU exactly on that hardclock tick, whenever it is needed for anything else or not. May be it could aggregate with some other hardclock callouts there, but more likely not. Even old callout_reset_on() is not that strict by using system-wide precision of 5%. If TCP can allow lower precision, then you should specify it. It allows the callout to be handled at some later time when CPU finally wake up. For that to work you should probably store the absolute precision values in addition to absolute times and make tcp_timer_next() return both time and precision, if different TCP timers may have too different precisions that may conflict (formally later timer may have smaller precision and so can not wait for long precision of the formally first timer, that is normally resolved by callouts subsystem).

I am not a big fan of C_HARDCLOCK any more. With fast TSC clocks everywhere C_HARDCLOCK's ability to save on sbinuptime() gives much less benefits. It is still OK to use it, if the area is really tied to hardclock in some way, but I would prefer the precision mechanism to be used when possible instead or in addition to reach events aggregation.

Thanks! My intent is not to change the precision of TCP callouts with this change. Couple weeks ago I started discussion with the TCP guys on what precision do we require or do we allow for TCP. Your input here is valuable, I will reiterate on next TCP meeting.

In D37321#855218, @mav wrote:

Even old callout_reset_on() is not that strict by using system-wide precision of 5%.

I don't see that. In sys/callout.h:

#define callout_reset_on(c, to_ticks, fn, arg, cpu)                     \  
    callout_reset_sbt_on((c), tick_sbt * (to_ticks), 0, (fn), (arg),    \
        (cpu), C_HARDCLOCK)
In D37321#855218, @mav wrote:

Even old callout_reset_on() is not that strict by using system-wide precision of 5%.

I don't see that. In sys/callout.h:

See tc_precexp in callout_when(). You've lost it when you dropped precision argument returned by callout_when().

In D37321#855262, @mav wrote:
In D37321#855218, @mav wrote:

Even old callout_reset_on() is not that strict by using system-wide precision of 5%.

I don't see that. In sys/callout.h:

See tc_precexp in callout_when(). You've lost it when you dropped precision argument returned by callout_when().

Oh, thanks! I got another question then. Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

Oh, thanks! I got another question then. Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

I see, it depends on the ticks...

Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

callout_when() returns precision value as the biggest of two: percent of relative time (either system-wide or passed explicitly with C_PREL() via flags), or explicitly passed to it absolute value. Once with callout_when() you switch from relative to absolute times, you have no relative time and so you can only use absolute precision returned by it as-is. What I think you should do to properly aggregate the events with each other and with other system, is to make tcp_timer_next() return absolute time of the first event (t1) as time and min(time+precision) of all events minus t1 as precision.

In D37321#855299, @mav wrote:

Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

callout_when() returns precision value as the biggest of two: percent of relative time (either system-wide or passed explicitly with C_PREL() via flags), or explicitly passed to it absolute value. Once with callout_when() you switch from relative to absolute times, you have no relative time and so you can only use absolute precision returned by it as-is. What I think you should do to properly aggregate the events with each other and with other system, is to make tcp_timer_next() return absolute time of the first event (t1) as time and min(time+precision) of all events minus t1 as precision.

For this change I'm going to take a conservative approach and just store the precision returned by callout_when(). Then I will discuss with TCP guys what we actually want. On our last discussion we touched that briefly and they noticed that our TCP timing is not as precise as it is on Linux and this (could be) a problem (sometimes). IMHO, most likely we want to have different precision for different timers. Anyway, this to happen in a different changeset.

For this change I'm going to take a conservative approach and just store the precision returned by callout_when(). Then I will discuss with TCP guys what we actually want. On our last discussion we touched that briefly and they noticed that our TCP timing is not as precise as it is on Linux and this (could be) a problem (sometimes). IMHO, most likely we want to have different precision for different timers. Anyway, this to happen in a different changeset.

I am not insisting on introducing different precisions in this patch. But when looking for the first event you can't just look on its time and then pass its precision, since it may cause conflict I specified earlier. You should implement the math I specified in the last message. While it was several different callouts, the math was done by the callout subsystem.

And it will work even better when you'll be able to pass more accurate precision information to callout_when(). ;)

  • Save precisions from callout_when() and use them when scheduling.
  • Provide precision calculation in tcp_timer_next() per Alexander's suggestion.
  • Zero t_precisions[which] in tcp_timer_enter(), by mav@