Details

Reviewers

• hselasky
kib
mav
glebius

Group Reviewers

transport

Commits

rG446ccdd08e2a: tcp: use single locked callout per tcpcb for the TCP timers

Summary

Use only one callout structure per tcpcb that is responsible for handling
all five TCP timeouts. Use locked version of callout, of course. The
callout function tcp_timer_enter() chooses soonest timer and executes it
with lock held. Unless the timer reports that the tcpcb has been freed,
the callout is rescheduled for next soonest timer, if there is any.

The single callout per tcpcb guarantees that on connection teardown we
can fully stop the callout and immediately free it, avoiding use of
callout_async_drain(). That is impossible with multiple callout per
a structure. In its turn this makes tcp_discardcb() the single entry
point for tcpcb destructor, merging the tcp_freecb() to the end of
the function.

While here, also remove lots of lingering checks in the beginning of
TCP timer functions. With a locked callout they are unnecessary.

While here, clean unused parts of timer KPI for the pluggable TCP stacks.

While here, remove TCPDEBUG from tcp_timer.c, as this allows for more
simplification of TCP timers. The TCPDEBUG is scheduled for removal.

Move the DTrace probes in timers to the beginning of a function, where
a tcpcb is always existing.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 48619
Build 45505: arc lint + arc unit

Event Timeline

glebius created this revision.Nov 9 2022, 11:02 PM

Herald added 1 blocking reviewer(s): transport. · View Herald TranscriptNov 9 2022, 11:02 PM

Herald added subscribers: melifaro, imp. · View Herald Transcript

glebius requested review of this revision.Nov 9 2022, 11:02 PM

glebius added a parent revision: D37126: inpcb: allow to provide protocol specific pcb size.

Harbormaster completed remote builds in B48249: Diff 112821.Nov 9 2022, 11:02 PM

glebius mentioned this in D37261: tcp: use locked callouts for the TCP timers.Nov 9 2022, 11:02 PM

gbe added a subscriber: gbe.Nov 10 2022, 6:19 AM

Add a handler for callout_stop() failure and a comment explaining what's going on.

Harbormaster completed remote builds in B48423: Diff 113275.Nov 18 2022, 4:56 PM

glebius added a reviewer: kib.Nov 18 2022, 4:57 PM

Some profiling on how often and how long the looping happens. With this patch:

@@ -897,8 +910,17 @@ tcp_timer_stop(struct tcpcb *tp)
                stopped = callout_stop(&tp->t_callout);
                MPASS(stopped == 0);
        } else while(__predict_false(callout_stop(&tp->t_callout) == 0)) {
+               loops++;
                INP_WUNLOCK(inp);
                kern_yield(PRI_UNCHANGED);
                INP_WLOCK(inp);
        }
+
+       if (__predict_false(loops > 0)) {
+               u_int i, l;
+
+               for (i = 0, l = loops; l != 0; l >>= 1)
+                       i++;
+               counter_u64_add(tcp_timer_stop_loops[i], 1);
+       }
 }

And http/https traffic in order of 20-40 Gbit/s it happens less than 1 time per hour. However, it may loop pretty long time :(

# sysctl net.inet.tcp.timer_stop_loops
net.inet.tcp.timer_stop_loops: 0 1 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
# uptime
 5:04PM  up  7:50, 2 users, load averages: 9.93, 10.22, 10.08

This results were obtained with INVARIANTS+WITNESS kernel.

kib added inline comments.Nov 19 2022, 10:45 PM

sys/netinet/tcp_timer.c
846	Shouldn't you assert that TDP_INCALLOUT is not set there? If not, perhaps you need to curthread_pflags_set()/restore().
850	Don't you need to execute all timers which t_timers value is less or equal to the current moment?
sys/sys/proc.h
560	I believe that e.g. TDP_INTCPCALLOUT (or similar) would be a better name for the flag. Also, for comment in this style, please use netinet/tcp_timer.c.

Rename the flag.
Check absence of flag before setting.

Harbormaster completed remote builds in B48441: Diff 113382.Nov 21 2022, 5:09 PM

glebius marked 2 inline comments as done.Nov 21 2022, 5:11 PM

glebius added inline comments.

sys/netinet/tcp_timer.c
850	If two timers alias to the same sbt, or if we race with callout thread, providing sbt argument from the past, the callout(9) system will self correct and schedule to next slot. It is in the very beginning of callout_cc_add().

Profiling data for longer period. Note that the kernel is with WITNESS, I believe this explains very long spin times. In a day or two I will have larger profiling data for a group of machines and without WITNESS.

# sysctl net.inet.tcp.timer_stop_loops
net.inet.tcp.timer_stop_loops: 0 5 4 5 2 3 5 7 3 3 1 0 2 0 0 0 0 1 2 1 0 0 0 0 0 0 0 0 0 0 0 0
# uptime
 5:12PM  up 3 days,  7:58, 2 users, load averages: 9.67, 9.64, 9.86

• hselasky added inline comments.Nov 21 2022, 8:23 PM

sys/netinet/tcp_timer.c
855	Hi, It might be an idea to align sbinuptime() to tick_sbt, before doing that addition. This way you avoid firing the timer more than needed. Instead increase kern.hz ? --HPS
903	This is wrong and might theoretically lead to live locks. Why not learn the callout subsystem about SMR and NET-EPOCH?

emaste added a subscriber: emaste.Nov 22 2022, 3:06 PM

kib added inline comments.Nov 22 2022, 9:45 PM

sys/netinet/tcp_timer.c
850	Might be, but why go through all the innards of callout_reset(), in particular, take at least two spinlocks etc?

glebius added inline comments.Nov 25 2022, 5:03 AM

sys/netinet/tcp_timer.c
850	Might be, but why go through all the innards of callout_reset(), in particular, take at least two spinlocks etc? The probability of two timers aliasing to the same sbt is very low. The cost of rechecking for aliased timer every time would eat more CPU cycles than going through callout_reset spinlocks once in a while. P.S. We might rethink that if we reduce precision of TCP timers.
855	It might be an idea to align sbinuptime() to tick_sbt, before doing that addition. This way you avoid firing the timer more than needed. Instead increase kern.hz ? Sorry, can't understand. What do you mean with "align sbinuptime() to tick_sbt"?
855	Hi, It might be an idea to align sbinuptime() to tick_sbt, before doing that addition. This way you avoid firing the timer more than needed. Instead increase kern.hz ? --HPS

I got some profiling data when running without WITNESS and with patch that counts the loops. The machines were serving 10 - 100 Gbit/s, but as they are using RACK, they use TCP timers not as often as the default stack does. Some machines didn't detect any looping but most experienced very little looping:

c051:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c012:net.inet.tcp.timer_stop_loops: 0 2 1 1 1 2 3 3 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
c044:net.inet.tcp.timer_stop_loops: 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c067:net.inet.tcp.timer_stop_loops: 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c072:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c047:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c001:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 1 0 1 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c455:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c009:net.inet.tcp.timer_stop_loops: 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c014:net.inet.tcp.timer_stop_loops: 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c366:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c602:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c011:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c620:net.inet.tcp.timer_stop_loops: 0 0 0 1 2 1 0 2 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c007:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c004:net.inet.tcp.timer_stop_loops: 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c631:net.inet.tcp.timer_stop_loops: 0 0 0 1 0 2 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c649:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c655:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c411:net.inet.tcp.timer_stop_loops: 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c309:net.inet.tcp.timer_stop_loops: 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c024:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c005:net.inet.tcp.timer_stop_loops: 0 1 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c207:net.inet.tcp.timer_stop_loops: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Exception was machine c012, which reported a looping event with order of 2^20 iterations (see second line) and multiple shorter events. None of the machines reported any measurable problem with serving health.

kib accepted this revision.Dec 2 2022, 5:21 PM

I plan to push this change next week.

Please look at my comments.

sys/netinet/tcp_timer.c
855	This code: sbinuptime() + tick_sbt * ticks; Should be written like: sbt_t sbt = sbinuptime(); sbt - (sbt % tick_sbt) + tick_sbt * ticks; Please also use another variable name than "ticks", because that is the name of a global variable! Sorry, can't understand. I think you don't see that the timer code will fire multiple timeouts if you don't align the absolute timeout value you pass to callout_reset_sbt(). The callout subsystem is not using periodic mode by default, and simply looks for the next timeout, based on the sbt value you pass. When you don't align the sbt value to tick_sbt, you risk the the "random" remainder, will lead to "N * kern.hz" callout IRQ's on the given CPU instead of only limited by "kern.hz". @mav : Can you help explain this obvious thing?

This revision now requires changes to proceed.Dec 2 2022, 11:22 PM

• hselasky added a reviewer: mav.Dec 2 2022, 11:22 PM

@glebius : Consider time alignment as a guarantee against excess number of timer IRQ's per second. If you want an own "hz" value in the TCP stack, that's OK.

For example if you have 10000 TCP callouts on a CPU core, and some mechanism like delayed ACK kicks in at the same time, then instead of limiting the maximum theoretical timer IRQ rate to "kern.hz", you basically allow a IRQ rate limited by the number of active TCP callouts on the given CPU core, because YES, the timer code is very simple and will worst-case fire the 100000 callouts one by one, even if the time difference is less than tick_sbt ! As stupid as it sounds. That's why I recommend, always align the absolute SBT time, when using callouts in large amounts!

I see. The current code uses callout_reset_on() which has C_HARDCLOCK in the macro, that does this rounding down. Do I understand it correct that the "precision" argument to callout_reset_sbt_on() becomes pretty useless if we specify C_HARDLOCK?

Use callout_when() to make sure that we are rounded down
Don't use local variable named "ticks"

Harbormaster completed remote builds in B48619: Diff 113864.Dec 6 2022, 12:27 AM

There is currently a bug with C_HARDCLOCK , but we can fix that separately. Just use the flag for now.

See: https://reviews.freebsd.org/D37594

In D37321#855118, @glebius wrote:

Do I understand it correct that the "precision" argument to callout_reset_sbt_on() becomes pretty useless if we specify C_HARDLOCK?

It is not useless if specific CPU is idle when the time of the event comes. The current patch passing zero precision would wake up CPU exactly on that hardclock tick, whenever it is needed for anything else or not. May be it could aggregate with some other hardclock callouts there, but more likely not. Even old callout_reset_on() is not that strict by using system-wide precision of 5%. If TCP can allow lower precision, then you should specify it. It allows the callout to be handled at some later time when CPU finally wake up. For that to work you should probably store the absolute precision values in addition to absolute times and make tcp_timer_next() return both time and precision, if different TCP timers may have too different precisions that may conflict (formally later timer may have smaller precision and so can not wait for long precision of the formally first timer, that is normally resolved by callouts subsystem).

I am not a big fan of C_HARDCLOCK any more. With fast TSC clocks everywhere C_HARDCLOCK's ability to save on sbinuptime() gives much less benefits. It is still OK to use it, if the area is really tied to hardclock in some way, but I would prefer the precision mechanism to be used when possible instead or in addition to reach events aggregation.

In D37321#855218, @mav wrote:

In D37321#855118, @glebius wrote:

Do I understand it correct that the "precision" argument to callout_reset_sbt_on() becomes pretty useless if we specify C_HARDLOCK?

It is not useless if specific CPU is idle when the time of the event comes. The current patch passing zero precision would wake up CPU exactly on that hardclock tick, whenever it is needed for anything else or not. May be it could aggregate with some other hardclock callouts there, but more likely not. Even old callout_reset_on() is not that strict by using system-wide precision of 5%. If TCP can allow lower precision, then you should specify it. It allows the callout to be handled at some later time when CPU finally wake up. For that to work you should probably store the absolute precision values in addition to absolute times and make tcp_timer_next() return both time and precision, if different TCP timers may have too different precisions that may conflict (formally later timer may have smaller precision and so can not wait for long precision of the formally first timer, that is normally resolved by callouts subsystem).

I am not a big fan of C_HARDCLOCK any more. With fast TSC clocks everywhere C_HARDCLOCK's ability to save on sbinuptime() gives much less benefits. It is still OK to use it, if the area is really tied to hardclock in some way, but I would prefer the precision mechanism to be used when possible instead or in addition to reach events aggregation.

Thanks! My intent is not to change the precision of TCP callouts with this change. Couple weeks ago I started discussion with the TCP guys on what precision do we require or do we allow for TCP. Your input here is valuable, I will reiterate on next TCP meeting.

In D37321#855218, @mav wrote:

Even old callout_reset_on() is not that strict by using system-wide precision of 5%.

I don't see that. In sys/callout.h:

#define callout_reset_on(c, to_ticks, fn, arg, cpu)                     \  
    callout_reset_sbt_on((c), tick_sbt * (to_ticks), 0, (fn), (arg),    \
        (cpu), C_HARDCLOCK)

In D37321#855260, @glebius wrote:

In D37321#855218, @mav wrote:

Even old callout_reset_on() is not that strict by using system-wide precision of 5%.

I don't see that. In sys/callout.h:

See tc_precexp in callout_when(). You've lost it when you dropped precision argument returned by callout_when().

In D37321#855262, @mav wrote:

In D37321#855260, @glebius wrote:

In D37321#855218, @mav wrote:

Even old callout_reset_on() is not that strict by using system-wide precision of 5%.

I don't see that. In sys/callout.h:

See tc_precexp in callout_when(). You've lost it when you dropped precision argument returned by callout_when().

Oh, thanks! I got another question then. Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

In D37321#855280, @glebius wrote:

Oh, thanks! I got another question then. Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

I see, it depends on the ticks...

In D37321#855280, @glebius wrote:

Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

callout_when() returns precision value as the biggest of two: percent of relative time (either system-wide or passed explicitly with C_PREL() via flags), or explicitly passed to it absolute value. Once with callout_when() you switch from relative to absolute times, you have no relative time and so you can only use absolute precision returned by it as-is. What I think you should do to properly aggregate the events with each other and with other system, is to make tcp_timer_next() return absolute time of the first event (t1) as time and min(time+precision) of all events minus t1 as precision.

In D37321#855299, @mav wrote:

In D37321#855280, @glebius wrote:

Is this returned precision going to be the same for any ticks argument or not? For example I call callout_when for TT_DELACK and store the result, but then I see that TT_PERSIST is going to be run earlier. Can I use precision returned by callout_when() for TT_DELACK to schedule TT_PERSIST?

callout_when() returns precision value as the biggest of two: percent of relative time (either system-wide or passed explicitly with C_PREL() via flags), or explicitly passed to it absolute value. Once with callout_when() you switch from relative to absolute times, you have no relative time and so you can only use absolute precision returned by it as-is. What I think you should do to properly aggregate the events with each other and with other system, is to make tcp_timer_next() return absolute time of the first event (t1) as time and min(time+precision) of all events minus t1 as precision.

For this change I'm going to take a conservative approach and just store the precision returned by callout_when(). Then I will discuss with TCP guys what we actually want. On our last discussion we touched that briefly and they noticed that our TCP timing is not as precise as it is on Linux and this (could be) a problem (sometimes). IMHO, most likely we want to have different precision for different timers. Anyway, this to happen in a different changeset.

In D37321#855300, @glebius wrote:

For this change I'm going to take a conservative approach and just store the precision returned by callout_when(). Then I will discuss with TCP guys what we actually want. On our last discussion we touched that briefly and they noticed that our TCP timing is not as precise as it is on Linux and this (could be) a problem (sometimes). IMHO, most likely we want to have different precision for different timers. Anyway, this to happen in a different changeset.

I am not insisting on introducing different precisions in this patch. But when looking for the first event you can't just look on its time and then pass its precision, since it may cause conflict I specified earlier. You should implement the math I specified in the last message. While it was several different callouts, the math was done by the callout subsystem.

And it will work even better when you'll be able to pass more accurate precision information to callout_when(). ;)

Save precisions from callout_when() and use them when scheduling.

Harbormaster completed remote builds in B48631: Diff 113889.Dec 6 2022, 6:45 PM

Provide precision calculation in tcp_timer_next() per Alexander's suggestion.

Harbormaster completed remote builds in B48633: Diff 113892.Dec 6 2022, 7:36 PM

Zero t_precisions[which] in tcp_timer_enter(), by mav@

Harbormaster completed remote builds in B48634: Diff 113895.Dec 6 2022, 7:54 PM

Committed 446ccdd08e2a9f704f6348cd7f679e59183b99b3

emaste added a commit: rG446ccdd08e2a: tcp: use single locked callout per tcpcb for the TCP timers.Dec 14 2022, 4:40 AM

tcp: use single locked callout per tcpcb for the TCP timers
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 113864

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_subr.c

sys/netinet/tcp_timer.h

sys/netinet/tcp_timer.c

sys/netinet/tcp_usrreq.c

sys/netinet/tcp_var.h

sys/sys/proc.h

tcp: use single locked callout per tcpcb for the TCP timersAbandonedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 113864

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_subr.c

sys/netinet/tcp_timer.h

sys/netinet/tcp_timer.c

sys/netinet/tcp_usrreq.c

sys/netinet/tcp_var.h

sys/sys/proc.h

tcp: use single locked callout per tcpcb for the TCP timers
AbandonedPublic
Actions

Revision Contents
Changeset List