Page MenuHomeFreeBSD

Update Rack and BBR to the latest from NF fixing LRO issues along the way.
ClosedPublic

Authored by rrs on Apr 29 2021, 1:48 PM.

Details

Summary

This update brings both rack and bbr into sync with the NF tree. There have been
many advancements in the rack stack (pacing, fixed rate pacing, performance improvements etc). As
well as some common fixes between bbr and rack. Both rack and bbr have been broken in
head since the lro changes went in, this fixes that as well so that now lro and bbr/rack work
nicely together and for rack the compressed ack feature adds some good amount of performance.

Test Plan

Bring up the stack and of course make sure it works.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

rrs requested review of this revision.Apr 29 2021, 1:48 PM

Thanks! This patch fixes the panic I have met with BBR each time my system was initiating a TCP session:

Unread portion of the kernel message buffer:
stack pointer	        = 0x28:0xfffffe028aa5b7d0
frame pointer	        = 0x28:0xfffffe028aa5b850
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 0 (if_io_tqg_54)
trap number		= 12
panic: page fault
cpuid = 54
time = 1619714192
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe028aa5b480
vpanic() at vpanic+0x181/frame 0xfffffe028aa5b4d0
panic() at panic+0x43/frame 0xfffffe028aa5b530
trap_fatal() at trap_fatal+0x387/frame 0xfffffe028aa5b590
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe028aa5b5f0
trap() at trap+0x27d/frame 0xfffffe028aa5b700
calltrap() at calltrap+0x8/frame 0xfffffe028aa5b700
--- trap 0xc, rip = 0xffffffff82bc42e5, rsp = 0xfffffe028aa5b7d0, rbp = 0xfffffe028aa5b850 ---
ctf_process_inbound_raw() at ctf_process_inbound_raw+0x345/frame 0xfffffe028aa5b850
ctf_do_queued_segments() at ctf_do_queued_segments+0x3e/frame 0xfffffe028aa5b870
tcp_lro_flush() at tcp_lro_flush+0xb6a/frame 0xfffffe028aa5b910
tcp_lro_flush_all() at tcp_lro_flush_all+0x18b/frame 0xfffffe028aa5b970
iflib_rxeof() at iflib_rxeof+0xc82/frame 0xfffffe028aa5ba80
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe028aa5bac0
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe028aa5bb40
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xac/frame 0xfffffe028aa5bb70
fork_exit() at fork_exit+0x7d/frame 0xfffffe028aa5bbb0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe028aa5bbb0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 5h34m24s
Dumping 7988 out of 261972 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80bf511a in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80bf55a0 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80bf53a3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108d2a7 in trap_fatal (frame=0xfffffe028aa5b710, 
    eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108d2ff in trap_pfault (frame=frame@entry=0xfffffe028aa5b710, 
    usermode=false, signo=<optimized out>, signo@entry=0x0, 
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108c95d in trap (frame=0xfffffe028aa5b710)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  tcp_fields_to_host (th=0x0) at /usr/src/sys/netinet/tcp_var.h:1104
#10 ctf_process_inbound_raw (tp=0xfffffe01d20044d8, 
    so=so@entry=0xfffff8309820b3b0, m=0xfffff8209879c300, has_pkt=0)
    at /usr/src/sys/modules/tcp/bbr/../../../netinet/tcp_stacks/rack_bbr_common.c:396
#11 0xffffffff82bc449e in ctf_do_queued_segments (so=0xfffff8309820b3b0, 
    tp=0xfffffe01d20044d8, have_pkt=-1736850688)
    at /usr/src/sys/modules/tcp/bbr/../../../netinet/tcp_stacks/rack_bbr_common.c:457
#12 0xffffffff80dbf67a in tcp_lro_flush_tcphpts (lc=0xfffff801028d8930, 
    le=<optimized out>) at /usr/src/sys/netinet/tcp_lro.c:1316
#13 tcp_lro_flush (lc=lc@entry=0xfffff801028d8930, le=0xfffffe01d35f0658)
    at /usr/src/sys/netinet/tcp_lro.c:1333
#14 0xffffffff80dbf96b in tcp_lro_rx_done (lc=0xfffff801028d8930)
    at /usr/src/sys/netinet/tcp_lro.c:557
#15 tcp_lro_flush_all (lc=lc@entry=0xfffff801028d8930)
    at /usr/src/sys/netinet/tcp_lro.c:1484
#16 0xffffffff80d2ab12 in iflib_rxeof (rxq=<optimized out>, 
    rxq@entry=0xfffff801028d8900, budget=<optimized out>)
    at /usr/src/sys/net/iflib.c:3057
#17 0xffffffff80d25362 in _task_fn_rx (context=0xfffff801028d8900)
    at /usr/src/sys/net/iflib.c:3989
#18 0xffffffff80c42b0d in gtaskqueue_run_locked (
    queue=queue@entry=0xfffff801026ef200)
    at /usr/src/sys/kern/subr_gtaskqueue.c:371
#19 0xffffffff80c427ac in gtaskqueue_thread_loop (arg=<optimized out>, 
    arg@entry=0xfffffe01d33c0518) at /usr/src/sys/kern/subr_gtaskqueue.c:547
#20 0xffffffff80bb244d in fork_exit (
    callout=0xffffffff80c42700 <gtaskqueue_thread_loop>, arg=<optimized out>, 
    frame=<optimized out>) at /usr/src/sys/kern/kern_fork.c:1078
#21 <signal handler called>
kbowling added inline comments.
sys/netinet/tcp_ratelimit.h
46

should it be here?

and yes Oliver this is because when the LRO code went in I forgot to update rack_bbr_common.c. I am glad
it fixes the issue. Neither Michael or I could recreate it since we were running on VM's and we did not
realize vmware fusion does not get LRO enabled .. Drew has given me a patch for that so I should put
it in... at least on my local box ;)

sys/netinet/tcp_ratelimit.h
46

I am not understanding your comment. Maybe you are referring to the comment "Temporary".. and yeah
that should probably not be there i.e. the comment should say " Total number of connections using this rate".
I will fix that.

The thing I thought, when I put it in, that it was just a nice temp for debugging. But I have found when
working with Drew on hardware pacing it is really important to have this bit of information while you
are testing and having a sysctl to get it is really good.

rscheff added inline comments.
sys/netinet/cc/cc_newreno.c
225

Why did you remove this RFC2861 related bound of ssthresh?

278

So, newreno no longer adjusts ssthresh and cwnd on RTO? That doesn't sound right...

sys/netinet/tcp_input.c
528

make this uint8_t flags a uint16_t flags, to cater for the ACE bits?

562

Add the th_x2 << 8 || th_flags to cover all 12 potential TCP flags? Or shall I with the AccECN patch eventually?

sys/netinet/tcp_sack.c
164

SEQ_LEQ exists

sys/netinet/cc/cc_newreno.c
225

The standard does not call for this. The standard calls for
ssthresh to be infinity at start. Yes you can set it to less, but I find
this to be incorrect and it does lead to issues we have seen
in production (low throughput due to getting into CA too soon).

278

Hmm I did not take this out.. but I see it somehow is not
in our tree.. This must be some sort of merge damage.. going
to have to see where in the NF tree this came from...

sys/netinet/tcp_input.c
528

Sure we can do that.

562

Lets do that in your patch assuming I get uint16_t in place for iptos

sys/netinet/tcp_sack.c
164

I did not realize someone finally added that.. I will change the code to that :)

rscheff added inline comments.
sys/netinet/cc/cc_newreno.c
225

Don't quite understand this. This is the after-idle function (thus has no bearing with the initial ssthresh on connection establishment. Also, since this is mid-session, ssthresh should NOT be infinitiy here).

RFC2861 suggests to set ssthresh to the maximum of the former ssthresh, or 3/4 of the former cwnd. So taking this out may leave ssthresh at a smaller, former value - but never restrict it as you suggest.

If you have observations, that increasing ssthresh in after-idle is counter-productive (e.g. running into extensive burst loss, RTO and Lost Retransmissions) that may be a valuable to know @ IETF.

sys/netinet/tcp_sack.c
164

was apparently always available in fbsd tree, ever since @rgrimes added tcp_seq.h - but yes, not all these macros exist in all tcp stacks. :)

https://reviews.freebsd.org/R10:df8bae1de4b67ccf57f4afebd4e2bf258c38910d#change-G7qyRTbZw4J4

rrs marked an inline comment as not done.May 3 2021, 6:51 PM
rrs added inline comments.
sys/netinet/cc/cc_newreno.c
278

Ok I see whats going on here. Looks like rack does these sets
with a bunch of other stuff. So somehow in our tree
the case got removed, when it should not. Since we
don't use anything but rack or BBR it was never noticed :-)

I will fix it in the next patch update with the rest of the fixes

rrs marked an inline comment as not done.

Address some of Richards comments

sys/netinet/cc/cc_newreno.c
225

Ahh I thought you were pointing out the initialization code. Actually I never
changed this.. you must remember this is what is in the NF tree. I would
have to look historically at where or what zapped those lines.. I will restore
them. It could have been merge damage or??

We would never have noticed this since there is a bug in the way being idle
is calculated in FreeBSD.. and basically it has never worked for servers only
for clients.. I may have fixed that in head can't remember.. but because that
has been the way the NF stack as ran for years I think we keep the option
to do idle down removed :)

gesh get the comments too silly

sys/netinet/tcp_input.c
528

Remember if you play with the flags, since they are coming from rack_bbr_common.c you
may find different upper bits set that correspond to tcp_lro.c bits in the compressed acks.

sys/netinet/cc/cc_newreno.c
225

The Idle detection for both directions got addressed with D25016, after you hinted me on this :)

241

This will miscalculate the mss for variable-length options (e.g. bidir traffic with SACK loss recovery in both directions going on, or when TSopt is used).

sys/netinet/cc/cc_newreno.c
241

I wonder if this is a sync-update we missed. Since I only had work on the beta stuff in here and a couple odds and ends. I
do like using tcp_maxseg(), but even at that it will only deal with constant options i.e. in the case of
a loss it will not allow bytes for that.. which as it should be.

sys/netinet/cc/cc_newreno.c
241

Actually I think the idea here is to have a mss to set the cwnd to. And in this
case you should be using probably just the fixed option sizes.. not including
if a sack is pending. Reason is that:

a) We don't count options in the cwnd/rwnd ... i.e. sendwin = max(cwnd, rwnd)... and we

compare that to (snd_max - snd_una).. that never has any options space in it.

b) We want to drive the sending of whole mss pieces. We don't want to have

left over 14 bytes or some such.

Yes an initial send if sack is on and has data would be smaller than a mss. But
the cwnd is being set in theory for other sends. If you account for whats in the
current sack blocks then you are shorting yourself and causing small sends in
the network.

Really what is needed here is not tcp_maxseg() but what is in ctf_fixed_maxseg().
What I probably need to do here is promote that into tcp_subr.c to tcp_fixed_maxseg()
and then use that.

This addresses Richards comments about mss = tcp_maxseg(). Which actually itself is
incorrect. Instead we need the fixed max segment size i.e. not including all the
dynamic things like sack since we are setting a cwnd that will survive long after
we are out of recovery.

This revision is now accepted and ready to land.May 5 2021, 5:50 PM

If you think long is better, then just g o ahead.

sys/netinet/tcp_ratelimit.h
46–48

Why do you change it from uint64_t to long?

sys/netinet/tcp_ratelimit.h
46–48

uint64_t breaks with 32 bit platforms. This code was yanked after the
ratelimit changes went in by Hans to make all platforms compile. Using
a long is a compromise, it makes it so 32 bit platforms "work" but they
will be limited to say 16Gbps (0x7fffffff x 8) since this is in bytes per second.

The other limit is in using, but 2 billion connections using a rate is a bit
more than any 32 bit platform would handle. And for the counter of
enobufs, well that should not wrap too often, and its only used to indicate
we need more pacing time.

So I think its an ok compromise though it will be better when the 32 bit
platforms all go away (and I am saying that as someone who has a stack
of 1U 32bit i386 machines) ;-)

sys/netinet/tcp_ratelimit.h
46–48

Then I'm fine with the changes. Go ahead and commit them.

One last comment from Michael Tuexen, use %zu in printf.

This revision now requires review to proceed.May 6 2021, 3:20 PM
This revision is now accepted and ready to land.May 6 2021, 3:21 PM
lstewart added inline comments.
sys/netinet/cc/cc_newreno.c
253

I noticed this during some unrelated code review: I think the flags test should be (nreno->newreno_flags & CC_NEWRENO_BETA_ECN_ENABLED).

355

Ditto