Details

Reviewers

tuexen
rrs

Group Reviewers

transport

Commits

rGa7619988ebc2: This brings into sync FreeBSD with the netflix versions of rack and bbr.
rG5d8fd932e418: This brings into sync FreeBSD with the netflix versions of rack and bbr.

Summary

This update brings both rack and bbr into sync with the NF tree. There have been
many advancements in the rack stack (pacing, fixed rate pacing, performance improvements etc). As
well as some common fixes between bbr and rack. Both rack and bbr have been broken in
head since the lro changes went in, this fixes that as well so that now lro and bbr/rack work
nicely together and for rack the compressed ack feature adds some good amount of performance.

Test Plan

Bring up the stack and of course make sure it works.

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

rrs created this revision.Apr 29 2021, 1:48 PM

Herald added 1 blocking reviewer(s): transport. · View Herald TranscriptApr 29 2021, 1:48 PM

Herald added a subscriber: melifaro. · View Herald Transcript

rrs requested review of this revision.Apr 29 2021, 1:48 PM

gbe added a subscriber: gbe.Apr 29 2021, 3:05 PM

Thanks! This patch fixes the panic I have met with BBR each time my system was initiating a TCP session:

Unread portion of the kernel message buffer:
stack pointer	        = 0x28:0xfffffe028aa5b7d0
frame pointer	        = 0x28:0xfffffe028aa5b850
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 0 (if_io_tqg_54)
trap number		= 12
panic: page fault
cpuid = 54
time = 1619714192
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe028aa5b480
vpanic() at vpanic+0x181/frame 0xfffffe028aa5b4d0
panic() at panic+0x43/frame 0xfffffe028aa5b530
trap_fatal() at trap_fatal+0x387/frame 0xfffffe028aa5b590
trap_pfault() at trap_pfault+0x4f/frame 0xfffffe028aa5b5f0
trap() at trap+0x27d/frame 0xfffffe028aa5b700
calltrap() at calltrap+0x8/frame 0xfffffe028aa5b700
--- trap 0xc, rip = 0xffffffff82bc42e5, rsp = 0xfffffe028aa5b7d0, rbp = 0xfffffe028aa5b850 ---
ctf_process_inbound_raw() at ctf_process_inbound_raw+0x345/frame 0xfffffe028aa5b850
ctf_do_queued_segments() at ctf_do_queued_segments+0x3e/frame 0xfffffe028aa5b870
tcp_lro_flush() at tcp_lro_flush+0xb6a/frame 0xfffffe028aa5b910
tcp_lro_flush_all() at tcp_lro_flush_all+0x18b/frame 0xfffffe028aa5b970
iflib_rxeof() at iflib_rxeof+0xc82/frame 0xfffffe028aa5ba80
_task_fn_rx() at _task_fn_rx+0x72/frame 0xfffffe028aa5bac0
gtaskqueue_run_locked() at gtaskqueue_run_locked+0x15d/frame 0xfffffe028aa5bb40
gtaskqueue_thread_loop() at gtaskqueue_thread_loop+0xac/frame 0xfffffe028aa5bb70
fork_exit() at fork_exit+0x7d/frame 0xfffffe028aa5bbb0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe028aa5bbb0
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
Uptime: 5h34m24s
Dumping 7988 out of 261972 MB:..1%..11%..21%..31%..41%..51%..61%..71%..81%..91%

__curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
55		__asm("movq %%gs:%P1,%0" : "=r" (td) : "n" (offsetof(struct pcpu,
(kgdb) #0  __curthread () at /usr/src/sys/amd64/include/pcpu_aux.h:55
#1  doadump (textdump=textdump@entry=1)
    at /usr/src/sys/kern/kern_shutdown.c:399
#2  0xffffffff80bf511a in kern_reboot (howto=260)
    at /usr/src/sys/kern/kern_shutdown.c:486
#3  0xffffffff80bf55a0 in vpanic (fmt=<optimized out>, ap=<optimized out>)
    at /usr/src/sys/kern/kern_shutdown.c:919
#4  0xffffffff80bf53a3 in panic (fmt=<unavailable>)
    at /usr/src/sys/kern/kern_shutdown.c:843
#5  0xffffffff8108d2a7 in trap_fatal (frame=0xfffffe028aa5b710, 
    eva=<optimized out>) at /usr/src/sys/amd64/amd64/trap.c:915
#6  0xffffffff8108d2ff in trap_pfault (frame=frame@entry=0xfffffe028aa5b710, 
    usermode=false, signo=<optimized out>, signo@entry=0x0, 
    ucode=<optimized out>, ucode@entry=0x0)
    at /usr/src/sys/amd64/amd64/trap.c:732
#7  0xffffffff8108c95d in trap (frame=0xfffffe028aa5b710)
    at /usr/src/sys/amd64/amd64/trap.c:398
#8  <signal handler called>
#9  tcp_fields_to_host (th=0x0) at /usr/src/sys/netinet/tcp_var.h:1104
#10 ctf_process_inbound_raw (tp=0xfffffe01d20044d8, 
    so=so@entry=0xfffff8309820b3b0, m=0xfffff8209879c300, has_pkt=0)
    at /usr/src/sys/modules/tcp/bbr/../../../netinet/tcp_stacks/rack_bbr_common.c:396
#11 0xffffffff82bc449e in ctf_do_queued_segments (so=0xfffff8309820b3b0, 
    tp=0xfffffe01d20044d8, have_pkt=-1736850688)
    at /usr/src/sys/modules/tcp/bbr/../../../netinet/tcp_stacks/rack_bbr_common.c:457
#12 0xffffffff80dbf67a in tcp_lro_flush_tcphpts (lc=0xfffff801028d8930, 
    le=<optimized out>) at /usr/src/sys/netinet/tcp_lro.c:1316
#13 tcp_lro_flush (lc=lc@entry=0xfffff801028d8930, le=0xfffffe01d35f0658)
    at /usr/src/sys/netinet/tcp_lro.c:1333
#14 0xffffffff80dbf96b in tcp_lro_rx_done (lc=0xfffff801028d8930)
    at /usr/src/sys/netinet/tcp_lro.c:557
#15 tcp_lro_flush_all (lc=lc@entry=0xfffff801028d8930)
    at /usr/src/sys/netinet/tcp_lro.c:1484
#16 0xffffffff80d2ab12 in iflib_rxeof (rxq=<optimized out>, 
    rxq@entry=0xfffff801028d8900, budget=<optimized out>)
    at /usr/src/sys/net/iflib.c:3057
#17 0xffffffff80d25362 in _task_fn_rx (context=0xfffff801028d8900)
    at /usr/src/sys/net/iflib.c:3989
#18 0xffffffff80c42b0d in gtaskqueue_run_locked (
    queue=queue@entry=0xfffff801026ef200)
    at /usr/src/sys/kern/subr_gtaskqueue.c:371
#19 0xffffffff80c427ac in gtaskqueue_thread_loop (arg=<optimized out>, 
    arg@entry=0xfffffe01d33c0518) at /usr/src/sys/kern/subr_gtaskqueue.c:547
#20 0xffffffff80bb244d in fork_exit (
    callout=0xffffffff80c42700 <gtaskqueue_thread_loop>, arg=<optimized out>, 
    frame=<optimized out>) at /usr/src/sys/kern/kern_fork.c:1078
#21 <signal handler called>

kbowling added a subscriber: kbowling.Apr 29 2021, 8:56 PM

kbowling added inline comments.

sys/netinet/tcp_ratelimit.h
47	should it be here?

and yes Oliver this is because when the LRO code went in I forgot to update rack_bbr_common.c. I am glad
it fixes the issue. Neither Michael or I could recreate it since we were running on VM's and we did not
realize vmware fusion does not get LRO enabled .. Drew has given me a patch for that so I should put
it in... at least on my local box ;)

sys/netinet/tcp_ratelimit.h
47	I am not understanding your comment. Maybe you are referring to the comment "Temporary".. and yeah that should probably not be there i.e. the comment should say " Total number of connections using this rate". I will fix that. The thing I thought, when I put it in, that it was just a nice temp for debugging. But I have found when working with Drew on hardware pacing it is really important to have this bit of information while you are testing and having a sysctl to get it is really good.

rrs mentioned this in D30043: tcp: Whitespace cleanup in brr and rack.Apr 30 2021, 6:25 PM

rscheff added a subscriber: rscheff.May 1 2021, 8:23 AM

rscheff added inline comments.

sys/netinet/cc/cc_newreno.c
225	Why did you remove this RFC2861 related bound of ssthresh?
278	So, newreno no longer adjusts ssthresh and cwnd on RTO? That doesn't sound right...
sys/netinet/tcp_input.c
528	make this uint8_t flags a uint16_t flags, to cater for the ACE bits?
563	Add the th_x2 << 8 \|\| th_flags to cover all 12 potential TCP flags? Or shall I with the AccECN patch eventually?
sys/netinet/tcp_sack.c
165	SEQ_LEQ exists

rrs added inline comments.May 3 2021, 4:19 PM

sys/netinet/cc/cc_newreno.c
225	The standard does not call for this. The standard calls for ssthresh to be infinity at start. Yes you can set it to less, but I find this to be incorrect and it does lead to issues we have seen in production (low throughput due to getting into CA too soon).
278	Hmm I did not take this out.. but I see it somehow is not in our tree.. This must be some sort of merge damage.. going to have to see where in the NF tree this came from...
sys/netinet/tcp_input.c
528	Sure we can do that.
563	Lets do that in your patch assuming I get uint16_t in place for iptos
sys/netinet/tcp_sack.c
165	I did not realize someone finally added that.. I will change the code to that :)

rscheff added a subscriber: rgrimes.May 3 2021, 6:06 PM

rscheff added inline comments.

sys/netinet/cc/cc_newreno.c
225	Don't quite understand this. This is the after-idle function (thus has no bearing with the initial ssthresh on connection establishment. Also, since this is mid-session, ssthresh should NOT be infinitiy here). RFC2861 suggests to set ssthresh to the maximum of the former ssthresh, or 3/4 of the former cwnd. So taking this out may leave ssthresh at a smaller, former value - but never restrict it as you suggest. If you have observations, that increasing ssthresh in after-idle is counter-productive (e.g. running into extensive burst loss, RTO and Lost Retransmissions) that may be a valuable to know @ IETF.
sys/netinet/tcp_sack.c
165	was apparently always available in fbsd tree, ever since @rgrimes added tcp_seq.h - but yes, not all these macros exist in all tcp stacks. :) https://reviews.freebsd.org/R10:df8bae1de4b67ccf57f4afebd4e2bf258c38910d#change-G7qyRTbZw4J4

rrs marked an inline comment as not done.May 3 2021, 6:51 PM

rrs added inline comments.

sys/netinet/cc/cc_newreno.c
278	Ok I see whats going on here. Looks like rack does these sets with a bunch of other stuff. So somehow in our tree the case got removed, when it should not. Since we don't use anything but rack or BBR it was never noticed :-) I will fix it in the next patch update with the rest of the fixes

Address some of Richards comments

rrs added inline comments.May 3 2021, 6:59 PM

sys/netinet/cc/cc_newreno.c
225	Ahh I thought you were pointing out the initialization code. Actually I never changed this.. you must remember this is what is in the NF tree. I would have to look historically at where or what zapped those lines.. I will restore them. It could have been merge damage or?? We would never have noticed this since there is a bug in the way being idle is calculated in FreeBSD.. and basically it has never worked for servers only for clients.. I may have fixed that in head can't remember.. but because that has been the way the NF stack as ran for years I think we keep the option to do idle down removed :)

rrs updated this revision to Diff 88571.May 3 2021, 6:59 PM

gesh get the comments too silly

rrs added inline comments.May 3 2021, 7:06 PM

sys/netinet/tcp_input.c
528	Remember if you play with the flags, since they are coming from rack_bbr_common.c you may find different upper bits set that correspond to tcp_lro.c bits in the compressed acks.

rscheff added inline comments.May 4 2021, 7:16 AM

sys/netinet/cc/cc_newreno.c
225	The Idle detection for both directions got addressed with D25016, after you hinted me on this :)
241	This will miscalculate the mss for variable-length options (e.g. bidir traffic with SACK loss recovery in both directions going on, or when TSopt is used).

rrs added inline comments.May 5 2021, 11:11 AM

sys/netinet/cc/cc_newreno.c
241	I wonder if this is a sync-update we missed. Since I only had work on the beta stuff in here and a couple odds and ends. I do like using tcp_maxseg(), but even at that it will only deal with constant options i.e. in the case of a loss it will not allow bytes for that.. which as it should be.

rrs added inline comments.May 5 2021, 11:22 AM

sys/netinet/cc/cc_newreno.c
241	Actually I think the idea here is to have a mss to set the cwnd to. And in this case you should be using probably just the fixed option sizes.. not including if a sack is pending. Reason is that: a) We don't count options in the cwnd/rwnd ... i.e. sendwin = max(cwnd, rwnd)... and we compare that to (snd_max - snd_una).. that never has any options space in it. b) We want to drive the sending of whole mss pieces. We don't want to have left over 14 bytes or some such. Yes an initial send if sack is on and has data would be smaller than a mss. But the cwnd is being set in theory for other sends. If you account for whats in the current sack blocks then you are shorting yourself and causing small sends in the network. Really what is needed here is not tcp_maxseg() but what is in ctf_fixed_maxseg(). What I probably need to do here is promote that into tcp_subr.c to tcp_fixed_maxseg() and then use that.

This addresses Richards comments about mss = tcp_maxseg(). Which actually itself is
incorrect. Instead we need the fixed max segment size i.e. not including all the
dynamic things like sack since we are setting a cwnd that will survive long after
we are out of recovery.

2khramtsov_gmail.com added a subscriber: 2khramtsov_gmail.com.May 5 2021, 5:40 PM

rrs accepted this revision.May 5 2021, 5:50 PM

This revision is now accepted and ready to land.May 5 2021, 5:50 PM

If you think long is better, then just g o ahead.

sys/netinet/tcp_ratelimit.h
46	Why do you change it from uint64_t to long?

rrs added inline comments.May 5 2021, 6:35 PM

sys/netinet/tcp_ratelimit.h
46	uint64_t breaks with 32 bit platforms. This code was yanked after the ratelimit changes went in by Hans to make all platforms compile. Using a long is a compromise, it makes it so 32 bit platforms "work" but they will be limited to say 16Gbps (0x7fffffff x 8) since this is in bytes per second. The other limit is in using, but 2 billion connections using a rate is a bit more than any 32 bit platform would handle. And for the counter of enobufs, well that should not wrap too often, and its only used to indicate we need more pacing time. So I think its an ok compromise though it will be better when the 32 bit platforms all go away (and I am saying that as someone who has a stack of 1U 32bit i386 machines) ;-)

tuexen added inline comments.May 5 2021, 6:51 PM

sys/netinet/tcp_ratelimit.h
46	Then I'm fine with the changes. Go ahead and commit them.

One last comment from Michael Tuexen, use %zu in printf.

This revision now requires review to proceed.May 6 2021, 3:20 PM

rrs accepted this revision.May 6 2021, 3:21 PM

This revision is now accepted and ready to land.May 6 2021, 3:21 PM

rrs closed this revision.May 7 2021, 2:54 PM

rrs added a commit: rG5d8fd932e418: This brings into sync FreeBSD with the netflix versions of rack and bbr..Jun 23 2021, 4:43 PM

tuexen added a commit: rGa7619988ebc2: This brings into sync FreeBSD with the netflix versions of rack and bbr..Aug 30 2021, 12:02 AM

lstewart added a subscriber: lstewart.Mar 2 2022, 9:21 AM

lstewart added inline comments.

sys/netinet/cc/cc_newreno.c
253	I noticed this during some unrelated code review: I think the flags test should be `(nreno->newreno_flags & CC_NEWRENO_BETA_ECN_ENABLED)`.
355	Ditto

Herald added a subscriber: glebius. · View Herald TranscriptMar 2 2022, 9:21 AM

Update Rack and BBR to the latest from NF fixing LRO issues along the way.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 88673

sys/netinet/cc/cc.h

sys/netinet/cc/cc_newreno.h

sys/netinet/cc/cc_newreno.c

sys/netinet/tcp.h

sys/netinet/tcp_accounting.h

sys/netinet/tcp_input.c

sys/netinet/tcp_log_buf.h

sys/netinet/tcp_ratelimit.h

sys/netinet/tcp_ratelimit.c

sys/netinet/tcp_sack.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_stacks/rack_bbr_common.h

sys/netinet/tcp_stacks/rack_bbr_common.c

sys/netinet/tcp_stacks/tcp_bbr.h

sys/netinet/tcp_stacks/tcp_rack.h

sys/netinet/tcp_subr.c

sys/netinet/tcp_var.h

Update Rack and BBR to the latest from NF fixing LRO issues along the way.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 88673

sys/netinet/cc/cc.h

sys/netinet/cc/cc_newreno.h

sys/netinet/cc/cc_newreno.c

sys/netinet/tcp.h

sys/netinet/tcp_accounting.h

sys/netinet/tcp_input.c

sys/netinet/tcp_log_buf.h

sys/netinet/tcp_ratelimit.h

sys/netinet/tcp_ratelimit.c

sys/netinet/tcp_sack.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_stacks/rack_bbr_common.h

sys/netinet/tcp_stacks/rack_bbr_common.c

sys/netinet/tcp_stacks/tcp_bbr.h

sys/netinet/tcp_stacks/tcp_rack.h

sys/netinet/tcp_subr.c

sys/netinet/tcp_var.h

Update Rack and BBR to the latest from NF fixing LRO issues along the way.
ClosedPublic
Actions

Revision Contents
Changeset List