Page MenuHomeFreeBSD

Decompose TCP INP_INFO lock to increase short-lived connections scalability
ClosedPublic

Authored by jch on May 20 2015, 12:59 PM.
Tags
None
Referenced Files
F83361244: D2599.id7609.diff
Thu, May 9, 11:08 AM
Unknown Object (File)
Thu, May 9, 5:11 AM
Unknown Object (File)
Thu, May 9, 5:11 AM
Unknown Object (File)
Thu, May 9, 5:11 AM
Unknown Object (File)
Thu, May 9, 5:11 AM
Unknown Object (File)
Thu, May 9, 5:11 AM
Unknown Object (File)
Thu, May 9, 5:03 AM
Unknown Object (File)
Thu, May 9, 5:03 AM

Details

Summary

Decompose TCP INP_INFO lock to increase short-lived TCP connections scalability:

  • The existing TCP INP_INFO lock continues to protect the global inpcb list stability during full list traversal (e.g. tcp_pcblist()).
  • A new INP_LIST lock protects inpcb list actual modifications (inp allocation and free) and inpcb counters.

It allows to use TCP INP_INFO_RLOCK lock in critical paths (e.g. tcp_input())
and use INP_INFO_WLOCK only in occasional operations that walk all connections.

The maximum of number of TCP connection (setup and teardown) per connection
is increased from 60k/sec to 150k/sec.

Some notes:

Test Plan

This patch has no known issues (so far) with GENERIC configuration, however due to
the large impacted code base it requires testing with not default options:

This patch has been tested with:

  • IPv4/v6
  • FLOWTABLE
  • PCBGROUP/RSS
  • TCP timer per CPU

It has been smoked testing with:

  • VIMAGE v4/v6
  • PCBGROUP without RSS
  • TOE (on Chelsio card)

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jch retitled this revision from to Decompose TCP INP_INFO lock to increase short-lived connections scalability.
jch updated this object.
jch edited the test plan for this revision. (Show Details)
jch added reviewers: adrian, gnn, jhb.

I've tested this with RSS+PCBGROUPS. Works fine.

sys/netinet/in_pcb.c
1257 ↗(On Diff #5509)

Should we eventually use this model for other protocols? (Not saying we have to do that now, but if we think we should it might be worth documenting as a TODO item)

sys/netinet/in_pcb.h
134 ↗(On Diff #5509)

Old bug: UDP has protocol PCB's now as well, so this should possibly say "TCP and UDP" (though UDP doesn't have connections)

138 ↗(On Diff #5509)

Old bug: the "when modified" shouldn't have commas around it. This might be more clearly worded as:

A few fields, such as the global connection lists, require both the inp_lock and global pcblist lock to be write-locked while modified.

Still a bit of a mouthful. :-/

139 ↗(On Diff #5509)

Old bug: this is a long sentence. I would probably split it up by s/lock, which/lock. This/

142 ↗(On Diff #5509)

So this last sentence is missing a word (should be "these fields _requires_ that write" or some such), but it also duplicates the "A few fields" sentence less awkwardly.

Hmm, I think I would reword much of this as so perhaps starting with the "A few fields" sentence:

A few fields are protected by multiple locks as indicated in the locking notes below.  For
these fields, all of the listed locks must be write-locked for any modifications.  However,
these fields can be safely read while any one of the listed locks are read-locked.  This
model can permit greater concurrency for read operations.  For example, connections can
be looked up while only holding a read lock on the global pcblist lock.  This is important
for performance when attempting to find the connection for a packet given its IP and port
tuple.

One noteworthy exception is that the global pcbinfo lock follows a different set of rules in
relation to the inp_list field.  Rather than being write-locked for modifications and
read-locked for list iterations, it must be read-locked during modifications and write-locked
during list iterations.  This ensures that the relatively rare global list iterations safely
walk a stable snapshot of connections while allowing more common list modifications
to safely grab the pcblist lock just while adding or removing a connection from the global
list.
sys/netinet/tcp_input.c
1153 ↗(On Diff #5509)

I think this comment might be helpful in cxgb_listen.c and t4_listen.c as well.

sys/netinet/tcp_subr.c
1252 ↗(On Diff #5509)

Do you need both locks to read ipi_gecnt and ipi_count? Would just the pcblist lock suffice?

1362 ↗(On Diff #5509)

Same question here.

[tcp-scale]: Apply jhb's review comments on code comments.

sys/netinet/in_pcb.c
1257 ↗(On Diff #5509)

This model is indeed applicable to other protocols. It is a good idea, I haved added the TODO item.

sys/netinet/in_pcb.h
134 ↗(On Diff #5509)

Your are right, fixed.

138 ↗(On Diff #5509)

Fixed as part of below comment change.

139 ↗(On Diff #5509)

Fixed as part of below comment change.

142 ↗(On Diff #5509)

I much prefer your version, added. Thanks

[tcp-scale]: Fix jhb's comment on syncache_expand() comments.

sys/netinet/tcp_input.c
1153 ↗(On Diff #5739)

Nice catch, fixed.

[tcp-scale]: Fix jhb's comment on ipi_gencnt/ipi_count access:

sys/netinet/tcp_subr.c
1252 ↗(On Diff #5740)

Great catch. To read ipi_gecnt and ipi_count you need only INP_LIST lock. I fixed the in_pcb.h documentation as well.

1362 ↗(On Diff #5740)

Fixed as part as above comment fix.

Fixed all @jhb comments (so far). Thanks for your time.

sys/netinet/in_pcb.h
177 ↗(On Diff #5741)

Perhaps s/If currently/Currently/ here.

sys/netinet/tcp_timewait.c
680 ↗(On Diff #5741)

Hmm, switching this to a read lock here (really in tcp_tw_start()) will not provide the deadlock protection between locking two inpcbs at once.

Perhaps this is ok as the first inp is always one going into time-wait and the second inp is one that is already in time-wait (so there is a defined order)? As long as this is the only place we acquire two inp locks at once and we never acquire a tw inpcb lock before a non-tw inpcb then we should be safe. However, if that is true then the comments both here and in tcp_twstart() need updating. I'm not sure if it is true though. Sorry I didn't catch this on the first look.

sys/netinet/in_pcb.h
177 ↗(On Diff #5741)

Right, fixed.

sys/netinet/tcp_timewait.c
680 ↗(On Diff #5741)

You are totally right. And there are actually two places were two inp locks are acquired at once:

#1 In tcp_input() - if syncache_expand() succeed - two inp locks are held: The inp of the listen socket + the inp of the newly created socket.
#2 In tcp_tw_2msl_scan() called from tcp_twstart() (here)

I checked that #1 and #2 cannot collide, thus it is safe to use only INP_INFO read lock here. However:

  • A comment update is required (Thanks, good catch)
  • In tcp_tw_2msl_scan() it is not a requirement to hold two inp locks at same time. I am going to check if we can simply use the below classical pattern to avoid that:

#1 in tcp_twstart() takes a reference on the inp lock transitioning to TIME_WAIT state
#2 Release its inp lock
#3 Call tcp_tw_2msl_stop()
#4 Take back the inp lock and check inp status with in_pcbrele_wlocked()

If the change is clear enough I will submit it in a separate review. The goal being to have only place were two inp locks are held at same time, it is cleaner).

sys/netinet/tcp_timewait.c
680 ↗(On Diff #5741)

I think this probably sounds fine though I'd want to see the patch I think.

I guess the race is tcp_input() handling another packet for this connection while the inp lock is dropped. If you are careful to avoid updating the state (right now the state is set to TIME_WAIT before tcp_tw_2msl_stop() is called, so you'd have to move that) then I think you can get
duplicate calls to twstart() if there are dup FINs, so you'd need to just bail if you get a tw for
reuse but the inp already has a tw.

The other option is to mark the inp as in timewait but then handle the race that a SYN might come in and call tcp_twcheck() (so twcheck would need to handle the case that it has TIMEWAIT set, but there is no actual tw yet, possibly arranging for the twstart() to immediately delete the inp when it relocks the inp?

Add D2763 as dependency
Rebased change on r284151

Rebased on svn path=/head/; revision=284266

bapt added a subscriber: bapt.

@mat could you possibly test this patch, you know you are always guilty when dealing with dns

[tcp-scale]: Use INP_INFO_RLOCK in tcp_timer_discard()
Improve INP_INFO_LOCK assertions in cxgb/cxgbe tom

Comment improvement from jhb

Hi guys, below a quick update:

  • This patch idea has been discussed during BSDCan 2015
  • I am testing TCP Offloading on Chelsio card to check the changes in TOE locking code
  • Limelight - i.e. Jason / @nitroboost-gmail.com - is testing this patch on 10, and no issue so far

Next tests:

  • @lawrence proposed to test this patch in Netflix QA
  • @np proposed to test this patch with TOE in Chelsio QA

Once I got the (hopefully good) results of these tests I will propose this patch to HEAD.

sys/netinet/tcp_timewait.c
680 ↗(On Diff #6166)

Hi @jhb,

I did the change and if it works well the patch is quite complex missing the initial goal of being cleaner. I would propose to leave the current INP locks as is, and I am going to thoroughly explain why it is ok with the current implementation.

A (real) cleaner solution would be that in tcp_twstart():

tw = uma_zalloc(V_tcptw_zone, M_NOWAIT);

always return an allocated tcptw because we took care to release in advance the oldest tcptw structs as soon as we get close to the V_tcptw_zone limit. And this would indeed makes the code cleaner.

Ooops, sorry, I was trying to remove myself from the subscribers here :-/

jch reclaimed this revision.EditedJun 19 2015, 12:23 PM
In D2599#55428, @mat wrote:

Ooops, sorry, I was trying to remove myself from the subscribers here :-/

No problem, just go in "Edit Revision" and remove yourself from subscribers and save. :)

Rebase on top of r285351.

[tcp-scale]: Add comment proposed by jhb about having two inps locked at
same time without the exclusive INP_INFO lock.

Adding jhb's suggested comment in syncache_socket().

No need anymore to upgrade to INP_INFO_RLOCK/INP_WLOCK
state in tcp_timer_rexmt(), we are already in this state.

[tcp-scale]: Rebase on HEAD r286066

As this change is quite stable and I have addressed all the review comments, I plan to push it by the end of this week. As usual please scream if you have something more to add. Moreover comments are still welcomed here even after this change being pushed. Thanks all for your time.

This revision was automatically updated to reflect the committed changes.