Page MenuHomeFreeBSD

Load balance sockets with new SO_REUSEPORT_LB option
ClosedPublic

Authored by johalun0_gmail.com on May 31 2017, 7:06 AM.

Details

Summary

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket).

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
rwatson requested changes to this revision.Apr 23 2018, 5:23 PM

There is a surprising lack of locking assertions in the substantial new chunks of code proposed for in_pcb.c and in6_pcb.c. Could you add proper locking assertions so that we can better understand the locking protocol? Please also add a comment above new data-structure definitions to make clear what the locking requirements are. Please document the per-field locking requirements in in_pcb.h.

Could you explain why the lock order implied by soinherit() (between so and so_inh) deadlock free?

Could you upload a version of the patch with full context so that surrounding code can be inspected properly -- especially with respect to locking?

Could you make revisions to the patch to better implement style(9) -- e.g., return has parens, spaces between "if" and "(condition)", tabs + 4-space continuing-line indents? Also, declare all locals at the tops of functions. Remove unnecessary vertical whitespace. Line wrap at 77 or 78 characters.

Could you comment on the ABI/API/KBI/KPI implications of changing the type of so_options?

This revision now requires changes to proceed.Apr 23 2018, 5:23 PM

There is a surprising lack of locking assertions in the substantial new chunks of code proposed for in_pcb.c and in6_pcb.c. Could you add proper locking assertions so that we can better understand the locking protocol? Please also add a comment above new data-structure definitions to make clear what the locking requirements are. Please document the per-field locking requirements in in_pcb.h.

This last sentence should have asked for per-field locking requirements in struct in_pcbinfo, both for consistency with existing fields, and also for the benefits of readers.

This revision was not accepted when it landed; it landed in state Needs Revision.Apr 23 2018, 7:51 PM
This revision was automatically updated to reflect the committed changes.

Changes:

  • Fix style
  • Add lock assertions
  • Acquire INP_HASH_RLOCK in_pcblookup_lbgroup_last()
  • Acquire INP_WLOCK before calling soinherit()

Thanks for the feedback!

There is a surprising lack of locking assertions in the substantial new chunks of code proposed for in_pcb.c and in6_pcb.c. Could you add proper locking assertions so that we can better understand the locking protocol? Please also add a comment above new data-structure definitions to make clear what the locking requirements are. Please document the per-field locking requirements in in_pcb.h.

Improved in the latest version.

Could you explain why the lock order implied by soinherit() (between so and so_inh) deadlock free?

Hmm.. No. Anyone else with confidence here?

Could you upload a version of the patch with full context so that surrounding code can be inspected properly -- especially with respect to locking?

https://github.com/johalun/freebsd/tree/d11003/ <- is this what you mean?

Could you make revisions to the patch to better implement style(9) -- e.g., return has parens, spaces between "if" and "(condition)", tabs + 4-space continuing-line indents? Also, declare all locals at the tops of functions. Remove unnecessary vertical whitespace. Line wrap at 77 or 78 characters.

Yes, there was a lot of misses here. Hopefully I got them all now.

Could you comment on the ABI/API/KBI/KPI implications of changing the type of so_options?

Should be minimal implications. Of course, userland needs to be updated to use the new option. setsockopt() already takes an int as argument so no problem there.

sbruno reopened this revision.Apr 24 2018, 7:56 PM
sbruno added a subscriber: sbruno.

Reopen at the request of the submitter.

This has been reverted from head at svn r332967.

Could you explain why the lock order implied by soinherit() (between so and so_inh) deadlock free?

Hmm.. No. Anyone else with confidence here?

If no one can answer this question, then this patch cannot be committed. It is essential that we know that there is a locking protocol in place that prevents deadlock, as socket locking is a very subtle thing, and we've spent almost two decades getting the current model into a workable and well-documented shape. Part of the reason I'm pushing on this is that in your original patch, there's little or no use of the programmatic techniques we use to ensure correctness: proper lock documentation, assertions, etc. This gives me a low confidence that this work is correct. Better assertions and documentation will substantially increase that confidence -- in particular, in arguing about deadlock.

Could you upload a version of the patch with full context so that surrounding code can be inspected properly -- especially with respect to locking?

https://github.com/johalun/freebsd/tree/d11003/ <- is this what you mean?

I'd like you to upload a version of your patch to this review request that includes the full context so that I can click "show context" around your diff in phabricator and comment on it all together. I'm afraid I've forgotten the name of the command-line tool for interacting with phabricator, but my experience is that that is the best way to get this to happen. I assume it's also possible using git diff in some form, but I've never been lucky in getting it to work :-).

Could you make revisions to the patch to better implement style(9) -- e.g., return has parens, spaces between "if" and "(condition)", tabs + 4-space continuing-line indents? Also, declare all locals at the tops of functions. Remove unnecessary vertical whitespace. Line wrap at 77 or 78 characters.

Yes, there was a lot of misses here. Hopefully I got them all now.

Could you comment on the ABI/API/KBI/KPI implications of changing the type of so_options?

Should be minimal implications. Of course, userland needs to be updated to use the new option. setsockopt() already takes an int as argument so no problem there.

... plus presumably requiring netstat/sockstat/etc to be recompiled, and also changing the binary interface for kernel modules. Are there kernel modules in the base system that may make assumptions about the size of so_options? How about in the ports tree?

Full context for all files.

Full context for all files.

Thanks, will take a look at the updated version.

Add entry in man page

Mostly style and comment issues in this pass, although there are a few other queries -- e.g., relating to so and inh_so lock order.

I will likely have further comments on the contents, but wanted to get these out quickly since they were easy to report.

sys/kern/uipc_socket.c
1064 ↗(On Diff #41882)

Comment should be a full sentence.

1086 ↗(On Diff #41882)

This is a second socket-lock acquisition in a row. How can we convince ourselves that this is safe?

1136 ↗(On Diff #41882)

Missing space after "if".

1140 ↗(On Diff #41882)

Missing space after "if".

1141 ↗(On Diff #41882)

Missing space after "if".

sys/netinet/in_pcb.c
229 ↗(On Diff #41882)

Should there be a lock assertion here?

239 ↗(On Diff #41882)

This vertical whitespace is probably not needed.

246 ↗(On Diff #41882)

Should there be a lock assertion here?

257 ↗(On Diff #41882)

Should there be a lock assertion here?

269 ↗(On Diff #41882)

Can probably drop this blank line here.

271 ↗(On Diff #41882)

Can probably drop this blank line here.

276 ↗(On Diff #41882)

Add a "." at the end -- style(9) would like comments to be complete sentences.

279 ↗(On Diff #41882)

In other in_..._inshash() functions, the pcbinfo argument is implied by the 'inp' argument. Do we need it to be explicit here?

299 ↗(On Diff #41882)

Comment should be a full sentence.

310 ↗(On Diff #41882)

Comment should be a full sentence.

322 ↗(On Diff #41882)

This vertical whitespace is probably not needed.

333 ↗(On Diff #41882)

Comment should be a full sentence.

341 ↗(On Diff #41882)

This variable would ideally be declared at the top of the function, in style(9).

351 ↗(On Diff #41882)

Comment should be a full sentence.

367 ↗(On Diff #41882)

Here, as above, can pcbinfo be implied by inp?

390 ↗(On Diff #41882)

Comment should be a full sentence.

393 ↗(On Diff #41882)

Comment should be a full sentence.

400 ↗(On Diff #41882)

Comment should be a full sentence.

408 ↗(On Diff #41882)

This structure is visually confusing. Could we break some of this out into another function, or use another control-flow structure? Currently it's hard to see how the return fits.

784 ↗(On Diff #41882)

Comment should be a full sentence.

1901 ↗(On Diff #41882)

Should these assertions be before the if() clause above?

1925 ↗(On Diff #41882)

This variable declaration would ideally be above.

2272 ↗(On Diff #41882)

Comment should be a full sentence.

2524 ↗(On Diff #41882)

Comment should be a full sentence.

2530 ↗(On Diff #41882)

Use a C-style comment, not C++, for a block comment. Comment should be a full sentence.

2531 ↗(On Diff #41882)

style(9) requires () around return arguments.

2662 ↗(On Diff #41882)

Use a C-style comment.

sys/netinet/in_pcb.h
550 ↗(On Diff #41882)

It might be useful for this comment to mention something about il_inp[] behaviour?

563 ↗(On Diff #41882)

What are the locking properties of il_inpsiz, il_inpcnt, and il_inp?

677 ↗(On Diff #41882)

I have no personal objection to this design choice, which is consistent with other hashes here, but someday we should think about something better for all of these hashes!

sys/netinet/tcp_subr.c
1960 ↗(On Diff #41882)

I'm cautious about naming a variable "listen" -- historically, this would have shadowed a global symbol (listen()). I wonder if another name might be a better idea.

1972 ↗(On Diff #41882)

The line wrapping here is funky. Make it consistently 78 characters wide.

2000 ↗(On Diff #41882)

This vertical whitespace is not required.

sys/netinet6/in6_pcb.c
176 ↗(On Diff #41882)

Use a C-style comment. Comment should be a full sentence.

904 ↗(On Diff #41882)

style(9) prefers that these variables be declared above rather than in a block here.

1172 ↗(On Diff #41882)

Comment should be a full sentence.

johalun0_gmail.com marked 38 inline comments as done.Apr 27 2018, 2:10 PM
johalun0_gmail.com added inline comments.
sys/kern/uipc_socket.c
1086 ↗(On Diff #41882)

By simply re-order the code a bit we can unlock the first one before we lock the other.

sys/netinet/in_pcb.c
229 ↗(On Diff #41882)

in_pcblbgroup_{alloc,free,resize,reorder} are only called from in_pcb{ins,rem}lbgrouphash where locks are asserted. To assert locks again in these functions we'd need to pass the inpcb as argument. What do you recommend?

johalun0_gmail.com marked an inline comment as done.

Improving locking
Refactor
Fix style

Added comment - unsure about locking in one place...

sys/netinet/tcp_subr.c
2003 ↗(On Diff #41915)

How about this locking? Is it needed for access to the socket and if so, is it safe to lock another inpcb here?

@rwatson If you have time, I would appreciate if you could take a look at the updates.

jtl added inline comments.May 7 2018, 3:30 PM
sys/netinet/tcp_subr.c
2003 ↗(On Diff #41915)

The INP lock is needed to check inp_socket and ensure it doesn't change while you are messing with the socket. (Actually, it could probably be a read lock.)

Its safe to lock the inpcb first, and then lock that inpcb's socket while you hold the INP lock. If you want to lock two inpcbs, you will need to define a deadlock-free ordering. (See, for example, the way sched_ule.c does this in sched_switch_migrate().)

  • Omit the inheritance part for now so that we can get this committed. Need more time and eyes to figure out a safe way to solve the locking issues. (this is anyway not a critical function but more a nice-to-have thing)
  • Add leading zeros to the rest of the SO_* options to indicate that they are now 32 bit.
sbruno accepted this revision.May 16 2018, 7:39 PM

Is everyone on this review fine with me committing this? With the inheritance code removed, I believe there are no further objections or questions.

jtl added a comment.May 21 2018, 9:09 PM

Is everyone on this review fine with me committing this? With the inheritance code removed, I believe there are no further objections or questions.

Have you addressed all of @rwatson's comments?

I've been meaning to review this, but have been delayed. I would appreciate a few extra days to look it over.

Have you addressed all of @rwatson's comments?

Yes, it's all been addressed including a couple of extra things like adding to man pages.

In D11003#327485, @jtl wrote:

Is everyone on this review fine with me committing this? With the inheritance code removed, I believe there are no further objections or questions.

Have you addressed all of @rwatson's comments?

I've been meaning to review this, but have been delayed. I would appreciate a few extra days to look it over.

@jtl I'd like to commit this at some point this week, but I also want your feedback on it. Should we hold off until next week to drop this into head?

jtl added a comment.May 24 2018, 12:43 AM
In D11003#327485, @jtl wrote:

Is everyone on this review fine with me committing this? With the inheritance code removed, I believe there are no further objections or questions.

I've been meaning to review this, but have been delayed. I would appreciate a few extra days to look it over.

@jtl I'd like to commit this at some point this week, but I also want your feedback on it. Should we hold off until next week to drop this into head?

If I haven't reviewed it by this time Friday (i.e. in the next 48 hours), I think it is only fair that I not hold you up anymore.

I'm not sure if @rwatson is done reviewing it, or if others (@glebius, for example) want you to wait for them to review it.

sbruno requested changes to this revision.May 29 2018, 11:14 PM

@johalun0_gmail.com Can this be regenerated against head when you get to it in the morning?

This revision now requires changes to proceed.May 29 2018, 11:14 PM

Updated diff against today's kernel.

pluknet added inline comments.
sys/netinet/tcp_subr.c
1985 ↗(On Diff #43129)

tcp_subr.c diff looks now unrelated, that is, only whitespace change here

johalun0_gmail.com marked an inline comment as done.May 30 2018, 10:30 AM
johalun0_gmail.com added inline comments.
sys/netinet/tcp_subr.c
1985 ↗(On Diff #43129)

Good catch! Will fix that.

johalun0_gmail.com marked an inline comment as done.

Remove whitespace change.

sbruno accepted this revision.May 31 2018, 6:31 PM

Pending no further review comments, this will go in the tree on Monday, May 31. That way folks can yell at me when we all get together in Ottawa.

This revision was not accepted when it landed; it landed in state Needs Review.Jun 6 2018, 3:46 PM
This revision was automatically updated to reflect the committed changes.