Load balance sockets with new SO_REUSEPORT_LB option
Needs ReviewPublic

Authored by johalun0_gmail.com on May 31 2017, 7:06 AM.

Details

Summary

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket).

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
swills added a subscriber: swills.Jun 5 2017, 4:24 PM
hiren added a subscriber: hiren.

I am not too sure on details but can this be done with PCBGROUPS/RSS? https://github.com/erikarn/freebsd-rss/ has some examples.

Adding Adrian and Ermal if they have any comments.

adrian edited edge metadata.Jun 6 2017, 12:37 AM

i'm okay with this (as long as it doesn't break RSS/PCBGROUPS.) This is fine for people wanting non-RSS load balancing of things (eg people writing loopback servics) without doing it in userland.

Duplicate a couple of SO_REUSEPORT checks and do the same for SO_REUSEPORT_LB in 'in_pcb_bindsetup.c'.
Let SO_REUSEPORT_LB set SO_REUSEADDR to fix the "address in use" error earlier mentioned. Testing confirmed that the earlier "possible bug" is now fixed.

'XXX' marks a few locations that need attention/confirmation.

I think you took a pretty old version of the code. I fixed the hash stability issue in a later version; you probably want to check that. And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

Would implementing https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272 solve this?

And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

Would implementing https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272 solve this?

Yep.

And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

Would implementing https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272 solve this?

Yep.

Cool. Thanks for the feedback!

socket inheritance
Allow socket inheritance between sockets with SO_REUSEPORT_LB option.
Inspired by https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272

The diff currently contains debug output code that will be removed later. Testing this feature is kind of difficult but I did as follow:

  • Launch two processes with 4 threads each on server machine that listen on the same port with SO_REUSEPORT_LB option.
  • Launch a client program on client machine that does 2000 instant connections to the server machine.
  • CTRL+C break one of the processes on the server machine to see if any sockets are transferred to the other server process.

Usually there are 2-4 "comp"-sockets on some listening socket that is target for inheritance. Not yet have I've seen any "incomp" sockets being inherited.
I added a temporary flag in the socket struct to be able to track inherited sockets.
By printing out the PID I can see that the killed server process' socket is calling inherit subroutine. Later I can see that the other server process' is doing sosend(), sofree(), etc on the inherited sockets. So far looking good but I'm not at all confident with this limited testing that the code is bug free.

There are a few places marked XXX that need extra attention and confirmation.

syncache
On DragonFlyBSD special care is taken to transfer the syncache to the new socket. However, I was not able to find any reference between the syncache and the sockets. It seems syncache is kept separate from the sockets so no action is required for that. Is this correct understanding?

gnn added a reviewer: glebius.Jul 13 2017, 4:08 PM

@sepherosa_gmail.com does this look good to you now?

In D11003#242017, @kevin.bowling_kev009.com wrote:

@sepherosa_gmail.com does this look good to you now?

Yep :)

In D11003#242017, @kevin.bowling_kev009.com wrote:

@sepherosa_gmail.com does this look good to you now?

Yep :)

Great! I will clean up some debug code and update the diff.

It is possible to separate white space changes from the rest of changes? Do you have this in git or are you using just modified subversion workdir?

Remove debug printing and clean up whitespace diffs. Applies cleanly to master branch as of 20170915.
Branch with this patch can be found here:
https://github.com/johalun/freebsd/tree/master-socket-loadbalancer

Fix build error.

I believe we need to update cddl/lib/libdtrace/tcp.d with this change also:

@@ -211,12 +192,12 @@
        tcps_rport =            p == NULL ? 0 : ntohs(p->t_inpcb->inp_inc.inc_ie.ie_fport);
        tcps_laddr =            p == NULL ? 0 :
            p->t_inpcb->inp_vflag == INP_IPV4 ?
-          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.id46_addr.ia46_addr4.s_addr) :
-          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.id6_addr);
+          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.ie46_local.ia46_addr4.s_addr) :
+          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.ie6_local);
        tcps_raddr =            p == NULL ? 0 :
            p->t_inpcb->inp_vflag == INP_IPV4 ?
-          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.id46_addr.ia46_addr4.s_addr) :
-          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.id6_addr);
+          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.ie46_foreign.ia46_addr4.s_addr) :
+          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.ie6_foreign);
        tcps_state =            p == NULL ? -1 : p->t_state;
        tcps_iss =              p == NULL ? 0  : p->iss;
        tcps_irs =              p == NULL ? 0  : p->irs;

Update libdtrace with changes to struct field names.

Herald added 1 blocking reviewer(s): gnn. · View Herald TranscriptSep 26 2017, 12:15 PM

I believe we need to update cddl/lib/libdtrace/tcp.d with this change also:

Good catch! Updated now.

johalun0_gmail.com updated this revision to Diff 33740.EditedOct 6 2017, 8:34 AM
  • Add support for IPv6 (but not IPv4-mapped IPv6 addresses)
  • Change to M_NOWAIT in lbgroup malloc to solve non-sleepable lock issue
  • Removed some old printfs.

@glebius Look ok to you?

  • Minor fixes for IPv6
  • Fixed a bug that caused non-wildcard IP address sockets to not function properly
johalun0_gmail.com edited the summary of this revision. (Show Details)Fri, Apr 20, 8:18 AM

Update to apply cleanly to HEAD as of 20180420.

kbowling accepted this revision.Fri, Apr 20, 7:45 PM
rwatson requested changes to this revision.Mon, Apr 23, 5:23 PM
rwatson added a subscriber: rwatson.

There is a surprising lack of locking assertions in the substantial new chunks of code proposed for in_pcb.c and in6_pcb.c. Could you add proper locking assertions so that we can better understand the locking protocol? Please also add a comment above new data-structure definitions to make clear what the locking requirements are. Please document the per-field locking requirements in in_pcb.h.

Could you explain why the lock order implied by soinherit() (between so and so_inh) deadlock free?

Could you upload a version of the patch with full context so that surrounding code can be inspected properly -- especially with respect to locking?

Could you make revisions to the patch to better implement style(9) -- e.g., return has parens, spaces between "if" and "(condition)", tabs + 4-space continuing-line indents? Also, declare all locals at the tops of functions. Remove unnecessary vertical whitespace. Line wrap at 77 or 78 characters.

Could you comment on the ABI/API/KBI/KPI implications of changing the type of so_options?

This revision now requires changes to proceed.Mon, Apr 23, 5:23 PM

There is a surprising lack of locking assertions in the substantial new chunks of code proposed for in_pcb.c and in6_pcb.c. Could you add proper locking assertions so that we can better understand the locking protocol? Please also add a comment above new data-structure definitions to make clear what the locking requirements are. Please document the per-field locking requirements in in_pcb.h.

This last sentence should have asked for per-field locking requirements in struct in_pcbinfo, both for consistency with existing fields, and also for the benefits of readers.

This revision was not accepted when it landed; it landed in state Needs Revision.Mon, Apr 23, 7:51 PM
This revision was automatically updated to reflect the committed changes.

Changes:

  • Fix style
  • Add lock assertions
  • Acquire INP_HASH_RLOCK in_pcblookup_lbgroup_last()
  • Acquire INP_WLOCK before calling soinherit()

Thanks for the feedback!

There is a surprising lack of locking assertions in the substantial new chunks of code proposed for in_pcb.c and in6_pcb.c. Could you add proper locking assertions so that we can better understand the locking protocol? Please also add a comment above new data-structure definitions to make clear what the locking requirements are. Please document the per-field locking requirements in in_pcb.h.

Improved in the latest version.

Could you explain why the lock order implied by soinherit() (between so and so_inh) deadlock free?

Hmm.. No. Anyone else with confidence here?

Could you upload a version of the patch with full context so that surrounding code can be inspected properly -- especially with respect to locking?

https://github.com/johalun/freebsd/tree/d11003/ <- is this what you mean?

Could you make revisions to the patch to better implement style(9) -- e.g., return has parens, spaces between "if" and "(condition)", tabs + 4-space continuing-line indents? Also, declare all locals at the tops of functions. Remove unnecessary vertical whitespace. Line wrap at 77 or 78 characters.

Yes, there was a lot of misses here. Hopefully I got them all now.

Could you comment on the ABI/API/KBI/KPI implications of changing the type of so_options?

Should be minimal implications. Of course, userland needs to be updated to use the new option. setsockopt() already takes an int as argument so no problem there.

sbruno reopened this revision.Tue, Apr 24, 7:56 PM
sbruno added a subscriber: sbruno.

Reopen at the request of the submitter.

This has been reverted from head at svn r332967.