Load balance sockets with new SO_REUSEPORT_LB option
Needs ReviewPublic

Authored by johalun0_gmail.com on May 31 2017, 7:06 AM.

Details

Summary

This patch adds a new socket option, SO_REUSEPORT_LB, which allow multiple programs or threads to bind to the same port and incoming connections will be load balanced using a hash function.

Most of the code was copied from a similar patch for DragonflyBSD.

However, in DragonflyBSD, load balancing is a global on/off setting and can not be set per socket. This patch allows for simultaneous use of both the current SO_REUSEPORT and the new SO_REUSEPORT_LB options on the same system.

Required changes to structures
Globally change so_options from 16 to 32 bit value to allow for more options.
Add hashtable in pcbinfo to hold all SO_REUSEPORT_LB sockets.

Limitations
As DragonflyBSD, a load balance group is limited to 256 pcbs (256 programs or threads sharing the same socket).

Discussion
If we want to avoid adding more options and changing so_options to a 32 bit value, can we use optvar argument to pass a flag (value other than 0 or 1) for load balancing or is that strictly used for setting the options on/off?

This is merely a draft, no effort has been made to test or implement for ipv6. Maybe someone more familiar with the IP stack can shed some light on what is needed for ipv6?

Possible bugs
Even if I close the sockets properly at exit, when I restart the server test program bind() fails with “Address already in use” and I have to wait for it to timeout.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
swills added a subscriber: swills.Jun 5 2017, 4:24 PM
hiren added a subscriber: hiren.

I am not too sure on details but can this be done with PCBGROUPS/RSS? https://github.com/erikarn/freebsd-rss/ has some examples.

Adding Adrian and Ermal if they have any comments.

i'm okay with this (as long as it doesn't break RSS/PCBGROUPS.) This is fine for people wanting non-RSS load balancing of things (eg people writing loopback servics) without doing it in userland.

Duplicate a couple of SO_REUSEPORT checks and do the same for SO_REUSEPORT_LB in 'in_pcb_bindsetup.c'.
Let SO_REUSEPORT_LB set SO_REUSEADDR to fix the "address in use" error earlier mentioned. Testing confirmed that the earlier "possible bug" is now fixed.

'XXX' marks a few locations that need attention/confirmation.

I think you took a pretty old version of the code. I fixed the hash stability issue in a later version; you probably want to check that. And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

Would implementing https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272 solve this?

And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

Would implementing https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272 solve this?

Yep.

And your implementation actually suffers the same issue as Linux: if any of the listen socket is closed, the sockets on the closed listen socket's accept queue are dropped; but well, that's may be acceptable, since I don't think Linux even fixed that.

Would implementing https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272 solve this?

Yep.

Cool. Thanks for the feedback!

socket inheritance
Allow socket inheritance between sockets with SO_REUSEPORT_LB option.
Inspired by https://github.com/DragonFlyBSD/DragonFlyBSD/commit/02ad2f0b874fb0a45eb69750219f79f5e8982272

The diff currently contains debug output code that will be removed later. Testing this feature is kind of difficult but I did as follow:

  • Launch two processes with 4 threads each on server machine that listen on the same port with SO_REUSEPORT_LB option.
  • Launch a client program on client machine that does 2000 instant connections to the server machine.
  • CTRL+C break one of the processes on the server machine to see if any sockets are transferred to the other server process.

Usually there are 2-4 "comp"-sockets on some listening socket that is target for inheritance. Not yet have I've seen any "incomp" sockets being inherited.
I added a temporary flag in the socket struct to be able to track inherited sockets.
By printing out the PID I can see that the killed server process' socket is calling inherit subroutine. Later I can see that the other server process' is doing sosend(), sofree(), etc on the inherited sockets. So far looking good but I'm not at all confident with this limited testing that the code is bug free.

There are a few places marked XXX that need extra attention and confirmation.

syncache
On DragonFlyBSD special care is taken to transfer the syncache to the new socket. However, I was not able to find any reference between the syncache and the sockets. It seems syncache is kept separate from the sockets so no action is required for that. Is this correct understanding?

gnn added a reviewer: glebius.Jul 13 2017, 4:08 PM

@sepherosa_gmail.com does this look good to you now?

Yep :)

Great! I will clean up some debug code and update the diff.

It is possible to separate white space changes from the rest of changes? Do you have this in git or are you using just modified subversion workdir?

Remove debug printing and clean up whitespace diffs. Applies cleanly to master branch as of 20170915.
Branch with this patch can be found here:
https://github.com/johalun/freebsd/tree/master-socket-loadbalancer

Fix build error.

I believe we need to update cddl/lib/libdtrace/tcp.d with this change also:

@@ -211,12 +192,12 @@
        tcps_rport =            p == NULL ? 0 : ntohs(p->t_inpcb->inp_inc.inc_ie.ie_fport);
        tcps_laddr =            p == NULL ? 0 :
            p->t_inpcb->inp_vflag == INP_IPV4 ?
-          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.id46_addr.ia46_addr4.s_addr) :
-          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.id6_addr);
+          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.ie46_local.ia46_addr4.s_addr) :
+          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependladdr.ie6_local);
        tcps_raddr =            p == NULL ? 0 :
            p->t_inpcb->inp_vflag == INP_IPV4 ?
-          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.id46_addr.ia46_addr4.s_addr) :
-          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.id6_addr);
+          inet_ntoa(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.ie46_foreign.ia46_addr4.s_addr) :
+          inet_ntoa6(&p->t_inpcb->inp_inc.inc_ie.ie_dependfaddr.ie6_foreign);
        tcps_state =            p == NULL ? -1 : p->t_state;
        tcps_iss =              p == NULL ? 0  : p->iss;
        tcps_irs =              p == NULL ? 0  : p->irs;

Update libdtrace with changes to struct field names.

Herald added 1 blocking reviewer(s): gnn. · View Herald TranscriptSep 26 2017, 12:15 PM

I believe we need to update cddl/lib/libdtrace/tcp.d with this change also:

Good catch! Updated now.

johalun0_gmail.com updated this revision to Diff 33740.EditedOct 6 2017, 8:34 AM
  • Add support for IPv6 (but not IPv4-mapped IPv6 addresses)
  • Change to M_NOWAIT in lbgroup malloc to solve non-sleepable lock issue
  • Removed some old printfs.

@glebius Look ok to you?

  • Minor fixes for IPv6
  • Fixed a bug that caused non-wildcard IP address sockets to not function properly