Page MenuHomeFreeBSD

Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain

Authored by gallatin on Sep 13 2019, 2:24 PM.
Referenced Files
Unknown Object (File)
Mon, May 13, 4:18 PM
Unknown Object (File)
Sat, May 11, 4:07 PM
Unknown Object (File)
Fri, May 10, 11:24 PM
Unknown Object (File)
Wed, May 1, 12:55 AM
Unknown Object (File)
Mon, Apr 29, 5:11 PM
Unknown Object (File)
Fri, Apr 26, 6:49 AM
Unknown Object (File)
Thu, Apr 25, 2:12 PM
Unknown Object (File)
Thu, Apr 25, 6:11 AM
"Like" token, awarded by mm.



In order to efficiently serve traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller.

This was implemented at the TCP level because: 1) It deals with established listen sockets and 2) Because I cannot test anything other than TCP.

Diff Detail

rS FreeBSD src repository - subversion
Lint Not Applicable
Tests Not Applicable

Event Timeline

bz requested changes to this revision.Sep 13 2019, 3:16 PM
bz added inline comments.
37 ↗(On Diff #62033)

Please do not forget to update .Dd before comitting.

343 ↗(On Diff #62033)

Very long lines. Please "man-lint" using the usual tools (or by hand).

151 ↗(On Diff #62033)

il_numa_domain is a u_char (uint8_t), here (and possibly elsewhere) we are using an int? Is there a reason the types don't agree? I always found these things highly confusing in other old code when I had to deal with that. I understand that in other places the -1/-2 (TCP_REUSPORT_LB_NUMA_*) arguments are used outside the valid range.

379 ↗(On Diff #62033)

in_pcblgroup_alloc is a static function. Is there any reason not to pass numa_domain in to it as well given the function initializes other fields as well?

448 ↗(On Diff #62033)

Neither of those two need to be initialized here; there are possible return() calls before this, so should be moved down. Would probably also help to have a more consisten use of pcbinfo vs. inp->inp_pcbinfo in the code below.

467 ↗(On Diff #62033)

extra blank line.

474 ↗(On Diff #62033)

If I am not mistaken err = 0; here and then there'll only be two goto calls, which I am fine with as writing it that way saves a level of indentation. The early err assignment saves a vertical block.

479 ↗(On Diff #62033)

Comments start with upper case and end in punctuation.

2027 ↗(On Diff #62033)

Could go on one line.

2062 ↗(On Diff #62033)

One one-line block has {} the other doesn't.

2069 ↗(On Diff #62033)

This is just another || condition if the rest of the logic is right that it is local_wild both times?

570 ↗(On Diff #62033)

In general I'd prefer it spelt uint8_t in networking, also matches what's used in the mbuf pkthdr if I am not misremembering.

190 ↗(On Diff #62033)

Can this be sorted into the numerical space?

920 ↗(On Diff #62033)

See netinet comment.

This revision now requires changes to proceed.Sep 13 2019, 3:16 PM
gallatin marked 9 inline comments as done.

Address bz's concerns:

  • Shorten lines in man page
  • Make numa_domain consistently uint8_t
  • simplify logic in places
  • fix some style errors
343 ↗(On Diff #62033)

I have no idea what man-lint is. I was using mandoc -T lint share/man/man4/tcp.4 and those long lines did not add to the existing 46 warnings. I have shortened the lines, but would appreciate a pointer to the proper tool.

151 ↗(On Diff #62033)

Changed to uint8_t

474 ↗(On Diff #62033)

Yes, re-reading it, it is cleaner to reverse the logic.

2069 ↗(On Diff #62033)

Indeed, this can be simplified a bit.

570 ↗(On Diff #62033)

I'm guilty of determining convention by looking at the line above :)

190 ↗(On Diff #62033)

I was trying to sort it into a place where it would not be jarring and stick out due to the length of the name, but I see your point.

Comment on the proper man page checker to use.

343 ↗(On Diff #62033)

You can use the textproc/igor port package in addition to "mandoc -Tlint". The igor tool will tell you to wrap a line after a sentence stop (among other things).

  • Fixed line wrap issues pointed out by igor
343 ↗(On Diff #62033)

Awesome! Thank you! I wish we had something like this for style(9). That would be almost as good as clang-format..

Approved by manpages. It would indeed help to have this for style.
I think someone was working on making that work with clang-format (maybe Ed Maste, not sure).

This isn't worth much consternation but I do wonder if we can try to limit the proliferation of these sockopts. For instance, if you generalize the existing _LB to take a generic integer arg, would that and cpuset achieve the same effect (and generalize to however people want to use those numbered groups?)

This isn't worth much consternation but I do wonder if we can try to limit the proliferation of these sockopts. For instance, if you generalize the existing _LB to take a generic integer arg, would that and cpuset achieve the same effect (and generalize to however people want to use those numbered groups?)

I hear you. The problem I have is that nginx creates the listen socket in the parent, and then forks the workers.

I had initially thought I could just make the existing _LB do the right thing on numa by just using the current domain as new listen sockets are created. But sadly, that does not work for us, so I came up with this generic thing that modifies existing listen sockets. This way, when a worker inherits a listen socket that's setup by the parent, its easy to tell the kernel to filter by the current numa domain.

Hello, any chance of this patch and D21648 getting committed before 13.0 freeze is in effect? I tested both changes against the latest HEAD (with some minor clean-up) and the functionality seems to work.

This generally looks ok to me, though inpcb binding is not something I've looked at recently. Just a few minor nits.

467 ↗(On Diff #80915)

Suggestion: you could drop this else since you return in the if.

485 ↗(On Diff #80915)

Blank lines before comments is typical in the tree.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 19 2020, 10:05 PM
This revision was automatically updated to reflect the committed changes.