Page MenuHomeFreeBSD

Filter TCP connections to SO_REUSEPORT_LB listen sockets by NUMA domain
Needs ReviewPublic

Authored by gallatin on Sep 13 2019, 2:24 PM.

Details

Summary

In order to efficiently serve traffic on a NUMA machine, one must avoid as many NUMA domain crossings as possible. With SO_REUSEPORT_LB, a number of workers can share a listen socket. However, even if a worker sets affinity to a core or set of cores on a NUMA domain, it will receive connections associated with all NUMA domains in the system. This will lead to cross-domain traffic when the server writes to the socket or calls sendfile(), and memory is allocated on the server's local NUMA node, but transmitted on the NUMA node associated with the TCP connection. Similarly, when the server reads from the socket, he will likely be reading memory allocated on the NUMA domain associated with the TCP connection.

This change provides a new socket ioctl, TCP_REUSPORT_LB_NUMA. A server can now tell the kernel to filter traffic so that only incoming connections associated with the desired NUMA domain are given to the server. (Of course, in the case where there are no servers sharing the listen socket on some domain, then as a fallback, traffic will be hashed as normal to all servers sharing the listen socket regardless of domain). This allows a server to deal only with traffic that is local to its NUMA domain, and avoids cross-domain traffic in most cases.

This patch, and a corresponding small patch to nginx to use TCP_REUSPORT_LB_NUMA allows us to serve 190Gb/s of kTLS encrypted https media content from dual-socket Xeons with only 13% (as measured by pcm.x) cross domain traffic on the memory controller.

This was implemented at the TCP level because: 1) It deals with established listen sockets and 2) Because I cannot test anything other than TCP.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 26477

Event Timeline

gallatin created this revision.Sep 13 2019, 2:24 PM
bz requested changes to this revision.Sep 13 2019, 3:16 PM
bz added inline comments.
share/man/man4/tcp.4
37

Please do not forget to update .Dd before comitting.

343

Very long lines. Please "man-lint" using the usual tools (or by hand).

sys/netinet/in_pcb.c
151–152

il_numa_domain is a u_char (uint8_t), here (and possibly elsewhere) we are using an int? Is there a reason the types don't agree? I always found these things highly confusing in other old code when I had to deal with that. I understand that in other places the -1/-2 (TCP_REUSPORT_LB_NUMA_*) arguments are used outside the valid range.

381

in_pcblgroup_alloc is a static function. Is there any reason not to pass numa_domain in to it as well given the function initializes other fields as well?

449

Neither of those two need to be initialized here; there are possible return() calls before this, so should be moved down. Would probably also help to have a more consisten use of pcbinfo vs. inp->inp_pcbinfo in the code below.

468

extra blank line.

475

If I am not mistaken err = 0; here and then there'll only be two goto calls, which I am fine with as writing it that way saves a level of indentation. The early err assignment saves a vertical block.

480

Comments start with upper case and end in punctuation.

2024

Could go on one line.

2058

One one-line block has {} the other doesn't.

2069

This is just another || condition if the rest of the logic is right that it is local_wild both times?

sys/netinet/in_pcb.h
570

In general I'd prefer it spelt uint8_t in networking, also matches what's used in the mbuf pkthdr if I am not misremembering.

sys/netinet/tcp.h
191

Can this be sorted into the numerical space?

sys/netinet6/in6_pcb.c
924

See netinet comment.

This revision now requires changes to proceed.Sep 13 2019, 3:16 PM
gallatin updated this revision to Diff 62047.Sep 13 2019, 6:00 PM
gallatin marked 9 inline comments as done.

Address bz's concerns:

  • Shorten lines in man page
  • Make numa_domain consistently uint8_t
  • simplify logic in places
  • fix some style errors
gallatin added inline comments.Sep 13 2019, 6:00 PM
share/man/man4/tcp.4
343

I have no idea what man-lint is. I was using mandoc -T lint share/man/man4/tcp.4 and those long lines did not add to the existing 46 warnings. I have shortened the lines, but would appreciate a pointer to the proper tool.

sys/netinet/in_pcb.c
151–152

Changed to uint8_t

475

Yes, re-reading it, it is cleaner to reverse the logic.

2069

Indeed, this can be simplified a bit.

sys/netinet/in_pcb.h
570

I'm guilty of determining convention by looking at the line above :)

sys/netinet/tcp.h
191

I was trying to sort it into a place where it would not be jarring and stick out due to the length of the name, but I see your point.

bcr added a subscriber: bcr.Sep 13 2019, 6:11 PM

Comment on the proper man page checker to use.

share/man/man4/tcp.4
343

You can use the textproc/igor port package in addition to "mandoc -Tlint". The igor tool will tell you to wrap a line after a sentence stop (among other things).

gallatin updated this revision to Diff 62053.Sep 13 2019, 6:52 PM
  • Fixed line wrap issues pointed out by igor
gallatin added inline comments.Sep 13 2019, 6:53 PM
share/man/man4/tcp.4
343

Awesome! Thank you! I wish we had something like this for style(9). That would be almost as good as clang-format..

bcr accepted this revision as: manpages.Sep 13 2019, 7:15 PM

Approved by manpages. It would indeed help to have this for style.
I think someone was working on making that work with clang-format (maybe Ed Maste, not sure).

This isn't worth much consternation but I do wonder if we can try to limit the proliferation of these sockopts. For instance, if you generalize the existing _LB to take a generic integer arg, would that and cpuset achieve the same effect (and generalize to however people want to use those numbered groups?)

This isn't worth much consternation but I do wonder if we can try to limit the proliferation of these sockopts. For instance, if you generalize the existing _LB to take a generic integer arg, would that and cpuset achieve the same effect (and generalize to however people want to use those numbered groups?)

I hear you. The problem I have is that nginx creates the listen socket in the parent, and then forks the workers.

I had initially thought I could just make the existing _LB do the right thing on numa by just using the current domain as new listen sockets are created. But sadly, that does not work for us, so I came up with this generic thing that modifies existing listen sockets. This way, when a worker inherits a listen socket that's setup by the parent, its easy to tell the kernel to filter by the current numa domain.

swills added a subscriber: swills.Sep 17 2019, 6:16 PM