Page MenuHomeFreeBSD

socket: Implement SO_SPLICE
ClosedPublic

Authored by markj on Aug 22 2024, 2:22 PM.
Tags
None
Referenced Files
F107463851: D46411.id142348.diff
Tue, Jan 14, 1:09 PM
F107443581: D46411.id142620.diff
Tue, Jan 14, 5:55 AM
F107442972: D46411.diff
Tue, Jan 14, 5:42 AM
Unknown Object (File)
Mon, Jan 13, 6:27 AM
Unknown Object (File)
Sat, Jan 11, 11:24 PM
Unknown Object (File)
Thu, Dec 26, 3:05 PM
Unknown Object (File)
Dec 8 2024, 1:27 PM
Unknown Object (File)
Dec 1 2024, 8:59 AM

Details

Summary

This is a feature which allows one to splice two TCP sockets together
such that data which arrives on one socket is automatically pushed into
the send buffer of the spliced socket. This can be used to make TCP
proxying more efficient as it eliminates the need to copy data into and
out of userspace.

The interface is copied from OpenBSD, and this implementation aims to be
compatible. Splicing is enabled by setting the SO_SPLICE socket option.
When spliced, data that arrives on the receive buffer is automatically
forwarded to the other socket. In particular, splicing is a
unidirectional operation; to splice a socket pair in both directions,
SO_SPLICE needs to be applied to both sockets. More concretely, when
setting the option one passes the following struct:

    struct splice {
	    int fd;
	    off_t max;
	    struct timveval idle;
    };

where "fd" refers to the socket to which the first socket is to be
spliced, and two setsockopt(SO_SPLICE) calls are required to set up a
bi-directional splice.

select(), poll() and kevent() do not return when data arrives in the
receive buffer of a spliced socket, as such data is expected to be
removed automatically once space is available in the corresponding send
buffer. Userspace can perform I/O on spliced sockets, but it will be
unpredictably interleaved with splice I/O.

A splice can be configured to unsplice once a certain number of bytes
have been transmitted, or after a given time period. Once unspliced,
the socket behaves normally from userspace's perspective. The number of
bytes transmitted via the splice can be retrieved using
getsockopt(SO_SPLICE); this works after unsplicing as well, up until the
socket is closed or spliced again. Userspace can also manually trigger
unsplicing by splicing to -1.

Splicing work is handled by dedicated threads, similar to KTLS. A
worker thread is assigned at splice creation time. At some point it
would be nice to have a direct dispatch mode, wherein the thread which
places data into a receive buffer is also responsible for pushing it
into the sink, but this requires tighter integration with the protocol
stack in order to avoid reentrancy problems.

Currently, sowakeup() and related functions will signal the worker
thread assigned to a spliced socket. so_splice_xfer() does the hard
work of moving data between socket buffers.

Co-authored by: gallatin
Sponsored by: Klara, Inc.
Sponsored by: Stormshield
Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 59145
Build 56032: arc lint + arc unit

Event Timeline

markj requested review of this revision.Aug 22 2024, 2:22 PM

I will post some benchmark numbers soon. We wrote a small TCP proxy to test this, and used it to optionally splice iperf3 sessions. In general SO_SPLICE gives a substantial throughput improvement and CPU utilization reduction versus userspace proxying, so should be useful for various proxy applications. For instance, relayd can run as a transparent proxy and makes use of SO_SPLICE on OpenBSD, and I'd like to enable the use of SO_SPLICE in the FreeBSD port.

Note that this implementation only supports TCP for now; if anyone would find UDP support useful, please say so.

0mp added inline comments.
lib/libsys/getsockopt.2
675

I might be nice to add a version number here. Also, mentioning when the implementation was added to FreeBSD would also be cool.

markj marked an inline comment as done.
  • Remove a stray debug print.
  • Add some version numbers to the man page.

I will post some benchmark numbers soon.

I tried a test with two Ampere Altra systems directly connected by Mellanox ConnectX-4 Lx 25Gbps NICs. One runs a simple TCP proxy that accepts one or more connections and creates corresponding connections to a target address, passing all data between them. I ran an iperf3 client and server on the other machine, such that traffic is looped through the proxy and then sent back over the same link.

When the proxy copies all data in userspace, a pair of TCP streams reaches about 12Gbps, evenly divided per stream, and CPU utilization is at about 5%. When the proxy instead splices the connections together using SO_SPLICE, we get about 19Gbps, again evenly divided, and CPU utilization is at 3%. In particular, we get lower CPU utilization even though mlx5 is pushing ~50% more data.

Unrelatedly, the relayd port will be updated to make use of this feature, as it does on OpenBSD, as it provides various TCP/HTTP/TLS proxying functionality.

brooks added inline comments.
sys/kern/uipc_socket.c
3943

This unfortunately wants COMPAT_FREEBSD32 handling (I almost suggested padding after the fd, but struct timeval needs handling regardless.)

markj marked an inline comment as done.
  • Log the splice structure with ktrace.
  • Add 32-bit compat handling.
sys/kern/uipc_socket.c
223

Hmm, this is probably not quite right, I think off_t is permitted to have 4 byte alignment on 32-bit platforms, but not on 64-bit platforms.

  • Fix the splice32 compat definition. I verified that the regression tests now pass when compiled for a 32-bit target.
arrowd added inline comments.
lib/libsys/getsockopt.2
575

What will happen if s1 is spliced with s2 and s2 is spliced with s1?

lib/libsys/getsockopt.2
575

Then data that arrives on s1 is transmitted to s2, and data received by s2 is transmitted to s1. This is the normal use-case.

I think you are maybe asking about what happens if there is a cycle in the set of spliced sockets? Extending your example, this could arise if s1 is connected to s0, and s2 is connected to s3, and s0 and s3 are spliced together in both directions. Or more simply if s1 is spliced to itself.

The implementation doesn't try to handle this. Data will pass through endlessly, much as it would if userspace were copying data in a loop. I believe this is ok so long as the splicing process remains killable. Once the process is killed, the associated connections will be closed, and the splice worker thread will detect this and automatically tear down the splice.

sys/kern/uipc_socket.c
223

This isn't right either as off_t is 64-bit aligned on non-i386 32-bit platforms. I think keeping off_t and making the structure packed on amd64 will do the job.

3944

if (SV_CURPROC_FLAG(SV_ILP32)) { is used above.

markj marked 2 inline comments as done.

Address Brooks' comments:

  • Fix the compat32 shim for non-x86 32-bit platforms.
  • Be consistent in how we test for a 32-bit process.

Interface bits all look good. I've not deeply reviewed the bit around moving data.

There are a couple remaining cases of double new lines.

This revision is now accepted and ready to land.Aug 30 2024, 6:16 PM

This passes basic sanity testing at netflix. Sorry for the delayed approval; we had a few integration issues with this and a local Netflix feature that made it look like splice was not working. It only just now became obvious that it was due to our local feature & how to fix it.

  • Remove unused code which set up CPU and NUMA domain pinning for splice worker threads.
  • Lower the priority of splice worker threads to PUSER. These threads effectively proxy user process work and so shouldn't get too high a scheduling priority, given that it's possible to create loops among spliced sockets.
This revision now requires review to proceed.Sep 9 2024, 1:51 PM
This revision is now accepted and ready to land.Sep 9 2024, 4:07 PM
This revision was automatically updated to reflect the committed changes.