Page MenuHomeFreeBSD

Add a new external mbuf type that holds multiple unmapped pages.
ClosedPublic

Authored by jhb on Jun 11 2019, 9:44 PM.

Details

Summary

Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.

For this new external mbuf type (EXT_PGS), the ext_buf pointer now
points to a struct mbuf_ext_pgs structure instead of a data buffer.
This structure contains an array of physical addresses (this reduces
cache misses compared to an earlier version that stored an array of
vm_page_t pointers). It also stores additional fields needed for
in-kernel TLS such as TLS header and trailer data that are currently
unused. To more easily detect these mbufs, the M_NOMAP flag is set in
m_flags in addition to M_EXT.

Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys(), to make things like BPF
safe.

NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.

If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain mapped mbufs before
computing the checksum.

Add support for using unmapped mbufs to hold data written on a socket
via sendfile(2). This can be enabled at runtime via the
kern.ipc.mb_use_ext_pgs sysctl.

Enable IFCAP_NOMAP for a vlan interface if it is supported by the
underlying trunk device.

Add support for IFCAP_NOMAP to cxgbe(4). Since cxgbe(4) uses sglist
instead of bus_dma, this required updates to the code that generates
scatter/gather lists for packets. Also, unmapped mbufs are always
sent via DMA and never as immediate data in the payload of a work
request.

Apply similar logic from sbcompress to pending data in the socket
buffer once it is marked ready via sbready. Normally sbcompress
merges small mbufs to reduce the length of mbuf chains in the socket
buffer. However, sbcompress cannot do this for mbufs marked
M_NOTREADY. sbcompress_ready is now called from sbready when mbufs
are marked ready to merge small mbuf chains once the data is available
to copy.

Submitted by: gallatin (earlier version)
Sponsored by: Netflix

Test Plan
  • Netflix has run an earlier version of this in production for a long time.
  • tested via nginx + sendfile as well as netperf -t TCP_SENDFILE

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jhb created this revision.Jun 11 2019, 9:44 PM
Herald added a subscriber: ae. · View Herald Transcript
jhb added a comment.Jun 11 2019, 9:48 PM

In terms of commits I will break this up into a few commits, but I think it's best to have it all available for design review, etc. Also, there are some whitespace, etc. fixes in here that I will merge upstream and then rebase this on top of, but design review stuff can still happen now.

Candidate commits:

  • base unmapped mbufs
  • support for unmapped mbufs with sendfile
  • sbready_compress and supporting changes
  • IFCAP_NOMAP for vlan(4)
  • IFCAP_NOMAP for cxgbe(4)

What happens if a firewall is enabled and an unmapped mbuf is passed through pfil(9)?
I suspect that, if a pfil hook is hit, we'd also have to copy it in, just like when a checksum needs to be updated.

jhb added a comment.Jun 12 2019, 4:58 PM

What happens if a firewall is enabled and an unmapped mbuf is passed through pfil(9)?
I suspect that, if a pfil hook is hit, we'd also have to copy it in, just like when a checksum needs to be updated.

Only if the firewall needs to read/write actual packet data. Protocol headers (TCP, IP, etc.) are always stored in a normal mbuf at the start of a packet's mbuf chain. Unmapped mbufs only hold payload data that is stored in a socket buffer, so most of the filters I can think of off the top of my head as well as things like NAT should only operate on the normal mbuf holding the headers.

One thing I need to amend the commit log to include is that for EXT_PGS mbufs, the m_data field is no longer a direct pointer, but an offset into the buffer described by the backing 'struct mbuf_ext_pgs'. Thus, code that tries to use m_data directly will quickly fault which is at least somewhat useful as an assertion to quickly find something that doesn't work. I pondered adding an explicit assertion to mtod(), but Drew noted that the value of m_data means we will panic anyway. I did add an assertion to m_pullup() if it tries to pull up data from an unmapped mbuf.

In D20616#445618, @jhb wrote:

What happens if a firewall is enabled and an unmapped mbuf is passed through pfil(9)?
I suspect that, if a pfil hook is hit, we'd also have to copy it in, just like when a checksum needs to be updated.

Only if the firewall needs to read/write actual packet data. Protocol headers (TCP, IP, etc.) are always stored in a normal mbuf at the start of a packet's mbuf chain. Unmapped mbufs only hold payload data that is stored in a socket buffer, so most of the filters I can think of off the top of my head as well as things like NAT should only operate on the normal mbuf holding the headers.

Okay, thanks. That should indeed just work. The 'pf_check_proto_cksum()' flow, assuming there's no hardware assist, might break. I suspect that hardware which uses unmapped mbufs is always going to have checksum offload, so that's probably not an issue either.

jhb added a comment.Jun 12 2019, 5:28 PM
In D20616#445618, @jhb wrote:

What happens if a firewall is enabled and an unmapped mbuf is passed through pfil(9)?
I suspect that, if a pfil hook is hit, we'd also have to copy it in, just like when a checksum needs to be updated.

Only if the firewall needs to read/write actual packet data. Protocol headers (TCP, IP, etc.) are always stored in a normal mbuf at the start of a packet's mbuf chain. Unmapped mbufs only hold payload data that is stored in a socket buffer, so most of the filters I can think of off the top of my head as well as things like NAT should only operate on the normal mbuf holding the headers.

Okay, thanks. That should indeed just work. The 'pf_check_proto_cksum()' flow, assuming there's no hardware assist, might break. I suspect that hardware which uses unmapped mbufs is always going to have checksum offload, so that's probably not an issue either.

Is this routine used for transmit or only for receive? In the current patch, unmapped mbufs are only used for transmit. The comments in pf_check_proto_cksum imply it might only apply to receive in which case it should be fine as-is.

ae added a comment.Jun 12 2019, 6:08 PM

Only if the firewall needs to read/write actual packet data. Protocol headers (TCP, IP, etc.) are always stored in a normal mbuf at the start of a packet's mbuf chain. Unmapped mbufs only hold payload data that is stored in a socket buffer, so most of the filters I can think of off the top of my head as well as things like NAT should only operate on the normal mbuf holding the headers.

Okay, thanks. That should indeed just work. The 'pf_check_proto_cksum()' flow, assuming there's no hardware assist, might break. I suspect that hardware which uses unmapped mbufs is always going to have checksum offload, so that's probably not an issue either.

libalias based NAT is probably another candidate to look at or test.

In D20616#445627, @jhb wrote:

Is this routine used for transmit or only for receive? In the current patch, unmapped mbufs are only used for transmit. The comments in pf_check_proto_cksum imply it might only apply to receive in which case it should be fine as-is.

I believe it can be used in either direction.

emaste added a subscriber: emaste.Jun 18 2019, 8:40 PM
gallatin added inline comments.
sys/kern/kern_mbuf.c
898 ↗(On Diff #58545)

I just hit a panic on this kassert in the netflix kernel. The mbuf looked like the following. It was basically a tiny 192 byte ending chunk of a giant 73K ext_pgs mbuf. (see below).

I think that we can & should safely remove this check for NULL here.

(kgdb) p m->m_len
$5 = 0xc0
(kgdb) p m->m_data
$6 = (caddr_t) 0x11f40 <error: Cannot access memory at address 0x11f40>
(kgdb) p/d 0x11f40
$7 = 73536

(kgdb) p *(struct mbuf_ext_pgs *)0xfffff80637918000
$10 = {

npgs = 0x12,
nrdy = 0x0,
first_pg_off = 0x0,
last_pg_len = 0x1000,

<...>
(kgdb) p/d 0x12*4096
$11 = 73728
(kgdb) p/d 73728 - 0xc0
$12 = 73536

jhb marked an inline comment as done.Jun 27 2019, 9:33 PM
jhb added inline comments.
sys/kern/kern_mbuf.c
898 ↗(On Diff #58545)

Hmm, this needs more changes then. Namely, we need to set m_temp.m_data and possibly m_temp.m_len up so that the m_copydata will copy the correct bytes of data.

Ah, we do copy all that in the memcpy, I just need to fix the comment above the memcpy.

jhb updated this revision to Diff 59118.Jun 27 2019, 9:47 PM
jhb marked an inline comment as done.
  • Remove incorrect assertion.
jhb added a comment.Jun 27 2019, 10:27 PM
In D20616#445627, @jhb wrote:

Is this routine used for transmit or only for receive? In the current patch, unmapped mbufs are only used for transmit. The comments in pf_check_proto_cksum imply it might only apply to receive in which case it should be fine as-is.

I believe it can be used in either direction.

Hmm. As far as I can tell, pf_check_proto_cksum() is only called from pf_return() (and is only called with IPPROTO_TCP at that, so the existing function has dead code for ICMP, UDP, etc.) when replying to a dropped packet with a RST. (pf_return calls pf_send_tcp() with a non-NULL first argument, but pf_send_tcp() doesn't use this first argument ('replyto') either which is weird).

So, after reading some more, the purpose of pf_check_proto_cksum seems to be to avoid sending TCP RST's in response to packets with bad checksums. In order to have this see an unmapped mbuf, you would have to have pf(4) send a RST back to the host when the host tried to send an outgoing packet. That seems rather crazy for a firewall to do. No outgoing packet ever has a valid checksum at the stage that a firewall sees it as ip_output adds checksums after invoking pfil hooks. In addition, pf_check_proto_cksum() invokes bad checksum counters for a received packet on a mismatch (e.g. tcps_rcvbadsum) so I really think it can only possibly be used on an RX packet.

gallatin accepted this revision.Jun 28 2019, 12:39 PM
hselasky accepted this revision.Jun 28 2019, 1:09 PM
hselasky added inline comments.
sys/sys/mbuf.h
325 ↗(On Diff #59118)

Where does 152 and 156 come from?

384 ↗(On Diff #59118)

Extra ()'s:

((ext_pgs)) -> (ext_pgs)

gallatin added inline comments.Jun 28 2019, 1:26 PM
sys/sys/mbuf.h
325 ↗(On Diff #59118)

It is the amount of space left over when subtracting the size of the rest of the structure from 256 bytes (so as to make a nicely sized structure for the uma allocator).

rrs accepted this revision.Jun 28 2019, 3:43 PM
This revision is now accepted and ready to land.Jun 28 2019, 3:43 PM
jhb added inline comments.Jun 28 2019, 4:25 PM
sys/sys/mbuf.h
325 ↗(On Diff #59118)

And to be clear, the static assertion for the size will catch if this is wrong.

384 ↗(On Diff #59118)

This is a normal safety measure for C pre-processor macro definitions.

This revision was automatically updated to reflect the committed changes.