Unmapped mbufs allow sendfile to carry multiple pages of data in a
single mbuf, without mapping those pages. It is a requirement for
Netflix's in-kernel TLS, and provides a 5-10% CPU savings on heavy web
serving workloads when used by sendfile, due to effectively
compressing socket buffers by an order of magnitude, and hence
reducing cache misses.
For this new external mbuf type (EXT_PGS), the ext_buf pointer now
points to a struct mbuf_ext_pgs structure instead of a data buffer.
This structure contains an array of physical addresses (this reduces
cache misses compared to an earlier version that stored an array of
vm_page_t pointers). It also stores additional fields needed for
in-kernel TLS such as TLS header and trailer data that are currently
unused. To more easily detect these mbufs, the M_NOMAP flag is set in
m_flags in addition to M_EXT.
Various functions like m_copydata() have been updated to safely access
packet contents (using uiomove_fromphys(), to make things like BPF
safe.
NIC drivers advertise support for unmapped mbufs on transmit via a new
IFCAP_NOMAP capability. This capability can be toggled via the new
'nomap' and '-nomap' ifconfig(8) commands. For NIC drivers that only
transmit packet contents via DMA and use bus_dma, adding the
capability to if_capabilities and if_capenable should be all that is
required.
If a NIC does not support unmapped mbufs, they are converted to a
chain of mapped mbufs (using sf_bufs to provide the mapping) in
ip_output or ip6_output. If an unmapped mbuf requires software
checksums, it is also converted to a chain mapped mbufs before
computing the checksum.
Add support for using unmapped mbufs to hold data written on a socket
via sendfile(2). This can be enabled at runtime via the
kern.ipc.mb_use_ext_pgs sysctl.
Enable IFCAP_NOMAP for a vlan interface if it is supported by the
underlying trunk device.
Add support for IFCAP_NOMAP to cxgbe(4). Since cxgbe(4) uses sglist
instead of bus_dma, this required updates to the code that generates
scatter/gather lists for packets. Also, unmapped mbufs are always
sent via DMA and never as immediate data in the payload of a work
request.
Apply similar logic from sbcompress to pending data in the socket
buffer once it is marked ready via sbready. Normally sbcompress
merges small mbufs to reduce the length of mbuf chains in the socket
buffer. However, sbcompress cannot do this for mbufs marked
M_NOTREADY. sbcompress_ready is now called from sbready when mbufs
are marked ready to merge small mbuf chains once the data is available
to copy.
Submitted by: gallatin (earlier version)
Sponsored by: Netflix