KTLS uses the embedded header and trailer fields of unmapped mbufs. This can lead to "silly" buffer lengths, where we have an mbuf chain that will create a scatter/gather list of the following lengths (ignoring offsets in the the pages for simplicity) when using AES-GCM:
13..4096..4096..4096..4096..16..13..4096..4096..4096..4096..16..13..4096..4096..4096..16..13..4096..4096..4096.......
Notice how we have 13 bytes followed by 16 bytes between each adjacent TLS record.
For software ktls we typically wind up with a pattern where we have several TLS records encrypted, and made ready at once. When these records are made ready, we can coalesce these silly buffers in sbready_compress by copying 13b TLS header of the next record into the 16b TLS trailer of the current record. After doing so, we have:
13..4096..4096..4096..4096..29..4096..4096..4096..4096..29..4096..4096..4096..29..4096..4096..4096.......
This marginally increases PCIe bus efficiency. We've seen an almost 1Gb/s increase in peak throughput on Broadwell based Xeons running a 100% software TLS workload with Mellanox ConnectX-4 NICs.
Note that this change is ifdef'ed for KTLS, as KTLS is currently the only user of the hdr/trailer feature of unmapped mbufs, and peeking into them is expensive, since the ext_pgs struct lives in separately allocated memory, and may be cold in cache.
This optimization is not applicable to HW ("NIC") TLS, as that depends on having the entire TLS record described by a single unmapped mbuf, so we cannot shift parts of the record between mbufs for HW TLS.