[This is too hacky to commit and I've only tested it with TLS1.2 andMaintain a cache of physically contiguous runs of pages for
AES-GCM so far. I'm wondering if it's worth trying to pursue thisuse as output buffers when software encryption is configured
further.]
While profiling software KTLS on an arm64 server I noticed that driverand in-place encryption is not possible. This makes allocation
ithreads spend a lot of time freeing anonymous pages. In particular,and free cheaper since in the common case we avoid touching
for each frame we allocate a set of 4KB pages to store the encryptedthe vm_page structures for the buffer, and fewer calls into
payload before sending to the NIC. A larger frame size (i.e., a larger
ktls_maxlen value) is beneficial because it reduces the number of mbufs
we have to allocate and free in the transmit path, so the idea is to go
further and allocate contiguous runs of pages to store output data,
using a UMA zone to cache such runsUMA are needed.
The main downsides are:It is possible that we will not be able to allocate these
- Memory overcommit. We allocate and cache runs of length ktls_maxlen,buffers if physical memory is fragmented. To avoid frequently
so for smaller frames we will leave some memory unused. My feeling iscalling into the physical memory allocator in this scenario,
that in a typical KTLS/sendfile workload a large majority of transmitsrate-limit allocation attempts after a failure. In the failure
wicase we fall be ofack to the maximum length, however.old behaviour of allocating a page at
- This scheme basically assumes that we will always be allocating and
freeing ktls buffers in the same NUMA domain. If that's not true,
then we'll get a lot of cross-domain frees which can quickly blow up
the size of the cache. We could mitigate this by adding a limit to
the zone size like we do for the per-CPU page caches. But allocations
will be also expensive if we miss in the UMA caches because
vm_page_alloc_contig() has to lock the per-domain free queues. I
expect that in a properly configured system this will not be a
problema time.
The upsides are:N.B.: if we could stash a pointer in the mbuf for a mapping
- Less time spent inof the buffer, we could get rid of the page allocator,ugly rate-limiting mechanism. and we don't have to unwireIn
pages when freeing mbufs.fact, we could probably get rid of the new zone and just
- Less fragmentation and better TLB efficiency since output buffers areuse malloc(). Multi-page mallocs will go through the reservation
system and thus have a good chance of being physically contiguous.
- Shorter SG lists for DMA to the NIC.
I don't have good CPU profiling tools on this particular platform so
it's hard to say how much of an improvement this gives, but I do see
fewer samples from mlx5 transmit completions with this change.
arm64 supports 16KB and 64KB page sizes, which I think is a much better
optimization for KTLS/sendfile than this one. But I'm curious as to
whether this change is useful on amd64as well.