Change Details

[This is too hacky to commit and I've only tested it with TLS1.2 and AES-GCM so far. I'm wondering if it's worth trying to pursue this further.] While profiling software KTLS on an arm64 server I noticed that driver ithreads spend a lot of time freeing anonymous pages. In particular, for each frame we allocate a set of 4KB pages to store the encrypted payload before sending to the NIC. A larger frame size (i.e., a larger ktls_maxlen value) is beneficial because it reduces the number of mbufs we have to allocate and free in the transmit path, so the idea is to go further and allocate contiguous runs of pages to store output data, using a UMA zone to cache such runs. The main downsides are: - Memory overcommit. We allocate and cache runs of length ktls_maxlen, so for smaller frames we will leave some memory unused. My feeling is that in a typical KTLS/sendfile workload a large majority of transmits will be of the maximum length, however. - This scheme basically assumes that we will always be allocating and freeing ktls buffers in the same NUMA domain. If that's not true, then we'll get a lot of cross-domain frees which can quickly blow up the size of the cache. We could mitigate this by adding a limit to the zone size like we do for the per-CPU page caches. But allocations will be also expensive if we miss in the UMA caches because vm_page_alloc_contig() has to lock the per-domain free queues. I expect that in a properly configured system this will not be a problem. The upsides are: - Less time spent in the page allocator, and we don't have to unwire pages when freeing mbufs. - Less fragmentation and better TLB efficiency since output buffers are contiguous. - Shorter SG lists for DMA to the NIC. I don't have good CPU profiling tools on this particular platform so it's hard to say how much of an improvement this gives, but I do see fewer samples from mlx5 transmit completions with this change. arm64 supports 16KB and 64KB page sizes, which I think is a much better optimization for KTLS/sendfile than this one. But I'm curious as to whether this change is useful on amd64.

[This is too hacky to commit and I've only tested it with TLS1.2 andMaintain a cache of physically contiguous runs of pages for AES-GCM so far. I'm wondering if it's worth trying to pursue thisuse as output buffers when software encryption is configured further.] While profiling software KTLS on an arm64 server I noticed that driverand in-place encryption is not possible. This makes allocation ithreads spend a lot of time freeing anonymous pages. In particular,and free cheaper since in the common case we avoid touching for each frame we allocate a set of 4KB pages to store the encryptedthe vm_page structures for the buffer, and fewer calls into payload before sending to the NIC. A larger frame size (i.e., a larger ktls_maxlen value) is beneficial because it reduces the number of mbufs we have to allocate and free in the transmit path, so the idea is to go further and allocate contiguous runs of pages to store output data, using a UMA zone to cache such runsUMA are needed. The main downsides are:It is possible that we will not be able to allocate these - Memory overcommit. We allocate and cache runs of length ktls_maxlen,buffers if physical memory is fragmented. To avoid frequently so for smaller frames we will leave some memory unused. My feeling iscalling into the physical memory allocator in this scenario, that in a typical KTLS/sendfile workload a large majority of transmitsrate-limit allocation attempts after a failure. In the failure wicase we fall be ofack to the maximum length, however.old behaviour of allocating a page at - This scheme basically assumes that we will always be allocating and freeing ktls buffers in the same NUMA domain. If that's not true, then we'll get a lot of cross-domain frees which can quickly blow up the size of the cache. We could mitigate this by adding a limit to the zone size like we do for the per-CPU page caches. But allocations will be also expensive if we miss in the UMA caches because vm_page_alloc_contig() has to lock the per-domain free queues. I expect that in a properly configured system this will not be a problema time. The upsides are:N.B.: if we could stash a pointer in the mbuf for a mapping - Less time spent inof the buffer, we could get rid of the page allocator,ugly rate-limiting mechanism. and we don't have to unwireIn pages when freeing mbufs.fact, we could probably get rid of the new zone and just - Less fragmentation and better TLB efficiency since output buffers areuse malloc(). Multi-page mallocs will go through the reservation system and thus have a good chance of being physically contiguous. - Shorter SG lists for DMA to the NIC. I don't have good CPU profiling tools on this particular platform so it's hard to say how much of an improvement this gives, but I do see fewer samples from mlx5 transmit completions with this change. arm64 supports 16KB and 64KB page sizes, which I think is a much better optimization for KTLS/sendfile than this one. But I'm curious as to whether this change is useful on amd64as well.