Page MenuHomeFreeBSD

ktls: Use large pages for output when encrypting in software
ClosedPublic

Authored by markj on Feb 9 2021, 5:01 PM.

Details

Summary

Maintain a cache of physically contiguous runs of pages for
use as output buffers when software encryption is configured
and in-place encryption is not possible. This makes allocation
and free cheaper since in the common case we avoid touching
the vm_page structures for the buffer, and fewer calls into
UMA are needed.

It is possible that we will not be able to allocate these
buffers if physical memory is fragmented. To avoid frequently
calling into the physical memory allocator in this scenario,
rate-limit allocation attempts after a failure. In the failure
case we fall back to the old behaviour of allocating a page at
a time.

N.B.: if we could stash a pointer in the mbuf for a mapping
of the buffer, we could get rid of the ugly rate-limiting mechanism. In
fact, we could probably get rid of the new zone and just
use malloc(). Multi-page mallocs will go through the reservation
system and thus have a good chance of being physically contiguous
as well.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

markj requested review of this revision.Feb 9 2021, 5:01 PM

This is pretty exciting. I can try this on amd64 and let you know how much it helps. It won't be until next week, as I have a few things going on this week, and it looks like I may need to change the ktls_isa-l_crypto-kmod to catch up to having iniovcnt != outiovcnt

Is there any reason to think that it won't work with TLS 1.3?

This is pretty exciting. I can try this on amd64 and let you know how much it helps.

Thanks! I expect the improvement to be modest but that'll help inform whether this is worth doing.

It won't be until next week, as I have a few things going on this week, and it looks like I may need to change the ktls_isa-l_crypto-kmod to catch up to having iniovcnt != outiovcnt

I guess that's not in the tree? That would hopefully be a straightforward conversion.

Is there any reason to think that it won't work with TLS 1.3?

I don't think so, I just haven't tested it yet. I'll try to do that before next week.

This is pretty exciting. I can try this on amd64 and let you know how much it helps.

Thanks! I expect the improvement to be modest but that'll help inform whether this is worth doing.

It won't be until next week, as I have a few things going on this week, and it looks like I may need to change the ktls_isa-l_crypto-kmod to catch up to having iniovcnt != outiovcnt

I guess that's not in the tree? That would hopefully be a straightforward conversion.

Its in the ports tree:
https://github.com/freebsd/freebsd-ports/tree/master/security/ktls_isa-l_crypto-kmod/files

That reminds me: You'll probably want to bump KTLS_API_VERSION as a part of this patch .

Drew

This is pretty exciting. I can try this on amd64 and let you know how much it helps.

Thanks! I expect the improvement to be modest but that'll help inform whether this is worth doing.

It won't be until next week, as I have a few things going on this week, and it looks like I may need to change the ktls_isa-l_crypto-kmod to catch up to having iniovcnt != outiovcnt

I guess that's not in the tree? That would hopefully be a straightforward conversion.

Its in the ports tree:
https://github.com/freebsd/freebsd-ports/tree/master/security/ktls_isa-l_crypto-kmod/files

Thanks, I'll try writing a patch.

That reminds me: You'll probably want to bump KTLS_API_VERSION as a part of this patch .

Done.

Bump the KTLS ABI version.

This is nothing short of amazing. It cuts CPU use on an original Netflix 100G server (16 core Broadwell Xeon 2697A v4) from ~62% to ~52% at roughly 92Gb/s of a 100% ktls Netflix video serving workload.

sys/kern/uipc_ktls.c
2166

Would it make sense to test for m_epg_npgs > 2 in order to avoid using a large page for a small amount of data?

sys/kern/uipc_ktls.c
2166

So I think it would be better to always use the cache zone, i.e., we should go in the opposite direction. Suppose there's some severe memory shortage that causes the cache to be drained, and after that memory is too fragmented for contig allocations to succeed. Then, for each request we will miss in the cache, try to allocate a contig run and fail, and fall back to page-by-page allocations. So each KTLS thread will be acquiring multiple global locks per request with nothing to show for it. Instead, ktls_buffer_import() should allocate and map a non-contiguous run of pages if vm_page_alloc_contig_domain() fails.

Do you think the extra memory usage for small requests is significant?

Another possibility might be to stop using the cache zone entirely if allocations start failing, but IMO it is nicer to keep the complexity in ktls_buffer_import().

sys/kern/uipc_ktls.c
2166

You're talking about memory exhaustion causing zones to be drained. I'm just concerned about just more efficient use of memory in general, assuming there is something else besides this workload that could make effective use of the remaining memory.

For the Netflix workload (large media files), I think the risk of wasting memory due to small requests is minimal. However, I could imagine other people using KTLS on a more generic web workload and typically sending files which are smaller than 16K. For example, loading a few random pages in Firefox with the Developer->Network thing running, I see lots (> 50%) of requests for files smaller than 16K.

Would "right sizing" allocations and avoiding the 16k zone for small allocations would be less efficient than what we have today?

sys/kern/uipc_ktls.c
2166

I had it in my head that these issues are related, but indeed, they're orthogonal. I agree now that we should fall back to page-by-page allocation for small requests.

  • Add a chicken switch.
  • Rate-limit allocation attempts to once per second after a failure, per ktls thread

Upload the correct version of the diff.

sys/kern/uipc_ktls.c
2166

I ended up adding a primitive rate-limiting mechanism. We can't allocate KVA and do page-by-page allocation in ktls_buffer_import() very easily because the mapping is not saved in the mbuf, so ktls_free_mext_contig() doesn't know which buffer to free.

It occurred to me while I was updating the revision that we could stash the mapping in the first vm_page structure, by adding a field to the plinks union. That has the disadvantage of requiring a PHYS_TO_VM_PAGE() call in ktls_free_mext_contig() but that's not too bad. That can be done in a separate diff though.

I've tested this overnight on a Netflix server running at 90G.

This works well for us, is stable, and is a huge performance win.

This revision is now accepted and ready to land.Mar 3 2021, 2:35 PM