Add support for optional separate output buffers to in-kernel crypto.
ClosedPublic
Actions

Authored by jhb on Apr 22 2020, 10:28 PM.

Details

Reviewers

gallatin
cem
jtl
jmg

Group Reviewers

manpages

Commits

rS361804: Use separate output buffers for OCF requests in KTLS.
rS361485: Support separate output buffers for aesni(4).
rS361484: Support separate output buffers in ccr(4).
rS361483: Add a sysctl knob to use separate output buffers for /dev/crypto.
rS361482: Export the _kern_crypto sysctl node from crypto.c.
rS361481: Add support for optional separate output buffers to in-kernel crypto.

Summary

Some crypto consumers such as GELI and KTLS for file-backed sendfile
need to store their output in a separate buffer from the input.
Currently these consumers copy the contents of the input buffer into
the output buffer and queue an in-place crypto operation on the output
buffer. Using a separate output buffer avoids this copy.

Create a new 'struct crypto_buffer' describing a crypto buffer containing a type and type-specific fields. crp_ilen is gone, instead buffers that use a flat kernel buffer have a cb_buf_len field for their length. The length of other buffer types is inferred from the backing store (e.g. uio_resid for a uio). Requests now have two such structures: crp_buf for the input buffer, and crp_obuf for the output buffer.

Consumers now use helper functions (crypto_use_*, e.g. crypto_use_mbuf()) to configure the input buffer. If an output buffer is not configured, the request still modifies the input buffer in-place. A consumer uses a second set of helper functions (crypto_use_output_*) to configure an output buffer.

Consumers must request support for separate output buffers when creating a crypto session via the CSP_F_SEPARATE_OUTPUT flag and are only permitted to queue a request with a separate output buffer on sessions with this flag set. Existing drivers already reject sessions with unknown flags, so this permits drivers to be modified to support this extension without requiring all drivers to change.

Several data-related functions now have matching versions that operate on an explicit buffer (e.g. crypto_apply_buf, crypto_contiguous_subsegment_buf, bus_dma_load_crp_buf).

Most of the existing data-related functions operate on the input buffer. However crypto_copyback always writes to the output buffer if a request uses a separate output buffer.

For the regions in input/output buffers, the following conventions are followed:
- AAD and IV are always present in input only and their fields are offsets into the input buffer.
- payload is always present in both buffers. If a request uses a separate output buffer, it must set a new crp_payload_start_output field to the offset of the payload in the output buffer.
- digest is in the input buffer for verify operations, and in the output buffer for compute operations. crp_digest_start is relative to the appropriate buffer.

Add a crypto buffer cursor abstraction. This is a more general form of some bits in the cryptosoft driver that tried to always use uio's. However, compared to the original code, this avoids rewalking the uio iovec array for requests with multiple vectors. It also avoids allocate an iovec array for mbufs and populating it by instead walking the mbuf chain directly.

Update the cryptosoft(4) driver to support separate output buffers making use of the cursor abstraction.

Export _kern_crypto node.

Add a sysctl knob to use separate output buffers for /dev/crypto.

This is a testing aid to permit using testing a driver's support of
separate output buffers via cryptocheck.

Support separate output buffers in ccr(4).

Use separate output buffers in KTLS.

For now this has a knob to force the old behavior to permit easy comparison.

Support separate output buffers for aesni(4).

Test Plan

tested cryptocheck both with/without separate output buffers against cryptosoft, aesni0, and ccr
tested KTLS both with/without separate output buffers against cryptosoft, aesni0, and ccr

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhb created this revision.Apr 22 2020, 10:28 PM

Herald added subscribers: melifaro, ae. · View Herald TranscriptApr 22 2020, 10:28 PM

Harbormaster completed remote builds in B30678: Diff 70900.Apr 22 2020, 10:28 PM

I would not do this all as one commit. I have a series that is probably about what I would commit here https://github.com/freebsd/freebsd/compare/master...bsdjhb:ocf_output_buffer

I have not yet done documentation updates. If we are happy with the design, I will add those before committing.

The motivator for this was improving performance of KTLS with OCF. While I don't have screaming results, I do seem some improvements by removing the memcpy. My test was the one I usually use for KTLS which is to have nginx running on the KTLS server and have a client box fire up 64 instances of openssl s_time all fetching the same 1GB file from nginx. The tests were using the default AES-GCM cipher. Both boxes are single package 4x2 Haswell boxes with a T6, and cc0 of both boxes cabled back-to-back. My measurements are not super precise, but are ballpark-ish by watching vmstat 1 output (to see the range of idle vs non-idle CPU usage) as well as using an internal tool at Chelsio that reported the RX bandwidth on the receiver once a second.

Configuration         Bandwidth  System CPU usage
--------------------  ---------  ----------------
Userland TLS          29-30Gbps  100%
ISA-L KTLS            50-51Gbps  100%
inplace cryptosoft    3.01Gbps   100%
separate cryptosoft   3.19Gbps   100%
inplace aesni0        38-39Gbps  97%
separate aesni0       40Gbps     97%
inplace ccr0          45Gbps     75-80%
separate ccr0         45-47Gbps  60-70%

The ccr numbers are a bit uneven depending on which ports you use. These numbers are for the default setup of using both crypto engines on the T6, but in some tests I got a bit more throughput at the same CPU usage by only using the port for cc1 (which was idle during the tests). However, the drop in CPU usage of around 10-15% was consistent. For aesni0, cutting out the memcpy can matter less since the data is likely still in cache from the memcpy when the crypto operation runs. Also, aesni0 is still doing another copy internally into a buffer in the KTLS case as aesni0 insists on only operating on a single virtually contiguous segment and resorting to a malloc + copy if that isn't true. I can try modifying aesni to use the crypto cursor abstraction to walk the buffer which would let it avoid that copy. If I parameterize aesni_wrap.c and use some #ifdef'ry I can probably have it also do the ISA-L thing of using movnt when there is a separate output buffer.

Looked mostly at ktls

This revision is now accepted and ready to land.Apr 27 2020, 1:56 PM

gbe added a subscriber: gbe.May 1 2020, 1:24 PM