Add support for KTLS RX via software decryption.
ClosedPublic
Actions

Authored by jhb on Apr 29 2020, 11:05 PM.

Details

Reviewers

Group Reviewers

transport
manpages

Commits

rS363464: Add support for KTLS RX via software decryption.

Summary

Allow TLS records to be decrypted in the kernel after being received
by a NIC. At a high level this is somewhat similar to software KTLS
for the transmit path except in reverse. Protocols enqueue mbufs
containing encrypted TLS records (or portions of records) into the tail
of a socket buffer and the KTLS layer decrypts those records before
returning them to userland applications. However, there is an important
difference:

In the transmit case, the socket buffer is always a single "record" holding a chain of mbufs. Not-yet-encrypted mbufs are marked not ready (M_NOTREADY) and released to protocols for transmit by marking mbufs ready once their data is encrypted.

In the receive case, incoming (encrypted) data appended to the socket buffer is still a single stream of data from the protocol, but decrypted TLS records are stored as separate records in the socket buffer and read individually via recvmsg().

Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message). Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.

As such, I settled on the following solution:

Socket buffers now contain an additional chain of mbufs (sb_mtls, sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by the protocol layer. These mbufs are still marked M_NOTREADY, but soreceive*() generally don't know about them (except that they will block waiting for data to be decrypted for a blocking read).

Each time a new mbuf is appended to this TLS mbuf chain, the socket buffer peeks at the TLS record header at the head of the chain to determine the encrypted record's length. If enough data is queued for the TLS record, the socket is placed on a per-CPU TLS workqueue (reusing the existing KTLS workqueues and worker threads).

The worker thread loops over the TLS mbuf chain decrypting records until it runs out of data. Each record is detached from the TLS mbuf chain while it is being decrypted to keep the mbufs "pinned". However, a new sb_dtlscc field tracks the character count of the detached record and sbcut()/sbdrop() is updated to account for the detached record. After the record is decrypted, the worker thread first checks to see if sbcut() dropped the record. If so, it is freed (can happen when a socket is closed with pending data). Otherwise, the header and trailer are stripped from the original mbufs, a control message is created holding the decrypted TLS header, and the decrypted TLS record is appended to the "normal" socket buffer chain.

(Side note: the SBCHECK() infrastucture was very useful as I was
able to add assertions there about the TLS chain that caught several
bugs during development.)

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhb requested review of this revision.Apr 29 2020, 11:05 PM

jhb created this revision.

Harbormaster completed remote builds in B30803: Diff 71173.Apr 29 2020, 11:05 PM

jhb added a child revision: D24452: Initial support for kernel offload of TLS receive..Apr 29 2020, 11:07 PM

crest_bultmann.eu added a subscriber: crest_bultmann.eu.Apr 30 2020, 11:25 AM

emaste added a subscriber: emaste.Apr 30 2020, 1:49 PM

I haven't looked at the patch in detail (and I'm far from an expert in this
area). However, the overall approach sounds fine to me and I have only
found the one little problem (leak of malloc'd "iov" in ktls_decrypt())
during testing sofar. (See inline comment above.)

Thanks for doing this John.

sys/kern/uipc_ktls.c
1867 ↗	(On Diff #71173)	There is a leak of "iov", since it never gets free'd. I fixed it by adding a free(iov, M_KTLS); here. Alternately, it might be better to make "iov" and "iov_cap" static, so you avoid doing malloc/free every time the function is called. (In other words, just let it grow to the largest iov size needed and then let it stay allocated.)

I just made another inline comment. It looks like SOCKBUF_LOCK(sb) gets called
twice in the if (error) { } block?

sys/kern/uipc_ktls.c
1794 ↗	(On Diff #71173)	Should SOCKBUF_LOCK(sb); be called both here and above at line#1770?

rmacklem added inline comments.May 5 2020, 4:52 AM

sys/kern/uipc_ktls.c
1794 ↗	(On Diff #71173)	My mistake. I now see sorwakeup_locked(so) unlocks the socket buffer, so you need to lock it again.

rmacklem added inline comments.May 5 2020, 12:15 PM

sys/kern/uipc_ktls.c
1593 ↗	(On Diff #71173)	I can't see anything broken in the following code to allocate the mbuf, but dropping/acquiring the SOCKBUF_LOCK(sb) does seem to risk introducing a race. Personally, I'd be tempted to pre-allocate the mbuf in ktls_decrypt() when the SOCKBUF_LOCK() isn't held and then pass it in as value/result (struct mbuf **). If it is needed here, just set the argument NULL. Then ktls_decrypt() does an m_free() on it, if non-NULL on return. (If you'd like, I can code this up and pass a patch along.) Just a suggestion, rick
1747 ↗	(On Diff #71173)	You could avoid doing the malloc/free of iov for some cases by allocating a small iov on the stack and only doing malloc/free when iov_count > size_of_the_one_on_the_stack. Again, just a suggestion, rick
1867 ↗	(On Diff #71173)	In case it wasn't clear, to make "iov" and "iov_cap" static, they would need to be in a worker thread specific structure passed in as an additional argument. Probably not worth the extra code complexity, I think? (Allocating a small iov on the stack as suggested above, is an easier way to avoid some of the malloc/free calls, I think?)

gbe added a subscriber: gbe.May 5 2020, 12:28 PM

Comments inline. Mostly questions.

sys/kern/uipc_ktls.c
1593 ↗	(On Diff #71173)	Just fyi, I've been testing with a printf in here and, not surprising, it has never happened. I think the above code is safe, given that there is only one thread calling ktls_decrypt() for any given socket at a time. That is how I currently understand the code?
1609 ↗	(On Diff #71173)	I'm guessing "sb->sb_mtls != top" only happens when the socket is closed/shut down?
1867 ↗	(On Diff #71173)	Just fyi, I am using a simple patch that allocates a small "iov" on the stack and only malloc's when a large iov is needed. I've passed the patch along to jhb@ and he can decide what is best.
2122 ↗	(On Diff #71173)	Since NFS traffic is bi-directional, do you think having two worker threads for each socket (one for tx and the other rx) would help w.r.t. performance? I might try coding this up and seeing if I can measure any difference.

jhb marked 9 inline comments as done.Jun 25 2020, 9:12 PM

jhb added inline comments.

sys/kern/uipc_ktls.c
1593 ↗	(On Diff #71173)	Yes, it is the single threading of ktls_decrypt() that keeps this safe. Other threads can append more mbufs to the chain but that doesn't matter while allocating. Other threads can also close the socket which will purge imbues from this change. The `sb_mtls != top` will be true in that case in which case we punt and retry.
1609 ↗	(On Diff #71173)	Yes, that is the only way mbufs can be removed from `sb_mtls` other than `ktls_decrypt` which is single-threaded.
1747 ↗	(On Diff #71173)	What I ended up doing is changing the sw_decrypt interface for KTLS to pass down the mbuf chain directly instead of an iov. This avoids needing to create iovecs at all as the only SW backend for KTLS right now is the OCF backend which can handle mbuf chains directly. I will upload this version shortly, but it removes all of the iovec handling here.
1867 ↗	(On Diff #71173)	This is all resolved in the new version by ditching the `iov` stuff entirely.
2122 ↗	(On Diff #71173)	So i'm not sure if it would help or not. When doing software crypto I think you are probably already CPU bound and so the overhead of extra context switches probably hurts performance. One thing I haven't yet done with KTLS for either direction is explore using async callbacks when crypto is performed by a co-processor.

Rebase.
Update for separate output buffer changes.
Use separate AAD support to remove duplicate iovec.
Pass the mbuf chain down to the decryption backends.
Compile.

Harbormaster completed remote builds in B31966: Diff 73667.Jun 25 2020, 9:14 PM

rmacklem added inline comments.Jun 26 2020, 3:48 AM

sys/kern/uipc_ktls.c
2079 ↗	(On Diff #73667)	The case I was thinking about was a client, where there is typically only one NFSv4 mount point on a system with more than one core. However, the only time when there is concurrent bi-directional traffic is for read-aheads and write-behinds. I realized that, for these, most of the data is moving in one direction, so having a second thread (on a different cpu) for the other direction probably wouldn't help? I also found coding it trickier than I thought it would be, so I never got it done. I might still try it someday, but I agree with you that it is not likely to improve performance. Thanks for doing this, rick