Page MenuHomeFreeBSD

Add support for KTLS RX via software decryption.

Authored by jhb on Apr 29 2020, 11:05 PM.



Allow TLS records to be decrypted in the kernel after being received
by a NIC. At a high level this is somewhat similar to software KTLS
for the transmit path except in reverse. Protocols enqueue mbufs
containing encrypted TLS records (or portions of records) into the tail
of a socket buffer and the KTLS layer decrypts those records before
returning them to userland applications. However, there is an important

  • In the transmit case, the socket buffer is always a single "record" holding a chain of mbufs. Not-yet-encrypted mbufs are marked not ready (M_NOTREADY) and released to protocols for transmit by marking mbufs ready once their data is encrypted.
  • In the receive case, incoming (encrypted) data appended to the socket buffer is still a single stream of data from the protocol, but decrypted TLS records are stored as separate records in the socket buffer and read individually via recvmsg().

Initially I tried to make this work by marking incoming mbufs as
M_NOTREADY, but there didn't seemed to be a non-gross way to deal with
picking a portion of the mbuf chain and turning it into a new record
in the socket buffer after decrypting the TLS record it contained
(along with prepending a control message). Also, such mbufs would
also need to be "pinned" in some way while they are being decrypted
such that a concurrent sbcut() wouldn't free them out from under the
thread performing decryption.

As such, I settled on the following solution:

  • Socket buffers now contain an additional chain of mbufs (sb_mtls, sb_mtlstail, and sb_tlscc) containing encrypted mbufs appended by the protocol layer. These mbufs are still marked M_NOTREADY, but soreceive*() generally don't know about them (except that they will block waiting for data to be decrypted for a blocking read).
  • Each time a new mbuf is appended to this TLS mbuf chain, the socket buffer peeks at the TLS record header at the head of the chain to determine the encrypted record's length. If enough data is queued for the TLS record, the socket is placed on a per-CPU TLS workqueue (reusing the existing KTLS workqueues and worker threads).
  • The worker thread loops over the TLS mbuf chain decrypting records until it runs out of data. Each record is detached from the TLS mbuf chain while it is being decrypted to keep the mbufs "pinned". However, a new sb_dtlscc field tracks the character count of the detached record and sbcut()/sbdrop() is updated to account for the detached record. After the record is decrypted, the worker thread first checks to see if sbcut() dropped the record. If so, it is freed (can happen when a socket is closed with pending data). Otherwise, the header and trailer are stripped from the original mbufs, a control message is created holding the decrypted TLS header, and the decrypted TLS record is appended to the "normal" socket buffer chain.

(Side note: the SBCHECK() infrastucture was very useful as I was
able to add assertions there about the TLS chain that caught several
bugs during development.)

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jhb requested review of this revision.Apr 29 2020, 11:05 PM
jhb created this revision.

I haven't looked at the patch in detail (and I'm far from an expert in this
area). However, the overall approach sounds fine to me and I have only
found the one little problem (leak of malloc'd "iov" in ktls_decrypt())
during testing sofar. (See inline comment above.)

Thanks for doing this John.

1867 ↗(On Diff #71173)

There is a leak of "iov", since it never gets free'd.
I fixed it by adding a

free(iov, M_KTLS);


Alternately, it might be better to make "iov" and "iov_cap" static, so you avoid
doing malloc/free every time the function is called. (In other words, just let it grow to
the largest iov size needed and then let it stay allocated.)

I just made another inline comment. It looks like SOCKBUF_LOCK(sb) gets called
twice in the if (error) { } block?

1794 ↗(On Diff #71173)

Should SOCKBUF_LOCK(sb); be called both here and above at line#1770?

1794 ↗(On Diff #71173)

My mistake. I now see sorwakeup_locked(so) unlocks
the socket buffer, so you need to lock it again.

1593 ↗(On Diff #71173)

I can't see anything broken in the following code to
allocate the mbuf, but dropping/acquiring the
SOCKBUF_LOCK(sb) does seem to risk introducing
a race.

Personally, I'd be tempted to pre-allocate the mbuf
in ktls_decrypt() when the SOCKBUF_LOCK() isn't held
and then pass it in as value/result (struct mbuf **).
If it is needed here, just set the argument NULL.
Then ktls_decrypt() does an m_free() on it, if non-NULL on return.
(If you'd like, I can code this up and pass a patch along.)

Just a suggestion, rick

1747 ↗(On Diff #71173)

You could avoid doing the malloc/free of iov for some cases
by allocating a small iov on the stack and only doing malloc/free
when iov_count > size_of_the_one_on_the_stack.

Again, just a suggestion, rick

1867 ↗(On Diff #71173)

In case it wasn't clear, to make "iov" and "iov_cap" static,
they would need to be in a worker thread specific structure
passed in as an additional argument.

Probably not worth the extra code complexity, I think?
(Allocating a small iov on the stack as suggested above,
is an easier way to avoid some of the malloc/free calls,
I think?)

Comments inline. Mostly questions.

1593 ↗(On Diff #71173)

Just fyi, I've been testing with a printf in here and, not surprising,
it has never happened.
I think the above code is safe, given that there is only one thread
calling ktls_decrypt() for any given socket at a time.
That is how I currently understand the code?

1609 ↗(On Diff #71173)

I'm guessing "sb->sb_mtls != top" only happens
when the socket is closed/shut down?

1867 ↗(On Diff #71173)

Just fyi, I am using a simple patch that allocates a
small "iov" on the stack and only malloc's when a
large iov is needed.
I've passed the patch along to jhb@ and he can decide
what is best.

2122 ↗(On Diff #71173)

Since NFS traffic is bi-directional, do you think having two
worker threads for each socket (one for tx and the other rx)
would help w.r.t. performance?

I might try coding this up and seeing if I can measure any

jhb marked 9 inline comments as done.Jun 25 2020, 9:12 PM
jhb added inline comments.
1593 ↗(On Diff #71173)

Yes, it is the single threading of ktls_decrypt() that keeps this safe. Other threads can append more mbufs to the chain but that doesn't matter while allocating. Other threads can also close the socket which will purge imbues from this change. The sb_mtls != top will be true in that case in which case we punt and retry.

1609 ↗(On Diff #71173)

Yes, that is the only way mbufs can be removed from sb_mtls other than ktls_decrypt which is single-threaded.

1747 ↗(On Diff #71173)

What I ended up doing is changing the sw_decrypt interface for KTLS to pass down the mbuf chain directly instead of an iov. This avoids needing to create iovecs at all as the only SW backend for KTLS right now is the OCF backend which can handle mbuf chains directly. I will upload this version shortly, but it removes all of the iovec handling here.

1867 ↗(On Diff #71173)

This is all resolved in the new version by ditching the iov stuff entirely.

2122 ↗(On Diff #71173)

So i'm not sure if it would help or not. When doing software crypto I think you are probably already CPU bound and so the overhead of extra context switches probably hurts performance.

One thing I haven't yet done with KTLS for either direction is explore using async callbacks when crypto is performed by a co-processor.

jhb marked 4 inline comments as done.
  • Rebase.
  • Update for separate output buffer changes.
  • Use separate AAD support to remove duplicate iovec.
  • Pass the mbuf chain down to the decryption backends.
  • Compile.
2079 ↗(On Diff #73667)

The case I was thinking about was a client, where there is typically only one
NFSv4 mount point on a system with more than one core.
However, the only time when there is concurrent bi-directional traffic is
for read-aheads and write-behinds.
I realized that, for these, most of the data is moving in one direction,
so having a second thread (on a different cpu) for the other direction
probably wouldn't help?
I also found coding it trickier than I thought it would be, so I never got it done.

I might still try it someday, but I agree with you that it is not likely to improve

Thanks for doing this, rick

  • Rebase.
  • Update for no-malloc.
  • Use ktls_ocf_dispatch.
  • Tidy to match other process methods.
  • More nomalloc fixes.
  • Yet more nomalloc.
This revision was not accepted when it landed; it landed in state Needs Review.Jul 23 2020, 11:48 PM
This revision was automatically updated to reflect the committed changes.