Details

Reviewers

gallatin
jhb
kib
markj
• hselasky

Commits

rG43bcc15a4270: ktls: Refer send tag pointer once.
rGfe8c78f0d202: ktls: Add full support for TLS RX offloading via network interface.
rGf0fca6461804: ktls: Refer send tag pointer once.
rGcb27627968e3: mlx5en(4): Set the leaf network interface field in the mbuf packet header.
rG892eded5b8b0: vlan(4): Add support for allocating TLS receive tags.
rG1967e313795f: lagg(4): Add support for allocating TLS receive tags.
rG9e2cce7e6a87: Implement a function to get the next TCP- and TLS- receive sequence number.
rGc8f2c290e40d: Add definitions for TLS receive tags using the existing send tag infrastructure.

Summary

Basic TLS RX offloading uses the "csum_flags" field in the mbuf packet
header to figure out if an incoming mbuf has been fully offloaded or
not. This information follows the packet stream via the LRO engine, IP
stack and finally to the TCP stack. The TCP stack preserves the mbuf
packet header also when re-assembling packets after packet loss. When
the mbuf goes into the socket buffer the packet header is demoted and
the offload information is transferred to "m_flags" . Later on a
worker thread will analyze the mbuf flags and decide if the mbufs
making up a TLS record indicate a fully-, partially- or not decrypted
TLS record. Based on these three cases the worker thread will either
pass the packet on as-is or recrypt the decrypted bits, if any, and
decrypt the packet as usual.

During packet loss the kernel TLS code will call back into the network
driver using the send tag, informing about the TCP starting sequence
number of every TLS record that is not fully decrypted by the network
interface. The network interface then stores this information in a
compressed table and starts asking the hardware if it has found a
valid TLS header in the TCP data payload. If the hardware has found a
valid TLS header and the referred TLS header is at a valid TCP
sequence number according to the TCP sequence numbers provided by the
kernel TLS code, the network driver then informs the hardware that it
can resume decryption.

Care has been taken to not merge encrypted and decrypted mbuf chains,
in the LRO engine and when appending mbufs to the socket buffer.

The mbuf's leaf network interface pointer is used to figure out from
which network interface the offloading rule should be allocated. Also
this pointer is used to track route changes.

Currently mbuf send tags are used in both transmit and receive
direction, due to convenience, but may get a new name in the future to
better reflect their usage.

MFC after: 1 week
Sponsored by: NVIDIA Networking

Test Plan

Patch is on top of "main":
https://github.com/freebsd/freebsd.git

Run iperf from:
https://github.com/Mellanox/iperf_ssl

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

jhb added inline comments.Jan 5 2022, 11:07 PM

sys/kern/uipc_ktls.c
2354 ↗	(On Diff #100361)	Maybe make this ST_DECRYPTED explicitly for readability?
2356 ↗	(On Diff #100361)	I think you can axe the XXX comments and instead add a larger block comment here: /* * NIC TLS is only supported for AEAD ciphersuites which used a fixed sized trailer. */
2458 ↗	(On Diff #100361)	Is state != -1 still the right check?
sys/kern/uipc_sockbuf.c
938 ↗	(On Diff #100361)	I would maybe simplify this a bit to avoid duplication: if (n->m_flags & M_PKTHDR) { if ((n->m_pkthdr.csum_flags & CSUM_TLS_MASK) == CSUM_TLS_DECRYPTED) flags = M_DECRYPTED; else flags = 0; m_demote_pkthdr(n); } n->m_flags &= ~M_DEMOTEFLAGS; n->m_flags \|= flags \| M_NOTREADY;
sys/net/if_var.h
250 ↗	(On Diff #100361)	This might be more accurately called the TLS sequence number than TLS record number (and probably name the member tls_seqno or tls_seq_number).
sys/opencrypto/ktls_ocf.c
119 ↗	(On Diff #100361)	Does the Mellanox NIC support hardware NIC TLS for ChaCha20-Poly1305?
623 ↗	(On Diff #100361)	As I said earlier, this is way more complicated than you need. I would make each kthread pre-allocate a buffer sized to one TLS record and just memset it to 0 and use AES-CTR to encrypt the zeroed buffer, then xor the relevant portion of the mbuf chain with the resulting buffer. I would probably not even bother with OCF for this but just use the 'enc_xform_aes_ctr' directly to do this (especially once the multi-block hooks review I have lands). You might even be able to re-encrypt the mbufs in place in some cases without needing the temporary buffer. As it is you are doing all the GMAC computations just to throw them away.

• hselasky added inline comments.Jan 7 2022, 1:59 PM

sys/kern/uipc_ktls.c
2152 ↗	(On Diff #97872)	Hi John, The TLS record length is not encrypted. When you have a chain of TLS records up to the present time, you only need to confirm one past TLS record started at a given TCP sequence number. The all the subsequent ones should be fine. This works fine up to 4 GBytes of data. v hardware stores the TCP sequence number of a previous TLS record. [ TLS RECORD ..... ] [ TLS RECORD ..... ] [ TLS RECORD ..... ] [ TLS RECORD ..... * ] * The last received TCP sequence number is not of relevance The software keeps track of the past TLS record starts and asks the HW what TCP sequence number it wants to confirm. Then if the TCP sequence number the SW gets exist in its locally kept database, then it tells hardware back, that this TCP sequence number is valid, and if the HW still refers the same TCP sequence number, then HW decryption starts immediately at the next IP packet received. Was that clear?

• hselasky marked 6 inline comments as done.Jan 7 2022, 2:28 PM

• hselasky added inline comments.

sys/kern/uipc_ktls.c
2458 ↗	(On Diff #100361)	Switched to using define.
sys/kern/uipc_sockbuf.c
938 ↗	(On Diff #100361)	"M_DEMOTEFLAGS" is the flags you want to keep, not clear. Else your example is OK!
sys/opencrypto/ktls_ocf.c
119 ↗	(On Diff #100361)	No.
623 ↗	(On Diff #100361)	OK, I'll try to re-do this part.

Rebased patch.

Addressed some comments from John Baldwin.

Diff reduction.

• hselasky marked an inline comment as done.Jan 7 2022, 3:15 PM

• hselasky added inline comments.

sys/kern/uipc_ktls.c
1631 ↗	(On Diff #97872)	I'll check what options are possible.
sys/opencrypto/ktls_ocf.c
623 ↗	(On Diff #100361)	@jhb : From what I can see the current encrypt functions are tightly bound to single / EXT PG mbufs. Can you help decouple the current decrypt functions from EXT PG mbufs only?

Hi John,

Did you see my replies on the TLS RX offload review?

Regarding the current encrypt functions, the problem is that they only work on a special single mbuf type. Right now, we dequeue the mbufs from the socket buffer, and it saves data copying to allocate an identical mbuf chain, for encryption, that to shuffle all data into a single new mbuf.

Unencrypted chain consists of three mbufs:

  (ENC)           (DEC)          (ENC)
m0->m_next  -> m1->m_next -> m2->m_next  -> NULL

Encrypted chain consists of three mbufs aswell (with ENC and DEC swapped due to XOR, execpt for trailer):

  (DEC)           (ENC)          (DEC)
n0->m_next  -> n1->m_next -> n2->m_next  -> NULL


Reconstructing the encrypted chain for final decryption now can be done mbuf by mbuf:

  (ENC)           (ENC)          (ENC)
m0->m_next  -> n1->m_next -> m2->m_next  -> NULL

m1,n0,n2 are freed. Actually the final decryption step can be avoided by splitting the HASHTAG check and the decryption step!

I can change the re-crypt functions into using the encrypt function signature for regular mbuf chains, to avoid a new function prototype.

Or should we update the existing encrypt function to also support regular mbuf chains?

--HPS

Implement native single-pass recrypt function in the open crypto framework.

This avoids having to pass mixed state mbufs twice through the crypto framework.

Implement recrypt functions for the open crypto framework, OCF.

When processing mixed encrypted and decrypted data, it is required
to retrieve both the fully encrypted and the decrypted version of
the data. The fully encrypted version is used to verify the
authentication tag, and the decrypted version is the output.

Currently only TLS v1.2 and TLS v1.3 supported.

Fix minor compilation issue.

Fix one more compilation issue.

Rework the re-crypt support. The low level APIs in the crypto framework can apparently only do full encryption and full decryption :-( So use that for now.

In D32356#765375, @hselasky wrote:

Rework the re-crypt support. The low level APIs in the crypto framework can apparently only do full encryption and full decryption :-( So use that for now.

Both AES-GCM and ChaCha20-Poly1305 perform the MAC (hash) on the _encrypted_ data. You have to re-encrypt the data so you can compute the hash vs just decrypting the data. It looks like you are still doing AES-GCM instead of plain AES-CTR when trying to re-encrypt the data so have to copy the tag around, etc. Perhaps it will be simpler if you can push your branch somewhere and I will generate a patch against it to show you what I've been suggesting (use AES-CTR, not AES-GCM on a block of zeroes and then doing an explicit xor of that block against the mbuf chain.

Take @jhb 's suggestion to encrypt a zero'ed mbuf and then XOR.

Simplify patches in ktls_ocf.c for handling re-crypted packets.

Test OK with CX-6 DX and TLS v1.3 and v1.2 .

@jhb: I noticed in the AESNI crypto implementation that it might call malloc() when using the output buffer feature ... and this should be avoided when we already allocated a buffer.

@jhb : Ping.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 26 2022, 12:42 PM

Closed by commit rGc8f2c290e40d: Add definitions for TLS receive tags using the existing send tag infrastructure. (authored by • hselasky). · Explain Why

This revision was automatically updated to reflect the committed changes.

• hselasky added a commit: rGc8f2c290e40d: Add definitions for TLS receive tags using the existing send tag infrastructure..

• hselasky added a commit: rG9e2cce7e6a87: Implement a function to get the next TCP- and TLS- receive sequence number..

Rebase patch.

• hselasky reopened this revision.Jan 26 2022, 1:45 PM

Rebase.

melifaro added inline comments.Feb 10 2022, 8:50 AM

sys/kern/uipc_ktls.c
1009 ↗	(On Diff #96410)	ping? :-)

• hselasky added inline comments.Feb 10 2022, 10:23 AM

sys/kern/uipc_ktls.c
1009 ↗	(On Diff #96410)	You suggest putting the following piece into an EPOCH section? nh = inp->inp_route.ro_nh; if (nh == NULL) { INP_RUNLOCK(inp); return (ENXIO); } ifp = nh->nh_ifp; What about the NULL check? Is it safe to skip then?

@melifaro @jhb : ping

One other structural thing I see is that this still assumes the outbound route path matches the inbound path (using the route to allocate the tag and changing ktls_output_eagain to reset both sessions on a TX failure). But as Drew noted that doesn't work in his setup where the RX and TX can be over different ports in a lagg since the remove end of the lagg can use whatever algorithm it wants to distribute the RX traffic. Instead, we need to store the "leaf" ifp in a new field in m_pkthdr or the like and pass that up through into the socket buffer. At the point of m_demote when we remove the packet header you would want to check for ifp mismatches like we do for output in ip_output_send. Perhaps that can be done as a second round, but then we will just have to revert the ktls_output_eagain() change so I'd rather avoid changing that API just to have to change it back later.

sys/opencrypto/ktls_ocf.c
623 ↗	(On Diff #100361)	If you use enc_xform they are not. Also, I still think you should have each ktls worker thread just malloc() a single 16KB block during startup and pass a pointer down to the decrypt worker routine instead having to do malloc/frees. If you use enc_xform you can bypass OCF entirely and generally not have to touch anything at all in ktls_ocf.c. If you do use OCF you need to allocate a completely different session since it is doing AES-CTR, not AES-GCM. However, you could also just create a single session for this in uipc_ktls.c and use per-op keys instead of a session key. But the idea is that back in ktls_decrypt in the MIXED case you would simply do the encrypt zeros + XOR and then fall through to the "ENCRYPTED" case that calls the normal callback here. But in particular, you shouldn't be trying to reuse the existing routines here to encrypt, you should be writing completely new code to encrypt the block of zeroes. You want to just do the encryption (AES-CTR for AES-GCM) without the MAC. If it's a single span that is block aligned at the start (but doesn't have to be at the end) you can even construct the request (if you use OCF) to just encrypt (or decrypt, in AES-CTR it's all the same) that one span of the mbuf chain directly without even using the 16KB block. (This optimization is probably only worth doing if you know it's always a single span at the end, which I don't see how it would be otherwise unless the NIC decrypts the packet _after_ a drop that is still within the known TLS header length which would be awfully weird since you know in that case you have to undo the work anyway). In fact, I still don't understand how the case can ever be anything but that the you have decrypted the start of a TLS record and had to stop somewhere in the middle due to a drop, and the rest of the TLS record is then still encrypted (as presumably the NIC just stops decrypting everything else in the stream once a drop happens (TCP sequence mismatch) until you get back in sync again after fixing up a record). In that case you can use an AES-CTR request on the first part of the TLS record (it will always start at offset 0 in terms of the encryption) and a length just of the decrypted bits without the need for the block of zeroes.

melifaro added inline comments.Feb 24 2022, 7:49 AM

sys/kern/uipc_ktls.c
1009 ↗	(On Diff #96410)	You suggest putting the following piece into an EPOCH section? I'd say wider than that - all code block which uses `ifp`, which is entire function. What about the NULL check? Is it safe to skip then? IIRC it is possible that routing change happened, we invalidated the cache in the datapath, but didn't get the result, ending with NULL there. Normally the socket would die after that, but to avoid the implicit dependency on the datapath logic, I'd keep it (probably making with `__predict_false`).

In D32356#778142, @jhb wrote:

One other structural thing I see is that this still assumes the outbound route path matches the inbound path (using the route to allocate the tag and changing ktls_output_eagain to reset both sessions on a TX failure). But as Drew noted that doesn't work in his setup where the RX and TX can be over different ports in a lagg since the remove end of the lagg can use whatever algorithm it wants to distribute the RX traffic. Instead, we need to store the "leaf" ifp in a new field in m_pkthdr or the like and pass that up through into the socket buffer. At the point of m_demote when we remove the packet header you would want to check for ifp mismatches like we do for output in ip_output_send. Perhaps that can be done as a second round, but then we will just have to revert the ktls_output_eagain() change so I'd rather avoid changing that API just to have to change it back later.

Keeping track of the leaf ifnet for incoming packets will most likely require mbuf changes. (Discussed with @gallatin) The current patch works fine for VLAN and failover LAGG and regular IP traffic. Can I push the current approach AS-IS, then commit to making those additional changes for LAGG?

--HPS

I think pushing it and fixing the lagg issue after its in the tree is probably the best path forward.

Do you have this pushed to a public branch somewhere (e.g. on GitHub?) It might be easiest to show you what I am saying about how to handle the crypto for the mixed case if I can generate a patch relative to your branch.

@jhb : No. The current patch is for -current / main. Do you want me to create such a git repository, or can we use your existing freebsd fork / branch?

In D32356#785045, @hselasky wrote:

@jhb : No. The current patch is for -current / main. Do you want me to create such a git repository, or can we use your existing freebsd fork / branch?

Hmm, if you have your own gitlab or GitHub account it is probably easiest to just push your branch there. I don't know that we have permit per-user branches on git.freebsd.org itself currently. Generally the way GH works is that you can't push to my own fork, but if you fork FreeBSD into your GH space you can push your branch there.

Rebase patch after @jhb latest crypto additions.

glebius added inline comments.May 4 2022, 7:10 PM

sys/sys/ktls.h
206 ↗	(On Diff #105340)	Just a suggestion. Making the return value a enum will improve readability of debugging sessions and remove extraneous "default" clause from switch statement.

Implement crypto state as enum (as suggested by Gleb)
Remove an unused variable
Rebased patch.

• hselasky marked an inline comment as done.May 6 2022, 1:24 PM

• hselasky added inline comments.May 6 2022, 1:28 PM

sys/kern/uipc_ktls.c
1617 ↗	(On Diff #105739)	jhb@ : May I commit this change separately? It basically makes sure we don't see different values in the assert and the "mst = " read below.

jhb added inline comments.May 12 2022, 10:02 PM

sys/kern/uipc_ktls.c
1617 ↗	(On Diff #105739)	I think this is a nop since the caller holds the INP_WLOCK here? It's fine to merge as it is a bit more readable, but I don't think it is a functional change due to the lock.