Details

Reviewers

jhb
markj
rrs
glebius
• hselasky

Group Reviewers

transport

Commits

rGc0e4090e3d43: ktls: Accurately track if ifnet ktls is enabled

Summary

When we implemented NIC kTLS state, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate NIC kTLS. This flag meant that now, or in the past, NIC kTLS was active on a socket. Later, I added code to switch TLS sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET, this meant that even long after a TLS session was switched to software, we'd be doing math to calculate the rxmt rate in tcp_account_for_send(), and making potentially spurious calls into ktls_disable_ifnet().

This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcbcb, and because tcp is the most frequent accessor of this state, it made sense to move this from the socket buffer to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session.. this actually seems to simplify a few things.

While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends.

This patch reduces spurious calls to ktls_disable_ifnet() by 95% or so in our environment.

Note that the new flags in the tcpcb fit into an alignment hole, and do not increase the size of the struct.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

gallatin created this revision.Feb 4 2023, 1:53 AM

Herald added a reviewer: transport. · View Herald TranscriptFeb 4 2023, 1:53 AM

Herald added subscribers: melifaro, imp. · View Herald Transcript

gallatin requested review of this revision.Feb 4 2023, 1:53 AM

gallatin edited the summary of this revision. (Show Details)

gallatin added a reviewer: • hselasky.

I'll read through this on Monday!

--HPS

rrs accepted this revision.Feb 6 2023, 3:40 PM

This revision is now accepted and ready to land.Feb 6 2023, 3:40 PM

markj added inline comments.Feb 6 2023, 4:49 PM

sys/kern/uipc_ktls.c
1929	It looks theoretically possible to come here with the inpcb read lock held. For instance, ip6_output() -> ip6_output_send() -> ktls_free(). ip6_output() may be called with the read lock held. So unless there's some invariant which guarantees that the current thread doesn't hold the read lock when releasing the last ktls ref, it's possible to deadlock here. Below, I think you can release the inpcb ref with just a read lock, so that's easy to solve, but I'm not sure what to do here.

gallatin added inline comments.Feb 6 2023, 8:35 PM

sys/kern/uipc_ktls.c
1929	Hmm.. I'd asked around about this, and was told that the rlock should not be held across ip output(). Do you know who might hold it, or is this theoretical?

• hselasky added inline comments.Feb 6 2023, 8:38 PM

sys/kern/uipc_ktls.c
1929	It might have to do with interaction with EPOCH .

markj added inline comments.Feb 6 2023, 8:58 PM

sys/kern/uipc_ktls.c
1929	Maybe it's possible via: tcp_input_with_port() receives a SYN and looks up a matching PCB with the read lock held something causes it to go to the dropwithreset label tcp_dropwithreset() calls tcp_respond() calls ip_output() I don't claim that this can actually happen, and this particular case probably doesn't make sense? But it's hard to tell, and ip_output() might not be the only place where we drop a ktls reference with a inpcb read lock held.

Run ktls_destroy() if we are called by a thread holding an rlock. There is no way to know if the rlock held by the thread is the inpcb lock, but just assume it is for safety.

My tests (and the counter I added) indicates that this never happens in our (Netflix) workload.

This revision now requires review to proceed.Feb 7 2023, 4:25 PM

markj added inline comments.Feb 7 2023, 4:59 PM

sys/kern/uipc_ktls.c
1840	Alternately, you could try to grab the inp write lock here (at least in the `tls->tx` case), and fall back to dispatching a task if that doesn't succeed. That's more future-proof: if ktls_free() starts getting called with some other read lock held, we'll otherwise be dispatching tasks for no reason, if someone changes the inp lock type, checking `td_rw_rlocks` will silently stop working. I think it's harmless to hold the inp write lock for the duration of ktls_destroy()?

gallatin marked 3 inline comments as done.Feb 7 2023, 8:57 PM

gallatin added inline comments.

sys/kern/uipc_ktls.c
1840	Thanks, I'll try that. It was not immediately clear to me from the man page that rw_try_wlock() would fail if the current thread held the rlock. But if you say that's how it behaves, I'll take your word :)

Update to hold the wlock in ktls_destroy for transmit ktls, as suggested by Mark.

markj accepted this revision.Feb 8 2023, 2:01 PM

This revision is now accepted and ready to land.Feb 8 2023, 2:01 PM

Updated patch to restore and document the td_rw_rlocks hack to detect if we might hold an rlock on the inp. Without this, we end up scheduling a taskqueue in about 18% of cases due to other threads holding locks.

Discussed with Markj on slack, and he reviewed it there..

This revision now requires review to proceed.Feb 9 2023, 5:52 PM

I'm trying to remember the edge case for why the flag wasn't cleared when moving to a software session. I'm not sure this doesn't open whatever race that was back up. I think the problem might have been that when you switch to SW TLS you might still have existing mbufs in the socket buffer that were framed with the NIC TLS session. We don't go back and try to do software encryption of those mbufs, and even though you've changed the TLS session for "new" requests to send data in the future, those previously existing mbufs still have the old snd_tag and need NIC TLS behavior in TCP. I think you've now broken that case again. In theory the right answer is to check the mbufs you are planning to send to see if they have a send tag and do the split in tcp_m based on that, but that means walking the mbuf chain I think all the time, and the global flag in the sockbuf seemed less expensive.

In D38380#875352, @jhb wrote:

I'm trying to remember the edge case for why the flag wasn't cleared when moving to a software session. I'm not sure this doesn't open whatever race that was back up. I think the problem might have been that when you switch to SW TLS you might still have existing mbufs in the socket buffer that were framed with the NIC TLS session. We don't go back and try to do software encryption of those mbufs, and even though you've changed the TLS session for "new" requests to send data in the future, those previously existing mbufs still have the old snd_tag and need NIC TLS behavior in TCP. I think you've now broken that case again. In theory the right answer is to check the mbufs you are planning to send to see if they have a send tag and do the split in tcp_m based on that, but that means walking the mbuf chain I think all the time, and the global flag in the sockbuf seemed less expensive.

When you switch to SW TLS, the existing mbufs that were framed with NIC TLS still reference the NIC TLS session. The t_nic_ktls_xmit flag remains set on the tcbcb until the NIC TLS session is released, meaning that TCP has moved past all those mbufs,

jhb added inline comments.Feb 9 2023, 6:22 PM

sys/kern/uipc_ktls.c
1924	Drew pointed me to this on Slack and this does fix the edge case I described previously. Also, setting/clearing this state with the INP lock held is what makes it safe for TCP output to check. The previous flag was a "set never clear" to avoid races with TCP output as well but the INP lock closes those.
sys/netinet/tcp_var.h
212	If we ever support rekeying in the future this may have to expand to a refcount rather than a simple bool as you might have two active NIC TLS sessions on a connection (one for the old key and one for the new key), but that can be addressed then. For now we don't currently support rekeying.

In D38380#875354, @gallatin wrote:

In D38380#875352, @jhb wrote:

I'm trying to remember the edge case for why the flag wasn't cleared when moving to a software session. I'm not sure this doesn't open whatever race that was back up. I think the problem might have been that when you switch to SW TLS you might still have existing mbufs in the socket buffer that were framed with the NIC TLS session. We don't go back and try to do software encryption of those mbufs, and even though you've changed the TLS session for "new" requests to send data in the future, those previously existing mbufs still have the old snd_tag and need NIC TLS behavior in TCP. I think you've now broken that case again. In theory the right answer is to check the mbufs you are planning to send to see if they have a send tag and do the split in tcp_m based on that, but that means walking the mbuf chain I think all the time, and the global flag in the sockbuf seemed less expensive.

When you switch to SW TLS, the existing mbufs that were framed with NIC TLS still reference the NIC TLS session. The t_nic_ktls_xmit flag remains set on the tcbcb until the NIC TLS session is released, meaning that TCP has moved past all those mbufs,

As you pointed out in slack: Once we support re-keying, this will need to change to a reference count, since there could be multiple nic ktls sessions active in that case.

jhb accepted this revision.Feb 9 2023, 6:26 PM

jhb added inline comments.

sys/kern/uipc_ktls.c
1481
1852–1854
3252	FWIW, normal style in the kernel is to have a blank line before comments like this
sys/sys/ktls.h
204	Just to make it clear this state isn't for TLS 1.0 only. You could even maybe move this up below the `bool sequential_records` so that TLS 1.0 state stays at the end.

This revision is now accepted and ready to land.Feb 9 2023, 6:26 PM

gallatin added inline comments.Feb 9 2023, 6:57 PM

sys/sys/ktls.h
204	Ah, the intent was to keep infrequently used things out of the main cachelines for this struct. I didn't mean to imply that it was related to tls 1.0

Closed by commit rGc0e4090e3d43: ktls: Accurately track if ifnet ktls is enabled (authored by gallatin). · Explain WhyApr 9 2023, 9:14 PM

This revision was automatically updated to reflect the committed changes.

gallatin added a commit: rGc0e4090e3d43: ktls: Accurately track if ifnet ktls is enabled.

ktls: Avoid spurious calls to ktls_disable_ifnet()
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 120037

sys/kern/uipc_ktls.c

sys/netinet/tcp_output.c

sys/netinet/tcp_ratelimit.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_var.h

sys/sys/ktls.h

sys/sys/sockbuf.h

ktls: Avoid spurious calls to ktls_disable_ifnet()ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 120037

sys/kern/uipc_ktls.c

sys/netinet/tcp_output.c

sys/netinet/tcp_ratelimit.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_var.h

sys/sys/ktls.h

sys/sys/sockbuf.h

ktls: Avoid spurious calls to ktls_disable_ifnet()
ClosedPublic
Actions

Revision Contents
Changeset List