Page MenuHomeFreeBSD

ktls: Avoid spurious calls to ktls_disable_ifnet()
ClosedPublic

Authored by gallatin on Feb 4 2023, 1:53 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 5:57 PM
Unknown Object (File)
Sat, Jan 4, 3:13 AM
Unknown Object (File)
Dec 5 2024, 7:26 AM
Unknown Object (File)
Nov 23 2024, 2:18 AM
Unknown Object (File)
Nov 18 2024, 7:02 PM
Unknown Object (File)
Nov 17 2024, 7:03 PM
Unknown Object (File)
Nov 17 2024, 6:37 PM
Unknown Object (File)
Nov 17 2024, 6:37 PM
Subscribers

Details

Summary

When we implemented NIC kTLS state, we set a flag in the tx socket buffer (SB_TLS_IFNET) to indicate NIC kTLS. This flag meant that now, or in the past, NIC kTLS was active on a socket. Later, I added code to switch TLS sessions to software in the case of lossy TCP connections that have a high retransmit rate. Because TCP was using SB_TLS_IFNET, this meant that even long after a TLS session was switched to software, we'd be doing math to calculate the rxmt rate in tcp_account_for_send(), and making potentially spurious calls into ktls_disable_ifnet().

This patch carefully tracks whether or not ifnet ktls is still enabled on a TCP connection. Because the inp is now embedded in the tcbcb, and because tcp is the most frequent accessor of this state, it made sense to move this from the socket buffer to the tcpcb. Because we now need reliable access to the tcbcb, we take a ref on the inp when creating a tx ktls session.. this actually seems to simplify a few things.

While here, I noticed that rack/bbr were incorrectly implementing tfb_hwtls_change(), and applying the change to all pending sends, when it should apply only to future sends.

This patch reduces spurious calls to ktls_disable_ifnet() by 95% or so in our environment.

Note that the new flags in the tcpcb fit into an alignment hole, and do not increase the size of the struct.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

gallatin edited the summary of this revision. (Show Details)
gallatin added a reviewer: hselasky.

I'll read through this on Monday!

--HPS

This revision is now accepted and ready to land.Feb 6 2023, 3:40 PM
sys/kern/uipc_ktls.c
1929

It looks theoretically possible to come here with the inpcb read lock held. For instance, ip6_output() -> ip6_output_send() -> ktls_free(). ip6_output() may be called with the read lock held. So unless there's some invariant which guarantees that the current thread doesn't hold the read lock when releasing the last ktls ref, it's possible to deadlock here.

Below, I think you can release the inpcb ref with just a read lock, so that's easy to solve, but I'm not sure what to do here.

sys/kern/uipc_ktls.c
1929

Hmm.. I'd asked around about this, and was told that the rlock should not be held across ip output(). Do you know who might hold it, or is this theoretical?

sys/kern/uipc_ktls.c
1929

It might have to do with interaction with EPOCH .

sys/kern/uipc_ktls.c
1929

Maybe it's possible via:

  • tcp_input_with_port() receives a SYN and looks up a matching PCB with the read lock held
  • something causes it to go to the dropwithreset label
  • tcp_dropwithreset() calls tcp_respond() calls ip_output()

I don't claim that this can actually happen, and this particular case probably doesn't make sense? But it's hard to tell, and ip_output() might not be the only place where we drop a ktls reference with a inpcb read lock held.

Run ktls_destroy() if we are called by a thread holding an rlock. There is no way to know if the rlock held by the thread is the inpcb lock, but just assume it is for safety.

My tests (and the counter I added) indicates that this never happens in our (Netflix) workload.

This revision now requires review to proceed.Feb 7 2023, 4:25 PM
sys/kern/uipc_ktls.c
1840

Alternately, you could try to grab the inp write lock here (at least in the tls->tx case), and fall back to dispatching a task if that doesn't succeed. That's more future-proof:

  • if ktls_free() starts getting called with some other read lock held, we'll otherwise be dispatching tasks for no reason,
  • if someone changes the inp lock type, checking td_rw_rlocks will silently stop working.

I think it's harmless to hold the inp write lock for the duration of ktls_destroy()?

gallatin added inline comments.
sys/kern/uipc_ktls.c
1840

Thanks, I'll try that.

It was not immediately clear to me from the man page that rw_try_wlock() would fail if the current thread held the rlock. But if you say that's how it behaves, I'll take your word :)

Update to hold the wlock in ktls_destroy for transmit ktls, as suggested by Mark.

This revision is now accepted and ready to land.Feb 8 2023, 2:01 PM

Updated patch to restore and document the td_rw_rlocks hack to detect if we might hold an rlock on the inp. Without this, we end up scheduling a taskqueue in about 18% of cases due to other threads holding locks.

Discussed with Markj on slack, and he reviewed it there..

This revision now requires review to proceed.Feb 9 2023, 5:52 PM

I'm trying to remember the edge case for why the flag wasn't cleared when moving to a software session. I'm not sure this doesn't open whatever race that was back up. I think the problem might have been that when you switch to SW TLS you might still have existing mbufs in the socket buffer that were framed with the NIC TLS session. We don't go back and try to do software encryption of those mbufs, and even though you've changed the TLS session for "new" requests to send data in the future, those previously existing mbufs still have the old snd_tag and need NIC TLS behavior in TCP. I think you've now broken that case again. In theory the right answer is to check the mbufs you are planning to send to see if they have a send tag and do the split in tcp_m based on that, but that means walking the mbuf chain I think all the time, and the global flag in the sockbuf seemed less expensive.

In D38380#875352, @jhb wrote:

I'm trying to remember the edge case for why the flag wasn't cleared when moving to a software session. I'm not sure this doesn't open whatever race that was back up. I think the problem might have been that when you switch to SW TLS you might still have existing mbufs in the socket buffer that were framed with the NIC TLS session. We don't go back and try to do software encryption of those mbufs, and even though you've changed the TLS session for "new" requests to send data in the future, those previously existing mbufs still have the old snd_tag and need NIC TLS behavior in TCP. I think you've now broken that case again. In theory the right answer is to check the mbufs you are planning to send to see if they have a send tag and do the split in tcp_m based on that, but that means walking the mbuf chain I think all the time, and the global flag in the sockbuf seemed less expensive.

When you switch to SW TLS, the existing mbufs that were framed with NIC TLS still reference the NIC TLS session. The t_nic_ktls_xmit flag remains set on the tcbcb until the NIC TLS session is released, meaning that TCP has moved past all those mbufs,

sys/kern/uipc_ktls.c
1924

Drew pointed me to this on Slack and this does fix the edge case I described previously. Also, setting/clearing this state with the INP lock held is what makes it safe for TCP output to check. The previous flag was a "set never clear" to avoid races with TCP output as well but the INP lock closes those.

sys/netinet/tcp_var.h
212

If we ever support rekeying in the future this may have to expand to a refcount rather than a simple bool as you might have two active NIC TLS sessions on a connection (one for the old key and one for the new key), but that can be addressed then. For now we don't currently support rekeying.

In D38380#875352, @jhb wrote:

I'm trying to remember the edge case for why the flag wasn't cleared when moving to a software session. I'm not sure this doesn't open whatever race that was back up. I think the problem might have been that when you switch to SW TLS you might still have existing mbufs in the socket buffer that were framed with the NIC TLS session. We don't go back and try to do software encryption of those mbufs, and even though you've changed the TLS session for "new" requests to send data in the future, those previously existing mbufs still have the old snd_tag and need NIC TLS behavior in TCP. I think you've now broken that case again. In theory the right answer is to check the mbufs you are planning to send to see if they have a send tag and do the split in tcp_m based on that, but that means walking the mbuf chain I think all the time, and the global flag in the sockbuf seemed less expensive.

When you switch to SW TLS, the existing mbufs that were framed with NIC TLS still reference the NIC TLS session. The t_nic_ktls_xmit flag remains set on the tcbcb until the NIC TLS session is released, meaning that TCP has moved past all those mbufs,

As you pointed out in slack: Once we support re-keying, this will need to change to a reference count, since there could be multiple nic ktls sessions active in that case.

jhb added inline comments.
sys/kern/uipc_ktls.c
1481
1852–1854
3252

FWIW, normal style in the kernel is to have a blank line before comments like this

sys/sys/ktls.h
204

Just to make it clear this state isn't for TLS 1.0 only. You could even maybe move this up below the bool sequential_records so that TLS 1.0 state stays at the end.

This revision is now accepted and ready to land.Feb 9 2023, 6:26 PM
sys/sys/ktls.h
204

Ah, the intent was to keep infrequently used things out of the main cachelines for this struct. I didn't mean to imply that it was related to tls 1.0