Page MenuHomeFreeBSD

ktls: automatically disable inline hw (aka ifnet) ktls offload when TCP rexmits are high
ClosedPublic

Authored by gallatin on Jun 25 2021, 9:46 PM.

Details

Summary

Inline hardware TLS offload is efficient in most NIC implementations only when packets are transmitted in order. When packets are transmitted in order, the NIC can track encryption state between transmits. However, when packets are transmitted out of order, as they are during a TCP re-transmit, then the NIC needs to re-establish encryption state by re-DMA'ing the entire TLS record up to and including the segment that TCP is re-transmitting. The worse case is when a TCP segment falls at the end of a full-sized 16KB TLS record. In that case, we're DMA'ing over 10x as much data as we're sending in order to properly encrypt the re-transmitted bytes. If this happens too often, this can easily overwhelm the PCIe bus and starve the NIC for PCIe bandwidth.

This change adds a tunable & sysctl , kern.ipc.tls.ifnet_max_rexmit. This specifies the maximum percentage bytes that can be re-transmitted on a TCP connection before we automatically disable inline hw (aka ifnet) ktls offload for that connection and switch it to software ktls. This dramatically reduces output drops and increases bandwidth on Netflix servers using ifnet ktls offload with Mellanox CX6-DX NICs, and allows us to get much closer to the link bandwidth.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Moved the sorele() to after the call into tfb_hwtls_change(), as the sorele() may end up calling pru_detach() in sofree. This could release TCP state, causing an invalid tcpcb, leading to a panic when trying to call into the tcp function block.

sys/kern/uipc_ktls.c
124

IMHO the name could be more descriptive, like ktls_hw_ifnet_max_rexmit_pct. That might be a bit long, but I'd at least add _pct.

2225

Is it possible for the sock ref to be leaked here?

2245

Above we handle the case so == NULL but here we dereference so unconditionally.

2257

It might be worth adding an explanation of exactly why it's beneficial to switch.

2270

SOCK_LOCK will assert if the socket lock is already held.

Where are you draining the ktls_disable_ifnet_help() function?

sys/kern/uipc_ktls.c
2203

Extra space here.

2240

Empty line here can be removed.

If you are not draining, then probably should protect all code with the network EPOCH at least.

Where are you draining the ktls_disable_ifnet_help() function?

I'm not. No draining is done for the reset tag task either. This is part of the core kernel, and no effort has been made to drain things at shutdown.

Where are you draining the ktls_disable_ifnet_help() function?

I'm not. No draining is done for the reset tag task either. This is part of the core kernel, and no effort has been made to drain things at shutdown.

What prevents the CPU from working on a freed INP?

Incorporated review feedback

gallatin added inline comments.
sys/kern/uipc_ktls.c
2225

Good catch. I've re-structured things to fix that.

2270

Thanks; removed

If you are not draining, then probably should protect all code with the network EPOCH at least.

Done

Remove epoch until i'm certain its needed, it introduces a lot of complexity and at least ktls_set_tx_mode() does not require epoch (as it sleeps)

Hi Drew,

I don't understand how you keep the asynchronous callback in sync with the state of the INP. The INP could have been freed and allocated by another TCP TLS connection by the time it gets execution time!

At some point you need to drain this callback, like we do with callout timers!

--HPS

Answering my own question: Is the in_pcbref() supposed to prevent the structure for going away? Then it looks good.

sys/netinet/tcp_var.h
1160

Is signed with unsigned comparison OK here? Or should ktls_ifnet_max_rexmit_pct be made unsigned aswell?

sys/kern/uipc_ktls.c
124

Probably want to make this unsigned.

sys/netinet/tcp_var.h
1159

I think you should have a check for division by zero??

Made ktls_ifnet_max_rexmit_pct unsigned as requested by Hans

gallatin added inline comments.
sys/netinet/tcp_var.h
1159

I don't think thats needed. The denominator is tp->t_snd_rxt_bytes + tp->t_sndbytes; and this is only called when rxt bytes is updated, so I don't see how the sum could be zero.

Answering my own question: Is the in_pcbref() supposed to prevent the structure for going away? Then it looks good.

Yes, that's what the ref is for.

Looks ok to me, my comments are mostly style nits.

sys/kern/uipc_ktls.c
2205
2222

The pcb is only detached when the socket is about to be freed, but the socket reference taken earlier prevents that. So I think this should assert inp_socket != NULL instead. Otherwise it looks like the sock ref is leaked, but it isn't, unless I'm missing something.

2267
2291
2298
sys/netinet/tcp_var.h
44

opt_* includes should be sorted before the others.

1157

Extra newline.

  • Address Mark's review feedback
  • Re-check inp flags before calling into the tcp function block, as ktls_set_tx_mode() drops the inp_wlock, which can result in the connection being closed. I had a panic on a test machine due to not checking the flags last night.
gallatin added inline comments.
sys/kern/uipc_ktls.c
2222

Yes, I see what you mean. I've replaced the check with an assert. This also simplifies the error handling, as a separate label is no longer needed.

sys/kern/uipc_ktls.c
124

Missing "static" here?

gallatin added inline comments.
sys/kern/uipc_ktls.c
124

No, this need to be global. tcp_account_for_send() accesses it. This is why it is in sys/ktls.h

This revision is now accepted and ready to land.Jul 6 2021, 1:04 PM
This revision was automatically updated to reflect the committed changes.