Page MenuHomeFreeBSD

tcp: bypass TSO when CWR bit is to be sent
Needs ReviewPublic

Authored by rscheff on Dec 22 2023, 11:42 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 19, 7:38 PM
Unknown Object (File)
Sat, Jan 11, 11:14 PM
Unknown Object (File)
Thu, Jan 9, 11:23 AM
Unknown Object (File)
Thu, Jan 9, 11:22 AM
Unknown Object (File)
Wed, Jan 8, 9:25 PM
Unknown Object (File)
Dec 7 2024, 5:56 PM
Unknown Object (File)
Sep 15 2024, 1:21 PM
Unknown Object (File)
Sep 8 2024, 10:56 AM
Subscribers

Details

Summary

Support for RFC3168 ECN by various hardware and drivers is
a very mixed bag.

Some hardware TSO will properly mask out the CWR bit (for 3168 ECN)
on all but the first segment.

Some hardware make not mask the CWR bit at all, while other
documents indicate it may remain set on the initial and middle
packets, but not the last.

Further, some drivers expect the hardware to indicate if "proper"
ECN support exists for TSO and will discard the transmission
entirely, when encountering the CWR bit but no hardware TSO+ECN
support.

To add to the complexity, the upcoming AccECN change does NOT
require any specific support to flag out certain header flags
between first, middle or last packet in a TSO chain. Thus working
with currently broken TSO (where the Flags are simply copied over),
but not working with the few hardware currenly working properly
(masking the CWR flag in all but the first packet).

In order to deal with all that different behaviours in a sensible
manner, bypassing TSO entirely when CWR is encountered appears
to be the only viable option for now.

MFC after: 2 weeks

Test Plan

On TSO hardware, validate that after receiving a ECE from the
receiver (e.g. setting the IP ECN CE codepoint on the forward
direction, for example by using dummynet with an ecn-enabled
queue), only a singular CWR flagged packet is received.

In virtualized environments, make sure the receiver does NOT
support LRO / TSO, to observe individual packets no larger
than the MTU

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 55110
Build 51999: arc lint + arc unit

Event Timeline

Let me do some testing first. What is in specifications is sometimes not in hardware...

Let me do some testing first. What is in specifications is sometimes not in hardware...

I expect the bitmasks (on intel chipsets) to be fixed for RFC3168 by now. But for AccECN this doesn't help - there, the CWR flag needs to be sent on every packet always.

IMHO there are these possibilities:

proper RFC3168 ECN support with TSO -> breaks AccECN with TSO, AccECN (reasonable likely)
no ECN support with TSO -> a) silent discard of TSO, extremely poor performance because of continous RTOs (one example)

b) all packets have CWR set, support for AccECN, but not RFC3168 ECN     (most likely)

incorrect RFC3168 ECN support -> random subset of packets in a TSO chain will have correct and incorrectly set CWR flag (unknown)

Let me do some testing first. What is in specifications is sometimes not in hardware...

I expect the bitmasks (on intel chipsets) to be fixed for RFC3168 by now. But for AccECN this doesn't help - there, the CWR flag needs to be sent on every packet always.

IMHO there are these possibilities:

proper RFC3168 ECN support with TSO -> breaks AccECN with TSO, AccECN (reasonable likely)
no ECN support with TSO -> a) silent discard of TSO, extremely poor performance because of continous RTOs (one example)

b) all packets have CWR set, support for AccECN, but not RFC3168 ECN     (most likely)

incorrect RFC3168 ECN support -> random subset of packets in a TSO chain will have correct and incorrectly set CWR flag (unknown)

I would like to test what the Intel NICs actually do and if we actually can change the masks. Depending on that, we might want to make the behaviour of the TCP stack switchable.

gallatin requested changes to this revision.EditedDec 26 2023, 1:17 AM

I'd prefer you add a feature flag so that NICs which do properly support CWR be able to use TSO, and avoid being pessimized by this.

This revision now requires changes to proceed.Dec 26 2023, 1:17 AM

Properly handling CWR is part of the NDIS spec... though the spec is broken, and says that "If the CWR bit in the TCP header of the large TCP packet is set, the miniport driver must set this bit in the TCP header of the first packet that it creates from the large TCP packet. The miniport driver may choose to set this bit in the TCP header of the last packet that it creates from the large TCP packet, although this is less desirable." [https://learn.microsoft.com/en-us/windows-hardware/drivers/network/offloading-the-segmentation-of-large-tcp-packets]

So, aside from the obnoxious potential for setting CWR on the last packet in a TSO, it seems like all hardware that passes the MS NDIS cert. tests should be capable of working properly. I recall when I worked for a NIC vendor (Myricom) had unit tests for this in our hardware/driver regression tests that we tested NIC designs with that was added specifically b/c we once failed the NDIS tests under a verilog simulator for this reason.

At any rate, it seems like any NIC that passes NDIS cert tests should be able to handle CWR (as long as it does not set CWR on the last segment). So I'd really prefer we have a feature flag for CWR support, rather than forcing all CWR sends to the slow path.

I agree with gallatin@: we should be able to configure whether we prefer classical ECN or AccECN and, if possible, avoid using the slow path. It seems that you can configure the behaviour for Intel cards. That is what I'm currently checking...

Properly handling CWR is part of the NDIS spec...

That may be; the virtio and vmxnet3 drivers then don't meet that NDIS test then - one discards TSO mbufs with CWR when the CWR bit is set, and the host doesn't signal CWR support; the other doesn't clear the CWR bit at all (like leacy hardware wihtout ECN capabilties).

At any rate, it seems like any NIC that passes NDIS cert tests should be able to handle CWR (as long as it does not set CWR on the last segment). So I'd really prefer we have a feature flag for CWR support, rather than forcing all CWR sends to the slow path.

The point is, with PRR (proportional rate reduction) you ARE already using the slow path - as only one additional packet will be elegible for sending for every two packets ACKed. Sure, with ACK compression / ACK thinning, one incoming ACK may still acknowledge more than 2 packets, and TSO may have slightly more data to send.

But the other problem is not solved: Only RFC3168 (which people move away from) requires the CWR bit to be set once, either (ideally) the first or last packet of a TSO chain. With DCTCP or TCP Prague, using Accurate ECN, entire TSO chains should retain the CWR bit as set in the initial header. Thus RFC3168 capabilities actively interfere - and RFC3168 capabitlities are ideally disabled by the hardware NIC driver, since TSO in that case (AccECN) is much more critical than in the RFC3168 case, where, due to the congestion window reduction, you are doing some slow path calculations already, or deal out packets one-by-one when performing PRR (which nowadays both RACK and base stack do).

Ideally, the hardware would be able to signal if it supports RFC3168 CWR flag clearing, or keeps the ECN flags unmodified per TCP session. However, that is very unlikely - more likely is that the hardware / driver signals if it mucks around with the CWR bit (NDIS compatibiliity ) - if so, the stack can choose if it relies on this (when the session is in RFC3168 ECN mode), or deals with set CWR individually (AccECN mode). Or, when the hardware keeps the CWR bit set (like the VMXNIC3 does) perform AccECN session entirely with TSO, while splitting up RFC3168 session.

In short - this capability to bypass TSO depending on the hardware capabilities and TCP mode (3168 / AccECN) will still be required, since I don't see it realistic that hardware will allow the selection between these modes on a per-session basis (I'm not aware of some chunk of mbuf metadata, where such signalling could be passed down to the TSO of the NIC).

And yes, ideally such an additional capability signalling - bidirectionally - by the NIC / driver (similar to what virtio tries to do, but without proper handling ultimately by the TCP stack when failing, see D43167) could be implemented. That's why I added everyone I could think of who could help in getting this problem (two dramatically different ways of hacing to deal with CWR soon) addressed.

sys/netinet/tcp_output.c
920

Note: This is where a decision about slow path or fast path needs to be taken. If the session is in RFC3168 ECN mode, and the hardware is properly (TM) supporting RFC3168 CWR, TSO may remain active.

If the session is in AccECN mode, and the hardware does NOT support RFC3168 CWR (or the driver has reprogrammed the mask fields to not clear the CWR bits after the first segment), then TSO could remain active.

In the inverse cases (legacy hardware or driver initializing mask bits to not clear CWR), the first segment of RFC3168 would need to follow the slow path, and subsequent segments can then be run fast.

Or current NDIS CWR complicant drivers would have to send entire lengths (possibly many MB, unlike in the 3168 case) via the slow path, when AccECN is active (DCTCP, TCP Prague).

Properly handling CWR is part of the NDIS spec...

That may be; the virtio and vmxnet3 drivers then don't meet that NDIS test then - one discards TSO mbufs with CWR when the CWR bit is set, and the host doesn't signal CWR support; the other doesn't clear the CWR bit at all (like leacy hardware wihtout ECN capabilties).

Let us discuss hardware drivers and virtual drivers separately. For hardware drivers we need to have a look at the hardware capabilities and APIs they are providing. I hope that it is possible to tune them. For virtual network drivers we also need to have a look at the host OS part.

At any rate, it seems like any NIC that passes NDIS cert tests should be able to handle CWR (as long as it does not set CWR on the last segment). So I'd really prefer we have a feature flag for CWR support, rather than forcing all CWR sends to the slow path.

The point is, with PRR (proportional rate reduction) you ARE already using the slow path - as only one additional packet will be elegible for sending for every two packets ACKed. Sure, with ACK compression / ACK thinning, one incoming ACK may still acknowledge more than 2 packets, and TSO may have slightly more data to send.

But the other problem is not solved: Only RFC3168 (which people move away from) requires the CWR bit to be set once, either (ideally) the first or last packet of a TSO chain. With DCTCP or TCP Prague, using Accurate ECN, entire TSO chains should retain the CWR bit as set in the initial header. Thus RFC3168 capabilities actively interfere - and RFC3168 capabitlities are ideally disabled by the hardware NIC driver, since TSO in that case (AccECN) is much more critical than in the RFC3168 case, where, due to the congestion window reduction, you are doing some slow path calculations already, or deal out packets one-by-one when performing PRR (which nowadays both RACK and base stack do).

This is the point. We want to have the PSH and FIN flag only on the last segment in any case. For the CWR flag we want different behaviors:

  • For classical ECN, we want the CWR flag only on the first segment.
  • For AccECN we want the CWR flag untouched

What I think we could do on systems with NICs allowing to configure the behavior:

  • Allow the TCP stack to be configured to negotiate classical ECN and not AccECN and allow the NIC to be configured to perform the appropriate action.
  • Allow the TCP stack to be configured to negotiate AccECN and not classical ECN and allow the NIC to be configured to perform the appropriate action.

In both cases we should be on the fast path.

Ideally, the hardware would be able to signal if it supports RFC3168 CWR flag clearing, or keeps the ECN flags unmodified per TCP session. However, that is very unlikely - more likely is that the hardware / driver signals if it mucks around with the CWR bit (NDIS compatibiliity ) - if so, the stack can choose if it relies on this (when the session is in RFC3168 ECN mode), or deals with set CWR individually (AccECN mode). Or, when the hardware keeps the CWR bit set (like the VMXNIC3 does) perform AccECN session entirely with TSO, while splitting up RFC3168 session.

In short - this capability to bypass TSO depending on the hardware capabilities and TCP mode (3168 / AccECN) will still be required, since I don't see it realistic that hardware will allow the selection between these modes on a per-session basis (I'm not aware of some chunk of mbuf metadata, where such signalling could be passed down to the TSO of the NIC).

And yes, ideally such an additional capability signalling - bidirectionally - by the NIC / driver (similar to what virtio tries to do, but without proper handling ultimately by the TCP stack when failing, see D43167) could be implemented. That's why I added everyone I could think of who could help in getting this problem (two dramatically different ways of hacing to deal with CWR soon) addressed.

Just some data (default values measured, except for a typo these corresponds to the values given in the datasheets ) points for the masks:

ChipTCP_flg_first_segTCP_Flg_mid_segTCP_Flg_lst_seg
i2110xFF60xF760xF7F
825760xFF60xF760xF7F
825990xFF60xFF60xF7F

So all mask the FIN and PSH flags for all segments but the last one. All mask the CWR flag on the last segment and keep it on the first segment. The i211 and 82576 mask the CWR flag on middle segments, but the 82599 seem to keep it.

My next steps:

  • Make the values changeable via the sysctl-interface.
  • Test whether the cards actually honor the values set via the sysctl-interface.

Assuming both steps work out as expected, we can configure whether the card is operating in the classical ECN mode or in the AccECN mode.

This is the point. We want to have the PSH and FIN flag only on the last segment in any case. For the CWR flag we want different behaviors:

  • For classical ECN, we want the CWR flag only on the first segment.
  • For AccECN we want the CWR flag untouched

What I think we could do on systems with NICs allowing to configure the behavior:

  • Allow the TCP stack to be configured to negotiate classical ECN and not AccECN and allow the NIC to be configured to perform the appropriate action.
  • Allow the TCP stack to be configured to negotiate AccECN and not classical ECN and allow the NIC to be configured to perform the appropriate action.

In both cases we should be on the fast path.

But you may want to support RFC3168 and AccECN simultaneously; so in the case of the (locally) unoptimized (- depending which driver is used) case, sending segments with CWR via the slow path seems the only alternative.

However, the main observation remains, that taking a slow-path hit in the case of RFC3168 is much less impactful - the CWR means the congestion window was reduced (the sending is supposed to be slowed down). With modern mechanisms like proportional rate reduction, it's unlikely that TSO chains with high length even get sent (would be a nice test to do; while not actually sending CWR, track when snd_cwnd is reduced, and how often immediately following that, TSO is actually active with more then 2 segments).

With AccECN, the ACE counter may remain at a value of 2, 3, 6 or 7 for long periods of time, where the proper use of TSO is much more critical than in the ephemerial RFC3168 CWR case...

Thus I strongly think that removing "TSO ECN" capabilities (clearing of the CWR bit) generally is the more appropriate path, when having control over both the TCP stack and drivers/hardware. The impact of splitting up a RFC3168 CWR TSO chunk into one single packet and the remainder of non-CWR segments

This is the point. We want to have the PSH and FIN flag only on the last segment in any case. For the CWR flag we want different behaviors:

  • For classical ECN, we want the CWR flag only on the first segment.
  • For AccECN we want the CWR flag untouched

What I think we could do on systems with NICs allowing to configure the behavior:

  • Allow the TCP stack to be configured to negotiate classical ECN and not AccECN and allow the NIC to be configured to perform the appropriate action.
  • Allow the TCP stack to be configured to negotiate AccECN and not classical ECN and allow the NIC to be configured to perform the appropriate action.

In both cases we should be on the fast path.

But you may want to support RFC3168 and AccECN simultaneously; so in the case of the (locally) unoptimized (- depending which driver is used) case, sending segments with CWR via the slow path seems the only alternative.

However, the main observation remains, that taking a slow-path hit in the case of RFC3168 is much less impactful - the CWR means the congestion window was reduced (the sending is supposed to be slowed down). With modern mechanisms like proportional rate reduction, it's unlikely that TSO chains with high length even get sent (would be a nice test to do; while not actually sending CWR, track when snd_cwnd is reduced, and how often immediately following that, TSO is actually active with more then 2 segments).

With AccECN, the ACE counter may remain at a value of 2, 3, 6 or 7 for long periods of time, where the proper use of TSO is much more critical than in the ephemerial RFC3168 CWR case...

Thus I strongly think that removing "TSO ECN" capabilities (clearing of the CWR bit) generally is the more appropriate path, when having control over both the TCP stack and drivers/hardware. The impact of splitting up a RFC3168 CWR TSO chunk into one single packet and the remainder of non-CWR segments

Maybe I was not clear on what I suggest. I suggest to make the behavior of the TCP stack configurable. So if I configure the system to only perform AccECN and I know the NIC does the right thing, I can allow TSO to happen for segment with CWR set...

  • add system-wide tuneable if NIC drivers support TSO with RFC3168 CWR support

OK, I'm sorry, I was not aware of AccECN and its desired behavior of setting CWR on all segments.

In any case, a system wide tunable is probably not the correct approach. We would want to have the driver set a bit to advertise which ECN modes it supports. If there is hardware that supports both, and the driver cannot determine easily from the packet itself if normal ECN or AccECN should be used, we probably also need to hint to the driver which ECN mode should be used.

I'd like to see some input from NIC vendors here, so I'm glad @np is on the review.

In any case, a system wide tunable is probably not the correct approach. We would want to have the driver set a bit to advertise which ECN modes it supports.

That would be ideal; with 3168 ECN mode being what is currently done with NDIS TSO ECN compatibiliity (masking CWR on all but on), and no special treatmen (pre TSO-ECN behavior).

But would it be feasible to expect the driver to swap these modes on a session by session basis?

If there is hardware that supports both, and the driver cannot determine easily from the packet itself if normal ECN or AccECN should be used, we probably also need to hint to the driver which ECN mode should be used.

Yes; There is no per-packet state kept to differentiate between these modes;

As long as the AE bit (highest bit of the 3-bit ACE counter, 9th bit of the 12 TCP header flags) is in use (e.g. during the 3WHS, and initially, as these counters start at 5 or 6, not at 0 or 1) it could be determined (and stored for subsequent packets belonging to the same session). However, with multihomed hosts and ECMP rerouting, such a switch could conceivably happen when it's not obvious by the new port's NIC...

I'd like to see some input from NIC vendors here, so I'm glad @np is on the review.

Looking forward for input after the holiday season; In my naive understanding, I would think it easiest to set up the NIC in one or the other way, and let the TCP stack know what to expect by the NIC - thus the system-wide loadtime tunable. So that the stack can efficiently sent the flavour supported by TSO HW directly, and deals with the other flavor in the stack.

Maybe of interest: Some of the folks working with TCP Prague on Linux are proposing a non-fallback 3WHS. While the AccECN negotiation was carefully designed to allow a classic RFC3168 ECN server to interact with an AccECN (and 3168 ECN) client, in certain environments they don't want to fall back to 3168 ECN at all.

The "normal" AccECN SYN has AE, CWR and ECE all set, and is responded to in one of 4 ways, depending on the IP ECN bits observed by the receiver for the SYN. A server not aware of AccECN will see the SYN,CWR,ECE flags, which is a RFC3168 handshake and respond accordingly.
What they seem to propose is to perform an AccECN handshake with SYN,AE - which the spec says is to be treated like the normal AccECN handshake. A server not aware of AccECN will only perceive the pure SYN and perform an non-ECN response...

In any case, TSO ECN compatibilty as per the NDIS specs does interfer with the ACE counter (and the hw drivers / LRO have to account for that too)... I don't know what exactly they are doing on the TSO TX path though. As my knowledge of the mbuf is not that intricate, would there be any place for metainformation down and up the stack to inform TSO and get info from LRO?

OK, I'm sorry, I was not aware of AccECN and its desired behavior of setting CWR on all segments.

In any case, a system wide tunable is probably not the correct approach. We would want to have the driver set a bit to advertise which ECN modes it supports. If there is hardware that supports both, and the driver cannot determine easily from the packet itself if normal ECN or AccECN should be used, we probably also need to hint to the driver which ECN mode should be used.

I'd like to see some input from NIC vendors here, so I'm glad @np is on the review.

I'm not familiar with AccECN either. I did look into the documentation for cxgbe hardware and it seems to do normal ECN only with stateless TSO. The behavior is as follows:
a) The TOS field is copied as is to all segments. So if ECN is in use then ECT/CE bits are copied to all segments.
b) If FIN/PSH are set in tcp flags then they will be set in the flags for the last segment only.
c) If CWR is set in the TCP hdr it is set in the hdr of the first segment only. The ECE bit is copied to all segments.
d) If there are IP options they are copied into each segment unaltered. This means if TCP timestamp option is in use the chip will use the same timestamp in all the segments for the TSO.

In D43166#986875, @np wrote:

b) If FIN/PSH are set in tcp flags then they will be set in the flags for the last segment only.
c) If CWR is set in the TCP hdr it is set in the hdr of the first segment only. The ECE bit is copied to all segments.
d) If there are IP options they are copied into each segment unaltered. This means if TCP timestamp option is in use the chip will use the same timestamp in all the segments for the TSO.

Thanks Navdeep!

Is there a r/w mask register similar to what Intel NICs offer available in the cxgbe hardware, to modify specifically the CWR behavior?

@tuexen suggested to have a global variable controlling the behavior of the tcp stack, depending on the used hardware drivers (e.g. on set of drivers for RFC3168 ECN support - clearing CWR on all but the first segment, and another set of drivers where CWR is kept as-is on all segments by TSO).

In D43166#986875, @np wrote:

b) If FIN/PSH are set in tcp flags then they will be set in the flags for the last segment only.
c) If CWR is set in the TCP hdr it is set in the hdr of the first segment only. The ECE bit is copied to all segments.
d) If there are IP options they are copied into each segment unaltered. This means if TCP timestamp option is in use the chip will use the same timestamp in all the segments for the TSO.

Thanks Navdeep!

Is there a r/w mask register similar to what Intel NICs offer available in the cxgbe hardware, to modify specifically the CWR behavior?

No, there is no such configuration knob available.

I pinged Nvidia/Mellanox last week, and I'm still waiting to hear back to see if they can support AccECN in their NICs

In D43166#986875, @np wrote:

OK, I'm sorry, I was not aware of AccECN and its desired behavior of setting CWR on all segments.

In any case, a system wide tunable is probably not the correct approach. We would want to have the driver set a bit to advertise which ECN modes it supports. If there is hardware that supports both, and the driver cannot determine easily from the packet itself if normal ECN or AccECN should be used, we probably also need to hint to the driver which ECN mode should be used.

I'd like to see some input from NIC vendors here, so I'm glad @np is on the review.

I'm not familiar with AccECN either. I did look into the documentation for cxgbe hardware and it seems to do normal ECN only with stateless TSO. The behavior is as follows:
a) The TOS field is copied as is to all segments. So if ECN is in use then ECT/CE bits are copied to all segments.
b) If FIN/PSH are set in tcp flags then they will be set in the flags for the last segment only.
c) If CWR is set in the TCP hdr it is set in the hdr of the first segment only. The ECE bit is copied to all segments.
d) If there are IP options they are copied into each segment unaltered. This means if TCP timestamp option is in use the chip will use the same timestamp in all the segments for the TSO.

So this seems to be the correct behavior for classical ECN, but not for AccECN. Thanks for providing the information.