head/sys/netinet/cc/cc_newreno.c
240	This is a functional change which appears to be unintentional given the r367007 commit log message? I also think we need to have a broader conversation about the use of connection-specific maximum segment size in the stack, especially for congestion avoidance/control algorithm purposes. I'm not convinced it is correct to use tcp_maxseg() for congestion window/algorithm calculations.

Hi Lawrence,

incrementing by t_maxseg (w/o TSopt) vs. tcp_maxseg() (w/ TSopt/MPTCP) does not lead to a unexpected transmission of 2 segments, when only a single segment would be expected (tcp_output prefers to send full segments but considers option overhead).

Incrementing by tcp_maxseg() is sightly less aggressive during CA, not?

rscheff added a commit: rS367752: MFC r367007: tcp: move cwnd and ssthresh updates into cc modules.Nov 17 2020, 8:04 AM

Allo!

Apologies for the delayed reaction...

In D26807#604494, @rscheff wrote:

incrementing by t_maxseg (w/o TSopt) vs. tcp_maxseg() (w/ TSopt/MPTCP) does not lead to a unexpected transmission of 2 segments, when only a single segment would be expected (tcp_output prefers to send full segments but considers option overhead).

Incrementing by tcp_maxseg() is sightly less aggressive during CA, not?

First up, it seems buggy to me that the sender side accounts for received SACK blocks in the "effective" MSS computed by tcp_maxseg() for outbound segments. (Not your bug, but your change amplifies the effect of the bug).

I'll try ground the remainder of my thoughts with my interpretation of the relevant "first principles" and practical considerations...

The congestion window maintained at the sender attempts to track an estimate of pipe capacity, but does so without knowledge of whether the capacity constraint applies at a particular TCP/IP layer or includes/excludes certain sources of non-application-payload overhead. TCPs typically take the shortcut of approximating pipe capacity purely in terms of TCP segment payload, but most IP paths I'm aware of typically track bits in terms of IP-payload i.e. it is likely more correct for most if not all paths to consider the congestion window as representing pipe capacity inclusive of all TCP segment overhead (fixed header and options). Doing so ensures that for a given congestion window, connections with different TCP segment sizes account for their sent TCP-layer bits the same. Also worth noting that paths often account based on number of packets in addition to or instead of bits i.e. tracking congestion window in terms of both bits and number of segments is potentially of value.

So, in a high level sense, accounting for TCP option bits in a given send window worth of data is both justified and arguably more correct. However, using tcp_maxseg() in congestion control accounting effectively moves us further away from this and towards accounting purely based on payload bits.

I also have a concern that tcp_maxseg() pessimises calculations for connections which use TCP options that do not appear on every segment (SACK being the most common case). It's a rounding error and unlikely to be of concern for sufficiently large send windows, but its a different story at the small window end of things, and even more so for connections using a smaller than typical MSS.

In D26807#610449, @lstewart wrote:

In D26807#604494, @rscheff wrote:

incrementing by t_maxseg (w/o TSopt) vs. tcp_maxseg() (w/ TSopt/MPTCP) does not lead to a unexpected transmission of 2 segments, when only a single segment would be expected (tcp_output prefers to send full segments but considers option overhead).

Incrementing by tcp_maxseg() is sightly less aggressive during CA, not?

First up, it seems buggy to me that the sender side accounts for received SACK blocks in the "effective" MSS computed by tcp_maxseg() for outbound segments. (Not your bug, but your change amplifies the effect of the bug).

Not quite sure I get your point here; are you saying the tcp_maxseg() is erraneously including received SACK blocks fom the previous ACK? Maybe you can send an annotated code snippet per direct mail?

I'll try ground the remainder of my thoughts with my interpretation of the relevant "first principles" and practical considerations...

The congestion window maintained at the sender attempts to track an estimate of pipe capacity, but does so without knowledge of whether the capacity constraint applies at a particular TCP/IP layer or includes/excludes certain sources of non-application-payload overhead. TCPs typically take the shortcut of approximating pipe capacity purely in terms of TCP segment payload, but most IP paths I'm aware of typically track bits in terms of IP-payload i.e. it is likely more correct for most if not all paths to consider the congestion window as representing pipe capacity inclusive of all TCP segment overhead (fixed header and options). Doing so ensures that for a given congestion window, connections with different TCP segment sizes account for their sent TCP-layer bits the same. Also worth noting that paths often account based on number of packets in addition to or instead of bits i.e. tracking congestion window in terms of both bits and number of segments is potentially of value.

So, in a high level sense, accounting for TCP option bits in a given send window worth of data is both justified and arguably more correct. However, using tcp_maxseg() in congestion control accounting effectively moves us further away from this and towards accounting purely based on payload bits.

I agree with you on the first principles approach. We have had this kind of discussion @IETF during reECN (ConEx), and that many don't apprecitate this subtle difference was a major hurdle in the process to get tracktion to AccECN (why there is a higher fidelity bytes-based signal to begin with, and additionally, shall it include vaious header bytes or not).
However, such a discussion is certainly well beyond the scope of this Diff (which I've committed already).

The concrete issue addressed here is, that during loss recovery, incrementing cwnd by t_maxseg, when TSopt is in use, leaves 12 payload bytes "unused" in a transmission opportunity, which may be sent out right away. The error of transmitting two segments (with much more header- and potentially ethernet padding bytes, than 12 bytes of payload certainly makes it more attractive to NOT incorrectly included these header option bytes.

Revamping cwnd to include various headers, some of which are not known to the sender at transmission time, is certainly beyond the scope of a simply bugfix. (E.g. TCP over IP over GRE over IPsec over VxLAN over MPLS may be a valid chain of headers in the middle of a path, where congestion occurs. Most of these technologies try to be as transparent to an endhost as possible, though...)

I also have a concern that tcp_maxseg() pessimises calculations for connections which use TCP options that do not appear on every segment (SACK being the most common case). It's a rounding error and unlikely to be of concern for sufficiently large send windows, but its a different story at the small window end of things, and even more so for connections using a smaller than typical MSS.

IIRC, tcp_maxseg() should take into account those options, which would be present "currently". As you note, SACK options would be the only one such TCP option, that FreeBSD would be using. However, once we tackle this issue or other options which are only infrequently present (e.g. MP-TCP options, AccECN option, AckCC (?) ) we can address this aspect then.

In D26807#610464, @rscheff wrote:

In D26807#610449, @lstewart wrote:

In D26807#604494, @rscheff wrote:

incrementing by t_maxseg (w/o TSopt) vs. tcp_maxseg() (w/ TSopt/MPTCP) does not lead to a unexpected transmission of 2 segments, when only a single segment would be expected (tcp_output prefers to send full segments but considers option overhead).

Incrementing by tcp_maxseg() is sightly less aggressive during CA, not?

First up, it seems buggy to me that the sender side accounts for received SACK blocks in the "effective" MSS computed by tcp_maxseg() for outbound segments. (Not your bug, but your change amplifies the effect of the bug).

Not quite sure I get your point here; are you saying the tcp_maxseg() is erraneously including received SACK blocks fom the previous ACK? Maybe you can send an annotated code snippet per direct mail?

It keys off tp->rcv_numsacks per the implementation in tcp_subr.c. I'm pretty sure this is an unintentional bug in the implementation of tcp_maxseg().

I'll try ground the remainder of my thoughts with my interpretation of the relevant "first principles" and practical considerations...

The congestion window maintained at the sender attempts to track an estimate of pipe capacity, but does so without knowledge of whether the capacity constraint applies at a particular TCP/IP layer or includes/excludes certain sources of non-application-payload overhead. TCPs typically take the shortcut of approximating pipe capacity purely in terms of TCP segment payload, but most IP paths I'm aware of typically track bits in terms of IP-payload i.e. it is likely more correct for most if not all paths to consider the congestion window as representing pipe capacity inclusive of all TCP segment overhead (fixed header and options). Doing so ensures that for a given congestion window, connections with different TCP segment sizes account for their sent TCP-layer bits the same. Also worth noting that paths often account based on number of packets in addition to or instead of bits i.e. tracking congestion window in terms of both bits and number of segments is potentially of value.

So, in a high level sense, accounting for TCP option bits in a given send window worth of data is both justified and arguably more correct. However, using tcp_maxseg() in congestion control accounting effectively moves us further away from this and towards accounting purely based on payload bits.

I agree with you on the first principles approach. We have had this kind of discussion @IETF during reECN (ConEx), and that many don't apprecitate this subtle difference was a major hurdle in the process to get tracktion to AccECN (why there is a higher fidelity bytes-based signal to begin with, and additionally, shall it include vaious header bytes or not).
However, such a discussion is certainly well beyond the scope of this Diff (which I've committed already).

Agreed, the broader discussion is not specific to this changeset (probably should move it to freebsd-transport@) but I wanted to touch on some of my thinking to provide some background for my concerns.

What is directly relevant to this changeset is the subtle but potentially impactful change to newreno's cwnd accounting by switching to using tcp_maxseg() in newreno_cong_signal(), which I flagged previously because it is a functional change which also introduces inconsistency within the module given that t_maxseg is still used elsewhere. FYI, I opted to exclude this changeset from our internal Netflix tree until we can either quantify the effects or figure out a different way of addressing the underlying problem being addressed, given my concern that this may pessimise connections with small send windows. That's what prompted me to circle back to this review to better understand the underlying issue being addressed by the change and to publicly flag my concern that there is in fact a subtle functional change being introduced here.

The concrete issue addressed here is, that during loss recovery, incrementing cwnd by t_maxseg, when TSopt is in use, leaves 12 payload bytes "unused" in a transmission opportunity, which may be sent out right away. The error of transmitting two segments (with much more header- and potentially ethernet padding bytes, than 12 bytes of payload certainly makes it more attractive to NOT incorrectly included these header option bytes.

Ah, I see. My initial reaction to such an issue is that we should consider addressing by tuning the runt transmission avoidance checks in tcp_output() rather than marginally gaming cwnd, which may still not prevent a runt transmission if, for example, the effective MSS varied over the preceding window of data for some reason.

Revamping cwnd to include various headers, some of which are not known to the sender at transmission time, is certainly beyond the scope of a simply bugfix. (E.g. TCP over IP over GRE over IPsec over VxLAN over MPLS may be a valid chain of headers in the middle of a path, where congestion occurs. Most of these technologies try to be as transparent to an endhost as possible, though...)

Sure, IMO the only reasonable thing TCP can do is fully account for its bits to most accurately derive its pipe capacity estimate.

I also have a concern that tcp_maxseg() pessimises calculations for connections which use TCP options that do not appear on every segment (SACK being the most common case). It's a rounding error and unlikely to be of concern for sufficiently large send windows, but its a different story at the small window end of things, and even more so for connections using a smaller than typical MSS.

IIRC, tcp_maxseg() should take into account those options, which would be present "currently". As you note, SACK options would be the only one such TCP option, that FreeBSD would be using. However, once we tackle this issue or other options which are only infrequently present (e.g. MP-TCP options, AccECN option, AckCC (?) ) we can address this aspect then.

It attempts to do so, but crudely. It doesn't track the actual option-related bits per send, doesn't keep a rolling tally per send window or over any other timescale, nor are any retrospective adjustments made if the cumulative effect of slightly incorrect tcp_maxseg()-based adjustments begins to add up.

The above discussion about tcp_maxseg() has been closed in the subjected email thread: https://lists.freebsd.org/pipermail/freebsd-transport/2020-November/000355.html
No further action needed.