Page MenuHomeFreeBSD

ignore ICMP need frag with equal or larger MTU offer
Needs ReviewPublic

Authored by glebius on Sep 4 2015, 11:25 AM.

Details

Summary

This patch was submitted by Richard Russo <russor whatsapp.com>.

The problem is that some bad gateway can respond with ICMP need frag suggesting the same MTU we already use. This has been seen in the wild. Theoretically there could be response with larger MTU. Our stack reacts on such ICMP resending the whole TCP window again, for each ICMP received. I believe this isn't a best behaviour. Richard suggest to ignore such ICMPs. His patch actually modifies tp_maxseg, it just suppresses resend of the window. This covers properly case of equal MTU offer. I suppose, that to cover case of growing offer, we should also add protection against increasing tp_maxseg.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 557
Build 557: arc lint + arc unit

Event Timeline

glebius retitled this revision from to ignore ICMP need frag with equal or larger MTU offer.Sep 4 2015, 11:25 AM
glebius updated this object.
glebius edited the test plan for this revision. (Show Details)
glebius set the repository for this revision to rS FreeBSD src repository.
glebius updated this revision to Diff 8484.

To clarify: this isn't the final patch to review & commit. This is request for discussion of a proper behaviour, and once we settle on it, I will produce a patch.

Actually this is covered by RFC 1191.

This means that the TCP layer must be able to recognize when a
 Datagram Too Big notification actually decreases the PMTU that
 it has already used to send a datagram on the given
 connection, and should ignore any other notifications.
pkelsey edited edge metadata.Sep 4 2015, 3:53 PM

We shouldn't be taking any action at all based on a NEEDFRAG message that suggests (or from which we deduce) a new mtu that is the same or larger than the current one. I think the clearest expression of this idea is to only call tcp_mtudisc() from tcp_ctlinput() when the new mtu could decrease tp->t_maxseg (and remove the comment that tcp_mtudisc() 'does the right thing').

stas added a subscriber: stas.Sep 4 2015, 7:11 PM
stas added a comment.EditedSep 4 2015, 7:17 PM

Another thing we discovered is that FreeBSD actually used to handle this case correctly up until r182851 where the relevant check was dropped by accident during the mtu handling refactoring. It used to check for the received MTU size and return early:

if (tp->t_maxopd <= mss)
    return (inp);
wollman added a subscriber: wollman.Sep 6 2015, 8:15 PM

Agreed that the current behavior is a bug and we should just fix it. I think pkelsey is right as far as the correct fix.

Note that pkelsey@ suggests to check against t_maxseg, and stas@ notes that before there was a check against t_maxopd. Here we come to a long historical confusion with t_maxseg vs t_maxopd. Which I still can't understand to its end.

Originally t_maxopd was introduced with T/TCP. Before we successfully lived without it. Later, when T/TCP was removed, t_maxopd remained and now some code uses t_maxopd, and some uses t_maxseg. IMHO, this confusion needs to be fixed.

I don't think we should be taking any

Note that pkelsey@ suggests to check against t_maxseg, and stas@ notes that before there was a check against t_maxopd. Here we come to a long historical confusion with t_maxseg vs t_maxopd. Which I still can't understand to its end.
Originally t_maxopd was introduced with T/TCP. Before we successfully lived without it. Later, when T/TCP was removed, t_maxopd remained and now some code uses t_maxopd, and some uses t_maxseg. IMHO, this confusion needs to be fixed.

Actually, I was not saying we check against t_maxseg - I chose my words very carefully there in order to avoid sorting out at that moment t_maxseg vs t_maxopd :) All I was saying that the check should only pass when t_maxseg could be decreased.

Since my attempt to avoid digging into failed, I have looked into it. Here is my summary:

t_maxopd is the maximum amount of TCP data and IP and TCP options that can fit in one segment (that is, mtu less minimum-sized TCP and IP headers). t_maxopd is used correctly in tcp_output() to determine the amount of tcp data to submit to the IP layer, so I do not think it is a T/TCP specific concept.

t_maxseg is supposed to be the maximum amount of TCP data that can be put in one segment, but it currently assumes there are no IP options and assumes the only possible per-segment TCP option is the TCP timestamp option (what about SACK and SIGNATURE?). Based on these assumptions, so far t_maxseg seems a bit suspicious.

Issues/questions:

  1. In tcp_input.c, the DELAY_ACK(tp, tlen) macro checks for LRO on a received segment by comparing tlen (the number of TCP stream bytes received) to t_maxopd. Shouldn't this check be against t_maxseg?
  1. At tcp_output.c:877, the calculation of max_len in the TSO path is ignoring IP option length. I'm not sure this is correct, particularly as just below TSO can wind up disabled and the TCP data len set to this max_len value.
  1. Uses of V_tcp_minmss and V_tcp_mssdflt ignore the possibility of IP and per-segment TCP options (this can result, for example, in having a smaller minimum mss in use than V_tcp_minmss due to the presence of options).
  1. Setting the TCP_MAXSEG option sets t_maxseg but does not also adjust t_maxopd.

Patrick,

glad that you are looking into that in parallel with me. I also found some of the important points that you listed, and missed some others.

In general, per RFC1122 section 4.2.2.6, we should take the MSS offered by peer and reduce it by tcp options length and ip options length. Or, if it is not offered MSS but MTU product, then it should be simply MTU - minimal protocol header (40 for IPv4). This is what t_maxopd is now.

Since options can be different for different segments, there is no reason to store a half calculated value in tcpcb. So, right now tp_maxseg is usually an incorrect value, but since we don't use it in crucial code, we are safe. Right now tcp_output() uses t_maxopd, and it calculates tcpoptlen and ipoptlen per packet, which is correct.

Historically, t_maxseg was the variable, but with all the T/TCP permutations, t_maxopd slowly took its place. I suggest to fix this, keeping only one variable and store pure unfixed MSS in it. And name it t_maxseg, to keep the historical BSD name.

Right now I'm testing patch and will post it in separate review soon.

So, t_maxseg is always equal to (t_maxopd - length of timestamp option). From my observations right now t_maxseg is used only in a code that is error safe: congestion window calculations, slow start threshold calculations, statistics. All this code can recover from a slightly incorrect value of t_maxseg.

It could be that for some of the calculations using the reduced MSS is even better, couldn't it? It doesn't make the value correct, however. I most cases we use timestamps, and the guessing of (MSS - options) is correct. But if SACK comes into play, the value is no longer correct. Not mentioning IPSEC or other heavy stuff.

In original BSD stack, the bare MSS was used to calculate cwnd or ssthresh. NetBSD continues to do so.

I probably should open new review for this discussion.

glebius edited edge metadata.Sep 17 2015, 4:42 PM
glebius updated this revision to Diff 8820.

Looks like I'm burying too deep into maxseg/maxopd stuff, so decided
first to fix this bug, and then go with code tudying.

I suggest to do the check on offered MTU as early as possible, and if
it doesn't reduce, then avoid any processing, including caching.

Patrick, can you please review?

Richard, can you please test in your environment? To apply the patch
you first need to merge cosmetical r287874 from head.

I've been running this patch (manually applied to 9.3-RELEASE) on one of our machines since Thursday, and all looks well. Prior to this issue, we were seeing retransmit events about once a day, so 4 days without an event looks successful.

kbowling added a subscriber: kbowling.
melifaro accepted this revision.Apr 5 2018, 10:34 PM
This revision is now accepted and ready to land.Apr 5 2018, 10:34 PM
This revision now requires review to proceed.Apr 5 2018, 10:34 PM
jtl added a subscriber: jtl.Apr 6 2018, 12:15 AM

This was already committed in rS288412.

@glebius, can you close the revision?

Thanks!