Page MenuHomeFreeBSD

Merge PLPMTU blackhole detection from xnu.
ClosedPublic

Authored by sbruno on Jul 31 2014, 2:09 AM.
Tags
None
Referenced Files
F103632532: D506.diff
Wed, Nov 27, 10:06 AM
F103622573: D506.id1158.diff
Wed, Nov 27, 6:34 AM
F103584230: D506.id1690.diff
Tue, Nov 26, 6:44 PM
F103568873: D506.id887.diff
Tue, Nov 26, 3:13 PM
Unknown Object (File)
Mon, Nov 25, 2:52 PM
Unknown Object (File)
Thu, Nov 21, 7:05 AM
Unknown Object (File)
Sat, Nov 16, 10:12 AM
Unknown Object (File)
Sat, Nov 16, 4:21 AM
Subscribers

Details

Reviewers
rpaulo
bz
Group Reviewers
network
Summary

If the network stack time out fires on an xmit, this code will shift the mss downward to the blackhole mss configured by the user. If that fails, shift the mss to the min mss compile time values.

This works around networks that block ICMP NF/DF responses and allows us to shrink the mss size to work in these conditions. This is the first step in implementing a working PLPMTUD design that can allow the network stack to adjust mss upwards for connections.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
In D506#60, @gnn wrote:

I gather that y'all retested with that last change and saw the MTU move correctly back up?

No. As long as the route is in the host cache of the router(not the test host), it never moves back up after a test reset.

gnn edited edge metadata.

OK, based on an IRC conversation with sbruno I now get the "decouple" Yes. go and commit this and make sure that a separate bugzilla issue is opened for the MTU increase problem.

sbruno removed subscribers: bz, glebius.
bz requested changes to this revision.Sep 16 2014, 5:18 PM
bz added a reviewer: bz.

Ok. Just because I was away for too long doesn't mean we should continue to do other silly things;-)

This clearly currently does not pass a universe build on i386 or amd64 even without any INET/INET6 changes requested for at least VNET failing.

Can we please just make these changes now on new code as well so I I'll have less work to do on the next full kernel audit?

TIA

sys/netinet/tcp_timer.c
144

This needs virtualisation

145

This is #ifdef INET

147

This needs virtualisation

150

This is #ifdef INET6

153

This needs virtualisation

555

I think we used to usually we write this as:

isipv6 = (tp->t_inpcb->inp_vflag & INP_IPV6) ? 1 : 0;

but I am probably wrong. Having V4 constants under #ifdef INET6 seems wrong to me however. There is otherwise also a possible problem with v4mapped v6 sockets I would think if my brain would work.

Also you want to probably do this under the INP_WLOCK() though not necessarily needed but the inp_vflag locking strategy is "protected by inp lock". Also you do not need to do the assignment work until you actually need it, which is only in one if/else case way further down. I'd suggest moving the evaluation down there to line ~685.

682

With the changes above, hiding the values and the sysctls per-address family properly this will no longer compile for all 3 cases (IPv4-only, IPv6-only, Dual-Stack).

I know the 6 lines of #ifdef #endif #ifdef #endif #ifdef #endif are a pain but so be it.

689

This will fail compiling on an VNET kernel as it lacks the V_

689

The fact that this version will work is because myself probably missed an #ifdef at one point for a reason unclear currently without doing a full source inspection but the symbol should not be assumed to be there for IPv6-only kernels.

699

style; else goes on same line

sys/netinet/tcp_var.h
205

You are still decrementing "TBD" spares on head, which you should not.

This revision now requires changes to proceed.Sep 16 2014, 5:18 PM
sbruno edited edge metadata.

Address review commentary:

tcp_timer.c

  • Virtualization, lines 137, 143, 146, 699
  • ifdef INET/INET6. lines 138, 143, 692
  • isipv6 change and move under INP_WLOCK(). line 563
  • } else { style. line 709

tcp_var.h:

  • revert allocation of t_spare[] elements.

Pending make universe builds at this time. I will post back with results.

sbruno edited edge metadata.

After coffee, realize that the complexity of the INET/INET6 ifdefs was out
of control, simplify and waste a few lines of text to show what code blocks
are executed in full.

I think this makes it way more readable.

Update a whitespace nit at line 700

rpaulo added a reviewer: rpaulo.
rpaulo added a subscriber: rpaulo.

Other than the two remarks below, this is good to go.

sys/netinet/tcp_timer.c
153

This is too low. The value of 1200 has been field tested by Apple for years. I'm not sure why you picked the minimum IPv4 MSS as that will degrade performance a lot.

659

Probably too many parenthesis.

sys/netinet/tcp_timer.c
153

This value was chosen based on real world testing of connections to other networks by LimeLight CDN folks. These are not arbitrary values. I cannot say where Apple got their values from, but the numbers I propose are based on testing in a live network.

As for degrading performance, this should only be firing in the case where an Needs Frag ICMP message is being blocked. We are providing the lowest value measured as a fall back as we only get one shot to deliver traffic.

659

Eh? All the parenthesis are paired and they make it obvious where values are being compared. I don't see "too many" here. But I'm open to being wrong.

sys/netinet/tcp_timer.c
153

Oh, yes, I forgot this implementation is missing an important part of blackhole detection: TCP retransmissions. N TCP retransmission should lower the MSS to work around broken/misconfigured firewalls/VPNs. These firewalls often just drop the packet when the MSS is too high (they can't forward/encapsulate their headers). This affects both Juniper and Cisco VPN concentrators when they are misconfigured.

In any case, have you tested this with a higher tcp_pmtud_black_hole_mss?

659

It's a matter of style and it's up to you.

679

What intermediary value? The values you've chosen are exactly the same as tcp.mssdflt and tcp.v6mssdftl, so I don't understand the purpose of this code.

rpaulo requested changes to this revision.Sep 18 2014, 5:39 PM
rpaulo edited edge metadata.
This revision now requires changes to proceed.Sep 18 2014, 5:39 PM
sys/netinet/tcp_timer.c
153

In the lab, yes. I did tests from 1400 -> 500 with active blocking of ICMP throughout the test lab. Seems to be happy and does the right thing. The delay is obnoxious as its waiting for failures to occur before stepping the mss down.

679

This line appears to be a cut-n-paste from the XNU version. It should read something like "set MSS to configured blackhole tuneable to retransmit"

I will eliminate the fancy tri-state conditionals as they are useless now that I've cleared up the INET vsINET6 defines a bit.

sbruno edited edge metadata.

Address review bits:

  • line 702 - comment is not clear, try to make it so.

Remove tri-state conditional alltogether, it was not doing anything
anymore as the outer conditional for if(isipv6) is now handling
that logic.

sys/netinet/tcp_timer.c
153

Right, I don't doubt it works, what I was complaining about is the amount of performance loss when going from a 1448 MSS (or similar) to 536.

In D506#88, @rpaulo wrote:

Right, I don't doubt it works, what I was complaining about is the amount of performance loss when going from a 1448 MSS (or similar) to 536.

Yeah, its pretty brutal without the ability to increase the MSS/MTU via a real PLPMTUD implementation. I'll be doing that in Q4 after MeetBSD according to my $work schedule.

sys/netinet/tcp_timer.c
680

I still don't understand this code. The "blackhole value" is exactly the same as the "default" value.

The code reads:
if (maxopd > 536)

maxopd = 536;

else

maxopd = 536;

Because V_tcp_pmtud_black_hole_mss == V_tcp_mssdflt, right?

sys/netinet/tcp_timer.c
680

Yep, no argument there. True in both the ipv6 and ipv4 case because of the default values that are being chosen for the tuneables. This will not be the case if the users choose different values based on their networks. So I think the code is still correct.

In D506#91, @sbruno wrote:

Yep, no argument there. True in both the ipv6 and ipv4 case because of the default values that are being chosen for the tuneables. This will not be the case if the users choose different values based on their networks. So I think the code is still correct.

Then I don't understand the choice of default values if this is a no-op by default. I'm fine with it if you plan to fix it in the future.

In D506#92, @rpaulo wrote:
In D506#91, @sbruno wrote:

Yep, no argument there. True in both the ipv6 and ipv4 case because of the default values that are being chosen for the tuneables. This will not be the case if the users choose different values based on their networks. So I think the code is still correct.

Then I don't understand the choice of default values if this is a no-op by default. I'm fine with it if you plan to fix it in the future.

I'm trying to think if there's a case where the user enables the blackhole functionality and the value of the blackhole mss would NOT be less than the default mss. I think that this code might be more of a safety check. If the user sets a blackhole value that its way out of bounds in the large direction, this would catch it and set it to the default.

In D506#93, @sbruno wrote:
In D506#92, @rpaulo wrote:
In D506#91, @sbruno wrote:

Yep, no argument there. True in both the ipv6 and ipv4 case because of the default values that are being chosen for the tuneables. This will not be the case if the users choose different values based on their networks. So I think the code is still correct.

Then I don't understand the choice of default values if this is a no-op by default. I'm fine with it if you plan to fix it in the future.

I'm trying to think if there's a case where the user enables the blackhole functionality and the value of the blackhole mss would NOT be less than the default mss. I think that this code might be more of a safety check. If the user sets a blackhole value that its way out of bounds in the large direction, this would catch it and set it to the default.

To me, that's how it was supposed to be in the first place. First drop to the blackhole MSS and then drop to the minimum MSS. The blackhole MSS could be 1200.

In D506#94, @rpaulo wrote:
In D506#93, @sbruno wrote:
In D506#92, @rpaulo wrote:
In D506#91, @sbruno wrote:

Yep, no argument there. True in both the ipv6 and ipv4 case because of the default values that are being chosen for the tuneables. This will not be the case if the users choose different values based on their networks. So I think the code is still correct.

Then I don't understand the choice of default values if this is a no-op by default. I'm fine with it if you plan to fix it in the future.

I'm trying to think if there's a case where the user enables the blackhole functionality and the value of the blackhole mss would NOT be less than the default mss. I think that this code might be more of a safety check. If the user sets a blackhole value that its way out of bounds in the large direction, this would catch it and set it to the default.

To me, that's how it was supposed to be in the first place. First drop to the blackhole MSS and then drop to the minimum MSS. The blackhole MSS could be 1200.

If I read the original revisions in XNu, there's no way that could have happened as the code clear the PMTU flag on the first pass. It would never fall back down the blackhole detection code path to try and use the minmss value.

Perhaps that's a bug and we should try and take two attempts (set blackhole mss to something not the minimum like 1200) and try to let the code be more robust?

In D506#95, @sbruno wrote:
In D506#94, @rpaulo wrote:
In D506#93, @sbruno wrote:
In D506#92, @rpaulo wrote:
In D506#91, @sbruno wrote:

Yep, no argument there. True in both the ipv6 and ipv4 case because of the default values that are being chosen for the tuneables. This will not be the case if the users choose different values based on their networks. So I think the code is still correct.

Then I don't understand the choice of default values if this is a no-op by default. I'm fine with it if you plan to fix it in the future.

I'm trying to think if there's a case where the user enables the blackhole functionality and the value of the blackhole mss would NOT be less than the default mss. I think that this code might be more of a safety check. If the user sets a blackhole value that its way out of bounds in the large direction, this would catch it and set it to the default.

To me, that's how it was supposed to be in the first place. First drop to the blackhole MSS and then drop to the minimum MSS. The blackhole MSS could be 1200.

If I read the original revisions in XNu, there's no way that could have happened as the code clear the PMTU flag on the first pass. It would never fall back down the blackhole detection code path to try and use the minmss value.

Perhaps that's a bug and we should try and take two attempts (set blackhole mss to something not the minimum like 1200) and try to let the code be more robust?

I wasn't specifically talking about xnu, but yes, I think two attempts would be better. Feel free to do it later :-)

rpaulo edited edge metadata.
sbruno edited edge metadata.

Sneakily add two attempt to downshift mss via moving the clearing
of PLPMTU_PMTUD only if the attempt with blackhole mss fails.

Adjust default blackhole mss to 1200 as the default min mss for
ipv4 (TCP_MSS) is already 536.

sbruno removed a subscriber: rpaulo.
rpaulo edited edge metadata.
rpaulo added inline comments.
sys/netinet/tcp_timer.c
678

"or *to* the default"

sbruno edited edge metadata.

Address comments:

  • sys/netinet/tcp_timer.c add a "to" in the comments

Add 3 new sysctls to the code so that one can monitor the
amount of invocations that this code is getting and see if
it still is failing.

Add some 80 column wraps to the new code.

In D506#104, @sbruno wrote:

Address comments:

  • sys/netinet/tcp_timer.c add a "to" in the comments

Add 3 new sysctls to the code so that one can monitor the
amount of invocations that this code is getting and see if
it still is failing.

Add some 80 column wraps to the new code.

This means that this code adds 6 new sysctl values:

net.inet.tcp.pmtud_blackhole_detection: 0
net.inet.tcp.pmtud_blackhole_activated: 0
net.inet.tcp.pmtud_blackhole_min_activated: 0
net.inet.tcp.pmtud_blackhole_failed: 0
net.inet.tcp.pmtud_blackhole_mss: 1200
net.inet.tcp.v6pmtud_blackhole_mss: 1220

Grrr ... make the new READ ONLY sysctl values, really be READ ONLY

Address offline comments from hiren to make the intent of the
two seperate counters to be two seperate attempts to downshift
the MSS on this connections.

Ok, I have not tried to compile (well incremental seemed fine) it yet but this patch includes the style changes, the VNET fixes, and vertical space reduced INET6/INET block in the middle with less code duplication.

https://people.freebsd.org/~bz/20140920-01-D506.diff

Please do not lose the whitespace changes!

It for now does NOT address the idea of making t_pmtu_flags a t_flags2 variable to be less wasteful and better set for the future.

sys/netinet/tcp_timer.c
141

No need to "declare" as in "extern" this in the .c file for a file-local implementation.

Dito further down.

142

per style(9) it is #define<tab>FOO<tab>BAR you are lacking the <tab> after #define

Dito further down.

144

Once this is static we can use rely on BSS and do not have to initialise.

Dito further down.

145

As per glebius and hopefully one of my comments (on irc or here) use the CTL flag and not the special macro anymore.

Dito further down.

552

This should really only be #ifdef INET6 and probably be before TCPDEBUG which also initialises something after declarations are done; however we can even move it down into the block as it's not used outside anywhere else; we do not (usually) do this but I like it and you already do it anyway.

658

There should be no need to initialise optlen as in either if or else block it is initialised and initialised before first use, so clang and gcc should be happy.

666

typo

668

comments end in punctuation + more below

731

I am sorry but that's a lot of code duplication and vertical space.

sys/netinet/tcp_var.h
204

I wonder if this should be t_flags2 as t_flags will run out after one more bit and we only need 3.

287

#define<tab>FOO<tab>BAR<space>/* Blah. */ will work best.

Sean, do you want to upload the updated patch, or what's the plan to proceed here?

In D506#112, @bz wrote:

Sean, do you want to upload the updated patch, or what's the plan to proceed here?

I will be updating the patch. Hopefully before you have the physical ability to strangle me in person.

Accept patch from bz https://people.freebsd.org/~bz/20140920-01-D506.diff

Change t_pmtud_flags to be t_flags2 after many requests from other people.

make universe pending, build on amd64 works.

make universe passed:


>>> make universe started on Mon Oct 6 20:26:39 PDT 2014

arm started on Mon Oct 6 20:26:39 PDT 2014
amd64 started on Mon Oct 6 20:26:39 PDT 2014
pc98 started on Mon Oct 6 20:26:39 PDT 2014
mips started on Mon Oct 6 20:26:39 PDT 2014
i386 started on Mon Oct 6 20:26:39 PDT 2014
sparc64 started on Mon Oct 6 20:26:39 PDT 2014
powerpc started on Mon Oct 6 20:26:39 PDT 2014
arm.arm buildworld started on Mon Oct 6 20:26:39 PDT 2014
arm.armeb buildworld started on Mon Oct 6 20:26:39 PDT 2014
amd64.amd64 buildworld started on Mon Oct 6 20:26:39 PDT 2014
arm.armv6 buildworld started on Mon Oct 6 20:26:39 PDT 2014
arm.armv6hf buildworld started on Mon Oct 6 20:26:39 PDT 2014
pc98.i386 buildworld started on Mon Oct 6 20:26:39 PDT 2014
mips.mipsel buildworld started on Mon Oct 6 20:26:39 PDT 2014
mips.mips buildworld started on Mon Oct 6 20:26:39 PDT 2014
arm.armeb buildworld completed on Mon Oct 6 21:55:23 PDT 2014
mips.mips buildworld completed on Mon Oct 6 22:00:03 PDT 2014
mips.mipsel buildworld completed on Mon Oct 6 22:00:22 PDT 2014
mips.mips64el buildworld started on Mon Oct 6 22:01:05 PDT 2014
mips.mips64 buildworld started on Mon Oct 6 22:01:36 PDT 2014
mips.mipsn32 buildworld started on Mon Oct 6 22:01:45 PDT 2014
mips.mipsn32 buildworld completed on Mon Oct 6 23:35:20 PDT 2014
mips.mips64el buildworld completed on Mon Oct 6 23:35:28 PDT 2014
mips.mips64 buildworld completed on Mon Oct 6 23:35:57 PDT 2014
i386.i386 buildworld started on Mon Oct 6 23:36:39 PDT 2014
sparc64.sparc64 buildworld started on Mon Oct 6 23:39:14 PDT 2014
powerpc.powerpc buildworld started on Mon Oct 6 23:45:56 PDT 2014
arm.armv6 buildworld completed on Tue Oct 7 00:09:40 PDT 2014
arm.arm buildworld completed on Tue Oct 7 00:09:47 PDT 2014
arm.armv6hf buildworld completed on Tue Oct 7 00:14:40 PDT 2014
powerpc.powerpc64 buildworld started on Tue Oct 7 00:15:46 PDT 2014
pc98.i386 buildworld completed on Tue Oct 7 00:26:38 PDT 2014
amd64.amd64 buildworld completed on Tue Oct 7 01:08:50 PDT 2014
sparc64.sparc64 buildworld completed on Tue Oct 7 01:27:53 PDT 2014
pc98 completed on Tue Oct 7 01:42:43 PDT 2014
sparc64 completed on Tue Oct 7 02:24:54 PDT 2014
i386.i386 buildworld completed on Tue Oct 7 03:33:31 PDT 2014
amd64 completed on Tue Oct 7 04:04:43 PDT 2014
powerpc.powerpc buildworld completed on Tue Oct 7 04:32:53 PDT 2014
mips completed on Tue Oct 7 04:35:26 PDT 2014
arm completed on Tue Oct 7 04:43:14 PDT 2014
powerpc.powerpc64 buildworld completed on Tue Oct 7 05:19:41 PDT 2014
i386 completed on Tue Oct 7 05:26:22 PDT 2014
powerpc completed on Tue Oct 7 05:43:23 PDT 2014


make universe completed on Tue Oct 7 05:43:23 PDT 2014

(started Mon Oct  6 20:26:39 PDT 2014)
bz edited edge metadata.

The raw diff looks ok to me. Phabricator diffs are useless at this stage:(

This revision is now accepted and ready to land.Oct 7 2014, 5:43 PM
sbruno removed a reviewer: lstewart.

Committed at svn R272720