Page MenuHomeFreeBSD

Fix tryforward so that it can return ICMP errors to the correct origin, even in the face of IPFW and NAT.
Needs ReviewPublic

Authored by gnn on Feb 18 2016, 6:53 PM.

Details

Reviewers
vangyzen
Test Plan

Run through the VANILLA/pktgen-fullspread and IPFW/pktgen-ipfw-dropone

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 2551
Build 2568: arc lint + arc unit

Event Timeline

gnn updated this revision to Diff 13435.Feb 18 2016, 6:53 PM
gnn retitled this revision from to Fix tryforward so that it can return ICMP errors to the correct origin, even in the face of IPFW and NAT..
gnn updated this object.
gnn edited the test plan for this revision. (Show Details)
garga added a subscriber: garga.Feb 18 2016, 7:06 PM
vangyzen added inline comments.
sys/netinet/ip_fastfwd.c
428
mcopy = NULL;

to prevent a double-free in the "consumed" label.

437
mcopy = NULL;

to prevent a double-free in the "consumed" label.

468
mcopy = NULL;

to prevent a double-free in the "consumed" label.

gnn marked 3 inline comments as done.Feb 18 2016, 11:41 PM

Thanks.

cy added a subscriber: cy.Feb 19 2016, 12:05 AM

I tried this on my gateway running 10.2-STABLE but it keeps crashing, meaning I am losing all connectivity including IPMI. On the IPMI console I do not see any panic. In the logs I do not see anything unusual.

I backported this to 10.2-RELEASE and it runs ok. Mistakenly I was running the 10.2-STABLE kernel on a 10.2-RELEASE userland.

ae added a subscriber: olivier.Feb 19 2016, 9:09 AM
ae added a subscriber: ae.Feb 19 2016, 9:12 AM

Olivier, can you compare forwarding performance of 10.3 with and without this patch?

gnn added a comment.Feb 19 2016, 4:34 PM

What I found in my setup was that this reduced performance by 5% but that is a pull back from a 25% increase. Can you also compare this against plain forwarding on 10.2? The comparisons that matter are:

pre-tryforward vs. old fastforward vs. tryforward now

Note that fastforward shows this bug so the speed loss, though mildly annoying, is required to prevent the bug that was reported.

With the patch backported to 10.2-RELEASE I got a kernel panic today.
I will try to get a dump.

FreeBSD/amd64 (xxxxx) (ttyu0)

login: panic: sbflush_internal: cc 0 || mb 0xfffff801786a0100 || mbcnt 256
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff809442a0 at kdb_backtrace+0x60
#1 0xffffffff80907a06 at vpanic+0x126
#2 0xffffffff809078d3 at panic+0x43
#3 0xffffffff8097caab at sbflush_internal+0x7b
#4 0xffffffff8097cb72 at sbdestroy+0x12
#5 0xffffffff8097eb23 at sofree+0x183
#6 0xffffffff8097f086 at soclose+0x376
#7 0xffffffff808bc979 at _fdrop+0x29
#8 0xffffffff808bf21e at closef+0x21e
#9 0xffffffff808bee39 at fdescfree+0x4f9
#10 0xffffffff808cb739 at exit1+0x569
#11 0xffffffff808cb1ce at sys_sys_exit+0xe
#12 0xffffffff80cd2287 at amd64_syscall+0x357
#13 0xffffffff80cb796b at Xfast_syscall+0xfb
Uptime: 3m58s

gnn added a comment.Feb 19 2016, 5:18 PM

I'm not sure how that trace could be related to this but if I can get a core dump it might help.

Ok, still on this Atom-4core (bench on i5-8-core Chelsio is still on-going), here are the difference:

x 10.2 forwarding (pps)
+ 10.2 fastforwarding (pps)
* 10-stable-with D5330 forwarding (pps)
+--------------------------------------------------------------------------+
|                                                  *                       |
|                                                  *                       |
|                                                * *                       |
|xx   x                                         ** *               +++ +   |
|xx   x                                         ** *               +++ ++ +|
||MA_|                                                                     |
|                                                                  |_MA_|  |
|                                               |_A|                       |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10        381295      405347.5      384301.5     389779.85     10183.979
+  10        701671        736395      713076.5      714775.6     11127.311
Difference at 95.0% confidence
        324996 +/- 10021.8
        83.3793% +/- 2.57114%
        (Student's t, pooled s = 10666.1)
*  10        608019        624612        617817     617255.35     6681.9961
Difference at 95.0% confidence
        227476 +/- 8092.6
        58.36% +/- 2.0762%
        (Student's t, pooled s = 8612.85)

We should ask melifaro about the impact on projects/routing.

On a different hardware setup (Intel Xeon E5-2650, Chelsio 10-Gigabit T540-CR, NIC configured for using 4 queues in place of 8) I didn't have the same result at all !

First there is no visible impact with or without this D5330 patch:

x 10-stable r295802 tryforward (pps)
+ 10-stable r295802 with D5330 tryforward (pps)
+--------------------------------------------------------------------------+
|                                         x          x                 +   |
|x   x                       +  +  +    ++xxx x      x*           +    +  +|
|                  |__________________A____M______________|                |
|                                 |_____________M__A_________________|     |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10     2108683.5       2553645     2460830.8     2420369.1     160855.67
+  10     2343852.5       2718812     2497405.5     2529979.1     149422.27
No difference proven at 95.0% confidence

A 10-stable with D5330 tryforward is still faster than a 10.2 fastforward on this hardware setup:

x 10.2 fastforwarding (pps)
+ 10-stable r295802 with D5330 tryforward (pps)
+--------------------------------------------------------------------------+
|                                                                      +   |
|x      x  x     x   x    +   +  +   ++ x      x  x x+x          +     +  +|
|         |___________________AM___________________|                       |
|                              |______________M___A__________________|     |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10       2147676       2563598     2378921.8     2375603.1     160971.63
+  10     2343852.5       2718812     2497405.5     2529979.1     149422.27
Difference at 95.0% confidence
        154376 +/- 145923
        6.49839% +/- 6.14258%
        (Student's t, pooled s = 155304)

Then comparing 10.2 forwarding, 10.2 fastforwarding, and 10-stable-with D5330 tryforward is useless:

x 10.2 forwarding (pps)
+ 10.2 fastforwarding (pps)
* 10-stable r295802 with D5330 tryforward (pps)
+--------------------------------------------------------------------------+
|   x                                                          *   +     * |
|xx x x xxx xx                                      + +++ +*** *+ ++*  * **|
|  |___A___|                                                               |
|                                                     |_____AM_____|       |
|                                                            |___M_A____|  |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  10        863031       1163347     1016572.8     1014444.7     103384.17
+  10       2147676       2563598     2378921.8     2375603.1     160971.63
Difference at 95.0% confidence
        1.36116e+06 +/- 127106
        134.178% +/- 12.5297%
        (Student's t, pooled s = 135278)
*  10     2343852.5       2718812     2497405.5     2529979.1     149422.27
Difference at 95.0% confidence
        1.51553e+06 +/- 120721
        149.395% +/- 11.9002%
        (Student's t, pooled s = 128482)
gnn added a comment.Feb 20 2016, 3:39 PM

I wonder if you can get some pmc results on this code to see what's causing the difference? I suspect caching issues, but that's just a guess. I'm going to move forward with a change to how this all works in HEAD, following the suggestion in the comment. that will take a while to work out because icmp_error() likes to free mbufs and if we have one on the stack that will not go well.

Here are the pmc stats files (gprof output) on the atom 4 cores:

I beleive what you are looking is on this extract:
without the patch:

             2739.00    20199.50   11095/11095       ip_input [8]
[9]     68.1 2739.00    20199.50   11095         ip_tryforward [9]
             1045.00    11134.29    5903/5903        ether_output [10]
              252.00     5141.23    2955/2955        ip_findroute [12]
              249.00     1401.00    1401/1401        in_localip [21]
              447.41      171.56     478/1500        __mtx_lock_sleep [19]
              345.00        0.00     345/699         bzero [34]
               13.00        0.00      13/34          in_rtalloc_ign [70]

With the patch:

             3044.00    20772.95   10986/10986       ip_input [8]
[9]     70.4 3044.00    20772.95   10986         ip_tryforward [9]
             1325.00    10408.61    5168/5168        ether_output [10]
              227.00     3890.40    2327/2327        ip_findroute [13]
              259.00     1189.00    1189/1189        in_localip [24]
              533.53      557.72     622/1610        uma_zalloc_arg <cycle 2> [20]
               83.87      550.67     384/902         m_freem [22]
              160.00      430.00     430/430         m_copydata [40]
              321.88      135.70     345/970         __mtx_lock_sleep [26]
              240.67      119.34     181/182         m_dup_pkthdr [44]
              335.00        0.00     335/652         bzero [37]
                5.56        0.00       5/9           in_rtalloc_ign [80]

Here are the pmc stats files (gprof output) on the Intel E5 8 cores:

ae added a comment.Feb 21 2016, 3:22 PM
In D5330#114389, @gnn wrote:

following the suggestion in the comment. that will take a while to work out because icmp_error() likes to free mbufs and if we have one on the stack that will not go well.

We can copy IP header + 8 bytes into on-stack buffer and then create mbuf only when this needed. But probably the difference will be very low.

g_amanakis_yahoo.com added a comment.EditedFeb 21 2016, 3:25 PM

On a vanilla 10.2-STABLE with the latest available kernel and applied patch I get this (ipfw+nat):

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...

Unread portion of the kernel message buffer:
panic: sbsndptr: sockbuf 0xfffff80011cbd6f8 and mbuf 0xfffff80064806d00 clashing
cpuid = 1
KDB: stack backtrace:
#0 0xffffffff8094b5d0 at kdb_backtrace+0x60
#1 0xffffffff8090de26 at vpanic+0x126
#2 0xffffffff8090dcf3 at panic+0x43
#3 0xffffffff80987710 at sbsndmbuf+0
#4 0xffffffff80a56ffc at tcp_output+0xe8c
#5 0xffffffff80a65048 at tcp_usr_send+0x3a8
#6 0xffffffff80989816 at sosend_generic+0x476
#7 0xffffffff80989a9c at sosend+0x3c
#8 0xffffffff8096d9d9 at soo_write+0x49
#9 0xffffffff80965787 at dofilewrite+0x87
#10 0xffffffff809654b8 at kern_writev+0x68
#11 0xffffffff80965443 at sys_write+0x63
#12 0xffffffff80cd994f at amd64_syscall+0x40f
#13 0xffffffff80cbebbb at Xfast_syscall+0xfb
Uptime: 1m57s
Dumping 427 out of 8134 MB:..4%..12%..23%..34%..42%..53%..64%..72%..83%..94%

Reading symbols from /boot/kernel/zfs.ko.symbols...done.
Loaded symbols for /boot/kernel/zfs.ko.symbols
Reading symbols from /boot/kernel/opensolaris.ko.symbols...done.
Loaded symbols for /boot/kernel/opensolaris.ko.symbols
Reading symbols from /boot/kernel/geom_eli.ko.symbols...done.
Loaded symbols for /boot/kernel/geom_eli.ko.symbols
Reading symbols from /boot/kernel/if_tap.ko.symbols...done.
Loaded symbols for /boot/kernel/if_tap.ko.symbols
Reading symbols from /boot/kernel/if_bridge.ko.symbols...done.
Loaded symbols for /boot/kernel/if_bridge.ko.symbols
Reading symbols from /boot/kernel/bridgestp.ko.symbols...done.
Loaded symbols for /boot/kernel/bridgestp.ko.symbols
Reading symbols from /boot/kernel/umass.ko.symbols...done.
Loaded symbols for /boot/kernel/umass.ko.symbols
Reading symbols from /boot/kernel/usb.ko.symbols...done.
Loaded symbols for /boot/kernel/usb.ko.symbols
Reading symbols from /boot/kernel/aio.ko.symbols...done.
Loaded symbols for /boot/kernel/aio.ko.symbols
Reading symbols from /boot/kernel/coretemp.ko.symbols...done.
Loaded symbols for /boot/kernel/coretemp.ko.symbols
Reading symbols from /boot/kernel/vmm.ko.symbols...done.
Loaded symbols for /boot/kernel/vmm.ko.symbols
Reading symbols from /boot/kernel/ehci.ko.symbols...done.
Loaded symbols for /boot/kernel/ehci.ko.symbols
Reading symbols from /boot/kernel/xhci.ko.symbols...done.
Loaded symbols for /boot/kernel/xhci.ko.symbols
Reading symbols from /boot/kernel/ipmi.ko.symbols...done.
Loaded symbols for /boot/kernel/ipmi.ko.symbols
Reading symbols from /boot/kernel/smbus.ko.symbols...done.
Loaded symbols for /boot/kernel/smbus.ko.symbols
Reading symbols from /boot/kernel/aesni.ko.symbols...done.
Loaded symbols for /boot/kernel/aesni.ko.symbols
Reading symbols from /boot/kernel/nmdm.ko.symbols...done.
Loaded symbols for /boot/kernel/nmdm.ko.symbols
Reading symbols from /boot/kernel/ums.ko.symbols...done.
Loaded symbols for /boot/kernel/ums.ko.symbols
Reading symbols from /boot/kernel/ukbd.ko.symbols...done.
Loaded symbols for /boot/kernel/ukbd.ko.symbols
Reading symbols from /boot/kernel/ulpt.ko.symbols...done.
Loaded symbols for /boot/kernel/ulpt.ko.symbols
Reading symbols from /boot/kernel/ipfw.ko.symbols...done.
Loaded symbols for /boot/kernel/ipfw.ko.symbols
Reading symbols from /boot/kernel/dummynet.ko.symbols...done.
Loaded symbols for /boot/kernel/dummynet.ko.symbols
Reading symbols from /boot/kernel/ipfw_nat.ko.symbols...done.
Loaded symbols for /boot/kernel/ipfw_nat.ko.symbols
Reading symbols from /boot/kernel/libalias.ko.symbols...done.
Loaded symbols for /boot/kernel/libalias.ko.symbols
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
219		__asm("movq %%gs:%1,%0" : "=r" (td)
(kgdb) backtrace
#0  doadump (textdump=<value optimized out>) at pcpu.h:219
#1  0xffffffff8090da82 in kern_reboot (howto=260)
    at /rex/src/freebsd/10/sys/kern/kern_shutdown.c:486
#2  0xffffffff8090de65 in vpanic (fmt=<value optimized out>, 
    ap=<value optimized out>)
    at /rex/src/freebsd/10/sys/kern/kern_shutdown.c:889
#3  0xffffffff8090dcf3 in panic (fmt=0x0)
    at /rex/src/freebsd/10/sys/kern/kern_shutdown.c:818
#4  0xffffffff80987710 in sbsndptr (sb=<value optimized out>, 
    off=<value optimized out>, len=<value optimized out>, 
    moff=<value optimized out>)
    at /rex/src/freebsd/10/sys/kern/uipc_sockbuf.c:1011
#5  0xffffffff80a56ffc in tcp_output (tp=0xfffff800643d0408)
    at /rex/src/freebsd/10/sys/netinet/tcp_output.c:1027
#6  0xffffffff80a65048 in tcp_usr_send (so=<value optimized out>, 
    flags=<value optimized out>, m=<value optimized out>, 
    nam=<value optimized out>, control=<value optimized out>, 
    td=<value optimized out>)
    at /rex/src/freebsd/10/sys/netinet/tcp_usrreq.c:918
#7  0xffffffff80989816 in sosend_generic (so=0xfffff80011cbd570, addr=0x0, 
    uio=0xfffffe023885a960, top=<value optimized out>, 
    control=<value optimized out>, flags=<value optimized out>, 
    td=0xffffffffffffffff) at /rex/src/freebsd/10/sys/kern/uipc_socket.c:1283
---Type <return> to continue, or q <return> to quit--- 
#8  0xffffffff80989a9c in sosend (so=0x0, addr=0x0, uio=0x0, top=0x0, 
    control=0x0, flags=0, td=0xfffff800645d7960)
    at /rex/src/freebsd/10/sys/kern/uipc_socket.c:1327
#9  0xffffffff8096d9d9 in soo_write (fp=<value optimized out>, 
    uio=0xfffffe023885a960, active_cred=<value optimized out>, 
    flags=<value optimized out>, td=<value optimized out>)
    at /rex/src/freebsd/10/sys/kern/sys_socket.c:103
#10 0xffffffff80965787 in dofilewrite (td=0xfffff800645d7960, fd=3, 
    fp=0xfffff80011c034b0, auio=0xfffffe023885a960, 
    offset=<value optimized out>, flags=0) at file.h:305
#11 0xffffffff809654b8 in kern_writev (td=0xfffff800645d7960, fd=3, 
    auio=0xfffffe023885a960) at /rex/src/freebsd/10/sys/kern/sys_generic.c:481
#12 0xffffffff80965443 in sys_write (td=<value optimized out>, 
    uap=<value optimized out>)
    at /rex/src/freebsd/10/sys/kern/sys_generic.c:396
#13 0xffffffff80cd994f in amd64_syscall (td=0xfffff800645d7960, traced=0)
    at subr_syscall.c:141
#14 0xffffffff80cbebbb in Xfast_syscall ()
    at /rex/src/freebsd/10/sys/amd64/amd64/exception.S:396
#15 0x0000000803492e7a in ?? ()
Previous frame inner to this frame (corrupt stack?)
Current language:  auto; currently minimal
ae added a comment.Feb 21 2016, 4:42 PM

On a vanilla 10.2-STABLE with the latest available kernel and applied patch I get this (ipfw+nat):

GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "amd64-marcel-freebsd"...
Unread portion of the kernel message buffer:
panic: sbsndptr: sockbuf 0xfffff80011cbd6f8 and mbuf 0xfffff80064806d00 clashing

I think it is unrelated problem, similar to this one:
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=148807

@ae
I think you are right. I am currently trying with "hw.em.enable_msix=0" and it seems to work ok!

Regarding my previous comment, I was running 10.2-RELEASE and had not turned fastforwarding on.
The bug still persists. I think there might be a problem with the fastforwarding routine as this does not happen when fastforwarding is off.
Steps to reproduce:

  1. WAN-interface: route change default -mtu 1196
  2. Browse some pages with a client machine
  3. Try to access the gateway via ssh on the LAN-interface.
ae added subscribers: erj, sbruno.Feb 22 2016, 10:47 AM

I am trying to find where the problem lies but without success so far.
However if I add the following right after line 311 ("passin :") the bug dissappears:

passin:
         dchg = (odest.s_addr != ip->ip_dst.s_addr);
         if (m->m_flags & M_IP_NEXTHOP) {
                 dchg = (m_tag_find(m, PACKET_TAG_IPFORWARD, NULL) != NULL);
                 if (dchg != 0) {
                         /*
                          * Directly ship the packet on.  This allows
                          * forwarding packets originally destined to us
                          * to some other directly connected host.
                          */
                         ip_forward(m, 1);
                         return NULL;
                 }
         }
         ip_forward(m, dchg);
         return NULL;

This way ip_forward() and ip_output() take over the outgoing path.

There are also some other minor discrepancies with the fastforwarding code:
e.g. before the PFIL hooks for outgoing packets the following has to be done:

odest.s_addr = dest.s_addr = ip->ip_dst.s_addr;
g_amanakis_yahoo.com added a comment.EditedFeb 27 2016, 4:01 AM

After a lot of testing I found the culprit:
removal of the lines 515-516 resolves my bug.

515  if (mcopy)
516                m_freem(mcopy);

the question is why?

sys/netinet/ip_fastfwd.c
515–516

Everytime consumed is called, mcopy has already been freed. This check seems harmless but causes a bug with unspecific errors resulting in kernel panics in case the route mtu is set to 576-1196.

Possible patch:

        else {
                counter_u64_add(ro.ro_rt->rt_pksent, 1);
                IPSTAT_INC(ips_forward);
                IPSTAT_INC(ips_fastforward);
        }
        m_freem(mcopy);
        RTFREE(ro.ro_rt);
        return NULL;
consumed:
        RTFREE(ro.ro_rt);
        return NULL;
drop:
        if (m)
                m_freem(m);
        if (mcopy)
                m_freem(mcopy);
        if (ro.ro_rt)
                RTFREE(ro.ro_rt);
        return NULL;
}
sys/netinet/ip_fastfwd.c
361

odest.s_addr = dest.s_addr = ip->ip_dst.s_addr;

Another issue not addressed by fastforwarding is the change of the routing table through pfil (i.e. setfib in ipfw). After the outgoing pfil hook there should be a check to see if the routing table has changed. If it has ip_findroute() should be called to determine the new route.