Page MenuHomeFreeBSD

Re-add route caching for TCP
ClosedPublic

Authored by karels on Nov 28 2015, 10:37 PM.
Referenced Files
Unknown Object (File)
Thu, May 2, 2:06 AM
Unknown Object (File)
Thu, May 2, 2:06 AM
Unknown Object (File)
Thu, May 2, 2:06 AM
Unknown Object (File)
Thu, May 2, 2:06 AM
Unknown Object (File)
Thu, May 2, 2:06 AM
Unknown Object (File)
Thu, May 2, 1:29 AM
Unknown Object (File)
Wed, May 1, 11:21 PM
Unknown Object (File)
Fri, Apr 26, 2:41 AM

Details

Summary

FreeBSD previously provided route caching for TCP (and UDP). Re-add
route caching for TCP, with some improvements. In particular, invalidate
the route cache if a new route is added, which might be a better match.
The cache is automatically invalidated if the old route is deleted.

Test Plan

IPv4 code originated in McAfee Firewall Enterprise, where it has been
in use for some years. ARP code is reworked for FreeBSD 11. IPv6 routing
and NDP code is similar. Code has been tested with route changes, and using
debugger to verify that code operation is as expected.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

karels retitled this revision from to Re-add route caching for TCP.
karels updated this object.
karels edited the test plan for this revision. (Show Details)
karels added reviewers: gnn, melifaro, bz.

I've had a patch to do this floating about for about 5 years (hence the otherwise vestigial rnh_gen). I hope you're successful in pushing in something that provides this functionality. I brought it up with melifaro@ about a year ago and he said that he had something better going in shortly.

Looking forward to continuing improvements.

In D4306#90594, @kmacy wrote:

I've had a patch to do this floating about for about 5 years (hence the otherwise vestigial rnh_gen). I hope you're successful in pushing in something that provides this functionality. I brought it up with melifaro@ about a year ago and he said that he had something better going in shortly.

Looking forward to continuing improvements.

Thanks, Kip. I was surprised to find rnh_gen already there. I'm happy to have something better, but this is (IMO) the least we can do.

melifaro requested changes to this revision.EditedNov 29 2015, 6:28 AM
melifaro edited edge metadata.

I believe that this is wrong way of doing things.

(The reasoning on re-implementing route caching was not specified, so I'll assume default "routing is slow". Please correct me if I'm wrong).

First of all, this looks like wrong approach in general, because in adds hacks in other layers instead of making lltable/routing fast (which is perfectly possible).
Second, this creates another bunch of layering violations that makes it much harder to hack routing/lltable code (think about multipath, changing radix to something different, changing rte, having medias w/o arp/nd etc..).

Please take a look at https://wiki.freebsd.org/ProjectsRoutingProposal with more detailed reasoning.
Some bits of it (mostly, lltable changes) has already been landed in head.

This revision now requires changes to proceed.Nov 29 2015, 6:28 AM

Thanks for the pointer to your wiki page; I wasn't aware of that before now. It sounds like you have some interesting work in progress. However, I don't agree with your conclusion that caching the route and lle are "wrong" or "hacks". The interface to route and lle validation could be encapsulated, but ip*output need to look up routes and get the information, and it will always be faster to cache the value than to make a "fast" lookup.

btw, I disagree with your statement that radix details must be known by the users of routing. The route structure did not change when routing was changed from a hash table to a radix tree.

Okay, let me be a bit more specific. I'm against taking some internal structures from code doing lookups/prepends, lock them somehow and pretend that it is "caching". Caching in general might not that bad but I'd like to see the difference between lookups and cached approach. I did some tests a year ago on how actually radix slows things. I customised lookup function to always return the same route and I didn't see any significant increase on projects/routing branch (which probably means that we have other more significant problems like NUMA or batching mbufs or ..). Also note that it will be possible to change actual lookup to something different than radix (say, array/hash optimised for typical "1 interface, 1 [default] route" case).
Also, theoretically you can cache/reference nexthop structure returned by one of fib_* lookup functions, but again, I'm not sure it is worth the efforts.

The route structure did change. The first "struct radix_node rt_nodes[2]" (occupying 96 bytes) is what I'm talking about.
Every direct rtentry user has to known that offset to real data is 2x rt_nodes and actual rt_nodes internals to extract rt_key sockaddr.

Thanks, I think I understand your position better now. But you are objecting mostly to the current API and data structures, not against using them. What I mean is that the "struct route" *is* the existing mechanism to get and hold a reference to a rtentry. The rtentry is the existing entity that contains the exported information, and "private" management information, and includes a reference count and a bit (RTF_UP) to report invalidation of the route to the remaining references. (And the route structure did not change for radix; that the rtentry.) So, while you apparently want to replace the API and implementation, we should take full advantage of what's there now, unless/until something new arrives in head. The same points apply to the llentry.

I have thought about such things as multipath. That will clearly require changes in the implemenation, and probably the API. Some users of the API will want a reference to a single path that does not mix up their packets, and some will want to know about the multiple paths. I don't think the "caching" change significantly changes the difficulty of implementing these other things; the callers (like ip*output) will have to change anyway. You also mentioned media without ARP (well, they won't use lle unless they use it for circuit lookup, etc), changing from radix to something else (may or may not requite changes to the API).

My changes resulted in a substantial performance gain for my application (a proxy-based firewall with some tricks to avoid copying). I'm afraid I don't remember the number, but the workload is not relevant to most people other than that TCP was a significant part of the application. I think we should benefit from the existing infrastructure (don't call it caching if you don't want to) while other things are being designed and implemented.

The existing poor performance is in large part to doing expensive operations on a per-packet basis. That results in a large improvement from TSO, even from software TSO (speaking of hacks). This approach is much better.

I'd also like to point out that this is about hosts communicating via TCP, and not FreeBSD acting as a router. As such this, or something quite similar, ought to go into the tree irrespective of the potential reworking of the routing system for increasing the speed of FreeBSD as a router.

Please note that the fastest route lookup is the one you never do. You'll never get faster than that.

Netflix uses flowtable to address routing lookup overhead rather than per-connection caching which is obviously cheaper, particularly for their long lived sessions. This is a bit ironic in that in the process of maintaining it the functionality that it was actually intended to support (stateful ECMP) has effectively been removed.

I've posted my variation on inpcb route caching to D4364. I believe it's been approved in the past.

2 mike-carels.
What I'm saying is that " take full advantage of what's there now," actually means "make things harder for those who want to fix routing".
For routing stuff you need to hide llentries/rtes and you changes exposes them back.
Passing 'struct rtentry' is not used in hot path, there are no ro_lle consumers other that af_inet flowtable for ethernet.
The last bit of change needed to start projects/routing merge to HEAD is in D4102 and is exactly about eliminating ro_lle .

About multipath: I also thought about it, that's why projects/routing has native and opaque support for in for most users.
I do think that caching rtes/lles will change complexity significantly.

W/ projects/routing api changes is is possible to implement, for example:
fast tunnels (when fib lookup returns next hop data which will contain _full_ header prepend for tunnelled interface and the "real" interface )
It is also possible to implement L2 multipath (e.g. route lookup will select "proper" L2 interface under lagg without the need to call lag_transmit()).
It is also possible to have MPLS support with this approach without significant data path changes.

Passing just "struct rte" w/ "struct lle" pointers makes these things much harder. You're going to "fix" TCP, but what about udp, raw ip, routing?

Btw, It would be great to see the numbers so we can see the potential benefits of doing this

2 gnn:
projects/routing is not about "improve forwarding" , it is about improving any rtable/lltable lookup, regardless if is TCP or raw ip or forwarding. This change is about optimising TCP-only workloads while keeping other things slow and making it harder to change this in general.

2 kmacy:
Yes, avoiding route lookup is of course faster. The question is "how much benefit can we get from avoiding it and is is worth doing".
Also, I'm not against caching at all, I'm against the proposed caching implementation.

kbowling added a project: transport.
kbowling added a subscriber: kbowling.

I do not believe that this, or Matt's, change should make doing the right thing with the routing system any more difficult than it already is. I think it's time we pushed either this change or the associated change into the tree. The other routing work is longer term.

The only thing that stops projects/routing work from landing is D4102 which conflicts with this particular patch.

karels edited edge metadata.

This revision does several things:

  • updates to current HEAD, including L2 header precomputation
  • removes L2 caching for now, pending additional discussions
  • adds UDP v4/v6 route caching, at bz's suggestion
  • fills in a few missing bits for completeness
gnn edited edge metadata.

4102 has already gone in. It's time to push this in as well.

karels edited edge metadata.

Update to match head, fix UDP/IPv4 locking
Move definitions to match movement in header files.

gnn added a subscriber: olivier.

I've added Olivier as a reviewer because I'm hoping he'll test this patch and give us some updated performance numbers.

I've sync my -head source to 296935, then applyed this patch without problem, but buildworld failed:

--- net.o ---
In file included from /usr/local/BSDRP/TESTING/FreeBSD/src/lib/libstand/net.c:51:
/usr/obj/TESTING.amd64/usr/local/BSDRP/TESTING/FreeBSD/src/tmp/usr/include/netinet/in_pcb.h:243:2: error: unknown type name 'rt_gen_t'
        rt_gen_t        inp_rt_cookie;  /* generation for route entry */
        ^
/usr/obj/TESTING.amd64/usr/local/BSDRP/TESTING/FreeBSD/src/tmp/usr/include/netinet/in_pcb.h:245:16: error: field has incomplete type '
struct route'
                struct route inpu_route;
                             ^
/usr/obj/TESTING.amd64/usr/local/BSDRP/TESTING/FreeBSD/src/tmp/usr/include/netinet/in_pcb.h:245:10: note: forward declaration of 'stru
ct route'
                struct route inpu_route;
                       ^
--- all_subdir_lib/libwrap ---

(...)

--- udp.o ---
In file included from /usr/local/BSDRP/TESTING/FreeBSD/src/lib/libstand/udp.c:51:
/usr/obj/TESTING.amd64/usr/local/BSDRP/TESTING/FreeBSD/src/tmp/usr/include/netinet/in_pcb.h:243:2: error: unknown type name 'rt_gen_t'
        rt_gen_t        inp_rt_cookie;  /* generation for route entry */
        ^
/usr/obj/TESTING.amd64/usr/local/BSDRP/TESTING/FreeBSD/src/tmp/usr/include/netinet/in_pcb.h:245:16: error: field has incomplete type '
struct route'
                struct route inpu_route;
                             ^
/usr/obj/TESTING.amd64/usr/local/BSDRP/TESTING/FreeBSD/src/tmp/usr/include/netinet/in_pcb.h:245:10: note: forward declaration of 'stru
ct route'
                struct route inpu_route;
                       ^
--- all_subdir_lib/libcuse ---

Argh, I see what I did. I ended up with the wrong version of route.h in /usr/include.

karels edited edge metadata.

Fix user-level include of in_pcb.h
rt_gen_t was in an ifdef _KERNEL

In my standard forwarding scenario (2 static routes), I didn't see improvement neither regression with this patch (about -1% with pf or ipfw):

This change does not affect forwarding, just TCP and UDP termination. It makes the most difference when TSO is not enabled.

In D4306#121368, @mike-karels.net wrote:

This change does not affect forwarding, just TCP and UDP termination. It makes the most difference when TSO is not enabled.

I'm not used to do end-host TCP/UDP benches: Do you have a TCP/UDP bench tools/method/scenario that I can use ?
Because if I correctly understand your changes, for measuring the benefit of this patch: we need to run some TCP bench and during the bench send update to the routing table that is updating a next hop currently used for reaching the remote TCP end-point. Right ?

I'll give this a benchmark in the Sentex lab.

Having applied this to HEAD I find that with the following test:

https://github.com/gvnn3/netperf/tree/master/VANILLA/Tests/iperf-twohost-nooffload

which turns off TCP offload features the following results:

ministat noroutecache.txt routecache.txt [?2004l
x noroutecache.txt
+ routecache.txt
+----------------------------------------------------------------------------------------------+

x x + +
xx x + + ++ +
_M______A________________________________A______M__________________

+----------------------------------------------------------------------------------------------+

N           Min           Max        Median           Avg        Stddev

x 7 3.48 4.23 3.52 3.81 0.38686776
+ 7 4.58 7.37 6.55 6.2857143 1.0756836
Difference at 95.0% confidence
2.47571 +/- 0.94147
64.9794% +/- 24.7105%
(Student's t, pooled s = 0.80832)

I am therefore going to commit this code.

gnn edited edge metadata.
This revision was automatically updated to reflect the committed changes.