Problems with routing subsystem
Initial routing kpi was introduced back in 1980. It was a nice generic approach back then, as no one new how the protocols would evolve. It has been enormously successful as it was able to survive for 20+ years.
However, this kpi does not try to protect subsystem internals from the outside users. Datastructures of the lookup algorithm and details of its implementation such as usage of sockaddrs have to be known by all of the subsystem users. Transport protocols, packet filters and event drivers know the exact layout of struct rtentry with all its locking, radix data structures and counters. As a result, making changes is hard, leading to compromises and piling technical debt. IPv6 lookup scope embedding is an example of such case.
Internally, the lookup algorithm is deeply embedded in the subsystem, making it hard to have custom more performant algorithms for different use cases. For example, existing radix is highly cache-inefficient for the large number of routes. It is also not optimally suited for the sparse IPv6 tree lookups.
The goal is to bring scalable multipath routing, enabled by default.
As the change is rather large, another goal of this change is to provide new, clean, explicitly-defined routing KPI, better suited for the existing and upcoming features, while keeping the implementation details hidden. Userland compatibility is also a requirement: no userland utilities or routing daemon changes are required for the existing functionality to work.
10000 feet view
This patch introduces the concept of nexthops - objects, containing the information necessary for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries.
The change also introduces the concept of nexthop groups. These groups are basically arrays of nexthop pointers optimized for fast lookup w.r.t. to relative nexthop weights. This part is the workhorse of the efficient multipath routing implementation.
With these changes, the lookup algorithm result is the pointer to either nexthop group or the nexthop. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem.
There is a good presentation of recent introduction of similar functionality in Linux: Routes with Nexthop Objects - Linux Plumbers Conference.
All dataplane lookup functions have flowid parameter to choose the appropriate nexthop. It is up to the caller to provide this flowid.
Forwarding with multipath is currently based on NIC-generated m->m_pkthdr.flowid.
Locally-originated packets are trickier. Currently, flowid is NOT calculated for any locally-originated connections. Additionally, in many cases determining local address is done by performing routing lookup, which introduces catch-22-like situation. To deal with it, hash is done over remote address and port. The code uses existing toeplitz hash implementation to generate flowids. As a performance optimization, hash is calculated iff there has been at least one multipath route in the system.
Inbound connections. Multipath relies on RSS kernel option to generate and store flowid in inp->inp_flowid.
No userland changes are required for this patch. All userland utilities and routing daemon should continue to work as is.
Nexthops and nexthopgroups datastructures are split into the immutable dataplane parts (struct nhop_object, struct nhgrp_object) and control plane parts (struct nhop_priv, struct nhgrp_priv). refcount(9) kpi is used to track the references, backed by epoch(9) deferred reclamation.
This allows avoiding nexhop refcounting in the dataplane code (except route caching part).
struct rtentry is also backed by the epoch(9) to simplify the control plane code by avoiding refcounting.
- Proper scoped implementation for IPv6 and IPv4 (RFC ..)
- rtsock extensions to support nexthops in routing daemons
- Custom lookup algorithms for IPv4/IPv6
Datapath has now per-AF set of lookup functions, which explicitly accepts address and scope instead of sockaddrs.
rt = rtalloc1_fib(dst_sa, 0, 0, fibnum);
nh = fib4_lookup_nh_ptr(fibnum, addr, scope, NHR_NONE, m->m_pkthdr.flowid); nh = fib6_lookup_nh_ptr(fibnum, &addr, scope, NHR_NODE, m->m_pkthdr.flowid);
With multipath support, notification mechanism has to be extended as single replace operation may result in multiple single-path changes which have to be propagated to each routing change subscriber (including rtsock). For example, routing daemon may want to change the nhop group for the certain prefix, resulting in tens on individual paths being added and withdrawn for the external observer.
Most struct rtentry field accesses are done via getter functions, allowing to hide the exact structure layout in the routing core.
- no individual per-rtentry` traffic counters, they moved to per-nhop.
- No missmsg functionality (failure to perform route lookup leading to rt_missmsg message). Old functionality, was implemented under the assumption that routing sw will calculate and install routes on kernel request. Currently there is no demand in such feature.
- Dependent object refcounting. Currently rte only references ifa. To be on the safer side and better deal with the departing interfaces, each nexthop references ifa, transmit interface and source address interface. With route caching this can potentially lead to the increased memory consumption (old pcb references nhop, which references stale interfaces/ifas). However, in reality either we have a system with lots of PCBs but small number of interfaces/nhops OR system with large number of interfaces/nhops but small number of PCBs.
- Performance might be worse in the use case with lots of short-lived TCP connections sharing the same next-hop. Previously splitting the route (0/1 and 128/1) helped. With this change, workaround will require more efforts, as the contention will be on nhop level.
- Multipath requires flowid generation for the outbound connection, which adds some CPU cycle to each connection setup or sendto() call.