Page MenuHomeFreeBSD

Implementation of scalable multipath & nexthop objects introduction.

Authored by melifaro on Mar 21 2020, 11:02 AM.


Group Reviewers
NOTE: Patch is against r358990.
NOTE: TESTING: build (amd64) kernel with the patch, try adding multiple routes for the same prefix.


Problems with routing subsystem

Initial routing kpi was introduced back in 1980. It was a nice generic approach back then, as no one new how the protocols would evolve. It has been enormously successful as it was able to survive for 20+ years.

However, this kpi does not try to protect subsystem internals from the outside users. Datastructures of the lookup algorithm and details of its implementation such as usage of sockaddrs have to be known by all of the subsystem users. Transport protocols, packet filters and event drivers know the exact layout of struct rtentry with all its locking, radix data structures and counters. As a result, making changes is hard, leading to compromises and piling technical debt. IPv6 lookup scope embedding is an example of such case.

Internally, the lookup algorithm is deeply embedded in the subsystem, making it hard to have custom more performant algorithms for different use cases. For example, existing radix is highly cache-inefficient for the large number of routes. It is also not optimally suited for the sparse IPv6 tree lookups.


The goal is to bring scalable multipath routing, enabled by default.
As the change is rather large, another goal of this change is to provide new, clean, explicitly-defined routing KPI, better suited for the existing and upcoming features, while keeping the implementation details hidden. Userland compatibility is also a requirement: no userland utilities or routing daemon changes are required for the existing functionality to work.

10000 feet view

This patch introduces the concept of nexthops - objects, containing the information necessary for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries.

The change also introduces the concept of nexthop groups. These groups are basically arrays of nexthop pointers optimized for fast lookup w.r.t. to relative nexthop weights. This part is the workhorse of the efficient multipath routing implementation.

With these changes, the lookup algorithm result is the pointer to either nexthop group or the nexthop. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem.

There is a good presentation of recent introduction of similar functionality in Linux: Routes with Nexthop Objects - Linux Plumbers Conference.


NOTE: multipath functionality is controlled by ROUTE_MPATH kernel option (added to GENERIC amd64 in this patch).

All dataplane lookup functions have flowid parameter to choose the appropriate nexthop. It is up to the caller to provide this flowid.

Forwarding with multipath is currently based on NIC-generated m->m_pkthdr.flowid.

Locally-originated packets are trickier. Currently, flowid is NOT calculated for any locally-originated connections. Additionally, in many cases determining local address is done by performing routing lookup, which introduces catch-22-like situation. To deal with it, hash is done over remote address and port. The code uses existing toeplitz hash implementation to generate flowids. As a performance optimization, hash is calculated iff there has been at least one multipath route in the system.

Inbound connections. Multipath relies on RSS kernel option to generate and store flowid in inp->inp_flowid.

Backward compatibility

No userland changes are required for this patch. All userland utilities and routing daemon should continue to work as is.

Locking model

Nexthops and nexthopgroups datastructures are split into the immutable dataplane parts (struct nhop_object, struct nhgrp_object) and control plane parts (struct nhop_priv, struct nhgrp_priv). refcount(9) kpi is used to track the references, backed by epoch(9) deferred reclamation.
This allows avoiding nexhop refcounting in the dataplane code (except route caching part).

struct rtentry is also backed by the epoch(9) to simplify the control plane code by avoiding refcounting.

Next steps

  • Proper scoped implementation for IPv6 and IPv4 (RFC ..)
  • rtsock extensions to support nexthops in routing daemons
  • Custom lookup algorithms for IPv4/IPv6

Implementation details


Datapath has now per-AF set of lookup functions, which explicitly accepts address and scope instead of sockaddrs.

rt = rtalloc1_fib(dst_sa, 0, 0, fibnum);
nh = fib4_lookup_nh_ptr(fibnum, addr, scope, NHR_NONE, m->m_pkthdr.flowid);
nh = fib6_lookup_nh_ptr(fibnum, &addr, scope, NHR_NODE, m->m_pkthdr.flowid);

Control plane
With multipath support, notification mechanism has to be extended as single replace operation may result in multiple single-path changes which have to be propagated to each routing change subscriber (including rtsock). For example, routing daemon may want to change the nhop group for the certain prefix, resulting in tens on individual paths being added and withdrawn for the external observer.
Most struct rtentry field accesses are done via getter functions, allowing to hide the exact structure layout in the routing core.


  • no individual per-rtentry` traffic counters, they moved to per-nhop.
  • No missmsg functionality (failure to perform route lookup leading to rt_missmsg message). Old functionality, was implemented under the assumption that routing sw will calculate and install routes on kernel request. Currently there is no demand in such feature.
  • Dependent object refcounting. Currently rte only references ifa. To be on the safer side and better deal with the departing interfaces, each nexthop references ifa, transmit interface and source address interface. With route caching this can potentially lead to the increased memory consumption (old pcb references nhop, which references stale interfaces/ifas). However, in reality either we have a system with lots of PCBs but small number of interfaces/nhops OR system with large number of interfaces/nhops but small number of PCBs.

Performance considerations

  • Performance might be worse in the use case with lots of short-lived TCP connections sharing the same next-hop. Previously splitting the route (0/1 and 128/1) helped. With this change, workaround will require more efforts, as the contention will be on nhop level.
  • Multipath requires flowid generation for the outbound connection, which adds some CPU cycle to each connection setup or sendto() call.
Test Plan

To enable automated functional testing, tests/sys/net/routing has been created. Currently there are 23 tests, with 10-15 more to come.
Forwarding and output tests has been added to the tree (the latter is still under review in D24138 ),
To enable unit testing, modules/tests/routing has been created. Currently there are only 3 tests, with ~20 more to come.

13:08 [1] m@current s kyua test -k /usr/tests/sys/net/routing/Kyuafile
test_rtsock_l3:rtm_add_v4_gu_ifa_ordered_success  ->  passed  [0.009s]
test_rtsock_l3:rtm_add_v4_gw_direct_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v4_temporal1_success  ->  passed  [0.017s]
test_rtsock_l3:rtm_add_v6_gu_gw_gu_direct_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v6_gu_ifa_hostroute_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v6_gu_ifa_ordered_success  ->  passed  [0.007s]
test_rtsock_l3:rtm_add_v6_gu_ifa_prefixroute_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v6_temporal1_success  ->  passed  [0.014s]
test_rtsock_l3:rtm_del_v4_gu_ifa_prefixroute_success  ->  passed  [0.008s]
test_rtsock_l3:rtm_del_v4_prefix_nogw_success  ->  passed  [0.010s]
test_rtsock_l3:rtm_del_v6_gu_ifa_hostroute_success  ->  passed  [0.010s]
test_rtsock_l3:rtm_del_v6_gu_ifa_prefixroute_success  ->  passed  [0.011s]
test_rtsock_l3:rtm_del_v6_gu_prefix_nogw_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_get_v4_empty_dst_failure  ->  passed  [0.002s]
test_rtsock_l3:rtm_get_v4_exact_success  ->  passed  [0.007s]
test_rtsock_l3:rtm_get_v4_hostbits_failure  ->  passed  [0.008s]
test_rtsock_l3:rtm_get_v4_lpm_success  ->  passed  [0.008s]
test_rtsock_lladdr:rtm_add_v4_gu_lle_success  ->  passed  [0.006s]
test_rtsock_lladdr:rtm_add_v6_gu_lle_success  ->  passed  [0.005s]
test_rtsock_lladdr:rtm_add_v6_ll_lle_success  ->  passed  [0.004s]
test_rtsock_lladdr:rtm_del_v4_gu_lle_success  ->  passed  [0.006s]
test_rtsock_lladdr:rtm_del_v6_gu_lle_success  ->  passed  [0.004s]
test_rtsock_lladdr:rtm_del_v6_ll_lle_success  ->  passed  [0.004s]

forward6:fwd_ip6_gu_icmp_gw_gu_fast_success  ->  passed  [3.226s]
forward6:fwd_ip6_gu_icmp_gw_gu_slow_success  ->  passed  [2.919s]
forward6:fwd_ip6_gu_icmp_gw_ll_fast_success  ->  passed  [3.023s]
forward6:fwd_ip6_gu_icmp_gw_ll_slow_success  ->  passed  [2.800s]
forward6:fwd_ip6_gu_icmp_iface_fast_success  ->  passed  [2.598s]
forward6:fwd_ip6_gu_icmp_iface_slow_success  ->  passed  [2.784s]
output6:output6_raw_flowid_mpath_success  ->  passed  [2.321s]
output6:output6_raw_success  ->  passed  [2.012s]
output6:output6_tcp_flowid_mpath_success  ->  passed  [2.558s]
output6:output6_tcp_setup_success  ->  passed  [1.981s]
output6:output6_udp_flowid_mpath_success  ->  passed  [4.790s]
output6:output6_udp_setup_success  ->  passed  [3.052s]

forward:fwd_ip_icmp_gw_fast_success  ->  passed  [1.179s]
forward:fwd_ip_icmp_gw_slow_success  ->  passed  [1.149s]
forward:fwd_ip_icmp_iface_fast_success  ->  passed  [1.145s]
forward:fwd_ip_icmp_iface_slow_success  ->  passed  [1.154s]
output:output_raw_flowid_mpath_success  ->  passed  [0.452s]
output:output_raw_success  ->  passed  [0.070s]
output:output_tcp_flowid_mpath_success  ->  passed  [1.784s]
output:output_tcp_setup_success  ->  passed  [0.141s]
output:output_udp_flowid_mpath_success  ->  passed  [9.037s]
output:output_udp_setup_success  ->  passed  [1.319s]
Loaded /usr/obj/usr/home/melifaro/free/head/amd64.amd64/sys/modules/tests/routing/routing_test.ko, id=8
13:06 [1] m@current for s in `sysctl -N kern.test.routing`; do s sysctl $s=1 ; done
kern.test.routing.route_ctl.test_add_route_pinned_success: 0 -> 0
kern.test.routing.route_ctl.test_add_route_exist_fail: 0 -> 0
kern.test.routing.route_ctl.test_add_route_plain_add_success: 0 -> 0


running item 2: test_add_route_pinned_success
done running item 2: test_add_route_pinned_success - ret 0
running item 1: test_add_route_exist_fail
done running item 1: test_add_route_exist_fail - ret 0
running item 0: test_add_route_plain_add_success
done running item 0: test_add_route_plain_add_success - ret 0

Diff Detail

rS FreeBSD src repository - subversion
Lint WarningsExcuse: .
Warningsys/dev/cxgbe/tom/t4_connect.c:279SPELL1Possible Spelling Mistake
Warningsys/netinet/tcp_offload.c:79SPELL1Possible Spelling Mistake
Warningsys/netinet/tcp_offload.c:80SPELL1Possible Spelling Mistake
Warningsys/netinet/tcp_offload.c:82SPELL1Possible Spelling Mistake
Warningsys/netinet/tcp_offload.c:83SPELL1Possible Spelling Mistake
Warningsys/netinet/tcp_offload.c:99SPELL1Possible Spelling Mistake
Warningsys/netinet/toecore.c:80SPELL1Possible Spelling Mistake
No Unit Test Coverage
Build Status
Buildable 30037
Build 27848: arc lint + arc unit

Event Timeline

melifaro edited the test plan for this revision. (Show Details)

Abandoning in favor of more-specific revisions.
Stage 1: (nexhop objects support) has been landed in D24232 and D24340.
Stage 2: (nexthop groups) is planned to land in D26449 and D26523.