Paths

Table of Contentst

Implementation of scalable multipath & nexthop objects introduction.
AbandonedPublic
Actions

Authored by melifaro on Mar 21 2020, 11:02 AM.

Details

Reviewers

bz
ae
glebius
olivier

Group Reviewers

manpages
transport
network

Summary

NOTE: Patch is against r358990.

NOTE: TESTING: build (amd64) kernel with the patch, try adding multiple routes for the same prefix.

Overview

Problems with routing subsystem

Initial routing kpi was introduced back in 1980. It was a nice generic approach back then, as no one new how the protocols would evolve. It has been enormously successful as it was able to survive for 20+ years.

However, this kpi does not try to protect subsystem internals from the outside users. Datastructures of the lookup algorithm and details of its implementation such as usage of sockaddrs have to be known by all of the subsystem users. Transport protocols, packet filters and event drivers know the exact layout of struct rtentry with all its locking, radix data structures and counters. As a result, making changes is hard, leading to compromises and piling technical debt. IPv6 lookup scope embedding is an example of such case.

Internally, the lookup algorithm is deeply embedded in the subsystem, making it hard to have custom more performant algorithms for different use cases. For example, existing radix is highly cache-inefficient for the large number of routes. It is also not optimally suited for the sparse IPv6 tree lookups.

Goals

The goal is to bring scalable multipath routing, enabled by default.
As the change is rather large, another goal of this change is to provide new, clean, explicitly-defined routing KPI, better suited for the existing and upcoming features, while keeping the implementation details hidden. Userland compatibility is also a requirement: no userland utilities or routing daemon changes are required for the existing functionality to work.

10000 feet view

This patch introduces the concept of nexthops - objects, containing the information necessary for performing the packet output decision. Output interface, mtu, flags, gw address goes there. For most of the cases, these objects will serve the same role as the struct rtentry is currently serving. Typically there will be low tens of such objects for the router even with multiple BGP full-views, as these objects will be shared between routing entries.

The change also introduces the concept of nexthop groups. These groups are basically arrays of nexthop pointers optimized for fast lookup w.r.t. to relative nexthop weights. This part is the workhorse of the efficient multipath routing implementation.

With these changes, the lookup algorithm result is the pointer to either nexthop group or the nexthop. All dataplane lookup functions returns pointer to the nexthop object, leaving nexhop groups details inside routing subsystem.

There is a good presentation of recent introduction of similar functionality in Linux: Routes with Nexthop Objects - Linux Plumbers Conference.

Multipath

NOTE: multipath functionality is controlled by ROUTE_MPATH kernel option (added to GENERIC amd64 in this patch).

All dataplane lookup functions have flowid parameter to choose the appropriate nexthop. It is up to the caller to provide this flowid.

Forwarding with multipath is currently based on NIC-generated m->m_pkthdr.flowid.

Locally-originated packets are trickier. Currently, flowid is NOT calculated for any locally-originated connections. Additionally, in many cases determining local address is done by performing routing lookup, which introduces catch-22-like situation. To deal with it, hash is done over remote address and port. The code uses existing toeplitz hash implementation to generate flowids. As a performance optimization, hash is calculated iff there has been at least one multipath route in the system.

Inbound connections. Multipath relies on RSS kernel option to generate and store flowid in inp->inp_flowid.

Backward compatibility

No userland changes are required for this patch. All userland utilities and routing daemon should continue to work as is.

Locking model

Nexthops and nexthopgroups datastructures are split into the immutable dataplane parts (struct nhop_object, struct nhgrp_object) and control plane parts (struct nhop_priv, struct nhgrp_priv). refcount(9) kpi is used to track the references, backed by epoch(9) deferred reclamation.
This allows avoiding nexhop refcounting in the dataplane code (except route caching part).

struct rtentry is also backed by the epoch(9) to simplify the control plane code by avoiding refcounting.

Next steps

Proper scoped implementation for IPv6 and IPv4 (RFC ..)
rtsock extensions to support nexthops in routing daemons
Custom lookup algorithms for IPv4/IPv6

Implementation details

KPI

Datapath has now per-AF set of lookup functions, which explicitly accepts address and scope instead of sockaddrs.

old

rt = rtalloc1_fib(dst_sa, 0, 0, fibnum);

new

nh = fib4_lookup_nh_ptr(fibnum, addr, scope, NHR_NONE, m->m_pkthdr.flowid);
nh = fib6_lookup_nh_ptr(fibnum, &addr, scope, NHR_NODE, m->m_pkthdr.flowid);

Control plane
With multipath support, notification mechanism has to be extended as single replace operation may result in multiple single-path changes which have to be propagated to each routing change subscriber (including rtsock). For example, routing daemon may want to change the nhop group for the certain prefix, resulting in tens on individual paths being added and withdrawn for the external observer.
Most struct rtentry field accesses are done via getter functions, allowing to hide the exact structure layout in the routing core.

Changes

no individual per-rtentry` traffic counters, they moved to per-nhop.
No missmsg functionality (failure to perform route lookup leading to rt_missmsg message). Old functionality, was implemented under the assumption that routing sw will calculate and install routes on kernel request. Currently there is no demand in such feature.
Dependent object refcounting. Currently rte only references ifa. To be on the safer side and better deal with the departing interfaces, each nexthop references ifa, transmit interface and source address interface. With route caching this can potentially lead to the increased memory consumption (old pcb references nhop, which references stale interfaces/ifas). However, in reality either we have a system with lots of PCBs but small number of interfaces/nhops OR system with large number of interfaces/nhops but small number of PCBs.

Performance considerations

Performance might be worse in the use case with lots of short-lived TCP connections sharing the same next-hop. Previously splitting the route (0/1 and 128/1) helped. With this change, workaround will require more efforts, as the contention will be on nhop level.
Multipath requires flowid generation for the outbound connection, which adds some CPU cycle to each connection setup or sendto() call.

Test Plan

To enable automated functional testing, tests/sys/net/routing has been created. Currently there are 23 tests, with 10-15 more to come.
Forwarding and output tests has been added to the tree (the latter is still under review in D24138 ),
To enable unit testing, modules/tests/routing has been created. Currently there are only 3 tests, with ~20 more to come.

13:08 [1] m@current s kyua test -k /usr/tests/sys/net/routing/Kyuafile
test_rtsock_l3:rtm_add_v4_gu_ifa_ordered_success  ->  passed  [0.009s]
test_rtsock_l3:rtm_add_v4_gw_direct_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v4_temporal1_success  ->  passed  [0.017s]
test_rtsock_l3:rtm_add_v6_gu_gw_gu_direct_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v6_gu_ifa_hostroute_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v6_gu_ifa_ordered_success  ->  passed  [0.007s]
test_rtsock_l3:rtm_add_v6_gu_ifa_prefixroute_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_add_v6_temporal1_success  ->  passed  [0.014s]
test_rtsock_l3:rtm_del_v4_gu_ifa_prefixroute_success  ->  passed  [0.008s]
test_rtsock_l3:rtm_del_v4_prefix_nogw_success  ->  passed  [0.010s]
test_rtsock_l3:rtm_del_v6_gu_ifa_hostroute_success  ->  passed  [0.010s]
test_rtsock_l3:rtm_del_v6_gu_ifa_prefixroute_success  ->  passed  [0.011s]
test_rtsock_l3:rtm_del_v6_gu_prefix_nogw_success  ->  passed  [0.006s]
test_rtsock_l3:rtm_get_v4_empty_dst_failure  ->  passed  [0.002s]
test_rtsock_l3:rtm_get_v4_exact_success  ->  passed  [0.007s]
test_rtsock_l3:rtm_get_v4_hostbits_failure  ->  passed  [0.008s]
test_rtsock_l3:rtm_get_v4_lpm_success  ->  passed  [0.008s]
test_rtsock_lladdr:rtm_add_v4_gu_lle_success  ->  passed  [0.006s]
test_rtsock_lladdr:rtm_add_v6_gu_lle_success  ->  passed  [0.005s]
test_rtsock_lladdr:rtm_add_v6_ll_lle_success  ->  passed  [0.004s]
test_rtsock_lladdr:rtm_del_v4_gu_lle_success  ->  passed  [0.006s]
test_rtsock_lladdr:rtm_del_v6_gu_lle_success  ->  passed  [0.004s]
test_rtsock_lladdr:rtm_del_v6_ll_lle_success  ->  passed  [0.004s]

forward6:fwd_ip6_gu_icmp_gw_gu_fast_success  ->  passed  [3.226s]
forward6:fwd_ip6_gu_icmp_gw_gu_slow_success  ->  passed  [2.919s]
forward6:fwd_ip6_gu_icmp_gw_ll_fast_success  ->  passed  [3.023s]
forward6:fwd_ip6_gu_icmp_gw_ll_slow_success  ->  passed  [2.800s]
forward6:fwd_ip6_gu_icmp_iface_fast_success  ->  passed  [2.598s]
forward6:fwd_ip6_gu_icmp_iface_slow_success  ->  passed  [2.784s]
output6:output6_raw_flowid_mpath_success  ->  passed  [2.321s]
output6:output6_raw_success  ->  passed  [2.012s]
output6:output6_tcp_flowid_mpath_success  ->  passed  [2.558s]
output6:output6_tcp_setup_success  ->  passed  [1.981s]
output6:output6_udp_flowid_mpath_success  ->  passed  [4.790s]
output6:output6_udp_setup_success  ->  passed  [3.052s]

forward:fwd_ip_icmp_gw_fast_success  ->  passed  [1.179s]
forward:fwd_ip_icmp_gw_slow_success  ->  passed  [1.149s]
forward:fwd_ip_icmp_iface_fast_success  ->  passed  [1.145s]
forward:fwd_ip_icmp_iface_slow_success  ->  passed  [1.154s]
output:output_raw_flowid_mpath_success  ->  passed  [0.452s]
output:output_raw_success  ->  passed  [0.070s]
output:output_tcp_flowid_mpath_success  ->  passed  [1.784s]
output:output_tcp_setup_success  ->  passed  [0.141s]
output:output_udp_flowid_mpath_success  ->  passed  [9.037s]
output:output_udp_setup_success  ->  passed  [1.319s]

Loaded /usr/obj/usr/home/melifaro/free/head/amd64.amd64/sys/modules/tests/routing/routing_test.ko, id=8
13:06 [1] m@current for s in `sysctl -N kern.test.routing`; do s sysctl $s=1 ; done
kern.test.routing.route_ctl.test_add_route_pinned_success: 0 -> 0
kern.test.routing.route_ctl.test_add_route_exist_fail: 0 -> 0
kern.test.routing.route_ctl.test_add_route_plain_add_success: 0 -> 0

dmesg

running item 2: test_add_route_pinned_success
done running item 2: test_add_route_pinned_success - ret 0
running item 1: test_add_route_exist_fail
done running item 1: test_add_route_exist_fail - ret 0
running item 0: test_add_route_plain_add_success
done running item 0: test_add_route_plain_add_success - ret 0

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Warnings

Severity	Location	Code	Message
Warning	sys/dev/cxgbe/tom/t4_connect.c:279	SPELL1	Possible Spelling Mistake
Warning	sys/netinet/tcp_offload.c:79	SPELL1	Possible Spelling Mistake
Warning	sys/netinet/tcp_offload.c:80	SPELL1	Possible Spelling Mistake
Warning	sys/netinet/tcp_offload.c:82	SPELL1	Possible Spelling Mistake
Warning	sys/netinet/tcp_offload.c:83	SPELL1	Possible Spelling Mistake
Warning	sys/netinet/tcp_offload.c:99	SPELL1	Possible Spelling Mistake
Warning	sys/netinet/toecore.c:80	SPELL1	Possible Spelling Mistake

Unit

No Test Coverage

Build Status

Buildable 30037
Build 27848: arc lint + arc unit

Event Timeline

melifaro created this revision.Mar 21 2020, 11:02 AM

Herald added a reviewer: manpages. · View Herald TranscriptMar 21 2020, 11:02 AM

Herald added a reviewer: transport. · View Herald Transcript

Herald added subscribers: bz, farrokhi, ae, imp. · View Herald Transcript

Harbormaster completed remote builds in B30037: Diff 69744.Mar 21 2020, 11:02 AM

melifaro edited the summary of this revision. (Show Details)Mar 21 2020, 11:10 AM

melifaro edited the test plan for this revision. (Show Details)

melifaro edited the summary of this revision. (Show Details)Mar 21 2020, 11:13 AM

melifaro added reviewers: network, bz, ae, glebius, olivier.

swills added a subscriber: swills.Mar 27 2020, 1:13 PM

melifaro mentioned this in D24232: Stage 1: Introduce nexhop objects and new routing kpi.Mar 30 2020, 10:15 PM

mateusz_serveraptor.com added a subscriber: mateusz_serveraptor.com.Apr 6 2020, 9:18 PM

melifaro mentioned this in D24340: Convert route caching to nexthop caching..Apr 8 2020, 10:22 PM

melifaro mentioned this in rS359823: Introduce nexthop objects and new routing KPI..Apr 12 2020, 2:30 PM

marius.h_lden.org added a subscriber: marius.h_lden.org.Apr 21 2020, 2:31 PM

melifaro mentioned this in D24604: Pass nhop in ifa_rtrequest()..Apr 28 2020, 8:03 AM

melifaro mentioned this in rS360475: Add nhop to the ifa_rtrequest() callback..Apr 29 2020, 7:29 PM

melifaro mentioned this in D24680: remove unused flowid argument from rib_lookup_info().May 3 2020, 5:39 PM

melifaro mentioned this in D24870: Move <add|del|change>_route to route_ctl.c.May 17 2020, 10:12 AM

melifaro mentioned this in rS361421: Move <add|del|change>_route() functions to route_ctl.c in preparation of.May 23 2020, 7:07 PM

melifaro mentioned this in D25192: Add rib_action() and make rtsock use it..Jun 8 2020, 8:06 PM

melifaro mentioned this in rS362007: Switch rtsock code to using newly-create rib_action() KPI call..Jun 10 2020, 7:46 AM

melifaro mentioned this in D26449: Stage 2: Introduce scalable route multipath.Sep 17 2020, 11:04 PM

Abandoning in favor of more-specific revisions.
Stage 1: (nexhop objects support) has been landed in D24232 and D24340.
Stage 2: (nexthop groups) is planned to land in D26449 and D26523.