Page MenuHomeFreeBSD

Stage 2: Introduce scalable route multipath
Needs ReviewPublic

Authored by melifaro on Tue, Sep 15, 11:15 PM.


Group Reviewers

This is the second part of routing subsystem changes initially described in D24141.
First part of these changes has been landed in D24232.
Support for outbound hashing is in D26523.

NOTE: I'm going to land this diff on Saturday, October 3 unless I get any objections.


  • Follows the same implementation as nexthop objects and reuses parts of the infrastructure.
  • Nexthop groups are stored in the resizable hash and are indexed in the same way nexthops do.
  • Similarly, there is a private part (contains pointes to nexthops and relative wights) and dataplane-visible part, consisting of array of nexthops to choose from.
  • Max nhgrp wifth is set to 64 (no limitations here, can be bumped futher)
  • Lazy initialisation: allocate index/hash memory at first attempt to add multipath route.
  • rt_nhop can now point to either nexthop or nexthop group (distinguished by NHF_MULTIPATH flag.
  • all dataplane functions handle nexthop selection internally
  • routing table notifications can now notify on switching between nexthop groups -> rib_decompose_notification() has been created to decompose such notifications to a set of old "simple" add/del/change operations.

User-visible changes

  • Backward compatible: all non-multipath functionality continues to work as is
  • All routes comes up with weights, default is 1, max is 16M (2^24).
  • route delete <prefix> will work for non-multipath prefixes, specifying gateway is required for multipath ones


  • Relies on mbuf flowid for both transit and outbound traffic.
  • For locally-originated connections we don’t currently calculate inp_flowid. As performance optimization, start calculating these hashes after inserting first multipath route (V_hash_outbound).

Rollout stages

  1. Commit with net.route.multipath=0 by default even with ROUTE_MPATH
  2. Enable ROUTE_MPATH in amd64 GENERIC.
  3. Turn on net.route.multipath=1 by default
Test Plan
divert:ipdivert_ip_input_local_success  ->  skipped: ipdivert module is not loaded  [0.027s]
divert:ipdivert_ip_output_remote_success  ->  skipped: ipdivert module is not loaded  [0.021s]
fibs:fibs_ifroutes1_success  ->  passed  [0.073s]
forward:fwd_ip_icmp_gw_fast_success  ->  passed  [1.241s]
forward:fwd_ip_icmp_gw_slow_success  ->  passed  [1.105s]
forward:fwd_ip_icmp_iface_fast_success  ->  passed  [1.051s]
forward:fwd_ip_icmp_iface_slow_success  ->  passed  [1.075s]
ip_reass_test:ip_reass__large_fragment  ->  passed  [0.014s]
ip_reass_test:ip_reass__multiple_last_fragments  ->  passed  [0.030s]
ip_reass_test:ip_reass__zero_length_fragment  ->  passed  [0.015s]
lpm:lpm_test1_success  ->  passed  [0.151s]
lpm:lpm_test2_success  ->  passed  [0.174s]
output:output_raw_flowid_mpath_success  ->  passed  [0.426s]
output:output_raw_success  ->  passed  [0.082s]
output:output_tcp_flowid_mpath_success  ->  passed  [1.850s]
output:output_tcp_setup_success  ->  passed  [0.154s]
output:output_udp_flowid_mpath_success  ->  passed  [9.339s]
output:output_udp_setup_success  ->  passed  [1.221s]
redirect:valid_redirect  ->  passed  [1.114s]
so_reuseport_lb_test:basic_ipv4  ->  passed  [1.566s]
so_reuseport_lb_test:basic_ipv6  ->  passed  [0.903s]
socket_afinet:socket_afinet  ->  passed  [0.002s]
socket_afinet:socket_afinet_bind_ok  ->  passed  [0.002s]
socket_afinet:socket_afinet_bind_zero  ->  passed  [0.002s]

fibs6:fibs6_ifroutes1_success  ->  passed  [1.576s]
forward6:fwd_ip6_gu_icmp_gw_gu_fast_success  ->  passed  [2.318s]
forward6:fwd_ip6_gu_icmp_gw_gu_slow_success  ->  passed  [2.706s]
forward6:fwd_ip6_gu_icmp_gw_ll_fast_success  ->  passed  [3.020s]
forward6:fwd_ip6_gu_icmp_gw_ll_slow_success  ->  passed  [3.046s]
forward6:fwd_ip6_gu_icmp_iface_fast_success  ->  passed  [2.620s]
forward6:fwd_ip6_gu_icmp_iface_slow_success  ->  passed  [2.677s]
lpm6:lpm6_test1_success  ->  passed  [1.890s]
lpm6:lpm6_test2_success  ->  passed  [1.990s]
mld:mldraw01  ->  passed  [4.191s]
output6:output6_raw_flowid_mpath_success  ->  failed: Balancing failure: 1: 37 2: 4  [2.348s]
output6:output6_raw_success  ->  passed  [1.814s]
output6:output6_tcp_flowid_mpath_success  ->  passed  [2.239s]
output6:output6_tcp_setup_success  ->  passed  [2.089s]
output6:output6_udp_flowid_mpath_success  ->  passed  [4.973s]
output6:output6_udp_setup_success  ->  passed  [3.244s]
redirect:valid_redirect  ->  passed  [2.363s]
scapyi386:scapyi386  ->  passed  [4.396s]

test_rtsock_l3:rtm_add_v4_gu_ifa_ordered_success  ->  passed  [0.118s]
test_rtsock_l3:rtm_add_v4_gw_direct_success  ->  passed  [0.118s]
test_rtsock_l3:rtm_add_v4_no_rtf_host_failure  ->  passed  [0.125s]
test_rtsock_l3:rtm_add_v4_temporal1_success  ->  passed  [0.131s]
test_rtsock_l3:rtm_add_v6_gu_gw_gu_direct_success  ->  passed  [0.151s]
test_rtsock_l3:rtm_add_v6_gu_ifa_hostroute_success  ->  passed  [0.124s]
test_rtsock_l3:rtm_add_v6_gu_ifa_ordered_success  ->  passed  [0.132s]
test_rtsock_l3:rtm_add_v6_gu_ifa_prefixroute_success  ->  passed  [0.122s]
test_rtsock_l3:rtm_add_v6_temporal1_success  ->  passed  [0.124s]
test_rtsock_l3:rtm_change_v4_gw_success  ->  passed  [0.126s]
test_rtsock_l3:rtm_change_v4_mtu_success  ->  passed  [0.126s]
test_rtsock_l3:rtm_change_v6_gw_success  ->  passed  [0.130s]
test_rtsock_l3:rtm_change_v6_mtu_success  ->  passed  [0.122s]
test_rtsock_l3:rtm_del_v4_gu_ifa_prefixroute_success  ->  passed  [0.132s]
test_rtsock_l3:rtm_del_v4_prefix_nogw_success  ->  passed  [0.136s]
test_rtsock_l3:rtm_del_v6_gu_ifa_hostroute_success  ->  passed  [0.121s]
test_rtsock_l3:rtm_del_v6_gu_ifa_prefixroute_success  ->  passed  [0.127s]
test_rtsock_l3:rtm_del_v6_gu_prefix_nogw_success  ->  passed  [0.125s]
test_rtsock_l3:rtm_get_v4_empty_dst_failure  ->  passed  [0.003s]
test_rtsock_l3:rtm_get_v4_exact_success  ->  passed  [0.122s]
test_rtsock_l3:rtm_get_v4_hostbits_failure  ->  passed  [0.119s]
test_rtsock_l3:rtm_get_v4_lpm_success  ->  passed  [0.121s]
test_rtsock_lladdr:rtm_add_v4_gu_lle_success  ->  passed  [0.130s]
test_rtsock_lladdr:rtm_add_v6_gu_lle_success  ->  passed  [0.138s]
test_rtsock_lladdr:rtm_add_v6_ll_lle_success  ->  passed  [0.119s]
test_rtsock_lladdr:rtm_del_v4_gu_lle_success  ->  passed  [0.121s]
test_rtsock_lladdr:rtm_del_v6_gu_lle_success  ->  passed  [0.120s]
test_rtsock_lladdr:rtm_del_v6_ll_lle_success  ->  passed  [0.118s]

Diff Detail

Lint OK
No Unit Test Coverage
Build Status
Buildable 33738
Build 30967: arc lint + arc unit

Event Timeline

melifaro created this revision.Tue, Sep 15, 11:15 PM
melifaro requested review of this revision.Tue, Sep 15, 11:15 PM
melifaro updated this revision to Diff 77092.Wed, Sep 16, 8:22 AM

Add outbound hashing & other fixes.

melifaro updated this revision to Diff 77161.Thu, Sep 17, 10:34 PM

Fix refcount leaks.

melifaro retitled this revision from Introduce scalable multipath to Stage 2: Introduce scalable route multipath.Thu, Sep 17, 11:04 PM
melifaro edited the summary of this revision. (Show Details)
melifaro edited the test plan for this revision. (Show Details)
melifaro added reviewers: network, glebius, ae, olivier.

Heavy patch, looks promising overall.


So in the case of ! ROUTE_MPATH, the call to uma_zfree is suppressed now?

Look like the review contains a patch to file usr.bin/netstat/nhgrp.c that doesn't exist on -head

Look like the review contains a patch to file usr.bin/netstat/nhgrp.c that doesn't exist on -head

Yeah, that's weird. raw patch is different from the one I have in git..
I'll update the patch shortly by cleaning some stuff in nhgrp.c, hopefully this will be enough for phabricator to stop thinking it's a copied file.

melifaro updated this revision to Diff 77196.Fri, Sep 18, 7:23 PM

Update nhgrp.c

olivier added inline comments.Fri, Sep 18, 7:48 PM

I have a build error here:

--- all_subdir_usr.bin/netstat ---
/usr/src/usr.bin/netstat/nhgrp.c:193:11: error: use of undeclared identifier 'NET_RT_NHGROUPS'
        mib[4] = NET_RT_NHGROUPS;
melifaro added inline comments.Sat, Sep 19, 8:53 AM

Not exactly. We have error set to EEXIST above, which renders this condition to be true.

melifaro updated this revision to Diff 77209.Sat, Sep 19, 8:54 AM

Update nhgrp kernel <> userland interface.
Fix corner cases in next hop group compilations.

melifaro updated this revision to Diff 77248.Sun, Sep 20, 9:53 AM

Fix rib_walk_del() multipath route handling.
Reword some comments.

melifaro added inline comments.Sun, Sep 20, 9:54 AM

Should be fixed now.

Tested on my ECMP lab ( and the flow-id load-balancing seems working great.
Would you like test using bird and FRR too ?

melifaro updated this revision to Diff 77321.Mon, Sep 21, 8:56 PM

Update to the latest HEAD to remove already-committed parts.

Tested on my ECMP lab ( and the flow-id load-balancing seems working great.
Would you like test using bird and FRR too ?

As far as I understand, bird doesn't have bits to support multipath with rtsock, some amount of work in krt_send_route() has to be done to make it happen.
Haven't checked FRR code yet, probably it also may need some updates.

Btw, If you by any chance have cycles/capability/willingness to test locally-originated traffic (performance-wise), that would we awesome :-)

melifaro updated this revision to Diff 77371.Tue, Sep 22, 7:37 PM

Split out outbound hashing part.
Rebase to latest HEAD.

melifaro updated this revision to Diff 77372.Tue, Sep 22, 7:40 PM

Add forgotten netstat header.

melifaro edited the summary of this revision. (Show Details)Tue, Sep 22, 8:00 PM

Tested with net/frr7 compiled with MULTIPATH option, and it is working great:

[root@router1]~# vtysh

Hello, this is FRRouting (version 7.4).
Copyright 1996-2005 Kunihiro Ishiguro, et al.

router1# sh ip route
Routing entry for
  Known via "ospf", distance 110, metric 20, best
  Last update 00:06:39 ago
  *, via igb1, weight 1
  *, via igb2, weight 1

route1# sh ipv6 route 2001:db8:24::/64
Routing entry for 2001:db8:24::/64
  Known via "ospf6", distance 110, metric 20, best
  Last update 00:02:42 ago
  * fe80::20d:b9ff:fe45:7ad5, via igb1, weight 1
  * fe80::20d:b9ff:fe45:7ad6, via igb2, weight 1

router1# exit

[root@router1]~# netstat -rn4 | grep         UG1        igb2         UG1        igb1

[root@router1]~# netstat -rn6 | grep 2001:db8:24::/64
2001:db8:24::/64                  fe80::20d:b9ff:fe45:7ad6%igb2 UG1        igb2
2001:db8:24::/64                  fe80::20d:b9ff:fe45:7ad5%igb1 UG1        igb1
glebius added inline comments.Wed, Sep 23, 9:07 PM

Looks like three descriptions in comments need to be fixed.


One malloc could succeed, other fail. Suggested fix - call free() on both before returning.


New style suggest to use _Static_assert() compiler builtin.


Extra newline.


May be add KASSERT to assure we don't overflow dst->nhg_size?