Convert route caching to nexthop caching.
AbandonedPublic
Actions

Authored by melifaro on Apr 8 2020, 9:04 PM.

Details

Reviewers

Group Reviewers

transport
network

Commits

rS360292: Convert route caching to nexthop caching.

Summary

Overview

Switch route caching to nexthop caching, based on the nexthop objects provided by D24232 .

Reasoning of the change and the high-level architecture is provided in D24141 .
Quick recap:

Nexthops are separate datastructures, containing all necessary information to perform packet forwarding such as gateway interface and mtu. Nexthops are shared among the routes, providing more pre-computed cache-efficient data while requiring less memory.
Splitting the LPM code and the attached data solves multiple long-standing problems in the routing layer, drastically reduces the coupling with outher parts of the stack and allows to transparently introduce faster lookup algorithms.

Techical details

Route caching was (re)introduced to avoid (slow) routing lookups, allowing to better scale . It works by acquiring rtentry reference, which is protected by per-rtentry mutex. When the routing table is changed (checked by comparing the rtable generation id) or link goes down, cache record gets withdrawn,

In terms of locking/performance typical datapath scenario looks like the following:
(rtalloc_ign() logic)

RIB_RLOCK()
RT_LOCK()
++rt ref
RIB_RUNLOCK()
RT_UNLOCK()

This change merely replaces rtentry with the actual forwarding nextop as a cached object, the change is straight-forwarded.

With the nexthop-based approach, rt lock/unlock part is changed to nhop refcount_acquire():

RIB_RLOCK()
refcount_acquire_if_not_zero()
RIB_RUNLOCK()

The rest of the mechanics such as cache cleanup on the route table change remains the same.

Worst-case scenario

The best use case for the route caching is the scenario with long-lived TCP sessions and rarely changing route table. When the condition starts to be suboptimal (short TCP connections, routing table churn), cache adds a penalty of locking/unlocking relevant rtentries.

As nexthop are shared across many routes, in the scenarios with large number of short-lived connections the contention on the particular nexthop may introduce higher penalty.

To be more specific, let's look at the example:

Suppose node have customers behind prefixes A (50% TCP sessions) and B (50% TCP sessions). Suppose the routing table looks like the following:
A/24 -> gateway1 (nhop1)
B/24 -> gateway1 (nhop1)

With the current code, each 50% of TCP connections will reference rte from A and 50% from B.
In the proposed change, all 100% TCP connections will reference nhop1.

With the high connection rate ratio (100k+/sec) this may result in lower performance due to the cacheline contention on the nhop1 refcounter.

The are multiple ways (and even combination of ways) of dealing with it:

pcpu refcounts (percpu_ref in Linux) (OS-wide optimisation)
nexthop "sharding", similar to current trick of splitting route prefix into multiple to reduce contention (routing subsystem hack)
introducing faster lookup algorithms such as DXR, reducing the need of route caching. (route subsystem next steps)
introducing heuristics potentially backed by a socket option on avoding using route caching for short-lived connections.

Depending on the importance of this particular scenario and the actual impact, one or more steps from the list above can be taken to address it.

Test Plan

divert:ipdivert_ip6_output_remote_success  ->  passed  [2.903s]
forward6:fwd_ip6_gu_icmp_gw_gu_fast_success  ->  passed  [2.691s]
forward6:fwd_ip6_gu_icmp_gw_gu_slow_success  ->  passed  [2.443s]
forward6:fwd_ip6_gu_icmp_gw_ll_fast_success  ->  passed  [2.847s]
forward6:fwd_ip6_gu_icmp_gw_ll_slow_success  ->  passed  [2.352s]
forward6:fwd_ip6_gu_icmp_iface_fast_success  ->  passed  [2.592s]
forward6:fwd_ip6_gu_icmp_iface_slow_success  ->  passed  [1.881s]
mld:mldraw01  ->  passed  [3.778s]
output6:output6_raw_flowid_mpath_success  ->  skipped: This test requires ROUTE_MPATH enabled  [0.020s]
output6:output6_raw_success  ->  passed  [2.035s]
output6:output6_tcp_flowid_mpath_success  ->  skipped: This test requires ROUTE_MPATH enabled  [0.021s]
output6:output6_tcp_setup_success  ->  passed  [1.718s]
output6:output6_udp_flowid_mpath_success  ->  skipped: This test requires ROUTE_MPATH enabled  [0.024s]
output6:output6_udp_setup_success  ->  passed  [2.967s]
redirect:valid_redirect  ->  passed  [2.832s]
scapyi386:scapyi386  ->  passed  [3.866s]

divert:ipdivert_ip_input_local_success  ->  passed  [0.648s]
divert:ipdivert_ip_output_remote_success  ->  passed  [0.797s]
forward:fwd_ip_icmp_gw_fast_success  ->  passed  [0.755s]
forward:fwd_ip_icmp_gw_slow_success  ->  passed  [0.790s]
forward:fwd_ip_icmp_iface_fast_success  ->  passed  [0.768s]
forward:fwd_ip_icmp_iface_slow_success  ->  passed  [0.762s]
ip_reass_test:ip_reass__large_fragment  ->  passed  [0.014s]
ip_reass_test:ip_reass__multiple_last_fragments  ->  passed  [0.032s]
ip_reass_test:ip_reass__zero_length_fragment  ->  passed  [0.014s]
output:output_raw_flowid_mpath_success  ->  skipped: This test requires ROUTE_MPATH enabled  [0.018s]
output:output_raw_success  ->  passed  [0.078s]
output:output_tcp_flowid_mpath_success  ->  skipped: This test requires ROUTE_MPATH enabled  [0.020s]
output:output_tcp_setup_success  ->  passed  [0.129s]
output:output_udp_flowid_mpath_success  ->  skipped: This test requires ROUTE_MPATH enabled  [0.020s]
output:output_udp_setup_success  ->  passed  [1.239s]
redirect:valid_redirect  ->  passed  [0.829s]
so_reuseport_lb_test:basic_ipv4  ->  passed  [0.731s]
so_reuseport_lb_test:basic_ipv6  ->  passed  [0.825s]
socket_afinet:socket_afinet  ->  passed  [0.002s]
socket_afinet:socket_afinet_bind_ok  ->  passed  [0.002s]