Page MenuHomeFreeBSD

routing: Replace unreachable nhops in nhgrp
Needs ReviewPublic

Authored by pouria on Mon, Jun 1, 7:39 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jun 23, 6:12 AM
Unknown Object (File)
Mon, Jun 22, 11:37 AM
Unknown Object (File)
Thu, Jun 18, 9:02 PM
Unknown Object (File)
Wed, Jun 17, 10:53 PM
Unknown Object (File)
Wed, Jun 17, 10:43 PM
Unknown Object (File)
Wed, Jun 17, 6:59 AM
Unknown Object (File)
Tue, Jun 16, 10:01 PM
Unknown Object (File)
Tue, Jun 16, 10:01 PM

Details

Reviewers
melifaro
glebius
markj
zlei
emaste
Group Reviewers
network
Summary

If a nhop gets an interface event, revalidate the nhops and
immediately try to recompile existing nexthop groups by
replacing unreachable nexthops with reachable ones.
If none are available, recompile them back to
their normal position in nexthop group slots.

Relnotes: yes

FINALLY TRUE ECMP WITHOUT EXTRA ALLOCATION!
in comparison to other OSes, we don't simply remove unreachable routes.
Now instead we replace them with reachable one.

We also don't want to walk through RIB to filter out ECMP ones.
We directly recompile nhgrps.

Test Plan
[root@ftsr1] [~] # netstat -On4W
Nexthop groups data

Internet:
GrpIdx  NhIdx     Weight   Slots           Gateway     Netif  Refcnt
8         ------- ------- ------- ----------------- ---------       2
4       1       1      172.23.1.254    vtnet0
7       1       1       192.168.9.1    vtnet1
[root@ftsr1] [~] # ifconfig vtnet1 down
[root@ftsr1] [~] # netstat -On4W
Nexthop groups data

Internet:
GrpIdx  NhIdx     Weight   Slots           Gateway     Netif  Refcnt
8         ------- ------- ------- ----------------- ---------       2
4       1       2      172.23.1.254    vtnet0
7       1       0       192.168.9.1    vtnet1
[root@ftsr1] [~] # ifconfig vtnet1 up
[root@ftsr1] [~] # netstat -On4W
Nexthop groups data

Internet:
GrpIdx  NhIdx     Weight   Slots           Gateway     Netif  Refcnt
8         ------- ------- ------- ----------------- ---------       2
4       1       1      172.23.1.254    vtnet0
7       1       1       192.168.9.1    vtnet1

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 73832
Build 70715: arc lint + arc unit

Event Timeline

pouria requested review of this revision.Mon, Jun 1, 7:39 PM

Resolve conflict with parent revision.

The new function is called _locked, but what actually makes it locked? AFAIU, the main race protector here is the fact that ifnet event handlers are serialized in the scope of a single ifnet. But what happens if two nhops_ifnet_state_changed (on different ifnets) run in parallel?

I also tested this patch on mercat5+6 on our netperf cluster.
mercat5 is connected back to back to mercat6 via two separate interfaces (3fff:3::, 3fff:4::)
mercat5:

ifconfig lo0 inet6 3fff:5::1
route -n6 add -net 3fff:6::/64 -gateway 3fff:3::b
route -n6 add -net 3fff:6::/64 -gateway 3fff:4::b

mercat6:

ifconfig lo0 inet6 3fff:6::1
route -n6 add -net 3fff:5::/64 -gateway 3fff:3::a
route -n6 add -net 3fff:5::/64 -gateway 3fff:4::a

test - terminal 1:

ifconfig cc0 down
# wait and switch
ifconfig cc0 up
ifconfig cc1 down

terminal 2:

ping -S 3fff:5::1 3fff:6::1
# Without packet loss!

Also, checked for regression via iperf3:

for i in `jot 22 1`; do iperf3 -6c 3fff:5::1 -p5201-5206 -P6 -t 25 -B 3fff:6::1 -u -b 10g -J | jq '.end.sum_received.bits_per_second / 1e6'; done
% ministat D57389-n286384-9025be72b305.txt main-n286382-22c1f5d0ec21.txt
x D57389-n286384-9025be72b305.txt
+ main-n286382-22c1f5d0ec21.txt
+----------------------------------------------------------------------------------------------------+
|                                                     +              +  x      x                     |
|xx+                  +            +  x  x    x +++   ++*x*  * x x   ++ * ++   x    +   x   x   x x +|
|                                  ||____________________MA___A_M______________|________|            |
+----------------------------------------------------------------------------------------------------+
N           Min           Max        Median           Avg        Stddev
x  20     6012.2105     6319.0092     6211.8506     6204.4123     84.320817
+  20      6018.908     6325.1225     6190.7419     6192.0741     67.733869
No difference proven at 95.0% confidence

The new function is called _locked, but what actually makes it locked? AFAIU, the main race protector here is the fact that ifnet event handlers are serialized in the scope of a single ifnet. But what happens if two nhops_ifnet_state_changed (on different ifnets) run in parallel?

nhops_iter_start will lock the nh_control with NHOPS_RLOCK(ctl) and nhops_iter_stop unlocks it.
Both nhop and nhgrp usually protect themselves with that lock.
NHOPS_RLOCK uses rwlock and it's also sleepable.

My initial approach was to call nhgrp_recompile_locked after nhops_iter_stop and use a write lock inside it instead.
But after your hint on atomic(9), I retraced all the operations during this patch, and all of its work was aligned, therefore, I believe it's also atomic.

Quoted from atomic(9):

On all architectures supported by FreeBSD, ordinary loads and stores of
integers in cache-coherent memory are inherently atomic if the integer is
naturally aligned and its size does not exceed the processor's word size.

After that, I suffix its function name with _locked and then I call nhgrp_recompile_locked before nhops_iter_stop to avoid relocking it.

To be safe with compile_nhgrp, I can use rw_try_upgrade in nhgrp_recompile_locked to make it serialize, and simply return if it can't upgrade.
What do you think?

Is it possible to add necessary lock assertions? That will clarify how exactly it is locked.

The only issue I see is that two (or more) nhops_ifnet_state_changed() execute in parallel and the one that would compile an nhgrp last would win the race and at the end we will observe dst->nhops[] the way that the last invocation set it. Since NH_IS_VALID is set/read without atomics the last invocation may not observe INVALID bit set by the previous invocation.

Such event of two interfaces going down at the same time is pretty possible, e.g. two vlan(4) interfaces would go down together if are on the same trunk.

I don't remember what is the level of serialization of the ifnet link events. Could they truly execute in parallel or not? AFAIR, they are processed by a taskqueue, that today is single threaded. Should we rely on that?

Address @glebius comment.
Do not rely on NHOP_RLOCK for nhgrp_recompile and use NHOP_WLOCK.
This should protect compile_nhgrp operation and and allows us to safely sleep
on the nh_control lock during recompile.

@glebius: You're right, we may change how the taskqueue works in the future.
Delaying nhgrp_recompile by sleeping on the lock is completely safe.
In comparison to the number of routes or nhops, we don't have many nhgrp,
so the lock isn't held for long.

I'm technically reverting to my initial patch for this revision :)

sys/net/route/nhgrp_ctl.c
1112

Question: what does require us to be in the net epoch here, given that we already have the write lock?