Apply metric to the actual list of nexthops in nhgrp.
Previously, we only supported nexthop weight for ECMP routes.
The upper 8 bit of weight was reserved for metric or administrative
distance purposes.
Now we filter any nexthops with lower metric in the actual nexthop
group.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 72079 Build 68962: arc lint + arc unit
Event Timeline
Here is how it works:
[root@ftsr1] [~] # route -n6 add -net 3fff::/64 -gateway 2a01:e140:10:10::6 -weight 33554435
change net 3fff::/64: gateway 2a01:e140:10:10::6 fib 0
[root@ftsr1] [~] # route -n6 add -net 3fff::1/64 -gateway 2a01:e140:10:10::1 -weight 2
add net 3fff::1/64: gateway 2a01:e140:10:10::1 fib 0
[root@ftsr1] [~] # route -n6 get 3fff::/64
route to: 3fff::
destination: 3fff::
mask: ffff:ffff:ffff:ffff::
gateway: 2a01:e140:10:10::6
fib: 0
interface: vtnet0
flags: <UP,GATEWAY,DONE,STATIC>
recvpipe sendpipe ssthresh rtt,msec mtu metric weight expire
0 0 0 0 1500 2 3 0
[root@ftsr1] [~] # route -on6 get 3fff::/64
route to: 3fff::
destination: 3fff::
mask: ffff:ffff:ffff:ffff::
fib: 0
flags: <UP,GATEWAY,DONE,STATIC>
nhops: 2
via gw 2a01:e140:10:10::1 iface vtnet0 metric 0 weight 2 mtu 1500 table inet6.0
via gw 2a01:e140:10:10::6 iface vtnet0 metric 2 weight 3 mtu 1500
[root@ftsr1] [~] # netstat -rn6W | grep 3fff
3fff::/64 2a01:e140:10:10::1 UGS 0 1500 vtnet0
3fff::/64 2a01:e140:10:10::6 UGS 0 1500 vtnet0
[root@ftsr1] [~] # netstat -On6W
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
11 ------- ------- ------- --------------------------------------- --------- 2
9 2 0 2a01:e140:10:10::1 vtnet0
1033554435 64 2a01:e140:10:10::6 vtnet0Of course, we should add a separate -metric argument to route(8) later.
| sys/net/route.h | ||
|---|---|---|
| 107–108 | Here is another hint for me to understand what was the original @melifaro design. | |
| sys/net/route/nhgrp_ctl.c | ||
| 147–148 | Changes here only affect the actual nhgrp size in alloc_nhgrp(). | |
| 252 | This is not necessary. | |
| sys/net/route/route_ctl.c | ||
| 199 | Why not use a separate value for metric? | |
Also this revision will NOT change expected routing behavior (POLA) since we never allow users to set weight more than 24-bit.
Now we don't, but we use it for metric.
I intentionally used the term metric instead of administrative distance. You know the difference.
before this patch, adding nexthop never results in change to the main or first nhop of the nexthop group.
Before this patch:
[pouria@cornelia] [/usr/src] [mpath_user|✔] % mdo route -n6 add -host 3fff::1 -gateway ::4 -weight 16777218
add host 3fff::1: gateway ::4
[pouria@cornelia] [/usr/src] [mpath_user|✔] % netstat -On6W
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
14 ------- ------- ------- --------------------------------------- --------- 2
7 1 0 ::1 lo0
8 2 0 ::2 lo0
11 16777218 32 ::4 lo0
12 16777216 32 ::3 lo0
[pouria@cornelia] [/usr/src] [mpath_user|✔] % echo '2^24' | bc
16777216
[pouria@cornelia] [/usr/src] [mpath_user|✔] % echo '(2^24)*2' | bc
33554432
[pouria@cornelia] [/usr/src] [mpath_user|✔] % mdo route -n6 add -host 3fff::1 -gateway ::5 -weight 33554434
add host 3fff::1: gateway ::5
[pouria@cornelia] [/usr/src] [mpath_user|✔] % route -n6 get 3fff::1
route to: 3fff::1
destination: 3fff::1
gateway: ::4
fib: 0
interface: lo0
flags: <UP,GATEWAY,HOST,DONE,STATIC>
recvpipe sendpipe ssthresh rtt,msec mtu weight expire
0 0 0 0 16384 0 0
[pouria@cornelia] [/usr/src] [mpath_user|✔] % netstat -On6W
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
15 ------- ------- ------- --------------------------------------- --------- 2
7 1 0 ::1 lo0
8 2 0 ::2 lo0
11 16777218 16 ::4 lo0
12 16777216 15 ::3 lo0
13 33554434 33 ::5 lo0You may wonder how I set the weight to more than 24-bit before this patch?
The answer is we didn't validate maximum weight value from netlink.
However, we didn't document any of the weight system. So we're safe from breaking the userland.
After this patch:
[root@ftsr1] [~] # route -n6 add -net 3fff::/64 -gateway ::4 -weight 16777218
add net 3fff::/64: gateway ::4 fib 0
[root@ftsr1] [~] # netstat -On6W
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
14 ------- ------- ------- --------------------------------------- --------- 2
9 2 0 ::1 lo0
10 2 0 ::2 lo0
1116777218 32 ::4 lo0
1216777216 32 ::3 lo0
[root@ftsr1] [~] # route -n6 add -net 3fff::/64 -gateway ::5 -weight 33554433
add net 3fff::/64: gateway ::5 fib 0
[root@ftsr1] [~] # netstat -On6W
Nexthop groups data
Internet6:
GrpIdx NhIdx Weight Slots Gateway Netif Refcnt
15 ------- ------- ------- --------------------------------------- --------- 2
9 2 0 ::1 lo0
10 2 0 ::2 lo0
1116777218 0 ::4 lo0
1216777216 0 ::3 lo0
1333554433 64 ::5 lo0Compare the nexthop slots of the nexthop group, before and after this patch.
| sys/net/route/route_helpers.c | ||
|---|---|---|
| 297 ↗ | (On Diff #175200) | With new structure, this initialization must be changed. |
| sys/net/route.h | ||
|---|---|---|
| 92–95 | Is this structure user visible? If so, better reduce rmx_filler[] down to 1 and use that space. The comment on T/TCP should be cleared while modifying this line. | |
| sys/net/route.h | ||
|---|---|---|
| 91–94 | If you keep order of existing members as before, the ABI should be compatible with old programs. | |
Refactor metric implementation.
In our routing stack implementation, metric is the attribute
of the nexthop, not the route itself.
Because we can't have two different route with same destination,
But, we can have a route with multiple nexthops.
Therefore, it doesn't make sense to compare weight of multiple
nexthops with different metrics.
By definition, we shouldn't have multiple of same nexthops with
different weight, but we can have multiple nexthops with different
metrics.
Apply metric to the actual list of nexthops in nhgrp.
However, store metric in nhop_object.
Previously, we only supported nexthop weight for ECMP routes.
Remove the upper 8-bit reservation of weight and reuse spare
data of nhop object for metric.
| sys/net/route/route_ctl.c | ||
|---|---|---|
| 199 | This is no longer true. | |
| sys/net/route/nhgrp_ctl.c | ||
|---|---|---|
| 138 | Higher than what? | |
| 165 | This comment is a bit confusing, it looks like the check is just filtering out nhops with metrics not equal to the one that was passed in. | |
| 271 | Again, this comment is not very helpful and kind of confusing. | |
| 515 | Is it impossible to have num_nhops == 0? | |
| 526 | It's better to avoid this style of i++; /* increment i */ comment. | |
Simplify set_metric logic by specifying 0 as RT_WILDCARD_METRIC macro and overwrite wildcard (0) metric to default metric (1).
This simple change allow users to delete all the routes without specifying metric in both rtsock and metric and also preserve POLA by not allowing users to add/change route metric to wildcard metric (0) by overwriting it to default metric.
Merge D57007.
Address @markj comment.
markj asked: How do you decide whether to add a field to nhop_object vs. the other (nhop_priv)?
I reevaluate my decisions and decided to test whether if worth it to reuse spare member of nhop_object which is used for dataplane or should I place it under nhop_priv?
I had two options, my initial thought was to place it inside the lookup section (NH_PRIV_END_CMP) of nhop_priv.
But since it was cache aligned, I initially decided to place metric inside the nhop_object.
But I didn't think of having to touch the cmp_priv at that time. So its cost of metric comparison by double dereferencing was not visible to me.
Now with D57007, I decided to test moving metric under nhop_priv without placing it inside the lookup section.
I decided to test that in production at NL-IX.
Here was the results:
% # When nh_metric was placed in nhop_object. % mdo vmstat -m | grep nh nhops 38 21248 324367067 64,128,256,512,1024,2048,4096,8192 % mdo netstat -rn4 | wc -l 1053225 % mdo netstat -rn6 | wc -l 288723 % mdo netstat -On6 | wc -l 37 % mdo netstat -on6 | wc -l 432 % # After moving it into nhop_priv (FreeBSD 16.0-CURRENT (GENERIC-NODEBUG) #3 mpath-n285993-1a086c816b5a): % mdo vmstat -m | grep nh nhops 38 21248 202332 64,128,256,512,1024,2048,4096,8192 % mdo netstat -rn4 | wc -l 1053251 % mdo netstat -rn6 | wc -l 288986 % mdo netstat -On6 | wc -l 40 % mdo netstat -on6 | wc -l 431
Superising to me, memory allocation is not changed.
Well, after seeing the results, I checked the size of nhop_priv again and found out the overall structure of it is not aligned to allocation boundaries.
So it is completely safe to move it under nhop_priv and it's better to do so because metric is a control-plane variable and
not belongs to nhop_object.
Thank you so much @markj for asking me that question.
IPv6 default route will be added using rib_default_route from route_helpers.
Ensure it sets the RT_DEFAULT_METRIC.