Draft revision to enable the data-plane routing IPv4 packets via IPv6 next-hops. The control-plane such as FRR already supports RFC 5549.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
- Lint Skipped 
- Unit
- Tests Skipped 
Event Timeline
Glad to get your response !
I'm interested with scalable network, and recently exploring the CLOS network architecture described by RFC 7938. It need large number of IPs and configuration
 and BGP unnumbered seems to rescue (RFC 5549).
Thanks your effort on FreeBSD's routing component, it is not too hard to implement such feature in the dataplane :)
I spent some time thinking about the potential implementation option some time ago. Let me try to summarise my thoughts below.
I'd really prefer to avoid adding additional complexity to the datapath. I'm really the proponent of making the fast path as fast / as simple as possible, with the obvious tradeoff of moving all the complexity to the control plane.
Reasonable.
I'm not expert in hardware routers, but it seems they behave simple / fast on datapath, the switch core, and leave complexity to the control plane. Though software routers does not compete with hardware routers, it is still valuable to improve the performance of datapath in regions such as cloud compute and fusion compute. PS. I've got noticed that some vendors have products with conception of router-on-nic. Not sure whether FreeBSD currently supports them or not.
I see 2 problems that need to be solved in order to integrate this smoothly:
- Pass packet family to the if_output() routine somehow
- Have a proper prepend header.
The potential rough edges:
- Source address selection
For the source address selection, if the outbound interface does not have corresponding address, routing IPv4 packets via IPv6 next-hop and the outbound interface does not have IPv4 global routable address or vice versa, I think they can be treated same as unnumbered interfaces, then we can borrow rules from current RFCs.
For IPv6 it is RFC 6724, and for IPv4 I think it is RFC 1122 section 3.3.4.3 and RFC 1812.
- inpcb LLE caching
If I remember correctly, there is LLE caching in route object, and inpcb caches the route.
- slow path - sending packets when there is no LLE entry
Current implementation when there is no LLE entry sending packets is blocked, I wonder if we can re-queue them.
It may be also worth mentioning that, depending on the implementation approach, certain optimisations may be worth considering.
For example, it is not too hard to couple nexthops with the relevant lle, allowing nexthop to store pre-calculated prepend as a pointer and pass the prepend header to the if_output(), optimizing performance for all gateway-based routes.
Good idea !
In that case, the only scenario that needs to be addressed is the "slow path" one.
In a bit more details:
For (1) we actually have at least 2 scenarios - IPv4 over IPv6 and vice versa.
So, one of the problems we need to solve is getting the "right" family inside the if_output(). I'm afraid, that using mbuf flags to address both of the combinations will make the code more complex and add notable amount of branches to the fast path.
I agree. This is a draft, and I did not want to touch other components such as pfil when I started working on it. If there're enough interests on this feature, then I'd like to elaborate on it.
I have ever considered add one more parameter af to if_output() directly, as the original design of if_output() presume that the gateway address family is same as the packets.
What if we leverage some spare fields in the struct route header instead?. We can pre-fill in this for the inbcb_route and update callers like ip_output_send() to include on-stack struct route if it was not passed before. Additionally, we can consider updating the KBI for if_output() and require a shortened version of struct route, w/o ro_dst, to make it family-agnostic and reduce on-stack usage.
It is good to reduce on-stack usage, but I think it needs profiling.
To implement this feature, currently only if_output need address family of the packets. Then we endup these three means:
- Obtains af from mbuf.
It works but not performant.
- Updating mbuf and add af member.
The af redundant and is performant, but since mbuf is passed crossing layers, need discuss further.
- Updating KBI for if_output() and add af parameter.
This assumes the caller knows exactly the address family. On-stack usage seems increase.
If if_output() always need address family, then it makes no difference. The performance of some wrapper interface such as if_vlan might be affected.
- The solution you proposed above.
It works theoretically. We also need to distinguish the route on-stack from the cached ones, inp_route e.g.
For (2) - proper prepend header - I was thinking of either
(a) having a "stub" LLes thank can be looked up via arpresolve() with a "proper" prepend
I'm not catching this. What is a "stub" LLEs ?
or (b) not using PCB caching if encap family differs from packet family, leveraging to-be-added nexthop ptr to the prepend data.
Also not catching this. What is the encap family ?
What do you think?
English is not my native tongue, hopefully I expressed clearly :)
This version looks pretty neat, please see some comments inline.
| sys/contrib/dpdk_rte_lpm/dpdk_lpm.c | ||
|---|---|---|
| 144 ↗ | (On Diff #90724) | Does it matter? The only thing that should be relevant for the lookup algos is the nexthop index. | 
| sys/dev/iicbus/if_ic.c | ||
| 374 ↗ | (On Diff #90724) | worth having as a macro? something like hdr = RO_GET_FAMILY(ro, dst) | 
| sys/net/route/route_ctl.c | ||
| 609–610 | Probably worth moving to a separate function to improve readability | |
| sys/netinet/in_fib_dxr.c | ||
| 341 ↗ | (On Diff #90724) | This code doesn't care about nexthop internals other than nhop index. | 
| sys/netinet/ip_output.c | ||
| 556 | Given rt_update_ro_flags() is static, it's probably worth just updating its signature to include nh parameter | |
| sys/netpfil/ipfw/ip_fw_table_algo.c | ||
| 3917 ↗ | (On Diff #90724) | IIRC we don't care about the nexthop here at all | 
ether_resolve_addr() will consume an mbuf and add it to the sending queue of the particular lle we're looking to resolve.
Once lle is resolved we iterate through this queue and re-send these using if_output() (end of arp_check_update_lle() for IPv4). Here we need to distingush between IPv4 or IPv6 packets somehow - either by inspecting IP version field or having a separate queues for IPv4/IPv6. Personally I'd experiment with the former approach.
For (1) we actually have at least 2 scenarios - IPv4 over IPv6 and vice versa.
So, one of the problems we need to solve is getting the "right" family inside the if_output(). I'm afraid, that using mbuf flags to address both of the combinations will make the code more complex and add notable amount of branches to the fast path.I agree. This is a draft, and I did not want to touch other components such as pfil when I started working on it. If there're enough interests on this feature, then I'd like to elaborate on it.
I have ever considered add one more parameter af to if_output() directly, as the original design of if_output() presume that the gateway address family is same as the packets.What if we leverage some spare fields in the struct route header instead?. We can pre-fill in this for the inbcb_route and update callers like ip_output_send() to include on-stack struct route if it was not passed before. Additionally, we can consider updating the KBI for if_output() and require a shortened version of struct route, w/o ro_dst, to make it family-agnostic and reduce on-stack usage.
It is good to reduce on-stack usage, but I think it needs profiling.
To implement this feature, currently only if_output need address family of the packets. Then we endup these three means:
- Obtains af from mbuf.
It works but not performant.
- Updating mbuf and add af member.
The af redundant and is performant, but since mbuf is passed crossing layers, need discuss further.
- Updating KBI for if_output() and add af parameter.
This assumes the caller knows exactly the address family. On-stack usage seems increase.
If if_output() always need address family, then it makes no difference. The performance of some wrapper interface such as if_vlan might be affected.
- The solution you proposed above.
It works theoretically. We also need to distinguish the route on-stack from the cached ones, inp_route e.g.
Always having struct route pointer in ip_output() seem to simplify things and I do like how this looks in the diff.
For (2) - proper prepend header - I was thinking of either
(a) having a "stub" LLes thank can be looked up via arpresolve() with a "proper" prependI'm not catching this. What is a "stub" LLEs ?
Sorry, I should have described it in a bit more detailed fashion.
Basically, my idea was to have a separate LLEs for a combination of (ip, upper_layer_family), so it can store the correct prepend and be cacheable by the PCB layer.
or (b) not using PCB caching if encap family differs from packet family, leveraging to-be-added nexthop ptr to the prepend data.
Also not catching this. What is the encap family ?
Different wording, sorry. Packet family - upper layer family (e.g. IPv6 for IPv6 packet), encap family - family of the gw we look in the LLE table.
What do you think?
English is not my native tongue, hopefully I expressed clearly :)
| sys/netinet/ip_output.c | ||
|---|---|---|
| 584 | Currently for IPv4 stack the source address selection is simple, but for some cases such as unnumbered interface, interface has only IPv6 addresses or IPv4 link-local addresses eg., it does not work greatly. It is known that ip_output and icmp_reflect are affected. As the issue exists before this feature, I'm planning to fix it in a separate diff. | |
| sys/net/route/route_ctl.c | ||
|---|---|---|
| 110 | For the feature routing IPv6 packets via IPv4 next-hops, it does not require too much effort. If it is useful in some case, I think we can put it in a separate diff. | |
| sys/net/if_ethersubr.c | ||
|---|---|---|
| 382 | It is a HACK here to fix the link layer type. It is better that the lle cache has correct type. | |
LGTM, I guess the biggest remaining piece now is lle handling, especially sending queued LLE packets upon successful resolution.
| sys/netinet6/nd6.c | ||
|---|---|---|
| 2219 ↗ | (On Diff #91340) | I'd avoid having > 7 arguments here, especially given we don't need to pass anything except the upper layer family. Maybe just dedicating 1 byte of is_gw to pass the family would work. | 
| 2388–2389 ↗ | (On Diff #91340) | I'd rather exit here with something like: | 
| 2489 ↗ | (On Diff #91340) | I'd just store encap family, so it can be made unified across IPv4/IPv6. | 
I have some WIP patch on making nd6_resolve() return LLEs with proper encap. Hope to publish it later this week.
I've added D31379 with the lltable support for IPvX over IPvY.
Basically, the diff creates per-family "child" lle entries attached to the main lle entry. The purpose of each child entry is to have an object with a proper encap, so it can be referenced just like standard lle.
I've updated the aforementioned D31379 to reflect the committed parts.
If you could update this review to use the new functionality (e.g. nd6_resolve() returning lle with the proper encap) , that would be awesome.
| sys/net/route/route_ctl.c | ||
|---|---|---|
| 110 | let's remove V6 over V4 for now. | |
| 122 | Could you also add feature(3) knob here so the userland can check if the support for the functionality exists? | |
| 593 | Mind renaming to something like check_gateway_family() to match verb_description pattern for other static functions? | |
| sys/netgraph/netflow/netflow.c | ||
| 367 ↗ | (On Diff #92020) | Probably worth explicitly stating we're leaving an empty gateway here for IPv6 nexthops. | 
| sys/dev/cxgbe/tom/t4_listen.c | ||
|---|---|---|
| 1119–1122 ↗ | (On Diff #92020) | |
.
| sys/net/if_tuntap.c | ||
|---|---|---|
| 1406 ↗ | (On Diff #92020) | In case we intend to support route like route add x.x.x.x -inet6 yy:: where the yy:: is IPv6 address of the peer of P-t-P interface, we still need it. Otherwise bpf will consume wrong address family. | 
| sys/netgraph/ng_iface.c | ||
| 374 ↗ | (On Diff #92020) | I think it is the same as above of sys/net/if_tuntap.c | 
| sys/net/if_infiniband.c | ||
|---|---|---|
| 395 | Not needed, we should pass proper family to ‘infiniband_resolve_addr()’ | |
| sys/net/route/route_ctl.c | ||
|---|---|---|
| 122 | Something like FEATURE(ipv4_rfc5549_support, "Route IPv4 packets via IPv6 nexthops"); | |
It should be the other way round :-) e.g. D31379 is a pre-requisite.
Could you try to apply it first, build on top and test?
You also need to add a bit of family wrapping logic inside fill_nhop_from_info(), so we get a proper family for the nexthop.
This is a quite large change codebase. I'm not sure I can do it well.
I would appreciate it if you could do it.
| sys/net/if_infiniband.c | ||
|---|---|---|
| 379 | Maybe move the "int af = ..." down here, if this is the only place it is used. | |