Page MenuHomeFreeBSD

Route IPv4 packets via IPv6 next-hops
ClosedPublic

Authored by zlei on May 22 2021, 9:38 AM.
Referenced Files
F80188042: D30398.id93963.diff
Thu, Mar 28, 11:29 PM
Unknown Object (File)
Feb 7 2024, 2:25 PM
Unknown Object (File)
Feb 4 2024, 12:43 AM
Unknown Object (File)
Jan 25 2024, 3:50 AM
Unknown Object (File)
Jan 25 2024, 3:50 AM
Unknown Object (File)
Jan 25 2024, 3:45 AM
Unknown Object (File)
Jan 25 2024, 3:45 AM
Unknown Object (File)
Jan 25 2024, 3:45 AM

Details

Summary

Draft revision to enable the data-plane routing IPv4 packets via IPv6 next-hops. The control-plane such as FRR already supports RFC 5549.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/netinet/ip_output.c
551

nh_ifa gets selected during the route addition. I don't remember the code there but it should reject a route if no ifa is found (and we should be able to find the loopback address). It also should represent IPv4 ifa. We don't need these checks in the datapath, probably worth just enforcing the control plane code.

First of all, thanks for working on this! This is the important feature we need to have in base, preferably turned on by default.
I'd love to land this.

Glad to get your response !
I'm interested with scalable network, and recently exploring the CLOS network architecture described by RFC 7938. It need large number of IPs and configuration
and BGP unnumbered seems to rescue (RFC 5549).
Thanks your effort on FreeBSD's routing component, it is not too hard to implement such feature in the dataplane :)

I spent some time thinking about the potential implementation option some time ago. Let me try to summarise my thoughts below.

I'd really prefer to avoid adding additional complexity to the datapath. I'm really the proponent of making the fast path as fast / as simple as possible, with the obvious tradeoff of moving all the complexity to the control plane.

Reasonable.
I'm not expert in hardware routers, but it seems they behave simple / fast on datapath, the switch core, and leave complexity to the control plane. Though software routers does not compete with hardware routers, it is still valuable to improve the performance of datapath in regions such as cloud compute and fusion compute. PS. I've got noticed that some vendors have products with conception of router-on-nic. Not sure whether FreeBSD currently supports them or not.

I see 2 problems that need to be solved in order to integrate this smoothly:

  1. Pass packet family to the if_output() routine somehow
  2. Have a proper prepend header.

The potential rough edges:

  • Source address selection

For the source address selection, if the outbound interface does not have corresponding address, routing IPv4 packets via IPv6 next-hop and the outbound interface does not have IPv4 global routable address or vice versa, I think they can be treated same as unnumbered interfaces, then we can borrow rules from current RFCs.
For IPv6 it is RFC 6724, and for IPv4 I think it is RFC 1122 section 3.3.4.3 and RFC 1812.

  • inpcb LLE caching

If I remember correctly, there is LLE caching in route object, and inpcb caches the route.

  • slow path - sending packets when there is no LLE entry

Current implementation when there is no LLE entry sending packets is blocked, I wonder if we can re-queue them.

It may be also worth mentioning that, depending on the implementation approach, certain optimisations may be worth considering.
For example, it is not too hard to couple nexthops with the relevant lle, allowing nexthop to store pre-calculated prepend as a pointer and pass the prepend header to the if_output(), optimizing performance for all gateway-based routes.

Good idea !

In that case, the only scenario that needs to be addressed is the "slow path" one.

In a bit more details:

For (1) we actually have at least 2 scenarios - IPv4 over IPv6 and vice versa.
So, one of the problems we need to solve is getting the "right" family inside the if_output(). I'm afraid, that using mbuf flags to address both of the combinations will make the code more complex and add notable amount of branches to the fast path.

I agree. This is a draft, and I did not want to touch other components such as pfil when I started working on it. If there're enough interests on this feature, then I'd like to elaborate on it.
I have ever considered add one more parameter af to if_output() directly, as the original design of if_output() presume that the gateway address family is same as the packets.

What if we leverage some spare fields in the struct route header instead?. We can pre-fill in this for the inbcb_route and update callers like ip_output_send() to include on-stack struct route if it was not passed before. Additionally, we can consider updating the KBI for if_output() and require a shortened version of struct route, w/o ro_dst, to make it family-agnostic and reduce on-stack usage.

It is good to reduce on-stack usage, but I think it needs profiling.
To implement this feature, currently only if_output need address family of the packets. Then we endup these three means:

  1. Obtains af from mbuf.

It works but not performant.

  1. Updating mbuf and add af member.

The af redundant and is performant, but since mbuf is passed crossing layers, need discuss further.

  1. Updating KBI for if_output() and add af parameter.

This assumes the caller knows exactly the address family. On-stack usage seems increase.
If if_output() always need address family, then it makes no difference. The performance of some wrapper interface such as if_vlan might be affected.

  1. The solution you proposed above.

It works theoretically. We also need to distinguish the route on-stack from the cached ones, inp_route e.g.

For (2) - proper prepend header - I was thinking of either
(a) having a "stub" LLes thank can be looked up via arpresolve() with a "proper" prepend

I'm not catching this. What is a "stub" LLEs ?

or (b) not using PCB caching if encap family differs from packet family, leveraging to-be-added nexthop ptr to the prepend data.

Also not catching this. What is the encap family ?

What do you think?

English is not my native tongue, hopefully I expressed clearly :)

Reuse route.ro_dst to get the address family of outbound / forwarded packets.

This version looks pretty neat, please see some comments inline.

sys/contrib/dpdk_rte_lpm/dpdk_lpm.c
144

Does it matter? The only thing that should be relevant for the lookup algos is the nexthop index.

sys/dev/iicbus/if_ic.c
374

worth having as a macro?

something like hdr = RO_GET_FAMILY(ro, dst)

sys/net/route/route_ctl.c
630

Probably worth moving to a separate function to improve readability

sys/netinet/in_fib_dxr.c
341

This code doesn't care about nexthop internals other than nhop index.

sys/netinet/ip_output.c
537

Given rt_update_ro_flags() is static, it's probably worth just updating its signature to include nh parameter

sys/netpfil/ipfw/ip_fw_table_algo.c
3917

IIRC we don't care about the nexthop here at all

In D30398#683204, @zlei.huang_gmail.com wrote:
  • inpcb LLE caching

If I remember correctly, there is LLE caching in route object, and inpcb caches the route.

  • slow path - sending packets when there is no LLE entry

Current implementation when there is no LLE entry sending packets is blocked, I wonder if we can re-queue them.

ether_resolve_addr() will consume an mbuf and add it to the sending queue of the particular lle we're looking to resolve.
Once lle is resolved we iterate through this queue and re-send these using if_output() (end of arp_check_update_lle() for IPv4). Here we need to distingush between IPv4 or IPv6 packets somehow - either by inspecting IP version field or having a separate queues for IPv4/IPv6. Personally I'd experiment with the former approach.

For (1) we actually have at least 2 scenarios - IPv4 over IPv6 and vice versa.
So, one of the problems we need to solve is getting the "right" family inside the if_output(). I'm afraid, that using mbuf flags to address both of the combinations will make the code more complex and add notable amount of branches to the fast path.

I agree. This is a draft, and I did not want to touch other components such as pfil when I started working on it. If there're enough interests on this feature, then I'd like to elaborate on it.
I have ever considered add one more parameter af to if_output() directly, as the original design of if_output() presume that the gateway address family is same as the packets.

What if we leverage some spare fields in the struct route header instead?. We can pre-fill in this for the inbcb_route and update callers like ip_output_send() to include on-stack struct route if it was not passed before. Additionally, we can consider updating the KBI for if_output() and require a shortened version of struct route, w/o ro_dst, to make it family-agnostic and reduce on-stack usage.

It is good to reduce on-stack usage, but I think it needs profiling.
To implement this feature, currently only if_output need address family of the packets. Then we endup these three means:

  1. Obtains af from mbuf.

It works but not performant.

  1. Updating mbuf and add af member.

The af redundant and is performant, but since mbuf is passed crossing layers, need discuss further.

  1. Updating KBI for if_output() and add af parameter.

This assumes the caller knows exactly the address family. On-stack usage seems increase.
If if_output() always need address family, then it makes no difference. The performance of some wrapper interface such as if_vlan might be affected.

  1. The solution you proposed above.

It works theoretically. We also need to distinguish the route on-stack from the cached ones, inp_route e.g.

Always having struct route pointer in ip_output() seem to simplify things and I do like how this looks in the diff.

For (2) - proper prepend header - I was thinking of either
(a) having a "stub" LLes thank can be looked up via arpresolve() with a "proper" prepend

I'm not catching this. What is a "stub" LLEs ?

Sorry, I should have described it in a bit more detailed fashion.
Basically, my idea was to have a separate LLEs for a combination of (ip, upper_layer_family), so it can store the correct prepend and be cacheable by the PCB layer.

or (b) not using PCB caching if encap family differs from packet family, leveraging to-be-added nexthop ptr to the prepend data.

Also not catching this. What is the encap family ?

Different wording, sorry. Packet family - upper layer family (e.g. IPv6 for IPv6 packet), encap family - family of the gw we look in the LLE table.

What do you think?

English is not my native tongue, hopefully I expressed clearly :)

zlei marked 4 inline comments as done.Jun 15 2021, 7:18 AM
zlei added inline comments.
sys/netinet/ip_output.c
551

Currently for IPv4 stack the source address selection is simple, but for some cases such as unnumbered interface, interface has only IPv6 addresses or IPv4 link-local addresses eg., it does not work greatly. It is known that ip_output and icmp_reflect are affected.

As the issue exists before this feature, I'm planning to fix it in a separate diff.

sys/net/route/route_ctl.c
110

For the feature routing IPv6 packets via IPv4 next-hops, it does not require too much effort. If it is useful in some case, I think we can put it in a separate diff.
I'm not expert on this. I'll appreciate if someone would share the use-cases of routing IPv6 packets via IPv4 next-hops.

sys/net/if_ethersubr.c
378

It is a HACK here to fix the link layer type. It is better that the lle cache has correct type.
I'm still investing on it.

LGTM, I guess the biggest remaining piece now is lle handling, especially sending queued LLE packets upon successful resolution.

LGTM, I guess the biggest remaining piece now is lle handling, especially sending queued LLE packets upon successful resolution.

Done! The solution looks ugly although.

LGTM, I guess the biggest remaining piece now is lle handling, especially sending queued LLE packets upon successful resolution.

sys/netinet6/nd6.c
2219 ↗(On Diff #91340)

How about we reuse is_gw and add some flags like 'ENCAP_IPV4' ?

2489 ↗(On Diff #91340)

Maybe we can stash only the encap family

sys/ofed/drivers/infiniband/core/ib_addr.c
409

Might be better if nd6_resolve() return with correct ether type ?

sys/netinet6/nd6.c
2219 ↗(On Diff #91340)

I'd avoid having > 7 arguments here, especially given we don't need to pass anything except the upper layer family. Maybe just dedicating 1 byte of is_gw to pass the family would work.

2388–2389 ↗(On Diff #91340)

I'd rather exit here with something like:

2489 ↗(On Diff #91340)

I'd just store encap family, so it can be made unified across IPv4/IPv6.
We can perfectly allocate structure on-stack, as there are no performance requirements.

I have some WIP patch on making nd6_resolve() return LLEs with proper encap. Hope to publish it later this week.

Rework sending queued LLE packets.

sys/net/if_tuntap.c
1408

Do we need it?

sys/netgraph/ng_iface.c
376

Do we need it?

I've added D31379 with the lltable support for IPvX over IPvY.
Basically, the diff creates per-family "child" lle entries attached to the main lle entry. The purpose of each child entry is to have an object with a proper encap, so it can be referenced just like standard lle.

I've updated the aforementioned D31379 to reflect the committed parts.
If you could update this review to use the new functionality (e.g. nd6_resolve() returning lle with the proper encap) , that would be awesome.

sys/net/route/route_ctl.c
110

let's remove V6 over V4 for now.

122

Could you also add feature(3) knob here so the userland can check if the support for the functionality exists?

614

Mind renaming to something like check_gateway_family() to match verb_description pattern for other static functions?
Also: if you have cycles it would be good to spin up a separate diff, targetting moving the existing checks to a separate function. I can land it before this diff, thus simplifying this one.

sys/netgraph/netflow/netflow.c
367

Probably worth explicitly stating we're leaving an empty gateway here for IPv6 nexthops.

sys/dev/cxgbe/tom/t4_listen.c
1126–1129

I've updated the aforementioned D31379 to reflect the committed parts.
If you could update this review to use the new functionality (e.g. nd6_resolve() returning lle with the proper encap) , that would be awesome.

I'll manage it in a few days :)

Rebased on latest main branch

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

zlei marked 3 inline comments as done.Aug 18 2021, 4:47 AM
zlei added inline comments.
sys/net/if_tuntap.c
1408

I'll test P-t-P devices and report later.

sys/net/route/route_ctl.c
122

I'm new to this feature(3) knob. Can you guide me please ?

sys/ofed/drivers/infiniband/core/ib_addr.c
409

This is in D31379 IIUC.

zlei marked 2 inline comments as done.Aug 18 2021, 7:38 AM

.

sys/net/if_tuntap.c
1408

In case we intend to support route like route add x.x.x.x -inet6 yy:: where the yy:: is IPv6 address of the peer of P-t-P interface, we still need it. Otherwise bpf will consume wrong address family.

sys/netgraph/ng_iface.c
376

I think it is the same as above of sys/net/if_tuntap.c

sys/net/if_infiniband.c
391

Not needed, we should pass proper family to ‘infiniband_resolve_addr()’

sys/net/route/route_ctl.c
122

Something like FEATURE(ipv4_rfc5549_support, "Route IPv4 packets via IPv6 nexthops");
You can check sysctl kern.features for the features that currently exist.

sys/net/if_ethersubr.c
238–241
376

Not needed anymore.

In D30398#712270, @zlei.huang_gmail.com wrote:

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

It should be the other way round :-) e.g. D31379 is a pre-requisite.
Could you try to apply it first, build on top and test?

zlei marked an inline comment as done.

Cleaned up some comments.

zlei marked an inline comment as done.

Added feature(3) knob.

In D30398#712270, @zlei.huang_gmail.com wrote:

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

It should be the other way round :-) e.g. D31379 is a pre-requisite.
Could you try to apply it first, build on top and test?

OK, I'll test and report later.

You also need to add a bit of family wrapping logic inside fill_nhop_from_info(), so we get a proper family for the nexthop.

Also: will you write a commit message, or do you prefer me doing it?

In D30398#712270, @zlei.huang_gmail.com wrote:

@melifaro Sorry for late response ;)

I removed the LLE part, it should be easy to apply D31379 .

It should be the other way round :-) e.g. D31379 is a pre-requisite.
Could you try to apply it first, build on top and test?

So far so good :)

Pass correct AF to nd6_resolve()

You also need to add a bit of family wrapping logic inside fill_nhop_from_info(), so we get a proper family for the nexthop.

I'll look at it.

Also: will you write a commit message, or do you prefer me doing it?

This is a quite large change codebase. I'm not sure I can do it well.
I would appreciate it if you could do it.

This revision is now accepted and ready to land.Aug 20 2021, 9:39 PM
sys/net/if_infiniband.c
375

Maybe move the "int af = ..." down here, if this is the only place it is used.

Rebased on latest main branch.
Moved down RO_GET_FAMILY()

This revision now requires review to proceed.Aug 22 2021, 3:47 PM
This revision was not accepted when it landed; it landed in state Needs Review.Aug 22 2021, 10:58 PM
This revision was automatically updated to reflect the committed changes.