MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking
Details
Test OK
Server side:
- kldload mlx5en mlx5ib ipoib
- ifconfig ib0 inet6 1::2/64 up
- iperf -V -s
Client side:
- kldload mlx5en mlx5ib ipoib
- ifconfig ib0 inet6 1::1/64 up
- opensm -d mlx5_0 -B
- iperf -V -i 1 -t 10 -c 1::2
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 35620
Event Timeline
Why ?
RFC 4391 section 8 IPv6 Stateless Autoconfiguration is quite clear IMO. There are local IPoIB addresses, with recommended format.
It might be the issue is somewhere else, but I see an issue with IPv6 neighbour solicitation being broken. Basically the Link-local address is not generated according the specification you mentioned, so the IB switch doesn't recognize it and distribute it as it should. Ping6 works (ignore the duplicate response). See below:
1 0.000000 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
2 1.024207 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
3 1.000161 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
4 1.199860 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
5 1.000025 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
6 0.999998 1::1 ff02::1:ff00:2 ICMPv6 102 Neighbor Solicitation for 1::2
7 9.683235 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=0, hop limit=64 (no response found!)
8 0.000154 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=0, hop limit=64 (reply in 9)
9 0.000600 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=0, hop limit=64 (request in 8)
10 0.000060 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=0, hop limit=64
11 1.015989 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=1, hop limit=64 (no response found!)
12 0.000074 1::1 1::2 ICMPv6 70 Echo (ping) request id=0x22e9, seq=1, hop limit=64 (reply in 13)
13 0.000310 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=1, hop limit=64 (request in 12)
14 0.000046 1::2 1::1 ICMPv6 70 Echo (ping) reply id=0x22e9, seq=1, hop limit=64
I do not understand the later claim as well.
NS is distributed to the members of specific multicast groups, which all hosts must join. It is not distributed based on the local address. Also, IB switches should not deal with IPoIB addresses at all, IPoIB is something like L4, out of scope.
Also note that the format of local addresses from RFC is recommended but not required. But I believe that LL is required, so you cannot disable it at whim.
IB switches should not deal with IPoIB addresses
IB switches has no concept of a L2. There may only be one uniq IP per port from what I know, which the switch needs to learn.
I will investigate a bit more before concluding this is the right way to go.
--HPS
This patch is wrong. I'll updated. It is multicast not link local IPv6, which is the issue. My bad. I'll update this patch.
sys/net/if_infiniband.c | ||
---|---|---|
253 | This seems weird as well to me; I would assume that non-directed ICMPv6 traffic has M_MCAST set the below clause would handle that? Why is ICMPv6 special at all compared to TCP or UDP? While link-layer address resolution works on ICMPv6 level it is not a separate protocol like ARP for IPv4 but above IPv6. |
sys/net/if_infiniband.c | ||
---|---|---|
253 | It might be the answer is that multicast handling in IPoIB is broken. I need to investigate this. |
sys/net/if_infiniband.c | ||
---|---|---|
252 | Given there is no BCAST in IPv6, I wonder what this means? |
sys/net/if_infiniband.c | ||
---|---|---|
252 | I just wanted to cases to be symmetric, but like you say there is no BCAST in IPv6, so now the packet is simply dropped. |
sys/net/if_infiniband.c | ||
---|---|---|
184–185 | Sorry, side note: would it be possible to move arp/nd encap logic to if_requesencap callback, so we have it consistent with ether_output() and simplify datapath code? |
sys/net/if_infiniband.c | ||
---|---|---|
184–185 | Yes. Weren't you supposed to do that? ;-) |