Page MenuHomeFreeBSD

Add support for IPoIB lagg devices in FreeBSD
ClosedPublic

Authored by hselasky on Aug 31 2020, 6:36 PM.

Details

Summary

Add support for IPoIB lagg devices in FreeBSD.

MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking

Test Plan
# Machine A
kldload mlx5en mlx5ib ipoib if_lagg
ifconfig ib0 up
ifconfig ib1 up
ifconfig lagg0 create laggtype infiniband
ifconfig lagg0 laggproto failover laggport ib0 laggport ib1 1.1.1.1 netmask 255.255.255.0
opensm -d mlx5_0 -B

# Machine B
kldload mlx5en mlx5ib ipoib if_lagg
ifconfig ib0 up
ifconfig ib1 up
ifconfig lagg0 create laggtype infiniband
ifconfig lagg0 laggproto failover laggport ib0 laggport ib1 1.1.1.2 netmask 255.255.255.0
ping 1.1.1.1

# To test failover use ifconfig to down/up the master link and see that the traffic moves from one link to the other

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Anmol is no longer with Panasas. ๐Ÿ™ We remain interested in this feature, but we no longer have anyone with the cycles to participate in the review.

Implement full support for failover.

Generate proper lladdr change event.

Fix for double free of mbuf.

Need to set VNET before the lladdr event.

D26733: Allow IPoIB to work with mutiple FIBs

Looks like a really nice neat incorporation of IB into lagg. Please see some comments inline :-)
Also it would be nice to briefly discuss the naming part :-)

sys/kern/uipc_mbufhash.c
220 โ†—(On Diff #78093)

I'm wondering if if would make sense to try to have L3+ hash calculation common for ethernet and IB - from the first glance offset is the only difference. What do you think?

sys/net/if_infiniband.c
189 โ†—(On Diff #78093)

Maybe it's worth trying pre-calculating lle headers by providing if_requestencap callback, similar to ether_requestencap() simplify the dataplane part a bit.

sys/net/if_lagg.h
209 โ†—(On Diff #78093)

Probably worth adding some comment what does this lock protect? :-)

Factor hashing functions as suggested by @ae .

Add comment for LAGG's mutex, as suggested by @ae.

sys/net/if_infiniband.c
189 โ†—(On Diff #78093)

Could we do this as a second step? I'm trying to limit the scope of this patchset.

Thank you for addressing the comments!
Sorry, got a couple more, but generally it's looking great :-)
The only real question/concern I have is the name. I'm a bit afraid that we're saying that "bond" is an IB port-channel and "lagg" is an ethernet port-channel. It's a bit confusing, given one cannot distinguish which is which by name and the fact that Linux folks are used to "bond" as the ethernet port-channel. If we really want to have the different names (to keep some applications happy?), would it be possible to have a name other than "bond"?

sys/net/if_infiniband.c
508 โ†—(On Diff #78497)

Btw, why do we actually attach it as ethernet and not DLT_INFINIBAND ?

189 โ†—(On Diff #78093)

Yep, sure!

sys/net/if_lagg.c
1238 โ†—(On Diff #78497)

Sorry, it's not particularly clear to me from the text, why if "nodes have. fixed mac address" we need to "regularly update". Also, is't a bit unclear if this update can be triggered by some other event, instead of checking once per second? Would it be possible to elaborate?

The only real question/concern I have is the name. I'm a bit afraid that we're saying that "bond" is an IB port-channel and "lagg" is an ethernet port-channel. It's a bit confusing, given one cannot distinguish which is which by name and the fact that Linux folks are used to "bond" as the ethernet port-channel. If we really want to have the different names (to keep some applications happy?), would it be possible to have a name other than "bond"?

I agree with @melifaro ; Link Aggregation Group covers Ethernet (failover, lacp, loadbalance, roundrobin); why can't it also cover InfiniBand failover as part of the same?

sys/net/if_types.h
245 โ†—(On Diff #78497)

The header for this enum mentions

http://www.iana.org/assignments/smi-numbers

This definition conflicts with that, so shouldn't Mellanox / Nvidia register a new type value for aggregated IB, or shouldn't this be down with the non-IANA values?

Or at least have this annotated as being a known conflict?

The only real question/concern I have is the name. I'm a bit afraid that we're saying that "bond" is an IB port-channel and "lagg" is an ethernet port-channel. It's a bit confusing, given one cannot distinguish which is which by name and the fact that Linux folks are used to "bond" as the ethernet port-channel. If we really want to have the different names (to keep some applications happy?), would it be possible to have a name other than "bond"?

I agree with @melifaro ; Link Aggregation Group covers Ethernet (failover, lacp, loadbalance, roundrobin); why can't it also cover InfiniBand failover as part of the same?

Hi,

It is not possible to re-use lagg<N> for infiniband, because we set the type of the network device when it is created. lagg<N> are created like ethernet and bond<N> are created like infiniband.

Please suggest a better name than "bond", likely four letters.

Else we end up having to do hacks, like in the initial patch.

--HPS

sys/net/if_infiniband.c
508 โ†—(On Diff #78497)

Because DLT_INFINIBAND needs to full infiniband header, which we would need to construct, because it is not part of the received nor transmitted mbufs. Stripping the data down to ethernet is simple enough and allows tcpdump to see the IP packets too.

The only real question/concern I have is the name. I'm a bit afraid that we're saying that "bond" is an IB port-channel and "lagg" is an ethernet port-channel. It's a bit confusing, given one cannot distinguish which is which by name and the fact that Linux folks are used to "bond" as the ethernet port-channel. If we really want to have the different names (to keep some applications happy?), would it be possible to have a name other than "bond"?

I agree with @melifaro ; Link Aggregation Group covers Ethernet (failover, lacp, loadbalance, roundrobin); why can't it also cover InfiniBand failover as part of the same?

Hi,

It is not possible to re-use lagg<N> for infiniband, because we set the type of the network device when it is created. lagg<N> are created like ethernet and bond<N> are created like infiniband.

Let's try to work it backwards?
I'd start with an assumption that from user POV we should be able to use the same interface for the same type of job, similarly to bond in Linux.
I see multiple options here:

  1. make lagg auto-change type based on the first interface attached. That would involve quite a lot of mechanics with notifications when changing between eth/ib.
  2. require lagg creation to specify laggtype in the creation params, assuming ethernet by default. ifconfig(8) needs to be updated (changes similar to vxlan_create() in sbin/ifconfig/ifvxlan.c).

(1) benefit that it won't require any userland changes when merging, compared to (2), however it comes at the cost of greater kernel complexity.
What is your feeling on requiring users to rebuild ifconfig(8)? Is it an absolute no-go?

Please suggest a better name than "bond", likely four letters.

I'd really prefer no to go with inventing the name of every permutation of an interface property :-)
However, if any proposals of the above doesn't look good - "laggib0", "laib0"?

Else we end up having to do hacks, like in the initial patch.

I tried to find the approach in the first published revision but failed to do so :-(

--HPS

The only real question/concern I have is the name. I'm a bit afraid that we're saying that "bond" is an IB port-channel and "lagg" is an ethernet port-channel. It's a bit confusing, given one cannot distinguish which is which by name and the fact that Linux folks are used to "bond" as the ethernet port-channel. If we really want to have the different names (to keep some applications happy?), would it be possible to have a name other than "bond"?

I agree with @melifaro ; Link Aggregation Group covers Ethernet (failover, lacp, loadbalance, roundrobin); why can't it also cover InfiniBand failover as part of the same?

Hi,

It is not possible to re-use lagg<N> for infiniband, because we set the type of the network device when it is created. lagg<N> are created like ethernet and bond<N> are created like infiniband.

Please suggest a better name than "bond", likely four letters.

Else we end up having to do hacks, like in the initial patch.

--HPS

sys/net/if_infiniband.c
508 โ†—(On Diff #78497)

Yep. Looking at the code for input part, we're stripping IB header and constructing ethernet header anyway in infiniband_bpf_mtap() which is a bit odd :-)

Hi,

It is not possible to re-use lagg<N> for infiniband, because we set the type of the network device when it is created. lagg<N> are created like ethernet and bond<N> are created like infiniband.

Let's try to work it backwards?
I'd start with an assumption that from user POV we should be able to use the same interface for the same type of job, similarly to bond in Linux.
I see multiple options here:

  1. make lagg auto-change type based on the first interface attached. That would involve quite a lot of mechanics with notifications when changing between eth/ib.
  2. require lagg creation to specify laggtype in the creation params, assuming ethernet by default. ifconfig(8) needs to be updated (changes similar to vxlan_create() in sbin/ifconfig/ifvxlan.c).

(1) benefit that it won't require any userland changes when merging, compared to (2), however it comes at the cost of greater kernel complexity.
What is your feeling on requiring users to rebuild ifconfig(8)? Is it an absolute no-go?

I like (2): assume Ethernet LAGG by default, but allow override to InfiniBand LAGG.

Please suggest a better name than "bond", likely four letters.

iblagg seems not-terrible to me. Or iblag, if six chars is too long.

@melfario

require lagg creation to specify laggtype in the creation params, assuming ethernet by default. ifconfig(8) needs to be updated (changes similar to vxlan_create() in sbin/ifconfig/ifvxlan.c).

I'll look into this. Doing it this way would remove the need for a separate device name! The default would be ethernet then.

--HPS

hselasky edited the test plan for this revision. (Show Details)

Implement suggestion from @melfario .

sys/net/if_lagg.c
1238 โ†—(On Diff #78497)

The MAC address in IPOIB comes from the so-called GID, which is hardcoded per port. It is not possible to have two different ports use the same MAC.

It might be we could piggy-back some link-up/down events, but I don't see any problem with this timer. It is low bandwidth.

The timer also provides a guarantee that the iflladdr events don't happen too frequently.

melifaro added inline comments.
sys/net/if_lagg.c
1238 โ†—(On Diff #78497)

Yep, now it is clear :-)
Would it be possible if you could update the comment to reflect link up/down part?

sys/net/if_lagg.h
99 โ†—(On Diff #78536)

Nit: maybe it's worth adding a bit more spare fields just in case? e.g. making it 16 or 32 bytes?

This revision is now accepted and ready to land.Oct 21 2020, 7:51 PM
hselasky marked 3 inline comments as done.
hselasky retitled this revision from Add support for IPoIB bonding in FreeBSD to Add support for IPoIB lagg devices in FreeBSD.
hselasky edited the summary of this revision. (Show Details)

Implement suggestions from @melifaro .

This revision now requires review to proceed.Oct 22 2020, 8:13 AM

Allow if_infiniband to be unloaded.

This revision was not accepted when it landed; it landed in state Needs Review.Oct 22 2020, 9:10 AM
This revision was automatically updated to reflect the committed changes.

It appears that tests/sys/net/if_lagg_test.sh has been failing since this change (https://ci.freebsd.org/job/FreeBSD-main-amd64-test/16921/) is the first build where this test fails.

It appears that tests/sys/net/if_lagg_test.sh has been failing since this change (https://ci.freebsd.org/job/FreeBSD-main-amd64-test/16921/) is the first build where this test fails.

Is the test failure only related to the LOR's or something else? This change does not affect those code paths mentioned in the LOR, from what I can see.

--HPS

The test appears to grep for "lagg_" in the LOR messages and there are two that match:

Lock order reversal between "in_multi_sx"(sx) and "if_lagg sx"(sx)!
Lock order "in_multi_sx"(sx) -> "if_lagg sx"(sx) first seen at:
#0 0xffffffff80c7bfcd at witness_checkorder+0x46d
#1 0xffffffff80c17b17 at _sx_xlock+0x67
#2 0xffffffff826d1bf0 at lagg_ioctl+0xe0
#3 0xffffffff80d3026d at if_addmulti+0x3fd
#4 0xffffffff80dbfbdd at in_joingroup_locked+0x27d
#5 0xffffffff80dbf932 at in_joingroup+0x42
#6 0xffffffff80dba825 at in_control+0xa25
#7 0xffffffff80d30a38 at ifioctl+0x3d8
#8 0xffffffff80c824d9 at kern_ioctl+0x289
#9 0xffffffff80c8219a at sys_ioctl+0x12a
#10 0xffffffff810bd629 at amd64_syscall+0x749
#11 0xffffffff8109010e at fast_syscall_common+0xf8

and

Lock order reversal between "in_control"(sx) and "if_lagg sx"(sx)!
Lock order "in_control"(sx) -> "if_lagg sx"(sx) first seen at:
#0 0xffffffff80c7bfcd at witness_checkorder+0x46d
#1 0xffffffff80c17b17 at _sx_xlock+0x67
#2 0xffffffff826d19ee at lagg_init+0x2e
#3 0xffffffff80d36c1e at ether_ioctl+0x1be
#4 0xffffffff826d20e1 at lagg_ioctl+0x5d1
#5 0xffffffff80dba7dd at in_control+0x9dd
#6 0xffffffff80d30a38 at ifioctl+0x3d8
#7 0xffffffff80c824d9 at kern_ioctl+0x289
#8 0xffffffff80c8219a at sys_ioctl+0x12a
#9 0xffffffff810bd629 at amd64_syscall+0x749
#10 0xffffffff8109010e at fast_syscall_common+0xf8

So I'm not sure why it would start failing after this change.

This looks like a generic issue to me.