Page MenuHomeFreeBSD

carp: support unicast
ClosedPublic

Authored by kp on Mar 7 2023, 10:47 AM.
Tags
None
Referenced Files
F133407953: D38940.id119050.diff
Sat, Oct 25, 2:15 PM
Unknown Object (File)
Fri, Oct 24, 6:12 PM
Unknown Object (File)
Sat, Oct 18, 11:09 AM
Unknown Object (File)
Wed, Oct 15, 4:36 AM
Unknown Object (File)
Mon, Oct 13, 5:45 AM
Unknown Object (File)
Sat, Oct 11, 2:40 AM
Unknown Object (File)
Fri, Oct 10, 7:53 PM
Unknown Object (File)
Fri, Oct 10, 7:53 PM

Details

Summary

Allow users to configure the address to send carp messages to. This
allows carp to be used in unicast mode, which is useful in certain
virtual configurations (e.g. AWS, VMWare ESXi, ...)

Sponsored by: Rubicon Communications, LLC ("Netgate")

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 50481
Build 47372: arc lint + arc unit

Event Timeline

kp requested review of this revision.Mar 7 2023, 10:47 AM

I'm a little torn on how to handle the extension in the interface to userspace. I've added a new ioctl for it, but we could also extend the struct (and then teach the existing ioctl to cope with two sizes of structure), or we could convert the whole thing to using netlink, to make future extensions easier.

Thank you for working on this!
The unicast option is a good addition indeed.
My only concern is that we're adding global carp-specific ioctls that doesn't look easy to extend afterwards. I'd prefer to create an easily-extenable API, such as netlink.
I'm happy to work with you (or implement a variation of this patch myself) that uses netlink as the control mechanism.

I'm happy to work with you (or implement a variation of this patch myself) that uses netlink as the control mechanism.

I'm going to use this as an opportunity to get to know netlink a bit more, so I'll take a stab at a first version myself, and then we'll go from there.
It probably makes sense to extend the existing code to use netlink and then rebase these patches (now using netlink) on top.

In D38940#886690, @kp wrote:

I'm going to use this as an opportunity to get to know netlink a bit more, so I'll take a stab at a first version myself, and then we'll go from there.

The first thing I'm running into is "The library does not currently offer any wrappers for writing netlink messages.". I can compose it directly, but if I understand things correctly that means we're not doing the TLV thing, and we're back to painting ourselves into a corner w.r.t. extensibility.
I can push onwards for now, but I think that should be addressed before we commit to a conversion to netlink.

I'm also not sure I understand the distinction between fields and attributes. Oh. Wait, are fields always there (i.e. the things we can easily compose directly with a struct before the snl_send() call), and attributes are optional? So for carp it'd probably make sense to have at least 'ifname' be a field, and things like vhid be attributes (so we can optionally filter on vhid)?

In D38940#886729, @kp wrote:
In D38940#886690, @kp wrote:

I'm going to use this as an opportunity to get to know netlink a bit more, so I'll take a stab at a first version myself, and then we'll go from there.

The first thing I'm running into is "The library does not currently offer any wrappers for writing netlink messages.". I can compose it directly, but if I understand things correctly that means we're not doing the TLV thing, and we're back to painting ourselves into a corner w.r.t. extensibility.

I'll have it working for route(8) conversion - will publish writers diff either later today or tomorrow.

I can push onwards for now, but I think that should be addressed before we commit to a conversion to netlink.

Sure, I don't want anyone to write raw message composing code.

I'm also not sure I understand the distinction between fields and attributes. Oh. Wait, are fields always there (i.e. the things we can easily compose directly with a struct before the snl_send() call), and attributes are optional? So for carp it'd probably make sense to have at least 'ifname' be a field, and things like vhid be attributes (so we can optionally filter on vhid)?

Yes, fields are mandatory (in a sense that they need to be present in the wire), while attributes are not _mandatory_ per protocol, but the handler can of course require certain number of them to be present.

For the network interfaces API, netlink has a concept of sharing custom per-interface data in the RTM_NEWLINK messages. Nested attributes IFLA_LINKINFO / IFLA_INFO_DATA are used for that.
I was thinking of using cloners variation to provide this additional data: https://cgit.freebsd.org/src/tree/sys/netlink/route/iface_drivers.c#n207 . This is a creation example for vlan, along with the modification / dump callbacks.

Published the writer code in D38947 (will update with the documentation tomorrow).

This allows carp to be used in unicast mode, which is useful in certain virtual configurations (e.g. AWS, VMWare ESXi, ...)

I use carp in my production environment (ESXi), and I have to enable promisc mode for virtual ports so that multicast frames can forward between VMs.

I do not familiar with AWS, so suppose AWS does not provide mechanism like ESXi. If AWS is willing to provide ( configuration of promisc mode ) then this feature is not that useful.

After dig into D38941 I'm even get confused about the design of unicast mode.
Is it two boxes share same carp group but exchange CARP messages across routers ? That seems to be an anycast setup. For anycast we have BGP so it looks an overkill for CARP protocol.

And how about on carp group with three or more boxes ? So we need peers instead of peer ?

In D38940#887150, @zlei wrote:

This allows carp to be used in unicast mode, which is useful in certain virtual configurations (e.g. AWS, VMWare ESXi, ...)

I use carp in my production environment (ESXi), and I have to enable promisc mode for virtual ports so that multicast frames can forward between VMs.

I do not familiar with AWS, so suppose AWS does not provide mechanism like ESXi. If AWS is willing to provide ( configuration of promisc mode ) then this feature is not that useful.

It doesn't. Yesterday I discovered that the current version of the patch doesn't work in AWS, because it changes the source MAC address to a multicast address. I've got a fix for that (i.e. don't change the MAC address in unicast mode) that allows it to work in AWS. Even in promiscuous mode the multicast traffic never arrives, so we do need this at least for AWS, and it would be useful for ESXi to avoid needing to run interfaces in promiscuous mode.

After dig into D38941 I'm even get confused about the design of unicast mode.
Is it two boxes share same carp group but exchange CARP messages across routers ? That seems to be an anycast setup. For anycast we have BGP so it looks an overkill for CARP protocol.

That's the intention, yes. The idea is to allow an elastic IP (in AWS parlance) to be a failed over between different instances, without making assumptions about the virtual network layout.

And how about on carp group with three or more boxes ? So we need peers instead of peer ?

Interesting point. I hadn't considered that, mostly because this mirrors the functionality in pfsync, where unicast mode is also limited to two peers. My inclination is to start off with the smaller change (i.e. supporting two boxes in the carp group) before we worry about other scenarios.

In D38940#887195, @kp wrote:
In D38940#887150, @zlei wrote:

This allows carp to be used in unicast mode, which is useful in certain virtual configurations (e.g. AWS, VMWare ESXi, ...)

I use carp in my production environment (ESXi), and I have to enable promisc mode for virtual ports so that multicast frames can forward between VMs.

I do not familiar with AWS, so suppose AWS does not provide mechanism like ESXi. If AWS is willing to provide ( configuration of promisc mode ) then this feature is not that useful.

It doesn't. Yesterday I discovered that the current version of the patch doesn't work in AWS, because it changes the source MAC address to a multicast address. I've got a fix for that (i.e. don't change the MAC address in unicast mode) that allows it to work in AWS.

I think AWS and ESXi by default block those forge packets with CARP(VRRP) multicast source MAC address.

Even in promiscuous mode the multicast traffic never arrives, so we do need this at least for AWS, and it would be useful for ESXi to avoid needing to run interfaces in promiscuous mode.

That is a little tricky. I see carp_multicast_setup() calls in_joingroup() / in6_joingroup() , then the interface will join CARP multicast group, so promiscuous mode is not needed. Am I missing something?

After dig into D38941 I'm even get confused about the design of unicast mode.
Is it two boxes share same carp group but exchange CARP messages across routers ? That seems to be an anycast setup. For anycast we have BGP so it looks an overkill for CARP protocol.

That's the intention, yes. The idea is to allow an elastic IP (in AWS parlance) to be a failed over between different instances, without making assumptions about the virtual network layout.

That may be workaround by CARP over VXLAN and devd, or net/ucarp .

And how about on carp group with three or more boxes ? So we need peers instead of peer ?

Interesting point. I hadn't considered that, mostly because this mirrors the functionality in pfsync, where unicast mode is also limited to two peers. My inclination is to start off with the smaller change (i.e. supporting two boxes in the carp group) before we worry about other scenarios.

In D38940#887365, @zlei wrote:
In D38940#887195, @kp wrote:
In D38940#887150, @zlei wrote:

This allows carp to be used in unicast mode, which is useful in certain virtual configurations (e.g. AWS, VMWare ESXi, ...)

I use carp in my production environment (ESXi), and I have to enable promisc mode for virtual ports so that multicast frames can forward between VMs.

I do not familiar with AWS, so suppose AWS does not provide mechanism like ESXi. If AWS is willing to provide ( configuration of promisc mode ) then this feature is not that useful.

It doesn't. Yesterday I discovered that the current version of the patch doesn't work in AWS, because it changes the source MAC address to a multicast address. I've got a fix for that (i.e. don't change the MAC address in unicast mode) that allows it to work in AWS.

I think AWS and ESXi by default block those forge packets with CARP(VRRP) multicast source MAC address.

I wouldn't describe the traffic as forged. (And I was wrong, it's not a multicast MAC address, but a "virtual router MAC address"). The VRRP standard at least explicitly requires that address to be used. CARP isn't formally standardised (at least that I could find), but it behaves very like VRRP.
In any event, we need to do things slightly differently to make them work in cloud setups.

Even in promiscuous mode the multicast traffic never arrives, so we do need this at least for AWS, and it would be useful for ESXi to avoid needing to run interfaces in promiscuous mode.

That is a little tricky. I see carp_multicast_setup() calls in_joingroup() / in6_joingroup() , then the interface will join CARP multicast group, so promiscuous mode is not needed. Am I missing something?

Cloud networks tend to be a bit special. At least in the first tests the multicast traffic simply didn't make it between VMs (in AWS). I don't currently have access to an ESXi instance, but I understood from your comments that it needs workarounds for multicast. Again something we'd like to avoid having to deal with.

After dig into D38941 I'm even get confused about the design of unicast mode.
Is it two boxes share same carp group but exchange CARP messages across routers ? That seems to be an anycast setup. For anycast we have BGP so it looks an overkill for CARP protocol.

That's the intention, yes. The idea is to allow an elastic IP (in AWS parlance) to be a failed over between different instances, without making assumptions about the virtual network layout.

That may be workaround by CARP over VXLAN and devd, or net/ucarp .

I'm insufficiently familiar with VXLAN to say if it'll accept multicast traffic or not, but in any event, we'd also like to keep the carp code as similar as possible between the physical hardware and cloud instance case.

In D38940#887375, @kp wrote:
In D38940#887365, @zlei wrote:
In D38940#887195, @kp wrote:
In D38940#887150, @zlei wrote:

This allows carp to be used in unicast mode, which is useful in certain virtual configurations (e.g. AWS, VMWare ESXi, ...)

I use carp in my production environment (ESXi), and I have to enable promisc mode for virtual ports so that multicast frames can forward between VMs.

I do not familiar with AWS, so suppose AWS does not provide mechanism like ESXi. If AWS is willing to provide ( configuration of promisc mode ) then this feature is not that useful.

It doesn't. Yesterday I discovered that the current version of the patch doesn't work in AWS, because it changes the source MAC address to a multicast address. I've got a fix for that (i.e. don't change the MAC address in unicast mode) that allows it to work in AWS.

I think AWS and ESXi by default block those forge packets with CARP(VRRP) multicast source MAC address.

I wouldn't describe the traffic as forged. (And I was wrong, it's not a multicast MAC address, but a "virtual router MAC address"). The VRRP standard at least explicitly requires that address to be used. CARP isn't formally standardised (at least that I could find), but it behaves very like VRRP.

Sorry I did not mean they (packets with CARP(VRRP) multicast source MAC address) are forged, but AWS / ESXi might treat them as forged [1].

In any event, we need to do things slightly differently to make them work in cloud setups.

Even in promiscuous mode the multicast traffic never arrives, so we do need this at least for AWS, and it would be useful for ESXi to avoid needing to run interfaces in promiscuous mode.

That is a little tricky. I see carp_multicast_setup() calls in_joingroup() / in6_joingroup() , then the interface will join CARP multicast group, so promiscuous mode is not needed. Am I missing something?

Cloud networks tend to be a bit special. At least in the first tests the multicast traffic simply didn't make it between VMs (in AWS). I don't currently have access to an ESXi instance, but I understood from your comments that it needs workarounds for multicast. Again something we'd like to avoid having to deal with.

After dig into D38941 I'm even get confused about the design of unicast mode.
Is it two boxes share same carp group but exchange CARP messages across routers ? That seems to be an anycast setup. For anycast we have BGP so it looks an overkill for CARP protocol.

That's the intention, yes. The idea is to allow an elastic IP (in AWS parlance) to be a failed over between different instances, without making assumptions about the virtual network layout.

That may be workaround by CARP over VXLAN and devd, or net/ucarp .

I'm insufficiently familiar with VXLAN to say if it'll accept multicast traffic or not,

I can confirm it will.

but in any event, we'd also like to keep the carp code as similar as possible between the physical hardware and cloud instance case.

I believe that's a good approach, though I can not tell which one is better, a workaround or more features to support non-standard-compliant devices.

[1] https://docs.vmware.com/en/VMware-vSphere/7.0/com.vmware.vsphere.security.doc/GUID-7DC6486F-5400-44DF-8A62-6273798A2F80.html

sys/netinet/ip_carp.c
1932

This might be too aggressive. I think EINVAL or ENOTSUP is sufficient.

I'm happy to work with you (or implement a variation of this patch myself) that uses netlink as the control mechanism.

I've got a first draft of netlink for carp here: D39048

Rebase on top of netlink-ified carp

Generally, LGTM, please see some comments (mostly related to the IP address handling) inline.

sys/netinet/ip_carp.c
1648
2284

The traditional approach in NETLINK_ROUTE is having a single attribute for the address & distinguish between IPv4/IPv6 by its len. (see parse_rta_ip()). Maybe it's worth applying here as well.

2374

Q: any reason not to use struct in_addr ?

sys/netlink/netlink_message_parser.c
352

Any reason not to use parse_rta_ip[6] ?

sys/netlink/netlink_message_parser.h
35

Shouldn't be needed here.

sys/netlink/netlink_message_writer.h
34

I'd prefer not to include it here & just do a forward declarations for struct in_addr, struct in6_addr and use static size of 4/16 for attribute writing.

sys/netlink/netlink_snl.h
519

netlink_snl is meant to be generic & network-independent.
I'd suggest either putting those to netlink_snl_route.h or using existing ones from there.

kp marked 3 inline comments as done.Mar 18 2023, 11:34 AM
kp added inline comments.
sys/netinet/ip_carp.c
2284

Carp is a little special, in that it uses IPv6 multicast to deal with IPv6 addresses, and v4 for v4. You can't provide failover for a v4 address over v6.

To make things even more fun, we don't actually know which one it is when we're setting up the vhid. And indeed, it's possible to have a single vhid provide redundancy for both an IPv4 and and IPv6 address. We'd have to allow for two addresses, and enforce that one of them must be v4 and one v6, and that seems like it'd make things more complex for no real gain.

2374

No, other than laziness. I'll see about changing that so it's closer to the v6 case.

sys/netlink/netlink_message_parser.c
352

I don't want to deal with struct sockaddr.

sys/netlink/netlink_message_parser.h
35

Thanks for catching that.

I knew when I was writing it that I didn't want to add that to generic netlink headers, but it was the most straightforward path to getting something to work, and I didn't immediately see where this would belong.

sys/netlink/netlink_snl.h
519

And that's the suggestion I was looking for when I took the easy way forward. I'll move it to netlink_snl_route.h.

sys/netinet/ip_carp.c
1596

We shouldn't get here, as the check above disallows anything except AF_INET / AF_INET6.

2284

I got it, makes sense.

kp marked 2 inline comments as done.

Review remarks

Thank you! LGTM for the netlink side. Left some minor comments I missed earlier.

The only remaining bit is the <netlinet/in.h> header in the netlink_message_writer.h, I'd love to avoid having that.

sys/netinet/ip_carp.c
737
1649

Nit: is this some magic number worth defining in the header?

1936–1938

Nit: maybe it's better to use C11 initializers instead of doing a dedicated memset later?

sys/netlink/netlink_snl.h
47

Shouldn't be needed here

This revision is now accepted and ready to land.Mar 18 2023, 4:07 PM
kp marked an inline comment as done.Mar 20 2023, 10:49 AM
kp added inline comments.
sys/netinet/ip_carp.c
737

I'm only moving this code around. It's a good improvement though, so I'll do that for the entire file in a separate commit.

1649

Possibly. Constants for IPv6 addresses are a bit annoying, which is probably why there wasn't one here before these changes either.

This revision now requires review to proceed.Mar 20 2023, 10:49 AM
This revision is now accepted and ready to land.Mar 20 2023, 11:02 AM
This revision was automatically updated to reflect the committed changes.