Page MenuHomeFreeBSD

Teach lagg to use rate-limited send tags for lacp
AbandonedPublic

Authored by gallatin on Feb 19 2020, 2:30 AM.

Details

Reviewers
hselasky
jhb
scottl
np
rrs
Group Reviewers
manpages
Summary

When NICs are overwhelmed with egress traffic (and return ENOBUFS when attempting to enqueue traffic), lacp metadata may be dropped or severely delayed, causing lagg link flapping.

Mellanox NICs allocate per-send-tag transmit queues, so if we allocate a rate-limited send tag for lacp, lacp gets to avoid the congested TX queues and send its metadata on an uncongested queue. This allows lacp metadata to be delivered to the link partner, avoiding link flaps when under very high load.

This approach was suggested by Mellanox. I'm not at all sure it is a good idea. My concern is that for nics like Chelsio, which support rate limit, but do not support allocating a separate queue for each rate limited send tag, this may actually delay lacp metadata even more, if it is prioritized behind non-rate limited traffic.

Test Plan

Overload a web server, observe link flapping with mlx5_en based nics when observing output drops in netstat. Enable rate-limited queue via ifconfig lagg0 use_rl and observe a stable lagg0 link.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 29485

Event Timeline

sys/net/ieee8023ad_lacp.c
546

M_WAITOK probably will work here.

558

I suspect you can use M_WAITOK here.

sys/net/ieee8023ad_lacp.h
232

Why not just inline the tag. It is pretty small already. Saves a memory allocation.

Uppercase NIC, the rest of the man page looks good.

sbin/ifconfig/ifconfig.8
2555

s/nic/NIC/

@np How do you think this work on chelsio nics? If it works as expected or it can be made to work this would be great for lacp. I would love to add this to carp also as I was only able to get it to do QOS for switch to use otherside of connection nothing to protect outgoing packets.

Sorry to go off topic on carp bits

np requested changes to this revision.Feb 19 2020, 8:41 PM

This doesn't look right to me.

Rate limiting a connection doesn't have much to do with tx queue
selection except on mlx5_en but this patch adds this assumption to lagg.
Even in theory this can work only with kernels with RATELIMIT (GENERIC
doesn't have it) and then only with the two NICs that support RATELIMIT
(cxgbe and mlx5_en). In practice this might fix the problem for mlx5_en
but will make it worse for cxgbe.

Scott added rsrv_noflowq to cxgbe in r261558 for this very reason. That
is driver-specific too but at least the change is entirely within the
driver and doesn't pollute sys/net. Can't something like that be done here?

This revision now requires changes to proceed.Feb 19 2020, 8:41 PM

Nick, just set hw.cxgbe.rsrv_noflow and all tx traffic that doesn't have a flowid (LACP, ARP, etc.)
will be treated as special by cxgbe and will be transmitted over a reserved tx queue.

In D23752#521981, @np wrote:

This doesn't look right to me.

Rate limiting a connection doesn't have much to do with tx queue
selection except on mlx5_en but this patch adds this assumption to lagg.
Even in theory this can work only with kernels with RATELIMIT (GENERIC
doesn't have it) and then only with the two NICs that support RATELIMIT
(cxgbe and mlx5_en). In practice this might fix the problem for mlx5_en
but will make it worse for cxgbe.

Scott added rsrv_noflowq to cxgbe in r261558 for this very reason. That
is driver-specific too but at least the change is entirely within the
driver and doesn't pollute sys/net. Can't something like that be done here?

Yes, that's what I requested from Mellanox, and I would vastly prefer that.

@np: Can the Chelsio NIC allocate one queue for each ratelimited connection?

It is better to use send tags for this, because the sysctl for this feature is non-standard.

@np: Can the Chelsio NIC allocate one queue for each ratelimited connection?

Sure, but why would it if it can get by without having to? I find it quite wasteful to allocate one tx queue for every connection. Each queue needs physically contiguous memory for the descriptor ring, some more for the bookkeeping in software, and some hw resources. A fixed number of tx queues is all that's needed for any number of rate limited connections with cxgbe so the driver just allocates them at init and doesn't have to do anything else. The whole point of having these special out of order queues is to avoid a queue per connection.

It is better to use send tags for this, because the sysctl for this feature is non-standard.

And the proposed patch is also non-standard. It works only with custom kernels (options RATELIMIT) and only for one driver but still manages to pollute sys/net in the process. If you're not doing rate limiting then don't misuse the RATELIMIT tag. Either come up with a new tag or find an existing field in the mbuf to use for this.

In D23752#522064, @np wrote:

@np: Can the Chelsio NIC allocate one queue for each ratelimited connection?

Sure, but why would it if it can get by without having to? I find it quite wasteful to allocate one tx queue for every connection. Each queue needs physically contiguous memory for the descriptor ring, some more for the bookkeeping in software, and some hw resources. A fixed number of tx queues is all that's needed for any number of rate limited connections with cxgbe so the driver just allocates them at init and doesn't have to do anything else. The whole point of having these special out of order queues is to avoid a queue per connection.

Sure, you need more memory, but then one TX queue overflowing doesn't disturb the other ones!

It is better to use send tags for this, because the sysctl for this feature is non-standard.

And the proposed patch is also non-standard. It works only with custom kernels (options RATELIMIT) and only for one driver but still manages to pollute sys/net in the process. If you're not doing rate limiting then don't misuse the RATELIMIT tag. Either come up with a new tag or find an existing field in the mbuf to use for this.

If a Chelsio NIC receives an mbuf w/o a valid flow ID on a ratelimit send tag, will that packet go on the special send-queue or not?

--HPS

Abandoning this in favor of moving selection of an uncongested queue for lacp into the mlx5 driver.