Page MenuHomeFreeBSD

Implement kernel support for hardware rate limited sockets

Authored by allanjude on Sep 17 2015, 12:55 PM.



Implement kernel support for hardware rate limited sockets.

  • Add RATELIMIT kernel configuration keyword which must be set to

enable the new functionality.

  • Add support for hardware driven, Receive Side Scaling, RSS aware, rate

limited sendqueues and expose the functionality through the already
established SO_MAX_PACING_RATE setsockopt(). The API support rates in
the range from 1 to 4Gbytes/s which are suitable for regular TCP and
UDP streams. The setsockopt(2) manual page has been updated.

  • Add rate limit function callback API to "struct ifnet" which supports

the following operations: if_snd_tag_alloc(), if_snd_tag_modify(),
if_snd_tag_query() and if_snd_tag_free().

  • Add support to ifconfig to view, set and clear the IFCAP_TXRTLMT

flag, which tells if a network driver supports rate limiting or not.

  • This patch also adds support for rate limiting through VLAN and LAGG

intermediate network devices.

  • How rate limiting works:
  1. The userspace application calls setsockopt() after accepting or

making a new connection to set the rate which is then stored in the
socket structure in the kernel. Later on when packets are transmitted
a check is made in the transmit path for rate changes. A rate change
implies a non-blocking ifp->if_snd_tag_alloc() call will be made to the
destination network interface, which then sets up a custom sendqueue
with the given rate limitation parameter. A "struct m_snd_tag" pointer is
returned which serves as a "snd_tag" hint in the m_pkthdr for the
subsequently transmitted mbufs.

  1. When the network driver sees the "m->m_pkthdr.snd_tag" different

from NULL, it will move the packets into a designated rate limited sendqueue
given by the snd_tag pointer. It is up to the individual drivers how the rate
limited traffic will be rate limited.

  1. Route changes are detected by the NIC drivers in the ifp->if_transmit()

routine when the ifnet pointer in the incoming snd_tag mismatches the
one of the network interface. The network adapter frees the mbuf and
returns EAGAIN which causes the ip_output() to release and clear the send
tag. Upon next ip_output() a new "snd_tag" will be tried allocated.

  1. When the PCB is detached the custom sendqueue will be released by a

non-blocking ifp->if_snd_tag_free() call to the currently bound network

Sponsored by: Mellanox Technologies

Diff Detail

rS FreeBSD src repository
Lint Skipped
Unit Tests Skipped
Build Status
Buildable 6856

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
hselasky updated this revision to Diff 22327.Nov 18 2016, 1:54 PM
hselasky edited edge metadata.

Try to integrate @np changes. Please give some comments if this patch is going in the right direction.

Thank you!

gallatin added inline comments.Nov 18 2016, 2:29 PM

I wonder if it would be possible to un-staticize these functions for using h/w rate limiting internally in TCP.

We'd probably also want an API to query the available hardware pacing rates on a given inp.


Should you clear INP_RATE_LIMIT_CHANGED at this point?


What about adding a callback to this routine to allow the TCP stack to be notified of a rate-limit change (so that it may decide to move a connection to a software pacer, for example).


I thought we were going to turn the rcvif into a union, and make it a pointer to the snd_tag on xmit?

hselasky added inline comments.Nov 19 2016, 4:19 PM

Yes, we can do that, though I'm a little worried that someone will complain about nameless unions. What do you think?

hselasky added inline comments.Nov 29 2016, 10:44 AM

I'll make these functions global. Might also make debugging easier.


Yes, you're right. I'll fix.


I think there will be a locking issue, LOR, if you try that. But possibly a polling function to poll the status of the current mbuf send tag is appropriate.

hselasky updated this revision to Diff 22597.Nov 29 2016, 3:23 PM
hselasky updated this object.
hselasky edited edge metadata.

Implement suggestions from Drew.

Add preliminary support for LAGG and VLAN.

hselasky updated this revision to Diff 22617.Nov 30 2016, 12:27 PM
hselasky edited edge metadata.

Minor API refactor.

Declare one parameter structure for each send tag operation. This allows lagg to be implemented without knowing the tag type.

hselasky updated this revision to Diff 22627.Nov 30 2016, 7:21 PM
hselasky edited edge metadata.

Add missing NULL checks for inp pointer.

gallatin added inline comments.Nov 30 2016, 7:37 PM

How do you handle already allocated tags when the underlying ifp fails?

Eg, the way that lagg is supposed to work is that if a link fails, lagg is supposed to stop using the failed link. Maybe I'm missing something, but I don't see a mechanism to shoot down these tags when a link does down, or LACP otherwise fails.


I think it would be helpful to have an (optional) callback in the inpcb here.

The way we (netflix) intend to use this is not via the socket ioclts, but via direct interactions with the TCP stack. If a connection that was previously paced becomes unpaced, we need to know about it so that it can be transitioned to a software pacer.


It would be nice to have a counter here to keep track of howmany ratelimit escapes there were.


Maybe rename this "upgraded" & move the INP_WLOCKED() check directly into the if() -- the existing code is a bit hard to understand (at least for me). On my first read, I thought INP_DOWNGRADE() was unreachable because wlocked would have to be true (which is incorrect)

hselasky added inline comments.Nov 30 2016, 8:04 PM

Hi Drew,

When a mbuf is delivered to its final network interface, the network interface driver needs to check if "ifp" is equal to the ifp pointer inside the m_snd_tag. If the network interface pointers are equal we know the packet is on the right interface and continue, else we return EAGAIN and then the ip_output()/ip6_output() will free the mst and allocate a new one on the new destination ifp. This way we avoid superfluous checks in the fast path for lagg.


hselasky added inline comments.Nov 30 2016, 8:10 PM

There has been implemented a function to query the current rate. Is that sufficient?



gallatin added inline comments.Nov 30 2016, 8:18 PM

Ah, that makes sense. Without seeing the driver side matching this version, it is hard to know these things :)


I think we'd prefer a callback

hselasky updated this revision to Diff 22641.Dec 1 2016, 3:18 PM
hselasky edited edge metadata.
  • Update patch as per some suggestions from Drew.
  • Implement missing IFCAP flags handling in LAGG and VLAN
  • Use if_capenable instead of if_capabilities when checking for ratelimit capability.
hselasky updated this revision to Diff 22861.Dec 13 2016, 10:40 AM
hselasky updated this object.
hselasky edited edge metadata.

Refcount the correct network interface in case of VLAN and LAGG.
Updated commit message.

hselasky updated this revision to Diff 22863.Dec 13 2016, 10:47 AM
hselasky edited edge metadata.

IFC + fix a comment while at it.

wblock added inline comments.Dec 16 2016, 7:16 PM

This "should be" is kind of ambiguous. Maybe it is meant to imply that the limit can be set but might not be a hard limit? Otherwise, how about:

instruct the socket and underlying network adapter layers to limit the
transfer rate to the given unsigned 32-bit value in bytes per second.
hselasky updated this revision to Diff 23568.Jan 3 2017, 11:47 AM
hselasky edited edge metadata.


scottl accepted this revision.Jan 16 2017, 7:37 PM
scottl added a reviewer: scottl.
scottl added a subscriber: scottl.

There has been no dissension on this review since early November, and HPS has done a lot to update the code to address the comments from then. Given the lack of new input, I recommend that this code be committed.

gnn requested changes to this revision.Jan 16 2017, 7:49 PM
gnn edited edge metadata.

There are a few things in here that seem to be both important and unresolved. I'll push np@ and gallatin@ to chime in because they have been the most frequent commenters. If this is not unstuck by Wednesday then let's get it in. But I hate having to "fix things in post."

This revision now requires changes to proceed.Jan 16 2017, 7:49 PM

Fixing things post-commit is much better than forcing the work to languish. This review has been plagued by poor participation for far too long, and it's reflecting poorly on the FreeBSD community. I'm ok with giving people until Wednesday, but after that it goes in, love it or hate it.

emaste added a subscriber: emaste.Jan 17 2017, 1:18 AM
hselasky marked 32 inline comments as done.Jan 17 2017, 9:07 AM

Mark a bunch of comments as "done". A patch update implementing the suggestions will follow.

hselasky updated this revision to Diff 24104.Jan 17 2017, 9:13 AM
hselasky edited edge metadata.

Implemented proposals from @wblock and @gallatin .

Try to hide some more done comments.

gallatin accepted this revision.Jan 17 2017, 2:03 PM
gallatin edited edge metadata.
adrian accepted this revision.Jan 17 2017, 6:28 PM
adrian edited edge metadata.
hselasky updated this revision to Diff 24154.Jan 18 2017, 12:15 PM
hselasky updated this object.
hselasky edited edge metadata.


  • Resolve code conflicts with the recently committed SO_TS_CLOCK feature.
  • Remove "#ifdef RATELIMIT" from setsockopt() handler as per suggestion from @gallatin .
hselasky marked an inline comment as done.Jan 18 2017, 12:17 PM

Mark action from @gallatin as done.

hselasky accepted this revision.Jan 18 2017, 1:43 PM
hselasky abandoned this revision.
hselasky added a reviewer: hselasky.

Committed to FreeBSD 12-current:

You can accept and then close the review rather than abandoning it. Abandon is really for changes that are not going to be committed.

@emaste There was no close option in my dropdown list, even after accept.

emaste commandeered this revision.Jan 18 2017, 3:12 PM
emaste reclaimed this revision.
emaste accepted this revision.
emaste abandoned this revision.
emaste edited edge metadata.
emaste added a reviewer: emaste.

There was no close option in my dropdown list, even after accept.

Indeed, that seems like a bug.

allanjude commandeered this revision.Jan 18 2017, 3:29 PM
allanjude reclaimed this revision.
allanjude accepted this revision.
allanjude accepted this revision.
allanjude abandoned this revision.
allanjude edited edge metadata.
allanjude added a reviewer: allanjude.