Page MenuHomeFreeBSD

tcp: Congestion control cleanup.

Authored by rrs on Oct 27 2021, 1:54 PM.



This patch does a bit of cleanup on TCP congestion control modules. There were some rather
interesting surprises that one could get i.e. where you use a socket option to change
from one CC (say cc_cubic) to another CC (say cc_vegas) and you could in theory get
a memory failure and end up on cc_newreno. This is not what one would expect. The
new code fixes this by requiring a cc_data_sz() function so we can malloc with M_WAITOK
and pass in to the init function preallocated memory. The CC init is expected in this
case *not* to fail but if it does and a module does break the
"no fail with memory given" contract we do fall back to the CC that was in place at the time.

This also fixes up a set of common newreno utilities that can be shared amongst other
CC modules instead of the other CC modules reaching into newreno and executing
what they think is a "common and understood" function. Lets put these functions in
cc.c and that way we have a common place that is easily findable by future developers or
bug fixers. This also allows newreno to evolve and grow support for its features i.e. ABE
and HYSTART++ without having to dance through hoops for other CC modules, instead
both newreno and the other modules just call into the common functions if they desire
that behavior or roll there own if that makes more sense.

Note: This commit changes the kernel configuration!! If you are not using GENERIC in
some form you must add a CC module option (one of CC_NEWRENO, CC_VEGAS, CC_CUBIC,
CC_CDG, CC_CHD, CC_DCTCP, CC_HTCP, CC_HD). You can have more than one defined
as well if you desire. Note that if you create a kernel configuration that does not
define a congestion control module and includes INET or INET6 the kernel compile will
break. Also you need to define a default, generic adds 'options CC_DEFAULT=\"newreno\"
but you can specify any string that represents the name of the CC module (same names
that show up in the CC module list under If you fail to add the
options CC_DEFAULT in your kernel configuration the kernel build will also break.

Test Plan

Make sure we can switch CC modules under memory pressure and make sure
it succeeds. I may even want to violate the contract and have a CC module that
fails with passed in memory with non INVARIANT kernel and validate that the origin
CC module remains in place.

Diff Detail

R10 FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

rrs requested review of this revision.Oct 27 2021, 1:54 PM

Cleanup a few typos in the header comments as well as lets get all
modules putting the common functions in there structure not in
the cc_mod_init() by setting a variable now that its in cc.c

Fix some bugs I found in the CC switching code. Next I need to test the failure and fallback with and without invariant

Just figured out another bit of broken-ness in the existing code. When we switch
CC modules, and we are established we need to run (if we have one) the conn_init()
function otherwise things might not go so well for the CC :)

Also let's take out the insistence that the init not fail. The fallback works and
its no longer going to new_reno.. but properly to whatever you were last on.

Fix up to sync up with the way we protect for unload. We have to hold the
lock through the time. This means a fun double lookup and possibly
restart if the size changed.

Another fun POLA violation fixed. When unloading a module make it so that:

a) If a module is default in any Vnet, fail with EBUSY.
b) If a connection is using the CC module, attempt to move it to its Vnet default.
c) If <b> fails (no memory) then fail the unload with the return err from the cc_init call (NOMEM most likely)

This makes it so at a minimum you get back to the default instead of randomly ending
up with newreno.

We could probably replace the whole walk all tcb's with a refcount and unload failure until
we have no-one using the module (its a bit strange having your CC module switched underneath you).
We could then do something like the stacks do that prevent new connections from being set ont
the removing module. If nothing else we should mention it in the manual i.e. if the module is
removed you might find your connection also moved to the Vnet default.

Remove the spacing/emacs gifts.

This takes some of Lawrences private comments into account. We allow a CC without a size function to
exist and be loaded if it does not have a cb_init function. If there is a cb_init then we fail the load and
print a warning to the console as to why. I will need to next update the man page to talk about various
issues when loading and unloading modules.

Now let's update the manual pages to reflect the current state of affairs. At this
point I think I am done, so reviewers please have a look and give feedback!

Thanks for the patch. Sorry for bothering you, but there some very small man page nits, that could be addressed.


New sentence, new line.


New sentence, new line.


New sentence, new line.


New sentence, new line.


New sentence, new line.


New sentence, new line.


New sentence, new line.


New sentence, new line.


Thanks for the review, I am not sure what you mean by your comment..

Do you want the non formatted sentence to have a nl i.e. like this:
Unloading a congestion control module will fail if it is used as a
default by any Vnet. When unloading a module, the Vnet default is
used to switch a connection to an alternate congestion control.
Unloading a congestion control module will fail if it is used as a
default by any Vnet.
When unloading a module, the Vnet default is
used to switch a connection to an alternate congestion control.
Or are you asking me to put in some format instruction in here like:

Unloading a congestion control module will fail if it is used as a
default by any Vnet.
When unloading a module, the Vnet default is
used to switch a connection to an alternate congestion control.

or some other change?

I want to address your comment properly and I am really not that familiar with the man formatting. I
am sure you must want me to put in some formatting code since just the first change will do
nothing for the output...

Please let me know what you want and I will make the changes gladly :)

With help from Michael, I now know what gbe's comments mean. Fix all
sentences to have a nl.

rrs marked 8 inline comments as done.Nov 1 2021, 7:11 PM

LGTM for the man page part.

@rrs thank for incorporating the suggested changes!

Fix a LOR as well as finish making newreno non-special. Basically we
now don't prohibit it from unloading/being a module. And we harness
the first cc module loaded/in the kernel as the default.

First pass feedback


I don't want to lose the ability to (forcibly) unload a module which is in use as a stack-configured default or by active connections. I'm ok with refusing to unload when in use if a normal kldunload is called, but kldunload -f should be a workable option and leave the system in a sane/functional state.


Changing this hook's signature should be avoided if at all possible because it's a documented KPI and present across multiple stable releases i.e. better to extend struct cc_var if there's a need to pass in extra things, or create a new hook.

I'm also not a fan of:

  • All CC algos sharing a malloc type
  • cc_data_sz() exposing module-internal information that a module may want to be able to vary based on runtime configuration settings
  • Passing NULL in for ptr which means module state can be stored in memory which the module allocated or which was allocated externally to it.

I'd like to propose some tweaks that I think would address my concerns:

  • Remove cc_data_sz() hook
  • Remove the ptr argument to cb_init()
  • Define a new CCF_MWAITOK flag in cc.h for use in struct cc_var's flags member
  • Individual algo cb_init() functions which want to allocate memory for algo state pass M_WAITOK or M_NOWAIT to malloc based on the CCF_MWAITOK flag setting passed in ccv
  • In the situation where cb_init() is being called from a non-sleepable context and malloc(<size>, M_NOWAIT) fails, the ENOMEM return errno can be used to fail, spin or trigger a "try again from a sleepable context" path along the lines of the logic you've added in tcp_usrreq.c

I'm not a fan of this approach. Routines which provide newreno like behaviour belong in the newreno module, or at the very least should retain their newreno naming rather than "common" which is ambiguous and too generic.

The approach I'd prefer to see taken is for the newreno CC module to revert to being a stateless CC algo implementation which other modules can continue to call into to execute newreno baseline functionality, and which provides the FreeBSD default algorithm in a minimal footprint way. The ABE + Hystart components which require state would then be excised into a shim module which provides a thin wrapper around cc_newreno with the extra optional functionality included. That gets us the best of both worlds to my mind.


With the proposed changes, this block is sufficiently large now that I'd suggest carving everything between this line and the INP_WUNLOCK() out into a separate function e.g. cc_switch_algo() in cc.c or similar.


Suggest using ESRCH to remain consistent with other CC(9) API functions c.f. cc_default_algo()

We will talk on the telchat today at 10/1am your time.. a bit late.


I believe a -f will still work but you will likely crash the system. I have seen
this with the loadable stacks.. Since you leave things stranded with invalid methods.

I do *not* think it is unreasonable to insist that you have enough brains to move the default to
some other CC before you unload a module. This is just common sense.

As to tcp_ccalgounload() failing, which it can (if the new default can't get memory), being
able to retry is I think acceptable. Michael has also suggested a user space utility like tcpdrop that
does this i.e. moves the CC algorithm of anyone using "cc" to some other. That might be another way
to do this.

This is mainly a developer thing and I really think it is not unreasonable to let you do it twice or three times
so the system has M_NOWAIT memory available. You will get to unload your module. And actually in general
its going to work.. only if the system is under heavy load and memory is low is it not going to work..


Hmm first KPI's do change! We are working in head not in stable... so it is completely allowed to change a KPI. And
in fact in this case it is good that does since then cb_init won't silently break if you have a non-published algorithm
that does not accept the void *ptr, it tells those that are not paying attention that they need to "fix" their algorithm. So
in this case I think its a "good" thing.

I don't see any CC module that has variable mallocs anywhere by the way and it seems very unlikely to
me that structures will vary. And I do suspect it is highly unlikely...

As to your proposal I am fine with it if you can explain in a bit more detail how the cb_init() is going to be able to get a M_WAITOK?

The premise right now is I am calling in tcp_usrreq.c malloc before getting a lock, once you call into the cb_init() the
method may be digging into the tp-> which means it needs to have the INP_WLOCK() which it does now. And I think
that is *far* more likely than having variable length mallocs. Basically if you want cb_init() to have the INP_WLOCK(), which
it always has had, then you will never be able to use M_WAITOK. This *could* be a far greater issue than revealing how
many bytes the algorithm uses.

Looking through the cb_inits of the CC algorithms that are public I don't see a case where anything is done but
malloc() and init things so my concern may be an unfounded worry as well, but I think its just as valid as your concern about shallow
and deep memory allocation in a cb_init and not wanting to expose the block of memory you need's size.

As to having a global CC memory type I don't see that as a problem, if there is a leak your going to pretty well
know which one it is since it will be apparent in the CC algorithm you are most likely using :)


I strongly disagree with this approach. It makes "newreno" special and it should not be special and it should not
have any special properties. It should be able to malloc like any other module and it *should not* be referenced by
other CC modules.

I have no problem with naming the generic functions common_newreno_cc_<blah>()

but I have great difficulty with any CC module reaching in and running some other CC modules methods. This
to me is a glaring issue!!


Sure thats a reasonable thing to do ;)



rrs marked an inline comment as not done.

Address the points Lawrence makes that I agree with.

rrs marked 2 inline comments as not done.Nov 4 2021, 11:41 AM

One other thing if you can provide more details on your M_WAITOK scheme.. I would like you to
make sure there are no surprises i.e. if we silently change things under the covers (no INP_WLOCK)
to me that would be a surprise if the KPI remains constant. So if I have a module outside of FreeBSD
making that change would cause a silent failure if I was assuming I had the INP_WLOCK() and then
all of a sudden I did not....

Per the transport call discussion I have updated things to:

  1. be able to add an option for every CC you want to build in the config
  2. be able to add a CC_DEFAULT (its required actually or compile will fail) to specify the default cc
  3. All generic's get CC_NEWRENO and CC_DEFAULT=\"newreno\"

Also forgot to mention, per the call I also bumped the module rev number
and added the lock asserts we discussed.

So I think this pretty well does it at least with what we all discussed and agreed too.. Let me know
if I missed something :)

Just two small man page nits.


It would be good the name the macro .Pp to be constistend with the rest of the man page.


.Pp like above.

This update gets it so we blow up if we don't have a CC module in the kernel of one sort or another.

Update the man page for mod_cc to include a new kernel configuration section listing the new
options and such.

Address Peter Lei's comments sent to me in email:

  1. typo in NOTES (VEGA not VEGAS)
  2. add a comment in the register function that the first becomes default until CC_DEFAULT is seen
  3. move the unlock in tcp_congestion() right before the malloc.
This revision is now accepted and ready to land.Nov 10 2021, 11:53 AM
This revision was automatically updated to reflect the committed changes.