diff --git a/UPDATING b/UPDATING --- a/UPDATING +++ b/UPDATING @@ -27,6 +27,19 @@ world, or to merely disable the most expensive debugging functionality at runtime, run "ln -s 'abort:false,junk:false' /etc/malloc.conf".) +20211110: + Commit xxxxxx changed the TCP congestion control framework so + that any of the included congestion control modules could be + the single module built into the kernel. Previously newreno + was automatically built in through direct reference. Has of + this commit you are required to declare at least one congestion + control module (e.g. 'options CC_NEWRENO') and to also delcare a + default using the CC_DEFAULT option (e.g. options CC_DEFAULT="newreno\"). + The GENERIC configuation includes CC_NEWRENO and defines newreno + as the default. If no congestion control option is built into the + kernel and you are including networking, the kernel compile will + fail. Also if no default is declared the kernel compile will fail. + 20211106: Commit f0c9847a6c47 changed the arguments for VOP_ALLOCATE. The NFS modules must be rebuilt from sources and any out diff --git a/share/man/man4/cc_newreno.4 b/share/man/man4/cc_newreno.4 --- a/share/man/man4/cc_newreno.4 +++ b/share/man/man4/cc_newreno.4 @@ -75,7 +75,33 @@ .Va net.inet.tcp.cc.abe=1 per: cwnd = (cwnd * CC_NEWRENO_BETA_ECN) / 100. Default is 80. +.It Va CC_NEWRENO_ENABLE_HYSTART +will enable or disable the application of Hystart++. +The current implementation allows the values 0, 1, 2 and 3. +A value of 0 (the default) disables the use of Hystart++. +Setting the value to 1 enables Hystart++. +Setting the value to 2 enables Hystart++ but also will cause, on exit from Hystart++'s CSS, to +set the cwnd to the value of where the increase in RTT first began as +well as setting ssthresh to the flight at send when we exit CSS. +Setting a value of 3 will keep the setting of the cwnd the same as 2, but will cause ssthresh +to be set to the average value between the lowest fas rtt (the value cwnd is +set to) and the fas value at exit of CSS. +.PP +Note that currently the only way to enable +hystart++ is to enable it via socket option. +When enabling it a value of 1 will enable precise internet-draft behavior +(subject to any MIB variable settings), other setting (2 and 3) are experimental. .El +.PP +Note that hystart++ requires the TCP stack be able to call to the congestion +controller with both the +.Va newround +function as well as the +.Va rttsample +function. +Currently the only TCP stacks that provide this feedback to the +congestion controller is rack. +.Pp .Sh MIB Variables The algorithm exposes these variables in the .Va net.inet.tcp.cc.newreno @@ -94,6 +120,32 @@ .Va net.inet.tcp.cc.abe=1 per: cwnd = (cwnd * beta_ecn) / 100. Default is 80. +.It Va hystartplusplus.bblogs +This boolean controls if black box logging will be done for hystart++ events. If set +to zero (the default) no logging is performed. +If set to one then black box logs will be generated on all hystart++ events. +.It Va hystartplusplus.css_rounds +This value controls the number of rounds that CSS runs for. +The default value matches the current internet-draft of 5. +.It Va hystartplusplus.css_growth_div +This value controls the divisor applied to slowstart during CSS. +The default value matches the current internet-draft of 4. +.It Va hystartplusplus.n_rttsamples +This value controls how many rtt samples must be collected in each round for +hystart++ to be active. +The default value matches the current internet-draft of 8. +.It Va hystartplusplus.maxrtt_thresh +This value controls the maximum rtt variance clamp when considering if CSS is needed. +The default value matches the current internet-draft of 16000 (in microseconds). +For further explanation please see the internet-draft. +.It Va hystartplusplus.minrtt_thresh +This value controls the minimum rtt variance clamp when considering if CSS is needed. +The default value matches the current internet-draft of 4000 (in microseconds). +For further explanation please see the internet-draft. +.It Va hystartplusplus.lowcwnd +This value controls what is the lowest congestion window that the tcp +stack must be at before hystart++ engages. +The default value matches the current internet-draft of 16. .El .Sh SEE ALSO .Xr cc_cdg 4 , diff --git a/share/man/man4/mod_cc.4 b/share/man/man4/mod_cc.4 --- a/share/man/man4/mod_cc.4 +++ b/share/man/man4/mod_cc.4 @@ -67,6 +67,16 @@ for details). Callers must pass a pointer to an algorithm specific data, and specify its size. +.Pp +Unloading a congestion control module will fail if it is used as a +default by any Vnet. +When unloading a module, the Vnet default is +used to switch a connection to an alternate congestion control. +Note that the new congestion control module may fail to initialize its +internal memory, if so it will fail the module unload. +If this occurs often times retrying the unload will succeed since the temporary +memory shortage as the new CC module malloc's memory, that prevented the +switch is often transient. .Sh MIB Variables The framework exposes the following variables in the .Va net.inet.tcp.cc @@ -93,6 +103,44 @@ If non-zero, apply standard beta instead of ABE-beta during ECN-signalled congestion recovery episodes if loss also needs to be repaired. .El +.Pp +Each congestion control module may also expose other MIB variables +to control their behaviour. +.Sh Kernel Configuration +.Pp +All of the available congestion control modules may also be loaded +via kernel configutation options. +A kernel configuration is required to have at least one congestion control +algorithm built into it via kernel option and a system default specified. +Compilation of the kernel will fail if these two conditions are not met. +.Sh Kernel Configuration Options +The framework exposes the following kernel configuration options. +.Bl -tag -width ".Va CC_NEWRENO" +.It Va CC_NEWRENO +This directive loads the newreno congestion control algorithm and is included +in GENERIC by default. +.It Va CC_CUBIC +This directive loads the cubic congestion control algorithm. +.It Va CC_VEGAS +This directive loads the vegas congestion control algorithm, note that +this algorithm also requires the TCP_HHOOK option as well. +.It Va CC_CDG +This directive loads the cdg congestion control algorithm, note that +this algorithm also requires the TCP_HHOOK option as well. +.It Va CC_DCTCP +This directive loads the dctcp congestion control algorithm. +.It Va CC_HD +This directive loads the hd congestion control algorithm, note that +this algorithm also requires the TCP_HHOOK option as well. +.It Va CC_CHD +This directive loads the chd congestion control algorithm, note that +this algorithm also requires the TCP_HHOOK option as well. +.It Va CC_HTCP +This directive loads the htcp congestion control algorithm. +.It Va CC_DEFAULT +This directive specifies the string that represents the name of the system default algorithm, the GENERIC kernel +defaults this to newreno. +.El .Sh SEE ALSO .Xr cc_cdg 4 , .Xr cc_chd 4 , @@ -103,6 +151,8 @@ .Xr cc_newreno 4 , .Xr cc_vegas 4 , .Xr tcp 4 , +.Xr config 5 , +.Xr config 8 , .Xr mod_cc 9 .Sh ACKNOWLEDGEMENTS Development and testing of this software were made possible in part by grants diff --git a/share/man/man9/mod_cc.9 b/share/man/man9/mod_cc.9 --- a/share/man/man9/mod_cc.9 +++ b/share/man/man9/mod_cc.9 @@ -68,7 +68,8 @@ char name[TCP_CA_NAME_MAX]; int (*mod_init) (void); int (*mod_destroy) (void); - int (*cb_init) (struct cc_var *ccv); + size_t (*cc_data_sz)(void); + int (*cb_init) (struct cc_var *ccv, void *ptr); void (*cb_destroy) (struct cc_var *ccv); void (*conn_init) (struct cc_var *ccv); void (*ack_received) (struct cc_var *ccv, uint16_t type); @@ -76,6 +77,8 @@ void (*post_recovery) (struct cc_var *ccv); void (*after_idle) (struct cc_var *ccv); int (*ctl_output)(struct cc_var *, struct sockopt *, void *); + void (*rttsample)(struct cc_var *, uint32_t, uint32_t, uint32_t); + void (*newround)(struct cc_var *, uint32_t); }; .Ed .Pp @@ -104,6 +107,17 @@ The return value is currently ignored. .Pp The +.Va cc_data_sz +function is called by the socket option code to get the size of +data that the +.Va cb_init +function needs. +The socket option code then preallocates the modules memory so that the +.Va cb_init +function will not fail (the socket option code uses M_WAITOK with +no locks held to do this). +.Pp +The .Va cb_init function is called when a TCP control block .Vt struct tcpcb @@ -114,6 +128,9 @@ .Va cb_init will cause the connection set up to be aborted, terminating the connection as a result. +Note that the ptr argument passed to the function should be checked to +see if it is non-NULL, if so it is preallocated memory that the cb_init function +must use instead of calling malloc itself. .Pp The .Va cb_destroy @@ -182,6 +199,30 @@ pointer to algorithm specific argument. .Pp The +.Va rttsample +function is called to pass round trip time information to the +congestion controller. +The additional arguments to the function include the microsecond RTT +that is being noted, the number of times that the data being +acknowledged was retransmitted as well as the flightsize at send. +For transports that do not track flightsize at send, this variable +will be the current cwnd at the time of the call. +.Pp +The +.Va newround +function is called each time a new round trip time begins. +The montonically increasing round number is also passed to the +congestion controller as well. +This can be used for various purposes by the congestion controller (e.g Hystart++). +.Pp +Note that currently not all TCP stacks call the +.Va rttsample +and +.Va newround +function so dependancy on these functions is also +dependant upon which TCP stack is in use. +.Pp +The .Fn DECLARE_CC_MODULE macro provides a convenient wrapper around the .Xr DECLARE_MODULE 9 @@ -203,8 +244,23 @@ .Vt struct cc_algo , but are only required to set the name field, and optionally any of the function pointers. +Note that if a module defines the +.Va cb_init +function it also must define a +.Va cc_data_sz +function. +This is because when switching from one congestion control +module to another the socket option code will preallocate memory for the +.Va cb_init +function. If no memory is allocated by the modules +.Va cb_init +then the +.Va cc_data_sz +function should return 0. +.Pp The stack will skip calling any function pointer which is NULL, so there is no -requirement to implement any of the function pointers. +requirement to implement any of the function pointers (with the exception of +the cb_init <-> cc_data_sz dependancy noted above). Using the C99 designated initialiser feature to set fields is encouraged. .Pp Each function pointer which deals with congestion control state is passed a @@ -222,6 +278,8 @@ struct tcpcb *tcp; struct sctp_nets *sctp; } ccvc; + uint16_t nsegs; + uint8_t labc; }; .Ed .Pp @@ -305,6 +363,19 @@ by the value of the congestion window. Algorithms should use the absence of this flag being set to avoid accumulating a large difference between the congestion window and send window. +.Pp +The +.Va nsegs +variable is used to pass in how much compression was done by the local +LRO system. +So for example if LRO pushed three in-order acknowledgements into +one acknowledgement the variable would be set to three. +.Pp +The +.Va labc +variable is used in conjunction with the CCF_USE_LOCAL_ABC flag +to override what labc variable the congestion controller will use +for this particular acknowledgement. .Sh SEE ALSO .Xr cc_cdg 4 , .Xr cc_chd 4 , diff --git a/sys/amd64/conf/GENERIC b/sys/amd64/conf/GENERIC --- a/sys/amd64/conf/GENERIC +++ b/sys/amd64/conf/GENERIC @@ -30,6 +30,8 @@ options VIMAGE # Subsystem virtualization, e.g. VNET options INET # InterNETworking options INET6 # IPv6 communications protocols +options CC_NEWRENO # include newreno congestion control +options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in. options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 options ROUTE_MPATH # Multipath routing support options FIB_ALGO # Modular fib lookups diff --git a/sys/arm/conf/std.armv6 b/sys/arm/conf/std.armv6 --- a/sys/arm/conf/std.armv6 +++ b/sys/arm/conf/std.armv6 @@ -8,6 +8,8 @@ options VIMAGE # Subsystem virtualization, e.g. VNET options INET # InterNETworking options INET6 # IPv6 communications protocols +options CC_NEWRENO # include newreno congestion control +options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in. options TCP_HHOOK # hhook(9) framework for TCP device crypto # core crypto support options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 diff --git a/sys/arm/conf/std.armv7 b/sys/arm/conf/std.armv7 --- a/sys/arm/conf/std.armv7 +++ b/sys/arm/conf/std.armv7 @@ -8,6 +8,8 @@ options VIMAGE # Subsystem virtualization, e.g. VNET options INET # InterNETworking options INET6 # IPv6 communications protocols +options CC_NEWRENO # include newreno congestion control +options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in. options TCP_HHOOK # hhook(9) framework for TCP device crypto # core crypto support options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 diff --git a/sys/arm64/conf/std.arm64 b/sys/arm64/conf/std.arm64 --- a/sys/arm64/conf/std.arm64 +++ b/sys/arm64/conf/std.arm64 @@ -11,6 +11,8 @@ options VIMAGE # Subsystem virtualization, e.g. VNET options INET # InterNETworking options INET6 # IPv6 communications protocols +options CC_NEWRENO # include newreno congestion control +options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in. options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 options ROUTE_MPATH # Multipath routing support options FIB_ALGO # Modular fib lookups diff --git a/sys/conf/NOTES b/sys/conf/NOTES --- a/sys/conf/NOTES +++ b/sys/conf/NOTES @@ -646,7 +646,26 @@ # options INET #Internet communications protocols options INET6 #IPv6 communications protocols - +# +# Note if you include INET/INET6 or both options +# You *must* define at least one of the congestion control +# options or the compile will fail. Generic defines +# options CC_NEWRENO. You also will need to specify +# a default or the compile of your kernel will fail +# as well. The string in default is the name of the +# cc module as it would appear in the sysctl for +# setting the default. Generic defines newreno +# as shown below. +# +options CC_CDG +options CC_CHD +options CC_CUBIC +options CC_DCTCP +options CC_HD +options CC_HTCP +options CC_NEWRENO +options CC_VEGAS +options CC_DEFAULT=\"newreno\" options RATELIMIT # TX rate limiting support options ROUTETABLES=2 # allocated fibs up to 65536. default is 1. diff --git a/sys/conf/files b/sys/conf/files --- a/sys/conf/files +++ b/sys/conf/files @@ -4351,8 +4351,20 @@ netinet/ip_output.c optional inet netinet/ip_reass.c optional inet netinet/raw_ip.c optional inet | inet6 -netinet/cc/cc.c optional inet | inet6 -netinet/cc/cc_newreno.c optional inet | inet6 +netinet/cc/cc.c optional cc_newreno inet | cc_vegas inet | \ + cc_htcp inet | cc_hd inet | cc_dctcp inet | cc_cubic inet | \ + cc_chd inet | cc_cdg inet | cc_newreno inet6 | cc_vegas inet6 | \ + cc_htcp inet6 | cc_hd inet6 |cc_dctcp inet6 | cc_cubic inet6 | \ + cc_chd inet6 | cc_cdg inet6 +netinet/cc/cc_cdg.c optional inet cc_cdg tcp_hhook +netinet/cc/cc_chd.c optional inet cc_chd tcp_hhook +netinet/cc/cc_cubic.c optional inet cc_cubic | inet6 cc_cubic +netinet/cc/cc_dctcp.c optional inet cc_dctcp | inet6 cc_dctcp +netinet/cc/cc_hd.c optional inet cc_hd tcp_hhook +netinet/cc/cc_htcp.c optional inet cc_htcp | inet6 cc_htcp +netinet/cc/cc_newreno.c optional inet cc_newreno | inet6 cc_newreno +netinet/cc/cc_vegas.c optional inet cc_vegas tcp_hhook +netinet/khelp/h_ertt.c optional inet tcp_hhook netinet/sctp_asconf.c optional inet sctp | inet6 sctp netinet/sctp_auth.c optional inet sctp | inet6 sctp netinet/sctp_bsd_addr.c optional inet sctp | inet6 sctp diff --git a/sys/conf/options b/sys/conf/options --- a/sys/conf/options +++ b/sys/conf/options @@ -81,6 +81,15 @@ CALLOUT_PROFILING CAPABILITIES opt_capsicum.h CAPABILITY_MODE opt_capsicum.h +CC_CDG opt_global.h +CC_CHD opt_global.h +CC_CUBIC opt_global.h +CC_DEFAULT opt_cc.h +CC_DCTCP opt_global.h +CC_HD opt_global.h +CC_HTCP opt_global.h +CC_NEWRENO opt_global.h +CC_VEGAS opt_global.h COMPAT_43 opt_global.h COMPAT_43TTY opt_global.h COMPAT_FREEBSD4 opt_global.h diff --git a/sys/i386/conf/GENERIC b/sys/i386/conf/GENERIC --- a/sys/i386/conf/GENERIC +++ b/sys/i386/conf/GENERIC @@ -31,6 +31,8 @@ options VIMAGE # Subsystem virtualization, e.g. VNET options INET # InterNETworking options INET6 # IPv6 communications protocols +options CC_NEWRENO # include newreno congestion control +options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in. options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 options ROUTE_MPATH # Multipath routing support options TCP_HHOOK # hhook(9) framework for TCP diff --git a/sys/modules/cc/Makefile b/sys/modules/cc/Makefile --- a/sys/modules/cc/Makefile +++ b/sys/modules/cc/Makefile @@ -1,6 +1,7 @@ # $FreeBSD$ -SUBDIR= cc_cubic \ +SUBDIR= cc_newreno \ + cc_cubic \ cc_dctcp \ cc_htcp diff --git a/sys/modules/cc/cc_newreno/Makefile b/sys/modules/cc/cc_newreno/Makefile new file mode 100644 --- /dev/null +++ b/sys/modules/cc/cc_newreno/Makefile @@ -0,0 +1,7 @@ +# $FreeBSD$ + +.PATH: ${SRCTOP}/sys/netinet/cc +KMOD= cc_newreno +SRCS= cc_newreno.c + +.include diff --git a/sys/netinet/cc/cc.h b/sys/netinet/cc/cc.h --- a/sys/netinet/cc/cc.h +++ b/sys/netinet/cc/cc.h @@ -53,10 +53,11 @@ #ifdef _KERNEL +MALLOC_DECLARE(M_CC_MEM); + /* Global CC vars. */ extern STAILQ_HEAD(cc_head, cc_algo) cc_list; extern const int tcprexmtthresh; -extern struct cc_algo newreno_cc_algo; /* Per-netstack bits. */ VNET_DECLARE(struct cc_algo *, default_cc_ptr); @@ -139,8 +140,19 @@ /* Cleanup global module state on kldunload. */ int (*mod_destroy)(void); - /* Init CC state for a new control block. */ - int (*cb_init)(struct cc_var *ccv); + /* Return the size of the void pointer the CC needs for state */ + size_t (*cc_data_sz)(void); + + /* + * Init CC state for a new control block. The CC + * module may be passed a NULL ptr indicating that + * it must allocate the memory. If it is passed a + * non-null pointer it is pre-allocated memory by + * the caller and the cb_init is expected to use that memory. + * It is not expected to fail if memory is passed in and + * all currently defined modules do not. + */ + int (*cb_init)(struct cc_var *ccv, void *ptr); /* Cleanup CC state for a terminating control block. */ void (*cb_destroy)(struct cc_var *ccv); @@ -176,8 +188,11 @@ int (*ctl_output)(struct cc_var *, struct sockopt *, void *); STAILQ_ENTRY (cc_algo) entries; + uint8_t flags; }; +#define CC_MODULE_BEING_REMOVED 0x01 /* The module is being removed */ + /* Macro to obtain the CC algo's struct ptr. */ #define CC_ALGO(tp) ((tp)->cc_algo) @@ -185,7 +200,7 @@ #define CC_DATA(tp) ((tp)->ccv->cc_data) /* Macro to obtain the system default CC algo's struct ptr. */ -#define CC_DEFAULT() V_default_cc_ptr +#define CC_DEFAULT_ALGO() V_default_cc_ptr extern struct rwlock cc_list_lock; #define CC_LIST_LOCK_INIT() rw_init(&cc_list_lock, "cc_list") @@ -198,5 +213,16 @@ #define CC_ALGOOPT_LIMIT 2048 +/* + * These routines give NewReno behavior to the caller + * they require no state and can be used by any other CC + * module that wishes to use NewReno type behaviour (along + * with anything else they may add on, pre or post call). + */ +void newreno_cc_post_recovery(struct cc_var *); +void newreno_cc_after_idle(struct cc_var *); +void newreno_cc_cong_signal(struct cc_var *, uint32_t ); +void newreno_cc_ack_received(struct cc_var *, uint16_t); + #endif /* _KERNEL */ #endif /* _NETINET_CC_CC_H_ */ diff --git a/sys/netinet/cc/cc.c b/sys/netinet/cc/cc.c --- a/sys/netinet/cc/cc.c +++ b/sys/netinet/cc/cc.c @@ -50,7 +50,7 @@ #include __FBSDID("$FreeBSD$"); - +#include #include #include #include @@ -70,11 +70,15 @@ #include #include #include +#include #include +#include +#include #include - #include +MALLOC_DEFINE(M_CC_MEM, "CC Mem", "Congestion Control State memory"); + /* * List of available cc algorithms on the current system. First element * is used as the system default CC algorithm. @@ -84,7 +88,10 @@ /* Protects the cc_list TAILQ. */ struct rwlock cc_list_lock; -VNET_DEFINE(struct cc_algo *, default_cc_ptr) = &newreno_cc_algo; +VNET_DEFINE(struct cc_algo *, default_cc_ptr) = NULL; + +VNET_DEFINE(uint32_t, newreno_beta) = 50; +#define V_newreno_beta VNET(newreno_beta) /* * Sysctl handler to show and change the default CC algorithm. @@ -98,7 +105,10 @@ /* Get the current default: */ CC_LIST_RLOCK(); - strlcpy(default_cc, CC_DEFAULT()->name, sizeof(default_cc)); + if (CC_DEFAULT_ALGO() != NULL) + strlcpy(default_cc, CC_DEFAULT_ALGO()->name, sizeof(default_cc)); + else + memset(default_cc, 0, TCP_CA_NAME_MAX); CC_LIST_RUNLOCK(); error = sysctl_handle_string(oidp, default_cc, sizeof(default_cc), req); @@ -108,7 +118,6 @@ goto done; error = ESRCH; - /* Find algo with specified name and set it to default. */ CC_LIST_RLOCK(); STAILQ_FOREACH(funcs, &cc_list, entries) { @@ -141,7 +150,9 @@ nalgos++; } CC_LIST_RUNLOCK(); - + if (nalgos == 0) { + return (ENOENT); + } s = sbuf_new(NULL, NULL, nalgos * TCP_CA_NAME_MAX, SBUF_FIXEDLEN); if (s == NULL) @@ -176,12 +187,13 @@ } /* - * Reset the default CC algo to NewReno for any netstack which is using the algo - * that is about to go away as its default. + * Return the number of times a proposed removal_cc is + * being used as the default. */ -static void -cc_checkreset_default(struct cc_algo *remove_cc) +static int +cc_check_default(struct cc_algo *remove_cc) { + int cnt = 0; VNET_ITERATOR_DECL(vnet_iter); CC_LIST_LOCK_ASSERT(); @@ -189,12 +201,16 @@ VNET_LIST_RLOCK_NOSLEEP(); VNET_FOREACH(vnet_iter) { CURVNET_SET(vnet_iter); - if (strncmp(CC_DEFAULT()->name, remove_cc->name, - TCP_CA_NAME_MAX) == 0) - V_default_cc_ptr = &newreno_cc_algo; + if ((CC_DEFAULT_ALGO() != NULL) && + strncmp(CC_DEFAULT_ALGO()->name, + remove_cc->name, + TCP_CA_NAME_MAX) == 0) { + cnt++; + } CURVNET_RESTORE(); } VNET_LIST_RUNLOCK_NOSLEEP(); + return (cnt); } /* @@ -218,31 +234,36 @@ err = ENOENT; - /* Never allow newreno to be deregistered. */ - if (&newreno_cc_algo == remove_cc) - return (EPERM); - /* Remove algo from cc_list so that new connections can't use it. */ CC_LIST_WLOCK(); STAILQ_FOREACH_SAFE(funcs, &cc_list, entries, tmpfuncs) { if (funcs == remove_cc) { - cc_checkreset_default(remove_cc); - STAILQ_REMOVE(&cc_list, funcs, cc_algo, entries); - err = 0; + if (cc_check_default(remove_cc)) { + err = EBUSY; + break; + } + /* Add a temp flag to stop new adds to it */ + funcs->flags |= CC_MODULE_BEING_REMOVED; + break; + } + } + CC_LIST_WUNLOCK(); + err = tcp_ccalgounload(remove_cc); + /* + * Now back through and we either remove the temp flag + * or pull the registration. + */ + CC_LIST_WLOCK(); + STAILQ_FOREACH_SAFE(funcs, &cc_list, entries, tmpfuncs) { + if (funcs == remove_cc) { + if (err == 0) + STAILQ_REMOVE(&cc_list, funcs, cc_algo, entries); + else + funcs->flags &= ~CC_MODULE_BEING_REMOVED; break; } } CC_LIST_WUNLOCK(); - - if (!err) - /* - * XXXLAS: - * - We may need to handle non-zero return values in future. - * - If we add CC framework support for protocols other than - * TCP, we may want a more generic way to handle this step. - */ - tcp_ccalgounload(remove_cc); - return (err); } @@ -263,19 +284,218 @@ */ CC_LIST_WLOCK(); STAILQ_FOREACH(funcs, &cc_list, entries) { - if (funcs == add_cc || strncmp(funcs->name, add_cc->name, - TCP_CA_NAME_MAX) == 0) + if (funcs == add_cc || + strncmp(funcs->name, add_cc->name, + TCP_CA_NAME_MAX) == 0) { err = EEXIST; + break; + } } - - if (!err) + /* + * The first loaded congestion control module will become + * the default until we find the "CC_DEFAULT" defined in + * the config (if we do). + */ + if (!err) { STAILQ_INSERT_TAIL(&cc_list, add_cc, entries); - + if (strcmp(add_cc->name, CC_DEFAULT) == 0) { + V_default_cc_ptr = add_cc; + } else if (V_default_cc_ptr == NULL) { + V_default_cc_ptr = add_cc; + } + } CC_LIST_WUNLOCK(); return (err); } +/* + * Perform any necessary tasks before we exit congestion recovery. + */ +void +newreno_cc_post_recovery(struct cc_var *ccv) +{ + int pipe; + + if (IN_FASTRECOVERY(CCV(ccv, t_flags))) { + /* + * Fast recovery will conclude after returning from this + * function. Window inflation should have left us with + * approximately snd_ssthresh outstanding data. But in case we + * would be inclined to send a burst, better to do it via the + * slow start mechanism. + * + * XXXLAS: Find a way to do this without needing curack + */ + if (V_tcp_do_newsack) + pipe = tcp_compute_pipe(ccv->ccvc.tcp); + else + pipe = CCV(ccv, snd_max) - ccv->curack; + if (pipe < CCV(ccv, snd_ssthresh)) + /* + * Ensure that cwnd does not collapse to 1 MSS under + * adverse conditons. Implements RFC6582 + */ + CCV(ccv, snd_cwnd) = max(pipe, CCV(ccv, t_maxseg)) + + CCV(ccv, t_maxseg); + else + CCV(ccv, snd_cwnd) = CCV(ccv, snd_ssthresh); + } +} + +void +newreno_cc_after_idle(struct cc_var *ccv) +{ + uint32_t rw; + /* + * If we've been idle for more than one retransmit timeout the old + * congestion window is no longer current and we have to reduce it to + * the restart window before we can transmit again. + * + * The restart window is the initial window or the last CWND, whichever + * is smaller. + * + * This is done to prevent us from flooding the path with a full CWND at + * wirespeed, overloading router and switch buffers along the way. + * + * See RFC5681 Section 4.1. "Restarting Idle Connections". + * + * In addition, per RFC2861 Section 2, the ssthresh is set to the + * maximum of the former ssthresh or 3/4 of the old cwnd, to + * not exit slow-start prematurely. + */ + rw = tcp_compute_initwnd(tcp_maxseg(ccv->ccvc.tcp)); + + CCV(ccv, snd_ssthresh) = max(CCV(ccv, snd_ssthresh), + CCV(ccv, snd_cwnd)-(CCV(ccv, snd_cwnd)>>2)); + + CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd)); +} + +/* + * Perform any necessary tasks before we enter congestion recovery. + */ +void +newreno_cc_cong_signal(struct cc_var *ccv, uint32_t type) +{ + uint32_t cwin, factor; + u_int mss; + + cwin = CCV(ccv, snd_cwnd); + mss = tcp_fixed_maxseg(ccv->ccvc.tcp); + /* + * Other TCP congestion controls use newreno_cong_signal(), but + * with their own private cc_data. Make sure the cc_data is used + * correctly. + */ + factor = V_newreno_beta; + + /* Catch algos which mistakenly leak private signal types. */ + KASSERT((type & CC_SIGPRIVMASK) == 0, + ("%s: congestion signal type 0x%08x is private\n", __func__, type)); + + cwin = max(((uint64_t)cwin * (uint64_t)factor) / (100ULL * (uint64_t)mss), + 2) * mss; + + switch (type) { + case CC_NDUPACK: + if (!IN_FASTRECOVERY(CCV(ccv, t_flags))) { + if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) + CCV(ccv, snd_ssthresh) = cwin; + ENTER_RECOVERY(CCV(ccv, t_flags)); + } + break; + case CC_ECN: + if (!IN_CONGRECOVERY(CCV(ccv, t_flags))) { + CCV(ccv, snd_ssthresh) = cwin; + CCV(ccv, snd_cwnd) = cwin; + ENTER_CONGRECOVERY(CCV(ccv, t_flags)); + } + break; + case CC_RTO: + CCV(ccv, snd_ssthresh) = max(min(CCV(ccv, snd_wnd), + CCV(ccv, snd_cwnd)) / 2 / mss, + 2) * mss; + CCV(ccv, snd_cwnd) = mss; + break; + } +} + +void +newreno_cc_ack_received(struct cc_var *ccv, uint16_t type) +{ + if (type == CC_ACK && !IN_RECOVERY(CCV(ccv, t_flags)) && + (ccv->flags & CCF_CWND_LIMITED)) { + u_int cw = CCV(ccv, snd_cwnd); + u_int incr = CCV(ccv, t_maxseg); + + /* + * Regular in-order ACK, open the congestion window. + * Method depends on which congestion control state we're + * in (slow start or cong avoid) and if ABC (RFC 3465) is + * enabled. + * + * slow start: cwnd <= ssthresh + * cong avoid: cwnd > ssthresh + * + * slow start and ABC (RFC 3465): + * Grow cwnd exponentially by the amount of data + * ACKed capping the max increment per ACK to + * (abc_l_var * maxseg) bytes. + * + * slow start without ABC (RFC 5681): + * Grow cwnd exponentially by maxseg per ACK. + * + * cong avoid and ABC (RFC 3465): + * Grow cwnd linearly by maxseg per RTT for each + * cwnd worth of ACKed data. + * + * cong avoid without ABC (RFC 5681): + * Grow cwnd linearly by approximately maxseg per RTT using + * maxseg^2 / cwnd per ACK as the increment. + * If cwnd > maxseg^2, fix the cwnd increment at 1 byte to + * avoid capping cwnd. + */ + if (cw > CCV(ccv, snd_ssthresh)) { + if (V_tcp_do_rfc3465) { + if (ccv->flags & CCF_ABC_SENTAWND) + ccv->flags &= ~CCF_ABC_SENTAWND; + else + incr = 0; + } else + incr = max((incr * incr / cw), 1); + } else if (V_tcp_do_rfc3465) { + /* + * In slow-start with ABC enabled and no RTO in sight? + * (Must not use abc_l_var > 1 if slow starting after + * an RTO. On RTO, snd_nxt = snd_una, so the + * snd_nxt == snd_max check is sufficient to + * handle this). + * + * XXXLAS: Find a way to signal SS after RTO that + * doesn't rely on tcpcb vars. + */ + uint16_t abc_val; + + if (ccv->flags & CCF_USE_LOCAL_ABC) + abc_val = ccv->labc; + else + abc_val = V_tcp_abc_l_var; + if (CCV(ccv, snd_nxt) == CCV(ccv, snd_max)) + incr = min(ccv->bytes_this_ack, + ccv->nsegs * abc_val * + CCV(ccv, t_maxseg)); + else + incr = min(ccv->bytes_this_ack, CCV(ccv, t_maxseg)); + + } + /* ABC is on by default, so incr equals 0 frequently. */ + if (incr > 0) + CCV(ccv, snd_cwnd) = min(cw + incr, + TCP_MAXWIN << CCV(ccv, snd_scale)); + } +} + /* * Handles kld related events. Returns 0 on success, non-zero on failure. */ @@ -290,6 +510,15 @@ switch(event_type) { case MOD_LOAD: + if ((algo->cc_data_sz == NULL) && (algo->cb_init != NULL)) { + /* + * A module must have a cc_data_sz function + * even if it has no data it should return 0. + */ + printf("Module Load Fails, it lacks a cc_data_sz() function but has a cb_init()!\n"); + err = EINVAL; + break; + } if (algo->mod_init != NULL) err = algo->mod_init(); if (!err) diff --git a/sys/netinet/cc/cc_cdg.c b/sys/netinet/cc/cc_cdg.c --- a/sys/netinet/cc/cc_cdg.c +++ b/sys/netinet/cc/cc_cdg.c @@ -67,6 +67,10 @@ #include +#include +#include + +#include #include #include #include @@ -197,10 +201,6 @@ 32531,32533,32535,32537,32538,32540,32542,32544,32545,32547}; static uma_zone_t qdiffsample_zone; - -static MALLOC_DEFINE(M_CDG, "cdg data", - "Per connection data required for the CDG congestion control algorithm"); - static int ertt_id; VNET_DEFINE_STATIC(uint32_t, cdg_alpha_inc); @@ -222,10 +222,11 @@ static int cdg_mod_init(void); static int cdg_mod_destroy(void); static void cdg_conn_init(struct cc_var *ccv); -static int cdg_cb_init(struct cc_var *ccv); +static int cdg_cb_init(struct cc_var *ccv, void *ptr); static void cdg_cb_destroy(struct cc_var *ccv); static void cdg_cong_signal(struct cc_var *ccv, uint32_t signal_type); static void cdg_ack_received(struct cc_var *ccv, uint16_t ack_type); +static size_t cdg_data_sz(void); struct cc_algo cdg_cc_algo = { .name = "cdg", @@ -235,7 +236,10 @@ .cb_init = cdg_cb_init, .conn_init = cdg_conn_init, .cong_signal = cdg_cong_signal, - .mod_destroy = cdg_mod_destroy + .mod_destroy = cdg_mod_destroy, + .cc_data_sz = cdg_data_sz, + .post_recovery = newreno_cc_post_recovery, + .after_idle = newreno_cc_after_idle, }; /* Vnet created and being initialised. */ @@ -271,10 +275,6 @@ CURVNET_RESTORE(); } VNET_LIST_RUNLOCK(); - - cdg_cc_algo.post_recovery = newreno_cc_algo.post_recovery; - cdg_cc_algo.after_idle = newreno_cc_algo.after_idle; - return (0); } @@ -286,15 +286,25 @@ return (0); } +static size_t +cdg_data_sz(void) +{ + return (sizeof(struct cdg)); +} + static int -cdg_cb_init(struct cc_var *ccv) +cdg_cb_init(struct cc_var *ccv, void *ptr) { struct cdg *cdg_data; - cdg_data = malloc(sizeof(struct cdg), M_CDG, M_NOWAIT); - if (cdg_data == NULL) - return (ENOMEM); - + INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb); + if (ptr == NULL) { + cdg_data = malloc(sizeof(struct cdg), M_CC_MEM, M_NOWAIT); + if (cdg_data == NULL) + return (ENOMEM); + } else { + cdg_data = ptr; + } cdg_data->shadow_w = 0; cdg_data->max_qtrend = 0; cdg_data->min_qtrend = 0; @@ -350,7 +360,7 @@ qds = qds_n; } - free(ccv->cc_data, M_CDG); + free(ccv->cc_data, M_CC_MEM); } static int @@ -484,7 +494,7 @@ ENTER_RECOVERY(CCV(ccv, t_flags)); break; default: - newreno_cc_algo.cong_signal(ccv, signal_type); + newreno_cc_cong_signal(ccv, signal_type); break; } } @@ -714,5 +724,5 @@ "the window backoff for loss based CC compatibility"); DECLARE_CC_MODULE(cdg, &cdg_cc_algo); -MODULE_VERSION(cdg, 1); +MODULE_VERSION(cdg, 2); MODULE_DEPEND(cdg, ertt, 1, 1, 1); diff --git a/sys/netinet/cc/cc_chd.c b/sys/netinet/cc/cc_chd.c --- a/sys/netinet/cc/cc_chd.c +++ b/sys/netinet/cc/cc_chd.c @@ -69,6 +69,10 @@ #include +#include +#include + +#include #include #include #include @@ -89,10 +93,11 @@ static void chd_ack_received(struct cc_var *ccv, uint16_t ack_type); static void chd_cb_destroy(struct cc_var *ccv); -static int chd_cb_init(struct cc_var *ccv); +static int chd_cb_init(struct cc_var *ccv, void *ptr); static void chd_cong_signal(struct cc_var *ccv, uint32_t signal_type); static void chd_conn_init(struct cc_var *ccv); static int chd_mod_init(void); +static size_t chd_data_sz(void); struct chd { /* @@ -126,8 +131,6 @@ #define V_chd_loss_fair VNET(chd_loss_fair) #define V_chd_use_max VNET(chd_use_max) -static MALLOC_DEFINE(M_CHD, "chd data", - "Per connection data required for the CHD congestion control algorithm"); struct cc_algo chd_cc_algo = { .name = "chd", @@ -136,7 +139,10 @@ .cb_init = chd_cb_init, .cong_signal = chd_cong_signal, .conn_init = chd_conn_init, - .mod_init = chd_mod_init + .mod_init = chd_mod_init, + .cc_data_sz = chd_data_sz, + .after_idle = newreno_cc_after_idle, + .post_recovery = newreno_cc_post_recovery, }; static __inline void @@ -304,18 +310,27 @@ static void chd_cb_destroy(struct cc_var *ccv) { + free(ccv->cc_data, M_CC_MEM); +} - free(ccv->cc_data, M_CHD); +size_t +chd_data_sz(void) +{ + return (sizeof(struct chd)); } static int -chd_cb_init(struct cc_var *ccv) +chd_cb_init(struct cc_var *ccv, void *ptr) { struct chd *chd_data; - chd_data = malloc(sizeof(struct chd), M_CHD, M_NOWAIT); - if (chd_data == NULL) - return (ENOMEM); + INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb); + if (ptr == NULL) { + chd_data = malloc(sizeof(struct chd), M_CC_MEM, M_NOWAIT); + if (chd_data == NULL) + return (ENOMEM); + } else + chd_data = ptr; chd_data->shadow_w = 0; ccv->cc_data = chd_data; @@ -374,7 +389,7 @@ break; default: - newreno_cc_algo.cong_signal(ccv, signal_type); + newreno_cc_cong_signal(ccv, signal_type); } } @@ -403,10 +418,6 @@ printf("%s: h_ertt module not found\n", __func__); return (ENOENT); } - - chd_cc_algo.after_idle = newreno_cc_algo.after_idle; - chd_cc_algo.post_recovery = newreno_cc_algo.post_recovery; - return (0); } @@ -493,5 +504,5 @@ "as the basic delay measurement for the algorithm."); DECLARE_CC_MODULE(chd, &chd_cc_algo); -MODULE_VERSION(chd, 1); +MODULE_VERSION(chd, 2); MODULE_DEPEND(chd, ertt, 1, 1, 1); diff --git a/sys/netinet/cc/cc_cubic.c b/sys/netinet/cc/cc_cubic.c --- a/sys/netinet/cc/cc_cubic.c +++ b/sys/netinet/cc/cc_cubic.c @@ -62,6 +62,10 @@ #include +#include +#include + +#include #include #include #include @@ -72,7 +76,7 @@ static void cubic_ack_received(struct cc_var *ccv, uint16_t type); static void cubic_cb_destroy(struct cc_var *ccv); -static int cubic_cb_init(struct cc_var *ccv); +static int cubic_cb_init(struct cc_var *ccv, void *ptr); static void cubic_cong_signal(struct cc_var *ccv, uint32_t type); static void cubic_conn_init(struct cc_var *ccv); static int cubic_mod_init(void); @@ -80,6 +84,7 @@ static void cubic_record_rtt(struct cc_var *ccv); static void cubic_ssthresh_update(struct cc_var *ccv, uint32_t maxseg); static void cubic_after_idle(struct cc_var *ccv); +static size_t cubic_data_sz(void); struct cubic { /* Cubic K in fixed point form with CUBIC_SHIFT worth of precision. */ @@ -114,9 +119,6 @@ int t_last_cong_prev; }; -static MALLOC_DEFINE(M_CUBIC, "cubic data", - "Per connection data required for the CUBIC congestion control algorithm"); - struct cc_algo cubic_cc_algo = { .name = "cubic", .ack_received = cubic_ack_received, @@ -127,6 +129,7 @@ .mod_init = cubic_mod_init, .post_recovery = cubic_post_recovery, .after_idle = cubic_after_idle, + .cc_data_sz = cubic_data_sz }; static void @@ -149,7 +152,7 @@ if (CCV(ccv, snd_cwnd) <= CCV(ccv, snd_ssthresh) || cubic_data->min_rtt_ticks == TCPTV_SRTTBASE) { cubic_data->flags |= CUBICFLAG_IN_SLOWSTART; - newreno_cc_algo.ack_received(ccv, type); + newreno_cc_ack_received(ccv, type); } else { if ((cubic_data->flags & CUBICFLAG_RTO_EVENT) && (cubic_data->flags & CUBICFLAG_IN_SLOWSTART)) { @@ -243,25 +246,34 @@ cubic_data->max_cwnd = ulmax(cubic_data->max_cwnd, CCV(ccv, snd_cwnd)); cubic_data->K = cubic_k(cubic_data->max_cwnd / CCV(ccv, t_maxseg)); - newreno_cc_algo.after_idle(ccv); + newreno_cc_after_idle(ccv); cubic_data->t_last_cong = ticks; } static void cubic_cb_destroy(struct cc_var *ccv) { - free(ccv->cc_data, M_CUBIC); + free(ccv->cc_data, M_CC_MEM); +} + +static size_t +cubic_data_sz(void) +{ + return (sizeof(struct cubic)); } static int -cubic_cb_init(struct cc_var *ccv) +cubic_cb_init(struct cc_var *ccv, void *ptr) { struct cubic *cubic_data; - cubic_data = malloc(sizeof(struct cubic), M_CUBIC, M_NOWAIT|M_ZERO); - - if (cubic_data == NULL) - return (ENOMEM); + INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb); + if (ptr == NULL) { + cubic_data = malloc(sizeof(struct cubic), M_CC_MEM, M_NOWAIT|M_ZERO); + if (cubic_data == NULL) + return (ENOMEM); + } else + cubic_data = ptr; /* Init some key variables with sensible defaults. */ cubic_data->t_last_cong = ticks; @@ -484,4 +496,4 @@ } DECLARE_CC_MODULE(cubic, &cubic_cc_algo); -MODULE_VERSION(cubic, 1); +MODULE_VERSION(cubic, 2); diff --git a/sys/netinet/cc/cc_dctcp.c b/sys/netinet/cc/cc_dctcp.c --- a/sys/netinet/cc/cc_dctcp.c +++ b/sys/netinet/cc/cc_dctcp.c @@ -50,6 +50,10 @@ #include +#include +#include + +#include #include #include #include @@ -76,18 +80,16 @@ uint32_t num_cong_events; /* # of congestion events */ }; -static MALLOC_DEFINE(M_dctcp, "dctcp data", - "Per connection data required for the dctcp algorithm"); - static void dctcp_ack_received(struct cc_var *ccv, uint16_t type); static void dctcp_after_idle(struct cc_var *ccv); static void dctcp_cb_destroy(struct cc_var *ccv); -static int dctcp_cb_init(struct cc_var *ccv); +static int dctcp_cb_init(struct cc_var *ccv, void *ptr); static void dctcp_cong_signal(struct cc_var *ccv, uint32_t type); static void dctcp_conn_init(struct cc_var *ccv); static void dctcp_post_recovery(struct cc_var *ccv); static void dctcp_ecnpkt_handler(struct cc_var *ccv); static void dctcp_update_alpha(struct cc_var *ccv); +static size_t dctcp_data_sz(void); struct cc_algo dctcp_cc_algo = { .name = "dctcp", @@ -99,6 +101,7 @@ .post_recovery = dctcp_post_recovery, .ecnpkt_handler = dctcp_ecnpkt_handler, .after_idle = dctcp_after_idle, + .cc_data_sz = dctcp_data_sz, }; static void @@ -117,10 +120,10 @@ */ if (IN_CONGRECOVERY(CCV(ccv, t_flags))) { EXIT_CONGRECOVERY(CCV(ccv, t_flags)); - newreno_cc_algo.ack_received(ccv, type); + newreno_cc_ack_received(ccv, type); ENTER_CONGRECOVERY(CCV(ccv, t_flags)); } else - newreno_cc_algo.ack_received(ccv, type); + newreno_cc_ack_received(ccv, type); if (type == CC_DUPACK) bytes_acked = min(ccv->bytes_this_ack, CCV(ccv, t_maxseg)); @@ -158,7 +161,13 @@ SEQ_GT(ccv->curack, dctcp_data->save_sndnxt)) dctcp_update_alpha(ccv); } else - newreno_cc_algo.ack_received(ccv, type); + newreno_cc_ack_received(ccv, type); +} + +static size_t +dctcp_data_sz(void) +{ + return (sizeof(struct dctcp)); } static void @@ -179,25 +188,27 @@ dctcp_data->num_cong_events = 0; } - newreno_cc_algo.after_idle(ccv); + newreno_cc_after_idle(ccv); } static void dctcp_cb_destroy(struct cc_var *ccv) { - free(ccv->cc_data, M_dctcp); + free(ccv->cc_data, M_CC_MEM); } static int -dctcp_cb_init(struct cc_var *ccv) +dctcp_cb_init(struct cc_var *ccv, void *ptr) { struct dctcp *dctcp_data; - dctcp_data = malloc(sizeof(struct dctcp), M_dctcp, M_NOWAIT|M_ZERO); - - if (dctcp_data == NULL) - return (ENOMEM); - + INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb); + if (ptr == NULL) { + dctcp_data = malloc(sizeof(struct dctcp), M_CC_MEM, M_NOWAIT|M_ZERO); + if (dctcp_data == NULL) + return (ENOMEM); + } else + dctcp_data = ptr; /* Initialize some key variables with sensible defaults. */ dctcp_data->bytes_ecn = 0; dctcp_data->bytes_total = 0; @@ -292,7 +303,7 @@ break; } } else - newreno_cc_algo.cong_signal(ccv, type); + newreno_cc_cong_signal(ccv, type); } static void @@ -312,7 +323,7 @@ static void dctcp_post_recovery(struct cc_var *ccv) { - newreno_cc_algo.post_recovery(ccv); + newreno_cc_post_recovery(ccv); if (CCV(ccv, t_flags2) & TF2_ECN_PERMIT) dctcp_update_alpha(ccv); @@ -468,4 +479,4 @@ "half CWND reduction after the first slow start"); DECLARE_CC_MODULE(dctcp, &dctcp_cc_algo); -MODULE_VERSION(dctcp, 1); +MODULE_VERSION(dctcp, 2); diff --git a/sys/netinet/cc/cc_hd.c b/sys/netinet/cc/cc_hd.c --- a/sys/netinet/cc/cc_hd.c +++ b/sys/netinet/cc/cc_hd.c @@ -84,6 +84,7 @@ static void hd_ack_received(struct cc_var *ccv, uint16_t ack_type); static int hd_mod_init(void); +static size_t hd_data_sz(void); static int ertt_id; @@ -97,9 +98,19 @@ struct cc_algo hd_cc_algo = { .name = "hd", .ack_received = hd_ack_received, - .mod_init = hd_mod_init + .mod_init = hd_mod_init, + .cc_data_sz = hd_data_sz, + .after_idle = newreno_cc_after_idle, + .cong_signal = newreno_cc_cong_signal, + .post_recovery = newreno_cc_post_recovery, }; +static size_t +hd_data_sz(void) +{ + return (0); +} + /* * Hamilton backoff function. Returns 1 if we should backoff or 0 otherwise. */ @@ -150,14 +161,14 @@ * half cwnd and behave like an ECN (ie * not a packet loss). */ - newreno_cc_algo.cong_signal(ccv, + newreno_cc_cong_signal(ccv, CC_ECN); return; } } } } - newreno_cc_algo.ack_received(ccv, ack_type); /* As for NewReno. */ + newreno_cc_ack_received(ccv, ack_type); } static int @@ -169,11 +180,6 @@ printf("%s: h_ertt module not found\n", __func__); return (ENOENT); } - - hd_cc_algo.after_idle = newreno_cc_algo.after_idle; - hd_cc_algo.cong_signal = newreno_cc_algo.cong_signal; - hd_cc_algo.post_recovery = newreno_cc_algo.post_recovery; - return (0); } @@ -251,5 +257,5 @@ "minimum queueing delay threshold (qmin) in ticks"); DECLARE_CC_MODULE(hd, &hd_cc_algo); -MODULE_VERSION(hd, 1); +MODULE_VERSION(hd, 2); MODULE_DEPEND(hd, ertt, 1, 1, 1); diff --git a/sys/netinet/cc/cc_htcp.c b/sys/netinet/cc/cc_htcp.c --- a/sys/netinet/cc/cc_htcp.c +++ b/sys/netinet/cc/cc_htcp.c @@ -64,6 +64,10 @@ #include +#include +#include + +#include #include #include #include @@ -137,7 +141,7 @@ static void htcp_ack_received(struct cc_var *ccv, uint16_t type); static void htcp_cb_destroy(struct cc_var *ccv); -static int htcp_cb_init(struct cc_var *ccv); +static int htcp_cb_init(struct cc_var *ccv, void *ptr); static void htcp_cong_signal(struct cc_var *ccv, uint32_t type); static int htcp_mod_init(void); static void htcp_post_recovery(struct cc_var *ccv); @@ -145,6 +149,7 @@ static void htcp_recalc_beta(struct cc_var *ccv); static void htcp_record_rtt(struct cc_var *ccv); static void htcp_ssthresh_update(struct cc_var *ccv); +static size_t htcp_data_sz(void); struct htcp { /* cwnd before entering cong recovery. */ @@ -175,9 +180,6 @@ #define V_htcp_adaptive_backoff VNET(htcp_adaptive_backoff) #define V_htcp_rtt_scaling VNET(htcp_rtt_scaling) -static MALLOC_DEFINE(M_HTCP, "htcp data", - "Per connection data required for the HTCP congestion control algorithm"); - struct cc_algo htcp_cc_algo = { .name = "htcp", .ack_received = htcp_ack_received, @@ -186,6 +188,8 @@ .cong_signal = htcp_cong_signal, .mod_init = htcp_mod_init, .post_recovery = htcp_post_recovery, + .cc_data_sz = htcp_data_sz, + .after_idle = newreno_cc_after_idle, }; static void @@ -214,7 +218,7 @@ */ if (htcp_data->alpha == 1 || CCV(ccv, snd_cwnd) <= CCV(ccv, snd_ssthresh)) - newreno_cc_algo.ack_received(ccv, type); + newreno_cc_ack_received(ccv, type); else { if (V_tcp_do_rfc3465) { /* Increment cwnd by alpha segments. */ @@ -238,18 +242,27 @@ static void htcp_cb_destroy(struct cc_var *ccv) { - free(ccv->cc_data, M_HTCP); + free(ccv->cc_data, M_CC_MEM); +} + +static size_t +htcp_data_sz(void) +{ + return(sizeof(struct htcp)); } static int -htcp_cb_init(struct cc_var *ccv) +htcp_cb_init(struct cc_var *ccv, void *ptr) { struct htcp *htcp_data; - htcp_data = malloc(sizeof(struct htcp), M_HTCP, M_NOWAIT); - - if (htcp_data == NULL) - return (ENOMEM); + INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb); + if (ptr == NULL) { + htcp_data = malloc(sizeof(struct htcp), M_CC_MEM, M_NOWAIT); + if (htcp_data == NULL) + return (ENOMEM); + } else + htcp_data = ptr; /* Init some key variables with sensible defaults. */ htcp_data->alpha = HTCP_INIT_ALPHA; @@ -333,16 +346,12 @@ static int htcp_mod_init(void) { - - htcp_cc_algo.after_idle = newreno_cc_algo.after_idle; - /* * HTCP_RTT_REF is defined in ms, and t_srtt in the tcpcb is stored in * units of TCP_RTT_SCALE*hz. Scale HTCP_RTT_REF to be in the same units * as t_srtt. */ htcp_rtt_ref = (HTCP_RTT_REF * TCP_RTT_SCALE * hz) / 1000; - return (0); } @@ -535,4 +544,4 @@ "enable H-TCP RTT scaling"); DECLARE_CC_MODULE(htcp, &htcp_cc_algo); -MODULE_VERSION(htcp, 1); +MODULE_VERSION(htcp, 2); diff --git a/sys/netinet/cc/cc_newreno.c b/sys/netinet/cc/cc_newreno.c --- a/sys/netinet/cc/cc_newreno.c +++ b/sys/netinet/cc/cc_newreno.c @@ -71,6 +71,10 @@ #include +#include +#include + +#include #include #include #include @@ -82,22 +86,20 @@ #include #include -static MALLOC_DEFINE(M_NEWRENO, "newreno data", - "newreno beta values"); - static void newreno_cb_destroy(struct cc_var *ccv); static void newreno_ack_received(struct cc_var *ccv, uint16_t type); static void newreno_after_idle(struct cc_var *ccv); static void newreno_cong_signal(struct cc_var *ccv, uint32_t type); -static void newreno_post_recovery(struct cc_var *ccv); static int newreno_ctl_output(struct cc_var *ccv, struct sockopt *sopt, void *buf); static void newreno_newround(struct cc_var *ccv, uint32_t round_cnt); static void newreno_rttsample(struct cc_var *ccv, uint32_t usec_rtt, uint32_t rxtcnt, uint32_t fas); -static int newreno_cb_init(struct cc_var *ccv); +static int newreno_cb_init(struct cc_var *ccv, void *); +static size_t newreno_data_sz(void); -VNET_DEFINE(uint32_t, newreno_beta) = 50; -VNET_DEFINE(uint32_t, newreno_beta_ecn) = 80; + +VNET_DECLARE(uint32_t, newreno_beta); #define V_newreno_beta VNET(newreno_beta) +VNET_DEFINE(uint32_t, newreno_beta_ecn) = 80; #define V_newreno_beta_ecn VNET(newreno_beta_ecn) struct cc_algo newreno_cc_algo = { @@ -106,11 +108,12 @@ .ack_received = newreno_ack_received, .after_idle = newreno_after_idle, .cong_signal = newreno_cong_signal, - .post_recovery = newreno_post_recovery, + .post_recovery = newreno_cc_post_recovery, .ctl_output = newreno_ctl_output, .newround = newreno_newround, .rttsample = newreno_rttsample, .cb_init = newreno_cb_init, + .cc_data_sz = newreno_data_sz, }; static uint32_t hystart_lowcwnd = 16; @@ -167,14 +170,24 @@ } } +static size_t +newreno_data_sz(void) +{ + return (sizeof(struct newreno)); +} + static int -newreno_cb_init(struct cc_var *ccv) +newreno_cb_init(struct cc_var *ccv, void *ptr) { struct newreno *nreno; - ccv->cc_data = malloc(sizeof(struct newreno), M_NEWRENO, M_NOWAIT); - if (ccv->cc_data == NULL) - return (ENOMEM); + INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb); + if (ptr == NULL) { + ccv->cc_data = malloc(sizeof(struct newreno), M_CC_MEM, M_NOWAIT); + if (ccv->cc_data == NULL) + return (ENOMEM); + } else + ccv->cc_data = ptr; nreno = (struct newreno *)ccv->cc_data; /* NB: nreno is not zeroed, so initialise all fields. */ nreno->beta = V_newreno_beta; @@ -201,7 +214,7 @@ static void newreno_cb_destroy(struct cc_var *ccv) { - free(ccv->cc_data, M_NEWRENO); + free(ccv->cc_data, M_CC_MEM); } static void @@ -209,13 +222,7 @@ { struct newreno *nreno; - /* - * Other TCP congestion controls use newreno_ack_received(), but - * with their own private cc_data. Make sure the cc_data is used - * correctly. - */ - nreno = (CC_ALGO(ccv->ccvc.tcp) == &newreno_cc_algo) ? ccv->cc_data : NULL; - + nreno = ccv->cc_data; if (type == CC_ACK && !IN_RECOVERY(CCV(ccv, t_flags)) && (ccv->flags & CCF_CWND_LIMITED)) { u_int cw = CCV(ccv, snd_cwnd); @@ -249,8 +256,7 @@ * avoid capping cwnd. */ if (cw > CCV(ccv, snd_ssthresh)) { - if ((nreno != NULL) && - (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS)) { + if (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS) { /* * We have slipped into CA with * CSS active. Deactivate all. @@ -284,8 +290,7 @@ abc_val = ccv->labc; else abc_val = V_tcp_abc_l_var; - if ((nreno != NULL) && - (nreno->newreno_flags & CC_NEWRENO_HYSTART_ALLOWED) && + if ((nreno->newreno_flags & CC_NEWRENO_HYSTART_ALLOWED) && (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) && ((nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS) == 0)) { /* @@ -323,8 +328,7 @@ incr = min(ccv->bytes_this_ack, CCV(ccv, t_maxseg)); /* Only if Hystart is enabled will the flag get set */ - if ((nreno != NULL) && - (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS)) { + if (nreno->newreno_flags & CC_NEWRENO_HYSTART_IN_CSS) { incr /= hystart_css_growth_div; newreno_log_hystart_event(ccv, nreno, 3, incr); } @@ -340,39 +344,10 @@ newreno_after_idle(struct cc_var *ccv) { struct newreno *nreno; - uint32_t rw; - - /* - * Other TCP congestion controls use newreno_after_idle(), but - * with their own private cc_data. Make sure the cc_data is used - * correctly. - */ - nreno = (CC_ALGO(ccv->ccvc.tcp) == &newreno_cc_algo) ? ccv->cc_data : NULL; - /* - * If we've been idle for more than one retransmit timeout the old - * congestion window is no longer current and we have to reduce it to - * the restart window before we can transmit again. - * - * The restart window is the initial window or the last CWND, whichever - * is smaller. - * - * This is done to prevent us from flooding the path with a full CWND at - * wirespeed, overloading router and switch buffers along the way. - * - * See RFC5681 Section 4.1. "Restarting Idle Connections". - * - * In addition, per RFC2861 Section 2, the ssthresh is set to the - * maximum of the former ssthresh or 3/4 of the old cwnd, to - * not exit slow-start prematurely. - */ - rw = tcp_compute_initwnd(tcp_maxseg(ccv->ccvc.tcp)); - - CCV(ccv, snd_ssthresh) = max(CCV(ccv, snd_ssthresh), - CCV(ccv, snd_cwnd)-(CCV(ccv, snd_cwnd)>>2)); - CCV(ccv, snd_cwnd) = min(rw, CCV(ccv, snd_cwnd)); - if ((nreno != NULL) && - (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) == 0) { + nreno = ccv->cc_data; + newreno_cc_after_idle(ccv); + if ((nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) == 0) { if (CCV(ccv, snd_cwnd) <= (hystart_lowcwnd * tcp_fixed_maxseg(ccv->ccvc.tcp))) { /* * Re-enable hystart if our cwnd has fallen below @@ -396,12 +371,7 @@ cwin = CCV(ccv, snd_cwnd); mss = tcp_fixed_maxseg(ccv->ccvc.tcp); - /* - * Other TCP congestion controls use newreno_cong_signal(), but - * with their own private cc_data. Make sure the cc_data is used - * correctly. - */ - nreno = (CC_ALGO(ccv->ccvc.tcp) == &newreno_cc_algo) ? ccv->cc_data : NULL; + nreno = ccv->cc_data; beta = (nreno == NULL) ? V_newreno_beta : nreno->beta;; beta_ecn = (nreno == NULL) ? V_newreno_beta_ecn : nreno->beta_ecn; /* @@ -426,8 +396,7 @@ switch (type) { case CC_NDUPACK: - if ((nreno != NULL) && - (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED)) { + if (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) { /* Make sure the flags are all off we had a loss */ nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_ENABLED; nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_IN_CSS; @@ -445,8 +414,7 @@ } break; case CC_ECN: - if ((nreno != NULL) && - (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED)) { + if (nreno->newreno_flags & CC_NEWRENO_HYSTART_ENABLED) { /* Make sure the flags are all off we had a loss */ nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_ENABLED; nreno->newreno_flags &= ~CC_NEWRENO_HYSTART_IN_CSS; @@ -466,41 +434,6 @@ } } -/* - * Perform any necessary tasks before we exit congestion recovery. - */ -static void -newreno_post_recovery(struct cc_var *ccv) -{ - int pipe; - - if (IN_FASTRECOVERY(CCV(ccv, t_flags))) { - /* - * Fast recovery will conclude after returning from this - * function. Window inflation should have left us with - * approximately snd_ssthresh outstanding data. But in case we - * would be inclined to send a burst, better to do it via the - * slow start mechanism. - * - * XXXLAS: Find a way to do this without needing curack - */ - if (V_tcp_do_newsack) - pipe = tcp_compute_pipe(ccv->ccvc.tcp); - else - pipe = CCV(ccv, snd_max) - ccv->curack; - - if (pipe < CCV(ccv, snd_ssthresh)) - /* - * Ensure that cwnd does not collapse to 1 MSS under - * adverse conditons. Implements RFC6582 - */ - CCV(ccv, snd_cwnd) = max(pipe, CCV(ccv, t_maxseg)) + - CCV(ccv, t_maxseg); - else - CCV(ccv, snd_cwnd) = CCV(ccv, snd_ssthresh); - } -} - static int newreno_ctl_output(struct cc_var *ccv, struct sockopt *sopt, void *buf) { @@ -723,4 +656,4 @@ DECLARE_CC_MODULE(newreno, &newreno_cc_algo); -MODULE_VERSION(newreno, 1); +MODULE_VERSION(newreno, 2); diff --git a/sys/netinet/cc/cc_vegas.c b/sys/netinet/cc/cc_vegas.c --- a/sys/netinet/cc/cc_vegas.c +++ b/sys/netinet/cc/cc_vegas.c @@ -71,6 +71,10 @@ #include +#include +#include + +#include #include #include #include @@ -87,10 +91,11 @@ static void vegas_ack_received(struct cc_var *ccv, uint16_t ack_type); static void vegas_cb_destroy(struct cc_var *ccv); -static int vegas_cb_init(struct cc_var *ccv); +static int vegas_cb_init(struct cc_var *ccv, void *ptr); static void vegas_cong_signal(struct cc_var *ccv, uint32_t signal_type); static void vegas_conn_init(struct cc_var *ccv); static int vegas_mod_init(void); +static size_t vegas_data_sz(void); struct vegas { int slow_start_toggle; @@ -103,9 +108,6 @@ #define V_vegas_alpha VNET(vegas_alpha) #define V_vegas_beta VNET(vegas_beta) -static MALLOC_DEFINE(M_VEGAS, "vegas data", - "Per connection data required for the Vegas congestion control algorithm"); - struct cc_algo vegas_cc_algo = { .name = "vegas", .ack_received = vegas_ack_received, @@ -113,7 +115,10 @@ .cb_init = vegas_cb_init, .cong_signal = vegas_cong_signal, .conn_init = vegas_conn_init, - .mod_init = vegas_mod_init + .mod_init = vegas_mod_init, + .cc_data_sz = vegas_data_sz, + .after_idle = newreno_cc_after_idle, + .post_recovery = newreno_cc_post_recovery, }; /* @@ -162,24 +167,33 @@ } if (vegas_data->slow_start_toggle) - newreno_cc_algo.ack_received(ccv, ack_type); + newreno_cc_ack_received(ccv, ack_type); } static void vegas_cb_destroy(struct cc_var *ccv) { - free(ccv->cc_data, M_VEGAS); + free(ccv->cc_data, M_CC_MEM); +} + +static size_t +vegas_data_sz(void) +{ + return (sizeof(struct vegas)); } static int -vegas_cb_init(struct cc_var *ccv) +vegas_cb_init(struct cc_var *ccv, void *ptr) { struct vegas *vegas_data; - vegas_data = malloc(sizeof(struct vegas), M_VEGAS, M_NOWAIT); - - if (vegas_data == NULL) - return (ENOMEM); + INP_WLOCK_ASSERT(ccv->ccvc.tcp->t_inpcb); + if (ptr == NULL) { + vegas_data = malloc(sizeof(struct vegas), M_CC_MEM, M_NOWAIT); + if (vegas_data == NULL) + return (ENOMEM); + } else + vegas_data = ptr; vegas_data->slow_start_toggle = 1; ccv->cc_data = vegas_data; @@ -216,7 +230,7 @@ break; default: - newreno_cc_algo.cong_signal(ccv, signal_type); + newreno_cc_cong_signal(ccv, signal_type); } if (IN_RECOVERY(CCV(ccv, t_flags)) && !presignalrecov) @@ -236,16 +250,11 @@ static int vegas_mod_init(void) { - ertt_id = khelp_get_id("ertt"); if (ertt_id <= 0) { printf("%s: h_ertt module not found\n", __func__); return (ENOENT); } - - vegas_cc_algo.after_idle = newreno_cc_algo.after_idle; - vegas_cc_algo.post_recovery = newreno_cc_algo.post_recovery; - return (0); } @@ -301,5 +310,5 @@ "vegas beta, specified as number of \"buffers\" (0 < alpha < beta)"); DECLARE_CC_MODULE(vegas, &vegas_cc_algo); -MODULE_VERSION(vegas, 1); +MODULE_VERSION(vegas, 2); MODULE_DEPEND(vegas, ertt, 1, 1, 1); diff --git a/sys/netinet/tcp_subr.c b/sys/netinet/tcp_subr.c --- a/sys/netinet/tcp_subr.c +++ b/sys/netinet/tcp_subr.c @@ -2137,8 +2137,9 @@ */ CC_LIST_RLOCK(); KASSERT(!STAILQ_EMPTY(&cc_list), ("cc_list is empty!")); - CC_ALGO(tp) = CC_DEFAULT(); + CC_ALGO(tp) = CC_DEFAULT_ALGO(); CC_LIST_RUNLOCK(); + /* * The tcpcb will hold a reference on its inpcb until tcp_discardcb() * is called. @@ -2147,7 +2148,7 @@ tp->t_inpcb = inp; if (CC_ALGO(tp)->cb_init != NULL) - if (CC_ALGO(tp)->cb_init(tp->ccv) > 0) { + if (CC_ALGO(tp)->cb_init(tp->ccv, NULL) > 0) { if (tp->t_fb->tfb_tcp_fb_fini) (*tp->t_fb->tfb_tcp_fb_fini)(tp, 1); in_pcbrele_wlocked(inp); @@ -2240,25 +2241,23 @@ } /* - * Switch the congestion control algorithm back to NewReno for any active - * control blocks using an algorithm which is about to go away. - * This ensures the CC framework can allow the unload to proceed without leaving - * any dangling pointers which would trigger a panic. - * Returning non-zero would inform the CC framework that something went wrong - * and it would be unsafe to allow the unload to proceed. However, there is no - * way for this to occur with this implementation so we always return zero. + * Switch the congestion control algorithm back to Vnet default for any active + * control blocks using an algorithm which is about to go away. If the algorithm + * has a cb_init function and it fails (no memory) then the operation fails and + * the unload will not succeed. + * */ int tcp_ccalgounload(struct cc_algo *unload_algo) { - struct cc_algo *tmpalgo; + struct cc_algo *oldalgo, *newalgo; struct inpcb *inp; struct tcpcb *tp; VNET_ITERATOR_DECL(vnet_iter); /* * Check all active control blocks across all network stacks and change - * any that are using "unload_algo" back to NewReno. If "unload_algo" + * any that are using "unload_algo" back to its default. If "unload_algo" * requires cleanup code to be run, call it. */ VNET_LIST_RLOCK(); @@ -2272,6 +2271,7 @@ * therefore don't enter the loop below until the connection * list has stabilised. */ + newalgo = CC_DEFAULT_ALGO(); CK_LIST_FOREACH(inp, &V_tcb, inp_list) { INP_WLOCK(inp); /* Important to skip tcptw structs. */ @@ -2280,24 +2280,48 @@ /* * By holding INP_WLOCK here, we are assured * that the connection is not currently - * executing inside the CC module's functions - * i.e. it is safe to make the switch back to - * NewReno. + * executing inside the CC module's functions. + * We attempt to switch to the Vnets default, + * if the init fails then we fail the whole + * operation and the module unload will fail. */ if (CC_ALGO(tp) == unload_algo) { - tmpalgo = CC_ALGO(tp); - if (tmpalgo->cb_destroy != NULL) - tmpalgo->cb_destroy(tp->ccv); - CC_DATA(tp) = NULL; - /* - * NewReno may allocate memory on - * demand for certain stateful - * configuration as needed, but is - * coded to never fail on memory - * allocation failure so it is a safe - * fallback. - */ - CC_ALGO(tp) = &newreno_cc_algo; + struct cc_var cc_mem; + int err; + + oldalgo = CC_ALGO(tp); + memset(&cc_mem, 0, sizeof(cc_mem)); + cc_mem.ccvc.tcp = tp; + if (newalgo->cb_init == NULL) { + /* + * No init we can skip the + * dance around a possible failure. + */ + CC_DATA(tp) = NULL; + goto proceed; + } + err = (newalgo->cb_init)(&cc_mem, NULL); + if (err) { + /* + * Presumably no memory the caller will + * need to try again. + */ + INP_WUNLOCK(inp); + INP_INFO_WUNLOCK(&V_tcbinfo); + CURVNET_RESTORE(); + VNET_LIST_RUNLOCK(); + return (err); + } +proceed: + if (oldalgo->cb_destroy != NULL) + oldalgo->cb_destroy(tp->ccv); + CC_ALGO(tp) = newalgo; + memcpy(tp->ccv, &cc_mem, sizeof(struct cc_var)); + if (TCPS_HAVEESTABLISHED(tp->t_state) && + (CC_ALGO(tp)->conn_init != NULL)) { + /* Yep run the connection init for the new CC */ + CC_ALGO(tp)->conn_init(tp->ccv); + } } } INP_WUNLOCK(inp); @@ -2306,7 +2330,6 @@ CURVNET_RESTORE(); } VNET_LIST_RUNLOCK(); - return (0); } diff --git a/sys/netinet/tcp_usrreq.c b/sys/netinet/tcp_usrreq.c --- a/sys/netinet/tcp_usrreq.c +++ b/sys/netinet/tcp_usrreq.c @@ -2007,6 +2007,115 @@ } #endif +extern struct cc_algo newreno_cc_algo; + +static int +tcp_congestion(struct socket *so, struct sockopt *sopt, struct inpcb *inp, struct tcpcb *tp) +{ + struct cc_algo *algo; + void *ptr = NULL; + struct cc_var cc_mem; + char buf[TCP_CA_NAME_MAX]; + size_t mem_sz; + int error; + + INP_WUNLOCK(inp); + error = sooptcopyin(sopt, buf, TCP_CA_NAME_MAX - 1, 1); + if (error) + return(error); + buf[sopt->sopt_valsize] = '\0'; + CC_LIST_RLOCK(); + STAILQ_FOREACH(algo, &cc_list, entries) + if (strncmp(buf, algo->name, + TCP_CA_NAME_MAX) == 0) { + if (algo->flags & CC_MODULE_BEING_REMOVED) { + /* We can't "see" modules being unloaded */ + continue; + } + break; + } + if (algo == NULL) { + CC_LIST_RUNLOCK(); + return(ESRCH); + } +do_over: + if (algo->cb_init != NULL) { + /* We can now pre-get the memory for the CC */ + mem_sz = (*algo->cc_data_sz)(); + if (mem_sz == 0) { + goto no_mem_needed; + } + CC_LIST_RUNLOCK(); + ptr = malloc(mem_sz, M_CC_MEM, M_WAITOK); + CC_LIST_RLOCK(); + STAILQ_FOREACH(algo, &cc_list, entries) + if (strncmp(buf, algo->name, + TCP_CA_NAME_MAX) == 0) + break; + if (algo == NULL) { + if (ptr) + free(ptr, M_CC_MEM); + CC_LIST_RUNLOCK(); + return(ESRCH); + } + } else { +no_mem_needed: + mem_sz = 0; + ptr = NULL; + } + /* + * Make sure its all clean and zero and also get + * back the inplock. + */ + memset(&cc_mem, 0, sizeof(cc_mem)); + if (mem_sz != (*algo->cc_data_sz)()) { + if (ptr) + free(ptr, M_CC_MEM); + goto do_over; + } + if (ptr) { + memset(ptr, 0, mem_sz); + INP_WLOCK_RECHECK_CLEANUP(inp, free(ptr, M_CC_MEM)); + } else + INP_WLOCK_RECHECK(inp); + CC_LIST_RUNLOCK(); + cc_mem.ccvc.tcp = tp; + /* + * We once again hold a write lock over the tcb so it's + * safe to do these things without ordering concerns. + * Note here we init into stack memory. + */ + if (algo->cb_init != NULL) + error = algo->cb_init(&cc_mem, ptr); + else + error = 0; + /* + * The CC algorithms, when given their memory + * should not fail we could in theory have a + * KASSERT here. + */ + if (error == 0) { + /* + * Touchdown, lets go ahead and move the + * connection to the new CC module by + * copying in the cc_mem after we call + * the old ones cleanup (if any). + */ + if (CC_ALGO(tp)->cb_destroy != NULL) + CC_ALGO(tp)->cb_destroy(tp->ccv); + memcpy(tp->ccv, &cc_mem, sizeof(struct cc_var)); + tp->cc_algo = algo; + /* Ok now are we where we have gotten past any conn_init? */ + if (TCPS_HAVEESTABLISHED(tp->t_state) && (CC_ALGO(tp)->conn_init != NULL)) { + /* Yep run the connection init for the new CC */ + CC_ALGO(tp)->conn_init(tp->ccv); + } + } else if (ptr) + free(ptr, M_CC_MEM); + INP_WUNLOCK(inp); + return (error); +} + int tcp_default_ctloutput(struct socket *so, struct sockopt *sopt, struct inpcb *inp, struct tcpcb *tp) { @@ -2016,7 +2125,6 @@ #ifdef KERN_TLS struct tls_enable tls; #endif - struct cc_algo *algo; char *pbuf, buf[TCP_LOG_ID_LEN]; #ifdef STATS struct statsblob *sbp; @@ -2223,46 +2331,7 @@ break; case TCP_CONGESTION: - INP_WUNLOCK(inp); - error = sooptcopyin(sopt, buf, TCP_CA_NAME_MAX - 1, 1); - if (error) - break; - buf[sopt->sopt_valsize] = '\0'; - INP_WLOCK_RECHECK(inp); - CC_LIST_RLOCK(); - STAILQ_FOREACH(algo, &cc_list, entries) - if (strncmp(buf, algo->name, - TCP_CA_NAME_MAX) == 0) - break; - CC_LIST_RUNLOCK(); - if (algo == NULL) { - INP_WUNLOCK(inp); - error = EINVAL; - break; - } - /* - * We hold a write lock over the tcb so it's safe to - * do these things without ordering concerns. - */ - if (CC_ALGO(tp)->cb_destroy != NULL) - CC_ALGO(tp)->cb_destroy(tp->ccv); - CC_DATA(tp) = NULL; - CC_ALGO(tp) = algo; - /* - * If something goes pear shaped initialising the new - * algo, fall back to newreno (which does not - * require initialisation). - */ - if (algo->cb_init != NULL && - algo->cb_init(tp->ccv) != 0) { - CC_ALGO(tp) = &newreno_cc_algo; - /* - * The only reason init should fail is - * because of malloc. - */ - error = ENOMEM; - } - INP_WUNLOCK(inp); + error = tcp_congestion(so, sopt, inp, tp); break; case TCP_REUSPORT_LB_NUMA: diff --git a/sys/powerpc/conf/GENERIC b/sys/powerpc/conf/GENERIC --- a/sys/powerpc/conf/GENERIC +++ b/sys/powerpc/conf/GENERIC @@ -38,6 +38,8 @@ options VIMAGE # Subsystem virtualization, e.g. VNET options INET #InterNETworking options INET6 #IPv6 communications protocols +options CC_NEWRENO # include newreno congestion control +options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in. options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 options TCP_HHOOK # hhook(9) framework for TCP options TCP_RFC7413 # TCP Fast Open diff --git a/sys/riscv/conf/GENERIC b/sys/riscv/conf/GENERIC --- a/sys/riscv/conf/GENERIC +++ b/sys/riscv/conf/GENERIC @@ -29,6 +29,8 @@ options VIMAGE # Subsystem virtualization, e.g. VNET options INET # InterNETworking options INET6 # IPv6 communications protocols +options CC_NEWRENO # include newreno congestion control +options CC_DEFAULT=\"newreno\" # define our default CC module it should be compiled in. options TCP_HHOOK # hhook(9) framework for TCP options IPSEC_SUPPORT # Allow kldload of ipsec and tcpmd5 options ROUTE_MPATH # Multipath routing support