Index: share/man/man4/tcp.4 =================================================================== --- share/man/man4/tcp.4 +++ share/man/man4/tcp.4 @@ -34,7 +34,7 @@ .\" From: @(#)tcp.4 8.1 (Berkeley) 6/5/93 .\" $FreeBSD$ .\" -.Dd July 14, 2022 +.Dd July 18, 2022 .Dt TCP 4 .Os .Sh NAME @@ -422,64 +422,6 @@ .Xr sysctl 3 MIB. .Bl -tag -width ".Va v6pmtud_blackhole_mss" -.It Va rfc1323 -Implement the window scaling and timestamp options of RFC 1323/RFC 7323 -(default is true). -.It Va tolerate_missing_ts -Tolerate the missing of timestamps (RFC 1323/RFC 7323) for -.Tn TCP -segments belonging to -.Tn TCP -connections for which support of -.Tn TCP -timestamps has been negotiated. -As of June 2021, several TCP stacks are known to violate RFC 7323, including -modern widely deployed ones. -Therefore the default is 1, i.e., the missing of timestamps is tolerated. -.It Va mssdflt -The default value used for the maximum segment size -.Pq Dq MSS -when no advice to the contrary is received from MSS negotiation. -.It Va sendspace -Maximum -.Tn TCP -send window. -.It Va recvspace -Maximum -.Tn TCP -receive window. -.It Va log_in_vain -Log any connection attempts to ports where there is not a socket -accepting connections. -The value of 1 limits the logging to -.Tn SYN -(connection establishment) packets only. -That of 2 results in any -.Tn TCP -packets to closed ports being logged. -Any value unlisted above disables the logging -(default is 0, i.e., the logging is disabled). -.It Va msl -The Maximum Segment Lifetime, in milliseconds, for a packet. -.It Va keepinit -Timeout, in milliseconds, for new, non-established -.Tn TCP -connections. -The default is 75000 msec. -.It Va keepidle -Amount of time, in milliseconds, that the connection must be idle -before keepalive probes (if enabled) are sent. -The default is 7200000 msec (2 hours). -.It Va keepintvl -The interval, in milliseconds, between keepalive probes sent to remote -machines, when no response is received on a -.Va keepidle -probe. -The default is 75000 msec. -.It Va keepcnt -Number of probes sent, with no response, before a connection -is dropped. -The default is 8 packets. .It Va always_keepalive Assume that .Dv SO_KEEPALIVE @@ -488,115 +430,15 @@ connections, the kernel will periodically send a packet to the remote host to verify the connection is still up. -.It Va icmp_may_rst -Certain -.Tn ICMP -unreachable messages may abort connections in -.Tn SYN-SENT -state. -.It Va do_tcpdrain -Flush packets in the -.Tn TCP -reassembly queue if the system is low on mbufs. .It Va blackhole If enabled, disable sending of RST when a connection is attempted to a port where there is not a socket accepting connections. See .Xr blackhole 4 . -.It Va delayed_ack -Delay ACK to try and piggyback it onto a data packet. .It Va delacktime Maximum amount of time, in milliseconds, before a delayed ACK is sent. -.It Va path_mtu_discovery -Enable Path MTU Discovery. -.It Va tcbhashsize -Size of the -.Tn TCP -control-block hash table -(read-only). -This may be tuned using the kernel option -.Dv TCBHASHSIZE -or by setting -.Va net.inet.tcp.tcbhashsize -in the -.Xr loader 8 . -.It Va pcbcount -Number of active process control blocks -(read-only). -.It Va syncookies -Determines whether or not -.Tn SYN -cookies should be generated for outbound -.Tn SYN-ACK -packets. -.Tn SYN -cookies are a great help during -.Tn SYN -flood attacks, and are enabled by default. -(See -.Xr syncookies 4 . ) -.It Va isn_reseed_interval -The interval (in seconds) specifying how often the secret data used in -RFC 1948 initial sequence number calculations should be reseeded. -By default, this variable is set to zero, indicating that -no reseeding will occur. -Reseeding should not be necessary, and will break -.Dv TIME_WAIT -recycling for a few minutes. -.It Va reass.cursegments -The current total number of segments present in all reassembly queues. -.It Va reass.maxsegments -The maximum limit on the total number of segments across all reassembly -queues. -The limit can be adjusted as a tunable. -.It Va reass.maxqueuelen -The maximum number of segments allowed in each reassembly queue. -By default, the system chooses a limit based on each TCP connection's -receive buffer size and maximum segment size (MSS). -The actual limit applied to a session's reassembly queue will be the lower of -the system-calculated automatic limit and the user-specified -.Va reass.maxqueuelen -limit. -.It Va rexmit_initial , rexmit_min , rexmit_slop -Adjust the retransmit timer calculation for -.Tn TCP . -The slop is -typically added to the raw calculation to take into account -occasional variances that the -.Tn SRTT -(smoothed round-trip time) -is unable to accommodate, while the minimum specifies an -absolute minimum. -While a number of -.Tn TCP -RFCs suggest a 1 -second minimum, these RFCs tend to focus on streaming behavior, -and fail to deal with the fact that a 1 second minimum has severe -detrimental effects over lossy interactive connections, such -as a 802.11b wireless link, and over very fast but lossy -connections for those cases not covered by the fast retransmit -code. -For this reason, we use 200ms of slop and a near-0 -minimum, which gives us an effective minimum of 200ms (similar to -.Tn Linux ) . -The initial value is used before an RTT measurement has been performed. -.It Va initcwnd_segments -Enable the ability to specify initial congestion window in number of segments. -The default value is 10 as suggested by RFC 6928. -Changing the value on fly would not affect connections using congestion window -from the hostcache. -Caution: -This regulates the burst of packets allowed to be sent in the first RTT. -The value should be relative to the link capacity. -Start with small values for lower-capacity links. -Large bursts can cause buffer overruns and packet drops if routers have small -buffers or the link is experiencing congestion. -.It Va newcwd -Enable the New Congestion Window Validation mechanism as described in RFC 7661. -This gently reduces the congestion window during periods, where TCP is -application limited and the network bandwidth is not utilized completely. -That prevents self-inflicted packet losses once the application starts to -transmit data at a higher speed. +.It Va delayed_ack +Delay ACK to try and piggyback it onto a data packet. .It Va do_lrd Enable Lost Retransmission Detection for SACK-enabled sessions, disabled by default. @@ -617,76 +459,10 @@ Helpful when a misconfigured token bucket traffic policer causes persistent high losses leading to RTO, but reduces PRR effectiveness in more common settings (default is false). -.It Va rfc6675_pipe -Deprecated and superseded by -.Va sack.revised -.It Va rfc3042 -Enable the Limited Transmit algorithm as described in RFC 3042. -It helps avoid timeouts on lossy links and also when the congestion window -is small, as happens on short transfers. -.It Va rfc3390 -Enable support for RFC 3390, which allows for a variable-sized -starting congestion window on new connections, depending on the -maximum segment size. -This helps throughput in general, but -particularly affects short transfers and high-bandwidth large -propagation-delay connections. -.It Va sack.enable -Enable support for RFC 2018, TCP Selective Acknowledgment option, -which allows the receiver to inform the sender about all successfully -arrived segments, allowing the sender to retransmit the missing segments -only. -.It Va sack.revised -Enables three updated mechanisms from RFC6675 (default is true). -Calculate the bytes in flight using the algorithm described in RFC 6675, and -is also an improvement when Proportional Rate Reduction is enabled. -Next, Rescue Retransmission helps timely loss recovery, when the trailing segments -of a transmission are lost, while no additional data is ready to be sent. -In case a partial ACK without a SACK block is received during SACK loss -recovery, the trailing segment is immediately resent, rather than waiting -for a Retransmission timeout. -Finally, SACK loss recovery is also engaged, once two segments plus one byte are -SACKed - even if no traditional duplicate ACKs were observed. -.It Va sack.maxholes -Maximum number of SACK holes per connection. -Defaults to 128. -.It Va sack.globalmaxholes -Maximum number of SACK holes per system, across all connections. -Defaults to 65536. -.It Va maxtcptw -When a TCP connection enters the -.Dv TIME_WAIT -state, its associated socket structure is freed, since it is of -negligible size and use, and a new structure is allocated to contain a -minimal amount of information necessary for sustaining a connection in -this state, called the compressed TCP TIME_WAIT state. -Since this structure is smaller than a socket structure, it can save -a significant amount of system memory. -The -.Va net.inet.tcp.maxtcptw -MIB variable controls the maximum number of these structures allocated. -By default, it is initialized to -.Va kern.ipc.maxsockets -/ 5. -.It Va nolocaltimewait -Suppress creating of compressed TCP TIME_WAIT states for connections in -which both endpoints are local. -.It Va fast_finwait2_recycle -Recycle -.Tn TCP -.Dv FIN_WAIT_2 -connections faster when the socket is marked as -.Dv SBS_CANTRCVMORE -(no user process has the socket open, data received on -the socket cannot be read). -The timeout used here is -.Va finwait2_timeout . -.It Va finwait2_timeout -Timeout to use for fast recycling of +.It Va do_tcpdrain +Flush packets in the .Tn TCP -.Dv FIN_WAIT_2 -connections. -Defaults to 60 seconds. +reassembly queue if the system is low on mbufs. .It Va ecn.enable Enable support for TCP Explicit Congestion Notification (ECN). ECN allows a TCP sender to reduce the transmission rate in order to @@ -707,40 +483,20 @@ specific connection. This is needed to help with connection establishment when a broken firewall is in the network path. -.It Va pmtud_blackhole_detection -Enable automatic path MTU blackhole detection. -In case of retransmits of MSS sized segments, -the OS will lower the MSS to check if it's an MTU problem. -If the current MSS is greater than the configured value to try -.Po Va net.inet.tcp.pmtud_blackhole_mss -and -.Va net.inet.tcp.v6pmtud_blackhole_mss -.Pc , -it will be set to this value, otherwise, -the MSS will be set to the default values -.Po Va net.inet.tcp.mssdflt -and -.Va net.inet.tcp.v6mssdflt -.Pc . -Settings: -.Bl -tag -compact -.It 0 -Disable path MTU blackhole detection. -.It 1 -Enable path MTU blackhole detection for IPv4 and IPv6. -.It 2 -Enable path MTU blackhole detection only for IPv4. -.It 3 -Enable path MTU blackhole detection only for IPv6. -.El -.It Va pmtud_blackhole_mss -MSS to try for IPv4 if PMTU blackhole detection is turned on. -.It Va v6pmtud_blackhole_mss -MSS to try for IPv6 if PMTU blackhole detection is turned on. -.It Va fastopen.acceptany -When non-zero, all client-supplied TFO cookies will be considered to be valid. -The default is 0. -.It Va fastopen.autokey +.It Va fast_finwait2_recycle +Recycle +.Tn TCP +.Dv FIN_WAIT_2 +connections faster when the socket is marked as +.Dv SBS_CANTRCVMORE +(no user process has the socket open, data received on +the socket cannot be read). +The timeout used here is +.Va finwait2_timeout . +.It Va fastopen.acceptany +When non-zero, all client-supplied TFO cookies will be considered to be valid. +The default is 0. +.It Va fastopen.autokey When this and .Va net.inet.tcp.fastopen.server_enable are non-zero, a new key will be automatically generated after this specified @@ -823,75 +579,182 @@ Install a new pre-shared key by writing .Va net.inet.tcp.fastopen.keylen bytes to this sysctl. -.It Va hostcache.enable +.It Va finwait2_timeout +Timeout to use for fast recycling of +.Tn TCP +.Dv FIN_WAIT_2 +connections +.Pq Va fast_finwait2_recycle . +Defaults to 60 seconds. +.It Va functions_available +List of available TCP function blocks (TCP stacks). +.It Va functions_default +The default TCP function block (TCP stack). +.It Va functions_inherit_listen_socket_stack +Determines whether to inherit listen socket's TCP stack or use the current +system default TCP stack, as defined by +.Va functions_default . +Default is true. +.It Va hostcache The TCP host cache is used to cache connection details and metrics to improve future performance of connections between the same hosts. At the completion of a TCP connection, a host will cache information for the connection for some defined period of time. +There are a number of +.Va hostcache +variables under this node. +See +.Va hostcache.enable . +.It Va hostcache.bucketlimit +The maximum number of entries for the same hash. +Defaults to 30. +.It Va hostcache.cachelimit +Overall entry limit for hostcache. +Defaults to +.Va hashsize +* +.Va bucketlimit . +.It Va hostcache.count +The current number of entries in the host cache. +.It Va hostcache.enable +Enable/disable the host cache: .Bl -tag -compact .It 0 Disable the host cache. .It 1 Enable the host cache. (default) .El -.It Va hostcache.purgenow -Immediately purge all entries once set to any value. -Setting this to 2 will also reseed the hash salt. -.It Va hostcache.purge -Expire all entires on next pruning of host cache entries. -Any non-zero setting will be reset to zero, once the pruge -is running. -.Bl -tag -compact -.It 0 -Do not purge all entries when pruning the host cache. (default) -.It 1 -Purge all entries when doing the next pruning. -.It 2 -Purge all entries, and also reseed the hash salt. -.El -.It Va hostcache.prune -Time in seconds between pruning expired host cache entries. -Defaults to 300 (5 minutes). .It Va hostcache.expire Time in seconds, how long a entry should be kept in the host cache since last accessed. Defaults to 3600 (1 hour). -.It Va hostcache.count -The current number of entries in the host cache. -.It Va hostcache.bucketlimit -The maximum number of entries for the same hash. -Defaults to 30. .It Va hostcache.hashsize Size of TCP hostcache hashtable. This number has to be a power of two, or will be rejected. Defaults to 512. -.It Va hostcache.cachelimit -Overall entry limit for hostcache. -Defaults to hashsize * bucketlimit. .It Va hostcache.histo Provide a Histogram of the hostcache hash utilization. .It Va hostcache.list Provide a complete list of all current entries in the host cache. -.It Va functions_available -List of available TCP function blocks (TCP stacks). -.It Va functions_default -The default TCP function block (TCP stack). -.It Va functions_inherit_listen_socket_stack -Determines whether to inherit listen socket's tcp stack or use the current -system default tcp stack, as defined by -.Va functions_default . -Default is true. +.It Va hostcache.prune +Time in seconds between pruning expired host cache entries. +Defaults to 300 (5 minutes). +.It Va hostcache.purge +Expire all entires on next pruning of host cache entries. +Any non-zero setting will be reset to zero, once the purge +is running. +.Bl -tag -compact +.It 0 +Do not purge all entries when pruning the host cache (default). +.It 1 +Purge all entries when doing the next pruning. +.It 2 +Purge all entries and also reseed the hash salt. +.El +.It Va hostcache.purgenow +Immediately purge all entries once set to any value. +Setting this to 2 will also reseed the hash salt. +.It Va icmp_may_rst +Certain +.Tn ICMP +unreachable messages may abort connections in +.Tn SYN-SENT +state. +.It Va initcwnd_segments +Enable the ability to specify initial congestion window in number of segments. +The default value is 10 as suggested by RFC 6928. +Changing the value on the fly would not affect connections +using congestion window from the hostcache. +Caution: +This regulates the burst of packets allowed to be sent in the first RTT. +The value should be relative to the link capacity. +Start with small values for lower-capacity links. +Large bursts can cause buffer overruns and packet drops if routers have small +buffers or the link is experiencing congestion. .It Va insecure_rst Use criteria defined in RFC793 instead of RFC5961 for accepting RST segments. Default is false. .It Va insecure_syn Use criteria defined in RFC793 instead of RFC5961 for accepting SYN segments. Default is false. -.It Va ts_offset_per_conn -When initializing the TCP timestamps, use a per connection offset instead of a -per host pair offset. -Default is to use per connection offsets as recommended in RFC 7323. +.It Va isn_reseed_interval +The interval (in seconds) specifying how often the secret data used in +RFC 1948 initial sequence number calculations should be reseeded. +By default, this variable is set to zero, indicating that +no reseeding will occur. +Reseeding should not be necessary, and will break +.Dv TIME_WAIT +recycling for a few minutes. +.It Va keepcnt +Number of keepalive probes sent, with no response, before a connection +is dropped. +The default is 8 packets. +.It Va keepidle +Amount of time, in milliseconds, that the connection must be idle +before sending keepalive probes (if enabled). +The default is 7200000 msec (7.2M msec, 2 hours). +.It Va keepinit +Timeout, in milliseconds, for new, non-established +.Tn TCP +connections. +The default is 75000 msec (75K msec, 75 sec). +.It Va keepintvl +The interval, in milliseconds, between keepalive probes sent to remote +machines, when no response is received on a +.Va keepidle +probe. +The default is 75000 msec (75K msec, 75 sec). +.It Va log_in_vain +Log any connection attempts to ports where there is not a socket +accepting connections. +The value of 1 limits the logging to +.Tn SYN +(connection establishment) packets only. +A value of 2 results in any +.Tn TCP +packets to closed ports being logged. +Any value not listed above disables the logging +(default is 0, i.e., the logging is disabled). +.It Va maxtcptw +When a TCP connection enters the +.Dv TIME_WAIT +state, its associated socket structure is freed, since it is of +negligible size and use, and a new structure is allocated to contain a +minimal amount of information necessary for sustaining a connection in +this state, called the compressed TCP +.Dv TIME_WAIT +state. +Since this structure is smaller than a socket structure, it can save +a significant amount of system memory. +The +.Va net.inet.tcp.maxtcptw +MIB variable controls the maximum number of these structures allocated. +By default, it is initialized to +.Va kern.ipc.maxsockets +/ 5. +.It Va msl +The Maximum Segment Lifetime, in milliseconds, for a packet. +.It Va mssdflt +The default value used for the maximum segment size +.Pq Dq MSS +when no advice to the contrary is received from MSS negotiation. +.It Va newcwd +Enable the New Congestion Window Validation mechanism as described in RFC 7661. +This gently reduces the congestion window during periods, where TCP is +application limited and the network bandwidth is not utilized completely. +That prevents self-inflicted packet losses once the application starts to +transmit data at a higher speed. +.It Va nolocaltimewait +Suppress creation of compressed TCP +.Dv TIME_WAIT +states for connections in +which both endpoints are local. +.It Va path_mtu_discovery +Enable Path MTU Discovery. +.It Va pcbcount +Number of active process control blocks +(read-only). .It Va perconn_stats_enable Controls the default collection of statistics for all connections using the .Xr stats 3 @@ -903,16 +766,170 @@ template sampling rates when .Xr stats 3 sampling is enabled. -.It Va udp_tunneling_port -The local UDP encapsulation port. -A value of 0 indicates that UDP encapsulation is disabled. -The default is 0. +.It Va pmtud_blackhole_detection +Enable automatic path MTU blackhole detection. +In case of retransmits of MSS sized segments, +the OS will lower the MSS to check if it's an MTU problem. +If the current MSS is greater than the configured value to try +.Po Va net.inet.tcp.pmtud_blackhole_mss +and +.Va net.inet.tcp.v6pmtud_blackhole_mss +.Pc , +it will be set to this value, otherwise, +the MSS will be set to the default values +.Po Va net.inet.tcp.mssdflt +and +.Va net.inet.tcp.v6mssdflt +.Pc . +Settings: +.Bl -tag -compact +.It 0 +Disable path MTU blackhole detection. +.It 1 +Enable path MTU blackhole detection for IPv4 and IPv6. +.It 2 +Enable path MTU blackhole detection only for IPv4. +.It 3 +Enable path MTU blackhole detection only for IPv6. +.El +.It Va pmtud_blackhole_mss +MSS to try for IPv4 if PMTU blackhole detection is turned on. +.It Va reass.cursegments +The current total number of segments present in all reassembly queues. +.It Va reass.maxqueuelen +The maximum number of segments allowed in each reassembly queue. +By default, the system chooses a limit based on each TCP connection's +receive buffer size and maximum segment size (MSS). +The actual limit applied to a session's reassembly queue will be the lower of +the system-calculated automatic limit and the user-specified +.Va reass.maxqueuelen +limit. +.It Va reass.maxsegments +The maximum limit on the total number of segments across all reassembly +queues. +The limit can be adjusted as a tunable. +.It Va recvspace +Maximum +.Tn TCP +receive window. +.It Va rexmit_initial , rexmit_min , rexmit_slop +Adjust the retransmit timer calculation for +.Tn TCP . +The slop is +typically added to the raw calculation to take into account +occasional variances that the +.Tn SRTT +(smoothed round-trip time) +is unable to accommodate, while the minimum specifies an +absolute minimum. +While a number of +.Tn TCP +RFCs suggest a 1 +second minimum, these RFCs tend to focus on streaming behavior, +and fail to deal with the fact that a 1 second minimum has severe +detrimental effects over lossy interactive connections, such +as a 802.11b wireless link, and over very fast but lossy +connections for those cases not covered by the fast retransmit +code. +For this reason, we use 200ms of slop and a near-0 +minimum, which gives us an effective minimum of 200ms (similar to +.Tn Linux ) . +The initial value is used before an RTT measurement has been performed. +.It Va rfc1323 +Implement the window scaling and timestamp options of RFC 1323/RFC 7323 +(default is true). +.It Va rfc3042 +Enable the Limited Transmit algorithm as described in RFC 3042. +It helps avoid timeouts on lossy links and also when the congestion window +is small, as happens on short transfers. +.It Va rfc3390 +Enable support for RFC 3390, which allows for a variable-sized +starting congestion window on new connections, depending on the +maximum segment size. +This helps throughput in general, but +particularly affects short transfers and high-bandwidth large +propagation-delay connections. +.It Va rfc6675_pipe +Deprecated and superseded by +.Va sack.revised +.It Va sack.enable +Enable support for RFC 2018, TCP Selective Acknowledgment option, +which allows the receiver to inform the sender about all successfully +arrived segments, allowing the sender to retransmit the missing segments +only. +.It Va sack.globalmaxholes +Maximum number of SACK holes per system, across all connections. +Defaults to 65536. +.It Va sack.maxholes +Maximum number of SACK holes per connection. +Defaults to 128. +.It Va sack.revised +Enables three updated mechanisms from RFC6675 (default is true). +Calculate the bytes in flight using the algorithm described in RFC 6675, and +is also an improvement when Proportional Rate Reduction is enabled. +Next, Rescue Retransmission helps timely loss recovery, when the trailing segments +of a transmission are lost, while no additional data is ready to be sent. +In case a partial ACK without a SACK block is received during SACK loss +recovery, the trailing segment is immediately resent, rather than waiting +for a Retransmission timeout. +Finally, SACK loss recovery is also engaged, once two segments plus one byte are +SACKed - even if no traditional duplicate ACKs were observed. +.It Va sendspace +Maximum +.Tn TCP +send window. +.It Va syncookies +Determines whether or not +.Tn SYN +cookies should be generated for outbound +.Tn SYN-ACK +packets. +.Tn SYN +cookies are a great help during +.Tn SYN +flood attacks, and are enabled by default. +(See +.Xr syncookies 4 . ) +.It Va tcbhashsize +Size of the +.Tn TCP +control-block hash table +(read-only). +This is tuned using the kernel option +.Dv TCBHASHSIZE +or by setting +.Va net.inet.tcp.tcbhashsize +in the +.Xr loader 8 . +.It Va tolerate_missing_ts +Tolerate the missing of timestamps (RFC 1323/RFC 7323) for +.Tn TCP +segments belonging to +.Tn TCP +connections for which support of +.Tn TCP +timestamps has been negotiated. +As of June 2021, several TCP stacks are known to violate RFC 7323, including +modern widely deployed ones. +Therefore the default is 1, i.e., the missing of timestamps is tolerated. +.It Va ts_offset_per_conn +When initializing the TCP timestamps, use a per connection offset instead of a +per host pair offset. +Default is to use per connection offsets as recommended in RFC 7323. .It Va udp_tunneling_overhead The overhead taken into account when using UDP encapsulation. Since MSS clamping by middleboxes will most likely not work, values larger than 8 (the size of the UDP header) are also supported. Supported values are between 8 and 1024. The default is 8. +.It Va udp_tunneling_port +The local UDP encapsulation port. +A value of 0 indicates that UDP encapsulation is disabled. +The default is 0. +.It Va v6pmtud_blackhole_mss +MSS to try for IPv6 if PMTU blackhole detection is turned on. +See +.Va pmtud_blackhole_detection . .El .Sh ERRORS A socket operation may fail with one of the following errors returned: