Add a limit on how long TCP data can live at the head of the output queue before it is acknowledged by the remote side.
While data lives in the output queue, it takes buffer space. When the remote side holds a connection in persist for an extended period or consumes data slowly, this data can back up in socket buffers. Furthermore, the data may even become stale.
User-space processes can manage part of this process through their own idle timers, etc. However, a user-space process does not have as much visibility into what is happening in the TCP stack. Additionally, when the userspace process has transmitted data and closed a connection, it loses the ability to monitor this and must rely on the kernel to manage the connection.
This feature is not covered by the TCP specification. However, this is somewhat hinted through the "user timeout" option in RFC 793. Also, it is equivalent to what a user-space application could choose to do on its own through more expensive user-space monitoring. Finally, it is an important capability to manage buffer space used on a server.
The user-space API changes are:
- New sysctl (net.inet.tcp.maxunacktime), which provides a default value for this feature.
- New socket option (TCP_MAXUNACKTIME), which lets an application set it on a per-socket basis. (If set on the listen socket, connections accepted through the listen socket will inherit the setting.)
The feature defaults to being disabled.
The mechanism is fairly simple:
- Record the time we add new data to the socket buffer or transmit new data on an idle connection.
- Update that time when the remote peer's cumulative ACK moves more than one byte. (This avoids counting ACKd persist probes.)
- When the persist or retransmit timer notice that we haven't received an acknowledgement for the data at the head of the output queue within the maxunacktime, they will reset the connection.