r307319 reduced the number of times the code rescheduled the keepalive timer. Rather than rescheduling the keepalive timer each time the code receives a packet, it lets the keepalive timer fire and reschedule itself. On a busy connection, this means the code should reschedule the keepalive timer approximately once each TP_KEEPIDLE ticks.
However, this change made things slightly worse for persistently-idle connections. In these cases, we can see this sequence of events:
- Keepalive timer set at session open.
- Remote side sends an ACK, increasing tp->t_rcvtime by X milliseconds.
- Keepalive timer fires. Because it is X milliseconds short of the idle time, it reschedules the keepalive timer for X milliseconds in the future.
- Keepalive timer fires again. It sends a keepalive probe.
- goto #2.
(Of course, the exact behavior depends on many factors, such as the configured idle time, configured interval, and how often the other side is doing keepalive probes. But, this seems to be the worst-case scneario.)
When you have a system with millions of persistently-idle connections, this can pose a real problem for the system, especially if the connections get somewhat synchronized.
This change does two things:
First, it moves the idle-time check into the block of code that runs when keepalives are enabled for a session. If keepalives aren't enabled, there is no reason to check whether the keepalive timer is early.
Second, it adds an allowed variance for the idle timer. If the keepalive timer finds that the idle timer has not quite expired, it will check if the remaining time until expiry is within the allowed variance. If so, it will send the first keepalive probe early. This avoids rescheduling the keepalive timer for a short delay.