Page MenuHomeFreeBSD

Further reduce keepalive timer rescheduling
Needs ReviewPublic

Authored by jtl on Feb 15 2018, 7:43 PM.

Details

Reviewers
rrs
glebius
tuexen
Group Reviewers
transport
Summary

r307319 reduced the number of times the code rescheduled the keepalive timer. Rather than rescheduling the keepalive timer each time the code receives a packet, it lets the keepalive timer fire and reschedule itself. On a busy connection, this means the code should reschedule the keepalive timer approximately once each TP_KEEPIDLE ticks.

However, this change made things slightly worse for persistently-idle connections. In these cases, we can see this sequence of events:

  1. Keepalive timer set at session open.
  2. Remote side sends an ACK, increasing tp->t_rcvtime by X milliseconds.
  3. Keepalive timer fires. Because it is X milliseconds short of the idle time, it reschedules the keepalive timer for X milliseconds in the future.
  4. Keepalive timer fires again. It sends a keepalive probe.
  5. goto #2.

(Of course, the exact behavior depends on many factors, such as the configured idle time, configured interval, and how often the other side is doing keepalive probes. But, this seems to be the worst-case scneario.)

When you have a system with millions of persistently-idle connections, this can pose a real problem for the system, especially if the connections get somewhat synchronized.

This change does two things:

First, it moves the idle-time check into the block of code that runs when keepalives are enabled for a session. If keepalives aren't enabled, there is no reason to check whether the keepalive timer is early.

Second, it adds an allowed variance for the idle timer. If the keepalive timer finds that the idle timer has not quite expired, it will check if the remaining time until expiry is within the allowed variance. If so, it will send the first keepalive probe early. This avoids rescheduling the keepalive timer for a short delay.

Test Plan

Use dtrace and tcpdump to verify that the code is behaving as expected. (And, having done that, it appears to be behaving as expected.)

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 15070
Build 15165: arc lint + arc unit

Event Timeline

jtl created this revision.Feb 15 2018, 7:43 PM
bz added a subscriber: bz.Sep 26 2019, 7:44 PM

This may sound like a pain but as you already say in your message, this is two changes .. First .. Second ..
Can you split them up into such? First should be really easy to review and second should then be straight forward as well by itself.

Also I was initially wondering why you used 2% rather than say 2 * rtt or something. 2% can be rather big for really long timeouts? I guess the question is: do the X ms grow with the time delta or are they rather constant?

jtl added a comment.Sep 26 2019, 8:27 PM
In D14387#476034, @bz wrote:

This may sound like a pain but as you already say in your message, this is two changes .. First .. Second ..
Can you split them up into such? First should be really easy to review and second should then be straight forward as well by itself.

I guess so. Normally, I'm agreeable to these requests, but this is one place where it seems to make more sense to just merge the changes as the end product will look quite different than the initial code.

Also I was initially wondering why you used 2% rather than say 2 * rtt or something. 2% can be rather big for really long timeouts? I guess the question is: do the X ms grow with the time delta or are they rather constant?

My general reasoning for using a percent of the keepalive interval was to be more precise as the keepalive interval was lower. My assumption is that the length of a keepalive interval is inversely proportional to the user's sensitivity to detecting dropped sessions.

In other words, if someone uses a really large RTT (e.g. 24 hours), are they really going to be upset if it varies by 2% (28.8 minutes)? OTOH, someone with a really small (e.g. 5-second) keepalive interval may well care if the interval is off by 2 * RTT, assuming RTT is > 100 ms.

In the completely idle-session case, the delta between when the callout happens and when the keepalive interval expires is generally between 1 and 2 RTTs. However, RTT measurements for mostly-idle sessions may or may not be accurate. So, I'm not sure an RTT-based approach will always produce accurate results.

tuexen added inline comments.Sep 29 2019, 11:41 AM
sys/netinet/tcp_timer.c
141

I'm wondering if it would be better to use a SYSCTL_PROC where a function is used to enforce that tcp_keepslop is between 0 and 100. At least in some places we do input validation for sysctl-variables.

477

Line 477 is a duplicate of line 463. So I think line 477 can be removed.