Further reduce keepalive timer rescheduling
Needs ReviewPublic
Actions

Authored by jtl on Feb 15 2018, 7:43 PM.

Details

Reviewers

rrs
glebius
tuexen

Group Reviewers

transport

Summary

r307319 reduced the number of times the code rescheduled the keepalive timer. Rather than rescheduling the keepalive timer each time the code receives a packet, it lets the keepalive timer fire and reschedule itself. On a busy connection, this means the code should reschedule the keepalive timer approximately once each TP_KEEPIDLE ticks.

However, this change made things slightly worse for persistently-idle connections. In these cases, we can see this sequence of events:

Keepalive timer set at session open.
Remote side sends an ACK, increasing tp->t_rcvtime by X milliseconds.
Keepalive timer fires. Because it is X milliseconds short of the idle time, it reschedules the keepalive timer for X milliseconds in the future.
Keepalive timer fires again. It sends a keepalive probe.
goto #2.

(Of course, the exact behavior depends on many factors, such as the configured idle time, configured interval, and how often the other side is doing keepalive probes. But, this seems to be the worst-case scneario.)

When you have a system with millions of persistently-idle connections, this can pose a real problem for the system, especially if the connections get somewhat synchronized.

This change does two things:

First, it moves the idle-time check into the block of code that runs when keepalives are enabled for a session. If keepalives aren't enabled, there is no reason to check whether the keepalive timer is early.

Second, it adds an allowed variance for the idle timer. If the keepalive timer finds that the idle timer has not quite expired, it will check if the remaining time until expiry is within the allowed variance. If so, it will send the first keepalive probe early. This avoids rescheduling the keepalive timer for a short delay.

Test Plan

Use dtrace and tcpdump to verify that the code is behaving as expected. (And, having done that, it appears to be behaving as expected.)

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 15070
Build 15165: arc lint + arc unit

Event Timeline

jtl created this revision.Feb 15 2018, 7:43 PM

Herald added a reviewer: transport. · View Herald TranscriptFeb 15 2018, 7:43 PM

Herald added a subscriber: imp. · View Herald Transcript

Harbormaster completed remote builds in B15070: Diff 39349.Feb 15 2018, 7:43 PM

kbowling added a subscriber: kbowling.Feb 17 2018, 11:45 AM

This may sound like a pain but as you already say in your message, this is two changes .. First .. Second ..
Can you split them up into such? First should be really easy to review and second should then be straight forward as well by itself.

Also I was initially wondering why you used 2% rather than say 2 * rtt or something. 2% can be rather big for really long timeouts? I guess the question is: do the X ms grow with the time delta or are they rather constant?

In D14387#476034, @bz wrote:

This may sound like a pain but as you already say in your message, this is two changes .. First .. Second ..
Can you split them up into such? First should be really easy to review and second should then be straight forward as well by itself.

I guess so. Normally, I'm agreeable to these requests, but this is one place where it seems to make more sense to just merge the changes as the end product will look quite different than the initial code.

Also I was initially wondering why you used 2% rather than say 2 * rtt or something. 2% can be rather big for really long timeouts? I guess the question is: do the X ms grow with the time delta or are they rather constant?

My general reasoning for using a percent of the keepalive interval was to be more precise as the keepalive interval was lower. My assumption is that the length of a keepalive interval is inversely proportional to the user's sensitivity to detecting dropped sessions.

In other words, if someone uses a really large RTT (e.g. 24 hours), are they really going to be upset if it varies by 2% (28.8 minutes)? OTOH, someone with a really small (e.g. 5-second) keepalive interval may well care if the interval is off by 2 * RTT, assuming RTT is > 100 ms.

In the completely idle-session case, the delta between when the callout happens and when the keepalive interval expires is generally between 1 and 2 RTTs. However, RTT measurements for mostly-idle sessions may or may not be accurate. So, I'm not sure an RTT-based approach will always produce accurate results.

tuexen added a reviewer: tuexen.Sep 29 2019, 10:48 AM

tuexen added inline comments.Sep 29 2019, 11:41 AM

sys/netinet/tcp_timer.c
141	I'm wondering if it would be better to use a `SYSCTL_PROC` where a function is used to enforce that tcp_keepslop is between 0 and 100. At least in some places we do input validation for `sysctl`-variables.
477	Line 477 is a duplicate of line 463. So I think line 477 can be removed.