If the connection was persistent and receiving-only, several (12) sporadic device insufficient buffers would cause the connection be dropped prematurely: Upon ENOBUFS in tcp_output() for an ACK, retransmission timer is started. No one will stop this retransmission timer for receiving- only connection, so the retransmission timer promises to expire and t_rxtshift is promised to be increased. And t_rxtshift will not be reset to 0, since no RTT measurement will be done for receiving-only connection. If this receiving-only connection lived long enough (e.g. >350sec, given the RTO starts from 230ms), and it suffered 12 sporadic device insufficient buffers, i.e. t_rxtshift >= 12, this receiving-only connection would be dropped prematurely by the retransmission timer. We now assert that for data segments, SYNs or FINs either rexmit or persist timer was wired upon ENOBUFS. And don't set rexmit timer for other cases, i.e. ENOBUFS upon ACKs.
Details
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Doesn't tp->t_rxtshift get updated on a successful send? If not, I think that is what we should be fixing.
Personally, I think a connection should drop if we aren't able to send any ACKs for 350 seconds.
sys/netinet/tcp_output.c | ||
---|---|---|
1558 | You already own a lock on tp. No need to do atomic operations. |
Of course, tp->t_rxtshift doesn't get updated on a successful send. That's not the way this works. :-)
Now that I've had more coffee...
This probably amounts to the generic problem of:
- When we send data, we should be setting the retransmit timer to make sure we retransmit.
- When we are not sending data, we should be setting a different timer (DELACK, perhaps?) to make sure we send an ACK, when able.
The key feature that makes the retransmit timer inappropriate for an ACK-only case is that it is only stopped when we receive input; however, in the ACK-only case, we really want to stop it as soon as we transmit a successful ACK.
Of course, we could just drop the ACK and everything would "just work". But, it probably is still a good idea to try to re-transmit the ACK.
The key feature that makes the retransmit timer inappropriate for an ACK-only case is that it is only stopped when we receive input; however, in the ACK-only case, we really want to stop it as soon as we transmit a successful ACK.
Indeed. I guess we want to treat internal insufficient memory error with retransmit timer remedy. One would also argue that do you really want to go on when you failed to respond (with the ACK) for these many times. Don't you have bigger problems by now?
Of course, we could just drop the ACK and everything would "just work". But, it probably is still a good idea to try to re-transmit the ACK.
I am not opposed to the suggested patch but its just...weird. (Also if its not obvious, I don't have a better solution to present. :-))
I believe that the original code is wrong, and the change is not sufficient
to fix it. The retransmit timer should be running if and only if we have
sent data into the receive window and are awaiting an ACK. The persist
timer should be running if and only if the retransmit timer is not running,
and we have data to send that do not reasonably fit in the window. If we
get an ENOBUFS when sending data, we will already be running the retransmit
timer. If we drop an ACK on ENOBUFS, either we will receive more data and
attempt another ACK, or the sender will time out and resend data. Either
will get the connection started again. I believe lines 1552-1554 should
simply be deleted.
I think something like this code will get you closer to what you want:
Index: sys/netinet/tcp_output.c =================================================================== --- sys/netinet/tcp_output.c (revision 298090) +++ sys/netinet/tcp_output.c (working copy) @@ -1545,10 +1545,16 @@ tp->t_softerror = error; return (error); case ENOBUFS: - if (!tcp_timer_active(tp, TT_REXMT) && - !tcp_timer_active(tp, TT_PERSIST)) - tcp_timer_activate(tp, TT_REXMT, tp->t_rxtcur); - tp->snd_cwnd = tp->t_maxseg; + if (len > 0 || (flags & (TH_SYN|TH_FIN))) { + if (!tcp_timer_active(tp, TT_REXMT) && + !tcp_timer_active(tp, TT_PERSIST)) + tcp_timer_activate(tp, TT_REXMT, + tp->t_rxtcur); + tp->snd_cwnd = tp->t_maxseg; + } else if (!tcp_timer_active(tp, TT_DELACK) && + (flags & TH_RST) == 0) + tcp_timer_activate(tp, TT_DELACK, + tcp_delacktime); return (0); case EMSGSIZE: /*
I think this code gets you closer to what you want.
I agree that the present code seems wrong. Using the retransmit timers for ACKs seems inappropriate. But, I am generally concerned about how the system responds when loaded. I'm not sure its a good thing to add more work to a system that is already overloaded. Things will automatically fix themselves in the presence of lost ACKs. The reason to retransmit ACKs is simply to prevent unnecessary retransmissions and the latency that comes with waiting for the retransmission to timeout. But, on an overloaded system, that may or may not be a good thing.
Good point, but see below.
If we drop an ACK on ENOBUFS, either we will receive more data and
attempt another ACK, or the sender will time out and resend data. Either
will get the connection started again.
True, but it may take time if we have to wait for a retransmission time out. That isn't necessarily a bad thing.
I believe lines 1552-1554 should
simply be deleted.
On the surface, that seems like a reasonable alternative. However, its interesting to see why this code was originally added. It looks like it was added by jlemon in r61179. Here's what the commit message says:
When attempting to transmit a packet, if the system fails to allocate
a mbuf, it may return without setting any timers. If no more data is
scheduled to be transmitted (this was a FIN) the system will sit in
LAST_ACK state forever.Thus, when mbuf allocation fails, set the retransmit timer if neither
the retransmit or persist timer is already pending.
Even back then, the retransmit and/or persist timers should have been running by this point in the code. It would be interesting to know if jlemon proved this code fixed the problem, or it was just speculated. We fixed a similar-sounding problem last year, and the code is different now, so these particular lines may truly no longer be necessary. But, I am a little hesitant to remove this code without knowing more about its origin.
Just my 2c.
But setting retransmission timer on an ACK seems... wrong.
I believe lines 1552-1554 should
simply be deleted.On the surface, that seems like a reasonable alternative. However, its interesting to see why this code was originally added. It looks like it was added by jlemon in r61179. Here's what the commit message says:
When attempting to transmit a packet, if the system fails to allocate
a mbuf, it may return without setting any timers. If no more data is
scheduled to be transmitted (this was a FIN) the system will sit in
LAST_ACK state forever.Thus, when mbuf allocation fails, set the retransmit timer if neither
the retransmit or persist timer is already pending.Even back then, the retransmit and/or persist timers should have been running by this point in the code. It would be interesting to know if jlemon proved this code fixed the problem, or it was just speculated. We fixed a similar-sounding problem last year, and the code is different now, so these particular lines may truly no longer be necessary. But, I am a little hesitant to remove this code without knowing more about its origin.
Just my 2c.
I guess this boils down to how you want to "deal" with this (possibly) transient memory error.
I'd like to go with @mike-karels.net's suggestion as when we reach to this point in code, I don't see how timers couldn't have been set already. And I doubt we can get any more info from that old commit :/
But because this is (sigh..) tcp and if we want to be really cautious, I like @jtl 's patch of handling this for 'acks' with delack. (In that patch, I am not sure why we want to drop cwnd down to 1mss only in case of data and not for acks. I think we should do that for both?)
(fwiw, other BSDs don't have this special handling which aligns with what Mike is proposing.)
Setting a retransmission timer on an ACK makes no sense; I don't think tcp_output will send an ACK on a retransmission timeout.
Setting timers in the ENOBUFS case is at best a partial fix. If the ACK is lost locally, we know; if it is lost elsewhere, we don't. We need timers in any case.
Setting a local timer for the ACK case is no better for latency than having the other end run a retransmit timer..
If there is a problem with setting the retransmit timer for a FIN, let's fix that. Otherwise, I stand by my recommendation of deleting the code to set a timer.
It's t_shift >= 12 killing the connection, since no one actually stops the timer/performs RTT update for ACK if ENOBUFS happens on a receiving only connections. So intermittent (not continuous) failures to transmit ACK due to ENOBUFS will kill a long living receiving only connection eventually.
As I said, it's _not_ continuous loosing of ACK due to ENOBUFS, but _intermittent_ (e.g. you lost 12 ACKs due to ENOBUFS in 2~3 hours) losing ACKs due to ENOBUFS is sufficient to kill a a long living receiving only connection.
Of course, we could just drop the ACK and everything would "just work". But, it probably is still a good idea to try to re-transmit the ACK.
I am not opposed to the suggested patch but its just...weird. (Also if its not obvious, I don't have a better solution to present. :-))
I agree with Mike's proposal (although FYI, I do belive tcp_output() will send an ACK on RTO). TCP ACKs are intentionally unreliable by design and setting the retransmit timer there is nonsense - either there is a bug elsewhere which needs to be fixed, or it is trying to paper over local ACK loss in a dubious manner. The ENOBUFS case should also become a thing of the past when the back pressure work goes in any way. For the immediate change, perhaps replacing with a macro that expands to a KASSERT to double check the appropriate conditions for the retransmit or persist timers being set would be a good idea. The macro should be used elsewhere in tcp_output() and tcp_intput() as well but that can be done in follow up commit(s).
Let's keep this moving along. Mike isn't (yet) a committer but if someone can commit this once everyone agrees that would be great.
FWIW, I agree with deleting the ENOBUFs special-case. If we haven't already set the right timers by here, we have another bug which needs to be fixed.
... but add a macro to check that the rexmit/persist timer is armed if appropriate! Should be added higher up though so that it is checked before all return statements in the vicinity.
btw, I think the line to set the snd_cwnd should remain for now, until something replaces it. ENOBUFS signals local congestion.
I thought that had been fixed ages ago... oops. It should be calling cc_cong_signal() with a new congestion type. Just leave that line as is for the moment though as Mike says.
Fixed? i.e. doing something other than setting cwnd to 1 seg?
It should be calling cc_cong_signal() with a new congestion type.
Hum... tcp_quench() used to be there which essentially had this 1 line to set cwnd to 1 seg.
Is there any (RFC) guidance for what to do in this situation?
Yes, but turns out it was a discussion I had privately with a colleague who never got around to creating the patch we discussed.
It should be calling cc_cong_signal() with a new congestion type.
Hum... tcp_quench() used to be there which essentially had this 1 line to set cwnd to 1 seg.
Is there any (RFC) guidance for what to do in this situation?
No, and it's an implementation detail that RFCs have no real business being concerned with either. Setting cwnd==maxseg is completely inappropriate though and I would argue that whatever reasoning was used to justify the original choice is as wrong today as it was back then. At any rate, that's something to follow up separately.
Well, resetting the cwnd can be unfair for the connections encounters the ENOBUFS, since the underlying hardware can be flooded by other connections. So I'd remove that cwnd resetting upon ENOBUFS. We can open another discussion after fixing this issue first.
I disagree; congestion is congestion, not "congestion for everyone but me". I'd prefer to leave the cwnd change until it is replaced by something more modern.
We probably can leave the cwnd resetting to later rexmt timeout or possible later fast retransmit (I think fast retransmit could kick in under some cases, if ENOBUFS happened); instead of resetting the cwnd immediately upon ENOBUFS.
sys/netinet/tcp_output.c | ||
---|---|---|
1551 | In my opinion, this does not need to be a panic. A KASSERT() should be sufficient. Also, this is not the re-usable macro which Lawrence suggested. |
sys/netinet/tcp_output.c | ||
---|---|---|
1551 | It's what I am testing w/, since I need non-INVARIANT kernel to generate enough traffic to make ENOBUFS happen. We can change it into KASSERT, however, I am wondering should we just use "if (__predict_false(...)) panic" here? Missing timeout here can obviously causing issues for data/SYN/FIN. I will wrap this code segment into a macro (change it into KASSERT is fine w/ me) after more testing. I think the code segment is fine now; we just need to inverse the logic, if we choose to use KASSERT. |
Please leave the manipulation of cwnd as is so as to avoid conflating two different changes. The manipulation of cwnd on local drop has nothing to do with the subject of this particular change.
Also, the patch we're reviewing here should be the commit candidate i.e. what will actually land in the tree. I understand that you need to test with something different than the commit candidate, but that should just be noted in the review description. We all need to see the final code as there are many subtleties here that require close attention from as many sets of eyes as possible.
Yep, I am not going to delete the cwnd reset in this patch.
Also, the patch we're reviewing here should be the commit candidate i.e. what will actually land in the tree. I understand that you need to test with something different than the commit candidate, but that should just be noted in the review description. We all need to see the final code as there are many subtleties here that require close attention from as many sets of eyes as possible.
This is what I am testing now. Since this path is not on a hot code path and we obviously don't want to have timers unset for data/FIN/SYN, can we just use the "if() panic" here?
And another thing is abstract it as a macro. Maybe we just leave the macro abstraction to the next step?
So, if both of following is true:
- We agree to use "if() panic" here, instead of KASSERT
- Do the macro abstraction in the next step
Then the current patch is exactly what I want to put into the tree.
errr, not sure if we're on the same page or not, but I am in strong agreement with Mike. Please leave the line "tp->snd_cwnd = tp->t_maxseg;" where it is and untouched i.e. don't delete or change it.
Also, the patch we're reviewing here should be the commit candidate i.e. what will actually land in the tree. I understand that you need to test with something different than the commit candidate, but that should just be noted in the review description. We all need to see the final code as there are many subtleties here that require close attention from as many sets of eyes as possible.
This is what I am testing now. Since this path is not on a hot code path and we obviously don't want to have timers unset for data/FIN/SYN, can we just use the "if() panic" here?
No. This isn't a condition we should bring the whole machine down for. A rate limited log message would be appropriate so the admin knows there's a potential for hung connections, but I don't insist on this.
And another thing is abstract it as a macro. Maybe we just leave the macro abstraction to the next step?
I don't see any good reason not to define and use the macro as part of this patch. It can then be used in follow up commits to check the other return points from the input and output processing paths.
OK, I see, then I will post a new patch then, but after finishing testing the current one (as I said, I need to use non-INVARIANT kernel to trigger it).
Thanks for the clarification.