- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Aug 27 2015
Aug 24 2015
This change looks good to me.
Just for the history below few commits that are related to this change:
Aug 18 2015
As I wrote before, this race condition was driven by an issue in callout_stop() that it is fixed in rS286880: callout_stop() should return 0 (fail) when the callout is currently (D3078). I will revert rS284245 as it is not more needed now.
Rebase on top of r286874.
Updating D3078: callout_stop() should return 0 when the callout is currently being serviced and
indeed unstoppable.
Aug 8 2015
Aug 3 2015
Follow jhb's idea: Use 'not_running' instead of 'running'.
Updating D3078: callout_stop() should return 0 when the callout is currently being serviced and
indeed unstoppable.
Aug 2 2015
In D3078#66062, @jhb wrote:This must be a recent regression? The old code definitely checked for this case. For example, in stable/9:
if (!(c->c_flags & CALLOUT_PENDING)) { ...
Jul 31 2015
I will push this change by end of next week, thus if you need more time please scream. As usual, comments are more than welcomed even after the commit. Thanks.
Jul 30 2015
As this change is quite stable and I have addressed all the review comments, I plan to push it by the end of this week. As usual please scream if you have something more to add. Moreover comments are still welcomed here even after this change being pushed. Thanks all for your time.
[tcp-scale]: Rebase on HEAD r286066
Jul 14 2015
No need anymore to upgrade to INP_INFO_RLOCK/INP_WLOCK
state in tcp_timer_rexmt(), we are already in this state.
Adding jhb's suggested comment in syncache_socket().
[tcp-scale]: Add comment proposed by jhb about having two inps locked at
same time without the exclusive INP_INFO lock.
Rebase on top of r285351.
Sorry @rrs you are the latest one to have done big changes to callout thus I picked up you first for this review, Tell me if you have time or not for it.
Jun 29 2015
I reviewed this patch as part of:
Jun 19 2015
In D2599#55428, @mat wrote:Ooops, sorry, I was trying to remove myself from the subscribers here :-/
Jun 18 2015
In D2079#49517, @lstewart wrote:Leaving aside D2599 for the moment (which looks like good work and I will indeed take a look at it in detail - please include me on reviews for any TCP related work. I don't always get time to give them attention in the review window, but being aware of the work is very useful), I'm still not clear why tcp_drop(), and therefore the timers which call it, need the info lock in the new world order (in fact, I think my confusion also applies to the old world order. I was thinking that taking the reference on the inpcb in tcp_newtcpcb() means you now control when the inpcb can be GCed with respect to the timers executing which should allow simplification of the locking in the timers. It may even be the case that the reference you hold is irrelevant to the following thoughts...)
Jun 17 2015
Hi guys, below a quick update:
Jun 13 2015
Comment improvement from jhb
The fact that callout_stop() can return 1 (i.e. callout successfully stopped) where this exact callout is just about to be ran can be seen as bug (/feature). Marc proposed me a fix for this callout bug(/feature) and will ask @rrs if it deserves to be fixed(/documented). Thanks again for your inputs/review and testing.
The race condition introduced with this change has been fixed as part of D2763: Fix a callout race condition introduced in TCP timers callouts with r281599. in HEAD and STABLE-10.
Jun 12 2015
[tcp-scale]: Use INP_INFO_RLOCK in tcp_timer_discard()
Improve INP_INFO_LOCK assertions in cxgb/cxgbe tom
Jun 11 2015
Rebased on svn path=/head/; revision=284266
Patch pushed in both HEAD and 10-STABLE. And it is not too late for comments on this review, it is never too late for improvements. Thanks all for your time.
Jun 10 2015
In D2763#52995, @nitroboost-gmail.com wrote:So far this is looking solid for us. Both with defaults and lowered keep alives on the same traffic patterns that caused the cores prior. Running with net.inet.tcp.per_cpu_timers = 1
Add D2763 as dependency
Rebased change on r284151
Jun 9 2015
Jun 8 2015
Here
Just for the record, below how I got details on this issue:
In D2079#52025, @jch wrote:In D2079#52021, @jch wrote:In D2079#51973, @lstewart wrote:Yes, lowering the keepalive timer was how I was triggering this more quickly during investigation as with our default it took days at high load to trigger. I've also analysed a core dump with the tp in t_state 0, so it's not specific to TIMEWAIT either. I think I might know what's going on but will hopefully confirm my findings later today.
Interesting. On my side I finally reproduce your exact issue:
panic: tcp_timer_keep: tp 0xfffff804210fc418 tp->t_inpcb == NULLJust I added debugging code to get a better context view (see below). And it appears that:
- TCP keep-alive time was running
- callout_stop(TT_KEEP) returned successfully
- As no TCP callouts were apparently running tcp_discardcb() decided to directly free the tcpcb
- Crash because a TT_KEEP callout was indeed still running and called afterward
I am digging this scenario...
Jun 5 2015
In D2079#52021, @jch wrote:In D2079#51973, @lstewart wrote:Yes, lowering the keepalive timer was how I was triggering this more quickly during investigation as with our default it took days at high load to trigger. I've also analysed a core dump with the tp in t_state 0, so it's not specific to TIMEWAIT either. I think I might know what's going on but will hopefully confirm my findings later today.
Interesting. On my side I finally reproduce your exact issue:
panic: tcp_timer_keep: tp 0xfffff804210fc418 tp->t_inpcb == NULLJust I added debugging code to get a better context view (see below). And it appears that:
- TCP keep-alive time was running
- callout_stop(TT_KEEP) returned successfully
- As no TCP callouts were apparently running tcp_discardcb() decided to directly free the tcpcb
- Crash because a TT_KEEP callout was indeed still running and called afterward
I am digging this scenario...
In D2079#51973, @lstewart wrote:Yes, lowering the keepalive timer was how I was triggering this more quickly during investigation as with our default it took days at high load to trigger. I've also analysed a core dump with the tp in t_state 0, so it's not specific to TIMEWAIT either. I think I might know what's going on but will hopefully confirm my findings later today.
Jun 4 2015
I might have found a way to reproduce this issue: Set the TCP keep-alive timers very low:
Jun 3 2015
In D2079#51293, @lstewart wrote:Randall accidentally misspoke. We're seeing tcp_timer_keep() fire with a tp in TIMEWAIT and t_inpcb==NULL. The rest of the tp looks sane indicating it hasn't been GCed. I'm still trying to understand how this is possible as the code looks correct to me, but I'm continuing to dig...
Jun 1 2015
In D2079#50233, @rrs wrote:We don't use TOE (we use LRO though). The panic's we have are the persist timer. Lawrence
has an idea though and is investigating that. Maybe he can turn something up.The problem of course is it happens in production after hours of running at full load.. so
no we have not done an INVARIANT run. Lets see what Lawrence turns up. We will
probably hold off merging this to our next release until we can get the issue resolved ;-)
May 29 2015
May 28 2015
In D2079#50069, @rrs wrote:We have put these changes into our NF caches, and we now are seeing
crashes that all relate to the removal ofif (inp == NULL) {
// count race return}
We have several crashes under load with this, so it appears there
is some un-thought out issue with this.I believe we will have to at least put the inp == NULL check back in
for our purposes, but someone may want to take a look at this
and see why its happening..(note we don't get the kassert since we don't have INVARIANT compiled
in we just get a crash in the inp lock :-o
Fixed all @jhb comments (so far). Thanks for your time.
[tcp-scale]: Fix jhb's comment on ipi_gencnt/ipi_count access:
[tcp-scale]: Fix jhb's comment on syncache_expand() comments.
[tcp-scale]: Apply jhb's review comments on code comments.
May 26 2015
In D2079#49517, @lstewart wrote:...
Let's talk through tcp_timer_persist() which calls tcp_drop(). First point - as I understand things, given the ref taken in tcp_newtcpcb(), we know that any call to in_pcbrele*() from functions called by the timer will not GC the inpcb. So we call:tp = tcp_drop(tp, ETIMEDOUT);...
May 22 2015
In D2079#48600, @lstewart wrote:Sorry for coming very late to the party and I realise you've already committed the changes, but thought I'd ask my question here so that all the context relating to this work is in one place...
In the new world order with your changes, I'm a little unclear about the need for the INP_INFO_LOCK in any of the TCP timer code. Can you please comment on if the lock is needed or not, and if it is, help me understand why?
May 20 2015
May 15 2015
Change MFC-ed in stable/10 here rS282968: MFC r279821:. Closing this revision.
Apr 16 2015
I believe D2079: Fix TCP timers use-after-free old race conditions fixed the same use-after-free race condition than this patch. The differences are:
- It uses the old callout API only (no callout_drain_async())
- It does not use inp_lock to protect callouts to avoid the INP_INFO_WLOCK/INP_WLOCK LOR management burden
Review closed with rS281599: Fix an old and well-documented use-after-free race condition in commit.
Expand tcp_timer_stop() comment based on jhb review
Apr 15 2015
Apr 13 2015
Rebase patch on top of r281483
Rebase patch on top of r281483
Apr 8 2015
Apr 2 2015
Rebase on top of r280990
FIxed with revision rS280990: Provide better debugging information in tcp_timer_activate() and.
Mar 31 2015
Fix bz's comment on sys/netinet/tcp_timer.c (7 and final step)
Just a minor change to improve legacy code as same time as D2079: Fix TCP timers use-after-free old race conditions review.