Details

Reviewers

jhb
jch
lstewart
gnn
bz
adrian
rwatson
brahma.gdb_gmail.com

Summary

Fix TCP timers use-after-free race conditions:

Add a reference of tcpcb to its inpcb
If needed, defer tcpcb deletion until TCP timers callouts have finished

Notes:

This patch changes the ABI of struct tcpcb_mem, thus it changes struct tcpcb
This is a fix for an old race condition found by Robert Watson (but without calling callout_drain())
Its follows Robert Watson's advices to fix it
In FreeBSD 10.x size of struct tcpcb_mem is 1024 bytes, in -CURRENT it is 1040, this patch add 8 bytes for a total of 1048 bytes

Test Plan

Tested with net.inet.tcp.per_cpu_timers set to values 0 and 1.

Tested on top of recent callout changes introduced by @rrs a.k.a.:

Tested with our TCP QA tests.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

No Lint Coverage

Unit

No Test Coverage

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

jch added inline comments.Mar 30 2015, 9:24 PM

sys/netinet/tcp_subr.c
807	Right, fixed.
810	I would even prefer: tp->t_inpcb = in_pcbref(inp); But that's a change for later. Fixed.
938–940	Right, much better. Fixed.
1026	Indeed, fixed.
1031	Good proposition, fixed.
1078	All `"tcp_timer_discard:"` replaced by `"%s:...", __func__`, fixed.
1088	Right, fixed.
sys/netinet/tcp_timer.c
46	refcount is actually not needed, removing it.
273	Right, I didn't use the `%s, __func__` pattern in first place because previous `tcp_timer.c` developers never used it: `$ grep __func__ sys/netinet/tcp_timer.c` `$` but at same time, I don't see why not using it now. Fixed.
287	Good proposition, fixed.
312	Fixed.
329	Fixed.
383	Fixed.
399	Fixed.
478	More fixed.
493–497	More fixed.
559	Hi, fixed.
576	Fixed (again).
856	Right, fixed.
890	Right, fixed.
919	Good idea. The same comment applies to the above and older `tcp_timer_active()` function, I will fix it in a separate patch. Fixed.

Rebase patch proposition on r280904

jch added a parent revision: D2179: Provide better debugging information in tcp_timer_activate() and tcp_timer_active().Mar 31 2015, 12:26 PM

Fix bz's comment on sys/netinet/tcp_timer.c (7 and final step)

I rebased this patch proposition on top of rS280904, especially to test this patch on top of recent callout changes introduced by @rrs with:

And all our TCP tests passed (good).

Thus this patch is ready for more in depth functionality/logic review (I believe from @rrs and @bz).

sys/netinet/tcp_timer.c
810	D2154: Use appropriate timeout_t* instead of void* in tcp_timer_activate() has been accepted and pushed in rS280904: Use appropriate timeout_t* instead of void* in tcp_timer_activate(), and I have rebased the current patch proposition on top of it.

Rebase on top of r280990

Legacy code has been improved according to review comments with D2179 and D2154. Patch ready for functionality pass.

adrian accepted this revision.Apr 2 2015, 9:29 PM

adrian edited edge metadata.

jch edited the test plan for this revision. (Show Details)Apr 8 2015, 7:04 AM

jch edited edge metadata.

I plan to propose this change to my mentor (@jhb) by the end of this week, thus @rrs and @bz please scream if you would like more time to study the functional side. (I was indeed expecting more questions/comments on this part).

Anyway, thanks all for your time reviewing this change.

Rebase patch on top of r281483

You can mark me as reviewed by, not just approved by for this as well.

sys/netinet/tcp_timer.c
928	I re-confused myself doing a final review. Perhaps expand this comment to explicitly note that the callout currently running will see callout_pending() false and return without doing anything after it gets the INP_WLOCK()?

jch added inline comments.Apr 15 2015, 2:47 PM

sys/netinet/tcp_timer.c
928	Good idea, I will expand this comment describing how it uses the classical callout_pending() test pattern.

Expand tcp_timer_stop() comment based on jhb review

sys/netinet/tcp_timer.c
928	Comment expanded.

Review closed with rS281599: Fix an old and well-documented use-after-free race condition in commit.

jhb accepted this revision.Apr 16 2015, 1:21 PM

jhb edited edge metadata.

jch accepted this revision.Apr 16 2015, 1:50 PM

jch added a reviewer: jch.

Post-commit accept, just so that this can be closed, stupid PB. Did not review the last revision for technical correctness anymore but was happy to see previous comments being addressed :-)

This revision is now accepted and ready to land.Apr 16 2015, 2:43 PM

jch closed this revision.Apr 16 2015, 2:49 PM

Sorry for coming very late to the party and I realise you've already committed the changes, but thought I'd ask my question here so that all the context relating to this work is in one place...

In the new world order with your changes, I'm a little unclear about the need for the INP_INFO_LOCK in any of the TCP timer code. Can you please comment on if the lock is needed or not, and if it is, help me understand why?

In D2079#48600, @lstewart wrote:

Sorry for coming very late to the party and I realise you've already committed the changes, but thought I'd ask my question here so that all the context relating to this work is in one place...

In the new world order with your changes, I'm a little unclear about the need for the INP_INFO_LOCK in any of the TCP timer code. Can you please comment on if the lock is needed or not, and if it is, help me understand why?

Hi @lstewart, no problem, it is never too late for comments. However you question on when INP_INFO lock is required is tricky:

Short answer: In an ideal world INP_INFO lock should protect only:

inp global counters
inp global list (that includes inp allocation (i.e. insertion in the list) and inp destruction (i.e. removal from the list)
inp global list stability when walking the full list

In tcp timer context, that means INP_INFO lock is required if the inp can be destroyed. For example as tcp_timer_delack() can't ever destroy its inp it is not required, and as tcp_timer_keep() can destroy it (via tcp_drop()) it is required.

Longer answer:

Checking that inp is never ever destroyed is a necessary and not sufficient condition to remove a INP_INFO lock. As INP_INFO is a global lock for the whole TCP stack, over the history it was used for other purposes. Below, examples of INP_INFO lock extra usages we found and fixed so far:

Thus removing a INP_INFO needs extra careful checks to see what are side effects of the current protection and if it is ok to remove them.

Note:
You might be interested by our current proposal to decompose the INP_INFO lock D2599: Decompose TCP INP_INFO lock to increase short-lived connections scalability even further. With this change applied, the only remaining duty of INP_INFO is:

insure inp global list stability when walking the full list

My 2 cents,

I'll prefix this by saying I'm not well versed in the finer points of PCBs and associated locking, and the locking guide in in_pcb.h is somewhat unclear on a few things to my mind. Apologies if this is all super obvious to others.

Leaving aside D2599 for the moment (which looks like good work and I will indeed take a look at it in detail - please include me on reviews for any TCP related work. I don't always get time to give them attention in the review window, but being aware of the work is very useful), I'm still not clear why tcp_drop(), and therefore the timers which call it, need the info lock in the new world order (in fact, I think my confusion also applies to the old world order. I was thinking that taking the reference on the inpcb in tcp_newtcpcb() means you now control when the inpcb can be GCed with respect to the timers executing which should allow simplification of the locking in the timers. It may even be the case that the reference you hold is irrelevant to the following thoughts...)

Let's talk through tcp_timer_persist() which calls tcp_drop(). First point - as I understand things, given the ref taken in tcp_newtcpcb(), we know that any call to in_pcbrele*() from functions called by the timer will not GC the inpcb. So we call:

tp = tcp_drop(tp, ETIMEDOUT);

which looks like:

struct tcpcb *
tcp_drop(struct tcpcb *tp, int errno)
{
        struct socket *so = tp->t_inpcb->inp_socket;

        INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
        INP_WLOCK_ASSERT(tp->t_inpcb);

        if (TCPS_HAVERCVDSYN(tp->t_state)) {
                tcp_state_change(tp, TCPS_CLOSED);
                (void) tcp_output(tp);
                TCPSTAT_INC(tcps_drops);
        } else
                TCPSTAT_INC(tcps_conndrops);
        if (errno == ETIMEDOUT && tp->t_softerror)
                errno = tp->t_softerror;
        so->so_error = errno;
        return (tcp_close(tp));
}

It asserts the INFO lock, but I don't see any aspect of the code which needs the lock, so I believe it's asserting on behalf of tcp_close() which looks like:

`
struct tcpcb *
tcp_close(struct tcpcb *tp)
{
        struct inpcb *inp = tp->t_inpcb;
        struct socket *so;

        INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
        INP_WLOCK_ASSERT(inp);

#ifdef TCP_OFFLOAD
        if (tp->t_state == TCPS_LISTEN)
                tcp_offload_listen_stop(tp);
#endif
        in_pcbdrop(inp);
        TCPSTAT_INC(tcps_closed);
        KASSERT(inp->inp_socket != NULL, ("tcp_close: inp_socket NULL"));
        so = inp->inp_socket;
        soisdisconnected(so);
        if (inp->inp_flags & INP_SOCKREF) {
                KASSERT(so->so_state & SS_PROTOREF,
                    ("tcp_close: !SS_PROTOREF"));
                inp->inp_flags &= ~INP_SOCKREF;
                INP_WUNLOCK(inp);
                ACCEPT_LOCK();
                SOCK_LOCK(so);
                so->so_state &= ~SS_PROTOREF;
                sofree(so);
                return (NULL);
        }
        return (tp);
}

tcp_close() also asserts the INFO lock, but I don't see any aspect of the code which needs the lock, so it seems to be asserting purely on behalf of in_pcbdrop(), which looks like:

void
in_pcbdrop(struct inpcb *inp)
{

        INP_WLOCK_ASSERT(inp);

        /*
         * XXXRW: Possibly we should protect the setting of INP_DROPPED with
         * the hash lock...?
         */
        inp->inp_flags |= INP_DROPPED;
        if (inp->inp_flags & INP_INHASHLIST) {
                struct inpcbport *phd = inp->inp_phd;

                INP_HASH_WLOCK(inp->inp_pcbinfo);
                LIST_REMOVE(inp, inp_hash);
                LIST_REMOVE(inp, inp_portlist);
                if (LIST_FIRST(&phd->phd_pcblist) == NULL) {
                        LIST_REMOVE(phd, phd_hash);
                        free(phd, M_PCB);
                }
                INP_HASH_WUNLOCK(inp->inp_pcbinfo);
                inp->inp_flags &= ~INP_INHASHLIST;
#ifdef PCBGROUP
                in_pcbgroup_remove(inp);
#endif
        }
}

First point of interest - the INFO lock is not asserted. Should it be?

Let's look at the locking guide in in_pcb.h:

/*-
 * struct inpcb captures the network layer state for TCP, UDP, and raw IPv4
 * and IPv6 sockets.  In the case of TCP, further per-connection state is
 * hung off of inp_ppcb most of the time.  Almost all fields of struct inpcb
 * are static after creation or protected by a per-inpcb rwlock, inp_lock.  A
 * few fields also require the global pcbinfo lock for the inpcb to be held,
 * when modified, such as the global connection lists and hashes, as well as
 * binding information (which affects which hash a connection is on).  This
 * model means that connections can be looked up without holding the
 * per-connection lock, which is important for performance when attempting to
 * find the connection for a packet given its IP and port tuple.  Writing to
 * these fields that write locks be held on both the inpcb and global locks.
 *
 * Key:
 * (c) - Constant after initialization
 * (g) - Protected by the pcbgroup lock
 * (i) - Protected by the inpcb lock
 * (p) - Protected by the pcbinfo lock for the inpcb
 * (s) - Protected by another subsystem's locks
 * (x) - Undefined locking
 *
 * A few other notes:
 *
 * When a read lock is held, stability of the field is guaranteed; to write
 * to a field, a write lock must generally be held.
 *
 * netinet/netinet6-layer code should not assume that the inp_socket pointer
 * is safe to dereference without inp_lock being held, even for protocols
 * other than TCP (where the inpcb persists during TIMEWAIT even after the
 * socket has been freed), or there may be close(2)-related races.
 *
 * The inp_vflag field is overloaded, and would otherwise ideally be (c).
 */
struct inpcb {
<...>
LIST_ENTRY(inpcb) inp_hash;     /* (i/p) hash list */
LIST_ENTRY(inpcb) inp_portlist; /* (i/p) */
struct  inpcbport *inp_phd;     /* (i/p) head of this list */
struct  inpcbinfo *inp_pcbinfo; /* (c) PCB list info */
<...>
}


/*-
 * Global data structure for each high-level protocol (UDP, TCP, ...) in both
 * IPv4 and IPv6.  Holds inpcb lists and information for managing them.
 *
 * Each pcbinfo is protected by two locks: ipi_lock and ipi_hash_lock,
 * the former covering mutable global fields (such as the global pcb list),
 * and the latter covering the hashed lookup tables.  The lock order is:
 *
 *    ipi_lock (before) inpcb locks (before) {ipi_hash_lock, pcbgroup locks}
 *
 * Locking key:
 *
 * (c) Constant or nearly constant after initialisation
 * (g) Locked by ipi_lock
 * (h) Read using either ipi_hash_lock or inpcb lock; write requires both
 * (p) Protected by one or more pcbgroup locks
 * (x) Synchronisation properties poorly defined
 */
struct inpcbinfo {
<...>
}

Note the inpcbinfo locking guide indicates that either the INFO or HASH locks may be required depending on which field is being accessed/manipulated, which seems to stand in contradiction with the inpcb locking guide.

Based on the locking guide, fields touched in in_pcbdrop() and a look at other code in in_pcb.c, I'm wondering if the locking guide is out of date, and if the INPCB + HASH locks are actually the only 2 locks required for in_pcbdrop() given the fields it is touching.

All that being said, I may well be missing something and would appreciate some help understanding the subtleties here.

Hi @lsteward,

In D2079#49517, @lstewart wrote:
...
Let's talk through tcp_timer_persist() which calls tcp_drop(). First point - as I understand things, given the ref taken in tcp_newtcpcb(), we know that any call to in_pcbrele*() from functions called by the timer will not GC the inpcb. So we call:
tp = tcp_drop(tp, ETIMEDOUT);
...

Thanks for your detailed comment.

First, you are right INP_INFO lock is not required by in_pcbdrop() but instead by in_pcbfree() (and in_pcbremlists() which is called only from in_pcbfree()). Call stack from tcp_timer_persist() to in_pcbfree() is indeed far from being obvious:

tcp_timer_keep()
tcp_drop()
tcp_close()
sofree()
tcp_usr_detach() (via pr->pr_usrreqs->pru_detach() in sofree())
tcp_detach()
in_pcbfree()
in_pcbremlists()

Second, you are still right in_pcbdrop() only requires INP_HASH and INP locks, I have updated that as part of D2599 (See sys/netinet/in_pcb.h around line 167 in D2599).

Third, I will reply about inp reference counting in a separate comment as this part in also quite complex.

In D2079#49598, @jch wrote:
Thanks for your detailed comment.

First, you are right INP_INFO lock is not required by in_pcbdrop() but instead by in_pcbfree() (and in_pcbremlists() which is called only from in_pcbfree()). Call stack from tcp_timer_persist() to in_pcbfree() is indeed far from being obvious:
tcp_timer_keep()
tcp_drop()
tcp_close()
sofree()
tcp_usr_detach() (via pr->pr_usrreqs->pru_detach() in sofree())
tcp_detach()
in_pcbfree()
in_pcbremlists()

Derp, thanks for pointing this out. For some reason in my mind I thought detach was not called synchronously through this code path and so thought in_pcbdrop() was the only candidate that might need the INFO lock.

Second, you are still right in_pcbdrop() only requires INP_HASH and INP locks, I have updated that as part of D2599 (See sys/netinet/in_pcb.h around line 167 in D2599).

Great.

Third, I will reply about inp reference counting in a separate comment as this part in also quite complex.

Ok, thanks. Will be very helpful to better understand.

florent_guiliani.fr added a subscriber: florent_guiliani.fr.May 28 2015, 1:08 PM

Hi all:

We have put these changes into our NF caches, and we now are seeing
crashes that all relate to the removal of

if (inp == NULL) {

// count race
return

}

We have several crashes under load with this, so it appears there
is some un-thought out issue with this.

I believe we will have to at least put the inp == NULL check back in
for our purposes, but someone may want to take a look at this
and see why its happening..

(note we don't get the kassert since we don't have INVARIANT compiled
in we just get a crash in the inp lock :-o

Hi @rrs,

In D2079#50069, @rrs wrote:
We have put these changes into our NF caches, and we now are seeing
crashes that all relate to the removal of

if (inp == NULL) {
// count race
return
}

We have several crashes under load with this, so it appears there
is some un-thought out issue with this.

I believe we will have to at least put the inp == NULL check back in
for our purposes, but someone may want to take a look at this
and see why its happening..

(note we don't get the kassert since we don't have INVARIANT compiled
in we just get a crash in the inp lock :-o

Interesting, thanks for sharing. Having inp == tp->t_inpcb == NULL means that uma_zfree(V_tcpcb_zone, tp) has been called and then tp is used (the use-after-free race condition). Anyway I would like to reproduce the issue:

Could you reproduce this issue with INVARIANTS? (I guess not) I am asking because other kassert can fail before this one /and/ then trigger inp == NULL.
Are you using TCP Offload Engine? (I guess yes)
If you have servers without TOE support or with TOE disabled, did you also witness this crash on these servers?

I am asking because I am inclined to forget the "other" TCP stack, the TOE one. And glancing at tcp_timer.c is interesting enough:

void
tcp_timer_activate(struct tcpcb *tp, uint32_t timer_type, u_int delta)
{
	struct callout *t_callout;
	void *f_callout;
	struct inpcb *inp = tp->t_inpcb;
	int cpu = INP_CPU(inp);

#ifdef TCP_OFFLOAD
	if (tp->t_flags & TF_TOE)
		return;
#endif
...

It might be a red herring, but I am going to check how TCP timer actually works in TOE context. My 2 cents.

We don't use TOE (we use LRO though). The panic's we have are the persist timer. Lawrence
has an idea though and is investigating that. Maybe he can turn something up.

The problem of course is it happens in production after hours of running at full load.. so
no we have not done an INVARIANT run. Lets see what Lawrence turns up. We will
probably hold off merging this to our next release until we can get the issue resolved ;-)

Hi @rrs,

In D2079#50233, @rrs wrote:

We don't use TOE (we use LRO though). The panic's we have are the persist timer. Lawrence
has an idea though and is investigating that. Maybe he can turn something up.

The problem of course is it happens in production after hours of running at full load.. so
no we have not done an INVARIANT run. Lets see what Lawrence turns up. We will
probably hold off merging this to our next release until we can get the issue resolved ;-)

Thanks for your inputs. Just as sanity checks I tested the persist timer roughly the same way than you:

I disabled INVARIANTS
I tuned our TCP QA test tools to produce a bunch of persist timer timeouts

Then, even with ~80k tcp_timer_persist() timeout called per second, I was not able to reproduce it (see below my Dtrace script results after 60 secs). Of course, it does not mean that there are no issues, it just means it is not an issue straightforward tissue o reproduce. I will let Lawrence - I guess @lstewart - to provide more details before going too far. Just few open notes, if you made custom changes in the area.

This patch heavily relies on current callout implementation, thus check any callout changes (if any)
Always use the TCP timer functions - i.e. tcp_timer_activate(), tcp_timer_stop(), etc. - instead of a direct access to the callout like callout_reset_on(tt_persist, delta, tcp_timer_persist, tp, cpu)

My 2 cents.

# cat persist.d 
fbt::tcp_setpersist:entry
{
    @tcp_setpersist[execname, stack()] = count();
}

fbt::tcp_timer_persist:entry
{
    @tcp_timer_persist[execname, stack()] = count();
}

END
{
    printf("\ntcp_setpersist:\n");
    printa(@tcp_setpersist);

    printf("\ntcp_timer_persist:\n");
    printa(@tcp_timer_persist);
}
# dtrace -s persist.d 
dtrace: script 'persist.d' matched 3 probes
^C
CPU     ID                    FUNCTION:NAME
 11      2                             :END 
tcp_setpersist:

  kernel                                            
              kernel`tcp_output+0x2004
              kernel`tcp_do_segment+0x268f
              kernel`tcp_input+0xd17
              kernel`ip_input+0xa9
              kernel`netisr_dispatch_src+0x61
              kernel`ether_demux+0x159
              kernel`ether_nh_input+0x37d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`vlan_input+0x138
              kernel`ether_demux+0x94
              kernel`ether_nh_input+0x37d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x747
              kernel`ixgbe_handle_que+0xc8
              kernel`taskqueue_run_locked+0x139
              kernel`taskqueue_thread_loop+0xc8
              kernel`fork_exit+0x9a
              kernel`0xffffffff80e213fe
           448565                       
  intr                                              
              kernel`tcp_output+0x2004
              kernel`tcp_do_segment+0x268f
              kernel`tcp_input+0xd17
              kernel`ip_input+0xa9
              kernel`netisr_dispatch_src+0x61
              kernel`ether_demux+0x159
              kernel`ether_nh_input+0x37d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`vlan_input+0x138
              kernel`ether_demux+0x94
              kernel`ether_nh_input+0x37d
              kernel`netisr_dispatch_src+0x61
              kernel`ether_input+0x26
              kernel`ixgbe_rxeof+0x747
              kernel`ixgbe_msix_que+0xc9
              kernel`intr_event_execute_handlers+0x1d8
              kernel`ithread_loop+0x9c
              kernel`fork_exit+0x9a
              kernel`0xffffffff80e213fe
          1589087
  intr                                              
              kernel`tcp_timer_persist+0x1dc
              kernel`softclock_call_cc+0x19c
              kernel`softclock+0x94
              kernel`intr_event_execute_handlers+0x1d8
              kernel`ithread_loop+0x9c
              kernel`fork_exit+0x9a
              kernel`0xffffffff80e213fe
          4666713

tcp_timer_persist:

  intr                                              
              kernel`softclock_call_cc+0x19c
              kernel`softclock+0x94
              kernel`intr_event_execute_handlers+0x1d8
              kernel`ithread_loop+0x9c
              kernel`fork_exit+0x9a
              kernel`0xffffffff80e213fe
          4891511

My 2 cents.

Randall accidentally misspoke. We're seeing tcp_timer_keep() fire with a tp in TIMEWAIT and t_inpcb==NULL. The rest of the tp looks sane indicating it hasn't been GCed. I'm still trying to understand how this is possible as the code looks correct to me, but I'm continuing to dig...

Hi @lstewart,

In D2079#51293, @lstewart wrote:

Randall accidentally misspoke. We're seeing tcp_timer_keep() fire with a tp in TIMEWAIT and t_inpcb==NULL. The rest of the tp looks sane indicating it hasn't been GCed. I'm still trying to understand how this is possible as the code looks correct to me, but I'm continuing to dig...

No problem, it was a good occasion to add a specific persist timer test in our QA tools. I am going to stress tcp_timer_keep() now to see of I can reproduce it, Thanks for the details.

pfg mentioned this in D2725: Add extra INVARIANTS check for ctor/dtor.Jun 3 2015, 8:00 PM

Hi @lsteward,

I might have found a way to reproduce this issue: Set the TCP keep-alive timers very low:

# sysctl -a | grep keep                                                                                                                          
vfs.nfs.nfs_keep_dirty_on_error: 0
net.inet.tcp.keepidle: 7200000
net.inet.tcp.keepintvl: 75000
net.inet.tcp.keepinit: 75000
net.inet.tcp.keepcnt: 8
net.inet.tcp.always_keepalive: 1
# sysctl net.inet.tcp.keepidle=10
net.inet.tcp.keepidle: 7200000 -> 10
# sysctl net.inet.tcp.keepidle=10
net.inet.tcp.keepidle: 10 -> 10
# sysctl net.inet.tcp.keepintvl=50
net.inet.tcp.keepintvl: 75000 -> 50
# sysctl net.inet.tcp.keepinit=10
net.inet.tcp.keepinit: 75000 -> 10

Then I got a kernel panic that looks similar to our, let me reproduce this panic several time to confirm that. Moreover, if tomorrow I did not find and fix the root cause of this panic I will revert this change to avoid too many reports during the investigation time. Thanks for your inputs.

Yes, lowering the keepalive timer was how I was triggering this more quickly during investigation as with our default it took days at high load to trigger. I've also analysed a core dump with the tp in t_state 0, so it's not specific to TIMEWAIT either. I think I might know what's going on but will hopefully confirm my findings later today.

hiren added a subscriber: hiren.Jun 5 2015, 2:07 AM

Hi @lstewart,

In D2079#51973, @lstewart wrote:

Yes, lowering the keepalive timer was how I was triggering this more quickly during investigation as with our default it took days at high load to trigger. I've also analysed a core dump with the tp in t_state 0, so it's not specific to TIMEWAIT either. I think I might know what's going on but will hopefully confirm my findings later today.

Interesting. On my side I finally reproduce your exact issue:

panic: tcp_timer_keep: tp 0xfffff804210fc418 tp->t_inpcb == NULL

Just I added debugging code to get a better context view (see below). And it appears that:

TCP keep-alive time was running
callout_stop(TT_KEEP) returned successfully
As no TCP callouts were apparently running tcp_discardcb() decided to directly free the tcpcb
Crash because a TT_KEEP callout was indeed still running and called afterward

I am digging this scenario. Below the debug patch I used:

diff --git a/sys/netinet/tcp_subr.c b/sys/netinet/tcp_subr.c
index d75221c..2eb81a4 100644
--- a/sys/netinet/tcp_subr.c
+++ b/sys/netinet/tcp_subr.c
@@ -1021,12 +1021,16 @@ tcp_discardcb(struct tcpcb *tp)
        CC_ALGO(tp) = NULL;
        inp->inp_ppcb = NULL;
        if ((tp->t_timers->tt_flags & TT_MASK) == 0) {
+               tp->t_timers->tt_flags |= TT_DIRECT_FREE;
                /* We own the last reference on tcpcb, let's free it. */
                tp->t_inpcb = NULL;
+               tp->t_timers->tt_flags |= TT_DIRECT_FREED;
                uma_zfree(V_tcpcb_zone, tp);
                released = in_pcbrele_wlocked(inp);
                KASSERT(!released, ("%s: inp %p should not have been released "
                        "here", __func__, inp));
+       } else {
+               tp->t_timers->tt_flags |= TT_DEFER_FREE;
        }
 }
 
@@ -1084,6 +1088,7 @@ tcp_timer_discard(struct tcpcb *tp, uint32_t timer_type)
        if ((tp->t_timers->tt_flags & TT_MASK) == 0) {
                /* We own the last reference on this tcpcb, let's free it. */
                tp->t_inpcb = NULL;
+               tp->t_timers->tt_flags |= TT_DEFER_FREED;
                uma_zfree(V_tcpcb_zone, tp);
                if (in_pcbrele_wlocked(inp)) {
                        INP_INFO_WUNLOCK(&V_tcbinfo);
diff --git a/sys/netinet/tcp_timer.c b/sys/netinet/tcp_timer.c
index a64e403..e45d05b 100644
--- a/sys/netinet/tcp_timer.c
+++ b/sys/netinet/tcp_timer.c
@@ -811,6 +811,8 @@ tcp_timer_activate(struct tcpcb *tp, uint32_t timer_type, u_int delta)
        struct inpcb *inp = tp->t_inpcb;
        int cpu = inp_to_cpuid(inp);
 
+       INP_WLOCK_ASSERT(tp->t_inpcb);
+
 #ifdef TCP_OFFLOAD
        if (tp->t_flags & TF_TOE)
                return;
@@ -892,6 +894,8 @@ tcp_timer_stop(struct tcpcb *tp, uint32_t timer_type)
        struct callout *t_callout;
        timeout_t *f_callout;
 
+       INP_WLOCK_ASSERT(tp->t_inpcb);
+
        tp->t_timers->tt_flags |= TT_STOPPED;
 
        switch (timer_type) {
@@ -920,9 +924,18 @@ tcp_timer_stop(struct tcpcb *tp, uint32_t timer_type)
                }
 
        if (tp->t_timers->tt_flags & timer_type) {
+               if (timer_type == TT_KEEP) {
+                       tp->t_timers->tt_flags |= TT_STOP_KEEP_RUN;
+               }
                if (callout_stop(t_callout)) {
+                       if (timer_type == TT_KEEP) {
+                               tp->t_timers->tt_flags |= TT_STOP_KEEP_OK;
+                       }
                        tp->t_timers->tt_flags &= ~timer_type;
                } else {
+                       if (timer_type == TT_KEEP) {
+                               tp->t_timers->tt_flags |= TT_STOP_KEEP_KO;
+                       }
                        /*
                         * Can't stop the callout, defer tcpcb actual deletion
                         * to the last tcp timer discard callout.
@@ -935,6 +948,10 @@ tcp_timer_stop(struct tcpcb *tp, uint32_t timer_type)
                         */
                        callout_reset(t_callout, 1, f_callout, tp);
                }
+       } else {
+               if (timer_type == TT_KEEP) {
+                       tp->t_timers->tt_flags |= TT_STOP_KEEP_NOT_RUN;
+               }
        }
 }
 
diff --git a/sys/netinet/tcp_timer.h b/sys/netinet/tcp_timer.h
index 02a7782..35a8363 100644
--- a/sys/netinet/tcp_timer.h
+++ b/sys/netinet/tcp_timer.h
@@ -159,6 +159,14 @@ struct tcp_timer {
 #define TT_KEEP                0x0008
 #define TT_2MSL                0x0010
 #define TT_MASK                (TT_DELACK|TT_REXMT|TT_PERSIST|TT_KEEP|TT_2MSL)
+#define TT_STOP_KEEP_OK        0x0020
+#define TT_STOP_KEEP_KO        0x0040
+#define TT_STOP_KEEP_RUN       0x0080
+#define TT_STOP_KEEP_NOT_RUN   0x0100
+#define TT_DIRECT_FREE 0x0200
+#define TT_DEFER_FREE  0x0400
+#define TT_DIRECT_FREED        0x0800
+#define TT_DEFER_FREED 0x1000
 
 #define TT_STOPPED     0x00010000

In D2079#52021, @jch wrote:
In D2079#51973, @lstewart wrote:

Yes, lowering the keepalive timer was how I was triggering this more quickly during investigation as with our default it took days at high load to trigger. I've also analysed a core dump with the tp in t_state 0, so it's not specific to TIMEWAIT either. I think I might know what's going on but will hopefully confirm my findings later today.

Interesting. On my side I finally reproduce your exact issue:
panic: tcp_timer_keep: tp 0xfffff804210fc418 tp->t_inpcb == NULL
Just I added debugging code to get a better context view (see below). And it appears that:

TCP keep-alive time was running

callout_stop(TT_KEEP) returned successfully

As no TCP callouts were apparently running tcp_discardcb() decided to directly free the tcpcb

Crash because a TT_KEEP callout was indeed still running and called afterward

I am digging this scenario...

Investigation summary so far: After callout_stop(TT_KEEP) returned successfully something somehow restarts the TT_KEEP callout and there are only two places were callout are added:

callout_reset_sbt_on(), and
softclock_call_cc() in callout migration case

As callout_reset_sbt_on() case should not happen here to TT_STOPPED flag usage, I am debugging the softclock_call_cc() in callout migration case.

It sounds like you guys are well into this, but FWIW we at Limelight are seeing the same and would be happy to sacrifice some machines to the debugging altar should it be needed. Everything has been TIME_WAIT so far, but I've only analyzed a handful of cores. The rest of the CB is still populated.

(kgdb) print tp->t_inpcb
$4 = (struct inpcb *) 0x0
(kgdb) print tp->t_state
$3 = 10

j-nitrology.com added a comment.Jun 6 2015, 7:54 PM

This comment was removed by j-nitrology.com.

Hi,

In D2079#52025, @jch wrote:
In D2079#52021, @jch wrote:
In D2079#51973, @lstewart wrote:

Yes, lowering the keepalive timer was how I was triggering this more quickly during investigation as with our default it took days at high load to trigger. I've also analysed a core dump with the tp in t_state 0, so it's not specific to TIMEWAIT either. I think I might know what's going on but will hopefully confirm my findings later today.

Interesting. On my side I finally reproduce your exact issue:
panic: tcp_timer_keep: tp 0xfffff804210fc418 tp->t_inpcb == NULL
Just I added debugging code to get a better context view (see below). And it appears that:

TCP keep-alive time was running

callout_stop(TT_KEEP) returned successfully

As no TCP callouts were apparently running tcp_discardcb() decided to directly free the tcpcb

Crash because a TT_KEEP callout was indeed still running and called afterward

I am digging this scenario...

I might have found it: The scenario seems to be:

callout_reset_on(TT_KEEP) failed as the TT_KEEP callout is about to run _but_ successfully schedules a new TT_KEEP callout
callout_stop(TT_KEEP) successfully stops the new TT_KEEP callout _but_ does not stop the about to run TT_KEEP callout
tcpcb is destroyed as callout_stop(TT_KEEP) return success
The about to run TT_KEEP callout runs
panic()

Short conclusion:

If a call to callout_reset_on(TT_XXX) fails, do not trust the return value of next call to callout_stop(TT_XXX)

I am testing the corresponding fix.

Just for the record, below how I got details on this issue:

#1. Apply a debug patch below to avoid crash and get a convenient entry point on TT_KEEP callout:

diff --git a/sys/netinet/tcp_timer.c b/sys/netinet/tcp_timer.c
index a64e403..dcc748e 100644
--- a/sys/netinet/tcp_timer.c
+++ b/sys/netinet/tcp_timer.c
@@ -270,6 +270,10 @@ tcp_timer_delack(void *xtp)
        CURVNET_SET(tp->t_vnet);

        inp = tp->t_inpcb;
+       if (inp == NULL) {
+               CURVNET_RESTORE();
+               return;
+       }
        KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp));
        INP_WLOCK(inp);
        if (callout_pending(&tp->t_timers->tt_delack) ||
@@ -307,8 +311,14 @@ tcp_timer_2msl(void *xtp)

        ostate = tp->t_state;
 #endif
+
        INP_INFO_WLOCK(&V_tcbinfo);
        inp = tp->t_inpcb;
+       if (inp == NULL) {
+               INP_INFO_WUNLOCK(&V_tcbinfo);
+               CURVNET_RESTORE();
+               return;
+       }
        KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp));
        INP_WLOCK(inp);
        tcp_free_sackholes(tp);
@@ -370,16 +380,30 @@ void
 tcp_timer_keep(void *xtp)
 {
        struct tcpcb *tp = xtp;
-       struct tcptemp *t_template;
        struct inpcb *inp;
+
        CURVNET_SET(tp->t_vnet);
+
+       INP_INFO_WLOCK(&V_tcbinfo);
+       inp = tp->t_inpcb;
+       tcp_timer_keepcb(tp, inp);
+}
+
+void
+tcp_timer_keepcb(struct tcpcb *tp, struct inpcb *inp) {
+
+       struct tcptemp *t_template;
 #ifdef TCPDEBUG
        int ostate;

        ostate = tp->t_state;
 #endif
-       INP_INFO_WLOCK(&V_tcbinfo);
-       inp = tp->t_inpcb;
+       INP_INFO_WLOCK_ASSERT(&V_tcbinfo);
+       if (inp == NULL) {
+               INP_INFO_WUNLOCK(&V_tcbinfo);
+               CURVNET_RESTORE();
+               return;
+       }
        KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp));
        INP_WLOCK(inp);
        if (callout_pending(&tp->t_timers->tt_keep) ||
@@ -475,6 +499,11 @@ tcp_timer_persist(void *xtp)
 #endif
        INP_INFO_WLOCK(&V_tcbinfo);
        inp = tp->t_inpcb;
+       if (inp == NULL) {
+               INP_INFO_WUNLOCK(&V_tcbinfo);
+               CURVNET_RESTORE();
+               return;
+       }
        KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp));
        INP_WLOCK(inp);
        if (callout_pending(&tp->t_timers->tt_persist) ||
@@ -556,6 +585,11 @@ tcp_timer_rexmt(void * xtp)

        INP_INFO_RLOCK(&V_tcbinfo);
        inp = tp->t_inpcb;
+       if (inp == NULL) {
+               INP_INFO_RUNLOCK(&V_tcbinfo);
+               CURVNET_RESTORE();
+               return;
+       }
        KASSERT(inp != NULL, ("%s: tp %p tp->t_inpcb == NULL", __func__, tp));
        INP_WLOCK(inp);
        if (callout_pending(&tp->t_timers->tt_rexmt) ||
@@ -811,6 +845,8 @@ tcp_timer_activate(struct tcpcb *tp, uint32_t timer_type, u_int delta)
        struct inpcb *inp = tp->t_inpcb;
        int cpu = inp_to_cpuid(inp);

+       INP_WLOCK_ASSERT(tp->t_inpcb);
+
 #ifdef TCP_OFFLOAD
        if (tp->t_flags & TF_TOE)
                return;
@@ -892,6 +928,8 @@ tcp_timer_stop(struct tcpcb *tp, uint32_t timer_type)
        struct callout *t_callout;
        timeout_t *f_callout;

+       INP_WLOCK_ASSERT(tp->t_inpcb);
+
        tp->t_timers->tt_flags |= TT_STOPPED;

        switch (timer_type) {
diff --git a/sys/netinet/tcp_timer.h b/sys/netinet/tcp_timer.h
index 02a7782..ed8fd15 100644
--- a/sys/netinet/tcp_timer.h
+++ b/sys/netinet/tcp_timer.h
@@ -189,6 +189,7 @@ void        tcp_timer_2msl(void *xtp);
 struct tcptw *
        tcp_tw_2msl_scan(int reuse);    /* XXX temporary? */
 void   tcp_timer_keep(void *xtp);
+void   tcp_timer_keepcb(struct tcpcb *tp, struct inpcb *inp);
 void   tcp_timer_persist(void *xtp);
 void   tcp_timer_rexmt(void *xtp);
 void   tcp_timer_delack(void *xtp);

#2. Run a Dtrace script that catches the issue and how we get there:

fbt::callout_reset_sbt_on:entry
{
  printf("%d cpu %d callout %p c_flags %x c_iflags %x c_cpu %d",
        timestamp,
        cpu,
        args[0],
        args[0]->c_flags,
        args[0]->c_iflags,
        args[0]->c_cpu);
}

fbt::tcp_timer_stop:entry
/args[1] == 0x08/
{
  printf("%d cpu %d tp %p tt_flags %x TT_KEEP callout %p c_flags %x c_iflags %x c_cpu %d",
         timestamp,
         cpu,
         args[0],
         args[0]->t_timers->tt_flags,
         &args[0]->t_timers->tt_keep,
         args[0]->t_timers->tt_keep.c_flags,
         args[0]->t_timers->tt_keep.c_iflags,
         args[0]->t_timers->tt_keep.c_cpu);
  self->callout = &args[0]->t_timers->tt_keep;
}

fbt::_callout_stop_safe:return
/self->callout/
{
  printf("%d cpu %d TT_KEEP callout %p _callout_stop_safe+%x return %d",
         timestamp,
         cpu,
         self->callout,
         args[0],
         args[1]);
  self->callout = 0;
}

fbt::tcp_timer_keepcb:entry
/args[1] == 0/
{
  printf("%d cpu %d race on tp %p tt_flags %x TT_KEEP callout %p c_flags %x c_iflags %x c_cpu %d",
         timestamp,
         cpu,
         args[0],
         args[0]->t_timers->tt_flags,
         &args[0]->t_timers->tt_keep,
         args[0]->t_timers->tt_keep.c_flags,
         args[0]->t_timers->tt_keep.c_iflags,
         args[0]->t_timers->tt_keep.c_cpu);
}

$ sudo dtrace -b 32m -x switchrate=10hz -s timer-dbg.d dtrace.dump

#3. Check "race on to" in dtrace dump and previous related callout calls

1  44354       callout_reset_sbt_on:entry 15538995557825 cpu 1 callout fffff803b2eb1728 c_flags 2 c_iflags 90 c_cpu 0
1  42861             tcp_timer_stop:entry 15538995565750 cpu 1 tp fffff803b2eb1418 tt_flags 12008 TT_KEEP callout fffff803b2eb1728 c_flags 2 c_iflags 14 c_cpu 0
1  26707        _callout_stop_safe:return 15538995567784 cpu 1 TT_KEEP callout fffff803b2eb1728 _callout_stop_safe+394 return 1
5  35931           tcp_timer_keepcb:entry 15538995609752 cpu 5 race on tp fffff803b2eb1418 tt_flags 12000 TT_KEEP callout fffff803b2eb1728 c_flags 0 c_iflags 10 c_cpu 0

#4. Check what happen in callout_reset_sbt_on() when c_iflags is set to 0x90, i.e. CALLOUT_PROCESSED and CALLOUT_RETURNUNLOCKED _and not_ CALLOUT_PENDING.

Hi @lstewart, @nitroboost-gmail.com, @hiren,

Here

tcp-timer.patch5 KBDownload

my current patch to fix this issue, I am currently not able to reproduce it on HEAD with this patch applied. Let me know how if it works for you. If it works well, I will create a review with this patch and test it also on stable/10.

Thanks again for your inputs.

jch mentioned this in D2763: Fix a callout race condition introduced in TCP timers callouts with r281599..Jun 9 2015, 3:57 AM

jch added a child revision: D2763: Fix a callout race condition introduced in TCP timers callouts with r281599..Jun 9 2015, 3:58 AM

The race condition introduced with this change has been fixed as part of D2763: Fix a callout race condition introduced in TCP timers callouts with r281599. in HEAD and STABLE-10.

Hi @lstewart,

In D2079#49517, @lstewart wrote:

Leaving aside D2599 for the moment (which looks like good work and I will indeed take a look at it in detail - please include me on reviews for any TCP related work. I don't always get time to give them attention in the review window, but being aware of the work is very useful), I'm still not clear why tcp_drop(), and therefore the timers which call it, need the info lock in the new world order (in fact, I think my confusion also applies to the old world order. I was thinking that taking the reference on the inpcb in tcp_newtcpcb() means you now control when the inpcb can be GCed with respect to the timers executing which should allow simplification of the locking in the timers. It may even be the case that the reference you hold is irrelevant to the following thoughts...)

In D2079#49710, @lstewart wrote:

In D2079#49598, @jch wrote:

Third, I will reply about inp reference counting in a separate comment as this part in also quite complex.

Ok, thanks. Will be very helpful to better understand.

I had finally time to write something on inpcb reference counting:

In short, I see two kinds of reference counting usages for inpcb:

The short reference counting usage
The "subsystems" reference counting

The short usage is straightforward: in_pcbref()/in_pcbrele() are called in pair in the same function. Example:

void use_inp() {
  INP_WLOCK(inp);
  in_pcbref(inp);
  INP_WUNLOCK(inp);

  /* Use the inp as the memory stability is guaranteed. Like in tcp_pcblist().
     Or acquire a lock that require to be taken before INP lock. Like in tcp_timer_rexmt(). */

  INP_WLOCK(inp);
  if (in_pcbrele_wlocked(inp)) {
    /* inp has been destroyed */
    return;
  }
  /* inp still alive */
  INP_WUNLOCK(inp);
}

Nothing to add here.

The subsystem usage: An inp reference is taken when the inpcb enters the subsystem and removed when it leaves the subsystem. Examples of TCP stack subsystems:

TCP Timer/tcpcb: Reference holder: tcpcb. Reference Lifetime: Lifetime of the tcpcb/callouts i.e. between tcp_newtcpcb() and tcp_discardcb()/tcp_timer_discard()

Timewait: Reference holder: tcptw. Reference Lifetime: Lifetime of the tcptw i.e. between tcp_twstart() and tcp_tw_2msl_stop()

The socket subsystem: Reference lifetime: Between tcp_attach() and tcp_detach()

TCP Offload: Reference Lifetime: As long as the inp is handled by the TCP Offload stack

The TCP hash tables: Reference lifetime: While inp is in the TCP hash table between in_pcbinshash*() and in_pcbdrop()/in_pcbremlists()

The "main" system that actually creates/destroy the inp: Lifetime: Lifetime of the inp. Between in_pcballoc() and in_pcbfree()

That makes the corresponding code subtle to read is that:

It is not an automatic refcounting like C++ smart pointers or Java garbage collect
Not all subsystems are using in_pcbref() for inp reference counting but flags instead (see below)
You have to take in account of the current TCP stack rules:

As tcpcb and tcptw are protected by the INP lock, tcpcb and tcptw are always deleted after the corresponding inp
in_pcbfree() can be called only after the socket is detached from its inp
Socket does not use in_pcbref() but uses SS_PROTOREF/INP_SOCKREF flags
TCP hash tables do not use in_pcbref() but use INP_INHASHLIST/INP_DROPPED flags
Etc...

In conclusion deleting a inp/tcpcb/tcptw/socket is not straightforward. Code clarity could be improved (as always) by making the TCP destruction rules clearer, and using more in_pcbref() (Socket and TCP hash tables, I am looking at you). But this kind of code housekeeping is difficult/risky, thus we do it only while changing the corresponding subsystems (see our recent changes on timewait and tcp timers). If I have time, I will draw the complete TCP structure deletion state machine and see where are the low-hanging fruits for cleaning up the inp reference counting.

My 2 cents.

jch mentioned this in D3521: Implement callout_drain_async().Sep 15 2015, 3:10 PM

Dear FreeBSD Team,

Currently  we are facing below crash on tcp_pcblist related on 9.2 kernel base.

 Fatal trap 12: page fault while in kernel mode

cpuid = 17; apic id = 15
fault virtual address = 0xa00000278
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80554ac1
stack pointer = 0x28:0xffffff8fe2469280
frame pointer = 0x28:0xffffff8fe24692a0
code segment = base 0x0, limit 0xfffff, type 0x1b

= DPL 0, pres 1, long 1, def32 0, gran 1

processor eflags = interrupt enabled, resume, IOPL = 0
current process = 9707 (netstat)
trap number = 12
panic: page fault
cpuid = 17
panic: page fault
KDB: stack backtrace:

(kgdb) bt
#0 doadump (textdump=1) at ../../../kern/kern_shutdown.c:278
#1 0xffffffff80556504 in kern_reboot (howto=260) at ../../../kern/kern_shutdown.c:461
#2 0xffffffff80556b81 in panic (fmt=0x4 <Address 0x4 out of bounds>) at ../../../kern/kern_shutdown.c:659
#3 0xffffffff807597a0 in trap_fatal (frame=0xc, eva=<value optimized out>) at ../../../amd64/amd64/trap.c:879
#4 0xffffffff80759b0d in trap_pfault (frame=0xffffff8fe24691d0, usermode=0) at ../../../amd64/amd64/trap.c:795
#5 0xffffffff80759ecb in trap (frame=0xffffff8fe24691d0) at ../../../amd64/amd64/trap.c:463
#6 0xffffffff807424ef in calltrap () at ../../../amd64/amd64/exception.S:232
#7 0xffffffff80554ac1 in _rw_rlock (rw=0xa00000260, file=0x0, line=0) at ../../../kern/kern_rwlock.c:352
#8 0xffffffff806679d9 in tcp_pcblist (oidp=<value optimized out>, arg1=<value optimized out>, arg2=<value optimized out>, req=0xffffff8fe2469870) at ../../../netinet/tcp_subr.c:1269
#9 0xffffffff80560aaa in sysctl_root (oidp=<value optimized out>, arg1=0x0, arg2=0, req=0xffffff8fe2469870) at ../../../kern/kern_sysctl.c:1493
#10 0xffffffff80560d95 in userland_sysctl (td=<value optimized out>, name=0xffffff8fe2469930, namelen=4, old=<value optimized out>, oldlenp=0x0, inkernel=<value optimized out>, new=0x0, newlen=0,

retval=0xffffff8fe2469998, flags=0) at ../../../kern/kern_sysctl.c:1603

#11 0xffffffff805612aa in sys___sysctl (td=0xfffffe0b2954f920, uap=0xffffff8fe2469a70) at ../../../kern/kern_sysctl.c:1529
#12 0xffffffff807590b0 in amd64_syscall (td=0xfffffe0b2954f920, traced=0) at subr_syscall.c:135
#13 0xffffffff807427d7 in Xfast_syscall () at ../../../amd64/amd64/exception.S:391
#14 0x0000000801397f6c in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) list *0xffffffff80554ac1
0xffffffff80554ac1 is in _rw_rlock (../../../kern/kern_rwlock.c:351).
346 ../../../kern/kern_rwlock.c: No such file or directory.
in ../../../kern/kern_rwlock.c
(kgdb) f 7
#7 0xffffffff80554ac1 in _rw_rlock (rw=0xa00000260, file=0x0, line=0) at ../../../kern/kern_rwlock.c:352
352 in ../../../kern/kern_rwlock.c
(kgdb) i lo
ts = <value optimized out>
owner = <value optimized out>
spintries = 0
i = 0
v = 0
(kgdb) print *(struct rwlock *)0xa00000260
Cannot access memory at address 0xa00000260

Kindly provide corresponding patch or root cause about the problem with proposed fix.

This revision is now accepted and ready to land.Nov 1 2017, 12:12 AM

Dear FreeBSD Team,

Currently we are facing below crash on tcp_pcblist related on 9.2 kernel base.

Fatal trap 12: page fault while in kernel mode
cpuid = 17; apic id = 15
fault virtual address = 0xa00000278
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80554ac1
stack pointer = 0x28:0xffffff8fe2469280
frame pointer = 0x28:0xffffff8fe24692a0
code segment = base 0x0, limit 0xfffff, type 0x1b

DPL 0, pres 1, long 1, def32 0, gran 1

processor eflags = interrupt enabled, resume, IOPL = 0
current process = 9707 (netstat)
trap number = 12
panic: page fault
cpuid = 17
panic: page fault
KDB: stack backtrace:

retval=0xffffff8fe2469998, flags=0) at ../../../kern/kern_sysctl.c:1603
#11 0xffffffff805612aa in sys___sysctl (td=0xfffffe0b2954f920, uap=0xffffff8fe2469a70) at ../../../kern/kern_sysctl.c:1529
#12 0xffffffff807590b0 in amd64_syscall (td=0xfffffe0b2954f920, traced=0) at subr_syscall.c:135
#13 0xffffffff807427d7 in Xfast_syscall () at ../../../amd64/amd64/exception.S:391
#14 0x0000000801397f6c in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb) list *0xffffffff80554ac1
0xffffffff80554ac1 is in _rw_rlock (../../../kern/kern_rwlock.c:351).
346 ../../../kern/kern_rwlock.c: No such file or directory.
in ../../../kern/kern_rwlock.c
(kgdb) f 7
#7 0xffffffff80554ac1 in _rw_rlock (rw=0xa00000260, file=0x0, line=0) at ../../../kern/kern_rwlock.c:352
352 in ../../../kern/kern_rwlock.c
(kgdb) i lo
ts = <value optimized out>
owner = <value optimized out>
spintries = 0
i = 0
v = 0
(kgdb) print *(struct rwlock *)0xa00000260
Cannot access memory at address 0xa00000260

Kindly provide corresponding patch or root cause about the problem with proposed fix.

This revision now requires changes to proceed.Nov 29 2017, 10:04 AM

Fix TCP timers use-after-free old race conditions
Needs RevisionPublic
Actions

Details

Diff Detail

Event Timeline

DPL 0, pres 1, long 1, def32 0, gran 1

Revision Contents
Changeset List

Diff 4845

sys/netinet/tcp_subr.c

sys/netinet/tcp_timer.h

sys/netinet/tcp_timer.c

sys/netinet/tcp_var.h

Fix TCP timers use-after-free old race conditionsNeeds RevisionPublicActions

Details

Diff Detail

Event Timeline

DPL 0, pres 1, long 1, def32 0, gran 1

Revision ContentsChangeset List

Diff 4845

sys/netinet/tcp_subr.c

sys/netinet/tcp_timer.h

sys/netinet/tcp_timer.c

sys/netinet/tcp_var.h

Fix TCP timers use-after-free old race conditions
Needs RevisionPublic
Actions

Revision Contents
Changeset List