Page MenuHomeFreeBSD

if: Protect V_ifnet in vnet_if_return()
ClosedPublic

Authored by kp on Thu, Nov 19, 1:54 PM.

Details

Summary

When we terminate a vnet (i.e. jail) we move interfaces back to their
home vnet. We need to protect our access to the V_ifnet CK_LIST.

We could enter NET_EPOCH, but if_detach_internal() (called from
if_vmove()) waits for net epoch callback completion. That's not possible
from NET_EPOCH. Instead, we take the IFNET_WLOCK, build a list of the
interfaces that need to move and, once we've released the lock, move
them back to their home vnet.

We cannot hold the IFNET_WLOCK() during if_vmove(), because that results
in a LOR between ifnet_sx, in_multi_sx and iflib ctx lock.

Separate out moving the ifp into or out of V_ifnet, so we can hold the
lock as we do the list manipulation, but do not hold it as we
if_vmove().

MFC after: 2 weeks
Sponsored by: Modirum MDPay

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kp requested review of this revision.Thu, Nov 19, 1:54 PM

I'm not sure that this is actually safe, unfortunately... we're now holding an rwlock and may need to sleep to acquire an sx in, e.g., me_reassign() (via if_vmove -> ifp->if_reassign -> me_reassign) if someone created a cloner and moved it into the now-dying vnet.

if_vmove() probably actually needs to assert that it won't be recursing on the ifnet locks here.

I'm not sure that this is actually safe, unfortunately... we're now holding an rwlock and may need to sleep to acquire an sx in, e.g., me_reassign() (via if_vmove -> ifp->if_reassign -> me_reassign) if someone created a cloner and moved it into the now-dying vnet.

if_vmove() probably actually needs to assert that it won't be recursing on the ifnet locks here.

I'm not sure I follow.
The ifnet_sxlock is recursive, so that one's safe to take again. I removed the rwlock from IFNET_WLOCK, because it's actually pointless (D27278). It should be safe for me_reassign() to acquire its own me_ioctl_sx either way.

In D27279#609340, @kp wrote:

I'm not sure that this is actually safe, unfortunately... we're now holding an rwlock and may need to sleep to acquire an sx in, e.g., me_reassign() (via if_vmove -> ifp->if_reassign -> me_reassign) if someone created a cloner and moved it into the now-dying vnet.

if_vmove() probably actually needs to assert that it won't be recursing on the ifnet locks here.

I'm not sure I follow.
The ifnet_sxlock is recursive, so that one's safe to take again. I removed the rwlock from IFNET_WLOCK, because it's actually pointless (D27278). It should be safe for me_reassign() to acquire its own me_ioctl_sx either way.

Ah, ok; D27278 is the missing context here. It would be helpful to include linked reviews like that in the stack via "Edit Related Revisions" -- I otherwise missed it since I'm not on that one.

allanjude added inline comments.
sys/net/if.c
480 ↗(On Diff #79755)

is this supposed to be 'x'lock? It would make sense if describing the set of locks, but you mention the write lock specifically

sys/net/if.c
480 ↗(On Diff #79755)

Yeah, that needs to be read/write lock.

With that aside, this looks good at first blush... I'll take a closer look when I get some time tonight.

can't say how correct this is, but it'll be great to have it fixed!

sys/net/if.c
475 ↗(On Diff #79762)

first line of multiline comments should start on the line after the opening /*,
so:

/*

  • We need to protect ... */

I'm still trying to track down what seems to be scribbling on freed memory when I run the pf:names test, but I note that it's also complaining about LOR ifnet_sx (vnet_if_return, though it points to the last line that dropped it?) -> in_multi_sx (in_ifdetach)

I'm still trying to track down what seems to be scribbling on freed memory when I run the pf:names test, but I note that it's also complaining about LOR ifnet_sx (vnet_if_return, though it points to the last line that dropped it?) -> in_multi_sx (in_ifdetach)

I can't reproduce the memory issue you're reporting. I'd be very interested in looking at that.

Is this the LOR you mean?

lock order reversal:
 1st 0xffffffff81d9e218 ifnet_sx (ifnet_sx, sx) @ /usr/src/sys/net/if.c:1608
 2nd 0xffffffff81da0348 in_multi_sx (in_multi_sx, sx) @ /usr/src/sys/netinet/in.c:1045
lock order ifnet_sx -> in_multi_sx attempted at:
#0 0xffffffff80c543fc at witness_checkorder+0xdcc
#1 0xffffffff80befa37 at _sx_xlock+0x67
#2 0xffffffff80d87f64 at in_ifdetach+0x24
#3 0xffffffff80cfc3aa at if_detach_internal+0x32a
#4 0xffffffff80d023bc at if_vmove+0x3c
#5 0xffffffff80d02370 at vnet_if_return+0x80
#6 0xffffffff80d33ab0 at vnet_destroy+0x140
#7 0xffffffff80ba789e at prison_deref+0x28e
#8 0xffffffff80ba8e00 at sys_jail_remove+0x290
#9 0xffffffff8103415e at amd64_syscall+0x12e
#10 0xffffffff810070ce at fast_syscall_common+0xf8

It's not clear to me where the other order is (i.e. where we take the multicast lock first, and then the ifnet lock).

I'm still trying to track down what seems to be scribbling on freed memory when I run the pf:names test, but I note that it's also complaining about LOR ifnet_sx (vnet_if_return, though it points to the last line that dropped it?) -> in_multi_sx (in_ifdetach)

I kept running the test differently and with other activity, and finally found the culprit:

(kgdb) bt
...
#13 0xffffffff81025c24 in trap (frame=0xfffffe00ac2f77b0) at /usr/src/sys/amd64/amd64/trap.c:212
#14 <signal handler called>
#15 0xffffffff80cdc785 in strncmp (s1=<optimized out>, s2=<optimized out>, n=<optimized out>)
    at /usr/src/sys/libkern/strncmp.c:44
#16 0xffffffff80cf1f89 in ifunit_ref (name=0xfffffe00ac2f7a10 "epair0a") at /usr/src/sys/net/if.c:2324
#17 0xffffffff80cf41c4 in ifioctl (so=0xfffff801e3fb2000, cmd=3224398136, data=0xfffffe00ac2f7a10 "epair0a",
    td=0xfffffe00a3b65e00) at /usr/src/sys/net/if.c:2998
(kgdb) frame 16
#16 0xffffffff80cf1f89 in ifunit_ref (name=0xfffffe00ac2f7a10 "epair0a") at /usr/src/sys/net/if.c:2324
2324                    if (strncmp(name, ifp->if_xname, IFNAMSIZ) == 0 &&
(kgdb) print ifp
$1 = (struct ifnet *) 0xdeadc0dedeadc0de

This is the classic race of ioctl handler vs. interface detachment.

Remove now pointless do {} while(0) wrappers. While here also remove a stale comment.

This is the classic race of ioctl handler vs. interface detachment.

Hmm. That does indicate there's still something wrong in the interface removal handling. We're in NET_EPOCH and should not see a freed ifp there. The normal flow for interface removal should now be to remove it from the list and schedule a NET_EPOCH_CALL() to actually free it. Clearly there's some code path that doesn't respect this.

I do think that's an independent issue, but we should try to fix that one too. Any hints you have on reproduction would be helpful.

In D27279#609748, @kp wrote:

This is the classic race of ioctl handler vs. interface detachment.

Hmm. That does indicate there's still something wrong in the interface removal handling. We're in NET_EPOCH and should not see a freed ifp there. The normal flow for interface removal should now be to remove it from the list and schedule a NET_EPOCH_CALL() to actually free it. Clearly there's some code path that doesn't respect this.

I do think that's an independent issue, but we should try to fix that one too. Any hints you have on reproduction would be helpful.

kldload pf; for i in $(seq 1 25); do kyua test -k /usr/tests/sys/netpfil/pf/Kyuafile names; done -- I consistently hit it within 5 iterations of this loop on a mostly idle system, usually within two or three. Normally it's after we've scribbled over something within the now-dead interface and I've triggered the panic in mtrash_ctor() w/ INVARIANTS on, but in the above backtrace it actually trapped instead since I had a little more activity going on.

kldload pf; for i in $(seq 1 25); do kyua test -k /usr/tests/sys/netpfil/pf/Kyuafile names; done -- I consistently hit it within 5 iterations of this loop on a mostly idle system, usually within two or three. Normally it's after we've scribbled over something within the now-dead interface and I've triggered the panic in mtrash_ctor() w/ INVARIANTS on, but in the above backtrace it actually trapped instead since I had a little more activity going on.

Strange. No panic here, and I've run through the full loop 5+ times now.

In D27279#609742, @kp wrote:

I'm still trying to track down what seems to be scribbling on freed memory when I run the pf:names test, but I note that it's also complaining about LOR ifnet_sx (vnet_if_return, though it points to the last line that dropped it?) -> in_multi_sx (in_ifdetach)

I can't reproduce the memory issue you're reporting. I'd be very interested in looking at that.

Is this the LOR you mean?

lock order reversal:
 1st 0xffffffff81d9e218 ifnet_sx (ifnet_sx, sx) @ /usr/src/sys/net/if.c:1608
 2nd 0xffffffff81da0348 in_multi_sx (in_multi_sx, sx) @ /usr/src/sys/netinet/in.c:1045
lock order ifnet_sx -> in_multi_sx attempted at:
#0 0xffffffff80c543fc at witness_checkorder+0xdcc
#1 0xffffffff80befa37 at _sx_xlock+0x67
#2 0xffffffff80d87f64 at in_ifdetach+0x24
#3 0xffffffff80cfc3aa at if_detach_internal+0x32a
#4 0xffffffff80d023bc at if_vmove+0x3c
#5 0xffffffff80d02370 at vnet_if_return+0x80
#6 0xffffffff80d33ab0 at vnet_destroy+0x140
#7 0xffffffff80ba789e at prison_deref+0x28e
#8 0xffffffff80ba8e00 at sys_jail_remove+0x290
#9 0xffffffff8103415e at amd64_syscall+0x12e
#10 0xffffffff810070ce at fast_syscall_common+0xf8

Yup!

It's not clear to me where the other order is (i.e. where we take the multicast lock first, and then the ifnet lock).

Same... and WITNESS didn't record where the initial order happened, so it's not giving anything useful here.

It's not clear to me where the other order is (i.e. where we take the multicast lock first, and then the ifnet lock).

Same... and WITNESS didn't record where the initial order happened, so it's not giving anything useful here.

For extra fun: I don't see a panic with this patch, but still get the LOR warning. I don't see how that is even possible. How can we LOR on those two locks if we never hold the in_multi_sx lock when we take the ifnet_sxlock?

diff --git a/sys/net/if_var.h b/sys/net/if_var.h
index beb9596895e..2bfc211dd6d 100644
--- a/sys/net/if_var.h
+++ b/sys/net/if_var.h
@@ -594,12 +594,21 @@ struct ifmultiaddr {
 };

 extern struct sx ifnet_sxlock;
+extern struct sx in_multi_sx;
+
+#define        IFNET_WLOCK()           do { \
+       sx_xlock(&ifnet_sxlock); \
+       MPASS(! sx_xlocked(&in_multi_sx));  \
+       } while (0)

-#define        IFNET_WLOCK()           sx_xlock(&ifnet_sxlock)
 #define        IFNET_WUNLOCK()         sx_xunlock(&ifnet_sxlock)
 #define        IFNET_RLOCK_ASSERT()    sx_assert(&ifnet_sxlock, SA_SLOCKED)
 #define        IFNET_WLOCK_ASSERT()    sx_assert(&ifnet_sxlock, SA_XLOCKED)
-#define        IFNET_RLOCK()           sx_slock(&ifnet_sxlock)
+#define        IFNET_RLOCK()           do { \
+       sx_slock(&ifnet_sxlock); \
+       MPASS(! sx_xlocked(&in_multi_sx));  \
+       } while (0)
+
 #define        IFNET_RUNLOCK()         sx_sunlock(&ifnet_sxlock)

 /*

I think we've somehow landed in bizarro world. :-)

Perhaps time to make sure entire kernel & all modules are built from scratch, without reusing old obj files, and without ccache -- just to make sure it's a clean slate?

Perhaps time to make sure entire kernel & all modules are built from scratch, without reusing old obj files, and without ccache -- just to make sure it's a clean slate?

I nuked my /usr/obj and rebuilt everything, but that didn't make any difference.

I also thought for a moment that there was a hardcoded locking order in witness. There are (string) references to in_multi_sx in kern/subr_witness.c, but removing those didn't make any difference.

The more I look at this the more it looks like it's either flat out a witness bug, or witness not correctly expressing what's going on.

As mentioned above I've run the scenario that produces the LOR warning with assertions in IFNET_(R|X)LOCK to ensure we don't hold the in_multi_sx lock.
I've even gone so far as to make witness log all new observed lock orders, and it never sees in_multi_sx and then ifnet_sx. It still complains about it though. So if anyone's got experience with witness strangeness I'd quite like a hint now.

Try putting entries in the witness_order_list_entry in kern/subr_witness.c specifying the ifnet_sx -> in_multi_sx ordering as correct. That should get better information. Likely the issue is a cycle of length larger than 2 (e.g. in_multi_sx before mystery lock, mystery lock before ifnet_sx, ifnet_sx before in_multi_sx).

Try putting entries in the witness_order_list_entry in kern/subr_witness.c specifying the ifnet_sx -> in_multi_sx ordering as correct. That should get better information. Likely the issue is a cycle of length larger than 2 (e.g. in_multi_sx before mystery lock, mystery lock before ifnet_sx, ifnet_sx before in_multi_sx).

Thanks! That seems to explain it.
Adding ifnet_sx -> in_multi_sx as correct gets us a warning about in_multi_sx -> iflib ctx lock, and adding that one finally gets us a warning about iflib ctx lock -> in_multi_sx. So we've got three locks involved here.

We see the ifnet_sx -> in_multi_sx order when we tear down a jail and need to detach interfaces from the jail (vnet_if_return()) and thus tear down the multicast bits (in in_ifdetach()).

We see the iflib ctx lock -> ifnet_sx order during interface attach, where we add a group to the interface in ether_ifattach() -> if_attach_internal():

lock order reversal:
 1st 0xfffff80003dcad78 iflib ctx lock (iflib ctx lock, sx) @ /usr/src/sys/net/iflib.c:4646
 2nd 0xffffffff81d9dad8 ifnet_sx (ifnet_sx, sx) @ /usr/src/sys/net/if.c:1474
lock order iflib ctx lock -> ifnet_sx attempted at:
#0 0xffffffff80c543c2 at witness_checkorder+0xdf2
#1 0xffffffff80bef9d7 at _sx_xlock+0x67
#2 0xffffffff80cfc7bf at if_addgroup+0x5f
#3 0xffffffff80cfb7ca at if_attach_internal+0x9a
#4 0xffffffff80d05839 at ether_ifattach+0x29
#5 0xffffffff80d1330e at iflib_device_register+0x10ae
#6 0xffffffff80d17605 at iflib_device_attach+0xb5
#7 0xffffffff80c1feea at device_attach+0x3ca
#8 0xffffffff80c1fa90 at device_probe_and_attach+0x70
#9 0xffffffff80c210f8 at bus_generic_attach+0x18
#10 0xffffffff80846b10 at pci_attach+0xe0
#11 0xffffffff80f04a09 at acpi_pci_attach+0x19
#12 0xffffffff80c1feea at device_attach+0x3ca
#13 0xffffffff80c1fa90 at device_probe_and_attach+0x70
#14 0xffffffff80c210f8 at bus_generic_attach+0x18
#15 0xffffffff80f08800 at acpi_pcib_pci_attach+0xa0
#16 0xffffffff80c1feea at device_attach+0x3ca
#17 0xffffffff80c1fa90 at device_probe_and_attach+0x70

The final order, in_multi_sx -> iflib ctx lock happens when we join a multicast group:

lock order reversal:
 1st 0xffffffff81d9fc08 in_multi_sx (in_multi_sx, sx) @ /usr/src/sys/netinet/in_mcast.c:1212
 2nd 0xfffff80005841d78 iflib ctx lock (iflib ctx lock, sx) @ /usr/src/sys/net/iflib.c:4219
lock order in_multi_sx -> iflib ctx lock attempted at:
#0 0xffffffff80c543c2 at witness_checkorder+0xdf2
#1 0xffffffff80bef9d7 at _sx_xlock+0x67
#2 0xffffffff80d1ef0f at iflib_if_ioctl+0x8f
#3 0xffffffff80cffa5d at if_addmulti+0x3fd
#4 0xffffffff80d8bacd at in_joingroup_locked+0x27d
#5 0xffffffff80d8b822 at in_joingroup+0x42
#6 0xffffffff80d86ba0 at in_control+0xa50
#7 0xffffffff80d00238 at ifioctl+0x3e8
#8 0xffffffff80c59879 at kern_ioctl+0x289
#9 0xffffffff80c5953a at sys_ioctl+0x12a
#10 0xffffffff8103315e at amd64_syscall+0x12e
#11 0xffffffff810060de at fast_syscall_common+0xf8

Now we know why it happens we can try to work out a way to break this locking loop. At a first glance I don't see an obvious place to do so though.

sys/net/if.c
483 ↗(On Diff #79803)

Would it be possible to do it a bit differently?

  1. Introduce if_link_ifnet() and if_unlink_ifnet() responsible for linking/unlinking ifp from V_ifnet & maintaining ifnet counters. The former can be factored out from ifnet_attach_internal() and the latter can be factored out from ifnet_detach_internal().
  1. Assume that callers of ifnet_detach_internal()and if_vmove() have to unlink ifnet manually. As there are only 2 callers, that's not a big deal
  1. Here as the first step after acquiring IFNET_WLOCK() allocate array of ifnet pointers of size curvnet->vnet_ifcnt.
  1. Instead of calling if_vmove() call if_unlink_ifnet() and save ifnet pointer to the array from (3).
  1. After dropping IFNET_WLOCK iterate through the array and call if_vmove() for each ifnet.

This will solve LORs and avoid having another instance of recursed locks.

sys/net/if.c
483 ↗(On Diff #79803)

That's a good notion, yes.

It doesn't fix the epair related panic, sadly. It's worth keeping in mind though, because we need something to protect the V_ifnet list access (and we don't have anything at all right now).

I'm experimenting with a similar approach, where if_vmove() now assumes it will be called with the IFNET_WLOCK held, and if_detach_internal releases it before in_ifdetach().

Alternatively, I wonder if a 'delete interface' lock would fix this issue as well.

sys/net/if.c
483 ↗(On Diff #79803)

Doing as this suggests, combined with an explicit "destroy the interface" lock, as below appears to fix the epair/vnet cleanup panic as well:

diff --git a/sys/net/if.c b/sys/net/if.c
index 481a3c23ef8..a5415b6e9a1 100644
--- a/sys/net/if.c
+++ b/sys/net/if.c
@@ -314,6 +314,9 @@ VNET_DEFINE(struct ifnet **, ifindex_table);
 struct sx ifnet_sxlock;
 SX_SYSINIT_FLAGS(ifnet_sx, &ifnet_sxlock, "ifnet_sx", SX_RECURSE);

+struct sx ifnet_del_sxlock;
+SX_SYSINIT(ifnet_del, &ifnet_del_sxlock, "ifnet_del_sx");
+
 /*
  * The allocation of network interfaces is a rather non-atomic affair; we
  * need to select an index before we are ready to expose the interface for
@@ -536,9 +539,11 @@ vnet_if_return(const void *unused __unused)
        }
        IFNET_WUNLOCK();

+       sx_xlock(&ifnet_del_sxlock);
        CK_STAILQ_FOREACH_SAFE(ifp, &ifnets, if_link, nifp) {
                if_vmove(ifp, ifp->if_home_vnet);
        }
+       sx_xunlock(&ifnet_del_sxlock);
 }
 VNET_SYSUNINIT(vnet_if_return, SI_SUB_VNET_DONE, SI_ORDER_ANY,
     vnet_if_return, NULL);
@@ -2996,8 +3001,11 @@ ifioctl(struct socket *so, u_long cmd, caddr_t data, struct thread *td)
                goto out_noref;
        case SIOCIFDESTROY:
                error = priv_check(td, PRIV_NET_IFDESTROY);
-               if (error == 0)
+               if (error == 0) {
+                       sx_xlock(&ifnet_del_sxlock);
                        error = if_clone_destroy(ifr->ifr_name);
+                       sx_xunlock(&ifnet_del_sxlock);
+               }
                goto out_noref;

        case SIOCIFGCLONERS:

That serialises interface destruction, of course, but that seems like a small price to pay for not panicking half the time you destroy an epair and a vnet.

I'll try to clean up this change and push a separate review for this cleanup lock thing.

sys/net/if.c
483 ↗(On Diff #79803)

Yay! It's awesome we have a solution for the panic.

Though, we still need some sort of safety protection for ifnet list traversal in vnet_if_return().
Is there any chance we can use existing IFNET_WLOCK in SIOCIFDESTROY so we don't overcomplicate the locking model by introducing a new lock?

sys/net/if.c
483 ↗(On Diff #79803)

Unfortunately that doesn't appear to be possible. Using IFNET_WLOCK to protect SIOCIFDESTROY produces the LOR we're trying to avoid:

login: lock order reversal:
 1st 0xffffffff81d9dad8 ifnet_sx (ifnet_sx, sx) @ /usr/src/sys/net/if.c:1625
 2nd 0xffffffff81d9fc08 in_multi_sx (in_multi_sx, sx) @ /usr/src/sys/netinet/in.c:1045
lock order ifnet_sx -> in_multi_sx attempted at:
#0 0xffffffff80c4838c at witness_checkorder+0xdcc
#1 0xffffffff80be3a47 at _sx_xlock+0x67
#2 0xffffffff80d7bf14 at in_ifdetach+0x24
#3 0xffffffff80cf03da at if_detach_internal+0x1ea
#4 0xffffffff80cf0031 at if_detach+0x51
#5 0xffffffff82b88d58 at epair_clone_destroy+0x98
#6 0xffffffff80cf759a at if_clone_destroyif+0x19a
#7 0xffffffff80cf7381 at if_clone_destroy+0x1f1
#8 0xffffffff80cf4237 at ifioctl+0x357
#9 0xffffffff80c4d829 at kern_ioctl+0x289
#10 0xffffffff80c4d4ea at sys_ioctl+0x12a
#11 0xffffffff8102715e at amd64_syscall+0x12e
#12 0xffffffff80ffa0ae at fast_syscall_common+0xf8

We do indeed also need to protect the V_ifnet list in vnet_if_returm(). Your suggested approach will work for that. I'll split this patch up in two: one to address the V_ifnet list protection in vnet_if_return() and one to introduce the new ifnet_del_sxlock to fix the epair/vnet cleanup panic.

Do not hold IFNET_WLOCK() during if_vmove()

This means this change no longer fixes the observed vnet/epair destruction
panic, but it does fix the lack of protection for V_ifnet in vnet_if_return().

kp retitled this revision from if: Acquire ifnet lock when returning interfaces to their home vnet to if: Protect V_ifnet in vnet_if_return().Tue, Nov 24, 5:31 PM
kp edited the summary of this revision. (Show Details)
sys/net/if.c
485 ↗(On Diff #79938)

Nit: given both vmove and return value are, in fact, bool, maybe worth considering declaring them as bool explicitly?

534 ↗(On Diff #79938)

It would be better if we could avoid changing if_link.

The other callers that can potentially be iterating V_ifnet at the same time may switch to ifnets list instead due to the fact that we're changing if_link of the ifp entry.

  • Use bool rather than int
  • Avoid reusing if_link
kp marked an inline comment as done.Tue, Nov 24, 9:56 PM
kp added inline comments.
sys/net/if.c
534 ↗(On Diff #79938)

That's a good point.

I assumed this was safe because the CK_LISTs are safe to modify with concurrent readers and one writer, but that's only the case if you don't add the element to a different list as you pointed out.

Thank you for addressing the comments!
Looks good, added another minor comment inside.

sys/net/if.c
511 ↗(On Diff #79960)

Sorry, this can potentially be a bit problematic - as we don't protect reading vnet_ifcnt here, we can end up having more interfaces once we got to IFNET_WLOCK part, which may lead to stack memory corruption.

Also, as we don't know how big vnet_ifcnt will be. I'd be rather cautious and prefer avoiding allocating unknown amount of space on stack, as the stack size is limited.

Given IFNET_WLOCK is sx, maybe it's worth allocating memory there explicitly, avoiding vnet_ifcnt potential change problems there.

Dynamically allocate the pending array

This revision is now accepted and ready to land.Tue, Nov 24, 10:32 PM
This revision was automatically updated to reflect the committed changes.