Details

Reviewers

rgrimes
ae
kp
glebius
jhb
avg
bz

Group Reviewers

network

Summary

This pushes us towards a mode of operation similar to other teardown routines, where we're guaranteed to be free of clashing operations after if_detach/ether_ifdetach returns. Various operations (e.g. ioctl handling, vnet moves in the future) will busy/unbusy the ifnet to block detaching until they've completed.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 30472

Event Timeline

kevans created this revision.Dec 5 2019, 2:56 PM

Herald added subscribers: ae, imp. · View Herald TranscriptDec 5 2019, 2:56 PM

Harbormaster completed remote builds in B27981: Diff 65265.Dec 5 2019, 2:56 PM

kevans mentioned this in D20468: if_vether, ported from OpenBSD.Dec 5 2019, 4:03 PM

Specifically tagging @ae and @kp as well, as they've written or maintained some of the drivers that could benefit from a callback to do this deferred cleanup of softc.

Explicitly adding @glebius as well, as he's well-versed in ifnet lifetime issues.

I might of used tunif_freed inplace of tunifdtor, but thats just me wanting grep to find these :-)

This revision is now accepted and ready to land.Dec 5 2019, 7:34 PM

We once discussed this issue with John back when I was working on projects/ifnet branch in subversion. Of course now with epoch in place and if_detach() waiting on all callbacks the situation is very different to what we had in 2015.

I'm trying to find discussion in archives, meanwhile adding John.

I don't like this change. The problem comes from the fact that we are allowing to run administrative detach in parallel with any other ifioctl(). I believe there are much more issues here except this one. This is just a can of worms as long as we keep these two things to run in parallel. Back when resolving all possible races in carp(4) I just have created a single sx lock that serialized any possible CARP reconfigurations. And this isn't a performance critical path - nobody suffers from this serialization. I'm inclined that we need same approach at interface wide level. Not necessarily a global sx, a per-interface sx should work. And at the point where ifioctl() today does ifunit_ref() it should lock the sx. Same thing should if_clone_destroy() do instead of taking reference.

In D22691#496569, @glebius wrote:

I don't like this change. The problem comes from the fact that we are allowing to run administrative detach in parallel with any other ifioctl(). I believe there are much more issues here except this one. This is just a can of worms as long as we keep these two things to run in parallel. Back when resolving all possible races in carp(4) I just have created a single sx lock that serialized any possible CARP reconfigurations. And this isn't a performance critical path - nobody suffers from this serialization. I'm inclined that we need same approach at interface wide level. Not necessarily a global sx, a per-interface sx should work. And at the point where ifioctl() today does ifunit_ref() it should lock the sx. Same thing should if_clone_destroy() do instead of taking reference.

Sure- I had considered this approach as well, but didn't immediately see the downside in letting the in-process ioctl continue and just freeing softc at the end. I can see where that'd be less than ideal, though.

One thing I am unsure about is how to handle the possibility that we've already entered the SIOCIFDESTROY path and taken the lock when another thread or two pops in and tries to execute another ioctl but waits on the sx in ifhwioctl. Presumably the destroy path would have to refrain from destroying the sx because the other patch could have taken it in the small window between releasing it and destroying it, right?

ae added a reviewer: avg.Dec 6 2019, 6:50 AM

My view is that if_detach() (and ether_ifdetach which calls it) needs to be a blocking wait that ensures that any in-process ioctls have finished and exited the ifnet's routines as well as forbidding new ones before it returns. This matches the pattern of other teardown routines like bus_teardown_intr, callout_drain, taskqueue_drain, etc. Many drivers in the tree assume these semantics (except a few that still call ifdetach "late").

kevans updated this revision to Diff 65512.Dec 11 2019, 10:31 PM

kevans retitled this revision from ifnet: add callback for post-free cleanup to ifnet: drain+halt ioctl/vmove operations on detach.

kevans edited the summary of this revision. (Show Details)

This revision now requires review to proceed.Dec 11 2019, 10:31 PM

Harbormaster completed remote builds in B28087: Diff 65512.Dec 11 2019, 10:31 PM

Adding bz since I started touching some vimage parts.

jhb added inline comments.Dec 11 2019, 11:01 PM

sys/net/if.c
1324	You don't have to use the refcount API if you are doing all operations under a lock. OTOH, I think a better fix is to avoid the lock for the release unless you need it. I would perhaps call this a "busy" count to match terminology we use for other parts of the kernel and add little wrappers if_busy / if_unbusy that just call refcount_acquire and refcount_release. You can drop the CV and instead use refcount_sleep() in if_detach() to wait for the reference count to drop to zero. You would still need the lock to gate checking the IFF_DETACHING flag before doing if_busy, but that's about it. The vnet moving thing though makes this more complicated. I'd prefer that we solve that by using a suitable busy/unbusy of the ifnet instead of adding the explicit check for destroying the vnet. Fore example, maybe vnet destroy busies all the ifp's and then unbusies them once they are moved if that works?

This seems to break the build:

In file included from /usr/src/contrib/ntp/libntp/systime.c:10:
In file included from /usr/src/contrib/ntp/include/ntp.h:13:
In file included from /usr/src/contrib/ntp/include/ntp_stdlib.h:14:
In file included from /usr/src/contrib/ntp/include/ntp_net.h:19:
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/if_var.h:431:12: error: field has incomplete type 'struct cv'
        struct cv if_mgmt_cv;
                  ^
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/if_var.h:431:9: note: forward declaration of 'struct cv'
        struct cv if_mgmt_cv;
               ^
--- all_subdir_usr.bin ---
yacc: 1 shift/reduce conflict, 16 reduce/reduce conflicts.

I've not yet looked any deeper.

In D22691#498397, @kp wrote:

This seems to break the build:

In file included from /usr/src/contrib/ntp/libntp/systime.c:10:
In file included from /usr/src/contrib/ntp/include/ntp.h:13:
In file included from /usr/src/contrib/ntp/include/ntp_stdlib.h:14:
In file included from /usr/src/contrib/ntp/include/ntp_net.h:19:
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/if_var.h:431:12: error: field has incomplete type 'struct cv'
        struct cv if_mgmt_cv;
                  ^
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/net/if_var.h:431:9: note: forward declaration of 'struct cv'
        struct cv if_mgmt_cv;
               ^
--- all_subdir_usr.bin ---
yacc: 1 shift/reduce conflict, 16 reduce/reduce conflicts.

I've not yet looked any deeper.

Whoops- failed to check/build if any userland stuff included if_var.h... only buildkernel+test. Will remedy in the next iteration.

Still haven't looked at it, but it seems to fix PR 238870. No more panics (in a couple of runs of) pf:names / pf:synproxy tests, so as far as I'm concerned this cannot get merged quickly enough.

kevans added inline comments.Dec 12 2019, 4:28 AM

sys/net/if.c
1324	The semantics of if_vmove need to change a little bit, too, if we go that route... and that's probably actually a good thing. Right now, if_vmove isn't mutually exclusive with ioctl handlers as well, but we should probably be treating that more like a temporary detach; drain ioctl handlers similarly and prevent if_vmove() on an interface that's already trying to move. Then, vnet_destroy busies all of its interfaces that are getting re-homed and flags them as IFF_DETACHED to prevent any vnet motion before we get a chance to move them back to their home later on in the shutdown procedure. If done properly, I think this also closes the remainder of the vnet race that I describe later in if_detach_internal where shutdown can commence in the middle of if_detach. But +1 for dropping the cv, removing the additional dependency of the header (pointed out by kp) and reducing the weight (because I'd still like to try and produce something MFC'able to stable/12).

kevans added inline comments.Dec 12 2019, 6:17 AM

sys/net/if.c
1324	Perhaps if_vmove draining ioctl handlers is unnecessary, but it should at least block any other if_vmove and further ioctls until the info has stabilized in the new vnet.

This is now stacked on D22780 and D22783, which separate out some of the other vnet ordering issues.

kevans added parent revisions: D22780: if_vmove: return proper error status, D22783: if_detach_internal: defer reading of vnet_shutdown until needed.Dec 12 2019, 6:21 PM

Harbormaster completed remote builds in B28106: Diff 65557.Dec 12 2019, 6:21 PM

kp mentioned this in D22780: if_vmove: return proper error status.Dec 12 2019, 7:12 PM

With all three of these patches I now see this panic running the pf tests:

panic: hashdestroy: hashtbl 0xfffff80045423000 not empty (malloc type ifaddr)
cpuid = 5
time = 1576251489
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0061f20830
vpanic() at vpanic+0x17e/frame 0xfffffe0061f20890
panic() at panic+0x43/frame 0xfffffe0061f208f0
hashdestroy() at hashdestroy+0x54/frame 0xfffffe0061f20900
vnet_destroy() at vnet_destroy+0x123/frame 0xfffffe0061f20930
prison_deref() at prison_deref+0x29d/frame 0xfffffe0061f20970
sys_jail_remove() at sys_jail_remove+0x290/frame 0xfffffe0061f209c0
amd64_syscall() at amd64_syscall+0x2d6/frame 0xfffffe0061f20af0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0061f20af0
--- syscall (508, FreeBSD ELF64, sys_jail_remove), rip = 0x80031bfaa, rsp = 0x7fffffffe848, rbp = 0x7fffffffe8d0 ---
KDB: enter: panic
[ thread pid 1407 tid 100482 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1082616(%rip)
db>

In D22691#499064, @kp wrote:

With all three of these patches I now see this panic running the pf tests:

panic: hashdestroy: hashtbl 0xfffff80045423000 not empty (malloc type ifaddr)
cpuid = 5
time = 1576251489
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0061f20830
vpanic() at vpanic+0x17e/frame 0xfffffe0061f20890
panic() at panic+0x43/frame 0xfffffe0061f208f0
hashdestroy() at hashdestroy+0x54/frame 0xfffffe0061f20900
vnet_destroy() at vnet_destroy+0x123/frame 0xfffffe0061f20930
prison_deref() at prison_deref+0x29d/frame 0xfffffe0061f20970
sys_jail_remove() at sys_jail_remove+0x290/frame 0xfffffe0061f209c0
amd64_syscall() at amd64_syscall+0x2d6/frame 0xfffffe0061f20af0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe0061f20af0
--- syscall (508, FreeBSD ELF64, sys_jail_remove), rip = 0x80031bfaa, rsp = 0x7fffffffe848, rbp = 0x7fffffffe8d0 ---
KDB: enter: panic
[ thread pid 1407 tid 100482 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1082616(%rip)
db>

I suspect this would be fine with if ((ifp->if_flags & IFF_VMOVING) != 0) patched out in the SYSUNINIT releasing interfaces back to their home vnet. I'm slightly uncomfortable with how some of this works -- this teardown happens at SI_SUB_PROTO_DOMAIN, but still we have some interfaces that were created in this vnet that will be torn down later at SI_SUB_PROTO_IF. Intuitively, I would think we'd want to make sure all interfaces are moved out or destroyed (I "broke" the former, but I think the latter's still an issue) before we get to that point -- in which case the answer would instead be to bump SYSUNINITs that are detaching the other cloned interfaces to SI_SUB_PROTO_END or earlier in the process, maybe convert the vnet ifcnt to refcount(9) API and stick a refcount_wait in this bad boy before we get too SI_SUB_DOMAIN to be sure that all interfaces have vacated.

I haven't looked at how those interfaces operate, though. In hindsight it's probably fine and we should force the move+detach done by patching out the VMOVING check.

Force removal of ifnet from a vnet, even if another thread's come along to destroy it -- we need to ensure it's gone before we proceed with sysinits or else we're just racing if_detach_internal vs. vnet sysuninits.

Harbormaster completed remote builds in B28331: Diff 66011.Dec 27 2019, 1:31 AM

I'm going to rebase this tomorrow and remove the vnet parts, focusing on just ioctl problems for the moment. I suspect vnet should integrate with the busy mechanism in some fashion, but there's still other work going on in that area.

Herald added a subscriber: melifaro. · View Herald TranscriptApr 13 2020, 4:42 AM

Rebase; remove VIMAGE-related changes, just focus on ioctl/detach for now. VIMAGE should get factored in at some point so that ioctls complete before an ifnet changes vnet, but this can be done later.

Harbormaster completed remote builds in B30472: Diff 70519.Apr 13 2020, 9:14 PM

Ping

zarychtam_plan-b.pwste.edu.pl added a subscriber: zarychtam_plan-b.pwste.edu.pl.May 14 2020, 6:45 AM

Looks sane to me (and passes regression tests), but I'm not the expert on this part of the stack,

This revision is now accepted and ready to land.May 25 2020, 4:03 PM

I'd like to consider committing this in the next couple of weeks, in case anyone else has plans to put eyes on it.

I intend to commit this by the middle of the next week, given that it's received no further feedback.

The resulting if_busy() / if_unbusy() / if_busy_wait() semantics looks very similar to sx locking. I'm wondering why not simply use sx locks as @glebius suggested?

In D22691#569148, @melifaro wrote:

The resulting if_busy() / if_unbusy() / if_busy_wait() semantics looks very similar to sx locking. I'm wondering why not simply use sx locks as @glebius suggested?

I think, through the course of this, I became convinced that sx usage complicates some possible future busy'ers that would no longer be able to hold a mutex across calls here. Plus, we'd just end up implementing refcount+sx+cv anyways for more complexity and little gain.

In D22691#569217, @kevans wrote:

In D22691#569148, @melifaro wrote:

The resulting if_busy() / if_unbusy() / if_busy_wait() semantics looks very similar to sx locking. I'm wondering why not simply use sx locks as @glebius suggested?

I think, through the course of this, I became convinced that sx usage complicates some possible future busy'ers that would no longer be able to hold a mutex across calls here.

I don't understand why can't we hold mutexes across calls. If we lock sx near the top of the stack, then we can take any mutexes later.

Plus, we'd just end up implementing refcount+sx+cv anyways for more complexity and little gain.

Well, we aren't implementing anything here, just using sx. Rather, the suggested combo of mutex+blockcount looks more like a new primitive implementation.

sys/net/if.c
2473	I always assumed flags are protected by IFNET_WLOCK.

In D22691#570099, @glebius wrote:

In D22691#569217, @kevans wrote:

In D22691#569148, @melifaro wrote:

The resulting if_busy() / if_unbusy() / if_busy_wait() semantics looks very similar to sx locking. I'm wondering why not simply use sx locks as @glebius suggested?

I think, through the course of this, I became convinced that sx usage complicates some possible future busy'ers that would no longer be able to hold a mutex across calls here.

I don't understand why can't we hold mutexes across calls. If we lock sx near the top of the stack, then we can take any mutexes later.

This isn't always possible, e.g., https://svnweb.freebsd.org/base/head/sys/netinet6/ip6_mroute.c?view=markup#l660 -> X_ip6_mrouter_done will hold the mrouter6_mtx across the call to if_detach; perhaps it could drop it a little bit earlier, but I don't know that we want to expose the busy lock/sx to if_detach callers or any future callers that end up busy'ing stuff... this feels dirty.

Plus, we'd just end up implementing refcount+sx+cv anyways for more complexity and little gain.

Well, we aren't implementing anything here, just using sx. Rather, the suggested combo of mutex+blockcount looks more like a new primitive implementation.

There was a reason I didn't think we could hold the shared lock across ifhwioctl (now ifhwioctl_internal) and use just an sx, but I'll need to look over it again because that's long-since been paged out of my brain.

This is no longer worth the continuous stress from absent reviewers and failure (mostly on my part) to communicate... someone else is free to pick something like this up and do the later work to fix the follow-up vnet issue, I'll keep this to myself.

markj added a subscriber: markj.Jul 23 2020, 6:40 PM

Hi,

I have a simpler, but similar patch at:
https://bz-attachments.freebsd.org/attachment.cgi?id=221506

Is this version more acceptable?

--HPS

I (still) really really dislike the idea of adding more boilerplate on basically every driver to workaround a nearly-universal problem.

kevans added a comment.Jan 12 2021, 10:22 PM

This comment was removed by kevans.

@kevans

It is too late to drain everyting inside if_detach() ?

<Drain needs to happen here before optinal mii-bus is detached!>
detach mii-bus
if_detach()
if_free()

We need a new API for this :-(

• hselasky mentioned this in D28136: Make sure user-space calls are drained when detaching network interfaces..Jan 13 2021, 2:57 PM

ifnet: drain+halt ioctl operations on detach
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 70519

sys/net/if.h

sys/net/if.c

sys/net/if_tuntap.c

sys/net/if_var.h

ifnet: drain+halt ioctl operations on detachAbandonedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 70519

sys/net/if.h

sys/net/if.c

sys/net/if_tuntap.c

sys/net/if_var.h

ifnet: drain+halt ioctl operations on detach
AbandonedPublic
Actions

Revision Contents
Changeset List