Page MenuHomeFreeBSD

Make sure user-space calls are drained when detaching network interfaces.
Needs ReviewPublic

Authored by hselasky on Jan 13 2021, 2:57 PM.

Details

Summary

Make sure user-space calls are drained when detaching network
interfaces.

This covers ifioctl(9), setsockopt(2) and getsockopt(2) which may call directly
into the network drivers.

Else use-after-free may happen.

MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking

Test Plan

Depends on D30376 .

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

sys/net/if.c
796

We could set a bit in the "ifp" to prevent new slow_refs while draining. Right now that is not an issue.

Do you have a test case? Or an idea of how this can most likely be triggered? I'd love to have a test for this issue as well.

sys/net/if.c
807

That if seems pointless. Am I missing something?

@kp

Run ifconfig ue0 in a loop on a USB interface which has got a mii interface attached to it.

Then unplug the USB interface - panic.

--HPS

I'm sorry I've not yet had the time to try to write a test case for this.

It's important enough that we should probably push it in for 13 anyway.

Can you clarify if_slow_unref() though? I do not understand the point of the if there.

sys/net/if.c
807

We could wakeup the pause() from there. But yes, the if() is just there to prevent the compiler from complaining about unused return code.

So lets' spell that (void)refcount_release()? That's the common idiom for 'ignore the return value here'.

I'd actually expect tests/sys/net/if_clone_tests to trigger this, but I've not had any luck with it so far.

sys/net/if.c
769

Could you clarify where this function is supposed to be used?

sys/net/if.c
769

It is supposed to be used by the network driver itself.

sys/net/if.c
769

And there is no way around this, unfortunately.

If anything this must be blockcount(9) and not refcount(9).

But why do we need two refcounts at all?

@kib

man blockcount
No manual entry for blockcount

blockcount is not MFC-able

But why do we need two refcounts at all?

Because the miibus device that hangs off the ifnet device, does not use the refcount in struct ifnet, and newbus is not refcounted too. So clear use-after-free can happen. The only solution I see, is a proper drain, I.E. stop all calls to mii-bus code, _before_ detaching the ethernet part.

@kib

man blockcount
No manual entry for blockcount

And what?

blockcount is not MFC-able

I do not understand this statement either. Basically you are claiming that we cannot have a progress in the system infrastructure.

But why do we need two refcounts at all?

Because the miibus device that hangs off the ifnet device, does not use the refcount in struct ifnet, and newbus is not refcounted too. So clear use-after-free can happen. The only solution I see, is a proper drain, I.E. stop all calls to mii-bus code, _before_ detaching the ethernet part.

You sentence does not explain why do we need separate refcount.

@kib: Let me start from scratch:

Typical network attach looks like this:

  1. if_alloc() (struct ifnet *)
  2. mii_attach() (device_t)
  3. ether_attach()

Typical network detach looks like this:

  1. device_delete_child() - mii bus.
  2. ether_ifdetach()
  3. if_free()

The refcount in struct ifnet only protects "struct ifnet", but mii-bus uses a device_t and associated softc, which have no [refcount] connection to "struct ifnet".
That means we need to drain all clients of MII-bus before step 1), because device_delete_child has no refcounting system and will free the softc which the mii bus is using!

My proposal and patch is very simple. Stop and drain all if_ioctl() calls, at the beginning of the network device teardown.

Do you see?

--HPS

Added examples of new API usage - requested by kib@ .

It looks like the idea here is to increment a reference during ioctl, drop it at the end of an ioctl, and never increment a reference if it decrements to zero. Then, when destroying, you can simply drop the final reference and then sleep-loop until the reference drops to zero before you move on freeing things.

Rather than the sleep-loop + reference counting, couldn't you accomplish this a bit more easily using an sx, and this way you wouldn't have to have the sleep-loop?

Specifically:

if_slow_ref --> sx_try_slock()
if_slow_unref --> sx_sunlock()
if_slow_drain --> sx_xlock() [which never gets unlocked]

Since we use try_slock for ref, we can return false with the same acquire_if_not_zero semantics as exist now. And by doing a xlock on drain, we ensure we're all finished with ioctls at that point, and then atomically block out future ones.

Hi,

Jason, we could possibly use a SX lock too, but I need to check it a bit. SX locks have a WITNESS object when debuggin is enabled and needs to be destroyed.
In your example, you need to point out where to xunlock aswell, because I'm not sure if it is a good idea to destroy locked mutexes.

--HPS

Jason, we could possibly use a SX lock too, but I need to check it a bit. SX locks have a WITNESS object when debuggin is enabled and needs to be destroyed.
In your example, you need to point out where to xunlock aswell, because I'm not sure if it is a good idea to destroy locked mutexes.

If you're planning to destroy the sx, then it means that there's a time at which nothing else will possibly access it, in which case just unlock it before you destroy it.

Ping - any more comments or ideas here?

--HPS

Ping - any more comments or ideas here?

My comment and idea was already written and AFAICT you didn't have additional questions about it. I was waiting to see an implementation before commenting further.

Do you have any comments about my suggestion? Or objections to it?

If the same thread does not do xlock and xunlock, then it may not work. What do you think, because the curthread pointer is typically stored as the owner pointer in the SX lock? Or am I wrong?

--HPS

I'm also thinking we could use a sleepable lock, SRCU, from the LinuxKPI. Would be more efficient!

--HPS

I just wrote up a workaround of calling WITNESS_DESTROY and INIT manually like the vfs code which would work but was hackish, and then realized....

sx_init_flags(..., ..., SX_NOWITNESS);

Bam! Looks like witness-less sx usage is already supported! So that should make things much easier.

I don't think we want LinuxKPI stuff in core networking code, right?

sx is core fbsd data structure and maps well. The SX_NOWITNESS flag should accomplish what we need.

void
_sx_xunlock(struct sx *sx, const char *file, int line)
{

        KASSERT(sx->sx_lock != SX_LOCK_DESTROYED,
            ("sx_xunlock() of destroyed sx @ %s:%d", file, line));
        _sx_assert(sx, SA_XLOCKED, file, line);
        WITNESS_UNLOCK(&sx->lock_object, LOP_EXCLUSIVE, file, line);
        LOCK_LOG_LOCK("XUNLOCK", &sx->lock_object, 0, sx->sx_recurse, file,
            line);
#if LOCK_DEBUG > 0
        _sx_xunlock_hard(sx, (uintptr_t)curthread, file, line);
#else
        __sx_xunlock(sx, curthread, file, line);
#endif
        TD_LOCKS_DEC(curthread);
}

If you look at _sx_xunlock() you see there is an assert that same thread locks and unlocks.

I think that may cause problems.

--HPS

We have EPOCH() in the kernel bug it is not sleepable. if_ioctl() can sleep in some cases!

If you look at _sx_xunlock() you see there is an assert that same thread locks and unlocks.

It seems you're right. Short of mangling the innards of sx, my suggestion might indeed be a dead end.

Indeed mapping this to be epoch would be trivial. But it's not sleepable. Maybe you could use ck_epoch_begin/ck_epoch_end/ck_epoch_synchronize_wait directly? Though at that point, maybe you're right, SRCU is better and more fleshed out.

Maybe it could be under ifdef COMPAT_LINUXKPI , because the LINUXKPI is part of GENERIC now.

Else the current solution also works. Can also push what I have and then work on suggestions for improving it.

--HPS

Another approach would be to add SX_NOASSERTS (in addition to SX_NOWITNESS), which would skip those KASSERTs and such.

Alternativly have a flag you set under the SX lock and then you don't need the lock locked to protect the callback.

Sure, that approach works fine too, and amounts to the same thing you'd arrive at with SRCU or epoch.

Probably the path of least resistance here; I like that best.

if_slow_ref --> sx_slock(); return draining == 0;
if_slow_unref --> sx_sunlock();
if_slow_drain --> sx_xlock(); draining = 1; sx_xunlock();

hselasky edited the summary of this revision. (Show Details)

Does it look better now?

Fix empty line that sneakt in.

Seems clean and simple to me. Nice improvement over the last revision. Let's merge this.

This revision is now accepted and ready to land.May 18 2021, 4:09 PM

We need to tweak this a little bit. There appears to be some sleepable mutex after non-sleepable mutex issues here.

I'm thinking to push this down a bit to only cover the syscall from userspace.

--HPS

hselasky edited the summary of this revision. (Show Details)
This revision now requires review to proceed.May 19 2021, 11:37 AM

We need to tweak this a little bit. There appears to be some sleepable mutex after non-sleepable mutex issues here.

You mean some kernel drivers disable interrupts and then call an ioctl? It sounds like that's behavior that should be fixed, rather than trying to bandaid over that by only making this apply to userspace calls. We want this to apply to kernelspace calls too, since the same issue exists in all contexts.

We need to tweak this a little bit. There appears to be some sleepable mutex after non-sleepable mutex issues here.

You mean some kernel drivers disable interrupts and then call an ioctl? It sounds like that's behavior that should be fixed, rather than trying to bandaid over that by only making this apply to userspace calls. We want this to apply to kernelspace calls too, since the same issue exists in all contexts.

No.

Some kernel drivers call ifhwioctl() from under a mutex. Just grep for it in sys/

The kernel calls, from the network stack, are protected by the network epoch I believe.

--HPS

Some kernel drivers call ifhwioctl() from under a mutex. Just grep for it in sys/

Ahh, so in that case, the rule is, ioctls in general must not sleep. I wonder how uniformly that's applied. Or if it's not uniform, if there's some docs on which ones sleep and which ones do not. Either way, sounds brittle.

I think the root issue is that the drivers are doing detach in the wrong order. You should ether_ifdetach() before you delete the mii children. In general in a driver you need to remove outside references (things like cdevs, ifnets, etc.) before you start tearing down device state itself. While that change is a bit larger and means fixing various drivers, it is the the correct approach. For example, when I did the first round of Giant removal on various 10/100 drivers I was careful to always ether_ifdetach() before stopping callouts or tearing down interrupt handlers for this reason. An approach like this may serve as a workaround, but it is really just papering over buggy driver detach routines.

(While it is true that bus drivers generally want to delete child devices as the first step in their detach methods, mii children of NICs are not a "generic" bus device like a PCI-PCI bridge, but instead mii is really just a pseudo-bus to permit sharing different phy drivers and the MAC + minibus + phy device_t's should be thought of as a single entity in terms of detach/attach. It is certainly true that you can get into some foot shooting if you use devctl to manually detach a phy driver, but the solution to users who do that is Don't Do That(tm).)

In D28136#681788, @jhb wrote:

I think the root issue is that the drivers are doing detach in the wrong order. You should ether_ifdetach() before you delete the mii children. In general in a driver you need to remove outside references (things like cdevs, ifnets, etc.) before you start tearing down device state itself. While that change is a bit larger and means fixing various drivers, it is the the correct approach. For example, when I did the first round of Giant removal on various 10/100 drivers I was careful to always ether_ifdetach() before stopping callouts or tearing down interrupt handlers for this reason. An approach like this may serve as a workaround, but it is really just papering over buggy driver detach routines.

(While it is true that bus drivers generally want to delete child devices as the first step in their detach methods, mii children of NICs are not a "generic" bus device like a PCI-PCI bridge, but instead mii is really just a pseudo-bus to permit sharing different phy drivers and the MAC + minibus + phy device_t's should be thought of as a single entity in terms of detach/attach. It is certainly true that you can get into some foot shooting if you use devctl to manually detach a phy driver, but the solution to users who do that is Don't Do That(tm).)

Can you explain how pending if_ioctl() calls will be flushed? You are right that doing detach first will unlink the ifnet structure, and prevents future ifioctl calls, but there may be concurrent calls still ...

--HPS

hselasky retitled this revision from Fix for use after free when detaching network interface to Make sure user-space calls are drained when detaching network interfaces..
hselasky edited the summary of this revision. (Show Details)
hselasky edited the test plan for this revision. (Show Details)

Take advice from @jhb, to be commited separately and reduce this patch.

Protect setsockopt() and getsockopt() aswell. They also call into kernel-space network drivers.

Can we move forward with this?

Make some exceptions for holding the sleepable epoch. Especially for ifconfig commands that create, move or destroy ifnets.