Page MenuHomeFreeBSD

Widen ifnet_detach_sxlock coverage
ClosedPublic

Authored by kp on Feb 8 2021, 10:07 AM.
Tags
None
Referenced Files
F114564098: D28530.id83527.diff
Mon, Apr 14, 12:33 AM
F114558488: D28530.id83707.diff
Sun, Apr 13, 11:33 PM
F114549747: D28530.id83569.diff
Sun, Apr 13, 9:51 PM
Unknown Object (File)
Thu, Apr 10, 7:08 PM
Unknown Object (File)
Sun, Apr 6, 3:40 PM
Unknown Object (File)
Wed, Mar 26, 2:50 AM
Unknown Object (File)
Mar 4 2025, 3:19 PM
Unknown Object (File)
Feb 25 2025, 5:42 AM

Details

Summary

Widen the ifnet_detach_sxlock to cover the entire vnet sysuninit code.
This ensures that we can't end up having the vnet_sysuninit free the UDP
pcb while the detach code is running and trying to purge the UDP pcb.

MFC after: 1 week

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kp requested review of this revision.Feb 8 2021, 10:07 AM

Fix shared->exclusive lock order

This revision was not accepted when it landed; it landed in state Needs Review.Feb 11 2021, 8:06 PM
This revision was automatically updated to reflect the committed changes.
markj added inline comments.
sys/net/vnet.c
288

This locking can trigger a deadlock: some vnet sysuninits will deregister eventhandlers, particularly from ifnet_arrival_event and ifnet_departure_event. In order to deregister an eventhandler, we have to remove it and wait for threads executing the event to drain. But, some handlers try to lock ifnet_detach_sxlock, so if they block on it, the thread destroying the VNET won't be able to make progress.

This very occasionally causes hangs when I run the test suite with parallelism enabled.

It feels wrong to acquire this lock here; individual SYSINITs can acquire it if they really need it. Do you remember the details of what goes wrong without this? I will try testing with the locking removed.

sys/net/vnet.c
288

I'm afraid I don't remember details.

I assume (based on the commit message) that the worry was that udp_destroy() was destroying the V_udbinfo inpcbinfo while if_detach_internal() runs in_pcbpurgeif0(V_udbinfo) (via in_ifdetach()).

Perhaps the correct fix is to not take the ifnet_detach_sxlock here, but to have udp_destory() take the IN_MULTI_LOCK(). That covers the in_pcbpurgeif0() in in_ifdetach()

zlei added inline comments.
sys/net/vnet.c
288

This locking can trigger a deadlock: some vnet sysuninits will deregister eventhandlers, particularly from ifnet_arrival_event and ifnet_departure_event. In order to deregister an eventhandler, we have to remove it and wait for threads executing the event to drain.

I bet vnet_destroy() does not need to acquire ifnet_detach_sxlock.

For a vnet, typically,

  1. The cloned interfaces will be destroyed via if_clone_detach() which will invoke if_detach() and acquire ifnet_detach_sxlock.
  2. The vmove loaned interfaces will be returned via vnet_if_return() which invokes if_vmove() under the lock ifnet_detach_sxlock.

But, some handlers try to lock ifnet_detach_sxlock, so if they block on it, the thread destroying the VNET won't be able to make progress.

Can you elaborate which handlers try to lock ifnet_detach_sxlock ?

This very occasionally causes hangs when I run the test suite with parallelism enabled.

It feels wrong to acquire this lock here; individual SYSINITs can acquire it if they really need it. Do you remember the details of what goes wrong without this? I will try testing with the locking removed.

sys/net/vnet.c
288

Can you elaborate which handlers try to lock ifnet_detach_sxlock ?

Actually, now I can't find one. :( I had a VM deadlocked with a thread holding the ifnet_detach_sxlock:

jail_remove taskq   mi_switch+0x26b sleepq_switch+0x18e _sleep+0x4dd _eventhandler_deregister+0x2c1 ipf_event_dereg+0x93 vnet_ipf_uninit+0xf7 vnet_destroy+0x233 prison_deref+0x1180 taskqueue_run_locked+0x3b9 taskqueue_thread_loop+0x138 fork_exit+0xa3 fork_trampoline+0xe

And I presumed that some thread running event handlers is blocked waiting for the detach lock, but I can't see it in procstat output. Nor can I spot it by code inspection. Maybe it's rtnl_handle_dellink(), but I don't know how that gets invoked. Unfortunately I don't have a vmcore to look at.

I still feel it's wrong to acquire the detach lock here; I was able to run the test suite to completion without it.