Page MenuHomeFreeBSD

Fix for double free when deleting entries from epoch managed lists
Needs ReviewPublic

Authored by hselasky on Oct 17 2018, 5:33 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Mar 7, 1:09 PM
Unknown Object (File)
Thu, Mar 7, 11:40 AM
Unknown Object (File)
Jan 31 2024, 11:57 AM
Unknown Object (File)
Jan 9 2024, 12:39 AM
Unknown Object (File)
Dec 20 2023, 4:34 AM
Unknown Object (File)
Aug 13 2023, 12:13 AM
Unknown Object (File)
Aug 13 2023, 12:13 AM
Unknown Object (File)
Jun 30 2023, 3:02 AM

Details

Reviewers
mmacy
slavash
kib
bz
mjg
Group Reviewers
network
Summary

In order to avoid races, epoch_wait() must be called between removing an interface address entry and before calling ifa_free(). Else concurrent threads may still reference the old interface address, IFA, still visible and this can lead up to a double free situation. To detect this situation let ifa_ref() use the function refcount_acquire_if_not_zero() to detect when the IFA is still visible, but can no longer be referenced, because it is scheduled for free. This way both iterating and freeing IFA's in an epoch section becomes possible and this also simplifies the code in question.

Test Plan
#!/usr/local/bin/bash
# Script to reproduce hang

for i in {1..255}
do
(ifconfig mce0.$i create; 
ifconfig mce0.$i inet 224.0.0.$i;
ifconfig mce0.$i destroy) &
done

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 20287

Event Timeline

Again, please upoad the context.

hselasky retitled this revision from Calling epoch_wait() under VLAN_XLOCK() leads to deadlock to Fix for deadlock and race deleting entries for epoch managed lists.
hselasky edited the summary of this revision. (Show Details)
hselasky edited the test plan for this revision. (Show Details)
hselasky added a reviewer: bz.

Updated patch with full context.

Make sure the return value of ifa_ref() is checked.

hselasky edited the summary of this revision. (Show Details)
hselasky added a reviewer: mjg.
hselasky added a subscriber: mjg.

Use already existing refcount_acquire_if_not_zero(). Suggested by @mjg .

hselasky retitled this revision from Fix for deadlock and race deleting entries for epoch managed lists to Fix for double free when deleting entries from epoch managed lists.
hselasky edited the summary of this revision. (Show Details)

@hselasky have you had pho@ test this change?

No, @pho has not tested this change.

No, @pho has not tested this change.

Nice test scenario!

So the original problem was a hang?

So these:
panic: starting DAD on non-tentative address 0xfffff8010c311000
https://people.freebsd.org/~pho/stress/log/epoch.txt

Fatal trap 9: general protection fault while in kernel mode
https://people.freebsd.org/~pho/stress/log/epoch-2.txt
are unrelated?

I'll proceed with testing your patch.

Hi @pho,

The initial problem was a deadlock which was fixed by https://svnweb.freebsd.org/changeset/base/339588 and after that I experienced some panics and started looking at the code.

--HPS

So these:
panic: starting DAD on non-tentative address 0xfffff8010c311000
https://people.freebsd.org/~pho/stress/log/epoch.txt

Might fix this one.

Fatal trap 9: general protection fault while in kernel mode
https://people.freebsd.org/~pho/stress/log/epoch-2.txt

There are probably missing checks for IFF_DYING before accessing the ifp. I started making a patch, but realized the problem was deeper. See: https://reviews.freebsd.org/D17617

--HPS

Removed unrelated patches from if_vlan.c .

Removed unrelated patches from if_vlan.c .

../../../netinet6/in6.c:1460:2: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result]
        ifa_ref(&ia->ia_ifa);

@pho: Must have been some recent patches. I'll update my patch shortly.

Update patches for latest FreeBSD 13-current.

Update patches for latest FreeBSD 13-current.

The fault is still there: https://people.freebsd.org/~pho/stress/log/epoch2.txt

@pho : Is the other fault you referred still there?

@pho : Is the other fault you referred still there?

The page fault seen in epoch-2.txt is the only problem I see now. I also this on a pristine HEAD.
The page fault is very easy to get, so it may shadow for any other problems.

I don't see any panics due to this, so it is not a big issue.

This patch likely needs to be rebased before we can continue.

It is still an issue with HEAD:

:
:
if_delmulti_locked: detaching ifnet instance 0xfffff803d7598000
if_delmulti_locked: detaching ifnet instance 0xfffff803d7598000
if_delmulti_locked: detaching ifnet instance 0xfffff803bc652800
if_delmulti_locked: detaching ifnet instance 0xfffff803bc652800
if_delmulti_locked: detaching ifnet instance 0xfffff803bc652800
if_delmulti_locked: detaching ifnet instance 0xfffff803bc652800
if_delmulti_locked: detaching ifnet instance 0xfffff803bc652800
if_delmulti_locked: detaching ifnet instance 0xfffff803bc652800


Fatal trap 9: general protection fault while in kernel mode
cpuid = 10; apic id = 0a
instruction pointer     = 0x20:0xffffffff80cf1ce5
stack pointer           = 0x28:0xfffffe00aa7e76b0
frame pointer           = 0x28:0xfffffe00aa7e7700
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 5492 (ifconfig)
trap number             = 9
panic: general protection fault
cpuid = 10
time = 1567437407
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00aa7e73c0
vpanic() at vpanic+0x19d/frame 0xfffffe00aa7e7410
panic() at panic+0x43/frame 0xfffffe00aa7e7470
trap_fatal() at trap_fatal+0x39c/frame 0xfffffe00aa7e74d0
trap() at trap+0x6a/frame 0xfffffe00aa7e75e0
calltrap() at calltrap+0x8/frame 0xfffffe00aa7e75e0
--- trap 0x9, rip = 0xffffffff80cf1ce5, rsp = 0xfffffe00aa7e76b0, rbp = 0xfffffe00aa7e7700 ---
vlan_ioctl() at vlan_ioctl+0x105/frame 0xfffffe00aa7e7700
ifhwioctl() at ifhwioctl+0x2ad/frame 0xfffffe00aa7e7780
ifioctl() at ifioctl+0x52d/frame 0xfffffe00aa7e7850
kern_ioctl() at kern_ioctl+0x295/frame 0xfffffe00aa7e78b0
sys_ioctl() at sys_ioctl+0x15d/frame 0xfffffe00aa7e7980
amd64_syscall() at amd64_syscall+0x2d4/frame 0xfffffe00aa7e7ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00aa7e7ab0
--- syscall (54, FreeBSD ELF64, sys_ioctl), rip = 0x80047f29a, rsp = 0x7fffffffe3e8, rbp = 0x7fffffffe4b0 ---
KDB: enter: panic
[ thread pid 5492 tid 100148 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db> x/s version
version:        FreeBSD 13.0-CURRENT #0 r351673M: Mon Sep  2 01:26:53 CEST 2019\012    pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO\012
db>

I agree that if_ref() and ifa_ref() should do refcount_acquire_if_not_zero(). Haven't reviewed the whole change, though. Is the reproducer still producing the panic with fresh -current?

I agree that if_ref() and ifa_ref() should do refcount_acquire_if_not_zero(). Haven't reviewed the whole change, though. Is the reproducer still producing the panic with fresh -current?

I have not seen any panics with the reproducer on:

FreeBSD mercat1.netperf.freebsd.org 14.0-CURRENT FreeBSD 14.0-CURRENT #1 main-n251492-d2de68811a80b-dirty: Thu Dec  9 17:05:30 CET 2021     pho@mercat1.netperf.freebsd.org:/usr/src/sys/amd64/compile/PHO  amd64