Page MenuHomeFreeBSD

unionfs: remove LK_UPGRADE if falling back to the standard lock
ClosedPublic

Authored by jah on Mar 26 2023, 2:59 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, May 12, 2:24 PM
Unknown Object (File)
Sat, May 11, 8:08 AM
Unknown Object (File)
Sat, May 11, 8:08 AM
Unknown Object (File)
Sat, May 11, 4:35 AM
Unknown Object (File)
Sat, May 11, 4:35 AM
Unknown Object (File)
Mon, May 6, 10:58 PM
Unknown Object (File)
Sun, Apr 28, 4:45 PM
Unknown Object (File)
Sat, Apr 27, 4:15 AM
Subscribers

Details

Summary

The LK_UPGRADE operation may have temporarily dropped the upper or
lower vnode's lock. If the unionfs vnode was reclaimed during that
window, its lock field will be reset to no longer point at the
upper/lower vnode lock, so the lock operation will use the standard
lock stored in v_lock. Remove LK_UPGRADE from the flags in this case
to avoid a lockmgr assertion, as this lock has not been previously
owned by the calling thread.

Reported by: pho

Remove an impossible condition from unionfs_lock()

We hold the vnode interlock, so vnode private data cannot suddenly
become NULL.

Remove unionfs_islocked()

The implementation is racy; if the unionfs vnode is not in fact
locked, vnode private data may be concurrently altered or freed.
Instead, simply rely upon the standard implementation to query the
v_vnlock field, which is type-stable and will reflect the correct
lower/upper vnode configuration for the unionfs node.

unionfs_mkdir(): handle dvp reclamation

The underlying VOP_MKDIR() implementation may temporarily drop the
parent directory vnode's lock. If the vnode is reclaimed during that
window, the unionfs vnode will effectively become unlocked because
the its v_vnlock field will be reset. To uphold the locking
requirements of VOP_MKDIR() and to avoid triggering various VFS
assertions, explicitly re-lock the unionfs vnode before returning
in this case.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jah requested review of this revision.Mar 26 2023, 2:59 AM

For VOP_MKDIR() change. Please note that almost any VOP modifying metadata could drop the vnode lock. The list is like VOP_CREAT(), VOP_LINK(), VOP_REMOVE(), VOP_WHITEOUT(), VOP_RMDIR(), VOP_MAKEINODE(), VOP_RENAME().

In D39272#894131, @kib wrote:

For VOP_MKDIR() change. Please note that almost any VOP modifying metadata could drop the vnode lock. The list is like VOP_CREAT(), VOP_LINK(), VOP_REMOVE(), VOP_WHITEOUT(), VOP_RMDIR(), VOP_MAKEINODE(), VOP_RENAME().

Thanks, I suspected there were other places I would need to do something similar. VOP_MKDIR() just happens to be the only one that's caused problems in stress testing (so far).
I'd like to try addressing these other cases in a separate later change. I have a few other unionfs fixes and a proposal for removing VV_CROSSLOCK that I'd like to do before that. I also want to consider whether these relocking cases can be handled in a more generic way, perhaps something like a simpler version of null_bypass().

This revision is now accepted and ready to land.Mar 27 2023, 10:04 PM

I ran into this problem:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x10
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82753417
stack pointer	        = 0x0:0xfffffe01438a8a80
frame pointer	        = 0x0:0xfffffe01438a8aa0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 2382 (find)
rdi: fffffe015b22d700 rsi:            82000 rdx: fffffe01438a8ad8
rcx:                1  r8:              246  r9:            40000
rax:                0 rbx: fffffe015b22d700 rbp: fffffe01438a8aa0
r10:                1 r11:                0 r12:            80000
r13: fffffe01599802c0 r14: fffffe01438a8ad8 r15:            82000
trap number		= 12
panic: page fault
cpuid = 2
time = 1679981682
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01438a8840
vpanic() at vpanic+0x152/frame 0xfffffe01438a8890
panic() at panic+0x43/frame 0xfffffe01438a88f0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe01438a8950
trap_pfault() at trap_pfault+0xab/frame 0xfffffe01438a89b0
calltrap() at calltrap+0x8/frame 0xfffffe01438a89b0
--- trap 0xc, rip = 0xffffffff82753417, rsp = 0xfffffe01438a8a80, rbp = 0xfffffe01438a8aa0 ---
unionfs_root() at unionfs_root+0x17/frame 0xfffffe01438a8aa0
vfs_lookup() at vfs_lookup+0x92a/frame 0xfffffe01438a8b40
namei() at namei+0x340/frame 0xfffffe01438a8bc0
kern_statat() at kern_statat+0x12f/frame 0xfffffe01438a8d00
sys_fstatat() at sys_fstatat+0x2f/frame 0xfffffe01438a8e00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe01438a8f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01438a8f30
--- syscall (552, FreeBSD ELF64, fstatat), rip = 0x1e9fa9c98, rbp = 0x1e9fa1989db0 ---

https://people.freebsd.org/~pho/stress/log/log0429.txt

PS
The BIOS and IPMI firmware was just updated on this test box, but the page fault seems legit to me?

In D39272#894669, @pho wrote:

I ran into this problem:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x10
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82753417
stack pointer	        = 0x0:0xfffffe01438a8a80
frame pointer	        = 0x0:0xfffffe01438a8aa0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 2382 (find)
rdi: fffffe015b22d700 rsi:            82000 rdx: fffffe01438a8ad8
rcx:                1  r8:              246  r9:            40000
rax:                0 rbx: fffffe015b22d700 rbp: fffffe01438a8aa0
r10:                1 r11:                0 r12:            80000
r13: fffffe01599802c0 r14: fffffe01438a8ad8 r15:            82000
trap number		= 12
panic: page fault
cpuid = 2
time = 1679981682
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01438a8840
vpanic() at vpanic+0x152/frame 0xfffffe01438a8890
panic() at panic+0x43/frame 0xfffffe01438a88f0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe01438a8950
trap_pfault() at trap_pfault+0xab/frame 0xfffffe01438a89b0
calltrap() at calltrap+0x8/frame 0xfffffe01438a89b0
--- trap 0xc, rip = 0xffffffff82753417, rsp = 0xfffffe01438a8a80, rbp = 0xfffffe01438a8aa0 ---
unionfs_root() at unionfs_root+0x17/frame 0xfffffe01438a8aa0
vfs_lookup() at vfs_lookup+0x92a/frame 0xfffffe01438a8b40
namei() at namei+0x340/frame 0xfffffe01438a8bc0
kern_statat() at kern_statat+0x12f/frame 0xfffffe01438a8d00
sys_fstatat() at sys_fstatat+0x2f/frame 0xfffffe01438a8e00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe01438a8f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01438a8f30
--- syscall (552, FreeBSD ELF64, fstatat), rip = 0x1e9fa9c98, rbp = 0x1e9fa1989db0 ---

https://people.freebsd.org/~pho/stress/log/log0429.txt

PS
The BIOS and IPMI firmware was just updated on this test box, but the page fault seems legit to me?

Hmm. I'm probably missing something, but on initial investigation the panic doesn't seem to make sense.
It really looks as though unionfs_root() is seeing a partially constructed mount object: mp has the unionfs ops vector, but mnt_data is NULL (thus the page fault) and the stat object's fsid is 0.
This is consistent with the mount object state that would exist partway through unionfs_domount(), or if unionfs_domount() failed due to failure of unionfs_nodeget() or vfs_register_upper_from_vp(). There is another thread that appears to be partway through unionfs_domount(), and mp's busy count (mnt_lockref) of 2 is consistent with these 2 threads.
What doesn't make sense is how vfs_lookup() could observe a mount object in this state in the first place; vfs_domount_first() doesn't set coveredvp->v_mountedhere until after successful completion of VFS_MOUNT(), and it does so with coveredvp locked exclusive to avoid racing vfs_lookup().

Some things to note:
--Besides a couple of added comments, this is the same patch you tested successfully at the end of January.
--Since then, I have found a couple of bugs in the cleanup logic in unionfs_domount() and unionfs_nodeget(), which I'll post in a separate review after this one, but I don't think they would explain this behavior.
--There does appear to be a third thread calling dounmount() and blocked on an FFS lock, but it's unclear if that has any impact on the crash.

In D39272#894822, @jah wrote:
In D39272#894669, @pho wrote:

I ran into this problem:

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 02
fault virtual address	= 0x10
fault code		= supervisor read data, page not present
instruction pointer	= 0x20:0xffffffff82753417
stack pointer	        = 0x0:0xfffffe01438a8a80
frame pointer	        = 0x0:0xfffffe01438a8aa0
code segment		= base 0x0, limit 0xfffff, type 0x1b
			= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags	= interrupt enabled, resume, IOPL = 0
current process		= 2382 (find)
rdi: fffffe015b22d700 rsi:            82000 rdx: fffffe01438a8ad8
rcx:                1  r8:              246  r9:            40000
rax:                0 rbx: fffffe015b22d700 rbp: fffffe01438a8aa0
r10:                1 r11:                0 r12:            80000
r13: fffffe01599802c0 r14: fffffe01438a8ad8 r15:            82000
trap number		= 12
panic: page fault
cpuid = 2
time = 1679981682
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01438a8840
vpanic() at vpanic+0x152/frame 0xfffffe01438a8890
panic() at panic+0x43/frame 0xfffffe01438a88f0
trap_fatal() at trap_fatal+0x409/frame 0xfffffe01438a8950
trap_pfault() at trap_pfault+0xab/frame 0xfffffe01438a89b0
calltrap() at calltrap+0x8/frame 0xfffffe01438a89b0
--- trap 0xc, rip = 0xffffffff82753417, rsp = 0xfffffe01438a8a80, rbp = 0xfffffe01438a8aa0 ---
unionfs_root() at unionfs_root+0x17/frame 0xfffffe01438a8aa0
vfs_lookup() at vfs_lookup+0x92a/frame 0xfffffe01438a8b40
namei() at namei+0x340/frame 0xfffffe01438a8bc0
kern_statat() at kern_statat+0x12f/frame 0xfffffe01438a8d00
sys_fstatat() at sys_fstatat+0x2f/frame 0xfffffe01438a8e00
amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe01438a8f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01438a8f30
--- syscall (552, FreeBSD ELF64, fstatat), rip = 0x1e9fa9c98, rbp = 0x1e9fa1989db0 ---

https://people.freebsd.org/~pho/stress/log/log0429.txt

PS
The BIOS and IPMI firmware was just updated on this test box, but the page fault seems legit to me?

Hmm. I'm probably missing something, but on initial investigation the panic doesn't seem to make sense.
It really looks as though unionfs_root() is seeing a partially constructed mount object: mp has the unionfs ops vector, but mnt_data is NULL (thus the page fault) and the stat object's fsid is 0.
This is consistent with the mount object state that would exist partway through unionfs_domount(), or if unionfs_domount() failed due to failure of unionfs_nodeget() or vfs_register_upper_from_vp(). There is another thread that appears to be partway through unionfs_domount(), and mp's busy count (mnt_lockref) of 2 is consistent with these 2 threads.
What doesn't make sense is how vfs_lookup() could observe a mount object in this state in the first place; vfs_domount_first() doesn't set coveredvp->v_mountedhere until after successful completion of VFS_MOUNT(), and it does so with coveredvp locked exclusive to avoid racing vfs_lookup().

Some things to note:
--Besides a couple of added comments, this is the same patch you tested successfully at the end of January.
--Since then, I have found a couple of bugs in the cleanup logic in unionfs_domount() and unionfs_nodeget(), which I'll post in a separate review after this one, but I don't think they would explain this behavior.
--There does appear to be a third thread calling dounmount() and blocked on an FFS lock, but it's unclear if that has any impact on the crash.

Thinking about this more, I suspect I might know what's going on here: it's a bug in the VV_CROSSLOCK logic I added to vfs_lookup().
If the covered vnode is held shared (which it will be for most filesystems), but the filesystem be walked into requires exclusive lookups (which unionfs still does), the VV_CROSSLOCK logic will need to upgrade the lock. This can involve transiently dropping the lock, and in that window the target mount may be unmounted.

The fix should be simple, but IMO another good argument for trying to get rid of VV_CROSSLOCK.

I've been trying to recreate the issue for the last 8 hours. I'll let the test run for a day or two.

  • vfs_lookup(): re-check v_mountedhere on lock upgrade
This revision now requires review to proceed.Mar 28 2023, 6:14 PM

I added a counter to the fix in vfs_lookup.c The problem test triggered this three time in a three hour test.
This was followed by all of the unions tests for seven hours.
I ran a mix of stress2 tests for two days without seeing any (related) problems.

This revision is now accepted and ready to land.Apr 10 2023, 2:53 PM