Page MenuHomeFreeBSD

Switch to use shared vnode locks for text files during image activation.
ClosedPublic

Authored by kib on Apr 16 2019, 1:22 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 8:01 AM
Unknown Object (File)
Sat, Jan 11, 7:09 PM
Unknown Object (File)
Wed, Jan 8, 7:15 AM
Unknown Object (File)
Tue, Jan 7, 12:28 AM
Unknown Object (File)
Dec 23 2024, 6:56 AM
Unknown Object (File)
Dec 23 2024, 6:56 AM
Unknown Object (File)
Dec 23 2024, 6:29 AM
Unknown Object (File)
Dec 23 2024, 3:02 AM

Details

Summary

kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition.

The change removes VV_TEXT, replacing it with the condition v_writecount < -1, and put v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal. Now, the operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access, atomic check for writeability and increment of v_writecount is performed by the VOP.

nullfs bypasses v_writecount to the lower vnode always.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 23980

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

I gave it a spin with poudriere and it panics:

panic: Assertion error_ == 0 failed at /tank/users/mjg/src/freebsd/sys/kern/kern_exec.c:623
cpuid = 59
time = 1555604110
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe056d49a650
vpanic() at vpanic+0x19d/frame 0xfffffe056d49a6a0
panic() at panic+0x43/frame 0xfffffe056d49a700
kern_execve() at kern_execve+0x12f3/frame 0xfffffe056d49aa50
sys_execve() at sys_execve+0x4c/frame 0xfffffe056d49aad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe056d49abf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe056d49abf0

  • syscall (59, FreeBSD ELF64, sys_execve), rip = 0x8003d517a, rsp = 0x7fffffffe388, rbp = 0x7fffffffe4d0 ---

KDB: enter: panic
[ thread pid 12060 tid 101862 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

This is the unset in interpreted case and is probably a race where 2 different threads are unsetting the flag.

I was working on shared exec myself few years back. I think both writers and execs have to be explicitly counted. Instead of using a dedicated flag to differentiate between them (or separate counts), in Linux they have a hack where negative value signifies execs.

In D19923#428953, @mjg wrote:

I gave it a spin with poudriere and it panics:

panic: Assertion error_ == 0 failed at /tank/users/mjg/src/freebsd/sys/kern/kern_exec.c:623
cpuid = 59
time = 1555604110
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe056d49a650
vpanic() at vpanic+0x19d/frame 0xfffffe056d49a6a0
panic() at panic+0x43/frame 0xfffffe056d49a700
kern_execve() at kern_execve+0x12f3/frame 0xfffffe056d49aa50
sys_execve() at sys_execve+0x4c/frame 0xfffffe056d49aad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe056d49abf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe056d49abf0

  • syscall (59, FreeBSD ELF64, sys_execve), rip = 0x8003d517a, rsp = 0x7fffffffe388, rbp = 0x7fffffffe4d0 ---

KDB: enter: panic
[ thread pid 12060 tid 101862 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

This is the unset in interpreted case and is probably a race where 2 different threads are unsetting the flag.

I was working on shared exec myself few years back. I think both writers and execs have to be explicitly counted. Instead of using a dedicated flag to differentiate between them (or separate counts), in Linux they have a hack where negative value signifies execs.

The execs are counted, but it is more complicated than just count number of execs. We clear VV_TEXT when the last mapping of the vnode is destroyed. The reason is that it is impossible to define when the file is no longer serving as text. Normal executable maps the text file twice, one time for code segment, one time for CoW data. So if I start counting execs in v_writecount, I do not know when to decrement. More, imagine a weird but legitimate program that executes, mmap some dso and then unmaps the main binary.

In fact, I believe that your backtrace is due to plain bug: VOP_UNSET_TEXT() should be called only when caller actually set v_writecount to -1. If it was already -1, then the clearing should be left to the other thread. Since we are exiting with error when unsetting -1, it should be fine: if other thread cleared the flag earlier than we. it must fail the exec anyway.

In principle, we can mark the mappings created by the image activator, and increment v_writecount on unmapping of each segment, if this simple solution does not work.

On kern_execve() error path, only call VOP_UNSET_TEXT() when it was set by current thread.

A version with refcounted text references. vm_map_entry keeps such reference on the lowest object in the shadow chain, which must be vnode or tmpfs swap.

This crashes for me early on:

Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 12
fault virtual address   = 0x8
fault code              = supervisor read data  , page not present
instruction pointer     = 0x20:0xffffffff808abef3
stack pointer           = 0x28:0xfffffe05670e0930
frame pointer           = 0x28:0xfffffe05670e0960
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1782 (sh)
trap number             = 12
panic: page fault
cpuid = 8
time = 1556050578
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe05670e05f0
vpanic() at vpanic+0x19d/frame 0xfffffe05670e0640
panic() at panic+0x43/frame 0xfffffe05670e06a0
trap_fatal() at trap_fatal+0x394/frame 0xfffffe05670e0700
trap_pfault() at trap_pfault+0x62/frame 0xfffffe05670e0750
trap() at trap+0x2b4/frame 0xfffffe05670e0860
calltrap() at calltrap+0x8/frame 0xfffffe05670e0860
--- trap 0xc, rip = 0xffffffff808abef3, rsp = 0xfffffe05670e0930, rbp = 0xfffffe05670e0960 ---
vm_map_entry_inc_vnode_text() at vm_map_entry_inc_vnode_text+0xd3/frame 0xfffffe05670e0960
vmspace_fork() at vmspace_fork+0x9cf/frame 0xfffffe05670e09c0
fork1() at fork1+0x4c3/frame 0xfffffe05670e0a80
sys_fork() at sys_fork+0x4c/frame 0xfffffe05670e0ad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe05670e0bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe05670e0bf0
--- syscall (2, FreeBSD ELF64, sys_fork), rip = 0x8003ed88a, rsp = 0x7fffffffe2d8, rbp = 0x7fffffffe310 ---
KDB: enter: panic
[ thread pid 1782 tid 101613 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
In D19923#430524, @mjg wrote:

This crashes for me early on:

Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 12
fault virtual address   = 0x8
fault code              = supervisor read data  , page not present
instruction pointer     = 0x20:0xffffffff808abef3
stack pointer           = 0x28:0xfffffe05670e0930
frame pointer           = 0x28:0xfffffe05670e0960
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1782 (sh)
trap number             = 12
panic: page fault
cpuid = 8
time = 1556050578
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe05670e05f0
vpanic() at vpanic+0x19d/frame 0xfffffe05670e0640
panic() at panic+0x43/frame 0xfffffe05670e06a0
trap_fatal() at trap_fatal+0x394/frame 0xfffffe05670e0700
trap_pfault() at trap_pfault+0x62/frame 0xfffffe05670e0750
trap() at trap+0x2b4/frame 0xfffffe05670e0860
calltrap() at calltrap+0x8/frame 0xfffffe05670e0860
--- trap 0xc, rip = 0xffffffff808abef3, rsp = 0xfffffe05670e0930, rbp = 0xfffffe05670e0960 ---
vm_map_entry_inc_vnode_text() at vm_map_entry_inc_vnode_text+0xd3/frame 0xfffffe05670e0960

Can you show me the line number for this address ?

What is 'early on' ? During init/initial shell startup, during the startup into multiuser, or some specific action ?

vmspace_fork() at vmspace_fork+0x9cf/frame 0xfffffe05670e09c0

And this line too, please.

fork1() at fork1+0x4c3/frame 0xfffffe05670e0a80
sys_fork() at sys_fork+0x4c/frame 0xfffffe05670e0ad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe05670e0bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe05670e0bf0

  • syscall (2, FreeBSD ELF64, sys_fork), rip = 0x8003ed88a, rsp = 0x7fffffffe2d8, rbp = 0x7fffffffe310 ---

KDB: enter: panic
[ thread pid 1782 tid 101613 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

I asked pho to help with debugging/testing, but he reported something different so far.

Sorry, I meant early when running poudriere. vp is NULL in vm_map_entry_inc_vnode_text. Can be trivially reproduced by having the shell fork when backed by tmpfs, in particular unpacking world, 'chroot . sh' + 'ls' do the trick.

In D19923#430751, @mjg wrote:

Sorry, I meant early when running poudriere. vp is NULL in vm_map_entry_inc_vnode_text. Can be trivially reproduced by having the shell fork when backed by tmpfs, in particular unpacking world, 'chroot . sh' + 'ls' do the trick.

Strange, I cannot reproduce it. might be there should be something more involved. I mounted tmpfs, then copied /bin/sh and /bin/ls there, then started copied shell, and from it I did ./ls. Anyway, I am uploading updated patch with assertion about the object type in inc_vnode_text(), and with two fixes already done after Peter' testing. I expect much more would be reported by Peter.

When forking wired process, do not copy the MAP_ENTRY_VN_EXEC flag, since map entry does not inherit from vnode.
Do not call VOP_UNSET_TEXT() on return from kern_execve() when SET was not successful.
Add assert about the object type in inc_vnode_text().

Extended the patch with:

+       if (!(((object->flags & OBJ_TMPFS) == 0 &&
+           object->type == OBJT_VNODE) ||
+           ((object->flags & OBJ_TMPFS) != 0 &&
+           object->type == OBJT_SWAP)))
+               printf("obj %p flags %x type %d\n", object, object->flags, object->type);

obj 0xfffff88143effd00 flags 1000 type 0
panic: Assertion ((object->flags & OBJ_TMPFS) == 0 && object->type == OBJT_VNODE) || ((object->flags & OBJ_TMPFS) != 0 && object->type == OBJT_SWAP) failed at /tank/users/mjg/src/freebsd/sys/vm/vm_map.c:535

The machine in question is pig1, you can easily get access.

In D19923#430772, @mjg wrote:

Extended the patch with:

+       if (!(((object->flags & OBJ_TMPFS) == 0 &&
+           object->type == OBJT_VNODE) ||
+           ((object->flags & OBJ_TMPFS) != 0 &&
+           object->type == OBJT_SWAP)))
+               printf("obj %p flags %x type %d\n", object, object->flags, object->type);

obj 0xfffff88143effd00 flags 1000 type 0
panic: Assertion ((object->flags & OBJ_TMPFS) == 0 && object->type == OBJT_VNODE) || ((object->flags & OBJ_TMPFS) != 0 && object->type == OBJT_SWAP) failed at /tank/users/mjg/src/freebsd/sys/vm/vm_map.c:535

The machine in question is pig1, you can easily get access.

Apparently it was a bug in tmpfs, we could collapse tmpfs swap object.

Do not collapse NOSPLIT objects.
Only unset text on interpreter if it was set by us.
Assert that we set text on the right vnode.

Collection of all fixes for now.

I like the idea, and can't spot anything wrong with the patch itself.

sys/kern/imgact_elf.c
943

This looks like something that should be committed separately; same with similar chunks below.

sys/sys/fcntl.h
149 ↗(On Diff #56662)

Unrelated change?

This revision is now accepted and ready to land.Apr 26 2019, 4:55 PM
sys/kern/imgact_elf.c
943

This could be, and I am now not convinced that I want such change. The reason is the M_NOWAIT is allowed to drain free queue deeper. For now I included them to see how the change pass over the Peter stress2. Then I will decide.

sys/sys/fcntl.h
149 ↗(On Diff #56662)

Very much related. The error handling for VOP_OPEN/O_EXLOCK failures was significantly rewritten to properly handle v_writecount on error and this flag is no longer needed. Look at the change in vn_closefile().

sys/sys/fcntl.h
149 ↗(On Diff #56662)

I might be missing something, but there's no other occurrence of O_TTY_INIT in the diff?

sys/sys/fcntl.h
149 ↗(On Diff #56662)

Sure, there is no. FOPENFAILED reused open-only bit to not deplete limited flags namespace.

I have run the full stress2 test set on amd64 using r346698+3cb8bc20f88(cpuctl).
No problems seen.

sys/sys/fcntl.h
149 ↗(On Diff #56662)

Ah, okay. Commit the removal of the old #define separately, perhaps?

sys/sys/fcntl.h
149 ↗(On Diff #56662)

How can this be done ? The define is needed right now, and it is not needed after the revision is landed. Are you proposing to commit everything but the define removal ? This is silly.

sys/compat/linux/linux_misc.c
468

Doesn't this require a write lock on the map?

kib marked an inline comment as done.Apr 28 2019, 5:39 PM
kib added inline comments.
sys/compat/linux/linux_misc.c
468

Fixed on my branch, I will post the update after some other fixes are done.

sys/sys/fcntl.h
149 ↗(On Diff #56662)

Sigh, never mind, it's just me getting #define arguments backwards - I thought you were removing O_TTY_OPEN and not FOPENFAILED. Sorry for the noise.

There is a dangling reference to VV_TEXT in a comment in rtld.c.

sys/kern/imgact_elf.c
452

A cast to void is better for explicitly ignoring errors, to avoid warnings from static analyzers.

943

I don't understand the claim about M_NOWAIT.

sys/kern/kern_exec.c
614–615

This comment is stale.

1744

The comment isn't quite right since that flag no longer exists.

sys/kern/vfs_vnops.c
446

This is a dead store.

sys/sys/vnode.h
833

I think _CHECKED is a more common suffix for this type of wrapper, e.g., pmap_pcid_alloc_checked(), vm_page_replace_checked().

sys/vm/vm_fault.c
1673

This is always a no-op since vm_map_entry_set_vnode_text() does nothing if MAP_ENTRY_VN_EXEC is not set.

sys/vm/vnode_pager.c
337

Typo in "object".

kib marked 9 inline comments as done.Apr 29 2019, 5:44 PM
kib added inline comments.
sys/kern/imgact_elf.c
452

It actually was changed more substantially in intermediate version, but now becomes a stray change. Reverted.

943

I forgot that I changed it in r243040. Weird.

sys/kern/vfs_vnops.c
446

Yes, I already removed it in my branch, to reduce the diff.

sys/vm/vm_fault.c
1673

Yes, all is needed there only the flag clearing. New entry cannot cary text reference since new object does not inherit from old one.

kib marked 4 inline comments as done.

Handle Markj' comments. Mainly,

  • VV_TEXT removed from comments
  • _SUCCEED -> _CHECKED
  • removed dead code in vm_fault_copy_entry.
This revision now requires review to proceed.Apr 29 2019, 5:45 PM

Testing of r346804+db5a624a416(cpuctl) is complete and no problems seen.

sys/compat/linux/linux_misc.c
470

Why not just set textset = false? We have already acquired a text ref by this point.

sys/fs/nfsclient/nfs_clbio.c
1642

Shouldn't it be <= -1 or VOP_IS_TEXT(vp)?

sys/kern/kern_exec.c
466

Did you mean to write "its vnode"?

sys/vm/vm_object.c
1730

Can you explain why this is needed? Is it intended to catch tmpfs objects?

sys/vm/vnode_pager.c
344

Does this do the right thing for nullfs vnodes?

kib marked 5 inline comments as done.May 4 2019, 9:23 PM
kib added inline comments.
sys/fs/nfsclient/nfs_clbio.c
1642

Vnode is potentially unlocked there, when called from the io thread. See the commented assert above. So VOP_IS_TEXT() does not work, and I do not want to remove the locked requirement from it.

Changed to <= -1.

sys/vm/vm_object.c
1730

Yes, this is in fact already existing bug demonstrated by the patch. It destroys content of the tmpfs or posix shared memory backing object by copying all pages to the shadow.

sys/vm/vnode_pager.c
344

This occurs on the vnode reclamation, and vgonel() calls vfs_notify_upper(VFS_NOTIFY_UPPER_RECLAIM). So at the time that we zero lower vnode v_writecount, upper (nullfs) vnode is already reclaimed.

Note that vp there is always lower vnode, since object->handle is lowervp and e.g. vm_mmap_vnode() specifically re-evaluates vp to get to the lowervp. VOP_SET_TEXT() on nullfs passes the op to lowervp.

kib marked 3 inline comments as done.

Fix (racy) check for text vnode in NFS.
Simplify linux a.out loader by reusing text reference.
Add missed word in the comment.

markj added inline comments.
sys/vm/vm_object.c
1730

I see. Isn't (flags & OBJ_NOSPLIT) == 0 a slightly weird predicate for deciding whether to collapse the backing object? Semantically it seems unrelated. I believe the handle != NULL condition is meant to exclude non-anonymous objects, but tmpfs and shm (including SYSV shm?) do not specify a handle, and I am not aware of any in-tree code that creates a swap object with a handle.

This revision is now accepted and ready to land.May 4 2019, 10:54 PM
kib marked an inline comment as done.May 4 2019, 11:05 PM
kib added inline comments.
sys/vm/vm_object.c
1730

It is a weird name but right flag. It means that the content of the object must not be lost. Handles for swap objects cause unneeded lookup, and not too long time ago it was under Giant. All current users of non-anon swap objects have other means to track their objects, and do not need a lookup by handle.

kib marked an inline comment as done.May 4 2019, 11:17 PM
This revision was automatically updated to reflect the committed changes.

What remains here has already been committed, but with LK_SHARED passed to vn_lock(). This can be closed.

This revision is now accepted and ready to land.May 18 2019, 8:05 PM