Details

Reviewers

trasz
mjg
markj
alc
kib

Commits

rS347151: Switch to use shared vnode locks for text files during image activation.
rS347150: Do not collapse objects with OBJ_NOSPLIT backing swap object.
rS347148: imgact_elf: do not relock the text vnode if possible.

Summary

kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition.

The change removes VV_TEXT, replacing it with the condition v_writecount < -1, and put v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal. Now, the operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access, atomic check for writeability and increment of v_writecount is performed by the VOP.

nullfs bypasses v_writecount to the lower vnode always.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 23869

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Herald added a subscriber: imp. · View Herald TranscriptApr 16 2019, 1:22 PM

Harbormaster completed remote builds in B23694: Diff 56245.Apr 16 2019, 1:22 PM

I gave it a spin with poudriere and it panics:

panic: Assertion error_ == 0 failed at /tank/users/mjg/src/freebsd/sys/kern/kern_exec.c:623
cpuid = 59
time = 1555604110
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe056d49a650
vpanic() at vpanic+0x19d/frame 0xfffffe056d49a6a0
panic() at panic+0x43/frame 0xfffffe056d49a700
kern_execve() at kern_execve+0x12f3/frame 0xfffffe056d49aa50
sys_execve() at sys_execve+0x4c/frame 0xfffffe056d49aad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe056d49abf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe056d49abf0

syscall (59, FreeBSD ELF64, sys_execve), rip = 0x8003d517a, rsp = 0x7fffffffe388, rbp = 0x7fffffffe4d0 ---

KDB: enter: panic
[ thread pid 12060 tid 101862 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

This is the unset in interpreted case and is probably a race where 2 different threads are unsetting the flag.

I was working on shared exec myself few years back. I think both writers and execs have to be explicitly counted. Instead of using a dedicated flag to differentiate between them (or separate counts), in Linux they have a hack where negative value signifies execs.

In D19923#428953, @mjg wrote:

I gave it a spin with poudriere and it panics:

panic: Assertion error_ == 0 failed at /tank/users/mjg/src/freebsd/sys/kern/kern_exec.c:623
cpuid = 59
time = 1555604110
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe056d49a650
vpanic() at vpanic+0x19d/frame 0xfffffe056d49a6a0
panic() at panic+0x43/frame 0xfffffe056d49a700
kern_execve() at kern_execve+0x12f3/frame 0xfffffe056d49aa50
sys_execve() at sys_execve+0x4c/frame 0xfffffe056d49aad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe056d49abf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe056d49abf0

syscall (59, FreeBSD ELF64, sys_execve), rip = 0x8003d517a, rsp = 0x7fffffffe388, rbp = 0x7fffffffe4d0 ---

KDB: enter: panic
[ thread pid 12060 tid 101862 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

This is the unset in interpreted case and is probably a race where 2 different threads are unsetting the flag.

I was working on shared exec myself few years back. I think both writers and execs have to be explicitly counted. Instead of using a dedicated flag to differentiate between them (or separate counts), in Linux they have a hack where negative value signifies execs.

The execs are counted, but it is more complicated than just count number of execs. We clear VV_TEXT when the last mapping of the vnode is destroyed. The reason is that it is impossible to define when the file is no longer serving as text. Normal executable maps the text file twice, one time for code segment, one time for CoW data. So if I start counting execs in v_writecount, I do not know when to decrement. More, imagine a weird but legitimate program that executes, mmap some dso and then unmaps the main binary.

In fact, I believe that your backtrace is due to plain bug: VOP_UNSET_TEXT() should be called only when caller actually set v_writecount to -1. If it was already -1, then the clearing should be left to the other thread. Since we are exiting with error when unsetting -1, it should be fine: if other thread cleared the flag earlier than we. it must fail the exec anyway.

In principle, we can mark the mappings created by the image activator, and increment v_writecount on unmapping of each segment, if this simple solution does not work.

On kern_execve() error path, only call VOP_UNSET_TEXT() when it was set by current thread.

Harbormaster completed remote builds in B23751: Diff 56361.Apr 18 2019, 5:42 PM

A version with refcounted text references. vm_map_entry keeps such reference on the lowest object in the shadow chain, which must be vnode or tmpfs swap.

Harbormaster completed remote builds in B23752: Diff 56363.Apr 18 2019, 7:24 PM

kib added a subscriber: pho.Apr 23 2019, 6:24 PM

This crashes for me early on:

Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 12
fault virtual address   = 0x8
fault code              = supervisor read data  , page not present
instruction pointer     = 0x20:0xffffffff808abef3
stack pointer           = 0x28:0xfffffe05670e0930
frame pointer           = 0x28:0xfffffe05670e0960
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1782 (sh)
trap number             = 12
panic: page fault
cpuid = 8
time = 1556050578
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe05670e05f0
vpanic() at vpanic+0x19d/frame 0xfffffe05670e0640
panic() at panic+0x43/frame 0xfffffe05670e06a0
trap_fatal() at trap_fatal+0x394/frame 0xfffffe05670e0700
trap_pfault() at trap_pfault+0x62/frame 0xfffffe05670e0750
trap() at trap+0x2b4/frame 0xfffffe05670e0860
calltrap() at calltrap+0x8/frame 0xfffffe05670e0860
--- trap 0xc, rip = 0xffffffff808abef3, rsp = 0xfffffe05670e0930, rbp = 0xfffffe05670e0960 ---
vm_map_entry_inc_vnode_text() at vm_map_entry_inc_vnode_text+0xd3/frame 0xfffffe05670e0960
vmspace_fork() at vmspace_fork+0x9cf/frame 0xfffffe05670e09c0
fork1() at fork1+0x4c3/frame 0xfffffe05670e0a80
sys_fork() at sys_fork+0x4c/frame 0xfffffe05670e0ad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe05670e0bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe05670e0bf0
--- syscall (2, FreeBSD ELF64, sys_fork), rip = 0x8003ed88a, rsp = 0x7fffffffe2d8, rbp = 0x7fffffffe310 ---
KDB: enter: panic
[ thread pid 1782 tid 101613 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why

In D19923#430524, @mjg wrote:

This crashes for me early on:

Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 12
fault virtual address   = 0x8
fault code              = supervisor read data  , page not present
instruction pointer     = 0x20:0xffffffff808abef3
stack pointer           = 0x28:0xfffffe05670e0930
frame pointer           = 0x28:0xfffffe05670e0960
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1782 (sh)
trap number             = 12
panic: page fault
cpuid = 8
time = 1556050578
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe05670e05f0
vpanic() at vpanic+0x19d/frame 0xfffffe05670e0640
panic() at panic+0x43/frame 0xfffffe05670e06a0
trap_fatal() at trap_fatal+0x394/frame 0xfffffe05670e0700
trap_pfault() at trap_pfault+0x62/frame 0xfffffe05670e0750
trap() at trap+0x2b4/frame 0xfffffe05670e0860
calltrap() at calltrap+0x8/frame 0xfffffe05670e0860
--- trap 0xc, rip = 0xffffffff808abef3, rsp = 0xfffffe05670e0930, rbp = 0xfffffe05670e0960 ---
vm_map_entry_inc_vnode_text() at vm_map_entry_inc_vnode_text+0xd3/frame 0xfffffe05670e0960

Can you show me the line number for this address ?

What is 'early on' ? During init/initial shell startup, during the startup into multiuser, or some specific action ?

vmspace_fork() at vmspace_fork+0x9cf/frame 0xfffffe05670e09c0

And this line too, please.

fork1() at fork1+0x4c3/frame 0xfffffe05670e0a80
sys_fork() at sys_fork+0x4c/frame 0xfffffe05670e0ad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe05670e0bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe05670e0bf0

syscall (2, FreeBSD ELF64, sys_fork), rip = 0x8003ed88a, rsp = 0x7fffffffe2d8, rbp = 0x7fffffffe310 ---

KDB: enter: panic
[ thread pid 1782 tid 101613 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

I asked pho to help with debugging/testing, but he reported something different so far.

Sorry, I meant early when running poudriere. vp is NULL in vm_map_entry_inc_vnode_text. Can be trivially reproduced by having the shell fork when backed by tmpfs, in particular unpacking world, 'chroot . sh' + 'ls' do the trick.

In D19923#430751, @mjg wrote:

Sorry, I meant early when running poudriere. vp is NULL in vm_map_entry_inc_vnode_text. Can be trivially reproduced by having the shell fork when backed by tmpfs, in particular unpacking world, 'chroot . sh' + 'ls' do the trick.

Strange, I cannot reproduce it. might be there should be something more involved. I mounted tmpfs, then copied /bin/sh and /bin/ls there, then started copied shell, and from it I did ./ls. Anyway, I am uploading updated patch with assertion about the object type in inc_vnode_text(), and with two fixes already done after Peter' testing. I expect much more would be reported by Peter.

When forking wired process, do not copy the MAP_ENTRY_VN_EXEC flag, since map entry does not inherit from vnode.
Do not call VOP_UNSET_TEXT() on return from kern_execve() when SET was not successful.
Add assert about the object type in inc_vnode_text().

Harbormaster completed remote builds in B23868: Diff 56580.Apr 24 2019, 11:55 AM

Extended the patch with:

+       if (!(((object->flags & OBJ_TMPFS) == 0 &&
+           object->type == OBJT_VNODE) ||
+           ((object->flags & OBJ_TMPFS) != 0 &&
+           object->type == OBJT_SWAP)))
+               printf("obj %p flags %x type %d\n", object, object->flags, object->type);

obj 0xfffff88143effd00 flags 1000 type 0
panic: Assertion ((object->flags & OBJ_TMPFS) == 0 && object->type == OBJT_VNODE) || ((object->flags & OBJ_TMPFS) != 0 && object->type == OBJT_SWAP) failed at /tank/users/mjg/src/freebsd/sys/vm/vm_map.c:535

The machine in question is pig1, you can easily get access.

In D19923#430772, @mjg wrote:
Extended the patch with:
+       if (!(((object->flags & OBJ_TMPFS) == 0 &&
+           object->type == OBJT_VNODE) ||
+           ((object->flags & OBJ_TMPFS) != 0 &&
+           object->type == OBJT_SWAP)))
+               printf("obj %p flags %x type %d\n", object, object->flags, object->type);
obj 0xfffff88143effd00 flags 1000 type 0
panic: Assertion ((object->flags & OBJ_TMPFS) == 0 && object->type == OBJT_VNODE) || ((object->flags & OBJ_TMPFS) != 0 && object->type == OBJT_SWAP) failed at /tank/users/mjg/src/freebsd/sys/vm/vm_map.c:535

The machine in question is pig1, you can easily get access.

Apparently it was a bug in tmpfs, we could collapse tmpfs swap object.

Do not collapse NOSPLIT objects.
Only unset text on interpreter if it was set by us.
Assert that we set text on the right vnode.

Harbormaster completed remote builds in B23869: Diff 56586.Apr 24 2019, 2:17 PM

This works fine

Collection of all fixes for now.

Harbormaster completed remote builds in B23904: Diff 56662.Apr 25 2019, 7:21 PM

kib edited the summary of this revision. (Show Details)Apr 26 2019, 1:01 PM

I like the idea, and can't spot anything wrong with the patch itself.

sys/kern/imgact_elf.c
943	This looks like something that should be committed separately; same with similar chunks below.
sys/sys/fcntl.h
149	Unrelated change?

This revision is now accepted and ready to land.Apr 26 2019, 4:55 PM

kib added inline comments.Apr 26 2019, 5:13 PM

sys/kern/imgact_elf.c
943	This could be, and I am now not convinced that I want such change. The reason is the M_NOWAIT is allowed to drain free queue deeper. For now I included them to see how the change pass over the Peter stress2. Then I will decide.
sys/sys/fcntl.h
149	Very much related. The error handling for VOP_OPEN/O_EXLOCK failures was significantly rewritten to properly handle v_writecount on error and this flag is no longer needed. Look at the change in vn_closefile().

trasz added inline comments.Apr 26 2019, 6:10 PM

sys/sys/fcntl.h
149	I might be missing something, but there's no other occurrence of O_TTY_INIT in the diff?

kib added inline comments.Apr 26 2019, 6:25 PM

sys/sys/fcntl.h
149	Sure, there is no. FOPENFAILED reused open-only bit to not deplete limited flags namespace.

I have run the full stress2 test set on amd64 using r346698+3cb8bc20f88(cpuctl).
No problems seen.

kib added reviewers: markj, alc.Apr 27 2019, 5:45 PM

trasz added inline comments.Apr 28 2019, 1:09 PM

sys/sys/fcntl.h
149	Ah, okay. Commit the removal of the old #define separately, perhaps?

kib added inline comments.Apr 28 2019, 1:16 PM

sys/sys/fcntl.h
149	How can this be done ? The define is needed right now, and it is not needed after the revision is landed. Are you proposing to commit everything but the define removal ? This is silly.

alc added inline comments.Apr 28 2019, 4:38 PM

sys/compat/linux/linux_misc.c
462	Doesn't this require a write lock on the map?

kib marked an inline comment as done.Apr 28 2019, 5:39 PM

kib added inline comments.

sys/compat/linux/linux_misc.c
462	Fixed on my branch, I will post the update after some other fixes are done.

trasz added inline comments.Apr 28 2019, 8:33 PM

sys/sys/fcntl.h
149	Sigh, never mind, it's just me getting #define arguments backwards - I thought you were removing O_TTY_OPEN and not FOPENFAILED. Sorry for the noise.

There is a dangling reference to VV_TEXT in a comment in rtld.c.

sys/kern/imgact_elf.c
452	A cast to void is better for explicitly ignoring errors, to avoid warnings from static analyzers.
943	I don't understand the claim about M_NOWAIT.
sys/kern/kern_exec.c
615	This comment is stale.
1743	The comment isn't quite right since that flag no longer exists.
sys/kern/vfs_vnops.c
431	This is a dead store.
sys/sys/vnode.h
832	I think _CHECKED is a more common suffix for this type of wrapper, e.g., pmap_pcid_alloc_checked(), vm_page_replace_checked().
sys/vm/vm_fault.c
1673 ↗	(On Diff #56662)	This is always a no-op since vm_map_entry_set_vnode_text() does nothing if MAP_ENTRY_VN_EXEC is not set.
sys/vm/vnode_pager.c
337	Typo in "object".

kib marked 9 inline comments as done.Apr 29 2019, 5:44 PM

kib added inline comments.

sys/kern/imgact_elf.c
452	It actually was changed more substantially in intermediate version, but now becomes a stray change. Reverted.
943	I forgot that I changed it in r243040. Weird.
sys/kern/vfs_vnops.c
431	Yes, I already removed it in my branch, to reduce the diff.
sys/vm/vm_fault.c
1673 ↗	(On Diff #56662)	Yes, all is needed there only the flag clearing. New entry cannot cary text reference since new object does not inherit from old one.

Handle Markj' comments. Mainly,

VV_TEXT removed from comments
_SUCCEED -> _CHECKED
removed dead code in vm_fault_copy_entry.

This revision now requires review to proceed.Apr 29 2019, 5:45 PM

Herald added a subscriber: bdragon. · View Herald TranscriptApr 29 2019, 5:45 PM

Harbormaster completed remote builds in B23980: Diff 56819.Apr 29 2019, 5:45 PM

emaste added a subscriber: emaste.Apr 29 2019, 8:51 PM

Testing of r346804+db5a624a416(cpuctl) is complete and no problems seen.

markj added inline comments.May 4 2019, 8:43 PM

sys/compat/linux/linux_misc.c
464	Why not just set textset = false? We have already acquired a text ref by this point.
sys/fs/nfsclient/nfs_clbio.c
1642	Shouldn't it be `<= -1` or `VOP_IS_TEXT(vp)`?
sys/kern/kern_exec.c
466	Did you mean to write "its vnode"?
sys/vm/vm_object.c
1750	Can you explain why this is needed? Is it intended to catch tmpfs objects?
sys/vm/vnode_pager.c
344	Does this do the right thing for nullfs vnodes?

kib marked 5 inline comments as done.May 4 2019, 9:23 PM

kib added inline comments.

sys/fs/nfsclient/nfs_clbio.c
1642	Vnode is potentially unlocked there, when called from the io thread. See the commented assert above. So VOP_IS_TEXT() does not work, and I do not want to remove the locked requirement from it. Changed to <= -1.
sys/vm/vm_object.c
1750	Yes, this is in fact already existing bug demonstrated by the patch. It destroys content of the tmpfs or posix shared memory backing object by copying all pages to the shadow.
sys/vm/vnode_pager.c
344	This occurs on the vnode reclamation, and vgonel() calls vfs_notify_upper(VFS_NOTIFY_UPPER_RECLAIM). So at the time that we zero lower vnode v_writecount, upper (nullfs) vnode is already reclaimed. Note that vp there is always lower vnode, since object->handle is lowervp and e.g. vm_mmap_vnode() specifically re-evaluates vp to get to the lowervp. VOP_SET_TEXT() on nullfs passes the op to lowervp.

Fix (racy) check for text vnode in NFS.
Simplify linux a.out loader by reusing text reference.
Add missed word in the comment.

Harbormaster completed remote builds in B24104: Diff 57048.May 4 2019, 9:25 PM

markj accepted this revision.May 4 2019, 10:54 PM

markj added inline comments.

sys/vm/vm_object.c
1750	I see. Isn't (flags & OBJ_NOSPLIT) == 0 a slightly weird predicate for deciding whether to collapse the backing object? Semantically it seems unrelated. I believe the handle != NULL condition is meant to exclude non-anonymous objects, but tmpfs and shm (including SYSV shm?) do not specify a handle, and I am not aware of any in-tree code that creates a swap object with a handle.

This revision is now accepted and ready to land.May 4 2019, 10:54 PM

kib marked an inline comment as done.May 4 2019, 11:05 PM

kib added inline comments.

sys/vm/vm_object.c
1750	It is a weird name but right flag. It means that the content of the object must not be lost. Handles for swap objects cause unneeded lookup, and not too long time ago it was under Giant. All current users of non-anon swap objects have other means to track their objects, and do not need a lookup by handle.

kib marked an inline comment as done.May 4 2019, 11:17 PM

Closed by commit rS347148: imgact_elf: do not relock the text vnode if possible. (authored by kib). · Explain WhyMay 5 2019, 11:04 AM

This revision was automatically updated to reflect the committed changes.

kib reopened this revision.May 5 2019, 11:09 AM

What remains here has already been committed, but with LK_SHARED passed to vn_lock(). This can be closed.

kib accepted this revision.May 18 2019, 8:05 PM

This revision is now accepted and ready to land.May 18 2019, 8:05 PM

kib closed this revision.May 18 2019, 8:05 PM

Switch to use shared vnode locks for text files during image activation.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 56586

sys/compat/linux/linux_misc.c

sys/fs/nfsclient/nfs_clbio.c

sys/fs/nfsclient/nfs_clvnops.c

sys/fs/nullfs/null_vnops.c

sys/fs/unionfs/union_subr.c

sys/kern/imgact_aout.c

sys/kern/imgact_elf.c

sys/kern/kern_exec.c

sys/kern/vfs_default.c

sys/kern/vfs_subr.c

sys/kern/vfs_vnops.c

sys/kern/vnode_if.src

sys/sys/fcntl.h

sys/sys/imgact.h

sys/sys/vnode.h

sys/ufs/ufs/ufs_extattr.c

sys/vm/vm_map.h

sys/vm/vm_map.c

sys/vm/vm_mmap.c

sys/vm/vm_object.c

sys/vm/vnode_pager.c

Switch to use shared vnode locks for text files during image activation.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 56586

sys/compat/linux/linux_misc.c

sys/fs/nfsclient/nfs_clbio.c

sys/fs/nfsclient/nfs_clvnops.c

sys/fs/nullfs/null_vnops.c

sys/fs/unionfs/union_subr.c

sys/kern/imgact_aout.c

sys/kern/imgact_elf.c

sys/kern/kern_exec.c

sys/kern/vfs_default.c

sys/kern/vfs_subr.c

sys/kern/vfs_vnops.c

sys/kern/vnode_if.src

sys/sys/fcntl.h

sys/sys/imgact.h

sys/sys/vnode.h

sys/ufs/ufs/ufs_extattr.c

sys/vm/vm_map.h

sys/vm/vm_map.c

sys/vm/vm_mmap.c

sys/vm/vm_object.c

sys/vm/vnode_pager.c

Switch to use shared vnode locks for text files during image activation.
ClosedPublic
Actions

Revision Contents
Changeset List