Page MenuHomeFreeBSD

Switch to use shared vnode locks for text files during image activation.
ClosedPublic

Authored by kib on Apr 16 2019, 1:22 PM.

Details

Summary

kern_execve() locks text vnode exclusive to be able to set and clear VV_TEXT flag. VV_TEXT is mutually exclusive with the v_writecount > 0 condition.

The change removes VV_TEXT, replacing it with the condition v_writecount < -1, and put v_writecount under the vnode interlock. Each text reference decrements v_writecount. To clear the text reference, it is recorded in the vm_map_entry backed by the text file as MAP_ENTRY_VN_TEXT flag, and v_writecount is incremented on the map entry removal. Now, the operations like VOP_ADD_WRITECOUNT() and VOP_SET_TEXT() check that v_writecount does not contradict the desired change. vn_writecheck() is now racy and its use was eliminated everywhere except access, atomic check for writeability and increment of v_writecount is performed by the VOP.

nullfs bypasses v_writecount to the lower vnode always.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 23904

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
mjg added a comment.EditedApr 18 2019, 4:28 PM

I gave it a spin with poudriere and it panics:

panic: Assertion error_ == 0 failed at /tank/users/mjg/src/freebsd/sys/kern/kern_exec.c:623
cpuid = 59
time = 1555604110
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe056d49a650
vpanic() at vpanic+0x19d/frame 0xfffffe056d49a6a0
panic() at panic+0x43/frame 0xfffffe056d49a700
kern_execve() at kern_execve+0x12f3/frame 0xfffffe056d49aa50
sys_execve() at sys_execve+0x4c/frame 0xfffffe056d49aad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe056d49abf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe056d49abf0

  • syscall (59, FreeBSD ELF64, sys_execve), rip = 0x8003d517a, rsp = 0x7fffffffe388, rbp = 0x7fffffffe4d0 ---

KDB: enter: panic
[ thread pid 12060 tid 101862 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

This is the unset in interpreted case and is probably a race where 2 different threads are unsetting the flag.

I was working on shared exec myself few years back. I think both writers and execs have to be explicitly counted. Instead of using a dedicated flag to differentiate between them (or separate counts), in Linux they have a hack where negative value signifies execs.

kib added a comment.Apr 18 2019, 5:39 PM
In D19923#428953, @mjg wrote:

I gave it a spin with poudriere and it panics:

panic: Assertion error_ == 0 failed at /tank/users/mjg/src/freebsd/sys/kern/kern_exec.c:623
cpuid = 59
time = 1555604110
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe056d49a650
vpanic() at vpanic+0x19d/frame 0xfffffe056d49a6a0
panic() at panic+0x43/frame 0xfffffe056d49a700
kern_execve() at kern_execve+0x12f3/frame 0xfffffe056d49aa50
sys_execve() at sys_execve+0x4c/frame 0xfffffe056d49aad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe056d49abf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe056d49abf0

  • syscall (59, FreeBSD ELF64, sys_execve), rip = 0x8003d517a, rsp = 0x7fffffffe388, rbp = 0x7fffffffe4d0 ---

KDB: enter: panic
[ thread pid 12060 tid 101862 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

This is the unset in interpreted case and is probably a race where 2 different threads are unsetting the flag.
I was working on shared exec myself few years back. I think both writers and execs have to be explicitly counted. Instead of using a dedicated flag to differentiate between them (or separate counts), in Linux they have a hack where negative value signifies execs.

The execs are counted, but it is more complicated than just count number of execs. We clear VV_TEXT when the last mapping of the vnode is destroyed. The reason is that it is impossible to define when the file is no longer serving as text. Normal executable maps the text file twice, one time for code segment, one time for CoW data. So if I start counting execs in v_writecount, I do not know when to decrement. More, imagine a weird but legitimate program that executes, mmap some dso and then unmaps the main binary.

In fact, I believe that your backtrace is due to plain bug: VOP_UNSET_TEXT() should be called only when caller actually set v_writecount to -1. If it was already -1, then the clearing should be left to the other thread. Since we are exiting with error when unsetting -1, it should be fine: if other thread cleared the flag earlier than we. it must fail the exec anyway.

In principle, we can mark the mappings created by the image activator, and increment v_writecount on unmapping of each segment, if this simple solution does not work.

kib updated this revision to Diff 56361.Apr 18 2019, 5:42 PM

On kern_execve() error path, only call VOP_UNSET_TEXT() when it was set by current thread.

kib updated this revision to Diff 56363.Apr 18 2019, 7:24 PM

A version with refcounted text references. vm_map_entry keeps such reference on the lowest object in the shadow chain, which must be vnode or tmpfs swap.

kib added a subscriber: pho.Apr 23 2019, 6:24 PM
mjg added a comment.Apr 23 2019, 8:18 PM

This crashes for me early on:

Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 12
fault virtual address   = 0x8
fault code              = supervisor read data  , page not present
instruction pointer     = 0x20:0xffffffff808abef3
stack pointer           = 0x28:0xfffffe05670e0930
frame pointer           = 0x28:0xfffffe05670e0960
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1782 (sh)
trap number             = 12
panic: page fault
cpuid = 8
time = 1556050578
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe05670e05f0
vpanic() at vpanic+0x19d/frame 0xfffffe05670e0640
panic() at panic+0x43/frame 0xfffffe05670e06a0
trap_fatal() at trap_fatal+0x394/frame 0xfffffe05670e0700
trap_pfault() at trap_pfault+0x62/frame 0xfffffe05670e0750
trap() at trap+0x2b4/frame 0xfffffe05670e0860
calltrap() at calltrap+0x8/frame 0xfffffe05670e0860
--- trap 0xc, rip = 0xffffffff808abef3, rsp = 0xfffffe05670e0930, rbp = 0xfffffe05670e0960 ---
vm_map_entry_inc_vnode_text() at vm_map_entry_inc_vnode_text+0xd3/frame 0xfffffe05670e0960
vmspace_fork() at vmspace_fork+0x9cf/frame 0xfffffe05670e09c0
fork1() at fork1+0x4c3/frame 0xfffffe05670e0a80
sys_fork() at sys_fork+0x4c/frame 0xfffffe05670e0ad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe05670e0bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe05670e0bf0
--- syscall (2, FreeBSD ELF64, sys_fork), rip = 0x8003ed88a, rsp = 0x7fffffffe2d8, rbp = 0x7fffffffe310 ---
KDB: enter: panic
[ thread pid 1782 tid 101613 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
kib added a comment.Apr 24 2019, 10:36 AM
In D19923#430524, @mjg wrote:

This crashes for me early on:

Fatal trap 12: page fault while in kernel mode
cpuid = 8; apic id = 12
fault virtual address   = 0x8
fault code              = supervisor read data  , page not present
instruction pointer     = 0x20:0xffffffff808abef3
stack pointer           = 0x28:0xfffffe05670e0930
frame pointer           = 0x28:0xfffffe05670e0960
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 1782 (sh)
trap number             = 12
panic: page fault
cpuid = 8
time = 1556050578
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe05670e05f0
vpanic() at vpanic+0x19d/frame 0xfffffe05670e0640
panic() at panic+0x43/frame 0xfffffe05670e06a0
trap_fatal() at trap_fatal+0x394/frame 0xfffffe05670e0700
trap_pfault() at trap_pfault+0x62/frame 0xfffffe05670e0750
trap() at trap+0x2b4/frame 0xfffffe05670e0860
calltrap() at calltrap+0x8/frame 0xfffffe05670e0860
--- trap 0xc, rip = 0xffffffff808abef3, rsp = 0xfffffe05670e0930, rbp = 0xfffffe05670e0960 ---
vm_map_entry_inc_vnode_text() at vm_map_entry_inc_vnode_text+0xd3/frame 0xfffffe05670e0960

Can you show me the line number for this address ?

What is 'early on' ? During init/initial shell startup, during the startup into multiuser, or some specific action ?

vmspace_fork() at vmspace_fork+0x9cf/frame 0xfffffe05670e09c0

And this line too, please.

fork1() at fork1+0x4c3/frame 0xfffffe05670e0a80
sys_fork() at sys_fork+0x4c/frame 0xfffffe05670e0ad0
amd64_syscall() at amd64_syscall+0x276/frame 0xfffffe05670e0bf0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe05670e0bf0

  • syscall (2, FreeBSD ELF64, sys_fork), rip = 0x8003ed88a, rsp = 0x7fffffffe2d8, rbp = 0x7fffffffe310 ---

KDB: enter: panic
[ thread pid 1782 tid 101613 ]
Stopped at kdb_enter+0x3b: movq $0,kdb_why

I asked pho to help with debugging/testing, but he reported something different so far.

mjg added a comment.EditedApr 24 2019, 11:38 AM

Sorry, I meant early when running poudriere. vp is NULL in vm_map_entry_inc_vnode_text. Can be trivially reproduced by having the shell fork when backed by tmpfs, in particular unpacking world, 'chroot . sh' + 'ls' do the trick.

kib added a comment.Apr 24 2019, 11:52 AM
In D19923#430751, @mjg wrote:

Sorry, I meant early when running poudriere. vp is NULL in vm_map_entry_inc_vnode_text. Can be trivially reproduced by having the shell fork when backed by tmpfs, in particular unpacking world, 'chroot . sh' + 'ls' do the trick.

Strange, I cannot reproduce it. might be there should be something more involved. I mounted tmpfs, then copied /bin/sh and /bin/ls there, then started copied shell, and from it I did ./ls. Anyway, I am uploading updated patch with assertion about the object type in inc_vnode_text(), and with two fixes already done after Peter' testing. I expect much more would be reported by Peter.

kib updated this revision to Diff 56580.Apr 24 2019, 11:55 AM

When forking wired process, do not copy the MAP_ENTRY_VN_EXEC flag, since map entry does not inherit from vnode.
Do not call VOP_UNSET_TEXT() on return from kern_execve() when SET was not successful.
Add assert about the object type in inc_vnode_text().

mjg added a comment.Apr 24 2019, 12:30 PM

Extended the patch with:

+       if (!(((object->flags & OBJ_TMPFS) == 0 &&
+           object->type == OBJT_VNODE) ||
+           ((object->flags & OBJ_TMPFS) != 0 &&
+           object->type == OBJT_SWAP)))
+               printf("obj %p flags %x type %d\n", object, object->flags, object->type);

obj 0xfffff88143effd00 flags 1000 type 0
panic: Assertion ((object->flags & OBJ_TMPFS) == 0 && object->type == OBJT_VNODE) || ((object->flags & OBJ_TMPFS) != 0 && object->type == OBJT_SWAP) failed at /tank/users/mjg/src/freebsd/sys/vm/vm_map.c:535

The machine in question is pig1, you can easily get access.

kib added a comment.Apr 24 2019, 2:15 PM
In D19923#430772, @mjg wrote:

Extended the patch with:

+       if (!(((object->flags & OBJ_TMPFS) == 0 &&
+           object->type == OBJT_VNODE) ||
+           ((object->flags & OBJ_TMPFS) != 0 &&
+           object->type == OBJT_SWAP)))
+               printf("obj %p flags %x type %d\n", object, object->flags, object->type);

obj 0xfffff88143effd00 flags 1000 type 0
panic: Assertion ((object->flags & OBJ_TMPFS) == 0 && object->type == OBJT_VNODE) || ((object->flags & OBJ_TMPFS) != 0 && object->type == OBJT_SWAP) failed at /tank/users/mjg/src/freebsd/sys/vm/vm_map.c:535

The machine in question is pig1, you can easily get access.

Apparently it was a bug in tmpfs, we could collapse tmpfs swap object.

kib updated this revision to Diff 56586.Apr 24 2019, 2:17 PM

Do not collapse NOSPLIT objects.
Only unset text on interpreter if it was set by us.
Assert that we set text on the right vnode.

mjg added a comment.Apr 24 2019, 4:30 PM

This works fine

kib updated this revision to Diff 56662.Apr 25 2019, 7:21 PM

Collection of all fixes for now.

kib edited the summary of this revision. (Show Details)Apr 26 2019, 1:01 PM
trasz accepted this revision.Apr 26 2019, 4:55 PM

I like the idea, and can't spot anything wrong with the patch itself.

sys/kern/imgact_elf.c
943

This looks like something that should be committed separately; same with similar chunks below.

sys/sys/fcntl.h
149

Unrelated change?

This revision is now accepted and ready to land.Apr 26 2019, 4:55 PM
kib added inline comments.Apr 26 2019, 5:13 PM
sys/kern/imgact_elf.c
943

This could be, and I am now not convinced that I want such change. The reason is the M_NOWAIT is allowed to drain free queue deeper. For now I included them to see how the change pass over the Peter stress2. Then I will decide.

sys/sys/fcntl.h
149

Very much related. The error handling for VOP_OPEN/O_EXLOCK failures was significantly rewritten to properly handle v_writecount on error and this flag is no longer needed. Look at the change in vn_closefile().

trasz added inline comments.Apr 26 2019, 6:10 PM
sys/sys/fcntl.h
149

I might be missing something, but there's no other occurrence of O_TTY_INIT in the diff?

kib added inline comments.Apr 26 2019, 6:25 PM
sys/sys/fcntl.h
149

Sure, there is no. FOPENFAILED reused open-only bit to not deplete limited flags namespace.

pho added a comment.Apr 27 2019, 5:08 PM

I have run the full stress2 test set on amd64 using r346698+3cb8bc20f88(cpuctl).
No problems seen.

kib added reviewers: markj, alc.Apr 27 2019, 5:45 PM
trasz added inline comments.Apr 28 2019, 1:09 PM
sys/sys/fcntl.h
149

Ah, okay. Commit the removal of the old #define separately, perhaps?

kib added inline comments.Apr 28 2019, 1:16 PM
sys/sys/fcntl.h
149

How can this be done ? The define is needed right now, and it is not needed after the revision is landed. Are you proposing to commit everything but the define removal ? This is silly.

alc added inline comments.Apr 28 2019, 4:38 PM
sys/compat/linux/linux_misc.c
468

Doesn't this require a write lock on the map?

kib marked an inline comment as done.Apr 28 2019, 5:39 PM
kib added inline comments.
sys/compat/linux/linux_misc.c
468

Fixed on my branch, I will post the update after some other fixes are done.

trasz added inline comments.Apr 28 2019, 8:33 PM
sys/sys/fcntl.h
149

Sigh, never mind, it's just me getting #define arguments backwards - I thought you were removing O_TTY_OPEN and not FOPENFAILED. Sorry for the noise.

markj added a comment.Apr 29 2019, 3:35 PM

There is a dangling reference to VV_TEXT in a comment in rtld.c.

sys/kern/imgact_elf.c
452

A cast to void is better for explicitly ignoring errors, to avoid warnings from static analyzers.

943

I don't understand the claim about M_NOWAIT.

sys/kern/kern_exec.c
615

This comment is stale.

1743

The comment isn't quite right since that flag no longer exists.

sys/kern/vfs_vnops.c
446

This is a dead store.

sys/sys/vnode.h
833

I think _CHECKED is a more common suffix for this type of wrapper, e.g., pmap_pcid_alloc_checked(), vm_page_replace_checked().

sys/vm/vm_fault.c
1673

This is always a no-op since vm_map_entry_set_vnode_text() does nothing if MAP_ENTRY_VN_EXEC is not set.

sys/vm/vnode_pager.c
337

Typo in "object".

kib marked 9 inline comments as done.Apr 29 2019, 5:44 PM
kib added inline comments.
sys/kern/imgact_elf.c
452

It actually was changed more substantially in intermediate version, but now becomes a stray change. Reverted.

943

I forgot that I changed it in r243040. Weird.

sys/kern/vfs_vnops.c
446

Yes, I already removed it in my branch, to reduce the diff.

sys/vm/vm_fault.c
1673

Yes, all is needed there only the flag clearing. New entry cannot cary text reference since new object does not inherit from old one.

kib marked 4 inline comments as done.Apr 29 2019, 5:45 PM
kib updated this revision to Diff 56819.

Handle Markj' comments. Mainly,

  • VV_TEXT removed from comments
  • _SUCCEED -> _CHECKED
  • removed dead code in vm_fault_copy_entry.
This revision now requires review to proceed.Apr 29 2019, 5:45 PM
emaste added a subscriber: emaste.Apr 29 2019, 8:51 PM
pho added a comment.Apr 30 2019, 2:07 PM

Testing of r346804+db5a624a416(cpuctl) is complete and no problems seen.

markj added inline comments.May 4 2019, 8:43 PM
sys/compat/linux/linux_misc.c
470

Why not just set textset = false? We have already acquired a text ref by this point.

sys/fs/nfsclient/nfs_clbio.c
1642

Shouldn't it be <= -1 or VOP_IS_TEXT(vp)?

sys/kern/kern_exec.c
466

Did you mean to write "its vnode"?

sys/vm/vm_object.c
1730

Can you explain why this is needed? Is it intended to catch tmpfs objects?

sys/vm/vnode_pager.c
344

Does this do the right thing for nullfs vnodes?

kib marked 5 inline comments as done.May 4 2019, 9:23 PM
kib added inline comments.
sys/fs/nfsclient/nfs_clbio.c
1642

Vnode is potentially unlocked there, when called from the io thread. See the commented assert above. So VOP_IS_TEXT() does not work, and I do not want to remove the locked requirement from it.

Changed to <= -1.

sys/vm/vm_object.c
1730

Yes, this is in fact already existing bug demonstrated by the patch. It destroys content of the tmpfs or posix shared memory backing object by copying all pages to the shadow.

sys/vm/vnode_pager.c
344

This occurs on the vnode reclamation, and vgonel() calls vfs_notify_upper(VFS_NOTIFY_UPPER_RECLAIM). So at the time that we zero lower vnode v_writecount, upper (nullfs) vnode is already reclaimed.

Note that vp there is always lower vnode, since object->handle is lowervp and e.g. vm_mmap_vnode() specifically re-evaluates vp to get to the lowervp. VOP_SET_TEXT() on nullfs passes the op to lowervp.

kib marked 3 inline comments as done.May 4 2019, 9:25 PM
kib updated this revision to Diff 57048.

Fix (racy) check for text vnode in NFS.
Simplify linux a.out loader by reusing text reference.
Add missed word in the comment.

markj accepted this revision.May 4 2019, 10:54 PM
markj added inline comments.
sys/vm/vm_object.c
1730

I see. Isn't (flags & OBJ_NOSPLIT) == 0 a slightly weird predicate for deciding whether to collapse the backing object? Semantically it seems unrelated. I believe the handle != NULL condition is meant to exclude non-anonymous objects, but tmpfs and shm (including SYSV shm?) do not specify a handle, and I am not aware of any in-tree code that creates a swap object with a handle.

This revision is now accepted and ready to land.May 4 2019, 10:54 PM
kib marked an inline comment as done.May 4 2019, 11:05 PM
kib added inline comments.
sys/vm/vm_object.c
1730

It is a weird name but right flag. It means that the content of the object must not be lost. Handles for swap objects cause unneeded lookup, and not too long time ago it was under Giant. All current users of non-anon swap objects have other means to track their objects, and do not need a lookup by handle.

kib marked an inline comment as done.May 4 2019, 11:17 PM
This revision was automatically updated to reflect the committed changes.
kib reopened this revision.May 5 2019, 11:09 AM
alc added a comment.May 18 2019, 7:51 PM

What remains here has already been committed, but with LK_SHARED passed to vn_lock(). This can be closed.

kib accepted this revision.May 18 2019, 8:05 PM
This revision is now accepted and ready to land.May 18 2019, 8:05 PM
kib closed this revision.May 18 2019, 8:05 PM