Details

Reviewers

jeff
kib

Summary

The code is both highly experimental and suffers from fixable limitations, but is already very useful as it is. Modulo more testing I consider the patch below committable.

This lookup only modifies the terminal vnode, meaning no intermediate path components ever get dirtied and it scales perfectly as long as the target node is not shared with anyone.

There are 3 parts to it which I'm putting together for an easier overview, they wont be committed in one go:

introduction of sequence counters for vnodes
fast path lookup in the namecache
conversion of tmpfs

I have separate patches for both ufs and zfs.

Key note is that although there are various limitations to it, it does fully work for most lookups (e.g., from stat) and in case of trouble supports continuing with the locked variant. Most notably not all filesystems support the new permission checking -- upon crossing to such a fs resolving continues the old way.

Design is described in the comment above cache_fplookup.

Notes: atomicity of traversal from one entry to another is provided with sequence counters. This ran into significant problems e.g., with rename routines which end up relocking after doing some work w.r.t. rename. For this reason an easy solution of end-to-end coverage for any exclusive locking does not work out. Instead I split this into 2 counters, one which tracks how many pending modifications are there and another to denote something is being done for the lookup.

The following ops modify the count: rename, setattr (anything), mount, unmount + some fs-specific operations. I believe this provides a complete list of anything which can interfere with the lookup.

In particular the stock lookup guarantees nobody changes permissions or adds/removes entries as we leapfrog to the next vnode. That is, it is impossible to find a vnode and check against stale permissions. This is maintained.

Note neither lookup guarantees anything about the state of the complete visited chain.

Performance:

incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total

tinderbox -j 104 NO_CLEAN WITHOUT_CTF on tmpfs:
before: 2975.76s user 4573.04s system 6429% cpu 1:57.40 total https://people.freebsd.org/~mjg/fg/flix1-tinderbox-inc.svg
after: 3083.57s user 1618.61s system 4867% cpu 1:36.61 total https://people.freebsd.org/~mjg/fg/flix1-tinderbox-inc-smr.svg

that is, almost all remaining contention is in the vm

microbenchmarks:
concurrent access(2) to the same file -- scales better, but bottlenecks on dirtying the terminal vnode
concurrent access(2) to different files in the same directory -- no writes to shared areas, thus no cacheline ping pong

Limitations and what's going to happen with them are as follows:

MAC vnode hooks depend on shared locks, thus relevant MAC hooks will be checked at runtime upfront and if any are present the lookup will abort. I don't intend to fix this.
AUDIT has the same problem and will also be checked for. I don't intend to fix this. This can be made more granular so that lookup only aborts if logging would take place.
dot dot lookups -- will be added later
symlinks -- will be added later
capability mode -- unclear what to do with it. it depends on explicitly tracking all visited .. vnodes. the code may get simpler with smr.
dirfd != AT_FDCWD -- trivial, will add later
anything which wants the parent -- D23917

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

mjg created this revision.Mar 2 2020, 2:33 AM

Herald added subscribers: jonathan, imp. · View Herald TranscriptMar 2 2020, 2:33 AM

Harbormaster completed remote builds in B29713: Diff 69044.Mar 2 2020, 2:33 AM

mjg edited the summary of this revision. (Show Details)Mar 2 2020, 2:34 AM

mjg edited the summary of this revision. (Show Details)Mar 2 2020, 2:36 AM

mjg edited the summary of this revision. (Show Details)

mjg added inline comments.Mar 2 2020, 2:42 AM

sys/sys/vnode.h
71	This is a wart from failed rename experiment, will remove later.

mjg edited the summary of this revision. (Show Details)Mar 2 2020, 2:51 AM

fix fixups
handle SAVENAME

assorted bugfixes and cleanups

Harbormaster completed remote builds in B29768: Diff 69143.Mar 4 2020, 12:08 AM

assorted fixups

Modulo bugs which will come out in more testing I consider this + the follow up 2 patches to be good enough to commit.

Even with all these limitations it already provides a net win in buildkernel et al.

assorted

jeff added inline comments.Mar 6 2020, 5:44 AM

sys/kern/vfs_cache.c
2887	I assume you want FILE:LINE to propagate all the way through.
sys/kern/vfs_subr.c
5515	relookup is an artifact of the bad api. It is unavoidable given our current locking scheme.

convert some inlines to macros to get better assertion output
hide vfs_smr access behind dedicated vfs_* macros
assert smr in fplookup pre/post routines
misc comment fixes

So In general the naming scheme is:
vn_* or v* in older code for vnode specific routines.
vop_ for routines specific to ops.
vfs_ for global or mount.

Not to get too pedantic but vfs_seq_* would might more sense as vn_seq_.

that's fair, I'll rename. While here the question is if this should be "smr lookup" instead of "fast path lookup". Then in particular the mount point flag would be MNTK_SMRLOOKUP and so on.

As you probably noticed smr accessors are not used here. I think they are a little problematic to integrate right now and imo can wait given the assertions already in place.

I don't mind adding accessors after the fact. It will help us continue to make them more natural and less obtrusive if we can make this work with them.

I have been calling my vm routines grab*_unlocked() which I don't like either. I think of smr naming as something like rw or sx. So I might hesitate to name a system after the thing used to synchronize it.

cache_fast_lookup() is not so cumbersome.

Is there an implementation of the vop somewhere that I'm missing?

rebase on top of updated D23913
rename vfs_seqc to vn_seqc
add back tmpfs port

add fallback to the slow path

assorted cleanups
fix a ref leak on fallback error

tested by pho

mjg edited the summary of this revision. (Show Details)Jul 3 2020, 8:07 PM

emaste added a subscriber: emaste.Jul 6 2020, 1:40 PM

So I think the first bit to extract from this patch and review is the vnode modification count, ie, v_seqc/v_seqc_users.

I want to look, at least, can it be piggy-backed on vn_start_write(). Basically you are tracking mostly the same stuff, except that some things like regular writes are not important.

I extracted the patch into D25573 . This review will only serve as a high-level view of the patchset.

mjg edited the summary of this revision. (Show Details)Jul 6 2020, 6:44 PM

mjg edited the summary of this revision. (Show Details)

cem added a subscriber: cem.Jul 20 2020, 7:23 PM

The patchset mostly landed in revisions r363516 .. r363523

vfs: fully smr-protected path lookup
AbandonedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 74018

sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c

sys/fs/tmpfs/tmpfs.h

sys/fs/tmpfs/tmpfs_subr.c

sys/fs/tmpfs/tmpfs_vfsops.c

sys/fs/tmpfs/tmpfs_vnops.h

sys/fs/tmpfs/tmpfs_vnops.c

sys/kern/kern_descrip.c

sys/kern/vfs_cache.c

sys/kern/vfs_lookup.c

sys/kern/vfs_mount.c

sys/kern/vfs_subr.c

sys/kern/vnode_if.src

sys/security/mac/mac_framework.h

sys/sys/_seqc.h

sys/sys/filedesc.h

sys/sys/mount.h

sys/sys/namei.h

sys/sys/seqc.h

sys/sys/vnode.h

vfs: fully smr-protected path lookupAbandonedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 74018

sys/cddl/compat/opensolaris/kern/opensolaris_vfs.c

sys/fs/tmpfs/tmpfs.h

sys/fs/tmpfs/tmpfs_subr.c

sys/fs/tmpfs/tmpfs_vfsops.c

sys/fs/tmpfs/tmpfs_vnops.h

sys/fs/tmpfs/tmpfs_vnops.c

sys/kern/kern_descrip.c

sys/kern/vfs_cache.c

sys/kern/vfs_lookup.c

sys/kern/vfs_mount.c

sys/kern/vfs_subr.c

sys/kern/vnode_if.src

sys/security/mac/mac_framework.h

sys/sys/_seqc.h

sys/sys/filedesc.h

sys/sys/mount.h

sys/sys/namei.h

sys/sys/seqc.h

sys/sys/vnode.h

vfs: fully smr-protected path lookup
AbandonedPublic
Actions

Revision Contents
Changeset List