Page MenuHomeFreeBSD

vfs: fully smr-protected path lookup
AbandonedPublic

Authored by mjg on Mar 2 2020, 2:33 AM.

Details

Reviewers
jeff
kib
Summary

The code is both highly experimental and suffers from fixable limitations, but is already very useful as it is. Modulo more testing I consider the patch below committable.

This lookup only modifies the terminal vnode, meaning no intermediate path components ever get dirtied and it scales perfectly as long as the target node is not shared with anyone.

There are 3 parts to it which I'm putting together for an easier overview, they wont be committed in one go:

  • introduction of sequence counters for vnodes
  • fast path lookup in the namecache
  • conversion of tmpfs

I have separate patches for both ufs and zfs.

Key note is that although there are various limitations to it, it does fully work for most lookups (e.g., from stat) and in case of trouble supports continuing with the locked variant. Most notably not all filesystems support the new permission checking -- upon crossing to such a fs resolving continues the old way.

Design is described in the comment above cache_fplookup.

Notes: atomicity of traversal from one entry to another is provided with sequence counters. This ran into significant problems e.g., with rename routines which end up relocking after doing some work w.r.t. rename. For this reason an easy solution of end-to-end coverage for any exclusive locking does not work out. Instead I split this into 2 counters, one which tracks how many pending modifications are there and another to denote something is being done for the lookup.

The following ops modify the count: rename, setattr (anything), mount, unmount + some fs-specific operations. I believe this provides a complete list of anything which can interfere with the lookup.

In particular the stock lookup guarantees nobody changes permissions or adds/removes entries as we leapfrog to the next vnode. That is, it is impossible to find a vnode and check against stale permissions. This is maintained.

Note neither lookup guarantees anything about the state of the complete visited chain.

Performance:

incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total

tinderbox -j 104 NO_CLEAN WITHOUT_CTF on tmpfs:
before: 2975.76s user 4573.04s system 6429% cpu 1:57.40 total https://people.freebsd.org/~mjg/fg/flix1-tinderbox-inc.svg
after: 3083.57s user 1618.61s system 4867% cpu 1:36.61 total https://people.freebsd.org/~mjg/fg/flix1-tinderbox-inc-smr.svg

that is, almost all remaining contention is in the vm

microbenchmarks:
concurrent access(2) to the same file -- scales better, but bottlenecks on dirtying the terminal vnode
concurrent access(2) to different files in the same directory -- no writes to shared areas, thus no cacheline ping pong

Limitations and what's going to happen with them are as follows:

  • MAC vnode hooks depend on shared locks, thus relevant MAC hooks will be checked at runtime upfront and if any are present the lookup will abort. I don't intend to fix this.
  • AUDIT has the same problem and will also be checked for. I don't intend to fix this. This can be made more granular so that lookup only aborts if logging would take place.
  • dot dot lookups -- will be added later
  • symlinks -- will be added later
  • capability mode -- unclear what to do with it. it depends on explicitly tracking all visited .. vnodes. the code may get simpler with smr.
  • dirfd != AT_FDCWD -- trivial, will add later
  • anything which wants the parent -- D23917

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

mjg edited the summary of this revision. (Show Details)
sys/sys/vnode.h
70

This is a wart from failed rename experiment, will remove later.

  • fix fixups
  • handle SAVENAME
  • assorted bugfixes and cleanups
  • assorted fixups

Modulo bugs which will come out in more testing I consider this + the follow up 2 patches to be good enough to commit.

Even with all these limitations it already provides a net win in buildkernel et al.

sys/kern/vfs_cache.c
2887

I assume you want FILE:LINE to propagate all the way through.

sys/kern/vfs_subr.c
5515

relookup is an artifact of the bad api. It is unavoidable given our current locking scheme.

  • convert some inlines to macros to get better assertion output
  • hide vfs_smr access behind dedicated vfs_* macros
  • assert smr in fplookup pre/post routines
  • misc comment fixes

So In general the naming scheme is:
vn_* or v* in older code for vnode specific routines.
vop_ for routines specific to ops.
vfs_ for global or mount.

Not to get too pedantic but vfs_seq_* would might more sense as vn_seq_.

that's fair, I'll rename. While here the question is if this should be "smr lookup" instead of "fast path lookup". Then in particular the mount point flag would be MNTK_SMRLOOKUP and so on.

As you probably noticed smr accessors are not used here. I think they are a little problematic to integrate right now and imo can wait given the assertions already in place.

I don't mind adding accessors after the fact. It will help us continue to make them more natural and less obtrusive if we can make this work with them.

I have been calling my vm routines grab*_unlocked() which I don't like either. I think of smr naming as something like rw or sx. So I might hesitate to name a system after the thing used to synchronize it.

cache_fast_lookup() is not so cumbersome.

Is there an implementation of the vop somewhere that I'm missing?

  • rebase on top of updated D23913
  • rename vfs_seqc to vn_seqc
  • add back tmpfs port
mjg retitled this revision from [WIP] vfs: fully smr-protected path lookup to vfs: fully smr-protected path lookup.
mjg edited the summary of this revision. (Show Details)
  • add fallback to the slow path
  • assorted cleanups
  • fix a ref leak on fallback error

So I think the first bit to extract from this patch and review is the vnode modification count, ie, v_seqc/v_seqc_users.

I want to look, at least, can it be piggy-backed on vn_start_write(). Basically you are tracking mostly the same stuff, except that some things like regular writes are not important.

I extracted the patch into D25573 . This review will only serve as a high-level view of the patchset.

mjg edited the summary of this revision. (Show Details)

The patchset mostly landed in revisions r363516 .. r363523