Page MenuHomeFreeBSD

[WIP] vfs: fully smr-protected path lookup
Needs ReviewPublic

Authored by mjg on Mar 2 2020, 2:33 AM.



The code is both highly experimental and limited, but is the minimum which makes sense to show.

This lookup only modifies the terminal vnode, meaning no intermediate path components ever get dirtied and it scales perfectly as long as the target node is not shared with anyone.

Note there are cleanups to be made as well and the patch is not yet in comittable shape (even modulo bugs) and I don't intend to commit it without an immediate followup to close limitations (see below).

There are 3 parts to it which I'm putting together for an easier overview, they wont be committed in one go:

  • introduction of sequence counters for vnodes
  • fast path lookup in the namecache
  • conversion of tmpfs

I have a separate patch for ufs and zfs will follow.

Design is described in the comment above cache_fplookup.

Notes: atomicity of traversal from one entry to another is provided with sequence counters. This ran into significant problems e.g., with rename routines which end up relocking after doing some work w.r.t. rename. For this reason an easy solution of end-to-end coverage for any exclusive locking does not work out. Instead I split this into 2 counters, one which tracks how many pending modifications are there and another to denote something is being done for the lookup.

The following ops modify the count: rename, setattr (anything), mount, unmount + some fs-specific operations. I believe this provides a complete list of anything which can interfere with the lookup.

In particular the stock lookup guarantees nobody changes permissions or adds/removes entries as we leapfrog to the next vnode. That is, it is impossible to find a vnode and check against stale permissions. This is maintained.

Note neither lookup guarantees anything about the state of the complete visited chain.

Microbenchmark wins are drastic of course and I'm not going to show them, for an actual workload here is an incremental -j 104 bzImage:

before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total

Limitations and what's going to happen with them are as follows:

  • MAC vnode hooks depend on shared locks, thus relevant MAC hooks will be checked at runtime upfront and if any are present the lookup will abort. I don't intend to fix this.
  • AUDIT has the same problem and will also be checked for. I don't intend to fix this. This can be made more granular so that lookup only aborts if logging would take place.
  • dot dot lookups -- will be added later
  • symlinks -- will be added later
  • capability mode -- unclear what to do with it. it depends on explicitly tracking all visited .. vnodes. the code may get simpler with smr.
  • dirfd != AT_FDCWD -- trivial, will add later
  • anything which wants the parent -- not added yet only to simplify the review
  • aborting starts the current lookup from scratch every time. I have code to change it, but it's not a big deal and will make most sense to implement after the above get plugged

Diff Detail

Lint Skipped
Unit Tests Skipped

Event Timeline

mjg created this revision.Mar 2 2020, 2:33 AM
mjg edited the summary of this revision. (Show Details)Mar 2 2020, 2:34 AM
mjg edited the summary of this revision. (Show Details)Mar 2 2020, 2:36 AM
mjg edited the summary of this revision. (Show Details)
mjg added inline comments.Mar 2 2020, 2:42 AM

This is a wart from failed rename experiment, will remove later.

mjg edited the summary of this revision. (Show Details)Mar 2 2020, 2:51 AM
mjg updated this revision to Diff 69049.Mar 2 2020, 5:54 AM
  • fix fixups
  • handle SAVENAME
mjg updated this revision to Diff 69143.Mar 4 2020, 12:08 AM
  • assorted bugfixes and cleanups
mjg updated this revision to Diff 69198.Mar 5 2020, 3:42 AM
  • assorted fixups

Modulo bugs which will come out in more testing I consider this + the follow up 2 patches to be good enough to commit.

Even with all these limitations it already provides a net win in buildkernel et al.

mjg updated this revision to Diff 69205.Mar 5 2020, 7:26 AM
  • assorted
jeff added inline comments.Mar 6 2020, 5:44 AM

I assume you want FILE:LINE to propagate all the way through.


relookup is an artifact of the bad api. It is unavoidable given our current locking scheme.

mjg updated this revision to Diff 69252.Mar 6 2020, 6:27 AM
  • convert some inlines to macros to get better assertion output
  • hide vfs_smr access behind dedicated vfs_* macros
  • assert smr in fplookup pre/post routines
  • misc comment fixes
jeff added a comment.Mar 6 2020, 6:31 AM

So In general the naming scheme is:
vn_* or v* in older code for vnode specific routines.
vop_ for routines specific to ops.
vfs_ for global or mount.

Not to get too pedantic but vfs_seq_* would might more sense as vn_seq_.

mjg added a comment.EditedMar 6 2020, 6:35 AM

that's fair, I'll rename. While here the question is if this should be "smr lookup" instead of "fast path lookup". Then in particular the mount point flag would be MNTK_SMRLOOKUP and so on.

As you probably noticed smr accessors are not used here. I think they are a little problematic to integrate right now and imo can wait given the assertions already in place.

jeff added a comment.Mar 6 2020, 7:06 AM

I don't mind adding accessors after the fact. It will help us continue to make them more natural and less obtrusive if we can make this work with them.

I have been calling my vm routines grab*_unlocked() which I don't like either. I think of smr naming as something like rw or sx. So I might hesitate to name a system after the thing used to synchronize it.

cache_fast_lookup() is not so cumbersome.

Is there an implementation of the vop somewhere that I'm missing?

mjg updated this revision to Diff 69255.Mar 6 2020, 7:39 AM
  • rebase on top of updated D23913
  • rename vfs_seqc to vn_seqc
  • add back tmpfs port