The code is both highly experimental and suffers from fixable limitations, but is already very useful as it is. Modulo more testing I consider the patch below committable.
This lookup only modifies the terminal vnode, meaning no intermediate path components ever get dirtied and it scales perfectly as long as the target node is not shared with anyone.
There are 3 parts to it which I'm putting together for an easier overview, they wont be committed in one go:
- introduction of sequence counters for vnodes
- fast path lookup in the namecache
- conversion of tmpfs
I have separate patches for both ufs and zfs.
Key note is that although there are various limitations to it, it does fully work for most lookups (e.g., from stat) and in case of trouble supports continuing with the locked variant. Most notably not all filesystems support the new permission checking -- upon crossing to such a fs resolving continues the old way.
Design is described in the comment above cache_fplookup.
Notes: atomicity of traversal from one entry to another is provided with sequence counters. This ran into significant problems e.g., with rename routines which end up relocking after doing some work w.r.t. rename. For this reason an easy solution of end-to-end coverage for any exclusive locking does not work out. Instead I split this into 2 counters, one which tracks how many pending modifications are there and another to denote something is being done for the lookup.
The following ops modify the count: rename, setattr (anything), mount, unmount + some fs-specific operations. I believe this provides a complete list of anything which can interfere with the lookup.
In particular the stock lookup guarantees nobody changes permissions or adds/removes entries as we leapfrog to the next vnode. That is, it is impossible to find a vnode and check against stale permissions. This is maintained.
Note neither lookup guarantees anything about the state of the complete visited chain.
Performance:
incremental -j 104 bzImage on tmpfs:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total
tinderbox -j 104 NO_CLEAN WITHOUT_CTF on tmpfs:
before: 2975.76s user 4573.04s system 6429% cpu 1:57.40 total https://people.freebsd.org/~mjg/fg/flix1-tinderbox-inc.svg
after: 3083.57s user 1618.61s system 4867% cpu 1:36.61 total https://people.freebsd.org/~mjg/fg/flix1-tinderbox-inc-smr.svg
that is, almost all remaining contention is in the vm
microbenchmarks:
concurrent access(2) to the same file -- scales better, but bottlenecks on dirtying the terminal vnode
concurrent access(2) to different files in the same directory -- no writes to shared areas, thus no cacheline ping pong
Limitations and what's going to happen with them are as follows:
- MAC vnode hooks depend on shared locks, thus relevant MAC hooks will be checked at runtime upfront and if any are present the lookup will abort. I don't intend to fix this.
- AUDIT has the same problem and will also be checked for. I don't intend to fix this. This can be made more granular so that lookup only aborts if logging would take place.
- dot dot lookups -- will be added later
- symlinks -- will be added later
- capability mode -- unclear what to do with it. it depends on explicitly tracking all visited .. vnodes. the code may get simpler with smr.
- dirfd != AT_FDCWD -- trivial, will add later
- anything which wants the parent -- D23917