The code is both highly experimental and limited, but is the minimum which makes sense to show.
This lookup only modifies the terminal vnode, meaning no intermediate path components ever get dirtied and it scales perfectly as long as the target node is not shared with anyone.
Note there are cleanups to be made as well and the patch is not yet in comittable shape (even modulo bugs) and I don't intend to commit it without an immediate followup to close limitations (see below).
There are 3 parts to it which I'm putting together for an easier overview, they wont be committed in one go:
- introduction of sequence counters for vnodes
- fast path lookup in the namecache
- conversion of tmpfs
I have a separate patch for ufs and zfs will follow.
Design is described in the comment above cache_fplookup.
Notes: atomicity of traversal from one entry to another is provided with sequence counters. This ran into significant problems e.g., with rename routines which end up relocking after doing some work w.r.t. rename. For this reason an easy solution of end-to-end coverage for any exclusive locking does not work out. Instead I split this into 2 counters, one which tracks how many pending modifications are there and another to denote something is being done for the lookup.
The following ops modify the count: rename, setattr (anything), mount, unmount + some fs-specific operations. I believe this provides a complete list of anything which can interfere with the lookup.
In particular the stock lookup guarantees nobody changes permissions or adds/removes entries as we leapfrog to the next vnode. That is, it is impossible to find a vnode and check against stale permissions. This is maintained.
Note neither lookup guarantees anything about the state of the complete visited chain.
Microbenchmark wins are drastic of course and I'm not going to show them, for an actual workload here is an incremental -j 104 bzImage:
before: 142.96s user 1025.63s system 4924% cpu 23.731 total
after: 147.36s user 313.40s system 3216% cpu 14.326 total
Limitations and what's going to happen with them are as follows:
- MAC vnode hooks depend on shared locks, thus relevant MAC hooks will be checked at runtime upfront and if any are present the lookup will abort. I don't intend to fix this.
- AUDIT has the same problem and will also be checked for. I don't intend to fix this. This can be made more granular so that lookup only aborts if logging would take place.
- dot dot lookups -- will be added later
- symlinks -- will be added later
- capability mode -- unclear what to do with it. it depends on explicitly tracking all visited .. vnodes. the code may get simpler with smr.
- dirfd != AT_FDCWD -- trivial, will add later
- anything which wants the parent -- not added yet only to simplify the review
- aborting starts the current lookup from scratch every time. I have code to change it, but it's not a big deal and will make most sense to implement after the above get plugged