Page MenuHomeFreeBSD

(lookup 1) vfs: add sequence counters to vnodes
ClosedPublic

Authored by mjg on Jul 6 2020, 6:23 PM.

Details

Summary

The intent is to allow for vnode inspection where the result is validated with sequence counters, thus allowing to elide taking any locks. Lockless lookup being the primary beneficiary.

Anything pertaining to a lookup -- vnode's parent, filesystems mounted on top or any permissions -- result in modification of the counter. One giant hole is lack of support for rename, which given the current API contract comes from filesystems wishing to utilize the feature. As explained in vop_rename_pre, they re-lookup and can end up operating on different vnodes than passed in.

2 other approaches:

  • integration with the vnode lock -- does not work since vnodes get notoriously relocked mid-VOPs
  • integration with wirte suspension -- would require special casing for rename, but would also require special casing on close and actual writes to the vnode, none of which do anything invalidate the lookup. the only real overlap is in setfacl etc., but that can be easily covered with pre/post hooks.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

mjg requested review of this revision.Jul 6 2020, 6:23 PM

I obviously miss something about big picture in this patch. Why a lot of cases are not marked with vn_seqc_write_begin() ? Just for example, shouldn't mkdir or rmdir cause the dvp to become blocked for lockless lookup ? While you do mark acl vops.

Lockless lookup vs creation has the same races as regular one, but timings may be different.

Locked lookup vs mkdir will share lock dvp, which will either delay mkdir (and no entry will be found) OR wait for said mkdir (in which case the entry will be found). Lockless does not wait nor hold it off, but it can find it or fail to do it all the same.

Similarly for rmdir, locked lookup can will either delay the removal (and will find the entry) OR wait for it (in which case no entry will be found). Lockless ends up with the same behaviour, although at worst it can return the vnode *after* rmdir was performed. The vnode itself is fine and I don't think there is a problem here, nonetheless this behavior can be removed by adding seqc modification to *vp* (not *dvp*). I don't see a reason to refrain from doing it, so I'll add it later.

dvp itself does not have to have the counter modified in any of these cases. Adding modification anyway would be harmful in the following scenario: say someone keeps adding/removing files from foo/ all while someone else keeps performing lookups on files which exist in the same directory. With patch as it is these lookups are not affected, but with changing the count every time an entry is added or removed would keep interfering.

All forms of permission changes modify the counter to maintain the invariant that when looking up foo/bar, successful jump to bar will only be done the right permissions tested. Specifically, locked variant for foo/bar guarantees nothing chagnes between foo and bar for the duration.

Consider the following race: foo is empty and has mode 755

thread A: chmod 700 foo; echo test > foo/bar
thread B: stat foo/bar

Not actual commands, just syscall-wise. In locked lookup thread B will either fail to find bar or will get permission denied. thread A at worst will find itself delayed.
Lockless lookup can perm check against 755, get delayed by whatever (interrupts, vcpu taken off) and then find bar. With the chmod above altering the count, this scenario is detected and taken care of.

In short, counter modifications for setacl & friends are there maintain all the semantics. Modifications to target vnode (not directory vnode) can be added for everything which adds/removes it, but it does not alter correctness. Modifications to directory vnode would not alter anything but would cause spurious failures in concurrent lookups in the same directory. Note they are tolerated for rename for simplicity.

Imagine we have a file f in the structure of directories A/B/C/f. Suppose we are looking up A/B/B/f (note double-B there) in thread 1, while another thread 2 does 'mv A/B/C A/; mv A/B A/C/; mv A/C/f A/C/B/f' (basically rolling B and C around). I claim that with locked lookup thread 1 would never succeed. What prevents it in case of lock-less lookup ?

  • add target vnode sequence counter modification around rmdir and remove

With this in place lockless lookup will stop returning the vnode before either VOP_REMOVE or VOP_RMDIR do anything.

Cases you mentioned perform rename, which for filesystems supporting lockless lookups modifies the counter for *all* affected vnodes (which is a little overzealous). Modifications are made before/after actual rename and consequently lookup will only succeed if rename has not done anyting yet. For example see ufs_rename in D25579 . This is pushed into filesystems because they can relookup affected vnodes.

I am trying to make somewhat different point, rename/rmdir etc are just illustrations.

I think that for initial version of this work, semantic of lockless lookup must be same or stricter than of the normal locked lookup, in the sense that some (perhaps valid) cases should decline to process in caution. Later we can inspect each of them one-by-one and make decisions whether it is safe to loosen the rules. But for now IMO it is not acceptable to mix this work and changing of the VFS guarantees in regard to name visibility and races.

After that point is made, I remind that we already have all places were VFS enters into fs modification state, with vn_start_write/vn_start_secondary_write. More, this mechanism explicitly takes the vnode which is going to be modified. vn_seqc_write_begin() should be added to vn_start_write, perhaps with a flag to avoid seqc_write for silly cases like VOP_WRITE(). This would be correct almost outright, except some cases like you found in ZFS that require additional handling, or mount/unmount.

  • modify the counter for both directory and target vnode (if applicable) for all ops which create or remove one

With the updated patch the races are the same -- that is, if the fast path is racing anyone it will fallback to the current lookup.

For write suspension, I noted there is only a partial overlap. Modifying the counter for dvp also serves against integrating the two -- only one vnode is passed, making the pre/post hooks a much more natural place (if possible -- it's not for rename).

sys/kern/vfs_subr.c
4106

Does it make sense to end seqc_write for reclaimed vnode ?
IMO it should be not, to avoid any possibility for it to participate in a new lookup.

You might need to indeed end the seqc_write in the vnode destructor.

5711

Assert that the vnode in seqc_write ?

sys/kern/vnode_if.src
104

whiteouts must seqc_write-protect dvp.

628

Shouldn't labeling also set seqc_write ? I remember you mentioned that MAC-enabled for the vnode should disable fast lookup.

sys/sys/seqc.h
115

So why not to make seqc_write_begin() call seqc_sleepable_write_begin() ?

And please split addition of _seqc.h, and seqc.h changes, separate commits.

  • assert seqc in reclaim post
  • keep seqc until the vnode gets free

The seqc header stuff is only included for reference, I have not decided yet which way to factor it.

MAC for lookup is only a factor if there any lookup hooks and their presence disables lockless operation entirely -- we have no idea what the hooks may want to do (and they require a locked vnode). Consequently I don't think setlabel is of significance.

This revision is now accepted and ready to land.Jul 9 2020, 11:17 AM
sys/kern/vfs_subr.c
5640

I think that bloking l/l for vp (not tdvp) is not needed.

This revision was automatically updated to reflect the committed changes.