Page MenuHomeFreeBSD

(lookup 3) vfs: lockless lookup
ClosedPublic

Authored by mjg on Jul 6 2020, 7:31 PM.

Details

Summary

Inner workings are explained in the comment above cache_fplookup.

Capabilities and fd-relative lookups are not supported and will result in immediate fallback to regular code.

Symlinks, ".." in the path, mount points without support for lockless lookup and mismatched counters will result in an attempt to get a reference to the directory vnode and continue in regular lookup. If this fails, the entire operation is aborted and regular lookup starts from scratch. However, care is taken that data is not copied again from userspace.

Test Plan

The entire patchset (prior to minor cleanups) was tested by pho.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

mjg requested review of this revision.Jul 6 2020, 7:31 PM
mjg edited the test plan for this revision. (Show Details)
mjg edited the summary of this revision. (Show Details)
sys/kern/vfs_cache.c
2857

So this is exact copy on namei_cleanup_cnp. Why ?

2868

And this is namei_handle_root(), but different for your use case by the lack of vrefact(). Can you reuse the code ?

2885

Why creating yet another structure, instead of saving all nameidata ?

3027

!= 0

3068

Can you just write return (vp->v_type == VLNK); ? At least for now.

3584

'You can grep' sounds frivolous.

All places that may modify data affecting lockless lookup are enclosed in vn_seqc_write_begin()/end() braces.

  • != 0 in cache_can_fplookup
  • shorter cache_fplookup_vnode_supported
  • different comment in cache_fplookup_impl
sys/kern/vfs_cache.c
2857

See below.

2868

This does not do strictrelative nor beneath stuff either. Like path parsing below, this is a candidate to deduplicate after more features show up. I don't see much point trying to dedup the current state. In particular it is unclear to me how deduping of ".." tracking for beneath/capability lookups is going to work out.

2885

The struct is enormous (232 bytes) and has to be checkpointed for every path component in case we have to fallback to regular lookup.

mjg marked 2 inline comments as done.Jul 11 2020, 4:12 PM
mjg marked an inline comment as done.
  • further comment tweaks
  • further comment changes
  • stop using vref_smr
  • rebase
  • various tweaks

Overall I do not see anything principally wrong with this patch, for me it looks fine algorithmic. I hope it become less of re-implementation and more integrated with existing lookup() code in future updates.

sys/kern/vfs_cache.c
3098

This looks too rude. Why not act on partial result by fetching dp from the dedicated place ?
ni_startdir only reset on symlink before, I think it costs nothing to keep it as is.

3181

Why __predict_false ?

Side note, this code seems to __predict-intensive overall.

3187

This is weird. You load nc_flag (even with atomic) and then cache_ncp_invalid() reloads nc_flag and checks.

3196

What is NCF_HOTNEGATIVE ?

3233

!= 0

sys/kern/vfs_lookup.c
534

May be explain that lockless refers to the vnode locking.

  • further style and comment tweaks
  • catch up recent hot neg changes

Better integration between the two lookups is the mid-term plan, it will also result in mostly deduplicated parsing etc.

sys/kern/vfs_cache.c
3098

Not sure I follow. The lookup loop in namei starts with dp = ndp->ni_startdir, making the assignment below pretty natural in my opinion. Note partial returns go straight to the loop.

3181

We expect to have a cache hit of sorts for vast majority of calls. Similarly, we don't expect to find an invalidated entry for vast majority of lookups and so on. In comparison there is no predict for whether the entry is positive or negative.

In general all predicts are there for corner cases which are only handled for correctness and which we expect to encounter sparingly.

3187

Unclear what the complaint is here. The code ensures that either all loads from ncp are performed before the entry is invalidated (making sure we got a good snapshot. On CPUs which reorder loads this in particular could fetch nc_flag first and nc_vp later. The acq fence + nc_flag reload ensures it's all fine regardless of how we ended up reading from ncp.

sys/kern/vfs_cache.c
3187

I do not see why not apply the idiomatic way of using the acquire: you load tvp, then fence, then load nc_flag, and the check for invalid is performed on the value loaded by that read of nc_flag. This would require inlining cache_ncp_invalid() but this is the only thing that could be said against.

Your code 'read nc_vp, read nc_flag, acq, re-read nc_flag' is strange (I am not saying wrong) because for the fence to have effect, the read must observe the rel-fenced write. There the value read before fence is used, potentially (but I am not sure) providing inconsistent state.

Instead of trying to convince myself and future readers that this is innocent, it is much simpler and obviously correct to just use one load.

sys/kern/vfs_cache.c
3187

Concerns about previous state aside, the code got reorganized because of the rebase.

Since NCP_HOTNEGATIVE flag got removed, the state has to be checked by:

negstate = NCP2NEGSTATE(ncp);
if ((negstate->neg_flag & NEG_HOT) == 0) ...

which in the current patch is then followed by the fence + re-read. This leaves positive entry case with your concern, but I don't think this can be addressed without dubious reordering of the above -- we don't know what the entry is without first reading nc_flag. On the other hand I want to maintain the invariant that the entry ceased to be accessed past the invalidation check.

sys/kern/vfs_cache.c
3187

And I still do not see why not to reorg it by putting fence_acq() before load of nc_flag, and then use direct check for NCF_INVALID instead of cache_ncp_invalid(), in the current patch structure.

And related question, why it is fine to check NCF_NEGATIVE and do anything before checking for NCF_INVALID.

sys/kern/vfs_cache.c
3187

As noted above, we don't know if the entry is negative without testing nc_flag. But it is negative, it has to be tested for NEG_HOT flag stored in a struct union'ed with nc_vp. With your proposal the code would have to look like this:

nc_flag = ...;
fence_acq();
if (nc_flag & NCP_NEGATIVE) {
           negstate = NCP2NEGSTATE(ncp);

... which reads negstate after we got the nc_flag snapshot to test.

The invariant is that ncp only shows up in the hash once fully constructed and the state is fine as long as the entry did not turn out to be invalidated.

Then the code has equivalent guarantees to locking the relevant chain, reading everything, unlocking which is de facto what locked lookup is doing. There is a slight difference that we don't ref the vnode inside, but that's taken care of with seqc + smr.

kib added inline comments.
sys/kern/vfs_cache.c
3187

I do not see why cannot you read n_un before the fence, but ok.

This revision is now accepted and ready to land.Jul 22 2020, 7:33 PM
This revision was automatically updated to reflect the committed changes.