Details

Reviewers

kib
jeff
markj

Commits

rS363519: vfs: lockless lookup

Summary

Inner workings are explained in the comment above cache_fplookup.

Capabilities and fd-relative lookups are not supported and will result in immediate fallback to regular code.

Symlinks, ".." in the path, mount points without support for lockless lookup and mismatched counters will result in an attempt to get a reference to the directory vnode and continue in regular lookup. If this fails, the entire operation is aborted and regular lookup starts from scratch. However, care is taken that data is not copied again from userspace.

Test Plan

The entire patchset (prior to minor cleanups) was tested by pho.

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

mjg created this revision.Jul 6 2020, 7:31 PM

Herald added subscribers: jonathan, imp. · View Herald TranscriptJul 6 2020, 7:31 PM

mjg requested review of this revision.Jul 6 2020, 7:31 PM

Harbormaster completed remote builds in B32164: Diff 74130.Jul 6 2020, 7:31 PM

mjg edited the summary of this revision. (Show Details)Jul 6 2020, 8:08 PM

mjg edited the test plan for this revision. (Show Details)

mjg edited the summary of this revision. (Show Details)Jul 6 2020, 8:13 PM

mckusick mentioned this in D25579: (lookup 4) ufs: add support for lockless lookup.Jul 6 2020, 11:02 PM

mjg updated this revision to Diff 74229.Jul 9 2020, 12:14 PM

mjg updated this revision to Diff 74231.Jul 9 2020, 12:29 PM

mjg edited the summary of this revision. (Show Details)

kib added inline comments.Jul 11 2020, 3:48 PM

sys/kern/vfs_cache.c
2858	So this is exact copy on namei_cleanup_cnp. Why ?
2869	And this is namei_handle_root(), but different for your use case by the lack of vrefact(). Can you reuse the code ?
2886	Why creating yet another structure, instead of saving all nameidata ?
3028	!= 0
3069	Can you just write `return (vp->v_type == VLNK);` ? At least for now.
3585	'You can grep' sounds frivolous. `All places that may modify data affecting lockless lookup are enclosed in vn_seqc_write_begin()/end() braces.`

!= 0 in cache_can_fplookup
shorter cache_fplookup_vnode_supported
different comment in cache_fplookup_impl

mjg added inline comments.Jul 11 2020, 4:11 PM

sys/kern/vfs_cache.c
2858	See below.
2869	This does not do strictrelative nor beneath stuff either. Like path parsing below, this is a candidate to deduplicate after more features show up. I don't see much point trying to dedup the current state. In particular it is unclear to me how deduping of ".." tracking for beneath/capability lookups is going to work out.
2886	The struct is enormous (232 bytes) and has to be checkpointed for every path component in case we have to fallback to regular lookup.

mjg marked 2 inline comments as done.Jul 11 2020, 4:12 PM

mjg marked an inline comment as done.

further comment tweaks

Harbormaster completed remote builds in B32268: Diff 74347.Jul 12 2020, 9:24 AM

further comment changes
stop using vref_smr

rebase
various tweaks

Overall I do not see anything principally wrong with this patch, for me it looks fine algorithmic. I hope it become less of re-implementation and more integrated with existing lookup() code in future updates.

sys/kern/vfs_cache.c
3099	This looks too rude. Why not act on partial result by fetching dp from the dedicated place ? ni_startdir only reset on symlink before, I think it costs nothing to keep it as is.
3182	Why __predict_false ? Side note, this code seems to __predict-intensive overall.
3188	This is weird. You load nc_flag (even with atomic) and then cache_ncp_invalid() reloads nc_flag and checks.
3197	What is NCF_HOTNEGATIVE ?
3234	!= 0
sys/kern/vfs_lookup.c
535	May be explain that lockless refers to the vnode locking.

further style and comment tweaks
catch up recent hot neg changes

Better integration between the two lookups is the mid-term plan, it will also result in mostly deduplicated parsing etc.

sys/kern/vfs_cache.c
3099	Not sure I follow. The lookup loop in namei starts with dp = ndp->ni_startdir, making the assignment below pretty natural in my opinion. Note partial returns go straight to the loop.
3182	We expect to have a cache hit of sorts for vast majority of calls. Similarly, we don't expect to find an invalidated entry for vast majority of lookups and so on. In comparison there is no predict for whether the entry is positive or negative. In general all predicts are there for corner cases which are only handled for correctness and which we expect to encounter sparingly.
3188	Unclear what the complaint is here. The code ensures that either all loads from ncp are performed before the entry is invalidated (making sure we got a good snapshot. On CPUs which reorder loads this in particular could fetch nc_flag first and nc_vp later. The acq fence + nc_flag reload ensures it's all fine regardless of how we ended up reading from ncp.

kib added inline comments.Jul 17 2020, 10:37 AM

sys/kern/vfs_cache.c
3188	I do not see why not apply the idiomatic way of using the acquire: you load tvp, then fence, then load nc_flag, and the check for invalid is performed on the value loaded by that read of nc_flag. This would require inlining cache_ncp_invalid() but this is the only thing that could be said against. Your code 'read nc_vp, read nc_flag, acq, re-read nc_flag' is strange (I am not saying wrong) because for the fence to have effect, the read must observe the rel-fenced write. There the value read before fence is used, potentially (but I am not sure) providing inconsistent state. Instead of trying to convince myself and future readers that this is innocent, it is much simpler and obviously correct to just use one load.

mjg added inline comments.Jul 17 2020, 11:05 AM

sys/kern/vfs_cache.c
3188	Concerns about previous state aside, the code got reorganized because of the rebase. Since NCP_HOTNEGATIVE flag got removed, the state has to be checked by: negstate = NCP2NEGSTATE(ncp); if ((negstate->neg_flag & NEG_HOT) == 0) ... which in the current patch is then followed by the fence + re-read. This leaves positive entry case with your concern, but I don't think this can be addressed without dubious reordering of the above -- we don't know what the entry is without first reading nc_flag. On the other hand I want to maintain the invariant that the entry ceased to be accessed past the invalidation check.

kib added inline comments.Jul 17 2020, 2:22 PM

sys/kern/vfs_cache.c
3188	And I still do not see why not to reorg it by putting fence_acq() before load of nc_flag, and then use direct check for NCF_INVALID instead of cache_ncp_invalid(), in the current patch structure. And related question, why it is fine to check NCF_NEGATIVE and do anything before checking for NCF_INVALID.

mjg added inline comments.Jul 20 2020, 9:50 PM

sys/kern/vfs_cache.c
3188	As noted above, we don't know if the entry is negative without testing nc_flag. But it is negative, it has to be tested for NEG_HOT flag stored in a struct union'ed with nc_vp. With your proposal the code would have to look like this: nc_flag = ...; fence_acq(); if (nc_flag & NCP_NEGATIVE) { negstate = NCP2NEGSTATE(ncp); ... which reads negstate after we got the nc_flag snapshot to test. The invariant is that ncp only shows up in the hash once fully constructed and the state is fine as long as the entry did not turn out to be invalidated. Then the code has equivalent guarantees to locking the relevant chain, reading everything, unlocking which is de facto what locked lookup is doing. There is a slight difference that we don't ref the vnode inside, but that's taken care of with seqc + smr.

kib accepted this revision.Jul 22 2020, 7:33 PM

kib added inline comments.

sys/kern/vfs_cache.c
3188	I do not see why cannot you read n_un before the fence, but ok.

This revision is now accepted and ready to land.Jul 22 2020, 7:33 PM

Closed by commit rS363519: vfs: lockless lookup (authored by mjg). · Explain WhyJul 25 2020, 10:37 AM

This revision was automatically updated to reflect the committed changes.

mjg added a commit: rS363519: vfs: lockless lookup.

(lookup 3) vfs: lockless lookup
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 74570

sys/kern/vfs_cache.c

sys/kern/vfs_lookup.c

sys/sys/namei.h

(lookup 3) vfs: lockless lookupClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 74570

sys/kern/vfs_cache.c

sys/kern/vfs_lookup.c

sys/sys/namei.h

(lookup 3) vfs: lockless lookup
ClosedPublic
Actions

Revision Contents
Changeset List