Page MenuHomeFreeBSD

vfs: introduce basic smr-protected path lookup
ClosedPublic

Authored by mjg on Mar 1 2020, 11:55 PM.

Details

Summary

The following was separated from a patchset which performs a fully-smr protected path lookup which only dirties the terminal vnode. This one only stops taking bucket locks for forward lookup. This already shows a significant speed up during concurrent stat calls on smaller scales.

The patch below will be split into two, handling namecache separately. It introduces vnode and namecache entry protection with smr.

The crux of the change is vhold_smr. An argument could be made that a flag should be set while dooming the vnode, but postponing this makes the code simpler, is harmless and opens up possibility of safely doing fetchadd + backpedaling from it in vhold_smr.

It's unclear to me what should be done here with smr accessors. It may be workable to make the hash use it, but it looks like some form of _raw accessors (if any) would be prudent here. Right now the code just asserts. I think the way to go for now is to hide all this behind dedicated macros which happen to be mere atomic load/store etc.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

  • hide vfs_smr behind macros
  • pretty print hold count flags

I think pending testing by pho this is committable.

mjg retitled this revision from vfs: introduce smr-protected path lookup to vfs: introduce basic smr-protected path lookup.Mar 6 2020, 7:33 AM

Would you not prefer to use an invalid value rather than a bit to hold the free state? i.e. VHOLD_NO_SMR would be something like VHOLD_DEAD 0xffffffff. Some of your asserts would have to be adjusted. The first ref should switch dead with 1.

sys/kern/vfs_subr.c
3450 ↗(On Diff #69043)

I had to solve this same problem for page busy. I did it slightly differently but related. This is VPB_FREED.

The notion is, the owner that is freeing has destroyed the page identity, and sets busy to VPB_FREED. The trybusy on the lockless side may succeed right before the owner sets FREED. The lockless lookup revalidates the page identity and unlocks if it races. The unlock is special because it handles FREED. We know that if we successfully acquire it, it can not be freed.

668 ↗(On Diff #69254)

If you have ordering issues in later patches you can just manually smr_create() first thing and pass it around everywhere even here.

sys/kern/vfs_subr.c
3450 ↗(On Diff #69043)

I noted what I think is an equivalent in the description.

  1. the code as implemented is imo very easy to understand.
  2. it maintains an assertable invariant that non-smr consumers never legally see the flag (modulo vdropl of course).
  3. it keeps the possiblity of still doing fetchadd.

To elaborate on the last point, when profiling things like the -j 104 bzImage incremental I kept seeing vhold_smr on the profile (despite it being only done on the terminal vnode). Note its performance is hindered by the fact that this is a cmpset loop in the first place. After the dust settles we can take a look at changing it into fetchadd, but this is only possible if fetchadding cannot legally alter any flags.
Basic idea is that we fetchadd and find the flag set, we can pretend we have never seen the object in the first place. This means that after vdrop wins the cmpset race it will still free the vnode, which will get the counter possibly modified after the call to uma_zfree_smr. This is a little hairy and I have ideas how to avoid doing anything with hold count in the first place (if we can avoid it) in the common case.

668 ↗(On Diff #69254)

I know, but I did not find a good place to do it yet. It's plausible I'll be able to reorder the problematic consumer.

Another option is that perhaps there should be a well-defined "allocate global smrs here" place so that other work does not run into equivalent woes.

Tested by pho. He ran into a zfs hang which seems unrelated https://people.freebsd.org/~pho/stress/log/mjguzik026.txt (threads stuck waiting on i/o completion while the thread which should deliver it just sleeps in its main loop)

ping? I want to move forward with general scalable lookup. Regardless what happens next, something in the lines of what's proposed here is needed and already comes with a benefit (removed rw locks in the namecache in common case).

Where is VHOLD_NO_SMR set ?

In vdropl:

if (!atomic_cmpset_int(&vp->v_holdcnt, 0, VHOLD_NO_SMR)) {
  • handle negative entries
  • remove overzealous assert in vdropl

I think this is fine on principle.

I do not see why cannot use VIRF_DOOMED instead of VHOLD_NO_SMR. This would simplify the patch significantly, while I think that it is fine to fail the cached lookup if you find any doomed vnode on the path.

I don't see it. Relying on dooming would require moving the VIRF_DOOMED flag into hold count to maintain atomicity vs vhold_smr. It would be the same patch + more work.

I asked pho to test the updated version.

In D23913#563019, @mjg wrote:

I don't see it. Relying on dooming would require moving the VIRF_DOOMED flag into hold count to maintain atomicity vs vhold_smr. It would be the same patch + more work.

I asked pho to test the updated version.

I mean get the transient reference, check for DOOMED, drop the reference if DOOMED and return failure from cached lookup.

I see two problems there:

  • setting DOOMED and checking hold count for free should be ordered (arguably it already is)
  • transient reference might either break some asserts or cause missed free.

I think that only missed free is serious enough to be counter-argument, but it might be fixable with less work than making the hold count flagged counter.

I ran all of the stress2 tests on r362642 with D23913.73738.diff.
No problems seen.

In D23913#563450, @pho wrote:

I ran all of the stress2 tests on r362642 with D23913.73738.diff.
No problems seen.

thank you for testing

In D23913#563051, @kib wrote:
In D23913#563019, @mjg wrote:

I don't see it. Relying on dooming would require moving the VIRF_DOOMED flag into hold count to maintain atomicity vs vhold_smr. It would be the same patch + more work.

I asked pho to test the updated version.

I mean get the transient reference, check for DOOMED, drop the reference if DOOMED and return failure from cached lookup.

I see two problems there:

  • setting DOOMED and checking hold count for free should be ordered (arguably it already is)
  • transient reference might either break some asserts or cause missed free.

I think that only missed free is serious enough to be counter-argument, but it might be fixable with less work than making the hold count flagged counter.

Well to expand on it the race between freeing the vnode and vhold_smr is the crux here. I tried to not have the flag in the count, but as it is there is no way to know when the vnode is no longer reachable. With the change as proposed we maintain the invariant that vnode handed over to uma_zfree will never get any new references (even transient). Consequently I think this is the most debuggable approach in the long run.

In D23913#563590, @mjg wrote:
In D23913#563051, @kib wrote:
In D23913#563019, @mjg wrote:

I don't see it. Relying on dooming would require moving the VIRF_DOOMED flag into hold count to maintain atomicity vs vhold_smr. It would be the same patch + more work.

I asked pho to test the updated version.

I mean get the transient reference, check for DOOMED, drop the reference if DOOMED and return failure from cached lookup.

I see two problems there:

  • setting DOOMED and checking hold count for free should be ordered (arguably it already is)
  • transient reference might either break some asserts or cause missed free.

I think that only missed free is serious enough to be counter-argument, but it might be fixable with less work than making the hold count flagged counter.

Well to expand on it the race between freeing the vnode and vhold_smr is the crux here. I tried to not have the flag in the count, but as it is there is no way to know when the vnode is no longer reachable. With the change as proposed we maintain the invariant that vnode handed over to uma_zfree will never get any new references (even transient). Consequently I think this is the most debuggable approach in the long run.

On the other hand, the hold is no longer a reference counter, it is a bitfield structure (basically, remember your objections against saturation handling in refcount.h ?). At least you should add checks that hold count does not overflow into the flags, as a followup.

This revision is now accepted and ready to land.Jul 1 2020, 12:23 AM
This revision was automatically updated to reflect the committed changes.