[wip] vfs: defer vdrop if mnt_listmtx is contended
AbandonedPublic
Actions

Authored by mjg on Dec 5 2019, 8:29 PM.

Details

Reviewers

jeff
kib

Summary

Currently regular vnodes very frequently move between active and free lists, e.g. when stating them. This causes significant contention in workloads which access many files in a short timespan. Long term fix would probably involve removing the active list in its current form in the first place. In the meantime I propose trylocking the list lock and in case of failure setting a special flag (VI_DEFERRED) instead of releasing the hold count, delegating it to the syncer to return the vnode. This both eliminates contention in my tests and maintains quasi-LRU especially when contention is low (or non-existent).

Sample results when doing an incremental linux kernel build on tmpfs (on a kernel with some other patches) -- make -s -j 104 bzImage:

before: 118.52s user 3772.92s system 7404% cpu 52.558 total
after: 151.06s user 1080.23s system 5031% cpu 24.473 total

Notes:

the code opts to not try to defer if inactive processing was requested. this is because there is no clear indicator whether a vnode is unlinked. The VV_NOSYNC flag is used only bu ufs. It is trivial to add to zfs, I don't know about tmpfs.
if memory serves ufs may have known LoR when used with SU which makes me wonder if the behavior should be opt in (should this start running into deadlocks)
I'm not happy with the need to scan the active list. struct mount can get a new flag field which would denote there is at least one deferred vnode and could refrain from scanning otherwise. another note is that the namecache holds many directory vnodes, perhaps a marker can be inserted to separate them from the rest. then the scan for deferred items could start there

Test Plan

I'm in the process of testing this with zfs and tmpfs, works fine so far. Will ask pho later.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 27988

Event Timeline

mjg created this revision.Dec 5 2019, 8:29 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 5 2019, 8:29 PM

Harbormaster completed remote builds in B27988: Diff 65280.Dec 5 2019, 8:30 PM

mjg edited the summary of this revision. (Show Details)Dec 5 2019, 8:31 PM

mjg retitled this revision from [wip] vfs: defer vdrop if mp_listmtx is contended to [wip] vfs: defer vdrop if mnt_listmtx is contended.Dec 5 2019, 8:35 PM

mjg edited the test plan for this revision. (Show Details)

jeff added inline comments.Dec 6 2019, 9:27 PM

sys/kern/vfs_subr.c
3307	So a general question I have is, why don't we simply always lazily do this processing. Just leave it on the active list. With no flag or anything. During the active scan if the holdcnt is zero move it to the free list. All of this trylock and flags is a lot of complexity to save the cost of scanning and requeuing later when it can be done more efficiently in bulk. You can look at being on the active queue as having a single activation count. If you want to optimize the LRU you can add an atomic flag for vnodes that are 'reactivated'. So vnodes with a single ref are moved out of active but those that are regularly reactivated never leave.
3330	Since this is bool it can/should be !refcount_release
4437–4438	Shouldn't these be inverted? We're now skipping the work if these conditions are true. We should only skip it if they are false.

mjg added inline comments.Dec 7 2019, 12:33 AM

sys/kern/vfs_subr.c
3307	I see I failed to mention in the original text that the issue at hand is a significant bottleneck on modest contemporary machines. Thus I wanted the simplest approach which is legible for MFC to stable/12. I did note I don't consider this to be a good long term fix. The proposed modification of letting more vnodes stay on the active list is what I considered originally. That is, syncer would check for a "used" bit -- if present it would remove it and requeue the vnode to the end, if not present it would return the vnode to the free list. It's not hard to extend this code to do it, but I don't think it a worthwhile addition given the temporary nature of this patch. I think a hard requirement for long run solution is numa-awareness. I don't have it completely thought out, but a rough sketch involves removal of the global free list and per-mount point active lists. Instead, all aging state would be managed per-domain (may be with clock or a different algorithm).
3330	This was copied from vdrop, can adjust both.
4437–4438	This is the original condition but inverted (note the "!").

mjg added inline comments.Dec 8 2019, 7:30 AM

sys/kern/vfs_subr.c
3307	So I took a stab at making this second chance-y and that made me even more confident it's not the right approach if doing anything remotely more involved that the patch as proposed here. Key for me is that some filesystems sometimes want to traverse a list of vnodes which can have outstanding i/o to perform. Currently there is no dedicated list of the sort and instead the "active" list is filtered. Imo we should add new list to mount point/vnode relationship and add there vnodes where it is likely they will match this criterion. It should be easy to be strict about addition (opens O_RDWR and similar), while we can be very lax about removing them from the list. Possibly in the very form as used in this patch when it comes to active list. With these out of the way we can have completely unrelated linkage used for aging. That in particular can be made per-domain instead of per-fs (or we can start with just keeping it global). It would be easy to make it into a variant of clock or whatever else which proves adequate. Additional feature of said list can be handling of deferred inactive. vdrop sees VI_OWEINACT and leaves a 0-held vnode on the active list which I consider to be a total hack. With the list at hand we can easily keep the ref and denote the vnode with a flag meaning syncer is resposible for inactive and vdrop. Finally there may be places which want to drop usecount but can't afford risking inactive. They could also use this facililty (vput_async?) Opinions?

jeff added inline comments.Dec 10 2019, 8:14 AM

sys/kern/vfs_subr.c
4430–4434	This could all be placed in the iterator instead of using msync for something unrelated. You would still need to guarantee that someone iterated at some point though. Alternatively if you implemented my idea of the active iterator grabbing a hold count this happens as a side effect.
4437–4438	This is super hard to read. You should invert the individual tests rather than the whole block.

mjg added inline comments.Dec 10 2019, 1:43 PM

sys/kern/vfs_subr.c
4430–4434	This would change semantics of the loop making it more problematic to MFC so I don't think it's a good idea.
4437–4438	tests are the same so that's it's easy to see it's the same stuff. Instead of inverting like this I can "== 0" or better yet move all the conditions into an inline ("vfs_want_msync"?) and call it in both places.