Page MenuHomeFreeBSD

vfs: add VOP_NEED_INACTIVE
ClosedPublic

Authored by mjg on Aug 22 2019, 7:29 PM.
Tags
None
Referenced Files
F80162480: D21371.diff
Thu, Mar 28, 6:19 PM
Unknown Object (File)
Feb 21 2024, 10:39 AM
Unknown Object (File)
Jan 29 2024, 2:55 PM
Unknown Object (File)
Dec 20 2023, 7:50 AM
Unknown Object (File)
Aug 28 2023, 3:06 PM
Unknown Object (File)
Aug 7 2023, 5:49 PM
Unknown Object (File)
Aug 7 2023, 5:47 PM
Unknown Object (File)
Aug 7 2023, 5:47 PM
Subscribers

Details

Summary

Currently one of the biggest bottlenecks in vfs is mandatory exclusive locking happening during 1->0 transitions of usecount. It frequently contends against shared lookups from the namecache. Most of the time vinactive processing has nothing to do anyway which makes this into a waste. There are ways to perhaps cache the count so the transition happens less frequently but inactive processing should be taken care of anyway.

Patch below adds a new vop to let filesystems decide if they want to do inactive and adds tmpfs as a sample consumer.

I don't have strong opinions about the func itself. Right now it expects to be called with the interlock taken, vnode held and usecount dropped to 0 (and vnode locked or not depending on mode). Presence of the interlock may be too limiting for some filesystems, although it's definitely not a problem for tmpfs and is easily hackable for zfs. Alternatively this can be changed so that if dropping the usecount to 0 fails and the interlock is needed, the routine gets called to do all the work without the interlock held.

Note that the awk magic does not enforce the interlock just yet, will fix that later after the protocol gets agreed on.

I'm posting this form of the patch mostly to prove this is indeed the main problem. Performance-wise this takes care of majority of contention for poudriere -j 104. Here are wait times after 500 minutes of building:

Before:

557563641706 (lockmgr:tmpfs)
123628486430 (sleep mutex:vm page)
104729843164 (rw:pmap pv list)
62875264471 (rw:vm object)
55746977102 (sx:vm map (user))
29644578671 (sleep mutex:vnode interlock)
17785800434 (sleep mutex:ncvn)
14703794563 (sx:proctree)
8305983621 (rw:ncbuc)
8019279135 (sleep mutex:process lock)

After:

94637557942 (rw:pmap pv list)
91465952963 (sleep mutex:vm page)
61032156416 (rw:vm object)
46309603301 (lockmgr:tmpfs)
45532512098 (sx:vm map (user))
14171493881 (sleep mutex:vnode interlock)
11766102343 (sx:proctree)
9849521242 (sleep mutex:ncvn)
8708503633 (sleep mutex:process lock)
8578206534 (sleep mutex:sleep mtxpool)

i.e. tmpfs goes down from dominating as a bottleneck to being just one of the problems

This taken care of opens possibility of doing other stuff like adaptive spinning for lockmgr which can further reduce contention.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Delegating so much basic VFS operation to VOP does not look pretty.

Why not adding some vop, e..g VOP_NEED_INACT(), which would tell VFS if inactivation is desired ? Since the call to VOP_NEED_INACT() would happen from vpux(9) under the shared vnode lock, it should be good enough. I think that VOP_NEED_INACT() could be the place that set the VI_OWEINACT flag when needed.

  • reimplement as VOP_NEED_INACTIVE + other changes

There are calls from vrele which come with the vnode unlocked and it would be nice to avoid locking it just to check for inactive. Races with vm objs being stolen away should not be a problem here but worst case the vnode can always get locked.

Apart from that note there is a check in vinactive which may or may not be fine to delegate to the fs (or perhaps give it a helper to call?)

So how about the updated patch. It keeps the interlock held across the call which I'm not completely fond of but should be fine.

So why not set VI_OWEINACT in the VOP instead of using the return value (which is better to keep for errors) ?

I suggest adding asserts that the vnode interlock is owned in pre/post methods for the VOP.

  • add assertions for the interlock
  • add OWEINACT to the vnode in the target routine

I wanted to avoid touching the vnode unnecessarily. This still can return an error normally and just grab an address to do_inact as the second argument.

Either way I don't know what the error is supposed to signify here or how it can be handled. If the answer is to just proceed with inactive then there is no difference from vputx's pov.

sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

Why not to push this condition into stdneed_inactive() ?

sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

Or rather, I do not understand the presence of this test at all, since stdneed_inactive() always set VI_OWEINACT.

sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

Then who is supposed to be perform this check? vinactive performs the check and calls vm_object_page_clean(), that's outside of whatever happens in VOP_INACTIVE. Adding the check here preserves the behavior (of not requiring NEED_INACTIVE to look at it).

I don't have a strong opinion one way or the other, just want to push optional inactive processing through.

sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

This check and the action that it gates, are required when moving the vnode from active to inactive list, because syncer only flushes pages for active vnodes. Without the check, dirty pages for the inactivated vnode could keep unsynced for arbitrary long time, causing user data loss on crash or power failure.

In the current patch, vinactive() is called always, except for the tmpfs vnodes. If a filesystem starts providing its own implementation of NEED_INACTIVE(), it should handle that as well (but e.g. if the filesystem is ro, there is no need).

mjg retitled this revision from [wip] vfs: add VOP_VPUTX to vfs: add VOP_NEED_INACTIVE.

following the irc discussion:

  • make the return value to indicate whether inactive is needed
  • provide vn_need_pageq_flush, a helper for filesystems
  • use the default routine in unionfs and nullfs

modulo perhaps some nits I think this is in a committable form (after it gets tested). I'll add zfs and nullfs routines and ask pho.

This revision is now accepted and ready to land.Aug 27 2019, 3:52 PM
  • fix a VI_DOOMED race found by pho

Now the patch passes stress2.

This revision now requires review to proceed.Aug 28 2019, 7:11 PM

nullfs bit will be handled in a different review

This revision is now accepted and ready to land.Aug 28 2019, 8:23 PM
This revision was automatically updated to reflect the committed changes.