Page MenuHomeFreeBSD

vfs: add VOP_NEED_INACTIVE
ClosedPublic

Authored by mjg on Thu, Aug 22, 7:29 PM.

Details

Summary

Currently one of the biggest bottlenecks in vfs is mandatory exclusive locking happening during 1->0 transitions of usecount. It frequently contends against shared lookups from the namecache. Most of the time vinactive processing has nothing to do anyway which makes this into a waste. There are ways to perhaps cache the count so the transition happens less frequently but inactive processing should be taken care of anyway.

Patch below adds a new vop to let filesystems decide if they want to do inactive and adds tmpfs as a sample consumer.

I don't have strong opinions about the func itself. Right now it expects to be called with the interlock taken, vnode held and usecount dropped to 0 (and vnode locked or not depending on mode). Presence of the interlock may be too limiting for some filesystems, although it's definitely not a problem for tmpfs and is easily hackable for zfs. Alternatively this can be changed so that if dropping the usecount to 0 fails and the interlock is needed, the routine gets called to do all the work without the interlock held.

Note that the awk magic does not enforce the interlock just yet, will fix that later after the protocol gets agreed on.

I'm posting this form of the patch mostly to prove this is indeed the main problem. Performance-wise this takes care of majority of contention for poudriere -j 104. Here are wait times after 500 minutes of building:

Before:

557563641706 (lockmgr:tmpfs)
123628486430 (sleep mutex:vm page)
104729843164 (rw:pmap pv list)
62875264471 (rw:vm object)
55746977102 (sx:vm map (user))
29644578671 (sleep mutex:vnode interlock)
17785800434 (sleep mutex:ncvn)
14703794563 (sx:proctree)
8305983621 (rw:ncbuc)
8019279135 (sleep mutex:process lock)

After:

94637557942 (rw:pmap pv list)
91465952963 (sleep mutex:vm page)
61032156416 (rw:vm object)
46309603301 (lockmgr:tmpfs)
45532512098 (sx:vm map (user))
14171493881 (sleep mutex:vnode interlock)
11766102343 (sx:proctree)
9849521242 (sleep mutex:ncvn)
8708503633 (sleep mutex:process lock)
8578206534 (sleep mutex:sleep mtxpool)

i.e. tmpfs goes down from dominating as a bottleneck to being just one of the problems

This taken care of opens possibility of doing other stuff like adaptive spinning for lockmgr which can further reduce contention.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mjg created this revision.Thu, Aug 22, 7:29 PM
mjg edited the summary of this revision. (Show Details)Thu, Aug 22, 7:31 PM
kib added a comment.Sat, Aug 24, 8:51 PM

Delegating so much basic VFS operation to VOP does not look pretty.

Why not adding some vop, e..g VOP_NEED_INACT(), which would tell VFS if inactivation is desired ? Since the call to VOP_NEED_INACT() would happen from vpux(9) under the shared vnode lock, it should be good enough. I think that VOP_NEED_INACT() could be the place that set the VI_OWEINACT flag when needed.

mjg updated this revision to Diff 61244.Sat, Aug 24, 10:47 PM
  • reimplement as VOP_NEED_INACTIVE + other changes

There are calls from vrele which come with the vnode unlocked and it would be nice to avoid locking it just to check for inactive. Races with vm objs being stolen away should not be a problem here but worst case the vnode can always get locked.

Apart from that note there is a check in vinactive which may or may not be fine to delegate to the fs (or perhaps give it a helper to call?)

So how about the updated patch. It keeps the interlock held across the call which I'm not completely fond of but should be fine.

kib added a comment.EditedSun, Aug 25, 11:55 AM

So why not set VI_OWEINACT in the VOP instead of using the return value (which is better to keep for errors) ?

I suggest adding asserts that the vnode interlock is owned in pre/post methods for the VOP.

mjg updated this revision to Diff 61272.Sun, Aug 25, 10:40 PM
  • add assertions for the interlock
  • add OWEINACT to the vnode in the target routine

I wanted to avoid touching the vnode unnecessarily. This still can return an error normally and just grab an address to do_inact as the second argument.

Either way I don't know what the error is supposed to signify here or how it can be handled. If the answer is to just proceed with inactive then there is no difference from vputx's pov.

kib added inline comments.Mon, Aug 26, 11:59 AM
sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

Why not to push this condition into stdneed_inactive() ?

kib added inline comments.Mon, Aug 26, 12:03 PM
sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

Or rather, I do not understand the presence of this test at all, since stdneed_inactive() always set VI_OWEINACT.

mjg added inline comments.Mon, Aug 26, 9:44 PM
sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

Then who is supposed to be perform this check? vinactive performs the check and calls vm_object_page_clean(), that's outside of whatever happens in VOP_INACTIVE. Adding the check here preserves the behavior (of not requiring NEED_INACTIVE to look at it).

I don't have a strong opinion one way or the other, just want to push optional inactive processing through.

kib added inline comments.Tue, Aug 27, 1:32 PM
sys/kern/vfs_subr.c
2901 ↗(On Diff #61272)

This check and the action that it gates, are required when moving the vnode from active to inactive list, because syncer only flushes pages for active vnodes. Without the check, dirty pages for the inactivated vnode could keep unsynced for arbitrary long time, causing user data loss on crash or power failure.

In the current patch, vinactive() is called always, except for the tmpfs vnodes. If a filesystem starts providing its own implementation of NEED_INACTIVE(), it should handle that as well (but e.g. if the filesystem is ro, there is no need).

mjg updated this revision to Diff 61357.EditedTue, Aug 27, 3:20 PM
mjg retitled this revision from [wip] vfs: add VOP_VPUTX to vfs: add VOP_NEED_INACTIVE.

following the irc discussion:

  • make the return value to indicate whether inactive is needed
  • provide vn_need_pageq_flush, a helper for filesystems
  • use the default routine in unionfs and nullfs

modulo perhaps some nits I think this is in a committable form (after it gets tested). I'll add zfs and nullfs routines and ask pho.

kib accepted this revision.Tue, Aug 27, 3:52 PM
This revision is now accepted and ready to land.Tue, Aug 27, 3:52 PM
mjg updated this revision to Diff 61416.EditedWed, Aug 28, 7:11 PM
  • fix a VI_DOOMED race found by pho

Now the patch passes stress2.

This revision now requires review to proceed.Wed, Aug 28, 7:11 PM
mjg added a comment.Wed, Aug 28, 7:11 PM

nullfs bit will be handled in a different review

kib accepted this revision.Wed, Aug 28, 8:23 PM
This revision is now accepted and ready to land.Wed, Aug 28, 8:23 PM
This revision was automatically updated to reflect the committed changes.