Details

Reviewers

kib
jeff

Commits

rS351585: tmpfs: use VOP_NEED_INACTIVE
rS351584: vfs: add VOP_NEED_INACTIVE

Summary

Currently one of the biggest bottlenecks in vfs is mandatory exclusive locking happening during 1->0 transitions of usecount. It frequently contends against shared lookups from the namecache. Most of the time vinactive processing has nothing to do anyway which makes this into a waste. There are ways to perhaps cache the count so the transition happens less frequently but inactive processing should be taken care of anyway.

Patch below adds a new vop to let filesystems decide if they want to do inactive and adds tmpfs as a sample consumer.

I don't have strong opinions about the func itself. Right now it expects to be called with the interlock taken, vnode held and usecount dropped to 0 (and vnode locked or not depending on mode). Presence of the interlock may be too limiting for some filesystems, although it's definitely not a problem for tmpfs and is easily hackable for zfs. Alternatively this can be changed so that if dropping the usecount to 0 fails and the interlock is needed, the routine gets called to do all the work without the interlock held.

Note that the awk magic does not enforce the interlock just yet, will fix that later after the protocol gets agreed on.

I'm posting this form of the patch mostly to prove this is indeed the main problem. Performance-wise this takes care of majority of contention for poudriere -j 104. Here are wait times after 500 minutes of building:

Before:

557563641706 (lockmgr:tmpfs)
123628486430 (sleep mutex:vm page)
104729843164 (rw:pmap pv list)
62875264471 (rw:vm object)
55746977102 (sx:vm map (user))
29644578671 (sleep mutex:vnode interlock)
17785800434 (sleep mutex:ncvn)
14703794563 (sx:proctree)
8305983621 (rw:ncbuc)
8019279135 (sleep mutex:process lock)

After:

94637557942 (rw:pmap pv list)
91465952963 (sleep mutex:vm page)
61032156416 (rw:vm object)
46309603301 (lockmgr:tmpfs)
45532512098 (sx:vm map (user))
14171493881 (sleep mutex:vnode interlock)
11766102343 (sx:proctree)
9849521242 (sleep mutex:ncvn)
8708503633 (sleep mutex:process lock)
8578206534 (sleep mutex:sleep mtxpool)

i.e. tmpfs goes down from dominating as a bottleneck to being just one of the problems

This taken care of opens possibility of doing other stuff like adaptive spinning for lockmgr which can further reduce contention.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg created this revision.Aug 22 2019, 7:29 PM

Herald added a subscriber: imp. · View Herald TranscriptAug 22 2019, 7:29 PM

Harbormaster completed remote builds in B26021: Diff 61133.Aug 22 2019, 7:29 PM

mjg edited the summary of this revision. (Show Details)Aug 22 2019, 7:31 PM

Delegating so much basic VFS operation to VOP does not look pretty.

Why not adding some vop, e..g VOP_NEED_INACT(), which would tell VFS if inactivation is desired ? Since the call to VOP_NEED_INACT() would happen from vpux(9) under the shared vnode lock, it should be good enough. I think that VOP_NEED_INACT() could be the place that set the VI_OWEINACT flag when needed.

reimplement as VOP_NEED_INACTIVE + other changes

There are calls from vrele which come with the vnode unlocked and it would be nice to avoid locking it just to check for inactive. Races with vm objs being stolen away should not be a problem here but worst case the vnode can always get locked.

Apart from that note there is a check in vinactive which may or may not be fine to delegate to the fs (or perhaps give it a helper to call?)

So how about the updated patch. It keeps the interlock held across the call which I'm not completely fond of but should be fine.

So why not set VI_OWEINACT in the VOP instead of using the return value (which is better to keep for errors) ?

I suggest adding asserts that the vnode interlock is owned in pre/post methods for the VOP.

add assertions for the interlock
add OWEINACT to the vnode in the target routine

I wanted to avoid touching the vnode unnecessarily. This still can return an error normally and just grab an address to do_inact as the second argument.

Either way I don't know what the error is supposed to signify here or how it can be handled. If the answer is to just proceed with inactive then there is no difference from vputx's pov.

kib added inline comments.Aug 26 2019, 11:59 AM

sys/kern/vfs_subr.c
2901 ↗	(On Diff #61272)	Why not to push this condition into stdneed_inactive() ?

kib added inline comments.Aug 26 2019, 12:03 PM

sys/kern/vfs_subr.c
2901 ↗	(On Diff #61272)	Or rather, I do not understand the presence of this test at all, since stdneed_inactive() always set VI_OWEINACT.

mjg added inline comments.Aug 26 2019, 9:44 PM

sys/kern/vfs_subr.c
2901 ↗	(On Diff #61272)	Then who is supposed to be perform this check? vinactive performs the check and calls vm_object_page_clean(), that's outside of whatever happens in VOP_INACTIVE. Adding the check here preserves the behavior (of not requiring NEED_INACTIVE to look at it). I don't have a strong opinion one way or the other, just want to push optional inactive processing through.

kib added inline comments.Aug 27 2019, 1:32 PM

sys/kern/vfs_subr.c
2901 ↗	(On Diff #61272)	This check and the action that it gates, are required when moving the vnode from active to inactive list, because syncer only flushes pages for active vnodes. Without the check, dirty pages for the inactivated vnode could keep unsynced for arbitrary long time, causing user data loss on crash or power failure. In the current patch, vinactive() is called always, except for the tmpfs vnodes. If a filesystem starts providing its own implementation of NEED_INACTIVE(), it should handle that as well (but e.g. if the filesystem is ro, there is no need).

following the irc discussion:

make the return value to indicate whether inactive is needed
provide vn_need_pageq_flush, a helper for filesystems

use the default routine in unionfs and nullfs

modulo perhaps some nits I think this is in a committable form (after it gets tested). I'll add zfs and nullfs routines and ask pho.

kib accepted this revision.Aug 27 2019, 3:52 PM

This revision is now accepted and ready to land.Aug 27 2019, 3:52 PM

fix a VI_DOOMED race found by pho

Now the patch passes stress2.

This revision now requires review to proceed.Aug 28 2019, 7:11 PM

nullfs bit will be handled in a different review

kib accepted this revision.Aug 28 2019, 8:23 PM

This revision is now accepted and ready to land.Aug 28 2019, 8:23 PM

Closed by commit rS351584: vfs: add VOP_NEED_INACTIVE (authored by mjg). · Explain WhyAug 28 2019, 8:34 PM

This revision was automatically updated to reflect the committed changes.

mjg added a commit: rS351584: vfs: add VOP_NEED_INACTIVE.

mjg added a commit: rS351585: tmpfs: use VOP_NEED_INACTIVE.

vfs: add VOP_NEED_INACTIVE
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 61430

head/sys/fs/nullfs/null_vnops.c

head/sys/fs/unionfs/union_vnops.c

head/sys/kern/vfs_default.c

head/sys/kern/vfs_subr.c

head/sys/kern/vnode_if.src

head/sys/sys/vnode.h

vfs: add VOP_NEED_INACTIVEClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 61430

head/sys/fs/nullfs/null_vnops.c

head/sys/fs/unionfs/union_vnops.c

head/sys/kern/vfs_default.c

head/sys/kern/vfs_subr.c

head/sys/kern/vnode_if.src

head/sys/sys/vnode.h

vfs: add VOP_NEED_INACTIVE
ClosedPublic
Actions

Revision Contents
Changeset List