Currently one of the biggest bottlenecks in vfs is mandatory exclusive locking happening during 1->0 transitions of usecount. It frequently contends against shared lookups from the namecache. Most of the time vinactive processing has nothing to do anyway which makes this into a waste. There are ways to perhaps cache the count so the transition happens less frequently but inactive processing should be taken care of anyway.
Patch below adds a new vop to let filesystems decide if they want to do inactive and adds tmpfs as a sample consumer.
I don't have strong opinions about the func itself. Right now it expects to be called with the interlock taken, vnode held and usecount dropped to 0 (and vnode locked or not depending on mode). Presence of the interlock may be too limiting for some filesystems, although it's definitely not a problem for tmpfs and is easily hackable for zfs. Alternatively this can be changed so that if dropping the usecount to 0 fails and the interlock is needed, the routine gets called to do all the work without the interlock held.
Note that the awk magic does not enforce the interlock just yet, will fix that later after the protocol gets agreed on.
I'm posting this form of the patch mostly to prove this is indeed the main problem. Performance-wise this takes care of majority of contention for poudriere -j 104. Here are wait times after 500 minutes of building:
Before:
557563641706 (lockmgr:tmpfs)
123628486430 (sleep mutex:vm page)
104729843164 (rw:pmap pv list)
62875264471 (rw:vm object)
55746977102 (sx:vm map (user))
29644578671 (sleep mutex:vnode interlock)
17785800434 (sleep mutex:ncvn)
14703794563 (sx:proctree)
8305983621 (rw:ncbuc)
8019279135 (sleep mutex:process lock)
After:
94637557942 (rw:pmap pv list)
91465952963 (sleep mutex:vm page)
61032156416 (rw:vm object)
46309603301 (lockmgr:tmpfs)
45532512098 (sx:vm map (user))
14171493881 (sleep mutex:vnode interlock)
11766102343 (sx:proctree)
9849521242 (sleep mutex:ncvn)
8708503633 (sleep mutex:process lock)
8578206534 (sleep mutex:sleep mtxpool)
i.e. tmpfs goes down from dominating as a bottleneck to being just one of the problems
This taken care of opens possibility of doing other stuff like adaptive spinning for lockmgr which can further reduce contention.