vfs: make getnewvnode try M_NOWAIT before doing harder work
AbandonedPublic
Actions

Authored by mjg on Jan 9 2020, 2:38 PM.

Details

Reviewers

kib
jeff

Summary

The patch can be split, but it is a little bit iffy to do it.

The current logic to handle numvnodes is partially duplicated and different depending on whether getnewvnode itself or the _reserv variant is called. Unify everything into vn_alloc.

numvnodes does not guarantee UMA has memory to allocate a new vnode, meaning that users of the current _reserv routine can still block much later (and possibly in a way which prevents some vnodes from getting freed). Since the only consumer passes the value of 1 and it never nests reservations, change the code to simply prealloc a vnode.

Finally, the vnode limit is already not strictly maintained (and it arguably should not be). The current behavior is that if immediate numvnodes bump fails, the code may try to either recycle a free vnode or wait for vnlru to free something up.

vnlru_free_locked makes only one attempt which may fail -- it can return without freeing up anything and numvnodes is incremented anyway
getnewvnode_wait has a sleep up to 1s and another vnlru_free_locked call followed by once more incrementing numvnodes regardless of the outcome

Especially 2 is highly problematic in that it inserts stalls when the total count is close to the limit, even when UMA has memory to accommodate the request.

Even as it is vnlru_proc is expected to bring total counts back to the preconfigured ranges.

Also note that since the routine was guaranteed to not fail for years, making it able to fail now becomes quite problematic.

Thus until this gets further reworked, I think adding a uma_zalloc(..., M_NOWAIT) call before messing with numvnodes is the right thing to do. The total vnode count is already indirectly restricted by a sum of several factors. The new added alloc adds more slop for vnlru proc to sort out but also eliminates avoidable stalls, other than that sticks to the current behavior.

This also eliminates vnlru_list acquire form this codepath during poudriere -j 104 and enables both numvnodes and freevnodes to be moved to a per-cpu scheme.

tl;dr this really needs to be reworked further but the above patch mostly moves the problem out of the way.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 28571

Event Timeline

mjg created this revision.Jan 9 2020, 2:38 PM

Herald added subscribers: delphij, imp. · View Herald TranscriptJan 9 2020, 2:38 PM

Harbormaster completed remote builds in B28571: Diff 66554.Jan 9 2020, 2:38 PM

mjg edited the summary of this revision. (Show Details)Jan 9 2020, 2:38 PM

mjg edited the summary of this revision. (Show Details)Jan 9 2020, 2:40 PM

mjg edited the summary of this revision. (Show Details)Jan 9 2020, 2:54 PM

mjg edited the summary of this revision. (Show Details)

mjg edited the summary of this revision. (Show Details)Jan 9 2020, 2:57 PM

mjg edited the summary of this revision. (Show Details)Jan 9 2020, 2:59 PM

You can split out and commit some parts of this, like removal of the argument to getnewvnode_reserve() and switch to td_vp_reserved.

sys/kern/vfs_subr.c
1473	I disagree with this, completely. The vnode limit is not to enforce some arbitrary limits on vnodes, it is there to allow kernel to avoid KVA exhaustion. The problem comes mostly from the secondary allocations for vnodes. For instance, UFS vnode instantiation means that we also allocate vm_object, inode, dinode, (often) ext attr area, and probably more. From what I saw, for ZFS it is even more severe. Too high values for desiredvnodes killed 32bit machines, see e.g. 295971, your change makes the limit not effective at all.

mjg added inline comments.Jan 12 2020, 1:08 AM

sys/kern/vfs_subr.c
1473	This is part of why this is NOWAIT. Should there be any RAM shortage this is expected to fail at which point we go back to the old way.

For now abandaon this in favor of D23140

kib added inline comments.Jan 12 2020, 8:57 AM

sys/kern/vfs_subr.c
1473	Would the vnodes be the only memory consumer, then yes, it mostly worked. But they are not, and there is a very significant allocations done after each vnode allocation, as I mentioned above. Both factors make overriding the limit very undesirable, and actually fatal on KVA-starved arches (regardless the amount of memory they have). Also, the existing override for the vnode count limit is only done for the situation when the request comes from suspender. It is not completely true, the code to check for suspension was added before the suspension owner field was added to struct mount. This is because blocking suspender there causes deadlock, and it could be improved by checking that curthread == mp->mnt_susp_owner.