Page MenuHomeFreeBSD

kern.maxvnodes should be a tunable, and probably shouldn't be run-time modifiable
ClosedPublic

Authored by wollman on Apr 9 2015, 5:27 PM.

Details

Summary

The kernel variable "desiredvnodes", which can be run-time modified through kern.maxvnodes MIB variable, is used to size a number of hash tables early in the boot process. Changing this value at run time doesn't change the size of the hash tables, which results in poor performance. As a first step, make desiredvnodes be a tunable, so that it's actually possible to get the hash tables sized correctly. (Dynamically resizing the hash tables is left as a possible future enhancement.)

Test Plan

This is part of my file server benchmarking campaign and I will be using this to quantify the effect of increasing vnode limits on various workloads.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

wollman retitled this revision from to kern.maxvnodes should be a tunable, and probably shouldn't be run-time modifiable.Apr 9 2015, 5:27 PM
wollman updated this object.
wollman edited the test plan for this revision. (Show Details)
wollman added reviewers: mav, kib, mckusick.
wollman set the repository for this revision to rS FreeBSD src repository.
wollman updated this revision to Diff 4760.
wollman updated this revision to Diff 4761.

Don't overwrite the value specified in the tunable when we initialize desiredvnodes.

mav edited edge metadata.Apr 9 2015, 7:54 PM

I don't see harm from making it tunable, that is good, but I don't like idea of making it read-only. I believe that loader-only tunables are too uncomfortable and should be avoided, when possible.
What's about practical implementation, now tunable may only increase automatic value, but not decrease it. That is odd.

In D2265#6, @mav wrote:

I don't see harm from making it tunable, that is good, but I don't like idea of making it read-only. I believe that loader-only tunables are too uncomfortable and should be avoided, when possible.

Can you think of a use case when you would want to change the number of vnodes *without* resizing the hash tables? I'm struggling to think of any situation in which that would be sensible.

What's about practical implementation, now tunable may only increase automatic value, but not decrease it. That is odd.

Hmmm, yes, I could have written that differently (and almost did). I'll revise after some testing of this version.

mav added a comment.Apr 9 2015, 10:08 PM
In D2265#7, @wollman wrote:
In D2265#6, @mav wrote:

I don't see harm from making it tunable, that is good, but I don't like idea of making it read-only. I believe that loader-only tunables are too uncomfortable and should be avoided, when possible.

Can you think of a use case when you would want to change the number of vnodes *without* resizing the hash tables? I'm struggling to think of any situation in which that would be sensible.

I just hate when I have to reboot system to do some tuning. I think that read-write sysctl that is somewhat suboptimal is better then read-only. I think making sysctl read-only would be a step into wrong direction.

In D2265#8, @mav wrote:

I just hate when I have to reboot system to do some tuning. I think that read-write sysctl that is somewhat suboptimal is better then read-only. I think making sysctl read-only would be a step into wrong direction.

You can't meaningfully tune the system with it read-write because that doesn't resize the hash tables, and I'm not going to be implementing that -- I'm not sure if it's even practical to resize on a running system. So I could leave it read-write, but you'd still want to reboot to actually do anything useful (other than demonstrate that not resizing the hash tables makes your system slower, of course). All this patch does is make it *possible* to do it at boot time so the hash tables get sized correctly for the configured vnode limit.

mav added a comment.Apr 9 2015, 10:21 PM
In D2265#9, @wollman wrote:

You can't meaningfully tune the system with it read-write because that doesn't resize the hash tables.

May be I don't understand something (or even everything) since that is not my area or kernel, but on my system I have this sysctl set in /etc/sysctl.conf. From your words it should be totally pointless. But from my experience it is not. IIRC it allows system to cache more files, that is useful for doing large source searches. I have no any idea about what hashes are you talking about, and can easily understand that underestimated hash size can make things slower then expected, but since in my case difference between default value and manually set one is only by few times, I doubt performance difference should be major enough, comparing to inability to cache all used files.

wollman updated this object.Apr 9 2015, 11:03 PM
wollman edited edge metadata.
wollman updated this revision to Diff 4773.

This version actually works, and restores the run-time modifiability. The way I've implemented it preserves the compiled-in limit of MAXVNODES_MAX at boot time only; probably it should be bypassed or made into a louder warning when set as a tunable, or else enforced at run time as well. (It's not clear whether MAXVNODES_MAX is intended to be a true limit or just an additional sanity check for the auto-tuning implementation.)

kib edited edge metadata.Apr 10 2015, 2:34 AM

The idiomatic use of TUNABLE_INT_FETCH is to call it after the auto-sizing calculation is done. If the tunable is not passed to kernel, the TUNABLE_XXX_FETCH() facility keep the variable intact. Please look at the other places in the kernel source how it is typically used.

The hash which is not resized is the vfs_hash (vfs_hash.c), most likely. It is very much innocent to not adjust the hash size after desiredvnodes change at runtime. Typically, the value is increased 2-5 times vs. the original, and the hash mask is either not changed or changed two times, would the hash resized. Even if the mask change is larger, it is still fine to do. Reason for somebody to increase the desiredvnode is to cache more vnodes, to improve performance. In the case the performance degrades, the tuner would do the right conclusion, might be with the help of documentation, where your finding really belongs.

In D2265#14, @kostikbel wrote:

The hash which is not resized is the vfs_hash (vfs_hash.c), most likely. It is very much innocent to not adjust the hash size after desiredvnodes change at runtime.

Actually, the more important one (I think) is the name cache hash (vfs_cache.c). There are some pretty dramatic impacts on performance which I hope to be able to say more about soon (after completing several series of benchmarks), although I can't say for certain where the particular performance impacts are coming from (the sort of testing I'm doing at this stage doesn't identify that).

kib added a comment.Apr 10 2015, 10:00 AM
In D2265#15, @wollman wrote:

Actually, the more important one (I think) is the name cache hash (vfs_cache.c). There are some pretty dramatic impacts on performance which I hope to be able to say more about soon (after completing several series of benchmarks), although I can't say for certain where the particular performance impacts are coming from (the sort of testing I'm doing at this stage doesn't identify that).

What I said is equally applicable to the namecache sizing. Note that vfs.ncsizefactor is also writeable at the runtime and its change does not result in the hash resizing.

See this image for a demonstration of what a difference correctly sized hash tables makes. (You can ignore the red pluses, that's just the default sizing for everything.) This is the metadata-intensive SWBUILD workload of the SPEC SFS 2014 benchmark; I'm collecting data for the other workloads presently.

mckusick edited edge metadata.Apr 13 2015, 5:47 AM
mckusick requested changes to this revision.

I am happy that Garrett is spending some time measuring filesystem
performance. It has not been studied in some time, and it appears
that performance has dropped off since it peaked in FreeBSD 7 or 8.

I believe that being able to change the number of desired vnodes is
something that should be allowed on the fly. It should be possible
to increase it when it is clear that more files should be cached and
decreased when warranted by memory pressure.

As noted, the hash tables need to match the expected number of vnodes.
The attached is my attempt at resizing the hash tables when the number
of vnodes is changed. I have not yet had the opportunity to test this
code, so it is proof of concept only. I hope that kib can take a
look at it to ensure that my locking assumptions are correct. And I
hope that Garrett can ensure that it works and is appropriately
performant.

I am incapable of figuring out how to add diffs to this comment, so the
best I seem able to do is to attach a file with my diffs to this comment.

This revision now requires changes to proceed.Apr 13 2015, 5:47 AM
kib added a comment.Apr 13 2015, 6:27 AM
In D2265#18, @mckusick wrote:

I am happy that Garrett is spending some time measuring filesystem
performance. It has not been studied in some time, and it appears
that performance has dropped off since it peaked in FreeBSD 7 or 8.
I believe that being able to change the number of desired vnodes is
something that should be allowed on the fly. It should be possible
to increase it when it is clear that more files should be cached and
decreased when warranted by memory pressure.
As noted, the hash tables need to match the expected number of vnodes.
The attached is my attempt at resizing the hash tables when the number
of vnodes is changed. I have not yet had the opportunity to test this
code, so it is proof of concept only. I hope that kib can take a
look at it to ensure that my locking assumptions are correct. And I
hope that Garrett can ensure that it works and is appropriately
performant.

The vfs_hash.c changes looks fine.

For vfs_cache.c, I suggest not to return EAGAIN from the debugging sysctls.
Also, I thnk nchash and hash bucket heads must be evaluated and used under the lock.
It is somewhat suprising that cache, with all its read->write lock upgrades and retry
logic is safe WRT namecache hash resize.

I am incapable of figuring out how to add diffs to this comment, so the
best I seem able to do is to attach a file with my diffs to this comment.

I think the right way would be to create a new revision and put a link to the revision there.

wollman abandoned this revision.Sep 6 2015, 7:51 PM

This is now overtaken by events in mckusick's commit r287497.

Closed by commit rS288079: MFC of 281677: (authored by mckusick). · Explain Why
This revision was automatically updated to reflect the committed changes.