Page MenuHomeFreeBSD

vfs: vntblinit(): Rework comments on the computation of 'kern.maxvnodes'
ClosedPublic

Authored by olce on Mon, May 12, 1:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, May 19, 3:52 PM
Unknown Object (File)
Sun, May 18, 8:32 AM
Unknown Object (File)
Thu, May 15, 2:30 PM
Subscribers

Details

Summary

Remove the parts that describe what is already in the code formula. Add
hints about which of 'physvnodes'/'virtvnodes' takes precedence, and
when 'desiredvnodes' becomes smaller than 'maxfiles'. These were both
computed analytically and verified experimentally.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

olce requested review of this revision.Mon, May 12, 1:18 PM

Originally from 1d7fe4b5159cb66a28e94771c207f822ddaa8a9e which also has a comment about revisiting MAXVNODES_MAX once physical memory exceeds 512GB.

What is the point of making comment repeating the code in even more verbose way? This would make the comment inconsistent with the actual code on the next whatever minor change.

Main point of the comment is that there are two concurrent limiting scales for the total amount of vnodes kernel can cache. One is the physical memory, relevant for the machines without KVA pressure. Another is the amount of KVA we can dedicate to caches, while giving the other subsystems the space to breeze. IMO the original text does it right, might be, with too much specifics. Your patch makes it reverse making the main idea harder to grasp, and verbosely repeating the code.

983094 is indeed magical, I suspect it came from some situation as the min sustainable count of vnodes for some load.

In D50310#1147813, @kib wrote:

What is the point of making comment repeating the code in even more verbose way?

The point of the new message is to clearly spell out in plain English what the formulae mean in more "physical" terms and be extra-clear about the computation (in particular, the 98,304 cutoff is not on the final number of vnodes, and the growth rate stated is in fact not the global one, both because the previous comments do not take the maxproc contribution into account). The computation here is not overly complex, but I still think the text has value. This is a pre-requisite either for revising it (in a very minor way now, for an alternative where the number of vnodes is raised instead) or just helping evaluate it quickly where it stands asymptotically with respect to maxfiles. In particular, I think it is important to state which formula actually applies and when.

This would make the comment inconsistent with the actual code on the next whatever minor change.

There are no more numbers in the new description than in the previous one, except for the last paragraph (see below).

While I agree with your concern in general, for a system-wide setting like this, I certainly don't want anyone to revise these numbers without proper review and updates of the surrounding comments.

Main point of the comment is that there are two concurrent limiting scales for the total amount of vnodes kernel can cache. One is the physical memory, relevant for the machines without KVA pressure. Another is the amount of KVA we can dedicate to caches, while giving the other subsystems the space to breeze. IMO the original text does it right, might be, with too much specifics. Your patch makes it reverse making the main idea harder to grasp, and verbosely repeating the code.

I disagree, both on the previous comment being clearer on the physical/KVA distinction (only the "kernel's heap size" is mentioned in both versions, and both state "limit") and on the new formulation that would be harder to grasp (the very physical/KVA distinction is isolated in the first paragraph!).

That said, I don't claim the text is perfect, and will see if I can remove some redundancy while keeping the logical paragraphs.

983094 is indeed magical, I suspect it came from some situation as the min sustainable count of vnodes for some load.

That's the kind of thing that should be reasonably explained in comments. I can guess why the formula is piecewise affine with a reduced growth rate towards infinity, but the precise cutoff value is harder to second-guess. The corresponding memory cut-off for 98,304 is ~1536MB, this also should be indicated in the comments.

In D50310#1147813, @kib wrote:

What is the point of making comment repeating the code in even more verbose way?

The point of the new message is to clearly spell out in plain English what the formulae mean in more "physical" terms and be extra-clear about the computation (in particular, the 98,304 cutoff is not on the final number of vnodes, and the growth rate stated is in fact not the global one, both because the previous comments do not take the maxproc contribution into account). The computation here is not overly complex, but I still think the text has value. This is a pre-requisite either for revising it (in a very minor way now, for an alternative where the number of vnodes is raised instead) or just helping evaluate it quickly where it stands asymptotically with respect to maxfiles. In particular, I think it is important to state which formula actually applies and when.

This would make the comment inconsistent with the actual code on the next whatever minor change.

There are no more numbers in the new description than in the previous one, except for the last paragraph (see below).

While I agree with your concern in general, for a system-wide setting like this, I certainly don't want anyone to revise these numbers without proper review and updates of the surrounding comments.

Having a verbose rewrite of the formula into the text does not help in reviewing the code.

Main point of the comment is that there are two concurrent limiting scales for the total amount of vnodes kernel can cache. One is the physical memory, relevant for the machines without KVA pressure. Another is the amount of KVA we can dedicate to caches, while giving the other subsystems the space to breeze. IMO the original text does it right, might be, with too much specifics. Your patch makes it reverse making the main idea harder to grasp, and verbosely repeating the code.

I disagree, both on the previous comment being clearer on the physical/KVA distinction (only the "kernel's heap size" is mentioned in both versions, and both state "limit") and on the new formulation that would be harder to grasp (the very physical/KVA distinction is isolated in the first paragraph!).

And I disagree with you, It is very clear what kernel heap size is talking about.
I do not see a point in repeating code in comments.

That said, I don't claim the text is perfect, and will see if I can remove some redundancy while keeping the logical paragraphs.

983094 is indeed magical, I suspect it came from some situation as the min sustainable count of vnodes for some load.

That's the kind of thing that should be reasonably explained in comments. I can guess why the formula is piecewise affine with a reduced growth rate towards infinity, but the precise cutoff value is harder to second-guess. The corresponding memory cut-off for 98,304 is ~1536MB, this also should be indicated in the comments.

Right, and your new long text similarly does not explain the reasons for the inflection points in the formula, it just repeats it with more words.

In D50310#1147892, @kib wrote:

Having a verbose rewrite of the formula into the text does not help in reviewing the code.

It helps because it makes it clearer what the intentions of who changes the code are, and we can then check whether the actual code matches them.

And I disagree with you, It is very clear what kernel heap size is talking about.

That's not my main point here, which is that the same terminology is used before and after the changes. But we should prefer KVA, heap size commonly means an entirely different thing in CS.

I do not see a point in repeating code in comments.

This is not a mere repeat, but it might be too verbose as is, yes.

Right, and your new long text similarly does not explain the reasons for the inflection points in the formula, it just repeats it with more words.

Are you really saying that I should try to provide a thorough explanation for code that *you* committed? And bde@ is unfortunately not there anymore to provide any explanation or hint.

Again, but you don't seem to be listening, saying there is an inflection point and giving its value allows to quickly compare that formula to other ones (maxfiles in that case).

In D50310#1147892, @kib wrote:

Having a verbose rewrite of the formula into the text does not help in reviewing the code.

It helps because it makes it clearer what the intentions of who changes the code are, and we can then check whether the actual code matches them.

Exactly the reverse. Your comment repeats the code and does not say anything about a possible intent.

And I disagree with you, It is very clear what kernel heap size is talking about.

That's not my main point here, which is that the same terminology is used before and after the changes. But we should prefer KVA, heap size commonly means an entirely different thing in CS.

I do not see a point in repeating code in comments.

This is not a mere repeat, but it might be too verbose as is, yes.

Right, and your new long text similarly does not explain the reasons for the inflection points in the formula, it just repeats it with more words.

Are you really saying that I should try to provide a thorough explanation for code that *you* committed? And bde@ is unfortunately not there anymore to provide any explanation or hint.

Then why touching the comment?

Again, but you don't seem to be listening, saying there is an inflection point and giving its value allows to quickly compare that formula to other ones (maxfiles in that case).

I am listening, and I hear the looped recording.

Let me restate two points about this and similar reviews:

  1. There is no sense in writing comments that repeat the code in prose.
  2. Limits are not the tools to fix either apps or deadlocks. They are an instrument for admins to enforce usage policies.
This comment was removed by olce.
In D50310#1147949, @kib wrote:

Exactly the reverse. Your comment repeats the code and does not say anything about a possible intent.

You're confusing the working intent and the reason for this mechanism. I'm talking about the intent in "physical" terms (how many units per units of memory). The varying growth rate is an intent. It can be read from the formula, but I doubt it is obvious for a lot of people. As for the reasons, I never said I'd second-guess them nor write them down.

Then why touching the comment?

I've already explained that a few times, please go read my earlier comments.

I am listening, and I hear the looped recording.

Well, you hear the loop that you force me to issue in response to yours.

Let me restate two points about this and similar reviews:

  1. There is no sense in writing comments that repeat the code in prose.

I agree, but that's not what I'm trying to do here. Also, I said I'd remove redundancy. I've also prepared a version that removes the parts that are more or less obvious from the formula, although I disapprove doing that. It will be uploaded shortly.

  1. Limits are not the tools to fix either apps or deadlocks. They are an instrument for admins to enforce usage policies.

I also agree, but you're missing the point (again), which is that this kind of mitigation is the only tool we have to avoid a deadlock very short term, and it doesn't have drawbacks. The alternatives really fixing the deadlocks in depth are much longer-term. And the thing is, people are having this issue now (and have for a while now).

So, no, you're not listening enough.

olce retitled this revision from vfs: vntblinit(): Expand comments on the computation of 'kern.maxvnodes' to vfs: vntblinit(): Rework comments on the computation of 'kern.maxvnodes'.
olce edited the summary of this revision. (Show Details)
  • Remove the text about what the formulae do (although I disapprove it).
  • Add the cutoff at which kern.maxfiles becomes greater than kern.maxvnodes.

Originally from 1d7fe4b5159cb66a28e94771c207f822ddaa8a9e which also has a comment about revisiting MAXVNODES_MAX once physical memory exceeds 512GB.

8Mi vnodes already seems a lot. Asymptotically, maxvnodes grows by ~16 files per MB (or 1 file per 64kB). Given the current size of files, that seems conservative, although I did not do serious statistics. The current formula reaches 8Mi with ~501GB (SI units). We could remove the in-tree value for MAXVNODES_MAX (but perhaps keep the possibility of setting it in a kernel config file), and/or make it a tunable.

sys/kern/vfs_subr.c
761

In my opinion, changing "the kernel's heap size" to "KVA size" is a step backwards. It is not the KVA size.

sys/kern/vfs_subr.c
761

My recollection is that the vm_object, vnode and NFS node structures are small enough that UMA is going to allocate them from single-page slabs, and so on 64-bit architectures with a direct map they will not be allocated from kmem, i.e. the kernel's heap. Given that, limiting their number based on the kernel's heap size is really a historical artifact.

sys/kern/vfs_subr.c
761

It definitely not an artifact on KVA-limited arches, and we still have arm and ppc32 even if we agree to ignore i386. Also ppc might not have DMAP on 64bit sometimes, unless I mis-remember.

A side note: Judging by the history of Linux DMAP on i386, I expect we would consider 64bit DMAP an architectural mistake in some near future. It is already pressing to have full 5-level paging for kernel on large amd64 machines since existing DMAP is limiting the amount of usable RAM.

So if talking about changing (bumping) this limit, we probably should just drop the kmem part of the formula for DMAP machines or machines where uma has small_alloc.

This revision is now accepted and ready to land.Tue, May 13, 5:55 PM
sys/kern/vfs_subr.c
761

The comment about UMA is generally true, but UMA will dynamically use multi-page slabs to reduce memory waste due to fragmentation. IIRC this happens below 90% efficiency, so sufficiently small objects will indeed be accessed via the direct map.

olce marked 3 inline comments as done.Tue, May 13, 8:55 PM
In D50310#1148443, @kib wrote:

So if talking about changing (bumping) this limit, we probably should just drop the kmem part of the formula for DMAP machines or machines where uma has small_alloc.

We could (although I'm not sure why for small_alloc with no DMAP => some powerpc). In any case, currently virtvnodes does not matter in practice for any 64-bit architecture for machines with more than ~1.5GB (that's the added comment).

sys/kern/vfs_subr.c
761

It is not the KVA size.

I took the words of kib@ about KVA too quickly. I already knew that vm_kmem_size is just equal to the page count after startup, except for arm and powerpc where a divisor of 3 is applied. I'll change that back.

I'm admittedly lacking knowledge in that area, and it is quite late here, but I don't see how this computation relates to KVA, at least directly? I thought the latter was a fixed amount per architecture, independent of the actual physical memory or number of pages.

My recollection is that the vm_object, vnode and NFS node structures are small enough that UMA is going to allocate them from single-page slabs, and so on 64-bit architectures with a direct map they will not be allocated from kmem, i.e. the kernel's heap.

I also thought that, after vm_page_startup(), all not-early-allocated physical pages are assigned to the VM, so basically vm_cnt.v_page_count (=vm_kmem_size) equals all available memory (minus the early allocations), with all subsystems doing memory allocations (like UMA, malloc(9)) subsequently obtaining backing store from these VM pages. Once allocated, admittedly pages have to be mapped if there's no direct map. Is this VA allocation what you call the kmem? If yes, then I'm still finding the term kernel's heap for it very misleading, as this is VA allocation and not memory allocation, and as said above I don't see the connection with how vm_kmem_size (supposed to be "kernel's heap size") is computed.

I'm most probably missing something or mixing things up. Could you please educate me?

sys/kern/vfs_subr.c
761

vm_cnt.v_page_count cannot be equal to vm_kmem_size, one is in pages, another is in bytes.

What you are calling 'VM pages' is the kernel memory, which necessary consists of two resources, one is the virtual address in KVA, another is the physical memory page backing the VA.

With direct map and small objects that fit into a page (i.e. small and aligned), KVA is preallocated by DMAP. So only the physical backing memory needs to be provided when doing the allocation. In other words, we are not limited by the KVA, only by available physical memoory. This is the case for all VFS and VM objects, like files and vnodes/inodes/vm_objects, on amd64 and arm64.

On architectures without direct map, KVA is usually in short supply, which is the reason why there is no direct map. Then we must not only avoid physical pages exhaustion by allocating all of them to UMA and kernel malloc, but also we need to limit KVA usage by uma, to leave space for other things that needs to live in KVA.

This is mostly the purpose of uma_set_limit(vm_kmem_size). kmem is now somewhat vapor-like thing, mostly a synonym to all allocations done through UMA, including malloc.

i386 is the obvious but probably the most convoluted example there. Assuming PAE + 4/4G UVA/KVA split, we have KVA of 4G, but configurations of physical memory can vary from 32M to 64G. On 64G machine we could spend a lot of phys pages for malloc, but since kernel code assumes that all allocated memory is directly accessible all the time, we have to limit uma use of KVA. 32M machines are already taken care of by the phys memory limit portion of vm_kmem_size calculation.

Without the 4/4 split, KVA is usually configured to 1G, leaving 3G UVA. Then vm_kmem_size must be even less. Same would be on a hypothetical arm board with 4G of RAM (I believe armv6 pmap cannot do LPAE, so we cannot see more than 4G there), where kernel normally gets less then 1G of KVA.

  • Change "KVA size" to "kernel's heap size" (and a few more details).
This revision now requires review to proceed.Wed, May 14, 5:45 PM
sys/kern/vfs_subr.c
761

vm_cnt.v_page_count cannot be equal to vm_kmem_size, one is in pages, another is in bytes.

Numerically, indeed, but they measure the exact same quantity, which is what I had in mind.

What you are calling 'VM pages' is the kernel memory, which necessary consists of two resources, one is the virtual address in KVA, another is the physical memory page backing the VA.

I'm just referring to VM-managed physical pages, which are supposed to back all allocations, including userland ones, aren't they?

With direct map and small objects that fit into a page (i.e. small and aligned), KVA is preallocated by DMAP.

Doesn't the DMAP cover all physical memory (provided it is lower than 4TB with current constants)? Any kind of objects could be accessed via the DMAP, couldn't they?

This is mostly the purpose of uma_set_limit(vm_kmem_size). kmem is now somewhat vapor-like thing, mostly a synonym to all allocations done through UMA, including malloc.

But then, vm_kmem_size should be related to the amount of VA, not the physical memory size. It seems VM_KMEM_SIZE_MAX is the piece I was missing, as it bounds mem_size when too big.

(snip)

Thanks for all the info (I was aware of the UVA/KVA split for i386, but not of some details).

olce marked 2 inline comments as done.Wed, May 14, 5:46 PM
sys/kern/vfs_subr.c
761

vm_cnt.v_page_count cannot be equal to vm_kmem_size, one is in pages, another is in bytes.

Numerically, indeed, but they measure the exact same quantity, which is what I had in mind.

At the first stage of the calculation for the vm_kmem_size yes. But then the kmem size is clipped by two formulas: first by the scaling factor (e.g. arm allows kmem to consume most of 1/3 of the total phys memory), and then by hard-coded platform VM_KMEM_SIZE_MAX.

For instance, on amd64 the scaling factor is 1, allowing to use all phys memory to back allocations, but then it is limited by 3/5 of the total KVA by VM_KMEM_SIZE_MAX. The later might seems to be a nop as well, but really it is not nop on LA57, at least until KVA is converted to real 5-level pages.

What you are calling 'VM pages' is the kernel memory, which necessary consists of two resources, one is the virtual address in KVA, another is the physical memory page backing the VA.

I'm just referring to VM-managed physical pages, which are supposed to back all allocations, including userland ones, aren't they?

Then this is not accurate to the level of mis-understanding, because page frames AKA virtual addresses are also the resource that might be scarce, as was shown.

With direct map and small objects that fit into a page (i.e. small and aligned), KVA is preallocated by DMAP.

Doesn't the DMAP cover all physical memory (provided it is lower than 4TB with current constants)? Any kind of objects could be accessed via the DMAP, couldn't they?

No. Only if the object is small and aligned to fit into page. General uma-allocated objects are not backed by physically contiguous memory. Uma small_alloc indeed utilizes DMAP mapping if it can.

This is mostly the purpose of uma_set_limit(vm_kmem_size). kmem is now somewhat vapor-like thing, mostly a synonym to all allocations done through UMA, including malloc.

But then, vm_kmem_size should be related to the amount of VA, not the physical memory size. It seems VM_KMEM_SIZE_MAX is the piece I was missing, as it bounds mem_size when too big.

It must be limited by both. Consider arm and a small board with like 128M of physical memory. We might have around 786M of KVA total, and can dedicate say 512M of KVA (virtual addresses) to uma. Then, if we do not limit by the physical memory, kernel could eat all physical pages for allocations, since they fit into the size of the allowed KVA, leaving userspace with no pages. The end result is constant trashing of swap after some minimal load.

(snip)

Thanks for all the info (I was aware of the UVA/KVA split for i386, but not of some details).

olce added inline comments.
sys/kern/vfs_subr.c
761

At the first stage of the calculation for the vm_kmem_size yes. But then the kmem size is clipped by two formulas: first by the scaling factor (e.g. arm allows kmem to consume most of 1/3 of the total phys memory), and then by hard-coded platform VM_KMEM_SIZE_MAX.

Right. I was writing while thinking, and figured this out only near the end of my previous comment.

For instance, on amd64 the scaling factor is 1, allowing to use all phys memory to back allocations, but then it is limited by 3/5 of the total KVA by VM_KMEM_SIZE_MAX. The later might seems to be a nop as well, but really it is not nop on LA57, at least until KVA is converted to real 5-level pages.

Yes, I have been looking that up, and understand the deal with LA57.

No. Only if the object is small and aligned to fit into page. General uma-allocated objects are not backed by physically contiguous memory. Uma small_alloc indeed utilizes DMAP mapping if it can.

Right. Anything could be accessed through the DMAP, but indeed for non-contiguous allocations that would be completely silly, as that would require a separate non-hardware mapping (so let's just have and use a real hardware mapping).

It must be limited by both. (snip)

Yes, I had figured that out as well, just after writing the paragraph you responded to, which I should have amended.

I now seem to understand all that has been said here. Thanks for your help.

This revision was not accepted when it landed; it landed in state Needs Review.Thu, May 15, 2:28 PM
This revision was automatically updated to reflect the committed changes.
olce marked an inline comment as done.