Page MenuHomeFreeBSD

kern.maxfiles: Clamp it to 'kern.maxvnodes'
AbandonedPublic

Authored by olce on May 2 2025, 8:32 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Oct 3, 7:43 AM
Unknown Object (File)
Aug 30 2025, 4:25 AM
Unknown Object (File)
Aug 26 2025, 10:36 AM
Unknown Object (File)
Aug 15 2025, 12:04 AM
Unknown Object (File)
Aug 2 2025, 11:45 PM
Unknown Object (File)
Jul 12 2025, 2:04 AM
Unknown Object (File)
Jul 10 2025, 11:06 AM
Unknown Object (File)
Jul 7 2025, 11:18 AM
Subscribers

Details

Reviewers
kib
markj
Summary

If 'kern.maxfilesperproc' is greater than 'kern.maxvnodes' (and
'kern.maxfiles' is also), one process can eat up all available vnodes in
the system, bringing it to a halt. In fact, multiple processes can have
the same effect as long as 'kern.maxfiles' is greater than
'kern.maxvnodes'.

In practice, I observed such a thing happening with KDE's Baloo on
a machine with 64GB of RAM, where the automatic tuning sets
'kern.maxfiles' higher than 'kern.maxvnodes'. This has apparently also
occured to other people recently (see, e.g.,
https://lists.freebsd.org/archives/freebsd-hackers/2025-March/004372.html).

A thorough analysis of the current automatic tuning shows that the
default for 'kern.maxfiles' exceeds that for 'kern.maxvnodes' when the
physical memory exceeds ~5136MB.

In advance of revising the automatic tuning, but also to support setting
manually these two parameters while avoiding this
non-immediately-obvious footshooting, make sure that 'kern.maxfiles' can
never exceed 'kern.maxvnodes'. In fact, make sure 'kern.maxvnodes' is
always larger by some leeway (default: 1000, sysctl knob
'debug.vnodes_to_files_leeway').

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 63845
Build 60729: arc lint + arc unit

Event Timeline

olce requested review of this revision.May 2 2025, 8:32 PM

I object. If you have some problem with some app, fix the app, up to the level of making app self-limit by changing the resource.
Not to mention that files are not only vnodes, and also single vnode can be referenced by many files. But global limits are not the facility to tune system for safety of the single app.

In D50125#1143471, @kib wrote:

I object. If you have some problem with some app, fix the app, up to the level of making app self-limit by changing the resource.
Not to mention that files are not only vnodes, and also single vnode can be referenced by many files. But global limits are not the facility to tune system for safety of the single app.

I tend to agree. I don't think this change really fixes the problem either: even with maxfiles larger than maxvnodes you're not safe, since multiple processes may open the same vnode, which counts multiple times against maxfiles but only once against maxvnodes. If your leeway is 1000, and you have 333 processes that hold /dev/null open over fds 0, 1, 2, then any process can still exhaust the limit.

When the limit is exhausted, how does the system's memory usage look? Should we instead consider relaxing the maxvnode limit somehow? If one process is holding a gigantic number of vnodes open, should we consider making it a candidate for an out-of-vnodes kill?

In D50125#1143471, @kib wrote:

I object. If you have some problem with some app, fix the app, up to the level of making app self-limit by changing the resource.

If an app misbehaves, that's (usually) the app's problem.
If an app deadlocks a system while misbehaving, that's (usually) the system's problem.

What makes you think that Baloo misbehaves? I have not checked its code, but I can imagine reasons for its behavior, and there's nothing wrong with them. Because we do not support en masse file monitoring, it probably uses a wrapper that uses kqueue, which may simply lead to creating one file descriptor per file on the disk. Maybe the wrapper (and/or Baloo itself) just creates file descriptors as long as it encounters an error, which never comes (problems 1 & 2, see below). Maybe Baloo/the wrapper is even "system friendly" in the sense of not even trying to create more files than what ulimit reports (which in the end is what kern.maxfilesperproc is set to), but this system-provided limit is misleading (problem 2, see below).

More importantly, whether Baloo misbehaves does not really matter in the end. The system just puts itself in a deadlock, and unless you happen to have an already-running monitoring process that can kill other processes (such as top(1)), there is no way to recover. It's very easy to write a malicious process that halt a machine (just a simple tree walk opening all it can, provided you have enough files, or add kqueue() to the equation). And there's no provision to stop a runaway/buggy process. I don't see how this could be acceptable.

The system gets in this dead-end because of the following fundamental problems:

  1. getnewvnode() never returns an error, blocking until some vnode is available (check against kern.maxvnodes), which may require waiting indefinitely for some process to stop using some of them.
  2. open() and other primitives only limit requests based on the number of file descriptors actually used (check against kern.maxfiles and kern.maxfilesperproc), which currently has no relation with whether there are actually vnodes to back the new files.

These are not things we are going to be able to fix overnight, and certainly not for 14.3. So my goal here is not to tackle them directly, but rather to find a simple enough fix to avoid the deadlock consequence and *in time for 14.3* (hopefully, we can get back to the fundamental problems later, with more time to tackle them).

And this is what is proposed here: We enforce that kern.maxfiles is always below kern.maxvnodes, so that the deadlock reported above can never happen. That's simple to understand, very logical, and does not really have drawbacks, as explained below.

Not to mention that files are not only vnodes, and also single vnode can be referenced by many files.

True in theory, but does not seem to matter that much here because:

  1. File descriptors usually do not point to the same vnodes. The greatest sources of duplication are probably stdin/stdout/stderr and pwd, leading to a very small number of open file descriptors compared to the system limits on current machines, and these are limited to kern.maxproc anyway. IIRC, I have not been observing more than x4 duplication on mostly quiescent machines with databases, background processes, servers, with ~500 processes and kern.openfiles around ~10000, with kern.maxfiles automatically tuned to ~2,1M and kern.maxvnodes to ~1,2M (64GB machines).
  2. For all practical purposes, current limits are anyway probably beyond most actual needs. kern.maxfiles crosses kern.maxvnodes for machines with ~5GB, after which kern.maxfiles grows linearly and faster than kern.maxvnodes (approx. twice as fast; that's why this problem is not seen on machines without much RAM). At this crosspoint, they have a value of ~164k. For ~16GB, kern.maxfiles is at ~524k and kern.maxvnodes at ~351k. Do you have any workload with that many file descriptors opened simultaneously running on that memory size? And we could easily double the number of vnodes, not overflowing the existing vnode clamp in the automatic tuning, which after this change would clamp kern.maxfiles at ~700k (so more than ~524k). Not talking about bigger machines where these numbers are likely too high to matter. And if they do for some workloads, they should be hand-tuned (and probably already are, but might need bumping kern.maxvnodes after this change).

But global limits are not the facility to tune system for safety of the single app.

I'm talking about the system here, which fails in practice and is theoretically unsound. What is reported here can bring a whole machine to an halt relatively quickly and without any privileges. And, as shown, there may be easy measures to prevent this from happening at all, which is what I would like you to consider.

I tend to agree. I don't think this change really fixes the problem either: even with maxfiles larger than maxvnodes you're not safe, since multiple processes may open the same vnode, which counts multiple times against maxfiles but only once against maxvnodes.

I think you're seeing it backwards. Ensuring that kern.maxfiles is lower than kern.maxvnodes does fix the problem (modulo the leeway, see below), because then no process is able to continue to open file descriptors referencing not-yet opened files until kern.maxvnodes is exhausted, which is necessary to reach the deadlock point.

If your leeway is 1000, and you have 333 processes that hold /dev/null open over fds 0, 1, 2, then any process can still exhaust the limit.

No, because the limit we are talking about is the vnode limit, not the file descriptor limits (kern.maxfiles/kern.maxfilesperproc). In your example, it's the opposite: All vnodes are shared, so that limit is never reached and there can be no deadlocks; the file descriptor limits still apply though (without causing any problem).

I added the leeway because there may be vnodes created in the kernel without any corresponding file descriptor (disks? GEOM? audit? drm? others?), which can break the invariant that there are always less active vnodes than there are file descriptors in use, which the scheme above relies on. I didn't count the VFS cache as it can relinquish its vnodes on vnode shortage.

I'm not sure how much leeway is necessary, though, or if there is something there that can jeopardize the whole scheme. I'm open to input/ideas.

When the limit is exhausted, how does the system's memory usage look? Should we instead consider relaxing the maxvnode limit somehow? If one process is holding a gigantic number of vnodes open, should we consider making it a candidate for an out-of-vnodes kill?

In terms of memory, nothing special. We could have an OOV killer, yes. And probably some tunable (kern.maxvnodesperproc) that ensures you'd need at least, let's say, 2 or 3 runaway processes to exhaust all vnodes. These do not seem particularly complex, but likely too risky for 14.3.

What do you (@kib and you) think?

I tend to agree. I don't think this change really fixes the problem either: even with maxfiles larger than maxvnodes you're not safe, since multiple processes may open the same vnode, which counts multiple times against maxfiles but only once against maxvnodes.

I think you're seeing it backwards. Ensuring that kern.maxfiles is lower than kern.maxvnodes does fix the problem (modulo the leeway, see below), because then no process is able to continue to open file descriptors referencing not-yet opened files until kern.maxvnodes is exhausted, which is necessary to reach the deadlock point.

If your leeway is 1000, and you have 333 processes that hold /dev/null open over fds 0, 1, 2, then any process can still exhaust the limit.

No, because the limit we are talking about is the vnode limit, not the file descriptor limits (kern.maxfiles/kern.maxfilesperproc). In your example, it's the opposite: All vnodes are shared, so that limit is never reached and there can be no deadlocks; the file descriptor limits still apply though (without causing any problem).

I see now, sorry.

I added the leeway because there may be vnodes created in the kernel without any corresponding file descriptor (disks? GEOM? audit? drm? others?), which can break the invariant that there are always less active vnodes than there are file descriptors in use, which the scheme above relies on. I didn't count the VFS cache as it can relinquish its vnodes on vnode shortage.

So the NFS server in particular doesn't break this mechanism? Certainly a number of other subsystems will hold a vnode's usecount > 0 without having an open file handle: mdconfig -f vnode, ktrace, swapon, ...

I'm not sure how much leeway is necessary, though, or if there is something there that can jeopardize the whole scheme. I'm open to input/ideas.

When the limit is exhausted, how does the system's memory usage look? Should we instead consider relaxing the maxvnode limit somehow? If one process is holding a gigantic number of vnodes open, should we consider making it a candidate for an out-of-vnodes kill?

In terms of memory, nothing special. We could have an OOV killer, yes. And probably some tunable (kern.maxvnodesperproc) that ensures you'd need at least, let's say, 2 or 3 runaway processes to exhaust all vnodes. These do not seem particularly complex, but likely too risky for 14.3.

Thinking about it some more, from my POV the bug is that we are applying a resource limit (maxvnodes) when there is no actual shortage (of memory).

getnewvnode()/vn_alloc() seems to try very hard to avoid exceeding the desiredvnodes limit. It will even do direct reclaim and try to claim a vnode from the free list (this can have unpredictable latency that is quite large in the worst case, and we purposefully avoid this kind of mechanism for page reclamation for that reason) rather than exceed the limit. Meanwhile we have a dedicated thread whose sole responsibility is to reclaim unused vnodes.

This is not a small effort, but I would prefer to fix this by making maxvnodes a soft limit. That is, getnewvnode() should always allocate a vnode unless it's really unable to because of a memory shortage (or because we're getting "close" to a memory shortage by some metric). Let the vnlru thread handle the hard work of maintaining the vnode cache size, and keep the allocation path simple. Going further, if vnode allocation outpaces the vnlru thread in practices, then we can spin up additional threads to handle the load. It may also potentially be useful to use a PID controller (in subr_pidctrl.c) to regulate the vnlru thread. Even further, the vnode cache should react to pressure from the rest of the kernel and be able to dynamically shrink the cache in proportion to its total memory usage (the total number of vnodes plus auxiliary structures like v_pollinfo and namecache entries).

This scheme is much more similar to how the pagedaemon works today, and I think would go a lot further in avoiding situations where the system falls off a cliff because this somewhat-arbitrary cache limit is exceeded.

It will even do direct reclaim and try to claim a vnode from the free list (this can have unpredictable latency that is quite large in the worst case, and we purposefully avoid this kind of mechanism for page reclamation for that reason) rather than exceed the limit.

This is not just a hypothetical problem, BTW. Quite a few years ago we had to hack vn_alloc() to stop doing this in order to provide reasonable latency in a NAS appliance.

Thinking about it some more, from my POV the bug is that we are applying a resource limit (maxvnodes) when there is no actual shortage (of memory).

Yes.

getnewvnode()/vn_alloc() seems to try very hard to avoid exceeding the desiredvnodes limit. It will even do direct reclaim and try to claim a vnode from the free list (this can have unpredictable latency that is quite large in the worst case, and we purposefully avoid this kind of mechanism for page reclamation for that reason) rather than exceed the limit. Meanwhile we have a dedicated thread whose sole responsibility is to reclaim unused vnodes.

This is not a small effort, but I would prefer to fix this by making maxvnodes a soft limit. That is, getnewvnode() should always allocate a vnode unless it's really unable to because of a memory shortage (or because we're getting "close" to a memory shortage by some metric). Let the vnlru thread handle the hard work of maintaining the vnode cache size, and keep the allocation path simple. Going further, if vnode allocation outpaces the vnlru thread in practices, then we can spin up additional threads to handle the load. It may also potentially be useful to use a PID controller (in subr_pidctrl.c) to regulate the vnlru thread. Even further, the vnode cache should react to pressure from the rest of the kernel and be able to dynamically shrink the cache in proportion to its total memory usage (the total number of vnodes plus auxiliary structures like v_pollinfo and namecache entries).

There are two big issues there:

  1. getnewvnode() is not supposed to fail. It is claimed that the assumption is the bug in filesystems, but I think now that it is not. It should be similar to the pv allocator in pmap, always giving the resource on request.
  2. vnodes must not exhaust both physical memory, this is already more or less handled by limits + page daemon. But also they must not exhaust KVA, and this is not handled currently at all except by the maxvnodes limit. I would say that i386 PAE is useful just to keep this aspect of the FreeBSD architecture straight.

This scheme is much more similar to how the pagedaemon works today, and I think would go a lot further in avoiding situations where the system falls off a cliff because this somewhat-arbitrary cache limit is exceeded.

In D50125#1147862, @kib wrote:

Thinking about it some more, from my POV the bug is that we are applying a resource limit (maxvnodes) when there is no actual shortage (of memory).

Yes.

getnewvnode()/vn_alloc() seems to try very hard to avoid exceeding the desiredvnodes limit. It will even do direct reclaim and try to claim a vnode from the free list (this can have unpredictable latency that is quite large in the worst case, and we purposefully avoid this kind of mechanism for page reclamation for that reason) rather than exceed the limit. Meanwhile we have a dedicated thread whose sole responsibility is to reclaim unused vnodes.

This is not a small effort, but I would prefer to fix this by making maxvnodes a soft limit. That is, getnewvnode() should always allocate a vnode unless it's really unable to because of a memory shortage (or because we're getting "close" to a memory shortage by some metric). Let the vnlru thread handle the hard work of maintaining the vnode cache size, and keep the allocation path simple. Going further, if vnode allocation outpaces the vnlru thread in practices, then we can spin up additional threads to handle the load. It may also potentially be useful to use a PID controller (in subr_pidctrl.c) to regulate the vnlru thread. Even further, the vnode cache should react to pressure from the rest of the kernel and be able to dynamically shrink the cache in proportion to its total memory usage (the total number of vnodes plus auxiliary structures like v_pollinfo and namecache entries).

There are two big issues there:

  1. getnewvnode() is not supposed to fail. It is claimed that the assumption is the bug in filesystems, but I think now that it is not. It should be similar to the pv allocator in pmap, always giving the resource on request.

Well, vn_alloc() is already allowed to sleep, it allocates vnodes with M_WAITOK. It can sleep until a vnode is available. The PV allocator operates under stronger constraints, since many pmap operations may not sleep.

  1. vnodes must not exhaust both physical memory, this is already more or less handled by limits + page daemon. But also they must not exhaust KVA, and this is not handled currently at all except by the maxvnodes limit. I would say that i386 PAE is useful just to keep this aspect of the FreeBSD architecture straight.

I think it's possible to have an optional hard limit on the number of vnodes. vnlru should work hard to maintain some "healthy" soft upper limit on used vnodes, but can also maintain a hard limit. To extend the analogy with the page daemon, we have some free page threshold below which the page daemon will run, but page allocations will succeed until the free page count hits some low watermark (depending on VM_ALLOC_{NORMAL,SYSTEM,INTERRUPT}).

With a hard limit, we can bound the KVA usage of vnodes. On systems where this matters, maxfiles perhaps does a better job of restricting userspace's ability to deadlock the system.

So the NFS server in particular doesn't break this mechanism? Certainly a number of other subsystems will hold a vnode's usecount > 0 without having an open file handle: mdconfig -f vnode, ktrace, swapon, ...

For mdconfig and swapon, AFAIK, the number of such vnodes is fairly limited. ktrace I don't know. I'm actually the most worried about nullfs. NFS has a similar problem. So, yes, these are definitely dents in the mitigation presented here. I don't think it means we should not apply it though.

Thinking about it some more, from my POV the bug is that we are applying a resource limit (maxvnodes) when there is no actual shortage (of memory).

Although I agree with the last part, I don't think that's the actual source of the bug. Even if we actually allow creating more vnodes because there is enough memory, we can eventually exhaust all memory and trigger an OOM, which probably won't kill the right process as vnode memory is not accounted to processes. Eventually, though, we can hope that the process consuming too many vnodes will get killed as it tries to grab more of them, but we may have had to kill every other process in the meantime. Not sure an operator can recover from such a state, in which case this wouldn't be much different than a deadlock in practice.

getnewvnode()/vn_alloc() seems to try very hard to avoid exceeding the desiredvnodes limit. It will even do direct reclaim and try to claim a vnode from the free list (this can have unpredictable latency that is quite large in the worst case, and we purposefully avoid this kind of mechanism for page reclamation for that reason) rather than exceed the limit. Meanwhile we have a dedicated thread whose sole responsibility is to reclaim unused vnodes.

Yes, it is (a bit far) on my todo list to revamp this completely. In particular, with the current vnode shared list structure, direct reclaiming is a bad idea that leads both to contention and over-shooting the objectives in vnlru_proc().

This is not a small effort, but I would prefer to fix this by making maxvnodes a soft limit. That is, getnewvnode() should always allocate a vnode unless it's really unable to because of a memory shortage (or because we're getting "close" to a memory shortage by some metric). Let the vnlru thread handle the hard work of maintaining the vnode cache size, and keep the allocation path simple. Going further, if vnode allocation outpaces the vnlru thread in practices, then we can spin up additional threads to handle the load. It may also potentially be useful to use a PID controller (in subr_pidctrl.c) to regulate the vnlru thread. Even further, the vnode cache should react to pressure from the rest of the kernel and be able to dynamically shrink the cache in proportion to its total memory usage (the total number of vnodes plus auxiliary structures like v_pollinfo and namecache entries).

This scheme is much more similar to how the pagedaemon works today, and I think would go a lot further in avoiding situations where the system falls off a cliff because this somewhat-arbitrary cache limit is exceeded.

I completely agree, I've been thinking about using a PID controller indeed. It's just that the point here is to have some mitigation in 14.3

This comment was removed by olce.

So the NFS server in particular doesn't break this mechanism? Certainly a number of other subsystems will hold a vnode's usecount > 0 without having an open file handle: mdconfig -f vnode, ktrace, swapon, ...

For mdconfig and swapon, AFAIK, the number of such vnodes is fairly limited. ktrace I don't know. I'm actually the most worried about nullfs. NFS has a similar problem. So, yes, these are definitely dents in the mitigation presented here. I don't think it means we should not apply it though.

nullfs in cached mode only multiplies the vnode counter. When the upper vnode, which does not have use count, is reclaimed, the lower vnode is reclaimed.
Not sure what are NFS concerns are. I do not see any issue with client, and the server is limited in the vnode use (as in v_usecount) by the total number of nfsd threads multiplied by some small number, like 3.

Thinking about it some more, from my POV the bug is that we are applying a resource limit (maxvnodes) when there is no actual shortage (of memory).

Although I agree with the last part, I don't think that's the actual source of the bug. Even if we actually allow creating more vnodes because there is enough memory, we can eventually exhaust all memory and trigger an OOM, which probably won't kill the right process as vnode memory is not accounted to processes. Eventually, though, we can hope that the process consuming too many vnodes will get killed as it tries to grab more of them, but we may have had to kill every other process in the meantime. Not sure an operator can recover from such a state, in which case this wouldn't be much different than a deadlock in practice.

Process cannot 'consume' vnodes.

getnewvnode()/vn_alloc() seems to try very hard to avoid exceeding the desiredvnodes limit. It will even do direct reclaim and try to claim a vnode from the free list (this can have unpredictable latency that is quite large in the worst case, and we purposefully avoid this kind of mechanism for page reclamation for that reason) rather than exceed the limit. Meanwhile we have a dedicated thread whose sole responsibility is to reclaim unused vnodes.

Yes, it is (a bit far) on my todo list to revamp this completely. In particular, with the current vnode shared list structure, direct reclaiming is a bad idea that leads both to contention and over-shooting the objectives in vnlru_proc().

This is not a small effort, but I would prefer to fix this by making maxvnodes a soft limit. That is, getnewvnode() should always allocate a vnode unless it's really unable to because of a memory shortage (or because we're getting "close" to a memory shortage by some metric). Let the vnlru thread handle the hard work of maintaining the vnode cache size, and keep the allocation path simple.

Being able to use the user thread context to help system thread to cope with load is beneficial. For instance, a lot of deadlocks in buffer cache were fixed by allowing the allocating thread to do some work for the buffer daemon. Compare the amount of tuning required for the page daemon, with the almost non-existent issues for the buffer cache.

The only silly thing in vnlru is the 1-sec pause, it indeed might be smarter. Otherwise allowing user threads to help in the vnode reclamation is good. We then not limited to the throughput of the single thread to flush dirtyness to inactivate and reclaim cached data.

Going further, if vnode allocation outpaces the vnlru thread in practices, then we can spin up additional threads to handle the load. It may also potentially be useful to use a PID controller (in subr_pidctrl.c) to regulate the vnlru thread. Even further, the vnode cache should react to pressure from the rest of the kernel and be able to dynamically shrink the cache in proportion to its total memory usage (the total number of vnodes plus auxiliary structures like v_pollinfo and namecache entries).

This scheme is much more similar to how the pagedaemon works today, and I think would go a lot further in avoiding situations where the system falls off a cliff because this somewhat-arbitrary cache limit is exceeded.

I completely agree, I've been thinking about using a PID controller indeed. It's just that the point here is to have some mitigation in 14.3

No. Please stop proposing to mis-use internal vnode cache organization as the feature to fix some bugs in some apps.

In D50125#1147862, @kib wrote:

There are two big issues there:

  1. getnewvnode() is not supposed to fail. It is claimed that the assumption is the bug in filesystems, but I think now that it is not. It should be similar to the pv allocator in pmap, always giving the resource on request.

I talked about this issue in an earlier comment. This is only an issue in conjunction with other properties however. If we have a OOV killer, or rely on the OOM one with some accounting for vnodes (which in itself requires a bit of attention since vnodes are shared across processes; we could simply account every different vnode in a process to it, i.e., de-duplicating only the FDs inside a single process), that would not be a problem per se.

  1. vnodes must not exhaust both physical memory, this is already more or less handled by limits + page daemon. But also they must not exhaust KVA, and this is not handled currently at all except by the maxvnodes limit. I would say that i386 PAE is useful just to keep this aspect of the FreeBSD architecture straight.

Generally speaking, reasoning in terms of limits may be only part of the solution. Even if this is for far in the future, we may consider systematically making reserves for all kernel resources, to be used by privileged processes (or kernel internal mechanisms, e.g., to ensure I/O to swap in all circumstances of memory shortage). Limits can be risen and often do not reflect that, underneath, we sometimes end up competing for the same resources. E.g.: A limit for vnodes may not be enough to prevent KVA exhaustion by vnodes once a lot of the KVA is already used for something else. That's probably a bad example, as I guess it's not really a problem for arches with big KVA, which soon will be all arches once 32-bit is out, but hopefully you can see the idea. For physical memory, we have OOM, which is quite imperfect in practice (we are not accounting a process' kernel objects' memory to it IIRC; the OOM killer more often than not doesn't kill the expected process).

In D50125#1147887, @kib wrote:

Being able to use the user thread context to help system thread to cope with load is beneficial. For instance, a lot of deadlocks in buffer cache were fixed by allowing the allocating thread to do some work for the buffer daemon. Compare the amount of tuning required for the page daemon, with the almost non-existent issues for the buffer cache.

The buffer cache is simpler because it can use the page cache as a 2nd-level cache, so the penalty of eviction is relatively low.

The only silly thing in vnlru is the 1-sec pause, it indeed might be smarter. Otherwise allowing user threads to help in the vnode reclamation is good. We then not limited to the throughput of the single thread to flush dirtyness to inactivate and reclaim cached data.

It is a design mistake IMO. getnewvnode() is in the critical path for many workloads, having unbounded latency there is a major flaw. If the throughput of a single thread is limiting, then we should be able to have more than one vnlru thread, exactly like we do for the buffer cache and the page cache.

In D50125#1147887, @kib wrote:

nullfs in cached mode only multiplies the vnode counter.

This is what I'm referring too, and that matters because that basically defeats the mitigation if you use nullfs a lot.

When the upper vnode, which does not have use count, is reclaimed, the lower vnode is reclaimed.

Yes, but every open file has a use count.

Not sure what are NFS concerns are. I do not see any issue with client, and the server is limited in the vnode use (as in v_usecount) by the total number of nfsd threads multiplied by some small number, like 3.

Maybe NFS is OK then.

Process cannot 'consume' vnodes.

Of course they can and do. That vnodes are shared doesn't change that fact. A plausible metric here is the number of (distinct) vnodes referenced by a process, whose memory consumption we should account to that process (and the sum of memory accounted to all processes will sometimes be greater than the total memory in the system, and that's not a problem for the purpose of selecting a target during OOM; that doesn't prevent from having another count, of memory that is really expected to be freed on a candidate process kill, and take it into account).

The only silly thing in vnlru is the 1-sec pause, it indeed might be smarter. Otherwise allowing user threads to help in the vnode reclamation is good. We then not limited to the throughput of the single thread to flush dirtyness to inactivate and reclaim cached data.

We may have several threads. But all of this is moot until we stop using the single common list, such as e.g., one list per-vnode "domain" (as done with the page daemon; except here the vnode "domains" would be arbitrary), or something else, which will inevitably degrade LRU but hopefully in a somewhat controlled way.

And there are lots of other silly things in vnlru.

The main point is to have a clear policy, which then all vnlru processes/threads could apply without interference. Direct reclamation has the drawback of unbounded latency. How is this handled in the buffer cache?

No. Please stop proposing to mis-use internal vnode cache organization as the feature to fix some bugs in some apps.

maxvnodes is not about cache organization, and the bug is not an application bug, though it is triggered by an application. You're drawing wrong conclusions from wrong premises, and repeating them are not going to make them true. This is improductive, so could you please stop that?

Chosen alternative is D50314.