Page MenuHomeFreeBSD

vfs: vn_alloc(): Stop always calling vn_alloc_hard() and direct reclaiming
ClosedPublic

Authored by olce on Tue, May 13, 9:09 PM.
Tags
None
Referenced Files
F118013365: D50338.id155414.diff
Sun, May 25, 6:05 AM
Unknown Object (File)
Sat, May 24, 8:13 AM
Unknown Object (File)
Sat, May 24, 3:13 AM
Unknown Object (File)
Fri, May 23, 11:20 PM
Unknown Object (File)
Fri, May 23, 12:21 PM
Unknown Object (File)
Sun, May 18, 9:52 PM
Unknown Object (File)
Sat, May 17, 12:06 AM
Unknown Object (File)
Fri, May 16, 10:14 PM

Details

Summary

Commit ab05a1cf321aca0f intended to revert commit 8733bc277a383cf5
("vfs: don't provoke recycling non-free vnodes without a good reason"),
but due to intervening changes in commit 054f45e026d898bd ("vfs: further
speed up continuous free vnode recycle"), it also had to revert part of
it. In particular, while removing the whole 'if (vn_alloc_cyclecount !=
0)' block, it inadvertantly removed the code block resetting
'vn_alloc_cyclecount' to 0 and skipping direct vnode reclamation (done
further below through vnlru_free_locked_direct()), which had been
outside the 'if' before the intervening commit.

Removing this block instead of reinstating it in practice causes
'vn_alloc_cyclecount' to (almost) never be 0, making vn_alloc() always
call vn_alloc_hard(), which takes the 'vnode_list_mtx' mutex. In other
words, this disables the fast path. [The reverted commit, which
introduced the 'if (vn_alloc_cyclecount != 0)' guarding this block,
actually never executed it because it also had the bug that
'vn_alloc_cyclecount' would always stay at 0, hiding its usefulness.]

Additionally, not skipping direct vnode reclamation even when there are
less vnodes than 'kern.maxvnodes' not only causes unnecessary contention
but also plain livelocks as vnlru_free_locked_direct() does not itself
check whether there are actually "free" (not referenced) vnodes to be
deallocated, and will blindly browse all the vnode list until it finds
one (which it may not, or only a few ones at the end). As the fast path
was disabled, all threads in the system would soon be competing for the
vnode list lock, outpacing the vnlru process that could never actually
recycle vnodes in a more agressive manner (i.e., even if they have
a non-zero hold count). And we could more easily get into this
situation, as each vnode allocation was reducing the count of "free"
vnodes, even if entirely new vnodes could be allocated instead. This
part was mitigated by the vnlru process (before the tipping point
described above), which explains why the mechanism would not always
livelock.

Not skipping direct vnode reclamation was arguably a bug introduced by
the intervening commit (054f45e026d898bd), but was mitigated by
vn_alloc_hard() not being called in the fast path. The revert commit,
by disabling the fast path, made it significantly annoying (to the point
of getting a few livelocks a week in some of my workloads).

Restore the reset of 'vn_alloc_cyclecount' to 0 and skip direct
reclamation when the current number of vnodes is below the
'kern.maxvnodes' limit, indicating we can start allocating a new 'struct
vnode' right away. While here, fix the comparison with the limit when
'bumped' is true.

Fixes: ab05a1cf321aca0f (revert of "vfs: don't provoke recycling non-free vnodes without a good reason")
Fixes: 054f45e026d898bd ("vfs: further speed up continuous free vnode recycle")

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 64150
Build 61034: arc lint + arc unit

Event Timeline

olce requested review of this revision.Tue, May 13, 9:09 PM

This is intended to fix the livelocks I'm experiencing on -STABLE, probably since ab05a1cf321aca0f although I didn't bisect to this granularity. What is certain is that I had no such problems with stable/14 close to 14.2, and have been having them since January, although they were quite infrequent at first, now reaching frequency of twice per week.

I have a long list of vnlru problems, and it's on my TODO list to revamp it (but probably not in the near future) if nobody beats me to it. In particular, code analysis suggests there are other potential sources of contention and livelocks.

That said, the goal here is to have a minimal fix to the actual problems I've been experiencing so that it can be a candidate for inclusion in 14.3. Testing has been minimal so far; I intend to run full stress tests tomorrow in order to confirm I cannot reproduce the livelocks.

This revision is now accepted and ready to land.Wed, May 14, 12:36 AM

I pointed out this problem to Kirk and Colin back when the change was made.
I had to do the same kind of fix locally (for Panzura) because of live-locks as well.
But I have one small difference, not sure if important, which I'll describe in an inline comment.

sys/kern/vfs_subr.c
1999

In my tree I also have

rnumvnodes = atomic_load_long(&numvnodes);

just before this line.
Not sure if that's actually needed given that the previous block uses rnumvnodes without refreshing it.
But 054f45e026d898bd had such line.

olce marked an inline comment as done.
olce edited the summary of this revision. (Show Details)
  • Reload 'numvnodes' into 'rnumvnodes' after taking the lock.
  • Fix the limit comparison on 'bumped'.
This revision now requires review to proceed.Wed, May 14, 1:14 PM
sys/kern/vfs_subr.c
1999

Yes, I saw it and hesitated on whether to restore it or not. Initially, I concluded it was not worth it, as numvnodes can be updated without lock, and I felt that in general the logic must be consistent between vn_alloc() and here.

Reconsidering it now, I'm convinced not restoring it is a plain bug, for the following reasons:

  1. vn_alloc() passes 0 as rnumvnodes in the first vn_alloc_hard() call, clearly assuming it won't be used when bumped is false.
  2. Between the load of rnumvnodes and its use in the test, we are locking vnode_list_mtx which, if not immediately obtained, signals that the list is being changed, which may widely modify the number of existing vnodes. (We could imagine a micro-optimization here, by using a trylock and not reloading if it fails, but that would require vn_alloc always reading the number; additionally, I feel there are already in vnode recycling code too many micro-optimization mechanisms distracting us from soundness).
  3. There is no real assumption of consistency between vn_alloc() and vn_alloc_hard() beyond the case of bumped true and the special code treating it before taking the lock.

So going to restore it. Thanks for making me re-examine that point.

This revision is now accepted and ready to land.Wed, May 14, 1:24 PM
In D50338#1148702, @avg wrote:

I pointed out this problem to Kirk and Colin back when the change was made.
I had to do the same kind of fix locally (for Panzura) because of live-locks as well.

I should probably have contacted you when I saw the revert commit, as even if I hadn't suspected there was a problem with it and hadn't experienced deadlocks yet, I was interested in the vnode cache and had analyzed quite a bunch of it at that time already.

That said, it would have been nice that Kirk and Colin put the revert under review (I suspect they didn't in part because they were running out of time) so that your "objection" could be logged. Were there any specific problems preventing you from submitting your local fix, or just lack of time?

Anyway, we are now at risk of shipping 14.3 with this annoying bug (although I don't know its frequency in general; all I can say is that, for my use case, it's becoming super annoying). I hope the fix here will be included in 14.3. Will normally have finished my tests in a couple of hours (now with the reload of numvnodes), and will report here.

In D50338#1148702, @avg wrote:

I pointed out this problem to Kirk and Colin back when the change was made.
I had to do the same kind of fix locally (for Panzura) because of live-locks as well.

I should probably have contacted you when I saw the revert commit, as even if I hadn't suspected there was a problem with it and hadn't experienced deadlocks yet, I was interested in the vnode cache and had analyzed quite a bunch of it at that time already.

That said, it would have been nice that Kirk and Colin put the revert under review (I suspect they didn't in part because they were running out of time) so that your "objection" could be logged. Were there any specific problems preventing you from submitting your local fix, or just lack of time?

Anyway, we are now at risk of shipping 14.3 with this annoying bug (although I don't know its frequency in general; all I can say is that, for my use case, it's becoming super annoying). I hope the fix here will be included in 14.3. Will normally have finished my tests in a couple of hours (now with the reload of numvnodes), and will report here.

It was mostly lack of time plus lack of confidence of my understanding the code.

In D50338#1148866, @avg wrote:

It was mostly lack of time plus lack of confidence of my understanding the code.

I see, that's fine.

Hopefully we'll have better information flows next time (and later a completely revamped vnlru code :-p).

I confirm that I cannot reproduce livelocks with the patch, while I can very quickly by setting a low kern.maxvnodes/vfs.vnode.param.limit. Scenarios are just buildworld/buildkernel and multiple finds running on millions of files. Even better, when the vnode limit is reached, I don't see the recent pathological behavior where vnlru is triggered and drastically reduces cached vnodes, effectively thrashing a large part of the cache.

Sorry for being out of the loop here, I just got back from a trip to the arctic where unsurprisingly there was little to no Internet.

The rush without review was indeed to get it into the last release of 13. The reversion did mitigate the live-lock benchmark that we had so it did improve some workloads. That said I am glad that you have drilled down and come up with a more complete / better fix. This code has proven to be problematic ever since I first wrote getnewvnode() back in the Berkeley BSD days. And as you noted can still be improved.

Sorry for being out of the loop here, I just got back from a trip to the arctic where unsurprisingly there was little to no Internet.

No problem.

The rush without review was indeed to get it into the last release of 13. The reversion did mitigate the live-lock benchmark that we had so it did improve some workloads. That said I am glad that you have drilled down and come up with a more complete / better fix.

Yes, I think given the variety of workloads we can have, in addition to more testing, analyzing in full the mechanisms and trying to predict what can go wrong improves robustness/quality and diminishes the chances of mistakes.

If you see anything unclear or suspect in the change or in connection with it, please let me know.

If there is no strange or bad behavior report and no justified objection until Tuesay, I'll MFC it then, and ask for a merge to releng/14.3 (re@ has been duly informed already, and the current plan is AIUI to merge this in BETA4, again unless we have reasons/reports saying otherwise).

With that aim in mind, I've tried to present here a minimal patch fixing the most pressing issue, refraining from fixing or modifying any other mechanisms.

This code has proven to be problematic ever since I first wrote getnewvnode() back in the Berkeley BSD days. And as you noted can still be improved.

Part of the improvements I can see are tied to the fact that we have multicores with large number of cores today, which I suspect just did not exist in the Berkeley days (but I don't know much about this period of history, especially hardware-wise).

That said, there are also logical improvements to make independently of parallelism. For example, I'm finding the vn_alloc_cyclecount/vstir mechanism quite strange, even if I think I understand why it exists. I looks like we would generally be better served by a PID controller instead of fixed, somewhat arbitrary thresholds, and mechanisms such as vstir. Also, we do not maintain a reserve of immediately allocatable vnodes when maxvnodes is reached, and we really should (both for questions of latency but also to increase scaling). "freevnodes" are not really that thing since, even if they have been inactivated and their referenced buffers/pages written to disk, they need a bit of work to be recycled, and the current mechanisms may also need some time to locate them. That is to give a taste of the things I'm envisioning. I'll be working on them in the background, as I'm going to focus more on other projects in the short term, but expect to draft something in the coming months (perhaps before the release of 15.0, but I'm really not sure at this point).