Details

Reviewers

jeff
alc
kib
glebius

Commits

rS340405: Add accounting to per-domain UMA full bucket caches.

Summary

Keep track of the number of items in each cache and use the existing UMA
callout to maintain a working set size estimate for each cache. This is
based on section 3.7 of Bonwick's vmem paper. The stored estimate is a
decaying average over the last 100s of activity. The WSS
estimate is currently unused, but my aim is to use it to make
uma_reclaim() less aggressive in the common case.

The item count (uzd_nitems) has a couple of applications; it will be
useful in D7538, and could also be used to restrict the bucket cache
size of certain zones. In particular, it would be useful for the
vm_page.c page cache zones, which are currently configured with
UMA_ZONE_NOBUCKETCACHE.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj created this revision.Aug 10 2018, 9:02 PM

Harbormaster completed remote builds in B18711: Diff 46530.Aug 10 2018, 9:02 PM

markj added a reviewer: jeff.Aug 10 2018, 9:04 PM

markj added reviewers: alc, kib.Aug 13 2018, 7:55 PM

alc added inline comments.Aug 15 2018, 2:24 AM

sys/vm/uma_core.c
543 ↗	(On Diff #46530)	Unless you think overflow is likely, it would be better to write this as: = (4 * wss + zdom->uzd_wss) / 5;

alc added inline comments.Aug 15 2018, 2:27 AM

sys/vm/uma_core.c
543 ↗	(On Diff #46530)	P.S. I'm also surprised by the extremely rapid decay. Was there a rationale?

alc added inline comments.Aug 15 2018, 3:53 AM

sys/vm/uma_core.c
2560 ↗	(On Diff #46530)	Suppose two processors fail at this one after the other, specifically, where the second fails before the first has succeeded in allocating a fresh, filled bucket. Won't this inflate the difference between uzd_imax and uzd_imin? If a processor gets to the comment "See if we lost the race or were migrated. ...", and it lost the race, should it increase uzd_imin to reflect the unneeded/unused bucket?

markj added inline comments.Aug 15 2018, 9:03 PM

sys/vm/uma_core.c
543 ↗	(On Diff #46530)	There isn't much of a rationale for the specific weights here. I was thinking about the long (20s) update period relative to the lowmem period of 10s. I do agree that the estimate decays perhaps too quickly, and would consider giving more weight to the tail. I haven't been able to come up with workloads where the difference is measurable, though.
2560 ↗	(On Diff #46530)	Did you mean to write "threads" instead of "processors"? If so, I agree.

Simplify the wss update calculation.
Give more weight to the tail of the estimate.

Harbormaster completed remote builds in B18873: Diff 46803.Aug 16 2018, 9:31 PM

jeff added inline comments.Aug 19 2018, 7:18 PM

sys/vm/uma_core.c
462 ↗	(On Diff #46803)	lock asserts please
471 ↗	(On Diff #46803)	Shouldn't this be done where we actually allocate the bucket?
543 ↗	(On Diff #46530)	Bonwick observed in his paper that flushing everything gave similar performance in his tests to flushing less. So it seems that he didn't spend a lot of effort tuning it. I think a minute or so is probably reasonable for decay. We could import and use an actual EMA algorithm although the scheduler doesn't use anything more complicated than this either.

alc added inline comments.Aug 19 2018, 7:44 PM

sys/vm/uma_core.c
2560 ↗	(On Diff #46530)	Yes.

markj mentioned this in D16667: Extend uma_reclaim() to permit different reclamation targets..Aug 20 2018, 5:08 AM

Add lock assertions.
Fix accounting in the case where we miss on the full bucket cache. Only bump imax after successfully refilling the per-CPU alloc cache.

Harbormaster completed remote builds in B18964: Diff 46957.Aug 20 2018, 5:15 AM

Fix INVARIANTS build.

Harbormaster completed remote builds in B18981: Diff 46992.Aug 20 2018, 6:40 PM

alc added inline comments.Aug 25 2018, 5:33 PM

sys/vm/uma_core.c
891–898 ↗	(On Diff #46992)	Should you reset the imin and imax counts here? After this drain operation, for a time, every "try fetch" is going to add to an imax value that is still based on the pre-drain state. And, when you eventually recycle a bucket, the imin value will get updated to something close to zero. So, there will be an exaggerated difference between imin and imax, which will inflate the moving average.
2601 ↗	(On Diff #46992)	Testing `domain != UMA_ANYDOMAIN` here would avoid a memory dereference and be no less clear.

Aside from my question about whether imin and imax should be reset, I think that this is change is correct and complete.

markj marked an inline comment as done.Aug 27 2018, 2:37 PM

markj added inline comments.

sys/vm/uma_core.c
891–898 ↗	(On Diff #46992)	Wouldn't resetting imin = imax = nitems effectively just throw away the history accumulated since the last update to wss? That could deflate the moving average. Could we, instead of freeing `nitems - wss` items when trimming, free `MIN(imin, nitems - wss)` items, and subtract the number of reclaimed items from both imin and imax so that all three of imin, imax and nitems are shifted by the same amount? Suppose `imax - imin < wss`. Then `nitems - imin < wss`, so `nitems - wss < imin` and we'll get the same reclamation target as before. Now suppose `imax - imin >= wss`, i.e., the WSS estimate for the cache will grow upon its next update. If we free only `imin` items in this case, we're ignoring the history of the WSS estimate before the most recent update, but I don't think that that's unreasonable: with the current patch we're ignoring the WSS estimate for the current interval, which is undesirable if the average of the estimate is growing. Moreover, if `imax - imin` is close to `wss`, then `nitems - wss` will be close to `imin`, so the WSS estimate would have to be growing by a large amount for the proposed change to have a significant effect.

In D16666#360462, @alc wrote:

Aside from my question about whether imin and imax should be reset, I think that this is change is correct and complete.

FWIW, I'm planning on waiting until head thaws before committing this.

Apply some review feedback.

Harbormaster completed remote builds in B19211: Diff 47344.Aug 27 2018, 2:56 PM

markj added a reviewer: glebius.Aug 28 2018, 3:09 PM

alc added inline comments.Aug 30 2018, 5:39 PM

sys/vm/uma_core.c
891–898 ↗	(On Diff #46992)	Wouldn't resetting imin = imax = nitems effectively just throw away the history accumulated since the last update to wss? That could deflate the moving average. Briefly, yes. However, given the long time period between wss updates, I would expect them to reconverge before the next wss update. Essentially, my earlier comment pointed out a calculation error that can arise in one direction, resulting in a too large wss estimate. And, your followup points out an error that can arise in the other direction, leading to a too small wss estimate, if we reset the counts. I think that the difference between these two calculation errors is that the first will persist until the end of the time period and the second will likely disappear before the end of the time period. Could we, instead of freeing nitems - wss items when trimming, free MIN(imin, nitems - wss) items, and subtract the number of reclaimed items from both imin and imax so that all three of imin, imax and nitems are shifted by the same amount? Yes, I like this much better. Do it. In particular, when the wss is growing, I think that ignoring the historical average is sound practice. Once you've done this, I would probably change the moving average calculation to be a more conventional `(wss + 4 * zdom->uzd_wss) / 5`.
543 ↗	(On Diff #46530)	.There isn't much of a rationale for the specific weights here. I was thinking about the long (20s) update period relative to the lowmem period of 10s. I do agree that the estimate decays perhaps too quickly, and would consider giving more weight to the tail. I haven't been able to come up with workloads where the difference is measurable, though. I would suggest saying these things explicitly in the comment at the head of the function.

emaste added a subscriber: emaste.Oct 4 2018, 12:30 AM

My current patchset for UMA has something similar. I also have zone_put_bucket() although named differently, to keep accounting of all items for a zone in buckets. So this change goes in line with my changes. At Netflix we also run the vm_page_cache zone with certain bucket limit, instead of no buckets at all.

In D16666#379679, @glebius wrote:

My current patchset for UMA has something similar. I also have zone_put_bucket() although named differently, to keep accounting of all items for a zone in buckets. So this change goes in line with my changes. At Netflix we also run the vm_page_cache zone with certain bucket limit, instead of no buckets at all.

Ok. I plan to get back to this soon, I just deferred it during the code freeze. Do you have any objection to committing my version?

markj marked 2 inline comments as done.Oct 31 2018, 8:51 PM

markj added inline comments.