Page MenuHomeFreeBSD

Add some accounting to the per-domain full bucket caches.
ClosedPublic

Authored by markj on Aug 10 2018, 9:02 PM.

Details

Summary

Keep track of the number of items in each cache and use the existing UMA
callout to maintain a working set size estimate for each cache. This is
based on section 3.7 of Bonwick's vmem paper. The stored estimate is a
decaying average over the last 100s of activity. The WSS
estimate is currently unused, but my aim is to use it to make
uma_reclaim() less aggressive in the common case.

The item count (uzd_nitems) has a couple of applications; it will be
useful in D7538, and could also be used to restrict the bucket cache
size of certain zones. In particular, it would be useful for the
vm_page.c page cache zones, which are currently configured with
UMA_ZONE_NOBUCKETCACHE.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

markj created this revision.Aug 10 2018, 9:02 PM
markj added a reviewer: jeff.Aug 10 2018, 9:04 PM
markj added reviewers: alc, kib.Aug 13 2018, 7:55 PM
alc added inline comments.Aug 15 2018, 2:24 AM
sys/vm/uma_core.c
543 ↗(On Diff #46530)

Unless you think overflow is likely, it would be better to write this as:

= (4 * wss + zdom->uzd_wss) / 5;
alc added inline comments.Aug 15 2018, 2:27 AM
sys/vm/uma_core.c
543 ↗(On Diff #46530)

P.S. I'm also surprised by the extremely rapid decay. Was there a rationale?

alc added inline comments.Aug 15 2018, 3:53 AM
sys/vm/uma_core.c
2560 ↗(On Diff #46530)

Suppose two processors fail at this one after the other, specifically, where the second fails before the first has succeeded in allocating a fresh, filled bucket. Won't this inflate the difference between uzd_imax and uzd_imin? If a processor gets to the comment "See if we lost the race or were migrated. ...", and it lost the race, should it increase uzd_imin to reflect the unneeded/unused bucket?

markj added inline comments.Aug 15 2018, 9:03 PM
sys/vm/uma_core.c
543 ↗(On Diff #46530)

There isn't much of a rationale for the specific weights here. I was thinking about the long (20s) update period relative to the lowmem period of 10s. I do agree that the estimate decays perhaps too quickly, and would consider giving more weight to the tail. I haven't been able to come up with workloads where the difference is measurable, though.

2560 ↗(On Diff #46530)

Did you mean to write "threads" instead of "processors"? If so, I agree.

markj marked an inline comment as done.Aug 16 2018, 9:31 PM
markj updated this revision to Diff 46803.
  • Simplify the wss update calculation.
  • Give more weight to the tail of the estimate.
jeff added inline comments.Aug 19 2018, 7:18 PM
sys/vm/uma_core.c
543 ↗(On Diff #46530)

Bonwick observed in his paper that flushing everything gave similar performance in his tests to flushing less. So it seems that he didn't spend a lot of effort tuning it.

I think a minute or so is probably reasonable for decay. We could import and use an actual EMA algorithm although the scheduler doesn't use anything more complicated than this either.

462 ↗(On Diff #46803)

lock asserts please

471 ↗(On Diff #46803)

Shouldn't this be done where we actually allocate the bucket?

alc added inline comments.Aug 19 2018, 7:44 PM
sys/vm/uma_core.c
2560 ↗(On Diff #46530)

Yes.

markj marked 5 inline comments as done.Aug 20 2018, 5:15 AM
markj updated this revision to Diff 46957.
  • Add lock assertions.
  • Fix accounting in the case where we miss on the full bucket cache. Only bump imax after successfully refilling the per-CPU alloc cache.
markj updated this revision to Diff 46992.Aug 20 2018, 6:40 PM

Fix INVARIANTS build.

alc added inline comments.Aug 25 2018, 5:33 PM
sys/vm/uma_core.c
891–898 ↗(On Diff #46992)

Should you reset the imin and imax counts here? After this drain operation, for a time, every "try fetch" is going to add to an imax value that is still based on the pre-drain state. And, when you eventually recycle a bucket, the imin value will get updated to something close to zero. So, there will be an exaggerated difference between imin and imax, which will inflate the moving average.

2601 ↗(On Diff #46992)

Testing domain != UMA_ANYDOMAIN here would avoid a memory dereference and be no less clear.

alc added a comment.Aug 26 2018, 4:35 PM

Aside from my question about whether imin and imax should be reset, I think that this is change is correct and complete.

markj marked an inline comment as done.Aug 27 2018, 2:37 PM
markj added inline comments.
sys/vm/uma_core.c
891–898 ↗(On Diff #46992)

Wouldn't resetting imin = imax = nitems effectively just throw away the history accumulated since the last update to wss? That could deflate the moving average.

Could we, instead of freeing nitems - wss items when trimming, free MIN(imin, nitems - wss) items, and subtract the number of reclaimed items from both imin and imax so that all three of imin, imax and nitems are shifted by the same amount? Suppose imax - imin < wss. Then nitems - imin < wss, so nitems - wss < imin and we'll get the same reclamation target as before. Now suppose imax - imin >= wss, i.e., the WSS estimate for the cache will grow upon its next update. If we free only imin items in this case, we're ignoring the history of the WSS estimate before the most recent update, but I don't think that that's unreasonable: with the current patch we're ignoring the WSS estimate for the current interval, which is undesirable if the average of the estimate is growing. Moreover, if imax - imin is close to wss, then nitems - wss will be close to imin, so the WSS estimate would have to be growing by a large amount for the proposed change to have a significant effect.

markj added a comment.Aug 27 2018, 2:39 PM
In D16666#360462, @alc wrote:

Aside from my question about whether imin and imax should be reset, I think that this is change is correct and complete.

FWIW, I'm planning on waiting until head thaws before committing this.

markj updated this revision to Diff 47344.Aug 27 2018, 2:56 PM
  • Apply some review feedback.
alc added inline comments.Aug 30 2018, 5:39 PM
sys/vm/uma_core.c
543 ↗(On Diff #46530)

.There isn't much of a rationale for the specific weights here. I was thinking about the long (20s) update period relative to the lowmem period of 10s. I do agree that the estimate decays perhaps too quickly, and would consider giving more weight to the tail. I haven't been able to come up with workloads where the difference is measurable, though.

I would suggest saying these things explicitly in the comment at the head of the function.

891–898 ↗(On Diff #46992)

Wouldn't resetting imin = imax = nitems effectively just throw away the history accumulated since the last update to wss? That could deflate the moving average.

Briefly, yes. However, given the long time period between wss updates, I would expect them to reconverge before the next wss update.

Essentially, my earlier comment pointed out a calculation error that can arise in one direction, resulting in a too large wss estimate. And, your followup points out an error that can arise in the other direction, leading to a too small wss estimate, if we reset the counts. I think that the difference between these two calculation errors is that the first will persist until the end of the time period and the second will likely disappear before the end of the time period.

Could we, instead of freeing nitems - wss items when trimming, free MIN(imin, nitems - wss) items, and subtract the number of reclaimed items from both imin and imax so that all three of imin, imax and nitems are shifted by the same amount?

Yes, I like this much better. Do it. In particular, when the wss is growing, I think that ignoring the historical average is sound practice. Once you've done this, I would probably change the moving average calculation to be a more conventional (wss + 4 * zdom->uzd_wss) / 5.

emaste added a subscriber: emaste.Oct 4 2018, 12:30 AM

My current patchset for UMA has something similar. I also have zone_put_bucket() although named differently, to keep accounting of all items for a zone in buckets. So this change goes in line with my changes. At Netflix we also run the vm_page_cache zone with certain bucket limit, instead of no buckets at all.

markj added a comment.Oct 30 2018, 8:31 PM

My current patchset for UMA has something similar. I also have zone_put_bucket() although named differently, to keep accounting of all items for a zone in buckets. So this change goes in line with my changes. At Netflix we also run the vm_page_cache zone with certain bucket limit, instead of no buckets at all.

Ok. I plan to get back to this soon, I just deferred it during the code freeze. Do you have any objection to committing my version?

markj marked 2 inline comments as done.Oct 31 2018, 8:51 PM
markj added inline comments.
sys/vm/uma_core.c
891–898 ↗(On Diff #46992)

(This discussion really applies to D16667 and will be handled there.)

markj updated this revision to Diff 49844.Oct 31 2018, 8:52 PM
  • Address feedback.
This revision was not accepted when it landed; it landed in state Needs Review.Nov 13 2018, 7:45 PM
This revision was automatically updated to reflect the committed changes.