Page MenuHomeFreeBSD

Add some accounting to the per-domain full bucket caches.
ClosedPublic

Authored by markj on Aug 10 2018, 9:02 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 19, 9:36 PM
Unknown Object (File)
Dec 12 2024, 10:18 PM
Unknown Object (File)
Dec 11 2024, 12:25 PM
Unknown Object (File)
Dec 7 2024, 8:02 AM
Unknown Object (File)
Nov 28 2024, 10:15 PM
Unknown Object (File)
Nov 26 2024, 10:36 AM
Unknown Object (File)
Nov 25 2024, 5:58 AM
Unknown Object (File)
Nov 24 2024, 5:04 AM
Subscribers

Details

Summary

Keep track of the number of items in each cache and use the existing UMA
callout to maintain a working set size estimate for each cache. This is
based on section 3.7 of Bonwick's vmem paper. The stored estimate is a
decaying average over the last 100s of activity. The WSS
estimate is currently unused, but my aim is to use it to make
uma_reclaim() less aggressive in the common case.

The item count (uzd_nitems) has a couple of applications; it will be
useful in D7538, and could also be used to restrict the bucket cache
size of certain zones. In particular, it would be useful for the
vm_page.c page cache zones, which are currently configured with
UMA_ZONE_NOBUCKETCACHE.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 20544
Build 19969: arc lint + arc unit

Event Timeline

sys/vm/uma_core.c
550

Unless you think overflow is likely, it would be better to write this as:

= (4 * wss + zdom->uzd_wss) / 5;
sys/vm/uma_core.c
550

P.S. I'm also surprised by the extremely rapid decay. Was there a rationale?

sys/vm/uma_core.c
2579

Suppose two processors fail at this one after the other, specifically, where the second fails before the first has succeeded in allocating a fresh, filled bucket. Won't this inflate the difference between uzd_imax and uzd_imin? If a processor gets to the comment "See if we lost the race or were migrated. ...", and it lost the race, should it increase uzd_imin to reflect the unneeded/unused bucket?

sys/vm/uma_core.c
550

There isn't much of a rationale for the specific weights here. I was thinking about the long (20s) update period relative to the lowmem period of 10s. I do agree that the estimate decays perhaps too quickly, and would consider giving more weight to the tail. I haven't been able to come up with workloads where the difference is measurable, though.

2579

Did you mean to write "threads" instead of "processors"? If so, I agree.

markj marked an inline comment as done.
  • Simplify the wss update calculation.
  • Give more weight to the tail of the estimate.
sys/vm/uma_core.c
465

lock asserts please

474

Shouldn't this be done where we actually allocate the bucket?

550

Bonwick observed in his paper that flushing everything gave similar performance in his tests to flushing less. So it seems that he didn't spend a lot of effort tuning it.

I think a minute or so is probably reasonable for decay. We could import and use an actual EMA algorithm although the scheduler doesn't use anything more complicated than this either.

sys/vm/uma_core.c
2579

Yes.

markj marked 5 inline comments as done.
  • Add lock assertions.
  • Fix accounting in the case where we miss on the full bucket cache. Only bump imax after successfully refilling the per-CPU alloc cache.
sys/vm/uma_core.c
899–906

Should you reset the imin and imax counts here? After this drain operation, for a time, every "try fetch" is going to add to an imax value that is still based on the pre-drain state. And, when you eventually recycle a bucket, the imin value will get updated to something close to zero. So, there will be an exaggerated difference between imin and imax, which will inflate the moving average.

2615–2616

Testing domain != UMA_ANYDOMAIN here would avoid a memory dereference and be no less clear.

Aside from my question about whether imin and imax should be reset, I think that this is change is correct and complete.

markj added inline comments.
sys/vm/uma_core.c
899–906

Wouldn't resetting imin = imax = nitems effectively just throw away the history accumulated since the last update to wss? That could deflate the moving average.

Could we, instead of freeing nitems - wss items when trimming, free MIN(imin, nitems - wss) items, and subtract the number of reclaimed items from both imin and imax so that all three of imin, imax and nitems are shifted by the same amount? Suppose imax - imin < wss. Then nitems - imin < wss, so nitems - wss < imin and we'll get the same reclamation target as before. Now suppose imax - imin >= wss, i.e., the WSS estimate for the cache will grow upon its next update. If we free only imin items in this case, we're ignoring the history of the WSS estimate before the most recent update, but I don't think that that's unreasonable: with the current patch we're ignoring the WSS estimate for the current interval, which is undesirable if the average of the estimate is growing. Moreover, if imax - imin is close to wss, then nitems - wss will be close to imin, so the WSS estimate would have to be growing by a large amount for the proposed change to have a significant effect.

In D16666#360462, @alc wrote:

Aside from my question about whether imin and imax should be reset, I think that this is change is correct and complete.

FWIW, I'm planning on waiting until head thaws before committing this.

  • Apply some review feedback.
sys/vm/uma_core.c
550

.There isn't much of a rationale for the specific weights here. I was thinking about the long (20s) update period relative to the lowmem period of 10s. I do agree that the estimate decays perhaps too quickly, and would consider giving more weight to the tail. I haven't been able to come up with workloads where the difference is measurable, though.

I would suggest saying these things explicitly in the comment at the head of the function.

899–906

Wouldn't resetting imin = imax = nitems effectively just throw away the history accumulated since the last update to wss? That could deflate the moving average.

Briefly, yes. However, given the long time period between wss updates, I would expect them to reconverge before the next wss update.

Essentially, my earlier comment pointed out a calculation error that can arise in one direction, resulting in a too large wss estimate. And, your followup points out an error that can arise in the other direction, leading to a too small wss estimate, if we reset the counts. I think that the difference between these two calculation errors is that the first will persist until the end of the time period and the second will likely disappear before the end of the time period.

Could we, instead of freeing nitems - wss items when trimming, free MIN(imin, nitems - wss) items, and subtract the number of reclaimed items from both imin and imax so that all three of imin, imax and nitems are shifted by the same amount?

Yes, I like this much better. Do it. In particular, when the wss is growing, I think that ignoring the historical average is sound practice. Once you've done this, I would probably change the moving average calculation to be a more conventional (wss + 4 * zdom->uzd_wss) / 5.

My current patchset for UMA has something similar. I also have zone_put_bucket() although named differently, to keep accounting of all items for a zone in buckets. So this change goes in line with my changes. At Netflix we also run the vm_page_cache zone with certain bucket limit, instead of no buckets at all.

My current patchset for UMA has something similar. I also have zone_put_bucket() although named differently, to keep accounting of all items for a zone in buckets. So this change goes in line with my changes. At Netflix we also run the vm_page_cache zone with certain bucket limit, instead of no buckets at all.

Ok. I plan to get back to this soon, I just deferred it during the code freeze. Do you have any objection to committing my version?

markj added inline comments.
sys/vm/uma_core.c
899–906

(This discussion really applies to D16667 and will be handled there.)

This revision was not accepted when it landed; it landed in state Needs Review.Nov 13 2018, 7:45 PM
This revision was automatically updated to reflect the committed changes.