Page MenuHomeFreeBSD

Extend uma_reclaim() to permit different reclamation targets.
ClosedPublic

Authored by markj on Aug 10 2018, 9:03 PM.

Details

Summary

Currently the page daemon will invoke uma_reclaim() once every 10s under
memory pressure. This causes the entire bucket cache of each zone to be
drained. UMA also exports zone_drain() to accomplish the same thing for
individual zones.

The current policy of draining each zone can result in a thundering
herd effect in busy zones when multiple threads take a bucket cache
miss following a drain. It is also just too aggressive when
uma_reclaim() gets called regularly in the steady state of a workload.

The change adds a "request" parameter to uma_reclaim() to let the caller
indicate the desired level of reclamation. I've tried to use the
following terminology consistently:

  • "reclaim" means to free some unspecified number of items from a cache,
  • "drain" is to reclaim all items from a cache,
  • "trim" reclaims items in excess of the working set size estimate.

The diff makes the following changes:

  • Rename zone_drain() to uma_zreclaim(), to keep naming consistent.
  • Add a "req" parameter to uma_reclaim() and uma_zreclaim(). Use that parameter to completely determine the behaviour of uma_reclaim(). Callers can trim the bucket cache, drain the bucket cache, or drain the bucket cache and per-CPU caches.
Test Plan

Peter stress-tested the patch.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

markj created this revision.Aug 10 2018, 9:03 PM
markj added a reviewer: jeff.Aug 10 2018, 9:04 PM
markj added reviewers: alc, kib.Aug 13 2018, 7:55 PM
jeff added inline comments.Aug 19 2018, 7:08 PM
sys/kern/kern_mbuf.c
683 ↗(On Diff #46531)

Shouldn't this be trim, not drain? We need to free some memory but not all. Possibly we need to guarantee that trim does something even if wss is reached but otherwise this is more expensive than it needs to be.

There is also a stray * in the comment above that you could remove while you're here.

927 ↗(On Diff #46531)

Here again, do we need to drain the whole thing or just free some memory?

sys/kern/vfs_subr.c
1207–1208 ↗(On Diff #46531)

I'm confused about this. Why are we reclaiming from all zones after we reclaim some vnodes? Presumably a bunch of zones related to vfs now have some free memory. This should be a TRIM at most and now that we have more regular grooming I think it could be omitted entirely.

sys/vm/uma_core.c
887 ↗(On Diff #46531)

I missed the review for the working set-size. Where is that?

markj added inline comments.Aug 20 2018, 5:08 AM
sys/kern/kern_mbuf.c
683 ↗(On Diff #46531)

Isn't mbuf cluster zone exhaustion a fairly exceptional situation? They usually indicate an mbuf leak in my experience. In that case mbuf packet allocations will likely be failing too, and even a trim will require the cluster zone lock so it's an expensive operation regardless. Will fix the comment.

sys/kern/vfs_subr.c
1207–1208 ↗(On Diff #46531)

Hmm, it was added somewhat recently, in r291244. I don't know that we can assume regular grooming is going be happening concurrent to this?

I agree that trimming is probably sufficient, especially in the absence of memory pressure.

sys/vm/uma_core.c
887 ↗(On Diff #46531)

D16666, just for the record, but you already commented on it.

jeff added inline comments.Aug 20 2018, 7:15 AM
sys/kern/kern_mbuf.c
683 ↗(On Diff #46531)

The boot defaults for clusters are relatively low and sometimes if you have a lot of interfaces it will trip more often because so much is sitting in receive queues. At isilon the zone sizes were tuned up substantially from the defaults so the only time it happened there was with a leak.

It's not a regression to leave it this way but I would possibly appreciate a comment about it.

sys/kern/vfs_subr.c
1207–1208 ↗(On Diff #46531)

I find this somewhat concerning. Perhaps we can get kib to comment. Again it's not your regression to fix but this really looks like overkill to me.

kib added inline comments.Aug 20 2018, 3:20 PM
sys/kern/vfs_subr.c
1207–1208 ↗(On Diff #46531)

If we exceeded the vnode limit, and then flushed number of vnodes back to under it, we certainly could have overuse of the related zones. I do not remember was that uma_reclaim() call added by the original bde patch, or was it added after pho tests demonstrated the need.

I do not object against less significant reclaim (TRIM in the patch terminology).

markj added inline comments.Aug 20 2018, 3:22 PM
sys/kern/kern_mbuf.c
683 ↗(On Diff #46531)

Isilon just removed the limits entirely. OneFS also has patches to allow sockbuf limits to be overridden.

markj updated this revision to Diff 46982.Aug 20 2018, 3:30 PM
  • Trim UMA caches after shrinking the vnode cache.
markj added inline comments.Aug 20 2018, 3:31 PM
sys/kern/vfs_subr.c
1207–1208 ↗(On Diff #46531)

I will ask Peter to test this version, which just trims the caches.

kib added inline comments.Aug 20 2018, 4:11 PM
sys/kern/vfs_subr.c
1207–1208 ↗(On Diff #46531)

AFAIR the problems were with ZFS. UFS was relatively ok.

markj added inline comments.Aug 20 2018, 4:17 PM
sys/kern/vfs_subr.c
1207–1208 ↗(On Diff #46531)

That makes sense given ZFS' heavy usage of UMA for large buffers. FWIW, earlier versions of this patch were tested by some ZFS users who reported seeing stalls when uma_reclaim() was called regularly by the page daemon.

markj updated this revision to Diff 47343.Aug 27 2018, 2:56 PM
  • Use the right type for the reclamation target.
emaste added a subscriber: emaste.Oct 4 2018, 12:30 AM
markj updated this revision to Diff 51659.Dec 6 2018, 4:30 PM

Rebase.

markj planned changes to this revision.Dec 6 2018, 4:30 PM
markj updated this revision to Diff 52087.Dec 16 2018, 11:42 PM

When trimming, choose the reclamation target by taking the minimum of
the WSS and nitems - imin, as discussed in D16666. This means that we
will ignore the historical average if the WSS estimate will grow at the
end of the current interval.

Adjust the imin/imax bounds, also as discussed in D16666.

Use a TAILQ instead of a LIST for bucket caches. This lets us reclaim the
least recently used items instead of the most recently used items as we did
in the previous version of the patch.

Rename uma_drain_lock to uma_reclaim_lock for consistency.

markj updated this revision to Diff 52171.Dec 19 2018, 6:02 PM

A couple of minor changes:

  • ensure the bucket list is initialized
  • when reclaiming, adjust imin and imax at the same time as nitems
markj edited the test plan for this revision. (Show Details)Dec 19 2018, 6:05 PM
markj updated this revision to Diff 52845.Jan 15 2019, 1:49 AM
  • Update libmemstat after converting the bucket list to a tailq.
markj updated this revision to Diff 52846.Jan 15 2019, 2:17 AM

Rebase.

markj updated this revision to Diff 59891.Jul 18 2019, 7:43 PM

Rebase.

jeff accepted this revision.Aug 19 2019, 9:30 PM
This revision is now accepted and ready to land.Aug 19 2019, 9:30 PM
This revision was automatically updated to reflect the committed changes.