Details

Reviewers

jeff
alc
kib
glebius

Group Reviewers

manpages

Commits

rS351673: Extend uma_reclaim() to permit different reclamation targets.

Summary

Currently the page daemon will invoke uma_reclaim() once every 10s under
memory pressure. This causes the entire bucket cache of each zone to be
drained. UMA also exports zone_drain() to accomplish the same thing for
individual zones.

The current policy of draining each zone can result in a thundering
herd effect in busy zones when multiple threads take a bucket cache
miss following a drain. It is also just too aggressive when
uma_reclaim() gets called regularly in the steady state of a workload.

The change adds a "request" parameter to uma_reclaim() to let the caller
indicate the desired level of reclamation. I've tried to use the
following terminology consistently:

"reclaim" means to free some unspecified number of items from a cache,
"drain" is to reclaim all items from a cache,
"trim" reclaims items in excess of the working set size estimate.

The diff makes the following changes:

Rename zone_drain() to uma_zreclaim(), to keep naming consistent.
Add a "req" parameter to uma_reclaim() and uma_zreclaim(). Use that parameter to completely determine the behaviour of uma_reclaim(). Callers can trim the bucket cache, drain the bucket cache, or drain the bucket cache and per-CPU caches.

Test Plan

Peter stress-tested the patch.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 21601
Build 20901: arc lint + arc unit

Event Timeline

markj created this revision.Aug 10 2018, 9:03 PM

Harbormaster completed remote builds in B18712: Diff 46531.Aug 10 2018, 9:03 PM

markj added a reviewer: jeff.Aug 10 2018, 9:04 PM

markj added reviewers: alc, kib.Aug 13 2018, 7:55 PM

jeff added inline comments.Aug 19 2018, 7:08 PM

sys/kern/kern_mbuf.c
689	Shouldn't this be trim, not drain? We need to free some memory but not all. Possibly we need to guarantee that trim does something even if wss is reached but otherwise this is more expensive than it needs to be. There is also a stray * in the comment above that you could remove while you're here.
933	Here again, do we need to drain the whole thing or just free some memory?
sys/kern/vfs_subr.c
1224–1225	I'm confused about this. Why are we reclaiming from all zones after we reclaim some vnodes? Presumably a bunch of zones related to vfs now have some free memory. This should be a TRIM at most and now that we have more regular grooming I think it could be omitted entirely.
sys/vm/uma_core.c
912–928	I missed the review for the working set-size. Where is that?

markj added inline comments.Aug 20 2018, 5:08 AM

sys/kern/kern_mbuf.c
689	Isn't mbuf cluster zone exhaustion a fairly exceptional situation? They usually indicate an mbuf leak in my experience. In that case mbuf packet allocations will likely be failing too, and even a trim will require the cluster zone lock so it's an expensive operation regardless. Will fix the comment.
sys/kern/vfs_subr.c
1224–1225	Hmm, it was added somewhat recently, in r291244. I don't know that we can assume regular grooming is going be happening concurrent to this? I agree that trimming is probably sufficient, especially in the absence of memory pressure.
sys/vm/uma_core.c
912–928	D16666, just for the record, but you already commented on it.

jeff added inline comments.Aug 20 2018, 7:15 AM

sys/kern/kern_mbuf.c
689	The boot defaults for clusters are relatively low and sometimes if you have a lot of interfaces it will trip more often because so much is sitting in receive queues. At isilon the zone sizes were tuned up substantially from the defaults so the only time it happened there was with a leak. It's not a regression to leave it this way but I would possibly appreciate a comment about it.
sys/kern/vfs_subr.c
1224–1225	I find this somewhat concerning. Perhaps we can get kib to comment. Again it's not your regression to fix but this really looks like overkill to me.

kib added inline comments.Aug 20 2018, 3:20 PM

sys/kern/vfs_subr.c
1224–1225	If we exceeded the vnode limit, and then flushed number of vnodes back to under it, we certainly could have overuse of the related zones. I do not remember was that uma_reclaim() call added by the original bde patch, or was it added after pho tests demonstrated the need. I do not object against less significant reclaim (TRIM in the patch terminology).

markj added inline comments.Aug 20 2018, 3:22 PM

sys/kern/kern_mbuf.c
689	Isilon just removed the limits entirely. OneFS also has patches to allow sockbuf limits to be overridden.

Trim UMA caches after shrinking the vnode cache.

Harbormaster completed remote builds in B18975: Diff 46982.Aug 20 2018, 3:30 PM

markj added inline comments.Aug 20 2018, 3:31 PM

sys/kern/vfs_subr.c
1224–1225	I will ask Peter to test this version, which just trims the caches.

kib added inline comments.Aug 20 2018, 4:11 PM

sys/kern/vfs_subr.c
1224–1225	AFAIR the problems were with ZFS. UFS was relatively ok.

markj added inline comments.Aug 20 2018, 4:17 PM

sys/kern/vfs_subr.c
1224–1225	That makes sense given ZFS' heavy usage of UMA for large buffers. FWIW, earlier versions of this patch were tested by some ZFS users who reported seeing stalls when uma_reclaim() was called regularly by the page daemon.

markj mentioned this in D7538: Correct adaptation ZFS ARC memory pressure to FreeBSD.Aug 23 2018, 3:12 PM

Use the right type for the reclamation target.

Harbormaster completed remote builds in B19210: Diff 47343.Aug 27 2018, 2:56 PM

markj added a reviewer: glebius.Aug 28 2018, 3:09 PM

emaste added a subscriber: emaste.Oct 4 2018, 12:30 AM

markj mentioned this in D16666: Add some accounting to the per-domain full bucket caches..Oct 31 2018, 8:51 PM

Rebase.

Harbormaster completed remote builds in B21397: Diff 51659.Dec 6 2018, 4:30 PM

markj planned changes to this revision.Dec 6 2018, 4:30 PM

When trimming, choose the reclamation target by taking the minimum of
the WSS and nitems - imin, as discussed in D16666. This means that we
will ignore the historical average if the WSS estimate will grow at the
end of the current interval.

Adjust the imin/imax bounds, also as discussed in D16666.

Use a TAILQ instead of a LIST for bucket caches. This lets us reclaim the
least recently used items instead of the most recently used items as we did
in the previous version of the patch.

Rename uma_drain_lock to uma_reclaim_lock for consistency.

Harbormaster completed remote builds in B21601: Diff 52087.Dec 16 2018, 11:42 PM