Page MenuHomeFreeBSD

Extend uma_reclaim() to permit different reclamation targets.
ClosedPublic

Authored by markj on Aug 10 2018, 9:03 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Apr 12, 8:37 PM
Unknown Object (File)
Fri, Mar 22, 2:56 PM
Unknown Object (File)
Fri, Mar 22, 2:56 PM
Unknown Object (File)
Fri, Mar 22, 2:56 PM
Unknown Object (File)
Fri, Mar 22, 2:51 PM
Unknown Object (File)
Fri, Mar 22, 2:51 PM
Unknown Object (File)
Fri, Mar 22, 2:51 PM
Unknown Object (File)
Fri, Mar 22, 2:51 PM
Subscribers

Details

Summary

Currently the page daemon will invoke uma_reclaim() once every 10s under
memory pressure. This causes the entire bucket cache of each zone to be
drained. UMA also exports zone_drain() to accomplish the same thing for
individual zones.

The current policy of draining each zone can result in a thundering
herd effect in busy zones when multiple threads take a bucket cache
miss following a drain. It is also just too aggressive when
uma_reclaim() gets called regularly in the steady state of a workload.

The change adds a "request" parameter to uma_reclaim() to let the caller
indicate the desired level of reclamation. I've tried to use the
following terminology consistently:

  • "reclaim" means to free some unspecified number of items from a cache,
  • "drain" is to reclaim all items from a cache,
  • "trim" reclaims items in excess of the working set size estimate.

The diff makes the following changes:

  • Rename zone_drain() to uma_zreclaim(), to keep naming consistent.
  • Add a "req" parameter to uma_reclaim() and uma_zreclaim(). Use that parameter to completely determine the behaviour of uma_reclaim(). Callers can trim the bucket cache, drain the bucket cache, or drain the bucket cache and per-CPU caches.
Test Plan

Peter stress-tested the patch.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 21601
Build 20901: arc lint + arc unit

Event Timeline

sys/kern/kern_mbuf.c
689

Shouldn't this be trim, not drain? We need to free some memory but not all. Possibly we need to guarantee that trim does something even if wss is reached but otherwise this is more expensive than it needs to be.

There is also a stray * in the comment above that you could remove while you're here.

933

Here again, do we need to drain the whole thing or just free some memory?

sys/kern/vfs_subr.c
1224–1225

I'm confused about this. Why are we reclaiming from all zones after we reclaim some vnodes? Presumably a bunch of zones related to vfs now have some free memory. This should be a TRIM at most and now that we have more regular grooming I think it could be omitted entirely.

sys/vm/uma_core.c
912–928

I missed the review for the working set-size. Where is that?

sys/kern/kern_mbuf.c
689

Isn't mbuf cluster zone exhaustion a fairly exceptional situation? They usually indicate an mbuf leak in my experience. In that case mbuf packet allocations will likely be failing too, and even a trim will require the cluster zone lock so it's an expensive operation regardless. Will fix the comment.

sys/kern/vfs_subr.c
1224–1225

Hmm, it was added somewhat recently, in r291244. I don't know that we can assume regular grooming is going be happening concurrent to this?

I agree that trimming is probably sufficient, especially in the absence of memory pressure.

sys/vm/uma_core.c
912–928

D16666, just for the record, but you already commented on it.

sys/kern/kern_mbuf.c
689

The boot defaults for clusters are relatively low and sometimes if you have a lot of interfaces it will trip more often because so much is sitting in receive queues. At isilon the zone sizes were tuned up substantially from the defaults so the only time it happened there was with a leak.

It's not a regression to leave it this way but I would possibly appreciate a comment about it.

sys/kern/vfs_subr.c
1224–1225

I find this somewhat concerning. Perhaps we can get kib to comment. Again it's not your regression to fix but this really looks like overkill to me.

sys/kern/vfs_subr.c
1224–1225

If we exceeded the vnode limit, and then flushed number of vnodes back to under it, we certainly could have overuse of the related zones. I do not remember was that uma_reclaim() call added by the original bde patch, or was it added after pho tests demonstrated the need.

I do not object against less significant reclaim (TRIM in the patch terminology).

sys/kern/kern_mbuf.c
689

Isilon just removed the limits entirely. OneFS also has patches to allow sockbuf limits to be overridden.

  • Trim UMA caches after shrinking the vnode cache.
sys/kern/vfs_subr.c
1224–1225

I will ask Peter to test this version, which just trims the caches.

sys/kern/vfs_subr.c
1224–1225

AFAIR the problems were with ZFS. UFS was relatively ok.

sys/kern/vfs_subr.c
1224–1225

That makes sense given ZFS' heavy usage of UMA for large buffers. FWIW, earlier versions of this patch were tested by some ZFS users who reported seeing stalls when uma_reclaim() was called regularly by the page daemon.

  • Use the right type for the reclamation target.
markj planned changes to this revision.Dec 6 2018, 4:30 PM

When trimming, choose the reclamation target by taking the minimum of
the WSS and nitems - imin, as discussed in D16666. This means that we
will ignore the historical average if the WSS estimate will grow at the
end of the current interval.

Adjust the imin/imax bounds, also as discussed in D16666.

Use a TAILQ instead of a LIST for bucket caches. This lets us reclaim the
least recently used items instead of the most recently used items as we did
in the previous version of the patch.

Rename uma_drain_lock to uma_reclaim_lock for consistency.

A couple of minor changes:

  • ensure the bucket list is initialized
  • when reclaiming, adjust imin and imax at the same time as nitems
  • Update libmemstat after converting the bucket list to a tailq.
This revision is now accepted and ready to land.Aug 19 2019, 9:30 PM