Revise the page cache size policy.
ClosedPublic
Actions

Authored by markj on Nov 15 2019, 11:45 PM.

Details

Reviewers

alc
kib
jeff
dougm
gallatin
glebius

Group Reviewers

manpages

Commits

rS355002: Revise the page cache size policy.

Summary

In r353734 I limited the use of the page caches to systems with a
relatively large amount of RAM per CPU. This was to mitigate some
issues reported with the system not able to keep up with memory pressure
in cases where it had been able to do so prior to the addition of the
direct free pool cache.

This change modifies some attributes of the cache zones and re-enables
them for basically all systems. I believe that it is preferable to
always enable the caches rather than use some heuristic based on the
amount of RAM and number of CPUs: it makes the kernel more consistent
and predictable across different systems.

The change modifies uma_zone_set_maxcache(), which was introduced
specifically for the page cache zones. Rather than using it to limit
only the bucket cache, have it set uz_count_max to provide an upper
bound on the per-CPU cache size that is consistent with the number of
items requested. Remove its return value since it has no use.

Enable the page cache zones unconditionally, and limit them to 0.1% of
the domain's pages. The limit can be overridden by the pgcache_zone_max
tunable as before.

Change the item size parameter passed to uma_zcache_create() to the
correct size, and stop setting UMA_ZONE_MAXBUCKET. This allows the page
cache buckets to be adaptively sized, like the rest of UMA's caches.
This also causes the initial bucket size to be small, so only systems
which benefit from large caches will get them. In my testing the bucket
size will ramp up very quickly under load, whereas most workloads are
not going to benefit from using the maximum bucket size for the direct
free pool cache.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj created this revision.Nov 15 2019, 11:45 PM

Harbormaster completed remote builds in B27552: Diff 64409.Nov 15 2019, 11:45 PM

markj added reviewers: alc, kib, jeff, dougm, gallatin, glebius.Nov 15 2019, 11:48 PM

jeff added inline comments.Nov 16 2019, 11:52 PM

sys/vm/vm_glue.c
477 ↗	(On Diff #64409)	On systems with more than 128 hardware threads this will disable per-cpu buckets. You would need at least 3 * mp_ncpus to avoid disabling buckets with the code in bucket_zone_max(). You'd be more likely to want to disable buckets on low memory low core count systems where the lock contention won't be as much of a problem. I'd rather just leave it at MINBUCKET.
sys/vm/vm_page.c
240 ↗	(On Diff #64409)	This one I feel a bit better about. If we look at a hypothetical 256 thread system we need: 3 * 256 * 1000 pages to have a single entry in a bucket. That's 3GB of memory per bucket count. So with 256gig per-domain it's 85 entries per-bucket. Really I think this limit is low and it could be 2-4 times that. Missing the bucket cache on large machines basically craters performance. Netflix has to force 256 entries per-bucket to keep throughput up. A sysctl here might be nice. It actually would be interesting to make a per-zone sysctl right in UMA so you don't need to repeat this for every zone that wants to update limits. We could export more stats there than vmstat -z as well.

markj added inline comments.Nov 18 2019, 9:42 PM

sys/vm/vm_glue.c
477 ↗	(On Diff #64409)	I don't quite follow. With 128 threads the cache size will be 512.
sys/vm/vm_page.c
240 ↗	(On Diff #64409)	Well, cache zones will only ever have two buckets per CPU currently. So with 1GB of RAM per domain per CPU we get a max bucket size of 128. With most of Netflix's current configurations they will get a max bucket size of 256 automatically using this diff. The limit could be probably increased, but note also that we have two free pools and this limit is per-pool, so it's really 0.2%, which is close to the free_severe threshold. We do already have a tunable to override this for the pgcache zones, and Netflix uses it. I like the idea of having a per-zone sysctl tree, but the naming would be kind of awkward since zone names need not be unique can contain spaces and ".". You'd have to have something like vm.uma.zones.0.name vm.uma.zones.0.limit vm.uma.zones.1.name vm.uma.zones.1.limit ...

jeff accepted this revision.Nov 18 2019, 9:49 PM

jeff added inline comments.

sys/vm/vm_glue.c
477 ↗	(On Diff #64409)	I misread max as min. mea culpa.
sys/vm/vm_page.c
240 ↗	(On Diff #64409)	I'd rather enforce non-duplicate naming and then use substitutions to make sysctl appropriate names. Or you could simply detect duplicate names when the zone is added and append a number to it for sysctl purposes. You could still put the original name in the subtree. I do see that it's properly counting buckets for cache zones so that helps.