Paths

Table of Contentst

Increase the vm_default max_user_wired value.
ClosedPublic
Actions

Authored by markj on Sep 14 2020, 8:18 PM.

Details

Reviewers

alc
kib
dougm

Commits

rS365841: Increase the default vm.max_user_wired value.

Summary

Since r347532 (merged to stable/12) we only count user-wired pages
towards the system limit. However, we now also treat pages wired by
hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with
large amounts of RAM tends to fail due to the low limit. I've seen a
number of reports of this with both bhyve and virtualbox.

I propose increasing the default value. The point of the limit is to
provide a seatbelt to ensure that the system can reclaim pages, not to
impose some policy on the use of wired memory. Now that kernel-wired
pages are not counted against the limit, I believe it is reasonable to
increase the default value (and merge the change to 12.2) so that large
memory VMs just work by default.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 33565
Build 30819: arc lint + arc unit

Event Timeline

markj requested review of this revision.Sep 14 2020, 8:18 PM

markj created this revision.

Harbormaster completed remote builds in B33547: Diff 77018.Sep 14 2020, 8:18 PM

markj edited the summary of this revision. (Show Details)Sep 14 2020, 8:19 PM

markj added reviewers: alc, kib, dougm.

kib accepted this revision.Sep 14 2020, 9:05 PM

This revision is now accepted and ready to land.Sep 14 2020, 9:05 PM

dougm added inline comments.Sep 15 2020, 8:31 AM

sys/vm/vm_pageout.c
2337	I don't know what the overflow risk here is, but freecount - freecount/5 won't overflow in cases where 4 *freecount / 5 will.

markj marked an inline comment as done.Sep 15 2020, 1:11 PM

markj added inline comments.

sys/vm/vm_pageout.c
2337	`freecount` is a count of pages, so with a page size of 4096 bytes we can count up to 2^32-1 pages, or ~2^44 bytes = 16TB. Then we'd get overflow when `freecount` represents 4TB, which is not an especially large amount of RAM these days. A few weeks ago I started converting page counters to u_long for this reason but haven't finished yet, I will go back to it. In the meantime I think we can just change `freecount` to u_long to avoid the problem.

Widen freecount.

This revision now requires review to proceed.Sep 15 2020, 1:12 PM

Harbormaster completed remote builds in B33565: Diff 77048.Sep 15 2020, 1:13 PM

This revision was not accepted when it landed; it landed in state Needs Review.Sep 17 2020, 4:49 PM

Closed by commit rS365841: Increase the default vm.max_user_wired value. (authored by markj). · Explain Why

This revision was automatically updated to reflect the committed changes.

markj added a commit: rS365841: Increase the default vm.max_user_wired value..

Herald added a subscriber: imp. · View Herald TranscriptSep 17 2020, 4:49 PM

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

In D26424#588887, @alc wrote:

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

That's true, but the ARC will shrink in response to memory pressure, at least in principle. In particular, it will attempt to shrink while the global free page count is below the free target.

In D26424#588892, @markj wrote:

In D26424#588887, @alc wrote:

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

That's true, but the ARC will shrink in response to memory pressure, at least in principle. In particular, it will attempt to shrink while the global free page count is below the free target.

Has that been tested? :-)

Is anyone using the -S option for any reason besides device pass through?

In D26424#588907, @alc wrote:

In D26424#588892, @markj wrote:

In D26424#588887, @alc wrote:

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

That's true, but the ARC will shrink in response to memory pressure, at least in principle. In particular, it will attempt to shrink while the global free page count is below the free target.

Has that been tested? :-)

It's been a while since I've dug into ARC low memory handling, but I do occasionally use virtualbox to run a Windows VM (with >50% of RAM allocated towards it) on a ZFS system and I see that the ARC shrinks promptly when the VM starts. It was actually this setup that motivated r355003 and a few related revisions last year: the virtualbox kernel module allocates a large number of wired pages with high allocation priority during VM initialization, and startup fails if an allocation failure occurs (i.e., there is no vm_wait() call), so it's a useful test of the VM's ability to keep up with memory pressure. In those tests the ARC would always shrink to a small fraction of the total RAM.

The low memory handling in the ARC should be reviewed now that OpenZFS has been imported, but my feeling is that the scenario you described shouldn't be especially problematic with the new default.

Is anyone using the -S option for any reason besides device pass through?

I'm not sure. I've only seen it used when passthrough is in use.

I tried a test where a postgres database of size 1.5*RAM is being accessed by pgbench, consuming most of the system's memory for the ARC. Then I started a bhyve VM with -S, giving it 75% of the system's RAM. The ARC ended up shrinking until the VM was started. A few observations:

Wired VM initialization is surprisingly slow even when all of the system's pages are free. I suspect this is because vm_map_wire() ends up calling vm_fault() on every single 4KB page.
The system becomes partially unresponsive when destroying a large wired VM. I can run commands from a shell, but programs like top(1) block on a mutex in a sysctl handler for several seconds. Not sure yet what's going on there.
The ARC grows very quickly once the VM is shut down.
In one iteration of the test I got an OOM kill. I believe uma_reclaim() and lowmem handlers provide no feedback to the OOM logic, which is a bug.

The system becomes partially unresponsive when destroying a large wired VM. I can run commands from a shell, but programs like top(1) block on a mutex in a sysctl handler for several seconds. Not sure yet what's going on there.

This seems to be because VMM memory segments are unmapped and destroyed from a destroy_dev() callback, and such callbacks always run from the Giant-protected taskqueue_swi_giant.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252227

Revision Contents
Changeset List

Path

Size

sys/

vm/

vm_pageout.c

9 lines

Diff 77048

View Options

Increase the vm_default max_user_wired value.ClosedPublicActions