Page MenuHomeFreeBSD

Increase the vm_default max_user_wired value.
ClosedPublic

Authored by markj on Sep 14 2020, 8:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Nov 3, 7:14 AM
Unknown Object (File)
Oct 22 2024, 1:24 AM
Unknown Object (File)
Oct 21 2024, 3:40 AM
Unknown Object (File)
Oct 4 2024, 2:37 PM
Unknown Object (File)
Sep 25 2024, 6:24 AM
Unknown Object (File)
Sep 22 2024, 9:47 PM
Unknown Object (File)
Sep 22 2024, 9:46 PM
Unknown Object (File)
Sep 22 2024, 9:46 PM
Subscribers

Details

Summary

Since r347532 (merged to stable/12) we only count user-wired pages
towards the system limit. However, we now also treat pages wired by
hypervisors (bhyve and virtualbox) as user-wired, so starting VMs with
large amounts of RAM tends to fail due to the low limit. I've seen a
number of reports of this with both bhyve and virtualbox.

I propose increasing the default value. The point of the limit is to
provide a seatbelt to ensure that the system can reclaim pages, not to
impose some policy on the use of wired memory. Now that kernel-wired
pages are not counted against the limit, I believe it is reasonable to
increase the default value (and merge the change to 12.2) so that large
memory VMs just work by default.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 33565
Build 30819: arc lint + arc unit

Event Timeline

markj requested review of this revision.Sep 14 2020, 8:18 PM
markj created this revision.
markj added reviewers: alc, kib, dougm.
This revision is now accepted and ready to land.Sep 14 2020, 9:05 PM
sys/vm/vm_pageout.c
2337

I don't know what the overflow risk here is, but
freecount - freecount/5
won't overflow in cases where
4 *freecount / 5
will.

markj added inline comments.
sys/vm/vm_pageout.c
2337

freecount is a count of pages, so with a page size of 4096 bytes we can count up to 2^32-1 pages, or ~2^44 bytes = 16TB. Then we'd get overflow when freecount represents 4TB, which is not an especially large amount of RAM these days.

A few weeks ago I started converting page counters to u_long for this reason but haven't finished yet, I will go back to it. In the meantime I think we can just change freecount to u_long to avoid the problem.

markj marked an inline comment as done.

Widen freecount.

This revision now requires review to proceed.Sep 15 2020, 1:12 PM
This revision was not accepted when it landed; it landed in state Needs Review.Sep 17 2020, 4:49 PM
This revision was automatically updated to reflect the committed changes.

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

In D26424#588887, @alc wrote:

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

That's true, but the ARC will shrink in response to memory pressure, at least in principle. In particular, it will attempt to shrink while the global free page count is below the free target.

In D26424#588887, @alc wrote:

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

That's true, but the ARC will shrink in response to memory pressure, at least in principle. In particular, it will attempt to shrink while the global free page count is below the free target.

Has that been tested? :-)

Is anyone using the -S option for any reason besides device pass through?

In D26424#588907, @alc wrote:
In D26424#588887, @alc wrote:

Suppose I'm a ZFS user and I start a bhyve VM with a large guest-physical memory and the -S option. My impression is that it is not unusual for the ARC to consume greater than 20% of physical memory.

That's true, but the ARC will shrink in response to memory pressure, at least in principle. In particular, it will attempt to shrink while the global free page count is below the free target.

Has that been tested? :-)

It's been a while since I've dug into ARC low memory handling, but I do occasionally use virtualbox to run a Windows VM (with >50% of RAM allocated towards it) on a ZFS system and I see that the ARC shrinks promptly when the VM starts. It was actually this setup that motivated r355003 and a few related revisions last year: the virtualbox kernel module allocates a large number of wired pages with high allocation priority during VM initialization, and startup fails if an allocation failure occurs (i.e., there is no vm_wait() call), so it's a useful test of the VM's ability to keep up with memory pressure. In those tests the ARC would always shrink to a small fraction of the total RAM.

The low memory handling in the ARC should be reviewed now that OpenZFS has been imported, but my feeling is that the scenario you described shouldn't be especially problematic with the new default.

Is anyone using the -S option for any reason besides device pass through?

I'm not sure. I've only seen it used when passthrough is in use.

I tried a test where a postgres database of size 1.5*RAM is being accessed by pgbench, consuming most of the system's memory for the ARC. Then I started a bhyve VM with -S, giving it 75% of the system's RAM. The ARC ended up shrinking until the VM was started. A few observations:

  • Wired VM initialization is surprisingly slow even when all of the system's pages are free. I suspect this is because vm_map_wire() ends up calling vm_fault() on every single 4KB page.
  • The system becomes partially unresponsive when destroying a large wired VM. I can run commands from a shell, but programs like top(1) block on a mutex in a sysctl handler for several seconds. Not sure yet what's going on there.
  • The ARC grows very quickly once the VM is shut down.
  • In one iteration of the test I got an OOM kill. I believe uma_reclaim() and lowmem handlers provide no feedback to the OOM logic, which is a bug.
  • The system becomes partially unresponsive when destroying a large wired VM. I can run commands from a shell, but programs like top(1) block on a mutex in a sysctl handler for several seconds. Not sure yet what's going on there.

This seems to be because VMM memory segments are unmapped and destroyed from a destroy_dev() callback, and such callbacks always run from the Giant-protected taskqueue_swi_giant.