Differential D22088

Wait until swap space becomes full before start killing processes.
Changes PlannedPublic
Actions

Authored by ota_j.email.ne.jp on Oct 19 2019, 4:22 AM.

Details

Reviewers

markj
alc
kib
dougm

Summary

Vm_pageout isn't aware of swap pager's status.
As a result, when a large number of pages are suddenly consumed, the number
of active pages becomes suddenly too low, also swap pager didn't
detect necessity of paging out in flush, vm_page out starts process killing
even though large number of swap space is remaining.

One of the easiest way to reproduce a case when OOM kills processes
while a large number of swap pages remaining is 'dd'ing to tmpfs.
I was able to reproduce the situation very frequently with the blow command.
Under my tests, processes were killed with 10GB of swap space remaining.
% mount -t tmpfs tmp /mnt/tmpfs
% dd if=/dev/zero of=/mnt/tmpfs/1 bs=1M

Swap_pager_full is set when there is no swap devices or when free swap
spaces are not found. Let OOM killer wait until we really run out of
swap devices.

PR 241048

This code will change the current behavior of OOM killing. As of now,
OOM killing starts happening if the system cannot create free pages
fast enough, regardless of availability of space spaces.
So, it is not full deterministic when OOM kicks in.
With this change, OOM will wait until swap spaces are fully consumed.
This may results in heavier slashing at peak of swap space usage as
OOM killing doesn't kick in.

Test Plan

Start 'top' to monitor swap space usage.
Run 'dd' command onto tmpfs to rapidly consume physical memory.
Observe 'out of swap' killing after swap space is filled up.
% mount -t tmpfs tmp /mnt/tmpfs
% dd if=/dev/zero of=/mnt/tmpfs/1 bs=1M

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 27122
Build 25401: arc lint + arc unit

Event Timeline

ota_j.email.ne.jp created this revision.Oct 19 2019, 4:22 AM

Herald added a subscriber: imp. · View Herald TranscriptOct 19 2019, 4:22 AM

Harbormaster completed remote builds in B27122: Diff 63470.Oct 19 2019, 4:22 AM

swap_pager_full == 0 cannot be a criteria for enabling OOM. Kills start only when pagedaemon is unable to produce free pages, and this might happens for many other reasons than just exhausted swap.

Can you provide complete scenario where OOM was triggered too early, in your opinion ? Preferably, it would be a shell script, together with description how to configure the machine, i.e. how much RAM/swap is needed.

swap_pager_full != 0 was indeed too aggressive. I'm wondering if we can feed in swap space usage such that we can delay OOM electing at low usage and push for OOM at high usage.

After further testing, I noticed using ZFS as swap space can reproduce very consistently.
You can use mdconfig on zfs or zvol. Both cause rapid OOM killing.
Compression on or off didn't matter.
I had had both swap partition and zfs swap. That combination also can trigger same.

Once you have zfs/zvol as swap, mount tmpfs and dd onto it.
$ zfs create -V 10G zsys/swap
$ zfs set compression=off zsys/swap
$ swapon /dev/zvol/zsys/swap
$ mount -t tmpfs tmp /mnt/tmp
$ dd if=/dev/zero of=/mnt/tmp/1 bs=1MB count=3000
Although I allocated 10GB space, OOM kicks in when 10MB ~ 20MB swap space usage.

I run the above commands in Parallels on Mac with 1.5GB physical memory allocated.
I have a couple day old CURRENT with GENERIC-NODEBUG.
This Mac uses SSD and FreeBSD in VM can easily handle over 100M per second swapping out easily with regular swap device.
OOM kicks in immediately having ZFS/ZVOL swap space with dd.
I tested several more times and each test killed processes with 20MB of swap space usage.
When I run make -C /usr/src buildworld -j 30 and slowly trigger swap-out at time of llvm compilation, OOM doesn't kick in even with ZFS/ZVOL swap configuration until swap space is fully consumed.

Once thing I really like about testing this kind of case in VM is that I can rollback to before-crash point with snapshot such that I can easily and quickly test multiple times and also nullify crashes - I can go back to normal without reboot nor fsck. This is off the topic, though.

In D22088#482770, @ota_j.email.ne.jp wrote:

After further testing, I noticed using ZFS as swap space can reproduce very consistently.

ZFS or zvol are unusable as swap, so your observations are not valid. Problem with ZFS is that any write requires memory allocation, perhaps quite large allocation. This means that trying to swap out anything results in even more memory pressure than to not swap out. It is known that use of zfs for swap causes deadlocks for that reason, and your experiments suggest that it causes OOM, which is not surprising.

Until that behavior of ZFS is fixed, it cannot be used as an argument about brokeness of either VM or specifically OOM detection code. OTOH, if you can construct a case where OOM is triggered using only high allocation rate on tmpfs, I would be very interesting in the reproducer.

I see. I just added zfs/zvol swap space temporary from command line within the last a couple of weeks for other reasons. I see few of too early OOM on few of machines although it is very rare and spontaneous such that it hasn't been easy to find a trigger.

Root on ZFS is increasingly popular and I wonder how people setup their swap space. Then, I'd like to update man swapon to discourage and mention zfs/zvol short coming.

Suspend for now.