Collection of fixes for OOM handling of some corner cases.
Needs ReviewPublic

Authored by kib on Dec 28 2017, 8:44 PM.

Details

Summary

The patch contains three somewhat related fixes, all dealing with either OOM inability to resolve deadlock or OOM not triggering when it should.

  1. In addition to pagedaemon initiating OOM, also do it from the vm_fault(). Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. This part works around the issue which I have on some machines.
  1. Larry McVoy reported a load where large process was swapped out and then selected as OOM victim. Since process must be made runnable to exit, and since we are low on memory thus not swapping in processes, OOM kill appear to be not effective. Handle the case by making swapper to check for killed processes if low on memory and swap them in. Similarly with vm_fault() allocating the page to handle fault of killed process with SYSTEM priority, kernel stacks of the killed process swapped in with SYSTEM priority, to avoid swapper blocking.
  1. Peter Jeremy reported a load where single anonymous object consumed almost all memory on the large system. Swapout code executes the iteration over the corresponding object page queue for long time, owning the map and object locks. This blocked pagedaemon which tries to lock the object, and blocked other threads in the process in vm_fault waiting for the map lock. Handle the issue by terminating the deactivation loop if we executed too long and by yielding at the top level in vm_daemon. Also, change the map lock mode in vm_swapout_map_deactivate_pages() to read.

Patches 1 and 2 were discussed with Mark. Peter tested patches 2 and 3, I used previous version of the patch 1 on my machines.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped
kib created this revision.Dec 28 2017, 8:44 PM
alc added inline comments.Dec 28 2017, 10:17 PM
sys/vm/vm_swapout.c
273

This and the below unlock changes are obviously correct and an improvement. Go ahead and commit them.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 10:56 PM
This revision was automatically updated to reflect the committed changes.
kib reopened this revision.Dec 28 2017, 11:00 PM
kib updated this revision to Diff 37164.Dec 28 2017, 11:02 PM

Update patch after r327316.

alc added a comment.Dec 28 2017, 11:03 PM

Does swapout_procs() actually need an exclusive lock map either?

alc added a comment.Dec 28 2017, 11:11 PM

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

alc added inline comments.Dec 28 2017, 11:22 PM
sys/vm/vm_swapout.c
755

Go ahead and commit this change to the comment. It seems unrelated to the rest anyway.

kib added a comment.Dec 28 2017, 11:48 PM
In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 11:50 PM
Closed by commit rS327319: Clean up the comment. (authored by kib, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.
kib reopened this revision.Dec 28 2017, 11:51 PM
kib updated this revision to Diff 37168.Dec 28 2017, 11:54 PM

Update the patch after r327319.

alc added a comment.Dec 29 2017, 3:07 AM
In D13671#285730, @kib wrote:
In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

Remove the vm map lock acquisition and release from swapout_procs() then.

alc added inline comments.Dec 29 2017, 3:18 AM
sys/vm/vm_swapout.c
568

"j" is out of order. Include "i" here too?

(Whatever you decide to do about "i", go ahead and commit this.)

659

Merge the "*pp" definition into this one?

661

Go ahead and commit this with the other variable definition style fixes.

kib updated this revision to Diff 37194.Dec 29 2017, 8:40 PM

Rebase the patch after recent vm_swapout.c commits.

alc added inline comments.Jan 1 2018, 6:03 PM
sys/vm/vm_swapout.c
211–212

I'm okay with this change and ...

527

... this change. Go ahead and commit them.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 1 2018, 7:27 PM
Closed by commit rS327468: Do not let vm_daemon run unbounded. (authored by kib, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.
kib reopened this revision.Jan 1 2018, 7:28 PM
kib updated this revision to Diff 37365.Jan 1 2018, 7:31 PM

Re-merge after r327468.

alc added inline comments.Jan 3 2018, 6:51 PM
sys/vm/vm_swapout.c
692

You are multiplying by "hz" twice, here and in pagedaemon_wait().

kib updated this revision to Diff 37471.Jan 3 2018, 7:04 PM

Remove double *hz.

alc added a comment.Jan 3 2018, 7:13 PM

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Index: vm/vm_swapout.c
===================================================================
--- vm/vm_swapout.c     (revision 327509)
+++ vm/vm_swapout.c     (working copy)
@@ -546,7 +546,7 @@ vm_thread_swapout(struct thread *td)
                        panic("vm_thread_swapout: kstack already missing?");
                vm_page_dirty(m);
                vm_page_lock(m);
-               vm_page_unwire(m, PQ_INACTIVE);
+               vm_page_unwire(m, PQ_LAUNDRY);
                vm_page_unlock(m);
        }
        VM_OBJECT_WUNLOCK(ksobj);

The whole point of swapping out the process is to make its pages more quickly available for other uses.

markj added a comment.Jan 3 2018, 7:16 PM
In D13671#287790, @kib wrote:

Remove double *hz.

Shouldn't the "timo" parameters to vm_wait*() and pagedaemon_wait() be in ticks? That would be more consistent with other _sleep() wrappers, and allow more granularity for hypothetical future consumers of those interfaces.

In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Seems reasonable to me.

sys/vm/vm_page.c
3017

"until free pages are available for allocation, or the requested timeout has elapsed."

kib added a comment.Jan 3 2018, 7:56 PM
In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

I think this is fine.

kib updated this revision to Diff 37479.Jan 3 2018, 8:03 PM
kib marked an inline comment as done.

Makr vm_wait() taking timeout in ticks.
Adjust comment.

swills added a subscriber: swills.Jan 14 2018, 1:42 AM
kib updated this revision to Diff 41435.Apr 13 2018, 4:57 PM

Rebase after all VM changes.