Page MenuHomeFreeBSD

Collection of fixes for OOM handling of some corner cases.
Needs ReviewPublic

Authored by kib on Dec 28 2017, 8:44 PM.

Details

Summary

The patch contains three somewhat related fixes, all dealing with either OOM inability to resolve deadlock or OOM not triggering when it should.

  1. In addition to pagedaemon initiating OOM, also do it from the vm_fault(). Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. This part works around the issue which I have on some machines.
  1. Larry McVoy reported a load where large process was swapped out and then selected as OOM victim. Since process must be made runnable to exit, and since we are low on memory thus not swapping in processes, OOM kill appear to be not effective. Handle the case by making swapper to check for killed processes if low on memory and swap them in. Similarly with vm_fault() allocating the page to handle fault of killed process with SYSTEM priority, kernel stacks of the killed process swapped in with SYSTEM priority, to avoid swapper blocking.
  1. Peter Jeremy reported a load where single anonymous object consumed almost all memory on the large system. Swapout code executes the iteration over the corresponding object page queue for long time, owning the map and object locks. This blocked pagedaemon which tries to lock the object, and blocked other threads in the process in vm_fault waiting for the map lock. Handle the issue by terminating the deactivation loop if we executed too long and by yielding at the top level in vm_daemon. Also, change the map lock mode in vm_swapout_map_deactivate_pages() to read.

Patches 1 and 2 were discussed with Mark. Peter tested patches 2 and 3, I used previous version of the patch 1 on my machines.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 20295

Event Timeline

kib created this revision.Dec 28 2017, 8:44 PM
alc added inline comments.Dec 28 2017, 10:17 PM
sys/vm/vm_swapout.c
273 ↗(On Diff #37156)

This and the below unlock changes are obviously correct and an improvement. Go ahead and commit them.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 10:56 PM
This revision was automatically updated to reflect the committed changes.
kib reopened this revision.Dec 28 2017, 11:00 PM
kib updated this revision to Diff 37164.Dec 28 2017, 11:02 PM

Update patch after r327316.

alc added a comment.Dec 28 2017, 11:03 PM

Does swapout_procs() actually need an exclusive lock map either?

alc added a comment.Dec 28 2017, 11:11 PM

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

alc added inline comments.Dec 28 2017, 11:22 PM
sys/vm/vm_swapout.c
756 ↗(On Diff #37164)

Go ahead and commit this change to the comment. It seems unrelated to the rest anyway.

kib added a comment.Dec 28 2017, 11:48 PM
In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 11:50 PM
Closed by commit rS327319: Clean up the comment. (authored by kib, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.
kib reopened this revision.Dec 28 2017, 11:51 PM
kib updated this revision to Diff 37168.Dec 28 2017, 11:54 PM

Update the patch after r327319.

alc added a comment.Dec 29 2017, 3:07 AM
In D13671#285730, @kib wrote:
In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

Remove the vm map lock acquisition and release from swapout_procs() then.

alc added inline comments.Dec 29 2017, 3:18 AM
sys/vm/vm_swapout.c
568 ↗(On Diff #37168)

"j" is out of order. Include "i" here too?

(Whatever you decide to do about "i", go ahead and commit this.)

659 ↗(On Diff #37168)

Merge the "*pp" definition into this one?

662 ↗(On Diff #37168)

Go ahead and commit this with the other variable definition style fixes.

kib updated this revision to Diff 37194.Dec 29 2017, 8:40 PM

Rebase the patch after recent vm_swapout.c commits.

alc added inline comments.Jan 1 2018, 6:03 PM
sys/vm/vm_swapout.c
211–212 ↗(On Diff #37194)

I'm okay with this change and ...

527 ↗(On Diff #37194)

... this change. Go ahead and commit them.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 1 2018, 7:27 PM
Closed by commit rS327468: Do not let vm_daemon run unbounded. (authored by kib, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.
kib reopened this revision.Jan 1 2018, 7:28 PM
kib updated this revision to Diff 37365.Jan 1 2018, 7:31 PM

Re-merge after r327468.

alc added inline comments.Jan 3 2018, 6:51 PM
sys/vm/vm_swapout.c
692 ↗(On Diff #37365)

You are multiplying by "hz" twice, here and in pagedaemon_wait().

kib updated this revision to Diff 37471.Jan 3 2018, 7:04 PM

Remove double *hz.

alc added a comment.Jan 3 2018, 7:13 PM

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Index: vm/vm_swapout.c
===================================================================
--- vm/vm_swapout.c     (revision 327509)
+++ vm/vm_swapout.c     (working copy)
@@ -546,7 +546,7 @@ vm_thread_swapout(struct thread *td)
                        panic("vm_thread_swapout: kstack already missing?");
                vm_page_dirty(m);
                vm_page_lock(m);
-               vm_page_unwire(m, PQ_INACTIVE);
+               vm_page_unwire(m, PQ_LAUNDRY);
                vm_page_unlock(m);
        }
        VM_OBJECT_WUNLOCK(ksobj);

The whole point of swapping out the process is to make its pages more quickly available for other uses.

markj added a comment.Jan 3 2018, 7:16 PM
In D13671#287790, @kib wrote:

Remove double *hz.

Shouldn't the "timo" parameters to vm_wait*() and pagedaemon_wait() be in ticks? That would be more consistent with other _sleep() wrappers, and allow more granularity for hypothetical future consumers of those interfaces.

In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Seems reasonable to me.

sys/vm/vm_page.c
3064

"until free pages are available for allocation, or the requested timeout has elapsed."

kib added a comment.Jan 3 2018, 7:56 PM
In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

I think this is fine.

kib updated this revision to Diff 37479.Jan 3 2018, 8:03 PM
kib marked an inline comment as done.

Makr vm_wait() taking timeout in ticks.
Adjust comment.

swills added a subscriber: swills.Jan 14 2018, 1:42 AM
kib updated this revision to Diff 41435.Apr 13 2018, 4:57 PM

Rebase after all VM changes.

alc added inline comments.Jul 27 2018, 5:23 PM
sys/vm/vm_swapout.c
664 ↗(On Diff #41435)

If a swapped out process is killed, shouldn't we try to dispose of it regardless of whether vm_page_count_min() is true?

kib updated this revision to Diff 45920.Jul 27 2018, 6:18 PM
kib marked an inline comment as done.

Fault a killed process in unconditionally.

alc added inline comments.Jul 28 2018, 7:04 PM
sys/vm/vm_swapout.c
660 ↗(On Diff #45920)

Are you sure that you wanted to delete this vm_wait_min()? Doing so allows for swapping in processes in the later loop.

691–698 ↗(On Diff #45920)

I'd like to suggest a different way of structuring this. In essence, move the P_WKILLED check into the below, existing FOREACH_PROC_IN_SYSTEM(p) loop. In more detail, I'm suggesting:

loop:
        min_flag = vm_page_count_min();
...
        FOREACH_PROC_IN_SYSTEM(p) {
                PROC_LOCK(p);
                if (p->p_state == PRS_NEW ||
                    p->p_flag & (P_SWAPPINGOUT | P_SWAPPINGIN | P_INMEM)) {
                        PROC_UNLOCK(p);
                        continue;
                }
                if (p->p_state == PRS_NORMAL &&
                    (p->p_flag & P_WKILLED) != 0) {
                        ...
                }
                if (min_flag) {
                        PROC_UNLOCK(p);
                        continue;
                }
                ...

I see the point of not swapping in a process when vm_page_count_min(), but I'm not sure that I see the point of implementing the delay with vm_wait_min(). The below tsleep() would suffice.

kib added inline comments.Jul 28 2018, 7:54 PM
sys/vm/vm_swapout.c
660 ↗(On Diff #45920)

For the loop which brings in WKILLED processes, I definitely wanted to remove the wait.

In fact, I considered moving the faultin() logic for the WKILLED processes to the separate thread, to not wait for the faulting the stack of other processes, or the "swapin" sleep. With your proposal, I can add a racy wakeup to avoid tsleep().

kib updated this revision to Diff 45961.Jul 28 2018, 7:55 PM

Move WKILLED handling into the other allproc loop.

alc added inline comments.Jul 28 2018, 8:40 PM
sys/vm/vm_swapout.c
677 ↗(On Diff #45961)

Is there a reason that we don't maintain a count of swapped out processes so that we can skip this loop when that count is zero?

alc added a comment.Jul 28 2018, 8:47 PM

Could we please move the bits on killing swapped out processes to their own separate review?

kib updated this revision to Diff 46296.Aug 4 2018, 8:50 PM

The last bit of the patch, OOM on too long wait for a page from vm_fault().

kib updated this revision to Diff 47789.Sep 7 2018, 12:15 PM

rebase

alc added inline comments.Oct 19 2018, 5:03 PM
sys/amd64/amd64/pmap.c
2801 ↗(On Diff #47789)

I don't see any callers that actually use the timeout with vm_wait().

kib updated this revision to Diff 49319.Oct 19 2018, 7:51 PM

Remove no longer needed addition of timo argument to vm_wait().

alc added inline comments.Sun, Nov 18, 8:57 PM
sys/vm/vm_fault.c
829–830

I want to ask a high-level question. I see the point of adding a timeout to vm_waitpfault(). Otherwise, the calling thread will sleep until the current memory shortage is resolved, and killing this process may be the preferred way to address that shortage. However, I want to ask why simply introducing the timeout doesn't suffice? Won't the existing OOM code identify this process as problematic and attempt to kill it? And, if all of the faulting, sleeping threads eventually wake up, won't the existing code in vm_fault() for handling faults by killed processes allow for the process to be terminated?

kib marked an inline comment as done.Sun, Nov 18, 10:37 PM
kib added inline comments.
sys/vm/vm_fault.c
829–830

Theoretically OOM P_KILLED check should be enough, but practically it was not in my situation which prompted me to write the patch.

Sometimes pageadaemon can make very small (units of pages) progress sometimes, which is enough for the OOM killer to reset the oom sequence, but not enough for the system to make real progress. Basically, any random vm_page_free() sabotages OOM, In my case I had a bug introduced into the build system which caused many instances of the parallel make to consume a lot of anon memory. There were several dozen of processes each eating several GBs, all non-killable. Machine can sit several hours in this state until I hit reset.

Making the timeout for paging allocation allowed it to recover on its own. Simply introducing the timeout is not enough for the reason stated above, because OOM really did not killed anything.

alc added inline comments.Mon, Nov 19, 5:28 PM
sys/vm/vm_page.c
3065–3066

Given that this function is defined to have only one caller, vm_fault(), couldn't we pass PCATCH to the msleep() call?

kib marked an inline comment as done.Mon, Nov 19, 6:11 PM
kib added inline comments.
sys/vm/vm_page.c
3065–3066

So what would be the semantic ? We allow the sleep interruption with PCATCH, but the signal cannot be delivered right now so it is queued for the process. I do not see a way to interrupt the page fault handler, we either should fail it or restart. So the signal sent would just cause the earlier restart of the fault handler loop ?

IMO timeout is enough for that.