Page MenuHomeFreeBSD

Collection of fixes for OOM handling of some corner cases.
ClosedPublic

Authored by kib on Dec 28 2017, 8:44 PM.

Details

Summary

The patch contains three somewhat related fixes, all dealing with either OOM inability to resolve deadlock or OOM not triggering when it should.

  1. In addition to pagedaemon initiating OOM, also do it from the vm_fault(). Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. This part works around the issue which I have on some machines.
  1. Larry McVoy reported a load where large process was swapped out and then selected as OOM victim. Since process must be made runnable to exit, and since we are low on memory thus not swapping in processes, OOM kill appear to be not effective. Handle the case by making swapper to check for killed processes if low on memory and swap them in. Similarly with vm_fault() allocating the page to handle fault of killed process with SYSTEM priority, kernel stacks of the killed process swapped in with SYSTEM priority, to avoid swapper blocking.
  1. Peter Jeremy reported a load where single anonymous object consumed almost all memory on the large system. Swapout code executes the iteration over the corresponding object page queue for long time, owning the map and object locks. This blocked pagedaemon which tries to lock the object, and blocked other threads in the process in vm_fault waiting for the map lock. Handle the issue by terminating the deactivation loop if we executed too long and by yielding at the top level in vm_daemon. Also, change the map lock mode in vm_swapout_map_deactivate_pages() to read.

Patches 1 and 2 were discussed with Mark. Peter tested patches 2 and 3, I used previous version of the patch 1 on my machines.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 10:56 PM
This revision was automatically updated to reflect the committed changes.

Update patch after r327316.

Does swapout_procs() actually need an exclusive lock map either?

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

sys/vm/vm_swapout.c
756 ↗(On Diff #37164)

Go ahead and commit this change to the comment. It seems unrelated to the rest anyway.

In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 11:50 PM
Closed by commit rS327319: Clean up the comment. (authored by kib). · Explain Why
This revision was automatically updated to reflect the committed changes.

Update the patch after r327319.

In D13671#285730, @kib wrote:
In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

Remove the vm map lock acquisition and release from swapout_procs() then.

sys/vm/vm_swapout.c
568 ↗(On Diff #37168)

"j" is out of order. Include "i" here too?

(Whatever you decide to do about "i", go ahead and commit this.)

659 ↗(On Diff #37168)

Merge the "*pp" definition into this one?

662 ↗(On Diff #37168)

Go ahead and commit this with the other variable definition style fixes.

Rebase the patch after recent vm_swapout.c commits.

sys/vm/vm_swapout.c
211–212 ↗(On Diff #37194)

I'm okay with this change and ...

527 ↗(On Diff #37194)

... this change. Go ahead and commit them.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 1 2018, 7:27 PM
This revision was automatically updated to reflect the committed changes.

Re-merge after r327468.

sys/vm/vm_swapout.c
692 ↗(On Diff #37365)

You are multiplying by "hz" twice, here and in pagedaemon_wait().

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Index: vm/vm_swapout.c
===================================================================
--- vm/vm_swapout.c     (revision 327509)
+++ vm/vm_swapout.c     (working copy)
@@ -546,7 +546,7 @@ vm_thread_swapout(struct thread *td)
                        panic("vm_thread_swapout: kstack already missing?");
                vm_page_dirty(m);
                vm_page_lock(m);
-               vm_page_unwire(m, PQ_INACTIVE);
+               vm_page_unwire(m, PQ_LAUNDRY);
                vm_page_unlock(m);
        }
        VM_OBJECT_WUNLOCK(ksobj);

The whole point of swapping out the process is to make its pages more quickly available for other uses.

In D13671#287790, @kib wrote:

Remove double *hz.

Shouldn't the "timo" parameters to vm_wait*() and pagedaemon_wait() be in ticks? That would be more consistent with other _sleep() wrappers, and allow more granularity for hypothetical future consumers of those interfaces.

In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Seems reasonable to me.

sys/vm/vm_page.c
2714 ↗(On Diff #37365)

"until free pages are available for allocation, or the requested timeout has elapsed."

In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

I think this is fine.

kib marked an inline comment as done.

Makr vm_wait() taking timeout in ticks.
Adjust comment.

Rebase after all VM changes.

sys/vm/vm_swapout.c
664 ↗(On Diff #41435)

If a swapped out process is killed, shouldn't we try to dispose of it regardless of whether vm_page_count_min() is true?

kib marked an inline comment as done.

Fault a killed process in unconditionally.

sys/vm/vm_swapout.c
660 ↗(On Diff #45920)

Are you sure that you wanted to delete this vm_wait_min()? Doing so allows for swapping in processes in the later loop.

691–698 ↗(On Diff #45920)

I'd like to suggest a different way of structuring this. In essence, move the P_WKILLED check into the below, existing FOREACH_PROC_IN_SYSTEM(p) loop. In more detail, I'm suggesting:

loop:
        min_flag = vm_page_count_min();
...
        FOREACH_PROC_IN_SYSTEM(p) {
                PROC_LOCK(p);
                if (p->p_state == PRS_NEW ||
                    p->p_flag & (P_SWAPPINGOUT | P_SWAPPINGIN | P_INMEM)) {
                        PROC_UNLOCK(p);
                        continue;
                }
                if (p->p_state == PRS_NORMAL &&
                    (p->p_flag & P_WKILLED) != 0) {
                        ...
                }
                if (min_flag) {
                        PROC_UNLOCK(p);
                        continue;
                }
                ...

I see the point of not swapping in a process when vm_page_count_min(), but I'm not sure that I see the point of implementing the delay with vm_wait_min(). The below tsleep() would suffice.

sys/vm/vm_swapout.c
660 ↗(On Diff #45920)

For the loop which brings in WKILLED processes, I definitely wanted to remove the wait.

In fact, I considered moving the faultin() logic for the WKILLED processes to the separate thread, to not wait for the faulting the stack of other processes, or the "swapin" sleep. With your proposal, I can add a racy wakeup to avoid tsleep().

Move WKILLED handling into the other allproc loop.

sys/vm/vm_swapout.c
677 ↗(On Diff #45961)

Is there a reason that we don't maintain a count of swapped out processes so that we can skip this loop when that count is zero?

Could we please move the bits on killing swapped out processes to their own separate review?

The last bit of the patch, OOM on too long wait for a page from vm_fault().

sys/amd64/amd64/pmap.c
2801 ↗(On Diff #47789)

I don't see any callers that actually use the timeout with vm_wait().

Remove no longer needed addition of timo argument to vm_wait().

sys/vm/vm_fault.c
829–830 ↗(On Diff #49319)

I want to ask a high-level question. I see the point of adding a timeout to vm_waitpfault(). Otherwise, the calling thread will sleep until the current memory shortage is resolved, and killing this process may be the preferred way to address that shortage. However, I want to ask why simply introducing the timeout doesn't suffice? Won't the existing OOM code identify this process as problematic and attempt to kill it? And, if all of the faulting, sleeping threads eventually wake up, won't the existing code in vm_fault() for handling faults by killed processes allow for the process to be terminated?

kib marked an inline comment as done.Nov 18 2018, 10:37 PM
kib added inline comments.
sys/vm/vm_fault.c
829–830 ↗(On Diff #49319)

Theoretically OOM P_KILLED check should be enough, but practically it was not in my situation which prompted me to write the patch.

Sometimes pageadaemon can make very small (units of pages) progress sometimes, which is enough for the OOM killer to reset the oom sequence, but not enough for the system to make real progress. Basically, any random vm_page_free() sabotages OOM, In my case I had a bug introduced into the build system which caused many instances of the parallel make to consume a lot of anon memory. There were several dozen of processes each eating several GBs, all non-killable. Machine can sit several hours in this state until I hit reset.

Making the timeout for paging allocation allowed it to recover on its own. Simply introducing the timeout is not enough for the reason stated above, because OOM really did not killed anything.

sys/vm/vm_page.c
3065–3066 ↗(On Diff #49319)

Given that this function is defined to have only one caller, vm_fault(), couldn't we pass PCATCH to the msleep() call?

kib marked an inline comment as done.Nov 19 2018, 6:11 PM
kib added inline comments.
sys/vm/vm_page.c
3065–3066 ↗(On Diff #49319)

So what would be the semantic ? We allow the sleep interruption with PCATCH, but the signal cannot be delivered right now so it is queued for the process. I do not see a way to interrupt the page fault handler, we either should fail it or restart. So the signal sent would just cause the earlier restart of the fault handler loop ?

IMO timeout is enough for that.

Update the patch to the latest merge, trying to revive the review process.

sys/vm/vm_pageout.c
1759 ↗(On Diff #60729)

Suppose the page daemon performs the first OOM kill since boot. ratelim_count is set to 0. Suppose a thread attempts a PF OOM kill within vm_oom_pf_secs after the first kill. It will succeed after setting ratelim_count = 1, so multiple OOM kills within the window are permitted. Shouldn't ratelim_count be reset to 1, not 0?

sys/vm/vm_pageout.c
1759 ↗(On Diff #60729)

I believe vm_oom_ratelim_count is a leftover from earlier more complicated scheme. I do not see any use of it now: I should remember the time of the last OOM run in vm_oom_ratelim_last always, and for VM_OOM_MEM_PF, compare now with _last.

Eliminate vm_oom_ratelim_count.

I think this is ok. Just some thoughts below.

In head, if a process sleeps for a long time in vm_waitpfault(), the page daemon can deactivate and reclaim most of the process' mapped pages. If it reclaims enough pages to wake up the process, the process will run and immediately fault a number of times, adding pressure to the pool of free pages. So I can see why the OOM killer would fail to make progress: with many large processes we will just reclaim pages from one process to satisfy PF allocations for another, and this can go on indefinitely.

With your patch this is less likely to happen since a starved process can eventually trigger OOM kills, but in principle it can still be a problem if the page daemon reclaims user-mapped pages quickly enough. We can tune the sysctls you added, but I wonder if there is a way to recognize this pattern in general.

sys/vm/vm_fault.c
141 ↗(On Diff #60783)

"Number of page allocation attempts before the page fault handler triggers OOM handling"?

146 ↗(On Diff #60783)

"Number of seconds to wait for free pages before retrying the page fault handler"?

This revision is now accepted and ready to land.Aug 15 2019, 6:01 PM

I think this is ok. Just some thoughts below.

In head, if a process sleeps for a long time in vm_waitpfault(), the page daemon can deactivate and reclaim most of the process' mapped pages. If it reclaims enough pages to wake up the process, the process will run and immediately fault a number of times, adding pressure to the pool of free pages. So I can see why the OOM killer would fail to make progress: with many large processes we will just reclaim pages from one process to satisfy PF allocations for another, and this can go on indefinitely.

With your patch this is less likely to happen since a starved process can eventually trigger OOM kills, but in principle it can still be a problem if the page daemon reclaims user-mapped pages quickly enough. We can tune the sysctls you added, but I wonder if there is a way to recognize this pattern in general.

For me it happens first when maching has no swap configured, then after I added swap, the swap space was completely exhausted and the situation repeated, AFAIR.

kib added a subscriber: pho.

Add sysctls descriptions, slightly edited from the wording provided by Mark.

This revision now requires review to proceed.Aug 15 2019, 7:02 PM

Peter, could you please, test the patch ? The only interesting scenarios are OOMs. It should allow the system to unwind from low memory situations much quicker, or sometimes it should make out of problems previously causing hang.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 16 2019, 9:44 AM
This revision was automatically updated to reflect the committed changes.