Details

Reviewers

alc
markj

Commits

rS351114: Fix OOM handling of some corner cases.
rS327468: Do not let vm_daemon run unbounded.
rS327319: Clean up the comment.
rS327316: In vm_swapout_map_deactivate_pages(), it is enough to lock the map for read.

Summary

The patch contains three somewhat related fixes, all dealing with either OOM inability to resolve deadlock or OOM not triggering when it should.

In addition to pagedaemon initiating OOM, also do it from the vm_fault(). Namely, if the thread waits for a free page to satisfy page fault some preconfigured amount of time, trigger OOM. These triggers are rate-limited, due to a usual case of several threads of the same multi-threaded process to enter fault handler simultaneously. The faults from pagedaemon threads participate in the calculation of OOM rate, but are not under the limit. This part works around the issue which I have on some machines.

Larry McVoy reported a load where large process was swapped out and then selected as OOM victim. Since process must be made runnable to exit, and since we are low on memory thus not swapping in processes, OOM kill appear to be not effective. Handle the case by making swapper to check for killed processes if low on memory and swap them in. Similarly with vm_fault() allocating the page to handle fault of killed process with SYSTEM priority, kernel stacks of the killed process swapped in with SYSTEM priority, to avoid swapper blocking.

Peter Jeremy reported a load where single anonymous object consumed almost all memory on the large system. Swapout code executes the iteration over the corresponding object page queue for long time, owning the map and object locks. This blocked pagedaemon which tries to lock the object, and blocked other threads in the process in vm_fault waiting for the map lock. Handle the issue by terminating the deactivation loop if we executed too long and by yielding at the top level in vm_daemon. Also, change the map lock mode in vm_swapout_map_deactivate_pages() to read.

Patches 1 and 2 were discussed with Mark. Peter tested patches 2 and 3, I used previous version of the patch 1 on my machines.

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 10:56 PM

Closed by commit rS327316: In vm_swapout_map_deactivate_pages(), it is enough to lock the map for read. (authored by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib reopened this revision.Dec 28 2017, 11:00 PM

Update patch after r327316.

Does swapout_procs() actually need an exclusive lock map either?

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

alc added inline comments.Dec 28 2017, 11:22 PM

sys/vm/vm_swapout.c
755	Go ahead and commit this change to the comment. It seems unrelated to the rest anyway.

In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 28 2017, 11:50 PM

Closed by commit rS327319: Clean up the comment. (authored by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib reopened this revision.Dec 28 2017, 11:51 PM

Update the patch after r327319.

In D13671#285730, @kib wrote:

In D13671#285719, @alc wrote:

Does swapout_procs() actually need an exclusive lock map either?

I think that swapout_procs() in fact does not need the map lock at all. It does not currently access the map entries nor it makes decisions based on the residency count or swap usage.

In D13671#285720, @alc wrote:

Shouldn't swapout() do a pmap_advise(DONTNEED) on every map entry? Otherwise, won't lingering reference bits from the pmap of the swapped out process potentially lead the reactivation of pages that were only accessed by the swapped out process?

I am not sure. swapout() occurs when the process was idle for long enough time. Mist likely, pagedaemon already processed the pages and there is no lingering reference bits in the pte. If pagedaemon did not scanned that pages yet, we are definitely not low on memory and reactivation maintains fairness.

If we start doing pmap_advise(), the map lock will be needed.

Remove the vm map lock acquisition and release from swapout_procs() then.

alc added inline comments.Dec 29 2017, 3:18 AM

sys/vm/vm_swapout.c
568	"j" is out of order. Include "i" here too? (Whatever you decide to do about "i", go ahead and commit this.)
659	Merge the "*pp" definition into this one?
661	Go ahead and commit this with the other variable definition style fixes.

Rebase the patch after recent vm_swapout.c commits.

alc added inline comments.Jan 1 2018, 6:03 PM

sys/vm/vm_swapout.c
211–212	I'm okay with this change and ...
527	... this change. Go ahead and commit them.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 1 2018, 7:27 PM

Closed by commit rS327468: Do not let vm_daemon run unbounded. (authored by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib reopened this revision.Jan 1 2018, 7:28 PM

Re-merge after r327468.

alc added inline comments.Jan 3 2018, 6:51 PM

sys/vm/vm_swapout.c
692	You are multiplying by "hz" twice, here and in pagedaemon_wait().

Remove double *hz.

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Index: vm/vm_swapout.c
===================================================================
--- vm/vm_swapout.c     (revision 327509)
+++ vm/vm_swapout.c     (working copy)
@@ -546,7 +546,7 @@ vm_thread_swapout(struct thread *td)
                        panic("vm_thread_swapout: kstack already missing?");
                vm_page_dirty(m);
                vm_page_lock(m);
-               vm_page_unwire(m, PQ_INACTIVE);
+               vm_page_unwire(m, PQ_LAUNDRY);
                vm_page_unlock(m);
        }
        VM_OBJECT_WUNLOCK(ksobj);

The whole point of swapping out the process is to make its pages more quickly available for other uses.

In D13671#287790, @kib wrote:

Remove double *hz.

Shouldn't the "timo" parameters to vm_wait*() and pagedaemon_wait() be in ticks? That would be more consistent with other _sleep() wrappers, and allow more granularity for hypothetical future consumers of those interfaces.

In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

Seems reasonable to me.

sys/vm/vm_page.c
3017	"until free pages are available for allocation, or the requested timeout has elapsed."

In D13671#287794, @alc wrote:

When we unmap a thread's stack, I think that the pages should go directly to the laundry.

I think this is fine.

Makr vm_wait() taking timeout in ticks.
Adjust comment.

swills added a subscriber: swills.Jan 14 2018, 1:42 AM

Rebase after all VM changes.

Herald added a subscriber: emaste. · View Herald TranscriptApr 13 2018, 4:57 PM

alc added inline comments.Jul 27 2018, 5:23 PM

sys/vm/vm_swapout.c
664	If a swapped out process is killed, shouldn't we try to dispose of it regardless of whether `vm_page_count_min()` is true?

Fault a killed process in unconditionally.

Harbormaster completed remote builds in B18376: Diff 45920.Jul 27 2018, 6:18 PM

alc added inline comments.Jul 28 2018, 7:04 PM

sys/vm/vm_swapout.c
653	Are you sure that you wanted to delete this vm_wait_min()? Doing so allows for swapping in processes in the later loop.
683–690	I'd like to suggest a different way of structuring this. In essence, move the P_WKILLED check into the below, existing FOREACH_PROC_IN_SYSTEM(p) loop. In more detail, I'm suggesting: loop: min_flag = vm_page_count_min(); ... FOREACH_PROC_IN_SYSTEM(p) { PROC_LOCK(p); if (p->p_state == PRS_NEW \|\| p->p_flag & (P_SWAPPINGOUT \| P_SWAPPINGIN \| P_INMEM)) { PROC_UNLOCK(p); continue; } if (p->p_state == PRS_NORMAL && (p->p_flag & P_WKILLED) != 0) { ... } if (min_flag) { PROC_UNLOCK(p); continue; } ... I see the point of not swapping in a process when vm_page_count_min(), but I'm not sure that I see the point of implementing the delay with vm_wait_min(). The below tsleep() would suffice.

kib added inline comments.Jul 28 2018, 7:54 PM

sys/vm/vm_swapout.c
653	For the loop which brings in WKILLED processes, I definitely wanted to remove the wait. In fact, I considered moving the faultin() logic for the WKILLED processes to the separate thread, to not wait for the faulting the stack of other processes, or the "swapin" sleep. With your proposal, I can add a racy wakeup to avoid tsleep().

Move WKILLED handling into the other allproc loop.

Harbormaster completed remote builds in B18401: Diff 45961.Jul 28 2018, 7:55 PM

alc added inline comments.Jul 28 2018, 8:40 PM

sys/vm/vm_swapout.c
699	Is there a reason that we don't maintain a count of swapped out processes so that we can skip this loop when that count is zero?

Could we please move the bits on killing swapped out processes to their own separate review?

The last bit of the patch, OOM on too long wait for a page from vm_fault().

Harbormaster completed remote builds in B18586: Diff 46296.Aug 4 2018, 8:51 PM

rebase

Harbormaster completed remote builds in B19451: Diff 47789.Sep 7 2018, 12:15 PM

alc added inline comments.Oct 19 2018, 5:03 PM

sys/amd64/amd64/pmap.c
2752	I don't see any callers that actually use the timeout with vm_wait().

Remove no longer needed addition of timo argument to vm_wait().

Harbormaster completed remote builds in B20295: Diff 49319.Oct 19 2018, 7:51 PM

alc added inline comments.Nov 18 2018, 8:57 PM

sys/vm/vm_fault.c
802–803	I want to ask a high-level question. I see the point of adding a timeout to vm_waitpfault(). Otherwise, the calling thread will sleep until the current memory shortage is resolved, and killing this process may be the preferred way to address that shortage. However, I want to ask why simply introducing the timeout doesn't suffice? Won't the existing OOM code identify this process as problematic and attempt to kill it? And, if all of the faulting, sleeping threads eventually wake up, won't the existing code in vm_fault() for handling faults by killed processes allow for the process to be terminated?

kib marked an inline comment as done.Nov 18 2018, 10:37 PM

kib added inline comments.

sys/vm/vm_fault.c
802–803	Theoretically OOM P_KILLED check should be enough, but practically it was not in my situation which prompted me to write the patch. Sometimes pageadaemon can make very small (units of pages) progress sometimes, which is enough for the OOM killer to reset the oom sequence, but not enough for the system to make real progress. Basically, any random vm_page_free() sabotages OOM, In my case I had a bug introduced into the build system which caused many instances of the parallel make to consume a lot of anon memory. There were several dozen of processes each eating several GBs, all non-killable. Machine can sit several hours in this state until I hit reset. Making the timeout for paging allocation allowed it to recover on its own. Simply introducing the timeout is not enough for the reason stated above, because OOM really did not killed anything.

alc added inline comments.Nov 19 2018, 5:28 PM

sys/vm/vm_page.c
3018–3019	Given that this function is defined to have only one caller, vm_fault(), couldn't we pass PCATCH to the msleep() call?

kib marked an inline comment as done.Nov 19 2018, 6:11 PM

kib added inline comments.

sys/vm/vm_page.c
3018–3019	So what would be the semantic ? We allow the sleep interruption with PCATCH, but the signal cannot be delivered right now so it is queued for the process. I do not see a way to interrupt the page fault handler, we either should fail it or restart. So the signal sent would just cause the earlier restart of the fault handler loop ? IMO timeout is enough for that.

Update the patch to the latest merge, trying to revive the review process.

Harbormaster completed remote builds in B25836: Diff 60729.Aug 13 2019, 1:56 PM

markj added inline comments.Aug 13 2019, 6:53 PM

sys/vm/vm_pageout.c
1695	Suppose the page daemon performs the first OOM kill since boot. ratelim_count is set to 0. Suppose a thread attempts a PF OOM kill within vm_oom_pf_secs after the first kill. It will succeed after setting ratelim_count = 1, so multiple OOM kills within the window are permitted. Shouldn't ratelim_count be reset to 1, not 0?

kib added inline comments.Aug 14 2019, 4:10 PM

sys/vm/vm_pageout.c
1695	I believe vm_oom_ratelim_count is a leftover from earlier more complicated scheme. I do not see any use of it now: I should remember the time of the last OOM run in vm_oom_ratelim_last always, and for VM_OOM_MEM_PF, compare now with _last.

Eliminate vm_oom_ratelim_count.

Harbormaster completed remote builds in B25872: Diff 60783.Aug 14 2019, 4:11 PM

I think this is ok. Just some thoughts below.

In head, if a process sleeps for a long time in vm_waitpfault(), the page daemon can deactivate and reclaim most of the process' mapped pages. If it reclaims enough pages to wake up the process, the process will run and immediately fault a number of times, adding pressure to the pool of free pages. So I can see why the OOM killer would fail to make progress: with many large processes we will just reclaim pages from one process to satisfy PF allocations for another, and this can go on indefinitely.

With your patch this is less likely to happen since a starved process can eventually trigger OOM kills, but in principle it can still be a problem if the page daemon reclaims user-mapped pages quickly enough. We can tune the sysctls you added, but I wonder if there is a way to recognize this pattern in general.

sys/vm/vm_fault.c
140	"Number of page allocation attempts before the page fault handler triggers OOM handling"?
145	"Number of seconds to wait for free pages before retrying the page fault handler"?

This revision is now accepted and ready to land.Aug 15 2019, 6:01 PM

In D13671#462633, @markj wrote:

I think this is ok. Just some thoughts below.

In head, if a process sleeps for a long time in vm_waitpfault(), the page daemon can deactivate and reclaim most of the process' mapped pages. If it reclaims enough pages to wake up the process, the process will run and immediately fault a number of times, adding pressure to the pool of free pages. So I can see why the OOM killer would fail to make progress: with many large processes we will just reclaim pages from one process to satisfy PF allocations for another, and this can go on indefinitely.

With your patch this is less likely to happen since a starved process can eventually trigger OOM kills, but in principle it can still be a problem if the page daemon reclaims user-mapped pages quickly enough. We can tune the sysctls you added, but I wonder if there is a way to recognize this pattern in general.

For me it happens first when maching has no swap configured, then after I added swap, the swap space was completely exhausted and the situation repeated, AFAIR.

Add sysctls descriptions, slightly edited from the wording provided by Mark.

This revision now requires review to proceed.Aug 15 2019, 7:02 PM

Harbormaster completed remote builds in B25896: Diff 60851.Aug 15 2019, 7:02 PM

Peter, could you please, test the patch ? The only interesting scenarios are OOMs. It should allow the system to unwind from low memory situations much quicker, or sometimes it should make out of problems previously causing hang.

D13671.60851.diff LGTM.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 16 2019, 9:44 AM

Closed by commit rS351114: Fix OOM handling of some corner cases. (authored by kib). · Explain Why

This revision was automatically updated to reflect the committed changes.

kib added a commit: rS351114: Fix OOM handling of some corner cases..

Collection of fixes for OOM handling of some corner cases.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 41435

sys/amd64/amd64/pmap.c

sys/compat/linuxkpi/common/src/linux_page.c

sys/dev/drm2/i915/i915_gem.c

sys/dev/drm2/i915/i915_gem_gtt.c

sys/dev/drm2/ttm/ttm_bo_vm.c

sys/dev/drm2/ttm/ttm_page_alloc.c

sys/i386/i386/pmap.c

sys/vm/vm_fault.c

sys/vm/vm_page.c

sys/vm/vm_pageout.h

sys/vm/vm_pageout.c

sys/vm/vm_swapout.c

Collection of fixes for OOM handling of some corner cases.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 41435

sys/amd64/amd64/pmap.c

sys/compat/linuxkpi/common/src/linux_page.c

sys/dev/drm2/i915/i915_gem.c

sys/dev/drm2/i915/i915_gem_gtt.c

sys/dev/drm2/ttm/ttm_bo_vm.c

sys/dev/drm2/ttm/ttm_page_alloc.c

sys/i386/i386/pmap.c

sys/vm/vm_fault.c

sys/vm/vm_page.c

sys/vm/vm_pageout.h

sys/vm/vm_pageout.c

sys/vm/vm_swapout.c

Collection of fixes for OOM handling of some corner cases.
ClosedPublic
Actions

Revision Contents
Changeset List