Page MenuHomeFreeBSD

Bump pageout_oom_seq.
Needs ReviewPublic

Authored by markj on Aug 10 2018, 4:05 PM.

Details

Reviewers
alc
kib
Summary

When an inactive queue scan fails to meet its target, we sleep
uninterruptibly. Before r308474 we slept for a half second in between
scans, the idea being to give some time for laundering to add clean
pages to the inactive queue. Now, we sleep for 0.1s. The default
oom_seq value predates the new sleep period, and can result in
spurious OOM kills when swapping to slow media. In the case that
motivated this change, a user was swapping to a USB flash drive in
addition to using it for a buildworld. Average write latency can easily
spike to over 1s when such drives are overloaded.

The value of 100 effectively gives the kernel 10s to launder pages
before invoking the OOM killer.

Test Plan

The tester reported that the OOM kills he was observing went away.
(In fact, they were replaced by panics triggered by I/O failures from
the drive.)

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 18701
Build 18383: arc lint + arc unit

Event Timeline

markj created this revision.Aug 10 2018, 4:05 PM
markj edited the test plan for this revision. (Show Details)Aug 10 2018, 4:08 PM
markj added reviewers: alc, kib.
kib added a comment.Aug 10 2018, 4:49 PM

I do not object to any tuning of oom_seq, but I note that such change needs Peter' validation. The value of oom_seq is pure empirical and it was some compromise between my tests on 64M (or 32M) QEMU i386 instance and Peter' tests on the 32G+ machines. Too large values caused OOM never triggering at all, since long enough runs of pagedaemon were able to free 1-2 pages despite the system was not able to make any progress.

In other words, the question is, does OOM still trigger with the bump ?

markj added a comment.Aug 10 2018, 5:17 PM
In D16659#354077, @kib wrote:

I do not object to any tuning of oom_seq, but I note that such change needs Peter' validation. The value of oom_seq is pure empirical and it was some compromise between my tests on 64M (or 32M) QEMU i386 instance and Peter' tests on the 32G+ machines. Too large values caused OOM never triggering at all, since long enough runs of pagedaemon were able to free 1-2 pages despite the system was not able to make any progress.

Hmm. This might be related to the number of active queue scans required to move pages from PQ_ACTIVE to PQ_INACTIVE. If the oom_seq value is large, the likelihood that we will deactivate and then reclaim a clean page is higher, so the system will appear to be making progress.

In other words, the question is, does OOM still trigger with the bump ?

We saw one occurrence even with the value bumped to 120, but usually the system just panicked when I/Os to the flash drive started failing. I'll admit that the scenario is a bit extreme, but the current default value of 12 means that we may trigger OOM kills if the system is unable to make progress for 1.2s. This must be too low.

alc added a comment.Aug 11 2018, 7:19 PM
In D16659#354077, @kib wrote:

I do not object to any tuning of oom_seq, but I note that such change needs Peter' validation. The value of oom_seq is pure empirical and it was some compromise between my tests on 64M (or 32M) QEMU i386 instance and Peter' tests on the 32G+ machines. Too large values caused OOM never triggering at all, since long enough runs of pagedaemon were able to free 1-2 pages despite the system was not able to make any progress.

Hmm. This might be related to the number of active queue scans required to move pages from PQ_ACTIVE to PQ_INACTIVE. If the oom_seq value is large, the likelihood that we will deactivate and then reclaim a clean page is higher, so the system will appear to be making progress.

Maybe we should resurrect the option to swap out a runnable process. That would move a bunch of pages from the active queue to the inactive queue. However, most of the pages would likely require laundering.

In other words, the question is, does OOM still trigger with the bump ?

We saw one occurrence even with the value bumped to 120, but usually the system just panicked when I/Os to the flash drive started failing. I'll admit that the scenario is a bit extreme, but the current default value of 12 means that we may trigger OOM kills if the system is unable to make progress for 1.2s. This must be too low.