Details

Reviewers

alc
kib

Commits

rG527c05afa6d9: vm_pageout: Scan inactive dirty pages less aggressively
rG095f6305772b: vm_pageout: Scan inactive dirty pages less aggressively

Summary

Consider a database workload where the bulk of RAM is used for a
fixed-size file-backed cache. Any leftover pages are used for
filesystem caching or anonymous memory. In particular, there is little
memory pressure and the inactive queue is scanned rarely.

Once in a while, the free page count dips a bit below the setpoint,
triggering an inactive queue scan. Since almost all of the memory there
is used by the database cache, the scan encounters only referenced
and/or dirty pages, moving them to the active and laundry queues. In
particular, it ends up completely depleting the inactive queue, even for
a small, non-urgent free page shortage.

This scan might process many gigabytes worth of pages in one go,
triggering VM object lock contention (on the DB cache file's VM object)
and consuming CPU, which can cause application latency spikes.

Observing this behaviour, my observation is that we should abort
scanning once we've encountered many dirty pages without meeting the
shortage. In general we've tried to make the page daemon control loops
avoid large bursts of work, and if a scan fails to turn up clean pages,
there's not much use in moving everything to laundry queue at once.

Modify the inactive scan to abort early if we encounter enough dirty
pages without meeting the shortage. If the shortage hasn't been met,
this will trigger shortfall laundering, wherein the laundry thread
will clean as many pages as needed to meet the instantaneous shortfall.
Laundered pages will be placed near the head of the inactive queue, so
will be immediately visible to the page daemon during its next scan of
the inactive queue.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj created this revision.Jan 6 2025, 7:16 PM

Herald added a subscriber: imp. · View Herald TranscriptJan 6 2025, 7:16 PM

markj requested review of this revision.Jan 6 2025, 7:16 PM

Harbormaster completed remote builds in B61522: Diff 148819.Jan 6 2025, 7:16 PM

This scan might process many gigabytes worth of pages in one go,
triggering VM object lock contention (on the DB cache file's VM object)
and consuming CPU, which can cause application latency spikes.

I meant to note that this is exacerbated by the page daemon being multithreaded on high core count systems - in this case we had 5 threads all processing the inactive queue over several seconds.

As a side note, I think the PPS calculation in vm_pageout_inactive_dispatch() also doesn't work well in this scenario: it counts the number of pages freed, not the number of pages scanned, so a queue full of dirty and/or referenced pages will result in a low PPS score, which makes it more likely that we'll dispatch multiple threads during a shortage.

Permit the inactive weight to have a value of 0, which effectively
restores the old behaviour.

Clamp the weights in the sysctl handler to make a multiplication overflow
less likely.

Set the inactive weight to 1 instead of 2. In my testing, we are still moving
pages to the laundry quite aggressively, see below, so we don't need the extra
multiplier.

Avoid incrementing oom_seq if there's no instantaneous shortage. Otherwise
it's possible to get spurious OOM kills after an acute page shortage: after the
shortage is resolved, the PID controller will still have positive output for a
period of time and thus will scan the queue. If the inactive queue is full of
dirty pages, the OOM controller will infer that the page daemon is failing to
make progress, but if the shortage has already been resolved, this is wrong.

This problem is not new but is easier to trigger now that we move pages to the
laundry less aggressively.

Harbormaster completed remote builds in B61563: Diff 148906.Jan 7 2025, 6:41 PM

As I understand, the patch causes the inactive scan to stop even if there is still page_shortage (>0), hoping that laundry would keep up and do the necessary cleaning. Suppose that we have the mix of the anon and file dirty pages, and, for instance, no swap (or files are backed by slow device). Then it is possible that for the long time, despite queuing the pages for laundry, they cannot be cleaned, so the page_shortage is not going to go away.
Wouldn't it be needed for such patch to ensure that either launder thread make progress, or inactive scan continues? I understand that scan would be kicked again, but I mean that laundry should kick it as well if it cannot get rid of page_shortage.

In D48337#1104688, @kib wrote:

As I understand, the patch causes the inactive scan to stop even if there is still page_shortage (>0), hoping that laundry would keep up and do the necessary cleaning. Suppose that we have the mix of the anon and file dirty pages, and, for instance, no swap (or files are backed by slow device). Then it is possible that for the long time, despite queuing the pages for laundry, they cannot be cleaned, so the page_shortage is not going to go away.
Wouldn't it be needed for such patch to ensure that either launder thread make progress, or inactive scan continues? I understand that scan would be kicked again, but I mean that laundry should kick it as well if it cannot get rid of page_shortage.

This is a good point, I did not think about such configurations, and need to test further.

I suspect it will not be a major problem, for two reasons: first, once a dirty anon page is moved to the laundry queue, it will stay there until it is freed. So the page daemon will quickly remove such pages from the inactive queue. When an anon page is first dirtied, it will end up in the active queue, and it takes a long time to deactivate. Thus, in stead-state operation, I believe the inactive queue will not contain many anon dirty pages.

The second reason is that the PID controller will quickly increase the size of page_shortage if the initial demand is not met, due to the integral term of the PID controller output, so the page daemon will still scan a large number of pages, albeit a bit more slowly than before.

Earlier this week, I did some experiments on a system with 64GB of RAM. I used a program which allocates ~60GB of dirty anon pages and puts them in the inactive queue (I set vm.pageout_update_period to a small number to accelerate this; MADV_DONTNEED could also be used). Then I tried creating shortfalls of different sizes (e.g., 100MB below the free page target) to see how the page daemon responds.

Without this patch, the pagedaemon moves all 60GB to the laundry queue. With the patch, we still move a large number of pages, e.g., 10GB+ in response to a 1GB shortfall. This is because the PID controller becomes quite aggressive if the page shortage cannot be satisfied instantaneously[**]; again, I believe this is mostly due to the integral term. So, I am not too worried about the case you described, but it deserves more analysis.

[**] This is related to the change in vm_pageout_mightbe_oom(). Perhaps that should be committed separately. After a page shortage is met, the page daemon may still keep trying to reclaim pages as demanded by the PID controller. This means that a persistent page_shortage > 0 condition is not necessarily a strong signal that we should trigger an OOM kill. We should also take the instantaneous shortage into account.

Rebase.
Drop the change to permit clustering with dirty pages in the inactive queue. It's logically a separate change, and it was incomplete: on its own it may cause the swap pager to call vm_page_deactivate_noreuse() on pages that hadn't yet made a full trip through the inactive queue. Compare with vm_pageout_flush(), which only moves pages to the head of the inactive queue if they are in the laundry.

Harbormaster completed remote builds in B65320: Diff 158280.Jul 10 2025, 8:33 PM

Don't count unswappable pages as dirty. I believe this will address kib's concern
by side-stepping it: if a dirty page can't be swapped out, we simply don't count it
against the scan limit, so we will process as many pages as necessary.

Harbormaster completed remote builds in B65321: Diff 158282.Jul 10 2025, 8:58 PM

markj edited the summary of this revision. (Show Details)Jul 10 2025, 9:06 PM

alc accepted this revision.Jul 13 2025, 11:40 PM

This revision is now accepted and ready to land.Jul 13 2025, 11:40 PM

kib accepted this revision.Jul 14 2025, 3:30 AM

kib added inline comments.

sys/vm/vm_pageout.c
1648	This is too harsh, I believe. If there is swap, but it is filled, we get into the same situation.

Handle the case where the swap pager is full too: export the swap_pager_full
variable and check that instead of the number of swap devices. In particular,
when the number of swap devices is 0, the swap_pager_full variable will be
non-zero.

This revision now requires review to proceed.Jul 14 2025, 2:17 PM

Harbormaster completed remote builds in B65415: Diff 158467.Jul 14 2025, 2:17 PM

kib added inline comments.Jul 14 2025, 10:54 PM

sys/vm/swap_pager.c
388–389	Why swap_pager_full and not swap_pager_almost_full? IMO the later, which has some delay in resetting back to false, is the better choice there.