Page MenuHomeFreeBSD

PQ_LAUNDRY
ClosedPublic

Authored by alc on Oct 20 2016, 5:44 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 9, 12:31 PM
Unknown Object (File)
Mon, Jan 6, 9:00 PM
Unknown Object (File)
Mon, Jan 6, 3:02 PM
Unknown Object (File)
Wed, Jan 1, 2:00 AM
Unknown Object (File)
Mon, Dec 30, 9:26 AM
Unknown Object (File)
Mon, Dec 30, 5:01 AM
Unknown Object (File)
Sun, Dec 29, 11:58 AM
Unknown Object (File)
Fri, Dec 27, 6:23 AM

Details

Summary

Introduce a new page queue, PQ_LAUNDRY, for unreferenced, i.e., inactive, dirty pages and a new thread for laundering the pages on this queue. In essence, this change decouples page laundering and reclamation. For example, one effect of this decoupling is that the legacy page daemon thread(s) will no longer block because laundering anonymous pages consumes all of the available pbufs for writing to swap. Instead, they are able to continue with page reclamation. This eliminates the need for dubious low-memory deadlock avoidance hacks, specifically, the vm_page_try_to_cache() calls in I/O completion handlers.

The laundry thread sleeps while waiting from a request from the pagedaemon(s). A request is raised by setting vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes (background laundering), and to quickly make up for a shortage of free and clean inactive pages (shortfall). When a background laundering is requested, the laundry thread computes the number of pagedaemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target.

A shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, the inactive scan period.

A laundry request can be latched while another is currently being serviced. A shortfall request will immediately preempt a background laundering.

The change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). vm_cnt.v_reactivated now represents the number of inactive or laundry pages that are returned to the active queue on account of a reference.

Test Plan

For testing shortfall, a modified sysbench which uses NCPU threads to write large NOSYNC-mapped files. This effectively forces all of system memory through the laundry queue; in testing on a system with 16GB of RAM, the laundry thread is able to write pages at 200-250MB/s to an SSD backing a UFS filesystem.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

alc retitled this revision from to PQ_LAUNDRY.
alc updated this object.
alc edited the test plan for this revision. (Show Details)
alc added reviewers: kib, markj.
markj added a subscriber: pho.
sys/vm/vm_pageout.c
238–240 ↗(On Diff #21549)

Kostik, what does Bruce say about proper SYSCTL style? Is the CTLFLAG_RW supposed to be on the first line and the indentation of the continuation lines four spaces? (Currently, the new SYSCTL's are following the style of the existing ones in this file.)

sys/vm/swap_pager.c
1641 ↗(On Diff #21549)

This comment is confusing IMO. Due to the backing store free, the page is marked dirty, and not queued. But since the page is not worth keeping in memory (it was in swap after all), it should go into laundry.

sys/vm/vm_pageout.c
238–240 ↗(On Diff #21549)

I am only sure about the 4-spaces indent and the new line before description. I used line break after the CTLFLAG* flags, but indeed Bruce might said that line break should be used before.

477 ↗(On Diff #21549)

What code is supposed to re-queue the page after flush ? I see that the patch added a call to vm_page_deactivate_noreuse() for write completion in the swap pager, but I failed to find something that would re-queue pages for the local vnode pager.

alc marked an inline comment as done.
alc added inline comments.
sys/vm/swap_pager.c
1641 ↗(On Diff #21549)

Done.

alc marked an inline comment as done.

It occurred to me that the behaviour of madvise(MADV_DONTNEED) for dirty pages is less than ideal on this branch. Right now, we always requeue such pages at the tail of the inactive queue, meaning that we may move them out of the laundry queue. That seems incorrect.

I propose modifying vm_page_advise() so that we call vm_page_launder() on dirty pages, rather than _vm_page_deactivate().

In D8302#173090, @markj wrote:

It occurred to me that the behaviour of madvise(MADV_DONTNEED) for dirty pages is less than ideal on this branch. Right now, we always requeue such pages at the tail of the inactive queue, meaning that we may move them out of the laundry queue. That seems incorrect.

I propose modifying vm_page_advise() so that we call vm_page_launder() on dirty pages, rather than _vm_page_deactivate().

Yes, do it.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

I think this is a valid problem in that there is no general mechanism to requeue the pages. The buffer cache takes care of this for some filesystems it seems, but I don't see how it would work for ZFS.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Thank you for pointing out the buffer cache involvement there, indeed, for filesystems using buffer cache it happens automatically. But IMO relying of the pager behaviour is wrong, and ZFS should not know such peculiarities of VM.

I would expect that e.g. completion code for vm_pageout_flush() take care of the re-queue, esp. because it already handles it for some error cases.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Even for filesystems which use the buffer cache, I don't see anything that automatically unwires the buf's pages once the write completes. vnode_pager_generic_putpages() specifies IO_VMIO, which is translated to B_RELBUF by ext2 and UFS, but it seems that this should really be enforced by generic code.

Unfortunately, it doesn't seem possible for vm_pageout_flush() or even the vnode pager to specify a completion handler - VOP_PUTPAGES provides no mechanism to do so. Am I missing something?

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

We do not need to force unwire. It is enough for pages to be queued laterm when the buffer is recycled. Buffer cache size is limited to the fixed amount, so the count of pages participating in the VMIO buffers and not visible to the page daemon is limited. OTOH, pages that not queued because they were missed are effectively unswappable until the owning object is destroyed.

VOP_PUTPAGES() is synchronous, more, the typical operation of the vnode pager marks the page clean before the buffer write is initiated. It is, so to say, migrate the dirtyness from the pages to buffer. I mean that vm_pageout_flush() could re-queue the pages after the pager returned.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Forcing an unwire is not strictly necessary, but in the case of laundering, the pages have gone through LRU and are eligible for reclamation. It seems strange to let them exert pressure on the bufspace and go through the buffer cache's own LRU. I think UFS' current behaviour of specifying B_RELBUF is correct.

VOP_PUTPAGES is not synchronous by default - one needs to specify VM_PAGER_PUT_SYNC, and this is only done when v_free_count < v_pageout_free_min. Even if VOP_PUTPAGES allowed one to specify an iodone handler like VOP_GETPAGES_ASYNC does, I don't see a way to implement it as a generic vnode method. vop_stdputpages() currently just calls VOP_WRITE, which also doesn't provide notifications for async writes.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

If you want to modify the bufcache behaviour WRT unwiring of the laundered pages, then vfs_vmio_unwire() looks like a proper place. It already tried to free pages or affect their LRU position on unwire in several cases, so one more case is not too outstanding.

I tried to express that VOP_PUTPAGES() is synchronous from the VM PoV: the page is marked clean outright, even before the write is scheduled somehow in the io subsystem. It is io level which records the need of performing write, and e.g. for clustering allowed (async putpages in VM terms), the dirty buffer may sit on the dirty queue until buffer or syncer daemons care about it. But from the VM look, the page is clean after the successful return from vm_pager_putpages(), and sometimes even earlier. So vm_pageout_fault() can do whatever re-queuing attempts it finds suitable, after the pager call.

I have run stress2 testes on i386 and amd64.
I ran a buildkernel on both i386 and amd64 with 256MB RAM / UP.
Buildworld was run with various small RAM configurations.
No problems seen.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Do we use VM_PAGER_PEND anywhere besides the swap pager?

To Kostik's point, I think that I agree. We should remove the vm_page_dequeue() calls from vm_pageout_cluster() and instead call vm_page_deactivate_noreuse() in vm_pageout_flush() when the pager returns VM_PAGER_OK.

However, there is one catch. We shouldn't automatically call vm_page_deactivate_noreuse() when vm_pageout_flush() is called by msync(), or in general any caller besides the laundry thread. I think it would suffice to test whether the page is in the laundry queue, and only call vm_page_deactivate_noreuse() if it is. That way, we would also handle the case where msync() is performed on a page in the laundry queue.

Turning to Mark's point, we ought to tell the buffer cache to immediately release the buffer and perform vm_page_deactivate_noreuse() on the pages. However, I don't think that any of the existing flags that vm_pageout_flush() can pass to vm_pager_put_pages() accomplishes the latter. Am I wrong?

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Ok, I understand the suggestion now. I think that queuing the page using vm_page_deactivate_noreuse() if it was on the laundry queue is a reasonable policy, and we can use the B_NOREUSE flag to effect this in the buffer cache.

It does indeed seem like we need to add a new VM_PAGER_PUT_* flag to signal our intent to VOP_PUTPAGES, and it also needs to be plumbed through VOP_WRITE somehow for the generic PUTPAGES implementation.

If we add a new VM_PAGER_PUT_* flag, then we actually don't need to test whether the page is in the laundry queue: vm_pageout_flush() takes the pager flags as a parameter, so we can just set the flag in vm_pageout_cluster() and use that to determine where to queue. That way, msync and so on will be unaffected.

VM_PAGER_PEND only appears to be set in the swap pager.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

That way, msync and so on will be unaffected.

We might have a page in the laundry queue that is actually laundered by the msync(2) call. In that case, we would want the page to be moved to the inactive queue. If the page has been referenced while in the laundry queue, there shouldn't be a problem with having used vm_page_deactivate_noreuse() on the page because vm_pageout_scan() will see the reference and not reclaim the page.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Here is the proposed patch:

Index: vm/vm_pageout.c
===================================================================
--- vm/vm_pageout.c     (revision 307753)
+++ vm/vm_pageout.c     (working copy)
@@ -405,7 +405,6 @@ vm_pageout_cluster(vm_page_t m)
         */
        vm_page_assert_unbusied(m);
        KASSERT(m->hold_count == 0, ("page %p is held", m));
-       vm_page_dequeue(m);
        vm_page_unlock(m);
 
        mc[vm_pageout_page_count] = pb = ps = m;
@@ -448,7 +447,6 @@ more:
                        ib = 0;
                        break;
                }
-               vm_page_dequeue(p);
                vm_page_unlock(p);
                mc[--page_base] = pb = p;
                ++pageout_count;
@@ -474,7 +472,6 @@ more:
                        vm_page_unlock(p);
                        break;
                }
-               vm_page_dequeue(p);
                vm_page_unlock(p);
                mc[page_base + pageout_count] = ps = p;
                ++pageout_count;
@@ -550,6 +547,10 @@ vm_pageout_flush(vm_page_t *mc, int count, int fla
                    ("vm_pageout_flush: page %p is not write protected", mt));
                switch (pageout_status[i]) {
                case VM_PAGER_OK:
+                       vm_page_lock(mt);
+                       if (vm_page_in_laundry(mt))
+                               vm_page_deactivate_noreuse(mt);
+                       vm_page_unlock(mt);
                case VM_PAGER_PEND:
                        numpagedout++;
                        break;
sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Looks fine.

sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

Seems right to me. I can work on the corresponding buffer cache change, but that's probably not a prerequisite to merging PQ_LAUNDRY?

alc edited edge metadata.
alc marked an inline comment as done.Oct 24 2016, 5:32 PM
alc added inline comments.
sys/vm/vm_pageout.c
477 ↗(On Diff #21549)

No, I don't think it's a prerequisite.

sys/sys/vmmeter.h
100 ↗(On Diff #21657)

This ought to have a more accurate description. Suggestions?

markj added inline comments.
sys/sys/vmmeter.h
100 ↗(On Diff #21657)

"pages eligible for laundering"?

Revise the description of v_laundry_count.

alc marked an inline comment as done.Oct 25 2016, 4:18 AM
sys/vm/vm_pageout.c
1978 ↗(On Diff #21661)

Should the "2" here be "VM_INACT_SCAN_INTERVAL"?

sys/vm/vm_pageout.c
1978 ↗(On Diff #21661)

Yes. That name doesn't really make sense though - it's a rate.

How about:

diff --git a/sys/vm/vm_pageout.c b/sys/vm/vm_pageout.c
index d09dccb..c996797 100644
--- a/sys/vm/vm_pageout.c
+++ b/sys/vm/vm_pageout.c
@@ -155,11 +155,9 @@ static struct kproc_desc vm_kp = {
 SYSINIT(vmdaemon, SI_SUB_KTHREAD_VM, SI_ORDER_FIRST, kproc_start, &vm_kp);
 #endif

-/* Sleep intervals for pagedaemon threads, in subdivisions of one second. */
-#define        VM_LAUNDER_INTERVAL     10
-#define        VM_INACT_SCAN_INTERVAL  2
-
-#define        VM_LAUNDER_RATE         (VM_LAUNDER_INTERVAL / VM_INACT_SCAN_INTERVAL)
+/* Pagedaemon activity rates, in subdivisions of one second. */
+#define        VM_LAUNDER_RATE         10
+#define        VM_INACT_SCAN_RATE      2

 int vm_pageout_deficit;                /* Estimated number of pages deficit */
 u_int vm_pageout_wakeup_thresh;
@@ -1149,7 +1147,7 @@ vm_pageout_laundry_worker(void *arg)
                 */
                if (shortfall > 0) {
                        in_shortfall = true;
-                       shortfall_cycle = VM_LAUNDER_RATE;
+                       shortfall_cycle = VM_LAUNDER_RATE / VM_INACT_SCAN_RATE;
                        target = shortfall;
                } else if (!in_shortfall)
                        goto trybackground;
@@ -1211,7 +1209,7 @@ trybackground:
                                target = 0;
                        }
                        launder = vm_background_launder_rate * PAGE_SIZE / 1024;
-                       launder /= VM_LAUNDER_INTERVAL;
+                       launder /= VM_LAUNDER_RATE;
                        if (launder > target)
                                launder = target;
                }
@@ -1225,7 +1223,7 @@ dolaundry:
                         */
                        target -= min(vm_pageout_launder(domain, launder,
                            in_shortfall), target);
-                       pause("laundp", hz / VM_LAUNDER_INTERVAL);
+                       pause("laundp", hz / VM_LAUNDER_RATE);
                }

                /*
@@ -2001,7 +1999,7 @@ vm_pageout_worker(void *arg)
                         */
                        mtx_unlock(&vm_page_queue_free_mtx);
                        if (pass >= 1)
-                               pause("psleep", hz / 2);
+                               pause("psleep", hz / VM_INACT_SCAN_RATE);
                        pass++;
                } else {
                        /*
sys/vm/vm_pageout.c
1978 ↗(On Diff #21661)

I agree. Commit your proposed change.

alc marked 3 inline comments as done.

Style only

sys/vm/vm_pageout.c
563 ↗(On Diff #21736)

I don't think that we ever consciously chose between vm_page_deactivate() and vm_page_deactivate_noreuse() here. Using vm_page_deactivate() will preserve the contents of this "failed page" from reclamation for a little longer. Is there actually a reason to prefer that?

sys/vm/vm_pageout.c
563 ↗(On Diff #21736)

I can't see any good reason either way. It looks like the failure modes that lead to VM_PAGER_BAD are transient (e.g. vnode is being reclaimed) and will lead to the page being freed by another mechanism.

sys/vm/vm_pageout.c
563 ↗(On Diff #21736)

After sleeping on the question and your response, I have a slight preference for using the _noreuse option. I'm also going to condition the _noreuse call on whether the page is in the laundry, like we did for the OK and PEND cases. This way, msync() pages will remain in their current queue, unless they were in the laundry.

Tweak vm_pageout_flush()'s handling of the VM_PAGER_BAD case.

alc marked 2 inline comments as done.Oct 28 2016, 4:42 PM

I went through the entire diff again yesterday. I don't have any other questions or planned changes.

In D8302#174196, @alc wrote:

I went through the entire diff again yesterday. I don't have any other questions or planned changes.

Cool! I don't have anything further to add at the moment either.

In D8302#174211, @markj wrote:
In D8302#174196, @alc wrote:

I went through the entire diff again yesterday. I don't have any other questions or planned changes.

Cool! I don't have anything further to add at the moment either.

Could you please rerun your stress test from the test plan so that the caveat about "this test result was from a few months ago" can be removed? (I'm viewing the summary as a draft of the commit message.)

Peter,

Could you please boot a PQ_LAUNDRY kernel with VM_NUMA_ALLOC enabled and put enough memory stress on it to trigger page laundering? I just want to double check that we haven't introduced any regressions in the current NUMA support.

In D8302#174630, @alc wrote:
In D8302#174211, @markj wrote:
In D8302#174196, @alc wrote:

I went through the entire diff again yesterday. I don't have any other questions or planned changes.

Cool! I don't have anything further to add at the moment either.

Could you please rerun your stress test from the test plan so that the caveat about "this test result was from a few months ago" can be removed? (I'm viewing the summary as a draft of the commit message.)

Sure, will do tonight.

In D8302#174631, @alc wrote:

Peter,

Could you please boot a PQ_LAUNDRY kernel with VM_NUMA_ALLOC enabled and put enough memory stress on it to trigger page laundering? I just want to double check that we haven't introduced any regressions in the current NUMA support.

Sure.
18:29:26 vm.stats.vm.v_laundry_count: 0
18:30:22 vm.stats.vm.v_laundry_count: 2437688
18:30:38 vm.stats.vm.v_laundry_count: 2448018
^C
$ uname -a
FreeBSD t2.osted.lan 12.0-CURRENT FreeBSD 12.0-CURRENT #0 r308137: Mon Oct 31 18:19:24 CET 2016 pho@t2.osted.lan:/var/tmp/PQ_LAUNDRY/sys/amd64/compile/MAXMEMDOM amd64
$ sysctl vm.ndomains
vm.ndomains: 2

In D8302#174638, @markj wrote:
In D8302#174630, @alc wrote:
In D8302#174211, @markj wrote:
In D8302#174196, @alc wrote:

I went through the entire diff again yesterday. I don't have any other questions or planned changes.

Cool! I don't have anything further to add at the moment either.

Could you please rerun your stress test from the test plan so that the caveat about "this test result was from a few months ago" can be removed? (I'm viewing the summary as a draft of the commit message.)

Sure, will do tonight.

I ran the test overnight on HEAD and haven't hit any problems, so I just removed that reference.

markj edited edge metadata.
This revision is now accepted and ready to land.Nov 2 2016, 8:14 PM
kib edited edge metadata.
markj added a subscriber: jhb.
This revision was automatically updated to reflect the committed changes.