Page MenuHomeFreeBSD

swapoff to release multiple swap blocks at a time

Authored by on Dec 14 2017, 6:16 AM.




Improve swp_pager_force_pagein() such that "swapoff" reads multiple blocks at a time.

I used vm_thread_swapin() as a reference on how to grab pages.

Call swap_pager_haspages() to find out extra and continuous VM pages and then call vm_page_grab() to prepare multiple vm_pages_t.
swap_pager_getpages() calls vm_page_alloc() for rahead but it looks we need to call vm_page_grab(s) instead for page-in purpose.
We find up to SWB_NPAGES (32 for i386) by swap_pager_haspages().
While SWAP_META_PAGES (8 for i386) is less than SWAP_META_PAGES, we skip "sb" blocks accordingly.

It looks VM_PAGE_BITS_ALL is set for pages that have been put back from swap devices. I find regions of non-VM_PAGE_BITS_ALL for npages to getpages back.

Test Plan

case 1:
% make buildworld -j 300
While above using swap space and have sufficient space to take one off,
% swapoff <swap-device>
case 2:
% mount -t tmpfs tmpfs /mnt/tmp
% dd if=/dev/zero of=/mnt/tmp/bigfile bs=1M count=10000
While above using swap space and have sufficient space to take one off,
% swapoff <swap-device>

Performance Measurement:
Objective: Find how much "swapoff" became faster at best case.
Setting: FreeBSD 12.0-STABLE on VMWare with 2GB of memory and 2GB swap space on SSD.
% mount -t tmpfs tmpfs /mnt/tmp
% dd if=/dev/zero of=/mnt/tmp/1GB bs=1M count=1024
% dd if=/dev/zero of=/mnt/tmp/out bs=1M count=1900
% rm /mnt/tmp/out
% time swapff /dev/ada0s1b

  1. Create "1GB" file and swap it out with "out".
  2. Delete "out" file so that "swapoff" will swap back in "1GB" file.
  3. Measure time and compare.

Result: Wall clock time speed up was about 7 times faster. Patched version also had about 18 times lesser page-in operations while each one took about 18 pages compare to 1 page at a time without the patch.

> old.txt <

unpatched# sysctl -a | grep v_swap | grep in
vm.stats.vm.vm_swappgsin: 1
vm.stats.vm.vm_swapin: 1
unpatched# time swapoff /dev/ada0s1b
0.000u 11.351s 0:47.50 23.8% 20+357k 0+0io 0pf+0w
unpatched# sysctl -a | grep v_swap | grep in
vm.stats.vm.vm_swappgsin: 263858
vm.stats.vm.vm_swapin: 263858

> new.txt <

patched# sysctl -a | grep v_swap | grep in
vm.stats.vm.vm_swappgsin: 1
vm.stats.vm.vm_swapin: 1
patched# time swapoff /dev/ada0s1b
0.000u 0.765s 0:07.93 9.5% 15+272k 4+0io 2pf+0w
patched# sysctl -a | grep v_swap | grep in
vm.stats.vm.vm_swappgsin: 266696
vm.stats.vm.vm_swapin: 14596

Diff Detail

rS FreeBSD src repository - subversion
Lint OK
No Unit Test Coverage
Build Status
Buildable 24863
Build 23608: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes added inline comments.

This is updated to use while loop and depend on outer for loop.
New code looks simpler.

Basically, the idea is to advance array index based on the number of page-ins completed. With new code, a single page-in operation can pull pages back more than the size of d[] array.


New variable here is now 'i' and this must be 0 when code reaches here.

In order for this 'i' to be greater than 0, we must have paged in more than we thought we paged out, based on our sb->d[].

This function is only called by "swapoff" command to detach a swap device.

Should I add something like KASSERT( i > 0, "we paged-in more than we had paged-out" )?


If 'npages' is set within a KASSERT expression, then it won't be set if assertions aren't turned on, and the program will be broken.

I suggested earlier that you just return '1' in this case; you haven't commented on that.


If you invoked this as
npages = vm_page_grab_pages(object, pindex, VM_ALLOC_NORMAL | VM_ALLOC_NOWAIT, ma, SWB_NPAGES);
would it work the same? Would it let you drop the call to swap_pager_haspage?


If you know that i must be 0, then you should add KASSERT(i == 0 ...).

573 ↗(On Diff #57821)

The changes to this file may be good or bad, but they don't seem related to the purpose of this change, so they should be part of a different patch.

593 ↗(On Diff #57821)

If the value of 'a' matters here, then that value can't be set in a KASSERT expression, since if asserts are turned off, 'a' won't be set.

I will revert and fix KASSERTs.
I will also test removing swap_pager_haspage() call.


I somehow thought KASSERT would be on all times and would be easier moving the evaluation inside. That's why I saw uninitialized error on some code trees but not all trees.

I will fix these tomorrow.


This vm_page_grab_pages() call corresponds to "m = vm_page_grab(object, pindex, VM_ALLOC_NORMAL);" call in old line 1676. I do not think NOWAIT will be a good idea if one "swapoff" a swap device out of many during a heavy slashing/paging period.

I recall these 2 pairs of functions, vm_page_grab* and swap_pager_getpages, need to be called in pair.

I think I removed one of them and resulted in a panic. It was like a year ago when I made the first change. I will try if I can remove swap_pager_haspage call.

573 ↗(On Diff #57821)

I referenced this function to find how to page-in multiple pages in a single call. I copied KASSERT from here.

593 ↗(On Diff #57821)

So, we'd better do
KASSERT(rv, ("%s: missing page %p", func, ma[i]));
here, too, as rv is boolean_t?

Reverted KASSERT.
Also experimented without calling swap_pager_haspage().
The result was instant reboot.


This line seems to choose to use the idea I suggested, which you said didn't work.


The name 'error' is confusing. How about 'haspage'?
Otherwise, you seem to be saying that you expect an error and the program should terminate if there is no error.


Is the value of npages returned form vm_page_grab_pages always going to match the value of npages passed in?

This crashed.

You reported that earlier, and then I pointed out to you that you posted an update with the crashing code in place and the working code hidden by an "#if 0".

So I don't know what new information you are providing here.

Reverted few of experiments back to a working code.

If you can post one more update that changes 'error' to a more sensible name, and adds assertions for i==0 and for vm_page_grab_pages returning the same npages value that was passed in to it, then I'll approve it and, when one of my mentors approves it, I'll commit it. But you don't need to post any more versions that you know are broken.


swap_pager_haspage() call here is to find how many EXTRA and adjacent pages available behind the current one. The return value is passed back to the caller via the function variable, npages here, and return code indicates if the function call is successful or not.

At this line of code, we know pindex was paged out and we know we need to take at least one page back from the device. So, actually and after iterations of thinking and discussion, I think our best action is npages = 1 and continue.

I originally added KASSERT because the piece I used as a reference had KASSERT. At this point, I think we'd better drop it.


Okay, I will add KASSERT.


The npages passed in is the maximum. It may return lower number.

On the other hand, I had had few extra condition checks to see how it would work. This call here always returned the same number.

I'm satisfied with the way this code looks. It will take approval from someone more senior on the project before I can check it in. You might want to provide some demonstration of performance improvement - that kind of demonstration is necessary for approval from some folks.


Add a space between "if" and "(".

This revision is now accepted and ready to land.May 25 2019, 6:11 PM

Fix style.

Thank you very much for your quick and precise review comments.
Especially, loops look much smaller and easy to follow.

This is hardware dependent but I saw "swapoff" command became about 10 times faster.
I can try getting some numbers next week.
Most of times, swapoff is at 'shutdown' time and stable servers won't have
a chance to enjoy the speed up, though ;)

This revision now requires review to proceed.May 25 2019, 6:37 PM

Why is this call needed ?


To be honest, I don't know. My objective was to call swap_pager_getpages() with more than 1 page to page-in faster. I kept the rest of operations same.

Existing code calls swp_pager_force_dirty() equivalent if (m->valid == VM_PAGE_BITS_ALL). Otherwise, swp_pager_force_launder() is called for each of vm_page_t m.


Sorry, I misread the code, I thought that only one page is dirtied. The previous pages in the run are dirtied with the call to swp_pager_force_launder().

The reason for dirtying is that clean pages in swap objects are either zero or have a copy in swap, by definition of clean. So if we are removing the swap, we must ensure that the user content is not going away.

I updated test case section with performance measurement.

This code reduces number of page-in calls and also IO operations resulted about 7 time faster "swapoff" at best scenario.


The "else" clause should be reconsidered. swap_pager_haspage(object, pindex)
should never return FALSE, because the caller has determined that there is, in fact, an allocated swap block at pindex of object. And, in fact, if ever it could have returned FALSE, npages should then be set to zero.

More generally, I was surprised to see swap_pager_haspage() called here. The caller to this function already has found the sb structure from which npages will be computed, so I was surprised that it wasn't computed by the caller and passed in.


There is a space after ma[i] that should be deleted.


Indeed, "the caller has determined that there is, in fact, an allocated swap block at pindex of object" is the true statement and thus old code didn't call swap_pager_haspage() but rather called vm_page_grab() to prepare vm_object_t. I added this call to also find how many EXTRA pages continues after the one we know so that we can call swap_pager_getpages() for more than one page.

It sounds like we need to add a panic() if swap_pager_haspage() returns false, instead. Or should I KASSERT?

swap_pager_swapoff calls swp_pager_force_pagein calls swap_pager_haspage calls swp_pager_meta_ctl.

swap_pager_swapoff has sb, the swap block, when is calls swp_pager_force_pagein, but doesn't pass it. Then swp_pager_force_pagein calls swap_pager_haspage, and that calls _swp_pager_meta_ctl, and that does a tree lookup to get back the swap block so that it can return sb->d[pindex % SWAP_META_PAGES]. The lookup happens over and over.

You could write a function that could count the pages, the way swap_pager_haspage does, but without calling swp_pager_meta_ctl to lookup the swap block, because you already have it.

Or, if you passed sb in as an argument to swp_pager_force_pagein, you could replace the swap_pager_haspage call with a loop like the one inside swap_pager_haspage that does the counting, but without needing swp_pager_meta_ctl.

That's what I understand Alan to mean from his comments. edited the test plan for this revision. (Show Details)

Resolve conflicts, fix style, and break a function.

Doug, thank you for elaboration.
Let's see how that approach works this time.
When I started and tried, it wasn't as easy and I decided to take advantages of existing functions.

Created swp_pager_find_continuous instead of calling swap_pager_haspage().

Have you looked at these functions:

They are used to update a range of pages to be freed, and when the next page added can't extend the range, then the accumulated range is flushed. That seems like what you need, except that when a range is flushed for you, the call is swp_pager_force_pagein instead of swp_pager_freeswapspace. You just keep extending ranges, with breaks triggering flushes, and then you have to do a final function call at the end for the last range.

I ask because I'm concerned to see SWAP_PCTRIE_LOOKUP_GE in swp_pager_find_continuous. That seems to duplicate the SWAP_PCTRIE_LOOKUP_GE call in swap_pager_swapoff_object. Could an approach like the one used for freeing swap page ranges make this simpler?

I don't think swp_pager_find_continuous() was a good idea/implementation. I will revert it shortly.

I looked more into swp_pager_meta_ctl() calls. Indeed, it is called from all over places and multiple times to find one from swblk structure.

I have overlooked and vm_pager_page_unsswapped() removes SWAP_TRIE objects. So, I think loop and how it skips blocks in swap_pager_swapoff_object() is bad. It skips too many blocks. I need to test the code again.

Working on smaller change-sets and releases is easier to review and test and my preference. I'd like to split this review into few pieces like swp_pager_force_dirty()/swp_pager_force_launder() and create different reviews.


vm_pager_page_unswapped() call chain calls SWAP_PCTRIE_REMOVE() to
remove swap blocks that become empty().
The number of paged-in pages still remain the correct number;
we need to figure out offset on next working block.

I also created D20545 to separate non-functionalty changes out of this review.

Can you update this patch, please, at least so that it can be applied? Part of it was put into a separate patch and has already been put in head.

Doung, thank you for your very prompt responses. I looked into more after kib concerned locks and Alan's comment. I think they are right and this code has issues. Indeed, swp_pager_meta_ctl() calls are very expensive. I also experience few other issues and trying to figure out. For example, I use VMWare on Mac for this testing these days compare to when I used physical machines when I originally made this change; VMWare stalls at high memory utilization and heavy host I/O. I will be traveling shortly, too.

In short, I need more time on this.

Pick up Doug's patch for searcing multiple blocks and split code check-in mergess.

This looks for continous pages up SWAP_META_PAGES to instead of SWB_NPAGES.
SWAP_META_PAGES is smaller (at least on i386) but loop is easier.

In addition, I added swp_pager_meta_free() to release multiple blocks with a
single call instead of npage times of vm_pager_page_unswapped().

I also subtract 2 from blits_fill() size to exclude the first 2 blocks skipped.
I think this extra 2 blocks causes "swapoff" to hung forever.
I started testing with multiple swap devices and system frequently stop responding.
I have had different changes for this offset and I haven't tested this version.


It seems like this line is now part of a different patch being reviewed elsewhere. In that case, please remove it from here.

I think that this is getting close to being commit-able.


Move this line back before the call to vm_page_get_pages().


I don't see the need for "i = 0" here.


n_blks must not exceed MAXPHYS / PAGE_SIZE.


Deep inside of swp_pager_force_pagein() we will release the object lock, allowing concurrent activity to potentially deallocate sb. So, subsequent lines should not use sb. Instead, the loop should be terminated early and the trie lookup repeated on the old "pi" value.