Page MenuHomeFreeBSD

prototype for trimming
Needs ReviewPublic

Authored by dougm on Jul 6 2019, 8:17 AM.

Details

Reviewers
alc
markj
Summary

Sketch of swap trimming with two cursors. I know it needs to wait for trimming to complete in a separate thread. It's just an outline.

Test Plan

The final product isn't here to test. What is here is a patch that shouldn't affect anything but performance, a little, but that should log something to help figure out how to tune the code that will trim freed swap space. So, any sort of test that exercises plenty of swap space is good. What is needed from it is the log entries generated from this line:

CTR4(KTR_SPARE5, "swapdev_trim: cursor %p alloc %d trimmer %p trimsize %d",

	     (void*)sp->sw_cursor, alloc_npages, (void*)sp->sw_trimmer, nblks);

Whenever you have the time to exercise this code, Peter, please do.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

dougm created this revision.Jul 6 2019, 8:17 AM
alc added a comment.Jul 6 2019, 4:27 PM

I would really like to see a complete prototype before we start committing piecemeal changes. Largely, because it is not clear to me that a "two cursors"-based approach is the right design. Alternatively, when trim is enabled, we could introduce a "coalescing queue" in front of the blist_free() calls. In other words, when trim is enabled, where blist_free() is currently called, we would simply enqueue the block numbers. When the number and/or size of the entries in the coalescing queue crosses some threshold, we would wake up a helper thread that issues trim operations for some of the entries and calls blist_free() on the corresponding blocks after the trim operation has completed. Situations where the number of free blocks in the blist falls below a threshold, may also require special handling, possibly even skipping the trim operation. The unknown with this alternative approach is how large the coalescing queue would be, or really how much memory we would allow it to consume, in order for it to be effective.

alc added a comment.Jul 6 2019, 5:33 PM

To be clear, I am not arguing that the alternative design is the better one. Only that there are plausible alternatives, and that we shouldn't start making incremental changes until we have evidence for following one approach or the other.

dougm updated this revision to Diff 59478.Jul 6 2019, 6:17 PM
dougm retitled this revision from move cursor from blist to swdev to prototype for trimming.
dougm edited the summary of this revision. (Show Details)
dougm edited the test plan for this revision. (Show Details)
alc added inline comments.Jul 6 2019, 7:38 PM
sys/vm/swap_pager.c
739–747

Take a look at ffs_blkfree_sendtrim() in ufs/ffs/ffs_alloc.c

dougm updated this revision to Diff 59482.Jul 6 2019, 8:55 PM

Something less obviously blocking, though I'm sure some flags are set wrong and the wait for an available write resource (nsw_wcount_async) is unacceptible. There must a queue somewhere I could stick that task in, waiting for writing to be okay again.

alc added a comment.Jul 7 2019, 2:23 AM

Something less obviously blocking, though I'm sure some flags are set wrong and the wait for an available write resource (nsw_wcount_async) is unacceptible. There must a queue somewhere I could stick that task in, waiting for writing to be okay again.

Please read "man 9 locking" and "man 9 sleep".

alc added inline comments.Jul 7 2019, 2:29 AM
sys/vm/swap_pager.c
749–750

I think that you are familiar with a pthread_cond_wait(). This is similar, in particular, the mutex must be held by the caller. And that mutex is being used to synchronize access to nsw_wcount_async.

dougm updated this revision to Diff 59489.Jul 7 2019, 3:43 AM

Fix some obvious deadlocks.

alc added inline comments.Jul 7 2019, 5:46 PM
sys/vm/swap_pager.c
739

This is problematic because the cursors are 64-bit variables while npages is a 32-bit variable.

769

You can't perform an M_WAITOK allocation when a mutex is held. A mutex is held when this function is called by swap_pager_reserve().

dougm updated this revision to Diff 59516.Jul 7 2019, 6:40 PM

In swap_pager_reserve, do all the allocation before acquiring the object lock. If all the allocation can't be done, leave the object alone.

Fix a 32- vs 64- bit problem.

alc added inline comments.Jul 7 2019, 8:18 PM
sys/vm/swap_pager.c
970

If swap space is fragmented, a call that allocates a moderate amount of swap space will overflow the stack and crash the kernel.

dougm updated this revision to Diff 59523.Jul 7 2019, 8:28 PM

In swap_pager_reserve, release the write lock when not modifying the object.

dougm updated this revision to Diff 59558.Jul 9 2019, 3:06 AM
dougm edited the test plan for this revision. (Show Details)
dougm removed a reviewer: kib.
dougm added a subscriber: pho.

Add logging. Don't really try to trim anything.

pho added a comment.Jul 9 2019, 6:26 AM

I'll try again and avoid a panic, but in the meantime here's some trace: https://people.freebsd.org/~pho/stress/log/dougm047.txt

dougm updated this revision to Diff 59561.Jul 9 2019, 6:59 AM

Correct the last log field.

dougm updated this revision to Diff 59658.Jul 11 2019, 9:51 PM

Don't try to trim exactly the swap device we just allocated from; just look for one that needs trimming.
Tweak the parameters to make the max trim size bigger, and make the trim zone smaller.

Do keep logging trimming events, but also try to actually trim something.

Compiles and boots.

pho added a comment.Jul 13 2019, 11:46 AM

With D20863.59658.diff I got:

20190713 12:51:07 all (40/652): crossmp3.sh
panic: non-NULL bp->data in g_io_request(cmd=3)
cpuid = 2
time = 1563015377
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0006f15290
vpanic() at vpanic+0x19d/frame 0xfffffe0006f152e0
panic() at panic+0x43/frame 0xfffffe0006f15340
g_io_request() at g_io_request+0x4a0/frame 0xfffffe0006f15390
swp_pager_trimswapspace() at swp_pager_trimswapspace+0x24e/frame 0xfffffe0006f153e0
swap_pager_putpages() at swap_pager_putpages+0x10b/frame 0xfffffe0006f15480
vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe0006f15580
vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe0006f157f0
vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe0006f15a70

https://people.freebsd.org/~pho/stress/log/dougm049.txt

dougm updated this revision to Diff 59728.Jul 13 2019, 6:02 PM

g_io_request requires that b_data be NULL before BIO_DELETE. Make it so.

pho added a comment.Jul 14 2019, 7:35 AM

With D20863.59728.diff I got:

20190714 08:32:23 all (7/10): swap5.sh
panic: non-NULL bp->data in g_io_request(cmd=3)
cpuid = 4
time = 1563086020
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c07fd260
vpanic() at vpanic+0x19d/frame 0xfffffe00c07fd2b0
panic() at panic+0x43/frame 0xfffffe00c07fd310
g_io_request() at g_io_request+0x4a0/frame 0xfffffe00c07fd360
swapgeom_strategy() at swapgeom_strategy+0x26c/frame 0xfffffe00c07fd3a0
swp_pager_strategy() at swp_pager_strategy+0x10d/frame 0xfffffe00c07fd3d0
swp_pager_trimswapspace() at swp_pager_trimswapspace+0x5f/frame 0xfffffe00c07fd3f0
swap_pager_putpages() at swap_pager_putpages+0x187/frame 0xfffffe00c07fd480
vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe00c07fd580
vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe00c07fd7f0
vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe00c07fda70

https://people.freebsd.org/~pho/stress/log/dougm050.txt

dougm added a comment.Jul 14 2019, 8:23 AM

It would be unfair to Peter to keep guessing solutions for this problem and letting him try them out. Does anyone listening know what I've got wrong here?

markj added a comment.Jul 14 2019, 5:23 PM

It would be unfair to Peter to keep guessing solutions for this problem and letting him try them out. Does anyone listening know what I've got wrong here?

I believe swp_pager_strategy() needs to grow some awareness of BIO_DELETEs. In particular it currently assumes that it's always handling a BIO_WRITE and either maps the page array (if the backing device cannot handle unmapped pages) or sets a special marker value for b_data indicating that the pages are unmapped. For BIO_DELETEs we don't want to do either of these things and should just pass the BIO to the device's strategy method.

dougm updated this revision to Diff 59740.Jul 14 2019, 5:37 PM

Bypass swp_pager_strategy for the BIO_DELETE case and just go to sp->sw_strategy(bp, sp);

alc added inline comments.Jul 14 2019, 8:52 PM
sys/vm/swap_pager.c
953

sw_dev_mtx isn't being released.

dougm added inline comments.Jul 14 2019, 8:57 PM
sys/vm/swap_pager.c
953

swapdev_trim releases it if it returns a non-NULL bp.

sw_strategy -> swapgeom_strategy acquires and releases it.

alc added a comment.Jul 14 2019, 9:31 PM

I think that swapgeom_strategy() should be made to handle a BIO_DELETE just like a BIO_WRITE.

dougm updated this revision to Diff 59745.Jul 14 2019, 10:27 PM

Make swapgeom_strategy() handle a BIO_DELETE just like a BIO_WRITE.

pho added a comment.Jul 15 2019, 7:45 AM

Some KTR output for a short test of D20863.59745.diff:
https://people.freebsd.org/~pho/stress/log/dougm051.txt

pho added a comment.Jul 17 2019, 1:26 PM

I have completed a full stress2 test with D20863.59745.diff on amd64 without seeing any problems.

markj added a comment.Jul 19 2019, 4:29 PM

I'll give this patch a try on a system running "poudriere bulk", which tends to use most of my system's swap.

markj added a comment.Jul 20 2019, 6:07 PM

Here's the result so far from the test: https://reviews.freebsd.org/P277

It's been running a poudriere build of all ports for about 24 hours. The system is somewhat constrained on memory (roughly 2GB per core) so it's swapping quite a lot.

alc added a comment.EditedJul 21 2019, 2:44 AM

Mark, I discussed this patch with Doug today and recommended that he modify swapdev_trim() so that it never sleeps and instead returns NULL. In particular, I don't think that we want the laundry thread sleeping because of a trim operation encountering a shortage of free memory. Thoughts?

alc added a comment.Jul 21 2019, 2:59 AM

Here's the result so far from the test: https://reviews.freebsd.org/P277
It's been running a poudriere build of all ports for about 24 hours. The system is somewhat constrained on memory (roughly 2GB per core) so it's swapping quite a lot.

I'm a bit concerned by the frequency of trim operations on a single page.

dougm updated this revision to Diff 59975.Jul 21 2019, 4:56 AM

Don't let the trimmer sleep waiting for a resources. Don't invoke the trimmer while holding an object lock. Do let the trimmer launch as many delete ops as it wants, subject to resource constraints. Don't worry about trimming too little.

pho added a comment.Jul 21 2019, 8:58 AM

With D20863.59975.diff I got:

20190721 09:32:47 all (4/11): stealer.sh
swap_pager: out of swap space
swp_pager_getswapspace(10): failed
panic: lock (sleep mutex) swapdev not locked @ vm/swap_pager.c:867
cpuid = 1
time = 1563694424
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000891f280
vpanic() at vpanic+0x19d/frame 0xfffffe000891f2d0
panic() at panic+0x43/frame 0xfffffe000891f330
witness_unlock() at witness_unlock+0xea/frame 0xfffffe000891f370
__mtx_unlock_flags() at __mtx_unlock_flags+0x4d/frame 0xfffffe000891f3a0
swp_pager_trimswapspace() at swp_pager_trimswapspace+0xd5/frame 0xfffffe000891f400
swap_pager_putpages() at swap_pager_putpages+0x56c/frame 0xfffffe000891f4a0
vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe000891f580
vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe000891f7f0
vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe000891fa70
fork_exit() at fork_exit+0x84/frame 0xfffffe000891fab0

https://people.freebsd.org/~pho/stress/log/dougm052.txt

dougm updated this revision to Diff 59987.Jul 21 2019, 3:45 PM

If trimming fails after blocks are allocated, free the blocks and reset the trimming cursor.

pho added a comment.Jul 21 2019, 5:43 PM

I still see the panic with D20863.59987.diff:

20190721 18:39:00 all (2/11): sort.sh
panic: lock (sleep mutex) swapdev not locked @ vm/swap_pager.c:867
cpuid = 1
time = 1563727491
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0006f2b270
vpanic() at vpanic+0x19d/frame 0xfffffe0006f2b2c0
panic() at panic+0x43/frame 0xfffffe0006f2b320
witness_unlock() at witness_unlock+0xea/frame 0xfffffe0006f2b360
__mtx_unlock_flags() at __mtx_unlock_flags+0x4d/frame 0xfffffe0006f2b390
swp_pager_trimswapspace() at swp_pager_trimswapspace+0xeb/frame 0xfffffe0006f2b3e0
swap_pager_putpages() at swap_pager_putpages+0x56c/frame 0xfffffe0006f2b480
vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe0006f2b580
vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe0006f2b7f0
vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe0006f2ba70
fork_exit() at fork_exit+0x84/frame 0xfffffe0006f2bab0

https://people.freebsd.org/~pho/stress/log/dougm053.txt

The test that triggered this one is quite small:

[ `sysctl -n vm.swap_total` -eq 0 ] && exit 0

for i in `jot 6`; do
        sort /dev/zero &
        pids="$pids $!"
done
start=`date '+%s'`
while [ $((`date '+%s'` - start)) -lt 180 ]; do
        sleep 10
        pgrep -q sort || break
done
while pgrep -q sort; do
        kill -9 $pids > /dev/null 2>&1
        sleep 1
done
wait
dougm updated this revision to Diff 59996.Jul 21 2019, 7:38 PM

Back out the changes that would launch more that one trim request from a single call to swp_pager_trimswapspace.

dougm updated this revision to Diff 59997.Jul 21 2019, 7:44 PM

Stop updating blk before it might be freed.

pho added a comment.Jul 22 2019, 8:15 AM

I ran tests on a GENERICish kernel for 12 hours without any issue, however a kernel build with "options KTR" does not compile:

cc -c -O2 -pipe -fno-strict-aliasing  -g -nostdinc  -I. -I../../.. -I../../../contrib/ck/include -I../../../contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -MD  -MF.depend.swap_pager.o -MTswap_pager.o -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include -fdebug-prefix-map=./x86=/usr/src/sys/x86/include -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error-tautological-compare -Wno-error-empty-body -Wno-error-parentheses-equality -Wno-error-unused-function -Wno-error-pointer-sign -Wno-error-shift-negative-value -Wno-address-of-packed-member  -mno-aes -mno-avx  -std=iso9899:1999 -Werror  ../../../vm/swap_pager.c                                                                               
../../../vm/swap_pager.c:880:14: error: use of undeclared identifier 'cursor'
             (void*)cursor, (void*)blk, npages, (void*)sp->sw_trimmer);
                    ^
1 error generated.
dougm updated this revision to Diff 60019.Jul 22 2019, 2:48 PM

Fix broken log entry.

alc added a comment.Jul 27 2019, 7:56 PM

I suggest that you make a separate patch to tidy up swap_pager_putpages(). Specifically, I would do the following things: (1) convert the panic() at the function's start into a KASSERT(), (2) convert sync into a bool-type variable, (3) make the variable definitions style(9) compliant, and (4) move the calls to swp_pager_init_freerange() and swp_pager_freeswapspace() out from under the object lock.

dougm updated this revision to Diff 60217.Jul 28 2019, 7:39 PM

Update after checking in overlapping changes.

alc added inline comments.Jul 28 2019, 8:54 PM
sys/vm/swap_pager.c
976

This actually looks like a bug fix. Specifically, it addresses an unlikely, but hypothetically possible, swap space leak. Please create a separate change for this fix.

sys/vm/swap_pager.h
75–76

Mark, should we introduce a SW_TRIM flag here that will be set by a new syscall?

pho added a comment.Jul 30 2019, 5:43 AM

Ran tests for another 23 hours on D20863.60217.diff. No problems seen.

markj added a comment.Jul 31 2019, 3:14 PM
In D20863#455860, @alc wrote:

Mark, I discussed this patch with Doug today and recommended that he modify swapdev_trim() so that it never sleeps and instead returns NULL. In particular, I don't think that we want the laundry thread sleeping because of a trim operation encountering a shortage of free memory. Thoughts?

Sorry, I missed this. Yes, I think that's correct. We occasionally see deadlocks in CAM in Peter's tests, where we rely on a M_WAITOK allocation succeeding in the I/O path.

I have been wondering lately if it would be reasonable to let the laundry thread maintain a chunk of reserved memory to handle these cases, but I think we should avoid introducing new ones whenever possible.

In D20863#455861, @alc wrote:

Here's the result so far from the test: https://reviews.freebsd.org/P277
It's been running a poudriere build of all ports for about 24 hours. The system is somewhat constrained on memory (roughly 2GB per core) so it's swapping quite a lot.

I'm a bit concerned by the frequency of trim operations on a single page.

FWIW, I have started testing this on my workstation as well. Its use of swap is more varied, so I'm interested to see if we get a similar result.

sys/vm/swap_pager.h
75–76

Yes, I think so. The features I'd like to see added are TRIM support, and support for prioritized swap devices. I imagined modifying the system call to look like this: swapon(const char *device, int flags, int priority). We'd be able to request TRIM support by setting a flag.

alc added a comment.Aug 1 2019, 5:50 AM
In D20863#455860, @alc wrote:

Mark, I discussed this patch with Doug today and recommended that he modify swapdev_trim() so that it never sleeps and instead returns NULL. In particular, I don't think that we want the laundry thread sleeping because of a trim operation encountering a shortage of free memory. Thoughts?

Sorry, I missed this. Yes, I think that's correct. We occasionally see deadlocks in CAM in Peter's tests, where we rely on a M_WAITOK allocation succeeding in the I/O path.
I have been wondering lately if it would be reasonable to let the laundry thread maintain a chunk of reserved memory to handle these cases, but I think we should avoid introducing new ones whenever possible.

Once upon a time, I proposed to kib@ that geom allocations for the page daemon (now the laundry thread) should be specifying M_USE_RESERVE. This may not be perfect but I think that it would be an improvement.

alc added inline comments.Aug 1 2019, 5:56 AM
sys/vm/swap_pager.h
75–76

Should we be modifying the existing syscall or adding a new one?

markj added a comment.Aug 1 2019, 11:03 PM
In D20863#455861, @alc wrote:

Here's the result so far from the test: https://reviews.freebsd.org/P277
It's been running a poudriere build of all ports for about 24 hours. The system is somewhat constrained on memory (roughly 2GB per core) so it's swapping quite a lot.

I'm a bit concerned by the frequency of trim operations on a single page.

FWIW, I have started testing this on my workstation as well. Its use of swap is more varied, so I'm interested to see if we get a similar result.

Here are some results from my workstation: https://reviews.freebsd.org/P284

In D20863#458992, @alc wrote:
In D20863#455860, @alc wrote:

Mark, I discussed this patch with Doug today and recommended that he modify swapdev_trim() so that it never sleeps and instead returns NULL. In particular, I don't think that we want the laundry thread sleeping because of a trim operation encountering a shortage of free memory. Thoughts?

Sorry, I missed this. Yes, I think that's correct. We occasionally see deadlocks in CAM in Peter's tests, where we rely on a M_WAITOK allocation succeeding in the I/O path.
I have been wondering lately if it would be reasonable to let the laundry thread maintain a chunk of reserved memory to handle these cases, but I think we should avoid introducing new ones whenever possible.

Once upon a time, I proposed to kib@ that geom allocations for the page daemon (now the laundry thread) should be specifying M_USE_RESERVE. This may not be perfect but I think that it would be an improvement.

What do you think of having UMA satisfy allocations from a cache attached to the laundry thread structure? We could go through GEOM and CAM to tag allocations that might result in blocking, but there are other cases where this might be more tricky or onerous. For example, if one wants to swap to a ZFS zvol, or to an md(4) malloc device that compresses sectors (something my GSOC student is working on).

sys/vm/swap_pager.h
75–76

I think we should add a new one and rename the old one to freebsd12_swapon. libc.so provides a swapon@@FBSD_1.0 symbol; it should be straightforward to implement compatibility for old applications.

markj added a comment.Aug 3 2019, 2:29 AM
In D20863#455861, @alc wrote:

Here's the result so far from the test: https://reviews.freebsd.org/P277
It's been running a poudriere build of all ports for about 24 hours. The system is somewhat constrained on memory (roughly 2GB per core) so it's swapping quite a lot.

I'm a bit concerned by the frequency of trim operations on a single page.

FWIW, I have started testing this on my workstation as well. Its use of swap is more varied, so I'm interested to see if we get a similar result.

Here are some results from my workstation: https://reviews.freebsd.org/P284

Another dump, after an extra day of uptime: https://reviews.freebsd.org/P285

dougm updated this revision to Diff 60414.Aug 3 2019, 3:06 AM

Update log messages to distinguish cases where trimming is aborted before it gets started. Currently, a stretch of log entries like:

         25403 swapdev_trim: cursor 0x1a18e6 blk 0x1e18db size    22 trimmer 0x1e18f1
	 25402 swapdev_trim: cursor 0x1a18e4 blk 0x1e18db size    22 trimmer 0x1e18f1
	 25401 swapdev_trim: cursor 0x1a18e2 blk 0x1e18db size    22 trimmer 0x1e18f1
	 25400 swapdev_trim: cursor 0x1a18e1 blk 0x1e18db size    22 trimmer 0x1e18f1
	 25399 swapdev_trim: cursor 0x1a18df blk 0x1e18db size    22 trimmer 0x1e18f1
	 25398 swapdev_trim: cursor 0x1a18de blk 0x1e18db size    22 trimmer 0x1e18f1
	 25397 swapdev_trim: cursor 0x1a18dc blk 0x1e18db size    22 trimmer 0x1e18f1
	 25396 swapdev_trim: cursor 0x1a1744 blk 0x1e18db size    22 trimmer 0x1e18f1
	 25395 swapdev_trim: cursor 0x1a1478 blk 0x1e1724 size    18 trimmer 0x1e1736
	 25394 swapdev_trim: cursor 0x1a0dcc blk 0x1e1464 size     5 trimmer 0x1e1469

makes it look like the trimmer is 'stuck', trimming the same range over and over, when it fact it is stuck, waiting for either a memory allocation to work or a writer count to be nonzero. Add something to the end of the log entry to distinguish these cases from an actual trim-start.

pho added a comment.Aug 3 2019, 1:03 PM

I ran tests on D20863.60414.diff for 6 hours and got this ktr log.

dougm updated this revision to Diff 60489.Aug 6 2019, 5:18 AM

Without changing the behavior of this patch, add a SW_TRIM flag and a priority field to swdevt. Pass those from sys_swapon to kern_swapon (which does the work formerly done by sys_swapon) to swaponsomething, which creates the swap object.

dougm added a comment.Aug 6 2019, 6:10 AM

Without changing the behavior of this patch, add a SW_TRIM flag and a priority field to swdevt. Pass those from sys_swapon to kern_swapon (which does the work formerly done by sys_swapon) to swaponsomething, which creates the swap object.

Mark -
I discussed the meaning of "priority" with Alan. Suppose the that there are 4 swap devices A, B, C, D with priorities 2,1,1,0. My understanding, after that discussion, is that a getswapspace call for 16 blocks would behave as follows.

  1. Look for 16 blocks on A.
  2. Look for 8 blocks on A; if successful, look for 8 more blocks on A.
  3. Look for 4, then 2, then 1 block on A.
  4. Look for 16 blocks on B, then on C.
  5. Look for 8 blocks on B, if successful, look for 8 more blocks on C.
  6. Round robin between B and C with smaller and smaller requests.
  7. If the demand for blocks is not satisfied, try D.

Is that your understanding? Do we need to preserve some state to tell us which of B and C is the next level 1 swap device to be asked for blocks?

markj added a comment.Aug 6 2019, 3:53 PM

Without changing the behavior of this patch, add a SW_TRIM flag and a priority field to swdevt. Pass those from sys_swapon to kern_swapon (which does the work formerly done by sys_swapon) to swaponsomething, which creates the swap object.

Mark -
I discussed the meaning of "priority" with Alan. Suppose the that there are 4 swap devices A, B, C, D with priorities 2,1,1,0. My understanding, after that discussion, is that a getswapspace call for 16 blocks would behave as follows.

  1. Look for 16 blocks on A.
  2. Look for 8 blocks on A; if successful, look for 8 more blocks on A.
  3. Look for 4, then 2, then 1 block on A.
  4. Look for 16 blocks on B, then on C.
  5. Look for 8 blocks on B, if successful, look for 8 more blocks on C.
  6. Round robin between B and C with smaller and smaller requests.
  7. If the demand for blocks is not satisfied, try D.

Is that your understanding? Do we need to preserve some state to tell us which of B and C is the next level 1 swap device to be asked for blocks?

I have not thought about it too carefully yet, but that seems like a reasonable strategy to me. I do think we should try to ensure precise round-robin among swap devices of equal priority since that would reduce the amount of write bandwidth used per device during heavy paging.

It occurs to me though that there's another consideration here. In general I would expect higher-priority swap devices to be smaller than lower priority swap devices; we might have a three-level setup with a compressed memory "disk", a small SSD partition, and a large partition on a slow HDD. If we always attempt to allocate blocks from the highest-priority swap device first we might get a sort of priority inversion over time: cold pages which are written and rarely or never referenced (e.g., the first stack page of a long-lived application) would be paged out first, filling the highest-priority device(s). Pages which are regularly referenced might end up getting paged out to a slower (lower priority) swap device, so they suffer higher latency for paging operations despite incurring such operations more frequently.

We might therefore want some primitive notion of per-page swap priority (perhaps using spare bits in the act_count field of struct vm_page). A policy might be to assign a low priority at page allocation time, and page out low-priority pages to low-priority swap device(s). Then, let a page-in raise the page's priority so that a subsequent pageout of the page targets a higher-priority swap device than last time. (I think such a priority could also be useful in determining whether to page out in the first place: if the laundry thread is frequently encountering high-priority dirty pages, the system is possibly thrashing and might do well to exert more pressure on large in-kernel consumers of memory, like the ZFS ARC.)

Should I work on adding a new swapon system call so as to making testing a priority parameter easier? We don't have to settle on a final system call interface yet.

alc added a comment.Aug 6 2019, 10:07 PM

Mark,

Before we get into the design of the priority mechanism, I'd like to discuss some of the test results and what I think that we should do in response to them.

The logs are showing a lot of aborted trims due to nsw_wcount_async being zero. Typically, there are a large number of retries on the same set of blocks, until we finally succeed in allocating a buf to perform the trim. However, occasionally I've this pattern actually end in an abort due to the set of blocks now being too close to the allocation cursor, and so that set never gets trimmed.

I'm not surprised by this behavior. The laundry thread is going to perform almost back-to-back buf allocations in putpages. And, it is going to finish its work before any of the pageout operations complete. So, unless we attempt the trim early in the sequence of back-to-back calls to putpages, and its many buf allocations, I expect that the trimming code is going to be unable to allocate a buf for trimming until the next time that the laundry thread wants to pageout.

To address this, what if we recycled the buf for potential trimming at the end of swp_pager_async_iodone()? And, if there is no need to trim, we free the buf. Is there any reason that we can't perform a swap_pager_strategy() call from the execution context in which we perform swp_pager_async_iodone()?

markj added a comment.Aug 8 2019, 2:54 AM
In D20863#460122, @alc wrote:

Mark,
Before we get into the design of the priority mechanism, I'd like to discuss some of the test results and what I think that we should do in response to them.
The logs are showing a lot of aborted trims due to nsw_wcount_async being zero. Typically, there are a large number of retries on the same set of blocks, until we finally succeed in allocating a buf to perform the trim. However, occasionally I've this pattern actually end in an abort due to the set of blocks now being too close to the allocation cursor, and so that set never gets trimmed.
I'm not surprised by this behavior. The laundry thread is going to perform almost back-to-back buf allocations in putpages. And, it is going to finish its work before any of the pageout operations complete. So, unless we attempt the trim early in the sequence of back-to-back calls to putpages, and its many buf allocations, I expect that the trimming code is going to be unable to allocate a buf for trimming until the next time that the laundry thread wants to pageout.
To address this, what if we recycled the buf for potential trimming at the end of swp_pager_async_iodone()? And, if there is no need to trim, we free the buf. Is there any reason that we can't perform a swap_pager_strategy() call from the execution context in which we perform swp_pager_async_iodone()?

That context will be a CAM doneq thread or the GEOM up thread. There is certainly a constraint there in that we cannot sleep, but I don't think that's required from TRIM if we are reusing the buf. I can't think of any problems that would arise, but I also don't know of any precedent for doing this.

Is there a reason we can't or shouldn't simply provide a separate limit for TRIM bufs?