Sketch of swap trimming with two cursors. I know it needs to wait for trimming to complete in a separate thread. It's just an outline.
Details
The final product isn't here to test. What is here is a patch that shouldn't affect anything but performance, a little, but that should log something to help figure out how to tune the code that will trim freed swap space. So, any sort of test that exercises plenty of swap space is good. What is needed from it is the log entries generated from this line:
CTR4(KTR_SPARE5, "swapdev_trim: cursor %p alloc %d trimmer %p trimsize %d",
(void*)sp->sw_cursor, alloc_npages, (void*)sp->sw_trimmer, nblks);
Whenever you have the time to exercise this code, Peter, please do.
Diff Detail
- Lint
- Lint Skipped 
- Unit
- Tests Skipped 
Event Timeline
I would really like to see a complete prototype before we start committing piecemeal changes. Largely, because it is not clear to me that a "two cursors"-based approach is the right design. Alternatively, when trim is enabled, we could introduce a "coalescing queue" in front of the blist_free() calls. In other words, when trim is enabled, where blist_free() is currently called, we would simply enqueue the block numbers. When the number and/or size of the entries in the coalescing queue crosses some threshold, we would wake up a helper thread that issues trim operations for some of the entries and calls blist_free() on the corresponding blocks after the trim operation has completed. Situations where the number of free blocks in the blist falls below a threshold, may also require special handling, possibly even skipping the trim operation. The unknown with this alternative approach is how large the coalescing queue would be, or really how much memory we would allow it to consume, in order for it to be effective.
To be clear, I am not arguing that the alternative design is the better one. Only that there are plausible alternatives, and that we shouldn't start making incremental changes until we have evidence for following one approach or the other.
| sys/vm/swap_pager.c | ||
|---|---|---|
| 739–747 | Take a look at ffs_blkfree_sendtrim() in ufs/ffs/ffs_alloc.c | |
Something less obviously blocking, though I'm sure some flags are set wrong and the wait for an available write resource (nsw_wcount_async) is unacceptible. There must a queue somewhere I could stick that task in, waiting for writing to be okay again.
| sys/vm/swap_pager.c | ||
|---|---|---|
| 749–750 | I think that you are familiar with a pthread_cond_wait(). This is similar, in particular, the mutex must be held by the caller. And that mutex is being used to synchronize access to nsw_wcount_async. | |
In swap_pager_reserve, do all the allocation before acquiring the object lock. If all the allocation can't be done, leave the object alone.
Fix a 32- vs 64- bit problem.
| sys/vm/swap_pager.c | ||
|---|---|---|
| 1030 | If swap space is fragmented, a call that allocates a moderate amount of swap space will overflow the stack and crash the kernel. | |
I'll try again and avoid a panic, but in the meantime here's some trace: https://people.freebsd.org/~pho/stress/log/dougm047.txt
Don't try to trim exactly the swap device we just allocated from; just look for one that needs trimming.
Tweak the parameters to make the max trim size bigger, and make the trim zone smaller.
Do keep logging trimming events, but also try to actually trim something.
Compiles and boots.
With D20863.59658.diff I got:
20190713 12:51:07 all (40/652): crossmp3.sh panic: non-NULL bp->data in g_io_request(cmd=3) cpuid = 2 time = 1563015377 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0006f15290 vpanic() at vpanic+0x19d/frame 0xfffffe0006f152e0 panic() at panic+0x43/frame 0xfffffe0006f15340 g_io_request() at g_io_request+0x4a0/frame 0xfffffe0006f15390 swp_pager_trimswapspace() at swp_pager_trimswapspace+0x24e/frame 0xfffffe0006f153e0 swap_pager_putpages() at swap_pager_putpages+0x10b/frame 0xfffffe0006f15480 vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe0006f15580 vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe0006f157f0 vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe0006f15a70
With D20863.59728.diff I got:
20190714 08:32:23 all (7/10): swap5.sh panic: non-NULL bp->data in g_io_request(cmd=3) cpuid = 4 time = 1563086020 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c07fd260 vpanic() at vpanic+0x19d/frame 0xfffffe00c07fd2b0 panic() at panic+0x43/frame 0xfffffe00c07fd310 g_io_request() at g_io_request+0x4a0/frame 0xfffffe00c07fd360 swapgeom_strategy() at swapgeom_strategy+0x26c/frame 0xfffffe00c07fd3a0 swp_pager_strategy() at swp_pager_strategy+0x10d/frame 0xfffffe00c07fd3d0 swp_pager_trimswapspace() at swp_pager_trimswapspace+0x5f/frame 0xfffffe00c07fd3f0 swap_pager_putpages() at swap_pager_putpages+0x187/frame 0xfffffe00c07fd480 vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe00c07fd580 vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe00c07fd7f0 vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe00c07fda70
It would be unfair to Peter to keep guessing solutions for this problem and letting him try them out. Does anyone listening know what I've got wrong here?
I believe swp_pager_strategy() needs to grow some awareness of BIO_DELETEs. In particular it currently assumes that it's always handling a BIO_WRITE and either maps the page array (if the backing device cannot handle unmapped pages) or sets a special marker value for b_data indicating that the pages are unmapped. For BIO_DELETEs we don't want to do either of these things and should just pass the BIO to the device's strategy method.
Bypass swp_pager_strategy for the BIO_DELETE case and just go to sp->sw_strategy(bp, sp);
| sys/vm/swap_pager.c | ||
|---|---|---|
| 1013 | sw_dev_mtx isn't being released. | |
| sys/vm/swap_pager.c | ||
|---|---|---|
| 1013 | swapdev_trim releases it if it returns a non-NULL bp. sw_strategy -> swapgeom_strategy acquires and releases it. | |
I think that swapgeom_strategy() should be made to handle a BIO_DELETE just like a BIO_WRITE.
Some KTR output for a short test of D20863.59745.diff:
https://people.freebsd.org/~pho/stress/log/dougm051.txt
I have completed a full stress2 test with D20863.59745.diff on amd64 without seeing any problems.
I'll give this patch a try on a system running "poudriere bulk", which tends to use most of my system's swap.
Here's the result so far from the test: https://reviews.freebsd.org/P277
It's been running a poudriere build of all ports for about 24 hours. The system is somewhat constrained on memory (roughly 2GB per core) so it's swapping quite a lot.
Mark, I discussed this patch with Doug today and recommended that he modify swapdev_trim() so that it never sleeps and instead returns NULL. In particular, I don't think that we want the laundry thread sleeping because of a trim operation encountering a shortage of free memory. Thoughts?
Don't let the trimmer sleep waiting for a resources. Don't invoke the trimmer while holding an object lock. Do let the trimmer launch as many delete ops as it wants, subject to resource constraints. Don't worry about trimming too little.
With D20863.59975.diff I got:
20190721 09:32:47 all (4/11): stealer.sh swap_pager: out of swap space swp_pager_getswapspace(10): failed panic: lock (sleep mutex) swapdev not locked @ vm/swap_pager.c:867 cpuid = 1 time = 1563694424 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe000891f280 vpanic() at vpanic+0x19d/frame 0xfffffe000891f2d0 panic() at panic+0x43/frame 0xfffffe000891f330 witness_unlock() at witness_unlock+0xea/frame 0xfffffe000891f370 __mtx_unlock_flags() at __mtx_unlock_flags+0x4d/frame 0xfffffe000891f3a0 swp_pager_trimswapspace() at swp_pager_trimswapspace+0xd5/frame 0xfffffe000891f400 swap_pager_putpages() at swap_pager_putpages+0x56c/frame 0xfffffe000891f4a0 vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe000891f580 vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe000891f7f0 vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe000891fa70 fork_exit() at fork_exit+0x84/frame 0xfffffe000891fab0
If trimming fails after blocks are allocated, free the blocks and reset the trimming cursor.
I still see the panic with D20863.59987.diff:
20190721 18:39:00 all (2/11): sort.sh panic: lock (sleep mutex) swapdev not locked @ vm/swap_pager.c:867 cpuid = 1 time = 1563727491 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0006f2b270 vpanic() at vpanic+0x19d/frame 0xfffffe0006f2b2c0 panic() at panic+0x43/frame 0xfffffe0006f2b320 witness_unlock() at witness_unlock+0xea/frame 0xfffffe0006f2b360 __mtx_unlock_flags() at __mtx_unlock_flags+0x4d/frame 0xfffffe0006f2b390 swp_pager_trimswapspace() at swp_pager_trimswapspace+0xeb/frame 0xfffffe0006f2b3e0 swap_pager_putpages() at swap_pager_putpages+0x56c/frame 0xfffffe0006f2b480 vm_pageout_flush() at vm_pageout_flush+0xff/frame 0xfffffe0006f2b580 vm_pageout_cluster() at vm_pageout_cluster+0x3e6/frame 0xfffffe0006f2b7f0 vm_pageout_laundry_worker() at vm_pageout_laundry_worker+0x2f6/frame 0xfffffe0006f2ba70 fork_exit() at fork_exit+0x84/frame 0xfffffe0006f2bab0
https://people.freebsd.org/~pho/stress/log/dougm053.txt
The test that triggered this one is quite small:
[ `sysctl -n vm.swap_total` -eq 0 ] && exit 0
for i in `jot 6`; do
        sort /dev/zero &
        pids="$pids $!"
done
start=`date '+%s'`
while [ $((`date '+%s'` - start)) -lt 180 ]; do
        sleep 10
        pgrep -q sort || break
done
while pgrep -q sort; do
        kill -9 $pids > /dev/null 2>&1
        sleep 1
done
waitBack out the changes that would launch more that one trim request from a single call to swp_pager_trimswapspace.
I ran tests on a GENERICish kernel for 12 hours without any issue, however a kernel build with "options KTR" does not compile:
cc -c -O2 -pipe -fno-strict-aliasing  -g -nostdinc  -I. -I../../.. -I../../../contrib/ck/include -I../../../contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h    -fno-omit-frame-pointer -mno-omit-leaf-frame-pointer -MD  -MF.depend.swap_pager.o -MTswap_pager.o -fdebug-prefix-map=./machine=/usr/src/sys/amd64/include -fdebug-prefix-map=./x86=/usr/src/sys/x86/include -mcmodel=kernel -mno-red-zone -mno-mmx -mno-sse -msoft-float  -fno-asynchronous-unwind-tables -ffreestanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error-tautological-compare -Wno-error-empty-body -Wno-error-parentheses-equality -Wno-error-unused-function -Wno-error-pointer-sign -Wno-error-shift-negative-value -Wno-address-of-packed-member  -mno-aes -mno-avx  -std=iso9899:1999 -Werror  ../../../vm/swap_pager.c                                                                               
../../../vm/swap_pager.c:880:14: error: use of undeclared identifier 'cursor'
             (void*)cursor, (void*)blk, npages, (void*)sp->sw_trimmer);
                    ^
1 error generated.I suggest that you make a separate patch to tidy up swap_pager_putpages(). Specifically, I would do the following things: (1) convert the panic() at the function's start into a KASSERT(), (2) convert sync into a bool-type variable, (3) make the variable definitions style(9) compliant, and (4) move the calls to swp_pager_init_freerange() and swp_pager_freeswapspace() out from under the object lock.
Sorry, I missed this. Yes, I think that's correct. We occasionally see deadlocks in CAM in Peter's tests, where we rely on a M_WAITOK allocation succeeding in the I/O path.
I have been wondering lately if it would be reasonable to let the laundry thread maintain a chunk of reserved memory to handle these cases, but I think we should avoid introducing new ones whenever possible.
FWIW, I have started testing this on my workstation as well. Its use of swap is more varied, so I'm interested to see if we get a similar result.
| sys/vm/swap_pager.h | ||
|---|---|---|
| 75–76 | Yes, I think so. The features I'd like to see added are TRIM support, and support for prioritized swap devices. I imagined modifying the system call to look like this: swapon(const char *device, int flags, int priority). We'd be able to request TRIM support by setting a flag. | |
Once upon a time, I proposed to kib@ that geom allocations for the page daemon (now the laundry thread) should be specifying M_USE_RESERVE. This may not be perfect but I think that it would be an improvement.
| sys/vm/swap_pager.h | ||
|---|---|---|
| 75–76 | Should we be modifying the existing syscall or adding a new one? | |
FWIW, I have started testing this on my workstation as well. Its use of swap is more varied, so I'm interested to see if we get a similar result.
Here are some results from my workstation: https://reviews.freebsd.org/P284
What do you think of having UMA satisfy allocations from a cache attached to the laundry thread structure? We could go through GEOM and CAM to tag allocations that might result in blocking, but there are other cases where this might be more tricky or onerous. For example, if one wants to swap to a ZFS zvol, or to an md(4) malloc device that compresses sectors (something my GSOC student is working on).
| sys/vm/swap_pager.h | ||
|---|---|---|
| 75–76 | I think we should add a new one and rename the old one to freebsd12_swapon. libc.so provides a swapon@@FBSD_1.0 symbol; it should be straightforward to implement compatibility for old applications. | |
Update log messages to distinguish cases where trimming is aborted before it gets started. Currently, a stretch of log entries like:
25403 swapdev_trim: cursor 0x1a18e6 blk 0x1e18db size 22 trimmer 0x1e18f1 25402 swapdev_trim: cursor 0x1a18e4 blk 0x1e18db size 22 trimmer 0x1e18f1 25401 swapdev_trim: cursor 0x1a18e2 blk 0x1e18db size 22 trimmer 0x1e18f1 25400 swapdev_trim: cursor 0x1a18e1 blk 0x1e18db size 22 trimmer 0x1e18f1 25399 swapdev_trim: cursor 0x1a18df blk 0x1e18db size 22 trimmer 0x1e18f1 25398 swapdev_trim: cursor 0x1a18de blk 0x1e18db size 22 trimmer 0x1e18f1 25397 swapdev_trim: cursor 0x1a18dc blk 0x1e18db size 22 trimmer 0x1e18f1 25396 swapdev_trim: cursor 0x1a1744 blk 0x1e18db size 22 trimmer 0x1e18f1 25395 swapdev_trim: cursor 0x1a1478 blk 0x1e1724 size 18 trimmer 0x1e1736 25394 swapdev_trim: cursor 0x1a0dcc blk 0x1e1464 size 5 trimmer 0x1e1469
makes it look like the trimmer is 'stuck', trimming the same range over and over, when it fact it is stuck, waiting for either a memory allocation to work or a writer count to be nonzero. Add something to the end of the log entry to distinguish these cases from an actual trim-start.
Without changing the behavior of this patch, add a SW_TRIM flag and a priority field to swdevt. Pass those from sys_swapon to kern_swapon (which does the work formerly done by sys_swapon) to swaponsomething, which creates the swap object.
Mark -
I discussed the meaning of "priority" with Alan.  Suppose the that there are 4 swap devices A, B, C, D with priorities 2,1,1,0.  My understanding, after that discussion, is that a getswapspace call for 16 blocks would behave as follows.
- Look for 16 blocks on A.
- Look for 8 blocks on A; if successful, look for 8 more blocks on A.
- Look for 4, then 2, then 1 block on A.
- Look for 16 blocks on B, then on C.
- Look for 8 blocks on B, if successful, look for 8 more blocks on C.
- Round robin between B and C with smaller and smaller requests.
- If the demand for blocks is not satisfied, try D.
Is that your understanding? Do we need to preserve some state to tell us which of B and C is the next level 1 swap device to be asked for blocks?
I have not thought about it too carefully yet, but that seems like a reasonable strategy to me. I do think we should try to ensure precise round-robin among swap devices of equal priority since that would reduce the amount of write bandwidth used per device during heavy paging.
It occurs to me though that there's another consideration here. In general I would expect higher-priority swap devices to be smaller than lower priority swap devices; we might have a three-level setup with a compressed memory "disk", a small SSD partition, and a large partition on a slow HDD. If we always attempt to allocate blocks from the highest-priority swap device first we might get a sort of priority inversion over time: cold pages which are written and rarely or never referenced (e.g., the first stack page of a long-lived application) would be paged out first, filling the highest-priority device(s). Pages which are regularly referenced might end up getting paged out to a slower (lower priority) swap device, so they suffer higher latency for paging operations despite incurring such operations more frequently.
We might therefore want some primitive notion of per-page swap priority (perhaps using spare bits in the act_count field of struct vm_page). A policy might be to assign a low priority at page allocation time, and page out low-priority pages to low-priority swap device(s). Then, let a page-in raise the page's priority so that a subsequent pageout of the page targets a higher-priority swap device than last time. (I think such a priority could also be useful in determining whether to page out in the first place: if the laundry thread is frequently encountering high-priority dirty pages, the system is possibly thrashing and might do well to exert more pressure on large in-kernel consumers of memory, like the ZFS ARC.)
Should I work on adding a new swapon system call so as to making testing a priority parameter easier? We don't have to settle on a final system call interface yet.
Mark,
Before we get into the design of the priority mechanism, I'd like to discuss some of the test results and what I think that we should do in response to them.
The logs are showing a lot of aborted trims due to nsw_wcount_async being zero. Typically, there are a large number of retries on the same set of blocks, until we finally succeed in allocating a buf to perform the trim. However, occasionally I've this pattern actually end in an abort due to the set of blocks now being too close to the allocation cursor, and so that set never gets trimmed.
I'm not surprised by this behavior. The laundry thread is going to perform almost back-to-back buf allocations in putpages. And, it is going to finish its work before any of the pageout operations complete. So, unless we attempt the trim early in the sequence of back-to-back calls to putpages, and its many buf allocations, I expect that the trimming code is going to be unable to allocate a buf for trimming until the next time that the laundry thread wants to pageout.
To address this, what if we recycled the buf for potential trimming at the end of swp_pager_async_iodone()? And, if there is no need to trim, we free the buf. Is there any reason that we can't perform a swap_pager_strategy() call from the execution context in which we perform swp_pager_async_iodone()?
That context will be a CAM doneq thread or the GEOM up thread. There is certainly a constraint there in that we cannot sleep, but I don't think that's required from TRIM if we are reusing the buf. I can't think of any problems that would arise, but I also don't know of any precedent for doing this.
Is there a reason we can't or shouldn't simply provide a separate limit for TRIM bufs?