Page MenuHomeFreeBSD

Let swp_pager_getswapspace allocate bigger blocks
ClosedPublic

Authored by dougm on Apr 22 2019, 4:52 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Dec 17, 1:23 AM
Unknown Object (File)
Mon, Dec 16, 10:26 AM
Unknown Object (File)
Mon, Dec 16, 3:55 AM
Unknown Object (File)
Thu, Dec 12, 9:45 PM
Unknown Object (File)
Wed, Dec 11, 7:05 PM
Unknown Object (File)
Sat, Dec 7, 3:05 PM
Unknown Object (File)
Sat, Dec 7, 8:45 AM
Unknown Object (File)
Fri, Dec 6, 4:40 AM
Subscribers

Details

Summary

A new parameter to blist_alloc specifies an upper bound on the size of the allocation request, so that the blocks allocated are from the next set of free blocks big enough to satisfy the minimum requirements of the request, and the number of blocks allocated are as many as possible, up to the specified maximum. The implementation of swp_pager_getswapspace uses this parameter to ask for a number of blocks between the new halved request size and the previous failed request size. Thus a request for 32 blocks may fail, but instead of getting only 16 blocks instead, the caller asks for 16 to 31 next, and might get 19 or 27, which is closer to what they originally wanted.

I expect this to lead to bigger block allocations and less block fragmentation, at least in some cases.

Test Plan

Peter Holm has tested this patch with a test designed to have many first allocatation requests fail. The result seems to be that the number of successful blist_alloc calls falls a little, and the number of unsuccessful blist_alloc calls falls dramatically. I don't know the details of his test, but since this change should reduce fragmentation, it could just be that having less of the free blocks appearing in 1- and 2- block pieces has a beneficial effect for his test.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Finally found the problem. Duh. Fixed it, trying everything now.

D20001.56791.diff:

Yeah!

FreeBSD 13.0-CURRENT (PHO) #14 r346790M: Sun Apr 28 21:12:11 CEST 2019
root@t2:~ # sh
# /tmp/swp.sh 
Device          1K-blocks     Used    Avail Capacity
/dev/da0p4       67108864        0 67108864     0%
/dev/da0p4       67108864  1292524 65816340     2%
/dev/da0p4       67108864  3709428 63399436     6%
/dev/da0p4       67108864  7375256 59733608    11%
/dev/da0p4       67108864  9751136 57357728    15%
:

I'll run some more tests ...

I messed up the non-inline version of bitpos in all the previous flailing about; I don't think it should affect ongoing tests, but I need to correct it.

D20001.56794.diff survived a 6 hour test. Note that a full test will take about 48 hours.

It would be helpful to have statistics on how this change affects the number of calls to blist_alloc, since that it one of its alleged benefits. The following patch to the unmodified source counts alloc successes and failures. It would be helpful to compare these statistics for the code with and without this change, on a test that produces a nontrivial number of alloc failures.

diff --git a/sys/kern/subr_blist.c b/sys/kern/subr_blist.c
index 3f8f1771d66..6a1ca51ae12 100644

  • a/sys/kern/subr_blist.c

+++ b/sys/kern/subr_blist.c
@@ -278,6 +278,13 @@ blist_destroy(blist_t bl)

free(bl, M_SWAP);

}

+#include <sys/sysctl.h>
+static SYSCTL_NODE(_debug, OID_AUTO, counters, CTLFLAG_RD, 0, "");
+
+static long alloc_success, alloc_failure;
+SYSCTL_ULONG(_debug_counters, OID_AUTO, alloc_success, CTLFLAG_RD, &alloc_success, 0, "");
+SYSCTL_ULONG(_debug_counters, OID_AUTO, alloc_failure, CTLFLAG_RD, &alloc_failure, 0, "");
+
/*

  • blist_alloc() - reserve space in the block bitmap. Return the base
  • of a contiguous region or SWAPBLK_NONE if space could

@@ -305,9 +312,12 @@ blist_alloc(blist_t bl, daddr_t count)

bl->bl_cursor = blk + count;
if (bl->bl_cursor == bl->bl_blocks)
        bl->bl_cursor = 0;

+ alloc_success++;

return (blk);
  • } else if (bl->bl_cursor == 0)

+ } else if (bl->bl_cursor == 0) {
+ alloc_failure++;

return (SWAPBLK_NONE);

+ }

        bl->bl_cursor = 0;
}

}

I'll see what I can do tomorrow. Could you possibly upload the diff?

A patch for counting blist_alloc successes and failures.

Here's the difference:

root@t2:/tmp # ministat -C 2 -w 72 alloc.HEAD.data alloc.D20001.56794.data
x alloc.HEAD.data
+ alloc.D20001.56794.data
+------------------------------------------------------------------------+
|+               +              xx     *+  +               x            x|
|         |_________________A|_________M______|A_________________|       |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5         51530         56751         52422       53484.4     2331.5118
+   5         47491         52992         52412         51021     2392.2399
No difference proven at 95.0% confidence

Here are the details:

root@t2:/tmp # cat alloc.HEAD.txt 
$ sh /tmp/alloc.sh 
FreeBSD t2.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r346898M: Tue Apr 30 14:01:30 CEST 2019     pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO  amd64
+++ /usr/src/sys/kern/subr_blist.c      (working copy)
debug.counters.alloc_success: 0
14:16:07 51614 success, 0 failures
14:20:42 56751 success, 0 failures
14:26:00 55105 success, 0 failures
14:31:25 51530 success, 0 failures
14:36:32 52422 success, 0 failures
debug.counters.alloc_success: 267422
$ 
root@t2:/tmp # 

root@t2:/tmp # cat alloc.D20001.56794.diff.txt 
$ /tmp/alloc.sh 
FreeBSD t2.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #3 r346898M: Tue Apr 30 12:27:20 CEST 2019     pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO  amd64
+++ /usr/src/sys/kern/subr_blist.c      (working copy)
+++ /usr/src/sys/sys/blist.h    (working copy)
+++ /usr/src/sys/vm/swap_pager.c        (working copy)
debug.counters.alloc_success: 0
13:32:25 52412 success, 0 failures
13:38:00 52992 success, 0 failures
13:43:00 52621 success, 0 failures
13:49:07 47491 success, 0 failures
13:54:10 49589 success, 0 failures
debug.counters.alloc_success: 255105
$ 
root@t2:/tmp #

The impact of this change is only felt when there allocation failures and retries.

While working on a new test scenario, which ran fine on HEAD:

FreeBSD t2.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #1 r346898M: Tue Apr 30 16:33:30 CEST 2019     pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO  amd64
+++ /usr/src/sys/kern/subr_blist.c      (working copy)
swapoff: removing /dev/da0p4 as swap device
Device          1K-blocks     Used    Avail Capacity
/dev/md10         4194304        0  4194304     0%
debug.counters.alloc_failure: 0
debug.counters.alloc_success: 0
18:50:25 42120 success, 3014622 failures
18:53:35 44420 success, 10817148 failures
18:56:44 39221 success, 1356175 failures
18:59:53 39428 success, 886460 failures
19:03:03 40925 success, 3850027 failures
debug.counters.alloc_failure: 19924432
debug.counters.alloc_success: 206114
swapon: adding /dev/da0p4 as swap device
$

I got this panic with D20001.56794.diff:

panic: freeing free block
cpuid = 11
time = 1556647166
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00cafdf100
vpanic() at vpanic+0x265/frame 0xfffffe00cafdf1c0
doadump() at doadump/frame 0xfffffe00cafdf220
blst_leaf_free() at blst_leaf_free+0x52/frame 0xfffffe00cafdf250
blst_meta_free() at blst_meta_free+0x44/frame 0xfffffe00cafdf2b0
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00cafdf310
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00cafdf370
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00cafdf3d0
blist_free() at blist_free+0x66/frame 0xfffffe00cafdf400
swp_pager_freeswapspace() at swp_pager_freeswapspace+0xb0/frame 0xfffffe00cafdf430
swp_pager_update_freerange() at swp_pager_update_freerange+0x52/frame 0xfffffe00cafdf460
swp_pager_meta_free_all() at swp_pager_meta_free_all+0xdc/frame 0xfffffe00cafdf4a0
swap_pager_dealloc() at swap_pager_dealloc+0x20d/frame 0xfffffe00cafdf4d0
vm_object_terminate() at vm_object_terminate+0x2a5/frame 0xfffffe00cafdf520
vm_object_deallocate() at vm_object_deallocate+0x83b/frame 0xfffffe00cafdf590
vm_map_entry_deallocate() at vm_map_entry_deallocate+0x2f/frame 0xfffffe00cafdf5c0
vm_map_process_deferred() at vm_map_process_deferred+0x106/frame 0xfffffe00cafdf600
_vm_map_unlock() at _vm_map_unlock+0x7a/frame 0xfffffe00cafdf630
vm_map_remove() at vm_map_remove+0xbe/frame 0xfffffe00cafdf670
vmspace_dofree() at vmspace_dofree+0x3f/frame 0xfffffe00cafdf6a0
vmspace_exit() at vmspace_exit+0x154/frame 0xfffffe00cafdf6e0
exit1() at exit1+0xaf9/frame 0xfffffe00cafdf920
sys_sys_exit() at sys_sys_exit+0x26/frame 0xfffffe00cafdf950

https://people.freebsd.org/~pho/stress/log/dougm027.txt

Add more information to the panic messages we generate. Add new panic messages to identify problems with different leaf allocation cases.

swp_pager_getswapspace(32): failed
panic: freeing free block: fffc0, size 22, mask 1
cpuid = 21
time = 1556698660
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2c/frame 0xfffffe00cc3493b0
kdb_backtrace() at kdb_backtrace+0x53/frame 0xfffffe00cc349460
vpanic() at vpanic+0x19d/frame 0xfffffe00cc3494b0
panic() at panic+0x43/frame 0xfffffe00cc349510
blst_leaf_free() at blst_leaf_free+0x69/frame 0xfffffe00cc349540
blst_meta_free() at blst_meta_free+0x44/frame 0xfffffe00cc3495a0
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00cc349600
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00cc349660
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00cc3496c0
blist_free() at blist_free+0x7e/frame 0xfffffe00cc3496f0
swp_pager_freeswapspace() at swp_pager_freeswapspace+0xb0/frame 0xfffffe00cc349720
swp_pager_meta_free_all() at swp_pager_meta_free_all+0x129/frame 0xfffffe00cc349760
swap_pager_dealloc() at swap_pager_dealloc+0x20d/frame 0xfffffe00cc349790
vm_object_terminate() at vm_object_terminate+0x27b/frame 0xfffffe00cc3497e0
vm_object_deallocate() at vm_object_deallocate+0x412/frame 0xfffffe00cc349840
vm_map_process_deferred() at vm_map_process_deferred+0x79/frame 0xfffffe00cc349860
vm_map_remove() at vm_map_remove+0xc6/frame 0xfffffe00cc349890
vmspace_exit() at vmspace_exit+0xd3/frame 0xfffffe00cc3498d0
exit1() at exit1+0x5ad/frame 0xfffffe00cc349940
sys_sys_exit() at sys_sys_exit+0xd/frame 0xfffffe00cc349950
syscallenter() at syscallenter+0x496/frame 0xfffffe00cc349a00
amd64_syscall() at amd64_syscall+0x4d/frame 0xfffffe00cc349ab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00cc349ab0
--- syscall (1, FreeBSD ELF64, sys_sys_exit), rip = 0x8003ab8da, rsp = 0x7fffffffe4d8, rbp = 0x7fffffffe500 ---
KDB: enter: panic
[ thread pid 4575 tid 100639 ]
Stopped at      breakpoint+0x5: popq    %rbp
db> x/s version
version:        FreeBSD 13.0-CURRENT #2 r346986M: Wed May  1 09:39:42 CEST 2019\012    pho@t1.osted.lan:/usr/src/sys/amd64/compile/PHO\012
db>

Fix bitpos again. Shortcut the zero case in next_leaf_alloc.

This looks promising. I'll add the counters and get some stats.

The summary:

$ ministat -C 2 -w 72 HEAD.dat D20001.56931.dat 
x HEAD.dat
+ D20001.56931.dat
+------------------------------------------------------------------------+
|+       +             x          x+                x    + +         x  x|
|     |_____________________|___A__M______________A_M______|___________| |
+------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5         39425         42441         41233       41088.8     1320.1103
+   5         38104         41668         40202         40021     1638.8319
No difference proven at 95.0% confidence
$ 

The details:

$ ls -l HEAD.log D20001.56931.log 
-rw-r--r--  1 root  wheel  828  2 maj  09:52 D20001.56931.log
-rw-r--r--  1 root  wheel  740  2 maj  10:27 HEAD.log
$ cat HEAD.log D20001.56931.log 
FreeBSD t2.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #3 r346994M: Thu May  2 09:55:55 CEST 2019     pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO  amd64
+++ /usr/src/sys/kern/subr_blist.c      (working copy)
swapoff: removing /dev/da0p4 as swap device
Device          1K-blocks     Used    Avail Capacity
/dev/md10         4194304        0  4194304     0%
debug.counters.alloc_failure: 0
debug.counters.alloc_success: 0
10:14:24 42441 success, 8539836 failures
10:17:33 42254 success, 2760273 failures
10:20:42 40091 success, 2245127 failures
10:23:52 39425 success, 934125 failures
10:27:01 41233 success, 9613029 failures
debug.counters.alloc_failure: 24092390
debug.counters.alloc_success: 205444
swapon: adding /dev/da0p4 as swap device
FreeBSD t2.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #2 r346994M: Thu May  2 09:30:53 CEST 2019     pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO  amd64
+++ /usr/src/sys/kern/subr_blist.c      (working copy)
+++ /usr/src/sys/sys/blist.h    (working copy)
+++ /usr/src/sys/vm/swap_pager.c        (working copy)
swapoff: removing /dev/da0p4 as swap device
Device          1K-blocks     Used    Avail Capacity
/dev/md10         4194304        0  4194304     0%
debug.counters.alloc_failure: 0
debug.counters.alloc_success: 0
09:39:32 41535 success, 391291 failures
09:42:41 41668 success, 1302308 failures
09:45:51 38596 success, 162214 failures
09:49:00 38104 success, 177041 failures
09:52:10 40202 success, 527782 failures
debug.counters.alloc_failure: 2560636
debug.counters.alloc_success: 200105
swapon: adding /dev/da0p4 as swap device
$ 

Remove debugging panic checks from blst_leaf_alloc.

I'm a little perplexed by the test results, as they show a very small decrease in successful allocations and a much more significant reduction in allocation failures. I don't know what to make of that.

I'll redo the test with 56965 once I free up my test box.

Modify the stats-gathering patch to track the average allocation request size.

While waiting for free H/W I tried to run the tests on a box running i386:

$ make subr_blist.o
cc -c -O -pipe  -g -nostdinc  -I. -I../../.. -I../../../contrib/ck/include -I../../../contrib/libfdt -D_KERNEL -DHAVE_KERNEL_OPTION_HEADERS -include opt_global.h    -MD  -MF.depend.subr_blist.o -MTsubr_blist.o -fdebug-prefix-map=./machine=/usr/src/sys/i386/include -fdebug-prefix-map=./x86=/usr/src/sys/x86/include -mno-mmx -mno-sse -msoft-float -ffreestanding -fwrapv -fstack-protector -gdwarf-2 -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wcast-qual -Wundef -Wno-pointer-sign -D__printf__=__freebsd_kprintf__ -Wmissing-include-dirs -fdiagnostics-show-option -Wno-unknown-pragmas -Wno-error-tautological-compare -Wno-error-empty-body -Wno-error-parentheses-equality -Wno-error-unused-function -Wno-error-pointer-sign -Wno-error-shift-negative-value -Wno-address-of-packed-member  -mno-aes -mno-avx  -std=iso9899:1999 -Werror  ../../../kern/subr_blist.c
../../../kern/subr_blist.c:310:12: error: incompatible pointer to integer conversion assigning to 'long' from 'int *'
      [-Werror,-Wint-conversion]
        count_err += count - avg_count;
                  ^  ~~~~~~~~~~~~~~~~~
../../../kern/subr_blist.c:352:7: error: format specifies type 'unsigned long' but the argument has type 'daddr_t'
      (aka 'long long') [-Werror,-Wformat]
                    blkno, (int)count, bl->bl_blocks);
                    ^~~~~
../../../kern/subr_blist.c:352:26: error: format specifies type 'long' but the argument has type 'daddr_t'
      (aka 'long long') [-Werror,-Wformat]
                    blkno, (int)count, bl->bl_blocks);
                                       ^~~~~~~~~~~~~
../../../kern/subr_blist.c:370:7: error: format specifies type 'unsigned long' but the argument has type 'daddr_t'
      (aka 'long long') [-Werror,-Wformat]
                    blkno, (int)count, bl->bl_blocks);
                    ^~~~~
../../../kern/subr_blist.c:370:26: error: format specifies type 'long' but the argument has type 'daddr_t'
      (aka 'long long') [-Werror,-Wformat]
                    blkno, (int)count, bl->bl_blocks);
                                       ^~~~~~~~~~~~~
../../../kern/subr_blist.c:879:7: error: format specifies type 'unsigned long' but the argument has type 'daddr_t'
      (aka 'long long') [-Werror,-Wformat]
                    blk, count, scan->bm_bitmap & mask);
                    ^~~
../../../kern/subr_blist.c:879:19: error: format specifies type 'unsigned long' but the argument has type
      'unsigned long long' [-Werror,-Wformat]
                    blk, count, scan->bm_bitmap & mask);
                                ^~~~~~~~~~~~~~~~~~~~~~
7 errors generated.
*** Error code 1

Here's the i386 stats:

$ ministat -C2  HEAD.dat D20001.56965.dat             
x HEAD.dat
+ D20001.56965.dat
+--------------------------------------------------------------------------------------------------+
|            ++ +  xx x      +                                +                                   x|
||____|_________M___M______A________A___________|______________________|                           |
+--------------------------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5          9215         14004          9322         10249     2100.5705
+   5          8864         11853          9074        9715.2     1254.1649
No difference proven at 95.0% confidence
$ cat HEAD.log 
FreeBSD x4.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r347052M: Fri May  3 11:36:51 CEST 2019     pho@x4.osted.lan:/usr/src/sys/i386/compile/PHO  i386
+++ /usr/src/sys/kern/subr_blist.c      (working copy)
swapoff: removing /dev/ada0s1b as swap device
Device          1K-blocks     Used    Avail Capacity
/dev/md10         1048576        0  1048576     0%
debug.counters.avg_count: 0
debug.counters.alloc_failure: 0
debug.counters.alloc_success: 0
14:06:47 14004 success, 5215333 failures
14:09:53 9275 success, 3238962 failures
14:13:00 9215 success, 3907939 failures
14:16:07 9322 success, 3586657 failures
14:19:14 9429 success, 3546284 failures
debug.counters.avg_count: 10
debug.counters.alloc_failure: 19495175
debug.counters.alloc_success: 51245
swapon: adding /dev/ada0s1b as swap device
$ cat D20001.56965.log
FreeBSD x4.osted.lan 13.0-CURRENT FreeBSD 13.0-CURRENT #1 r347052M: Fri May  3 14:47:08 CEST 2019     pho@x4.osted.lan:/usr/src/sys/i386/compile/PHO  i386
+++ /usr/src/sys/kern/subr_blist.c      (working copy)
+++ /usr/src/sys/sys/blist.h    (working copy)
+++ /usr/src/sys/vm/swap_pager.c        (working copy)
swapoff: removing /dev/ada0s1b as swap device
Device          1K-blocks     Used    Avail Capacity
/dev/md10         1048576        0  1048576     0%
debug.counters.avg_count: 0
debug.counters.alloc_failure: 0
debug.counters.alloc_success: 0
14:55:07 11853 success, 193965 failures
14:58:13 9827 success, 455838 failures
15:01:20 8864 success, 409578 failures
15:04:27 8958 success, 459011 failures
15:07:34 9074 success, 486567 failures
debug.counters.avg_count: 16
debug.counters.alloc_failure: 2004959
debug.counters.alloc_success: 48576
swapon: adding /dev/ada0s1b as swap device
$ 

Cast parameters in some panic messages to long long.

The patch that adds debug_counter statistics is written to be applied to the unmodified code. The patch under review here changes the type of the parameter 'count' from a value type to a pointer type, so the line

count_err += count - avg_count;

should be modified to

count_err += *count - avg_count;

to apply compute the statistics properly for the modified code. I apologize for not making this clear before.

The patch that adds debug_counter statistics is written to be applied to the unmodified code. The patch under review here changes the type of the parameter 'count' from a value type to a pointer type, so the line

count_err += count - avg_count;

should be modified to

count_err += *count - avg_count;

to apply compute the statistics properly for the modified code. I apologize for not making this clear before.

Ah, no problem. It was easy to fix.

dougm edited the test plan for this revision. (Show Details)
dougm added reviewers: kib, markj.

Much of my patch seems to have disappeared in the last upload. Restoring it, I hope.

sys/kern/subr_blist.c
196 ↗(On Diff #57013)

Is it no longer true that only one bit is set in mask ? Can the mask be zero ?

291 ↗(On Diff #57013)

Should we switch to KASSERT ?

336 ↗(On Diff #57013)

Again, why this and all other checks are not KASSERTs ?

sys/vm/swap_pager.c
734 ↗(On Diff #57013)

goto to line 769 would be cleaner IMO.

905 ↗(On Diff #57013)

Remove initialization from declaration.

910 ↗(On Diff #57013)

I must admit that this is very unusual way to write loop condition. Am I right that only starts is incremented ? Then why not use size > 0 as control and decrement it ?

Or at least keep the function parameter start constant and change a local variable.

Adapt reviewer suggestions.

sys/kern/subr_blist.c
196 ↗(On Diff #57013)

Some calls to bitpos will be made with more than one bit set in mask. No calls should be made with mask==0. If a call to bitpos was made with mask==0, the result would be -1 for the HAVE_INLINE_FFSLL + long long case, and 4*sizeof(mask)-1 otherwise.

291 ↗(On Diff #57013)

I'm happy to do so, if you want. I was just sticking with the style of the current code by using panic.

336 ↗(On Diff #57013)

I'll change them.

sys/vm/swap_pager.c
734 ↗(On Diff #57013)

I can make !TAILQ_EMPTY a condition of the loop. Or, if someone happens to know that this function will never be called with an empty list, I can just remove it.

905 ↗(On Diff #57013)

Okay.

910 ↗(On Diff #57013)

Only starts is incremented. And i, in the inner loop.

I found the previous code, with 4 different values updated every iteration, too complex. But I can find something else to change instead.

Correct a KASSERT condition. Replace a ummin() with a min().

With D20001.57040.diff I got:

Setting hostuuid: 2bde2bde-f4e2-e111-aab2-001e6756b69b.
Setting hostid: 0x0035ff86.
Starting ddb.
WARNING: reducing swap size to maximum of 65536MB per unit
panic: freeing free block: 2, size 62, mask 0
cpuid = 9
time = 1556993704
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00c62bf320
vpanic() at vpanic+0x19d/frame 0xfffffe00c62bf370
panic() at panic+0x43/frame 0xfffffe00c62bf3d0
blst_leaf_free() at blst_leaf_free+0x7c/frame 0xfffffe00c62bf400
blst_meta_free() at blst_meta_free+0x44/frame 0xfffffe00c62bf460
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00c62bf4c0
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00c62bf520
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00c62bf580
blst_meta_free() at blst_meta_free+0x158/frame 0xfffffe00c62bf5e0
blist_free() at blist_free+0xa1/frame 0xfffffe00c62bf610
swaponsomething() at swaponsomething+0x180/frame 0xfffffe00c62bf6e0
swapongeom_locked() at swapongeom_locked+0x217/frame 0xfffffe00c62bf760
swapongeom() at swapongeom+0x92/frame 0xfffffe00c62bf790
sys_swapon() at sys_swapon+0x157/frame 0xfffffe00c62bf990
amd64_syscall() at amd64_syscall+0x291/frame 0xfffffe00c62bfab0
fast_syscall_common() at fast_syscall_common+0x101/frame 0xfffffe00c62bfab0
--- syscall (85, FreeBSD ELF64, sys_swapon), rip = 0x8002f776a, rsp = 0x7fffffffeb88, rbp = 0x7fffffffeba0 ---
KDB: enter: panic
[ thread pid 75 tid 100230 ]
Stopped at      kdb_enter+0x3b: movq    $0,kdb_why
db> x/s version
version:        FreeBSD 13.0-CURRENT #5 r347055M: Sat May  4 20:12:05 CEST 2019\012    pho@t2.osted.lan:/usr/src/sys/amd64/compile/PHO\012
db>

Correct another KASSERT logic error.

It's a lot easier to review if there's more context around the patch hunks. Could you use -U99999 when creating diffs to upload to phabricator?

svn diff -x -U20 update after recent commit introduced conflicts.

Re-style the loop in blist_alloc to make it look loopier.

Cancel gratitutious stpidity.

Get rid of a few x&-x expressions made unnecessary by the changes to bitpos.

kern/subr_blist.c
637 ↗(On Diff #57115)

This won't matter in practice, but min()'s parameters are unsigned, so I'd use imin() instead if only to avoid warnings from static analyzers and the like.

743 ↗(On Diff #57115)

This seems a bit inconsistent to me. If we find a run of count1 blocks in the current leaf, we're satisfied (though we will greedily take up to maxcount blocks if the run is large enough). But I think there's a special case: suppose count1 = 1, maxcount > 1 and mask is 0x8000000000000000, i.e., only the last block is free. Then (-mask & ~mask) == 0 and we will try to allocate from the next leaf even though the current leaf can satisfy the request. Is there something I'm missing?

dougm marked an inline comment as done.

min() -> imin()

kern/subr_blist.c
743 ↗(On Diff #57115)

I think you mean *count, not count1.

I intend for blist_alloc to return the next block that begins a run of at least *count blocks, and to allocate as many blocks as possible there, up to a limit of maxcount. Nothing in there about leaf boundaries. The caller shouldn't know about leaf boundaries.

In the best case, if *count is 1 and maxcount is 2 (because a blist_alloc call with *count==3 just failed) and the next free block is the last one of its leaf, and if we return that block address with *count still set to 1, somebody is going to call blist_alloc again with *count == 2, and we'll have to look at the next leaf for that. We won't find 2 free blocks at the beginning of the next leaf (else the alloc for 3 would have succeeded), but we might find 1, but now we can't use it because *count==2. So we keep searching for two free blocks. If we had only taken both the blocks, spanning the leaf boundary, we'd be looking for only a single block. I expect a search for one block would be quicker than a search for two blocks.

In a worse case, swap_pager_reserve wants a huge number of blocks and through terrible luck, there are no 64 block free ranges and a huge number of 63 block free ranges that go from block 32 of one leaf to block 30 of the next leaf. Those requests for [32,63] blocks could respect leaf boundaries and provide a bunch of 32 block ranges (and leave 31 block gaps) or they could provide a bunch - about half as many - 63 block ranges.

If blist_alloc is called with *count != maxcount, then that's because an alloc call for maxcount+1 bytes failed. So I intend to find the first gap of size at least *count and completely fill it. The change you suggest would have be leave a fragment that I don't want to leave.

kern/subr_blist.c
209 ↗(On Diff #57164)

I suppose such thing is better done with the _Generic() expression.

224 ↗(On Diff #57164)

Perhaps extract the change to bitpos(), both sizeof(int) optimization, and removal of restriction that it works only with the word with one bit set, to preparational diff ?

733 ↗(On Diff #57164)

What is (-mask & !mask) ?

kern/subr_blist.c
209 ↗(On Diff #57164)

I am not familiar with that expression. Please point me to an example, or an explanation.

224 ↗(On Diff #57164)

I will generate lots of preparational diffs, if someone will approve them.

733 ↗(On Diff #57164)

If mask is an int with the last k bits cleared (but not the last k+1 bits), then (-mask & ~mask) is mask with all but the last k bits flipped.

For usage here, the first 1 bit of (-mask & ~mask) corresponds to the 0 bit of mask that ends the first clump of 1 bits of mask.

kern/subr_blist.c
209 ↗(On Diff #57164)

https://en.cppreference.com/w/c/language/generic

It would be like this (a minimal example that I compiled)

typedef	unsigned long long u_daddr_t;

int generic_bitpos(u_daddr_t mask);

int
bitpos(u_daddr_t mask)
{

	return (_Generic(mask,
#ifdef HAVE_INLINE__FFSLL
	    long long: ffsll(mask) - 1,
#endif
#ifdef HAVE_INLINE_FFS
	    int: ffs(mask) - 1,
#endif
	    default: generic_bitpos(mask)
	));
}
224 ↗(On Diff #57164)

Sure, why not. Just do not forget to abandon cumulative reviews so that we see what is going on.

kern/subr_blist.c
733 ↗(On Diff #57164)

I believe this absolutely requires an explanatory comment.

Update after recent changes to overlapping code.

Accounting for bitpos checkin.

Catching up with what's committed already. Other than the changes of D20228, I don't know what else I can extract from this patch to make it easier to review.

dougm edited the summary of this revision. (Show Details)

Update after swap_pager.c changes.

vm/swap_pager.c
745 ↗(On Diff #57311)

Why do you need this reduction of the max allowed number of blocks ?

vm/swap_pager.c
745 ↗(On Diff #57311)

Occasionally, mpages keeps blist_leaf_alloc from looking at the next leaf when all the rest of the blocks of this leaf are free and reach the mpages=maxcount limit at the end of the leaf. If maxpages were bigger, we'd look into the next leaf. Here, we are reducing mpages to avoid looking into the next leaf when we know there can't be a free block at the beginning of the next leaf, else blist_alloc would have succeeded in the previous round.

So the reduction isn't necessary for correctness, just slightly useful for efficiency.

This revision is now accepted and ready to land.May 11 2019, 4:02 PM
This revision was automatically updated to reflect the committed changes.