Page MenuHomeFreeBSD

[libkvm] kvm_geswapinfo() to stop subtracting vm.dmmax size from swap device sizes.
AcceptedPublic

Authored by ota_j.email.ne.jp on Jun 17 2019, 6:07 PM.

Details

Summary

PR 238670 - https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=238670

vm.dmmax size is nothing to do with each of swap device sizes.

Removal of vm.dmmax resulted stale static functions and thus removed together to avoid non-referenced static function call error from compiler.

Test Plan

Create 1 MB swapfile for mdconfig and swapon.

dd if=/dev/zero of=1MB bs=1M count=1

1+0 records in
1+0 records out
1048576 bytes transferred in 0.035823 secs (29271150 bytes/sec)

mdconfig -a -t vnode -f 1MB

md0

  1. swapon /dev/md0
  2. swapctl -l

Device: 1024-blocks Used:
/dev/ada0s1b 2097144 0
/dev/md0 1016 0

  1. mount -t tmpfs tmpfs /mnt/tmp
  2. dd if=/dev/zero of=/mnt/tmp/fill bs=1M count=2000

2000+0 records in
2000+0 records out
2097152000 bytes transferred in 36.267636 secs (57824337 bytes/sec)

swapctl -l

Device: 1024-blocks Used:
/dev/ada0s1b 2097144 144220
/dev/md0 1016 1016

systat -swap

Disk 1K-blocks Used /0% /10 /20 /30 /40 /50 /60 /70 /80 /90 /100
ada0s1b 2097016 144216 XXXX
md0 1016 1016 XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XX

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 24901
Build 23631: arc lint + arc unit

Event Timeline

ota_j.email.ne.jp edited the summary of this revision. (Show Details)Jun 17 2019, 6:11 PM
ota_j.email.ne.jp edited the test plan for this revision. (Show Details)
ota_j.email.ne.jp added a reviewer: dougm.
ota_j.email.ne.jp edited the summary of this revision. (Show Details)
dougm added a reviewer: alc.Jun 17 2019, 8:50 PM
linimon resigned from this revision.Jun 27 2019, 1:36 AM

This is not an area I understand well.

kib accepted this revision.Sep 27 2019, 1:54 PM
This revision is now accepted and ready to land.Sep 27 2019, 1:54 PM
markj accepted this revision.Sep 27 2019, 2:04 PM
alc added a comment.Sep 28 2019, 6:10 PM

I believe that there is a sound reason for not reporting the unmodified size of the device. Specifically, the swap pager can't use the first, if I recall correctly, 8KB of a swap device because it might contain a disk label. However, I can't explain why we use dmmax to calculate the modified size.

But sysctl describes totally different:

% sysctl -d vm.dmmax
vm.dmmax: Maximum size of a swap block in pages
% sysctl vm.dmmax
vm.dmmax: 32

After this removal, dmmax sysctl will be totally eliminated (I just realized).

% grep -R dmmax /usr/src/*
/usr/src/lib/libkvm/kvm_getswapinfo.c: { .n_name = "_dmmax" }, /* maximum size of a swap block */
/usr/src/lib/libkvm/kvm_getswapinfo.c:static int dmmax;
/usr/src/lib/libkvm/kvm_getswapinfo.c: ttl = swinfo.sw_nblks - dmmax;
/usr/src/lib/libkvm/kvm_getswapinfo.c: if (!GETSYSCTL(kd, "vm.dmmax", dmmax))
/usr/src/lib/libkvm/kvm_getswapinfo.c: ttl = xsd.xsw_nblks - dmmax;
/usr/src/lib/libkvm/kvm_getswapinfo.c: _kvm_err(kd, kd->program, "unable to find dmmax");
/usr/src/lib/libkvm/kvm_getswapinfo.c: KGET(NL_DMMAX, &dmmax);
/usr/src/sys/vm/swap_pager.c:SYSCTL_INT(_vm, OID_AUTO, dmmax, CTLFLAG_RD, &nsw_cluster_max, 0,
/usr/src/tools/tools/sysdoc/tunables.mdoc:vm.dmmax

kib added a comment.Sep 29 2019, 2:05 PM
In D20674#476657, @alc wrote:

I believe that there is a sound reason for not reporting the unmodified size of the device. Specifically, the swap pager can't use the first, if I recall correctly, 8KB of a swap device because it might contain a disk label. However, I can't explain why we use dmmax to calculate the modified size.

I think reporting the total size minus reserved blocks (for any definition of the block size, correct or not), is more confusing to users than reporting raw size and used. At worst, user would observe that some amount of swap volume is not filled, instead of seeing mismatch between e.g. geom size and vm size for the swap volume.

That said, I suspect that we no longer need to avoid initial blocks. At least not for the label preservation.

alc added a comment.Sep 29 2019, 4:20 PM
In D20674#476783, @kib wrote:
In D20674#476657, @alc wrote:

I believe that there is a sound reason for not reporting the unmodified size of the device. Specifically, the swap pager can't use the first, if I recall correctly, 8KB of a swap device because it might contain a disk label. However, I can't explain why we use dmmax to calculate the modified size.

I think reporting the total size minus reserved blocks (for any definition of the block size, correct or not), is more confusing to users than reporting raw size and used. At worst, user would observe that some amount of swap volume is not filled, instead of seeing mismatch between e.g. geom size and vm size for the swap volume.

I'm sympathetic to that argument. In any case, the current code is wrong, in that, it subtracts too much.

That said, I suspect that we no longer need to avoid initial blocks. At least not for the label preservation.

I'm afraid that is not true. I found out the hard way that it is needed. I lost a disk label, swap area, and file system when I first enabled "trimonce". However, the configuration of that storage device is usual. The first partition is a swap area, not a file system.

kib added a comment.Sep 29 2019, 4:32 PM
In D20674#476856, @alc wrote:
In D20674#476783, @kib wrote:

That said, I suspect that we no longer need to avoid initial blocks. At least not for the label preservation.

I'm afraid that is not true. I found out the hard way that it is needed. I lost a disk label, swap area, and file system when I first enabled "trimonce". However, the configuration of that storage device is usual. The first partition is a swap area, not a file system.

Was it for bsd label or gpt partitions ? Even then, if for bsd you did not used c, it should be fine.

imp added a comment.Sep 29 2019, 4:34 PM
In D20674#476856, @alc wrote:
In D20674#476783, @kib wrote:
In D20674#476657, @alc wrote:

I believe that there is a sound reason for not reporting the unmodified size of the device. Specifically, the swap pager can't use the first, if I recall correctly, 8KB of a swap device because it might contain a disk label. However, I can't explain why we use dmmax to calculate the modified size.

I think reporting the total size minus reserved blocks (for any definition of the block size, correct or not), is more confusing to users than reporting raw size and used. At worst, user would observe that some amount of swap volume is not filled, instead of seeing mismatch between e.g. geom size and vm size for the swap volume.

I'm sympathetic to that argument. In any case, the current code is wrong, in that, it subtracts too much.

That said, I suspect that we no longer need to avoid initial blocks. At least not for the label preservation.

I'm afraid that is not true. I found out the hard way that it is needed. I lost a disk label, swap area, and file system when I first enabled "trimonce". However, the configuration of that storage device is usual. The first partition is a swap area, not a file system.

We can't stop skipping. For the bsd label, we have an area at the front of the first partition that is both part of the partition as well as being part of the metadata and/or boot blocks. If we don't skip, we will overwrite this data as alc found out. Normally, these days, this isn't done, but since we have no way of asking the lower layers how much should be reserved, we need to skip. It might give a performance boost if we skip to the reported stripe size, since NAND devices use this to report optimal LBA alignment. I haven't checked lately to see if we do that and/or try to set the swap out block size to that, clustering pages if needed. For things like SD cards, it can be a huge win. For SATA SSD it's usually a moderate win. Nvme is more variable since enterprise drives don't really care, while super cheap QLC drives care rather a lot.

alc added a comment.Sep 29 2019, 5:05 PM
In D20674#476860, @kib wrote:
In D20674#476856, @alc wrote:
In D20674#476783, @kib wrote:

That said, I suspect that we no longer need to avoid initial blocks. At least not for the label preservation.

I'm afraid that is not true. I found out the hard way that it is needed. I lost a disk label, swap area, and file system when I first enabled "trimonce". However, the configuration of that storage device is usual. The first partition is a swap area, not a file system.

Was it for bsd label or gpt partitions ? Even then, if for bsd you did not used c, it should be fine.

It was a bsd label, and I did not use 'c'.

alc added a comment.Sep 29 2019, 5:16 PM
In D20674#476861, @imp wrote:
In D20674#476856, @alc wrote:
In D20674#476783, @kib wrote:
In D20674#476657, @alc wrote:

I believe that there is a sound reason for not reporting the unmodified size of the device. Specifically, the swap pager can't use the first, if I recall correctly, 8KB of a swap device because it might contain a disk label. However, I can't explain why we use dmmax to calculate the modified size.

I think reporting the total size minus reserved blocks (for any definition of the block size, correct or not), is more confusing to users than reporting raw size and used. At worst, user would observe that some amount of swap volume is not filled, instead of seeing mismatch between e.g. geom size and vm size for the swap volume.

I'm sympathetic to that argument. In any case, the current code is wrong, in that, it subtracts too much.

That said, I suspect that we no longer need to avoid initial blocks. At least not for the label preservation.

I'm afraid that is not true. I found out the hard way that it is needed. I lost a disk label, swap area, and file system when I first enabled "trimonce". However, the configuration of that storage device is usual. The first partition is a swap area, not a file system.

We can't stop skipping. For the bsd label, we have an area at the front of the first partition that is both part of the partition as well as being part of the metadata and/or boot blocks. If we don't skip, we will overwrite this data as alc found out. Normally, these days, this isn't done, but since we have no way of asking the lower layers how much should be reserved, we need to skip. It might give a performance boost if we skip to the reported stripe size, since NAND devices use this to report optimal LBA alignment. I haven't checked lately to see if we do that and/or try to set the swap out block size to that, clustering pages if needed. For things like SD cards, it can be a huge win. For SATA SSD it's usually a moderate win. Nvme is more variable since enterprise drives don't really care, while super cheap QLC drives care rather a lot.

We do clustering, but that clustering is in no way related to the device's stripe size. What is a typical stripe size?

kib added a comment.Sep 29 2019, 8:05 PM
In D20674#476867, @alc wrote:

I see, even gpart, by default, if not explicitly managed to specify the start offset, creates the slice 'a' at the start of the volume. OTOH I agree that it is too late to fix it, anyway most people migrated or should migrate to GPT.

v2% sudo mdconfig -a -t swap -s 32m
md0
v2% sudo gpart create -s bsd md0
md0 created
v2% sudo gpart show md0
=>    0  65536  md0  BSD  (32M)
      0  65536       - free -  (32M)

v2% sudo gpart add -t freebsd-swap -i 0
gpart: Invalid number of arguments.
v2% sudo gpart add -t freebsd-swap -i 0 md0
md0a added
v2% gpart show md0
=>    0  65536  md0  BSD  (32M)
      0  65536    1  freebsd-swap  (32M)