Page MenuHomeFreeBSD

growfs script: add swap partition as well as growing root
Needs ReviewPublic

Authored by karels on Tue, Nov 22, 7:02 PM.

Details

Summary

Add the ability to create a swap partition in the course of growing
the root file system on first boot, enabling by default. The default
rules are: add swap if the disk is at least 15 GB (decimal), and the
existing root is less than 40% of the disk. The default size is 10%
of the disk, but not more than double the memory size.

The default behavior can be overridden by setting growfs_swap_size in
/etc/rc.conf or in the kernel environment, with kenv taking priority.
A value of 0 inhibits the addition of swap, an empty value specifies
the default, and other values indicate a swap size in sectors.

Addition of swap is also inhibited if a swap partition is found in
the output of the sysctl kern.geom.conftxt before the current root
partition, usually meaning that there is another disk present.

The root partition is read-only when growfs runs, so /etc/fstab can
not be modified. That step is handled by a new growfs_fstab script,
added in a separate commit. Set the value "growfs_swap_added=1" in
kenv to indicate that this should be done.

There is optional verbose output meant for debugging; it can only be
enabled by modifying the script (in two places, for sh and awk).
This should be removed before release, after testing on -current.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 48503
Build 45389: arc lint + arc unit

Event Timeline

karels created this revision.

"other values indicate a swap size in sectors": so from media to media, with different sector sizes, the same figure means differing sizes?

Is the script used with the likes of:

QUOTE
Disk images may be downloaded from the following URL (or any of the
FreeBSD Project mirrors):

https://download.freebsd.org/snapshots/VM-IMAGES/

Images are available in the following disk image formats:

~ RAW
~ QCOW2 (qemu)
~ VMDK (qemu, VirtualBox, VMWare)
~ VHD (qemu, xen)

The partition layout is:

~ 512k - freebsd-boot GPT partition type (bootfs GPT label)
~ 1GB  - freebsd-swap GPT partition type (swapfs GPT label)
~ 24GB - freebsd-ufs GPT partition type (rootfs GPT label)

END QUOTE

If yes, has the handling been tested? The above already has a 1GB freebsd-swap before the freebsd-ufs.

What if someone had a similar example that instead had such a freebsd-swap after the freebsd-ufs?

This sounds like a good idea in principle. A couple things which come to mind:

  1. This breaks the scenario of "VM is initially booted with a 20 GB disk; at a later time, the disk is expanded to 30 GB and /etc/rc.d/growfs is run manually" since the root partition will no longer be at the end of the disk. I guess theoretically we could delete the swap partition and create a new one at the end of the disk?
  1. In EC2 I have code for automatically using ephemeral disks for swap space; this has higher performance than using the root disk (network disk vs. local disk). But not all EC2 instance types have ephemeral disks. I'm not sure what the ideal interaction between these two features would be.

". . . and the existing root is less than 40% of the disk" What if other partitions/slices than root's are taking up space as well? Do you need to use an available free space's size that is sufficiently large instead?

It may be that the script may be intended to work in more contexts than just the FreeBSD built images and be intended to preserve various partitions/slices that may be around. That might mean only growing into the freespace that happens to be directly after the root partition.

Imagine someone deleting a partition just after the root partition for media that they had been using and then having the root partition grown despite yet other partitions being present.

Warning: I'm not an expert in the intended range of uses of the long standing script. So I may have wandered too far here.

"other values indicate a swap size in sectors": so from media to media, with different sector sizes, the same figure means differing sizes?

Sector size is (nearly?) always 512. Using sectors ensures that the value is a multiple of the sector size. Although I suppose I could just divide and round down.

Is the script used with the likes of:

[VM-IMAGES with pre-existing swap]

I don't know, I'll have to investigate. I had assumed that it was not used there.

If yes, has the handling been tested? The above already has a 1GB freebsd-swap before the freebsd-ufs.

Not yet, but I'll investigate. "before" is relative to the kern.geom.conftxt, which is not always in the expected order.

What if someone had a similar example that instead had such a freebsd-swap after the freebsd-ufs?

The easiest thing would be to put growfs_swap_size="0" in the default rc.conf. Or, if it made sense, to change the default to "off", and enable in arm images, etc.

This sounds like a good idea in principle. A couple things which come to mind:

  1. This breaks the scenario of "VM is initially booted with a 20 GB disk; at a later time, the disk is expanded to 30 GB and /etc/rc.d/growfs is run manually" since the root partition will no longer be at the end of the disk. I guess theoretically we could delete the swap partition and create a new one at the end of the disk?

Yes, deleting the automatic swap partition would allow this to work. This is only a problem if the initial root partition was smaller than the 20 GB disk and growfs was enabled.

  1. In EC2 I have code for automatically using ephemeral disks for swap space; this has higher performance than using the root disk (network disk vs. local disk). But not all EC2 instance types have ephemeral disks. I'm not sure what the ideal interaction between these two features would be.

It probably makes the most sense to disable the growfs swap addition in that case. These scripts don't really handle that situation correctly, where swap could be prioritized. But the growfs swap will not be added if there is swap in the fstab already.

". . . and the existing root is less than 40% of the disk" What if other partitions/slices than root's are taking up space as well? Do you need to use an available free space's size that is sufficiently large instead?

This is designed for use with our existing images, where root is the only substantial partition, and the free space immediately follows it.

It may be that the script may be intended to work in more contexts than just the FreeBSD built images and be intended to preserve various partitions/slices that may be around. That might mean only growing into the freespace that happens to be directly after the root partition.

That is true in the current growfs as well.

Imagine someone deleting a partition just after the root partition for media that they had been using and then having the root partition grown despite yet other partitions being present.

If they deleted the partition after first boot, growfs won't run automatically.

Warning: I'm not an expert in the intended range of uses of the long standing script. So I may have wandered too far here.

It really isn't a general-purpose tool, it only needs to handle the images as we ship them. Some of the options may be useful for other embedded systems, e.g. setting the swap size where the target is known hardware.

About the VM images with pre-existing swap partitions: the swap partition is already listed in /etc/fstab, so I'll just check for that. Although we might want to consider switching them to use this mechanism... But that wouldn't provide any swap if the disk is too small.

Change units of growfs_swap_size to bytes; skip swap if swap is in fstab

It probably makes the most sense to disable the growfs swap addition in that case. These scripts don't really handle that situation correctly, where swap could be prioritized. But the growfs swap will not be added if there is swap in the fstab already.

Thinking about this a bit more... EC2 instances can change their instance types, so an instance might have ephemeral disks (with swap space allocated on them) on some boots and not others. I'm inclined to say that the best solution here is "always allocate the swap partition but don't enable it if we have other swap already". Unfortunately "already" in this case happens after growfs, so we would need to create the swap partition in growfs and then conditionally enable it later...

It probably makes the most sense to disable the growfs swap addition in that case. These scripts don't really handle that situation correctly, where swap could be prioritized. But the growfs swap will not be added if there is swap in the fstab already.

Thinking about this a bit more... EC2 instances can change their instance types, so an instance might have ephemeral disks (with swap space allocated on them) on some boots and not others. I'm inclined to say that the best solution here is "always allocate the swap partition but don't enable it if we have other swap already". Unfortunately "already" in this case happens after growfs, so we would need to create the swap partition in growfs and then conditionally enable it later...

How do you handle the fstab in that case? I see that my EC2 instance has neither swap, nor an fstab entry. The code that I just added will omit swap if it is already included in the fstab; that handles the VM images from the standard build, which already have a swap partition, and which is in the fstab. I'd be reluctant to add a second swap partition, which would end up on the same device, especially if it would never be enabled by default.

I can think of a couple of possibilities: it would be possible to set growfs_swap_size to a size in bytes in the EC2 rc.conf, which would force creation of a swap device of that size. Or, I could add a reserved value (e.g. 1 byte) that would force the creation using the default sizing rules. It would be added to the fstab, but as "/dev/label/growfs_swap", so it would be easy to recognize and manipulate.

If it is useful to add a way to force swap to be added using the default size, I would probably change the values for growfs_swap_size to symbolic ones, e.g. AUTO, NONE, or ALWAYS (or a value in bytes).

Any thoughts on whether it would be useful to force addition of swap even if it is already present? I thought it would be easy, but it takes a bit of work to find the new swap partition if there was already one present.

Any thoughts on whether it would be useful to force addition of swap even if it is already present? I thought it would be easy, but it takes a bit of work to find the new swap partition if there was already one present.

My take: Already present -> operating outside the limited remit of this script... don't add it.

I'd avoid creating ANOTHER swap partition, though. If you want a 'force' it should mean "and throw away whatever swap you found there to create the one you'd have created were that swap partition not ever there" or some such.

In D37462#853345, @imp wrote:

I'd avoid creating ANOTHER swap partition, though. If you want a 'force' it should mean "and throw away whatever swap you found there to create the one you'd have created were that swap partition not ever there" or some such.

There are two different situations though, a swap partition on the root/install disk, or on another disk. I have the latter situation if my USB SSD is plugged in. I don't want to modify other disks, that would violate POLA. I realized, though, that supplying a swap size already implements "force", and can label the wrong partition as "growfs_swap" currently. I'll need to fix that one way or the other.

@cperciva what would work best on EC2?

@cperciva what would work best on EC2?

Sorry, I was at AWS re:Invent so I wasn't able to pay attention to this review for the past few days.

I think the best for EC2 is
(a) unconditionally *create* swap space on >15 GB disks, but
(b) only *use* that swap space if -- at a later stage in the boot process -- we don't have any swap configured.

This way if a system is booted on an EC2 instance with ephemeral disks it will use those, but if it's moved over to an instance without ephemeral disks later it will still have *some* swap space.

Another question: Do you want to respect vm.swap_maxpages? Or just allocate the swap space and let the kernel potentially issue a warning about the device being too large and not being fully utilized?

@cperciva what would work best on EC2?

I think the best for EC2 is
(a) unconditionally *create* swap space on >15 GB disks, but
(b) only *use* that swap space if -- at a later stage in the boot process -- we don't have any swap configured.

Two questions (at least): is there a canonical way to test for EC2 for this purpose? And would ephemeral disks have swap partitions, or entries for swap in /etc/fstab, at the time growfs runs? I can easily create and label the partition, but skip the fstab and swapon steps if on EC2. Maybe testing ec2_ephemeralswap_enable would be appropriate here?

This way if a system is booted on an EC2 instance with ephemeral disks it will use those, but if it's moved over to an instance without ephemeral disks later it will still have *some* swap space.

Another question: Do you want to respect vm.swap_maxpages? Or just allocate the swap space and let the kernel potentially issue a warning about the device being too large and not being fully utilized?

Good question. I had assumed that a limit of 2*physmem would be small enough; I've been thinking mostly about embedded systems with <= 8 GB. Computing this limit would probably be appropriate. Looks like the warning is at vm.swap_maxpages/2. 2*physmem is probably excessive for larger systems too.

Another question: Do you want to respect vm.swap_maxpages? Or just allocate the swap space and let the kernel potentially issue a warning about the device being too large and not being fully utilized?

My memory is that warnings start to be produced about being potentially mistuned when swapon crosses about something like vm.swap_maxpages/2 (total across active swap files). There is also some variation in the figure over time as the OS is upgraded, even if other things are not varied, as I remember. Leaving some margin can be appropriate.

Looking at "man 8 loader_simp" and its kern.maxswzone material:

Note that swap metadata can be fragmented, which means that
the system can run out of space before it reaches the
theoretical limit.  Therefore, care should be taken to not
configure more swap than approximately half of the
theoretical maximum.

Unfortunately, some other detail about kern.maxswzone need not apply to various architectures. (I'm guessing it presumes amd64.) The:

If no value is provided, the system
allocates enough memory to handle an amount of swap that
corresponds to eight times the amount of physical memory
present in the system.

is suspect, for example. On aarch64 I get warnings starting somewhere between 3.5 and 3.8 times the RAM present. On armv7, it starts somewhere between 1.8 and 2.0 times the RAM present, as I remember.

Another thing about kern.maxswzone, that may not be obvious, if I understand right anyway: increasing kern.maxswzone makes tradeoffs with other kernel memory use instead of only increasing the amount of swap space that can be supported. The normal adjustment (rare) is to decrease it to allow more of other types of kernel memory use. See "man 8 loader_simp".