Page MenuHomeFreeBSD

release: Add support for creating ZFS-based images
ClosedPublic

Authored by markj on Mar 3 2022, 4:55 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Nov 25, 12:07 AM
Unknown Object (File)
Sun, Nov 24, 10:03 PM
Unknown Object (File)
Sun, Nov 24, 6:18 PM
Unknown Object (File)
Sun, Nov 24, 12:54 PM
Unknown Object (File)
Sun, Nov 24, 12:34 AM
Unknown Object (File)
Sat, Nov 23, 9:15 PM
Unknown Object (File)
Sat, Nov 23, 1:29 PM
Unknown Object (File)
Fri, Nov 22, 11:53 PM

Details

Summary

The change extends vmimage.subr to handle a new parameter, VMFS, which
should be equal to either "ufs" or "zfs". In the latter case, rather
than using makefs(8) to create the filesystem image, we use md(4) to
create a zpool with a single disk vdev, mounted at $DESTDIR, and install
the FreeBSD distribution there. The zpool has the autoexpand property
set so that growfs works as with UFS.

This has a couple of caveats. First, ZFS images cannot be built in a
jail, as the kernel forbids creation of zpools when the creating process
is jailed. The pool will also contain random GUIDs for the pool and
vdev, so images are not completely reproducible. I'm working on
extending makefs(8) to create the filesystem image using libzpool, which
if successful will fix both problems, but in the meantime this solution
works well and provides an often-requested feature without much added
complexity.

The dataset layout is the same as that used by bsdinstall and "poudriere
image". There is a distinct lack of error handling in the release build
scripts, so I haven't tried to handle errors in my extensions, but that
would be a good next step IMO.

Test Plan

I was able to use this to create amd64 and arm64 ZFS-based EC2 AMIs, and I
verified that they boot successfully. I also booted an amd64 ZFS image using
bhyve. I will test GCE shortly. Help testing other cloud providers would be
appreciated.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 48083
Build 44970: arc lint + arc unit

Event Timeline

markj requested review of this revision.Mar 3 2022, 4:55 PM

I took a different approach from D23334. As far as I can see, none of the cloud configuration bits (e.g., ec2.conf) are filesystem-specific, so it seemed reasonable to avoid creating a separate ec2-zfs.conf as suggested there. It might be useful eventually to let cloudware bits override the default filesystem layout.

gjb added inline comments.
release/tools/vmimage.subr
54

I am uncomfortable with the pool being named 'zroot', as that conflicts with the name of the pool on the build machines, even if the 'tmppool' name is random.

release/tools/vmimage.subr
54

This scenario is exactly the purpose of -t, to handle such conflicts. All of my testing was done on a system that already has a pool called zroot.

markj@nuc> zpool list
NAME                SIZE  ALLOC   FREE  CKPOINT  EXPANDSZ   FRAG    CAP  DEDUP    HEALTH  ALTROOT
release.187959379  4.50G  3.93M  4.50G        -         -     0%     0%  1.00x    ONLINE  /usr/obj/usr/home/markj/src/freebsd/dev/projects/release/arm64.aarch64/release/cw-ec2
zroot               912G   327G   585G        -         -    27%    35%  1.00x    ONLINE  -
delphij added inline comments.
release/tools/vmimage.subr
61

exec=on is the default, did you mean exec=off? If the default value is to be used, the exec=on should be removed.

62

In this setup, /usr and /var are actually using /'s dataset (and these ZFS datasets were mainly for inheritance purposes). Is this intentional (maybe to make sure that the boot environment contained both /usr and /var)?

BTW. Personally I'd create a container dataset with mountpoint=/ and canmount=off for both usr and var instead of setting mountpoint in two datasets. Something like:

sysroot -o mountpoint=/ -o canmount=off
sysroot/usr -o canmount=off
sysroot/usr/home
...

So it gives a clear separation between the "system" datasets and user's datasets, and not setting mountpoint makes it easier to do zfs send -R | zfs receive, but these are just personal preferences.

67

[OPTIONAL] Personally I'd set setuid=off and possibly also exec=off for the whole /var as the default feature set instead of setting it on individual file systems.

Please give me a few days to review the rest of this. I would like to expand this beyond cloud images, i.e., the files populated in the VM-IMAGES directory on the download mirrors, which will require adjustments to Makefile.mirrors.

release/tools/vmimage.subr
54

You are correct, my apologies. I misread the usage.

release/tools/vmimage.subr
61

This was copied from bsdinstall (see usr.sbin/bsdinstall/scripts/zfsboot), which also specifies exec=on for some reason. I'll try removing it. In general I'd expect that /tmp and /var/tmp need to permit execution of files: it's not uncommon for something to write a generated script there and run it.

62

I believe it's intentional, yes.

Per my other comments I'm a bit reluctant to deviate from the layout we use elsewhere (the fact that it's now hard-coded in three places is a problem on its own...). Perhaps we want a standalone script in the base system which creates datasets for a FreeBSD installation in a standard way?

67

I think I agree, but I'm not sure it's a good idea to deviate from what bsdinstall and poudriere do. Perhaps we should change the defaults there too? I suspect that at least var/tmp needs exec=on, not sure about var/mail.

In D34426#780015, @gjb wrote:

Please give me a few days to review the rest of this. I would like to expand this beyond cloud images, i.e., the files populated in the VM-IMAGES directory on the download mirrors, which will require adjustments to Makefile.mirrors.

To be clear, the context in which I mean this is providing both UFS and ZFS images, and staging both appropriately (the main work would be adding a 'for' loop, but I need to look closer).

This is a nice addition!

The only "problem" I've seen (using poudriere-image-zfs) is if, for example, you want to create an image from a system running 14-CURRENT, to a system that will run say 12.3-RELEASE, the difference in zpool versions is problematic.

I don't know how VM images are built, but this will require building from compatible zpool versions.

In D34426#780141, @jlduran_gmail.com wrote:

The only "problem" I've seen (using poudriere-image-zfs) is if, for example, you want to create an image from a system running 14-CURRENT, to a system that will run say 12.3-RELEASE, the difference in zpool versions is problematic.

I don't know how VM images are built, but this will require building from compatible zpool versions.

That's a good point. I think this is less of a problem here since the release(7) scripts are part of the src tree, so initially one would only be able to build 14-CURRENT ZFS images. It could be a problem in the longer term. For official snapshot and release images, it'll depend on the version of FreeBSD used by the releng team; I'm not sure what FreeBSD version the builders run.

I'm not too familiar with the history of zpool versions - isn't the problem really with feature flags? I suppose we'd need to identify the set of pool features supported by a reasonable range of FreeBSD releases, and ensure that only those features are enabled. A firstboot rc.d script could be used to upgrade the pool.

In D34426#780372, @markj wrote:
I'm not too familiar with the history of zpool versions - isn't the problem really with feature flags? I suppose we'd need to identify the set of pool features supported by a reasonable range of FreeBSD releases, and ensure that only those features are enabled. A firstboot rc.d script could be used to upgrade the pool.

Yes, that could be one solution for releases on OpenZFS (I guess >= stable/13), not sure on stable/12. Nevertheless, it was something just to keep in mind.

Ping, any other comments? The pool version issue is worth addressing in some way but isn't a blocker IMO. Ideally we could start publishing ZFS snapshots for 14.0 as quickly as possible to start collecting feedback and bug reports ahead of any official support for these images.

Ping, any other comments? The pool version issue is worth addressing in some way but isn't a blocker IMO. Ideally we could start publishing ZFS snapshots for 14.0 as quickly as possible to start collecting feedback and bug reports ahead of any official support for these images.

Apologies. I have been busy (both at work and personal life).

I will follow up on this before the end of the week.

I'm worried about failure cases with this code (since they are more impactful than in UFS case)
I'm also worried about what happens if two people run this at the same time (say native and
cross building arm64 get to the point of generating these things at the same time). I didn't see
any careful attempts to make the zpool names or mount points unique, but I may have overlooked something.

release/tools/ec2.conf
24

growfs expands ZFS now? I think rewording this comment requires more thought.

release/tools/vmimage.subr
52

what happens if this fails?

54

Where is this cleaned up if something goes terribly wrong?

Another thought just came to me: what happens when the ZFS version on the builder differs from the target version? Out of conservatism, I try to only upgrade the builders when necessary (such as a new syscall being added, etc.).

In D34426#783437, @imp wrote:

I'm worried about failure cases with this code (since they are more impactful than in UFS case)

Error handling definitely needs to be improved. I'll work on this if there are no objections to this review and we can start making ZFS snapshots of -CURRENT available for testing.

I'm also worried about what happens if two people run this at the same time (say native and
cross building arm64 get to the point of generating these things at the same time). I didn't see
any careful attempts to make the zpool names or mount points unique, but I may have overlooked something.

The temporary pool name is randomized: tmppool=release.$(jot -r 1 1000000000). The mount point is not, but that's a problem for UFS builds too AFAICT. I believe it does contain the target arch, so building amd64 and arm64 at the same time isn't a problem at least in principle.

In D34426#783441, @gjb wrote:

Another thought just came to me: what happens when the ZFS version on the builder differs from the target version? Out of conservatism, I try to only upgrade the builders when necessary (such as a new syscall being added, etc.).

I believe the right solution there is to create the pool with the "legacy" feature set, and provide an rc.d script which can enable features during firstboot. Some more specific guidance along these lines would be appreciated.

release/tools/ec2.conf
24

The growfs rc.d script does, yes, though this patch also sets the autoexpand pool property for good measure.

release/tools/vmimage.subr
52

See my comment in the review description about error handling. I think a separate pass through these scripts is needed to fix up error handling; I hit a lot of issues due to this while figuring out how to use the release scripts, even before I started making changes. I think we should start by setting -e and running through all the possible options to see if anything breaks.

There is a cleanup handler that runs upon SIGINT etc. but it also gets called normally as part of mk-vmimage.sh. For that reason I can't destroy the pool there.

Use makefs -t zfs instead of zpool(8)/zfs(8) to create a pool and datasets.

This eliminates a number of problems with the old approach:

  • images will be reproducible
  • no need to import any pools on the host
  • no special privileges are needed

Use makefs -t zfs instead of zpool(8)/zfs(8) to create a pool and datasets.

This eliminates a number of problems with the old approach:

  • images will be reproducible
  • no need to import any pools on the host
  • no special privileges are needed

This depends on D35248 and D35336, not committed yet.

With respect to pool versions, makefs -t zfs creates pools with version 5000 (i.e., the last version) and all feature flags turned off (because makefs doesn't implement any of them). I don't believe we have anything that will automatically enable features on first boot, but that could easily be added.

Ping? I think the switch to using makefs -t zfs addressed the major concerns that were raised. I've been using this patch to do some testing with both amd64 and arm64 and it's been reliable so far. I'd really like for us to start releasing ZFS-based VM image snapshots from "main", alongside UFS images, so that we can get more testing. It'll take a while to build confidence in makefs -t zfs, and having snapshot images conveniently available to play with will be very helpful.

release/scripts/mk-vmimage.sh
84

One (final, I think) nit I noticed, unless I missed it, if ${VMFS} is empty, we should default it to UFS instead of outright falling back to usage().

Changes LGTM, a handful of rather minor notes inline. AFAICT this should not change anything for existing UFS images, so should be safe from a regression POV and will allow ZFS snapshot testing. Hopefully releng will sign off on this shortly.

release/scripts/mk-vmimage.sh
37

Incidentally it seems this ("${@}") does not produce a useful usage message

release/tools/ec2.conf
26–27

Aside, curious why it is an extra 1GB?

release/tools/vmimage.subr
102

Aside, I wonder if there's a better way to deal with this

187

We could explicitly check for "${VMFS} = zfs here and have an error else case. (Noticed as I have some crazy thoughts about creating FAT images.)

246

I was a bit confused by this at first, before finding that it is only used for x86 below. Maybe X86GPTBOOTFILE?

This revision is now accepted and ready to land.Oct 28 2022, 3:20 PM
gjb added inline comments.
release/scripts/mk-vmimage.sh
84

Apologies, I noticed VMFS is set to UFS by default in the Makefile.vm. Sorry for the noise.

release/scripts/mk-vmimage.sh
84

Oh, our Phab updates crossed paths.

I wondered about this too and was going to suggest that VMFS default to UFS but then noticed that the other variables (VMFORMAT etc.) all need to be set and don't have defaults so this follows that pattern. I guess mk-vmimage.sh is "no user serviceable parts inside"

release/tools/vmimage.subr
102

That is, dealing with ZFS vs disk_ident, not about setting variables in the release scripts etc.

release/scripts/mk-vmimage.sh
84

More or less, correct, mk-vmimage.sh has no user serviceable parts. Theoretically, everything is (or should be) passed to it through Makefile.vm.

markj added inline comments.
release/scripts/mk-vmimage.sh
37

I think you're right. The release bits generally don't do any error checking either, so if something fails (e.g., I typoed something in the make -C release invocation or forgot a flag or forgot to run the command as root) early on, the error gets lost in the command output and you have to scroll through trying to figure out what went wrong.

release/tools/vmimage.subr
102

Do you know if the underlying problem is documented somewhere? Is it just that we want to make sure we have a consistent provider name for vdevs?

187

Yes, I'll add that.

246

Yeah, I'll change that.

release/tools/ec2.conf
26–27

Not totally certain, but it looks like bsdec2-image-upload posts the image size in multiples of GB, so if the filesystem size is a multiple of 1GB then we have to round up to accommodate other partitions. In other words, I think it's a constraint of EC2/EBS?

Address review feedback, document VMFS in release.7

This revision now requires review to proceed.Oct 28 2022, 8:04 PM
This revision is now accepted and ready to land.Oct 28 2022, 8:18 PM