Page MenuHomeFreeBSD

bhyve: snapshot capsicum support
AbandonedPublic

Authored by gusev.vitaliy_gmail.com on Jun 10 2022, 11:50 PM.
Tags
Referenced Files
F88198548: D35454.id118120.diff
Fri, Jul 12, 9:51 AM
Unknown Object (File)
Tue, Jul 9, 8:07 PM
Unknown Object (File)
Sun, Jul 7, 2:55 AM
Unknown Object (File)
Wed, Jul 3, 5:22 PM
Unknown Object (File)
Mon, Jul 1, 10:27 AM
Unknown Object (File)
Wed, Jun 26, 2:19 PM
Unknown Object (File)
Wed, Jun 26, 2:18 PM
Unknown Object (File)
Wed, Jun 26, 2:04 PM

Details

Reviewers
jhb
markj
corvink
Group Reviewers
bhyve
Summary

Snapshots are placed to a directory obtained from --checkpoint/--snapshot argument.
If only name is described, snapshot is placed to the bhyvectl's current directory.

Example:

bhyvectl --suspend=/disk/dump/x1--vm=ubuntu

Snapshot 'x1' will be stored to the '/disk/dump' directory.

Sponsored by vStack.


vm destroy context needs more efforts, so ignore that error
until VM_DESTROY ioctl is implemented.

Test Plan

Start VM, suspend and then resume.

Start VM, do checkpoint several times. Verify no errors occurred.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

I'd rather see the snapshot functionality write it's data to a single file. A file descriptor for said file would be passed in from the nvlist via an IPC command.

In other words, bhyvectl would open the snapshot file and pass that file descriptor over the socket to bhyve.

In D35454#804127, @rew wrote:

I'd rather see the snapshot functionality write it's data to a single file. A file descriptor for said file would be passed in from the nvlist via an IPC command.

In other words, bhyvectl would open the snapshot file and pass that file descriptor over the socket to bhyve.

Good point. I have one more proposal. Instead of allowing bhyve to do all things at once (dump memory, kernel pages, etc), I would suggest to introduce several small commands that bhyve can perform:

  • Pause vCPUs
  • Dump VM memory to the passed fd.
  • Dump kernel&device states and config to the passed fd.
  • Unpause vCPUs

In other words, bhyvectl will send several commands to get VM memory file, kernel&device&conf file and 'bhyve' will handle it.

That approach will help in implementing Live Migration, that will imply addition commands:

  • Dump changed VM pages to a passed fd
  • Apply changed VM pages from passed fd.

@rew What do you think about idea implementing "bricks/handlers" in bhyve and full logic in bhyvectl ?

not sold on it

My perspective about save/restore is that the limitations/file format issues described in the original commit message, 483d953a86a2507355f8287c5107dc827a0ff516, need to be resolved/discussed.

Until those issues are addressed, the save/restore feature (and any feature that builds off of it) will not make it into main.

While capsicum support is useful towards obtaining that goal, I'm not convinced that save/restore needs multiple fd's to do what it needs - and I'd prefer to not let that design bleed into other userland tools (bhyvectl).

In D35454#806691, @rew wrote:

not sold on it

My perspective about save/restore is that the limitations/file format issues described in the original commit message, 483d953a86a2507355f8287c5107dc827a0ff516, need to be resolved/discussed.

Until those issues are addressed, the save/restore feature (and any feature that builds off of it) will not make it into main.

So I see the two major issues there :

  1. versioning
  2. file format.

While it is under discussing how to save binary data, I would create review that saves entire VM config to a .meta file with config's format, and all fields from .meta are saved in config's format. Example:

kern_structs.device.vioapic.size=388
kern_structs.device.vioapic.file_offset=7844

etc.

This will allow to resume VM just provide this command: bhyve -r $snapshot_name
Also it will protect against different command lines in suspended VM and resumed.

While capsicum support is useful towards obtaining that goal, I'm not convinced that save/restore needs multiple fd's to do what it needs - and I'd prefer to not let that design bleed into other userland tools (bhyvectl).

@rew bhyvectl doesn't need pass multiple fd's. It can pass one fd at time or pass bitmask of commands. If use one fd for all steps (save all things - ram, kernel, config to one file) - it will require completly rework saving format.

If use one fd for all steps (save all things - ram, kernel, config to one file) - it will require completly rework saving format.

Yes, I'm suggesting that the saving/file format will need an overhaul before the save/restore feature can be turned on by default.

From my point of view, what we have in the tree right now is a working prototype which needs to be iterated on. And given that the file format is likely to change, the usage/requirement of these fd's might also change - which is why I'm not convinced that these fd's are worth messing with right now.

Let me make it clear though...this is my current line of thinking and I'm not requesting or blocking any changes here.

I haven't fully reviewed in detail, but it seems that this is an incremental step forward, and it makes sense to go ahead with this change [

I haven't fully reviewed in detail, but it seems that this is an incremental step forward, and it makes sense to go ahead with this change [

You want to set the directory a snapshot file is written to via an environment variable?

And if you want to change that directory, have to restart the bhyve process?

gusev.vitaliy_gmail.com edited the summary of this revision. (Show Details)

Updated to the latest main branch. Open directory at bhyvectl side and pass fd to bhyve so that it can use openat() under capsicum.

I still hold the opinion that the snapshot data should be stored in a single file and the file descriptor for that file should be passed to bhyve.

The snapshot file format should be (more or less) in its final state before we start polluting this nvlist with potentially obsolete file descriptors.

In other words, these changes are putting the cart before the horse.

corvink added inline comments.
usr.sbin/bhyve/bhyverun.c
1441

I'd split this patch into multiple ones. E.g. this change is unrelated to the fd part. So, it should be a seperate commits and can be merged early.

@rew Could you take a look?

@corvink Created separate review https://reviews.freebsd.org/D38836

usr.sbin/bhyve/bhyverun.c
1441

You could move more parts into seperate commits. I know phab is not a good tool for multiple small commits and it's some additional work. However, it speeds up merging because uncritical parts can be merged earlier.

lib/libvmmapi/vmmapi.c
1757

This could be a seperate commit.

usr.sbin/bhyve/bhyverun.c
1565–1575

Create a seperate commit and explain why the move is required.

1589–1590

This is unrelated to capsicum and should be a seperate commit.

usr.sbin/bhyve/block_if.c
1012

Unrelated too.

usr.sbin/bhyve/snapshot.c
1560–1566

Is this part unrelated to your fd part? If yes, you could split this as well.

gusev.vitaliy_gmail.com marked an inline comment as done.

Correct 'fd' check:

-       return (fd > 0 ? fd : -errno);
+       return (fd >= 0 ? fd : -errno);

@corvink As you suggested I split this patch to the several reviews:

libvmm: Add missed ioctl to vm_ioctl_cmds
https://reviews.freebsd.org/D38866

bhyve: [snapshot] Do not flush readonly device at blockif_pause()
https://reviews.freebsd.org/D38855

bhyve: [snapshot] Add cap limits for ipc socket
https://reviews.freebsd.org/D38856

bhyve: Exit with EX_OSERR if init checkpoint or restore time failed
https://reviews.freebsd.org/D38872

bhyve: Init checkpoint before caph_enter()
https://reviews.freebsd.org/D38857

bhyve: Use directory fd with checkpoint
https://reviews.freebsd.org/D38858

bhyve: Enable Capsicum for snapshots
https://reviews.freebsd.org/D38860

FYI if you use just the D# notation Phabricator will add markup to show which reviews are open/closed e.g.

D38866 libvmm: Add missed ioctl to vm_ioctl_cmds
D38855 bhyve: [snapshot] Do not flush readonly device at blockif_pause()
D38856 bhyve: [snapshot] Add cap limits for ipc socket
D38872 bhyve: Exit with EX_OSERR if init checkpoint or restore time failed
D38857 bhyve: Init checkpoint before caph_enter()
D38858 bhyve: Use directory fd with checkpoint
D38860 bhyve: Enable Capsicum for snapshots

FYI if you use just the D# notation Phabricator will add markup to show which reviews are open/closed e.g.

D38866 libvmm: Add missed ioctl to vm_ioctl_cmds
D38855 bhyve: [snapshot] Do not flush readonly device at blockif_pause()
D38856 bhyve: [snapshot] Add cap limits for ipc socket
D38872 bhyve: Exit with EX_OSERR if init checkpoint or restore time failed
D38857 bhyve: Init checkpoint before caph_enter()
D38858 bhyve: Use directory fd with checkpoint
D38860 bhyve: Enable Capsicum for snapshots

Cool! Thanks!

Close as whole patch series D38886 is committed.