Page MenuHomeFreeBSD

bhyve - Snapshot Save and Restore
Needs ReviewPublic

Authored by darius.mihaim_gmail.com on Mar 7 2019, 4:56 PM.

Details

Reviewers
None
Group Reviewers
bhyve
Summary

Virtual machine save and restore - currently only supports virtual machine suspend and resume due to a lack of support for disk device snapshots. The implementation is available on
GitHub.

The feature is functional for Intel processors, for certain devices. The test matrix is available on the project's wiki.

The project has not been tested on AMD processors, but Linux guests seem to freeze at least in some cases (e.g., the virtual machine is restored after the host is rebooted), so support is limited.

An overview of the snapshot mechanism is at: https://docs.google.com/document/d/1xxMM2T51giU0t8yTFWmhsBLprJ3hMvtxpUU8rkSkM40

An per-field analysis of the implementation is at: https://docs.google.com/spreadsheets/d/13wDSSeCs-vgtGAnKVGJtTNbAaom36w9BfnC_7hEaZ98

Known issues:

  • only one device of a type should be used
  • FreeBSD and Linux guests may be frozen for a few seconds or report disk errors with AHCI block devices (data does not seem to be lost)
  • e1000 on Linux will sometimes not work after a restore
Test Plan

For all guests testing has been done as (specific devices may differ, as noted below):

  • constantly ping a remote server
  • start copying a large file
  • suspend the virtual machine
  • [optionally] reboot the host
  • restore the virtual machine
  • all functionality should continue as expected

FreeBSD guest

  • use virtio-net / e1000, ahci and frame buffer with xhci

Linux guest

  • use virtio-net, ahci and frame buffer with xhci

Windows guest

  • use virtio-net / e1000, ahci and frame buffer with xhci

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
novel added a comment.Apr 8 2019, 5:41 PM

This is a very useful feature, thanks for working on that!

However, it seems it could be more user-friendly.

Few observations:

  1. How do I know status of save/restore? If I do bhyvectl --vm=... --suspend=..., it seems to just exit with 0. Is there any way to figure out if the suspend attempt was successful or monitor progress? Maybe it would be good to have blocking mode for this operation (i.e. bhyvectl doesn't exit until it's finished).
  2. Same for --checkpoint.
  3. Is there a way to list all available snapshots (with creation time)?
  4. It looks like it's terminating the bhyve process even if the running VM configuration doesn't support snapshotting, which is pretty unfortunate (also, it would be good if snapshotting errors could be displayed by bhyvectl, not bhyve stdout).
usr.sbin/bhyve/pci_emul.c
2039

I get these messages for devices that I don't actually have in a VM, is that expected behavior? Also, why is that using dos-style (\r\n) line endings?

This is a very useful feature, thanks for working on that!

Thank you very much for testing it!

However, it seems it could be more user-friendly.

I agree, but we have been focusing on functionality for the most part.

Few observations:

  1. How do I know status of save/restore? If I do bhyvectl --vm=... --suspend=..., it seems to just exit with 0. Is there any way to figure out if the suspend attempt was successful or monitor progress? Maybe it would be good to have blocking mode for this operation (i.e. bhyvectl doesn't exit until it's finished).

Could be done, but for now bhyvectl simply writes to the bhyve's checkpoint UNIX socket, and no data is sent back. Simply reading from the socket in bhvyectl to make it wait for a status message from bhyve may suffice. For progress, an extension of this where bhyve sends the current progress is also possible. A very large part of the time (almost all of it, for VMs that have more than the minimal amount of RAM required by the OS) is spent to save memory, so the progress may be saved_memory / total_memory.

  1. Same for --checkpoint.

Checkpoints are currently redirected to suspends, due to a lack of support for snapshots for virtual disks. After disk support is added, proper checkpoints will also be added, but they should be created immediately. That means that bhyvectl should only get/show the fail/success status without blocking (too long).

  1. Is there a way to list all available snapshots (with creation time)?

Currently none. Closest you can get to that is listing the creation time of the snapshot files.

  1. It looks like it's terminating the bhyve process even if the running VM configuration doesn't support snapshotting, which is pretty unfortunate (also, it would be good if snapshotting errors could be displayed by bhyvectl, not bhyve stdout).

I'm not entirely sure how/why it is killed. Please add the output from bhyve, or feel free to E-mail me.

usr.sbin/bhyve/pci_emul.c
2039

It's because the devices are in a static array in snapshot.c - that's also why only one device of each type is allowed. When we'll add support for dynamically registering devices, these messages shouldn't appear anymore.

The \r\n line endings are put there because the output from bhyve was going to the next line (new line was inserted), but continued from where the last line ended (no carriage return). We didn't find why this happens, and went with the \r\n newlines instead. No real reason to keep them if the output behaves, though.

Add minor fixes to snapshot logic:

  • correct variable used as parameter for 'sizeof' in vatpic.c (the size was correct before, by coincidence)
  • remove a comment that became irrelevant in the past in vhpet.c
  • fix vLAPIC timer reset logic in the off-chance the timer would be periodic and a value of 0 is saved for the counter
krion added a subscriber: krion.May 22 2019, 8:37 PM

Remove vLAPIC fields from snapshot procedure, if they are not required after a restore / can be computed from other fields.
Save vLAPIC's LAPIC page as part of vLAPIC snapshot procedure, instead of a separate data structure.
Rename variables / functions / macros used to snapshot guest to host memory mappings to make them more intuitive.

Remove debug printfs from pci_ahci.c

darius.mihaim_gmail.com edited the summary of this revision. (Show Details)Jun 10 2019, 6:58 PM
jhb added a subscriber: jhb.Jun 18 2019, 4:18 PM

So I think having this be under an optional #ifdef for now is probably best, especially since it doesn't yet work with capsicum. I can add a new WITH_BHYVE_SNAPSHOT src.conf option. I can help with adding the #ifdef or you can do that. You would do something like this in bhyve's Makefile:

.if ${MK_BHYVE_SNAPSHOT} != "no"
SRCS+= snapshot.c

LIBS+= ucl xo
CFLAGS+= -DVMM_SNAPSHOT

# other settings needed
.endif

For the kernel makefile we use a new kernel option. For now we can just add it to opt_global.h by adding a line like this to sys/conf/options:

VMM_SNAPSHOT  opt_global.h

Then you can use #ifdef VMM_SNAPSHOT in the kernel sources.

Aside from that, there are probably several things to work on once this is in the tree, but the biggest one is that we probably need to pause vCPUs first before we start pausing devices since pausing the devices first can lead to device timeouts as we discussed previously. The second biggest thing is probably to replace the ioctls to lock/unlock the vCPUs by instead explicitly kicking the vCPU threads out to userland. For now you can just reuse the "debug" vmexit that the gdb stub uses assuming that we won't suspend while the debug stub is active. Eventually these two things will have to become aware of each other.

lib/libvmmapi/vmmapi.c
532

This probably triggers warnings. Pointer math on void * isn't defined in standard C. GNU C allows it as an extension. Char * would be better. It seems like 'offaddr' should also be vm_paddr_t, not off_t. FreeBSD's style also doesn't do spaces after casts, but that's minor.

lib/libvmmapi/vmmapi.h
273

This is normally spelled 'packed' in FreeBSD and would be added to the end of the struct normally. However, I don't think you need packed here since a char array is not going to have padding on any architectures FreeBSD supports.

sys/amd64/include/vmm.h
135

Ideally this isn't really part of the API. We'd just want some kind of opaque "CPU context" perhaps that is VMCS on Intel, VMCB on AMD. This can be on the list for things to refine in tree.

sys/amd64/include/vmm_dev.h
34

These are a lot of nested includes for this header. Certainly <sys/param.h> shouldn't be a nested include. I wonder if we can get rid of all of these nested includes? It's fine to make certain things in the header conditional as the existing headers do with _SYS_CPUSET_H_ if you need to.

sys/amd64/vmm/vmm.c
1785

Why is this check here?

2947

Not to change right now, but for the one-liner wrapper functions that are only used once like this (vm_snapshot_lapic) I would probably just inline the call to vlapic_snapshot() directly in this switch instead.

sys/amd64/vmm/vmm_dev.c
805–811

Missing a 'break' here now.

1203

You can't call make_dev* while holding locks. There is a race with setting si_drv1 that can be fixed by using make_dev_s instead of make_dev_p, but that would be a separate change from save/restore I think,

sys/amd64/vmm/vmm_snapshot.c
10

For local variables, we can just use 'op' without leading underscores. FreeBSD generally only uses leading underscores for global symbols that are not part of the public API.

28

Maybe use __DEVOLATILE instead:

void *_data;

_data = __DEVOLATILE(void *, data);

Maybe to avoid the leading underscore you could call the function parameter 'datav' and the local variable in the function 'data'. Alternatively, if there are not many volatile pointers being passed to this function, the volatile could just be removed and __DEVOLATILE could be used in callers in the places where it is needed.

usr.sbin/bhyve/Makefile
96

Are the libucl headers not installed?

usr.sbin/bhyve/pci_emul.c
2039

When you are using stdio for the console, it leaves the console in raw mode so \n is only a newline and not a full CR+LF combo. Other printfs in bhyve only use \n though since stdio isn't always used as the console.

usr.sbin/bhyve/snapshot.h
11

Why does this need packed? If it isn't stored in a file, we should let the compiler choose the optimal layout.

I agree with @jhb that this absolutely should be placed behind #ifdef guards for now.

sys/amd64/vmm/amd/svm.c
2418–2435

Why aren't the x86 generic registers being saved/restored by logic shared by both SVM and VMX?

2437–2444

Why are you saving host state?

2456

The eptgen is specific to the state of the host machine and probably shouldn't be saved.

2458–2459

The ASID is potentially specific to the state of the host machine (it is on SmartOS bhyve at least) and shouldn't be saved.

2592–2599

As bhyve lacks AVIC support today, these should all be zero at snapshot time. Unused portions should probably be skipped.

2606–2608

I'm not sure it makes sense to record the ASID as part of snapshot state, as that's specific to the host for tracking vCPUs as they go on/off CPU

sys/amd64/vmm/amd/vmcb.h
213

Rather than requiring forward definitions here and in the other headers (vlapic, vhpet, etc) for struct vm_snapshot_meta, why not include machine/vmm_snapshot.h? It's required for the function implementation anyways.

sys/amd64/vmm/intel/vmx.c
3860–3876

With descriptors being common across the two CPU types, why not use a general mechanism for saving them?

3896–3898

Why not save these MSRs in a general fasion? (Same goes for EFER too)

3902–3903

Saving/restoring control state like this with no validation seems dangerous. It's probably safest to allow the host to reconstruct that from present (and requested) features?

3908–3933

In SmartOS bhyve, we added a CPU-agnostic method for setting the tsc offset in vmm.c, which then VMX/SVM consume in order to do the CPU-specific settings. It might make sense to do something like that here.

sys/amd64/vmm/io/vioapic.c
513–514

The write operators for these register tables enforce certain invariants about their contents. The restore process likely needs to do the same enforcement. Additionally, I believe this is skipping TMR updates for the CPUs.

sys/amd64/vmm/io/vlapic.c
1659–1663

Why was this accessor added?

sys/amd64/vmm/vmm.c
810–814

Why is this here with no consumer?

sys/amd64/vmm/vmm_dev.c
484–490

I'm uncomfortable about providing ioctl interfaces to these locking operations. All of the other vCPU lock consumers will take the needed lock(s), perform their action in kernel space, and the release the lock(s) prior to returning to userspace. These ioctls break that model. Furthermore, since the save/restore operation is driven from userspace, where the vCPU run threads are operating, there's no reason why the quiescence cannot start there.

  • rework vCPU pausing and resuming mechanism
  • rewrite VM suspend procedure - do not resume VM and devices before exit
  • add missing break clause in vmm_dev switch-case
  • various minor coding style fixes pointed by @jhb
  • remove unnecessary critical section nesting level check
darius.mihaim_gmail.com marked 3 inline comments as done.Jun 26 2019, 10:21 AM

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

lib/libvmmapi/vmmapi.c
532

No warnings were generated at compile time, but using char * makes more sense since baseaddr is also char *

sys/amd64/include/vmm.h
135

Removed it; no longer used.

sys/amd64/include/vmm_dev.h
34

Removed all includes from this header. Replaced 'struct vm_snapshot_req' with 'struct vm_snapshot_meta' since the former had only one member of the latter type.

sys/amd64/vmm/vmm.c
1785

Removed it; was forgotten from a previous implementation.

2947

Removed the one-line wrapper functions.

sys/amd64/vmm/vmm_dev.c
1203

Removed the comment, since it was not related to save/restore.

sys/amd64/vmm/vmm_snapshot.c
10

Renamed the variable.

28

Used __DEVOLATILE here to be safe and keep the compatibility (same parameters and return type) with the functions used in userspace.

usr.sbin/bhyve/Makefile
96

Build fails if this line is removed - #include <ucl.h> is unable to find the header.

usr.sbin/bhyve/pci_emul.c
2039

Will keep it it in mind, thank you. We can remove the CR characters for coding style sometime down the line.

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

Please delay the #ifdef guards, as per the discussion in the 2019/06/20 bhyve call we may of found an alternative, using a project branch and rebasing the git to that.

usr.sbin/bhyve/Makefile
96

Perhaps because this is a private library?
/usr/include/private/ucl/ucl.h

mihai added a subscriber: mihai.Jun 26 2019, 12:39 PM

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

Please delay the #ifdef guards, as per the discussion in the 2019/06/20 bhyve call we may of found an alternative, using a project branch and rebasing the git to that.

We are following the roadmap discuss with John Baldwin and all of you two calls ago.

Last time you took the decision of using a branch instead of ifdef but this approach is not different than the current one. We will end up with a code that is not up-to-date. We rather keep the current github repo than creating a branch.

So the opinion of UPB team + Matthew is that we need to merge the code into HEAD. If not, it is better to keep it in the current github repo.

There is a tremendous amount of work that had been done and I need to get merged.

Remove some irrelevant comments and unused functions

sys/amd64/vmm/amd/svm.c
2418–2435

The data structures that hold them differ. Registers on Intel are also saved like this. Using generics to read and write them may be in order.

2437–2444

It's similar to the Intel implementation. We are working on removing these registers from that implementation. We will do the same here when we fix it there, but we cannot test it on AMD currently.

sys/amd64/vmm/amd/vmcb.h
213

Forward declarations were seemed like the preferred implementation. Including headers may be problematic, especially if those include other headers.

sys/amd64/vmm/intel/vmx.c
3860–3876

Simply meant to keep code for all data from vmcs bundled in one function - may be useful to move to a generic function to avoid duplication.

3896–3898

Same comment as the one for other registers.

3908–3933

We will look into it, but it's not a priority for now. Thanks for pointing it out.

sys/amd64/vmm/io/vioapic.c
513–514

Added this to the analysis document under restore checks with TODO.

sys/amd64/vmm/io/vlapic.c
1659–1663

Removed it; was a leftover from an older implementation.

sys/amd64/vmm/vmm.c
810–814

Removed it; was a leftover from an older implementation.

sys/amd64/vmm/vmm_dev.c
484–490

Reworked the VM pause to block the threads in userspace.

jhb added a comment.Jun 27 2019, 5:38 PM

Even though I wasn't on the call, I agree with Mihai that a project branch isn't a good approach. It's no better than having it in a git branch, and arguably worse as svn merge not as nice as a git rebase for a patch like this.

However, the reason to have it in the tree is to allow it to be updated as part of other API changes, etc. and to allow others to iterate on it. I already have several thoughts in my head on further changes that we could make, but I'd rather doing that based on an #ifdef'd version.

sys/amd64/vmm/amd/vmcb.h
213

I strongly disagree. Practically everywhere else, shared types declarations are pulled in via headers. What about this situation makes it so complicated that doing so isn't feasible?

cem added inline comments.Jul 3 2019, 7:45 AM
sys/amd64/vmm/amd/vmcb.h
213

Forward declarations are used extensively in FreeBSD when the header only needs to reference the pointer type. I guess I don’t understand why you strongly disagree with using them appropriately.

  • kick the guest threads out of vmrun when starting the pausing vCPUs.
  • remove host registers from intel/vmx.c snapshot process (AMD registers have not been removed yet).
lib/libvmmapi/vmmapi.c
739–746

Have you quantified the overhead added by these mutex/condvar operations for exit-heavy workloads? VM exits which go all the way to userspace are relatively expensive, so the relative delta might be small, but it would be useful to know.

It might make sense to move this logic from libvmmapi into the bhyve process itself, where all of the other userspace vCPU state tracking is going on. If contention turned out to be high when lots of userspace exits are occurring (since there's only the one lock today) it would be possible, armed with knowledge of the number of active vCPU threads, to do something like a per-vCPU lock and pause state. Going that far might only make sense if the vCPU threads happen to contend in such a way.

1656–1657

Error logging to stderr is left up to the library consumer (read: the bhyve binary) today. It probably makes sense to continue that.

Here and the two functions below could just be sure to leave the error state (meta->dev_name or errno) in place to be logged by the consumer, should they so choose.

usr.sbin/bhyve/bhyverun.c
171

What is this for?

176

It looks like mevent.c has an extern declaration which refers to this which doesn't reflect the const change made here. It should be kept in sync?

1333

Given the logic in spinup_vcpu() the UNRESTRICTED_GUEST capability is now a necessity for all bhyve users. If that is your intent, it probably needs to be properly documented and communicated to the user base.

usr.sbin/bhyve/block_if.c
972–981

Is there reason to believe that it's not necessary to allow the restoring host to query these parameters itself when bringing up the blockdev there? It could be that aspects of the device (particularly with respect to its capabilities) may change.

usr.sbin/bhyve/virtio.c
869–882

If you're taking a snapshot while the vCPUs are paused, won't copying of the virtqueues be redundant, as they'll be picked up in the copy of guest RAM?

  • integrate updated vCPU pause mechanism from @jhb that uses the VM debug components.
  • separate BSP vCPU initialization from other vCPUs (do not initialize all vCPUs before starting the VM when not restoring; do not automatically grant UNRESTRICTED_GUEST capabilities to BSP)
  • fix various coding errors, such as removing unused macros, using proper buffer size for snprintf and coding style
lib/libvmmapi/vmmapi.c
739–746

The vCPU pause mechanism has been changed thanks to @jhb. It is now relying on the debug functionality and should no longer cause heavy contention on mutexes.

1656–1657

Will update this at a later time. The messages from the entire lib will have to be moved where functions are used.

usr.sbin/bhyve/bhyverun.c
171

Artifact from an older implementation. The code has since been refactored.

176

I have added the 'const' qualifier to the mevent.c variable. The vmname should not be modified at runtime, so it makes more sense to keep it as a constant. On a side note, since the name of the VM is also kept in libvmmapi's 'struct vmctx' , the name does not need to be kept as a global; a getter function may be better suited.

1333

I have changed the initialization loop, and moved the BSP vCPU initialization outside it. Through it, the BSP vCPU will not have the UNRESTRICTED_GUEST capability set, and the other vCPUs will be initialized only when restoring the VM.

Since the changes added by @jhb keep track of the activated vCPUs, it may be worth keeping track of those when suspending/resuming, but I do not believe there will be any cases where this will actually be important, since they will be activated by the guest very early on.

usr.sbin/bhyve/block_if.c
972–981

If the information can be easily queried, the mechanism can be changed. However, the restore mechanism will likely become more complex, and relying on other saved data, such as guest memory, may not improve security as it is saved in a similar fashion (copied in/from a file on disk).

usr.sbin/bhyve/virtio.c
869–882

The GUEST2HOST_ADDR macros do not save memory, but rather guest memory mappings. The pointer parameter is converted to a guest address that can be used to locate the virtual queues in the new (restored) VM's address space.