Page MenuHomeFreeBSD

bhyve - Snapshot Save and Restore
AbandonedPublic

Authored by jhb on Mar 7 2019, 4:56 PM.

Details

Summary

Virtual machine save and restore - currently only supports virtual machine suspend and resume due to a lack of support for disk device snapshots. The implementation is available on
GitHub.

The feature is functional for Intel processors, for certain devices. The test matrix is available on the project's wiki.

The project has not been tested on AMD processors, but Linux guests seem to freeze at least in some cases (e.g., the virtual machine is restored after the host is rebooted), so support is limited.

An overview of the snapshot mechanism is at: https://docs.google.com/document/d/1xxMM2T51giU0t8yTFWmhsBLprJ3hMvtxpUU8rkSkM40

An per-field analysis of the implementation is at: https://docs.google.com/spreadsheets/d/13wDSSeCs-vgtGAnKVGJtTNbAaom36w9BfnC_7hEaZ98

Known issues:

  • only one device of a type should be used
  • FreeBSD and Linux guests may be frozen for a few seconds (data does not seem to be lost)
  • e1000 on Linux will sometimes not work after a restore

Functionality marked as experimental using the BHYVE_SNAPSHOT flag, thanks to @jhb

Inline with upstream refs: git: e96b4170d024f41e9f178394a77006f39aaf6610; SVN: r353039

Test Plan

For all guests testing has been done as (specific devices may differ, as noted below):

  • constantly ping a remote server
  • start copying a large file
  • suspend the virtual machine
  • [optionally] reboot the host
  • restore the virtual machine
  • all functionality should continue as expected

FreeBSD guest

  • use virtio-net / e1000, ahci and frame buffer with xhci

Linux guest

  • use virtio-net, ahci and frame buffer with xhci

Windows guest

  • use virtio-net / e1000, ahci and frame buffer with xhci

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
lib/libvmmapi/vmmapi.c
463

This probably triggers warnings. Pointer math on void * isn't defined in standard C. GNU C allows it as an extension. Char * would be better. It seems like 'offaddr' should also be vm_paddr_t, not off_t. FreeBSD's style also doesn't do spaces after casts, but that's minor.

lib/libvmmapi/vmmapi.h
260

This is normally spelled 'packed' in FreeBSD and would be added to the end of the struct normally. However, I don't think you need packed here since a char array is not going to have padding on any architectures FreeBSD supports.

sys/amd64/include/vmm.h
134

Ideally this isn't really part of the API. We'd just want some kind of opaque "CPU context" perhaps that is VMCS on Intel, VMCB on AMD. This can be on the list for things to refine in tree.

sys/amd64/include/vmm_dev.h
33

These are a lot of nested includes for this header. Certainly <sys/param.h> shouldn't be a nested include. I wonder if we can get rid of all of these nested includes? It's fine to make certain things in the header conditional as the existing headers do with _SYS_CPUSET_H_ if you need to.

sys/amd64/vmm/vmm.c
1717

Why is this check here?

2877

Not to change right now, but for the one-liner wrapper functions that are only used once like this (vm_snapshot_lapic) I would probably just inline the call to vlapic_snapshot() directly in this switch instead.

sys/amd64/vmm/vmm_dev.c
775

Missing a 'break' here now.

1011

You can't call make_dev* while holding locks. There is a race with setting si_drv1 that can be fixed by using make_dev_s instead of make_dev_p, but that would be a separate change from save/restore I think,

sys/amd64/vmm/vmm_snapshot.c
9

For local variables, we can just use 'op' without leading underscores. FreeBSD generally only uses leading underscores for global symbols that are not part of the public API.

27

Maybe use __DEVOLATILE instead:

void *_data;

_data = __DEVOLATILE(void *, data);

Maybe to avoid the leading underscore you could call the function parameter 'datav' and the local variable in the function 'data'. Alternatively, if there are not many volatile pointers being passed to this function, the volatile could just be removed and __DEVOLATILE could be used in callers in the places where it is needed.

usr.sbin/bhyve/Makefile
95

Are the libucl headers not installed?

usr.sbin/bhyve/pci_emul.c
2038

When you are using stdio for the console, it leaves the console in raw mode so \n is only a newline and not a full CR+LF combo. Other printfs in bhyve only use \n though since stdio isn't always used as the console.

usr.sbin/bhyve/snapshot.h
10

Why does this need packed? If it isn't stored in a file, we should let the compiler choose the optimal layout.

I agree with @jhb that this absolutely should be placed behind #ifdef guards for now.

sys/amd64/vmm/amd/svm.c
2417–2434

Why aren't the x86 generic registers being saved/restored by logic shared by both SVM and VMX?

2436–2443

Why are you saving host state?

2455

The eptgen is specific to the state of the host machine and probably shouldn't be saved.

2457–2458

The ASID is potentially specific to the state of the host machine (it is on SmartOS bhyve at least) and shouldn't be saved.

2591–2598

As bhyve lacks AVIC support today, these should all be zero at snapshot time. Unused portions should probably be skipped.

2605–2607

I'm not sure it makes sense to record the ASID as part of snapshot state, as that's specific to the host for tracking vCPUs as they go on/off CPU

sys/amd64/vmm/amd/vmcb.h
212

Rather than requiring forward definitions here and in the other headers (vlapic, vhpet, etc) for struct vm_snapshot_meta, why not include machine/vmm_snapshot.h? It's required for the function implementation anyways.

sys/amd64/vmm/intel/vmx.c
3853–3869

With descriptors being common across the two CPU types, why not use a general mechanism for saving them?

3889–3891

Why not save these MSRs in a general fasion? (Same goes for EFER too)

3895–3896

Saving/restoring control state like this with no validation seems dangerous. It's probably safest to allow the host to reconstruct that from present (and requested) features?

3901–3926

In SmartOS bhyve, we added a CPU-agnostic method for setting the tsc offset in vmm.c, which then VMX/SVM consume in order to do the CPU-specific settings. It might make sense to do something like that here.

sys/amd64/vmm/io/vioapic.c
512–513

The write operators for these register tables enforce certain invariants about their contents. The restore process likely needs to do the same enforcement. Additionally, I believe this is skipping TMR updates for the CPUs.

sys/amd64/vmm/io/vlapic.c
1658–1662

Why was this accessor added?

sys/amd64/vmm/vmm.c
811–815

Why is this here with no consumer?

sys/amd64/vmm/vmm_dev.c
464–470

I'm uncomfortable about providing ioctl interfaces to these locking operations. All of the other vCPU lock consumers will take the needed lock(s), perform their action in kernel space, and the release the lock(s) prior to returning to userspace. These ioctls break that model. Furthermore, since the save/restore operation is driven from userspace, where the vCPU run threads are operating, there's no reason why the quiescence cannot start there.

  • rework vCPU pausing and resuming mechanism
  • rewrite VM suspend procedure - do not resume VM and devices before exit
  • add missing break clause in vmm_dev switch-case
  • various minor coding style fixes pointed by @jhb
  • remove unnecessary critical section nesting level check

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

lib/libvmmapi/vmmapi.c
463

No warnings were generated at compile time, but using char * makes more sense since baseaddr is also char *

sys/amd64/include/vmm.h
134

Removed it; no longer used.

sys/amd64/include/vmm_dev.h
33

Removed all includes from this header. Replaced 'struct vm_snapshot_req' with 'struct vm_snapshot_meta' since the former had only one member of the latter type.

sys/amd64/vmm/vmm.c
1717

Removed it; was forgotten from a previous implementation.

2877

Removed the one-line wrapper functions.

sys/amd64/vmm/vmm_dev.c
1011

Removed the comment, since it was not related to save/restore.

sys/amd64/vmm/vmm_snapshot.c
9

Renamed the variable.

27

Used __DEVOLATILE here to be safe and keep the compatibility (same parameters and return type) with the functions used in userspace.

usr.sbin/bhyve/Makefile
95

Build fails if this line is removed - #include <ucl.h> is unable to find the header.

usr.sbin/bhyve/pci_emul.c
2038

Will keep it it in mind, thank you. We can remove the CR characters for coding style sometime down the line.

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

Please delay the #ifdef guards, as per the discussion in the 2019/06/20 bhyve call we may of found an alternative, using a project branch and rebasing the git to that.

usr.sbin/bhyve/Makefile
95

Perhaps because this is a private library?
/usr/include/private/ucl/ucl.h

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

Please delay the #ifdef guards, as per the discussion in the 2019/06/20 bhyve call we may of found an alternative, using a project branch and rebasing the git to that.

We are following the roadmap discuss with John Baldwin and all of you two calls ago.

Last time you took the decision of using a branch instead of ifdef but this approach is not different than the current one. We will end up with a code that is not up-to-date. We rather keep the current github repo than creating a branch.

So the opinion of UPB team + Matthew is that we need to merge the code into HEAD. If not, it is better to keep it in the current github repo.

There is a tremendous amount of work that had been done and I need to get merged.

Remove some irrelevant comments and unused functions

sys/amd64/vmm/amd/svm.c
2417–2434

The data structures that hold them differ. Registers on Intel are also saved like this. Using generics to read and write them may be in order.

2436–2443

It's similar to the Intel implementation. We are working on removing these registers from that implementation. We will do the same here when we fix it there, but we cannot test it on AMD currently.

sys/amd64/vmm/amd/vmcb.h
212

Forward declarations were seemed like the preferred implementation. Including headers may be problematic, especially if those include other headers.

sys/amd64/vmm/intel/vmx.c
3853–3869

Simply meant to keep code for all data from vmcs bundled in one function - may be useful to move to a generic function to avoid duplication.

3889–3891

Same comment as the one for other registers.

3901–3926

We will look into it, but it's not a priority for now. Thanks for pointing it out.

sys/amd64/vmm/io/vioapic.c
512–513

Added this to the analysis document under restore checks with TODO.

sys/amd64/vmm/io/vlapic.c
1658–1662

Removed it; was a leftover from an older implementation.

sys/amd64/vmm/vmm.c
811–815

Removed it; was a leftover from an older implementation.

sys/amd64/vmm/vmm_dev.c
464–470

Reworked the VM pause to block the threads in userspace.

Even though I wasn't on the call, I agree with Mihai that a project branch isn't a good approach. It's no better than having it in a git branch, and arguably worse as svn merge not as nice as a git rebase for a patch like this.

However, the reason to have it in the tree is to allow it to be updated as part of other API changes, etc. and to allow others to iterate on it. I already have several thoughts in my head on further changes that we could make, but I'd rather doing that based on an #ifdef'd version.

sys/amd64/vmm/amd/vmcb.h
212

I strongly disagree. Practically everywhere else, shared types declarations are pulled in via headers. What about this situation makes it so complicated that doing so isn't feasible?

sys/amd64/vmm/amd/vmcb.h
212

Forward declarations are used extensively in FreeBSD when the header only needs to reference the pointer type. I guess I don’t understand why you strongly disagree with using them appropriately.

  • kick the guest threads out of vmrun when starting the pausing vCPUs.
  • remove host registers from intel/vmx.c snapshot process (AMD registers have not been removed yet).
lib/libvmmapi/vmmapi.c
670–677

Have you quantified the overhead added by these mutex/condvar operations for exit-heavy workloads? VM exits which go all the way to userspace are relatively expensive, so the relative delta might be small, but it would be useful to know.

It might make sense to move this logic from libvmmapi into the bhyve process itself, where all of the other userspace vCPU state tracking is going on. If contention turned out to be high when lots of userspace exits are occurring (since there's only the one lock today) it would be possible, armed with knowledge of the number of active vCPU threads, to do something like a per-vCPU lock and pause state. Going that far might only make sense if the vCPU threads happen to contend in such a way.

1587–1588

Error logging to stderr is left up to the library consumer (read: the bhyve binary) today. It probably makes sense to continue that.

Here and the two functions below could just be sure to leave the error state (meta->dev_name or errno) in place to be logged by the consumer, should they so choose.

usr.sbin/bhyve/bhyverun.c
170

What is this for?

174

It looks like mevent.c has an extern declaration which refers to this which doesn't reflect the const change made here. It should be kept in sync?

1345

Given the logic in spinup_vcpu() the UNRESTRICTED_GUEST capability is now a necessity for all bhyve users. If that is your intent, it probably needs to be properly documented and communicated to the user base.

usr.sbin/bhyve/block_if.c
969–978

Is there reason to believe that it's not necessary to allow the restoring host to query these parameters itself when bringing up the blockdev there? It could be that aspects of the device (particularly with respect to its capabilities) may change.

usr.sbin/bhyve/virtio.c
868–881

If you're taking a snapshot while the vCPUs are paused, won't copying of the virtqueues be redundant, as they'll be picked up in the copy of guest RAM?

  • integrate updated vCPU pause mechanism from @jhb that uses the VM debug components.
  • separate BSP vCPU initialization from other vCPUs (do not initialize all vCPUs before starting the VM when not restoring; do not automatically grant UNRESTRICTED_GUEST capabilities to BSP)
  • fix various coding errors, such as removing unused macros, using proper buffer size for snprintf and coding style
lib/libvmmapi/vmmapi.c
670–677

The vCPU pause mechanism has been changed thanks to @jhb. It is now relying on the debug functionality and should no longer cause heavy contention on mutexes.

1587–1588

Will update this at a later time. The messages from the entire lib will have to be moved where functions are used.

usr.sbin/bhyve/bhyverun.c
170

Artifact from an older implementation. The code has since been refactored.

174

I have added the 'const' qualifier to the mevent.c variable. The vmname should not be modified at runtime, so it makes more sense to keep it as a constant. On a side note, since the name of the VM is also kept in libvmmapi's 'struct vmctx' , the name does not need to be kept as a global; a getter function may be better suited.

1345

I have changed the initialization loop, and moved the BSP vCPU initialization outside it. Through it, the BSP vCPU will not have the UNRESTRICTED_GUEST capability set, and the other vCPUs will be initialized only when restoring the VM.

Since the changes added by @jhb keep track of the activated vCPUs, it may be worth keeping track of those when suspending/resuming, but I do not believe there will be any cases where this will actually be important, since they will be activated by the guest very early on.

usr.sbin/bhyve/block_if.c
969–978

If the information can be easily queried, the mechanism can be changed. However, the restore mechanism will likely become more complex, and relying on other saved data, such as guest memory, may not improve security as it is saved in a similar fashion (copied in/from a file on disk).

usr.sbin/bhyve/virtio.c
868–881

The GUEST2HOST_ADDR macros do not save memory, but rather guest memory mappings. The pointer parameter is converted to a guest address that can be used to locate the virtual queues in the new (restored) VM's address space.

darius.mihaim_gmail.com edited the summary of this revision. (Show Details)
  • Added compilation flag(s) to use as experimental functionality (BHYVE_SNAPSHOT) thanks to @jhb
  • Rebase with freebsd/freebsd updated master branch, on 4th of October 2019. Refs: git: e96b4170d024f41e9f178394a77006f39aaf6610 SVN: r353039

Add license and copyrights to snapshot specific files (snapshot.[ch], vmm_snapshot[ch]).

Rebase the code with SVN R359133 / git 217fa09bf639260f4fe7c9415d8f42b141637d51.

With the rebase we have added / fixed the following:

  • remove a field from the virtual AHCI device. The field caused issues only on a bare-metal test server, as far as we could tell, but not on desktop systems we tried (that's why we didn't catch it earlier).
  • fixed length parameter in snprintf call used to write the name of the UNIX socket for the checkpoint.
  • add a spinner-style progress bar to show that the memory snapshot process is still running.
jhb abandoned this revision.
jhb added a reviewer: darius.mihaim_gmail.com.