Page MenuHomeFreeBSD

bhyve - Snapshot Save and Restore
Needs ReviewPublic

Authored by darius.mihaim_gmail.com on Mar 7 2019, 4:56 PM.

Details

Reviewers
None
Group Reviewers
bhyve
Summary

Virtual machine save and restore - currently only supports virtual machine suspend and resume due to a lack of support for disk device snapshots. The implementation is available on
GitHub.

The feature is functional for Intel processors, for certain devices. The test matrix is available on the project's wiki.

The project has not been tested on AMD processors, but Linux guests seem to freeze at least in some cases (e.g., the virtual machine is restored after the host is rebooted), so support is limited.

An overview of the snapshot mechanism is at: https://docs.google.com/document/d/1xxMM2T51giU0t8yTFWmhsBLprJ3hMvtxpUU8rkSkM40

An per-field analysis of the implementation is at: https://docs.google.com/spreadsheets/d/13wDSSeCs-vgtGAnKVGJtTNbAaom36w9BfnC_7hEaZ98

Known issues:

  • only one device of a type should be used
  • FreeBSD and Linux guests may be frozen for a few seconds or report disk errors with AHCI block devices (data does not seem to be lost)
  • e1000 on Linux will sometimes not work after a restore
Test Plan

For all guests testing has been done as (specific devices may differ, as noted below):

  • constantly ping a remote server
  • start copying a large file
  • suspend the virtual machine
  • [optionally] reboot the host
  • restore the virtual machine
  • all functionality should continue as expected

FreeBSD guest

  • use virtio-net / e1000, ahci and frame buffer with xhci

Linux guest

  • use virtio-net, ahci and frame buffer with xhci

Windows guest

  • use virtio-net / e1000, ahci and frame buffer with xhci

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
dab added inline comments.Mar 8 2019, 4:57 AM
sys/amd64/vmm/intel/vmcs.c
541–545

See comments in function above.

araujo added a subscriber: araujo.Mar 8 2019, 7:42 AM
lib/libvmmapi/vmmapi.c
261–265

I'm not particularly fond of return (func(params));. Would it be more inline with the coding style rewriting it like this?

sys/amd64/include/vmm_snapshot.h
38–40

There is a function called vm_get_snapshot_size - not that you mention it, I realize that the information in the comment should have mentioned "snapshot size" - that computes the size (or length) of the snapshot data

42

It was left there to point that there wasn't a need for an explicit field. I may remove this line and the comment above if people agree that they shouldn't really be there.

sys/amd64/vmm/intel/vmcs.c
481–482

The decision was made to keep things symmetrical on all branches of the if (similar to breaks for all cases of a switch-case), and also as a precaution in case more code is added after the if.

485

Same as the previous case.

513–517

Same observations.

541–545

Same observations.

darius.mihaim_gmail.com edited the test plan for this revision. (Show Details)Mar 8 2019, 10:20 AM
dab added inline comments.Mar 8 2019, 1:17 PM
lib/libvmmapi/vmmapi.c
261–265

I saw other places in this code where return (ioctl(...)) was done, so it wouldn't be out of character with the rest of the code. That being said, I don't think style(9) makes any mention of this one way or the other. I like the simplified code, but it is totally up to you.

sys/amd64/include/vmm_snapshot.h
38–40

Ah, missed that. Thanks.

42

Maybe just mention vm_get_snapshot_size() in the comment.

sys/amd64/vmm/intel/vmcs.c
481–482

I prefer the simpler (and less verbose) code, but can see your point.

emaste added inline comments.Mar 8 2019, 2:33 PM
lib/libvmmapi/vmmapi.c
141–142

Agreed the printf is probably unnecessary, but if we want to keep it, should probably be warnx instead.

261–265

For something very simple like this I would go with return(function(...)), but the current one is fine. However, style(9) would omit the extra blank line before the return.

286

style(9) wants

/*
 * Multiline comments
 * to look like this.
 */
1634

style discourages // comments

lib/libvmmapi/vmmapi.h
131

We should prefer size_t (and bool) where appropriate in new code - of course this is relatively rare in FreeBSD today because much of the code predates these types.

sys/amd64/include/vmm.h
701

Some comments in this change /* Start with a capital letter and end with a period. */ and /* some are all lowercase without a period */. Style(9) wants the former.

cem added inline comments.Mar 8 2019, 3:56 PM
lib/libvmmapi/vmmapi.c
261–265

style(9) mentions this explicitly: "Values in return statements should be enclosed in parentheses" (as well as by example).

emaste added inline comments.Mar 8 2019, 5:15 PM
lib/libvmmapi/vmmapi.c
261–265

It's not a question of parens vs not, it's

err = foo()

return (err);

vs

return(foo());
novel added a subscriber: novel.Mar 9 2019, 5:51 AM

Fix issues pointed by inline comments

darius.mihaim_gmail.com marked 16 inline comments as done.Mar 10 2019, 10:52 AM
darius.mihaim_gmail.com added inline comments.
lib/libvmmapi/vmmapi.c
141–142

Removed the print for now and replaced it with a goto err as suggested. May be useful to have the VM simply not support suspend/resume, instead of failing to create it when the device is not created.

261–265

Went with return (ioctl(...)); in the end, since the function is simple enough.

297

Was forgotten there as a sanity check. I have removed it.

darius.mihaim_gmail.com marked an inline comment as done.Mar 15 2019, 9:16 AM
darius.mihaim_gmail.com updated this revision to Diff 55094.
  • remove guest memory mapping device that is currently unused. The device was used to make snapshots faster by creating a separate Copy-on-Write mapping of the guest memory. This approach is not useful for guest suspends, and we may be able to avoid exporting it through a device even for snapshots (will be investigating that approach).
  • remove some unused code

Remove some unused code and fix style of comments

novel added a comment.Apr 8 2019, 5:41 PM

This is a very useful feature, thanks for working on that!

However, it seems it could be more user-friendly.

Few observations:

  1. How do I know status of save/restore? If I do bhyvectl --vm=... --suspend=..., it seems to just exit with 0. Is there any way to figure out if the suspend attempt was successful or monitor progress? Maybe it would be good to have blocking mode for this operation (i.e. bhyvectl doesn't exit until it's finished).
  2. Same for --checkpoint.
  3. Is there a way to list all available snapshots (with creation time)?
  4. It looks like it's terminating the bhyve process even if the running VM configuration doesn't support snapshotting, which is pretty unfortunate (also, it would be good if snapshotting errors could be displayed by bhyvectl, not bhyve stdout).
usr.sbin/bhyve/pci_emul.c
2039

I get these messages for devices that I don't actually have in a VM, is that expected behavior? Also, why is that using dos-style (\r\n) line endings?

This is a very useful feature, thanks for working on that!

Thank you very much for testing it!

However, it seems it could be more user-friendly.

I agree, but we have been focusing on functionality for the most part.

Few observations:

  1. How do I know status of save/restore? If I do bhyvectl --vm=... --suspend=..., it seems to just exit with 0. Is there any way to figure out if the suspend attempt was successful or monitor progress? Maybe it would be good to have blocking mode for this operation (i.e. bhyvectl doesn't exit until it's finished).

Could be done, but for now bhyvectl simply writes to the bhyve's checkpoint UNIX socket, and no data is sent back. Simply reading from the socket in bhvyectl to make it wait for a status message from bhyve may suffice. For progress, an extension of this where bhyve sends the current progress is also possible. A very large part of the time (almost all of it, for VMs that have more than the minimal amount of RAM required by the OS) is spent to save memory, so the progress may be saved_memory / total_memory.

  1. Same for --checkpoint.

Checkpoints are currently redirected to suspends, due to a lack of support for snapshots for virtual disks. After disk support is added, proper checkpoints will also be added, but they should be created immediately. That means that bhyvectl should only get/show the fail/success status without blocking (too long).

  1. Is there a way to list all available snapshots (with creation time)?

Currently none. Closest you can get to that is listing the creation time of the snapshot files.

  1. It looks like it's terminating the bhyve process even if the running VM configuration doesn't support snapshotting, which is pretty unfortunate (also, it would be good if snapshotting errors could be displayed by bhyvectl, not bhyve stdout).

I'm not entirely sure how/why it is killed. Please add the output from bhyve, or feel free to E-mail me.

usr.sbin/bhyve/pci_emul.c
2039

It's because the devices are in a static array in snapshot.c - that's also why only one device of each type is allowed. When we'll add support for dynamically registering devices, these messages shouldn't appear anymore.

The \r\n line endings are put there because the output from bhyve was going to the next line (new line was inserted), but continued from where the last line ended (no carriage return). We didn't find why this happens, and went with the \r\n newlines instead. No real reason to keep them if the output behaves, though.

Add minor fixes to snapshot logic:

  • correct variable used as parameter for 'sizeof' in vatpic.c (the size was correct before, by coincidence)
  • remove a comment that became irrelevant in the past in vhpet.c
  • fix vLAPIC timer reset logic in the off-chance the timer would be periodic and a value of 0 is saved for the counter
krion added a subscriber: krion.May 22 2019, 8:37 PM

Remove vLAPIC fields from snapshot procedure, if they are not required after a restore / can be computed from other fields.
Save vLAPIC's LAPIC page as part of vLAPIC snapshot procedure, instead of a separate data structure.
Rename variables / functions / macros used to snapshot guest to host memory mappings to make them more intuitive.

Remove debug printfs from pci_ahci.c

darius.mihaim_gmail.com edited the summary of this revision. (Show Details)Mon, Jun 10, 6:58 PM
jhb added a subscriber: jhb.Tue, Jun 18, 4:18 PM

So I think having this be under an optional #ifdef for now is probably best, especially since it doesn't yet work with capsicum. I can add a new WITH_BHYVE_SNAPSHOT src.conf option. I can help with adding the #ifdef or you can do that. You would do something like this in bhyve's Makefile:

.if ${MK_BHYVE_SNAPSHOT} != "no"
SRCS+= snapshot.c

LIBS+= ucl xo
CFLAGS+= -DVMM_SNAPSHOT

# other settings needed
.endif

For the kernel makefile we use a new kernel option. For now we can just add it to opt_global.h by adding a line like this to sys/conf/options:

VMM_SNAPSHOT  opt_global.h

Then you can use #ifdef VMM_SNAPSHOT in the kernel sources.

Aside from that, there are probably several things to work on once this is in the tree, but the biggest one is that we probably need to pause vCPUs first before we start pausing devices since pausing the devices first can lead to device timeouts as we discussed previously. The second biggest thing is probably to replace the ioctls to lock/unlock the vCPUs by instead explicitly kicking the vCPU threads out to userland. For now you can just reuse the "debug" vmexit that the gdb stub uses assuming that we won't suspend while the debug stub is active. Eventually these two things will have to become aware of each other.

lib/libvmmapi/vmmapi.c
484

This probably triggers warnings. Pointer math on void * isn't defined in standard C. GNU C allows it as an extension. Char * would be better. It seems like 'offaddr' should also be vm_paddr_t, not off_t. FreeBSD's style also doesn't do spaces after casts, but that's minor.

lib/libvmmapi/vmmapi.h
263

This is normally spelled 'packed' in FreeBSD and would be added to the end of the struct normally. However, I don't think you need packed here since a char array is not going to have padding on any architectures FreeBSD supports.

sys/amd64/include/vmm.h
135

Ideally this isn't really part of the API. We'd just want some kind of opaque "CPU context" perhaps that is VMCS on Intel, VMCB on AMD. This can be on the list for things to refine in tree.

sys/amd64/include/vmm_dev.h
34

These are a lot of nested includes for this header. Certainly <sys/param.h> shouldn't be a nested include. I wonder if we can get rid of all of these nested includes? It's fine to make certain things in the header conditional as the existing headers do with _SYS_CPUSET_H_ if you need to.

sys/amd64/vmm/vmm.c
1720

Why is this check here?

2880

Not to change right now, but for the one-liner wrapper functions that are only used once like this (vm_snapshot_lapic) I would probably just inline the call to vlapic_snapshot() directly in this switch instead.

sys/amd64/vmm/vmm_dev.c
775

Missing a 'break' here now.

1012

You can't call make_dev* while holding locks. There is a race with setting si_drv1 that can be fixed by using make_dev_s instead of make_dev_p, but that would be a separate change from save/restore I think,

sys/amd64/vmm/vmm_snapshot.c
10

For local variables, we can just use 'op' without leading underscores. FreeBSD generally only uses leading underscores for global symbols that are not part of the public API.

28

Maybe use __DEVOLATILE instead:

void *_data;

_data = __DEVOLATILE(void *, data);

Maybe to avoid the leading underscore you could call the function parameter 'datav' and the local variable in the function 'data'. Alternatively, if there are not many volatile pointers being passed to this function, the volatile could just be removed and __DEVOLATILE could be used in callers in the places where it is needed.

usr.sbin/bhyve/Makefile
96

Are the libucl headers not installed?

usr.sbin/bhyve/pci_emul.c
2039

When you are using stdio for the console, it leaves the console in raw mode so \n is only a newline and not a full CR+LF combo. Other printfs in bhyve only use \n though since stdio isn't always used as the console.

usr.sbin/bhyve/snapshot.h
11

Why does this need packed? If it isn't stored in a file, we should let the compiler choose the optimal layout.

I agree with @jhb that this absolutely should be placed behind #ifdef guards for now.

sys/amd64/vmm/amd/svm.c
2418–2435

Why aren't the x86 generic registers being saved/restored by logic shared by both SVM and VMX?

2437–2444

Why are you saving host state?

2456

The eptgen is specific to the state of the host machine and probably shouldn't be saved.

2458–2459

The ASID is potentially specific to the state of the host machine (it is on SmartOS bhyve at least) and shouldn't be saved.

2592–2599

As bhyve lacks AVIC support today, these should all be zero at snapshot time. Unused portions should probably be skipped.

2606–2608

I'm not sure it makes sense to record the ASID as part of snapshot state, as that's specific to the host for tracking vCPUs as they go on/off CPU

sys/amd64/vmm/amd/vmcb.h
213

Rather than requiring forward definitions here and in the other headers (vlapic, vhpet, etc) for struct vm_snapshot_meta, why not include machine/vmm_snapshot.h? It's required for the function implementation anyways.

sys/amd64/vmm/intel/vmx.c
3862–3878

With descriptors being common across the two CPU types, why not use a general mechanism for saving them?

3898–3900

Why not save these MSRs in a general fasion? (Same goes for EFER too)

3904–3905

Saving/restoring control state like this with no validation seems dangerous. It's probably safest to allow the host to reconstruct that from present (and requested) features?

3910–3935

In SmartOS bhyve, we added a CPU-agnostic method for setting the tsc offset in vmm.c, which then VMX/SVM consume in order to do the CPU-specific settings. It might make sense to do something like that here.

sys/amd64/vmm/io/vioapic.c
513–514

The write operators for these register tables enforce certain invariants about their contents. The restore process likely needs to do the same enforcement. Additionally, I believe this is skipping TMR updates for the CPUs.

sys/amd64/vmm/io/vlapic.c
1659–1663

Why was this accessor added?

sys/amd64/vmm/vmm.c
809–813

Why is this here with no consumer?

sys/amd64/vmm/vmm_dev.c
465–471

I'm uncomfortable about providing ioctl interfaces to these locking operations. All of the other vCPU lock consumers will take the needed lock(s), perform their action in kernel space, and the release the lock(s) prior to returning to userspace. These ioctls break that model. Furthermore, since the save/restore operation is driven from userspace, where the vCPU run threads are operating, there's no reason why the quiescence cannot start there.

  • rework vCPU pausing and resuming mechanism
  • rewrite VM suspend procedure - do not resume VM and devices before exit
  • add missing break clause in vmm_dev switch-case
  • various minor coding style fixes pointed by @jhb
  • remove unnecessary critical section nesting level check
darius.mihaim_gmail.com marked 3 inline comments as done.Wed, Jun 26, 10:21 AM

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

lib/libvmmapi/vmmapi.c
484

No warnings were generated at compile time, but using char * makes more sense since baseaddr is also char *

sys/amd64/include/vmm.h
135

Removed it; no longer used.

sys/amd64/include/vmm_dev.h
34

Removed all includes from this header. Replaced 'struct vm_snapshot_req' with 'struct vm_snapshot_meta' since the former had only one member of the latter type.

sys/amd64/vmm/vmm.c
1720

Removed it; was forgotten from a previous implementation.

2880

Removed the one-line wrapper functions.

sys/amd64/vmm/vmm_dev.c
1012

Removed the comment, since it was not related to save/restore.

sys/amd64/vmm/vmm_snapshot.c
10

Renamed the variable.

28

Used __DEVOLATILE here to be safe and keep the compatibility (same parameters and return type) with the functions used in userspace.

usr.sbin/bhyve/Makefile
96

Build fails if this line is removed - #include <ucl.h> is unable to find the header.

usr.sbin/bhyve/pci_emul.c
2039

Will keep it it in mind, thank you. We can remove the CR characters for coding style sometime down the line.

I have applied the feedback @jhb gave, except for the #ifdef guards. Will work on it after rebasing with a newer master.

Please delay the #ifdef guards, as per the discussion in the 2019/06/20 bhyve call we may of found an alternative, using a project branch and rebasing the git to that.

usr.sbin/bhyve/Makefile
96

Perhaps because this is a private library?
/usr/include/private/ucl/ucl.h