Page MenuHomeFreeBSD

bhyve: multiple devices support for suspend/checkpoint/resume
ClosedPublic

Authored by gusev.vitaliy_gmail.com on Jun 24 2022, 7:57 PM.
Tags
Referenced Files
Unknown Object (File)
Jan 19 2024, 7:50 AM
Unknown Object (File)
Dec 20 2023, 7:04 AM
Unknown Object (File)
Dec 10 2023, 7:21 PM
Unknown Object (File)
Nov 25 2023, 10:48 PM
Unknown Object (File)
Nov 25 2023, 3:54 PM
Unknown Object (File)
Nov 25 2023, 3:02 AM
Unknown Object (File)
Nov 22 2023, 10:49 AM
Unknown Object (File)
Nov 21 2023, 7:29 PM

Details

Summary

Current snapshot implementation doesn't support multiple devices with similar type. For example, two virtio-blk or two CD-ROM-s, etc.

So the following configuration cannot be restored.

bhyve -s 3:0,virtio-blk,disk.img  -s 4,virtio-blk,disk2.img

In some cases it is restored silently, but doesn't work. In some cases it gets failed during restore stage.

This patch fixes that.

Idea is: using unique name as identifier. For PCI devices it would be, for example, virtio-blk-pci-$bus-$slot-$func
Changes:

  • Use unique name to identify device.
  • Add simple register_snapshot_dev() function, for any type of device.
  • Changed some function names and defs related to "user devices" and renamed to "devices"
  • Renamed "struct" section to "kern_structs"
  • Removed pci_find_slotted_dev function
  • Added printing error number if pause/snapshot/resume fail

Sponsored by vStack.

Test Plan

Patch has been working well for more than year.

Command line for example (VM has two virtio-blk disks):

bhyve -c sockets=1,cores=2 -m 1024M -H -A -P -S -s 0:0,hostbridge -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd -s 3:0,virtio-blk,disk0,sectorsize=4096 -s 4:0,virtio-net,tap0 -s 5,ahci-cd,/disk/iso/ubuntu-22.04.1-live-server-amd64.iso  -s 3:1,virtio-blk,disk2,sectorsize=4096 -s 7,fbuf,tcp=10.2.0.9:5900,w=1024,h=768 -s 30,xhci,tablet -s 31:0,lpc ubuntu22

Messages during suspend:

pci_pause: not implemented for: lpc-pci-0-31-0
pci_pause: not implemented for: xhci-pci-0-30-0
pci_pause: not implemented for: fbuf-pci-0-7-0
pci_pause: not implemented for: hostbridge-pci-0-0-0
[1024.000MiB / 1024.000MiB] |###################################################################################################|

Messages during resume:

fbuf frame buffer base: 0x3315b4600000 [sz 16777216]
Pausing pci devs...
pci_pause: not implemented for: lpc-pci-0-31-0
pci_pause: not implemented for: xhci-pci-0-30-0
pci_pause: not implemented for: fbuf-pci-0-7-0
pci_pause: not implemented for: hostbridge-pci-0-0-0
Restoring vm mem...
[1024.000MiB / 1024.000MiB] |###################################################################################################|
Restoring pci devs...
Restoring kernel structs...
Resuming pci devs...
pci_resume: not implemented for: lpc-pci-0-31-0
pci_resume: not implemented for: xhci-pci-0-30-0
pci_resume: not implemented for: fbuf-pci-0-7-0
pci_resume: not implemented for: hostbridge-pci-0-0-0

Suspend of devices in .meta file:

{
  "basic metadata": {
    "ncpus": 2,
    "vmname": "ubuntu22",
    "memsize": 1073741824,
    "memflags": 2
  },
  "kern_structs": [
    {
      "device": "vhpet",
      "size": 284,
      "file_offset": 0
    },
    {
      "device": "vm",
      "size": 400,
      "file_offset": 284
    },
    {
      "device": "vioapic",
      "size": 388,
      "file_offset": 684
    },
    {
      "device": "vlapic",
      "size": 8338,
      "file_offset": 1072
    },
    {
      "device": "vmcx",
      "size": 1200,
      "file_offset": 9410
    },
    {
      "device": "vatpit",
      "size": 172,
      "file_offset": 10610
    },
    {
      "device": "vatpic",
      "size": 118,
      "file_offset": 10782
    },
    {
      "device": "vpmtmr",
      "size": 4,
      "file_offset": 10900
    },
    {
      "device": "vrtc",
      "size": 140,
      "file_offset": 10904
    }
  ],
  "devices": [
    {
      "device": "lpc-pci-0-31-0",
      "size": 620,
      "file_offset": 11044
    },
    {
      "device": "xhci-pci-0-30-0",
      "size": 2077,
      "file_offset": 11664
    },
    {
      "device": "fbuf-pci-0-7-0",
      "size": 16777668,
      "file_offset": 13741
    },
    {
      "device": "ahci-pci-0-5-0",
      "size": 1531,
      "file_offset": 16791409
    },
    {
      "device": "virtio-net-pci-0-4-0",
      "size": 66197,
      "file_offset": 16792940
    },
    {
      "device": "virtio-blk-pci-0-3-1",
      "size": 8835,
      "file_offset": 16859137
    },
    {
      "device": "virtio-blk-pci-0-3-0",
      "size": 8835,
      "file_offset": 16867972
    },
    {
      "device": "hostbridge-pci-0-0-0",
      "size": 452,
      "file_offset": 16876807
    },
    {
      "device": "atkbdc",
      "size": 141,
      "file_offset": 16877259
    }
  ]
}

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

gusev.vitaliy_gmail.com edited the summary of this revision. (Show Details)
gusev.vitaliy_gmail.com edited the test plan for this revision. (Show Details)

@jhb Could you review this patch, please? If possible it should be committed, since w/o patch, suspend/resume doesn't work with multi devices.

@rew This is stopper for following up nvlist changes. Please review and possibly approve. Thanks.

@rew This is stopper for following up nvlist changes. Please review and possibly approve. Thanks.

How is this a stopper for the file format changes?

Offhand, it looks like this should be done after the file format changes - if I'm mistaken, let me know.

To make this clear, I'm not trying to slow the process down. However, I don't want to introduce churn that will need to be undone/resorted when we finalize the file format as that will require more work in the end.

This is why I've been saying the most important thing for the snapshot feature is to finalize the file format. Until that work is done, I see it as counter-productive to extend or build on top of what we have in the tree right now.

So, how do you know this code will not need to be revisited after the file format changes? And in what way does the existing code prevent the file format work from being done?

In D35590#885880, @rew wrote:

How is this a stopper for the file format changes?

This is how it is done in our Internal tree. Reordering will create a lot of conflicts and will require a lot of efforts for testing and of course will introduce bugs. We want each patchwork works fine without regression. Don't it?

Offhand, it looks like this should be done after the file format changes - if I'm mistaken, let me know.

in terms of between features and bugs, current review is bug fix, but format changes - is feature. Bugfix is most important than implementing feature. This review should be done before format review (nvlist).

To make this clear, I'm not trying to slow the process down. However, I don't want to introduce churn that will need to be undone/resorted when we finalize the file format as that will require more work in the end.

Ok. Welcome and thanks for your efforts for reviewing and ideas.

This is why I've been saying the most important thing for the snapshot feature is to finalize the file format. Until that work is done, I see it as counter-productive to extend or build on top of what we have in the tree right now.

I guess you think that I will upstream this changes and forgot about format work. But I am going to create nvlist (including format change) as soon as that review committed.

So, how do you know this code will not need to be revisited after the file format changes? And in what way does the existing code prevent the file format work from being done?

This review simplifies snapshot/restore code for devices . All reviews: this review and nvlist review change the same file snapshot.c and this review makes creating nvlist patchwork (incl. format change) easier.

If you need I split this review into small reviews, I can do it.

I guess you think that I will upstream this changes and forgot about format work. But I am going to create nvlist (including format change) as soon as that review committed.

Before doing the work, I strongly encourage you to write an email to the virtualization mailing list with the proposed file format.

This review simplifies snapshot/restore code for devices . All reviews: this review and nvlist review change the same file snapshot.c and this review makes creating nvlist patchwork (incl. format change) easier.

If I understand it correctly, these changes break backwards compatibility with the previous format. But, judging by the interest of the snapshot feature, I doubt anyone cares.

I imagine the meta file will become obsolete in the feature since the bhyve configuration nvlist could probably be leveraged to some degree instead - but there's talk about that changing as well.

With that said, I believe subsequent changes to the meta file will make these changes obsolete and it's questionable if the cost of developer cycles is worth a proper review.

In general, I have no objection against these changes but, I'm not likely to commit them either.

If you need I split this review into small reviews, I can do it.

Only the file format is controversal. So, if you split this review into smaller ones which are unrelated to the file format change, we can make some progress.

Only the file format is controversal. So, if you split this review into smaller ones which are unrelated to the file format change, we can make some progress.

Will split into smaller patches. Format is not problem since current .meta format is flexible and is not "stable". Even all code are under #ifdef BHYVE_SNAPSHOT so it is like "Experimental" and we should care about only stability and features and speedup development .

This review is bugfix. it is improssible to fix this issue without changing current .meta representation. Also note, that coming nvlist changes with new format (only file per snapshot) will remove .meta at all.

In D35590#886517, @rew wrote:

I guess you think that I will upstream this changes and forgot about format work. But I am going to create nvlist (including format change) as soon as that review committed.

Before doing the work, I strongly encourage you to write an email to the virtualization mailing list with the proposed file format.

Yes, will do and will add reviewers to CC. Thanks!

This review simplifies snapshot/restore code for devices . All reviews: this review and nvlist review change the same file snapshot.c and this review makes creating nvlist patchwork (incl. format change) easier.

If I understand it correctly, these changes break backwards compatibility with the previous format. But, judging by the interest of the snapshot feature, I doubt anyone cares.

Agree. Especially for the fact suspend/resume was broken since November 2022 and nobody notified about that.

I imagine the meta file will become obsolete in the feature since the bhyve configuration nvlist could probably be leveraged to some degree instead - but there's talk about that changing as well.

It will be removed since all data-s will be in one file (as do Qemu, VMware and others).

With that said, I believe subsequent changes to the meta file will make these changes obsolete and it's questionable if the cost of developer cycles is worth a proper review.

The main idea of this review - introduce struct snapshot_dev and it will be used by the following nvlist changes.

In general, I have no objection against these changes but, I'm not likely to commit them either.

Good. Let's go further and make Suspend/Resume more robust, flexible, debugged, supporting more devices!

The main idea of this review - introduce struct snapshot_dev and it will be used by the following nvlist changes.

Seems like it'd make more sense to generate this meta information from the in-core configuration at the time the snapshot is taken.

Have you considered this?

In D35590#886738, @rew wrote:

The main idea of this review - introduce struct snapshot_dev and it will be used by the following nvlist changes.

Seems like it'd make more sense to generate this meta information from the in-core configuration at the time the snapshot is taken.

Have you considered this?

Did you mean using "config" subsystem? There is no container that keeps information about all devices. PCI devices has its own container, Keyboard is as its own, etc. What is why register_snapshot_dev() was introduced. It adds callback that is called at the time snapshot is taken to generate meta information about device. Probably this is wtat you suggested.

Did you mean using "config" subsystem? There is no container that keeps information about all devices. PCI devices has its own container, Keyboard is as its own, etc. What is why register_snapshot_dev() was introduced. It adds callback that is called at the time snapshot is taken to generate meta information about device. Probably this is wtat you suggested.

If I dump the config tree when taking a snapshot, here is the (trimmed) ouput:

memory.wired=true
memory.size=4G
x86.strictmsr=true
x86.vmexit_on_hlt=true
x86.vmexit_on_pause=true
acpi_tables=true
cpus=2
lpc.com1.path=stdio
lpc.bootrom=/usr/local/share/uefi-firmware/BHYVE_UEFI.fd
pci.0.0.0.device=hostbridge
pci.0.1.0.device=lpc
pci.0.3.0.device=virtio-blk
pci.0.3.0.path=/dev/md0
gdb.address=127.0.0.1
gdb.port=6668
name=uefi

For example, in vm_checkpoint()/vm_snapshot_user_devs(), instead of looping through a cached device list, you'd call into the pci subsystem that would traverse the in-core configuration, resolve the device path, and snapshot it.

@rew Of course if assume that only PCI and Keyboard devices are used, introducing

In D35590#887378, @rew wrote:

Did you mean using "config" subsystem? There is no container that keeps information about all devices. PCI devices has its own container, Keyboard is as its own, etc. What is why register_snapshot_dev() was introduced. It adds callback that is called at the time snapshot is taken to generate meta information about device. Probably this is wtat you suggested.

If I dump the config tree when taking a snapshot, here is the (trimmed) ouput:

memory.wired=true
memory.size=4G
x86.strictmsr=true
x86.vmexit_on_hlt=true
x86.vmexit_on_pause=true
acpi_tables=true
cpus=2
lpc.com1.path=stdio
lpc.bootrom=/usr/local/share/uefi-firmware/BHYVE_UEFI.fd
pci.0.0.0.device=hostbridge
pci.0.1.0.device=lpc
pci.0.3.0.device=virtio-blk
pci.0.3.0.path=/dev/md0
gdb.address=127.0.0.1
gdb.port=6668
name=uefi

For example, in vm_checkpoint()/vm_snapshot_user_devs(), instead of looping through a cached device list, you'd call into the pci subsystem that would traverse the in-core configuration, resolve the device path, and snapshot it.

PCI subsystem is not only subsystem that should be snapshot-ed. There is at least one more - "Keyboard". The patch unifies snapshot function call for all devices by introducing list of snapshotted devices. Of course it can be achieved like this:

vm_save_devices(struct vmctx *ctx, int data_fd,  xo_handle_t *xop)
{
       save_pci_devices(meta, data_fd, xop)
       save_keyboard(meta, data_fd, xop);
}

snapshot_pci_devices()
{
    // foreach all pci devices
    // or walk through nvlist config. 
}

That will work, but this is not easier than current review's approach. Additionally restore path becomes very easy because it is already known list of devices and their callbacks and callback's argument. Look at vm_restore_devices().

That will work, but this is not easier than current review's approach.

This counter-argument lacks any technical substance.

Your approach is flawed in, at least, these two ways:

  • requires maintaining an additional list of cached devices; a list that can easily be generated dynamically
  • duplicates information that is stored in the configuration

Also, this is why it's important to discuss design changes before doing the work - to make sure everyone agrees on the same approach.

Additionally restore path becomes very easy because it is already known list of devices and their callbacks and callback's argument. Look at vm_restore_devices().

Better to save the in-core configuration in the snapshot file so one could restore with a command that looks like:

bhyve -r $snapshot
In D35590#887928, @rew wrote:

That will work, but this is not easier than current review's approach.

This counter-argument lacks any technical substance.

I do not argue, just want to make the better solution. My purpose is - get another opinions and re-think about current implementation.

What I am doing now - try to compare all proposals and current implementation "having list of devices".

Your approach is flawed in, at least, these two ways:

I don't purely like terminology "your approach", because it could be considered like I want to push the solution despite on possible lacks.

  • requires maintaining an additional list of cached devices; a list that can easily be generated dynamically

Config doesn't have all information about devices. At least Keyboard.

Config can be used to get all PCI devices or it can be achieved by walking through all PCI slots, and then directly call Keyboard Snapshot , but in terms of programming it doesn't look right - it cannot be widen very well, if some another type of device (non PCI) is added.

Additionally, Look at current implementation of atkbdc_init()

atkbdc_init() {

    sc = calloc(1, sizeof(struct atkbdc_softc));

    #ifdef BHYVE_SNAPSHOT
    static struct atkbdc_softc *atkbdc_sc = NULL;
    #endif

And to get it snapshot-ed upper layer should call directly atkbdc_snapshot(), i.e. be aware about this device. Using "Config" approach doesn't solve this bad code.

Instead, Using register snapshot device allows do not care about devices, its types, internals, etc.

  • duplicates information that is stored in the configuration

It is small enough and doesn't eats a lot of memory at all. Instead, it allows quickly get all devices snapshot-ed and restored with minimum CPU usage.

And as I mentioned, restore path currently is not easy, "Config" approach doesn't make it easier.

Also, this is why it's important to discuss design changes before doing the work - to make sure everyone agrees on the same approach.

Additionally restore path becomes very easy because it is already known list of devices and their callbacks and callback's argument. Look at vm_restore_devices().

Better to save the in-core configuration in the snapshot file so one could restore with a command that looks like:

bhyve -r $snapshot

Yes, it is better. But I feel that you want me to solve all problems at once. I would go step by step. In our internal tree we have all what you described: one file per snapshot, in-core configuration, etc. But it should be reviewed and upstreamed with several reviews, not in one. So this review doesn't want to solve all THE problems what current Snapshot/Restore code has. Let's move forward and all useful things will be added :)

Let's move forward and all useful things will be added :)

sounds good

Beside some nits, it looks fine for me.

Note: I don't have a strong feeling about renaming some functions etc. However, I would like to avoid unnecessary renames to keep the change cleaner.

usr.sbin/bhyve/bhyverun.c
1501

Is this rename necessary?

usr.sbin/bhyve/pci_emul.c
993

IMHO, we should split this change into a separate commit.

2346

What's the advantage of a void pointer instead of a struct pci_devinst pointer? I'd prefer struct pci_devinst.

usr.sbin/bhyve/pci_emul.h
83

Unrelated newline change.

usr.sbin/bhyve/snapshot.c
122

IMHO, we should split such changes into a separate commit.

961

Is this rename necessary?

usr.sbin/bhyve/bhyverun.c
1501

I think yes, because:

  1. Using "*user*" is confusing.
  2. Implementation of bhyve devices (that would pause) don't have any "user" mention. So either implementation of bhyve devices also should have "user" word or this function shouldn't have "user".
  3. Following up (nvlist and others) changes will have only call to save kernel data. Kernel data would be as whole object.
usr.sbin/bhyve/pci_emul.c
993

Will do. Thanks!

2346

It is because static struct snapshot_ops (snapshot_cb) is general for all type of devices. Not only PCI. For instance, look at
register_snapshot_dev("atkbdc", &atkbdc_snapshot_ops, sc);

usr.sbin/bhyve/pci_emul.h
83

Will fix. Thanks!

usr.sbin/bhyve/snapshot.c
122

Will do.

961

Yes because it does "saving" and not save/restore. Using "snapshot" implies a function does "save and restore".

LGTM. One general note:
I know that phabricator is not a good tool to create small commits. However, I prefer small commits as they are much easier to review and therefore can speed up merging. You've already created a list of changes in your summary. Please try to create a single commit for each of these changes.

  • Use unique name to identify device.
  • Add simple register_snapshot_dev() function, for any type of device.
  • Changed some function names and defs related to "user devices" and renamed to "devices"
  • Renamed "struct" section to "kern_structs"
  • Removed pci_find_slotted_dev function
  • Added printing error number if pause/snapshot/resume fail

Btw: If you have a public git repo, you can just send me the link to a git branch where this patch is splitted into smaller ones and I'm going to commit it.

usr.sbin/bhyve/pci_emul.c
2346

Got it.

usr.sbin/bhyve/snapshot.c
175

Missing check for NULL.

@gusev.vitaliy_gmail.com As said, please split this commit. You can also send me a link to a personal git repo where I can pull the changes from.

This revision is now accepted and ready to land.May 2 2023, 7:59 AM

@gusev.vitaliy_gmail.com As said, please split this commit. You can also send me a link to a personal git repo where I can pull the changes from.

Yes, will do it soon ~ day.

@corvink I've split this patch with several patches. Note, according to @rew comments about adding and usage "register_snapshot_dev", I replaced that code with pci_next() function to walk all available PCI devices and save/restore them. 'atkbdc` device is saved/restored separately in this case.

Reviews:

D40104 bhyve: [snapshot] Rename 'struct' key with 'kern_struct'
D40105 bhyve: [snapshot] Simplify restore kernel structs
D40106 bhyve: [snapshot] Rename 'user_dev' with 'devices'
D40107 bhyve: pci_devinst.pi_name should contain bus, slot, func
D40108 bhyve: [snapshot] Add .pe_snapshot method for PCI 'hostbridge'
D40109 bhyve: [snapshot] Use pci_next() to save/restore pci devices