Page MenuHomeFreeBSD

bhyve - snapshot capsicum integration[Part 1]
Needs ReviewPublic

Authored by ionut.mihalache1506_gmail.com on Mar 14 2022, 1:41 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jul 14, 8:36 AM
Unknown Object (File)
Tue, Jul 2, 1:42 AM
Unknown Object (File)
Sun, Jun 30, 11:53 AM
Unknown Object (File)
Sun, Jun 30, 8:30 AM
Unknown Object (File)
Sun, Jun 30, 8:07 AM
Unknown Object (File)
Thu, Jun 27, 7:09 PM
Unknown Object (File)
Wed, Jun 26, 11:16 PM
Unknown Object (File)
Tue, Jun 25, 3:21 AM

Details

Reviewers
rew
Group Reviewers
bhyve
capsicum
Summary

Integrating bhyve snapshot mechanism with capsicum was deactivated in order for the snapshot mechanism to be used.
The problems during the snapshot mechanism were caused by the use of some sockets that did not have the necessary permissions, such as
the socket used to communicate between bhyve and bhyvectl or the file descriptors for the disks.

In order to increase the protection of the operations done during the snapshot process I added the necessary restrictions for the used sockets and file descriptors and I replaced the open system call with openat.

For some of the system calls(such as sysctlbyname) I used casper service in order to use these calls for a more secure running environment.

Test Plan

Start the guest vm with: bhyve -c 1 -m 256M -H -A -P -s 0:0,hostbridge -s 1:0,lpc -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd -s 2:0,virtio-net,tap0 -s 3:0,ahci-hd,main.img -l com1,stdio -t <path_to_the_directory_to_save_snapshot_files> freebsd_guest

For testing the functionality open two processes at the same time on the guest vm:

  • one copy process of a 500MB file
  • one ping process

While the processes mentioned above are running:

  • suspend the guest with the command: bhyvectl --suspend file.ckp --vm=freebsd_guestand
  • restore its state with the command: bhyve -c 1 -m 256M -H -A -P -s 0:0,hostbridge -s 1:0,lpc -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd -s 2:0,virtio-net,tap0 -s 3:0,ahci-hd,main.img -l com1,stdio -r file.ckp freebsd_guest.

After the restore command, both processes should finish successfully and the guest can be used without errors.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
emaste added inline comments.
Makefile.inc1
2782 ↗(On Diff #103942)

curious why we need casper here

usr.sbin/bhyve/Makefile
95

@oshogbo why is this commented out today?

123–124

When this works w/ Capsicum we should just remove the comment and CFLAGS

usr.sbin/bhyve/bhyverun.c
1127

extra debugging to be removed

1554–1555

strange whitespace here

usr.sbin/bhyve/Makefile
95

The snapshot/checkpoint code depends on the open syscalls.
IRC John was working on improving this - https://reviews.freebsd.org/D30471

usr.sbin/bhyve/Makefile
95

Right but I mean this one is not in a MK_BHYVE_SNAPSHOT block

usr.sbin/bhyve/Makefile
95

Oh. I didn't notice the MK_BHYVE_SNAPSHOT block till you tell me.
Yea it should be decided there then.

rew requested changes to this revision.Mar 17 2022, 5:47 PM
rew added a subscriber: rew.

You have broken up https://reviews.freebsd.org/D30471 into multiple reviews without addressing the issues I raised in the original review.

This revision now requires changes to proceed.Mar 17 2022, 5:47 PM
In D34547#783705, @rew wrote:

You have broken up https://reviews.freebsd.org/D30471 into multiple reviews without addressing the issues I raised in the original review.

You are correct about the issues that you mentioned in the original ticket. The split into multiple reviews it should be easier to address all the problems, this being the reason I did the split without solving them. I am working on a solution for the mentions issues. Your example regarding using /dev/md0 will not work with the current implementation. The parent directory of the first disk was used to provide a testing environment without adding a restriction to the path of the state files. The files are created after the guest is booted and the capability mode is on, so the use of the open system call is not possible anymore at that point, that is why I need a file descriptor for a parent directory. The solution that I am working will add a way to specify the checkpoint directory at boot time.

The solution that I am working will add a way to specify the checkpoint directory at boot time.

To be clear, we should wait for further updates to the revision?

usr.sbin/bhyve/snapshot.c
1484

The brace looks like it's in the wrong place. The subsequent revisions modify this section some more, and finally the close() is removed again in the last revision, so this is just noise.

All revisions in this stack need to be updated and rebased.

For example, in https://reviews.freebsd.org/D34586, get_ckp_path() is added and then gets removed in https://reviews.freebsd.org/D34587.

The solution that I am working will add a way to specify the checkpoint directory at boot time.

To be clear, we should wait for further updates to the revision?

I did apply the fix that I mentioned in the ticket from here https://reviews.freebsd.org/D34587. With the way that I did the split I changed the third ticket because it is the one that has all the changes ready for testing. If you consider that it would be make the review easier I will apply the changes for this ticket as well.

In D34547#787009, @rew wrote:

All revisions in this stack need to be updated and rebased.

For example, in https://reviews.freebsd.org/D34586, get_ckp_path() is added and then gets removed in https://reviews.freebsd.org/D34587.

This happened because of the way the I did the split, https://reviews.freebsd.org/D34587 having the final changes for testing. I will apply the changes that I did for this ticket as well, and I think after the update, https://reviews.freebsd.org/D34586 will become obsolete due to how it was initially created.

I updated the revision to match the changes applied in https://reviews.freebsd.org/D34587 as well to solve an issue when the first disk path is /dev/md0. Now the snapshot directory can be specified at boot time using -t option.
I also applied the code refactorization to match part 3(https://reviews.freebsd.org/D34587) of the stack review.

ionut.mihalache1506_gmail.com edited the summary of this revision. (Show Details)
ionut.mihalache1506_gmail.com edited the test plan for this revision. (Show Details)

I did one more update to the proposed implementation. Initially I did a split in order to make the review process easier. After applying the recommended changes parts 2 and 3 of the stack are not needed anymore because the code turned out to be enough for one ticket. This last update contains all the changes for the capsicum integration and I will archive parts 2 and 3. If you have any others changes that you consider that should be applied please use only this ticket. I am eager to see what is your opinion after all the changes were applied.

I propose to simplify code and fix could be:

  1. Create "snapshots" dir, open it and create snapshots under this directory. I.e. do not add "-t" option.
  2. Add VM_SNAPSHOT_REQ to vm_get_ioctls().
  3. Move init_checkpoint_thread() before cap_enter()

That should be enough and makes not so much code.

usr.sbin/bhyve/snapshot.c
1395

I see that casper that you added is really needed only for destroying vm environment via sysctl. But what if leave it as I and do not destroy vm env? This behaviour could be similar as after "halt" in VM.

usr.sbin/bhyve/snapshot.c
1395

*leave it as is and do not destroy vm env on suspend*

usr.sbin/bhyve/Makefile
93–94

The comment can be deleted.

usr.sbin/bhyve/bhyverun.c
38

This should be further down, together with other userspace headers.

1306

Please document this in the bhyve(8) man page.

1562

This should be called from init_snapshot(), not from the top-level bhyve code. It only initializes state for the snapshot code.

usr.sbin/bhyve/snapshot.c
362–364

Missing error checking.

1367
1395

This is the wrong time to be creating a casper service. All services should be started at initialization time, after which the "casper capability" should be closed.

1417

This doesn't need to be a macro, please make it a regular subroutine.

1520
1689
1702

The casper channel should be closed after initialization is complete. Otherwise Capsicum is not going to accomplish anything, since malicious code can always request new capabilities via new casper services.

I applied most of the changes recommended by @markj. I did not add the documentation for the "-t" yet. @gusev.vitaliy_gmail.com i did add VM_SNAPSHOT_REQ to vm_get_ioctls() and I did move init_checkpoint_thread() before cap_enter() in bhyverun.c; I did not implement the first recommendation yet considering the feedback from @markj and I don't know exactly if I should remove or not the "-t <checkpoint_dir_path>" approach. What do you think of this current patch version?

What about receiving the file descriptors from the nvlist?

The sender (e.g., bhyvectl) opens the file descriptors and adds them to the nvlist; then bhyve gets the descriptors from the nvlist and writes to it. I think that would let us to skip out on adding the -t flag and ckp_path altogether..?

Part of the problem with that is, three file descriptors would need to be sent across. And considering the snapshot data model isn't set in stone..

it'd be nice if the snapshot feature only used one file descriptor to do its work.

thoughts?

usr.sbin/bhyve/snapshot.c
1367

EPRINTLN() is preferred for new code, which handles raw mode; see bhyve/debug.h

In D34547#791959, @rew wrote:

What about receiving the file descriptors from the nvlist?

The sender (e.g., bhyvectl) opens the file descriptors and adds them to the nvlist; then bhyve gets the descriptors from the nvlist and writes to it. I think that would let us to skip out on adding the -t flag and ckp_path altogether..?

Part of the problem with that is, three file descriptors would need to be sent across. And considering the snapshot data model isn't set in stone..

it'd be nice if the snapshot feature only used one file descriptor to do its work.

thoughts?

I think getting the file descriptors from the nvlist it is a good approach but considering it may add some more code and as you said the snapshot data model does not have yet a definitive version maybe the approach would make more sense in a new ticket. For using one file descriptor to save all the information, https://reviews.freebsd.org/D29262 would be a start to reduce the number of used files.

Add EPRINTLN to handle raw mode when printing the error messages for the created functions.

Please look at comments. A lot of code look excessive. Removing vm destroy can eliminate of adding and changing many files.

usr.sbin/bhyve/snapshot.c
171

why 'capsysctl' is required?

266

I guess all under ifdef is not really required.

278

This also is not required.

314

This is not required ?

350

Why not openat()? And this function needs changes at all if consider capsicum support?

1359

This function should return value and vm_checkpoint() should check that value.

1389

I suggest remove this function and vm-destroy usage. Leave vm state as is on suspend.

1395

Sorry, if user does "halt" inside VM, VM context is not destroyed when bhyve process ends.

1549

Why cdir_fd is closed here? It should persist cross "checkpoint" or "suspend" calls. And using it under "ifdef WITHOUT_CAPSICUM" is also wrong.

1610

Why all below under ifdef is required? I guess it is excessive.

1699

Original code works with capsicum w/o any changing, so this function is not required.

@jhb In original code (bhyverun.c) I see this:

^361da7386423 bhyverun.c                (Peter Grehan         2016-05-27  901)  case VM_SUSPEND_POWEROFF: 
621b5090487de usr.sbin/bhyve/bhyverun.c (John Baldwin         2019-06-26  902)      if (get_config_bool_default("destroy_on_poweroff", false))
0a1016f9e8a84 usr.sbin/bhyve/bhyverun.c (Pawel Biernacki      2020-06-25  903)          vm_destroy(ctx);
^361da7386423 bhyverun.c                (Peter Grehan         2016-05-27  904)      exit(1);

I suppose code also doesn't work with enabled capsicum. So before going further for this review, probably VM_DESTROY ioctl should be implemented. I can implement it and open review.

usr.sbin/bhyve/snapshot.c
266

Why?

usr.sbin/bhyve/snapshot.c
266

Because it should work w/o that code. What problem does that code solve?

@jhb In original code (bhyverun.c) I see this:

^361da7386423 bhyverun.c                (Peter Grehan         2016-05-27  901)  case VM_SUSPEND_POWEROFF: 
621b5090487de usr.sbin/bhyve/bhyverun.c (John Baldwin         2019-06-26  902)      if (get_config_bool_default("destroy_on_poweroff", false))
0a1016f9e8a84 usr.sbin/bhyve/bhyverun.c (Pawel Biernacki      2020-06-25  903)          vm_destroy(ctx);
^361da7386423 bhyverun.c                (Peter Grehan         2016-05-27  904)      exit(1);

I suppose code also doesn't work with enabled capsicum. So before going further for this review, probably VM_DESTROY ioctl should be implemented. I can implement it and open review.

This is already partially done in D31325 . How about we complete those changes and get them committed? D31325 itself is ok, I believe, but D31326 needs some more work to handle reboots. There, I believe it's sufficient to make use of the "persistent" flag.

usr.sbin/bhyve/snapshot.c
266

Do these sockets not remain open in the running bhyve process? The intent is to limit the privileges of any fds that remain open.

usr.sbin/bhyve/snapshot.c
266

For "checkpoint" all fds are closed or unmapped. If consider "restore" currently they are not closed after finished restore, but I believe it is a bug.

All fd-s should be unmapped and closed because all data are copied to vmm or kernel and are not required after successful restore.

And note, name "socket" also is wrong here - all fd-s are not socket.

dy partially done in D31325 . How about we complete those changes and get them committed? D31325 itself is ok, I believe, but D31326 needs some more work to handle reboots. There, I believe it's sufficient to make use of the "persistent" flag.

I created simple review (https://reviews.freebsd.org/D34954) as replacement of D31326 and if it is ok, I will fix reboots and probably halts. Or halts should leave vmm unchanged ?

usr.sbin/bhyve/bhyverun.c
1565

Here should be added exit/err on fail. I guess having checkpoint thread is critical or, if not, it should rely on new introduces "-t" parameter: if "-t" is specified, checkpoint is activated.

usr.sbin/bhyve/snapshot.c
1438

BTW, why do permission masks (0700) have execute bit here?

usr.sbin/bhyve/snapshot.c
1422

BTW, usage 0 is wrong here, cleanup path also should be corrected with (fd_checkpoint >= 0) condition.

1467

Potential meta_fd leak here. It should be closed on fdopen() failure.

usr.sbin/bhyve/snapshot.c
266

I see that restore path also has cleanup in main(), so all fd-s should be closed after "init" stage. And additionally I believe that intention of caph_rights_limit() is not limit, but allowing and by default all open file descriptor is strictly limited.

main():
#ifdef BHYVE_SNAPSHOT

if (restore_file != NULL)
    destroy_restore_state(&rstate);
lib/libvmmapi/vmmapi.c
1722

It also needs VM_RESTORE_TIME, w/o it vm_restore_time() will fail.

usr.sbin/bhyve/snapshot.c
350

The changes would make sense if the cap_enter is called before the restore is actually done and openat should be used in that case because would not work, but considering the 'restore' files are opened and closed before cap_enter the function should remain unchanged.

1389

Can you give more details about how exactly it would be better to approach the usage of vm_destroy? If I only remove function call then resources will not be freed and the guest will be stopped due to the exit call afterwards and a new guest can't be opened or restored until a vm_destroy call is requested through bhyvectl and if I don't use exit as well, wouldn't this remove the "suspend" functionality and allow only "checkpoint" functionality?

usr.sbin/bhyve/snapshot.c
266

By default a fd starts with all rights (see cap_rights_limit man page). It is still reasonable to limit rights to only those necessary, but not as important as for fds that remain open in a process involving untrusted data.

Name could be limit_vmmem_rights instead of _socket

I renamed the limiting functions for the file descriptor as @emaste suggested.

I looked into what @gusev.vitaliy_gmail.com proped regarding the removal of vm_destroy(or local_vm_destroy) and I need some more information regarding the better solution. If I delete the function call, the resources of the guest will not be freed and the guest will be stopped due to the exit call afterwards and a new guest can't be opened or restored until a vm_destroy call is requested through bhyvectl. If I remove the exit call as well wouldn't this just remove the "suspend" operation and transform it into "checkpoint"?