Page MenuHomeFreeBSD

bhyve nvme: Inform guests of namespace resize
ClosedPublic

Authored by chuck on Nov 12 2021, 12:43 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Dec 10, 2:32 PM
Unknown Object (File)
Thu, Dec 5, 6:08 PM
Unknown Object (File)
Tue, Dec 3, 1:08 PM
Unknown Object (File)
Sat, Nov 23, 6:17 PM
Unknown Object (File)
Fri, Nov 22, 6:55 PM
Unknown Object (File)
Thu, Nov 21, 6:52 PM
Unknown Object (File)
Thu, Nov 21, 9:25 AM
Unknown Object (File)
Thu, Nov 21, 5:57 AM

Details

Summary

Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

wanpengqian_gmail.com added inline comments.
usr.sbin/bhyve/pci_nvme.c
299

Can we have an array here? we can avoid calloc later. also avoid check NULL in pci_nvme_resized()

654

do we need to check the result value.

The nvmecontrol of FreeBSD guest can output the changed size. but gpart still reports the original size. how can we let kernel reload this change? nvme reset seems no a fix.

The nvmecontrol of FreeBSD guest can output the changed size. but gpart still reports the original size. how can we let kernel reload this change? nvme reset seems no a fix.

rewrite the partition table?

what disk diskinfo ndaX or nvdX say?

In D32953#744004, @imp wrote:

The nvmecontrol of FreeBSD guest can output the changed size. but gpart still reports the original size. how can we let kernel reload this change? nvme reset seems no a fix.

rewrite the partition table?

what disk diskinfo ndaX or nvdX say?

root@bsd16:~ # nvmecontrol devlist
 nvme0: bhyve-NVMe
    nvme0ns1 (120GB)
 nvme1: bhyve-NVMe
    nvme1ns1 (120GB)
 nvme2: bhyve-NVMe
    nvme2ns1 (100TB)
root@bsd16:~ # diskinfo nvd2
nvd2	512	21474836480	41943040	0	0

When guest boots, nvd2 is 20GB.

Also try debian linux.

here is result

root@deb15:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     QNVMe001             bhyve-NVMe                               1          85.90  GB /  85.90  GB    512   B +  0 B   1.0     
/dev/nvme1n1     QNVMe002             bhyve-NVMe                               1         109.95  TB / 109.95  TB    512   B +  0 B   1.0     
root@deb15:~# fdisk /dev/nvme1n1 

Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0xd42a43b0.

Command (m for help): p
Disk /dev/nvme1n1: 40 GiB, 42949672960 bytes, 83886080 sectors
Disk model: bhyve-NVMe                              
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xd42a43b0

fdisk cannot report new size ether.

Hmmm, I have patches, I think, that propigate the new size of a namespace as well as a dozen other issues with changing things around dynamically... I'll dust them off and put them up for review and link them here.

In D32953#744020, @imp wrote:

Hmmm, I have patches, I think, that propigate the new size of a namespace as well as a dozen other issues with changing things around dynamically... I'll dust them off and put them up for review and link them here.

I took a quick look yesterday, and my guess was nvd (and perhaps nda) needed to call disk_resize()

In D32953#744020, @imp wrote:

Hmmm, I have patches, I think, that propagate the new size of a namespace as well as a dozen other issues with changing things around dynamically... I'll dust them off and put them up for review and link them here.

I took a quick look yesterday, and my guess was nvd (and perhaps nda) needed to call disk_resize()

Yea, they need to be notified when the size changes, though, and only the nvme driver knows that.
There's a nvme_sim_ns_change callback that's currently registered for nda. This is the same as nvd_new_disk,
but I'm not sure that nvd_new_disk handles the change case well.
I'll have to thread this through. My current ns enhancement work is incomplete and doesn't handle this deatail.

https://reviews.freebsd.org/D32963 is the current state of things, but I don't see any calls to disk_resize() in it.
scsi and ata disks work,but nvme doesn't. I've not worked through all the details of that review, btw, since it
is a work in progress that had other things take priority over and I've not returned to it

Convert to using struct nvme_ns_list

Neat! I will defer to Chuck and Warner on reviewing the NVMe specific behavior, but I think it looks fine (and I'm happy the resize callback works well for this without needing refactoring).

usr.sbin/bhyve/pci_nvme.c
3005

Is the NSID always 1?

Yea, they need to be notified when the size changes, though, and only the nvme driver knows that.
There's a nvme_sim_ns_change callback that's currently registered for nda. This is the same as nvd_new_disk,
but I'm not sure that nvd_new_disk handles the change case well.
I'll have to thread this through. My current ns enhancement work is incomplete and doesn't handle this deatail.

https://reviews.freebsd.org/D32963 is the current state of things, but I don't see any calls to disk_resize() in it.
scsi and ata disks work,but nvme doesn't. I've not worked through all the details of that review, btw, since it
is a work in progress that had other things take priority over and I've not returned to it

after some dig, I found nvd driver will not response disk resize, I use nda to have a try. when host resize the disk, guest panic.

panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro
cpuid = 5
time = 1637052864
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00
vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50
panic() at panic+0x43/frame 0xfffffe00e21e7bb0
sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00
_sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0
pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0
nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70
nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0
nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0
nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e
ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100098 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1276e2e(%rip)
usr.sbin/bhyve/pci_nvme.c
3005

Yes, NSID will always be 0x1 as the emulation only supports a single namespace.

So if I take these patches, how do I change the namespace size in bhyve? I can fix nvd and nda

Apologies, I should have put something about how to use this.

  1. Create a file to back the NVMe device (e.g., truncate -s 1G /vms/nvme-aen/disk1.img)
  2. Start the VM
  3. Increase the size of the file backing the NVMe device (e.g., truncate -s +1G /vms/nvme-aen/disk1.img)

After increasing the file size, you should seem something like the following on the console:
nvme0: async event occurred (type 0x2, info 0x00, page 0x04)

In D32953#745698, @imp wrote:

So if I take these patches, how do I change the namespace size in bhyve? I can fix nvd and nda

I have made a quick patch for nvd, after nvme reconstruct the namespace, nvd will call disk_resize().
please take a look at D33028

panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro
cpuid = 5
time = 1637052864
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00
vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50
panic() at panic+0x43/frame 0xfffffe00e21e7bb0
sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00
_sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0
pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0
nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70
nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0
nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0
nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e
ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100098 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1276e2e(%rip)

nvme_ctrlr_async_event_log_page_cb() is called by nvme_qpair_process_completions()
which is a interrupt thread, this thread mark as sleeping prohibited. if we query device and update namesapce here, it will cause a kernel panic.
because query device namespace wll call nvme_complete_poll(),which has a pause_sbt()

panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro
cpuid = 5
time = 1637052864
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00
vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50
panic() at panic+0x43/frame 0xfffffe00e21e7bb0
sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00
_sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0
pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0
nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70
nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0
nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0
nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e
ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100098 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1276e2e(%rip)

nvme_ctrlr_async_event_log_page_cb() is called by nvme_qpair_process_completions()
which is a interrupt thread, this thread mark as sleeping prohibited. if we query device and update namesapce here, it will cause a kernel panic.
because query device namespace wll call nvme_complete_poll(),which has a pause_sbt()

Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.

Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.

Here is my initial thinking, create a thread which listening kevents, when namespace changed, nvme_ctrlr_async_event_log_page_cb() trigger a event, this thread will query nvme device and update the corresponding namespace data. after that, invoke consumer to call disk_resize();

Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.

Here is my initial thinking, create a thread which listening kevents, when namespace changed, nvme_ctrlr_async_event_log_page_cb() trigger a event, this thread will query nvme device and update the corresponding namespace data. after that, invoke consumer to call disk_resize();

I don't think we need the namespace info before interrupts are enabled, so I don't think we need a kthread for this. At most, we may need a workqueue to run job...

I don't think we need the namespace info before interrupts are enabled, so I don't think we need a kthread for this. At most, we may need a workqueue to run job...

Yes, you are right.

following your comments, I have figured out how to make it works.

  1. add an extra thread when controller starts.
  2. add a new async_event_queue to controller structure.

namespace changed async event trigger -> push into async_event_queue -> this extra thread query device namesapce and update namespace data -> notify consumer to disk_resize()

quick test shows this flow works. is it the right direction?

currently only namesapce follow this path.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 14 2021, 11:31 PM
This revision was automatically updated to reflect the committed changes.

I'm in the process of porting this change over to illumos, and it looks like we should also be setting the appropriate bit in the oaes field in nvme_controller_data.

At present, nvme identify does not indicate the capability:

root@bhyvetest:~# nvmeadm -v identify nvme0
nvme0: Identify Controller
  Controller Capabilities and Features
    Model:                                  bhyve-NVMe
    Serial:                                 NVME-4-0
...
    Optional Asynchronous Events Supported
      Namespace Attribute Notices:          unsupported
      Firmware Activation Notices:          unsupported
      Asynchronous Namespace Access Change Notices: unsupported
      Predictable Latency Event Aggregation: unsupported
      LBA Status Information Notices:       unsupported

Here's the patch I'm testing in illumos:

diff
+++ b/usr/src/cmd/bhyve/pci_nvme.c
@@ -514,6 +514,11 @@ pci_nvme_init_ctrldata(struct pci_nvme_softc *sc)
        cd->ver = 0x00010300;

        cd->oacs = 1 << NVME_CTRLR_DATA_OACS_FORMAT_SHIFT;
+#define        NVME_CTRLR_DATA_OAES_NSCHANGE_SHIFT     8
+       cd->oaes = 1 << NVME_CTRLR_DATA_OAES_NSCHANGE_SHIFT;
        cd->acl = 2;
        cd->aerl = 4;

I'm in the process of porting this change over to illumos, and it looks like we should also be setting the appropriate bit in the oaes field in nvme_controller_data.

At present, nvme identify does not indicate the capability:

Thanks for catching this. Your change will work just fine, but I'm tempted to add a helper wrapper like the below. Thoughts?

diff --git a/usr.sbin/bhyve/pci_nvme.c b/usr.sbin/bhyve/pci_nvme.c
index 8cd3a949f916..5caaa9de5c5a 100644
--- a/usr.sbin/bhyve/pci_nvme.c
+++ b/usr.sbin/bhyve/pci_nvme.c
@@ -280,6 +280,9 @@ typedef enum {
        PCI_NVME_AE_INFO_MAX,
 } pci_nvme_async_info;

+#define PCI_NVME_AE_SHIFT      8
+#define PCI_NVME_AE_MASK(event)        (1 << (event + PCI_NVME_AE_SHIFT))
+
 /* Asynchronous Event Notifications */
 struct pci_nvme_aen {
        pci_nvme_async_type atype;
@@ -536,6 +539,7 @@ pci_nvme_init_ctrldata(struct pci_nvme_softc *sc)

        cd->cntrltype = NVME_CNTRLTYPE_IO;
        cd->oacs = 1 << NVME_CTRLR_DATA_OACS_FORMAT_SHIFT;
+       cd->oaes = PCI_NVME_AE_MASK(PCI_NVME_AE_INFO_NS_ATTR_CHANGED);
        cd->acl = 2;
        cd->aerl = 4;

@@ -950,8 +954,7 @@ pci_nvme_aen_process(struct pci_nvme_softc *sc)
                                status = NVME_SC_INTERNAL_DEVICE_ERROR;
                                break;
                        }
-                       mask >>= 8;
-                       if (((1 << aen->event_data) & mask) == 0)
+                       if ((PCI_NVME_AE_MASK(aen->event_data) & mask) == 0)
                                continue;
                        switch (aen->event_data) {
                        case PCI_NVME_AE_INFO_NS_ATTR_CHANGED:

Thanks for catching this. Your change will work just fine, but I'm tempted to add a helper wrapper like the below. Thoughts?

I think the whole set of these (OAES) definitions should probably end up in sys/dev/nvme/nvme.h anyway, like the NVME_CTRLR_DATA_OACS_* ones that are already there. That would just make it analogous to the existing oacs assignment. Linking it to the pci_nvme_async_info in bhyve may make sense too though.

I think the whole set of these (OAES) definitions should probably end up in sys/dev/nvme/nvme.h anyway, like the NVME_CTRLR_DATA_OACS_* ones that are already there.

Review is up for nvme.h additions (D34300) and associated bhyve changes (D34331)