Page MenuHomeFreeBSD

bhyve nvme: Inform guests of namespace resize
ClosedPublic

Authored by chuck on Nov 12 2021, 12:43 AM.

Details

Summary

Register a "block resize" callback to be notified of changes to the
backing storage for the Namespace. Use this to generate an Asynchronous
Event Notification, Namespace Attributes Changed when the guest OS
provides an Asynchronous Event Request.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

wanpengqian_gmail.com added inline comments.
usr.sbin/bhyve/pci_nvme.c
298

Can we have an array here? we can avoid calloc later. also avoid check NULL in pci_nvme_resized()

653

do we need to check the result value.

The nvmecontrol of FreeBSD guest can output the changed size. but gpart still reports the original size. how can we let kernel reload this change? nvme reset seems no a fix.

The nvmecontrol of FreeBSD guest can output the changed size. but gpart still reports the original size. how can we let kernel reload this change? nvme reset seems no a fix.

rewrite the partition table?

what disk diskinfo ndaX or nvdX say?

In D32953#744004, @imp wrote:

The nvmecontrol of FreeBSD guest can output the changed size. but gpart still reports the original size. how can we let kernel reload this change? nvme reset seems no a fix.

rewrite the partition table?

what disk diskinfo ndaX or nvdX say?

root@bsd16:~ # nvmecontrol devlist
 nvme0: bhyve-NVMe
    nvme0ns1 (120GB)
 nvme1: bhyve-NVMe
    nvme1ns1 (120GB)
 nvme2: bhyve-NVMe
    nvme2ns1 (100TB)
root@bsd16:~ # diskinfo nvd2
nvd2	512	21474836480	41943040	0	0

When guest boots, nvd2 is 20GB.

Also try debian linux.

here is result

root@deb15:~# nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     QNVMe001             bhyve-NVMe                               1          85.90  GB /  85.90  GB    512   B +  0 B   1.0     
/dev/nvme1n1     QNVMe002             bhyve-NVMe                               1         109.95  TB / 109.95  TB    512   B +  0 B   1.0     
root@deb15:~# fdisk /dev/nvme1n1 

Welcome to fdisk (util-linux 2.36.1).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0xd42a43b0.

Command (m for help): p
Disk /dev/nvme1n1: 40 GiB, 42949672960 bytes, 83886080 sectors
Disk model: bhyve-NVMe                              
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xd42a43b0

fdisk cannot report new size ether.

Hmmm, I have patches, I think, that propigate the new size of a namespace as well as a dozen other issues with changing things around dynamically... I'll dust them off and put them up for review and link them here.

In D32953#744020, @imp wrote:

Hmmm, I have patches, I think, that propigate the new size of a namespace as well as a dozen other issues with changing things around dynamically... I'll dust them off and put them up for review and link them here.

I took a quick look yesterday, and my guess was nvd (and perhaps nda) needed to call disk_resize()

In D32953#744020, @imp wrote:

Hmmm, I have patches, I think, that propagate the new size of a namespace as well as a dozen other issues with changing things around dynamically... I'll dust them off and put them up for review and link them here.

I took a quick look yesterday, and my guess was nvd (and perhaps nda) needed to call disk_resize()

Yea, they need to be notified when the size changes, though, and only the nvme driver knows that.
There's a nvme_sim_ns_change callback that's currently registered for nda. This is the same as nvd_new_disk,
but I'm not sure that nvd_new_disk handles the change case well.
I'll have to thread this through. My current ns enhancement work is incomplete and doesn't handle this deatail.

https://reviews.freebsd.org/D32963 is the current state of things, but I don't see any calls to disk_resize() in it.
scsi and ata disks work,but nvme doesn't. I've not worked through all the details of that review, btw, since it
is a work in progress that had other things take priority over and I've not returned to it

Convert to using struct nvme_ns_list

Neat! I will defer to Chuck and Warner on reviewing the NVMe specific behavior, but I think it looks fine (and I'm happy the resize callback works well for this without needing refactoring).

usr.sbin/bhyve/pci_nvme.c
3004

Is the NSID always 1?

Yea, they need to be notified when the size changes, though, and only the nvme driver knows that.
There's a nvme_sim_ns_change callback that's currently registered for nda. This is the same as nvd_new_disk,
but I'm not sure that nvd_new_disk handles the change case well.
I'll have to thread this through. My current ns enhancement work is incomplete and doesn't handle this deatail.

https://reviews.freebsd.org/D32963 is the current state of things, but I don't see any calls to disk_resize() in it.
scsi and ata disks work,but nvme doesn't. I've not worked through all the details of that review, btw, since it
is a work in progress that had other things take priority over and I've not returned to it

after some dig, I found nvd driver will not response disk resize, I use nda to have a try. when host resize the disk, guest panic.

panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro
cpuid = 5
time = 1637052864
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00
vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50
panic() at panic+0x43/frame 0xfffffe00e21e7bb0
sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00
_sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0
pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0
nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70
nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0
nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0
nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e
ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100098 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1276e2e(%rip)
usr.sbin/bhyve/pci_nvme.c
3004

Yes, NSID will always be 0x1 as the emulation only supports a single namespace.

So if I take these patches, how do I change the namespace size in bhyve? I can fix nvd and nda

Apologies, I should have put something about how to use this.

  1. Create a file to back the NVMe device (e.g., truncate -s 1G /vms/nvme-aen/disk1.img)
  2. Start the VM
  3. Increase the size of the file backing the NVMe device (e.g., truncate -s +1G /vms/nvme-aen/disk1.img)

After increasing the file size, you should seem something like the following on the console:
nvme0: async event occurred (type 0x2, info 0x00, page 0x04)

In D32953#745698, @imp wrote:

So if I take these patches, how do I change the namespace size in bhyve? I can fix nvd and nda

I have made a quick patch for nvd, after nvme reconstruct the namespace, nvd will call disk_resize().
please take a look at D33028

panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro
cpuid = 5
time = 1637052864
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00
vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50
panic() at panic+0x43/frame 0xfffffe00e21e7bb0
sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00
_sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0
pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0
nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70
nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0
nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0
nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e
ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100098 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1276e2e(%rip)

nvme_ctrlr_async_event_log_page_cb() is called by nvme_qpair_process_completions()
which is a interrupt thread, this thread mark as sleeping prohibited. if we query device and update namesapce here, it will cause a kernel panic.
because query device namespace wll call nvme_complete_poll(),which has a pause_sbt()

panic: sleepq_add: td 0xfffffe00e228b020 to sleep on wchan 0xffffffff81684b85 with sleeping pro
cpuid = 5
time = 1637052864
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe00e21e7b00
vpanic() at vpanic+0x181/frame 0xfffffe00e21e7b50
panic() at panic+0x43/frame 0xfffffe00e21e7bb0
sleepq_add() at sleepq_add+0x359/frame 0xfffffe00e21e7c00
_sleep() at _sleep+0x1f5/frame 0xfffffe00e21e7ca0
pause_sbt() at pause_sbt+0xfe/frame 0xfffffe00e21e7cd0
nvme_ns_construct() at nvme_ns_construct+0xd3/frame 0xfffffe00e21e7d70
nvme_notify_ns() at nvme_notify_ns+0x50/frame 0xfffffe00e21e7da0
nvme_ctrlr_async_event_log_page_cb() at nvme_ctrlr_async_event_log_page_cb+0xf6/frame 0xfffffe0
nvme_qpair_complete_tracker() at nvme_qpair_complete_tracker+0x20f/frame 0xfffffe00e21e7e10
nvme_qpair_process_completions() at nvme_qpair_process_completions+0x19f/frame 0xfffffe00e21e7e
ithread_loop() at ithread_loop+0x279/frame 0xfffffe00e21e7ef0
fork_exit() at fork_exit+0x80/frame 0xfffffe00e21e7f30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe00e21e7f30
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---
KDB: enter: panic
[ thread pid 12 tid 100098 ]
Stopped at      kdb_enter+0x37: movq    $0,0x1276e2e(%rip)

nvme_ctrlr_async_event_log_page_cb() is called by nvme_qpair_process_completions()
which is a interrupt thread, this thread mark as sleeping prohibited. if we query device and update namesapce here, it will cause a kernel panic.
because query device namespace wll call nvme_complete_poll(),which has a pause_sbt()

Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.

Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.

Here is my initial thinking, create a thread which listening kevents, when namespace changed, nvme_ctrlr_async_event_log_page_cb() trigger a event, this thread will query nvme device and update the corresponding namespace data. after that, invoke consumer to call disk_resize();

Yes. Normally we only query namespaces while we're starting up (hence the poll complete). We
should let this be event driven instead.

Here is my initial thinking, create a thread which listening kevents, when namespace changed, nvme_ctrlr_async_event_log_page_cb() trigger a event, this thread will query nvme device and update the corresponding namespace data. after that, invoke consumer to call disk_resize();

I don't think we need the namespace info before interrupts are enabled, so I don't think we need a kthread for this. At most, we may need a workqueue to run job...

I don't think we need the namespace info before interrupts are enabled, so I don't think we need a kthread for this. At most, we may need a workqueue to run job...

Yes, you are right.

following your comments, I have figured out how to make it works.

  1. add an extra thread when controller starts.
  2. add a new async_event_queue to controller structure.

namespace changed async event trigger -> push into async_event_queue -> this extra thread query device namesapce and update namespace data -> notify consumer to disk_resize()

quick test shows this flow works. is it the right direction?

currently only namesapce follow this path.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 14 2021, 11:31 PM
This revision was automatically updated to reflect the committed changes.