This is an updated version of the bhyve NVMe emulation by Leon Dang.
It is based on the original GSoC version by Shunsuke Mie but has been
heavily modified in performance, functionality and guest support.
Details
Install/boot tested with FreeBSD, UEFI, Windows 8.1, Windows Server 2k12r2 and 2k16,
Ubuntu 17.10/16.10, and Centos 7.4
The diff to bhyve/uefi to support NVMe is at https://people.freebsd.org/~grehan/bhyve_uefi_nvme.diff , with a pre-built UEFI binary at https://people.freebsd.org/~grehan/BHYVE_NVMe.fd.xz
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
588 ↗ | (On Diff #38436) | Invalid namespace or format is returned only if the passed in NSID == 0xffffffff (or 0xfffffffe for CNS=0x10). Section 5.15 (NVM Express 1.3) says, |
795 ↗ | (On Diff #38436) | Since the implementation is *not* aborting the command (which is totally fine), bit 0 of DW0 should be set to indicate the command was not aborted. Bit 0 cleared means the command was aborted. See section 5.1.1 of the 1.3 spec. |
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
228 ↗ | (On Diff #38436) | Shouldn't the rest of the serial number be filled in with a uniquifier? |
250 ↗ | (On Diff #38436) | It's probably worth noting that this is the actual FreeBSD (Foundation) OUI. |
256 ↗ | (On Diff #38436) | The reserved bytes are... reserved. What do you mean by "Version" here? Why are you skipping the first two bytes? |
272 ↗ | (On Diff #38436) | Why not just cd->wctemp = 0x0157? |
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
256 ↗ | (On Diff #38436) | stable/11 doesn't have ver. Will need to merge in nvme.h or use raw offsets to struct nvme_controller_data to keep it compatible with 11 and 12. |
272 ↗ | (On Diff #38436) | stable/11 missing wctemp |
588 ↗ | (On Diff #38436) | Will verify for the CNSs. Although, CNS 0x11 "If the specified namespace is an invalid NSID then the controller shall fail the command with a status code of Invalid Namespace or Format" |
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
256 ↗ | (On Diff #38436) | Don't worry about stable/11 - it's fine to use the defs that are in current, and the MFC can fix as needed (which could be to backport the needed defs from nvme.h) |
Update from Leon:
Added in serial number option/generation.
Comment on FreeBSD OUI
Used the ver and wctemp struct fields
Fix error in completion status
Set cdw0 bit 0
Minor white-space issues
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
2 ↗ | (On Diff #38343) | Would you kindly add a SPDX-License-Identifier: BSD-2-Clause-FreeBSD license tag in the committed version? |
OK from manpages, but don't forget to bump the .Dd once you commit it since this is a content change.
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
63 ↗ | (On Diff #38343) | style(9) this goes at the top of the list, kernel include files come first. |
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
588 ↗ | (On Diff #38436) | Sure, but with this implementation, NSID=0x1 is valid and the code doesn't check. It isn't the end of the world as it doesn't support namespace management, but it probably belongs with the other "unsupported identify command" cases. |
I'm taking this revision by Peter's request. I will update it soon with the latest patch.
- Update this review with the latest patch that @grehan sent me before BSDCan 2018.
- Fixed some style(9). (myself)
- Improved the man page. (myself)
Here is the latest round of tests: https://people.freebsd.org/~araujo/bhyve_nvme/nvme.txt
Tested with guest OS: FreeBSD Head, Linux Fedora fc27, Ubuntu 18.04, OpenSuse 15.0, Windows Server 2016 Datacenter.
Tested with all accepted device paths: Real nvme, zdev and also with ram.
Tested on: AMD Ryzen Threadripper 1950X 16-Core Processor and Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.
I would like to thanks everybody that worked on this implementation.
Cool, thanks for the quick reply.
Approved from manpages. Don't forget to bump the .Dd to the date of the commit.
Generally quite good. A few questions about some code that doesn't quite look right.
usr.sbin/bhyve/bhyve.8 | ||
---|---|---|
447 ↗ | (On Diff #44522) | Is this MB (power of 10) or MiB (power of 2)? |
usr.sbin/bhyve/pci_nvme.c | ||
6 ↗ | (On Diff #44522) | Not a big deal, but you might want to poll Shunsuke and Leon to see if we can drop this. |
329 ↗ | (On Diff #44522) | what does this mean? |
381 ↗ | (On Diff #44522) | 0 is special, a comment here would be useful. |
382 ↗ | (On Diff #44522) | What makes zero special? Why doesn't it need to be reset? |
387 ↗ | (On Diff #44522) | typo for compl_queues? |
394–395 ↗ | (On Diff #44522) | Shouldn't these be in a similar loop ... or |
403 ↗ | (On Diff #44522) | ... or maybe you should allocate both of these in the else for the prior if and either allocate all, or reinit all, but not this weird combo that happens to work. |
404 ↗ | (On Diff #44522) | So we have max_queues and num_cqueues. There seems to be an inconsistency here. If it is really max_queues, then you can combine things. if it isn't and the numbers can be different, you'll need to have separate loops. Also, can num_cqueues change? If so, then this code doesn't look right. |
419 ↗ | (On Diff #44522) | would an assert that sc->submit_queues != NULL and sc->compl_queues != NULL be useful here? |
451 ↗ | (On Diff #44522) | Do you need to undo vm_map_gpa effects here? |
516 ↗ | (On Diff #44522) | Same here: undo effects of vm_map_gpa? |
587 ↗ | (On Diff #44522) | Since you don't support the firmware update process, I'm not sure you should map this page. |
659–676 ↗ | (On Diff #44522) | Should you save the values to GET_FEATURE returns the right value? |
729 ↗ | (On Diff #44522) | magic |
usr.sbin/bhyve/bhyve.8 | ||
---|---|---|
447 ↗ | (On Diff #44522) | It is MiB (power of 2). I will fix the man page. |
usr.sbin/bhyve/pci_nvme.c | ||
6 ↗ | (On Diff #44522) | I'm 100% sure they won't mind remove that. I will ping them. |
329 ↗ | (On Diff #44522) | Warning Composite Temperature Time: The value 0x0157 probably comes from HW vendors, Ref: https://www.jetone.com.tw/uploadfiles/327/datasheet/Phison/new-8.7/phison_m2_2280_s3.pdf https://gzhls.at/blob/ldb/3/e/c/c/9e383766a5338cda0c014374aa49001d82cf.pdf |
381 ↗ | (On Diff #44522) | 0 is not special, actually it was used in this way to avoid set the queues twice. I will fix this loop, I can declare the "i" outside of this loop and remove that if statement. |
382 ↗ | (On Diff #44522) | Explained above. |
387 ↗ | (On Diff #44522) | Yes, it was a typo.. I'm fixing it. |
403 ↗ | (On Diff #44522) | As soon as I remove that if statement all of these sets will be inside the same loop statement. |
@imp I will answer your other comments next week, I'm moving to a new house and I need to shutdown my office now, that is why I submitted the few things I was already answering..
- Fix manpage s/MB/MiB/g.
- Remove 'All rights reserved.'.
- Add a comment to explain why we need the variable 'i'.
Update with Leon's comment, Leon is one of the authors of this patch.
usr.sbin/bhyve/pci_nvme.c | ||
---|---|---|
329 ↗ | (On Diff #44522) | Leon's comment: NVMe spec specifically states for any implementation >= v1.2 the value must not be 0, and "It is recommended that implementations report a value of 0157h in this field." So pretty much all vendors do the same. |
381 ↗ | (On Diff #44522) | Leon's comment: The Admin Submission Queue is at index 0. It must not be changed at reset otherwise the emulation will be out of sync with the guest, only at complete controller reinitialization. |
403 ↗ | (On Diff #44522) | Leon's comment: The allocation is done once because the guest goes through different boot phases and each step causes a resizing of the queues; I deliberately did this so that we don't have to go through those reallocation cycles. |
587 ↗ | (On Diff #44522) | Leon's comment: Whilst we don't support firmware updates, the guest does query for the firmware slot information; even if it does not intend to use it. |
659–676 ↗ | (On Diff #44522) | Leon's comment: They might set, but they never seem to GET. |
729 ↗ | (On Diff #44522) | Leon's comment: The constant is not defined in the NVMe headers, hence put here. Either update the nvme headers if it's not there or define in isolation. |