Page MenuHomeFreeBSD

NVMe emulation for bhyve (improved)
ClosedPublic

Authored by araujo on Jan 23 2018, 6:03 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 20, 7:03 AM
Unknown Object (File)
Tue, Nov 19, 8:41 AM
Unknown Object (File)
Tue, Nov 19, 4:23 AM
Unknown Object (File)
Sun, Nov 17, 12:19 PM
Unknown Object (File)
Wed, Nov 13, 11:31 AM
Unknown Object (File)
Tue, Nov 12, 11:38 PM
Unknown Object (File)
Mon, Nov 11, 10:42 PM
Unknown Object (File)
Mon, Nov 11, 2:18 PM

Details

Summary

This is an updated version of the bhyve NVMe emulation by Leon Dang.
It is based on the original GSoC version by Shunsuke Mie but has been
heavily modified in performance, functionality and guest support.

Test Plan

Install/boot tested with FreeBSD, UEFI, Windows 8.1, Windows Server 2k12r2 and 2k16,
Ubuntu 17.10/16.10, and Centos 7.4

The diff to bhyve/uefi to support NVMe is at https://people.freebsd.org/~grehan/bhyve_uefi_nvme.diff , with a pre-built UEFI binary at https://people.freebsd.org/~grehan/BHYVE_NVMe.fd.xz

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 17661
Build 17466: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
chuck requested changes to this revision.Jan 25 2018, 11:26 PM
chuck added a subscriber: chuck.
chuck added inline comments.
usr.sbin/bhyve/pci_nvme.c
589

Invalid namespace or format is returned only if the passed in NSID == 0xffffffff (or 0xfffffffe for CNS=0x10). Section 5.15 (NVM Express 1.3) says,
"If a controller does not support a CNS value then it shall abort the command with a status of Invalid Field in Command."

796

Since the implementation is *not* aborting the command (which is totally fine), bit 0 of DW0 should be set to indicate the command was not aborted. Bit 0 cleared means the command was aborted. See section 5.1.1 of the 1.3 spec.

This revision now requires changes to proceed.Jan 25 2018, 11:26 PM
rpokala added inline comments.
usr.sbin/bhyve/pci_nvme.c
229

Shouldn't the rest of the serial number be filled in with a uniquifier?

251

It's probably worth noting that this is the actual FreeBSD (Foundation) OUI.

257

The reserved bytes are... reserved. What do you mean by "Version" here? Why are you skipping the first two bytes?

273

Why not just cd->wctemp = 0x0157?

leon_digitalmsx.com added inline comments.
usr.sbin/bhyve/pci_nvme.c
257

stable/11 doesn't have ver. Will need to merge in nvme.h or use raw offsets to struct nvme_controller_data to keep it compatible with 11 and 12.

273

stable/11 missing wctemp

589

Will verify for the CNSs. Although, CNS 0x11 "If the specified namespace is an invalid NSID then the controller shall fail the command with a status code of Invalid Namespace or Format"

usr.sbin/bhyve/pci_nvme.c
257

Don't worry about stable/11 - it's fine to use the defs that are in current, and the MFC can fix as needed (which could be to backport the needed defs from nvme.h)

Update from Leon:

Added in serial number option/generation.
Comment on FreeBSD OUI
Used the ver and wctemp struct fields
Fix error in completion status
Set cdw0 bit 0
Minor white-space issues

usr.sbin/bhyve/pci_nvme.c
3

Would you kindly add a SPDX-License-Identifier: BSD-2-Clause-FreeBSD license tag in the committed version?

bcr added a subscriber: bcr.

OK from manpages, but don't forget to bump the .Dd once you commit it since this is a content change.

rgrimes added inline comments.
usr.sbin/bhyve/pci_nvme.c
64

style(9) this goes at the top of the list, kernel include files come first.

chuck added inline comments.
usr.sbin/bhyve/pci_nvme.c
589

Sure, but with this implementation, NSID=0x1 is valid and the code doesn't check. It isn't the end of the world as it doesn't support namespace management, but it probably belongs with the other "unsupported identify command" cases.

This revision is now accepted and ready to land.Mar 9 2018, 12:31 AM
araujo added a reviewer: grehan.

I'm taking this revision by Peter's request. I will update it soon with the latest patch.

  • Update this review with the latest patch that @grehan sent me before BSDCan 2018.
  • Fixed some style(9). (myself)
  • Improved the man page. (myself)

Here is the latest round of tests: https://people.freebsd.org/~araujo/bhyve_nvme/nvme.txt

Tested with guest OS: FreeBSD Head, Linux Fedora fc27, Ubuntu 18.04, OpenSuse 15.0, Windows Server 2016 Datacenter.
Tested with all accepted device paths: Real nvme, zdev and also with ram.
Tested on: AMD Ryzen Threadripper 1950X 16-Core Processor and Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.

I would like to thanks everybody that worked on this implementation.

This revision now requires review to proceed.Jun 27 2018, 1:31 PM

Two textual suggestions for the man page.

usr.sbin/bhyve/bhyve.8
439

Nvme -> NVMe

453

io -> I/O

This comment was removed by bcr.

Cool, thanks for the quick reply.
Approved from manpages. Don't forget to bump the .Dd to the date of the commit.

This revision is now accepted and ready to land.Jun 27 2018, 1:58 PM

Generally quite good. A few questions about some code that doesn't quite look right.

usr.sbin/bhyve/bhyve.8
447

Is this MB (power of 10) or MiB (power of 2)?

usr.sbin/bhyve/pci_nvme.c
7

Not a big deal, but you might want to poll Shunsuke and Leon to see if we can drop this.

330

what does this mean?

382

0 is special, a comment here would be useful.

383

What makes zero special? Why doesn't it need to be reset?

388

typo for compl_queues?

395–396

Shouldn't these be in a similar loop ... or

404

... or maybe you should allocate both of these in the else for the prior if and either allocate all, or reinit all, but not this weird combo that happens to work.

405

So we have max_queues and num_cqueues. There seems to be an inconsistency here. If it is really max_queues, then you can combine things. if it isn't and the numbers can be different, you'll need to have separate loops. Also, can num_cqueues change? If so, then this code doesn't look right.

420

would an assert that sc->submit_queues != NULL and sc->compl_queues != NULL be useful here?

452

Do you need to undo vm_map_gpa effects here?

517

Same here: undo effects of vm_map_gpa?

588

Since you don't support the firmware update process, I'm not sure you should map this page.
But if you do, you reported a firmware version before, so you must have at least one slot and that slot must report the firmware.

660–677

Should you save the values to GET_FEATURE returns the right value?

730

magic

usr.sbin/bhyve/bhyve.8
447

It is MiB (power of 2). I will fix the man page.

usr.sbin/bhyve/pci_nvme.c
7

I'm 100% sure they won't mind remove that. I will ping them.

330

Warning Composite Temperature Time:
Contains the amount of time in minutes that the controller is operational and the Composite Temperature is greater than or equal to the Warning Composite Temperature Threshold (WCTEMP) field and less than the Critical Composite Temperature Threshold (CCTEMP) field in the Identify Controller data structure. Ref: http://www.nvmexpress.org/wp-content/uploads/NVM_Express_1_2b_Gold_20160603.pdf

The value 0x0157 probably comes from HW vendors, Ref:
https://www.glynshop.com/erp/owweb/Daten/Datenblaetter/DRAM_FLASH_MEDIEN/17_M.2/02%20Xmore/PCIe/MLC/Xmore%20quality%20M22280xxxGXQC8E017Z%20v1.7.pdf

https://www.jetone.com.tw/uploadfiles/327/datasheet/Phison/new-8.7/phison_m2_2280_s3.pdf

https://gzhls.at/blob/ldb/3/e/c/c/9e383766a5338cda0c014374aa49001d82cf.pdf

382

0 is not special, actually it was used in this way to avoid set the queues twice. I will fix this loop, I can declare the "i" outside of this loop and remove that if statement.

383

Explained above.

388

Yes, it was a typo.. I'm fixing it.

404

As soon as I remove that if statement all of these sets will be inside the same loop statement.

@imp I will answer your other comments next week, I'm moving to a new house and I need to shutdown my office now, that is why I submitted the few things I was already answering..

  • Fix manpage s/MB/MiB/g.
  • Remove 'All rights reserved.'.
  • Add a comment to explain why we need the variable 'i'.
This revision now requires review to proceed.Jul 3 2018, 4:40 AM

Update with Leon's comment, Leon is one of the authors of this patch.

usr.sbin/bhyve/pci_nvme.c
330

Leon's comment: NVMe spec specifically states for any implementation >= v1.2 the value must not be 0, and "It is recommended that implementations report a value of 0157h in this field." So pretty much all vendors do the same.

382

Leon's comment: The Admin Submission Queue is at index 0. It must not be changed at reset otherwise the emulation will be out of sync with the guest, only at complete controller reinitialization.

404

Leon's comment: The allocation is done once because the guest goes through different boot phases and each step causes a resizing of the queues; I deliberately did this so that we don't have to go through those reallocation cycles.

588

Leon's comment: Whilst we don't support firmware updates, the guest does query for the firmware slot information; even if it does not intend to use it.

660–677

Leon's comment: They might set, but they never seem to GET.

730

Leon's comment: The constant is not defined in the NVMe headers, hence put here. Either update the nvme headers if it's not there or define in isolation.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 5 2018, 3:34 AM
This revision was automatically updated to reflect the committed changes.