Page MenuHomeFreeBSD

NVMe emulation for bhyve (improved)

Authored by araujo on Jan 23 2018, 6:03 AM.



This is an updated version of the bhyve NVMe emulation by Leon Dang.
It is based on the original GSoC version by Shunsuke Mie but has been
heavily modified in performance, functionality and guest support.

Test Plan

Install/boot tested with FreeBSD, UEFI, Windows 8.1, Windows Server 2k12r2 and 2k16,
Ubuntu 17.10/16.10, and Centos 7.4

The diff to bhyve/uefi to support NVMe is at , with a pre-built UEFI binary at

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Thank you so much for your support.

chuck requested changes to this revision.Jan 25 2018, 11:26 PM
chuck added a subscriber: chuck.
chuck added inline comments.
588 ↗(On Diff #38436)

Invalid namespace or format is returned only if the passed in NSID == 0xffffffff (or 0xfffffffe for CNS=0x10). Section 5.15 (NVM Express 1.3) says,
"If a controller does not support a CNS value then it shall abort the command with a status of Invalid Field in Command."

795 ↗(On Diff #38436)

Since the implementation is *not* aborting the command (which is totally fine), bit 0 of DW0 should be set to indicate the command was not aborted. Bit 0 cleared means the command was aborted. See section 5.1.1 of the 1.3 spec.

This revision now requires changes to proceed.Jan 25 2018, 11:26 PM
rpokala added inline comments.
228 ↗(On Diff #38436)

Shouldn't the rest of the serial number be filled in with a uniquifier?

250 ↗(On Diff #38436)

It's probably worth noting that this is the actual FreeBSD (Foundation) OUI.

256 ↗(On Diff #38436)

The reserved bytes are... reserved. What do you mean by "Version" here? Why are you skipping the first two bytes?

272 ↗(On Diff #38436)

Why not just cd->wctemp = 0x0157? added inline comments.
256 ↗(On Diff #38436)

stable/11 doesn't have ver. Will need to merge in nvme.h or use raw offsets to struct nvme_controller_data to keep it compatible with 11 and 12.

272 ↗(On Diff #38436)

stable/11 missing wctemp

588 ↗(On Diff #38436)

Will verify for the CNSs. Although, CNS 0x11 "If the specified namespace is an invalid NSID then the controller shall fail the command with a status code of Invalid Namespace or Format"

grehan added inline comments.Jan 26 2018, 3:12 AM
256 ↗(On Diff #38436)

Don't worry about stable/11 - it's fine to use the defs that are in current, and the MFC can fix as needed (which could be to backport the needed defs from nvme.h)

grehan updated this revision to Diff 38485.Jan 26 2018, 4:06 PM

Update from Leon:

Added in serial number option/generation.
Comment on FreeBSD OUI
Used the ver and wctemp struct fields
Fix error in completion status
Set cdw0 bit 0
Minor white-space issues

emaste added a subscriber: emaste.Jan 29 2018, 2:46 AM
emaste added inline comments.Jan 29 2018, 2:49 AM
2 ↗(On Diff #38343)

Would you kindly add a SPDX-License-Identifier: BSD-2-Clause-FreeBSD license tag in the committed version?

kmacy added a subscriber: kmacy.Jan 29 2018, 3:13 PM
bcr accepted this revision.Mar 6 2018, 12:00 AM
bcr added a subscriber: bcr.

OK from manpages, but don't forget to bump the .Dd once you commit it since this is a content change.

rgrimes resigned from this revision.Mar 6 2018, 12:48 AM
rgrimes added inline comments.
63 ↗(On Diff #38343)

style(9) this goes at the top of the list, kernel include files come first.

chuck accepted this revision.Mar 9 2018, 12:31 AM
chuck added inline comments.
588 ↗(On Diff #38436)

Sure, but with this implementation, NSID=0x1 is valid and the code doesn't check. It isn't the end of the world as it doesn't support namespace management, but it probably belongs with the other "unsupported identify command" cases.

This revision is now accepted and ready to land.Mar 9 2018, 12:31 AM
araujo commandeered this revision.

I'm taking this revision by Peter's request. I will update it soon with the latest patch.

araujo updated this revision to Diff 44520.Jun 27 2018, 1:31 PM
  • Update this review with the latest patch that @grehan sent me before BSDCan 2018.
  • Fixed some style(9). (myself)
  • Improved the man page. (myself)

Here is the latest round of tests:

Tested with guest OS: FreeBSD Head, Linux Fedora fc27, Ubuntu 18.04, OpenSuse 15.0, Windows Server 2016 Datacenter.
Tested with all accepted device paths: Real nvme, zdev and also with ram.
Tested on: AMD Ryzen Threadripper 1950X 16-Core Processor and Intel(R) Xeon(R) CPU E5-2609 v2 @ 2.50GHz.

I would like to thanks everybody that worked on this implementation.

This revision now requires review to proceed.Jun 27 2018, 1:31 PM
araujo marked 2 inline comments as done.Jun 27 2018, 1:38 PM
bcr added a comment.Jun 27 2018, 1:53 PM

Two textual suggestions for the man page.

439 ↗(On Diff #44520)

Nvme -> NVMe

453 ↗(On Diff #44520)

io -> I/O

bcr added a comment.Jun 27 2018, 1:53 PM
This comment was removed by bcr.
araujo updated this revision to Diff 44521.Jun 27 2018, 1:56 PM

Address @bcr review. Thank you!

araujo updated this revision to Diff 44522.Jun 27 2018, 1:58 PM

@bcr Now it is fixed :).

bcr accepted this revision.Jun 27 2018, 1:58 PM

Cool, thanks for the quick reply.
Approved from manpages. Don't forget to bump the .Dd to the date of the commit.

This revision is now accepted and ready to land.Jun 27 2018, 1:58 PM
araujo marked 2 inline comments as done.Jun 27 2018, 1:59 PM
imp added a comment.Jun 27 2018, 3:35 PM

Generally quite good. A few questions about some code that doesn't quite look right.

447 ↗(On Diff #44522)

Is this MB (power of 10) or MiB (power of 2)?

6 ↗(On Diff #44522)

Not a big deal, but you might want to poll Shunsuke and Leon to see if we can drop this.

329 ↗(On Diff #44522)

what does this mean?

381 ↗(On Diff #44522)

0 is special, a comment here would be useful.

382 ↗(On Diff #44522)

What makes zero special? Why doesn't it need to be reset?

387 ↗(On Diff #44522)

typo for compl_queues?

394–395 ↗(On Diff #44522)

Shouldn't these be in a similar loop ... or

403 ↗(On Diff #44522)

... or maybe you should allocate both of these in the else for the prior if and either allocate all, or reinit all, but not this weird combo that happens to work.

404 ↗(On Diff #44522)

So we have max_queues and num_cqueues. There seems to be an inconsistency here. If it is really max_queues, then you can combine things. if it isn't and the numbers can be different, you'll need to have separate loops. Also, can num_cqueues change? If so, then this code doesn't look right.

419 ↗(On Diff #44522)

would an assert that sc->submit_queues != NULL and sc->compl_queues != NULL be useful here?

451 ↗(On Diff #44522)

Do you need to undo vm_map_gpa effects here?

516 ↗(On Diff #44522)

Same here: undo effects of vm_map_gpa?

587 ↗(On Diff #44522)

Since you don't support the firmware update process, I'm not sure you should map this page.
But if you do, you reported a firmware version before, so you must have at least one slot and that slot must report the firmware.

659–676 ↗(On Diff #44522)

Should you save the values to GET_FEATURE returns the right value?

729 ↗(On Diff #44522)


araujo added inline comments.Jun 29 2018, 3:27 AM
447 ↗(On Diff #44522)

It is MiB (power of 2). I will fix the man page.

6 ↗(On Diff #44522)

I'm 100% sure they won't mind remove that. I will ping them.

329 ↗(On Diff #44522)

Warning Composite Temperature Time:
Contains the amount of time in minutes that the controller is operational and the Composite Temperature is greater than or equal to the Warning Composite Temperature Threshold (WCTEMP) field and less than the Critical Composite Temperature Threshold (CCTEMP) field in the Identify Controller data structure. Ref:

The value 0x0157 probably comes from HW vendors, Ref:

381 ↗(On Diff #44522)

0 is not special, actually it was used in this way to avoid set the queues twice. I will fix this loop, I can declare the "i" outside of this loop and remove that if statement.

382 ↗(On Diff #44522)

Explained above.

387 ↗(On Diff #44522)

Yes, it was a typo.. I'm fixing it.

403 ↗(On Diff #44522)

As soon as I remove that if statement all of these sets will be inside the same loop statement.

@imp I will answer your other comments next week, I'm moving to a new house and I need to shutdown my office now, that is why I submitted the few things I was already answering..

araujo updated this revision to Diff 44790.Jul 3 2018, 4:40 AM
  • Fix manpage s/MB/MiB/g.
  • Remove 'All rights reserved.'.
  • Add a comment to explain why we need the variable 'i'.
This revision now requires review to proceed.Jul 3 2018, 4:40 AM
araujo marked 4 inline comments as done.Jul 3 2018, 4:45 AM

Update with Leon's comment, Leon is one of the authors of this patch.

329 ↗(On Diff #44522)

Leon's comment: NVMe spec specifically states for any implementation >= v1.2 the value must not be 0, and "It is recommended that implementations report a value of 0157h in this field." So pretty much all vendors do the same.

381 ↗(On Diff #44522)

Leon's comment: The Admin Submission Queue is at index 0. It must not be changed at reset otherwise the emulation will be out of sync with the guest, only at complete controller reinitialization.

403 ↗(On Diff #44522)

Leon's comment: The allocation is done once because the guest goes through different boot phases and each step causes a resizing of the queues; I deliberately did this so that we don't have to go through those reallocation cycles.

587 ↗(On Diff #44522)

Leon's comment: Whilst we don't support firmware updates, the guest does query for the firmware slot information; even if it does not intend to use it.

659–676 ↗(On Diff #44522)

Leon's comment: They might set, but they never seem to GET.

729 ↗(On Diff #44522)

Leon's comment: The constant is not defined in the NVMe headers, hence put here. Either update the nvme headers if it's not there or define in isolation.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 5 2018, 3:34 AM
This revision was automatically updated to reflect the committed changes.