Page MenuHomeFreeBSD

nvmf: The in-kernel NVMe over Fabrics host

Authored by jhb on Apr 9 2024, 11:03 PM.
Referenced Files
Unknown Object (File)
Mon, May 27, 2:11 AM
Unknown Object (File)
Sun, May 19, 6:00 AM
Unknown Object (File)
Fri, May 3, 5:55 AM
Unknown Object (File)
Apr 27 2024, 3:56 PM
Unknown Object (File)
Apr 26 2024, 1:12 PM
Unknown Object (File)
Apr 26 2024, 4:58 AM
Unknown Object (File)
Apr 25 2024, 10:30 PM
Unknown Object (File)
Apr 12 2024, 7:46 PM



This is the client (initiator in SCSI terms) for NVMe over Fabrics.
Userland is responsible for creating a set of queue pairs and then
handing them off via an ioctl to this driver, e.g. via the 'connect'
command from nvmecontrol(8). An nvmeX new-bus device is created
at the top-level to represent the remote controller similar to PCI
nvmeX devices for PCI-express controllers.

As with nvme(4), namespace devices named /dev/nvmeXnsY are created and
pass through commands can be submitted to either the namespace devices
or the controller device. For example, 'nvmecontrol identify nvmeX'
works for a remote Fabrics controller the same as for a PCI-express

nvmf exports remote namespaces via nda(4) devices using the new NVMF
CAM transport. nvmf does not support nvd(4), only nda(4).

Sponsored by: Chelsio Communications

Diff Detail

rG FreeBSD src repository
Lint Not Applicable
Tests Not Applicable

Event Timeline

jhb requested review of this revision.Apr 9 2024, 11:03 PM
jhb created this revision.

So it knows its parent's ivars? Isn't that a bit naught? What's the thinking for doing this unorthodox thing here?


How does unload / load work here? If the ivars have had their cdata shorn off of them, then it's not here to use now and could be NULL (though below you assume it's not null).


We can add resid at the end of the current nvmio structure, I think. We laid out things such that we can do this. libcam uses get_ccb, so we have plenty of space there. The kernel uses either get_ccb() or a custom UMA allocator, which will automatically grow in size.

As for what it should do, I'm unsure wrt retry or not. I suspect that the logical thing to do is to retry some number of times the uncompleted I/O. The upper layers may or may not know how to cope well. SCSI used to do this years ago, and still has code, but IIRC (and I may not) none of the modern SIMs will return a partial I/O so that code path is, at best lightly tested. But the buffer cache, at least, will only 'validate' the I/O that completes, so that will cause it to retry. I'm not sure what ZFS will do, though, so I don't know how important it is to retry. Swapper likely does something weird, but is of much less important.

We should also consider movinng some of the retry logic up into nvd / nda from nvme right now, but that's only tangentially releated.


Not its parent's ivars (the parent is root0), just its own. I need a way to pass the parameters from the ioctl that adds a new device down to the attach routine. This device is kind of special as it isn't a hardware thing enumerated by a bus, but it is a software device created by an ioctl. I could add a single ivar with accessors that is a pointer to the ioctl parameters perhaps, but that extra layer of indirection seemed a bit silly.


Devices are only instantiated by an ioctl that sets the ivars before calling device_attach. The ioctl handler cleans up if the attach fails, but it only cleans up ivars->cdata if it hasn't been "claimed" in the attach routine. Here the attach routine takes ownership of cdata and is now responsible for freeing it if attach later fails or during detach.

So, nothing happens during kldload except that the /dev/nvmf device is created that can accept ioctls to create devices. During kldunload any active devices are detached which will free any cdata in the softc. The ivars are only "live" during the ioctl handler and are cleared back to NULL after device_attach concludes:

static int
nvmf_handoff_host(struct nvmf_handoff_host *hh)
	struct nvmf_ivars ivars;
	device_t dev;
	int error;

	error = nvmf_init_ivars(&ivars, hh);
	if (error != 0)
		return (error);

	dev = device_add_child(root_bus, "nvme", -1);
	if (dev == NULL) {
		error = ENXIO;
		goto out;

	device_set_ivars(dev, &ivars);
	error = device_probe_and_attach(dev);
	device_set_ivars(dev, NULL);
	if (error != 0)
		device_delete_child(root_bus, dev);

	return (error);

This comment is largely inspired by talking with you about this on IRC before. TBH, partial completions should be rare. It means there was data corruption on the wire that caused a PDU digest to mismatch so a data transfer for a successful operation failed. Retrying the remainder probably is sensible since the actual operation succeeded (we only get data if the operation succeeds). However, that logic does indeed belong up in nda and this layer doesn't try to retry at all.



Switch to SPDX-only license blocks for C files

This revision was not accepted when it landed; it landed in state Needs Review.Fri, May 3, 12:16 AM
This revision was automatically updated to reflect the committed changes.