Details

Reviewers

imp
chuck

Commits

rGa1eda74167b5: nvmf: The in-kernel NVMe over Fabrics host

Summary

This is the client (initiator in SCSI terms) for NVMe over Fabrics.
Userland is responsible for creating a set of queue pairs and then
handing them off via an ioctl to this driver, e.g. via the 'connect'
command from nvmecontrol(8). An nvmeX new-bus device is created
at the top-level to represent the remote controller similar to PCI
nvmeX devices for PCI-express controllers.

As with nvme(4), namespace devices named /dev/nvmeXnsY are created and
pass through commands can be submitted to either the namespace devices
or the controller device. For example, 'nvmecontrol identify nvmeX'
works for a remote Fabrics controller the same as for a PCI-express
controller.

nvmf exports remote namespaces via nda(4) devices using the new NVMF
CAM transport. nvmf does not support nvd(4), only nda(4).

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jhb requested review of this revision.Apr 9 2024, 11:03 PM

jhb created this revision.

Harbormaster completed remote builds in B57005: Diff 136810.Apr 9 2024, 11:03 PM

jhb added a parent revision: D44713: cam: Add a XPORT_NVMF for NVMe over Fabrics sims.Apr 9 2024, 11:03 PM

jhb added a child revision: D44715: nvmecontrol: New commands to support Fabrics hosts.

jhb added a reviewer: chuck.Apr 9 2024, 11:07 PM

imp added inline comments.Apr 15 2024, 4:49 PM

sys/dev/nvmf/host/nvmf.c
450	So it knows its parent's ivars? Isn't that a bit naught? What's the thinking for doing this unorthodox thing here?
467	How does unload / load work here? If the ivars have had their cdata shorn off of them, then it's not here to use now and could be NULL (though below you assume it's not null).
sys/dev/nvmf/host/nvmf_sim.c
90	We can add resid at the end of the current nvmio structure, I think. We laid out things such that we can do this. libcam uses get_ccb, so we have plenty of space there. The kernel uses either get_ccb() or a custom UMA allocator, which will automatically grow in size. As for what it should do, I'm unsure wrt retry or not. I suspect that the logical thing to do is to retry some number of times the uncompleted I/O. The upper layers may or may not know how to cope well. SCSI used to do this years ago, and still has code, but IIRC (and I may not) none of the modern SIMs will return a partial I/O so that code path is, at best lightly tested. But the buffer cache, at least, will only 'validate' the I/O that completes, so that will cause it to retry. I'm not sure what ZFS will do, though, so I don't know how important it is to retry. Swapper likely does something weird, but is of much less important. We should also consider movinng some of the retry logic up into nvd / nda from nvme right now, but that's only tangentially releated.

jhb added inline comments.Apr 16 2024, 7:05 PM

sys/dev/nvmf/host/nvmf.c
450	Not its parent's ivars (the parent is root0), just its own. I need a way to pass the parameters from the ioctl that adds a new device down to the attach routine. This device is kind of special as it isn't a hardware thing enumerated by a bus, but it is a software device created by an ioctl. I could add a single ivar with accessors that is a pointer to the ioctl parameters perhaps, but that extra layer of indirection seemed a bit silly.
467	Devices are only instantiated by an ioctl that sets the ivars before calling device_attach. The ioctl handler cleans up if the attach fails, but it only cleans up `ivars->cdata` if it hasn't been "claimed" in the attach routine. Here the attach routine takes ownership of `cdata` and is now responsible for freeing it if attach later fails or during detach. So, nothing happens during kldload except that the /dev/nvmf device is created that can accept ioctls to create devices. During kldunload any active devices are detached which will free any `cdata` in the softc. The `ivars` are only "live" during the ioctl handler and are cleared back to NULL after `device_attach` concludes: static int nvmf_handoff_host(struct nvmf_handoff_host *hh) { struct nvmf_ivars ivars; device_t dev; int error; error = nvmf_init_ivars(&ivars, hh); if (error != 0) return (error); bus_topo_lock(); dev = device_add_child(root_bus, "nvme", -1); if (dev == NULL) { bus_topo_unlock(); error = ENXIO; goto out; } device_set_ivars(dev, &ivars); error = device_probe_and_attach(dev); device_set_ivars(dev, NULL); if (error != 0) device_delete_child(root_bus, dev); bus_topo_unlock(); out: nvmf_free_ivars(&ivars); return (error); }
sys/dev/nvmf/host/nvmf_sim.c
90	This comment is largely inspired by talking with you about this on IRC before. TBH, partial completions should be rare. It means there was data corruption on the wire that caused a PDU digest to mismatch so a data transfer for a successful operation failed. Retrying the remainder probably is sensible since the actual operation succeeded (we only get data if the operation succeeds). However, that logic does indeed belong up in `nda` and this layer doesn't try to retry at all.

Add NVMF_DISCONNECT_HOST and NVMF_DISCONNECT_ALL ioctls for /dev/nvmf

Harbormaster completed remote builds in B57512: Diff 137978.May 1 2024, 8:59 PM

Add NVMF_DISCONNECT_HOST and NVMF_DISCONNECT_ALL ioctls for /dev/nvmf

Harbormaster completed remote builds in B57516: Diff 137982.May 1 2024, 9:11 PM

Switch to SPDX-only license blocks for C files

Harbormaster completed remote builds in B57530: Diff 137996.May 1 2024, 10:51 PM

This revision was not accepted when it landed; it landed in state Needs Review.May 3 2024, 12:16 AM

Closed by commit rGa1eda74167b5: nvmf: The in-kernel NVMe over Fabrics host (authored by jhb). · Explain Why

This revision was automatically updated to reflect the committed changes.

jhb added a commit: rGa1eda74167b5: nvmf: The in-kernel NVMe over Fabrics host.

nvmf: The in-kernel NVMe over Fabrics host
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 138064

share/man/man4/Makefile

share/man/man4/nvmf.4

sys/conf/NOTES

sys/conf/files

sys/dev/nvmf/host/nvmf.c

sys/dev/nvmf/host/nvmf_aer.c

sys/dev/nvmf/host/nvmf_cmd.c

sys/dev/nvmf/host/nvmf_ctldev.c

sys/dev/nvmf/host/nvmf_ns.c

sys/dev/nvmf/host/nvmf_qpair.c

sys/dev/nvmf/host/nvmf_sim.c

sys/dev/nvmf/host/nvmf_var.h

sys/modules/nvmf/Makefile

sys/modules/nvmf/nvmf/Makefile

nvmf: The in-kernel NVMe over Fabrics hostClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 138064

share/man/man4/Makefile

share/man/man4/nvmf.4

sys/conf/NOTES

sys/conf/files

sys/dev/nvmf/host/nvmf.c

sys/dev/nvmf/host/nvmf_aer.c

sys/dev/nvmf/host/nvmf_cmd.c

sys/dev/nvmf/host/nvmf_ctldev.c

sys/dev/nvmf/host/nvmf_ns.c

sys/dev/nvmf/host/nvmf_qpair.c

sys/dev/nvmf/host/nvmf_sim.c

sys/dev/nvmf/host/nvmf_var.h

sys/modules/nvmf/Makefile

sys/modules/nvmf/nvmf/Makefile

nvmf: The in-kernel NVMe over Fabrics host
ClosedPublic
Actions

Revision Contents
Changeset List