Page MenuHomeFreeBSD

Add big-endian support to NVMe
ClosedPublic

Authored by mst_semihalf.com on Jan 15 2018, 4:05 PM.
Tags
None
Referenced Files
F105943777: D13916.diff
Sun, Dec 22, 10:12 PM
Unknown Object (File)
Sun, Dec 8, 11:54 PM
Unknown Object (File)
Nov 19 2024, 1:33 AM
Unknown Object (File)
Nov 14 2024, 7:16 PM
Unknown Object (File)
Nov 11 2024, 7:09 PM
Unknown Object (File)
Nov 11 2024, 6:48 PM
Unknown Object (File)
Nov 10 2024, 2:21 AM
Unknown Object (File)
Nov 9 2024, 1:42 PM
Subscribers

Details

Summary

Remove bitfields from defined structures as they are not portable.
Instead use shift and mask macros in the driver and nvmecontrol application.

NVMe is now working on powerpc64 host.

Buildworld on PPC64 tested, universe still building.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

imp requested changes to this revision.Jan 15 2018, 4:15 PM

Shouldn't all this endian flipping be done in bus space?

These changes are massive and many seem a bit gratuitous, or seem to be done in a way that's requires more extra code to do the thing.

Two examples:

In SCSI and ATA, for example, we put the identify and inquiry data into host order early so all the downstream code can just use them normally. That would eliminate a almost all of conversions that you're currently doing inline in the code. And it would fit better with CAM. I don't want nda to be the odd-man out knowing it has to do the endian conversions.

There's a number of macros used to create opcodes, but then we do htole16 with them. Why not put that in there?

There's others, but I honestly stopped reviewing after seeing these things because I think this approach is not one that can scale. There's out of tree patches to support name space, provisioning hints and other formatting stuff that could easily come in and people might not notice they won't work on big endian right away.

I know a ton of work went into this, and I feel bad for not liking it, but I think we'll have real issues for years to come if we go down this path.

sbin/nvmecontrol/Makefile
10 ↗(On Diff #37972)

No. You can't do this in the build. What are you trying to accomplish that #include in the affected file won't do?

This revision now requires changes to proceed.Jan 15 2018, 4:15 PM
sbin/nvmecontrol/devlist.c
59 ↗(On Diff #37972)

So why not convert nsdata->flbas to native before doing this shifting and masking? Yet yo do down below when you access the lbaf. Is that a bug or something clever?

sbin/nvmecontrol/firmware.c
128 ↗(On Diff #37972)

This is but one of many macros that must have htole16 to work. Why not roll it into NVME_CMD_SET_OPC?

160 ↗(On Diff #37972)

And why not assume that NVME_STATUS_GET_SCT is passed in little endian rather than host order?

198 ↗(On Diff #37972)

getopt returns an in, so ch should be an int, not a signed char.

I agree with Warner's comments -- if there is any other way to do the endian conversions at all, we should.

sys/conf/files.powerpc
60 ↗(On Diff #37972)

Why are these in files.powerpc instead of just files? The driver isn't machine-dependent.

It's done that way because I needed to maintain a strict convention of where byte-swapping was performed to minimize the chance of mistakes (such as forgetting to swap or double-swapping a field by mistake as it's passed through various callbacks). Otherwise it was hard to make sense of the code, as there are a lot of fields which need swapping and they are often transferred throughout components.

I went with the convention of swapping at the latest possible time i.e. when the value is used (for arithmetic, comparison, etc.), leaving device memory structures (completion/command buffers) intact. This guarantees that only the data which is actually used gets swapped. The downside is that conversion has to be done each time a value is used in a statement. I realize it has made the code overly verbose with all the letoh/htole's, so I'll try to hide some of them in headers and swap some or all the fields in-place in the buffers.

I actually started out with the opposite approach of swapping all fields at the earliest possible time (just after reading the buffer from device) as you propose, but then the code started to look awkward as so many fields had to be swapped in one place. I thought about flipping only the fields that are actually used later in the code, but that could create problems in the future if someone decides to use a previously unused field and forgets to add it to the swapping routine. Most people testing NVMe use little-endian machines so such bugs could accumulate without being discovered. I was also concerned about possible performance penalty due to swapping a lot of data in one place (including possibly unused fields).

If you prefer the alternative convention of swapping at the earliest time i.e. on read from device, I'll try to rework the patch to do that, possibly by adding macros to nvme.h such as nvme_completion_bswap(cpl) and the like. The awkward part is that all fields will have to be swapped one by one - can't use a loop as the fields have different sizes, e.g.:

"nvme_ctrldata_bswap(data)":

data.vid = htole16(data.vid);
data.ssvid = htole16(data.ssvid);
data.ctrlr_id = htole32(data.ctrlr_id);
data.ver = htole64(data.ver);
data.rtd3r = htole32(data.rtd3r);
(similarly for 20 remaining non-uint8 fields of nvme_controller_data)

Again, I feel like the only other reasonable option is to immediately swap all fields of a structure once it has been read from device as shown above, without skipping any, as swapping selectively would require each following contributor to remember to update the swapping routine when more fields are used. I'll try this approach and submit an update to the patch. There might be a better option which I haven't thought of so any input would be welcome.

ATA seems to be a simpler case - AFAIK it relies more on registers so swapping is mostly done by bus_space. Here values from registers are swapped by bus_space accessors as well (nvme_mmio_write(), etc).
As for nda, it should probably use accessors to get what it needs without doing conversions on its own so I'll change that.

sbin/nvmecontrol/Makefile
10 ↗(On Diff #37972)

I think I was having compilation issues at one point and thought I needed that line, I'll remove it.

sbin/nvmecontrol/devlist.c
59 ↗(On Diff #37972)

flbas is uin8_t so it does not need to be swapped. The entry of lbaf[] array needs to be swapped as it's 32-bit and that's what the le32toh below does. If you're saying we should byte-swap the entire lbaf[] array beforehand, then it can be done but would look a bit awkward - a separate function (not macro) would need to be created to swap all namespace_data fields including a for loop swapping lbaf[]. Alternatively, the routine which gets namespace_data from device could check 'flbas' and byte-swap only the relevant lbaf[] entry but I was apprehensive about doing such selective swapping (see my main comment).

sbin/nvmecontrol/firmware.c
128 ↗(On Diff #37972)

I only made get/set macros for a few most frequently used fields and left this htole here with the others for consistency but of course I can move it into the macro to make the lines shorter.

160 ↗(On Diff #37972)

I'll change that.

198 ↗(On Diff #37972)

Correct, I'll change it to int. I had a compilation error here as char is unsigned on PPC64.

sys/conf/files.powerpc
60 ↗(On Diff #37972)

That's right, I'll move it into global files.

As suggested, I have reworked the patch to perform endianess conversions all at once and as early as possible. The changes concern data which is read from device. I've also implemented other suggestions from the review.

Other changes:

  • Fix off-by-one errors in existing NVMe code (skipping last temperature sensor, wrong interpretation of ELPE value)
  • Add missing byteswap in dsm_range, also check for NULL after calling malloc
  • Swap bytes in all error log pages instead of just the first one

Have more to review.

sbin/Makefile.powerpc64
3–4 ↗(On Diff #39387)

These two likely can be committed separately, if at all. gpart should cover all use cases, no?

sys/powerpc/conf/GENERIC64
117 ↗(On Diff #39387)

I'd recommend just starting with nda.

sbin/Makefile.powerpc64
3–4 ↗(On Diff #39387)

I think we want those, but I'll move it to a separate patch.

sys/powerpc/conf/GENERIC64
117 ↗(On Diff #39387)

meaning commit 'device nvme' first and nvd in another patch?

I think we talked out all issues here.
I'll split this to separate patches and commit to the HEAD. Please let me know if you have any objections.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 22 2018, 1:33 PM
Closed by commit rS329824: NVMe: Add big-endian support (authored by wma). · Explain Why
This revision was automatically updated to reflect the committed changes.