Page MenuHomeFreeBSD

Limit stripesize reported from nvd(4) for to geom(8) to 4K
ClosedPublic

Authored by smh on Dec 8 2015, 9:59 PM.

Details

Summary

Intel NVMe controllers have a slow path for I/Os that span a 128KB stripe boundary but ZFS limits ashift, which is derived from d_stripesize, to 13 (8KB) so we limit the stripesize reported to geom(8) to 4KB.

This may result in a small number of additional I/Os to require splitting in nvme(4), however the NVMe I/O path is very efficient so these additional I/Os will cause very minimal (if any) difference in performance or CPU utilisation.

Test Plan

Create a zfs(8) pool on nvd(4) and run zpool status to check that no preformance warning is produced.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

smh retitled this revision from to Limit stripesize reported from nvd(4) for to geom(8) to 4K.
smh updated this object.
smh edited the test plan for this revision. (Show Details)
sys/dev/nvd/nvd.c
293 ↗(On Diff #10940)

Any chance you can make this a quirk? Or something nvme_ns_get_stripesize?
I'd hate to duplicate this in my NVMe CAM attachment.

Expose stripesize limit via kern.nvd.max_stripesize sysctl

smh marked an inline comment as done.Dec 8 2015, 11:58 PM
smh added inline comments.
sys/dev/nvd/nvd.c
293 ↗(On Diff #10940)

The setting of stripesize is already a quirk, although a hardcoded one see: nvme_ns.c:489.

Jim's idea of passing this up to geom was to allow the upper layers to perform alignment and hence obtain optimum performance.

I've made the limit into a sysctl but kept the default to 4KB to ensure max compatibility, that ok?

sys/dev/nvd/nvd.c
310 ↗(On Diff #10947)

No. It's not OK. I don't want to replicate this logic in CAM attached nda I'm working on. Why can't we just fix it in nvme_ns.c?

smh marked an inline comment as done.Dec 9 2015, 4:26 PM
smh added inline comments.
sys/dev/nvd/nvd.c
310 ↗(On Diff #10947)

rgr I miss-understood your original comment. I'll look to move it, thanks for the update.

sys/dev/nvd/nvd.c
310 ↗(On Diff #10947)

Thanks Steve. nvme_ns.c needs a more robust quirking mechanism, but this is good for now.

sys/dev/nvd/nvd.c
310 ↗(On Diff #10947)

Putting this in nvme_ns.c is OK, but we should change the name of nvme_ns_get_stripesize() then (since it is no longer reporting the stripe size quick verbatim).

How about nvme_ns_get_optimal_sector_size()? When stripe size == 0, this just returns nvme_ns_get_sector_size(). Otherwise it returns min(stripesize, 4KB).

sys/dev/nvd/nvd.c
310 ↗(On Diff #10947)

And I meant to add that then this gets used for d_sectorsize, and we don't set d_stripesize at all.

Move max check to nvme exposing as nvme_ns_get_optimal_sector_size.

smh marked 5 inline comments as done.Dec 10 2015, 5:42 PM
smh added inline comments.
sys/dev/nvd/nvd.c
291 ↗(On Diff #11044)

While d_sectorsize is a bit miss-used here I don't want to prevent things from getting the true sectorsize so I elected to still set d_stripesize just to nvme_ns_get_optimal_sector_size instead.

jimharris edited edge metadata.
This revision is now accepted and ready to land.Dec 10 2015, 11:18 PM
imp added a reviewer: imp.

Not sure I'm happy this is a global in nvme.c, but it's better there than nvd.
As multiple NVMe drives become common this may need to change, but it is OK for now.

In D4446#94747, @imp wrote:

Not sure I'm happy this is a global in nvme.c, but it's better there than nvd.
As multiple NVMe drives become common this may need to change, but it is OK for now.

As a capping limit its good to have a global but as you say it would nice to also per ctrl ones.

We could use the same method I used for the CAM sort_io_queue r248922 where for the ctrl level ones -1 indicates use global max.

Something for another day though, thanks for the reviews.

This revision was automatically updated to reflect the committed changes.