Page MenuHomeFreeBSD

RFC: Disk I/O priority support
Needs ReviewPublic

Authored by mav on Oct 22 2020, 9:22 PM.

Details

Reviewers
scottl
imp
mmacy
trasz
mckusick
cem
pjd
slm
Group Reviewers
cam
Summary

For years SCSI in most of its modern flavors allows passing relative command priority (0 - default, 1 - highest, 15-lowest) from initiator to target. SATA also allows two levels of priority (normal and high). I am not aware yet of any SCSI disks supporting the priorities at this moment (haven't looked very hard), but some SATA HDDs and SSDs I have report it as supported, and at least for some WD REDs I am able to measure the effects of it.

In this patch I've added priority field into CAM SCSI initiator and target command structures (should not break ABI due to alignment), added its support into isp(4) as target and initiator and mps(4), mpr(4) and iscsi(4) and initiator. For SATA drives the high priority is just a single bit in NCQ READ and WRITE commands, so the change is minimal. Into GEOM BIO I've added just a single LOWPRIO flag to not break the KBI, but I can extend it to separate field if people think we may benefit from passing more priorities.

Test Plan

After marking all background ZFS I/Os as low-priority I am able to see some latency reduction when benchmarking short random reads simultaneously while pretty much sequential pool scrub running. If the concept is OK, we may test it in more different patterns. The scrub pressure was just the issue I was looking at.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Unit Tests Skipped
Build Status
Buildable 34480

Event Timeline

mav requested review of this revision.Oct 22 2020, 9:22 PM

Interesting. I have patches to iosched that marks metadata requests and topqueues them, but doesn't try to prioritize in the drive. It doesn't handle writes, though (we don't need them, but it's one of the reasons I've not conmitted)... it gives a modest boost to open latency, but not as much as the async open chuck is working on.

I also keep track of them, but haven't pushed the priority bit down into the drives. We run at a low queue depth already...

This is interesting...

It would be trivial to request high priority for synchronous writes in bwrite() and if desired synchronous reads in bread(). That would have effects for several filesystems.

UFS does very few synchronous writes (when running with soft updates), so likely would not see much effect if only synchronous writes were given high priority.
It would be interesting to see if requesting high priority for synchronous reads would have any measurable (and useful) effect.

I could see a use for at least three levels of priority: low priority (default) for asynchronous I/O, mid-level priority for synchronous reads, high priority for synchronous writes.

I could see a use for at least three levels of priority: low priority (default) for asynchronous I/O, mid-level priority for synchronous reads, high priority for synchronous writes.

I've been able to measure a small effect from having two levels for normal and metadata reads. I didn't try anything for synchronous writes. I added it mostly to instrument metadata reads to test some metadata read mitigation code we were trying out.

I'm thinking that we may need to have a couple of flags to have more than low/high...

I'm happy to share my code too...

More experiments with SATA WD REDs show that priorities there more like absolute with deadline. On WD20EFRX-68E on heavy random workload I see low-priority requests in presence of high-priority are all delayed for about a second, while on WD80EFZX-68U they are all delayed for about 5 seconds. So big difference makes me think it is unusable for differentiation of sync vs async requests, but should still be good for read/write vs scrub/initialization/etc differentiation. Unfortunately I still haven't found any capable SAS drive to check there, but considering SATL directly map one into another I suppose they should have the same (absolute) semantics.

I think that the intention of the feature from the manufacturers is for background sync and scrub workloads, not filesystem consistency operations. Regarding SAS, tags have always been the mechanism for setting priority and creating barriers. Ordered tags, head-of-queue tags, etc. mpr/mps support these, but they're largely unused because BIO_ORDERED was removed from FreeBSD. I'm not aware of SAS adopting the same SATA priority scheme.

Priority is working on top of tag, affecting specifically only commands tagged as SIMPLE . The ORDERED and HEAD tags still have their function as they are mandatory in their fencing semantics, while priority is a softer hint for a schduler.

mav edited the summary of this revision. (Show Details)
mav edited the test plan for this revision. (Show Details)

I've tried opposite approach of adding LOWPRIO flag instead and using it only for background operations in few places, and marking BIOs without it high-priority in ATA/SCSI. But while testing it I've noticed that disk random IOPS drop to almost non-NCQ level on a mix of different priorities. And I am measuring the same on both WD and HGST. I don't understand what is going on there, may be I am missing something, but that is unacceptable trade-off to me. I've uploaded my present patch in case somebody wish to play, but probably won't commit it in this state.

Just for information, I've also experimented with isochronous NCQ priority (AKA NCQ streaming). I hoped that setting large timeout would reduce the request priority. But at least on WD Red I see no any priority effects until the timeout is reached, and I see priority increase (again with the IOPS problem) when it is. It is good to see that the feature is really working, but unfortunately I see no usage for it in this shape. I see plenty of use cases for low priority (that SATA/SAS drives don't provide), but not really for high priority (that they do, but not very efficiently). NVMe seems to have usable priority concept and some devices support it, just not sure how important is the priority for pretty fast NVMe's.

Have we reached any conclusions about whether to do any of the ideas suggested in this phabricator thread?

If there is some use high priority, then it works for SATA and it is simple, but since it is absolute priority, the difference between normal and high is too big to use it without very good reason. For low priority though, which would be useful for background operations even with absolute priorities, I haven't found a working implementation so far, unless potentially NVMe. I am hoping to get some comments from ${HDD vendor} about it.