Paths

Table of Contentst

add BIO_NORETRY flag, implement support in ata_da, use in ZFS vdev_geom
AbandonedPublic
Actions

Authored by avg on Nov 24 2017, 10:28 AM.

Details

Reviewers

asomers
imp
mav
smh
scottl

Summary

The flag is intended to signal that the bio is "fail-fast" and the lower
storage layers should not try too hard to make it succeed.
This flag can be useful if the bio issuer implements some sort of
a redundancy scheme or a retry algorithm (or both).

The first user of the new flag is ZFS. In fact, the flag is inspired
by ZIO_FLAG_TRYHARD in ZFS and its handling by vdev_disk.
I think that GEOM classes like g_mirror and g_raid can also benefit from
the flag.

ata_da (ada) is the first storage driver to honour the flag.
It should not be hard to add its support to other CAM-based drivers
like scsi_da.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 13001
Build 13255: arc lint + arc unit

Event Timeline

avg created this revision.Nov 24 2017, 10:28 AM

Herald added a subscriber: delphij. · View Herald TranscriptNov 24 2017, 10:28 AM

Harbormaster completed remote builds in B12995: Diff 35686.Nov 24 2017, 10:28 AM

I like the part about BIO_NORETRY. Just wonder why it is not implemented for SCSI devices too.

But I don't like part about short ATA timeouts, which may cause false timeouts if the drive spindle was stopper. IIRC I saw disks spinning up for about 10 seconds.

In D13224#275371, @mav wrote:

I like the part about BIO_NORETRY. Just wonder why it is not implemented for SCSI devices too.

I want to share the work :-)

But I don't like part about short ATA timeouts, which may cause false timeouts if the drive spindle was stopper. IIRC I saw disks spinning up for about 10 seconds.

Yeah, I wasn't sure about the timeout part too...

I also got a modification-after-free when testing in bhyve with a too short timeout.
Apparently the emulated ahci controller wrote into a buffer after the ahci aborted the request on timeout.
Not sure if ahci reset the controller before releasing the request or if it is a bug in the emulated controller where it can complete an uncompleted request that was started before the reset.

Back to the topic, do you think that it is safer to use the same timeout as for normal requests ?

Thanks!

In D13224#275372, @avg wrote:

Back to the topic, do you think that it is safer to use the same timeout as for normal requests ?

Yes, I think so.

I think this is a terrible idea, at least as implemented. It suffers several problems.

First, it conflates the notion of 'fast fail' with the notion of 'don't retry'. Those are two very different notions.
Next, it violates the contract that EIO means 'And this I/O won't succeed because it's been through the storage appropriate recovery algorithm.' which is mostly true in the tree now. This muddies the waters at a time we're trying to clear this up. Some other error code must be returned up the stack.

TRYHARD seems to mean 'do more than you normally would' when present, rather than 'give it some half-assed attempt' when absent.

It's basically trying to create multiple classes of I/O with different quality of service and pour concrete over this inadequately expressive API.

I don't like it.

sys/cam/ata/ata_da.c
2350	I think this is a really bad idea. You're putting policy here that doesn't belong. How do we know this is the right thing to do? I don't like it at all.
sys/geom/geom_io.c
212	Why not both clone and duplicate? That's incredibly short sighted. No comment was updated here.

This revision now requires changes to proceed.Nov 24 2017, 1:00 PM

imp added a reviewer: scottl.Nov 24 2017, 1:00 PM

avg added inline comments.Nov 24 2017, 1:40 PM

sys/cam/ata/ata_da.c
2350	As already discussed with mav I will remove the timeout change.
sys/geom/geom_io.c
212	I was going to wait until clone and duplicate get converged, https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=223838 Also, losing this flag is not as bad as losing `BIO_ORDERED` which duplicate already does. But if you insist, I will do it now.

In D13224#275374, @imp wrote:

I think this is a terrible idea, at least as implemented. It suffers several problems.

First, it conflates the notion of 'fast fail' with the notion of 'don't retry'. Those are two very different notions.

Retrying takes time, so in my opinion there is a direct correlation between failing fast and (not) retrying.
If this is mostly about the flag name, I can change it, no problem.

Next, it violates the contract that EIO means 'And this I/O won't succeed because it's been through the storage appropriate recovery algorithm.' which is mostly true in the tree now.

I do not see how it violates that contract.
"Appropriate recovery algorithm" is too vague.
And still, why shouldn't some code be able to ask for no retries please?
It's that code that sets the flag and it's that code that deals with the EIO, all other layers should just pass it along.

Or maybe I didn't get exactly what you meant.

This muddies the waters at a time we're trying to clear this up. Some other error code must be returned up the stack.

Please explain, this is too unspecific.

TRYHARD seems to mean 'do more than you normally would' when present, rather than 'give it some half-assed attempt' when absent.

TRYHARD is an internal ZFS flag, they ascribed the meaning to it and it is, perhaps, a misnomer.
But by default they want fast and half-assed attempts and only when they retry (on ZFS level) or they know they exhausted all of their redundant copies they will use this flag as a last resort.

This is somewhat analogous to the advice that ZFS should not be used on top of hardware RAID controllers, that HBAs should be preferred.
ZFS does not want / need too magic happening below it, it already does all the magic (multiple copies, retries, etc).

It's basically trying to create multiple classes of I/O with different quality of service and pour concrete over this inadequately expressive API.

The API is expressive enough for me and ZFS:-)
Without the flag - do what you normally do.
With the flag - trust me, I am ZFS, I know how to handle an error.

There is a clear dichotomy between zero, nada, none retries and whatever retrying is normally done.
If anyone would need a finer grained control, then let's discuss it when that happens.

I don't like it.

I hope that I'll be able to change your opinion.
I don't hope to convince you to start liking it, but at least, to stop objecting to it so strongly.

@imp a couple of links on what's going on in illumos:

I must admit that I didn't get through their io/scsi/targets/sd.c code.

I've worked on the SD failfast code in the past, and its implemented a bit different but IMHO the idea is very sound and really helps a lot when you have bad disks one way or the other.

When we issue the IO in the SD driver, with failfast set, (B_FAILFAST) it will issue the IO as normal. When the IO comes back, with an error (which in our case is set in mpt_sas not sure about BSD) -- failfast would be "armed", if we then see another timeout/error of that disk that has failfast "armed" it would fail that IO but also any outstanding IO making the assumption that all other IO's would fail as well.

When the queue was empty again, failfast would be cleared so that we dont take out the whole drive (which wat FMA is for). We tested this extensively with a sanblaze, where we clould have the drive fail in all shapes and forms one could think off. It prevented a single bad drive to slow down the whole pool, with fma turned off.

Ideally however, FMA would have taken out the drive before all this was needed but drives .. well.... they do some wierd stuff dont they =) So in general it was to capture drives in act that would one way or the other, not trip SERD engine within the diagnose modules of FMA. ZFS would fault the vdev eventually.

Failfast also helped a lot during zpool import in particular during failovers as FS timeouts from Linux clients where a real problem. In any case, perhaps this implementation is not 100% there yet, but i'd recommond such a feature. Recently the Linux folks added fail-fast support as well, although not for ZFS but that has different reasons.

If people are interested I can lookop the orignal desing of failfast in Solaris and share it.

BIO_NORETRY has nothing to do with timeouts

Harbormaster completed remote builds in B13001: Diff 35696.Nov 24 2017, 5:26 PM

After reading the links and ing.gila_gmail.com's description, it looks like FAILFAST and NORETRY are two different things. FAILFAST will faill all other I/O if one I/O fails. NORETRY will not retry the I/O when the drive tells us to retry the I/O. The ZFS code expects the former semantics, not the NORETRY semantics.

In D13224#275493, @imp wrote:

After reading the links and ing.gila_gmail.com's description, it looks like FAILFAST and NORETRY are two different things.

Yes, I am coming to the same conclusion.

FAILFAST will faill all other I/O if one I/O fails. NORETRY will not retry the I/O when the drive tells us to retry the I/O. The ZFS code expects the former semantics, not the NORETRY semantics.

I do not see where exactly ZFS expects that in a sense that it really relies on FAILFAST semantics and would become broken with NORETRY semantics.
I think that it work just fine with NORETRY.

@ing.gila_gmail.com thank you very much for the explanation! It's both interesting and useful. Much appreciated.

Closing this review as now I see many shortcomings in it.
First, 'NO RETRY' is indeed a too specific name and a too specific instruction comparing to 'FAIL FAST'.
Second, suppressing retries could be a [barely] good enough first approximation for certain class of storage devices (spinning HDDs), but for other device types it might be rather wrong.
Third, the naive implementation didn't take into account things like frozen CAM queues, etc.
Other problems pointed out by reviewers.
So, finally, this change wouldn't fly anyway given the opposition.