Page MenuHomeFreeBSD

cam/da: Allow read-retry to be disabled
AbandonedPublic

Authored by imp on Feb 4 2025, 6:38 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Mar 6, 3:33 AM
Unknown Object (File)
Mon, Feb 24, 12:46 PM
Unknown Object (File)
Feb 23 2025, 9:46 AM
Unknown Object (File)
Feb 19 2025, 8:13 PM
Unknown Object (File)
Feb 14 2025, 9:11 PM
Unknown Object (File)
Feb 11 2025, 8:51 PM
Unknown Object (File)
Feb 10 2025, 4:42 AM
Unknown Object (File)
Feb 5 2025, 4:40 AM
Subscribers
None

Details

Reviewers
mav
ken
Group Reviewers
cam
Summary

These days, read commands are tried very hard in the drive, so retrying
is usually futile. Add the ability to turn it off for reads to improve
read latency on failing media. However, this leaves the retries for
other commands in place, which also can be good for WRITE errors. So we
have three configurations: Default: we do 5 retires for read or writes
(or anything else). This turned on 5 retries for writes, but none for
READs. And setting retries to 0 (no retries for anything).

Normally, this would enabled by default. However, it's not 100%. There's
times people's data is so precious they want retry by default. Also,
some HBAs consume a retry for congested situations improperly (because
the alternative is to never fail in some fairly rare congestions in with
expanders). But for the most part turning it on helps a lot on reducing
latency for recovery situations.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 62253
Build 59137: arc lint + arc unit

Event Timeline

Note: There's no good interface for this. I was thinking of having a global tunable kern.cam.da.no_read_retry and a similar per-drive sysctl/tunable.

So this isn't quite right, so I'm going to abandon.

We should do this only in daerror in response to MEDIUM ERROR sense keys, first of all. This is easy to fix, though. Some read errors we still want to retry, just not MEDIUM ERRORS
Second, a meta-analysis of our error logs that I did after I coded this up shows that 99% of the read errors asc/ascq we encounter at Netflix on SAS/SATL disks are already tagged as SS_FATAL so it has much less effect than I'd thought. And the remainder are weird in ways that need deeper analysis than I could extract from the data.
There may still be value in doing this, but until I can show up with data, I'm going to discard this change.