zfsd: fault disks that generate too many I/O delay events
ClosedPublic
Actions

Authored by asomers on Nov 28 2023, 11:07 PM.

Details

Reviewers

mav
delphij

Commits

rG946afb62aba5: zfsd: fault disks that generate too many I/O delay events
rGe2ce586899ff: zfsd: fault disks that generate too many I/O delay events
rGd565784a7eba: zfsd: fault disks that generate too many I/O delay events

Summary

If ZFS reports that a disk had at least 8 I/O operations over 60s that
were each delayed by at least 30s (implying a queue depth > 5 or I/O
aggregation, obviously), fault that disk. Disks that respond this
slowly can degrade the entire system's performance.

MFC after: 2 weeks
Sponsored by: Axcient

Test Plan

unit and integration tests added. Change has been run in production for 4 months, where it faults about 0.2% of HDDs annually.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

asomers created this revision.Nov 28 2023, 11:07 PM

Herald added subscribers: delphij, imp. · View Herald TranscriptNov 28 2023, 11:07 PM

asomers requested review of this revision.Nov 28 2023, 11:07 PM

Harbormaster completed remote builds in B54680: Diff 130713.Nov 28 2023, 11:07 PM

delphij accepted this revision.Nov 29 2023, 6:47 AM

delphij added inline comments.

cddl/usr.sbin/zfsd/case_file.h
239	Should these converted to `const int`'s instead of being enum?

This revision is now accepted and ready to land.Nov 29 2023, 6:47 AM

asomers added subscribers: gibbs, allanjude.Nov 29 2023, 2:48 PM

asomers added inline comments.

cddl/usr.sbin/zfsd/case_file.h
239	Good question. They certainly could be. I suspect that it was @gibbs who initially created them as enums. BTW, in a future commit I'm planning to make these customizable via vdev properties, as suggested by @allanjude . I'll convert them to `const int` at that time.

Closed by commit rGd565784a7eba: zfsd: fault disks that generate too many I/O delay events (authored by asomers). · Explain WhyNov 29 2023, 2:51 PM

This revision was automatically updated to reflect the committed changes.

asomers added a commit: rGd565784a7eba: zfsd: fault disks that generate too many I/O delay events.

I'd have liked to see these configurable thresholds for people that want to more or less aggressively fault things.
But as is, it's fine, and it's a nice to have not something absolutely required.

In D42825#976834, @imp wrote:

I'd have liked to see these configurable thresholds for people that want to more or less aggressively fault things.
But as is, it's fine, and it's a nice to have not something absolutely required.

Yes, definitely. I'm going to make them configurable next, using Allan Jude's suggested method.

FYI: One of the things I have planned for after the first of the year is to start publishing from the CAM io scheduler latencies that exceed some threshold, either static (> 30s) or dynamic (> 5MAD from median). I'd planned on using this stream to accumulate in real time the notion of an unresponsive disk and use that to inform our system's continued use of the disk. Interesting to see parallel work.

asomers added a commit: rGe2ce586899ff: zfsd: fault disks that generate too many I/O delay events.Jan 19 2024, 8:17 PM

asomers added a commit: rG946afb62aba5: zfsd: fault disks that generate too many I/O delay events.Jan 20 2024, 1:51 AM