rate limit ctl_process_done warning messages
Needs ReviewPublic
Actions

Authored by asomers on Apr 25 2024, 8:23 PM.

Details

Reviewers

mav
jhb

Group Reviewers

cam

Summary

If a CTL I/O operation takes too long (default 90s) a warning will be
printed to the system console. But that has a cost, and the extra load
can cause other operations to slow down. malloc_large seems to be
particularly affected. This can lead to a positive feedback doom loop:
more CTL warnings means a slower system, which means more CTL warnings.

Break the feedback cycle by rate-limiting these messages to one per
second.

MFC after: 2 weeks
Sponsored by: Axcient

Test Plan

manually tested using gnop to artificially delay I/O

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 57375
Build 54263: arc lint + arc unit

Event Timeline

asomers created this revision.Apr 25 2024, 8:23 PM

Herald added a reviewer: cam. · View Herald TranscriptApr 25 2024, 8:23 PM

Herald added a subscriber: imp. · View Herald Transcript

asomers requested review of this revision.Apr 25 2024, 8:23 PM

Harbormaster completed remote builds in B57375: Diff 137692.Apr 25 2024, 8:23 PM

I wonder what is your queue depth, so that one message per request per 90 seconds would cause a noticeable storm. Also per-system limiting makes output not very useful, since it does not say much useful about LUNs, ports, commands, etc due to selecting first message out of many, only that something is wrong. Thinking even wider, I find those messages printed on actual completion not very useful, since if there are not a delays, but something is really wrong, the commands many never complete and so the messages may never get printed. I wonder if instead removing all this and once per second checking OOA queues for stuck requests and printing some digests would be more useful.

In D44961#1025278, @mav wrote:

I wonder what is your queue depth, so that one message per request per 90 seconds would cause a noticeable storm. Also per-system limiting makes output not very useful, since it does not say much useful about LUNs, ports, commands, etc due to selecting first message out of many, only that something is wrong. Thinking even wider, I find those messages printed on actual completion not very useful, since if there are not a delays, but something is really wrong, the commands many never complete and so the messages may never get printed. I wonder if instead removing all this and once per second checking OOA queues for stuck requests and printing some digests would be more useful.

I don't know what the queue depth per target was, but the server in question had about 700 targets at the time. So an average queue depth per target of as little as 7 could've caused the storm we saw. What is an "OOA queue"?

In D44961#1025280, @asomers wrote:

What is an "OOA queue"?

Order of arrival. CTL has a per-LUN list of all requests currently in progress for purposes of conflicts resolution, etc. You may see them with ctladm dumpooa. Checking OOA queues would actually allow us to separate actual requests causing delays from innocent victims blocked by them, that is quite a problem for debugging IIRC.

Revision Contents
Changeset List

Path

Size

sys/

cam/

ctl/

ctl.c

59 lines

Diff 137692

View Options

rate limit ctl_process_done warning messagesNeeds ReviewPublicActions