Page MenuHomeFreeBSD

rate limit ctl_process_done warning messages
Needs ReviewPublic

Authored by asomers on Thu, Apr 25, 8:23 PM.
Tags
None
Referenced Files
F83135585: D44961.diff
Mon, May 6, 7:42 PM
Unknown Object (File)
Sun, Apr 28, 7:40 PM
Unknown Object (File)
Fri, Apr 26, 3:09 PM
Unknown Object (File)
Fri, Apr 26, 3:09 PM
Unknown Object (File)
Fri, Apr 26, 9:42 AM
Subscribers

Details

Reviewers
mav
jhb
Group Reviewers
cam
Summary

If a CTL I/O operation takes too long (default 90s) a warning will be
printed to the system console. But that has a cost, and the extra load
can cause other operations to slow down. malloc_large seems to be
particularly affected. This can lead to a positive feedback doom loop:
more CTL warnings means a slower system, which means more CTL warnings.

Break the feedback cycle by rate-limiting these messages to one per
second.

MFC after: 2 weeks
Sponsored by: Axcient

Test Plan

manually tested using gnop to artificially delay I/O

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 57375
Build 54263: arc lint + arc unit

Event Timeline

I wonder what is your queue depth, so that one message per request per 90 seconds would cause a noticeable storm. Also per-system limiting makes output not very useful, since it does not say much useful about LUNs, ports, commands, etc due to selecting first message out of many, only that something is wrong. Thinking even wider, I find those messages printed on actual completion not very useful, since if there are not a delays, but something is really wrong, the commands many never complete and so the messages may never get printed. I wonder if instead removing all this and once per second checking OOA queues for stuck requests and printing some digests would be more useful.

In D44961#1025278, @mav wrote:

I wonder what is your queue depth, so that one message per request per 90 seconds would cause a noticeable storm. Also per-system limiting makes output not very useful, since it does not say much useful about LUNs, ports, commands, etc due to selecting first message out of many, only that something is wrong. Thinking even wider, I find those messages printed on actual completion not very useful, since if there are not a delays, but something is really wrong, the commands many never complete and so the messages may never get printed. I wonder if instead removing all this and once per second checking OOA queues for stuck requests and printing some digests would be more useful.

I don't know what the queue depth per target was, but the server in question had about 700 targets at the time. So an average queue depth per target of as little as 7 could've caused the storm we saw. What is an "OOA queue"?

What is an "OOA queue"?

Order of arrival. CTL has a per-LUN list of all requests currently in progress for purposes of conflicts resolution, etc. You may see them with ctladm dumpooa. Checking OOA queues would actually allow us to separate actual requests causing delays from innocent victims blocked by them, that is quite a problem for debugging IIRC.