Page MenuHomeFreeBSD

aacraid: log DMA failures
Needs ReviewPublic

Authored by luporl on Jun 17 2021, 12:50 PM.
Tags
None
Referenced Files
F81585198: D30799.id96021.diff
Thu, Apr 18, 12:47 PM
Unknown Object (File)
Sat, Apr 13, 4:07 AM
Unknown Object (File)
Feb 27 2024, 9:15 AM
Unknown Object (File)
Jan 23 2024, 7:11 AM
Unknown Object (File)
Jan 22 2024, 11:32 AM
Unknown Object (File)
Dec 23 2023, 1:08 AM
Unknown Object (File)
Dec 14 2023, 10:30 PM
Unknown Object (File)
Nov 22 2023, 11:42 AM

Details

Summary

Log an error message when bus_dmamap_load() fails to load the
buffer to DMA area and invokes the callback function with an
error.

This makes it much easier to figure out what is going wrong and why a read/write command failed.

It is hard to abort the I/O operation at aacraid DMA callback and propagate the error to upper layers, but at least we can log the error and help whoever is trying to debug an issue.

Test Plan

Tested on a Talos II machine with an aacraid controller.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 39947
Build 36836: arc lint + arc unit

Event Timeline

sys/dev/aacraid/aacraid.c
1302

Does this need a rate limiter?

Address review's comments

luporl added inline comments.
sys/dev/aacraid/aacraid.c
1302

Yes, it could generate a lot of messages.
Limiting it to 1 message in each 10 seconds (mostly because I prefer it to be displayed more often than not and this error end up compromising the system in a short time).

@imp I've added the rate limiter, do you think this revision is ok now?

scottl requested changes to this revision.Jul 23 2021, 2:42 PM

I've put rate limiters into other drivers and have been unhappy with the results. Whether it's once a second, 10 seconds, 100 seconds, or whatever, an unattended system will wind up with thousands of messages in its logs that obfuscate more important messages. I think a better tactic is to print a message once, and maintain a sysctl with a counter.

This revision now requires changes to proceed.Jul 23 2021, 2:42 PM

Address reviewer's comments:

  • log DMA load buffer error message only once for each device
  • add a per device error counter, exported via sysctl (inspired by mpr code)
sys/dev/aacraid/aacraid.c
227–229

it looks like this was never used before

254

does it make sense to put this under dev.aacraid instead? I think the sysctl setup is simplified by doing so

sys/dev/aacraid/aacraid.c
254

Correct, it should be dev.aacraid.

luporl added inline comments.
sys/dev/aacraid/aacraid.c
254

Ok, it's now always added to dev.aacraid and setup is skipped if for some reason sysctl_ctx or sysctl_tree is NULL.

luporl added inline comments.
sys/dev/aacraid/aacraid.c
227–229

It's really not used. Do you think it's better to remove it, or just ignore it?

sys/dev/aacraid/aacraid.c
227–229

I would delete it in a separate commit, either before or after this change.

Is this change good to go now?

sys/dev/aacraid/aacraid.c
227–229

Ok, I'll delete it after this change.

markj added inline comments.
sys/dev/aacraid/aacraid.c
1310

Isn't it possible that a bogus command will trigger controller firmware bugs and command timeouts? IMHO it is better to abort the command at this stage, though I'm not sure yet how to do that in this driver.

sys/dev/aacraid/aacraid.c
1310

Yes, I think it's possible. I agree that it would be better to abort the command here. That was my plan, until I realized this wasn't a simple task. I guess we would need to replicate part of what is being done in the interrupt handler, to clean up and wake up who is waiting on this command to complete. But I'm afraid I won't have cycles available to implement proper command abort.

In any case, we already allow bogus commands with current code and this change at least warns about DMA errors.