Page MenuHomeFreeBSD

ena: Add completion descriptor corruption check
Needs ReviewPublic

Authored by osamaabb_amazon.com on Aug 20 2024, 8:45 AM.
Tags
None
Referenced Files
Unknown Object (File)
Jan 4 2025, 2:16 AM
Unknown Object (File)
Dec 13 2024, 2:24 PM
Unknown Object (File)
Sep 25 2024, 8:44 AM
Unknown Object (File)
Sep 25 2024, 5:01 AM
Unknown Object (File)
Sep 24 2024, 7:35 PM
Unknown Object (File)
Sep 24 2024, 3:32 PM
Unknown Object (File)
Sep 23 2024, 6:18 AM
Unknown Object (File)
Sep 22 2024, 4:25 PM
Subscribers

Details

Reviewers
cperciva
Summary

Adding a check of the MBZ (Must Be Zero) fields in the
incoming tx and rx completion descriptors in order to
identify corrupted descriptors.

Approved by: cperciva
MFC after: 2 weeks
Sponsored by: Amazon, Inc.

Diff Detail

Repository
rG FreeBSD src repository
Lint
No Lint Coverage
Unit
No Test Coverage
Build Status
Buildable 59066
Build 55953: arc lint + arc unit

Event Timeline

Is returning an error the right response here? My initial reaction is that this should be a kernel panic, but maybe it's easier to track down such faults if the system keeps running?

Is returning an error the right response here? My initial reaction is that this should be a kernel panic, but maybe it's easier to track down such faults if the system keeps running?

I feel like a kernel panic is a bit of an overkill here, for these cases we currently reset the driver with ENA_REGS_RESET_RX_DESCRIPTOR_MALFORMED reset reason
We aim for recovery to maintain network availability

Is returning an error the right response here? My initial reaction is that this should be a kernel panic, but maybe it's easier to track down such faults if the system keeps running?

I feel like a kernel panic is a bit of an overkill here, for these cases we currently reset the driver with ENA_REGS_RESET_RX_DESCRIPTOR_MALFORMED reset reason
We aim for recovery to maintain network availability

Ok, your call. I think there's tension here between availability and integrity -- my general position is "if we detected that one thing is corrupted, who knows how much undetected corruption has happened, so we should panic and reboot into a clean state rather than trusting anything about the currently running system". But obviously it's a tradeoff and you know the context of the system better than I do.