Page MenuHomeFreeBSD

cam: Don't log invalid cdb errors
Needs ReviewPublic

Authored by imp on Oct 25 2024, 9:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 23, 2:59 PM
Unknown Object (File)
Thu, Oct 23, 7:14 AM
Unknown Object (File)
Thu, Oct 23, 7:14 AM
Unknown Object (File)
Wed, Oct 22, 10:59 PM
Unknown Object (File)
Wed, Oct 22, 4:43 AM
Unknown Object (File)
Mon, Oct 13, 9:45 PM
Unknown Object (File)
Sat, Oct 11, 4:48 AM
Unknown Object (File)
Thu, Oct 9, 8:20 PM
Subscribers

Details

Reviewers
mav
ken
Group Reviewers
cam
Summary

These errors can happen in the normal course of operation. Especially
when drives aren't quite standard enough, so programs like smartctl
can't know the commands they are sending are incorrect. smrtctl copes
correctly with it, and logging it won't give us insight into drive
health, so skip it.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 60197
Build 57081: arc lint + arc unit

Event Timeline

imp requested review of this revision.Oct 25 2024, 9:27 PM

The logging of all the SCSI errors was to help us (Netflix) retire disks vs rehab them. This has proven to be a little harder than I thought when Scott did this work originally.

We've come to discover, though, that we have a non-compliant disk. HGST He8 SAS does not properly support getting supported log pages by asking for log page 0,0xff. So we get lots of spamage like

system=CAM subsystem=periph type=error device=pass10 serial="XXXXXXXXX" cam_status="0x4cc" scsi_status=2 scsi_sense="72 05 24 00" CDB="4d 00 40 ff 00 00 00 3e fc 00 "  timestamp=1729878619.012464

in our log files. smartctl can't know that this isn't working, but does respond correctly by ignoring the error (there's comments about how some HGST drives do this in error). There's no benefit from logging this data, except maybe to find this 'bug' which turns out to be unfixable. these drives are on their way out, but we have 150 machines in the field generating about 100k messages like this a day due to other smart data we're pulling from the drive a couple times an hour due to the impressive power of multiplication. smartctl doesn't remember this from poll to poll, and always tries to get the supported log pages (unconditionally), so there's no good way to avoid this, short of a quirk in the smartmon code to avoid doing that for older HGST drives.

I added logging passthru commands to catch other issues related to timeouts, so I'd rather make this adjustment than back that out to avoid this error.

I'd personally want to keep these messages with bootverbose.. I can imagine it might be handy to see them at times...

I've never used it myself, so don't have a strong opinion, but as next step somebody will want to block reservation conflicts, then something else, and again and again...