cam: Don't log invalid cdb errors
Needs ReviewPublic
Actions

Authored by imp on Oct 25 2024, 9:27 PM.

Details

Reviewers

mav
ken

Group Reviewers

cam

Summary

These errors can happen in the normal course of operation. Especially
when drives aren't quite standard enough, so programs like smartctl
can't know the commands they are sending are incorrect. smrtctl copes
correctly with it, and logging it won't give us insight into drive
health, so skip it.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 60197
Build 57081: arc lint + arc unit

Event Timeline

imp created this revision.Oct 25 2024, 9:27 PM

Herald added a reviewer: cam. · View Herald TranscriptOct 25 2024, 9:27 PM

imp requested review of this revision.Oct 25 2024, 9:27 PM

Harbormaster completed remote builds in B60197: Diff 145469.Oct 25 2024, 9:27 PM

imp added reviewers: mav, ken.Oct 25 2024, 9:30 PM

The logging of all the SCSI errors was to help us (Netflix) retire disks vs rehab them. This has proven to be a little harder than I thought when Scott did this work originally.

We've come to discover, though, that we have a non-compliant disk. HGST He8 SAS does not properly support getting supported log pages by asking for log page 0,0xff. So we get lots of spamage like

system=CAM subsystem=periph type=error device=pass10 serial="XXXXXXXXX" cam_status="0x4cc" scsi_status=2 scsi_sense="72 05 24 00" CDB="4d 00 40 ff 00 00 00 3e fc 00 "  timestamp=1729878619.012464

in our log files. smartctl can't know that this isn't working, but does respond correctly by ignoring the error (there's comments about how some HGST drives do this in error). There's no benefit from logging this data, except maybe to find this 'bug' which turns out to be unfixable. these drives are on their way out, but we have 150 machines in the field generating about 100k messages like this a day due to other smart data we're pulling from the drive a couple times an hour due to the impressive power of multiplication. smartctl doesn't remember this from poll to poll, and always tries to get the supported log pages (unconditionally), so there's no good way to avoid this, short of a quirk in the smartmon code to avoid doing that for older HGST drives.

I added logging passthru commands to catch other issues related to timeouts, so I'd rather make this adjustment than back that out to avoid this error.

I'd personally want to keep these messages with bootverbose.. I can imagine it might be handy to see them at times...

I've never used it myself, so don't have a strong opinion, but as next step somebody will want to block reservation conflicts, then something else, and again and again...

Revision Contents
Changeset List

Path

Size

sys/

cam/

cam_periph.c

20 lines

Diff 145469

View Options

cam: Don't log invalid cdb errorsNeeds ReviewPublicActions