Page MenuHomeFreeBSD

Fix bugs in recovery path and improve cm tracking
ClosedPublic

Authored by imp on Jun 27 2019, 10:28 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 4, 2:08 AM
Unknown Object (File)
Tue, Nov 26, 2:44 PM
Unknown Object (File)
Oct 31 2024, 8:42 PM
Unknown Object (File)
Oct 2 2024, 2:53 PM
Unknown Object (File)
Sep 20 2024, 9:48 PM
Unknown Object (File)
Sep 20 2024, 9:48 PM
Unknown Object (File)
Sep 20 2024, 9:48 PM
Unknown Object (File)
Sep 20 2024, 9:47 PM
Subscribers
None

Details

Summary

Eliminate the TIMEDOUT state. This state really conveyed two different
concepts: I timed out during recovery (and my command got put on the
recovery queue), and I timed out diring discovery (which doesn't).
Separate those two concepts into two flags. Use the TIMEDOUT flag to
fail requests as timed out. Use the on queue flag to remove them from
the queue.

In mps_intr_locked for MPI2_RPY_DESCRIPT_FLAGS_ADDRESS_REPLY message
type, when completing commands, ignore the ones that are not in state
INQUEUE. They were already completed as part of the recovery
process. When we complete them twice, we wind up with entries on the
free queue that are marked as busy, trigging asserts.

Add the state we came from when a bad state assert happens.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There's three fixes here: one is to remove TIMEDOUT as a state and moves it to a couple of bits
one is to only remove the cm from the deferred list when its been inserted
and one is to fix the race that I called out separately.

sys/dev/mps/mps.c
2484–2500 ↗(On Diff #59128)

This is, I think the source of the corruption.
Sometimes cm_state here is MPS_CM_STATE_FREE, which I think causes problems because that creates 'busy' cm's in the free list, which hits a panic.

Here's the boot log from a machine that now survives.

mps0: <Avago Technologies (LSI) SAS2008> port 0xe000-0xe0ff mem 0xfb900000-0xfb903fff,0xfb880000-0xfb8bffff irq 26 at device 0.0 on pci1
mps0: Firmware: 15.00.00.00, Driver: 21.02.00.00-fbsd
mps0: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
...
mps1: <Avago Technologies (LSI) SAS2008> port 0xd000-0xd0ff mem 0xfb300000-0xfb303fff,0xfb280000-0xfb2bffff irq 34 at device 0.0 on pci3
mps1: Firmware: 15.00.00.00, Driver: 21.02.00.00-fbsd
mps1: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
...
mps2: <Avago Technologies (LSI) SAS2008> port 0xc000-0xc0ff mem 0xfad00000-0xfad03fff,0xfac80000-0xfacbffff irq 40 at device 0.0 on pci4
mps2: Firmware: 15.00.00.00, Driver: 21.02.00.00-fbsd
mps2: IOCCapabilities: 1285c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,HostDisc>
...
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe03150916c0 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091818 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091970 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091ac8 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091c20 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe03150922d8 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315092430 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315092588 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe03150926e0 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315092838 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_add_device: sending Target Reset for stuck SATA identify command (cm = 0xfffffe03150916c0)
mps2: Sending target reset to target 0
Free state for ADDRESS_REPLY status, ignoring
mps2: mpssas_action_scsiio: Freezing devq for target ID 0
(probe0:mps2:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:mps2:0:0:0): CAM status: CAM subsystem is busy
(probe0:mps2:0:0:0): Retrying command, 3 more tries remain
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
Free state for ADDRESS_REPLY status, ignoring
mps2: target reset status 0x0 code 0x0 count 9
mps2: Finished reset recovery for target 0
mps2: Unfreezing devq for target ID 0
mps2: Controller reported scsi ioc terminated tgt 0 SMID 269 loginfo 31111000
(probe0:mps2:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:mps2:0:0:0): CAM status: CCB request completed with an error
(probe0:mps2:0:0:0): Retrying command, 2 more tries remain
run_interrupt_driven_hooks: still waiting after 120 seconds for xpt_config
mps2: Controller reported scsi ioc terminated tgt 0 SMID 270 loginfo 31111000
(probe0:mps2:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:mps2:0:0:0): CAM status: CCB request completed with an error
(probe0:mps2:0:0:0): Retrying command, 1 more tries remain
run_interrupt_driven_hooks: still waiting after 180 seconds for xpt_config
mps2: Controller reported scsi ioc terminated tgt 0 SMID 271 loginfo 31111000
(probe0:mps2:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:mps2:0:0:0): CAM status: CCB request completed with an error
(probe0:mps2:0:0:0): Retrying command, 0 more tries remain
mps2: Controller reported scsi ioc terminated tgt 0 SMID 272 loginfo 31111000
(probe0:mps2:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:mps2:0:0:0): CAM status: CCB request completed with an error
(probe0:mps2:0:0:0): Error 5, Retries exhausted
da2 at mps0 bus 0 scbus7 target 1 lun 0
da2: <ATA HGST HUS726060AL T1E2> Fixed Direct Access SPC-4 SCSI device
da2: Serial Number AR31001EV1AVUC
da2: 600.000MB/s transfers
da2: Command Queueing enabled
da2: 5723166MB (11721045168 512 byte sectors)
da1 at mps0 bus 0 scbus7 target 0 lun 0
da1: <ATA HGST HUS726060AL T1E2> Fixed Direct Access SPC-4 SCSI device
da1: Serial Number AR31001EV1A8YC
da1: 600.000MB/s transfers
da1: Command Queueing enabled
da1: 5723166MB (11721045168 512 byte sectors)
...

So we get 10 timeouts and have 9 cm's in the free state. We have to do the Target Reset after these timeouts. Unsure why the mismatch, unless the commands complete 'eventually' because we didn't properly cancel them and that's what's causing the issue? Unsure.

Here's a slightly better debug run, with printing of the cm address:

mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe03150916c0 sc 0xfffffe0069f13000
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091818 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091970 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091ac8 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315091c20 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe03150922d8 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315092430 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
run_interrupt_driven_hooks: still waiting after 60 seconds for xpt_config
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315092588 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe03150926e0 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_ata_id_timeout ATA ID command timeout cm 0xfffffe0315092838 sc 0xfffffe0069f13000 [****]
ata_id_timeout: state is 3
mps2: Request for SATA PASSTHROUGH page completed with error 60
mps2: Sleeping 3 seconds after SATA ID error to wait for spinup
mps2: mpssas_add_device: sending Target Reset for stuck SATA identify command (cm = 0xfffffe03150916c0)
mps2: Sending target reset to target 0
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe03150916c0
mps2: mpssas_action_scsiio: Freezing devq for target ID 0
(probe0:mps2:0:0:0): INQUIRY. CDB: 12 00 00 00 24 00
(probe0:mps2:0:0:0): CAM status: CAM subsystem is busy
(probe0:mps2:0:0:0): Retrying command, 3 more tries remain
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe0315092838
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe03150926e0
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe0315092588
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe0315092430
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe03150922d8
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe0315091c20
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe0315091ac8
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe0315091970
mps2: Bad state for ADDRESS_REPLY status, ignoring state 0 cm 0xfffffe0315091818
mps2: target reset status 0x0 code 0x0 count 9
mps2: Finished reset recovery for target 0
mps2: Unfreezing devq for target ID 0

All the Bad States CMs are marked with [****] in the retry output.

imp retitled this revision from Grab bag of "fixes" for mps issue to Fix bugs in recovery path and improve cm tracking.Jun 28 2019, 3:55 PM
imp edited the summary of this revision. (Show Details)
imp added a reviewer: scottl.
sys/dev/mps/mps.c
2453 ↗(On Diff #59160)

This case doesn't need the protection because cm->cm_state isn't set to BUSY.

sys/dev/mps/mps_sas.c
1605 ↗(On Diff #59160)

before we used the TIMEDOUT state as a proxy for this condition.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 8 2019, 8:20 PM
This revision was automatically updated to reflect the committed changes.