mpr: fix freeze / release mismatch in timeout code
ClosedPublic
Actions

Authored by imp on Nov 18 2021, 6:53 PM.

Details

Reviewers

slm
mav
scottl
ken

Commits

rGde8bb30885a4: mpr: fix freeze / release mismatch in timeout code
rGa8837c77efd0: mpr: fix freeze / release mismatch in timeout code

Summary

So, if we're processing a timeout, and we've sent an ABORT to the firmware
for that timeout, but not yet received the response from the firmware, AND
we get another timeout, we queue the timeout and freeze the queue. However,
when we've finally processed them all, we only release the queue once. This
causes all I/O to halt as the devq remains frozen forever.

Instead, only freeze the queue when we start the process (eg set INRESET
on the target). This will allow the release when all the timed out I/Os have
finished ABORTing.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

imp created this revision.Nov 18 2021, 6:53 PM

Herald added a reviewer: slm. · View Herald TranscriptNov 18 2021, 6:53 PM

imp requested review of this revision.Nov 18 2021, 6:53 PM

Harbormaster completed remote builds in B42872: Diff 98704.Nov 18 2021, 6:53 PM

darkfiberiru_gmail.com added a subscriber: darkfiberiru_gmail.com.Nov 18 2021, 6:56 PM

imp retitled this revision from better freeze / release to mpr: fix freeze / release mismatch in timeout code.Nov 18 2021, 7:10 PM

imp edited the summary of this revision. (Show Details)

imp added reviewers: mav, scottl, ken.

I believe that this fixes 9781c28c6d63cfa8438d1aa31f512a6b217a6b2b because that's where this xpt_freeze_devq was introduced.

Warner, would you please point me where exactly do you see an asymmetry? As I see, the queue is frozen for every task management request sent to device inside mprsas_prepare_for_tm() and should be released for each one inside mprsas_free_tm(). Is there case when those two call don't match?

PS: And I don't like pulling debug messages into the level enabled by default.

Ah. I think I see. mprsas_logical_unit_reset_complete() and mprsas_abort_complete() reused the tm by calling mprsas_send_abort() another time for another timed out command without freeing tm. Your patch may have sense.

This revision is now accepted and ready to land.Nov 20 2021, 12:24 AM

Yes. I think you have it.

schedule cm1 and cm2 at the same time.
timeout cm1
send abort cm1
devq freeze
timeout cm2 (put on queue)
abort cm1 finishes
send abort cm2
devq freeze
abort cm2 finishes
devq release

is the sequence I captured in the wild. Sometimes with as many as 4 cm at the same time.

If that's the sequence you see, then all we need to do is chat about the xinfo -> info changes.
I'd like all the freeze / release messages to be at least under recovery, but I could do xinfo | recovery instead.
that helps trace out what happens here.
Comments on that movement?

Closed by commit rGa8837c77efd0: mpr: fix freeze / release mismatch in timeout code (authored by imp). · Explain WhyNov 21 2021, 4:00 PM

This revision was automatically updated to reflect the committed changes.

imp added a commit: rGa8837c77efd0: mpr: fix freeze / release mismatch in timeout code.

imp added a commit: rGde8bb30885a4: mpr: fix freeze / release mismatch in timeout code.Dec 6 2021, 3:58 PM