Page MenuHomeFreeBSD

mpr: fix freeze / release mismatch in timeout code
ClosedPublic

Authored by imp on Thu, Nov 18, 6:53 PM.

Details

Summary

So, if we're processing a timeout, and we've sent an ABORT to the firmware
for that timeout, but not yet received the response from the firmware, AND
we get another timeout, we queue the timeout and freeze the queue. However,
when we've finally processed them all, we only release the queue once. This
causes all I/O to halt as the devq remains frozen forever.

Instead, only freeze the queue when we start the process (eg set INRESET
on the target). This will allow the release when all the timed out I/Os have
finished ABORTing.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

imp requested review of this revision.Thu, Nov 18, 6:53 PM
imp retitled this revision from better freeze / release to mpr: fix freeze / release mismatch in timeout code.Thu, Nov 18, 7:10 PM
imp edited the summary of this revision. (Show Details)
imp added reviewers: mav, scottl, ken.

I believe that this fixes 9781c28c6d63cfa8438d1aa31f512a6b217a6b2b because that's where this xpt_freeze_devq was introduced.

Warner, would you please point me where exactly do you see an asymmetry? As I see, the queue is frozen for every task management request sent to device inside mprsas_prepare_for_tm() and should be released for each one inside mprsas_free_tm(). Is there case when those two call don't match?

PS: And I don't like pulling debug messages into the level enabled by default.

Ah. I think I see. mprsas_logical_unit_reset_complete() and mprsas_abort_complete() reused the tm by calling mprsas_send_abort() another time for another timed out command without freeing tm. Your patch may have sense.

This revision is now accepted and ready to land.Sat, Nov 20, 12:24 AM

Yes. I think you have it.

schedule cm1 and cm2 at the same time.
timeout cm1
send abort cm1
devq freeze
timeout cm2 (put on queue)
abort cm1 finishes
send abort cm2
devq freeze
abort cm2 finishes
devq release

is the sequence I captured in the wild. Sometimes with as many as 4 cm at the same time.

If that's the sequence you see, then all we need to do is chat about the xinfo -> info changes.
I'd like all the freeze / release messages to be at least under recovery, but I could do xinfo | recovery instead.
that helps trace out what happens here.
Comments on that movement?