Page MenuHomeFreeBSD

Fix use after free panic and state transitions in mps(4) and mpr(4)
ClosedPublic

Authored by ken on Jun 26 2020, 8:53 PM.

Details

Summary
Fix mpr(4) and mps(4) state transitions and a use-after-free panic.
 
 When the mpr(4) and mps(4) drivers probe a SATA device, they issue
 an ATA Identify command (via mp{s,r}sas_get_sata_identify()) before
 the target is fully setup in the driver.  The drivers wait for
 completion of the identify command, and have a 5 second timeout.
 If the timeout fires, the command is marked with the
 SATA_ID_TIMEOUT flag so it can be freed later.
 
 That is where the use-after-free problem comes in.  Once the
 ATA Identify times out, the driver sends a target reset, and then
 frees any identify commands that have timed out.  But, once the
 target reset completes, commands that were queued to the drive are
 returned to the driver by the controller.
 
 At that point, the driver (in mp{s,r}_intr_locked()) looks up the
 command descriptor for that particular SMID, marks it CM_STATE_BUSY
 and sends it on for completion handling.
 
 The problem at this stage is that the command has already been
 freed, and put on the free queue, so its state is CM_STATE_FREE.
 If INVARIANTS are turned on, we get a panic as soon as this command
 is allocated, because its state is no longer CM_STATE_FREE, but
 rather CM_STATE_BUSY.
 
 So, the solution is to not free ATA Identify commands that get
 stuck until they actually return from the controller.  Hopefully
 this works correctly on older firmware versions.  If not, it could
 result in commands hanging around indefinitely.  But, the
 alternative is a use-after-free panic or assertion (in the
 INVARIANTS case).
 
 This also tightens up the state transitions between
 CM_STATE_FREE, CM_STATE_BUSY and CM_STATE_INQUEUE, so that the
 state transitions happen once, and we have assertions to make sure
 that commands are in the correct state before transitioning to the
 next state.
 Also, for each state assertion, we print out the current state of
 the command if it is incorrect.
 
 mp{s,r}.c:      Add a new sysctl variable, dump_reqs_alltypes,
                 that controls the behavior of the dump_reqs sysctl.
                 If dump_reqs_alltypes is non-zero, it will dump
                 all commands, not just the commands that are in the
                 CM_STATE_INQUEUE state.  (You can see the commands
                 that are in the queue by using mp{s,r}util debug
                 dumpreqs.)
 
                 Make sure that the INQUEUE -> BUSY state transition
                 happens in one place, the mp{s,r}_complete_command
                 routine.        
 
 mp{s,r}_sas.c:  Make sure we print the current command type in
                 command state assertions.
 
 mp{s,r}_sas_lsi.c:
                 Add a new completion handler,
                 mp{s,r}sas_ata_id_complete.  This completion
                 handler will free data allocated for an ATA
                 Identify command and free the command structure.
 
                 In mp{s,r}_ata_id_timeout, do not set the command
                 state to CM_STATE_BUSY.  The command is still in
                 queue in the controller.  Since we were blocking
                 waiting for this command to complete, there was
                 no completion handler previously.  Set the
                 completion handler, so that whenever the command
                 does come back, it will get freed properly.
 
                 Do not free ATA Identify commands that have timed
                 out in mp{s,r}sas_add_device().  Wait for them
                 to actually come back from the controller.
 
 mp{s,r}var.h:   Add a dump_reqs_alltypes variable for the new
                 dump_reqs_alltypes sysctl.
 
                 Make sure we print the current state for state
                 transition asserts.
Test Plan

To best test this, you need a machine with a SAS expander, and SATA drives attached to the expander. The more drives the better.

Build a kernel with INVARIANTS, and then:

You will be cycling the PHYs for the drives in question on the expander. You can script it, for example:

for i in 10 11 12 13 14 15 16 17 18 19; do while true; do camcontrol smppc ses0 -p $i -o disable; sleep 10; camcontrol smpc ses0 -p $i -o linkreset; sleep 10; done & done

If you run for 24 hours without an assert, you can probably call this fixed.

Once you're done with the test, kill all of the jobs, and then reset the PHYs so that the drives come back. For example:

for i in 10 11 12 13 14 15 16 17 18 19; do camcontrol smppc ses0 -p $i -o linkreset; done

Once all of the drives have come back successfully (make sure that they all come back!), make sure there are not any stuck commands.

To do that:

sysctl dev.mps.0.dump_reqs_alltypes=1
(Use "mpr" for 12Gb controllers, and do that for each controller instance number, 0, 1, 2, etc.)

Then, using either mpsutil or mprutil (6Gb or 12Gb), do this:

mpsutil debug dumpreqs |grep SMID |awk '{if ($4 != 0){print $0}}'

That will show commands that are not free. If you have I/O active, you may see commands that are busy (state 1) and in queue (state 2), but they should be transient. If they're stuck, then we may have a bug.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

ken requested review of this revision.Jun 26 2020, 8:53 PM

I only looked at mpr, but this looks good to me. It's good you have a way to recreate it. I have random machines failing with this (about ~0.001%/day), but can't find one to recreate it. I like this approach, and is similar to the one I took with other commands and target reset. thanks for fixing.

This revision is now accepted and ready to land.Jun 30 2020, 4:33 PM

I'm deploying this to one or two of the machines that we see panics from this every few days.

I've had good luck running this for 2 or 3 firmware cycles now at Netflix... We're down 3/4 in panics, and at least part of the remaining 1/4 appear to be due to a small mismerge of two files so they were out of sync with the rest of the state machine....

what are the plans to commit this?

sys/dev/mpr/mpr.c
2345

Actually, I spoke too soon. I hadn't completely audited the diffs from head and came to a false 'this was a due to a mismerge' conclusion.

Turns out there's a case this assert catches that's legit (though maybe the place to fix it is in the caller).

So, if we can't map in the chain frames properly in mp[rs]_data_cb, then we call mp[rs]_command_complete before we've had a chance to call mp[rs]_enqueue_request which sets the state to INQUEUE. So, since the state is still BUSY, we panic here every time we have a resource shortage that leads to no chains being available. It's rare (1 maybe 2 a week in the Netflix fleet of thousands), but it can happen.

Since this is the only other caller than mp[rs]_inter_locked, maybe the best thing to do is to set the state to INQUEUE in the *data_cb routine in the out of chain frames error path instead...