Page MenuHomeFreeBSD

cam: don't lock while handling an AC_UNIT_ATTENTION (da)
ClosedPublic

Authored by rew on Dec 23 2020, 7:40 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 7, 10:58 PM
Unknown Object (File)
Sun, Jan 5, 9:33 AM
Unknown Object (File)
Dec 5 2024, 5:00 PM
Unknown Object (File)
Dec 3 2024, 11:41 AM
Unknown Object (File)
Oct 21 2024, 4:26 PM
Unknown Object (File)
Sep 25 2024, 11:33 AM
Unknown Object (File)
Sep 24 2024, 9:23 AM
Unknown Object (File)
Sep 24 2024, 5:15 AM
Subscribers
None

Details

Summary

Don't take the device_mtx lock in daasync() when handling an
AC_UNIT_ATTENTION. Instead, assert the lock is held before modifying the
periph's softc flags.

The device_mtx lock is taken in xptdevicetraverse() before daasync()
is eventually called in xpt_async_bcast().

What I found was that even though the periphs are different
("da0" and "probe0", in my case), they are attached to the same device and
share the same device_mtx.

remnants of PR: 240917, 226510, 226578

I'll add the following for completeness, but not sure if phabricator is the
right place for it..

This patch fixes the panic:

panic: _mtx_lock_sleep: recursed on non-recursive mutex CAM device lock @ /usr/src/sys/cam/scsi/scsi_d
                                                                            
cpuid = 0                                                                   
time = 1607456465                                                           
KDB: stack backtrace:                                                       
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0001d2b750
vpanic() at vanic+0x181/frame 0xfffffe0001d2b7a0                            
panic() at panic+0x43/frame 0xfffffe0001d2b800                              
__mtx_lock_sleep() at __mtx_lock_sleep+0x47a/frame 0xfffffe0001d2b890       
__mtx_lock_flags() at __mtx_lock_flags+0xe5/frame 0xfffffe0001d2b8e0           
daasync() at daasync+0x4dc/frame 0xfffffe0001d2b930                          
xpt_async_process_dev() at xpt_async_process_dev+0x152/frame 0xfffffe0001d2b980
xptdevicetraverse() at xptdevicetraverse+0xa5/frame 0xfffffe0001d2b9d0       
xpttargettraverse() at xpttargettraverse+0x6b/frame 0xfffffe0001d2ba10       
xpt_async_process() at xpt_async_process+0x2ed/frame 0xfffffe0001d2bb20     
xpt_done_process() at xpt_done_process+0x3a3/frame 0xfffffe0001d2bb60        
xpt_done_td() at xpt_done_td+0xf5/frame 0xfffffe0001d2bbb0                  
fork_exit() at fork_exit+0x80/frame 0xfffffe0001d2bbf0                         
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0001d2bbf0           
--- trap 0, rip = 0, rsp = 0, rbp = 0 ---                                   
KDB: enter: panic                                                                       
[ thread pid 4 tid 100026 ]                                                 
Stopped at      kdb_enter+0x37: movq    $0,0x10aa136(%rip)                  
db>

To reproduce, setup a LUN backed by a zvol, (in ctl.conf):

portal-group pg1 {                                                           
    discovery-auth-group no-authentication                                  
    listen 0.0.0.0                                                           
}                                                                            
                                                                            
target iqn.2012-18.com.example:target155 {                                  
    auth-group no-authentication                                            
    portal-group pg1                                                        
                                                                            
    lun 0 {                                                                 
        path /dev/zvol/zroot/testvol                                        
        size 100M                                                           
    }                                                                       
}

Start ctld:

% service ctld start

Turn on the camsim port twice, the second time will trigger panic:

% ctladm port -p 0 -o on                                                    
% ctladm port -p 0 -o on

output =>

(probe0:camsim0:0:0:0): INQUIRY data has changed                            
-- panic triggered --

Another way, by modifying LUN size (panic triggered by the rescan):

% ctladm modify -b block -l 0 -s 99M                                        
% camcontrol rescan all

output =>

(probe0:camsim0:0:0:0): Capacity data has changed                           
-- panic triggered --
Test Plan

repeat above commands without triggering a panic

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 35639
Build 32530: arc lint + arc unit

Event Timeline

rew requested review of this revision.Dec 23 2020, 7:40 AM

Let's tag in Scott as well, because CAM

ACK, I'll look at it over the weekend.

I believe this is probably right. The comment deleted was one I added as I was auditing things...
I didn't have good test case to check.

This revision is now accepted and ready to land.Jun 13 2021, 7:49 PM