Page MenuHomeFreeBSD

Newly added features and bug fixes in latest Microchip SmartPQI driver.
Needs ReviewPublic

Authored by papani.srikanth_microchip.com on Apr 5 2021, 2:33 AM.

Details

Summary

It includes:

1)Newly added TMF feature.
2)Added newly Huawei & Inspur PCI ID's
3)Fixed smartpqi driver hangs in Z-Pool while running on FreeBSD12.1
4)Fixed flooding dmesg in kernel while the controller is offline during in ioctls.
5)Avoided unnecessary host memory allocation for rcb sg buffers.
6)Fixed race conditions while accessing internal rcb structure.
7)Fixed where Logical volumes exposing two different names to the OS it's due to the system memory is overwritten with DMA stale data.
8)Fixed dynamically unloading a smartpqi driver.
9)Added device_shutdown callback instead of deprecated shutdown_final kernel event in smartpqi driver.
10)Fixed where Os is crashed during physical drive hot removal during heavy IO.
11)Fixed OS crash during controller lockup/offline during heavy IO.
12)Fixed coverity issues in smartpqi driver
13)Fixed system crash while creating and deleting logical volume in a continuous loop.
14)Fixed where the volume size is not exposing to OS when it expands.
15)Added HC3 pci id's.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 38318
Build 35207: arc lint + arc unit

Event Timeline

With this patch applied, it fails much faster for me completely locking up ZFS, usually after some minutes of light load (e.g. I was doing git gc when this happened):

[ERROR]::[4:655.0][CPU 0][pqisrc_heartbeat_timer_handler][178]:controller is offline
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 4f e0 00 07 e8 00
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 7d 98 00 00 58 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 67 98 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
...
da0 at smartpqi0 bus 0 scbus0 target 64 lun 0
da0: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX32D7088CCV detached
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 38 28 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
da1 at smartpqi0 bus 0 scbus0 target 65 lun 0
da1: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX42D70CHZS7 detached
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 75 b0 00 07 e8 00
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
da2 at smartpqi0 bus 0 scbus0 target 66 lun 0
da2: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WXC2D90D7YAX detached
da3 at smartpqi0 bus 0 scbus0 target 67 lun 0
da3: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX12DB0N8F4X detached
ses0 at smartpqi0 bus 0 scbus0 target 68 lun 0
ses0: <Adaptec Smart Adapter 3.53>  s/n 7A4263EAB3E     detached
pass5 at smartpqi0 bus 0 scbus0 target 1088 lun 1
pass5: <Adaptec 1100-8i 3.53>  s/n 7A4263EAB3E     detached
(ses0:smartpqi0:0:68:0): Periph destroyed
(pass5:smartpqi0:0:1088:1): Periph destroyed
Solaris: WARNING: Pool 'data' has encountered an uncorrectable I/O failure and has been suspended.

Apr 16 11:36:45 sun ZFS[45956]: catastrophic pool I/O failure, zpool=data

Now every FS access is stuck. I was able to save the kernel dump using NMI (in case we can get something interesting from it).

This same system works just fine using the same HBA, same disks, same cabling under illumos (using our internal smartpqi driver) transferring TBs worth of data, so I don't expect this to be hardware issue.

HBA is 1100-8i, disks are 4x WDC WD40PURZ SATA3 HDDs, connected using breakout cable.

It's a pretty big patch with many changes to easily review. I barely looked through 1/8th of it. It would be good if you at least separated formatting from substantial changes.

sys/dev/smartpqi/smartpqi_cam.c
272

Why have you decided to terminate those strings? It is allowed, but not required by the standard.

606

style(9) tells: "Values in return statements should be enclosed in parentheses." Plus you are simply inconsistent with other code parts.

In D29584#668490, @mav wrote:

It's a pretty big patch with many changes to easily review. I barely looked through 1/8th of it. It would be good if you at least separated formatting from substantial changes.

Thats' why I've not looked more deeply at it. OTOH, since this is basically vendor code, I'm inclined to not worry so much about the formatting and similar issues, but they do make it hard to review whether or not new bugs are sneaking in with all the bug fixes and formatting changes. It's tough to know how to balance these issues. Though reports of lockups here are concerning enough that I think some response to them is needed before proceeding.

In D29584#668508, @imp wrote:
In D29584#668490, @mav wrote:

It's a pretty big patch with many changes to easily review. I barely looked through 1/8th of it. It would be good if you at least separated formatting from substantial changes.

Thats' why I've not looked more deeply at it. OTOH, since this is basically vendor code, I'm inclined to not worry so much about the formatting and similar issues, but they do make it hard to review whether or not new bugs are sneaking in with all the bug fixes and formatting changes. It's tough to know how to balance these issues. Though reports of lockups here are concerning enough that I think some response to them is needed before proceeding.

Yeah, I saw the lockup report came in earlier today. Unfortunately Srikanth is in a different time zone so he will take a look at it as soon as he can. I do understand the code changes got a bit out of hand as we are trying to sync up to our latest code base. We'll see if we can make the patches more palatable. Any suggestions are always appreciated.

With this patch applied, it fails much faster for me completely locking up ZFS, usually after some minutes of light load (e.g. I was doing git gc when this happened):

[ERROR]::[4:655.0][CPU 0][pqisrc_heartbeat_timer_handler][178]:controller is offline
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 4f e0 00 07 e8 00
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 7d 98 00 00 58 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 67 98 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
...
da0 at smartpqi0 bus 0 scbus0 target 64 lun 0
da0: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX32D7088CCV detached
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 38 28 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
da1 at smartpqi0 bus 0 scbus0 target 65 lun 0
da1: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX42D70CHZS7 detached
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 75 b0 00 07 e8 00
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
da2 at smartpqi0 bus 0 scbus0 target 66 lun 0
da2: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WXC2D90D7YAX detached
da3 at smartpqi0 bus 0 scbus0 target 67 lun 0
da3: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX12DB0N8F4X detached
ses0 at smartpqi0 bus 0 scbus0 target 68 lun 0
ses0: <Adaptec Smart Adapter 3.53>  s/n 7A4263EAB3E     detached
pass5 at smartpqi0 bus 0 scbus0 target 1088 lun 1
pass5: <Adaptec 1100-8i 3.53>  s/n 7A4263EAB3E     detached
(ses0:smartpqi0:0:68:0): Periph destroyed
(pass5:smartpqi0:0:1088:1): Periph destroyed
Solaris: WARNING: Pool 'data' has encountered an uncorrectable I/O failure and has been suspended.

Apr 16 11:36:45 sun ZFS[45956]: catastrophic pool I/O failure, zpool=data

Now every FS access is stuck. I was able to save the kernel dump using NMI (in case we can get something interesting from it).

This same system works just fine using the same HBA, same disks, same cabling under illumos (using our internal smartpqi driver) transferring TBs worth of data, so I don't expect this to be hardware issue.

HBA is 1100-8i, disks are 4x WDC WD40PURZ SATA3 HDDs, connected using breakout cable.

Thanks for testing the changes, The lockup issue is new to us. Could you please provide the Z-Pool reproduction steps so we can reflect the setup in our lab too for testing.

With this patch applied, it fails much faster for me completely locking up ZFS, usually after some minutes of light load (e.g. I was doing git gc when this happened):

[ERROR]::[4:655.0][CPU 0][pqisrc_heartbeat_timer_handler][178]:controller is offline
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 4f e0 00 07 e8 00
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 7d 98 00 00 58 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 67 98 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
...
da0 at smartpqi0 bus 0 scbus0 target 64 lun 0
da0: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX32D7088CCV detached
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 38 28 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
da1 at smartpqi0 bus 0 scbus0 target 65 lun 0
da1: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX42D70CHZS7 detached
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 75 b0 00 07 e8 00
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
da2 at smartpqi0 bus 0 scbus0 target 66 lun 0
da2: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WXC2D90D7YAX detached
da3 at smartpqi0 bus 0 scbus0 target 67 lun 0
da3: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX12DB0N8F4X detached
ses0 at smartpqi0 bus 0 scbus0 target 68 lun 0
ses0: <Adaptec Smart Adapter 3.53>  s/n 7A4263EAB3E     detached
pass5 at smartpqi0 bus 0 scbus0 target 1088 lun 1
pass5: <Adaptec 1100-8i 3.53>  s/n 7A4263EAB3E     detached
(ses0:smartpqi0:0:68:0): Periph destroyed
(pass5:smartpqi0:0:1088:1): Periph destroyed
Solaris: WARNING: Pool 'data' has encountered an uncorrectable I/O failure and has been suspended.

Apr 16 11:36:45 sun ZFS[45956]: catastrophic pool I/O failure, zpool=data

Now every FS access is stuck. I was able to save the kernel dump using NMI (in case we can get something interesting from it).

This same system works just fine using the same HBA, same disks, same cabling under illumos (using our internal smartpqi driver) transferring TBs worth of data, so I don't expect this to be hardware issue.

HBA is 1100-8i, disks are 4x WDC WD40PURZ SATA3 HDDs, connected using breakout cable.

Thanks for testing the changes, The lockup issue is new to us. Could you please provide the Z-Pool reproduction steps so we can reflect the setup in our lab too for testing.

Pool configuration is simple -- raidz of 4 sata disks connected using breakout cable, i.e.:

# for i in $(seq 0 3); do gpart create -s gpt da$i; gpart add -t freebsd-zfs da$i; done
da0 created
da0p1 added
da1 created
da1p1 added
da2 created
da2p1 added
da3 created
da3p1 added
# zpool create -O atime=off -O compression=zstd data raidz da0p1 da1p1 da2p1 da3p1
# zpool status data
  pool: data
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            da0p1   ONLINE       0     0     0
            da1p1   ONLINE       0     0     0
            da2p1   ONLINE       0     0     0
            da3p1   ONLINE       0     0     0

errors: No known data errors

I don't know what exactly triggers the issue, usually it's some simple operation like git clone, git pull, git gc and so on.

FWIW, applying this patch via git apply D29584.diff results in a number of trailing whitespace complaints:

/home/tuffli/D29584.diff:70: trailing whitespace.
/home/tuffli/D29584.diff:100: trailing whitespace.
/home/tuffli/D29584.diff:1126: trailing whitespace.
/home/tuffli/D29584.diff:1850: trailing whitespace.
/home/tuffli/D29584.diff:1905: trailing whitespace.
/home/tuffli/D29584.diff:2113: trailing whitespace.
/home/tuffli/D29584.diff:3307: trailing whitespace.
/home/tuffli/D29584.diff:4254: trailing whitespace.
/home/tuffli/D29584.diff:4326: trailing whitespace.
/home/tuffli/D29584.diff:5418: trailing whitespace.
/home/tuffli/D29584.diff:6860: trailing whitespace.
  • Added lockup code info in driver,

Re-try the IO's if there is a lack of DMA resources instead of deferring.

FWIW, applying this patch via git apply D29584.diff results in a number of trailing whitespace complaints:

/home/tuffli/D29584.diff:70: trailing whitespace.
/home/tuffli/D29584.diff:100: trailing whitespace.
/home/tuffli/D29584.diff:1126: trailing whitespace.
/home/tuffli/D29584.diff:1850: trailing whitespace.
/home/tuffli/D29584.diff:1905: trailing whitespace.
/home/tuffli/D29584.diff:2113: trailing whitespace.
/home/tuffli/D29584.diff:3307: trailing whitespace.
/home/tuffli/D29584.diff:4254: trailing whitespace.
/home/tuffli/D29584.diff:4326: trailing whitespace.
/home/tuffli/D29584.diff:5418: trailing whitespace.
/home/tuffli/D29584.diff:6860: trailing whitespace.

Hi Chuck,
This code is for FreeBSD 14.0 main branch, You can take it out for FreeBSD 13.0 testing

With this patch applied, it fails much faster for me completely locking up ZFS, usually after some minutes of light load (e.g. I was doing git gc when this happened):

[ERROR]::[4:655.0][CPU 0][pqisrc_heartbeat_timer_handler][178]:controller is offline
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 4f e0 00 07 e8 00
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 7d 98 00 00 58 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 67 98 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
...
da0 at smartpqi0 bus 0 scbus0 target 64 lun 0
da0: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX32D7088CCV detached
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
(da1:smartpqi0:0:65:0): WRITE(10). CDB: 2a 00 05 82 38 28 00 07 e8 00
(da1:smartpqi0:0:65:0): CAM status: Unable to abort CCB request
(da1:smartpqi0:0:65:0): Error 5, Unretryable error
da1 at smartpqi0 bus 0 scbus0 target 65 lun 0
da1: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX42D70CHZS7 detached
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
(da2:smartpqi0:0:66:0): WRITE(10). CDB: 2a 00 05 82 75 b0 00 07 e8 00
(da2:smartpqi0:0:66:0): CAM status: Unable to abort CCB request
(da2:smartpqi0:0:66:0): Error 5, Unretryable error
da2 at smartpqi0 bus 0 scbus0 target 66 lun 0
da2: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WXC2D90D7YAX detached
da3 at smartpqi0 bus 0 scbus0 target 67 lun 0
da3: <ATA WDC WD40PURZ-85A 0A80>  s/n WD-WX12DB0N8F4X detached
ses0 at smartpqi0 bus 0 scbus0 target 68 lun 0
ses0: <Adaptec Smart Adapter 3.53>  s/n 7A4263EAB3E     detached
pass5 at smartpqi0 bus 0 scbus0 target 1088 lun 1
pass5: <Adaptec 1100-8i 3.53>  s/n 7A4263EAB3E     detached
(ses0:smartpqi0:0:68:0): Periph destroyed
(pass5:smartpqi0:0:1088:1): Periph destroyed
Solaris: WARNING: Pool 'data' has encountered an uncorrectable I/O failure and has been suspended.

Apr 16 11:36:45 sun ZFS[45956]: catastrophic pool I/O failure, zpool=data

Now every FS access is stuck. I was able to save the kernel dump using NMI (in case we can get something interesting from it).

This same system works just fine using the same HBA, same disks, same cabling under illumos (using our internal smartpqi driver) transferring TBs worth of data, so I don't expect this to be hardware issue.

HBA is 1100-8i, disks are 4x WDC WD40PURZ SATA3 HDDs, connected using breakout cable.

Thanks for testing the changes, The lockup issue is new to us. Could you please provide the Z-Pool reproduction steps so we can reflect the setup in our lab too for testing.

Pool configuration is simple -- raidz of 4 sata disks connected using breakout cable, i.e.:

# for i in $(seq 0 3); do gpart create -s gpt da$i; gpart add -t freebsd-zfs da$i; done
da0 created
da0p1 added
da1 created
da1p1 added
da2 created
da2p1 added
da3 created
da3p1 added
# zpool create -O atime=off -O compression=zstd data raidz da0p1 da1p1 da2p1 da3p1
# zpool status data
  pool: data
 state: ONLINE
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            da0p1   ONLINE       0     0     0
            da1p1   ONLINE       0     0     0
            da2p1   ONLINE       0     0     0
            da3p1   ONLINE       0     0     0

errors: No known data errors

I don't know what exactly triggers the issue, usually it's some simple operation like git clone, git pull, git gc and so on.

Can you please take this patch for testing