Like the previous generations, Samsung 860 EVO SSDs have a broken NCQ TRIM command and report 512 byte sectors while being optimized for 4k.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
How were you able to determine this?
The latest version of Linux doesn't have this quirk.
It appears I was mistaken, and I'm inclined to suspect the AMD SB950 AHCI SATA controller is the problem. In that case, I can remove the NCQ_TRIM_BROKEN quirk from this diff, but it would still be nice to have the 4K quirk entry. Should I create a new diff for that or is it OK to simply update this one?
Explanation
While replacing the the Samsung 850 EVO SSDs in my system with Samsung 860 EVO SSDs, I started to notice error messages from the kernel about the new drives.
"Uncorrectable parity/CRC error" (among other errors, screenshot attached) appeared during ZFS resilvering.
Only the new drives were causing errors. I tested with different cables, swapped ports, etc.
The system has two AHCI SATA controllers:
- AMD SB7x0/SB8x0/SB9x0 AHCI SATA controller (specifically, SB950)
- ASMedia ASM1062 AHCI SATA controller
I have errors on the 860 drives (but not the 850 drives) when they are attached to the AMD SB950 controller. When I attach the same drives to the other controller and run a scrub, there are no errors. The ASMedia controller has its own quirk to disable NCQ, and I assumed that was masking the same broken NCQ TRIM command that existed for the 830, 840, 845, and 850.
I added loader tunables to test the quirks added by this diff:
kern.cam.ada.0.quirks=0x3 kern.cam.ada.1.quirks=0x3
I confirmed the tunables took effect by observing that sysctl kern.cam.ada.*.delete_method changed from NCQ_DSM_TRIM to DSM_TRIM for the new drives after a reboot.
Now I have not seen another "Uncorrectable parity/CRC error" message. The "Command timeout" and "ATA Status Error" messages still appear during scrub, but I have found reports of these same errors with the same pairing of 860 EVO drives with AMD SB950 controller on Linux and on Windows where the same controller worked with the older SSDs and the same SSDs work with different controllers (or different drivers perhaps).
At this point I decided to add the quirk and submit this diff.
However, since you pointed out that Linux doesn't have this quirk, I have recognized the assumption I made and am now performing more thorough testing.
After removing the quirks (reverting to a stock configuration), I have been able to reproduce the "Uncorrectable parity/CRC error" messages with an 860 and the SB950 by running benchmarks/fio using an example config with the following changes:
--- /usr/local/share/examples/fio/ssd-test.fio 2018-07-23 07:55:32.000000000 -0700 +++ ssd-test.fio 2018-09-14 12:41:22.060666000 -0700 @@ -12,12 +12,11 @@ # [global] bs=4k -ioengine=libaio iodepth=4 size=10g direct=1 runtime=60 -directory=/mount-point-of-ssd +directory=/tmpzfs filename=ssd.test.file [seq-read]
/tmpzfs is the mountpoint of a pool I created on an 860.
I get the following errors during the course of the test:
Sep 14 12:44:09 fx-freebsd kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 2e ef 7f 40 10 00 00 01 00 00 Sep 14 12:44:09 fx-freebsd kernel: (ada1:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error Sep 14 12:44:09 fx-freebsd kernel: (ada1:ahcich5:0:0:0): Retrying command Sep 14 12:44:09 fx-freebsd kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 2e f0 7f 40 10 00 00 01 00 00 Sep 14 12:44:09 fx-freebsd kernel: (ada1:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error Sep 14 12:44:09 fx-freebsd kernel: (ada1:ahcich5:0:0:0): Retrying command [...snipped out more of the same for brevity...] Sep 14 12:44:28 fx-freebsd kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 e9 d9 1e 40 11 00 00 01 00 00 Sep 14 12:44:28 fx-freebsd kernel: (ada1:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error Sep 14 12:44:28 fx-freebsd kernel: (ada1:ahcich5:0:0:0): Retrying command Sep 14 12:44:28 fx-freebsd kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 e9 da 1e 40 11 00 00 01 00 00 Sep 14 12:44:28 fx-freebsd kernel: (ada1:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error Sep 14 12:44:28 fx-freebsd kernel: (ada1:ahcich5:0:0:0): Retrying command
Added the quirk to loader.conf, rebooted, confirmed the change took, and and ran the same test with the same drive in the same port. I see the same errors:
[...] Sep 14 13:10:30 fx-freebsd kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 fa c6 29 40 16 00 00 01 00 00 Sep 14 13:10:30 fx-freebsd kernel: (ada1:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error Sep 14 13:10:30 fx-freebsd kernel: (ada1:ahcich5:0:0:0): Retrying command Sep 14 13:10:30 fx-freebsd kernel: (ada1:ahcich5:0:0:0): WRITE_FPDMA_QUEUED. ACB: 61 00 fa c7 29 40 16 00 00 01 00 00 Sep 14 13:10:30 fx-freebsd kernel: (ada1:ahcich5:0:0:0): CAM status: Uncorrectable parity/CRC error Sep 14 13:10:30 fx-freebsd kernel: (ada1:ahcich5:0:0:0): Retrying command
This quirk doesn't appear to be necessary (or even work) for the 860 EVO drives after all. In hindsight it didn't make any sense to test TRIM by doing a scrub. Since this new test seems reliable, I ran it in several other configurations, for good measure:
Drive | Drive Quirks | Platform | Controller | Controller Quirks | Errors |
860 EVO | None | Intel | Intel Lynx Point AHCI SATA | None | None |
860 EVO | None | Intel | ASMedia ASM1062 AHCI SATA | NOCCS, NOAUX | None |
860 EVO | None | AMD | ASMedia ASM1062 AHCI SATA | NOCCS, NOAUX | None |
860 EVO | None | AMD | AMD SB950 AHCI SATA | ATI_PMP_BUG, 1MSI | Uncorrectable parity/CRC error, ATA Status Error, Command timeout |
860 EVO | 4K, NCQ_TRIM_BROKEN | AMD | AMD SB950 AHCI SATA | ATI_PMP_BUG, 1MSI | Uncorrectable parity/CRC error, ATA Status Error, Command timeout |
850 EVO | 4K, NCQ_TRIM_BROKEN | AMD | AMD SB950 AHCI SATA | ATI_PMP_BUG, 1MSI | None |