HomeFreeBSD

Fix dRAID self-healing short columns

Description

Fix dRAID self-healing short columns

When dRAID performs a normal read operation only the data columns
in the raid map are read from disk. This is enough information to
calculate the checksum, verify it, and return the needed data to the
application. It's only in the event of a checksum failure that the
additional parity and any empty columns must be read since they are
required for parity reconstruction.

Reading these additional columns is handled by vdev_raidz_read_all()
which calls vdev_draid_map_alloc_empty() to expand the raid_map_t
and submit IOs for the missing columns. This all works correctly,
but it fails to account for any "short" columns. These are data
columns which are padded with a empty skip sector at the end.
Since that empty sector is not needed for a normal read it's not
read when columns is first read from disk. However, like the parity
and empty columns the skip sector is needed to perform reconstruction.

The fix is to mark any "short" columns as never being read by clearing
the rc_tried flag when expanding the raid_map_t. This will cause
the entire column to re-read from disk in the event of a checksum
failure allowing the self-healing functionality to repair the block.

Note that this only effects the self-healing feature because when
scrubbing a pool the parity, data, and empty columns are all read
initially to verify their contents. Furthermore, only blocks which
contain "short" columns would be effected, and only when the memory
backing the skip sector wasn't already zeroed out.

This change extends the existing redundancy_raidz.ksh test case to
verify self-healing (as well as resilver and scrub). Then applies
the same test case to dRAID with a slightly modified version of
the test script called redundancy_draid.ksh. The unused variable
combrec was also removed from both test cases.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #12010

Details

Provenance
Brian Behlendorf <behlendorf1@llnl.gov>Authored on May 8 2021, 3:57 PM
Parents
rGb1dd6351bb16: Replace ZoL with OpenZFS where applicable
Branches
Unknown
Tags
Unknown

Event Timeline

Brian Behlendorf <behlendorf1@llnl.gov> committed rG2085a5f992e6: Fix dRAID self-healing short columns (authored by Brian Behlendorf <behlendorf1@llnl.gov>).May 10 2021, 7:18 PM