diff --git a/man/man7/zpool-features.7 b/man/man7/zpool-features.7 index efe9e833996a..2b7dcb63829c 100644 --- a/man/man7/zpool-features.7 +++ b/man/man7/zpool-features.7 @@ -1,953 +1,952 @@ .\" .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved. .\" Copyright (c) 2013 by Saso Kiselkov. All rights reserved. .\" Copyright (c) 2014, Joyent, Inc. All rights reserved. .\" The contents of this file are subject to the terms of the Common Development .\" and Distribution License (the "License"). You may not use this file except .\" in compliance with the License. You can obtain a copy of the license at .\" usr/src/OPENSOLARIS.LICENSE or https://opensource.org/licenses/CDDL-1.0. .\" .\" See the License for the specific language governing permissions and .\" limitations under the License. When distributing Covered Code, include this .\" CDDL HEADER in each file and include the License file at .\" usr/src/OPENSOLARIS.LICENSE. If applicable, add the following below this .\" CDDL HEADER, with the fields enclosed by brackets "[]" replaced with your .\" own identifying information: .\" Portions Copyright [yyyy] [name of copyright owner] .\" Copyright (c) 2019, Klara Inc. .\" Copyright (c) 2019, Allan Jude .\" Copyright (c) 2021, Colm Buckley .\" .Dd June 23, 2022 .Dt ZPOOL-FEATURES 7 .Os . .Sh NAME .Nm zpool-features .Nd description of ZFS pool features . .Sh DESCRIPTION ZFS pool on-disk format versions are specified via .Dq features which replace the old on-disk format numbers .Pq the last supported on-disk format number is 28 . To enable a feature on a pool use the .Nm zpool Cm upgrade , or set the .Sy feature Ns @ Ns Ar feature-name property to .Sy enabled . Please also see the .Sx Compatibility feature sets section for information on how sets of features may be enabled together. .Pp The pool format does not affect file system version compatibility or the ability to send file systems between pools. .Pp Since most features can be enabled independently of each other, the on-disk format of the pool is specified by the set of all features marked as .Sy active on the pool. If the pool was created by another software version this set may include unsupported features. . .Ss Identifying features Every feature has a GUID of the form .Ar com.example : Ns Ar feature-name . The reversed DNS name ensures that the feature's GUID is unique across all ZFS implementations. When unsupported features are encountered on a pool they will be identified by their GUIDs. Refer to the documentation for the ZFS implementation that created the pool for information about those features. .Pp Each supported feature also has a short name. By convention a feature's short name is the portion of its GUID which follows the .Sq \&: .Po i.e. .Ar com.example : Ns Ar feature-name would have the short name .Ar feature-name .Pc , however a feature's short name may differ across ZFS implementations if following the convention would result in name conflicts. . .Ss Feature states Features can be in one of three states: .Bl -tag -width "disabled" .It Sy active This feature's on-disk format changes are in effect on the pool. Support for this feature is required to import the pool in read-write mode. If this feature is not read-only compatible, support is also required to import the pool in read-only mode .Pq see Sx Read-only compatibility . .It Sy enabled An administrator has marked this feature as enabled on the pool, but the feature's on-disk format changes have not been made yet. The pool can still be imported by software that does not support this feature, but changes may be made to the on-disk format at any time which will move the feature to the .Sy active state. Some features may support returning to the .Sy enabled state after becoming .Sy active . See feature-specific documentation for details. .It Sy disabled This feature's on-disk format changes have not been made and will not be made unless an administrator moves the feature to the .Sy enabled state. Features cannot be disabled once they have been enabled. .El .Pp The state of supported features is exposed through pool properties of the form .Sy feature Ns @ Ns Ar short-name . . .Ss Read-only compatibility Some features may make on-disk format changes that do not interfere with other software's ability to read from the pool. These features are referred to as .Dq read-only compatible . If all unsupported features on a pool are read-only compatible, the pool can be imported in read-only mode by setting the .Sy readonly property during import .Po see .Xr zpool-import 8 for details on importing pools .Pc . . .Ss Unsupported features For each unsupported feature enabled on an imported pool, a pool property named .Sy unsupported Ns @ Ns Ar feature-name will indicate why the import was allowed despite the unsupported feature. Possible values for this property are: .Bl -tag -width "readonly" .It Sy inactive The feature is in the .Sy enabled state and therefore the pool's on-disk format is still compatible with software that does not support this feature. .It Sy readonly The feature is read-only compatible and the pool has been imported in read-only mode. .El . .Ss Feature dependencies Some features depend on other features being enabled in order to function. Enabling a feature will automatically enable any features it depends on. . .Ss Compatibility feature sets It is sometimes necessary for a pool to maintain compatibility with a specific on-disk format, by enabling and disabling particular features. The .Sy compatibility feature facilitates this by allowing feature sets to be read from text files. When set to .Sy off .Pq the default , compatibility feature sets are disabled .Pq i.e. all features are enabled ; when set to .Sy legacy , no features are enabled. When set to a comma-separated list of filenames .Po each filename may either be an absolute path, or relative to .Pa /etc/zfs/compatibility.d or .Pa /usr/share/zfs/compatibility.d .Pc , the lists of requested features are read from those files, separated by whitespace and/or commas. Only features present in all files are enabled. .Pp Simple sanity checks are applied to the files: they must be between 1 B and 16 KiB in size, and must end with a newline character. .Pp The requested features are applied when a pool is created using .Nm zpool Cm create Fl o Sy compatibility Ns = Ns Ar … and controls which features are enabled when using .Nm zpool Cm upgrade . .Nm zpool Cm status will not show a warning about disabled features which are not part of the requested feature set. .Pp The special value .Sy legacy prevents any features from being enabled, either via .Nm zpool Cm upgrade or .Nm zpool Cm set Sy feature Ns @ Ns Ar feature-name Ns = Ns Sy enabled . This setting also prevents pools from being upgraded to newer on-disk versions. This is a safety measure to prevent new features from being accidentally enabled, breaking compatibility. .Pp By convention, compatibility files in .Pa /usr/share/zfs/compatibility.d are provided by the distribution, and include feature sets supported by important versions of popular distributions, and feature sets commonly supported at the start of each year. Compatibility files in .Pa /etc/zfs/compatibility.d , if present, will take precedence over files with the same name in .Pa /usr/share/zfs/compatibility.d . .Pp If an unrecognized feature is found in these files, an error message will be shown. If the unrecognized feature is in a file in .Pa /etc/zfs/compatibility.d , this is treated as an error and processing will stop. If the unrecognized feature is under .Pa /usr/share/zfs/compatibility.d , this is treated as a warning and processing will continue. This difference is to allow distributions to include features which might not be recognized by the currently-installed binaries. .Pp Compatibility files may include comments: any text from .Sq # to the end of the line is ignored. .Pp .Sy Example : .Bd -literal -compact -offset 4n .No example# Nm cat Pa /usr/share/zfs/compatibility.d/grub2 # Features which are supported by GRUB2 async_destroy bookmarks embedded_data empty_bpobj enabled_txg extensible_dataset filesystem_limits hole_birth large_blocks lz4_compress spacemap_histogram .No example# Nm zpool Cm create Fl o Sy compatibility Ns = Ns Ar grub2 Ar bootpool Ar vdev .Ed .Pp See .Xr zpool-create 8 and .Xr zpool-upgrade 8 for more information on how these commands are affected by feature sets. . .de feature .It Sy \\$2 .Bl -tag -compact -width "READ-ONLY COMPATIBLE" .It GUID .Sy \\$1:\\$2 .if !"\\$4"" \{\ .It DEPENDENCIES \fB\\$4\fP\c .if !"\\$5"" , \fB\\$5\fP\c .if !"\\$6"" , \fB\\$6\fP\c .if !"\\$7"" , \fB\\$7\fP\c .if !"\\$8"" , \fB\\$8\fP\c .if !"\\$9"" , \fB\\$9\fP\c .\} .It READ-ONLY COMPATIBLE \\$3 .El .Pp .. . .ds instant-never \ .No This feature becomes Sy active No as soon as it is enabled \ and will never return to being Sy enabled . . .ds remount-upgrade \ .No Each filesystem will be upgraded automatically when remounted, \ or when a new file is created under that filesystem. \ The upgrade can also be triggered on filesystems via \ Nm zfs Cm set Sy version Ns = Ns Sy current Ar fs . \ No The upgrade process runs in the background and may take a while to complete \ for filesystems containing large amounts of files . . .de checksum-spiel When the .Sy \\$1 feature is set to .Sy enabled , the administrator can turn on the .Sy \\$1 checksum on any dataset using .Nm zfs Cm set Sy checksum Ns = Ns Sy \\$1 Ar dset .Po see Xr zfs-set 8 Pc . This feature becomes .Sy active once a .Sy checksum property has been set to .Sy \\$1 , and will return to being .Sy enabled once all filesystems that have ever had their checksum set to .Sy \\$1 are destroyed. .. . .Sh FEATURES The following features are supported on this system: .Bl -tag -width Ds .feature org.zfsonlinux allocation_classes yes This feature enables support for separate allocation classes. .Pp This feature becomes .Sy active when a dedicated allocation class vdev .Pq dedup or special is created with the .Nm zpool Cm create No or Nm zpool Cm add No commands . With device removal, it can be returned to the .Sy enabled state if all the dedicated allocation class vdevs are removed. . .feature com.delphix async_destroy yes Destroying a file system requires traversing all of its data in order to return its used space to the pool. Without .Sy async_destroy , the file system is not fully removed until all space has been reclaimed. If the destroy operation is interrupted by a reboot or power outage, the next attempt to open the pool will need to complete the destroy operation synchronously. .Pp When .Sy async_destroy is enabled, the file system's data will be reclaimed by a background process, allowing the destroy operation to complete without traversing the entire file system. The background process is able to resume interrupted destroys after the pool has been opened, eliminating the need to finish interrupted destroys as part of the open operation. The amount of space remaining to be reclaimed by the background process is available through the .Sy freeing property. .Pp This feature is only .Sy active while .Sy freeing is non-zero. . .feature org.openzfs blake3 no extensible_dataset This feature enables the use of the BLAKE3 hash algorithm for checksum and dedup. BLAKE3 is a secure hash algorithm focused on high performance. .Pp .checksum-spiel blake3 . .feature com.fudosecurity block_cloning yes When this feature is enabled ZFS will use block cloning for operations like .Fn copy_file_range 2 . Block cloning allows to create multiple references to a single block. It is much faster than copying the data (as the actual data is neither read nor written) and takes no additional space. Blocks can be cloned across datasets under some conditions (like disabled encryption and equal .Nm recordsize ) . .Pp This feature becomes .Sy active when first block is cloned. When the last cloned block is freed, it goes back to the enabled state. .feature com.delphix bookmarks yes extensible_dataset This feature enables use of the .Nm zfs Cm bookmark command. .Pp This feature is .Sy active while any bookmarks exist in the pool. All bookmarks in the pool can be listed by running .Nm zfs Cm list Fl t Sy bookmark Fl r Ar poolname . . .feature com.datto bookmark_v2 no bookmark extensible_dataset This feature enables the creation and management of larger bookmarks which are needed for other features in ZFS. .Pp This feature becomes .Sy active when a v2 bookmark is created and will be returned to the .Sy enabled state when all v2 bookmarks are destroyed. . .feature com.delphix bookmark_written no bookmark extensible_dataset bookmark_v2 This feature enables additional bookmark accounting fields, enabling the .Sy written Ns # Ns Ar bookmark property .Pq space written since a bookmark and estimates of send stream sizes for incrementals from bookmarks. .Pp This feature becomes .Sy active when a bookmark is created and will be returned to the .Sy enabled state when all bookmarks with these fields are destroyed. . .feature org.openzfs device_rebuild yes This feature enables the ability for the .Nm zpool Cm attach and .Nm zpool Cm replace commands to perform sequential reconstruction .Pq instead of healing reconstruction when resilvering. .Pp Sequential reconstruction resilvers a device in LBA order without immediately verifying the checksums. Once complete, a scrub is started, which then verifies the checksums. This approach allows full redundancy to be restored to the pool in the minimum amount of time. This two-phase approach will take longer than a healing resilver when the time to verify the checksums is included. However, unless there is additional pool damage, no checksum errors should be reported by the scrub. This feature is incompatible with raidz configurations. . This feature becomes .Sy active while a sequential resilver is in progress, and returns to .Sy enabled when the resilver completes. . .feature com.delphix device_removal no This feature enables the .Nm zpool Cm remove command to remove top-level vdevs, evacuating them to reduce the total size of the pool. .Pp This feature becomes .Sy active when the .Nm zpool Cm remove command is used on a top-level vdev, and will never return to being .Sy enabled . . .feature org.openzfs draid no This feature enables use of the .Sy draid vdev type. dRAID is a variant of RAID-Z which provides integrated distributed hot spares that allow faster resilvering while retaining the benefits of RAID-Z. Data, parity, and spare space are organized in redundancy groups and distributed evenly over all of the devices. .Pp This feature becomes .Sy active when creating a pool which uses the .Sy draid vdev type, or when adding a new .Sy draid vdev to an existing pool. . .feature org.illumos edonr no extensible_dataset This feature enables the use of the Edon-R hash algorithm for checksum, including for nopwrite .Po if compression is also enabled, an overwrite of a block whose checksum matches the data being written will be ignored .Pc . In an abundance of caution, Edon-R requires verification when used with dedup: .Nm zfs Cm set Sy dedup Ns = Ns Sy edonr , Ns Sy verify .Po see Xr zfs-set 8 Pc . .Pp Edon-R is a very high-performance hash algorithm that was part of the NIST SHA-3 competition. It provides extremely high hash performance .Pq over 350% faster than SHA-256 , but was not selected because of its unsuitability as a general purpose secure hash algorithm. This implementation utilizes the new salted checksumming functionality in ZFS, which means that the checksum is pre-seeded with a secret 256-bit random key .Pq stored on the pool before being fed the data block to be checksummed. Thus the produced checksums are unique to a given pool, preventing hash collision attacks on systems with dedup. .Pp .checksum-spiel edonr . .feature com.delphix embedded_data no This feature improves the performance and compression ratio of highly-compressible blocks. Blocks whose contents can compress to 112 bytes or smaller can take advantage of this feature. .Pp When this feature is enabled, the contents of highly-compressible blocks are stored in the block .Dq pointer itself .Po a misnomer in this case, as it contains the compressed data, rather than a pointer to its location on disk .Pc . Thus the space of the block .Pq one sector, typically 512 B or 4 KiB is saved, and no additional I/O is needed to read and write the data block. . \*[instant-never] . .feature com.delphix empty_bpobj yes This feature increases the performance of creating and using a large number of snapshots of a single filesystem or volume, and also reduces the disk space required. .Pp When there are many snapshots, each snapshot uses many Block Pointer Objects .Pq bpobjs to track blocks associated with that snapshot. However, in common use cases, most of these bpobjs are empty. This feature allows us to create each bpobj on-demand, thus eliminating the empty bpobjs. .Pp This feature is .Sy active while there are any filesystems, volumes, or snapshots which were created after enabling this feature. . .feature com.delphix enabled_txg yes Once this feature is enabled, ZFS records the transaction group number in which new features are enabled. This has no user-visible impact, but other features may depend on this feature. .Pp This feature becomes .Sy active as soon as it is enabled and will never return to being .Sy enabled . . .feature com.datto encryption no bookmark_v2 extensible_dataset This feature enables the creation and management of natively encrypted datasets. .Pp This feature becomes .Sy active when an encrypted dataset is created and will be returned to the .Sy enabled state when all datasets that use this feature are destroyed. . .feature com.delphix extensible_dataset no This feature allows more flexible use of internal ZFS data structures, and exists for other features to depend on. .Pp This feature will be .Sy active when the first dependent feature uses it, and will be returned to the .Sy enabled state when all datasets that use this feature are destroyed. . .feature com.joyent filesystem_limits yes extensible_dataset This feature enables filesystem and snapshot limits. These limits can be used to control how many filesystems and/or snapshots can be created at the point in the tree on which the limits are set. .Pp This feature is .Sy active once either of the limit properties has been set on a dataset and will never return to being .Sy enabled . . .feature com.delphix head_errlog no This feature enables the upgraded version of errlog, which required an on-disk error log format change. Now the error log of each head dataset is stored separately in the zap object and keyed by the head id. -In case of encrypted filesystems with unloaded keys or unmounted encrypted -filesystems we are unable to check their snapshots or clones for errors and -these will not be reported. -In this case no filenames will be reported either. With this feature enabled, every dataset affected by an error block is listed in the output of .Nm zpool Cm status . +In case of encrypted filesystems with unloaded keys we are unable to check +their snapshots or clones for errors and these will not be reported. +An "access denied" error will be reported. .Pp \*[instant-never] . .feature com.delphix hole_birth no enabled_txg This feature has/had bugs, the result of which is that, if you do a .Nm zfs Cm send Fl i .Pq or Fl R , No since it uses Fl i from an affected dataset, the receiving party will not see any checksum or other errors, but the resulting destination snapshot will not match the source. Its use by .Nm zfs Cm send Fl i has been disabled by default .Po see .Sy send_holes_without_birth_time in .Xr zfs 4 .Pc . .Pp This feature improves performance of incremental sends .Pq Nm zfs Cm send Fl i and receives for objects with many holes. The most common case of hole-filled objects is zvols. .Pp An incremental send stream from snapshot .Sy A No to snapshot Sy B contains information about every block that changed between .Sy A No and Sy B . Blocks which did not change between those snapshots can be identified and omitted from the stream using a piece of metadata called the .Dq block birth time , but birth times are not recorded for holes .Pq blocks filled only with zeroes . Since holes created after .Sy A No cannot be distinguished from holes created before Sy A , information about every hole in the entire filesystem or zvol is included in the send stream. .Pp For workloads where holes are rare this is not a problem. However, when incrementally replicating filesystems or zvols with many holes .Pq for example a zvol formatted with another filesystem a lot of time will be spent sending and receiving unnecessary information about holes that already exist on the receiving side. .Pp Once the .Sy hole_birth feature has been enabled the block birth times of all new holes will be recorded. Incremental sends between snapshots created after this feature is enabled will use this new metadata to avoid sending information about holes that already exist on the receiving side. .Pp \*[instant-never] . .feature org.open-zfs large_blocks no extensible_dataset This feature allows the record size on a dataset to be set larger than 128 KiB. .Pp This feature becomes .Sy active once a dataset contains a file with a block size larger than 128 KiB, and will return to being .Sy enabled once all filesystems that have ever had their recordsize larger than 128 KiB are destroyed. . .feature org.zfsonlinux large_dnode no extensible_dataset This feature allows the size of dnodes in a dataset to be set larger than 512 B. . This feature becomes .Sy active once a dataset contains an object with a dnode larger than 512 B, which occurs as a result of setting the .Sy dnodesize dataset property to a value other than .Sy legacy . The feature will return to being .Sy enabled once all filesystems that have ever contained a dnode larger than 512 B are destroyed. Large dnodes allow more data to be stored in the bonus buffer, thus potentially improving performance by avoiding the use of spill blocks. . .feature com.delphix livelist yes This feature allows clones to be deleted faster than the traditional method when a large number of random/sparse writes have been made to the clone. All blocks allocated and freed after a clone is created are tracked by the the clone's livelist which is referenced during the deletion of the clone. The feature is activated when a clone is created and remains .Sy active until all clones have been destroyed. . .feature com.delphix log_spacemap yes com.delphix:spacemap_v2 This feature improves performance for heavily-fragmented pools, especially when workloads are heavy in random-writes. It does so by logging all the metaslab changes on a single spacemap every TXG instead of scattering multiple writes to all the metaslab spacemaps. .Pp \*[instant-never] . .feature org.illumos lz4_compress no .Sy lz4 is a high-performance real-time compression algorithm that features significantly faster compression and decompression as well as a higher compression ratio than the older .Sy lzjb compression. Typically, .Sy lz4 compression is approximately 50% faster on compressible data and 200% faster on incompressible data than .Sy lzjb . It is also approximately 80% faster on decompression, while giving approximately a 10% better compression ratio. .Pp When the .Sy lz4_compress feature is set to .Sy enabled , the administrator can turn on .Sy lz4 compression on any dataset on the pool using the .Xr zfs-set 8 command. All newly written metadata will be compressed with the .Sy lz4 algorithm. .Pp \*[instant-never] . .feature com.joyent multi_vdev_crash_dump no This feature allows a dump device to be configured with a pool comprised of multiple vdevs. Those vdevs may be arranged in any mirrored or raidz configuration. .Pp When the .Sy multi_vdev_crash_dump feature is set to .Sy enabled , the administrator can use .Xr dumpadm 8 to configure a dump device on a pool comprised of multiple vdevs. .Pp Under .Fx and Linux this feature is unused, but registered for compatibility. New pools created on these systems will have the feature .Sy enabled but will never transition to .Sy active , as this functionality is not required for crash dump support. Existing pools where this feature is .Sy active can be imported. . .feature com.delphix obsolete_counts yes device_removal This feature is an enhancement of .Sy device_removal , which will over time reduce the memory used to track removed devices. When indirect blocks are freed or remapped, we note that their part of the indirect mapping is .Dq obsolete – no longer needed. .Pp This feature becomes .Sy active when the .Nm zpool Cm remove command is used on a top-level vdev, and will never return to being .Sy enabled . . .feature org.zfsonlinux project_quota yes extensible_dataset This feature allows administrators to account the spaces and objects usage information against the project identifier .Pq ID . .Pp The project ID is an object-based attribute. When upgrading an existing filesystem, objects without a project ID will be assigned a zero project ID. When this feature is enabled, newly created objects inherit their parent directories' project ID if the parent's inherit flag is set .Pq via Nm chattr Sy [+-]P No or Nm zfs Cm project Fl s Ns | Ns Fl C . Otherwise, the new object's project ID will be zero. An object's project ID can be changed at any time by the owner .Pq or privileged user via .Nm chattr Fl p Ar prjid or .Nm zfs Cm project Fl p Ar prjid . .Pp This feature will become .Sy active as soon as it is enabled and will never return to being .Sy disabled . \*[remount-upgrade] . .feature com.delphix redaction_bookmarks no bookmarks extensible_dataset This feature enables the use of redacted .Nm zfs Cm send Ns s , which create redaction bookmarks storing the list of blocks redacted by the send that created them. For more information about redacted sends, see .Xr zfs-send 8 . . .feature com.delphix redacted_datasets no extensible_dataset This feature enables the receiving of redacted .Nm zfs Cm send streams, which create redacted datasets when received. These datasets are missing some of their blocks, and so cannot be safely mounted, and their contents cannot be safely read. For more information about redacted receives, see .Xr zfs-send 8 . . .feature com.datto resilver_defer yes This feature allows ZFS to postpone new resilvers if an existing one is already in progress. Without this feature, any new resilvers will cause the currently running one to be immediately restarted from the beginning. .Pp This feature becomes .Sy active once a resilver has been deferred, and returns to being .Sy enabled when the deferred resilver begins. . .feature org.illumos sha512 no extensible_dataset This feature enables the use of the SHA-512/256 truncated hash algorithm .Pq FIPS 180-4 for checksum and dedup. The native 64-bit arithmetic of SHA-512 provides an approximate 50% performance boost over SHA-256 on 64-bit hardware and is thus a good minimum-change replacement candidate for systems where hash performance is important, but these systems cannot for whatever reason utilize the faster .Sy skein No and Sy edonr algorithms. .Pp .checksum-spiel sha512 . .feature org.illumos skein no extensible_dataset This feature enables the use of the Skein hash algorithm for checksum and dedup. Skein is a high-performance secure hash algorithm that was a finalist in the NIST SHA-3 competition. It provides a very high security margin and high performance on 64-bit hardware .Pq 80% faster than SHA-256 . This implementation also utilizes the new salted checksumming functionality in ZFS, which means that the checksum is pre-seeded with a secret 256-bit random key .Pq stored on the pool before being fed the data block to be checksummed. Thus the produced checksums are unique to a given pool, preventing hash collision attacks on systems with dedup. .Pp .checksum-spiel skein . .feature com.delphix spacemap_histogram yes This features allows ZFS to maintain more information about how free space is organized within the pool. If this feature is .Sy enabled , it will be activated when a new space map object is created, or an existing space map is upgraded to the new format, and never returns back to being .Sy enabled . . .feature com.delphix spacemap_v2 yes This feature enables the use of the new space map encoding which consists of two words .Pq instead of one whenever it is advantageous. The new encoding allows space maps to represent large regions of space more efficiently on-disk while also increasing their maximum addressable offset. .Pp This feature becomes .Sy active once it is .Sy enabled , and never returns back to being .Sy enabled . . .feature org.zfsonlinux userobj_accounting yes extensible_dataset This feature allows administrators to account the object usage information by user and group. .Pp \*[instant-never] \*[remount-upgrade] . .feature com.klarasystems vdev_zaps_v2 no This feature creates a ZAP object for the root vdev. .Pp This feature becomes active after the next .Nm zpool Cm import or .Nm zpool reguid . . Properties can be retrieved or set on the root vdev using .Nm zpool Cm get and .Nm zpool Cm set with .Sy root as the vdev name which is an alias for .Sy root-0 . .feature org.openzfs zilsaxattr yes extensible_dataset This feature enables .Sy xattr Ns = Ns Sy sa extended attribute logging in the ZIL. If enabled, extended attribute changes .Pq both Sy xattrdir Ns = Ns Sy dir No and Sy xattr Ns = Ns Sy sa are guaranteed to be durable if either the dataset had .Sy sync Ns = Ns Sy always set at the time the changes were made, or .Xr sync 2 is called on the dataset after the changes were made. .Pp This feature becomes .Sy active when a ZIL is created for at least one dataset and will be returned to the .Sy enabled state when it is destroyed for all datasets that use this feature. . .feature com.delphix zpool_checkpoint yes This feature enables the .Nm zpool Cm checkpoint command that can checkpoint the state of the pool at the time it was issued and later rewind back to it or discard it. .Pp This feature becomes .Sy active when the .Nm zpool Cm checkpoint command is used to checkpoint the pool. The feature will only return back to being .Sy enabled when the pool is rewound or the checkpoint has been discarded. . .feature org.freebsd zstd_compress no extensible_dataset .Sy zstd is a high-performance compression algorithm that features a combination of high compression ratios and high speed. Compared to .Sy gzip , .Sy zstd offers slightly better compression at much higher speeds. Compared to .Sy lz4 , .Sy zstd offers much better compression while being only modestly slower. Typically, .Sy zstd compression speed ranges from 250 to 500 MB/s per thread and decompression speed is over 1 GB/s per thread. .Pp When the .Sy zstd feature is set to .Sy enabled , the administrator can turn on .Sy zstd compression of any dataset using .Nm zfs Cm set Sy compress Ns = Ns Sy zstd Ar dset .Po see Xr zfs-set 8 Pc . This feature becomes .Sy active once a .Sy compress property has been set to .Sy zstd , and will return to being .Sy enabled once all filesystems that have ever had their .Sy compress property set to .Sy zstd are destroyed. .El . .Sh SEE ALSO .Xr zfs 8 , .Xr zpool 8 diff --git a/module/zfs/spa_errlog.c b/module/zfs/spa_errlog.c index e0604c4a84af..44950a769d3b 100644 --- a/module/zfs/spa_errlog.c +++ b/module/zfs/spa_errlog.c @@ -1,1372 +1,1360 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or https://opensource.org/licenses/CDDL-1.0. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2013, 2014, Delphix. All rights reserved. * Copyright (c) 2019 Datto Inc. * Copyright (c) 2021, 2022, George Amanakis. All rights reserved. */ /* * Routines to manage the on-disk persistent error log. * * Each pool stores a log of all logical data errors seen during normal * operation. This is actually the union of two distinct logs: the last log, * and the current log. All errors seen are logged to the current log. When a * scrub completes, the current log becomes the last log, the last log is thrown * out, and the current log is reinitialized. This way, if an error is somehow * corrected, a new scrub will show that it no longer exists, and will be * deleted from the log when the scrub completes. * * The log is stored using a ZAP object whose key is a string form of the * zbookmark_phys tuple (objset, object, level, blkid), and whose contents is an * optional 'objset:object' human-readable string describing the data. When an * error is first logged, this string will be empty, indicating that no name is * known. This prevents us from having to issue a potentially large amount of * I/O to discover the object name during an error path. Instead, we do the * calculation when the data is requested, storing the result so future queries * will be faster. * * If the head_errlog feature is enabled, a different on-disk format is used. * The error log of each head dataset is stored separately in the zap object * and keyed by the head id. This enables listing every dataset affected in * userland. In order to be able to track whether an error block has been * modified or added to snapshots since it was marked as an error, a new tuple * is introduced: zbookmark_err_phys_t. It allows the storage of the birth * transaction group of an error block on-disk. The birth transaction group is * used by check_filesystem() to assess whether this block was freed, * re-written or added to a snapshot since its marking as an error. * * This log is then shipped into an nvlist where the key is the dataset name and * the value is the object name. Userland is then responsible for uniquifying * this list and displaying it to the user. */ #include #include #include #include #include #include #include #include #include #define NAME_MAX_LEN 64 typedef struct clones { uint64_t clone_ds; list_node_t node; } clones_t; /* * spa_upgrade_errlog_limit : A zfs module parameter that controls the number * of on-disk error log entries that will be converted to the new * format when enabling head_errlog. Defaults to 0 which converts * all log entries. */ static uint_t spa_upgrade_errlog_limit = 0; /* * Convert a bookmark to a string. */ static void bookmark_to_name(zbookmark_phys_t *zb, char *buf, size_t len) { (void) snprintf(buf, len, "%llx:%llx:%llx:%llx", (u_longlong_t)zb->zb_objset, (u_longlong_t)zb->zb_object, (u_longlong_t)zb->zb_level, (u_longlong_t)zb->zb_blkid); } /* * Convert an err_phys to a string. */ static void errphys_to_name(zbookmark_err_phys_t *zep, char *buf, size_t len) { (void) snprintf(buf, len, "%llx:%llx:%llx:%llx", (u_longlong_t)zep->zb_object, (u_longlong_t)zep->zb_level, (u_longlong_t)zep->zb_blkid, (u_longlong_t)zep->zb_birth); } /* * Convert a string to a err_phys. */ static void name_to_errphys(char *buf, zbookmark_err_phys_t *zep) { zep->zb_object = zfs_strtonum(buf, &buf); ASSERT(*buf == ':'); zep->zb_level = (int)zfs_strtonum(buf + 1, &buf); ASSERT(*buf == ':'); zep->zb_blkid = zfs_strtonum(buf + 1, &buf); ASSERT(*buf == ':'); zep->zb_birth = zfs_strtonum(buf + 1, &buf); ASSERT(*buf == '\0'); } /* * Convert a string to a bookmark. */ static void name_to_bookmark(char *buf, zbookmark_phys_t *zb) { zb->zb_objset = zfs_strtonum(buf, &buf); ASSERT(*buf == ':'); zb->zb_object = zfs_strtonum(buf + 1, &buf); ASSERT(*buf == ':'); zb->zb_level = (int)zfs_strtonum(buf + 1, &buf); ASSERT(*buf == ':'); zb->zb_blkid = zfs_strtonum(buf + 1, &buf); ASSERT(*buf == '\0'); } #ifdef _KERNEL static void zep_to_zb(uint64_t dataset, zbookmark_err_phys_t *zep, zbookmark_phys_t *zb) { zb->zb_objset = dataset; zb->zb_object = zep->zb_object; zb->zb_level = zep->zb_level; zb->zb_blkid = zep->zb_blkid; } #endif static void name_to_object(char *buf, uint64_t *obj) { *obj = zfs_strtonum(buf, &buf); ASSERT(*buf == '\0'); } /* * Retrieve the head filesystem. */ static int get_head_ds(spa_t *spa, uint64_t dsobj, uint64_t *head_ds) { dsl_dataset_t *ds; - int error = dsl_dataset_hold_obj(spa->spa_dsl_pool, - dsobj, FTAG, &ds); + int error = dsl_dataset_hold_obj_flags(spa->spa_dsl_pool, + dsobj, DS_HOLD_FLAG_DECRYPT, FTAG, &ds); if (error != 0) return (error); ASSERT(head_ds); *head_ds = dsl_dir_phys(ds->ds_dir)->dd_head_dataset_obj; - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); return (error); } /* * Log an uncorrectable error to the persistent error log. We add it to the * spa's list of pending errors. The changes are actually synced out to disk * during spa_errlog_sync(). */ void spa_log_error(spa_t *spa, const zbookmark_phys_t *zb, const uint64_t *birth) { spa_error_entry_t search; spa_error_entry_t *new; avl_tree_t *tree; avl_index_t where; /* * If we are trying to import a pool, ignore any errors, as we won't be * writing to the pool any time soon. */ if (spa_load_state(spa) == SPA_LOAD_TRYIMPORT) return; mutex_enter(&spa->spa_errlist_lock); /* * If we have had a request to rotate the log, log it to the next list * instead of the current one. */ if (spa->spa_scrub_active || spa->spa_scrub_finished) tree = &spa->spa_errlist_scrub; else tree = &spa->spa_errlist_last; search.se_bookmark = *zb; if (avl_find(tree, &search, &where) != NULL) { mutex_exit(&spa->spa_errlist_lock); return; } new = kmem_zalloc(sizeof (spa_error_entry_t), KM_SLEEP); new->se_bookmark = *zb; /* * If the head_errlog feature is enabled, store the birth txg now. In * case the file is deleted before spa_errlog_sync() runs, we will not * be able to retrieve the birth txg. */ if (spa_feature_is_enabled(spa, SPA_FEATURE_HEAD_ERRLOG)) { new->se_zep.zb_object = zb->zb_object; new->se_zep.zb_level = zb->zb_level; new->se_zep.zb_blkid = zb->zb_blkid; /* * birth may end up being NULL, e.g. in zio_done(). We * will handle this in process_error_block(). */ if (birth != NULL) new->se_zep.zb_birth = *birth; } avl_insert(tree, new, where); mutex_exit(&spa->spa_errlist_lock); } #ifdef _KERNEL static int find_birth_txg(dsl_dataset_t *ds, zbookmark_err_phys_t *zep, uint64_t *birth_txg) { objset_t *os; int error = dmu_objset_from_ds(ds, &os); if (error != 0) return (error); dnode_t *dn; blkptr_t bp; error = dnode_hold(os, zep->zb_object, FTAG, &dn); if (error != 0) return (error); rw_enter(&dn->dn_struct_rwlock, RW_READER); error = dbuf_dnode_findbp(dn, zep->zb_level, zep->zb_blkid, &bp, NULL, NULL); if (error == 0 && BP_IS_HOLE(&bp)) error = SET_ERROR(ENOENT); *birth_txg = bp.blk_birth; rw_exit(&dn->dn_struct_rwlock); dnode_rele(dn, FTAG); return (error); } /* * Copy the bookmark to the end of the user-space buffer which starts at * uaddr and has *count unused entries, and decrement *count by 1. */ static int copyout_entry(const zbookmark_phys_t *zb, void *uaddr, uint64_t *count) { if (*count == 0) return (SET_ERROR(ENOMEM)); *count -= 1; if (copyout(zb, (char *)uaddr + (*count) * sizeof (zbookmark_phys_t), sizeof (zbookmark_phys_t)) != 0) return (SET_ERROR(EFAULT)); return (0); } /* * Each time the error block is referenced by a snapshot or clone, add a * zbookmark_phys_t entry to the userspace array at uaddr. The array is * filled from the back and the in-out parameter *count is modified to be the * number of unused entries at the beginning of the array. */ static int check_filesystem(spa_t *spa, uint64_t head_ds, zbookmark_err_phys_t *zep, void *uaddr, uint64_t *count, list_t *clones_list) { dsl_dataset_t *ds; dsl_pool_t *dp = spa->spa_dsl_pool; - int error = dsl_dataset_hold_obj(dp, head_ds, FTAG, &ds); + int error = dsl_dataset_hold_obj_flags(dp, head_ds, + DS_HOLD_FLAG_DECRYPT, FTAG, &ds); if (error != 0) return (error); uint64_t latest_txg; uint64_t txg_to_consider = spa->spa_syncing_txg; boolean_t check_snapshot = B_TRUE; error = find_birth_txg(ds, zep, &latest_txg); - /* - * If the filesystem is encrypted and the key is not loaded - * or the encrypted filesystem is not mounted the error will be EACCES. - * In that case report an error in the head filesystem and return. - */ - if (error == EACCES) { - dsl_dataset_rele(ds, FTAG); - zbookmark_phys_t zb; - zep_to_zb(head_ds, zep, &zb); - error = copyout_entry(&zb, uaddr, count); - if (error != 0) { - dsl_dataset_rele(ds, FTAG); - return (error); - } - return (0); - } - /* * If find_birth_txg() errors out otherwise, let txg_to_consider be * equal to the spa's syncing txg: if check_filesystem() errors out * then affected snapshots or clones will not be checked. */ if (error == 0 && zep->zb_birth == latest_txg) { /* Block neither free nor rewritten. */ zbookmark_phys_t zb; zep_to_zb(head_ds, zep, &zb); error = copyout_entry(&zb, uaddr, count); if (error != 0) { - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); return (error); } check_snapshot = B_FALSE; } else if (error == 0) { txg_to_consider = latest_txg; } /* * Retrieve the number of snapshots if the dataset is not a snapshot. */ uint64_t snap_count = 0; if (dsl_dataset_phys(ds)->ds_snapnames_zapobj != 0) { error = zap_count(spa->spa_meta_objset, dsl_dataset_phys(ds)->ds_snapnames_zapobj, &snap_count); if (error != 0) { - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); return (error); } } if (snap_count == 0) { /* Filesystem without snapshots. */ - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); return (0); } uint64_t *snap_obj_array = kmem_zalloc(snap_count * sizeof (uint64_t), KM_SLEEP); int aff_snap_count = 0; uint64_t snap_obj = dsl_dataset_phys(ds)->ds_prev_snap_obj; uint64_t snap_obj_txg = dsl_dataset_phys(ds)->ds_prev_snap_txg; uint64_t zap_clone = dsl_dir_phys(ds->ds_dir)->dd_clones; - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); /* Check only snapshots created from this file system. */ while (snap_obj != 0 && zep->zb_birth < snap_obj_txg && snap_obj_txg <= txg_to_consider) { - error = dsl_dataset_hold_obj(dp, snap_obj, FTAG, &ds); + error = dsl_dataset_hold_obj_flags(dp, snap_obj, + DS_HOLD_FLAG_DECRYPT, FTAG, &ds); if (error != 0) goto out; if (dsl_dir_phys(ds->ds_dir)->dd_head_dataset_obj != head_ds) { snap_obj = dsl_dataset_phys(ds)->ds_prev_snap_obj; snap_obj_txg = dsl_dataset_phys(ds)->ds_prev_snap_txg; - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); continue; } boolean_t affected = B_TRUE; if (check_snapshot) { uint64_t blk_txg; error = find_birth_txg(ds, zep, &blk_txg); affected = (error == 0 && zep->zb_birth == blk_txg); } /* Report errors in snapshots. */ if (affected) { snap_obj_array[aff_snap_count] = snap_obj; aff_snap_count++; zbookmark_phys_t zb; zep_to_zb(snap_obj, zep, &zb); error = copyout_entry(&zb, uaddr, count); if (error != 0) { - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, + FTAG); goto out; } } snap_obj = dsl_dataset_phys(ds)->ds_prev_snap_obj; snap_obj_txg = dsl_dataset_phys(ds)->ds_prev_snap_txg; - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); } if (zap_clone == 0 || aff_snap_count == 0) return (0); /* Check clones. */ zap_cursor_t *zc; zap_attribute_t *za; zc = kmem_zalloc(sizeof (zap_cursor_t), KM_SLEEP); za = kmem_zalloc(sizeof (zap_attribute_t), KM_SLEEP); for (zap_cursor_init(zc, spa->spa_meta_objset, zap_clone); zap_cursor_retrieve(zc, za) == 0; zap_cursor_advance(zc)) { dsl_dataset_t *clone; - error = dsl_dataset_hold_obj(dp, za->za_first_integer, - FTAG, &clone); + error = dsl_dataset_hold_obj_flags(dp, za->za_first_integer, + DS_HOLD_FLAG_DECRYPT, FTAG, &clone); if (error != 0) break; /* * Only clones whose origins were affected could also * have affected snapshots. */ boolean_t found = B_FALSE; for (int i = 0; i < snap_count; i++) { if (dsl_dir_phys(clone->ds_dir)->dd_origin_obj == snap_obj_array[i]) found = B_TRUE; } - dsl_dataset_rele(clone, FTAG); + dsl_dataset_rele_flags(clone, DS_HOLD_FLAG_DECRYPT, FTAG); if (!found) continue; clones_t *ct = kmem_zalloc(sizeof (*ct), KM_SLEEP); ct->clone_ds = za->za_first_integer; list_insert_tail(clones_list, ct); } zap_cursor_fini(zc); kmem_free(za, sizeof (*za)); kmem_free(zc, sizeof (*zc)); out: kmem_free(snap_obj_array, sizeof (*snap_obj_array)); return (error); } static int find_top_affected_fs(spa_t *spa, uint64_t head_ds, zbookmark_err_phys_t *zep, uint64_t *top_affected_fs) { uint64_t oldest_dsobj; int error = dsl_dataset_oldest_snapshot(spa, head_ds, zep->zb_birth, &oldest_dsobj); if (error != 0) return (error); dsl_dataset_t *ds; - error = dsl_dataset_hold_obj(spa->spa_dsl_pool, oldest_dsobj, - FTAG, &ds); + error = dsl_dataset_hold_obj_flags(spa->spa_dsl_pool, oldest_dsobj, + DS_HOLD_FLAG_DECRYPT, FTAG, &ds); if (error != 0) return (error); *top_affected_fs = dsl_dir_phys(ds->ds_dir)->dd_head_dataset_obj; - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); return (0); } static int process_error_block(spa_t *spa, uint64_t head_ds, zbookmark_err_phys_t *zep, void *uaddr, uint64_t *count) { /* * If zb_birth == 0 or head_ds == 0 it means we failed to retrieve the * birth txg or the head filesystem of the block pointer. This may * happen e.g. when an encrypted filesystem is not mounted or when * the key is not loaded. In this case do not proceed to * check_filesystem(), instead do the accounting here. */ if (zep->zb_birth == 0 || head_ds == 0) { zbookmark_phys_t zb; zep_to_zb(head_ds, zep, &zb); int error = copyout_entry(&zb, uaddr, count); if (error != 0) { return (error); } return (0); } uint64_t top_affected_fs; int error = find_top_affected_fs(spa, head_ds, zep, &top_affected_fs); if (error == 0) { clones_t *ct; list_t clones_list; list_create(&clones_list, sizeof (clones_t), offsetof(clones_t, node)); error = check_filesystem(spa, top_affected_fs, zep, uaddr, count, &clones_list); while ((ct = list_remove_head(&clones_list)) != NULL) { error = check_filesystem(spa, ct->clone_ds, zep, uaddr, count, &clones_list); kmem_free(ct, sizeof (*ct)); if (error) { while (!list_is_empty(&clones_list)) { ct = list_remove_head(&clones_list); kmem_free(ct, sizeof (*ct)); } break; } } list_destroy(&clones_list); } return (error); } #endif /* * If a healed bookmark matches an entry in the error log we stash it in a tree * so that we can later remove the related log entries in sync context. */ static void spa_add_healed_error(spa_t *spa, uint64_t obj, zbookmark_phys_t *healed_zb) { char name[NAME_MAX_LEN]; if (obj == 0) return; bookmark_to_name(healed_zb, name, sizeof (name)); mutex_enter(&spa->spa_errlog_lock); if (zap_contains(spa->spa_meta_objset, obj, name) == 0) { /* * Found an error matching healed zb, add zb to our * tree of healed errors */ avl_tree_t *tree = &spa->spa_errlist_healed; spa_error_entry_t search; spa_error_entry_t *new; avl_index_t where; search.se_bookmark = *healed_zb; mutex_enter(&spa->spa_errlist_lock); if (avl_find(tree, &search, &where) != NULL) { mutex_exit(&spa->spa_errlist_lock); mutex_exit(&spa->spa_errlog_lock); return; } new = kmem_zalloc(sizeof (spa_error_entry_t), KM_SLEEP); new->se_bookmark = *healed_zb; avl_insert(tree, new, where); mutex_exit(&spa->spa_errlist_lock); } mutex_exit(&spa->spa_errlog_lock); } /* * If this error exists in the given tree remove it. */ static void remove_error_from_list(spa_t *spa, avl_tree_t *t, const zbookmark_phys_t *zb) { spa_error_entry_t search, *found; avl_index_t where; mutex_enter(&spa->spa_errlist_lock); search.se_bookmark = *zb; if ((found = avl_find(t, &search, &where)) != NULL) { avl_remove(t, found); kmem_free(found, sizeof (spa_error_entry_t)); } mutex_exit(&spa->spa_errlist_lock); } /* * Removes all of the recv healed errors from both on-disk error logs */ static void spa_remove_healed_errors(spa_t *spa, avl_tree_t *s, avl_tree_t *l, dmu_tx_t *tx) { char name[NAME_MAX_LEN]; spa_error_entry_t *se; void *cookie = NULL; ASSERT(MUTEX_HELD(&spa->spa_errlog_lock)); while ((se = avl_destroy_nodes(&spa->spa_errlist_healed, &cookie)) != NULL) { remove_error_from_list(spa, s, &se->se_bookmark); remove_error_from_list(spa, l, &se->se_bookmark); bookmark_to_name(&se->se_bookmark, name, sizeof (name)); kmem_free(se, sizeof (spa_error_entry_t)); (void) zap_remove(spa->spa_meta_objset, spa->spa_errlog_last, name, tx); (void) zap_remove(spa->spa_meta_objset, spa->spa_errlog_scrub, name, tx); } } /* * Stash away healed bookmarks to remove them from the on-disk error logs * later in spa_remove_healed_errors(). */ void spa_remove_error(spa_t *spa, zbookmark_phys_t *zb) { char name[NAME_MAX_LEN]; bookmark_to_name(zb, name, sizeof (name)); spa_add_healed_error(spa, spa->spa_errlog_last, zb); spa_add_healed_error(spa, spa->spa_errlog_scrub, zb); } static uint64_t approx_errlog_size_impl(spa_t *spa, uint64_t spa_err_obj) { if (spa_err_obj == 0) return (0); uint64_t total = 0; zap_cursor_t zc; zap_attribute_t za; for (zap_cursor_init(&zc, spa->spa_meta_objset, spa_err_obj); zap_cursor_retrieve(&zc, &za) == 0; zap_cursor_advance(&zc)) { uint64_t count; if (zap_count(spa->spa_meta_objset, za.za_first_integer, &count) == 0) total += count; } zap_cursor_fini(&zc); return (total); } /* * Return the approximate number of errors currently in the error log. This * will be nonzero if there are some errors, but otherwise it may be more * or less than the number of entries returned by spa_get_errlog(). */ uint64_t spa_approx_errlog_size(spa_t *spa) { uint64_t total = 0; if (!spa_feature_is_enabled(spa, SPA_FEATURE_HEAD_ERRLOG)) { mutex_enter(&spa->spa_errlog_lock); uint64_t count; if (spa->spa_errlog_scrub != 0 && zap_count(spa->spa_meta_objset, spa->spa_errlog_scrub, &count) == 0) total += count; if (spa->spa_errlog_last != 0 && !spa->spa_scrub_finished && zap_count(spa->spa_meta_objset, spa->spa_errlog_last, &count) == 0) total += count; mutex_exit(&spa->spa_errlog_lock); } else { mutex_enter(&spa->spa_errlog_lock); total += approx_errlog_size_impl(spa, spa->spa_errlog_last); total += approx_errlog_size_impl(spa, spa->spa_errlog_scrub); mutex_exit(&spa->spa_errlog_lock); } mutex_enter(&spa->spa_errlist_lock); total += avl_numnodes(&spa->spa_errlist_last); total += avl_numnodes(&spa->spa_errlist_scrub); mutex_exit(&spa->spa_errlist_lock); return (total); } /* * This function sweeps through an on-disk error log and stores all bookmarks * as error bookmarks in a new ZAP object. At the end we discard the old one, * and spa_update_errlog() will set the spa's on-disk error log to new ZAP * object. */ static void sync_upgrade_errlog(spa_t *spa, uint64_t spa_err_obj, uint64_t *newobj, dmu_tx_t *tx) { zap_cursor_t zc; zap_attribute_t za; zbookmark_phys_t zb; uint64_t count; *newobj = zap_create(spa->spa_meta_objset, DMU_OT_ERROR_LOG, DMU_OT_NONE, 0, tx); /* * If we cannnot perform the upgrade we should clear the old on-disk * error logs. */ if (zap_count(spa->spa_meta_objset, spa_err_obj, &count) != 0) { VERIFY0(dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)); return; } for (zap_cursor_init(&zc, spa->spa_meta_objset, spa_err_obj); zap_cursor_retrieve(&zc, &za) == 0; zap_cursor_advance(&zc)) { if (spa_upgrade_errlog_limit != 0 && zc.zc_cd == spa_upgrade_errlog_limit) break; name_to_bookmark(za.za_name, &zb); zbookmark_err_phys_t zep; zep.zb_object = zb.zb_object; zep.zb_level = zb.zb_level; zep.zb_blkid = zb.zb_blkid; zep.zb_birth = 0; /* * In case of an error we should simply continue instead of * returning prematurely. See the next comment. */ uint64_t head_ds; dsl_pool_t *dp = spa->spa_dsl_pool; dsl_dataset_t *ds; objset_t *os; - int error = dsl_dataset_hold_obj(dp, zb.zb_objset, FTAG, &ds); + int error = dsl_dataset_hold_obj_flags(dp, zb.zb_objset, + DS_HOLD_FLAG_DECRYPT, FTAG, &ds); if (error != 0) continue; head_ds = dsl_dir_phys(ds->ds_dir)->dd_head_dataset_obj; /* * The objset and the dnode are required for getting the block * pointer, which is used to determine if BP_IS_HOLE(). If * getting the objset or the dnode fails, do not create a * zap entry (presuming we know the dataset) as this may create * spurious errors that we cannot ever resolve. If an error is * truly persistent, it should re-appear after a scan. */ if (dmu_objset_from_ds(ds, &os) != 0) { - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); continue; } dnode_t *dn; blkptr_t bp; if (dnode_hold(os, zep.zb_object, FTAG, &dn) != 0) { - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); continue; } rw_enter(&dn->dn_struct_rwlock, RW_READER); error = dbuf_dnode_findbp(dn, zep.zb_level, zep.zb_blkid, &bp, NULL, NULL); if (error == EACCES) error = 0; else if (!error) zep.zb_birth = bp.blk_birth; rw_exit(&dn->dn_struct_rwlock); dnode_rele(dn, FTAG); - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); if (error != 0 || BP_IS_HOLE(&bp)) continue; uint64_t err_obj; error = zap_lookup_int_key(spa->spa_meta_objset, *newobj, head_ds, &err_obj); if (error == ENOENT) { err_obj = zap_create(spa->spa_meta_objset, DMU_OT_ERROR_LOG, DMU_OT_NONE, 0, tx); (void) zap_update_int_key(spa->spa_meta_objset, *newobj, head_ds, err_obj, tx); } char buf[64]; errphys_to_name(&zep, buf, sizeof (buf)); const char *name = ""; (void) zap_update(spa->spa_meta_objset, err_obj, buf, 1, strlen(name) + 1, name, tx); } zap_cursor_fini(&zc); VERIFY0(dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)); } void spa_upgrade_errlog(spa_t *spa, dmu_tx_t *tx) { uint64_t newobj = 0; mutex_enter(&spa->spa_errlog_lock); if (spa->spa_errlog_last != 0) { sync_upgrade_errlog(spa, spa->spa_errlog_last, &newobj, tx); spa->spa_errlog_last = newobj; } if (spa->spa_errlog_scrub != 0) { sync_upgrade_errlog(spa, spa->spa_errlog_scrub, &newobj, tx); spa->spa_errlog_scrub = newobj; } mutex_exit(&spa->spa_errlog_lock); } #ifdef _KERNEL /* * If an error block is shared by two datasets it will be counted twice. */ static int process_error_log(spa_t *spa, uint64_t obj, void *uaddr, uint64_t *count) { if (obj == 0) return (0); zap_cursor_t *zc; zap_attribute_t *za; zc = kmem_zalloc(sizeof (zap_cursor_t), KM_SLEEP); za = kmem_zalloc(sizeof (zap_attribute_t), KM_SLEEP); if (!spa_feature_is_enabled(spa, SPA_FEATURE_HEAD_ERRLOG)) { for (zap_cursor_init(zc, spa->spa_meta_objset, obj); zap_cursor_retrieve(zc, za) == 0; zap_cursor_advance(zc)) { if (*count == 0) { zap_cursor_fini(zc); kmem_free(zc, sizeof (*zc)); kmem_free(za, sizeof (*za)); return (SET_ERROR(ENOMEM)); } zbookmark_phys_t zb; name_to_bookmark(za->za_name, &zb); int error = copyout_entry(&zb, uaddr, count); if (error != 0) { zap_cursor_fini(zc); kmem_free(zc, sizeof (*zc)); kmem_free(za, sizeof (*za)); return (error); } } zap_cursor_fini(zc); kmem_free(zc, sizeof (*zc)); kmem_free(za, sizeof (*za)); return (0); } for (zap_cursor_init(zc, spa->spa_meta_objset, obj); zap_cursor_retrieve(zc, za) == 0; zap_cursor_advance(zc)) { zap_cursor_t *head_ds_cursor; zap_attribute_t *head_ds_attr; head_ds_cursor = kmem_zalloc(sizeof (zap_cursor_t), KM_SLEEP); head_ds_attr = kmem_zalloc(sizeof (zap_attribute_t), KM_SLEEP); uint64_t head_ds_err_obj = za->za_first_integer; uint64_t head_ds; name_to_object(za->za_name, &head_ds); for (zap_cursor_init(head_ds_cursor, spa->spa_meta_objset, head_ds_err_obj); zap_cursor_retrieve(head_ds_cursor, head_ds_attr) == 0; zap_cursor_advance(head_ds_cursor)) { zbookmark_err_phys_t head_ds_block; name_to_errphys(head_ds_attr->za_name, &head_ds_block); int error = process_error_block(spa, head_ds, &head_ds_block, uaddr, count); if (error != 0) { zap_cursor_fini(head_ds_cursor); kmem_free(head_ds_cursor, sizeof (*head_ds_cursor)); kmem_free(head_ds_attr, sizeof (*head_ds_attr)); zap_cursor_fini(zc); kmem_free(za, sizeof (*za)); kmem_free(zc, sizeof (*zc)); return (error); } } zap_cursor_fini(head_ds_cursor); kmem_free(head_ds_cursor, sizeof (*head_ds_cursor)); kmem_free(head_ds_attr, sizeof (*head_ds_attr)); } zap_cursor_fini(zc); kmem_free(za, sizeof (*za)); kmem_free(zc, sizeof (*zc)); return (0); } static int process_error_list(spa_t *spa, avl_tree_t *list, void *uaddr, uint64_t *count) { spa_error_entry_t *se; if (!spa_feature_is_enabled(spa, SPA_FEATURE_HEAD_ERRLOG)) { for (se = avl_first(list); se != NULL; se = AVL_NEXT(list, se)) { int error = copyout_entry(&se->se_bookmark, uaddr, count); if (error != 0) { return (error); } } return (0); } for (se = avl_first(list); se != NULL; se = AVL_NEXT(list, se)) { uint64_t head_ds = 0; int error = get_head_ds(spa, se->se_bookmark.zb_objset, &head_ds); /* * If get_head_ds() errors out, set the head filesystem * to the filesystem stored in the bookmark of the * error block. */ if (error != 0) head_ds = se->se_bookmark.zb_objset; error = process_error_block(spa, head_ds, &se->se_zep, uaddr, count); if (error != 0) return (error); } return (0); } #endif /* * Copy all known errors to userland as an array of bookmarks. This is * actually a union of the on-disk last log and current log, as well as any * pending error requests. * * Because the act of reading the on-disk log could cause errors to be * generated, we have two separate locks: one for the error log and one for the * in-core error lists. We only need the error list lock to log and error, so * we grab the error log lock while we read the on-disk logs, and only pick up * the error list lock when we are finished. */ int spa_get_errlog(spa_t *spa, void *uaddr, uint64_t *count) { int ret = 0; #ifdef _KERNEL /* * The pool config lock is needed to hold a dataset_t via (among other * places) process_error_list() -> process_error_block()-> * find_top_affected_fs(), and lock ordering requires that we get it * before the spa_errlog_lock. */ dsl_pool_config_enter(spa->spa_dsl_pool, FTAG); mutex_enter(&spa->spa_errlog_lock); ret = process_error_log(spa, spa->spa_errlog_scrub, uaddr, count); if (!ret && !spa->spa_scrub_finished) ret = process_error_log(spa, spa->spa_errlog_last, uaddr, count); mutex_enter(&spa->spa_errlist_lock); if (!ret) ret = process_error_list(spa, &spa->spa_errlist_scrub, uaddr, count); if (!ret) ret = process_error_list(spa, &spa->spa_errlist_last, uaddr, count); mutex_exit(&spa->spa_errlist_lock); mutex_exit(&spa->spa_errlog_lock); dsl_pool_config_exit(spa->spa_dsl_pool, FTAG); #else (void) spa, (void) uaddr, (void) count; #endif return (ret); } /* * Called when a scrub completes. This simply set a bit which tells which AVL * tree to add new errors. spa_errlog_sync() is responsible for actually * syncing the changes to the underlying objects. */ void spa_errlog_rotate(spa_t *spa) { mutex_enter(&spa->spa_errlist_lock); spa->spa_scrub_finished = B_TRUE; mutex_exit(&spa->spa_errlist_lock); } /* * Discard any pending errors from the spa_t. Called when unloading a faulted * pool, as the errors encountered during the open cannot be synced to disk. */ void spa_errlog_drain(spa_t *spa) { spa_error_entry_t *se; void *cookie; mutex_enter(&spa->spa_errlist_lock); cookie = NULL; while ((se = avl_destroy_nodes(&spa->spa_errlist_last, &cookie)) != NULL) kmem_free(se, sizeof (spa_error_entry_t)); cookie = NULL; while ((se = avl_destroy_nodes(&spa->spa_errlist_scrub, &cookie)) != NULL) kmem_free(se, sizeof (spa_error_entry_t)); mutex_exit(&spa->spa_errlist_lock); } /* * Process a list of errors into the current on-disk log. */ void sync_error_list(spa_t *spa, avl_tree_t *t, uint64_t *obj, dmu_tx_t *tx) { spa_error_entry_t *se; char buf[NAME_MAX_LEN]; void *cookie; if (avl_numnodes(t) == 0) return; /* create log if necessary */ if (*obj == 0) *obj = zap_create(spa->spa_meta_objset, DMU_OT_ERROR_LOG, DMU_OT_NONE, 0, tx); /* add errors to the current log */ if (!spa_feature_is_enabled(spa, SPA_FEATURE_HEAD_ERRLOG)) { for (se = avl_first(t); se != NULL; se = AVL_NEXT(t, se)) { bookmark_to_name(&se->se_bookmark, buf, sizeof (buf)); const char *name = se->se_name ? se->se_name : ""; (void) zap_update(spa->spa_meta_objset, *obj, buf, 1, strlen(name) + 1, name, tx); } } else { for (se = avl_first(t); se != NULL; se = AVL_NEXT(t, se)) { zbookmark_err_phys_t zep; zep.zb_object = se->se_zep.zb_object; zep.zb_level = se->se_zep.zb_level; zep.zb_blkid = se->se_zep.zb_blkid; zep.zb_birth = se->se_zep.zb_birth; uint64_t head_ds = 0; int error = get_head_ds(spa, se->se_bookmark.zb_objset, &head_ds); /* * If get_head_ds() errors out, set the head filesystem * to the filesystem stored in the bookmark of the * error block. */ if (error != 0) head_ds = se->se_bookmark.zb_objset; uint64_t err_obj; error = zap_lookup_int_key(spa->spa_meta_objset, *obj, head_ds, &err_obj); if (error == ENOENT) { err_obj = zap_create(spa->spa_meta_objset, DMU_OT_ERROR_LOG, DMU_OT_NONE, 0, tx); (void) zap_update_int_key(spa->spa_meta_objset, *obj, head_ds, err_obj, tx); } errphys_to_name(&zep, buf, sizeof (buf)); const char *name = se->se_name ? se->se_name : ""; (void) zap_update(spa->spa_meta_objset, err_obj, buf, 1, strlen(name) + 1, name, tx); } } /* purge the error list */ cookie = NULL; while ((se = avl_destroy_nodes(t, &cookie)) != NULL) kmem_free(se, sizeof (spa_error_entry_t)); } static void delete_errlog(spa_t *spa, uint64_t spa_err_obj, dmu_tx_t *tx) { if (spa_feature_is_enabled(spa, SPA_FEATURE_HEAD_ERRLOG)) { zap_cursor_t zc; zap_attribute_t za; for (zap_cursor_init(&zc, spa->spa_meta_objset, spa_err_obj); zap_cursor_retrieve(&zc, &za) == 0; zap_cursor_advance(&zc)) { VERIFY0(dmu_object_free(spa->spa_meta_objset, za.za_first_integer, tx)); } zap_cursor_fini(&zc); } VERIFY0(dmu_object_free(spa->spa_meta_objset, spa_err_obj, tx)); } /* * Sync the error log out to disk. This is a little tricky because the act of * writing the error log requires the spa_errlist_lock. So, we need to lock the * error lists, take a copy of the lists, and then reinitialize them. Then, we * drop the error list lock and take the error log lock, at which point we * do the errlog processing. Then, if we encounter an I/O error during this * process, we can successfully add the error to the list. Note that this will * result in the perpetual recycling of errors, but it is an unlikely situation * and not a performance critical operation. */ void spa_errlog_sync(spa_t *spa, uint64_t txg) { dmu_tx_t *tx; avl_tree_t scrub, last; int scrub_finished; mutex_enter(&spa->spa_errlist_lock); /* * Bail out early under normal circumstances. */ if (avl_numnodes(&spa->spa_errlist_scrub) == 0 && avl_numnodes(&spa->spa_errlist_last) == 0 && avl_numnodes(&spa->spa_errlist_healed) == 0 && !spa->spa_scrub_finished) { mutex_exit(&spa->spa_errlist_lock); return; } spa_get_errlists(spa, &last, &scrub); scrub_finished = spa->spa_scrub_finished; spa->spa_scrub_finished = B_FALSE; mutex_exit(&spa->spa_errlist_lock); /* * The pool config lock is needed to hold a dataset_t via * sync_error_list() -> get_head_ds(), and lock ordering * requires that we get it before the spa_errlog_lock. */ dsl_pool_config_enter(spa->spa_dsl_pool, FTAG); mutex_enter(&spa->spa_errlog_lock); tx = dmu_tx_create_assigned(spa->spa_dsl_pool, txg); /* * Remove healed errors from errors. */ spa_remove_healed_errors(spa, &last, &scrub, tx); /* * Sync out the current list of errors. */ sync_error_list(spa, &last, &spa->spa_errlog_last, tx); /* * Rotate the log if necessary. */ if (scrub_finished) { if (spa->spa_errlog_last != 0) delete_errlog(spa, spa->spa_errlog_last, tx); spa->spa_errlog_last = spa->spa_errlog_scrub; spa->spa_errlog_scrub = 0; sync_error_list(spa, &scrub, &spa->spa_errlog_last, tx); } /* * Sync out any pending scrub errors. */ sync_error_list(spa, &scrub, &spa->spa_errlog_scrub, tx); /* * Update the MOS to reflect the new values. */ (void) zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_ERRLOG_LAST, sizeof (uint64_t), 1, &spa->spa_errlog_last, tx); (void) zap_update(spa->spa_meta_objset, DMU_POOL_DIRECTORY_OBJECT, DMU_POOL_ERRLOG_SCRUB, sizeof (uint64_t), 1, &spa->spa_errlog_scrub, tx); dmu_tx_commit(tx); mutex_exit(&spa->spa_errlog_lock); dsl_pool_config_exit(spa->spa_dsl_pool, FTAG); } static void delete_dataset_errlog(spa_t *spa, uint64_t spa_err_obj, uint64_t ds, dmu_tx_t *tx) { if (spa_err_obj == 0) return; zap_cursor_t zc; zap_attribute_t za; for (zap_cursor_init(&zc, spa->spa_meta_objset, spa_err_obj); zap_cursor_retrieve(&zc, &za) == 0; zap_cursor_advance(&zc)) { uint64_t head_ds; name_to_object(za.za_name, &head_ds); if (head_ds == ds) { (void) zap_remove(spa->spa_meta_objset, spa_err_obj, za.za_name, tx); VERIFY0(dmu_object_free(spa->spa_meta_objset, za.za_first_integer, tx)); break; } } zap_cursor_fini(&zc); } void spa_delete_dataset_errlog(spa_t *spa, uint64_t ds, dmu_tx_t *tx) { mutex_enter(&spa->spa_errlog_lock); delete_dataset_errlog(spa, spa->spa_errlog_scrub, ds, tx); delete_dataset_errlog(spa, spa->spa_errlog_last, ds, tx); mutex_exit(&spa->spa_errlog_lock); } static int find_txg_ancestor_snapshot(spa_t *spa, uint64_t new_head, uint64_t old_head, uint64_t *txg) { dsl_dataset_t *ds; dsl_pool_t *dp = spa->spa_dsl_pool; - int error = dsl_dataset_hold_obj(dp, old_head, FTAG, &ds); + int error = dsl_dataset_hold_obj_flags(dp, old_head, + DS_HOLD_FLAG_DECRYPT, FTAG, &ds); if (error != 0) return (error); uint64_t prev_obj = dsl_dataset_phys(ds)->ds_prev_snap_obj; uint64_t prev_obj_txg = dsl_dataset_phys(ds)->ds_prev_snap_txg; while (prev_obj != 0) { - dsl_dataset_rele(ds, FTAG); - if ((error = dsl_dataset_hold_obj(dp, prev_obj, - FTAG, &ds)) == 0 && + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); + if ((error = dsl_dataset_hold_obj_flags(dp, prev_obj, + DS_HOLD_FLAG_DECRYPT, FTAG, &ds)) == 0 && dsl_dir_phys(ds->ds_dir)->dd_head_dataset_obj == new_head) break; if (error != 0) return (error); prev_obj_txg = dsl_dataset_phys(ds)->ds_prev_snap_txg; prev_obj = dsl_dataset_phys(ds)->ds_prev_snap_obj; } - dsl_dataset_rele(ds, FTAG); + dsl_dataset_rele_flags(ds, DS_HOLD_FLAG_DECRYPT, FTAG); ASSERT(prev_obj != 0); *txg = prev_obj_txg; return (0); } static void swap_errlog(spa_t *spa, uint64_t spa_err_obj, uint64_t new_head, uint64_t old_head, dmu_tx_t *tx) { if (spa_err_obj == 0) return; uint64_t old_head_errlog; int error = zap_lookup_int_key(spa->spa_meta_objset, spa_err_obj, old_head, &old_head_errlog); /* If no error log, then there is nothing to do. */ if (error != 0) return; uint64_t txg; error = find_txg_ancestor_snapshot(spa, new_head, old_head, &txg); if (error != 0) return; /* * Create an error log if the file system being promoted does not * already have one. */ uint64_t new_head_errlog; error = zap_lookup_int_key(spa->spa_meta_objset, spa_err_obj, new_head, &new_head_errlog); if (error != 0) { new_head_errlog = zap_create(spa->spa_meta_objset, DMU_OT_ERROR_LOG, DMU_OT_NONE, 0, tx); (void) zap_update_int_key(spa->spa_meta_objset, spa_err_obj, new_head, new_head_errlog, tx); } zap_cursor_t zc; zap_attribute_t za; zbookmark_err_phys_t err_block; for (zap_cursor_init(&zc, spa->spa_meta_objset, old_head_errlog); zap_cursor_retrieve(&zc, &za) == 0; zap_cursor_advance(&zc)) { const char *name = ""; name_to_errphys(za.za_name, &err_block); if (err_block.zb_birth < txg) { (void) zap_update(spa->spa_meta_objset, new_head_errlog, za.za_name, 1, strlen(name) + 1, name, tx); (void) zap_remove(spa->spa_meta_objset, old_head_errlog, za.za_name, tx); } } zap_cursor_fini(&zc); } void spa_swap_errlog(spa_t *spa, uint64_t new_head_ds, uint64_t old_head_ds, dmu_tx_t *tx) { mutex_enter(&spa->spa_errlog_lock); swap_errlog(spa, spa->spa_errlog_scrub, new_head_ds, old_head_ds, tx); swap_errlog(spa, spa->spa_errlog_last, new_head_ds, old_head_ds, tx); mutex_exit(&spa->spa_errlog_lock); } #if defined(_KERNEL) /* error handling */ EXPORT_SYMBOL(spa_log_error); EXPORT_SYMBOL(spa_approx_errlog_size); EXPORT_SYMBOL(spa_get_errlog); EXPORT_SYMBOL(spa_errlog_rotate); EXPORT_SYMBOL(spa_errlog_drain); EXPORT_SYMBOL(spa_errlog_sync); EXPORT_SYMBOL(spa_get_errlists); EXPORT_SYMBOL(spa_delete_dataset_errlog); EXPORT_SYMBOL(spa_swap_errlog); EXPORT_SYMBOL(sync_error_list); EXPORT_SYMBOL(spa_upgrade_errlog); #endif /* BEGIN CSTYLED */ ZFS_MODULE_PARAM(zfs_spa, spa_, upgrade_errlog_limit, UINT, ZMOD_RW, "Limit the number of errors which will be upgraded to the new " "on-disk error log when enabling head_errlog"); /* END CSTYLED */ diff --git a/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_corrective.ksh b/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_corrective.ksh index 9ebde1cd9d32..261fc5eed8cb 100755 --- a/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_corrective.ksh +++ b/tests/zfs-tests/tests/functional/cli_root/zfs_receive/zfs_receive_corrective.ksh @@ -1,192 +1,209 @@ #!/bin/ksh -p # # CDDL HEADER START # # This file and its contents are supplied under the terms of the # Common Development and Distribution License ("CDDL"), version 1.0. # You may only use this file in accordance with the terms of version # 1.0 of the CDDL. # # A full copy of the text of the CDDL should have accompanied this # source. A copy of the CDDL is also available via the Internet at # http://www.illumos.org/license/CDDL. # # CDDL HEADER END # # # Copyright (c) 2019 Datto, Inc. All rights reserved. # Copyright (c) 2022 Axcient. # . $STF_SUITE/include/libtest.shlib # # DESCRIPTION: # OpenZFS should be able to heal data using corrective recv # # STRATEGY: # 0. Create a file, checksum the file to be corrupted then compare it's checksum # with the one obtained after healing under different testing scenarios: # 1. Test healing (aka corrective) recv from a full send file # 2. Test healing recv (aka heal recv) from an incremental send file # 3. Test healing recv when compression on-disk is off but source was compressed # 4. Test heal recv when compression on-disk is on but source was uncompressed # 5. Test heal recv when compression doesn't match between send file and on-disk # 6. Test healing recv of an encrypted dataset using an unencrypted send file # 7. Test healing recv (on an encrypted dataset) using a raw send file # 8. Test healing when specifying destination filesystem only (no snapshot) # 9. Test incremental recv aftear healing recv # verify_runnable "both" DISK=${DISKS%% *} backup=$TEST_BASE_DIR/backup raw_backup=$TEST_BASE_DIR/raw_backup ibackup=$TEST_BASE_DIR/ibackup unc_backup=$TEST_BASE_DIR/unc_backup function cleanup { log_must rm -f $backup $raw_backup $ibackup $unc_backup poolexists $TESTPOOL && destroy_pool $TESTPOOL log_must zpool create -f $TESTPOOL $DISK } function test_corrective_recv { log_must zpool scrub -w $TESTPOOL log_must zpool status -v $TESTPOOL log_must eval "zpool status -v $TESTPOOL | \ grep \"Permanent errors have been detected\"" # make sure we will read the corruption from disk by flushing the ARC log_must zinject -a log_must eval "zfs recv -c $1 < $2" log_must zpool scrub -w $TESTPOOL log_must zpool status -v $TESTPOOL log_mustnot eval "zpool status -v $TESTPOOL | \ grep \"Permanent errors have been detected\"" typeset cksum=$(md5digest $file) [[ "$cksum" == "$checksum" ]] || \ log_fail "Checksums differ ($cksum != $checksum)" } log_onexit cleanup log_assert "ZFS corrective receive should be able to heal data corruption" typeset passphrase="password" typeset file="/$TESTPOOL/$TESTFS1/$TESTFILE0" log_must eval "poolexists $TESTPOOL && destroy_pool $TESTPOOL" log_must zpool create -f $TESTPOOL $DISK log_must eval "echo $passphrase > /$TESTPOOL/pwd" log_must zfs create -o primarycache=none \ -o atime=off -o compression=lz4 $TESTPOOL/$TESTFS1 log_must dd if=/dev/urandom of=$file bs=1024 count=1024 oflag=sync log_must eval "echo 'aaaaaaaa' >> "$file typeset checksum=$(md5digest $file) log_must zfs snapshot $TESTPOOL/$TESTFS1@snap1 # create full send file log_must eval "zfs send $TESTPOOL/$TESTFS1@snap1 > $backup" log_must dd if=/dev/urandom of=$file"1" bs=1024 count=1024 oflag=sync log_must eval "echo 'bbbbbbbb' >> "$file"1" log_must zfs snapshot $TESTPOOL/$TESTFS1@snap2 # create incremental send file log_must eval "zfs send -i $TESTPOOL/$TESTFS1@snap1 \ $TESTPOOL/$TESTFS1@snap2 > $ibackup" corrupt_blocks_at_level $file 0 # test healing recv from a full send file test_corrective_recv $TESTPOOL/$TESTFS1@snap1 $backup corrupt_blocks_at_level $file"1" 0 # test healing recv from an incremental send file test_corrective_recv $TESTPOOL/$TESTFS1@snap2 $ibackup # create new uncompressed dataset using our send file log_must eval "zfs recv -o compression=off -o primarycache=none \ $TESTPOOL/$TESTFS2 < $backup" typeset compr=$(get_prop compression $TESTPOOL/$TESTFS2) [[ "$compr" == "off" ]] || \ log_fail "Unexpected compression $compr in recved dataset" corrupt_blocks_at_level "/$TESTPOOL/$TESTFS2/$TESTFILE0" 0 # test healing recv when compression on-disk is off but source was compressed test_corrective_recv "$TESTPOOL/$TESTFS2@snap1" $backup # create a full sendfile from an uncompressed source log_must eval "zfs send $TESTPOOL/$TESTFS2@snap1 > $unc_backup" log_must eval "zfs recv -o compression=gzip -o primarycache=none \ $TESTPOOL/testfs3 < $unc_backup" typeset compr=$(get_prop compression $TESTPOOL/testfs3) [[ "$compr" == "gzip" ]] || \ log_fail "Unexpected compression $compr in recved dataset" corrupt_blocks_at_level "/$TESTPOOL/testfs3/$TESTFILE0" 0 # test healing recv when compression on-disk is on but source was uncompressed test_corrective_recv "$TESTPOOL/testfs3@snap1" $unc_backup # create new compressed dataset using our send file log_must eval "zfs recv -o compression=gzip -o primarycache=none \ $TESTPOOL/testfs4 < $backup" typeset compr=$(get_prop compression $TESTPOOL/testfs4) [[ "$compr" == "gzip" ]] || \ log_fail "Unexpected compression $compr in recved dataset" corrupt_blocks_at_level "/$TESTPOOL/testfs4/$TESTFILE0" 0 # test healing recv when compression doesn't match between send file and on-disk test_corrective_recv "$TESTPOOL/testfs4@snap1" $backup # create new encrypted (and compressed) dataset using our send file log_must eval "zfs recv -o encryption=aes-256-ccm -o keyformat=passphrase \ -o keylocation=file:///$TESTPOOL/pwd -o primarycache=none \ $TESTPOOL/testfs5 < $backup" typeset encr=$(get_prop encryption $TESTPOOL/testfs5) [[ "$encr" == "aes-256-ccm" ]] || \ log_fail "Unexpected encryption $encr in recved dataset" log_must eval "zfs send --raw $TESTPOOL/testfs5@snap1 > $raw_backup" log_must eval "zfs send $TESTPOOL/testfs5@snap1 > $backup" corrupt_blocks_at_level "/$TESTPOOL/testfs5/$TESTFILE0" 0 # test healing recv of an encrypted dataset using an unencrypted send file test_corrective_recv "$TESTPOOL/testfs5@snap1" $backup corrupt_blocks_at_level "/$TESTPOOL/testfs5/$TESTFILE0" 0 log_must zfs unmount $TESTPOOL/testfs5 log_must zfs unload-key $TESTPOOL/testfs5 # test healing recv (on an encrypted dataset) using a raw send file -test_corrective_recv "$TESTPOOL/testfs5@snap1" $raw_backup +# This is a special case since with unloaded keys we cannot report errors +# in the filesystem. +log_must zpool scrub -w $TESTPOOL +log_must zpool status -v $TESTPOOL +log_mustnot eval "zpool status -v $TESTPOOL | \ + grep \"permission denied\"" +# make sure we will read the corruption from disk by flushing the ARC +log_must zinject -a +log_must eval "zfs recv -c $TESTPOOL/testfs5@snap1 < $raw_backup" + +log_must zpool scrub -w $TESTPOOL +log_must zpool status -v $TESTPOOL +log_mustnot eval "zpool status -v $TESTPOOL | \ + grep \"Permanent errors have been detected\"" +typeset cksum=$(md5digest $file) +[[ "$cksum" == "$checksum" ]] || \ + log_fail "Checksums differ ($cksum != $checksum)" + # non raw send file healing an encrypted dataset with an unloaded key will fail log_mustnot eval "zfs recv -c $TESTPOOL/testfs5@snap1 < $backup" log_must zfs rollback -r $TESTPOOL/$TESTFS1@snap1 corrupt_blocks_at_level $file 0 # test healing when specifying destination filesystem only (no snapshot) test_corrective_recv $TESTPOOL/$TESTFS1 $backup # test incremental recv aftear healing recv log_must eval "zfs recv $TESTPOOL/$TESTFS1 < $ibackup" # test that healing recv can not be combined with incompatible recv options log_mustnot eval "zfs recv -h -c $TESTPOOL/$TESTFS1@snap1 < $backup" log_mustnot eval "zfs recv -F -c $TESTPOOL/$TESTFS1@snap1 < $backup" log_mustnot eval "zfs recv -s -c $TESTPOOL/$TESTFS1@snap1 < $backup" log_mustnot eval "zfs recv -u -c $TESTPOOL/$TESTFS1@snap1 < $backup" log_mustnot eval "zfs recv -d -c $TESTPOOL/$TESTFS1@snap1 < $backup" log_mustnot eval "zfs recv -e -c $TESTPOOL/$TESTFS1@snap1 < $backup" # ensure healing recv doesn't work when snap GUIDS don't match log_mustnot eval "zfs recv -c $TESTPOOL/testfs5@snap2 < $backup" log_mustnot eval "zfs recv -c $TESTPOOL/testfs5 < $backup" # test that healing recv doesn't work on non-existing snapshots log_mustnot eval "zfs recv -c $TESTPOOL/$TESTFS1@missing < $backup" log_pass "OpenZFS corrective recv works for data healing" diff --git a/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_005_pos.ksh b/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_005_pos.ksh index 04cd1892380d..ec4c67fb42f5 100755 --- a/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_005_pos.ksh +++ b/tests/zfs-tests/tests/functional/cli_root/zpool_status/zpool_status_005_pos.ksh @@ -1,90 +1,90 @@ #!/bin/ksh -p # # CDDL HEADER START # # The contents of this file are subject to the terms of the # Common Development and Distribution License (the "License"). # You may not use this file except in compliance with the License. # # You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE # or https://opensource.org/licenses/CDDL-1.0. # See the License for the specific language governing permissions # and limitations under the License. # # When distributing Covered Code, include this CDDL HEADER in each # file and include the License file at usr/src/OPENSOLARIS.LICENSE. # If applicable, add the following below this CDDL HEADER, with the # fields enclosed by brackets "[]" replaced with your own identifying # information: Portions Copyright [yyyy] [name of copyright owner] # # CDDL HEADER END # # # Copyright (c) 2022 George Amanakis. All rights reserved. # # # DESCRIPTION: # Verify correct output with 'zpool status -v' after corrupting a file # # STRATEGY: -# 1. Create a pool, an ancrypted filesystem and a file +# 1. Create a pool, an encrypted filesystem and a file # 2. zinject checksum errors # 3. Unmount the filesystem and unload the key # 4. Scrub the pool # 5. Verify we report that errors were detected but we do not report # the filename since the key is not loaded. # 6. Load the key and mount the encrypted fs. # 7. Verify we report errors in the pool in 'zpool status -v' . $STF_SUITE/include/libtest.shlib verify_runnable "both" DISK=${DISKS%% *} function cleanup { log_must zinject -c all destroy_pool $TESTPOOL2 rm -f $TESTDIR/vdev_a } log_assert "Verify reporting errors with unloaded keys works" log_onexit cleanup typeset passphrase="password" typeset file="/$TESTPOOL2/$TESTFS1/$TESTFILE0" truncate -s $MINVDEVSIZE $TESTDIR/vdev_a log_must zpool create -f -o feature@head_errlog=enabled $TESTPOOL2 $TESTDIR/vdev_a log_must eval "echo $passphrase > /$TESTPOOL2/pwd" log_must zfs create -o encryption=aes-256-ccm -o keyformat=passphrase \ -o keylocation=file:///$TESTPOOL2/pwd -o primarycache=none \ $TESTPOOL2/$TESTFS1 log_must dd if=/dev/urandom of=$file bs=1024 count=1024 oflag=sync log_must eval "echo 'aaaaaaaa' >> "$file corrupt_blocks_at_level $file 0 log_must zfs umount $TESTPOOL2/$TESTFS1 log_must zfs unload-key -a log_must zpool sync $TESTPOOL2 log_must zpool scrub $TESTPOOL2 log_must zpool wait -t scrub $TESTPOOL2 log_must zpool status -v $TESTPOOL2 -log_must eval "zpool status -v $TESTPOOL2 | \ - grep \"Permanent errors have been detected\"" +log_mustnot eval "zpool status -v $TESTPOOL2 | \ + grep \"permission denied\"" log_mustnot eval "zpool status -v $TESTPOOL2 | grep '$file'" log_must eval "cat /$TESTPOOL2/pwd | zfs load-key $TESTPOOL2/$TESTFS1" log_must zfs mount $TESTPOOL2/$TESTFS1 log_must zpool status -v $TESTPOOL2 log_must eval "zpool status -v $TESTPOOL2 | \ grep \"Permanent errors have been detected\"" log_must eval "zpool status -v $TESTPOOL2 | grep '$file'" log_pass "Verify reporting errors with unloaded keys works"