Index: vendor-sys/openzfs/dist/.github/CONTRIBUTING.md =================================================================== --- vendor-sys/openzfs/dist/.github/CONTRIBUTING.md (nonexistent) +++ vendor-sys/openzfs/dist/.github/CONTRIBUTING.md (revision 366348) @@ -0,0 +1,345 @@ +# Contributing to OpenZFS +

+ OpenZFS Logo +

+ +*First of all, thank you for taking the time to contribute!* + +By using the following guidelines, you can help us make OpenZFS even better. + +## Table Of Contents +[What should I know before I get +started?](#what-should-i-know-before-i-get-started) + + * [Get ZFS](#get-zfs) + * [Debug ZFS](#debug-zfs) + * [Where can I ask for help?](#where-can-I-ask-for-help) + +[How Can I Contribute?](#how-can-i-contribute) + + * [Reporting Bugs](#reporting-bugs) + * [Suggesting Enhancements](#suggesting-enhancements) + * [Pull Requests](#pull-requests) + * [Testing](#testing) + +[Style Guides](#style-guides) + + * [Coding Conventions](#coding-conventions) + * [Commit Message Formats](#commit-message-formats) + * [New Changes](#new-changes) + * [OpenZFS Patch Ports](#openzfs-patch-ports) + * [Coverity Defect Fixes](#coverity-defect-fixes) + * [Signed Off By](#signed-off-by) + +Helpful resources + + * [OpenZFS Documentation](https://openzfs.github.io/openzfs-docs/) + * [OpenZFS Developer Resources](http://open-zfs.org/wiki/Developer_resources) + * [Git and GitHub for beginners](https://openzfs.github.io/openzfs-docs/Developer%20Resources/Git%20and%20GitHub%20for%20beginners.html) + +## What should I know before I get started? + +### Get ZFS +You can build zfs packages by following [these +instructions](https://openzfs.github.io/openzfs-docs/Developer%20Resources/Building%20ZFS.html), +or install stable packages from [your distribution's +repository](https://openzfs.github.io/openzfs-docs/Getting%20Started/index.html). + +### Debug ZFS +A variety of methods and tools are available to aid ZFS developers. +It's strongly recommended that when developing a patch the `--enable-debug` +configure option should be set. This will enable additional correctness +checks and all the ASSERTs to help quickly catch potential issues. + +In addition, there are numerous utilities and debugging files which +provide visibility into the inner workings of ZFS. The most useful +of these tools are discussed in detail on the [Troubleshooting +page](https://openzfs.github.io/openzfs-docs/Basic%20Concepts/Troubleshooting.html). + +### Where can I ask for help? +The [zfs-discuss mailing +list](https://openzfs.github.io/openzfs-docs/Project%20and%20Community/Mailing%20Lists.html) +or IRC are the best places to ask for help. Please do not file +support requests on the GitHub issue tracker. + +## How Can I Contribute? + +### Reporting Bugs +*Please* contact us via the [zfs-discuss mailing +list](https://openzfs.github.io/openzfs-docs/Project%20and%20Community/Mailing%20Lists.html) +or IRC if you aren't certain that you are experiencing a bug. + +If you run into an issue, please search our [issue +tracker](https://github.com/openzfs/zfs/issues) *first* to ensure the +issue hasn't been reported before. Open a new issue only if you haven't +found anything similar to your issue. + +You can open a new issue and search existing issues using the public [issue +tracker](https://github.com/openzfs/zfs/issues). + +#### When opening a new issue, please include the following information at the top of the issue: +* What distribution (with version) you are using. +* The spl and zfs versions you are using, installation method (repository +or manual compilation). +* Describe the issue you are experiencing. +* Describe how to reproduce the issue. +* Including any warning/errors/backtraces from the system logs. + +When a new issue is opened, it is not uncommon for developers to request +additional information. + +In general, the more detail you share about a problem the quicker a +developer can resolve it. For example, providing a simple test case is always +exceptionally helpful. + +Be prepared to work with the developers investigating your issue. Your +assistance is crucial in providing a quick solution. They may ask for +information like: + +* Your pool configuration as reported by `zdb` or `zpool status`. +* Your hardware configuration, such as + * Number of CPUs. + * Amount of memory. + * Whether your system has ECC memory. + * Whether it is running under a VMM/Hypervisor. + * Kernel version. + * Values of the spl/zfs module parameters. +* Stack traces which may be logged to `dmesg`. + +### Suggesting Enhancements +OpenZFS is a widely deployed production filesystem which is under active +development. The team's primary focus is on fixing known issues, improving +performance, and adding compelling new features. + +You can view the list of proposed features +by filtering the issue tracker by the ["Type: Feature" +label](https://github.com/openzfs/zfs/issues?q=is%3Aopen+is%3Aissue+label%3A%22Type%3A+Feature%22). +If you have an idea for a feature first check this list. If your idea already +appears then add a +1 to the top most comment, this helps us gauge interest +in that feature. + +Otherwise, open a new issue and describe your proposed feature. Why is this +feature needed? What problem does it solve? + +### Pull Requests + +#### General + +* All pull requests must be based on the current master branch and apply +without conflicts. +* Please attempt to limit pull requests to a single commit which resolves +one specific issue. +* Make sure your commit messages are in the correct format. See the +[Commit Message Formats](#commit-message-formats) section for more information. +* When updating a pull request squash multiple commits by performing a +[rebase](https://git-scm.com/docs/git-rebase) (squash). +* For large pull requests consider structuring your changes as a stack of +logically independent patches which build on each other. This makes large +changes easier to review and approve which speeds up the merging process. +* Try to keep pull requests simple. Simple code with comments is much easier +to review and approve. +* All proposed changes must be approved by an OpenZFS organization member. +* If you have an idea you'd like to discuss or which requires additional testing, consider opening it as a draft pull request. +Once everything is in good shape and the details have been worked out you can remove its draft status. +Any required reviews can then be finalized and the pull request merged. + +#### Tests and Benchmarks +* Every pull request will by tested by the buildbot on multiple platforms by running the [zfs-tests.sh and zloop.sh]( +https://openzfs.github.io/openzfs-docs/Developer%20Resources/Building%20ZFS.html#running-zloop-sh-and-zfs-tests-sh) test suites. +* To verify your changes conform to the [style guidelines]( +https://github.com/openzfs/zfs/blob/master/.github/CONTRIBUTING.md#style-guides +), please run `make checkstyle` and resolve any warnings. +* Static code analysis of each pull request is performed by the buildbot; run `make lint` to check your changes. +* Test cases should be provided when appropriate. +This includes making sure new features have adequate code coverage. +* If your pull request improves performance, please include some benchmarks. +* The pull request must pass all required [ZFS +Buildbot](http://build.zfsonlinux.org/) builders before +being accepted. If you are experiencing intermittent TEST +builder failures, you may be experiencing a [test suite +issue](https://github.com/openzfs/zfs/issues?q=is%3Aissue+is%3Aopen+label%3A%22Type%3A+Test+Suite%22). +There are also various [buildbot options](https://openzfs.github.io/openzfs-docs/Developer%20Resources/Buildbot%20Options.html) +to control how changes are tested. + +### Testing +All help is appreciated! If you're in a position to run the latest code +consider helping us by reporting any functional problems, performance +regressions or other suspected issues. By running the latest code to a wide +range of realistic workloads, configurations and architectures we're better +able quickly identify and resolve potential issues. + +Users can also run the [ZFS Test +Suite](https://github.com/openzfs/zfs/tree/master/tests) on their systems +to verify ZFS is behaving as intended. + +## Style Guides + +### Repository Structure + +OpenZFS uses a standardised branching structure. +- The "development and main branch", is the branch all development should be based on. +- "Release branches" contain the latest released code for said version. +- "Staging branches" contain selected commits prior to being released. + +**Branch Names:** +- Development and Main branch: `master` +- Release branches: `zfs-$VERSION-release` +- Staging branches: `zfs-$VERSION-staging` + +`$VERSION` should be replaced with the `major.minor` version number. +_(This is the version number without the `.patch` version at the end)_ + +### Coding Conventions +We currently use [C Style and Coding Standards for +SunOS](http://www.cis.upenn.edu/%7Elee/06cse480/data/cstyle.ms.pdf) as our +coding convention. + +This repository has an `.editorconfig` file. If your editor [supports +editorconfig](https://editorconfig.org/#download), it will +automatically respect most of this project's whitespace preferences. + +Additionally, Git can help warn on whitespace problems as well: + +``` +git config --local core.whitespace trailing-space,space-before-tab,indent-with-non-tab,-tab-in-indent +``` + +### Commit Message Formats +#### New Changes +Commit messages for new changes must meet the following guidelines: +* In 72 characters or less, provide a summary of the change as the +first line in the commit message. +* A body which provides a description of the change. If necessary, +please summarize important information such as why the proposed +approach was chosen or a brief description of the bug you are resolving. +Each line of the body must be 72 characters or less. +* The last line must be a `Signed-off-by:` tag. See the +[Signed Off By](#signed-off-by) section for more information. + +An example commit message for new changes is provided below. + +``` +This line is a brief summary of your change + +Please provide at least a couple sentences describing the +change. If necessary, please summarize decisions such as +why the proposed approach was chosen or what bug you are +attempting to solve. + +Signed-off-by: Contributor +``` + +#### OpenZFS Patch Ports +If you are porting OpenZFS patches, the commit message must meet +the following guidelines: +* The first line must be the summary line from the most important OpenZFS commit being ported. +It must begin with `OpenZFS dddd, dddd - ` where `dddd` are OpenZFS issue numbers. +* Provides a `Authored by:` line to attribute each patch for each original author. +* Provides the `Reviewed by:` and `Approved by:` lines from each original +OpenZFS commit. +* Provides a `Ported-by:` line with the developer's name followed by +their email for each OpenZFS commit. +* Provides a `OpenZFS-issue:` line with link for each original illumos +issue. +* Provides a `OpenZFS-commit:` line with link for each original OpenZFS commit. +* If necessary, provide some porting notes to describe any deviations from +the original OpenZFS commits. + +An example OpenZFS patch port commit message for a single patch is provided +below. +``` +OpenZFS 1234 - Summary from the original OpenZFS commit + +Authored by: Original Author +Reviewed by: Reviewer One +Reviewed by: Reviewer Two +Approved by: Approver One +Ported-by: ZFS Contributor + +Provide some porting notes here if necessary. + +OpenZFS-issue: https://www.illumos.org/issues/1234 +OpenZFS-commit: https://github.com/openzfs/openzfs/commit/abcd1234 +``` + +If necessary, multiple OpenZFS patches can be combined in a single port. +This is useful when you are porting a new patch and its subsequent bug +fixes. An example commit message is provided below. +``` +OpenZFS 1234, 5678 - Summary of most important OpenZFS commit + +1234 Summary from original OpenZFS commit for 1234 + +Authored by: Original Author +Reviewed by: Reviewer Two +Approved by: Approver One +Ported-by: ZFS Contributor + +Provide some porting notes here for 1234 if necessary. + +OpenZFS-issue: https://www.illumos.org/issues/1234 +OpenZFS-commit: https://github.com/openzfs/openzfs/commit/abcd1234 + +5678 Summary from original OpenZFS commit for 5678 + +Authored by: Original Author2 +Reviewed by: Reviewer One +Approved by: Approver Two +Ported-by: ZFS Contributor + +Provide some porting notes here for 5678 if necessary. + +OpenZFS-issue: https://www.illumos.org/issues/5678 +OpenZFS-commit: https://github.com/openzfs/openzfs/commit/efgh5678 +``` + +#### Coverity Defect Fixes +If you are submitting a fix to a +[Coverity defect](https://scan.coverity.com/projects/zfsonlinux-zfs), +the commit message should meet the following guidelines: +* Provides a subject line in the format of +`Fix coverity defects: CID dddd, dddd...` where `dddd` represents +each CID fixed by the commit. +* Provides a body which lists each Coverity defect and how it was corrected. +* The last line must be a `Signed-off-by:` tag. See the +[Signed Off By](#signed-off-by) section for more information. + +An example Coverity defect fix commit message is provided below. +``` +Fix coverity defects: CID 12345, 67890 + +CID 12345: Logically dead code (DEADCODE) + +Removed the if(var != 0) block because the condition could never be +satisfied. + +CID 67890: Resource Leak (RESOURCE_LEAK) + +Ensure free is called after allocating memory in function(). + +Signed-off-by: Contributor +``` + +#### Signed Off By +A line tagged as `Signed-off-by:` must contain the developer's +name followed by their email. This is the developer's certification +that they have the right to submit the patch for inclusion into +the code base and indicates agreement to the [Developer's Certificate +of Origin](https://www.kernel.org/doc/html/latest/process/submitting-patches.html#sign-your-work-the-developer-s-certificate-of-origin). +Code without a proper signoff cannot be merged. + +Git can append the `Signed-off-by` line to your commit messages. Simply +provide the `-s` or `--signoff` option when performing a `git commit`. +For more information about writing commit messages, visit [How to Write +a Git Commit Message](https://chris.beams.io/posts/git-commit/). + +#### Co-authored By +If someone else had part in your pull request, please add the following to the commit: +`Co-authored-by: Name ` +This is useful if their authorship was lost during squashing, rebasing, etc., +but may be used in any situation where there are co-authors. + +The email address used here should be the same as on the GitHub profile of said user. +If said user does not have their email address public, please use the following instead: +`Co-authored-by: Name <[username]@users.noreply.github.com>` Index: vendor-sys/openzfs/dist/cmd/zfs/zfs_main.c =================================================================== --- vendor-sys/openzfs/dist/cmd/zfs/zfs_main.c (revision 366347) +++ vendor-sys/openzfs/dist/cmd/zfs/zfs_main.c (revision 366348) @@ -1,8638 +1,8640 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011, 2020 by Delphix. All rights reserved. * Copyright 2012 Milan Jurik. All rights reserved. * Copyright (c) 2012, Joyent, Inc. All rights reserved. * Copyright (c) 2013 Steven Hartland. All rights reserved. * Copyright 2016 Igor Kozhukhov . * Copyright 2016 Nexenta Systems, Inc. * Copyright (c) 2019 Datto Inc. * Copyright (c) 2019, loli10K * Copyright 2019 Joyent, Inc. * Copyright (c) 2019, 2020 by Christian Schwarz. All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef HAVE_IDMAP #include #include #endif /* HAVE_IDMAP */ #include "zfs_iter.h" #include "zfs_util.h" #include "zfs_comutil.h" #include "libzfs_impl.h" #include "zfs_projectutil.h" libzfs_handle_t *g_zfs; static FILE *mnttab_file; static char history_str[HIS_MAX_RECORD_LEN]; static boolean_t log_history = B_TRUE; static int zfs_do_clone(int argc, char **argv); static int zfs_do_create(int argc, char **argv); static int zfs_do_destroy(int argc, char **argv); static int zfs_do_get(int argc, char **argv); static int zfs_do_inherit(int argc, char **argv); static int zfs_do_list(int argc, char **argv); static int zfs_do_mount(int argc, char **argv); static int zfs_do_rename(int argc, char **argv); static int zfs_do_rollback(int argc, char **argv); static int zfs_do_set(int argc, char **argv); static int zfs_do_upgrade(int argc, char **argv); static int zfs_do_snapshot(int argc, char **argv); static int zfs_do_unmount(int argc, char **argv); static int zfs_do_share(int argc, char **argv); static int zfs_do_unshare(int argc, char **argv); static int zfs_do_send(int argc, char **argv); static int zfs_do_receive(int argc, char **argv); static int zfs_do_promote(int argc, char **argv); static int zfs_do_userspace(int argc, char **argv); static int zfs_do_allow(int argc, char **argv); static int zfs_do_unallow(int argc, char **argv); static int zfs_do_hold(int argc, char **argv); static int zfs_do_holds(int argc, char **argv); static int zfs_do_release(int argc, char **argv); static int zfs_do_diff(int argc, char **argv); static int zfs_do_bookmark(int argc, char **argv); static int zfs_do_channel_program(int argc, char **argv); static int zfs_do_load_key(int argc, char **argv); static int zfs_do_unload_key(int argc, char **argv); static int zfs_do_change_key(int argc, char **argv); static int zfs_do_project(int argc, char **argv); static int zfs_do_version(int argc, char **argv); static int zfs_do_redact(int argc, char **argv); static int zfs_do_wait(int argc, char **argv); #ifdef __FreeBSD__ static int zfs_do_jail(int argc, char **argv); static int zfs_do_unjail(int argc, char **argv); #endif /* * Enable a reasonable set of defaults for libumem debugging on DEBUG builds. */ #ifdef DEBUG const char * _umem_debug_init(void) { return ("default,verbose"); /* $UMEM_DEBUG setting */ } const char * _umem_logging_init(void) { return ("fail,contents"); /* $UMEM_LOGGING setting */ } #endif typedef enum { HELP_CLONE, HELP_CREATE, HELP_DESTROY, HELP_GET, HELP_INHERIT, HELP_UPGRADE, HELP_LIST, HELP_MOUNT, HELP_PROMOTE, HELP_RECEIVE, HELP_RENAME, HELP_ROLLBACK, HELP_SEND, HELP_SET, HELP_SHARE, HELP_SNAPSHOT, HELP_UNMOUNT, HELP_UNSHARE, HELP_ALLOW, HELP_UNALLOW, HELP_USERSPACE, HELP_GROUPSPACE, HELP_PROJECTSPACE, HELP_PROJECT, HELP_HOLD, HELP_HOLDS, HELP_RELEASE, HELP_DIFF, HELP_BOOKMARK, HELP_CHANNEL_PROGRAM, HELP_LOAD_KEY, HELP_UNLOAD_KEY, HELP_CHANGE_KEY, HELP_VERSION, HELP_REDACT, HELP_JAIL, HELP_UNJAIL, HELP_WAIT, } zfs_help_t; typedef struct zfs_command { const char *name; int (*func)(int argc, char **argv); zfs_help_t usage; } zfs_command_t; /* * Master command table. Each ZFS command has a name, associated function, and * usage message. The usage messages need to be internationalized, so we have * to have a function to return the usage message based on a command index. * * These commands are organized according to how they are displayed in the usage * message. An empty command (one with a NULL name) indicates an empty line in * the generic usage message. */ static zfs_command_t command_table[] = { { "version", zfs_do_version, HELP_VERSION }, { NULL }, { "create", zfs_do_create, HELP_CREATE }, { "destroy", zfs_do_destroy, HELP_DESTROY }, { NULL }, { "snapshot", zfs_do_snapshot, HELP_SNAPSHOT }, { "rollback", zfs_do_rollback, HELP_ROLLBACK }, { "clone", zfs_do_clone, HELP_CLONE }, { "promote", zfs_do_promote, HELP_PROMOTE }, { "rename", zfs_do_rename, HELP_RENAME }, { "bookmark", zfs_do_bookmark, HELP_BOOKMARK }, { "program", zfs_do_channel_program, HELP_CHANNEL_PROGRAM }, { NULL }, { "list", zfs_do_list, HELP_LIST }, { NULL }, { "set", zfs_do_set, HELP_SET }, { "get", zfs_do_get, HELP_GET }, { "inherit", zfs_do_inherit, HELP_INHERIT }, { "upgrade", zfs_do_upgrade, HELP_UPGRADE }, { NULL }, { "userspace", zfs_do_userspace, HELP_USERSPACE }, { "groupspace", zfs_do_userspace, HELP_GROUPSPACE }, { "projectspace", zfs_do_userspace, HELP_PROJECTSPACE }, { NULL }, { "project", zfs_do_project, HELP_PROJECT }, { NULL }, { "mount", zfs_do_mount, HELP_MOUNT }, { "unmount", zfs_do_unmount, HELP_UNMOUNT }, { "share", zfs_do_share, HELP_SHARE }, { "unshare", zfs_do_unshare, HELP_UNSHARE }, { NULL }, { "send", zfs_do_send, HELP_SEND }, { "receive", zfs_do_receive, HELP_RECEIVE }, { NULL }, { "allow", zfs_do_allow, HELP_ALLOW }, { NULL }, { "unallow", zfs_do_unallow, HELP_UNALLOW }, { NULL }, { "hold", zfs_do_hold, HELP_HOLD }, { "holds", zfs_do_holds, HELP_HOLDS }, { "release", zfs_do_release, HELP_RELEASE }, { "diff", zfs_do_diff, HELP_DIFF }, { "load-key", zfs_do_load_key, HELP_LOAD_KEY }, { "unload-key", zfs_do_unload_key, HELP_UNLOAD_KEY }, { "change-key", zfs_do_change_key, HELP_CHANGE_KEY }, { "redact", zfs_do_redact, HELP_REDACT }, { "wait", zfs_do_wait, HELP_WAIT }, #ifdef __FreeBSD__ { "jail", zfs_do_jail, HELP_JAIL }, { "unjail", zfs_do_unjail, HELP_UNJAIL }, #endif }; #define NCOMMAND (sizeof (command_table) / sizeof (command_table[0])) zfs_command_t *current_command; static const char * get_usage(zfs_help_t idx) { switch (idx) { case HELP_CLONE: return (gettext("\tclone [-p] [-o property=value] ... " " \n")); case HELP_CREATE: return (gettext("\tcreate [-Pnpv] [-o property=value] ... " "\n" "\tcreate [-Pnpsv] [-b blocksize] [-o property=value] ... " "-V \n")); case HELP_DESTROY: return (gettext("\tdestroy [-fnpRrv] \n" "\tdestroy [-dnpRrv] " "@[%][,...]\n" "\tdestroy #\n")); case HELP_GET: return (gettext("\tget [-rHp] [-d max] " "[-o \"all\" | field[,...]]\n" "\t [-t type[,...]] [-s source[,...]]\n" "\t <\"all\" | property[,...]> " "[filesystem|volume|snapshot|bookmark] ...\n")); case HELP_INHERIT: return (gettext("\tinherit [-rS] " " ...\n")); case HELP_UPGRADE: return (gettext("\tupgrade [-v]\n" "\tupgrade [-r] [-V version] <-a | filesystem ...>\n")); case HELP_LIST: return (gettext("\tlist [-Hp] [-r|-d max] [-o property[,...]] " "[-s property]...\n\t [-S property]... [-t type[,...]] " "[filesystem|volume|snapshot] ...\n")); case HELP_MOUNT: return (gettext("\tmount\n" "\tmount [-flvO] [-o opts] <-a | filesystem>\n")); case HELP_PROMOTE: return (gettext("\tpromote \n")); case HELP_RECEIVE: return (gettext("\treceive [-vMnsFhu] " "[-o =] ... [-x ] ...\n" "\t \n" "\treceive [-vMnsFhu] [-o =] ... " "[-x ] ... \n" "\t [-d | -e] \n" "\treceive -A \n")); case HELP_RENAME: return (gettext("\trename [-f] " "\n" "\trename -p [-f] \n" "\trename -u [-f] \n" "\trename -r \n")); case HELP_ROLLBACK: return (gettext("\trollback [-rRf] \n")); case HELP_SEND: return (gettext("\tsend [-DnPpRvLecwhb] [-[i|I] snapshot] " "\n" "\tsend [-nvPLecw] [-i snapshot|bookmark] " "\n" "\tsend [-DnPpvLec] [-i bookmark|snapshot] " "--redact \n" "\tsend [-nvPe] -t \n" "\tsend [-Pnv] --saved filesystem\n")); case HELP_SET: return (gettext("\tset ... " " ...\n")); case HELP_SHARE: return (gettext("\tshare [-l] <-a [nfs|smb] | filesystem>\n")); case HELP_SNAPSHOT: return (gettext("\tsnapshot [-r] [-o property=value] ... " "@ ...\n")); case HELP_UNMOUNT: return (gettext("\tunmount [-fu] " "<-a | filesystem|mountpoint>\n")); case HELP_UNSHARE: return (gettext("\tunshare " "<-a [nfs|smb] | filesystem|mountpoint>\n")); case HELP_ALLOW: return (gettext("\tallow \n" "\tallow [-ldug] " "<\"everyone\"|user|group>[,...] [,...]\n" "\t \n" "\tallow [-ld] -e [,...] " "\n" "\tallow -c [,...] \n" "\tallow -s @setname [,...] " "\n")); case HELP_UNALLOW: return (gettext("\tunallow [-rldug] " "<\"everyone\"|user|group>[,...]\n" "\t [[,...]] \n" "\tunallow [-rld] -e [[,...]] " "\n" "\tunallow [-r] -c [[,...]] " "\n" "\tunallow [-r] -s @setname [[,...]] " "\n")); case HELP_USERSPACE: return (gettext("\tuserspace [-Hinp] [-o field[,...]] " "[-s field] ...\n" "\t [-S field] ... [-t type[,...]] " - "\n")); + "\n")); case HELP_GROUPSPACE: return (gettext("\tgroupspace [-Hinp] [-o field[,...]] " "[-s field] ...\n" "\t [-S field] ... [-t type[,...]] " - "\n")); + "\n")); case HELP_PROJECTSPACE: return (gettext("\tprojectspace [-Hp] [-o field[,...]] " "[-s field] ... \n" - "\t [-S field] ... \n")); + "\t [-S field] ... \n")); case HELP_PROJECT: return (gettext("\tproject [-d|-r] \n" "\tproject -c [-0] [-d|-r] [-p id] \n" "\tproject -C [-k] [-r] \n" "\tproject [-p id] [-r] [-s] \n")); case HELP_HOLD: return (gettext("\thold [-r] ...\n")); case HELP_HOLDS: return (gettext("\tholds [-rH] ...\n")); case HELP_RELEASE: return (gettext("\trelease [-r] ...\n")); case HELP_DIFF: return (gettext("\tdiff [-FHt] " "[snapshot|filesystem]\n")); case HELP_BOOKMARK: return (gettext("\tbookmark " "\n")); case HELP_CHANNEL_PROGRAM: return (gettext("\tprogram [-jn] [-t ] " "[-m ]\n" "\t [lua args...]\n")); case HELP_LOAD_KEY: return (gettext("\tload-key [-rn] [-L ] " "<-a | filesystem|volume>\n")); case HELP_UNLOAD_KEY: return (gettext("\tunload-key [-r] " "<-a | filesystem|volume>\n")); case HELP_CHANGE_KEY: return (gettext("\tchange-key [-l] [-o keyformat=]\n" "\t [-o keylocation=] [-o pbkdf2iters=]\n" "\t \n" "\tchange-key -i [-l] \n")); case HELP_VERSION: return (gettext("\tversion\n")); case HELP_REDACT: return (gettext("\tredact " " ...\n")); case HELP_JAIL: return (gettext("\tjail \n")); case HELP_UNJAIL: return (gettext("\tunjail \n")); case HELP_WAIT: return (gettext("\twait [-t ] \n")); } abort(); /* NOTREACHED */ } void nomem(void) { (void) fprintf(stderr, gettext("internal error: out of memory\n")); exit(1); } /* * Utility function to guarantee malloc() success. */ void * safe_malloc(size_t size) { void *data; if ((data = calloc(1, size)) == NULL) nomem(); return (data); } static void * safe_realloc(void *data, size_t size) { void *newp; if ((newp = realloc(data, size)) == NULL) { free(data); nomem(); } return (newp); } static char * safe_strdup(char *str) { char *dupstr = strdup(str); if (dupstr == NULL) nomem(); return (dupstr); } /* * Callback routine that will print out information for each of * the properties. */ static int usage_prop_cb(int prop, void *cb) { FILE *fp = cb; (void) fprintf(fp, "\t%-15s ", zfs_prop_to_name(prop)); if (zfs_prop_readonly(prop)) (void) fprintf(fp, " NO "); else (void) fprintf(fp, "YES "); if (zfs_prop_inheritable(prop)) (void) fprintf(fp, " YES "); else (void) fprintf(fp, " NO "); if (zfs_prop_values(prop) == NULL) (void) fprintf(fp, "-\n"); else (void) fprintf(fp, "%s\n", zfs_prop_values(prop)); return (ZPROP_CONT); } /* * Display usage message. If we're inside a command, display only the usage for * that command. Otherwise, iterate over the entire command table and display * a complete usage message. */ static void usage(boolean_t requested) { int i; boolean_t show_properties = B_FALSE; FILE *fp = requested ? stdout : stderr; if (current_command == NULL) { (void) fprintf(fp, gettext("usage: zfs command args ...\n")); (void) fprintf(fp, gettext("where 'command' is one of the following:\n\n")); for (i = 0; i < NCOMMAND; i++) { if (command_table[i].name == NULL) (void) fprintf(fp, "\n"); else (void) fprintf(fp, "%s", get_usage(command_table[i].usage)); } (void) fprintf(fp, gettext("\nEach dataset is of the form: " "pool/[dataset/]*dataset[@name]\n")); } else { (void) fprintf(fp, gettext("usage:\n")); (void) fprintf(fp, "%s", get_usage(current_command->usage)); } if (current_command != NULL && (strcmp(current_command->name, "set") == 0 || strcmp(current_command->name, "get") == 0 || strcmp(current_command->name, "inherit") == 0 || strcmp(current_command->name, "list") == 0)) show_properties = B_TRUE; if (show_properties) { (void) fprintf(fp, gettext("\nThe following properties are supported:\n")); (void) fprintf(fp, "\n\t%-14s %s %s %s\n\n", "PROPERTY", "EDIT", "INHERIT", "VALUES"); /* Iterate over all properties */ (void) zprop_iter(usage_prop_cb, fp, B_FALSE, B_TRUE, ZFS_TYPE_DATASET); (void) fprintf(fp, "\t%-15s ", "userused@..."); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, "\t%-15s ", "groupused@..."); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, "\t%-15s ", "projectused@..."); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, "\t%-15s ", "userobjused@..."); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, "\t%-15s ", "groupobjused@..."); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, "\t%-15s ", "projectobjused@..."); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, "\t%-15s ", "userquota@..."); (void) fprintf(fp, "YES NO | none\n"); (void) fprintf(fp, "\t%-15s ", "groupquota@..."); (void) fprintf(fp, "YES NO | none\n"); (void) fprintf(fp, "\t%-15s ", "projectquota@..."); (void) fprintf(fp, "YES NO | none\n"); (void) fprintf(fp, "\t%-15s ", "userobjquota@..."); (void) fprintf(fp, "YES NO | none\n"); (void) fprintf(fp, "\t%-15s ", "groupobjquota@..."); (void) fprintf(fp, "YES NO | none\n"); (void) fprintf(fp, "\t%-15s ", "projectobjquota@..."); (void) fprintf(fp, "YES NO | none\n"); (void) fprintf(fp, "\t%-15s ", "written@"); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, "\t%-15s ", "written#"); (void) fprintf(fp, " NO NO \n"); (void) fprintf(fp, gettext("\nSizes are specified in bytes " "with standard units such as K, M, G, etc.\n")); (void) fprintf(fp, gettext("\nUser-defined properties can " "be specified by using a name containing a colon (:).\n")); (void) fprintf(fp, gettext("\nThe {user|group|project}" "[obj]{used|quota}@ properties must be appended with\n" "a user|group|project specifier of one of these forms:\n" " POSIX name (eg: \"matt\")\n" " POSIX id (eg: \"126829\")\n" " SMB name@domain (eg: \"matt@sun\")\n" " SMB SID (eg: \"S-1-234-567-89\")\n")); } else { (void) fprintf(fp, gettext("\nFor the property list, run: %s\n"), "zfs set|get"); (void) fprintf(fp, gettext("\nFor the delegated permission list, run: %s\n"), "zfs allow|unallow"); } /* * See comments at end of main(). */ if (getenv("ZFS_ABORT") != NULL) { (void) printf("dumping core by request\n"); abort(); } exit(requested ? 0 : 2); } /* * Take a property=value argument string and add it to the given nvlist. * Modifies the argument inplace. */ static boolean_t parseprop(nvlist_t *props, char *propname) { char *propval; if ((propval = strchr(propname, '=')) == NULL) { (void) fprintf(stderr, gettext("missing " "'=' for property=value argument\n")); return (B_FALSE); } *propval = '\0'; propval++; if (nvlist_exists(props, propname)) { (void) fprintf(stderr, gettext("property '%s' " "specified multiple times\n"), propname); return (B_FALSE); } if (nvlist_add_string(props, propname, propval) != 0) nomem(); return (B_TRUE); } /* * Take a property name argument and add it to the given nvlist. * Modifies the argument inplace. */ static boolean_t parsepropname(nvlist_t *props, char *propname) { if (strchr(propname, '=') != NULL) { (void) fprintf(stderr, gettext("invalid character " "'=' in property argument\n")); return (B_FALSE); } if (nvlist_exists(props, propname)) { (void) fprintf(stderr, gettext("property '%s' " "specified multiple times\n"), propname); return (B_FALSE); } if (nvlist_add_boolean(props, propname) != 0) nomem(); return (B_TRUE); } static int parse_depth(char *opt, int *flags) { char *tmp; int depth; depth = (int)strtol(opt, &tmp, 0); if (*tmp) { (void) fprintf(stderr, gettext("%s is not an integer\n"), optarg); usage(B_FALSE); } if (depth < 0) { (void) fprintf(stderr, gettext("Depth can not be negative.\n")); usage(B_FALSE); } *flags |= (ZFS_ITER_DEPTH_LIMIT|ZFS_ITER_RECURSE); return (depth); } #define PROGRESS_DELAY 2 /* seconds */ static char *pt_reverse = "\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b"; static time_t pt_begin; static char *pt_header = NULL; static boolean_t pt_shown; static void start_progress_timer(void) { pt_begin = time(NULL) + PROGRESS_DELAY; pt_shown = B_FALSE; } static void set_progress_header(char *header) { assert(pt_header == NULL); pt_header = safe_strdup(header); if (pt_shown) { (void) printf("%s: ", header); (void) fflush(stdout); } } static void update_progress(char *update) { if (!pt_shown && time(NULL) > pt_begin) { int len = strlen(update); (void) printf("%s: %s%*.*s", pt_header, update, len, len, pt_reverse); (void) fflush(stdout); pt_shown = B_TRUE; } else if (pt_shown) { int len = strlen(update); (void) printf("%s%*.*s", update, len, len, pt_reverse); (void) fflush(stdout); } } static void finish_progress(char *done) { if (pt_shown) { (void) printf("%s\n", done); (void) fflush(stdout); } free(pt_header); pt_header = NULL; } static int zfs_mount_and_share(libzfs_handle_t *hdl, const char *dataset, zfs_type_t type) { zfs_handle_t *zhp = NULL; int ret = 0; zhp = zfs_open(hdl, dataset, type); if (zhp == NULL) return (1); /* * Volumes may neither be mounted or shared. Potentially in the * future filesystems detected on these volumes could be mounted. */ if (zfs_get_type(zhp) == ZFS_TYPE_VOLUME) { zfs_close(zhp); return (0); } /* * Mount and/or share the new filesystem as appropriate. We provide a * verbose error message to let the user know that their filesystem was * in fact created, even if we failed to mount or share it. * * If the user doesn't want the dataset automatically mounted, then * skip the mount/share step */ if (zfs_prop_valid_for_type(ZFS_PROP_CANMOUNT, type, B_FALSE) && zfs_prop_get_int(zhp, ZFS_PROP_CANMOUNT) == ZFS_CANMOUNT_ON) { if (zfs_mount_delegation_check()) { (void) fprintf(stderr, gettext("filesystem " "successfully created, but it may only be " "mounted by root\n")); ret = 1; } else if (zfs_mount(zhp, NULL, 0) != 0) { (void) fprintf(stderr, gettext("filesystem " "successfully created, but not mounted\n")); ret = 1; } else if (zfs_share(zhp) != 0) { (void) fprintf(stderr, gettext("filesystem " "successfully created, but not shared\n")); ret = 1; } zfs_commit_all_shares(); } zfs_close(zhp); return (ret); } /* * zfs clone [-p] [-o prop=value] ... * * Given an existing dataset, create a writable copy whose initial contents * are the same as the source. The newly created dataset maintains a * dependency on the original; the original cannot be destroyed so long as * the clone exists. * * The '-p' flag creates all the non-existing ancestors of the target first. */ static int zfs_do_clone(int argc, char **argv) { zfs_handle_t *zhp = NULL; boolean_t parents = B_FALSE; nvlist_t *props; int ret = 0; int c; if (nvlist_alloc(&props, NV_UNIQUE_NAME, 0) != 0) nomem(); /* check options */ while ((c = getopt(argc, argv, "o:p")) != -1) { switch (c) { case 'o': if (!parseprop(props, optarg)) { nvlist_free(props); return (1); } break; case 'p': parents = B_TRUE; break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); goto usage; } } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 1) { (void) fprintf(stderr, gettext("missing source dataset " "argument\n")); goto usage; } if (argc < 2) { (void) fprintf(stderr, gettext("missing target dataset " "argument\n")); goto usage; } if (argc > 2) { (void) fprintf(stderr, gettext("too many arguments\n")); goto usage; } /* open the source dataset */ if ((zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_SNAPSHOT)) == NULL) { nvlist_free(props); return (1); } if (parents && zfs_name_valid(argv[1], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME)) { /* * Now create the ancestors of the target dataset. If the * target already exists and '-p' option was used we should not * complain. */ if (zfs_dataset_exists(g_zfs, argv[1], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME)) { zfs_close(zhp); nvlist_free(props); return (0); } if (zfs_create_ancestors(g_zfs, argv[1]) != 0) { zfs_close(zhp); nvlist_free(props); return (1); } } /* pass to libzfs */ ret = zfs_clone(zhp, argv[1], props); /* create the mountpoint if necessary */ if (ret == 0) { if (log_history) { (void) zpool_log_history(g_zfs, history_str); log_history = B_FALSE; } ret = zfs_mount_and_share(g_zfs, argv[1], ZFS_TYPE_DATASET); } zfs_close(zhp); nvlist_free(props); return (!!ret); usage: ASSERT3P(zhp, ==, NULL); nvlist_free(props); usage(B_FALSE); return (-1); } /* * zfs create [-Pnpv] [-o prop=value] ... fs * zfs create [-Pnpsv] [-b blocksize] [-o prop=value] ... -V vol size * * Create a new dataset. This command can be used to create filesystems * and volumes. Snapshot creation is handled by 'zfs snapshot'. * For volumes, the user must specify a size to be used. * * The '-s' flag applies only to volumes, and indicates that we should not try * to set the reservation for this volume. By default we set a reservation * equal to the size for any volume. For pools with SPA_VERSION >= * SPA_VERSION_REFRESERVATION, we set a refreservation instead. * * The '-p' flag creates all the non-existing ancestors of the target first. * * The '-n' flag is no-op (dry run) mode. This will perform a user-space sanity * check of arguments and properties, but does not check for permissions, * available space, etc. * * The '-v' flag is for verbose output. * * The '-P' flag is used for parseable output. It implies '-v'. */ static int zfs_do_create(int argc, char **argv) { zfs_type_t type = ZFS_TYPE_FILESYSTEM; zpool_handle_t *zpool_handle = NULL; nvlist_t *real_props = NULL; uint64_t volsize = 0; int c; boolean_t noreserve = B_FALSE; boolean_t bflag = B_FALSE; boolean_t parents = B_FALSE; boolean_t dryrun = B_FALSE; boolean_t verbose = B_FALSE; boolean_t parseable = B_FALSE; int ret = 1; nvlist_t *props; uint64_t intval; if (nvlist_alloc(&props, NV_UNIQUE_NAME, 0) != 0) nomem(); /* check options */ while ((c = getopt(argc, argv, ":PV:b:nso:pv")) != -1) { switch (c) { case 'V': type = ZFS_TYPE_VOLUME; if (zfs_nicestrtonum(g_zfs, optarg, &intval) != 0) { (void) fprintf(stderr, gettext("bad volume " "size '%s': %s\n"), optarg, libzfs_error_description(g_zfs)); goto error; } if (nvlist_add_uint64(props, zfs_prop_to_name(ZFS_PROP_VOLSIZE), intval) != 0) nomem(); volsize = intval; break; case 'P': verbose = B_TRUE; parseable = B_TRUE; break; case 'p': parents = B_TRUE; break; case 'b': bflag = B_TRUE; if (zfs_nicestrtonum(g_zfs, optarg, &intval) != 0) { (void) fprintf(stderr, gettext("bad volume " "block size '%s': %s\n"), optarg, libzfs_error_description(g_zfs)); goto error; } if (nvlist_add_uint64(props, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), intval) != 0) nomem(); break; case 'n': dryrun = B_TRUE; break; case 'o': if (!parseprop(props, optarg)) goto error; break; case 's': noreserve = B_TRUE; break; case 'v': verbose = B_TRUE; break; case ':': (void) fprintf(stderr, gettext("missing size " "argument\n")); goto badusage; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); goto badusage; } } if ((bflag || noreserve) && type != ZFS_TYPE_VOLUME) { (void) fprintf(stderr, gettext("'-s' and '-b' can only be " "used when creating a volume\n")); goto badusage; } argc -= optind; argv += optind; /* check number of arguments */ if (argc == 0) { (void) fprintf(stderr, gettext("missing %s argument\n"), zfs_type_to_name(type)); goto badusage; } if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); goto badusage; } if (dryrun || (type == ZFS_TYPE_VOLUME && !noreserve)) { char msg[ZFS_MAX_DATASET_NAME_LEN * 2]; char *p; if ((p = strchr(argv[0], '/')) != NULL) *p = '\0'; zpool_handle = zpool_open(g_zfs, argv[0]); if (p != NULL) *p = '/'; if (zpool_handle == NULL) goto error; (void) snprintf(msg, sizeof (msg), dryrun ? gettext("cannot verify '%s'") : gettext("cannot create '%s'"), argv[0]); if (props && (real_props = zfs_valid_proplist(g_zfs, type, props, 0, NULL, zpool_handle, B_TRUE, msg)) == NULL) { zpool_close(zpool_handle); goto error; } } /* * if volsize is not a multiple of volblocksize, round it up to the * nearest multiple of the volblocksize */ if (type == ZFS_TYPE_VOLUME) { uint64_t volblocksize; if (nvlist_lookup_uint64(props, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), &volblocksize) != 0) volblocksize = ZVOL_DEFAULT_BLOCKSIZE; if (volsize % volblocksize) { volsize = P2ROUNDUP_TYPED(volsize, volblocksize, uint64_t); if (nvlist_add_uint64(props, zfs_prop_to_name(ZFS_PROP_VOLSIZE), volsize) != 0) { nvlist_free(props); nomem(); } } } if (type == ZFS_TYPE_VOLUME && !noreserve) { uint64_t spa_version; zfs_prop_t resv_prop; char *strval; spa_version = zpool_get_prop_int(zpool_handle, ZPOOL_PROP_VERSION, NULL); if (spa_version >= SPA_VERSION_REFRESERVATION) resv_prop = ZFS_PROP_REFRESERVATION; else resv_prop = ZFS_PROP_RESERVATION; volsize = zvol_volsize_to_reservation(zpool_handle, volsize, real_props); if (nvlist_lookup_string(props, zfs_prop_to_name(resv_prop), &strval) != 0) { if (nvlist_add_uint64(props, zfs_prop_to_name(resv_prop), volsize) != 0) { nvlist_free(props); nomem(); } } } if (zpool_handle != NULL) { zpool_close(zpool_handle); nvlist_free(real_props); } if (parents && zfs_name_valid(argv[0], type)) { /* * Now create the ancestors of target dataset. If the target * already exists and '-p' option was used we should not * complain. */ if (zfs_dataset_exists(g_zfs, argv[0], type)) { ret = 0; goto error; } if (verbose) { (void) printf(parseable ? "create_ancestors\t%s\n" : dryrun ? "would create ancestors of %s\n" : "create ancestors of %s\n", argv[0]); } if (!dryrun) { if (zfs_create_ancestors(g_zfs, argv[0]) != 0) { goto error; } } } if (verbose) { nvpair_t *nvp = NULL; (void) printf(parseable ? "create\t%s\n" : dryrun ? "would create %s\n" : "create %s\n", argv[0]); while ((nvp = nvlist_next_nvpair(props, nvp)) != NULL) { uint64_t uval; char *sval; switch (nvpair_type(nvp)) { case DATA_TYPE_UINT64: VERIFY0(nvpair_value_uint64(nvp, &uval)); (void) printf(parseable ? "property\t%s\t%llu\n" : "\t%s=%llu\n", nvpair_name(nvp), (u_longlong_t)uval); break; case DATA_TYPE_STRING: VERIFY0(nvpair_value_string(nvp, &sval)); (void) printf(parseable ? "property\t%s\t%s\n" : "\t%s=%s\n", nvpair_name(nvp), sval); break; default: (void) fprintf(stderr, "property '%s' " "has illegal type %d\n", nvpair_name(nvp), nvpair_type(nvp)); abort(); } } } if (dryrun) { ret = 0; goto error; } /* pass to libzfs */ if (zfs_create(g_zfs, argv[0], type, props) != 0) goto error; if (log_history) { (void) zpool_log_history(g_zfs, history_str); log_history = B_FALSE; } ret = zfs_mount_and_share(g_zfs, argv[0], ZFS_TYPE_DATASET); error: nvlist_free(props); return (ret); badusage: nvlist_free(props); usage(B_FALSE); return (2); } /* * zfs destroy [-rRf] * zfs destroy [-rRd] * * -r Recursively destroy all children * -R Recursively destroy all dependents, including clones * -f Force unmounting of any dependents * -d If we can't destroy now, mark for deferred destruction * * Destroys the given dataset. By default, it will unmount any filesystems, * and refuse to destroy a dataset that has any dependents. A dependent can * either be a child, or a clone of a child. */ typedef struct destroy_cbdata { boolean_t cb_first; boolean_t cb_force; boolean_t cb_recurse; boolean_t cb_error; boolean_t cb_doclones; zfs_handle_t *cb_target; boolean_t cb_defer_destroy; boolean_t cb_verbose; boolean_t cb_parsable; boolean_t cb_dryrun; nvlist_t *cb_nvl; nvlist_t *cb_batchedsnaps; /* first snap in contiguous run */ char *cb_firstsnap; /* previous snap in contiguous run */ char *cb_prevsnap; int64_t cb_snapused; char *cb_snapspec; char *cb_bookmark; uint64_t cb_snap_count; } destroy_cbdata_t; /* * Check for any dependents based on the '-r' or '-R' flags. */ static int destroy_check_dependent(zfs_handle_t *zhp, void *data) { destroy_cbdata_t *cbp = data; const char *tname = zfs_get_name(cbp->cb_target); const char *name = zfs_get_name(zhp); if (strncmp(tname, name, strlen(tname)) == 0 && (name[strlen(tname)] == '/' || name[strlen(tname)] == '@')) { /* * This is a direct descendant, not a clone somewhere else in * the hierarchy. */ if (cbp->cb_recurse) goto out; if (cbp->cb_first) { (void) fprintf(stderr, gettext("cannot destroy '%s': " "%s has children\n"), zfs_get_name(cbp->cb_target), zfs_type_to_name(zfs_get_type(cbp->cb_target))); (void) fprintf(stderr, gettext("use '-r' to destroy " "the following datasets:\n")); cbp->cb_first = B_FALSE; cbp->cb_error = B_TRUE; } (void) fprintf(stderr, "%s\n", zfs_get_name(zhp)); } else { /* * This is a clone. We only want to report this if the '-r' * wasn't specified, or the target is a snapshot. */ if (!cbp->cb_recurse && zfs_get_type(cbp->cb_target) != ZFS_TYPE_SNAPSHOT) goto out; if (cbp->cb_first) { (void) fprintf(stderr, gettext("cannot destroy '%s': " "%s has dependent clones\n"), zfs_get_name(cbp->cb_target), zfs_type_to_name(zfs_get_type(cbp->cb_target))); (void) fprintf(stderr, gettext("use '-R' to destroy " "the following datasets:\n")); cbp->cb_first = B_FALSE; cbp->cb_error = B_TRUE; cbp->cb_dryrun = B_TRUE; } (void) fprintf(stderr, "%s\n", zfs_get_name(zhp)); } out: zfs_close(zhp); return (0); } static int destroy_batched(destroy_cbdata_t *cb) { int error = zfs_destroy_snaps_nvl(g_zfs, cb->cb_batchedsnaps, B_FALSE); fnvlist_free(cb->cb_batchedsnaps); cb->cb_batchedsnaps = fnvlist_alloc(); return (error); } static int destroy_callback(zfs_handle_t *zhp, void *data) { destroy_cbdata_t *cb = data; const char *name = zfs_get_name(zhp); int error; if (cb->cb_verbose) { if (cb->cb_parsable) { (void) printf("destroy\t%s\n", name); } else if (cb->cb_dryrun) { (void) printf(gettext("would destroy %s\n"), name); } else { (void) printf(gettext("will destroy %s\n"), name); } } /* * Ignore pools (which we've already flagged as an error before getting * here). */ if (strchr(zfs_get_name(zhp), '/') == NULL && zfs_get_type(zhp) == ZFS_TYPE_FILESYSTEM) { zfs_close(zhp); return (0); } if (cb->cb_dryrun) { zfs_close(zhp); return (0); } /* * We batch up all contiguous snapshots (even of different * filesystems) and destroy them with one ioctl. We can't * simply do all snap deletions and then all fs deletions, * because we must delete a clone before its origin. */ if (zfs_get_type(zhp) == ZFS_TYPE_SNAPSHOT) { cb->cb_snap_count++; fnvlist_add_boolean(cb->cb_batchedsnaps, name); if (cb->cb_snap_count % 10 == 0 && cb->cb_defer_destroy) error = destroy_batched(cb); } else { error = destroy_batched(cb); if (error != 0 || zfs_unmount(zhp, NULL, cb->cb_force ? MS_FORCE : 0) != 0 || zfs_destroy(zhp, cb->cb_defer_destroy) != 0) { zfs_close(zhp); /* * When performing a recursive destroy we ignore errors * so that the recursive destroy could continue * destroying past problem datasets */ if (cb->cb_recurse) { cb->cb_error = B_TRUE; return (0); } return (-1); } } zfs_close(zhp); return (0); } static int destroy_print_cb(zfs_handle_t *zhp, void *arg) { destroy_cbdata_t *cb = arg; const char *name = zfs_get_name(zhp); int err = 0; if (nvlist_exists(cb->cb_nvl, name)) { if (cb->cb_firstsnap == NULL) cb->cb_firstsnap = strdup(name); if (cb->cb_prevsnap != NULL) free(cb->cb_prevsnap); /* this snap continues the current range */ cb->cb_prevsnap = strdup(name); if (cb->cb_firstsnap == NULL || cb->cb_prevsnap == NULL) nomem(); if (cb->cb_verbose) { if (cb->cb_parsable) { (void) printf("destroy\t%s\n", name); } else if (cb->cb_dryrun) { (void) printf(gettext("would destroy %s\n"), name); } else { (void) printf(gettext("will destroy %s\n"), name); } } } else if (cb->cb_firstsnap != NULL) { /* end of this range */ uint64_t used = 0; err = lzc_snaprange_space(cb->cb_firstsnap, cb->cb_prevsnap, &used); cb->cb_snapused += used; free(cb->cb_firstsnap); cb->cb_firstsnap = NULL; free(cb->cb_prevsnap); cb->cb_prevsnap = NULL; } zfs_close(zhp); return (err); } static int destroy_print_snapshots(zfs_handle_t *fs_zhp, destroy_cbdata_t *cb) { int err; assert(cb->cb_firstsnap == NULL); assert(cb->cb_prevsnap == NULL); err = zfs_iter_snapshots_sorted(fs_zhp, destroy_print_cb, cb, 0, 0); if (cb->cb_firstsnap != NULL) { uint64_t used = 0; if (err == 0) { err = lzc_snaprange_space(cb->cb_firstsnap, cb->cb_prevsnap, &used); } cb->cb_snapused += used; free(cb->cb_firstsnap); cb->cb_firstsnap = NULL; free(cb->cb_prevsnap); cb->cb_prevsnap = NULL; } return (err); } static int snapshot_to_nvl_cb(zfs_handle_t *zhp, void *arg) { destroy_cbdata_t *cb = arg; int err = 0; /* Check for clones. */ if (!cb->cb_doclones && !cb->cb_defer_destroy) { cb->cb_target = zhp; cb->cb_first = B_TRUE; err = zfs_iter_dependents(zhp, B_TRUE, destroy_check_dependent, cb); } if (err == 0) { if (nvlist_add_boolean(cb->cb_nvl, zfs_get_name(zhp))) nomem(); } zfs_close(zhp); return (err); } static int gather_snapshots(zfs_handle_t *zhp, void *arg) { destroy_cbdata_t *cb = arg; int err = 0; err = zfs_iter_snapspec(zhp, cb->cb_snapspec, snapshot_to_nvl_cb, cb); if (err == ENOENT) err = 0; if (err != 0) goto out; if (cb->cb_verbose) { err = destroy_print_snapshots(zhp, cb); if (err != 0) goto out; } if (cb->cb_recurse) err = zfs_iter_filesystems(zhp, gather_snapshots, cb); out: zfs_close(zhp); return (err); } static int destroy_clones(destroy_cbdata_t *cb) { nvpair_t *pair; for (pair = nvlist_next_nvpair(cb->cb_nvl, NULL); pair != NULL; pair = nvlist_next_nvpair(cb->cb_nvl, pair)) { zfs_handle_t *zhp = zfs_open(g_zfs, nvpair_name(pair), ZFS_TYPE_SNAPSHOT); if (zhp != NULL) { boolean_t defer = cb->cb_defer_destroy; int err; /* * We can't defer destroy non-snapshots, so set it to * false while destroying the clones. */ cb->cb_defer_destroy = B_FALSE; err = zfs_iter_dependents(zhp, B_FALSE, destroy_callback, cb); cb->cb_defer_destroy = defer; zfs_close(zhp); if (err != 0) return (err); } } return (0); } static int zfs_do_destroy(int argc, char **argv) { destroy_cbdata_t cb = { 0 }; int rv = 0; int err = 0; int c; zfs_handle_t *zhp = NULL; char *at, *pound; zfs_type_t type = ZFS_TYPE_DATASET; /* check options */ while ((c = getopt(argc, argv, "vpndfrR")) != -1) { switch (c) { case 'v': cb.cb_verbose = B_TRUE; break; case 'p': cb.cb_verbose = B_TRUE; cb.cb_parsable = B_TRUE; break; case 'n': cb.cb_dryrun = B_TRUE; break; case 'd': cb.cb_defer_destroy = B_TRUE; type = ZFS_TYPE_SNAPSHOT; break; case 'f': cb.cb_force = B_TRUE; break; case 'r': cb.cb_recurse = B_TRUE; break; case 'R': cb.cb_recurse = B_TRUE; cb.cb_doclones = B_TRUE; break; case '?': default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* check number of arguments */ if (argc == 0) { (void) fprintf(stderr, gettext("missing dataset argument\n")); usage(B_FALSE); } if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } at = strchr(argv[0], '@'); pound = strchr(argv[0], '#'); if (at != NULL) { /* Build the list of snaps to destroy in cb_nvl. */ cb.cb_nvl = fnvlist_alloc(); *at = '\0'; zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) { nvlist_free(cb.cb_nvl); return (1); } cb.cb_snapspec = at + 1; if (gather_snapshots(zfs_handle_dup(zhp), &cb) != 0 || cb.cb_error) { rv = 1; goto out; } if (nvlist_empty(cb.cb_nvl)) { (void) fprintf(stderr, gettext("could not find any " "snapshots to destroy; check snapshot names.\n")); rv = 1; goto out; } if (cb.cb_verbose) { char buf[16]; zfs_nicebytes(cb.cb_snapused, buf, sizeof (buf)); if (cb.cb_parsable) { (void) printf("reclaim\t%llu\n", (u_longlong_t)cb.cb_snapused); } else if (cb.cb_dryrun) { (void) printf(gettext("would reclaim %s\n"), buf); } else { (void) printf(gettext("will reclaim %s\n"), buf); } } if (!cb.cb_dryrun) { if (cb.cb_doclones) { cb.cb_batchedsnaps = fnvlist_alloc(); err = destroy_clones(&cb); if (err == 0) { err = zfs_destroy_snaps_nvl(g_zfs, cb.cb_batchedsnaps, B_FALSE); } if (err != 0) { rv = 1; goto out; } } if (err == 0) { err = zfs_destroy_snaps_nvl(g_zfs, cb.cb_nvl, cb.cb_defer_destroy); } } if (err != 0) rv = 1; } else if (pound != NULL) { int err; nvlist_t *nvl; if (cb.cb_dryrun) { (void) fprintf(stderr, "dryrun is not supported with bookmark\n"); return (-1); } if (cb.cb_defer_destroy) { (void) fprintf(stderr, "defer destroy is not supported with bookmark\n"); return (-1); } if (cb.cb_recurse) { (void) fprintf(stderr, "recursive is not supported with bookmark\n"); return (-1); } /* * Unfortunately, zfs_bookmark() doesn't honor the * casesensitivity setting. However, we can't simply * remove this check, because lzc_destroy_bookmarks() * ignores non-existent bookmarks, so this is necessary * to get a proper error message. */ if (!zfs_bookmark_exists(argv[0])) { (void) fprintf(stderr, gettext("bookmark '%s' " "does not exist.\n"), argv[0]); return (1); } nvl = fnvlist_alloc(); fnvlist_add_boolean(nvl, argv[0]); err = lzc_destroy_bookmarks(nvl, NULL); if (err != 0) { (void) zfs_standard_error(g_zfs, err, "cannot destroy bookmark"); } nvlist_free(nvl); return (err); } else { /* Open the given dataset */ if ((zhp = zfs_open(g_zfs, argv[0], type)) == NULL) return (1); cb.cb_target = zhp; /* * Perform an explicit check for pools before going any further. */ if (!cb.cb_recurse && strchr(zfs_get_name(zhp), '/') == NULL && zfs_get_type(zhp) == ZFS_TYPE_FILESYSTEM) { (void) fprintf(stderr, gettext("cannot destroy '%s': " "operation does not apply to pools\n"), zfs_get_name(zhp)); (void) fprintf(stderr, gettext("use 'zfs destroy -r " "%s' to destroy all datasets in the pool\n"), zfs_get_name(zhp)); (void) fprintf(stderr, gettext("use 'zpool destroy %s' " "to destroy the pool itself\n"), zfs_get_name(zhp)); rv = 1; goto out; } /* * Check for any dependents and/or clones. */ cb.cb_first = B_TRUE; if (!cb.cb_doclones && zfs_iter_dependents(zhp, B_TRUE, destroy_check_dependent, &cb) != 0) { rv = 1; goto out; } if (cb.cb_error) { rv = 1; goto out; } cb.cb_batchedsnaps = fnvlist_alloc(); if (zfs_iter_dependents(zhp, B_FALSE, destroy_callback, &cb) != 0) { rv = 1; goto out; } /* * Do the real thing. The callback will close the * handle regardless of whether it succeeds or not. */ err = destroy_callback(zhp, &cb); zhp = NULL; if (err == 0) { err = zfs_destroy_snaps_nvl(g_zfs, cb.cb_batchedsnaps, cb.cb_defer_destroy); } if (err != 0 || cb.cb_error == B_TRUE) rv = 1; } out: fnvlist_free(cb.cb_batchedsnaps); fnvlist_free(cb.cb_nvl); if (zhp != NULL) zfs_close(zhp); return (rv); } static boolean_t is_recvd_column(zprop_get_cbdata_t *cbp) { int i; zfs_get_column_t col; for (i = 0; i < ZFS_GET_NCOLS && (col = cbp->cb_columns[i]) != GET_COL_NONE; i++) if (col == GET_COL_RECVD) return (B_TRUE); return (B_FALSE); } /* * zfs get [-rHp] [-o all | field[,field]...] [-s source[,source]...] * < all | property[,property]... > < fs | snap | vol > ... * * -r recurse over any child datasets * -H scripted mode. Headers are stripped, and fields are separated * by tabs instead of spaces. * -o Set of fields to display. One of "name,property,value, * received,source". Default is "name,property,value,source". * "all" is an alias for all five. * -s Set of sources to allow. One of * "local,default,inherited,received,temporary,none". Default is * all six. * -p Display values in parsable (literal) format. * * Prints properties for the given datasets. The user can control which * columns to display as well as which property types to allow. */ /* * Invoked to display the properties for a single dataset. */ static int get_callback(zfs_handle_t *zhp, void *data) { char buf[ZFS_MAXPROPLEN]; char rbuf[ZFS_MAXPROPLEN]; zprop_source_t sourcetype; char source[ZFS_MAX_DATASET_NAME_LEN]; zprop_get_cbdata_t *cbp = data; nvlist_t *user_props = zfs_get_user_props(zhp); zprop_list_t *pl = cbp->cb_proplist; nvlist_t *propval; char *strval; char *sourceval; boolean_t received = is_recvd_column(cbp); for (; pl != NULL; pl = pl->pl_next) { char *recvdval = NULL; /* * Skip the special fake placeholder. This will also skip over * the name property when 'all' is specified. */ if (pl->pl_prop == ZFS_PROP_NAME && pl == cbp->cb_proplist) continue; if (pl->pl_prop != ZPROP_INVAL) { if (zfs_prop_get(zhp, pl->pl_prop, buf, sizeof (buf), &sourcetype, source, sizeof (source), cbp->cb_literal) != 0) { if (pl->pl_all) continue; if (!zfs_prop_valid_for_type(pl->pl_prop, ZFS_TYPE_DATASET, B_FALSE)) { (void) fprintf(stderr, gettext("No such property '%s'\n"), zfs_prop_to_name(pl->pl_prop)); continue; } sourcetype = ZPROP_SRC_NONE; (void) strlcpy(buf, "-", sizeof (buf)); } if (received && (zfs_prop_get_recvd(zhp, zfs_prop_to_name(pl->pl_prop), rbuf, sizeof (rbuf), cbp->cb_literal) == 0)) recvdval = rbuf; zprop_print_one_property(zfs_get_name(zhp), cbp, zfs_prop_to_name(pl->pl_prop), buf, sourcetype, source, recvdval); } else if (zfs_prop_userquota(pl->pl_user_prop)) { sourcetype = ZPROP_SRC_LOCAL; if (zfs_prop_get_userquota(zhp, pl->pl_user_prop, buf, sizeof (buf), cbp->cb_literal) != 0) { sourcetype = ZPROP_SRC_NONE; (void) strlcpy(buf, "-", sizeof (buf)); } zprop_print_one_property(zfs_get_name(zhp), cbp, pl->pl_user_prop, buf, sourcetype, source, NULL); } else if (zfs_prop_written(pl->pl_user_prop)) { sourcetype = ZPROP_SRC_LOCAL; if (zfs_prop_get_written(zhp, pl->pl_user_prop, buf, sizeof (buf), cbp->cb_literal) != 0) { sourcetype = ZPROP_SRC_NONE; (void) strlcpy(buf, "-", sizeof (buf)); } zprop_print_one_property(zfs_get_name(zhp), cbp, pl->pl_user_prop, buf, sourcetype, source, NULL); } else { if (nvlist_lookup_nvlist(user_props, pl->pl_user_prop, &propval) != 0) { if (pl->pl_all) continue; sourcetype = ZPROP_SRC_NONE; strval = "-"; } else { verify(nvlist_lookup_string(propval, ZPROP_VALUE, &strval) == 0); verify(nvlist_lookup_string(propval, ZPROP_SOURCE, &sourceval) == 0); if (strcmp(sourceval, zfs_get_name(zhp)) == 0) { sourcetype = ZPROP_SRC_LOCAL; } else if (strcmp(sourceval, ZPROP_SOURCE_VAL_RECVD) == 0) { sourcetype = ZPROP_SRC_RECEIVED; } else { sourcetype = ZPROP_SRC_INHERITED; (void) strlcpy(source, sourceval, sizeof (source)); } } if (received && (zfs_prop_get_recvd(zhp, pl->pl_user_prop, rbuf, sizeof (rbuf), cbp->cb_literal) == 0)) recvdval = rbuf; zprop_print_one_property(zfs_get_name(zhp), cbp, pl->pl_user_prop, strval, sourcetype, source, recvdval); } } return (0); } static int zfs_do_get(int argc, char **argv) { zprop_get_cbdata_t cb = { 0 }; int i, c, flags = ZFS_ITER_ARGS_CAN_BE_PATHS; int types = ZFS_TYPE_DATASET | ZFS_TYPE_BOOKMARK; char *value, *fields; int ret = 0; int limit = 0; zprop_list_t fake_name = { 0 }; /* * Set up default columns and sources. */ cb.cb_sources = ZPROP_SRC_ALL; cb.cb_columns[0] = GET_COL_NAME; cb.cb_columns[1] = GET_COL_PROPERTY; cb.cb_columns[2] = GET_COL_VALUE; cb.cb_columns[3] = GET_COL_SOURCE; cb.cb_type = ZFS_TYPE_DATASET; /* check options */ while ((c = getopt(argc, argv, ":d:o:s:rt:Hp")) != -1) { switch (c) { case 'p': cb.cb_literal = B_TRUE; break; case 'd': limit = parse_depth(optarg, &flags); break; case 'r': flags |= ZFS_ITER_RECURSE; break; case 'H': cb.cb_scripted = B_TRUE; break; case ':': (void) fprintf(stderr, gettext("missing argument for " "'%c' option\n"), optopt); usage(B_FALSE); break; case 'o': /* * Process the set of columns to display. We zero out * the structure to give us a blank slate. */ bzero(&cb.cb_columns, sizeof (cb.cb_columns)); i = 0; while (*optarg != '\0') { static char *col_subopts[] = { "name", "property", "value", "received", "source", "all", NULL }; if (i == ZFS_GET_NCOLS) { (void) fprintf(stderr, gettext("too " "many fields given to -o " "option\n")); usage(B_FALSE); } switch (getsubopt(&optarg, col_subopts, &value)) { case 0: cb.cb_columns[i++] = GET_COL_NAME; break; case 1: cb.cb_columns[i++] = GET_COL_PROPERTY; break; case 2: cb.cb_columns[i++] = GET_COL_VALUE; break; case 3: cb.cb_columns[i++] = GET_COL_RECVD; flags |= ZFS_ITER_RECVD_PROPS; break; case 4: cb.cb_columns[i++] = GET_COL_SOURCE; break; case 5: if (i > 0) { (void) fprintf(stderr, gettext("\"all\" conflicts " "with specific fields " "given to -o option\n")); usage(B_FALSE); } cb.cb_columns[0] = GET_COL_NAME; cb.cb_columns[1] = GET_COL_PROPERTY; cb.cb_columns[2] = GET_COL_VALUE; cb.cb_columns[3] = GET_COL_RECVD; cb.cb_columns[4] = GET_COL_SOURCE; flags |= ZFS_ITER_RECVD_PROPS; i = ZFS_GET_NCOLS; break; default: (void) fprintf(stderr, gettext("invalid column name " "'%s'\n"), value); usage(B_FALSE); } } break; case 's': cb.cb_sources = 0; while (*optarg != '\0') { static char *source_subopts[] = { "local", "default", "inherited", "received", "temporary", "none", NULL }; switch (getsubopt(&optarg, source_subopts, &value)) { case 0: cb.cb_sources |= ZPROP_SRC_LOCAL; break; case 1: cb.cb_sources |= ZPROP_SRC_DEFAULT; break; case 2: cb.cb_sources |= ZPROP_SRC_INHERITED; break; case 3: cb.cb_sources |= ZPROP_SRC_RECEIVED; break; case 4: cb.cb_sources |= ZPROP_SRC_TEMPORARY; break; case 5: cb.cb_sources |= ZPROP_SRC_NONE; break; default: (void) fprintf(stderr, gettext("invalid source " "'%s'\n"), value); usage(B_FALSE); } } break; case 't': types = 0; flags &= ~ZFS_ITER_PROP_LISTSNAPS; while (*optarg != '\0') { static char *type_subopts[] = { "filesystem", "volume", "snapshot", "snap", "bookmark", "all", NULL }; switch (getsubopt(&optarg, type_subopts, &value)) { case 0: types |= ZFS_TYPE_FILESYSTEM; break; case 1: types |= ZFS_TYPE_VOLUME; break; case 2: case 3: types |= ZFS_TYPE_SNAPSHOT; break; case 4: types |= ZFS_TYPE_BOOKMARK; break; case 5: types = ZFS_TYPE_DATASET | ZFS_TYPE_BOOKMARK; break; default: (void) fprintf(stderr, gettext("invalid type '%s'\n"), value); usage(B_FALSE); } } break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; if (argc < 1) { (void) fprintf(stderr, gettext("missing property " "argument\n")); usage(B_FALSE); } fields = argv[0]; /* * Handle users who want to get all snapshots or bookmarks * of a dataset (ex. 'zfs get -t snapshot refer '). */ if ((types == ZFS_TYPE_SNAPSHOT || types == ZFS_TYPE_BOOKMARK) && argc > 1 && (flags & ZFS_ITER_RECURSE) == 0 && limit == 0) { flags |= (ZFS_ITER_DEPTH_LIMIT | ZFS_ITER_RECURSE); limit = 1; } if (zprop_get_list(g_zfs, fields, &cb.cb_proplist, ZFS_TYPE_DATASET) != 0) usage(B_FALSE); argc--; argv++; /* * As part of zfs_expand_proplist(), we keep track of the maximum column * width for each property. For the 'NAME' (and 'SOURCE') columns, we * need to know the maximum name length. However, the user likely did * not specify 'name' as one of the properties to fetch, so we need to * make sure we always include at least this property for * print_get_headers() to work properly. */ if (cb.cb_proplist != NULL) { fake_name.pl_prop = ZFS_PROP_NAME; fake_name.pl_width = strlen(gettext("NAME")); fake_name.pl_next = cb.cb_proplist; cb.cb_proplist = &fake_name; } cb.cb_first = B_TRUE; /* run for each object */ ret = zfs_for_each(argc, argv, flags, types, NULL, &cb.cb_proplist, limit, get_callback, &cb); if (cb.cb_proplist == &fake_name) zprop_free_list(fake_name.pl_next); else zprop_free_list(cb.cb_proplist); return (ret); } /* * inherit [-rS] ... * * -r Recurse over all children * -S Revert to received value, if any * * For each dataset specified on the command line, inherit the given property * from its parent. Inheriting a property at the pool level will cause it to * use the default value. The '-r' flag will recurse over all children, and is * useful for setting a property on a hierarchy-wide basis, regardless of any * local modifications for each dataset. */ typedef struct inherit_cbdata { const char *cb_propname; boolean_t cb_received; } inherit_cbdata_t; static int inherit_recurse_cb(zfs_handle_t *zhp, void *data) { inherit_cbdata_t *cb = data; zfs_prop_t prop = zfs_name_to_prop(cb->cb_propname); /* * If we're doing it recursively, then ignore properties that * are not valid for this type of dataset. */ if (prop != ZPROP_INVAL && !zfs_prop_valid_for_type(prop, zfs_get_type(zhp), B_FALSE)) return (0); return (zfs_prop_inherit(zhp, cb->cb_propname, cb->cb_received) != 0); } static int inherit_cb(zfs_handle_t *zhp, void *data) { inherit_cbdata_t *cb = data; return (zfs_prop_inherit(zhp, cb->cb_propname, cb->cb_received) != 0); } static int zfs_do_inherit(int argc, char **argv) { int c; zfs_prop_t prop; inherit_cbdata_t cb = { 0 }; char *propname; int ret = 0; int flags = 0; boolean_t received = B_FALSE; /* check options */ while ((c = getopt(argc, argv, "rS")) != -1) { switch (c) { case 'r': flags |= ZFS_ITER_RECURSE; break; case 'S': received = B_TRUE; break; case '?': default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 1) { (void) fprintf(stderr, gettext("missing property argument\n")); usage(B_FALSE); } if (argc < 2) { (void) fprintf(stderr, gettext("missing dataset argument\n")); usage(B_FALSE); } propname = argv[0]; argc--; argv++; if ((prop = zfs_name_to_prop(propname)) != ZPROP_INVAL) { if (zfs_prop_readonly(prop)) { (void) fprintf(stderr, gettext( "%s property is read-only\n"), propname); return (1); } if (!zfs_prop_inheritable(prop) && !received) { (void) fprintf(stderr, gettext("'%s' property cannot " "be inherited\n"), propname); if (prop == ZFS_PROP_QUOTA || prop == ZFS_PROP_RESERVATION || prop == ZFS_PROP_REFQUOTA || prop == ZFS_PROP_REFRESERVATION) { (void) fprintf(stderr, gettext("use 'zfs set " "%s=none' to clear\n"), propname); (void) fprintf(stderr, gettext("use 'zfs " "inherit -S %s' to revert to received " "value\n"), propname); } return (1); } if (received && (prop == ZFS_PROP_VOLSIZE || prop == ZFS_PROP_VERSION)) { (void) fprintf(stderr, gettext("'%s' property cannot " "be reverted to a received value\n"), propname); return (1); } } else if (!zfs_prop_user(propname)) { (void) fprintf(stderr, gettext("invalid property '%s'\n"), propname); usage(B_FALSE); } cb.cb_propname = propname; cb.cb_received = received; if (flags & ZFS_ITER_RECURSE) { ret = zfs_for_each(argc, argv, flags, ZFS_TYPE_DATASET, NULL, NULL, 0, inherit_recurse_cb, &cb); } else { ret = zfs_for_each(argc, argv, flags, ZFS_TYPE_DATASET, NULL, NULL, 0, inherit_cb, &cb); } return (ret); } typedef struct upgrade_cbdata { uint64_t cb_numupgraded; uint64_t cb_numsamegraded; uint64_t cb_numfailed; uint64_t cb_version; boolean_t cb_newer; boolean_t cb_foundone; char cb_lastfs[ZFS_MAX_DATASET_NAME_LEN]; } upgrade_cbdata_t; static int same_pool(zfs_handle_t *zhp, const char *name) { int len1 = strcspn(name, "/@"); const char *zhname = zfs_get_name(zhp); int len2 = strcspn(zhname, "/@"); if (len1 != len2) return (B_FALSE); return (strncmp(name, zhname, len1) == 0); } static int upgrade_list_callback(zfs_handle_t *zhp, void *data) { upgrade_cbdata_t *cb = data; int version = zfs_prop_get_int(zhp, ZFS_PROP_VERSION); /* list if it's old/new */ if ((!cb->cb_newer && version < ZPL_VERSION) || (cb->cb_newer && version > ZPL_VERSION)) { char *str; if (cb->cb_newer) { str = gettext("The following filesystems are " "formatted using a newer software version and\n" "cannot be accessed on the current system.\n\n"); } else { str = gettext("The following filesystems are " "out of date, and can be upgraded. After being\n" "upgraded, these filesystems (and any 'zfs send' " "streams generated from\n" "subsequent snapshots) will no longer be " "accessible by older software versions.\n\n"); } if (!cb->cb_foundone) { (void) puts(str); (void) printf(gettext("VER FILESYSTEM\n")); (void) printf(gettext("--- ------------\n")); cb->cb_foundone = B_TRUE; } (void) printf("%2u %s\n", version, zfs_get_name(zhp)); } return (0); } static int upgrade_set_callback(zfs_handle_t *zhp, void *data) { upgrade_cbdata_t *cb = data; int version = zfs_prop_get_int(zhp, ZFS_PROP_VERSION); int needed_spa_version; int spa_version; if (zfs_spa_version(zhp, &spa_version) < 0) return (-1); needed_spa_version = zfs_spa_version_map(cb->cb_version); if (needed_spa_version < 0) return (-1); if (spa_version < needed_spa_version) { /* can't upgrade */ (void) printf(gettext("%s: can not be " "upgraded; the pool version needs to first " "be upgraded\nto version %d\n\n"), zfs_get_name(zhp), needed_spa_version); cb->cb_numfailed++; return (0); } /* upgrade */ if (version < cb->cb_version) { char verstr[16]; (void) snprintf(verstr, sizeof (verstr), "%llu", (u_longlong_t)cb->cb_version); if (cb->cb_lastfs[0] && !same_pool(zhp, cb->cb_lastfs)) { /* * If they did "zfs upgrade -a", then we could * be doing ioctls to different pools. We need * to log this history once to each pool, and bypass * the normal history logging that happens in main(). */ (void) zpool_log_history(g_zfs, history_str); log_history = B_FALSE; } if (zfs_prop_set(zhp, "version", verstr) == 0) cb->cb_numupgraded++; else cb->cb_numfailed++; (void) strcpy(cb->cb_lastfs, zfs_get_name(zhp)); } else if (version > cb->cb_version) { /* can't downgrade */ (void) printf(gettext("%s: can not be downgraded; " "it is already at version %u\n"), zfs_get_name(zhp), version); cb->cb_numfailed++; } else { cb->cb_numsamegraded++; } return (0); } /* * zfs upgrade * zfs upgrade -v * zfs upgrade [-r] [-V ] <-a | filesystem> */ static int zfs_do_upgrade(int argc, char **argv) { boolean_t all = B_FALSE; boolean_t showversions = B_FALSE; int ret = 0; upgrade_cbdata_t cb = { 0 }; int c; int flags = ZFS_ITER_ARGS_CAN_BE_PATHS; /* check options */ while ((c = getopt(argc, argv, "rvV:a")) != -1) { switch (c) { case 'r': flags |= ZFS_ITER_RECURSE; break; case 'v': showversions = B_TRUE; break; case 'V': if (zfs_prop_string_to_index(ZFS_PROP_VERSION, optarg, &cb.cb_version) != 0) { (void) fprintf(stderr, gettext("invalid version %s\n"), optarg); usage(B_FALSE); } break; case 'a': all = B_TRUE; break; case '?': default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; if ((!all && !argc) && ((flags & ZFS_ITER_RECURSE) | cb.cb_version)) usage(B_FALSE); if (showversions && (flags & ZFS_ITER_RECURSE || all || cb.cb_version || argc)) usage(B_FALSE); if ((all || argc) && (showversions)) usage(B_FALSE); if (all && argc) usage(B_FALSE); if (showversions) { /* Show info on available versions. */ (void) printf(gettext("The following filesystem versions are " "supported:\n\n")); (void) printf(gettext("VER DESCRIPTION\n")); (void) printf("--- -----------------------------------------" "---------------\n"); (void) printf(gettext(" 1 Initial ZFS filesystem version\n")); (void) printf(gettext(" 2 Enhanced directory entries\n")); (void) printf(gettext(" 3 Case insensitive and filesystem " "user identifier (FUID)\n")); (void) printf(gettext(" 4 userquota, groupquota " "properties\n")); (void) printf(gettext(" 5 System attributes\n")); (void) printf(gettext("\nFor more information on a particular " "version, including supported releases,\n")); (void) printf("see the ZFS Administration Guide.\n\n"); ret = 0; } else if (argc || all) { /* Upgrade filesystems */ if (cb.cb_version == 0) cb.cb_version = ZPL_VERSION; ret = zfs_for_each(argc, argv, flags, ZFS_TYPE_FILESYSTEM, NULL, NULL, 0, upgrade_set_callback, &cb); (void) printf(gettext("%llu filesystems upgraded\n"), (u_longlong_t)cb.cb_numupgraded); if (cb.cb_numsamegraded) { (void) printf(gettext("%llu filesystems already at " "this version\n"), (u_longlong_t)cb.cb_numsamegraded); } if (cb.cb_numfailed != 0) ret = 1; } else { /* List old-version filesystems */ boolean_t found; (void) printf(gettext("This system is currently running " "ZFS filesystem version %llu.\n\n"), ZPL_VERSION); flags |= ZFS_ITER_RECURSE; ret = zfs_for_each(0, NULL, flags, ZFS_TYPE_FILESYSTEM, NULL, NULL, 0, upgrade_list_callback, &cb); found = cb.cb_foundone; cb.cb_foundone = B_FALSE; cb.cb_newer = B_TRUE; ret = zfs_for_each(0, NULL, flags, ZFS_TYPE_FILESYSTEM, NULL, NULL, 0, upgrade_list_callback, &cb); if (!cb.cb_foundone && !found) { (void) printf(gettext("All filesystems are " "formatted with the current version.\n")); } } return (ret); } /* * zfs userspace [-Hinp] [-o field[,...]] [-s field [-s field]...] - * [-S field [-S field]...] [-t type[,...]] filesystem | snapshot + * [-S field [-S field]...] [-t type[,...]] + * filesystem | snapshot | path * zfs groupspace [-Hinp] [-o field[,...]] [-s field [-s field]...] - * [-S field [-S field]...] [-t type[,...]] filesystem | snapshot + * [-S field [-S field]...] [-t type[,...]] + * filesystem | snapshot | path * zfs projectspace [-Hp] [-o field[,...]] [-s field [-s field]...] - * [-S field [-S field]...] filesystem | snapshot + * [-S field [-S field]...] filesystem | snapshot | path * * -H Scripted mode; elide headers and separate columns by tabs. * -i Translate SID to POSIX ID. * -n Print numeric ID instead of user/group name. * -o Control which fields to display. * -p Use exact (parsable) numeric output. * -s Specify sort columns, descending order. * -S Specify sort columns, ascending order. * -t Control which object types to display. * * Displays space consumed by, and quotas on, each user in the specified * filesystem or snapshot. */ /* us_field_types, us_field_hdr and us_field_names should be kept in sync */ enum us_field_types { USFIELD_TYPE, USFIELD_NAME, USFIELD_USED, USFIELD_QUOTA, USFIELD_OBJUSED, USFIELD_OBJQUOTA }; static char *us_field_hdr[] = { "TYPE", "NAME", "USED", "QUOTA", "OBJUSED", "OBJQUOTA" }; static char *us_field_names[] = { "type", "name", "used", "quota", "objused", "objquota" }; #define USFIELD_LAST (sizeof (us_field_names) / sizeof (char *)) #define USTYPE_PSX_GRP (1 << 0) #define USTYPE_PSX_USR (1 << 1) #define USTYPE_SMB_GRP (1 << 2) #define USTYPE_SMB_USR (1 << 3) #define USTYPE_PROJ (1 << 4) #define USTYPE_ALL \ (USTYPE_PSX_GRP | USTYPE_PSX_USR | USTYPE_SMB_GRP | USTYPE_SMB_USR | \ USTYPE_PROJ) static int us_type_bits[] = { USTYPE_PSX_GRP, USTYPE_PSX_USR, USTYPE_SMB_GRP, USTYPE_SMB_USR, USTYPE_ALL }; static char *us_type_names[] = { "posixgroup", "posixuser", "smbgroup", "smbuser", "all" }; typedef struct us_node { nvlist_t *usn_nvl; uu_avl_node_t usn_avlnode; uu_list_node_t usn_listnode; } us_node_t; typedef struct us_cbdata { nvlist_t **cb_nvlp; uu_avl_pool_t *cb_avl_pool; uu_avl_t *cb_avl; boolean_t cb_numname; boolean_t cb_nicenum; boolean_t cb_sid2posix; zfs_userquota_prop_t cb_prop; zfs_sort_column_t *cb_sortcol; size_t cb_width[USFIELD_LAST]; } us_cbdata_t; static boolean_t us_populated = B_FALSE; typedef struct { zfs_sort_column_t *si_sortcol; boolean_t si_numname; } us_sort_info_t; static int us_field_index(char *field) { int i; for (i = 0; i < USFIELD_LAST; i++) { if (strcmp(field, us_field_names[i]) == 0) return (i); } return (-1); } static int us_compare(const void *larg, const void *rarg, void *unused) { const us_node_t *l = larg; const us_node_t *r = rarg; us_sort_info_t *si = (us_sort_info_t *)unused; zfs_sort_column_t *sortcol = si->si_sortcol; boolean_t numname = si->si_numname; nvlist_t *lnvl = l->usn_nvl; nvlist_t *rnvl = r->usn_nvl; int rc = 0; boolean_t lvb, rvb; for (; sortcol != NULL; sortcol = sortcol->sc_next) { char *lvstr = ""; char *rvstr = ""; uint32_t lv32 = 0; uint32_t rv32 = 0; uint64_t lv64 = 0; uint64_t rv64 = 0; zfs_prop_t prop = sortcol->sc_prop; const char *propname = NULL; boolean_t reverse = sortcol->sc_reverse; switch (prop) { case ZFS_PROP_TYPE: propname = "type"; (void) nvlist_lookup_uint32(lnvl, propname, &lv32); (void) nvlist_lookup_uint32(rnvl, propname, &rv32); if (rv32 != lv32) rc = (rv32 < lv32) ? 1 : -1; break; case ZFS_PROP_NAME: propname = "name"; if (numname) { compare_nums: (void) nvlist_lookup_uint64(lnvl, propname, &lv64); (void) nvlist_lookup_uint64(rnvl, propname, &rv64); if (rv64 != lv64) rc = (rv64 < lv64) ? 1 : -1; } else { if ((nvlist_lookup_string(lnvl, propname, &lvstr) == ENOENT) || (nvlist_lookup_string(rnvl, propname, &rvstr) == ENOENT)) { goto compare_nums; } rc = strcmp(lvstr, rvstr); } break; case ZFS_PROP_USED: case ZFS_PROP_QUOTA: if (!us_populated) break; if (prop == ZFS_PROP_USED) propname = "used"; else propname = "quota"; (void) nvlist_lookup_uint64(lnvl, propname, &lv64); (void) nvlist_lookup_uint64(rnvl, propname, &rv64); if (rv64 != lv64) rc = (rv64 < lv64) ? 1 : -1; break; default: break; } if (rc != 0) { if (rc < 0) return (reverse ? 1 : -1); else return (reverse ? -1 : 1); } } /* * If entries still seem to be the same, check if they are of the same * type (smbentity is added only if we are doing SID to POSIX ID * translation where we can have duplicate type/name combinations). */ if (nvlist_lookup_boolean_value(lnvl, "smbentity", &lvb) == 0 && nvlist_lookup_boolean_value(rnvl, "smbentity", &rvb) == 0 && lvb != rvb) return (lvb < rvb ? -1 : 1); return (0); } static boolean_t zfs_prop_is_user(unsigned p) { return (p == ZFS_PROP_USERUSED || p == ZFS_PROP_USERQUOTA || p == ZFS_PROP_USEROBJUSED || p == ZFS_PROP_USEROBJQUOTA); } static boolean_t zfs_prop_is_group(unsigned p) { return (p == ZFS_PROP_GROUPUSED || p == ZFS_PROP_GROUPQUOTA || p == ZFS_PROP_GROUPOBJUSED || p == ZFS_PROP_GROUPOBJQUOTA); } static boolean_t zfs_prop_is_project(unsigned p) { return (p == ZFS_PROP_PROJECTUSED || p == ZFS_PROP_PROJECTQUOTA || p == ZFS_PROP_PROJECTOBJUSED || p == ZFS_PROP_PROJECTOBJQUOTA); } static inline const char * us_type2str(unsigned field_type) { switch (field_type) { case USTYPE_PSX_USR: return ("POSIX User"); case USTYPE_PSX_GRP: return ("POSIX Group"); case USTYPE_SMB_USR: return ("SMB User"); case USTYPE_SMB_GRP: return ("SMB Group"); case USTYPE_PROJ: return ("Project"); default: return ("Undefined"); } } static int userspace_cb(void *arg, const char *domain, uid_t rid, uint64_t space) { us_cbdata_t *cb = (us_cbdata_t *)arg; zfs_userquota_prop_t prop = cb->cb_prop; char *name = NULL; char *propname; char sizebuf[32]; us_node_t *node; uu_avl_pool_t *avl_pool = cb->cb_avl_pool; uu_avl_t *avl = cb->cb_avl; uu_avl_index_t idx; nvlist_t *props; us_node_t *n; zfs_sort_column_t *sortcol = cb->cb_sortcol; unsigned type = 0; const char *typestr; size_t namelen; size_t typelen; size_t sizelen; int typeidx, nameidx, sizeidx; us_sort_info_t sortinfo = { sortcol, cb->cb_numname }; boolean_t smbentity = B_FALSE; if (nvlist_alloc(&props, NV_UNIQUE_NAME, 0) != 0) nomem(); node = safe_malloc(sizeof (us_node_t)); uu_avl_node_init(node, &node->usn_avlnode, avl_pool); node->usn_nvl = props; if (domain != NULL && domain[0] != '\0') { #ifdef HAVE_IDMAP /* SMB */ char sid[MAXNAMELEN + 32]; uid_t id; uint64_t classes; int err; directory_error_t e; smbentity = B_TRUE; (void) snprintf(sid, sizeof (sid), "%s-%u", domain, rid); if (prop == ZFS_PROP_GROUPUSED || prop == ZFS_PROP_GROUPQUOTA) { type = USTYPE_SMB_GRP; err = sid_to_id(sid, B_FALSE, &id); } else { type = USTYPE_SMB_USR; err = sid_to_id(sid, B_TRUE, &id); } if (err == 0) { rid = id; if (!cb->cb_sid2posix) { e = directory_name_from_sid(NULL, sid, &name, &classes); if (e != NULL) directory_error_free(e); if (name == NULL) name = sid; } } #else nvlist_free(props); free(node); return (-1); #endif /* HAVE_IDMAP */ } if (cb->cb_sid2posix || domain == NULL || domain[0] == '\0') { /* POSIX or -i */ if (zfs_prop_is_group(prop)) { type = USTYPE_PSX_GRP; if (!cb->cb_numname) { struct group *g; if ((g = getgrgid(rid)) != NULL) name = g->gr_name; } } else if (zfs_prop_is_user(prop)) { type = USTYPE_PSX_USR; if (!cb->cb_numname) { struct passwd *p; if ((p = getpwuid(rid)) != NULL) name = p->pw_name; } } else { type = USTYPE_PROJ; } } /* * Make sure that the type/name combination is unique when doing * SID to POSIX ID translation (hence changing the type from SMB to * POSIX). */ if (cb->cb_sid2posix && nvlist_add_boolean_value(props, "smbentity", smbentity) != 0) nomem(); /* Calculate/update width of TYPE field */ typestr = us_type2str(type); typelen = strlen(gettext(typestr)); typeidx = us_field_index("type"); if (typelen > cb->cb_width[typeidx]) cb->cb_width[typeidx] = typelen; if (nvlist_add_uint32(props, "type", type) != 0) nomem(); /* Calculate/update width of NAME field */ if ((cb->cb_numname && cb->cb_sid2posix) || name == NULL) { if (nvlist_add_uint64(props, "name", rid) != 0) nomem(); namelen = snprintf(NULL, 0, "%u", rid); } else { if (nvlist_add_string(props, "name", name) != 0) nomem(); namelen = strlen(name); } nameidx = us_field_index("name"); if (nameidx >= 0 && namelen > cb->cb_width[nameidx]) cb->cb_width[nameidx] = namelen; /* * Check if this type/name combination is in the list and update it; * otherwise add new node to the list. */ if ((n = uu_avl_find(avl, node, &sortinfo, &idx)) == NULL) { uu_avl_insert(avl, node, idx); } else { nvlist_free(props); free(node); node = n; props = node->usn_nvl; } /* Calculate/update width of USED/QUOTA fields */ if (cb->cb_nicenum) { if (prop == ZFS_PROP_USERUSED || prop == ZFS_PROP_GROUPUSED || prop == ZFS_PROP_USERQUOTA || prop == ZFS_PROP_GROUPQUOTA || prop == ZFS_PROP_PROJECTUSED || prop == ZFS_PROP_PROJECTQUOTA) { zfs_nicebytes(space, sizebuf, sizeof (sizebuf)); } else { zfs_nicenum(space, sizebuf, sizeof (sizebuf)); } } else { (void) snprintf(sizebuf, sizeof (sizebuf), "%llu", (u_longlong_t)space); } sizelen = strlen(sizebuf); if (prop == ZFS_PROP_USERUSED || prop == ZFS_PROP_GROUPUSED || prop == ZFS_PROP_PROJECTUSED) { propname = "used"; if (!nvlist_exists(props, "quota")) (void) nvlist_add_uint64(props, "quota", 0); } else if (prop == ZFS_PROP_USERQUOTA || prop == ZFS_PROP_GROUPQUOTA || prop == ZFS_PROP_PROJECTQUOTA) { propname = "quota"; if (!nvlist_exists(props, "used")) (void) nvlist_add_uint64(props, "used", 0); } else if (prop == ZFS_PROP_USEROBJUSED || prop == ZFS_PROP_GROUPOBJUSED || prop == ZFS_PROP_PROJECTOBJUSED) { propname = "objused"; if (!nvlist_exists(props, "objquota")) (void) nvlist_add_uint64(props, "objquota", 0); } else if (prop == ZFS_PROP_USEROBJQUOTA || prop == ZFS_PROP_GROUPOBJQUOTA || prop == ZFS_PROP_PROJECTOBJQUOTA) { propname = "objquota"; if (!nvlist_exists(props, "objused")) (void) nvlist_add_uint64(props, "objused", 0); } else { return (-1); } sizeidx = us_field_index(propname); if (sizeidx >= 0 && sizelen > cb->cb_width[sizeidx]) cb->cb_width[sizeidx] = sizelen; if (nvlist_add_uint64(props, propname, space) != 0) nomem(); return (0); } static void print_us_node(boolean_t scripted, boolean_t parsable, int *fields, int types, size_t *width, us_node_t *node) { nvlist_t *nvl = node->usn_nvl; char valstr[MAXNAMELEN]; boolean_t first = B_TRUE; int cfield = 0; int field; uint32_t ustype; /* Check type */ (void) nvlist_lookup_uint32(nvl, "type", &ustype); if (!(ustype & types)) return; while ((field = fields[cfield]) != USFIELD_LAST) { nvpair_t *nvp = NULL; data_type_t type; uint32_t val32; uint64_t val64; char *strval = "-"; while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) { if (strcmp(nvpair_name(nvp), us_field_names[field]) == 0) break; } type = nvp == NULL ? DATA_TYPE_UNKNOWN : nvpair_type(nvp); switch (type) { case DATA_TYPE_UINT32: (void) nvpair_value_uint32(nvp, &val32); break; case DATA_TYPE_UINT64: (void) nvpair_value_uint64(nvp, &val64); break; case DATA_TYPE_STRING: (void) nvpair_value_string(nvp, &strval); break; case DATA_TYPE_UNKNOWN: break; default: (void) fprintf(stderr, "invalid data type\n"); } switch (field) { case USFIELD_TYPE: if (type == DATA_TYPE_UINT32) strval = (char *)us_type2str(val32); break; case USFIELD_NAME: if (type == DATA_TYPE_UINT64) { (void) sprintf(valstr, "%llu", (u_longlong_t)val64); strval = valstr; } break; case USFIELD_USED: case USFIELD_QUOTA: if (type == DATA_TYPE_UINT64) { if (parsable) { (void) sprintf(valstr, "%llu", (u_longlong_t)val64); strval = valstr; } else if (field == USFIELD_QUOTA && val64 == 0) { strval = "none"; } else { zfs_nicebytes(val64, valstr, sizeof (valstr)); strval = valstr; } } break; case USFIELD_OBJUSED: case USFIELD_OBJQUOTA: if (type == DATA_TYPE_UINT64) { if (parsable) { (void) sprintf(valstr, "%llu", (u_longlong_t)val64); strval = valstr; } else if (field == USFIELD_OBJQUOTA && val64 == 0) { strval = "none"; } else { zfs_nicenum(val64, valstr, sizeof (valstr)); strval = valstr; } } break; } if (!first) { if (scripted) (void) printf("\t"); else (void) printf(" "); } if (scripted) (void) printf("%s", strval); else if (field == USFIELD_TYPE || field == USFIELD_NAME) (void) printf("%-*s", (int)width[field], strval); else (void) printf("%*s", (int)width[field], strval); first = B_FALSE; cfield++; } (void) printf("\n"); } static void print_us(boolean_t scripted, boolean_t parsable, int *fields, int types, size_t *width, boolean_t rmnode, uu_avl_t *avl) { us_node_t *node; const char *col; int cfield = 0; int field; if (!scripted) { boolean_t first = B_TRUE; while ((field = fields[cfield]) != USFIELD_LAST) { col = gettext(us_field_hdr[field]); if (field == USFIELD_TYPE || field == USFIELD_NAME) { (void) printf(first ? "%-*s" : " %-*s", (int)width[field], col); } else { (void) printf(first ? "%*s" : " %*s", (int)width[field], col); } first = B_FALSE; cfield++; } (void) printf("\n"); } for (node = uu_avl_first(avl); node; node = uu_avl_next(avl, node)) { print_us_node(scripted, parsable, fields, types, width, node); if (rmnode) nvlist_free(node->usn_nvl); } } static int zfs_do_userspace(int argc, char **argv) { zfs_handle_t *zhp; zfs_userquota_prop_t p; uu_avl_pool_t *avl_pool; uu_avl_t *avl_tree; uu_avl_walk_t *walk; char *delim; char deffields[] = "type,name,used,quota,objused,objquota"; char *ofield = NULL; char *tfield = NULL; int cfield = 0; int fields[256]; int i; boolean_t scripted = B_FALSE; boolean_t prtnum = B_FALSE; boolean_t parsable = B_FALSE; boolean_t sid2posix = B_FALSE; int ret = 0; int c; zfs_sort_column_t *sortcol = NULL; int types = USTYPE_PSX_USR | USTYPE_SMB_USR; us_cbdata_t cb; us_node_t *node; us_node_t *rmnode; uu_list_pool_t *listpool; uu_list_t *list; uu_avl_index_t idx = 0; uu_list_index_t idx2 = 0; if (argc < 2) usage(B_FALSE); if (strcmp(argv[0], "groupspace") == 0) { /* Toggle default group types */ types = USTYPE_PSX_GRP | USTYPE_SMB_GRP; } else if (strcmp(argv[0], "projectspace") == 0) { types = USTYPE_PROJ; prtnum = B_TRUE; } while ((c = getopt(argc, argv, "nHpo:s:S:t:i")) != -1) { switch (c) { case 'n': if (types == USTYPE_PROJ) { (void) fprintf(stderr, gettext("invalid option 'n'\n")); usage(B_FALSE); } prtnum = B_TRUE; break; case 'H': scripted = B_TRUE; break; case 'p': parsable = B_TRUE; break; case 'o': ofield = optarg; break; case 's': case 'S': if (zfs_add_sort_column(&sortcol, optarg, c == 's' ? B_FALSE : B_TRUE) != 0) { (void) fprintf(stderr, gettext("invalid field '%s'\n"), optarg); usage(B_FALSE); } break; case 't': if (types == USTYPE_PROJ) { (void) fprintf(stderr, gettext("invalid option 't'\n")); usage(B_FALSE); } tfield = optarg; break; case 'i': if (types == USTYPE_PROJ) { (void) fprintf(stderr, gettext("invalid option 'i'\n")); usage(B_FALSE); } sid2posix = B_TRUE; break; case ':': (void) fprintf(stderr, gettext("missing argument for " "'%c' option\n"), optopt); usage(B_FALSE); break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; if (argc < 1) { (void) fprintf(stderr, gettext("missing dataset name\n")); usage(B_FALSE); } if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } /* Use default output fields if not specified using -o */ if (ofield == NULL) ofield = deffields; do { if ((delim = strchr(ofield, ',')) != NULL) *delim = '\0'; if ((fields[cfield++] = us_field_index(ofield)) == -1) { (void) fprintf(stderr, gettext("invalid type '%s' " "for -o option\n"), ofield); return (-1); } if (delim != NULL) ofield = delim + 1; } while (delim != NULL); fields[cfield] = USFIELD_LAST; /* Override output types (-t option) */ if (tfield != NULL) { types = 0; do { boolean_t found = B_FALSE; if ((delim = strchr(tfield, ',')) != NULL) *delim = '\0'; for (i = 0; i < sizeof (us_type_bits) / sizeof (int); i++) { if (strcmp(tfield, us_type_names[i]) == 0) { found = B_TRUE; types |= us_type_bits[i]; break; } } if (!found) { (void) fprintf(stderr, gettext("invalid type " "'%s' for -t option\n"), tfield); return (-1); } if (delim != NULL) tfield = delim + 1; } while (delim != NULL); } - if ((zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM | + if ((zhp = zfs_path_to_zhandle(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_SNAPSHOT)) == NULL) return (1); if (zhp->zfs_head_type != ZFS_TYPE_FILESYSTEM) { (void) fprintf(stderr, gettext("operation is only applicable " "to filesystems and their snapshots\n")); zfs_close(zhp); return (1); } if ((avl_pool = uu_avl_pool_create("us_avl_pool", sizeof (us_node_t), offsetof(us_node_t, usn_avlnode), us_compare, UU_DEFAULT)) == NULL) nomem(); if ((avl_tree = uu_avl_create(avl_pool, NULL, UU_DEFAULT)) == NULL) nomem(); /* Always add default sorting columns */ (void) zfs_add_sort_column(&sortcol, "type", B_FALSE); (void) zfs_add_sort_column(&sortcol, "name", B_FALSE); cb.cb_sortcol = sortcol; cb.cb_numname = prtnum; cb.cb_nicenum = !parsable; cb.cb_avl_pool = avl_pool; cb.cb_avl = avl_tree; cb.cb_sid2posix = sid2posix; for (i = 0; i < USFIELD_LAST; i++) cb.cb_width[i] = strlen(gettext(us_field_hdr[i])); for (p = 0; p < ZFS_NUM_USERQUOTA_PROPS; p++) { if ((zfs_prop_is_user(p) && !(types & (USTYPE_PSX_USR | USTYPE_SMB_USR))) || (zfs_prop_is_group(p) && !(types & (USTYPE_PSX_GRP | USTYPE_SMB_GRP))) || (zfs_prop_is_project(p) && types != USTYPE_PROJ)) continue; cb.cb_prop = p; if ((ret = zfs_userspace(zhp, p, userspace_cb, &cb)) != 0) { zfs_close(zhp); return (ret); } } zfs_close(zhp); /* Sort the list */ if ((node = uu_avl_first(avl_tree)) == NULL) return (0); us_populated = B_TRUE; listpool = uu_list_pool_create("tmplist", sizeof (us_node_t), offsetof(us_node_t, usn_listnode), NULL, UU_DEFAULT); list = uu_list_create(listpool, NULL, UU_DEFAULT); uu_list_node_init(node, &node->usn_listnode, listpool); while (node != NULL) { rmnode = node; node = uu_avl_next(avl_tree, node); uu_avl_remove(avl_tree, rmnode); if (uu_list_find(list, rmnode, NULL, &idx2) == NULL) uu_list_insert(list, rmnode, idx2); } for (node = uu_list_first(list); node != NULL; node = uu_list_next(list, node)) { us_sort_info_t sortinfo = { sortcol, cb.cb_numname }; if (uu_avl_find(avl_tree, node, &sortinfo, &idx) == NULL) uu_avl_insert(avl_tree, node, idx); } uu_list_destroy(list); uu_list_pool_destroy(listpool); /* Print and free node nvlist memory */ print_us(scripted, parsable, fields, types, cb.cb_width, B_TRUE, cb.cb_avl); zfs_free_sort_columns(sortcol); /* Clean up the AVL tree */ if ((walk = uu_avl_walk_start(cb.cb_avl, UU_WALK_ROBUST)) == NULL) nomem(); while ((node = uu_avl_walk_next(walk)) != NULL) { uu_avl_remove(cb.cb_avl, node); free(node); } uu_avl_walk_end(walk); uu_avl_destroy(avl_tree); uu_avl_pool_destroy(avl_pool); return (ret); } /* * list [-Hp][-r|-d max] [-o property[,...]] [-s property] ... [-S property] * [-t type[,...]] [filesystem|volume|snapshot] ... * * -H Scripted mode; elide headers and separate columns by tabs * -p Display values in parsable (literal) format. * -r Recurse over all children * -d Limit recursion by depth. * -o Control which fields to display. * -s Specify sort columns, descending order. * -S Specify sort columns, ascending order. * -t Control which object types to display. * * When given no arguments, list all filesystems in the system. * Otherwise, list the specified datasets, optionally recursing down them if * '-r' is specified. */ typedef struct list_cbdata { boolean_t cb_first; boolean_t cb_literal; boolean_t cb_scripted; zprop_list_t *cb_proplist; } list_cbdata_t; /* * Given a list of columns to display, output appropriate headers for each one. */ static void print_header(list_cbdata_t *cb) { zprop_list_t *pl = cb->cb_proplist; char headerbuf[ZFS_MAXPROPLEN]; const char *header; int i; boolean_t first = B_TRUE; boolean_t right_justify; for (; pl != NULL; pl = pl->pl_next) { if (!first) { (void) printf(" "); } else { first = B_FALSE; } right_justify = B_FALSE; if (pl->pl_prop != ZPROP_INVAL) { header = zfs_prop_column_name(pl->pl_prop); right_justify = zfs_prop_align_right(pl->pl_prop); } else { for (i = 0; pl->pl_user_prop[i] != '\0'; i++) headerbuf[i] = toupper(pl->pl_user_prop[i]); headerbuf[i] = '\0'; header = headerbuf; } if (pl->pl_next == NULL && !right_justify) (void) printf("%s", header); else if (right_justify) (void) printf("%*s", (int)pl->pl_width, header); else (void) printf("%-*s", (int)pl->pl_width, header); } (void) printf("\n"); } /* * Given a dataset and a list of fields, print out all the properties according * to the described layout. */ static void print_dataset(zfs_handle_t *zhp, list_cbdata_t *cb) { zprop_list_t *pl = cb->cb_proplist; boolean_t first = B_TRUE; char property[ZFS_MAXPROPLEN]; nvlist_t *userprops = zfs_get_user_props(zhp); nvlist_t *propval; char *propstr; boolean_t right_justify; for (; pl != NULL; pl = pl->pl_next) { if (!first) { if (cb->cb_scripted) (void) printf("\t"); else (void) printf(" "); } else { first = B_FALSE; } if (pl->pl_prop == ZFS_PROP_NAME) { (void) strlcpy(property, zfs_get_name(zhp), sizeof (property)); propstr = property; right_justify = zfs_prop_align_right(pl->pl_prop); } else if (pl->pl_prop != ZPROP_INVAL) { if (zfs_prop_get(zhp, pl->pl_prop, property, sizeof (property), NULL, NULL, 0, cb->cb_literal) != 0) propstr = "-"; else propstr = property; right_justify = zfs_prop_align_right(pl->pl_prop); } else if (zfs_prop_userquota(pl->pl_user_prop)) { if (zfs_prop_get_userquota(zhp, pl->pl_user_prop, property, sizeof (property), cb->cb_literal) != 0) propstr = "-"; else propstr = property; right_justify = B_TRUE; } else if (zfs_prop_written(pl->pl_user_prop)) { if (zfs_prop_get_written(zhp, pl->pl_user_prop, property, sizeof (property), cb->cb_literal) != 0) propstr = "-"; else propstr = property; right_justify = B_TRUE; } else { if (nvlist_lookup_nvlist(userprops, pl->pl_user_prop, &propval) != 0) propstr = "-"; else verify(nvlist_lookup_string(propval, ZPROP_VALUE, &propstr) == 0); right_justify = B_FALSE; } /* * If this is being called in scripted mode, or if this is the * last column and it is left-justified, don't include a width * format specifier. */ if (cb->cb_scripted || (pl->pl_next == NULL && !right_justify)) (void) printf("%s", propstr); else if (right_justify) (void) printf("%*s", (int)pl->pl_width, propstr); else (void) printf("%-*s", (int)pl->pl_width, propstr); } (void) printf("\n"); } /* * Generic callback function to list a dataset or snapshot. */ static int list_callback(zfs_handle_t *zhp, void *data) { list_cbdata_t *cbp = data; if (cbp->cb_first) { if (!cbp->cb_scripted) print_header(cbp); cbp->cb_first = B_FALSE; } print_dataset(zhp, cbp); return (0); } static int zfs_do_list(int argc, char **argv) { int c; static char default_fields[] = "name,used,available,referenced,mountpoint"; int types = ZFS_TYPE_DATASET; boolean_t types_specified = B_FALSE; char *fields = NULL; list_cbdata_t cb = { 0 }; char *value; int limit = 0; int ret = 0; zfs_sort_column_t *sortcol = NULL; int flags = ZFS_ITER_PROP_LISTSNAPS | ZFS_ITER_ARGS_CAN_BE_PATHS; /* check options */ while ((c = getopt(argc, argv, "HS:d:o:prs:t:")) != -1) { switch (c) { case 'o': fields = optarg; break; case 'p': cb.cb_literal = B_TRUE; flags |= ZFS_ITER_LITERAL_PROPS; break; case 'd': limit = parse_depth(optarg, &flags); break; case 'r': flags |= ZFS_ITER_RECURSE; break; case 'H': cb.cb_scripted = B_TRUE; break; case 's': if (zfs_add_sort_column(&sortcol, optarg, B_FALSE) != 0) { (void) fprintf(stderr, gettext("invalid property '%s'\n"), optarg); usage(B_FALSE); } break; case 'S': if (zfs_add_sort_column(&sortcol, optarg, B_TRUE) != 0) { (void) fprintf(stderr, gettext("invalid property '%s'\n"), optarg); usage(B_FALSE); } break; case 't': types = 0; types_specified = B_TRUE; flags &= ~ZFS_ITER_PROP_LISTSNAPS; while (*optarg != '\0') { static char *type_subopts[] = { "filesystem", "volume", "snapshot", "snap", "bookmark", "all", NULL }; switch (getsubopt(&optarg, type_subopts, &value)) { case 0: types |= ZFS_TYPE_FILESYSTEM; break; case 1: types |= ZFS_TYPE_VOLUME; break; case 2: case 3: types |= ZFS_TYPE_SNAPSHOT; break; case 4: types |= ZFS_TYPE_BOOKMARK; break; case 5: types = ZFS_TYPE_DATASET | ZFS_TYPE_BOOKMARK; break; default: (void) fprintf(stderr, gettext("invalid type '%s'\n"), value); usage(B_FALSE); } } break; case ':': (void) fprintf(stderr, gettext("missing argument for " "'%c' option\n"), optopt); usage(B_FALSE); break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; if (fields == NULL) fields = default_fields; /* * If we are only going to list snapshot names and sort by name, * then we can use faster version. */ if (strcmp(fields, "name") == 0 && zfs_sort_only_by_name(sortcol)) flags |= ZFS_ITER_SIMPLE; /* * If "-o space" and no types were specified, don't display snapshots. */ if (strcmp(fields, "space") == 0 && types_specified == B_FALSE) types &= ~ZFS_TYPE_SNAPSHOT; /* * Handle users who want to list all snapshots or bookmarks * of the current dataset (ex. 'zfs list -t snapshot '). */ if ((types == ZFS_TYPE_SNAPSHOT || types == ZFS_TYPE_BOOKMARK) && argc > 0 && (flags & ZFS_ITER_RECURSE) == 0 && limit == 0) { flags |= (ZFS_ITER_DEPTH_LIMIT | ZFS_ITER_RECURSE); limit = 1; } /* * If the user specifies '-o all', the zprop_get_list() doesn't * normally include the name of the dataset. For 'zfs list', we always * want this property to be first. */ if (zprop_get_list(g_zfs, fields, &cb.cb_proplist, ZFS_TYPE_DATASET) != 0) usage(B_FALSE); cb.cb_first = B_TRUE; ret = zfs_for_each(argc, argv, flags, types, sortcol, &cb.cb_proplist, limit, list_callback, &cb); zprop_free_list(cb.cb_proplist); zfs_free_sort_columns(sortcol); if (ret == 0 && cb.cb_first && !cb.cb_scripted) (void) fprintf(stderr, gettext("no datasets available\n")); return (ret); } /* * zfs rename [-fu] * zfs rename [-f] -p * zfs rename [-u] -r * * Renames the given dataset to another of the same type. * * The '-p' flag creates all the non-existing ancestors of the target first. * The '-u' flag prevents file systems from being remounted during rename. */ /* ARGSUSED */ static int zfs_do_rename(int argc, char **argv) { zfs_handle_t *zhp; renameflags_t flags = { 0 }; int c; int ret = 0; int types; boolean_t parents = B_FALSE; /* check options */ while ((c = getopt(argc, argv, "pruf")) != -1) { switch (c) { case 'p': parents = B_TRUE; break; case 'r': flags.recursive = B_TRUE; break; case 'u': flags.nounmount = B_TRUE; break; case 'f': flags.forceunmount = B_TRUE; break; case '?': default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 1) { (void) fprintf(stderr, gettext("missing source dataset " "argument\n")); usage(B_FALSE); } if (argc < 2) { (void) fprintf(stderr, gettext("missing target dataset " "argument\n")); usage(B_FALSE); } if (argc > 2) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } if (flags.recursive && parents) { (void) fprintf(stderr, gettext("-p and -r options are mutually " "exclusive\n")); usage(B_FALSE); } if (flags.nounmount && parents) { (void) fprintf(stderr, gettext("-u and -p options are mutually " "exclusive\n")); usage(B_FALSE); } if (flags.recursive && strchr(argv[0], '@') == 0) { (void) fprintf(stderr, gettext("source dataset for recursive " "rename must be a snapshot\n")); usage(B_FALSE); } if (flags.nounmount) types = ZFS_TYPE_FILESYSTEM; else if (parents) types = ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME; else types = ZFS_TYPE_DATASET; if ((zhp = zfs_open(g_zfs, argv[0], types)) == NULL) return (1); /* If we were asked and the name looks good, try to create ancestors. */ if (parents && zfs_name_valid(argv[1], zfs_get_type(zhp)) && zfs_create_ancestors(g_zfs, argv[1]) != 0) { zfs_close(zhp); return (1); } ret = (zfs_rename(zhp, argv[1], flags) != 0); zfs_close(zhp); return (ret); } /* * zfs promote * * Promotes the given clone fs to be the parent */ /* ARGSUSED */ static int zfs_do_promote(int argc, char **argv) { zfs_handle_t *zhp; int ret = 0; /* check options */ if (argc > 1 && argv[1][0] == '-') { (void) fprintf(stderr, gettext("invalid option '%c'\n"), argv[1][1]); usage(B_FALSE); } /* check number of arguments */ if (argc < 2) { (void) fprintf(stderr, gettext("missing clone filesystem" " argument\n")); usage(B_FALSE); } if (argc > 2) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } zhp = zfs_open(g_zfs, argv[1], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) return (1); ret = (zfs_promote(zhp) != 0); zfs_close(zhp); return (ret); } static int zfs_do_redact(int argc, char **argv) { char *snap = NULL; char *bookname = NULL; char **rsnaps = NULL; int numrsnaps = 0; argv++; argc--; if (argc < 3) { (void) fprintf(stderr, gettext("too few arguments\n")); usage(B_FALSE); } snap = argv[0]; bookname = argv[1]; rsnaps = argv + 2; numrsnaps = argc - 2; nvlist_t *rsnapnv = fnvlist_alloc(); for (int i = 0; i < numrsnaps; i++) { fnvlist_add_boolean(rsnapnv, rsnaps[i]); } int err = lzc_redact(snap, bookname, rsnapnv); fnvlist_free(rsnapnv); switch (err) { case 0: break; case ENOENT: (void) fprintf(stderr, gettext("provided snapshot %s does not exist\n"), snap); break; case EEXIST: (void) fprintf(stderr, gettext("specified redaction bookmark " "(%s) provided already exists\n"), bookname); break; case ENAMETOOLONG: (void) fprintf(stderr, gettext("provided bookmark name cannot " "be used, final name would be too long\n")); break; case E2BIG: (void) fprintf(stderr, gettext("too many redaction snapshots " "specified\n")); break; case EINVAL: if (strchr(bookname, '#') != NULL) (void) fprintf(stderr, gettext( "redaction bookmark name must not contain '#'\n")); else (void) fprintf(stderr, gettext( "redaction snapshot must be descendent of " "snapshot being redacted\n")); break; case EALREADY: (void) fprintf(stderr, gettext("attempted to redact redacted " "dataset or with respect to redacted dataset\n")); break; case ENOTSUP: (void) fprintf(stderr, gettext("redaction bookmarks feature " "not enabled\n")); break; case EXDEV: (void) fprintf(stderr, gettext("potentially invalid redaction " "snapshot; full dataset names required\n")); break; default: (void) fprintf(stderr, gettext("internal error: %s\n"), strerror(errno)); } return (err); } /* * zfs rollback [-rRf] * * -r Delete any intervening snapshots before doing rollback * -R Delete any snapshots and their clones * -f ignored for backwards compatibility * * Given a filesystem, rollback to a specific snapshot, discarding any changes * since then and making it the active dataset. If more recent snapshots exist, * the command will complain unless the '-r' flag is given. */ typedef struct rollback_cbdata { uint64_t cb_create; uint8_t cb_younger_ds_printed; boolean_t cb_first; int cb_doclones; char *cb_target; int cb_error; boolean_t cb_recurse; } rollback_cbdata_t; static int rollback_check_dependent(zfs_handle_t *zhp, void *data) { rollback_cbdata_t *cbp = data; if (cbp->cb_first && cbp->cb_recurse) { (void) fprintf(stderr, gettext("cannot rollback to " "'%s': clones of previous snapshots exist\n"), cbp->cb_target); (void) fprintf(stderr, gettext("use '-R' to " "force deletion of the following clones and " "dependents:\n")); cbp->cb_first = 0; cbp->cb_error = 1; } (void) fprintf(stderr, "%s\n", zfs_get_name(zhp)); zfs_close(zhp); return (0); } /* * Report some snapshots/bookmarks more recent than the one specified. * Used when '-r' is not specified. We reuse this same callback for the * snapshot dependents - if 'cb_dependent' is set, then this is a * dependent and we should report it without checking the transaction group. */ static int rollback_check(zfs_handle_t *zhp, void *data) { rollback_cbdata_t *cbp = data; /* * Max number of younger snapshots and/or bookmarks to display before * we stop the iteration. */ const uint8_t max_younger = 32; if (cbp->cb_doclones) { zfs_close(zhp); return (0); } if (zfs_prop_get_int(zhp, ZFS_PROP_CREATETXG) > cbp->cb_create) { if (cbp->cb_first && !cbp->cb_recurse) { (void) fprintf(stderr, gettext("cannot " "rollback to '%s': more recent snapshots " "or bookmarks exist\n"), cbp->cb_target); (void) fprintf(stderr, gettext("use '-r' to " "force deletion of the following " "snapshots and bookmarks:\n")); cbp->cb_first = 0; cbp->cb_error = 1; } if (cbp->cb_recurse) { if (zfs_iter_dependents(zhp, B_TRUE, rollback_check_dependent, cbp) != 0) { zfs_close(zhp); return (-1); } } else { (void) fprintf(stderr, "%s\n", zfs_get_name(zhp)); cbp->cb_younger_ds_printed++; } } zfs_close(zhp); if (cbp->cb_younger_ds_printed == max_younger) { /* * This non-recursive rollback is going to fail due to the * presence of snapshots and/or bookmarks that are younger than * the rollback target. * We printed some of the offending objects, now we stop * zfs_iter_snapshot/bookmark iteration so we can fail fast and * avoid iterating over the rest of the younger objects */ (void) fprintf(stderr, gettext("Output limited to %d " "snapshots/bookmarks\n"), max_younger); return (-1); } return (0); } static int zfs_do_rollback(int argc, char **argv) { int ret = 0; int c; boolean_t force = B_FALSE; rollback_cbdata_t cb = { 0 }; zfs_handle_t *zhp, *snap; char parentname[ZFS_MAX_DATASET_NAME_LEN]; char *delim; uint64_t min_txg = 0; /* check options */ while ((c = getopt(argc, argv, "rRf")) != -1) { switch (c) { case 'r': cb.cb_recurse = 1; break; case 'R': cb.cb_recurse = 1; cb.cb_doclones = 1; break; case 'f': force = B_TRUE; break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 1) { (void) fprintf(stderr, gettext("missing dataset argument\n")); usage(B_FALSE); } if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } /* open the snapshot */ if ((snap = zfs_open(g_zfs, argv[0], ZFS_TYPE_SNAPSHOT)) == NULL) return (1); /* open the parent dataset */ (void) strlcpy(parentname, argv[0], sizeof (parentname)); verify((delim = strrchr(parentname, '@')) != NULL); *delim = '\0'; if ((zhp = zfs_open(g_zfs, parentname, ZFS_TYPE_DATASET)) == NULL) { zfs_close(snap); return (1); } /* * Check for more recent snapshots and/or clones based on the presence * of '-r' and '-R'. */ cb.cb_target = argv[0]; cb.cb_create = zfs_prop_get_int(snap, ZFS_PROP_CREATETXG); cb.cb_first = B_TRUE; cb.cb_error = 0; if (cb.cb_create > 0) min_txg = cb.cb_create; if ((ret = zfs_iter_snapshots(zhp, B_FALSE, rollback_check, &cb, min_txg, 0)) != 0) goto out; if ((ret = zfs_iter_bookmarks(zhp, rollback_check, &cb)) != 0) goto out; if ((ret = cb.cb_error) != 0) goto out; /* * Rollback parent to the given snapshot. */ ret = zfs_rollback(zhp, snap, force); out: zfs_close(snap); zfs_close(zhp); if (ret == 0) return (0); else return (1); } /* * zfs set property=value ... { fs | snap | vol } ... * * Sets the given properties for all datasets specified on the command line. */ static int set_callback(zfs_handle_t *zhp, void *data) { nvlist_t *props = data; if (zfs_prop_set_list(zhp, props) != 0) { switch (libzfs_errno(g_zfs)) { case EZFS_MOUNTFAILED: (void) fprintf(stderr, gettext("property may be set " "but unable to remount filesystem\n")); break; case EZFS_SHARENFSFAILED: (void) fprintf(stderr, gettext("property may be set " "but unable to reshare filesystem\n")); break; } return (1); } return (0); } static int zfs_do_set(int argc, char **argv) { nvlist_t *props = NULL; int ds_start = -1; /* argv idx of first dataset arg */ int ret = 0; int i; /* check for options */ if (argc > 1 && argv[1][0] == '-') { (void) fprintf(stderr, gettext("invalid option '%c'\n"), argv[1][1]); usage(B_FALSE); } /* check number of arguments */ if (argc < 2) { (void) fprintf(stderr, gettext("missing arguments\n")); usage(B_FALSE); } if (argc < 3) { if (strchr(argv[1], '=') == NULL) { (void) fprintf(stderr, gettext("missing property=value " "argument(s)\n")); } else { (void) fprintf(stderr, gettext("missing dataset " "name(s)\n")); } usage(B_FALSE); } /* validate argument order: prop=val args followed by dataset args */ for (i = 1; i < argc; i++) { if (strchr(argv[i], '=') != NULL) { if (ds_start > 0) { /* out-of-order prop=val argument */ (void) fprintf(stderr, gettext("invalid " "argument order\n")); usage(B_FALSE); } } else if (ds_start < 0) { ds_start = i; } } if (ds_start < 0) { (void) fprintf(stderr, gettext("missing dataset name(s)\n")); usage(B_FALSE); } /* Populate a list of property settings */ if (nvlist_alloc(&props, NV_UNIQUE_NAME, 0) != 0) nomem(); for (i = 1; i < ds_start; i++) { if (!parseprop(props, argv[i])) { ret = -1; goto error; } } ret = zfs_for_each(argc - ds_start, argv + ds_start, 0, ZFS_TYPE_DATASET, NULL, NULL, 0, set_callback, props); error: nvlist_free(props); return (ret); } typedef struct snap_cbdata { nvlist_t *sd_nvl; boolean_t sd_recursive; const char *sd_snapname; } snap_cbdata_t; static int zfs_snapshot_cb(zfs_handle_t *zhp, void *arg) { snap_cbdata_t *sd = arg; char *name; int rv = 0; int error; if (sd->sd_recursive && zfs_prop_get_int(zhp, ZFS_PROP_INCONSISTENT) != 0) { zfs_close(zhp); return (0); } error = asprintf(&name, "%s@%s", zfs_get_name(zhp), sd->sd_snapname); if (error == -1) nomem(); fnvlist_add_boolean(sd->sd_nvl, name); free(name); if (sd->sd_recursive) rv = zfs_iter_filesystems(zhp, zfs_snapshot_cb, sd); zfs_close(zhp); return (rv); } /* * zfs snapshot [-r] [-o prop=value] ... * * Creates a snapshot with the given name. While functionally equivalent to * 'zfs create', it is a separate command to differentiate intent. */ static int zfs_do_snapshot(int argc, char **argv) { int ret = 0; int c; nvlist_t *props; snap_cbdata_t sd = { 0 }; boolean_t multiple_snaps = B_FALSE; if (nvlist_alloc(&props, NV_UNIQUE_NAME, 0) != 0) nomem(); if (nvlist_alloc(&sd.sd_nvl, NV_UNIQUE_NAME, 0) != 0) nomem(); /* check options */ while ((c = getopt(argc, argv, "ro:")) != -1) { switch (c) { case 'o': if (!parseprop(props, optarg)) { nvlist_free(sd.sd_nvl); nvlist_free(props); return (1); } break; case 'r': sd.sd_recursive = B_TRUE; multiple_snaps = B_TRUE; break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); goto usage; } } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 1) { (void) fprintf(stderr, gettext("missing snapshot argument\n")); goto usage; } if (argc > 1) multiple_snaps = B_TRUE; for (; argc > 0; argc--, argv++) { char *atp; zfs_handle_t *zhp; atp = strchr(argv[0], '@'); if (atp == NULL) goto usage; *atp = '\0'; sd.sd_snapname = atp + 1; zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) goto usage; if (zfs_snapshot_cb(zhp, &sd) != 0) goto usage; } ret = zfs_snapshot_nvl(g_zfs, sd.sd_nvl, props); nvlist_free(sd.sd_nvl); nvlist_free(props); if (ret != 0 && multiple_snaps) (void) fprintf(stderr, gettext("no snapshots were created\n")); return (ret != 0); usage: nvlist_free(sd.sd_nvl); nvlist_free(props); usage(B_FALSE); return (-1); } /* * Send a backup stream to stdout. */ static int zfs_do_send(int argc, char **argv) { char *fromname = NULL; char *toname = NULL; char *resume_token = NULL; char *cp; zfs_handle_t *zhp; sendflags_t flags = { 0 }; int c, err; nvlist_t *dbgnv = NULL; char *redactbook = NULL; struct option long_options[] = { {"replicate", no_argument, NULL, 'R'}, {"redact", required_argument, NULL, 'd'}, {"props", no_argument, NULL, 'p'}, {"parsable", no_argument, NULL, 'P'}, {"dedup", no_argument, NULL, 'D'}, {"verbose", no_argument, NULL, 'v'}, {"dryrun", no_argument, NULL, 'n'}, {"large-block", no_argument, NULL, 'L'}, {"embed", no_argument, NULL, 'e'}, {"resume", required_argument, NULL, 't'}, {"compressed", no_argument, NULL, 'c'}, {"raw", no_argument, NULL, 'w'}, {"backup", no_argument, NULL, 'b'}, {"holds", no_argument, NULL, 'h'}, {"saved", no_argument, NULL, 'S'}, {0, 0, 0, 0} }; /* check options */ while ((c = getopt_long(argc, argv, ":i:I:RDpvnPLeht:cwbd:S", long_options, NULL)) != -1) { switch (c) { case 'i': if (fromname) usage(B_FALSE); fromname = optarg; break; case 'I': if (fromname) usage(B_FALSE); fromname = optarg; flags.doall = B_TRUE; break; case 'R': flags.replicate = B_TRUE; break; case 'd': redactbook = optarg; break; case 'p': flags.props = B_TRUE; break; case 'b': flags.backup = B_TRUE; break; case 'h': flags.holds = B_TRUE; break; case 'P': flags.parsable = B_TRUE; break; case 'v': flags.verbosity++; flags.progress = B_TRUE; break; case 'D': (void) fprintf(stderr, gettext("WARNING: deduplicated send is no " "longer supported. A regular,\n" "non-deduplicated stream will be generated.\n\n")); break; case 'n': flags.dryrun = B_TRUE; break; case 'L': flags.largeblock = B_TRUE; break; case 'e': flags.embed_data = B_TRUE; break; case 't': resume_token = optarg; break; case 'c': flags.compress = B_TRUE; break; case 'w': flags.raw = B_TRUE; flags.compress = B_TRUE; flags.embed_data = B_TRUE; flags.largeblock = B_TRUE; break; case 'S': flags.saved = B_TRUE; break; case ':': /* * If a parameter was not passed, optopt contains the * value that would normally lead us into the * appropriate case statement. If it's > 256, then this * must be a longopt and we should look at argv to get * the string. Otherwise it's just the character, so we * should use it directly. */ if (optopt <= UINT8_MAX) { (void) fprintf(stderr, gettext("missing argument for '%c' " "option\n"), optopt); } else { (void) fprintf(stderr, gettext("missing argument for '%s' " "option\n"), argv[optind - 1]); } usage(B_FALSE); break; case '?': /*FALLTHROUGH*/ default: /* * If an invalid flag was passed, optopt contains the * character if it was a short flag, or 0 if it was a * longopt. */ if (optopt != 0) { (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); } else { (void) fprintf(stderr, gettext("invalid option '%s'\n"), argv[optind - 1]); } usage(B_FALSE); } } if (flags.parsable && flags.verbosity == 0) flags.verbosity = 1; argc -= optind; argv += optind; if (resume_token != NULL) { if (fromname != NULL || flags.replicate || flags.props || flags.backup || flags.holds || flags.saved || redactbook != NULL) { (void) fprintf(stderr, gettext("invalid flags combined with -t\n")); usage(B_FALSE); } if (argc > 0) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } } else { if (argc < 1) { (void) fprintf(stderr, gettext("missing snapshot argument\n")); usage(B_FALSE); } if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } } if (flags.saved) { if (fromname != NULL || flags.replicate || flags.props || flags.doall || flags.backup || flags.holds || flags.largeblock || flags.embed_data || flags.compress || flags.raw || redactbook != NULL) { (void) fprintf(stderr, gettext("incompatible flags " "combined with saved send flag\n")); usage(B_FALSE); } if (strchr(argv[0], '@') != NULL) { (void) fprintf(stderr, gettext("saved send must " "specify the dataset with partially-received " "state\n")); usage(B_FALSE); } } if (flags.raw && redactbook != NULL) { (void) fprintf(stderr, gettext("Error: raw sends may not be redacted.\n")); return (1); } if (!flags.dryrun && isatty(STDOUT_FILENO)) { (void) fprintf(stderr, gettext("Error: Stream can not be written to a terminal.\n" "You must redirect standard output.\n")); return (1); } if (flags.saved) { zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_DATASET); if (zhp == NULL) return (1); err = zfs_send_saved(zhp, &flags, STDOUT_FILENO, resume_token); zfs_close(zhp); return (err != 0); } else if (resume_token != NULL) { return (zfs_send_resume(g_zfs, &flags, STDOUT_FILENO, resume_token)); } /* * For everything except -R and -I, use the new, cleaner code path. */ if (!(flags.replicate || flags.doall)) { char frombuf[ZFS_MAX_DATASET_NAME_LEN]; if (fromname != NULL && (strchr(fromname, '#') == NULL && strchr(fromname, '@') == NULL)) { /* * Neither bookmark or snapshot was specified. Print a * warning, and assume snapshot. */ (void) fprintf(stderr, "Warning: incremental source " "didn't specify type, assuming snapshot. Use '@' " "or '#' prefix to avoid ambiguity.\n"); (void) snprintf(frombuf, sizeof (frombuf), "@%s", fromname); fromname = frombuf; } if (fromname != NULL && (fromname[0] == '#' || fromname[0] == '@')) { /* * Incremental source name begins with # or @. * Default to same fs as target. */ char tmpbuf[ZFS_MAX_DATASET_NAME_LEN]; (void) strlcpy(tmpbuf, fromname, sizeof (tmpbuf)); (void) strlcpy(frombuf, argv[0], sizeof (frombuf)); cp = strchr(frombuf, '@'); if (cp != NULL) *cp = '\0'; (void) strlcat(frombuf, tmpbuf, sizeof (frombuf)); fromname = frombuf; } zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_DATASET); if (zhp == NULL) return (1); err = zfs_send_one(zhp, fromname, STDOUT_FILENO, &flags, redactbook); zfs_close(zhp); return (err != 0); } if (fromname != NULL && strchr(fromname, '#')) { (void) fprintf(stderr, gettext("Error: multiple snapshots cannot be " "sent from a bookmark.\n")); return (1); } if (redactbook != NULL) { (void) fprintf(stderr, gettext("Error: multiple snapshots " "cannot be sent redacted.\n")); return (1); } if ((cp = strchr(argv[0], '@')) == NULL) { (void) fprintf(stderr, gettext("Error: " "Unsupported flag with filesystem or bookmark.\n")); return (1); } *cp = '\0'; toname = cp + 1; zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) return (1); /* * If they specified the full path to the snapshot, chop off * everything except the short name of the snapshot, but special * case if they specify the origin. */ if (fromname && (cp = strchr(fromname, '@')) != NULL) { char origin[ZFS_MAX_DATASET_NAME_LEN]; zprop_source_t src; (void) zfs_prop_get(zhp, ZFS_PROP_ORIGIN, origin, sizeof (origin), &src, NULL, 0, B_FALSE); if (strcmp(origin, fromname) == 0) { fromname = NULL; flags.fromorigin = B_TRUE; } else { *cp = '\0'; if (cp != fromname && strcmp(argv[0], fromname)) { (void) fprintf(stderr, gettext("incremental source must be " "in same filesystem\n")); usage(B_FALSE); } fromname = cp + 1; if (strchr(fromname, '@') || strchr(fromname, '/')) { (void) fprintf(stderr, gettext("invalid incremental source\n")); usage(B_FALSE); } } } if (flags.replicate && fromname == NULL) flags.doall = B_TRUE; err = zfs_send(zhp, fromname, toname, &flags, STDOUT_FILENO, NULL, 0, flags.verbosity >= 3 ? &dbgnv : NULL); if (flags.verbosity >= 3 && dbgnv != NULL) { /* * dump_nvlist prints to stdout, but that's been * redirected to a file. Make it print to stderr * instead. */ (void) dup2(STDERR_FILENO, STDOUT_FILENO); dump_nvlist(dbgnv, 0); nvlist_free(dbgnv); } zfs_close(zhp); return (err != 0); } /* * Restore a backup stream from stdin. */ static int zfs_do_receive(int argc, char **argv) { int c, err = 0; recvflags_t flags = { 0 }; boolean_t abort_resumable = B_FALSE; nvlist_t *props; if (nvlist_alloc(&props, NV_UNIQUE_NAME, 0) != 0) nomem(); /* check options */ while ((c = getopt(argc, argv, ":o:x:dehMnuvFsA")) != -1) { switch (c) { case 'o': if (!parseprop(props, optarg)) { nvlist_free(props); usage(B_FALSE); } break; case 'x': if (!parsepropname(props, optarg)) { nvlist_free(props); usage(B_FALSE); } break; case 'd': if (flags.istail) { (void) fprintf(stderr, gettext("invalid option " "combination: -d and -e are mutually " "exclusive\n")); usage(B_FALSE); } flags.isprefix = B_TRUE; break; case 'e': if (flags.isprefix) { (void) fprintf(stderr, gettext("invalid option " "combination: -d and -e are mutually " "exclusive\n")); usage(B_FALSE); } flags.istail = B_TRUE; break; case 'h': flags.skipholds = B_TRUE; break; case 'M': flags.forceunmount = B_TRUE; break; case 'n': flags.dryrun = B_TRUE; break; case 'u': flags.nomount = B_TRUE; break; case 'v': flags.verbose = B_TRUE; break; case 's': flags.resumable = B_TRUE; break; case 'F': flags.force = B_TRUE; break; case 'A': abort_resumable = B_TRUE; break; case ':': (void) fprintf(stderr, gettext("missing argument for " "'%c' option\n"), optopt); usage(B_FALSE); break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* zfs recv -e (use "tail" name) implies -d (remove dataset "head") */ if (flags.istail) flags.isprefix = B_TRUE; /* check number of arguments */ if (argc < 1) { (void) fprintf(stderr, gettext("missing snapshot argument\n")); usage(B_FALSE); } if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } if (abort_resumable) { if (flags.isprefix || flags.istail || flags.dryrun || flags.resumable || flags.nomount) { (void) fprintf(stderr, gettext("invalid option\n")); usage(B_FALSE); } char namebuf[ZFS_MAX_DATASET_NAME_LEN]; (void) snprintf(namebuf, sizeof (namebuf), "%s/%%recv", argv[0]); if (zfs_dataset_exists(g_zfs, namebuf, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME)) { zfs_handle_t *zhp = zfs_open(g_zfs, namebuf, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) { nvlist_free(props); return (1); } err = zfs_destroy(zhp, B_FALSE); zfs_close(zhp); } else { zfs_handle_t *zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) usage(B_FALSE); if (!zfs_prop_get_int(zhp, ZFS_PROP_INCONSISTENT) || zfs_prop_get(zhp, ZFS_PROP_RECEIVE_RESUME_TOKEN, NULL, 0, NULL, NULL, 0, B_TRUE) == -1) { (void) fprintf(stderr, gettext("'%s' does not have any " "resumable receive state to abort\n"), argv[0]); nvlist_free(props); zfs_close(zhp); return (1); } err = zfs_destroy(zhp, B_FALSE); zfs_close(zhp); } nvlist_free(props); return (err != 0); } if (isatty(STDIN_FILENO)) { (void) fprintf(stderr, gettext("Error: Backup stream can not be read " "from a terminal.\n" "You must redirect standard input.\n")); nvlist_free(props); return (1); } err = zfs_receive(g_zfs, argv[0], props, &flags, STDIN_FILENO, NULL); nvlist_free(props); return (err != 0); } /* * allow/unallow stuff */ /* copied from zfs/sys/dsl_deleg.h */ #define ZFS_DELEG_PERM_CREATE "create" #define ZFS_DELEG_PERM_DESTROY "destroy" #define ZFS_DELEG_PERM_SNAPSHOT "snapshot" #define ZFS_DELEG_PERM_ROLLBACK "rollback" #define ZFS_DELEG_PERM_CLONE "clone" #define ZFS_DELEG_PERM_PROMOTE "promote" #define ZFS_DELEG_PERM_RENAME "rename" #define ZFS_DELEG_PERM_MOUNT "mount" #define ZFS_DELEG_PERM_SHARE "share" #define ZFS_DELEG_PERM_SEND "send" #define ZFS_DELEG_PERM_RECEIVE "receive" #define ZFS_DELEG_PERM_ALLOW "allow" #define ZFS_DELEG_PERM_USERPROP "userprop" #define ZFS_DELEG_PERM_VSCAN "vscan" /* ??? */ #define ZFS_DELEG_PERM_USERQUOTA "userquota" #define ZFS_DELEG_PERM_GROUPQUOTA "groupquota" #define ZFS_DELEG_PERM_USERUSED "userused" #define ZFS_DELEG_PERM_GROUPUSED "groupused" #define ZFS_DELEG_PERM_USEROBJQUOTA "userobjquota" #define ZFS_DELEG_PERM_GROUPOBJQUOTA "groupobjquota" #define ZFS_DELEG_PERM_USEROBJUSED "userobjused" #define ZFS_DELEG_PERM_GROUPOBJUSED "groupobjused" #define ZFS_DELEG_PERM_HOLD "hold" #define ZFS_DELEG_PERM_RELEASE "release" #define ZFS_DELEG_PERM_DIFF "diff" #define ZFS_DELEG_PERM_BOOKMARK "bookmark" #define ZFS_DELEG_PERM_LOAD_KEY "load-key" #define ZFS_DELEG_PERM_CHANGE_KEY "change-key" #define ZFS_DELEG_PERM_PROJECTUSED "projectused" #define ZFS_DELEG_PERM_PROJECTQUOTA "projectquota" #define ZFS_DELEG_PERM_PROJECTOBJUSED "projectobjused" #define ZFS_DELEG_PERM_PROJECTOBJQUOTA "projectobjquota" #define ZFS_NUM_DELEG_NOTES ZFS_DELEG_NOTE_NONE static zfs_deleg_perm_tab_t zfs_deleg_perm_tbl[] = { { ZFS_DELEG_PERM_ALLOW, ZFS_DELEG_NOTE_ALLOW }, { ZFS_DELEG_PERM_CLONE, ZFS_DELEG_NOTE_CLONE }, { ZFS_DELEG_PERM_CREATE, ZFS_DELEG_NOTE_CREATE }, { ZFS_DELEG_PERM_DESTROY, ZFS_DELEG_NOTE_DESTROY }, { ZFS_DELEG_PERM_DIFF, ZFS_DELEG_NOTE_DIFF}, { ZFS_DELEG_PERM_HOLD, ZFS_DELEG_NOTE_HOLD }, { ZFS_DELEG_PERM_MOUNT, ZFS_DELEG_NOTE_MOUNT }, { ZFS_DELEG_PERM_PROMOTE, ZFS_DELEG_NOTE_PROMOTE }, { ZFS_DELEG_PERM_RECEIVE, ZFS_DELEG_NOTE_RECEIVE }, { ZFS_DELEG_PERM_RELEASE, ZFS_DELEG_NOTE_RELEASE }, { ZFS_DELEG_PERM_RENAME, ZFS_DELEG_NOTE_RENAME }, { ZFS_DELEG_PERM_ROLLBACK, ZFS_DELEG_NOTE_ROLLBACK }, { ZFS_DELEG_PERM_SEND, ZFS_DELEG_NOTE_SEND }, { ZFS_DELEG_PERM_SHARE, ZFS_DELEG_NOTE_SHARE }, { ZFS_DELEG_PERM_SNAPSHOT, ZFS_DELEG_NOTE_SNAPSHOT }, { ZFS_DELEG_PERM_BOOKMARK, ZFS_DELEG_NOTE_BOOKMARK }, { ZFS_DELEG_PERM_LOAD_KEY, ZFS_DELEG_NOTE_LOAD_KEY }, { ZFS_DELEG_PERM_CHANGE_KEY, ZFS_DELEG_NOTE_CHANGE_KEY }, { ZFS_DELEG_PERM_GROUPQUOTA, ZFS_DELEG_NOTE_GROUPQUOTA }, { ZFS_DELEG_PERM_GROUPUSED, ZFS_DELEG_NOTE_GROUPUSED }, { ZFS_DELEG_PERM_USERPROP, ZFS_DELEG_NOTE_USERPROP }, { ZFS_DELEG_PERM_USERQUOTA, ZFS_DELEG_NOTE_USERQUOTA }, { ZFS_DELEG_PERM_USERUSED, ZFS_DELEG_NOTE_USERUSED }, { ZFS_DELEG_PERM_USEROBJQUOTA, ZFS_DELEG_NOTE_USEROBJQUOTA }, { ZFS_DELEG_PERM_USEROBJUSED, ZFS_DELEG_NOTE_USEROBJUSED }, { ZFS_DELEG_PERM_GROUPOBJQUOTA, ZFS_DELEG_NOTE_GROUPOBJQUOTA }, { ZFS_DELEG_PERM_GROUPOBJUSED, ZFS_DELEG_NOTE_GROUPOBJUSED }, { ZFS_DELEG_PERM_PROJECTUSED, ZFS_DELEG_NOTE_PROJECTUSED }, { ZFS_DELEG_PERM_PROJECTQUOTA, ZFS_DELEG_NOTE_PROJECTQUOTA }, { ZFS_DELEG_PERM_PROJECTOBJUSED, ZFS_DELEG_NOTE_PROJECTOBJUSED }, { ZFS_DELEG_PERM_PROJECTOBJQUOTA, ZFS_DELEG_NOTE_PROJECTOBJQUOTA }, { NULL, ZFS_DELEG_NOTE_NONE } }; /* permission structure */ typedef struct deleg_perm { zfs_deleg_who_type_t dp_who_type; const char *dp_name; boolean_t dp_local; boolean_t dp_descend; } deleg_perm_t; /* */ typedef struct deleg_perm_node { deleg_perm_t dpn_perm; uu_avl_node_t dpn_avl_node; } deleg_perm_node_t; typedef struct fs_perm fs_perm_t; /* permissions set */ typedef struct who_perm { zfs_deleg_who_type_t who_type; const char *who_name; /* id */ char who_ug_name[256]; /* user/group name */ fs_perm_t *who_fsperm; /* uplink */ uu_avl_t *who_deleg_perm_avl; /* permissions */ } who_perm_t; /* */ typedef struct who_perm_node { who_perm_t who_perm; uu_avl_node_t who_avl_node; } who_perm_node_t; typedef struct fs_perm_set fs_perm_set_t; /* fs permissions */ struct fs_perm { const char *fsp_name; uu_avl_t *fsp_sc_avl; /* sets,create */ uu_avl_t *fsp_uge_avl; /* user,group,everyone */ fs_perm_set_t *fsp_set; /* uplink */ }; /* */ typedef struct fs_perm_node { fs_perm_t fspn_fsperm; uu_avl_t *fspn_avl; uu_list_node_t fspn_list_node; } fs_perm_node_t; /* top level structure */ struct fs_perm_set { uu_list_pool_t *fsps_list_pool; uu_list_t *fsps_list; /* list of fs_perms */ uu_avl_pool_t *fsps_named_set_avl_pool; uu_avl_pool_t *fsps_who_perm_avl_pool; uu_avl_pool_t *fsps_deleg_perm_avl_pool; }; static inline const char * deleg_perm_type(zfs_deleg_note_t note) { /* subcommands */ switch (note) { /* SUBCOMMANDS */ /* OTHER */ case ZFS_DELEG_NOTE_GROUPQUOTA: case ZFS_DELEG_NOTE_GROUPUSED: case ZFS_DELEG_NOTE_USERPROP: case ZFS_DELEG_NOTE_USERQUOTA: case ZFS_DELEG_NOTE_USERUSED: case ZFS_DELEG_NOTE_USEROBJQUOTA: case ZFS_DELEG_NOTE_USEROBJUSED: case ZFS_DELEG_NOTE_GROUPOBJQUOTA: case ZFS_DELEG_NOTE_GROUPOBJUSED: case ZFS_DELEG_NOTE_PROJECTUSED: case ZFS_DELEG_NOTE_PROJECTQUOTA: case ZFS_DELEG_NOTE_PROJECTOBJUSED: case ZFS_DELEG_NOTE_PROJECTOBJQUOTA: /* other */ return (gettext("other")); default: return (gettext("subcommand")); } } static int who_type2weight(zfs_deleg_who_type_t who_type) { int res; switch (who_type) { case ZFS_DELEG_NAMED_SET_SETS: case ZFS_DELEG_NAMED_SET: res = 0; break; case ZFS_DELEG_CREATE_SETS: case ZFS_DELEG_CREATE: res = 1; break; case ZFS_DELEG_USER_SETS: case ZFS_DELEG_USER: res = 2; break; case ZFS_DELEG_GROUP_SETS: case ZFS_DELEG_GROUP: res = 3; break; case ZFS_DELEG_EVERYONE_SETS: case ZFS_DELEG_EVERYONE: res = 4; break; default: res = -1; } return (res); } /* ARGSUSED */ static int who_perm_compare(const void *larg, const void *rarg, void *unused) { const who_perm_node_t *l = larg; const who_perm_node_t *r = rarg; zfs_deleg_who_type_t ltype = l->who_perm.who_type; zfs_deleg_who_type_t rtype = r->who_perm.who_type; int lweight = who_type2weight(ltype); int rweight = who_type2weight(rtype); int res = lweight - rweight; if (res == 0) res = strncmp(l->who_perm.who_name, r->who_perm.who_name, ZFS_MAX_DELEG_NAME-1); if (res == 0) return (0); if (res > 0) return (1); else return (-1); } /* ARGSUSED */ static int deleg_perm_compare(const void *larg, const void *rarg, void *unused) { const deleg_perm_node_t *l = larg; const deleg_perm_node_t *r = rarg; int res = strncmp(l->dpn_perm.dp_name, r->dpn_perm.dp_name, ZFS_MAX_DELEG_NAME-1); if (res == 0) return (0); if (res > 0) return (1); else return (-1); } static inline void fs_perm_set_init(fs_perm_set_t *fspset) { bzero(fspset, sizeof (fs_perm_set_t)); if ((fspset->fsps_list_pool = uu_list_pool_create("fsps_list_pool", sizeof (fs_perm_node_t), offsetof(fs_perm_node_t, fspn_list_node), NULL, UU_DEFAULT)) == NULL) nomem(); if ((fspset->fsps_list = uu_list_create(fspset->fsps_list_pool, NULL, UU_DEFAULT)) == NULL) nomem(); if ((fspset->fsps_named_set_avl_pool = uu_avl_pool_create( "named_set_avl_pool", sizeof (who_perm_node_t), offsetof( who_perm_node_t, who_avl_node), who_perm_compare, UU_DEFAULT)) == NULL) nomem(); if ((fspset->fsps_who_perm_avl_pool = uu_avl_pool_create( "who_perm_avl_pool", sizeof (who_perm_node_t), offsetof( who_perm_node_t, who_avl_node), who_perm_compare, UU_DEFAULT)) == NULL) nomem(); if ((fspset->fsps_deleg_perm_avl_pool = uu_avl_pool_create( "deleg_perm_avl_pool", sizeof (deleg_perm_node_t), offsetof( deleg_perm_node_t, dpn_avl_node), deleg_perm_compare, UU_DEFAULT)) == NULL) nomem(); } static inline void fs_perm_fini(fs_perm_t *); static inline void who_perm_fini(who_perm_t *); static inline void fs_perm_set_fini(fs_perm_set_t *fspset) { fs_perm_node_t *node = uu_list_first(fspset->fsps_list); while (node != NULL) { fs_perm_node_t *next_node = uu_list_next(fspset->fsps_list, node); fs_perm_t *fsperm = &node->fspn_fsperm; fs_perm_fini(fsperm); uu_list_remove(fspset->fsps_list, node); free(node); node = next_node; } uu_avl_pool_destroy(fspset->fsps_named_set_avl_pool); uu_avl_pool_destroy(fspset->fsps_who_perm_avl_pool); uu_avl_pool_destroy(fspset->fsps_deleg_perm_avl_pool); } static inline void deleg_perm_init(deleg_perm_t *deleg_perm, zfs_deleg_who_type_t type, const char *name) { deleg_perm->dp_who_type = type; deleg_perm->dp_name = name; } static inline void who_perm_init(who_perm_t *who_perm, fs_perm_t *fsperm, zfs_deleg_who_type_t type, const char *name) { uu_avl_pool_t *pool; pool = fsperm->fsp_set->fsps_deleg_perm_avl_pool; bzero(who_perm, sizeof (who_perm_t)); if ((who_perm->who_deleg_perm_avl = uu_avl_create(pool, NULL, UU_DEFAULT)) == NULL) nomem(); who_perm->who_type = type; who_perm->who_name = name; who_perm->who_fsperm = fsperm; } static inline void who_perm_fini(who_perm_t *who_perm) { deleg_perm_node_t *node = uu_avl_first(who_perm->who_deleg_perm_avl); while (node != NULL) { deleg_perm_node_t *next_node = uu_avl_next(who_perm->who_deleg_perm_avl, node); uu_avl_remove(who_perm->who_deleg_perm_avl, node); free(node); node = next_node; } uu_avl_destroy(who_perm->who_deleg_perm_avl); } static inline void fs_perm_init(fs_perm_t *fsperm, fs_perm_set_t *fspset, const char *fsname) { uu_avl_pool_t *nset_pool = fspset->fsps_named_set_avl_pool; uu_avl_pool_t *who_pool = fspset->fsps_who_perm_avl_pool; bzero(fsperm, sizeof (fs_perm_t)); if ((fsperm->fsp_sc_avl = uu_avl_create(nset_pool, NULL, UU_DEFAULT)) == NULL) nomem(); if ((fsperm->fsp_uge_avl = uu_avl_create(who_pool, NULL, UU_DEFAULT)) == NULL) nomem(); fsperm->fsp_set = fspset; fsperm->fsp_name = fsname; } static inline void fs_perm_fini(fs_perm_t *fsperm) { who_perm_node_t *node = uu_avl_first(fsperm->fsp_sc_avl); while (node != NULL) { who_perm_node_t *next_node = uu_avl_next(fsperm->fsp_sc_avl, node); who_perm_t *who_perm = &node->who_perm; who_perm_fini(who_perm); uu_avl_remove(fsperm->fsp_sc_avl, node); free(node); node = next_node; } node = uu_avl_first(fsperm->fsp_uge_avl); while (node != NULL) { who_perm_node_t *next_node = uu_avl_next(fsperm->fsp_uge_avl, node); who_perm_t *who_perm = &node->who_perm; who_perm_fini(who_perm); uu_avl_remove(fsperm->fsp_uge_avl, node); free(node); node = next_node; } uu_avl_destroy(fsperm->fsp_sc_avl); uu_avl_destroy(fsperm->fsp_uge_avl); } static void set_deleg_perm_node(uu_avl_t *avl, deleg_perm_node_t *node, zfs_deleg_who_type_t who_type, const char *name, char locality) { uu_avl_index_t idx = 0; deleg_perm_node_t *found_node = NULL; deleg_perm_t *deleg_perm = &node->dpn_perm; deleg_perm_init(deleg_perm, who_type, name); if ((found_node = uu_avl_find(avl, node, NULL, &idx)) == NULL) uu_avl_insert(avl, node, idx); else { node = found_node; deleg_perm = &node->dpn_perm; } switch (locality) { case ZFS_DELEG_LOCAL: deleg_perm->dp_local = B_TRUE; break; case ZFS_DELEG_DESCENDENT: deleg_perm->dp_descend = B_TRUE; break; case ZFS_DELEG_NA: break; default: assert(B_FALSE); /* invalid locality */ } } static inline int parse_who_perm(who_perm_t *who_perm, nvlist_t *nvl, char locality) { nvpair_t *nvp = NULL; fs_perm_set_t *fspset = who_perm->who_fsperm->fsp_set; uu_avl_t *avl = who_perm->who_deleg_perm_avl; zfs_deleg_who_type_t who_type = who_perm->who_type; while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) { const char *name = nvpair_name(nvp); data_type_t type = nvpair_type(nvp); uu_avl_pool_t *avl_pool = fspset->fsps_deleg_perm_avl_pool; deleg_perm_node_t *node = safe_malloc(sizeof (deleg_perm_node_t)); VERIFY(type == DATA_TYPE_BOOLEAN); uu_avl_node_init(node, &node->dpn_avl_node, avl_pool); set_deleg_perm_node(avl, node, who_type, name, locality); } return (0); } static inline int parse_fs_perm(fs_perm_t *fsperm, nvlist_t *nvl) { nvpair_t *nvp = NULL; fs_perm_set_t *fspset = fsperm->fsp_set; while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) { nvlist_t *nvl2 = NULL; const char *name = nvpair_name(nvp); uu_avl_t *avl = NULL; uu_avl_pool_t *avl_pool = NULL; zfs_deleg_who_type_t perm_type = name[0]; char perm_locality = name[1]; const char *perm_name = name + 3; who_perm_t *who_perm = NULL; assert('$' == name[2]); if (nvpair_value_nvlist(nvp, &nvl2) != 0) return (-1); switch (perm_type) { case ZFS_DELEG_CREATE: case ZFS_DELEG_CREATE_SETS: case ZFS_DELEG_NAMED_SET: case ZFS_DELEG_NAMED_SET_SETS: avl_pool = fspset->fsps_named_set_avl_pool; avl = fsperm->fsp_sc_avl; break; case ZFS_DELEG_USER: case ZFS_DELEG_USER_SETS: case ZFS_DELEG_GROUP: case ZFS_DELEG_GROUP_SETS: case ZFS_DELEG_EVERYONE: case ZFS_DELEG_EVERYONE_SETS: avl_pool = fspset->fsps_who_perm_avl_pool; avl = fsperm->fsp_uge_avl; break; default: assert(!"unhandled zfs_deleg_who_type_t"); } who_perm_node_t *found_node = NULL; who_perm_node_t *node = safe_malloc( sizeof (who_perm_node_t)); who_perm = &node->who_perm; uu_avl_index_t idx = 0; uu_avl_node_init(node, &node->who_avl_node, avl_pool); who_perm_init(who_perm, fsperm, perm_type, perm_name); if ((found_node = uu_avl_find(avl, node, NULL, &idx)) == NULL) { if (avl == fsperm->fsp_uge_avl) { uid_t rid = 0; struct passwd *p = NULL; struct group *g = NULL; const char *nice_name = NULL; switch (perm_type) { case ZFS_DELEG_USER_SETS: case ZFS_DELEG_USER: rid = atoi(perm_name); p = getpwuid(rid); if (p) nice_name = p->pw_name; break; case ZFS_DELEG_GROUP_SETS: case ZFS_DELEG_GROUP: rid = atoi(perm_name); g = getgrgid(rid); if (g) nice_name = g->gr_name; break; default: break; } if (nice_name != NULL) { (void) strlcpy( node->who_perm.who_ug_name, nice_name, 256); } else { /* User or group unknown */ (void) snprintf( node->who_perm.who_ug_name, sizeof (node->who_perm.who_ug_name), "(unknown: %d)", rid); } } uu_avl_insert(avl, node, idx); } else { node = found_node; who_perm = &node->who_perm; } assert(who_perm != NULL); (void) parse_who_perm(who_perm, nvl2, perm_locality); } return (0); } static inline int parse_fs_perm_set(fs_perm_set_t *fspset, nvlist_t *nvl) { nvpair_t *nvp = NULL; uu_avl_index_t idx = 0; while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) { nvlist_t *nvl2 = NULL; const char *fsname = nvpair_name(nvp); data_type_t type = nvpair_type(nvp); fs_perm_t *fsperm = NULL; fs_perm_node_t *node = safe_malloc(sizeof (fs_perm_node_t)); if (node == NULL) nomem(); fsperm = &node->fspn_fsperm; VERIFY(DATA_TYPE_NVLIST == type); uu_list_node_init(node, &node->fspn_list_node, fspset->fsps_list_pool); idx = uu_list_numnodes(fspset->fsps_list); fs_perm_init(fsperm, fspset, fsname); if (nvpair_value_nvlist(nvp, &nvl2) != 0) return (-1); (void) parse_fs_perm(fsperm, nvl2); uu_list_insert(fspset->fsps_list, node, idx); } return (0); } static inline const char * deleg_perm_comment(zfs_deleg_note_t note) { const char *str = ""; /* subcommands */ switch (note) { /* SUBCOMMANDS */ case ZFS_DELEG_NOTE_ALLOW: str = gettext("Must also have the permission that is being" "\n\t\t\t\tallowed"); break; case ZFS_DELEG_NOTE_CLONE: str = gettext("Must also have the 'create' ability and 'mount'" "\n\t\t\t\tability in the origin file system"); break; case ZFS_DELEG_NOTE_CREATE: str = gettext("Must also have the 'mount' ability"); break; case ZFS_DELEG_NOTE_DESTROY: str = gettext("Must also have the 'mount' ability"); break; case ZFS_DELEG_NOTE_DIFF: str = gettext("Allows lookup of paths within a dataset;" "\n\t\t\t\tgiven an object number. Ordinary users need this" "\n\t\t\t\tin order to use zfs diff"); break; case ZFS_DELEG_NOTE_HOLD: str = gettext("Allows adding a user hold to a snapshot"); break; case ZFS_DELEG_NOTE_MOUNT: str = gettext("Allows mount/umount of ZFS datasets"); break; case ZFS_DELEG_NOTE_PROMOTE: str = gettext("Must also have the 'mount'\n\t\t\t\tand" " 'promote' ability in the origin file system"); break; case ZFS_DELEG_NOTE_RECEIVE: str = gettext("Must also have the 'mount' and 'create'" " ability"); break; case ZFS_DELEG_NOTE_RELEASE: str = gettext("Allows releasing a user hold which\n\t\t\t\t" "might destroy the snapshot"); break; case ZFS_DELEG_NOTE_RENAME: str = gettext("Must also have the 'mount' and 'create'" "\n\t\t\t\tability in the new parent"); break; case ZFS_DELEG_NOTE_ROLLBACK: str = gettext(""); break; case ZFS_DELEG_NOTE_SEND: str = gettext(""); break; case ZFS_DELEG_NOTE_SHARE: str = gettext("Allows sharing file systems over NFS or SMB" "\n\t\t\t\tprotocols"); break; case ZFS_DELEG_NOTE_SNAPSHOT: str = gettext(""); break; case ZFS_DELEG_NOTE_LOAD_KEY: str = gettext("Allows loading or unloading an encryption key"); break; case ZFS_DELEG_NOTE_CHANGE_KEY: str = gettext("Allows changing or adding an encryption key"); break; /* * case ZFS_DELEG_NOTE_VSCAN: * str = gettext(""); * break; */ /* OTHER */ case ZFS_DELEG_NOTE_GROUPQUOTA: str = gettext("Allows accessing any groupquota@... property"); break; case ZFS_DELEG_NOTE_GROUPUSED: str = gettext("Allows reading any groupused@... property"); break; case ZFS_DELEG_NOTE_USERPROP: str = gettext("Allows changing any user property"); break; case ZFS_DELEG_NOTE_USERQUOTA: str = gettext("Allows accessing any userquota@... property"); break; case ZFS_DELEG_NOTE_USERUSED: str = gettext("Allows reading any userused@... property"); break; case ZFS_DELEG_NOTE_USEROBJQUOTA: str = gettext("Allows accessing any userobjquota@... property"); break; case ZFS_DELEG_NOTE_GROUPOBJQUOTA: str = gettext("Allows accessing any \n\t\t\t\t" "groupobjquota@... property"); break; case ZFS_DELEG_NOTE_GROUPOBJUSED: str = gettext("Allows reading any groupobjused@... property"); break; case ZFS_DELEG_NOTE_USEROBJUSED: str = gettext("Allows reading any userobjused@... property"); break; case ZFS_DELEG_NOTE_PROJECTQUOTA: str = gettext("Allows accessing any projectquota@... property"); break; case ZFS_DELEG_NOTE_PROJECTOBJQUOTA: str = gettext("Allows accessing any \n\t\t\t\t" "projectobjquota@... property"); break; case ZFS_DELEG_NOTE_PROJECTUSED: str = gettext("Allows reading any projectused@... property"); break; case ZFS_DELEG_NOTE_PROJECTOBJUSED: str = gettext("Allows accessing any \n\t\t\t\t" "projectobjused@... property"); break; /* other */ default: str = ""; } return (str); } struct allow_opts { boolean_t local; boolean_t descend; boolean_t user; boolean_t group; boolean_t everyone; boolean_t create; boolean_t set; boolean_t recursive; /* unallow only */ boolean_t prt_usage; boolean_t prt_perms; char *who; char *perms; const char *dataset; }; static inline int prop_cmp(const void *a, const void *b) { const char *str1 = *(const char **)a; const char *str2 = *(const char **)b; return (strcmp(str1, str2)); } static void allow_usage(boolean_t un, boolean_t requested, const char *msg) { const char *opt_desc[] = { "-h", gettext("show this help message and exit"), "-l", gettext("set permission locally"), "-d", gettext("set permission for descents"), "-u", gettext("set permission for user"), "-g", gettext("set permission for group"), "-e", gettext("set permission for everyone"), "-c", gettext("set create time permission"), "-s", gettext("define permission set"), /* unallow only */ "-r", gettext("remove permissions recursively"), }; size_t unallow_size = sizeof (opt_desc) / sizeof (char *); size_t allow_size = unallow_size - 2; const char *props[ZFS_NUM_PROPS]; int i; size_t count = 0; FILE *fp = requested ? stdout : stderr; zprop_desc_t *pdtbl = zfs_prop_get_table(); const char *fmt = gettext("%-16s %-14s\t%s\n"); (void) fprintf(fp, gettext("Usage: %s\n"), get_usage(un ? HELP_UNALLOW : HELP_ALLOW)); (void) fprintf(fp, gettext("Options:\n")); for (i = 0; i < (un ? unallow_size : allow_size); i += 2) { const char *opt = opt_desc[i]; const char *optdsc = opt_desc[i + 1]; (void) fprintf(fp, gettext(" %-10s %s\n"), opt, optdsc); } (void) fprintf(fp, gettext("\nThe following permissions are " "supported:\n\n")); (void) fprintf(fp, fmt, gettext("NAME"), gettext("TYPE"), gettext("NOTES")); for (i = 0; i < ZFS_NUM_DELEG_NOTES; i++) { const char *perm_name = zfs_deleg_perm_tbl[i].z_perm; zfs_deleg_note_t perm_note = zfs_deleg_perm_tbl[i].z_note; const char *perm_type = deleg_perm_type(perm_note); const char *perm_comment = deleg_perm_comment(perm_note); (void) fprintf(fp, fmt, perm_name, perm_type, perm_comment); } for (i = 0; i < ZFS_NUM_PROPS; i++) { zprop_desc_t *pd = &pdtbl[i]; if (pd->pd_visible != B_TRUE) continue; if (pd->pd_attr == PROP_READONLY) continue; props[count++] = pd->pd_name; } props[count] = NULL; qsort(props, count, sizeof (char *), prop_cmp); for (i = 0; i < count; i++) (void) fprintf(fp, fmt, props[i], gettext("property"), ""); if (msg != NULL) (void) fprintf(fp, gettext("\nzfs: error: %s"), msg); exit(requested ? 0 : 2); } static inline const char * munge_args(int argc, char **argv, boolean_t un, size_t expected_argc, char **permsp) { if (un && argc == expected_argc - 1) *permsp = NULL; else if (argc == expected_argc) *permsp = argv[argc - 2]; else allow_usage(un, B_FALSE, gettext("wrong number of parameters\n")); return (argv[argc - 1]); } static void parse_allow_args(int argc, char **argv, boolean_t un, struct allow_opts *opts) { int uge_sum = opts->user + opts->group + opts->everyone; int csuge_sum = opts->create + opts->set + uge_sum; int ldcsuge_sum = csuge_sum + opts->local + opts->descend; int all_sum = un ? ldcsuge_sum + opts->recursive : ldcsuge_sum; if (uge_sum > 1) allow_usage(un, B_FALSE, gettext("-u, -g, and -e are mutually exclusive\n")); if (opts->prt_usage) { if (argc == 0 && all_sum == 0) allow_usage(un, B_TRUE, NULL); else usage(B_FALSE); } if (opts->set) { if (csuge_sum > 1) allow_usage(un, B_FALSE, gettext("invalid options combined with -s\n")); opts->dataset = munge_args(argc, argv, un, 3, &opts->perms); if (argv[0][0] != '@') allow_usage(un, B_FALSE, gettext("invalid set name: missing '@' prefix\n")); opts->who = argv[0]; } else if (opts->create) { if (ldcsuge_sum > 1) allow_usage(un, B_FALSE, gettext("invalid options combined with -c\n")); opts->dataset = munge_args(argc, argv, un, 2, &opts->perms); } else if (opts->everyone) { if (csuge_sum > 1) allow_usage(un, B_FALSE, gettext("invalid options combined with -e\n")); opts->dataset = munge_args(argc, argv, un, 2, &opts->perms); } else if (uge_sum == 0 && argc > 0 && strcmp(argv[0], "everyone") == 0) { opts->everyone = B_TRUE; argc--; argv++; opts->dataset = munge_args(argc, argv, un, 2, &opts->perms); } else if (argc == 1 && !un) { opts->prt_perms = B_TRUE; opts->dataset = argv[argc-1]; } else { opts->dataset = munge_args(argc, argv, un, 3, &opts->perms); opts->who = argv[0]; } if (!opts->local && !opts->descend) { opts->local = B_TRUE; opts->descend = B_TRUE; } } static void store_allow_perm(zfs_deleg_who_type_t type, boolean_t local, boolean_t descend, const char *who, char *perms, nvlist_t *top_nvl) { int i; char ld[2] = { '\0', '\0' }; char who_buf[MAXNAMELEN + 32]; char base_type = '\0'; char set_type = '\0'; nvlist_t *base_nvl = NULL; nvlist_t *set_nvl = NULL; nvlist_t *nvl; if (nvlist_alloc(&base_nvl, NV_UNIQUE_NAME, 0) != 0) nomem(); if (nvlist_alloc(&set_nvl, NV_UNIQUE_NAME, 0) != 0) nomem(); switch (type) { case ZFS_DELEG_NAMED_SET_SETS: case ZFS_DELEG_NAMED_SET: set_type = ZFS_DELEG_NAMED_SET_SETS; base_type = ZFS_DELEG_NAMED_SET; ld[0] = ZFS_DELEG_NA; break; case ZFS_DELEG_CREATE_SETS: case ZFS_DELEG_CREATE: set_type = ZFS_DELEG_CREATE_SETS; base_type = ZFS_DELEG_CREATE; ld[0] = ZFS_DELEG_NA; break; case ZFS_DELEG_USER_SETS: case ZFS_DELEG_USER: set_type = ZFS_DELEG_USER_SETS; base_type = ZFS_DELEG_USER; if (local) ld[0] = ZFS_DELEG_LOCAL; if (descend) ld[1] = ZFS_DELEG_DESCENDENT; break; case ZFS_DELEG_GROUP_SETS: case ZFS_DELEG_GROUP: set_type = ZFS_DELEG_GROUP_SETS; base_type = ZFS_DELEG_GROUP; if (local) ld[0] = ZFS_DELEG_LOCAL; if (descend) ld[1] = ZFS_DELEG_DESCENDENT; break; case ZFS_DELEG_EVERYONE_SETS: case ZFS_DELEG_EVERYONE: set_type = ZFS_DELEG_EVERYONE_SETS; base_type = ZFS_DELEG_EVERYONE; if (local) ld[0] = ZFS_DELEG_LOCAL; if (descend) ld[1] = ZFS_DELEG_DESCENDENT; break; default: assert(set_type != '\0' && base_type != '\0'); } if (perms != NULL) { char *curr = perms; char *end = curr + strlen(perms); while (curr < end) { char *delim = strchr(curr, ','); if (delim == NULL) delim = end; else *delim = '\0'; if (curr[0] == '@') nvl = set_nvl; else nvl = base_nvl; (void) nvlist_add_boolean(nvl, curr); if (delim != end) *delim = ','; curr = delim + 1; } for (i = 0; i < 2; i++) { char locality = ld[i]; if (locality == 0) continue; if (!nvlist_empty(base_nvl)) { if (who != NULL) (void) snprintf(who_buf, sizeof (who_buf), "%c%c$%s", base_type, locality, who); else (void) snprintf(who_buf, sizeof (who_buf), "%c%c$", base_type, locality); (void) nvlist_add_nvlist(top_nvl, who_buf, base_nvl); } if (!nvlist_empty(set_nvl)) { if (who != NULL) (void) snprintf(who_buf, sizeof (who_buf), "%c%c$%s", set_type, locality, who); else (void) snprintf(who_buf, sizeof (who_buf), "%c%c$", set_type, locality); (void) nvlist_add_nvlist(top_nvl, who_buf, set_nvl); } } } else { for (i = 0; i < 2; i++) { char locality = ld[i]; if (locality == 0) continue; if (who != NULL) (void) snprintf(who_buf, sizeof (who_buf), "%c%c$%s", base_type, locality, who); else (void) snprintf(who_buf, sizeof (who_buf), "%c%c$", base_type, locality); (void) nvlist_add_boolean(top_nvl, who_buf); if (who != NULL) (void) snprintf(who_buf, sizeof (who_buf), "%c%c$%s", set_type, locality, who); else (void) snprintf(who_buf, sizeof (who_buf), "%c%c$", set_type, locality); (void) nvlist_add_boolean(top_nvl, who_buf); } } } static int construct_fsacl_list(boolean_t un, struct allow_opts *opts, nvlist_t **nvlp) { if (nvlist_alloc(nvlp, NV_UNIQUE_NAME, 0) != 0) nomem(); if (opts->set) { store_allow_perm(ZFS_DELEG_NAMED_SET, opts->local, opts->descend, opts->who, opts->perms, *nvlp); } else if (opts->create) { store_allow_perm(ZFS_DELEG_CREATE, opts->local, opts->descend, NULL, opts->perms, *nvlp); } else if (opts->everyone) { store_allow_perm(ZFS_DELEG_EVERYONE, opts->local, opts->descend, NULL, opts->perms, *nvlp); } else { char *curr = opts->who; char *end = curr + strlen(curr); while (curr < end) { const char *who; zfs_deleg_who_type_t who_type = ZFS_DELEG_WHO_UNKNOWN; char *endch; char *delim = strchr(curr, ','); char errbuf[256]; char id[64]; struct passwd *p = NULL; struct group *g = NULL; uid_t rid; if (delim == NULL) delim = end; else *delim = '\0'; rid = (uid_t)strtol(curr, &endch, 0); if (opts->user) { who_type = ZFS_DELEG_USER; if (*endch != '\0') p = getpwnam(curr); else p = getpwuid(rid); if (p != NULL) rid = p->pw_uid; else if (*endch != '\0') { (void) snprintf(errbuf, 256, gettext( "invalid user %s\n"), curr); allow_usage(un, B_TRUE, errbuf); } } else if (opts->group) { who_type = ZFS_DELEG_GROUP; if (*endch != '\0') g = getgrnam(curr); else g = getgrgid(rid); if (g != NULL) rid = g->gr_gid; else if (*endch != '\0') { (void) snprintf(errbuf, 256, gettext( "invalid group %s\n"), curr); allow_usage(un, B_TRUE, errbuf); } } else { if (*endch != '\0') { p = getpwnam(curr); } else { p = getpwuid(rid); } if (p == NULL) { if (*endch != '\0') { g = getgrnam(curr); } else { g = getgrgid(rid); } } if (p != NULL) { who_type = ZFS_DELEG_USER; rid = p->pw_uid; } else if (g != NULL) { who_type = ZFS_DELEG_GROUP; rid = g->gr_gid; } else { (void) snprintf(errbuf, 256, gettext( "invalid user/group %s\n"), curr); allow_usage(un, B_TRUE, errbuf); } } (void) sprintf(id, "%u", rid); who = id; store_allow_perm(who_type, opts->local, opts->descend, who, opts->perms, *nvlp); curr = delim + 1; } } return (0); } static void print_set_creat_perms(uu_avl_t *who_avl) { const char *sc_title[] = { gettext("Permission sets:\n"), gettext("Create time permissions:\n"), NULL }; who_perm_node_t *who_node = NULL; int prev_weight = -1; for (who_node = uu_avl_first(who_avl); who_node != NULL; who_node = uu_avl_next(who_avl, who_node)) { uu_avl_t *avl = who_node->who_perm.who_deleg_perm_avl; zfs_deleg_who_type_t who_type = who_node->who_perm.who_type; const char *who_name = who_node->who_perm.who_name; int weight = who_type2weight(who_type); boolean_t first = B_TRUE; deleg_perm_node_t *deleg_node; if (prev_weight != weight) { (void) printf("%s", sc_title[weight]); prev_weight = weight; } if (who_name == NULL || strnlen(who_name, 1) == 0) (void) printf("\t"); else (void) printf("\t%s ", who_name); for (deleg_node = uu_avl_first(avl); deleg_node != NULL; deleg_node = uu_avl_next(avl, deleg_node)) { if (first) { (void) printf("%s", deleg_node->dpn_perm.dp_name); first = B_FALSE; } else (void) printf(",%s", deleg_node->dpn_perm.dp_name); } (void) printf("\n"); } } static void print_uge_deleg_perms(uu_avl_t *who_avl, boolean_t local, boolean_t descend, const char *title) { who_perm_node_t *who_node = NULL; boolean_t prt_title = B_TRUE; uu_avl_walk_t *walk; if ((walk = uu_avl_walk_start(who_avl, UU_WALK_ROBUST)) == NULL) nomem(); while ((who_node = uu_avl_walk_next(walk)) != NULL) { const char *who_name = who_node->who_perm.who_name; const char *nice_who_name = who_node->who_perm.who_ug_name; uu_avl_t *avl = who_node->who_perm.who_deleg_perm_avl; zfs_deleg_who_type_t who_type = who_node->who_perm.who_type; char delim = ' '; deleg_perm_node_t *deleg_node; boolean_t prt_who = B_TRUE; for (deleg_node = uu_avl_first(avl); deleg_node != NULL; deleg_node = uu_avl_next(avl, deleg_node)) { if (local != deleg_node->dpn_perm.dp_local || descend != deleg_node->dpn_perm.dp_descend) continue; if (prt_who) { const char *who = NULL; if (prt_title) { prt_title = B_FALSE; (void) printf("%s", title); } switch (who_type) { case ZFS_DELEG_USER_SETS: case ZFS_DELEG_USER: who = gettext("user"); if (nice_who_name) who_name = nice_who_name; break; case ZFS_DELEG_GROUP_SETS: case ZFS_DELEG_GROUP: who = gettext("group"); if (nice_who_name) who_name = nice_who_name; break; case ZFS_DELEG_EVERYONE_SETS: case ZFS_DELEG_EVERYONE: who = gettext("everyone"); who_name = NULL; break; default: assert(who != NULL); } prt_who = B_FALSE; if (who_name == NULL) (void) printf("\t%s", who); else (void) printf("\t%s %s", who, who_name); } (void) printf("%c%s", delim, deleg_node->dpn_perm.dp_name); delim = ','; } if (!prt_who) (void) printf("\n"); } uu_avl_walk_end(walk); } static void print_fs_perms(fs_perm_set_t *fspset) { fs_perm_node_t *node = NULL; char buf[MAXNAMELEN + 32]; const char *dsname = buf; for (node = uu_list_first(fspset->fsps_list); node != NULL; node = uu_list_next(fspset->fsps_list, node)) { uu_avl_t *sc_avl = node->fspn_fsperm.fsp_sc_avl; uu_avl_t *uge_avl = node->fspn_fsperm.fsp_uge_avl; int left = 0; (void) snprintf(buf, sizeof (buf), gettext("---- Permissions on %s "), node->fspn_fsperm.fsp_name); (void) printf("%s", dsname); left = 70 - strlen(buf); while (left-- > 0) (void) printf("-"); (void) printf("\n"); print_set_creat_perms(sc_avl); print_uge_deleg_perms(uge_avl, B_TRUE, B_FALSE, gettext("Local permissions:\n")); print_uge_deleg_perms(uge_avl, B_FALSE, B_TRUE, gettext("Descendent permissions:\n")); print_uge_deleg_perms(uge_avl, B_TRUE, B_TRUE, gettext("Local+Descendent permissions:\n")); } } static fs_perm_set_t fs_perm_set = { NULL, NULL, NULL, NULL }; struct deleg_perms { boolean_t un; nvlist_t *nvl; }; static int set_deleg_perms(zfs_handle_t *zhp, void *data) { struct deleg_perms *perms = (struct deleg_perms *)data; zfs_type_t zfs_type = zfs_get_type(zhp); if (zfs_type != ZFS_TYPE_FILESYSTEM && zfs_type != ZFS_TYPE_VOLUME) return (0); return (zfs_set_fsacl(zhp, perms->un, perms->nvl)); } static int zfs_do_allow_unallow_impl(int argc, char **argv, boolean_t un) { zfs_handle_t *zhp; nvlist_t *perm_nvl = NULL; nvlist_t *update_perm_nvl = NULL; int error = 1; int c; struct allow_opts opts = { 0 }; const char *optstr = un ? "ldugecsrh" : "ldugecsh"; /* check opts */ while ((c = getopt(argc, argv, optstr)) != -1) { switch (c) { case 'l': opts.local = B_TRUE; break; case 'd': opts.descend = B_TRUE; break; case 'u': opts.user = B_TRUE; break; case 'g': opts.group = B_TRUE; break; case 'e': opts.everyone = B_TRUE; break; case 's': opts.set = B_TRUE; break; case 'c': opts.create = B_TRUE; break; case 'r': opts.recursive = B_TRUE; break; case ':': (void) fprintf(stderr, gettext("missing argument for " "'%c' option\n"), optopt); usage(B_FALSE); break; case 'h': opts.prt_usage = B_TRUE; break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* check arguments */ parse_allow_args(argc, argv, un, &opts); /* try to open the dataset */ if ((zhp = zfs_open(g_zfs, opts.dataset, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME)) == NULL) { (void) fprintf(stderr, "Failed to open dataset: %s\n", opts.dataset); return (-1); } if (zfs_get_fsacl(zhp, &perm_nvl) != 0) goto cleanup2; fs_perm_set_init(&fs_perm_set); if (parse_fs_perm_set(&fs_perm_set, perm_nvl) != 0) { (void) fprintf(stderr, "Failed to parse fsacl permissions\n"); goto cleanup1; } if (opts.prt_perms) print_fs_perms(&fs_perm_set); else { (void) construct_fsacl_list(un, &opts, &update_perm_nvl); if (zfs_set_fsacl(zhp, un, update_perm_nvl) != 0) goto cleanup0; if (un && opts.recursive) { struct deleg_perms data = { un, update_perm_nvl }; if (zfs_iter_filesystems(zhp, set_deleg_perms, &data) != 0) goto cleanup0; } } error = 0; cleanup0: nvlist_free(perm_nvl); nvlist_free(update_perm_nvl); cleanup1: fs_perm_set_fini(&fs_perm_set); cleanup2: zfs_close(zhp); return (error); } static int zfs_do_allow(int argc, char **argv) { return (zfs_do_allow_unallow_impl(argc, argv, B_FALSE)); } static int zfs_do_unallow(int argc, char **argv) { return (zfs_do_allow_unallow_impl(argc, argv, B_TRUE)); } static int zfs_do_hold_rele_impl(int argc, char **argv, boolean_t holding) { int errors = 0; int i; const char *tag; boolean_t recursive = B_FALSE; const char *opts = holding ? "rt" : "r"; int c; /* check options */ while ((c = getopt(argc, argv, opts)) != -1) { switch (c) { case 'r': recursive = B_TRUE; break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 2) usage(B_FALSE); tag = argv[0]; --argc; ++argv; if (holding && tag[0] == '.') { /* tags starting with '.' are reserved for libzfs */ (void) fprintf(stderr, gettext("tag may not start with '.'\n")); usage(B_FALSE); } for (i = 0; i < argc; ++i) { zfs_handle_t *zhp; char parent[ZFS_MAX_DATASET_NAME_LEN]; const char *delim; char *path = argv[i]; delim = strchr(path, '@'); if (delim == NULL) { (void) fprintf(stderr, gettext("'%s' is not a snapshot\n"), path); ++errors; continue; } (void) strncpy(parent, path, delim - path); parent[delim - path] = '\0'; zhp = zfs_open(g_zfs, parent, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) { ++errors; continue; } if (holding) { if (zfs_hold(zhp, delim+1, tag, recursive, -1) != 0) ++errors; } else { if (zfs_release(zhp, delim+1, tag, recursive) != 0) ++errors; } zfs_close(zhp); } return (errors != 0); } /* * zfs hold [-r] [-t] ... * * -r Recursively hold * * Apply a user-hold with the given tag to the list of snapshots. */ static int zfs_do_hold(int argc, char **argv) { return (zfs_do_hold_rele_impl(argc, argv, B_TRUE)); } /* * zfs release [-r] ... * * -r Recursively release * * Release a user-hold with the given tag from the list of snapshots. */ static int zfs_do_release(int argc, char **argv) { return (zfs_do_hold_rele_impl(argc, argv, B_FALSE)); } typedef struct holds_cbdata { boolean_t cb_recursive; const char *cb_snapname; nvlist_t **cb_nvlp; size_t cb_max_namelen; size_t cb_max_taglen; } holds_cbdata_t; #define STRFTIME_FMT_STR "%a %b %e %H:%M %Y" #define DATETIME_BUF_LEN (32) /* * */ static void print_holds(boolean_t scripted, int nwidth, int tagwidth, nvlist_t *nvl) { int i; nvpair_t *nvp = NULL; char *hdr_cols[] = { "NAME", "TAG", "TIMESTAMP" }; const char *col; if (!scripted) { for (i = 0; i < 3; i++) { col = gettext(hdr_cols[i]); if (i < 2) (void) printf("%-*s ", i ? tagwidth : nwidth, col); else (void) printf("%s\n", col); } } while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) { char *zname = nvpair_name(nvp); nvlist_t *nvl2; nvpair_t *nvp2 = NULL; (void) nvpair_value_nvlist(nvp, &nvl2); while ((nvp2 = nvlist_next_nvpair(nvl2, nvp2)) != NULL) { char tsbuf[DATETIME_BUF_LEN]; char *tagname = nvpair_name(nvp2); uint64_t val = 0; time_t time; struct tm t; (void) nvpair_value_uint64(nvp2, &val); time = (time_t)val; (void) localtime_r(&time, &t); (void) strftime(tsbuf, DATETIME_BUF_LEN, gettext(STRFTIME_FMT_STR), &t); if (scripted) { (void) printf("%s\t%s\t%s\n", zname, tagname, tsbuf); } else { (void) printf("%-*s %-*s %s\n", nwidth, zname, tagwidth, tagname, tsbuf); } } } } /* * Generic callback function to list a dataset or snapshot. */ static int holds_callback(zfs_handle_t *zhp, void *data) { holds_cbdata_t *cbp = data; nvlist_t *top_nvl = *cbp->cb_nvlp; nvlist_t *nvl = NULL; nvpair_t *nvp = NULL; const char *zname = zfs_get_name(zhp); size_t znamelen = strlen(zname); if (cbp->cb_recursive) { const char *snapname; char *delim = strchr(zname, '@'); if (delim == NULL) return (0); snapname = delim + 1; if (strcmp(cbp->cb_snapname, snapname)) return (0); } if (zfs_get_holds(zhp, &nvl) != 0) return (-1); if (znamelen > cbp->cb_max_namelen) cbp->cb_max_namelen = znamelen; while ((nvp = nvlist_next_nvpair(nvl, nvp)) != NULL) { const char *tag = nvpair_name(nvp); size_t taglen = strlen(tag); if (taglen > cbp->cb_max_taglen) cbp->cb_max_taglen = taglen; } return (nvlist_add_nvlist(top_nvl, zname, nvl)); } /* * zfs holds [-rH] ... * * -r Lists holds that are set on the named snapshots recursively. * -H Scripted mode; elide headers and separate columns by tabs. */ static int zfs_do_holds(int argc, char **argv) { int errors = 0; int c; int i; boolean_t scripted = B_FALSE; boolean_t recursive = B_FALSE; const char *opts = "rH"; nvlist_t *nvl; int types = ZFS_TYPE_SNAPSHOT; holds_cbdata_t cb = { 0 }; int limit = 0; int ret = 0; int flags = 0; /* check options */ while ((c = getopt(argc, argv, opts)) != -1) { switch (c) { case 'r': recursive = B_TRUE; break; case 'H': scripted = B_TRUE; break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } if (recursive) { types |= ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME; flags |= ZFS_ITER_RECURSE; } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 1) usage(B_FALSE); if (nvlist_alloc(&nvl, NV_UNIQUE_NAME, 0) != 0) nomem(); for (i = 0; i < argc; ++i) { char *snapshot = argv[i]; const char *delim; const char *snapname; delim = strchr(snapshot, '@'); if (delim == NULL) { (void) fprintf(stderr, gettext("'%s' is not a snapshot\n"), snapshot); ++errors; continue; } snapname = delim + 1; if (recursive) snapshot[delim - snapshot] = '\0'; cb.cb_recursive = recursive; cb.cb_snapname = snapname; cb.cb_nvlp = &nvl; /* * 1. collect holds data, set format options */ ret = zfs_for_each(argc, argv, flags, types, NULL, NULL, limit, holds_callback, &cb); if (ret != 0) ++errors; } /* * 2. print holds data */ print_holds(scripted, cb.cb_max_namelen, cb.cb_max_taglen, nvl); if (nvlist_empty(nvl)) (void) fprintf(stderr, gettext("no datasets available\n")); nvlist_free(nvl); return (0 != errors); } #define CHECK_SPINNER 30 #define SPINNER_TIME 3 /* seconds */ #define MOUNT_TIME 1 /* seconds */ typedef struct get_all_state { boolean_t ga_verbose; get_all_cb_t *ga_cbp; } get_all_state_t; static int get_one_dataset(zfs_handle_t *zhp, void *data) { static char *spin[] = { "-", "\\", "|", "/" }; static int spinval = 0; static int spincheck = 0; static time_t last_spin_time = (time_t)0; get_all_state_t *state = data; zfs_type_t type = zfs_get_type(zhp); if (state->ga_verbose) { if (--spincheck < 0) { time_t now = time(NULL); if (last_spin_time + SPINNER_TIME < now) { update_progress(spin[spinval++ % 4]); last_spin_time = now; } spincheck = CHECK_SPINNER; } } /* * Iterate over any nested datasets. */ if (zfs_iter_filesystems(zhp, get_one_dataset, data) != 0) { zfs_close(zhp); return (1); } /* * Skip any datasets whose type does not match. */ if ((type & ZFS_TYPE_FILESYSTEM) == 0) { zfs_close(zhp); return (0); } libzfs_add_handle(state->ga_cbp, zhp); assert(state->ga_cbp->cb_used <= state->ga_cbp->cb_alloc); return (0); } static void get_all_datasets(get_all_cb_t *cbp, boolean_t verbose) { get_all_state_t state = { .ga_verbose = verbose, .ga_cbp = cbp }; if (verbose) set_progress_header(gettext("Reading ZFS config")); (void) zfs_iter_root(g_zfs, get_one_dataset, &state); if (verbose) finish_progress(gettext("done.")); } /* * Generic callback for sharing or mounting filesystems. Because the code is so * similar, we have a common function with an extra parameter to determine which * mode we are using. */ typedef enum { OP_SHARE, OP_MOUNT } share_mount_op_t; typedef struct share_mount_state { share_mount_op_t sm_op; boolean_t sm_verbose; int sm_flags; char *sm_options; char *sm_proto; /* only valid for OP_SHARE */ pthread_mutex_t sm_lock; /* protects the remaining fields */ uint_t sm_total; /* number of filesystems to process */ uint_t sm_done; /* number of filesystems processed */ int sm_status; /* -1 if any of the share/mount operations failed */ } share_mount_state_t; /* * Share or mount a dataset. */ static int share_mount_one(zfs_handle_t *zhp, int op, int flags, char *protocol, boolean_t explicit, const char *options) { char mountpoint[ZFS_MAXPROPLEN]; char shareopts[ZFS_MAXPROPLEN]; char smbshareopts[ZFS_MAXPROPLEN]; const char *cmdname = op == OP_SHARE ? "share" : "mount"; struct mnttab mnt; uint64_t zoned, canmount; boolean_t shared_nfs, shared_smb; assert(zfs_get_type(zhp) & ZFS_TYPE_FILESYSTEM); /* * Check to make sure we can mount/share this dataset. If we * are in the global zone and the filesystem is exported to a * local zone, or if we are in a local zone and the * filesystem is not exported, then it is an error. */ zoned = zfs_prop_get_int(zhp, ZFS_PROP_ZONED); if (zoned && getzoneid() == GLOBAL_ZONEID) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': " "dataset is exported to a local zone\n"), cmdname, zfs_get_name(zhp)); return (1); } else if (!zoned && getzoneid() != GLOBAL_ZONEID) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': " "permission denied\n"), cmdname, zfs_get_name(zhp)); return (1); } /* * Ignore any filesystems which don't apply to us. This * includes those with a legacy mountpoint, or those with * legacy share options. */ verify(zfs_prop_get(zhp, ZFS_PROP_MOUNTPOINT, mountpoint, sizeof (mountpoint), NULL, NULL, 0, B_FALSE) == 0); verify(zfs_prop_get(zhp, ZFS_PROP_SHARENFS, shareopts, sizeof (shareopts), NULL, NULL, 0, B_FALSE) == 0); verify(zfs_prop_get(zhp, ZFS_PROP_SHARESMB, smbshareopts, sizeof (smbshareopts), NULL, NULL, 0, B_FALSE) == 0); if (op == OP_SHARE && strcmp(shareopts, "off") == 0 && strcmp(smbshareopts, "off") == 0) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot share '%s': " "legacy share\n"), zfs_get_name(zhp)); (void) fprintf(stderr, gettext("use share(1M) to " "share this filesystem, or set " "sharenfs property on\n")); return (1); } /* * We cannot share or mount legacy filesystems. If the * shareopts is non-legacy but the mountpoint is legacy, we * treat it as a legacy share. */ if (strcmp(mountpoint, "legacy") == 0) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': " "legacy mountpoint\n"), cmdname, zfs_get_name(zhp)); (void) fprintf(stderr, gettext("use %s(1M) to " "%s this filesystem\n"), cmdname, cmdname); return (1); } if (strcmp(mountpoint, "none") == 0) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': no " "mountpoint set\n"), cmdname, zfs_get_name(zhp)); return (1); } /* * canmount explicit outcome * on no pass through * on yes pass through * off no return 0 * off yes display error, return 1 * noauto no return 0 * noauto yes pass through */ canmount = zfs_prop_get_int(zhp, ZFS_PROP_CANMOUNT); if (canmount == ZFS_CANMOUNT_OFF) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': " "'canmount' property is set to 'off'\n"), cmdname, zfs_get_name(zhp)); return (1); } else if (canmount == ZFS_CANMOUNT_NOAUTO && !explicit) { /* * When performing a 'zfs mount -a', we skip any mounts for * datasets that have 'noauto' set. Sharing a dataset with * 'noauto' set is only allowed if it's mounted. */ if (op == OP_MOUNT) return (0); if (op == OP_SHARE && !zfs_is_mounted(zhp, NULL)) { /* also purge it from existing exports */ zfs_unshareall_bypath(zhp, mountpoint); return (0); } } /* * If this filesystem is encrypted and does not have * a loaded key, we can not mount it. */ if ((flags & MS_CRYPT) == 0 && zfs_prop_get_int(zhp, ZFS_PROP_ENCRYPTION) != ZIO_CRYPT_OFF && zfs_prop_get_int(zhp, ZFS_PROP_KEYSTATUS) == ZFS_KEYSTATUS_UNAVAILABLE) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': " "encryption key not loaded\n"), cmdname, zfs_get_name(zhp)); return (1); } /* * If this filesystem is inconsistent and has a receive resume * token, we can not mount it. */ if (zfs_prop_get_int(zhp, ZFS_PROP_INCONSISTENT) && zfs_prop_get(zhp, ZFS_PROP_RECEIVE_RESUME_TOKEN, NULL, 0, NULL, NULL, 0, B_TRUE) == 0) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': " "Contains partially-completed state from " "\"zfs receive -s\", which can be resumed with " "\"zfs send -t\"\n"), cmdname, zfs_get_name(zhp)); return (1); } if (zfs_prop_get_int(zhp, ZFS_PROP_REDACTED) && !(flags & MS_FORCE)) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot %s '%s': " "Dataset is not complete, was created by receiving " "a redacted zfs send stream.\n"), cmdname, zfs_get_name(zhp)); return (1); } /* * At this point, we have verified that the mountpoint and/or * shareopts are appropriate for auto management. If the * filesystem is already mounted or shared, return (failing * for explicit requests); otherwise mount or share the * filesystem. */ switch (op) { case OP_SHARE: shared_nfs = zfs_is_shared_nfs(zhp, NULL); shared_smb = zfs_is_shared_smb(zhp, NULL); if ((shared_nfs && shared_smb) || (shared_nfs && strcmp(shareopts, "on") == 0 && strcmp(smbshareopts, "off") == 0) || (shared_smb && strcmp(smbshareopts, "on") == 0 && strcmp(shareopts, "off") == 0)) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot share " "'%s': filesystem already shared\n"), zfs_get_name(zhp)); return (1); } if (!zfs_is_mounted(zhp, NULL) && zfs_mount(zhp, NULL, flags) != 0) return (1); if (protocol == NULL) { if (zfs_shareall(zhp) != 0) return (1); } else if (strcmp(protocol, "nfs") == 0) { if (zfs_share_nfs(zhp)) return (1); } else if (strcmp(protocol, "smb") == 0) { if (zfs_share_smb(zhp)) return (1); } else { (void) fprintf(stderr, gettext("cannot share " "'%s': invalid share type '%s' " "specified\n"), zfs_get_name(zhp), protocol); return (1); } break; case OP_MOUNT: if (options == NULL) mnt.mnt_mntopts = ""; else mnt.mnt_mntopts = (char *)options; if (!hasmntopt(&mnt, MNTOPT_REMOUNT) && zfs_is_mounted(zhp, NULL)) { if (!explicit) return (0); (void) fprintf(stderr, gettext("cannot mount " "'%s': filesystem already mounted\n"), zfs_get_name(zhp)); return (1); } if (zfs_mount(zhp, options, flags) != 0) return (1); break; } return (0); } /* * Reports progress in the form "(current/total)". Not thread-safe. */ static void report_mount_progress(int current, int total) { static time_t last_progress_time = 0; time_t now = time(NULL); char info[32]; /* report 1..n instead of 0..n-1 */ ++current; /* display header if we're here for the first time */ if (current == 1) { set_progress_header(gettext("Mounting ZFS filesystems")); } else if (current != total && last_progress_time + MOUNT_TIME >= now) { /* too soon to report again */ return; } last_progress_time = now; (void) sprintf(info, "(%d/%d)", current, total); if (current == total) finish_progress(info); else update_progress(info); } /* * zfs_foreach_mountpoint() callback that mounts or shares one filesystem and * updates the progress meter. */ static int share_mount_one_cb(zfs_handle_t *zhp, void *arg) { share_mount_state_t *sms = arg; int ret; ret = share_mount_one(zhp, sms->sm_op, sms->sm_flags, sms->sm_proto, B_FALSE, sms->sm_options); pthread_mutex_lock(&sms->sm_lock); if (ret != 0) sms->sm_status = ret; sms->sm_done++; if (sms->sm_verbose) report_mount_progress(sms->sm_done, sms->sm_total); pthread_mutex_unlock(&sms->sm_lock); return (ret); } static void append_options(char *mntopts, char *newopts) { int len = strlen(mntopts); /* original length plus new string to append plus 1 for the comma */ if (len + 1 + strlen(newopts) >= MNT_LINE_MAX) { (void) fprintf(stderr, gettext("the opts argument for " "'%s' option is too long (more than %d chars)\n"), "-o", MNT_LINE_MAX); usage(B_FALSE); } if (*mntopts) mntopts[len++] = ','; (void) strcpy(&mntopts[len], newopts); } static int share_mount(int op, int argc, char **argv) { int do_all = 0; boolean_t verbose = B_FALSE; int c, ret = 0; char *options = NULL; int flags = 0; /* check options */ while ((c = getopt(argc, argv, op == OP_MOUNT ? ":alvo:Of" : "al")) != -1) { switch (c) { case 'a': do_all = 1; break; case 'v': verbose = B_TRUE; break; case 'l': flags |= MS_CRYPT; break; case 'o': if (*optarg == '\0') { (void) fprintf(stderr, gettext("empty mount " "options (-o) specified\n")); usage(B_FALSE); } if (options == NULL) options = safe_malloc(MNT_LINE_MAX + 1); /* option validation is done later */ append_options(options, optarg); break; case 'O': flags |= MS_OVERLAY; break; case 'f': flags |= MS_FORCE; break; case ':': (void) fprintf(stderr, gettext("missing argument for " "'%c' option\n"), optopt); usage(B_FALSE); break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; /* check number of arguments */ if (do_all) { char *protocol = NULL; if (op == OP_SHARE && argc > 0) { if (strcmp(argv[0], "nfs") != 0 && strcmp(argv[0], "smb") != 0) { (void) fprintf(stderr, gettext("share type " "must be 'nfs' or 'smb'\n")); usage(B_FALSE); } protocol = argv[0]; argc--; argv++; } if (argc != 0) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } start_progress_timer(); get_all_cb_t cb = { 0 }; get_all_datasets(&cb, verbose); if (cb.cb_used == 0) { if (options != NULL) free(options); return (0); } share_mount_state_t share_mount_state = { 0 }; share_mount_state.sm_op = op; share_mount_state.sm_verbose = verbose; share_mount_state.sm_flags = flags; share_mount_state.sm_options = options; share_mount_state.sm_proto = protocol; share_mount_state.sm_total = cb.cb_used; pthread_mutex_init(&share_mount_state.sm_lock, NULL); /* * libshare isn't mt-safe, so only do the operation in parallel * if we're mounting. Additionally, the key-loading option must * be serialized so that we can prompt the user for their keys * in a consistent manner. */ zfs_foreach_mountpoint(g_zfs, cb.cb_handles, cb.cb_used, share_mount_one_cb, &share_mount_state, op == OP_MOUNT && !(flags & MS_CRYPT)); zfs_commit_all_shares(); ret = share_mount_state.sm_status; for (int i = 0; i < cb.cb_used; i++) zfs_close(cb.cb_handles[i]); free(cb.cb_handles); } else if (argc == 0) { struct mnttab entry; if ((op == OP_SHARE) || (options != NULL)) { (void) fprintf(stderr, gettext("missing filesystem " "argument (specify -a for all)\n")); usage(B_FALSE); } /* * When mount is given no arguments, go through * /proc/self/mounts and display any active ZFS mounts. * We hide any snapshots, since they are controlled * automatically. */ /* Reopen MNTTAB to prevent reading stale data from open file */ if (freopen(MNTTAB, "r", mnttab_file) == NULL) { if (options != NULL) free(options); return (ENOENT); } while (getmntent(mnttab_file, &entry) == 0) { if (strcmp(entry.mnt_fstype, MNTTYPE_ZFS) != 0 || strchr(entry.mnt_special, '@') != NULL) continue; (void) printf("%-30s %s\n", entry.mnt_special, entry.mnt_mountp); } } else { zfs_handle_t *zhp; if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } if ((zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM)) == NULL) { ret = 1; } else { ret = share_mount_one(zhp, op, flags, NULL, B_TRUE, options); zfs_commit_all_shares(); zfs_close(zhp); } } if (options != NULL) free(options); return (ret); } /* * zfs mount -a [nfs] * zfs mount filesystem * * Mount all filesystems, or mount the given filesystem. */ static int zfs_do_mount(int argc, char **argv) { return (share_mount(OP_MOUNT, argc, argv)); } /* * zfs share -a [nfs | smb] * zfs share filesystem * * Share all filesystems, or share the given filesystem. */ static int zfs_do_share(int argc, char **argv) { return (share_mount(OP_SHARE, argc, argv)); } typedef struct unshare_unmount_node { zfs_handle_t *un_zhp; char *un_mountp; uu_avl_node_t un_avlnode; } unshare_unmount_node_t; /* ARGSUSED */ static int unshare_unmount_compare(const void *larg, const void *rarg, void *unused) { const unshare_unmount_node_t *l = larg; const unshare_unmount_node_t *r = rarg; return (strcmp(l->un_mountp, r->un_mountp)); } /* * Convenience routine used by zfs_do_umount() and manual_unmount(). Given an * absolute path, find the entry /proc/self/mounts, verify that it's a * ZFS filesystem, and unmount it appropriately. */ static int unshare_unmount_path(int op, char *path, int flags, boolean_t is_manual) { zfs_handle_t *zhp; int ret = 0; struct stat64 statbuf; struct extmnttab entry; const char *cmdname = (op == OP_SHARE) ? "unshare" : "unmount"; ino_t path_inode; /* * Search for the given (major,minor) pair in the mount table. */ /* Reopen MNTTAB to prevent reading stale data from open file */ if (freopen(MNTTAB, "r", mnttab_file) == NULL) return (ENOENT); if (getextmntent(path, &entry, &statbuf) != 0) { if (op == OP_SHARE) { (void) fprintf(stderr, gettext("cannot %s '%s': not " "currently mounted\n"), cmdname, path); return (1); } (void) fprintf(stderr, gettext("warning: %s not in" "/proc/self/mounts\n"), path); if ((ret = umount2(path, flags)) != 0) (void) fprintf(stderr, gettext("%s: %s\n"), path, strerror(errno)); return (ret != 0); } path_inode = statbuf.st_ino; if (strcmp(entry.mnt_fstype, MNTTYPE_ZFS) != 0) { (void) fprintf(stderr, gettext("cannot %s '%s': not a ZFS " "filesystem\n"), cmdname, path); return (1); } if ((zhp = zfs_open(g_zfs, entry.mnt_special, ZFS_TYPE_FILESYSTEM)) == NULL) return (1); ret = 1; if (stat64(entry.mnt_mountp, &statbuf) != 0) { (void) fprintf(stderr, gettext("cannot %s '%s': %s\n"), cmdname, path, strerror(errno)); goto out; } else if (statbuf.st_ino != path_inode) { (void) fprintf(stderr, gettext("cannot " "%s '%s': not a mountpoint\n"), cmdname, path); goto out; } if (op == OP_SHARE) { char nfs_mnt_prop[ZFS_MAXPROPLEN]; char smbshare_prop[ZFS_MAXPROPLEN]; verify(zfs_prop_get(zhp, ZFS_PROP_SHARENFS, nfs_mnt_prop, sizeof (nfs_mnt_prop), NULL, NULL, 0, B_FALSE) == 0); verify(zfs_prop_get(zhp, ZFS_PROP_SHARESMB, smbshare_prop, sizeof (smbshare_prop), NULL, NULL, 0, B_FALSE) == 0); if (strcmp(nfs_mnt_prop, "off") == 0 && strcmp(smbshare_prop, "off") == 0) { (void) fprintf(stderr, gettext("cannot unshare " "'%s': legacy share\n"), path); (void) fprintf(stderr, gettext("use exportfs(8) " "or smbcontrol(1) to unshare this filesystem\n")); } else if (!zfs_is_shared(zhp)) { (void) fprintf(stderr, gettext("cannot unshare '%s': " "not currently shared\n"), path); } else { ret = zfs_unshareall_bypath(zhp, path); zfs_commit_all_shares(); } } else { char mtpt_prop[ZFS_MAXPROPLEN]; verify(zfs_prop_get(zhp, ZFS_PROP_MOUNTPOINT, mtpt_prop, sizeof (mtpt_prop), NULL, NULL, 0, B_FALSE) == 0); if (is_manual) { ret = zfs_unmount(zhp, NULL, flags); } else if (strcmp(mtpt_prop, "legacy") == 0) { (void) fprintf(stderr, gettext("cannot unmount " "'%s': legacy mountpoint\n"), zfs_get_name(zhp)); (void) fprintf(stderr, gettext("use umount(8) " "to unmount this filesystem\n")); } else { ret = zfs_unmountall(zhp, flags); } } out: zfs_close(zhp); return (ret != 0); } /* * Generic callback for unsharing or unmounting a filesystem. */ static int unshare_unmount(int op, int argc, char **argv) { int do_all = 0; int flags = 0; int ret = 0; int c; zfs_handle_t *zhp; char nfs_mnt_prop[ZFS_MAXPROPLEN]; char sharesmb[ZFS_MAXPROPLEN]; /* check options */ while ((c = getopt(argc, argv, op == OP_SHARE ? ":a" : "afu")) != -1) { switch (c) { case 'a': do_all = 1; break; case 'f': flags |= MS_FORCE; break; case 'u': flags |= MS_CRYPT; break; case ':': (void) fprintf(stderr, gettext("missing argument for " "'%c' option\n"), optopt); usage(B_FALSE); break; case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; if (do_all) { /* * We could make use of zfs_for_each() to walk all datasets in * the system, but this would be very inefficient, especially * since we would have to linearly search /proc/self/mounts for * each one. Instead, do one pass through /proc/self/mounts * looking for zfs entries and call zfs_unmount() for each one. * * Things get a little tricky if the administrator has created * mountpoints beneath other ZFS filesystems. In this case, we * have to unmount the deepest filesystems first. To accomplish * this, we place all the mountpoints in an AVL tree sorted by * the special type (dataset name), and walk the result in * reverse to make sure to get any snapshots first. */ struct mnttab entry; uu_avl_pool_t *pool; uu_avl_t *tree = NULL; unshare_unmount_node_t *node; uu_avl_index_t idx; uu_avl_walk_t *walk; char *protocol = NULL; if (op == OP_SHARE && argc > 0) { if (strcmp(argv[0], "nfs") != 0 && strcmp(argv[0], "smb") != 0) { (void) fprintf(stderr, gettext("share type " "must be 'nfs' or 'smb'\n")); usage(B_FALSE); } protocol = argv[0]; argc--; argv++; } if (argc != 0) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } if (((pool = uu_avl_pool_create("unmount_pool", sizeof (unshare_unmount_node_t), offsetof(unshare_unmount_node_t, un_avlnode), unshare_unmount_compare, UU_DEFAULT)) == NULL) || ((tree = uu_avl_create(pool, NULL, UU_DEFAULT)) == NULL)) nomem(); /* Reopen MNTTAB to prevent reading stale data from open file */ if (freopen(MNTTAB, "r", mnttab_file) == NULL) return (ENOENT); while (getmntent(mnttab_file, &entry) == 0) { /* ignore non-ZFS entries */ if (strcmp(entry.mnt_fstype, MNTTYPE_ZFS) != 0) continue; /* ignore snapshots */ if (strchr(entry.mnt_special, '@') != NULL) continue; if ((zhp = zfs_open(g_zfs, entry.mnt_special, ZFS_TYPE_FILESYSTEM)) == NULL) { ret = 1; continue; } /* * Ignore datasets that are excluded/restricted by * parent pool name. */ if (zpool_skip_pool(zfs_get_pool_name(zhp))) { zfs_close(zhp); continue; } switch (op) { case OP_SHARE: verify(zfs_prop_get(zhp, ZFS_PROP_SHARENFS, nfs_mnt_prop, sizeof (nfs_mnt_prop), NULL, NULL, 0, B_FALSE) == 0); if (strcmp(nfs_mnt_prop, "off") != 0) break; verify(zfs_prop_get(zhp, ZFS_PROP_SHARESMB, nfs_mnt_prop, sizeof (nfs_mnt_prop), NULL, NULL, 0, B_FALSE) == 0); if (strcmp(nfs_mnt_prop, "off") == 0) continue; break; case OP_MOUNT: /* Ignore legacy mounts */ verify(zfs_prop_get(zhp, ZFS_PROP_MOUNTPOINT, nfs_mnt_prop, sizeof (nfs_mnt_prop), NULL, NULL, 0, B_FALSE) == 0); if (strcmp(nfs_mnt_prop, "legacy") == 0) continue; /* Ignore canmount=noauto mounts */ if (zfs_prop_get_int(zhp, ZFS_PROP_CANMOUNT) == ZFS_CANMOUNT_NOAUTO) continue; default: break; } node = safe_malloc(sizeof (unshare_unmount_node_t)); node->un_zhp = zhp; node->un_mountp = safe_strdup(entry.mnt_mountp); uu_avl_node_init(node, &node->un_avlnode, pool); if (uu_avl_find(tree, node, NULL, &idx) == NULL) { uu_avl_insert(tree, node, idx); } else { zfs_close(node->un_zhp); free(node->un_mountp); free(node); } } /* * Walk the AVL tree in reverse, unmounting each filesystem and * removing it from the AVL tree in the process. */ if ((walk = uu_avl_walk_start(tree, UU_WALK_REVERSE | UU_WALK_ROBUST)) == NULL) nomem(); while ((node = uu_avl_walk_next(walk)) != NULL) { const char *mntarg = NULL; uu_avl_remove(tree, node); switch (op) { case OP_SHARE: if (zfs_unshareall_bytype(node->un_zhp, node->un_mountp, protocol) != 0) ret = 1; break; case OP_MOUNT: if (zfs_unmount(node->un_zhp, mntarg, flags) != 0) ret = 1; break; } zfs_close(node->un_zhp); free(node->un_mountp); free(node); } if (op == OP_SHARE) zfs_commit_shares(protocol); uu_avl_walk_end(walk); uu_avl_destroy(tree); uu_avl_pool_destroy(pool); } else { if (argc != 1) { if (argc == 0) (void) fprintf(stderr, gettext("missing filesystem argument\n")); else (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } /* * We have an argument, but it may be a full path or a ZFS * filesystem. Pass full paths off to unmount_path() (shared by * manual_unmount), otherwise open the filesystem and pass to * zfs_unmount(). */ if (argv[0][0] == '/') return (unshare_unmount_path(op, argv[0], flags, B_FALSE)); if ((zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM)) == NULL) return (1); verify(zfs_prop_get(zhp, op == OP_SHARE ? ZFS_PROP_SHARENFS : ZFS_PROP_MOUNTPOINT, nfs_mnt_prop, sizeof (nfs_mnt_prop), NULL, NULL, 0, B_FALSE) == 0); switch (op) { case OP_SHARE: verify(zfs_prop_get(zhp, ZFS_PROP_SHARENFS, nfs_mnt_prop, sizeof (nfs_mnt_prop), NULL, NULL, 0, B_FALSE) == 0); verify(zfs_prop_get(zhp, ZFS_PROP_SHARESMB, sharesmb, sizeof (sharesmb), NULL, NULL, 0, B_FALSE) == 0); if (strcmp(nfs_mnt_prop, "off") == 0 && strcmp(sharesmb, "off") == 0) { (void) fprintf(stderr, gettext("cannot " "unshare '%s': legacy share\n"), zfs_get_name(zhp)); (void) fprintf(stderr, gettext("use " "unshare(1M) to unshare this " "filesystem\n")); ret = 1; } else if (!zfs_is_shared(zhp)) { (void) fprintf(stderr, gettext("cannot " "unshare '%s': not currently " "shared\n"), zfs_get_name(zhp)); ret = 1; } else if (zfs_unshareall(zhp) != 0) { ret = 1; } break; case OP_MOUNT: if (strcmp(nfs_mnt_prop, "legacy") == 0) { (void) fprintf(stderr, gettext("cannot " "unmount '%s': legacy " "mountpoint\n"), zfs_get_name(zhp)); (void) fprintf(stderr, gettext("use " "umount(1M) to unmount this " "filesystem\n")); ret = 1; } else if (!zfs_is_mounted(zhp, NULL)) { (void) fprintf(stderr, gettext("cannot " "unmount '%s': not currently " "mounted\n"), zfs_get_name(zhp)); ret = 1; } else if (zfs_unmountall(zhp, flags) != 0) { ret = 1; } break; } zfs_close(zhp); } return (ret); } /* * zfs unmount [-fu] -a * zfs unmount [-fu] filesystem * * Unmount all filesystems, or a specific ZFS filesystem. */ static int zfs_do_unmount(int argc, char **argv) { return (unshare_unmount(OP_MOUNT, argc, argv)); } /* * zfs unshare -a * zfs unshare filesystem * * Unshare all filesystems, or a specific ZFS filesystem. */ static int zfs_do_unshare(int argc, char **argv) { return (unshare_unmount(OP_SHARE, argc, argv)); } static int find_command_idx(char *command, int *idx) { int i; for (i = 0; i < NCOMMAND; i++) { if (command_table[i].name == NULL) continue; if (strcmp(command, command_table[i].name) == 0) { *idx = i; return (0); } } return (1); } static int zfs_do_diff(int argc, char **argv) { zfs_handle_t *zhp; int flags = 0; char *tosnap = NULL; char *fromsnap = NULL; char *atp, *copy; int err = 0; int c; struct sigaction sa; while ((c = getopt(argc, argv, "FHt")) != -1) { switch (c) { case 'F': flags |= ZFS_DIFF_CLASSIFY; break; case 'H': flags |= ZFS_DIFF_PARSEABLE; break; case 't': flags |= ZFS_DIFF_TIMESTAMP; break; default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; if (argc < 1) { (void) fprintf(stderr, gettext("must provide at least one snapshot name\n")); usage(B_FALSE); } if (argc > 2) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } fromsnap = argv[0]; tosnap = (argc == 2) ? argv[1] : NULL; copy = NULL; if (*fromsnap != '@') copy = strdup(fromsnap); else if (tosnap) copy = strdup(tosnap); if (copy == NULL) usage(B_FALSE); if ((atp = strchr(copy, '@')) != NULL) *atp = '\0'; if ((zhp = zfs_open(g_zfs, copy, ZFS_TYPE_FILESYSTEM)) == NULL) { free(copy); return (1); } free(copy); /* * Ignore SIGPIPE so that the library can give us * information on any failure */ if (sigemptyset(&sa.sa_mask) == -1) { err = errno; goto out; } sa.sa_flags = 0; sa.sa_handler = SIG_IGN; if (sigaction(SIGPIPE, &sa, NULL) == -1) { err = errno; goto out; } err = zfs_show_diffs(zhp, STDOUT_FILENO, fromsnap, tosnap, flags); out: zfs_close(zhp); return (err != 0); } /* * zfs bookmark | * * Creates a bookmark with the given name from the source snapshot * or creates a copy of an existing source bookmark. */ static int zfs_do_bookmark(int argc, char **argv) { char *source, *bookname; char expbuf[ZFS_MAX_DATASET_NAME_LEN]; int source_type; nvlist_t *nvl; int ret = 0; int c; /* check options */ while ((c = getopt(argc, argv, "")) != -1) { switch (c) { case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); goto usage; } } argc -= optind; argv += optind; /* check number of arguments */ if (argc < 1) { (void) fprintf(stderr, gettext("missing source argument\n")); goto usage; } if (argc < 2) { (void) fprintf(stderr, gettext("missing bookmark argument\n")); goto usage; } source = argv[0]; bookname = argv[1]; if (strchr(source, '@') == NULL && strchr(source, '#') == NULL) { (void) fprintf(stderr, gettext("invalid source name '%s': " "must contain a '@' or '#'\n"), source); goto usage; } if (strchr(bookname, '#') == NULL) { (void) fprintf(stderr, gettext("invalid bookmark name '%s': " "must contain a '#'\n"), bookname); goto usage; } /* * expand source or bookname to full path: * one of them may be specified as short name */ { char **expand; char *source_short, *bookname_short; source_short = strpbrk(source, "@#"); bookname_short = strpbrk(bookname, "#"); if (source_short == source && bookname_short == bookname) { (void) fprintf(stderr, gettext( "either source or bookmark must be specified as " "full dataset paths")); goto usage; } else if (source_short != source && bookname_short != bookname) { expand = NULL; } else if (source_short != source) { strlcpy(expbuf, source, sizeof (expbuf)); expand = &bookname; } else if (bookname_short != bookname) { strlcpy(expbuf, bookname, sizeof (expbuf)); expand = &source; } else { abort(); } if (expand != NULL) { *strpbrk(expbuf, "@#") = '\0'; /* dataset name in buf */ (void) strlcat(expbuf, *expand, sizeof (expbuf)); *expand = expbuf; } } /* determine source type */ switch (*strpbrk(source, "@#")) { case '@': source_type = ZFS_TYPE_SNAPSHOT; break; case '#': source_type = ZFS_TYPE_BOOKMARK; break; default: abort(); } /* test the source exists */ zfs_handle_t *zhp; zhp = zfs_open(g_zfs, source, source_type); if (zhp == NULL) goto usage; zfs_close(zhp); nvl = fnvlist_alloc(); fnvlist_add_string(nvl, bookname, source); ret = lzc_bookmark(nvl, NULL); fnvlist_free(nvl); if (ret != 0) { const char *err_msg = NULL; char errbuf[1024]; (void) snprintf(errbuf, sizeof (errbuf), dgettext(TEXT_DOMAIN, "cannot create bookmark '%s'"), bookname); switch (ret) { case EXDEV: err_msg = "bookmark is in a different pool"; break; case ZFS_ERR_BOOKMARK_SOURCE_NOT_ANCESTOR: err_msg = "source is not an ancestor of the " "new bookmark's dataset"; break; case EEXIST: err_msg = "bookmark exists"; break; case EINVAL: err_msg = "invalid argument"; break; case ENOTSUP: err_msg = "bookmark feature not enabled"; break; case ENOSPC: err_msg = "out of space"; break; case ENOENT: err_msg = "dataset does not exist"; break; default: (void) zfs_standard_error(g_zfs, ret, errbuf); break; } if (err_msg != NULL) { (void) fprintf(stderr, "%s: %s\n", errbuf, dgettext(TEXT_DOMAIN, err_msg)); } } return (ret != 0); usage: usage(B_FALSE); return (-1); } static int zfs_do_channel_program(int argc, char **argv) { int ret, fd, c; char *progbuf, *filename, *poolname; size_t progsize, progread; nvlist_t *outnvl = NULL; uint64_t instrlimit = ZCP_DEFAULT_INSTRLIMIT; uint64_t memlimit = ZCP_DEFAULT_MEMLIMIT; boolean_t sync_flag = B_TRUE, json_output = B_FALSE; zpool_handle_t *zhp; /* check options */ while ((c = getopt(argc, argv, "nt:m:j")) != -1) { switch (c) { case 't': case 'm': { uint64_t arg; char *endp; errno = 0; arg = strtoull(optarg, &endp, 0); if (errno != 0 || *endp != '\0') { (void) fprintf(stderr, gettext( "invalid argument " "'%s': expected integer\n"), optarg); goto usage; } if (c == 't') { instrlimit = arg; } else { ASSERT3U(c, ==, 'm'); memlimit = arg; } break; } case 'n': { sync_flag = B_FALSE; break; } case 'j': { json_output = B_TRUE; break; } case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); goto usage; } } argc -= optind; argv += optind; if (argc < 2) { (void) fprintf(stderr, gettext("invalid number of arguments\n")); goto usage; } poolname = argv[0]; filename = argv[1]; if (strcmp(filename, "-") == 0) { fd = 0; filename = "standard input"; } else if ((fd = open(filename, O_RDONLY)) < 0) { (void) fprintf(stderr, gettext("cannot open '%s': %s\n"), filename, strerror(errno)); return (1); } if ((zhp = zpool_open(g_zfs, poolname)) == NULL) { (void) fprintf(stderr, gettext("cannot open pool '%s'\n"), poolname); if (fd != 0) (void) close(fd); return (1); } zpool_close(zhp); /* * Read in the channel program, expanding the program buffer as * necessary. */ progread = 0; progsize = 1024; progbuf = safe_malloc(progsize); do { ret = read(fd, progbuf + progread, progsize - progread); progread += ret; if (progread == progsize && ret > 0) { progsize *= 2; progbuf = safe_realloc(progbuf, progsize); } } while (ret > 0); if (fd != 0) (void) close(fd); if (ret < 0) { free(progbuf); (void) fprintf(stderr, gettext("cannot read '%s': %s\n"), filename, strerror(errno)); return (1); } progbuf[progread] = '\0'; /* * Any remaining arguments are passed as arguments to the lua script as * a string array: * { * "argv" -> [ "arg 1", ... "arg n" ], * } */ nvlist_t *argnvl = fnvlist_alloc(); fnvlist_add_string_array(argnvl, ZCP_ARG_CLIARGV, argv + 2, argc - 2); if (sync_flag) { ret = lzc_channel_program(poolname, progbuf, instrlimit, memlimit, argnvl, &outnvl); } else { ret = lzc_channel_program_nosync(poolname, progbuf, instrlimit, memlimit, argnvl, &outnvl); } if (ret != 0) { /* * On error, report the error message handed back by lua if one * exists. Otherwise, generate an appropriate error message, * falling back on strerror() for an unexpected return code. */ char *errstring = NULL; const char *msg = gettext("Channel program execution failed"); uint64_t instructions = 0; if (outnvl != NULL && nvlist_exists(outnvl, ZCP_RET_ERROR)) { (void) nvlist_lookup_string(outnvl, ZCP_RET_ERROR, &errstring); if (errstring == NULL) errstring = strerror(ret); if (ret == ETIME) { (void) nvlist_lookup_uint64(outnvl, ZCP_ARG_INSTRLIMIT, &instructions); } } else { switch (ret) { case EINVAL: errstring = "Invalid instruction or memory limit."; break; case ENOMEM: errstring = "Return value too large."; break; case ENOSPC: errstring = "Memory limit exhausted."; break; case ETIME: errstring = "Timed out."; break; case EPERM: errstring = "Permission denied. Channel " "programs must be run as root."; break; default: (void) zfs_standard_error(g_zfs, ret, msg); } } if (errstring != NULL) (void) fprintf(stderr, "%s:\n%s\n", msg, errstring); if (ret == ETIME && instructions != 0) (void) fprintf(stderr, gettext("%llu Lua instructions\n"), (u_longlong_t)instructions); } else { if (json_output) { (void) nvlist_print_json(stdout, outnvl); } else if (nvlist_empty(outnvl)) { (void) fprintf(stdout, gettext("Channel program fully " "executed and did not produce output.\n")); } else { (void) fprintf(stdout, gettext("Channel program fully " "executed and produced output:\n")); dump_nvlist(outnvl, 4); } } free(progbuf); fnvlist_free(outnvl); fnvlist_free(argnvl); return (ret != 0); usage: usage(B_FALSE); return (-1); } typedef struct loadkey_cbdata { boolean_t cb_loadkey; boolean_t cb_recursive; boolean_t cb_noop; char *cb_keylocation; uint64_t cb_numfailed; uint64_t cb_numattempted; } loadkey_cbdata_t; static int load_key_callback(zfs_handle_t *zhp, void *data) { int ret; boolean_t is_encroot; loadkey_cbdata_t *cb = data; uint64_t keystatus = zfs_prop_get_int(zhp, ZFS_PROP_KEYSTATUS); /* * If we are working recursively, we want to skip loading / unloading * keys for non-encryption roots and datasets whose keys are already * in the desired end-state. */ if (cb->cb_recursive) { ret = zfs_crypto_get_encryption_root(zhp, &is_encroot, NULL); if (ret != 0) return (ret); if (!is_encroot) return (0); if ((cb->cb_loadkey && keystatus == ZFS_KEYSTATUS_AVAILABLE) || (!cb->cb_loadkey && keystatus == ZFS_KEYSTATUS_UNAVAILABLE)) return (0); } cb->cb_numattempted++; if (cb->cb_loadkey) ret = zfs_crypto_load_key(zhp, cb->cb_noop, cb->cb_keylocation); else ret = zfs_crypto_unload_key(zhp); if (ret != 0) { cb->cb_numfailed++; return (ret); } return (0); } static int load_unload_keys(int argc, char **argv, boolean_t loadkey) { int c, ret = 0, flags = 0; boolean_t do_all = B_FALSE; loadkey_cbdata_t cb = { 0 }; cb.cb_loadkey = loadkey; while ((c = getopt(argc, argv, "anrL:")) != -1) { /* noop and alternate keylocations only apply to zfs load-key */ if (loadkey) { switch (c) { case 'n': cb.cb_noop = B_TRUE; continue; case 'L': cb.cb_keylocation = optarg; continue; default: break; } } switch (c) { case 'a': do_all = B_TRUE; cb.cb_recursive = B_TRUE; break; case 'r': flags |= ZFS_ITER_RECURSE; cb.cb_recursive = B_TRUE; break; default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argc -= optind; argv += optind; if (!do_all && argc == 0) { (void) fprintf(stderr, gettext("Missing dataset argument or -a option\n")); usage(B_FALSE); } if (do_all && argc != 0) { (void) fprintf(stderr, gettext("Cannot specify dataset with -a option\n")); usage(B_FALSE); } if (cb.cb_recursive && cb.cb_keylocation != NULL && strcmp(cb.cb_keylocation, "prompt") != 0) { (void) fprintf(stderr, gettext("alternate keylocation may only " "be 'prompt' with -r or -a\n")); usage(B_FALSE); } ret = zfs_for_each(argc, argv, flags, ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME, NULL, NULL, 0, load_key_callback, &cb); if (cb.cb_noop || (cb.cb_recursive && cb.cb_numattempted != 0)) { (void) printf(gettext("%llu / %llu key(s) successfully %s\n"), (u_longlong_t)(cb.cb_numattempted - cb.cb_numfailed), (u_longlong_t)cb.cb_numattempted, loadkey ? (cb.cb_noop ? "verified" : "loaded") : "unloaded"); } if (cb.cb_numfailed != 0) ret = -1; return (ret); } static int zfs_do_load_key(int argc, char **argv) { return (load_unload_keys(argc, argv, B_TRUE)); } static int zfs_do_unload_key(int argc, char **argv) { return (load_unload_keys(argc, argv, B_FALSE)); } static int zfs_do_change_key(int argc, char **argv) { int c, ret; uint64_t keystatus; boolean_t loadkey = B_FALSE, inheritkey = B_FALSE; zfs_handle_t *zhp = NULL; nvlist_t *props = fnvlist_alloc(); while ((c = getopt(argc, argv, "lio:")) != -1) { switch (c) { case 'l': loadkey = B_TRUE; break; case 'i': inheritkey = B_TRUE; break; case 'o': if (!parseprop(props, optarg)) { nvlist_free(props); return (1); } break; default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } if (inheritkey && !nvlist_empty(props)) { (void) fprintf(stderr, gettext("Properties not allowed for inheriting\n")); usage(B_FALSE); } argc -= optind; argv += optind; if (argc < 1) { (void) fprintf(stderr, gettext("Missing dataset argument\n")); usage(B_FALSE); } if (argc > 1) { (void) fprintf(stderr, gettext("Too many arguments\n")); usage(B_FALSE); } zhp = zfs_open(g_zfs, argv[argc - 1], ZFS_TYPE_FILESYSTEM | ZFS_TYPE_VOLUME); if (zhp == NULL) usage(B_FALSE); if (loadkey) { keystatus = zfs_prop_get_int(zhp, ZFS_PROP_KEYSTATUS); if (keystatus != ZFS_KEYSTATUS_AVAILABLE) { ret = zfs_crypto_load_key(zhp, B_FALSE, NULL); if (ret != 0) { nvlist_free(props); zfs_close(zhp); return (-1); } } /* refresh the properties so the new keystatus is visible */ zfs_refresh_properties(zhp); } ret = zfs_crypto_rewrap(zhp, props, inheritkey); if (ret != 0) { nvlist_free(props); zfs_close(zhp); return (-1); } nvlist_free(props); zfs_close(zhp); return (0); } /* * 1) zfs project [-d|-r] * List project ID and inherit flag of file(s) or directories. * -d: List the directory itself, not its children. * -r: List subdirectories recursively. * * 2) zfs project -C [-k] [-r] * Clear project inherit flag and/or ID on the file(s) or directories. * -k: Keep the project ID unchanged. If not specified, the project ID * will be reset as zero. * -r: Clear on subdirectories recursively. * * 3) zfs project -c [-0] [-d|-r] [-p id] * Check project ID and inherit flag on the file(s) or directories, * report the outliers. * -0: Print file name followed by a NUL instead of newline. * -d: Check the directory itself, not its children. * -p: Specify the referenced ID for comparing with the target file(s) * or directories' project IDs. If not specified, the target (top) * directory's project ID will be used as the referenced one. * -r: Check subdirectories recursively. * * 4) zfs project [-p id] [-r] [-s] * Set project ID and/or inherit flag on the file(s) or directories. * -p: Set the project ID as the given id. * -r: Set on subdirectories recursively. If not specify "-p" option, * it will use top-level directory's project ID as the given id, * then set both project ID and inherit flag on all descendants * of the top-level directory. * -s: Set project inherit flag. */ static int zfs_do_project(int argc, char **argv) { zfs_project_control_t zpc = { .zpc_expected_projid = ZFS_INVALID_PROJID, .zpc_op = ZFS_PROJECT_OP_DEFAULT, .zpc_dironly = B_FALSE, .zpc_keep_projid = B_FALSE, .zpc_newline = B_TRUE, .zpc_recursive = B_FALSE, .zpc_set_flag = B_FALSE, }; int ret = 0, c; if (argc < 2) usage(B_FALSE); while ((c = getopt(argc, argv, "0Ccdkp:rs")) != -1) { switch (c) { case '0': zpc.zpc_newline = B_FALSE; break; case 'C': if (zpc.zpc_op != ZFS_PROJECT_OP_DEFAULT) { (void) fprintf(stderr, gettext("cannot " "specify '-C' '-c' '-s' together\n")); usage(B_FALSE); } zpc.zpc_op = ZFS_PROJECT_OP_CLEAR; break; case 'c': if (zpc.zpc_op != ZFS_PROJECT_OP_DEFAULT) { (void) fprintf(stderr, gettext("cannot " "specify '-C' '-c' '-s' together\n")); usage(B_FALSE); } zpc.zpc_op = ZFS_PROJECT_OP_CHECK; break; case 'd': zpc.zpc_dironly = B_TRUE; /* overwrite "-r" option */ zpc.zpc_recursive = B_FALSE; break; case 'k': zpc.zpc_keep_projid = B_TRUE; break; case 'p': { char *endptr; errno = 0; zpc.zpc_expected_projid = strtoull(optarg, &endptr, 0); if (errno != 0 || *endptr != '\0') { (void) fprintf(stderr, gettext("project ID must be less than " "%u\n"), UINT32_MAX); usage(B_FALSE); } if (zpc.zpc_expected_projid >= UINT32_MAX) { (void) fprintf(stderr, gettext("invalid project ID\n")); usage(B_FALSE); } break; } case 'r': zpc.zpc_recursive = B_TRUE; /* overwrite "-d" option */ zpc.zpc_dironly = B_FALSE; break; case 's': if (zpc.zpc_op != ZFS_PROJECT_OP_DEFAULT) { (void) fprintf(stderr, gettext("cannot " "specify '-C' '-c' '-s' together\n")); usage(B_FALSE); } zpc.zpc_set_flag = B_TRUE; zpc.zpc_op = ZFS_PROJECT_OP_SET; break; default: (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } if (zpc.zpc_op == ZFS_PROJECT_OP_DEFAULT) { if (zpc.zpc_expected_projid != ZFS_INVALID_PROJID) zpc.zpc_op = ZFS_PROJECT_OP_SET; else zpc.zpc_op = ZFS_PROJECT_OP_LIST; } switch (zpc.zpc_op) { case ZFS_PROJECT_OP_LIST: if (zpc.zpc_keep_projid) { (void) fprintf(stderr, gettext("'-k' is only valid together with '-C'\n")); usage(B_FALSE); } if (!zpc.zpc_newline) { (void) fprintf(stderr, gettext("'-0' is only valid together with '-c'\n")); usage(B_FALSE); } break; case ZFS_PROJECT_OP_CHECK: if (zpc.zpc_keep_projid) { (void) fprintf(stderr, gettext("'-k' is only valid together with '-C'\n")); usage(B_FALSE); } break; case ZFS_PROJECT_OP_CLEAR: if (zpc.zpc_dironly) { (void) fprintf(stderr, gettext("'-d' is useless together with '-C'\n")); usage(B_FALSE); } if (!zpc.zpc_newline) { (void) fprintf(stderr, gettext("'-0' is only valid together with '-c'\n")); usage(B_FALSE); } if (zpc.zpc_expected_projid != ZFS_INVALID_PROJID) { (void) fprintf(stderr, gettext("'-p' is useless together with '-C'\n")); usage(B_FALSE); } break; case ZFS_PROJECT_OP_SET: if (zpc.zpc_dironly) { (void) fprintf(stderr, gettext("'-d' is useless for set project ID and/or " "inherit flag\n")); usage(B_FALSE); } if (zpc.zpc_keep_projid) { (void) fprintf(stderr, gettext("'-k' is only valid together with '-C'\n")); usage(B_FALSE); } if (!zpc.zpc_newline) { (void) fprintf(stderr, gettext("'-0' is only valid together with '-c'\n")); usage(B_FALSE); } break; default: ASSERT(0); break; } argv += optind; argc -= optind; if (argc == 0) { (void) fprintf(stderr, gettext("missing file or directory target(s)\n")); usage(B_FALSE); } for (int i = 0; i < argc; i++) { int err; err = zfs_project_handle(argv[i], &zpc); if (err && !ret) ret = err; } return (ret); } static int zfs_do_wait(int argc, char **argv) { boolean_t enabled[ZFS_WAIT_NUM_ACTIVITIES]; int error, i; char c; /* By default, wait for all types of activity. */ for (i = 0; i < ZFS_WAIT_NUM_ACTIVITIES; i++) enabled[i] = B_TRUE; while ((c = getopt(argc, argv, "t:")) != -1) { switch (c) { case 't': { static char *col_subopts[] = { "deleteq", NULL }; char *value; /* Reset activities array */ bzero(&enabled, sizeof (enabled)); while (*optarg != '\0') { int activity = getsubopt(&optarg, col_subopts, &value); if (activity < 0) { (void) fprintf(stderr, gettext("invalid activity '%s'\n"), value); usage(B_FALSE); } enabled[activity] = B_TRUE; } break; } case '?': (void) fprintf(stderr, gettext("invalid option '%c'\n"), optopt); usage(B_FALSE); } } argv += optind; argc -= optind; if (argc < 1) { (void) fprintf(stderr, gettext("missing 'filesystem' " "argument\n")); usage(B_FALSE); } if (argc > 1) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } zfs_handle_t *zhp = zfs_open(g_zfs, argv[0], ZFS_TYPE_FILESYSTEM); if (zhp == NULL) return (1); for (;;) { boolean_t missing = B_FALSE; boolean_t any_waited = B_FALSE; for (int i = 0; i < ZFS_WAIT_NUM_ACTIVITIES; i++) { boolean_t waited; if (!enabled[i]) continue; error = zfs_wait_status(zhp, i, &missing, &waited); if (error != 0 || missing) break; any_waited = (any_waited || waited); } if (error != 0 || missing || !any_waited) break; } zfs_close(zhp); return (error); } /* * Display version message */ static int zfs_do_version(int argc, char **argv) { if (zfs_version_print() == -1) return (1); return (0); } int main(int argc, char **argv) { int ret = 0; int i = 0; char *cmdname; char **newargv; (void) setlocale(LC_ALL, ""); (void) setlocale(LC_NUMERIC, "C"); (void) textdomain(TEXT_DOMAIN); opterr = 0; /* * Make sure the user has specified some command. */ if (argc < 2) { (void) fprintf(stderr, gettext("missing command\n")); usage(B_FALSE); } cmdname = argv[1]; /* * The 'umount' command is an alias for 'unmount' */ if (strcmp(cmdname, "umount") == 0) cmdname = "unmount"; /* * The 'recv' command is an alias for 'receive' */ if (strcmp(cmdname, "recv") == 0) cmdname = "receive"; /* * The 'snap' command is an alias for 'snapshot' */ if (strcmp(cmdname, "snap") == 0) cmdname = "snapshot"; /* * Special case '-?' */ if ((strcmp(cmdname, "-?") == 0) || (strcmp(cmdname, "--help") == 0)) usage(B_TRUE); /* * Special case '-V|--version' */ if ((strcmp(cmdname, "-V") == 0) || (strcmp(cmdname, "--version") == 0)) return (zfs_do_version(argc, argv)); if ((g_zfs = libzfs_init()) == NULL) { (void) fprintf(stderr, "%s\n", libzfs_error_init(errno)); return (1); } mnttab_file = g_zfs->libzfs_mnttab; zfs_save_arguments(argc, argv, history_str, sizeof (history_str)); libzfs_print_on_error(g_zfs, B_TRUE); /* * Many commands modify input strings for string parsing reasons. * We create a copy to protect the original argv. */ newargv = malloc((argc + 1) * sizeof (newargv[0])); for (i = 0; i < argc; i++) newargv[i] = strdup(argv[i]); newargv[argc] = NULL; /* * Run the appropriate command. */ libzfs_mnttab_cache(g_zfs, B_TRUE); if (find_command_idx(cmdname, &i) == 0) { current_command = &command_table[i]; ret = command_table[i].func(argc - 1, newargv + 1); } else if (strchr(cmdname, '=') != NULL) { verify(find_command_idx("set", &i) == 0); current_command = &command_table[i]; ret = command_table[i].func(argc, newargv); } else { (void) fprintf(stderr, gettext("unrecognized " "command '%s'\n"), cmdname); usage(B_FALSE); ret = 1; } for (i = 0; i < argc; i++) free(newargv[i]); free(newargv); if (ret == 0 && log_history) (void) zpool_log_history(g_zfs, history_str); libzfs_fini(g_zfs); /* * The 'ZFS_ABORT' environment variable causes us to dump core on exit * for the purposes of running ::findleaks. */ if (getenv("ZFS_ABORT") != NULL) { (void) printf("dumping core by request\n"); abort(); } return (ret); } #ifdef __FreeBSD__ #include #include /* * Attach/detach the given dataset to/from the given jail */ /* ARGSUSED */ static int zfs_do_jail_impl(int argc, char **argv, boolean_t attach) { zfs_handle_t *zhp; int jailid, ret; /* check number of arguments */ if (argc < 3) { (void) fprintf(stderr, gettext("missing argument(s)\n")); usage(B_FALSE); } if (argc > 3) { (void) fprintf(stderr, gettext("too many arguments\n")); usage(B_FALSE); } jailid = jail_getid(argv[1]); if (jailid < 0) { (void) fprintf(stderr, gettext("invalid jail id or name\n")); usage(B_FALSE); } zhp = zfs_open(g_zfs, argv[2], ZFS_TYPE_FILESYSTEM); if (zhp == NULL) return (1); ret = (zfs_jail(zhp, jailid, attach) != 0); zfs_close(zhp); return (ret); } /* * zfs jail jailid filesystem * * Attach the given dataset to the given jail */ /* ARGSUSED */ static int zfs_do_jail(int argc, char **argv) { return (zfs_do_jail_impl(argc, argv, B_TRUE)); } /* * zfs unjail jailid filesystem * * Detach the given dataset from the given jail */ /* ARGSUSED */ static int zfs_do_unjail(int argc, char **argv) { return (zfs_do_jail_impl(argc, argv, B_FALSE)); } #endif Index: vendor-sys/openzfs/dist/config/kernel-config-defined.m4 =================================================================== --- vendor-sys/openzfs/dist/config/kernel-config-defined.m4 (revision 366347) +++ vendor-sys/openzfs/dist/config/kernel-config-defined.m4 (revision 366348) @@ -1,183 +1,183 @@ dnl # dnl # Certain kernel build options are not supported. These must be dnl # detected at configure time and cause a build failure. Otherwise dnl # modules may be successfully built that behave incorrectly. dnl # AC_DEFUN([ZFS_AC_KERNEL_CONFIG_DEFINED], [ AS_IF([test "x$cross_compiling" != xyes], [ AC_RUN_IFELSE([ AC_LANG_PROGRAM([ #include "$LINUX/include/linux/license.h" ], [ return !license_is_gpl_compatible( "$ZFS_META_LICENSE"); ]) ], [ AC_DEFINE([ZFS_IS_GPL_COMPATIBLE], [1], [Define to 1 if GPL-only symbols can be used]) ], [ ]) ]) ZFS_AC_KERNEL_SRC_CONFIG_THREAD_SIZE ZFS_AC_KERNEL_SRC_CONFIG_DEBUG_LOCK_ALLOC ZFS_AC_KERNEL_SRC_CONFIG_TRIM_UNUSED_KSYMS ZFS_AC_KERNEL_SRC_CONFIG_ZLIB_INFLATE ZFS_AC_KERNEL_SRC_CONFIG_ZLIB_DEFLATE AC_MSG_CHECKING([for kernel config option compatibility]) ZFS_LINUX_TEST_COMPILE_ALL([config]) AC_MSG_RESULT([done]) ZFS_AC_KERNEL_CONFIG_THREAD_SIZE ZFS_AC_KERNEL_CONFIG_DEBUG_LOCK_ALLOC ZFS_AC_KERNEL_CONFIG_TRIM_UNUSED_KSYMS ZFS_AC_KERNEL_CONFIG_ZLIB_INFLATE ZFS_AC_KERNEL_CONFIG_ZLIB_DEFLATE ]) dnl # dnl # Check configured THREAD_SIZE dnl # dnl # The stack size will vary by architecture, but as of Linux 3.15 on x86_64 dnl # the default thread stack size was increased to 16K from 8K. Therefore, dnl # on newer kernels and some architectures stack usage optimizations can be dnl # conditionally applied to improve performance without negatively impacting dnl # stability. dnl # AC_DEFUN([ZFS_AC_KERNEL_SRC_CONFIG_THREAD_SIZE], [ ZFS_LINUX_TEST_SRC([config_thread_size], [ #include ],[ #if (THREAD_SIZE < 16384) #error "THREAD_SIZE is less than 16K" #endif ]) ]) AC_DEFUN([ZFS_AC_KERNEL_CONFIG_THREAD_SIZE], [ AC_MSG_CHECKING([whether kernel was built with 16K or larger stacks]) ZFS_LINUX_TEST_RESULT([config_thread_size], [ AC_MSG_RESULT([yes]) AC_DEFINE(HAVE_LARGE_STACKS, 1, [kernel has large stacks]) ],[ AC_MSG_RESULT([no]) ]) ]) dnl # dnl # Check CONFIG_DEBUG_LOCK_ALLOC dnl # dnl # This is typically only set for debug kernels because it comes with dnl # a performance penalty. However, when it is set it maps the non-GPL dnl # symbol mutex_lock() to the GPL-only mutex_lock_nested() symbol. dnl # This will cause a failure at link time which we'd rather know about dnl # at compile time. dnl # dnl # Since we plan to pursue making mutex_lock_nested() a non-GPL symbol dnl # with the upstream community we add a check to detect this case. dnl # AC_DEFUN([ZFS_AC_KERNEL_SRC_CONFIG_DEBUG_LOCK_ALLOC], [ ZFS_LINUX_TEST_SRC([config_debug_lock_alloc], [ #include ],[ struct mutex lock; mutex_init(&lock); mutex_lock(&lock); mutex_unlock(&lock); ], [], [$ZFS_META_LICENSE]) ]) AC_DEFUN([ZFS_AC_KERNEL_CONFIG_DEBUG_LOCK_ALLOC], [ AC_MSG_CHECKING([whether mutex_lock() is GPL-only]) - ZFS_LINUX_TEST_RESULT([config_debug_lock_alloc], [ + ZFS_LINUX_TEST_RESULT([config_debug_lock_alloc_license], [ AC_MSG_RESULT(no) ],[ AC_MSG_RESULT(yes) AC_MSG_ERROR([ *** Kernel built with CONFIG_DEBUG_LOCK_ALLOC which is incompatible *** with the CDDL license and will prevent the module linking stage *** from succeeding. You must rebuild your kernel without this *** option enabled.]) ]) ]) dnl # dnl # Check CONFIG_TRIM_UNUSED_KSYMS dnl # dnl # Verify the kernel has CONFIG_TRIM_UNUSED_KSYMS disabled. dnl # AC_DEFUN([ZFS_AC_KERNEL_SRC_CONFIG_TRIM_UNUSED_KSYMS], [ ZFS_LINUX_TEST_SRC([config_trim_unusued_ksyms], [ #if defined(CONFIG_TRIM_UNUSED_KSYMS) #error CONFIG_TRIM_UNUSED_KSYMS not defined #endif ],[]) ]) AC_DEFUN([ZFS_AC_KERNEL_CONFIG_TRIM_UNUSED_KSYMS], [ AC_MSG_CHECKING([whether CONFIG_TRIM_UNUSED_KSYM is disabled]) ZFS_LINUX_TEST_RESULT([config_trim_unusued_ksyms], [ AC_MSG_RESULT([yes]) ],[ AC_MSG_RESULT([no]) AS_IF([test "x$enable_linux_builtin" != xyes], [ AC_MSG_ERROR([ *** This kernel has unused symbols trimming enabled, please disable. *** Rebuild the kernel with CONFIG_TRIM_UNUSED_KSYMS=n set.]) ]) ]) ]) dnl # dnl # Check CONFIG_ZLIB_INFLATE dnl # dnl # Verify the kernel has CONFIG_ZLIB_INFLATE support enabled. dnl # AC_DEFUN([ZFS_AC_KERNEL_SRC_CONFIG_ZLIB_INFLATE], [ ZFS_LINUX_TEST_SRC([config_zlib_inflate], [ #if !defined(CONFIG_ZLIB_INFLATE) && \ !defined(CONFIG_ZLIB_INFLATE_MODULE) #error CONFIG_ZLIB_INFLATE not defined #endif ],[]) ]) AC_DEFUN([ZFS_AC_KERNEL_CONFIG_ZLIB_INFLATE], [ AC_MSG_CHECKING([whether CONFIG_ZLIB_INFLATE is defined]) ZFS_LINUX_TEST_RESULT([config_zlib_inflate], [ AC_MSG_RESULT([yes]) ],[ AC_MSG_RESULT([no]) AC_MSG_ERROR([ *** This kernel does not include the required zlib inflate support. *** Rebuild the kernel with CONFIG_ZLIB_INFLATE=y|m set.]) ]) ]) dnl # dnl # Check CONFIG_ZLIB_DEFLATE dnl # dnl # Verify the kernel has CONFIG_ZLIB_DEFLATE support enabled. dnl # AC_DEFUN([ZFS_AC_KERNEL_SRC_CONFIG_ZLIB_DEFLATE], [ ZFS_LINUX_TEST_SRC([config_zlib_deflate], [ #if !defined(CONFIG_ZLIB_DEFLATE) && \ !defined(CONFIG_ZLIB_DEFLATE_MODULE) #error CONFIG_ZLIB_DEFLATE not defined #endif ],[]) ]) AC_DEFUN([ZFS_AC_KERNEL_CONFIG_ZLIB_DEFLATE], [ AC_MSG_CHECKING([whether CONFIG_ZLIB_DEFLATE is defined]) ZFS_LINUX_TEST_RESULT([config_zlib_deflate], [ AC_MSG_RESULT([yes]) ],[ AC_MSG_RESULT([no]) AC_MSG_ERROR([ *** This kernel does not include the required zlib deflate support. *** Rebuild the kernel with CONFIG_ZLIB_DEFLATE=y|m set.]) ]) ]) Index: vendor-sys/openzfs/dist/config/kernel-objtool.m4 =================================================================== --- vendor-sys/openzfs/dist/config/kernel-objtool.m4 (revision 366347) +++ vendor-sys/openzfs/dist/config/kernel-objtool.m4 (revision 366348) @@ -1,45 +1,46 @@ dnl # dnl # Check for objtool support. dnl # AC_DEFUN([ZFS_AC_KERNEL_SRC_OBJTOOL], [ dnl # 4.6 API for compile-time stack validation ZFS_LINUX_TEST_SRC([objtool], [ #undef __ASSEMBLY__ + #include #include ],[ #if !defined(FRAME_BEGIN) - CTASSERT(1); + #error "FRAME_BEGIN is not defined" #endif ]) dnl # 4.6 API added STACK_FRAME_NON_STANDARD macro ZFS_LINUX_TEST_SRC([stack_frame_non_standard], [ #include ],[ #if !defined(STACK_FRAME_NON_STANDARD) - CTASSERT(1); + #error "STACK_FRAME_NON_STANDARD is not defined." #endif ]) ]) AC_DEFUN([ZFS_AC_KERNEL_OBJTOOL], [ AC_MSG_CHECKING( [whether compile-time stack validation (objtool) is available]) ZFS_LINUX_TEST_RESULT([objtool], [ AC_MSG_RESULT(yes) AC_DEFINE(HAVE_KERNEL_OBJTOOL, 1, [kernel does stack verification]) AC_MSG_CHECKING([whether STACK_FRAME_NON_STANDARD is defined]) ZFS_LINUX_TEST_RESULT([stack_frame_non_standard], [ AC_MSG_RESULT(yes) AC_DEFINE(HAVE_STACK_FRAME_NON_STANDARD, 1, [STACK_FRAME_NON_STANDARD is defined]) ],[ AC_MSG_RESULT(no) ]) ],[ AC_MSG_RESULT(no) ]) ]) Index: vendor-sys/openzfs/dist/configure.ac =================================================================== --- vendor-sys/openzfs/dist/configure.ac (revision 366347) +++ vendor-sys/openzfs/dist/configure.ac (revision 366348) @@ -1,411 +1,412 @@ /* * This file is part of the ZFS Linux port. * * Copyright (c) 2009 Lawrence Livermore National Security, LLC. * Produced at Lawrence Livermore National Laboratory * Written by: * Brian Behlendorf , * Herb Wartens , * Jim Garlick * LLNL-CODE-403049 * * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License, Version 1.0 only * (the "License"). You may not use this file except in compliance * with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ AC_INIT(m4_esyscmd(grep ^Name: META | cut -d ':' -f 2 | tr -d ' \n'), m4_esyscmd(grep ^Version: META | cut -d ':' -f 2 | tr -d ' \n')) AC_LANG(C) ZFS_AC_META AC_CONFIG_AUX_DIR([config]) AC_CONFIG_MACRO_DIR([config]) AC_CANONICAL_SYSTEM AM_MAINTAINER_MODE m4_ifdef([AM_SILENT_RULES], [AM_SILENT_RULES([yes])]) AM_INIT_AUTOMAKE([subdir-objects]) AC_CONFIG_HEADERS([zfs_config.h], [ (mv zfs_config.h zfs_config.h.tmp && awk -f ${ac_srcdir}/config/config.awk zfs_config.h.tmp >zfs_config.h && rm zfs_config.h.tmp) || exit 1]) AC_PROG_INSTALL AC_PROG_CC AC_PROG_LIBTOOL PKG_PROG_PKG_CONFIG AM_PROG_AS AM_PROG_CC_C_O AX_CODE_COVERAGE _AM_PROG_TAR(pax) ZFS_AC_LICENSE ZFS_AC_CONFIG ZFS_AC_PACKAGE ZFS_AC_DEBUG ZFS_AC_DEBUGINFO ZFS_AC_DEBUG_KMEM ZFS_AC_DEBUG_KMEM_TRACKING AC_CONFIG_FILES([ Makefile cmd/Makefile cmd/arc_summary/Makefile cmd/arcstat/Makefile cmd/dbufstat/Makefile cmd/fsck_zfs/Makefile cmd/mount_zfs/Makefile cmd/raidz_test/Makefile cmd/vdev_id/Makefile cmd/zdb/Makefile cmd/zed/Makefile cmd/zed/zed.d/Makefile cmd/zfs/Makefile cmd/zfs_ids_to_path/Makefile cmd/zgenhostid/Makefile cmd/zhack/Makefile cmd/zinject/Makefile cmd/zpool/Makefile cmd/zstream/Makefile cmd/zstreamdump/Makefile cmd/ztest/Makefile cmd/zvol_id/Makefile cmd/zvol_wait/Makefile contrib/Makefile contrib/bash_completion.d/Makefile contrib/bpftrace/Makefile contrib/dracut/02zfsexpandknowledge/Makefile contrib/dracut/90zfs/Makefile contrib/dracut/Makefile contrib/initramfs/Makefile contrib/initramfs/conf.d/Makefile contrib/initramfs/conf-hooks.d/Makefile contrib/initramfs/hooks/Makefile contrib/initramfs/scripts/Makefile contrib/initramfs/scripts/local-top/Makefile contrib/pam_zfs_key/Makefile contrib/pyzfs/Makefile contrib/pyzfs/setup.py contrib/zcp/Makefile etc/Makefile etc/default/Makefile etc/init.d/Makefile etc/modules-load.d/Makefile etc/sudoers.d/Makefile etc/systemd/Makefile etc/systemd/system-generators/Makefile etc/systemd/system/Makefile etc/zfs/Makefile include/Makefile include/os/Makefile include/os/freebsd/Makefile include/os/freebsd/linux/Makefile include/os/freebsd/spl/Makefile include/os/freebsd/spl/acl/Makefile include/os/freebsd/spl/rpc/Makefile include/os/freebsd/spl/sys/Makefile include/os/freebsd/zfs/Makefile include/os/freebsd/zfs/sys/Makefile include/os/linux/Makefile include/os/linux/kernel/Makefile include/os/linux/kernel/linux/Makefile include/os/linux/spl/Makefile include/os/linux/spl/rpc/Makefile include/os/linux/spl/sys/Makefile include/os/linux/zfs/Makefile include/os/linux/zfs/sys/Makefile include/sys/Makefile include/sys/crypto/Makefile include/sys/fm/Makefile include/sys/fm/fs/Makefile include/sys/fs/Makefile include/sys/lua/Makefile include/sys/sysevent/Makefile include/sys/zstd/Makefile lib/Makefile lib/libavl/Makefile lib/libefi/Makefile lib/libicp/Makefile lib/libnvpair/Makefile lib/libshare/Makefile lib/libspl/Makefile lib/libspl/include/Makefile lib/libspl/include/ia32/Makefile lib/libspl/include/ia32/sys/Makefile lib/libspl/include/os/Makefile lib/libspl/include/os/freebsd/Makefile lib/libspl/include/os/freebsd/sys/Makefile lib/libspl/include/os/linux/Makefile lib/libspl/include/os/linux/sys/Makefile lib/libspl/include/rpc/Makefile lib/libspl/include/sys/Makefile lib/libspl/include/sys/dktp/Makefile lib/libspl/include/util/Makefile lib/libtpool/Makefile lib/libunicode/Makefile lib/libuutil/Makefile lib/libzfs/Makefile lib/libzfs/libzfs.pc lib/libzfsbootenv/Makefile lib/libzfsbootenv/libzfsbootenv.pc lib/libzfs_core/Makefile lib/libzfs_core/libzfs_core.pc lib/libzpool/Makefile lib/libzstd/Makefile lib/libzutil/Makefile man/Makefile man/man1/Makefile man/man5/Makefile man/man8/Makefile module/Kbuild module/Makefile module/avl/Makefile module/icp/Makefile module/lua/Makefile module/nvpair/Makefile module/os/linux/spl/Makefile module/os/linux/zfs/Makefile module/spl/Makefile module/unicode/Makefile module/zcommon/Makefile module/zfs/Makefile module/zstd/Makefile rpm/Makefile rpm/generic/Makefile rpm/generic/zfs-dkms.spec rpm/generic/zfs-kmod.spec rpm/generic/zfs.spec rpm/redhat/Makefile rpm/redhat/zfs-dkms.spec rpm/redhat/zfs-kmod.spec rpm/redhat/zfs.spec scripts/Makefile tests/Makefile tests/runfiles/Makefile tests/test-runner/Makefile tests/test-runner/bin/Makefile tests/test-runner/include/Makefile tests/test-runner/man/Makefile tests/zfs-tests/Makefile tests/zfs-tests/callbacks/Makefile tests/zfs-tests/cmd/Makefile + tests/zfs-tests/cmd/badsend/Makefile tests/zfs-tests/cmd/btree_test/Makefile tests/zfs-tests/cmd/chg_usr_exec/Makefile tests/zfs-tests/cmd/devname2devid/Makefile tests/zfs-tests/cmd/dir_rd_update/Makefile tests/zfs-tests/cmd/file_check/Makefile tests/zfs-tests/cmd/file_trunc/Makefile tests/zfs-tests/cmd/file_write/Makefile tests/zfs-tests/cmd/get_diff/Makefile tests/zfs-tests/cmd/largest_file/Makefile tests/zfs-tests/cmd/libzfs_input_check/Makefile tests/zfs-tests/cmd/mkbusy/Makefile tests/zfs-tests/cmd/mkfile/Makefile tests/zfs-tests/cmd/mkfiles/Makefile tests/zfs-tests/cmd/mktree/Makefile tests/zfs-tests/cmd/mmap_exec/Makefile tests/zfs-tests/cmd/mmap_libaio/Makefile tests/zfs-tests/cmd/mmapwrite/Makefile tests/zfs-tests/cmd/nvlist_to_lua/Makefile tests/zfs-tests/cmd/randfree_file/Makefile tests/zfs-tests/cmd/randwritecomp/Makefile tests/zfs-tests/cmd/readmmap/Makefile tests/zfs-tests/cmd/rename_dir/Makefile tests/zfs-tests/cmd/rm_lnkcnt_zero_file/Makefile tests/zfs-tests/cmd/stride_dd/Makefile tests/zfs-tests/cmd/threadsappend/Makefile tests/zfs-tests/cmd/user_ns_exec/Makefile tests/zfs-tests/cmd/xattrtest/Makefile tests/zfs-tests/include/Makefile tests/zfs-tests/tests/Makefile tests/zfs-tests/tests/functional/Makefile tests/zfs-tests/tests/functional/acl/Makefile tests/zfs-tests/tests/functional/acl/posix/Makefile tests/zfs-tests/tests/functional/alloc_class/Makefile tests/zfs-tests/tests/functional/arc/Makefile tests/zfs-tests/tests/functional/atime/Makefile tests/zfs-tests/tests/functional/bootfs/Makefile tests/zfs-tests/tests/functional/btree/Makefile tests/zfs-tests/tests/functional/cache/Makefile tests/zfs-tests/tests/functional/cachefile/Makefile tests/zfs-tests/tests/functional/casenorm/Makefile tests/zfs-tests/tests/functional/channel_program/Makefile tests/zfs-tests/tests/functional/channel_program/lua_core/Makefile tests/zfs-tests/tests/functional/channel_program/synctask_core/Makefile tests/zfs-tests/tests/functional/chattr/Makefile tests/zfs-tests/tests/functional/checksum/Makefile tests/zfs-tests/tests/functional/clean_mirror/Makefile tests/zfs-tests/tests/functional/cli_root/Makefile tests/zfs-tests/tests/functional/cli_root/zdb/Makefile tests/zfs-tests/tests/functional/cli_root/zfs/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_bookmark/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_change-key/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_clone/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_copies/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_create/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_destroy/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_diff/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_get/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_ids_to_path/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_inherit/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_jail/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_load-key/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_mount/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_program/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_promote/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_property/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_receive/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_rename/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_reservation/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_rollback/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_send/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_set/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_share/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_snapshot/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_sysfs/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_unload-key/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_unmount/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_unshare/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_upgrade/Makefile tests/zfs-tests/tests/functional/cli_root/zfs_wait/Makefile tests/zfs-tests/tests/functional/cli_root/zpool/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_add/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_attach/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_clear/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_create/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_destroy/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_detach/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_events/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_expand/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_export/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_get/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_history/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_import/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_import/blockfiles/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_initialize/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_labelclear/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_offline/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_online/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_remove/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_reopen/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_replace/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_resilver/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_scrub/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_set/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_split/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_status/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_sync/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_trim/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_upgrade/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_upgrade/blockfiles/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_wait/Makefile tests/zfs-tests/tests/functional/cli_root/zpool_wait/scan/Makefile tests/zfs-tests/tests/functional/cli_user/Makefile tests/zfs-tests/tests/functional/cli_user/misc/Makefile tests/zfs-tests/tests/functional/cli_user/zfs_list/Makefile tests/zfs-tests/tests/functional/cli_user/zpool_iostat/Makefile tests/zfs-tests/tests/functional/cli_user/zpool_list/Makefile tests/zfs-tests/tests/functional/cli_user/zpool_status/Makefile tests/zfs-tests/tests/functional/compression/Makefile tests/zfs-tests/tests/functional/cp_files/Makefile tests/zfs-tests/tests/functional/ctime/Makefile tests/zfs-tests/tests/functional/deadman/Makefile tests/zfs-tests/tests/functional/delegate/Makefile tests/zfs-tests/tests/functional/devices/Makefile tests/zfs-tests/tests/functional/events/Makefile tests/zfs-tests/tests/functional/exec/Makefile tests/zfs-tests/tests/functional/fallocate/Makefile tests/zfs-tests/tests/functional/fault/Makefile tests/zfs-tests/tests/functional/features/Makefile tests/zfs-tests/tests/functional/features/async_destroy/Makefile tests/zfs-tests/tests/functional/features/large_dnode/Makefile tests/zfs-tests/tests/functional/grow/Makefile tests/zfs-tests/tests/functional/history/Makefile tests/zfs-tests/tests/functional/hkdf/Makefile tests/zfs-tests/tests/functional/inheritance/Makefile tests/zfs-tests/tests/functional/inuse/Makefile tests/zfs-tests/tests/functional/io/Makefile tests/zfs-tests/tests/functional/large_files/Makefile tests/zfs-tests/tests/functional/largest_pool/Makefile tests/zfs-tests/tests/functional/libzfs/Makefile tests/zfs-tests/tests/functional/limits/Makefile tests/zfs-tests/tests/functional/link_count/Makefile tests/zfs-tests/tests/functional/log_spacemap/Makefile tests/zfs-tests/tests/functional/migration/Makefile tests/zfs-tests/tests/functional/mmap/Makefile tests/zfs-tests/tests/functional/mmp/Makefile tests/zfs-tests/tests/functional/mount/Makefile tests/zfs-tests/tests/functional/mv_files/Makefile tests/zfs-tests/tests/functional/nestedfs/Makefile tests/zfs-tests/tests/functional/no_space/Makefile tests/zfs-tests/tests/functional/nopwrite/Makefile tests/zfs-tests/tests/functional/online_offline/Makefile tests/zfs-tests/tests/functional/pam/Makefile tests/zfs-tests/tests/functional/persist_l2arc/Makefile tests/zfs-tests/tests/functional/pool_checkpoint/Makefile tests/zfs-tests/tests/functional/pool_names/Makefile tests/zfs-tests/tests/functional/poolversion/Makefile tests/zfs-tests/tests/functional/privilege/Makefile tests/zfs-tests/tests/functional/procfs/Makefile tests/zfs-tests/tests/functional/projectquota/Makefile tests/zfs-tests/tests/functional/pyzfs/Makefile tests/zfs-tests/tests/functional/quota/Makefile tests/zfs-tests/tests/functional/raidz/Makefile tests/zfs-tests/tests/functional/redacted_send/Makefile tests/zfs-tests/tests/functional/redundancy/Makefile tests/zfs-tests/tests/functional/refquota/Makefile tests/zfs-tests/tests/functional/refreserv/Makefile tests/zfs-tests/tests/functional/removal/Makefile tests/zfs-tests/tests/functional/rename_dirs/Makefile tests/zfs-tests/tests/functional/replacement/Makefile tests/zfs-tests/tests/functional/reservation/Makefile tests/zfs-tests/tests/functional/rootpool/Makefile tests/zfs-tests/tests/functional/rsend/Makefile tests/zfs-tests/tests/functional/scrub_mirror/Makefile tests/zfs-tests/tests/functional/slog/Makefile tests/zfs-tests/tests/functional/snapshot/Makefile tests/zfs-tests/tests/functional/snapused/Makefile tests/zfs-tests/tests/functional/sparse/Makefile tests/zfs-tests/tests/functional/suid/Makefile tests/zfs-tests/tests/functional/threadsappend/Makefile tests/zfs-tests/tests/functional/tmpfile/Makefile tests/zfs-tests/tests/functional/trim/Makefile tests/zfs-tests/tests/functional/truncate/Makefile tests/zfs-tests/tests/functional/upgrade/Makefile tests/zfs-tests/tests/functional/user_namespace/Makefile tests/zfs-tests/tests/functional/userquota/Makefile tests/zfs-tests/tests/functional/vdev_zaps/Makefile tests/zfs-tests/tests/functional/write_dirs/Makefile tests/zfs-tests/tests/functional/xattr/Makefile tests/zfs-tests/tests/functional/zvol/Makefile tests/zfs-tests/tests/functional/zvol/zvol_ENOSPC/Makefile tests/zfs-tests/tests/functional/zvol/zvol_cli/Makefile tests/zfs-tests/tests/functional/zvol/zvol_misc/Makefile tests/zfs-tests/tests/functional/zvol/zvol_swap/Makefile tests/zfs-tests/tests/perf/Makefile tests/zfs-tests/tests/perf/fio/Makefile tests/zfs-tests/tests/perf/regression/Makefile tests/zfs-tests/tests/perf/scripts/Makefile tests/zfs-tests/tests/stress/Makefile udev/Makefile udev/rules.d/Makefile zfs.release ]) AC_OUTPUT Index: vendor-sys/openzfs/dist/contrib/initramfs/scripts/zfs =================================================================== --- vendor-sys/openzfs/dist/contrib/initramfs/scripts/zfs (revision 366347) +++ vendor-sys/openzfs/dist/contrib/initramfs/scripts/zfs (revision 366348) @@ -1,1011 +1,1004 @@ # ZFS boot stub for initramfs-tools. # # In the initramfs environment, the /init script sources this stub to # override the default functions in the /scripts/local script. # # Enable this by passing boot=zfs on the kernel command line. # # Source the common functions . /etc/zfs/zfs-functions # Start interactive shell. # Use debian's panic() if defined, because it allows to prevent shell access # by setting panic in cmdline (e.g. panic=0 or panic=15). # See "4.5 Disable root prompt on the initramfs" of Securing Debian Manual: # https://www.debian.org/doc/manuals/securing-debian-howto/ch4.en.html shell() { - if type panic > /dev/null 2>&1; then - panic $@ + if command -v panic > /dev/null 2>&1; then + panic else /bin/sh fi } # This runs any scripts that should run before we start importing # pools and mounting any filesystems. pre_mountroot() { - if type run_scripts > /dev/null 2>&1 && \ - [ -f "/scripts/local-top" -o -d "/scripts/local-top" ] + if command -v run_scripts > /dev/null 2>&1 then - [ "$quiet" != "y" ] && \ - zfs_log_begin_msg "Running /scripts/local-top" - run_scripts /scripts/local-top - [ "$quiet" != "y" ] && zfs_log_end_msg - fi + if [ -f "/scripts/local-top" ] || [ -d "/scripts/local-top" ] + then + [ "$quiet" != "y" ] && \ + zfs_log_begin_msg "Running /scripts/local-top" + run_scripts /scripts/local-top + [ "$quiet" != "y" ] && zfs_log_end_msg + fi - if type run_scripts > /dev/null 2>&1 && \ - [ -f "/scripts/local-premount" -o -d "/scripts/local-premount" ] - then - [ "$quiet" != "y" ] && \ - zfs_log_begin_msg "Running /scripts/local-premount" - run_scripts /scripts/local-premount - [ "$quiet" != "y" ] && zfs_log_end_msg + if [ -f "/scripts/local-premount" ] || [ -d "/scripts/local-premount" ] + then + [ "$quiet" != "y" ] && \ + zfs_log_begin_msg "Running /scripts/local-premount" + run_scripts /scripts/local-premount + [ "$quiet" != "y" ] && zfs_log_end_msg + fi fi } # If plymouth is available, hide the splash image. disable_plymouth() { if [ -x /bin/plymouth ] && /bin/plymouth --ping then /bin/plymouth hide-splash >/dev/null 2>&1 fi } # Get a ZFS filesystem property value. get_fs_value() { - local fs="$1" - local value=$2 + fs="$1" + value=$2 - "${ZFS}" get -H -ovalue $value "$fs" 2> /dev/null + "${ZFS}" get -H -ovalue "$value" "$fs" 2> /dev/null } # Find the 'bootfs' property on pool $1. # If the property does not contain '/', then ignore this # pool by exporting it again. find_rootfs() { - local pool="$1" + pool="$1" # If 'POOL_IMPORTED' isn't set, no pool imported and therefore # we won't be able to find a root fs. [ -z "${POOL_IMPORTED}" ] && return 1 # If it's already specified, just keep it mounted and exit # User (kernel command line) must be correct. [ -n "${ZFS_BOOTFS}" ] && return 0 # Not set, try to find it in the 'bootfs' property of the pool. # NOTE: zpool does not support 'get -H -ovalue bootfs'... ZFS_BOOTFS=$("${ZPOOL}" list -H -obootfs "$pool") # Make sure it's not '-' and that it starts with /. if [ "${ZFS_BOOTFS}" != "-" ] && \ - $(get_fs_value "${ZFS_BOOTFS}" mountpoint | grep -q '^/$') + get_fs_value "${ZFS_BOOTFS}" mountpoint | grep -q '^/$' then # Keep it mounted POOL_IMPORTED=1 return 0 fi # Not boot fs here, export it and later try again.. "${ZPOOL}" export "$pool" POOL_IMPORTED="" return 1 } # Support function to get a list of all pools, separated with ';' find_pools() { - local CMD="$*" - local pools pool + CMD="$*" pools=$($CMD 2> /dev/null | \ grep -E "pool:|^[a-zA-Z0-9]" | \ sed 's@.*: @@' | \ - while read pool; do \ - echo -n "$pool;" + while read -r pool; do \ + printf "%s" "$pool;" done) echo "${pools%%;}" # Return without the last ';'. } # Get a list of all available pools get_pools() { - local available_pools npools - if [ -n "${ZFS_POOL_IMPORT}" ]; then echo "$ZFS_POOL_IMPORT" return 0 fi # Get the base list of available pools. available_pools=$(find_pools "$ZPOOL" import) # Just in case - seen it happen (that a pool isn't visible/found # with a simple "zpool import" but only when using the "-d" # option or setting ZPOOL_IMPORT_PATH). if [ -d "/dev/disk/by-id" ] then npools=$(find_pools "$ZPOOL" import -d /dev/disk/by-id) if [ -n "$npools" ] then # Because we have found extra pool(s) here, which wasn't # found 'normally', we need to force USE_DISK_BY_ID to # make sure we're able to actually import it/them later. USE_DISK_BY_ID='yes' if [ -n "$available_pools" ] then # Filter out duplicates (pools found with the simple # "zpool import" but which is also found with the # "zpool import -d ..."). npools=$(echo "$npools" | sed "s,$available_pools,,") # Add the list to the existing list of # available pools available_pools="$available_pools;$npools" else available_pools="$npools" fi fi fi # Filter out any exceptions... if [ -n "$ZFS_POOL_EXCEPTIONS" ] then - local found="" - local apools="" - local pool exception + found="" + apools="" OLD_IFS="$IFS" ; IFS=";" for pool in $available_pools do for exception in $ZFS_POOL_EXCEPTIONS do [ "$pool" = "$exception" ] && continue 2 found="$pool" done if [ -n "$found" ] then if [ -n "$apools" ] then apools="$apools;$pool" else apools="$pool" fi fi done IFS="$OLD_IFS" available_pools="$apools" fi # Return list of available pools. echo "$available_pools" } # Import given pool $1 import_pool() { - local pool="$1" - local dirs dir + pool="$1" # Verify that the pool isn't already imported # Make as sure as we can to not require '-f' to import. "${ZPOOL}" get name,guid -o value -H 2>/dev/null | grep -Fxq "$pool" && return 0 # For backwards compatibility, make sure that ZPOOL_IMPORT_PATH is set # to something we can use later with the real import(s). We want to # make sure we find all by* dirs, BUT by-vdev should be first (if it # exists). - if [ -n "$USE_DISK_BY_ID" -a -z "$ZPOOL_IMPORT_PATH" ] + if [ -n "$USE_DISK_BY_ID" ] && [ -z "$ZPOOL_IMPORT_PATH" ] then dirs="$(for dir in $(echo /dev/disk/by-*) do # Ignore by-vdev here - we want it first! echo "$dir" | grep -q /by-vdev && continue [ ! -d "$dir" ] && continue - echo -n "$dir:" + printf "%s" "$dir:" done | sed 's,:$,,g')" if [ -d "/dev/disk/by-vdev" ] then # Add by-vdev at the beginning. ZPOOL_IMPORT_PATH="/dev/disk/by-vdev:" fi # ... and /dev at the very end, just for good measure. ZPOOL_IMPORT_PATH="$ZPOOL_IMPORT_PATH$dirs:/dev" fi # Needs to be exported for "zpool" to catch it. [ -n "$ZPOOL_IMPORT_PATH" ] && export ZPOOL_IMPORT_PATH [ "$quiet" != "y" ] && zfs_log_begin_msg \ "Importing pool '${pool}' using defaults" ZFS_CMD="${ZPOOL} import -N ${ZPOOL_FORCE} ${ZPOOL_IMPORT_OPTS}" ZFS_STDERR="$($ZFS_CMD "$pool" 2>&1)" ZFS_ERROR="$?" if [ "${ZFS_ERROR}" != 0 ] then [ "$quiet" != "y" ] && zfs_log_failure_msg "${ZFS_ERROR}" if [ -f "${ZPOOL_CACHE}" ] then [ "$quiet" != "y" ] && zfs_log_begin_msg \ "Importing pool '${pool}' using cachefile." ZFS_CMD="${ZPOOL} import -c ${ZPOOL_CACHE} -N ${ZPOOL_FORCE} ${ZPOOL_IMPORT_OPTS}" ZFS_STDERR="$($ZFS_CMD "$pool" 2>&1)" ZFS_ERROR="$?" fi if [ "${ZFS_ERROR}" != 0 ] then [ "$quiet" != "y" ] && zfs_log_failure_msg "${ZFS_ERROR}" disable_plymouth echo "" echo "Command: ${ZFS_CMD} '$pool'" echo "Message: $ZFS_STDERR" echo "Error: $ZFS_ERROR" echo "" echo "Failed to import pool '$pool'." echo "Manually import the pool and exit." shell fi fi [ "$quiet" != "y" ] && zfs_log_end_msg POOL_IMPORTED=1 return 0 } # Load ZFS modules # Loading a module in a initrd require a slightly different approach, # with more logging etc. load_module_initrd() { - if [ "$ZFS_INITRD_PRE_MOUNTROOT_SLEEP" > 0 ] + if [ "$ZFS_INITRD_PRE_MOUNTROOT_SLEEP" -gt 0 ] 2>/dev/null then if [ "$quiet" != "y" ]; then zfs_log_begin_msg "Sleeping for" \ "$ZFS_INITRD_PRE_MOUNTROOT_SLEEP seconds..." fi sleep "$ZFS_INITRD_PRE_MOUNTROOT_SLEEP" [ "$quiet" != "y" ] && zfs_log_end_msg fi # Wait for all of the /dev/{hd,sd}[a-z] device nodes to appear. - if type wait_for_udev > /dev/null 2>&1 ; then + if command -v wait_for_udev > /dev/null 2>&1 ; then wait_for_udev 10 - elif type wait_for_dev > /dev/null 2>&1 ; then + elif command -v wait_for_dev > /dev/null 2>&1 ; then wait_for_dev fi # zpool import refuse to import without a valid /proc/self/mounts [ ! -f /proc/self/mounts ] && mount proc /proc # Load the module load_module "zfs" || return 1 - if [ "$ZFS_INITRD_POST_MODPROBE_SLEEP" > 0 ] + if [ "$ZFS_INITRD_POST_MODPROBE_SLEEP" -gt 0 ] 2>/dev/null then if [ "$quiet" != "y" ]; then zfs_log_begin_msg "Sleeping for" \ "$ZFS_INITRD_POST_MODPROBE_SLEEP seconds..." fi sleep "$ZFS_INITRD_POST_MODPROBE_SLEEP" [ "$quiet" != "y" ] && zfs_log_end_msg fi return 0 } # Mount a given filesystem mount_fs() { - local fs="$1" - local mountpoint + fs="$1" # Check that the filesystem exists - "${ZFS}" list -oname -tfilesystem -H "${fs}" > /dev/null 2>&1 - [ "$?" -ne 0 ] && return 1 + "${ZFS}" list -oname -tfilesystem -H "${fs}" > /dev/null 2>&1 || return 1 # Skip filesystems with canmount=off. The root fs should not have # canmount=off, but ignore it for backwards compatibility just in case. if [ "$fs" != "${ZFS_BOOTFS}" ] then canmount=$(get_fs_value "$fs" canmount) [ "$canmount" = "off" ] && return 0 fi # Need the _original_ datasets mountpoint! mountpoint=$(get_fs_value "$fs" mountpoint) - if [ "$mountpoint" = "legacy" -o "$mountpoint" = "none" ]; then + if [ "$mountpoint" = "legacy" ] || [ "$mountpoint" = "none" ]; then # Can't use the mountpoint property. Might be one of our # clones. Check the 'org.zol:mountpoint' property set in # clone_snap() if that's usable. mountpoint=$(get_fs_value "$fs" org.zol:mountpoint) - if [ "$mountpoint" = "legacy" -o \ - "$mountpoint" = "none" -o \ - "$mountpoint" = "-" ] + if [ "$mountpoint" = "legacy" ] || + [ "$mountpoint" = "none" ] || + [ "$mountpoint" = "-" ] then if [ "$fs" != "${ZFS_BOOTFS}" ]; then # We don't have a proper mountpoint and this # isn't the root fs. return 0 else # Last hail-mary: Hope 'rootmnt' is set! mountpoint="" fi fi if [ "$mountpoint" = "legacy" ]; then ZFS_CMD="mount -t zfs" else # If it's not a legacy filesystem, it can only be a # native one... ZFS_CMD="mount -o zfsutil -t zfs" fi else ZFS_CMD="mount -o zfsutil -t zfs" fi # Possibly decrypt a filesystem using native encryption. decrypt_fs "$fs" [ "$quiet" != "y" ] && \ zfs_log_begin_msg "Mounting '${fs}' on '${rootmnt}/${mountpoint}'" [ -n "${ZFS_DEBUG}" ] && \ zfs_log_begin_msg "CMD: '$ZFS_CMD ${fs} ${rootmnt}/${mountpoint}'" ZFS_STDERR=$(${ZFS_CMD} "${fs}" "${rootmnt}/${mountpoint}" 2>&1) ZFS_ERROR=$? if [ "${ZFS_ERROR}" != 0 ] then [ "$quiet" != "y" ] && zfs_log_failure_msg "${ZFS_ERROR}" disable_plymouth echo "" echo "Command: ${ZFS_CMD} ${fs} ${rootmnt}/${mountpoint}" echo "Message: $ZFS_STDERR" echo "Error: $ZFS_ERROR" echo "" echo "Failed to mount ${fs} on ${rootmnt}/${mountpoint}." echo "Manually mount the filesystem and exit." shell else [ "$quiet" != "y" ] && zfs_log_end_msg fi return 0 } # Unlock a ZFS native encrypted filesystem. decrypt_fs() { - local fs="$1" - + fs="$1" + # If pool encryption is active and the zfs command understands '-o encryption' - if [ "$(zpool list -H -o feature@encryption $(echo "${fs}" | awk -F\/ '{print $1}'))" = 'active' ]; then + if [ "$(zpool list -H -o feature@encryption "$(echo "${fs}" | awk -F/ '{print $1}')")" = 'active' ]; then # Determine dataset that holds key for root dataset ENCRYPTIONROOT="$(get_fs_value "${fs}" encryptionroot)" KEYLOCATION="$(get_fs_value "${ENCRYPTIONROOT}" keylocation)" echo "${ENCRYPTIONROOT}" > /run/zfs_fs_name # If root dataset is encrypted... if ! [ "${ENCRYPTIONROOT}" = "-" ]; then KEYSTATUS="$(get_fs_value "${ENCRYPTIONROOT}" keystatus)" # Continue only if the key needs to be loaded [ "$KEYSTATUS" = "unavailable" ] || return 0 TRY_COUNT=3 # If key is stored in a file, do not prompt if ! [ "${KEYLOCATION}" = "prompt" ]; then $ZFS load-key "${ENCRYPTIONROOT}" # Prompt with plymouth, if active elif [ -e /bin/plymouth ] && /bin/plymouth --ping 2>/dev/null; then echo "plymouth" > /run/zfs_console_askpwd_cmd while [ $TRY_COUNT -gt 0 ]; do plymouth ask-for-password --prompt "Encrypted ZFS password for ${ENCRYPTIONROOT}" | \ $ZFS load-key "${ENCRYPTIONROOT}" && break TRY_COUNT=$((TRY_COUNT - 1)) done - # Prompt with systemd, if active + # Prompt with systemd, if active elif [ -e /run/systemd/system ]; then echo "systemd-ask-password" > /run/zfs_console_askpwd_cmd while [ $TRY_COUNT -gt 0 ]; do systemd-ask-password "Encrypted ZFS password for ${ENCRYPTIONROOT}" --no-tty | \ $ZFS load-key "${ENCRYPTIONROOT}" && break TRY_COUNT=$((TRY_COUNT - 1)) done # Prompt with ZFS tty, otherwise else # Temporarily setting "printk" to "7" allows the prompt to appear even when the "quiet" kernel option has been used echo "load-key" > /run/zfs_console_askpwd_cmd storeprintk="$(awk '{print $1}' /proc/sys/kernel/printk)" echo 7 > /proc/sys/kernel/printk $ZFS load-key "${ENCRYPTIONROOT}" echo "$storeprintk" > /proc/sys/kernel/printk fi fi fi return 0 } # Destroy a given filesystem. destroy_fs() { - local fs="$1" + fs="$1" [ "$quiet" != "y" ] && \ zfs_log_begin_msg "Destroying '$fs'" ZFS_CMD="${ZFS} destroy $fs" ZFS_STDERR="$(${ZFS_CMD} 2>&1)" ZFS_ERROR="$?" if [ "${ZFS_ERROR}" != 0 ] then [ "$quiet" != "y" ] && zfs_log_failure_msg "${ZFS_ERROR}" disable_plymouth echo "" echo "Command: $ZFS_CMD" echo "Message: $ZFS_STDERR" echo "Error: $ZFS_ERROR" echo "" echo "Failed to destroy '$fs'. Please make sure that '$fs' is not available." echo "Hint: Try: zfs destroy -Rfn $fs" echo "If this dryrun looks good, then remove the 'n' from '-Rfn' and try again." shell else [ "$quiet" != "y" ] && zfs_log_end_msg fi return 0 } # Clone snapshot $1 to destination filesystem $2 # Set 'canmount=noauto' and 'mountpoint=none' so that we get to keep # manual control over it's mounting (i.e., make sure it's not automatically # mounted with a 'zfs mount -a' in the init/systemd scripts). clone_snap() { - local snap="$1" - local destfs="$2" - local mountpoint="$3" + snap="$1" + destfs="$2" + mountpoint="$3" [ "$quiet" != "y" ] && zfs_log_begin_msg "Cloning '$snap' to '$destfs'" # Clone the snapshot into a dataset we can boot from # + We don't want this filesystem to be automatically mounted, we # want control over this here and nowhere else. # + We don't need any mountpoint set for the same reason. # We use the 'org.zol:mountpoint' property to remember the mountpoint. ZFS_CMD="${ZFS} clone -o canmount=noauto -o mountpoint=none" ZFS_CMD="${ZFS_CMD} -o org.zol:mountpoint=${mountpoint}" ZFS_CMD="${ZFS_CMD} $snap $destfs" ZFS_STDERR="$(${ZFS_CMD} 2>&1)" ZFS_ERROR="$?" if [ "${ZFS_ERROR}" != 0 ] then [ "$quiet" != "y" ] && zfs_log_failure_msg "${ZFS_ERROR}" disable_plymouth echo "" echo "Command: $ZFS_CMD" echo "Message: $ZFS_STDERR" echo "Error: $ZFS_ERROR" echo "" echo "Failed to clone snapshot." echo "Make sure that the any problems are corrected and then make sure" echo "that the dataset '$destfs' exists and is bootable." shell else [ "$quiet" != "y" ] && zfs_log_end_msg fi return 0 } # Rollback a given snapshot. rollback_snap() { - local snap="$1" + snap="$1" [ "$quiet" != "y" ] && zfs_log_begin_msg "Rollback $snap" ZFS_CMD="${ZFS} rollback -Rf $snap" ZFS_STDERR="$(${ZFS_CMD} 2>&1)" ZFS_ERROR="$?" if [ "${ZFS_ERROR}" != 0 ] then [ "$quiet" != "y" ] && zfs_log_failure_msg "${ZFS_ERROR}" disable_plymouth echo "" echo "Command: $ZFS_CMD" echo "Message: $ZFS_STDERR" echo "Error: $ZFS_ERROR" echo "" echo "Failed to rollback snapshot." shell else [ "$quiet" != "y" ] && zfs_log_end_msg fi return 0 } # Get a list of snapshots, give them as a numbered list # to the user to choose from. ask_user_snap() { - local fs="$1" - local i=1 - local SNAP snapnr snap debug + fs="$1" + i=1 # We need to temporarily disable debugging. Set 'debug' so we # remember to enabled it again. if [ -n "${ZFS_DEBUG}" ]; then unset ZFS_DEBUG set +x debug=1 fi # Because we need the resulting snapshot, which is sent on # stdout to the caller, we use stderr for our questions. echo "What snapshot do you want to boot from?" > /dev/stderr - while read snap; do + while read -r snap; do echo " $i: ${snap}" > /dev/stderr - eval `echo SNAP_$i=$snap` + eval "$(echo SNAP_$i=$snap)" i=$((i + 1)) done < /dev/stderr - read snapnr + echo "%s" " Snap nr [1-$((i-1))]? " > /dev/stderr + read -r snapnr # Re-enable debugging. if [ -n "${debug}" ]; then ZFS_DEBUG=1 set -x fi - echo "$(eval echo "$"SNAP_$snapnr)" + echo "$(eval echo '$SNAP_'$snapnr)" } setup_snapshot_booting() { - local snap="$1" - local s destfs subfs mountpoint retval=0 filesystems fs + snap="$1" + retval=0 - # Make sure that the snapshot specified actually exist. - if [ ! $(get_fs_value "${snap}" type) ] + # Make sure that the snapshot specified actually exists. + if [ ! "$(get_fs_value "${snap}" type)" ] then # Snapshot does not exist (...@ ?) # ask the user for a snapshot to use. snap="$(ask_user_snap "${snap%%@*}")" fi # Separate the full snapshot ('$snap') into it's filesystem and # snapshot names. Would have been nice with a split() function.. rootfs="${snap%%@*}" snapname="${snap##*@}" ZFS_BOOTFS="${rootfs}_${snapname}" if ! grep -qiE '(^|[^\\](\\\\)* )(rollback)=(on|yes|1)( |$)' /proc/cmdline then # If the destination dataset for the clone # already exists, destroy it. Recursively - if [ $(get_fs_value "${rootfs}_${snapname}" type) ]; then + if [ "$(get_fs_value "${rootfs}_${snapname}" type)" ]; then filesystems=$("${ZFS}" list -oname -tfilesystem -H \ -r -Sname "${ZFS_BOOTFS}") for fs in $filesystems; do destroy_fs "${fs}" done fi fi # Get all snapshots, recursively (might need to clone /usr, /var etc # as well). for s in $("${ZFS}" list -H -oname -tsnapshot -r "${rootfs}" | \ grep "${snapname}") do if grep -qiE '(^|[^\\](\\\\)* )(rollback)=(on|yes|1)( |$)' /proc/cmdline then # Rollback snapshot rollback_snap "$s" || retval=$((retval + 1)) else # Setup a destination filesystem name. # Ex: Called with 'rpool/ROOT/debian@snap2' # rpool/ROOT/debian@snap2 => rpool/ROOT/debian_snap2 # rpool/ROOT/debian/boot@snap2 => rpool/ROOT/debian_snap2/boot # rpool/ROOT/debian/usr@snap2 => rpool/ROOT/debian_snap2/usr # rpool/ROOT/debian/var@snap2 => rpool/ROOT/debian_snap2/var subfs="${s##$rootfs}" subfs="${subfs%%@$snapname}" destfs="${rootfs}_${snapname}" # base fs. [ -n "$subfs" ] && destfs="${destfs}$subfs" # + sub fs. # Get the mountpoint of the filesystem, to be used # with clone_snap(). If legacy or none, then use # the sub fs value. mountpoint=$(get_fs_value "${s%%@*}" mountpoint) - if [ "$mountpoint" = "legacy" -o \ - "$mountpoint" = "none" ] + if [ "$mountpoint" = "legacy" ] || \ + [ "$mountpoint" = "none" ] then if [ -n "${subfs}" ]; then mountpoint="${subfs}" else mountpoint="/" fi fi # Clone the snapshot into its own # filesystem clone_snap "$s" "${destfs}" "${mountpoint}" || \ retval=$((retval + 1)) fi done # If we haven't return yet, we have a problem... return "${retval}" } # ================================================================ # This is the main function. mountroot() { - local snaporig snapsub destfs pool POOLS - # ---------------------------------------------------------------- # I N I T I A L S E T U P # ------------ # Run the pre-mount scripts from /scripts/local-top. pre_mountroot # ------------ # Source the default setup variables. [ -r '/etc/default/zfs' ] && . /etc/default/zfs # ------------ # Support debug option if grep -qiE '(^|[^\\](\\\\)* )(zfs_debug|zfs\.debug|zfsdebug)=(on|yes|1)( |$)' /proc/cmdline then ZFS_DEBUG=1 mkdir /var/log #exec 2> /var/log/boot.debug set -x fi # ------------ # Load ZFS module etc. if ! load_module_initrd; then disable_plymouth echo "" echo "Failed to load ZFS modules." echo "Manually load the modules and exit." shell fi # ------------ # Look for the cache file (if any). [ ! -f ${ZPOOL_CACHE} ] && unset ZPOOL_CACHE # ------------ # Compatibility: 'ROOT' is for Debian GNU/Linux (etc), # 'root' is for Redhat/Fedora (etc), # 'REAL_ROOT' is for Gentoo if [ -z "$ROOT" ] then [ -n "$root" ] && ROOT=${root} [ -n "$REAL_ROOT" ] && ROOT=${REAL_ROOT} fi # ------------ # Where to mount the root fs in the initrd - set outside this script # Compatibility: 'rootmnt' is for Debian GNU/Linux (etc), # 'NEWROOT' is for RedHat/Fedora (etc), # 'NEW_ROOT' is for Gentoo if [ -z "$rootmnt" ] then [ -n "$NEWROOT" ] && rootmnt=${NEWROOT} [ -n "$NEW_ROOT" ] && rootmnt=${NEW_ROOT} fi # ------------ # No longer set in the defaults file, but it could have been set in # get_pools() in some circumstances. If it's something, but not 'yes', # it's no good to us. - [ -n "$USE_DISK_BY_ID" -a "$USE_DISK_BY_ID" != 'yes' ] && \ + [ -n "$USE_DISK_BY_ID" ] && [ "$USE_DISK_BY_ID" != 'yes' ] && \ unset USE_DISK_BY_ID # ---------------------------------------------------------------- # P A R S E C O M M A N D L I N E O P T I O N S # This part is the really ugly part - there's so many options and permutations # 'out there', and if we should make this the 'primary' source for ZFS initrd # scripting, we need/should support them all. # # Supports the following kernel command line argument combinations # (in this order - first match win): # # rpool= (tries to finds bootfs automatically) # bootfs=/ (uses this for rpool - first part) # rpool= bootfs=/ # -B zfs-bootfs=/ (uses this for rpool - first part) # rpool=rpool (default if none of the above is used) # root=/ (uses this for rpool - first part) # root=ZFS=/ (uses this for rpool - first part, without 'ZFS=') # root=zfs:AUTO (tries to detect both pool and rootfs # root=zfs:/ (uses this for rpool - first part, without 'zfs:') # # Option could also be # Option could also be # ------------ # Support force option # In addition, setting one of zfs_force, zfs.force or zfsforce to # 'yes', 'on' or '1' will make sure we force import the pool. # This should (almost) never be needed, but it's here for # completeness. ZPOOL_FORCE="" if grep -qiE '(^|[^\\](\\\\)* )(zfs_force|zfs\.force|zfsforce)=(on|yes|1)( |$)' /proc/cmdline then ZPOOL_FORCE="-f" fi # ------------ # Look for 'rpool' and 'bootfs' parameter [ -n "$rpool" ] && ZFS_RPOOL="${rpool#rpool=}" [ -n "$bootfs" ] && ZFS_BOOTFS="${bootfs#bootfs=}" # ------------ # If we have 'ROOT' (see above), but not 'ZFS_BOOTFS', then use # 'ROOT' - [ -n "$ROOT" -a -z "${ZFS_BOOTFS}" ] && ZFS_BOOTFS="$ROOT" + [ -n "$ROOT" ] && [ -z "${ZFS_BOOTFS}" ] && ZFS_BOOTFS="$ROOT" # ------------ # Check for the `-B zfs-bootfs=%s/%u,...` kind of parameter. # NOTE: Only use the pool name and dataset. The rest is not - # supported by ZoL (whatever it's for). + # supported by OpenZFS (whatever it's for). if [ -z "$ZFS_RPOOL" ] then # The ${zfs-bootfs} variable is set at the kernel command # line, usually by GRUB, but it cannot be referenced here # directly because bourne variable names cannot contain a # hyphen. # # Reassign the variable by dumping the environment and # stripping the zfs-bootfs= prefix. Let the shell handle # quoting through the eval command. eval ZFS_RPOOL=$(set | sed -n -e 's,^zfs-bootfs=,,p') fi # ------------ # No root fs or pool specified - do auto detect. - if [ -z "$ZFS_RPOOL" -a -z "${ZFS_BOOTFS}" ] + if [ -z "$ZFS_RPOOL" ] && [ -z "${ZFS_BOOTFS}" ] then # Do auto detect. Do this by 'cheating' - set 'root=zfs:AUTO' # which will be caught later - ROOT=zfs:AUTO + ROOT='zfs:AUTO' fi # ---------------------------------------------------------------- # F I N D A N D I M P O R T C O R R E C T P O O L # ------------ if [ "$ROOT" = "zfs:AUTO" ] then # Try to detect both pool and root fs. [ "$quiet" != "y" ] && \ zfs_log_begin_msg "Attempting to import additional pools." # Get a list of pools available for import if [ -n "$ZFS_RPOOL" ] then # We've specified a pool - check only that POOLS=$ZFS_RPOOL else POOLS=$(get_pools) fi OLD_IFS="$IFS" ; IFS=";" for pool in $POOLS do [ -z "$pool" ] && continue import_pool "$pool" find_rootfs "$pool" done IFS="$OLD_IFS" [ "$quiet" != "y" ] && zfs_log_end_msg $ZFS_ERROR else # No auto - use value from the command line option. # Strip 'zfs:' and 'ZFS='. ZFS_BOOTFS="${ROOT#*[:=]}" # Strip everything after the first slash. ZFS_RPOOL="${ZFS_BOOTFS%%/*}" fi # Import the pool (if not already done so in the AUTO check above). - if [ -n "$ZFS_RPOOL" -a -z "${POOL_IMPORTED}" ] + if [ -n "$ZFS_RPOOL" ] && [ -z "${POOL_IMPORTED}" ] then [ "$quiet" != "y" ] && \ zfs_log_begin_msg "Importing ZFS root pool '$ZFS_RPOOL'" import_pool "${ZFS_RPOOL}" find_rootfs "${ZFS_RPOOL}" [ "$quiet" != "y" ] && zfs_log_end_msg fi if [ -z "${POOL_IMPORTED}" ] then # No pool imported, this is serious! disable_plymouth echo "" echo "Command: $ZFS_CMD" echo "Message: $ZFS_STDERR" echo "Error: $ZFS_ERROR" echo "" echo "No pool imported. Manually import the root pool" echo "at the command prompt and then exit." echo "Hint: Try: zpool import -R ${rootmnt} -N ${ZFS_RPOOL}" shell fi # In case the pool was specified as guid, resolve guid to name pool="$("${ZPOOL}" get name,guid -o name,value -H | \ awk -v pool="${ZFS_RPOOL}" '$2 == pool { print $1 }')" if [ -n "$pool" ]; then # If $ZFS_BOOTFS contains guid, replace the guid portion with $pool ZFS_BOOTFS=$(echo "$ZFS_BOOTFS" | \ sed -e "s/$("${ZPOOL}" get guid -o value "$pool" -H)/$pool/g") ZFS_RPOOL="${pool}" fi # Set the no-op scheduler on the disks containing the vdevs of # the root pool. For single-queue devices, this scheduler is # "noop", for multi-queue devices, it is "none". # ZFS already does this for wholedisk vdevs (for all pools), so this # is only important for partitions. "${ZPOOL}" status -L "${ZFS_RPOOL}" 2> /dev/null | awk '/^\t / && !/(mirror|raidz)/ { dev=$1; sub(/[0-9]+$/, "", dev); print dev }' | while read -r i do SCHEDULER=/sys/block/$i/queue/scheduler if [ -e "${SCHEDULER}" ] then # Query to see what schedulers are available case "$(cat "${SCHEDULER}")" in *noop*) echo noop > "${SCHEDULER}" ;; *none*) echo none > "${SCHEDULER}" ;; esac fi done # ---------------------------------------------------------------- # P R E P A R E R O O T F I L E S Y S T E M if [ -n "${ZFS_BOOTFS}" ] then # Booting from a snapshot? # Will overwrite the ZFS_BOOTFS variable like so: # rpool/ROOT/debian@snap2 => rpool/ROOT/debian_snap2 echo "${ZFS_BOOTFS}" | grep -q '@' && \ setup_snapshot_booting "${ZFS_BOOTFS}" fi if [ -z "${ZFS_BOOTFS}" ] then # Still nothing! Let the user sort this out. disable_plymouth echo "" echo "Error: Unknown root filesystem - no 'bootfs' pool property and" echo " not specified on the kernel command line." echo "" echo "Manually mount the root filesystem on $rootmnt and then exit." echo "Hint: Try: mount -o zfsutil -t zfs ${ZFS_RPOOL-rpool}/ROOT/system $rootmnt" shell fi # ---------------------------------------------------------------- # M O U N T F I L E S Y S T E M S # * Ideally, the root filesystem would be mounted like this: # # zpool import -R "$rootmnt" -N "$ZFS_RPOOL" # zfs mount -o mountpoint=/ "${ZFS_BOOTFS}" # # but the MOUNTPOINT prefix is preserved on descendent filesystem # after the pivot into the regular root, which later breaks things # like `zfs mount -a` and the /proc/self/mounts refresh. # # * Mount additional filesystems required # Such as /usr, /var, /usr/local etc. # NOTE: Mounted in the order specified in the # ZFS_INITRD_ADDITIONAL_DATASETS variable so take care! # Go through the complete list (recursively) of all filesystems below # the real root dataset filesystems=$("${ZFS}" list -oname -tfilesystem -H -r "${ZFS_BOOTFS}") for fs in $filesystems $ZFS_INITRD_ADDITIONAL_DATASETS do mount_fs "$fs" done touch /run/zfs_unlock_complete if [ -e /run/zfs_unlock_complete_notify ]; then - read zfs_unlock_complete_notify < /run/zfs_unlock_complete_notify + read -r zfs_unlock_complete_notify < /run/zfs_unlock_complete_notify fi # ------------ # Debugging information if [ -n "${ZFS_DEBUG}" ] then #exec 2>&1- echo "DEBUG: imported pools:" "${ZPOOL}" list -H echo echo "DEBUG: mounted ZFS filesystems:" mount | grep zfs echo echo "=> waiting for ENTER before continuing because of 'zfsdebug=1'. " - echo -n " 'c' for shell, 'r' for reboot, 'ENTER' to continue. " - read b + printf "%s" " 'c' for shell, 'r' for reboot, 'ENTER' to continue. " + read -r b [ "$b" = "c" ] && /bin/sh [ "$b" = "r" ] && reboot -f set +x fi # ------------ # Run local bottom script - if type run_scripts > /dev/null 2>&1 && \ - [ -f "/scripts/local-bottom" -o -d "/scripts/local-bottom" ] + if command -v run_scripts > /dev/null 2>&1 then - [ "$quiet" != "y" ] && \ - zfs_log_begin_msg "Running /scripts/local-bottom" - run_scripts /scripts/local-bottom - [ "$quiet" != "y" ] && zfs_log_end_msg + if [ -f "/scripts/local-bottom" ] || [ -d "/scripts/local-bottom" ] + then + [ "$quiet" != "y" ] && \ + zfs_log_begin_msg "Running /scripts/local-bottom" + run_scripts /scripts/local-bottom + [ "$quiet" != "y" ] && zfs_log_end_msg + fi fi } Index: vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-cryptohash.diff =================================================================== --- vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-cryptohash.diff (nonexistent) +++ vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-cryptohash.diff (revision 366348) @@ -0,0 +1,17 @@ +cryptohash.h was dropped and merged with crypto/sha.sh in 5.8 kernel. Details in: +https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=228c4f265c6eb60eaa4ed0edb3bf7c113173576c + +--- +diff --git a/quickassist/utilities/osal/src/linux/kernel_space/OsalCryptoInterface.c b/quickassist/utilities/osal/src/linux/kernel_space/OsalCryptoInterface.c +index 4c389da..e602377 100644 +--- a/quickassist/utilities/osal/src/linux/kernel_space/OsalCryptoInterface.c ++++ b/quickassist/utilities/osal/src/linux/kernel_space/OsalCryptoInterface.c +@@ -66,7 +66,7 @@ + + #include "Osal.h" + #include +-#include ++#include + #include + #if (LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,29)) + #include Index: vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-pci_aer.diff =================================================================== --- vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-pci_aer.diff (nonexistent) +++ vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-pci_aer.diff (revision 366348) @@ -0,0 +1,20 @@ +In kernel 5.7 the pci_cleanup_aer_uncorrect_error_status() function was +renamed with the following commit: + +git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=894020fdd88c1e9a74c60b67c0f19f1c7696ba2f + +This simply updates the function call with the proper name (pci_aer_clear_nonfatal_status()). + +--- +diff --git a/quickassist/qat/drivers/crypto/qat/qat_common/adf_aer.c b/quickassist/qat/drivers/crypto/qat/qat_common/adf_aer.c +index a6ce6df..545bb79 100644 +--- a/quickassist/qat/drivers/crypto/qat/qat_common/adf_aer.c ++++ b/quickassist/qat/drivers/crypto/qat/qat_common/adf_aer.c +@@ -304,7 +304,7 @@ static pci_ers_result_t adf_slot_reset(struct pci_dev *pdev) + pr_err("QAT: Can't find acceleration device\n"); + return PCI_ERS_RESULT_DISCONNECT; + } +- pci_cleanup_aer_uncorrect_error_status(pdev); ++ pci_aer_clear_nonfatal_status(pdev); + if (adf_dev_aer_schedule_reset(accel_dev, ADF_DEV_RESET_SYNC)) + return PCI_ERS_RESULT_DISCONNECT; Index: vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-timespec.diff =================================================================== --- vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-timespec.diff (nonexistent) +++ vendor-sys/openzfs/dist/contrib/intel_qat/patch/0001-timespec.diff (revision 366348) @@ -0,0 +1,35 @@ +This patch attempts to expose timespec and getnstimeofday which were +explicitly hidden in the 5.6 kernel with the introduction of the +following commits: + +git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c766d1472c70d25ad475cf56042af1652e792b23 +git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=412c53a680a97cb1ae2c0ab60230e193bee86387 + +Code received from users@dpdk.org, issue tracked under QATE-59888. + +--- +diff --git a/quickassist/lookaside/access_layer/src/sample_code/performance/framework/linux/kernel_space/cpa_sample_code_utils.c b/quickassist/lookaside/access_layer/src/sample_code/performance/framework/linux/kernel_space/cpa_sample_code_utils.c +index 4639834..523e376 100644 +--- a/quickassist/lookaside/access_layer/src/sample_code/performance/framework/linux/kernel_space/cpa_sample_code_utils.c ++++ b/quickassist/lookaside/access_layer/src/sample_code/performance/framework/linux/kernel_space/cpa_sample_code_utils.c +@@ -107,6 +107,8 @@ atomic_t arrived; + extern struct device perf_device; + #endif + ++#define timespec timespec64 ++#define getnstimeofday ktime_get_real_ts64 + + /* Define a number for timeout */ + #define SAMPLE_CODE_MAX_LONG (0x7FFFFFFF) +diff --git a/quickassist/qat/compat/qat_compat.h b/quickassist/qat/compat/qat_compat.h +index 2a02eaf..3515092 100644 +--- a/quickassist/qat/compat/qat_compat.h ++++ b/quickassist/qat/compat/qat_compat.h +@@ -466,4 +466,7 @@ static inline void pci_ignore_hotplug(struct pci_dev *dev) + #if (RHEL_RELEASE_CODE && RHEL_RELEASE_VERSION(7, 3) <= RHEL_RELEASE_CODE) + #define QAT_KPT_CAP_DISCOVERY + #endif ++ ++#define timespec timespec64 ++#define getnstimeofday ktime_get_real_ts64 + #endif /* _QAT_COMPAT_H_ */ Index: vendor-sys/openzfs/dist/contrib/intel_qat/patch/LICENSE =================================================================== --- vendor-sys/openzfs/dist/contrib/intel_qat/patch/LICENSE (nonexistent) +++ vendor-sys/openzfs/dist/contrib/intel_qat/patch/LICENSE (revision 366348) @@ -0,0 +1,30 @@ +BSD LICENSE + +Copyright (c) Intel Corporation. +All rights reserved. + +Redistribution and use in source and binary forms, with or without +modification, are permitted provided that the following conditions +are met: + + * Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + * Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in + the documentation and/or other materials provided with the + distribution. + * Neither the name of Intel Corporation nor the names of its + contributors may be used to endorse or promote products derived + from this software without specific prior written permission. + +THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS +"AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT +LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR +A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT +OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, +DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY +THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT +(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE +OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Index: vendor-sys/openzfs/dist/contrib/intel_qat/readme.md =================================================================== --- vendor-sys/openzfs/dist/contrib/intel_qat/readme.md (nonexistent) +++ vendor-sys/openzfs/dist/contrib/intel_qat/readme.md (revision 366348) @@ -0,0 +1,27 @@ +# Intel_QAT easy install script + +This contrib contains community compatibility patches to get Intel QAT working on the following kernel versions: +- 5.6 +- 5.7 +- 5.8 + +These patches are based on the following Intel QAT version: +[1.7.l.4.10.0-00014](https://01.org/sites/default/files/downloads/qat1.7.l.4.10.0-00014.tar.gz) + +When using QAT with above kernels versions, the following patches needs to be applied using: +patch -p1 < _$PATCH_ +_Where $PATCH refers to the path of the patch in question_ + +### 5.6 +/patch/0001-timespec.diff + +### 5.7 +/patch/0001-pci_aer.diff + +### 5.8 +/patch/0001-cryptohash.diff + + +_Patches are supplied by [Storage Performance Development Kit (SPDK)](https://github.com/spdk/spdk)_ + + Index: vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/kstat.h =================================================================== --- vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/kstat.h (revision 366347) +++ vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/kstat.h (revision 366348) @@ -1,207 +1,225 @@ /* * Copyright (C) 2007-2010 Lawrence Livermore National Security, LLC. * Copyright (C) 2007 The Regents of the University of California. * Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). * Written by Brian Behlendorf . * UCRL-CODE-235197 * * This file is part of the SPL, Solaris Porting Layer. * For details, see . * * The SPL is free software; you can redistribute it and/or modify it * under the terms of the GNU General Public License as published by the * Free Software Foundation; either version 2 of the License, or (at your * option) any later version. * * The SPL is distributed in the hope that it will be useful, but WITHOUT * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or * FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License * for more details. * * You should have received a copy of the GNU General Public License along * with the SPL. If not, see . */ #ifndef _SPL_KSTAT_H #define _SPL_KSTAT_H #include #include struct list_head {}; #include #include #define KSTAT_STRLEN 255 #define KSTAT_RAW_MAX (128*1024) /* * For reference valid classes are: * disk, tape, net, controller, vm, kvm, hat, streams, kstat, misc */ #define KSTAT_TYPE_RAW 0 /* can be anything; ks_ndata >= 1 */ #define KSTAT_TYPE_NAMED 1 /* name/value pair; ks_ndata >= 1 */ #define KSTAT_TYPE_INTR 2 /* interrupt stats; ks_ndata == 1 */ #define KSTAT_TYPE_IO 3 /* I/O stats; ks_ndata == 1 */ #define KSTAT_TYPE_TIMER 4 /* event timer; ks_ndata >= 1 */ #define KSTAT_NUM_TYPES 5 #define KSTAT_DATA_CHAR 0 #define KSTAT_DATA_INT32 1 #define KSTAT_DATA_UINT32 2 #define KSTAT_DATA_INT64 3 #define KSTAT_DATA_UINT64 4 #define KSTAT_DATA_LONG 5 #define KSTAT_DATA_ULONG 6 #define KSTAT_DATA_STRING 7 #define KSTAT_NUM_DATAS 8 #define KSTAT_INTR_HARD 0 #define KSTAT_INTR_SOFT 1 #define KSTAT_INTR_WATCHDOG 2 #define KSTAT_INTR_SPURIOUS 3 #define KSTAT_INTR_MULTSVC 4 #define KSTAT_NUM_INTRS 5 #define KSTAT_FLAG_VIRTUAL 0x01 #define KSTAT_FLAG_VAR_SIZE 0x02 #define KSTAT_FLAG_WRITABLE 0x04 #define KSTAT_FLAG_PERSISTENT 0x08 #define KSTAT_FLAG_DORMANT 0x10 #define KSTAT_FLAG_INVALID 0x20 #define KSTAT_FLAG_LONGSTRINGS 0x40 #define KSTAT_FLAG_NO_HEADERS 0x80 #define KS_MAGIC 0x9d9d9d9d /* Dynamic updates */ #define KSTAT_READ 0 #define KSTAT_WRITE 1 struct kstat_s; typedef struct kstat_s kstat_t; typedef int kid_t; /* unique kstat id */ typedef int kstat_update_t(struct kstat_s *, int); /* dynamic update cb */ +struct seq_file { + char *sf_buf; + size_t sf_size; +}; + +void seq_printf(struct seq_file *m, const char *fmt, ...); + + typedef struct kstat_module { char ksm_name[KSTAT_STRLEN+1]; /* module name */ struct list_head ksm_module_list; /* module linkage */ struct list_head ksm_kstat_list; /* list of kstat entries */ struct proc_dir_entry *ksm_proc; /* proc entry */ } kstat_module_t; typedef struct kstat_raw_ops { int (*headers)(char *buf, size_t size); + int (*seq_headers)(struct seq_file *); int (*data)(char *buf, size_t size, void *data); void *(*addr)(kstat_t *ksp, loff_t index); } kstat_raw_ops_t; struct kstat_s { int ks_magic; /* magic value */ kid_t ks_kid; /* unique kstat ID */ hrtime_t ks_crtime; /* creation time */ hrtime_t ks_snaptime; /* last access time */ char ks_module[KSTAT_STRLEN+1]; /* provider module name */ int ks_instance; /* provider module instance */ char ks_name[KSTAT_STRLEN+1]; /* kstat name */ char ks_class[KSTAT_STRLEN+1]; /* kstat class */ uchar_t ks_type; /* kstat data type */ uchar_t ks_flags; /* kstat flags */ void *ks_data; /* kstat type-specific data */ uint_t ks_ndata; /* # of data records */ size_t ks_data_size; /* size of kstat data section */ kstat_update_t *ks_update; /* dynamic updates */ void *ks_private; /* private data */ + void *ks_private1; /* private data */ kmutex_t ks_private_lock; /* kstat private data lock */ kmutex_t *ks_lock; /* kstat data lock */ struct list_head ks_list; /* kstat linkage */ kstat_module_t *ks_owner; /* kstat module linkage */ kstat_raw_ops_t ks_raw_ops; /* ops table for raw type */ char *ks_raw_buf; /* buf used for raw ops */ size_t ks_raw_bufsize; /* size of raw ops buffer */ struct sysctl_ctx_list ks_sysctl_ctx; struct sysctl_oid *ks_sysctl_root; }; typedef struct kstat_named_s { char name[KSTAT_STRLEN]; /* name of counter */ uchar_t data_type; /* data type */ union { char c[16]; /* 128-bit int */ int32_t i32; /* 32-bit signed int */ uint32_t ui32; /* 32-bit unsigned int */ int64_t i64; /* 64-bit signed int */ uint64_t ui64; /* 64-bit unsigned int */ long l; /* native signed long */ ulong_t ul; /* native unsigned long */ struct { union { char *ptr; /* NULL-term string */ char __pad[8]; /* 64-bit padding */ } addr; uint32_t len; /* # bytes for strlen + '\0' */ } string; } value; } kstat_named_t; #define KSTAT_NAMED_STR_PTR(knptr) ((knptr)->value.string.addr.ptr) #define KSTAT_NAMED_STR_BUFLEN(knptr) ((knptr)->value.string.len) typedef struct kstat_intr { uint_t intrs[KSTAT_NUM_INTRS]; } kstat_intr_t; typedef struct kstat_io { u_longlong_t nread; /* number of bytes read */ u_longlong_t nwritten; /* number of bytes written */ uint_t reads; /* number of read operations */ uint_t writes; /* number of write operations */ hrtime_t wtime; /* cumulative wait (pre-service) time */ hrtime_t wlentime; /* cumulative wait len*time product */ hrtime_t wlastupdate; /* last time wait queue changed */ hrtime_t rtime; /* cumulative run (service) time */ hrtime_t rlentime; /* cumulative run length*time product */ hrtime_t rlastupdate; /* last time run queue changed */ uint_t wcnt; /* count of elements in wait state */ uint_t rcnt; /* count of elements in run state */ } kstat_io_t; typedef struct kstat_timer { char name[KSTAT_STRLEN+1]; /* event name */ u_longlong_t num_events; /* number of events */ hrtime_t elapsed_time; /* cumulative elapsed time */ hrtime_t min_time; /* shortest event duration */ hrtime_t max_time; /* longest event duration */ hrtime_t start_time; /* previous event start time */ hrtime_t stop_time; /* previous event stop time */ } kstat_timer_t; int spl_kstat_init(void); void spl_kstat_fini(void); extern void __kstat_set_raw_ops(kstat_t *ksp, int (*headers)(char *buf, size_t size), int (*data)(char *buf, size_t size, void *data), void* (*addr)(kstat_t *ksp, loff_t index)); +extern void __kstat_set_seq_raw_ops(kstat_t *ksp, + int (*headers)(struct seq_file *), + int (*data)(char *buf, size_t size, void *data), + void* (*addr)(kstat_t *ksp, loff_t index)); + + extern kstat_t *__kstat_create(const char *ks_module, int ks_instance, const char *ks_name, const char *ks_class, uchar_t ks_type, uint_t ks_ndata, uchar_t ks_flags); extern void __kstat_install(kstat_t *ksp); extern void __kstat_delete(kstat_t *ksp); extern void kstat_waitq_enter(kstat_io_t *); extern void kstat_waitq_exit(kstat_io_t *); extern void kstat_runq_enter(kstat_io_t *); extern void kstat_runq_exit(kstat_io_t *); +#define kstat_set_seq_raw_ops(k, h, d, a) \ + __kstat_set_seq_raw_ops(k, h, d, a) #define kstat_set_raw_ops(k, h, d, a) \ __kstat_set_raw_ops(k, h, d, a) #define kstat_create(m, i, n, c, t, s, f) \ __kstat_create(m, i, n, c, t, s, f) #define kstat_install(k) __kstat_install(k) #define kstat_delete(k) __kstat_delete(k) #endif /* _SPL_KSTAT_H */ Index: vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/procfs_list.h =================================================================== --- vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/procfs_list.h (revision 366347) +++ vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/procfs_list.h (revision 366348) @@ -1,64 +1,67 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2018 by Delphix. All rights reserved. */ #ifndef _SPL_PROCFS_LIST_H #define _SPL_PROCFS_LIST_H #include #include /* * procfs list manipulation */ -struct seq_file { }; -void seq_printf(struct seq_file *m, const char *fmt, ...); - -typedef struct procfs_list { +typedef struct procfs_list procfs_list_t; +struct procfs_list { void *pl_private; + void *pl_next_data; kmutex_t pl_lock; list_t pl_list; uint64_t pl_next_id; + int (*pl_show)(struct seq_file *f, void *p); + int (*pl_show_header)(struct seq_file *f); + int (*pl_clear)(procfs_list_t *procfs_list); size_t pl_node_offset; -} procfs_list_t; +}; typedef struct procfs_list_node { list_node_t pln_link; uint64_t pln_id; } procfs_list_node_t; void procfs_list_install(const char *module, + const char *submodule, const char *name, mode_t mode, procfs_list_t *procfs_list, int (*show)(struct seq_file *f, void *p), int (*show_header)(struct seq_file *f), int (*clear)(procfs_list_t *procfs_list), size_t procfs_list_node_off); void procfs_list_uninstall(procfs_list_t *procfs_list); void procfs_list_destroy(procfs_list_t *procfs_list); void procfs_list_add(procfs_list_t *procfs_list, void *p); #endif /* _SPL_PROCFS_LIST_H */ Index: vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/simd_x86.h =================================================================== --- vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/simd_x86.h (revision 366347) +++ vendor-sys/openzfs/dist/include/os/freebsd/spl/sys/simd_x86.h (revision 366348) @@ -1,300 +1,300 @@ /* * Copyright (c) 2020 iXsystems, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * $FreeBSD$ */ #include #include #include +#include #ifdef __i386__ #include #else #include #endif #include #include #define kfpu_init() (0) #define kfpu_fini() do {} while (0) #define kfpu_allowed() 1 #define kfpu_initialize(tsk) do {} while (0) -#define kfpu_begin() { \ - critical_enter(); \ - fpu_kern_enter(curthread, NULL, FPU_KERN_NOCTX); \ +#define kfpu_begin() { \ + if (__predict_false(!is_fpu_kern_thread(0))) \ + fpu_kern_enter(curthread, NULL, FPU_KERN_NOCTX);\ } -#define kfpu_end() \ - { \ - fpu_kern_leave(curthread, NULL); \ - critical_exit(); \ - } +#define kfpu_end() { \ + if (__predict_false(curpcb->pcb_flags & PCB_FPUNOSAVE)) \ + fpu_kern_leave(curthread, NULL); \ +} /* * Check if OS supports AVX and AVX2 by checking XCR0 * Only call this function if CPUID indicates that AVX feature is * supported by the CPU, otherwise it might be an illegal instruction. */ static inline uint64_t xgetbv(uint32_t index) { uint32_t eax, edx; /* xgetbv - instruction byte code */ __asm__ __volatile__(".byte 0x0f; .byte 0x01; .byte 0xd0" : "=a" (eax), "=d" (edx) : "c" (index)); return ((((uint64_t)edx)<<32) | (uint64_t)eax); } /* * Detect register set support */ static inline boolean_t __simd_state_enabled(const uint64_t state) { boolean_t has_osxsave; uint64_t xcr0; has_osxsave = !!(cpu_feature2 & CPUID2_OSXSAVE); if (!has_osxsave) return (B_FALSE); xcr0 = xgetbv(0); return ((xcr0 & state) == state); } #define _XSTATE_SSE_AVX (0x2 | 0x4) #define _XSTATE_AVX512 (0xE0 | _XSTATE_SSE_AVX) #define __ymm_enabled() __simd_state_enabled(_XSTATE_SSE_AVX) #define __zmm_enabled() __simd_state_enabled(_XSTATE_AVX512) /* * Check if SSE instruction set is available */ static inline boolean_t zfs_sse_available(void) { return (!!(cpu_feature & CPUID_SSE)); } /* * Check if SSE2 instruction set is available */ static inline boolean_t zfs_sse2_available(void) { return (!!(cpu_feature & CPUID_SSE2)); } /* * Check if SSE3 instruction set is available */ static inline boolean_t zfs_sse3_available(void) { return (!!(cpu_feature2 & CPUID2_SSE3)); } /* * Check if SSSE3 instruction set is available */ static inline boolean_t zfs_ssse3_available(void) { return (!!(cpu_feature2 & CPUID2_SSSE3)); } /* * Check if SSE4.1 instruction set is available */ static inline boolean_t zfs_sse4_1_available(void) { return (!!(cpu_feature2 & CPUID2_SSE41)); } /* * Check if SSE4.2 instruction set is available */ static inline boolean_t zfs_sse4_2_available(void) { return (!!(cpu_feature2 & CPUID2_SSE42)); } /* * Check if AVX instruction set is available */ static inline boolean_t zfs_avx_available(void) { boolean_t has_avx; has_avx = !!(cpu_feature2 & CPUID2_AVX); return (has_avx && __ymm_enabled()); } /* * Check if AVX2 instruction set is available */ static inline boolean_t zfs_avx2_available(void) { boolean_t has_avx2; has_avx2 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX2); return (has_avx2 && __ymm_enabled()); } /* * AVX-512 family of instruction sets: * * AVX512F Foundation * AVX512CD Conflict Detection Instructions * AVX512ER Exponential and Reciprocal Instructions * AVX512PF Prefetch Instructions * * AVX512BW Byte and Word Instructions * AVX512DQ Double-word and Quadword Instructions * AVX512VL Vector Length Extensions * * AVX512IFMA Integer Fused Multiply Add (Not supported by kernel 4.4) * AVX512VBMI Vector Byte Manipulation Instructions */ /* Check if AVX512F instruction set is available */ static inline boolean_t zfs_avx512f_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512CD instruction set is available */ static inline boolean_t zfs_avx512cd_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F) && !!(cpu_stdext_feature & CPUID_STDEXT_AVX512CD); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512ER instruction set is available */ static inline boolean_t zfs_avx512er_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F) && !!(cpu_stdext_feature & CPUID_STDEXT_AVX512CD); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512PF instruction set is available */ static inline boolean_t zfs_avx512pf_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F) && !!(cpu_stdext_feature & CPUID_STDEXT_AVX512PF); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512BW instruction set is available */ static inline boolean_t zfs_avx512bw_available(void) { boolean_t has_avx512 = B_FALSE; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512BW); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512DQ instruction set is available */ static inline boolean_t zfs_avx512dq_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F) && !!(cpu_stdext_feature & CPUID_STDEXT_AVX512DQ); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512VL instruction set is available */ static inline boolean_t zfs_avx512vl_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F) && !!(cpu_stdext_feature & CPUID_STDEXT_AVX512VL); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512IFMA instruction set is available */ static inline boolean_t zfs_avx512ifma_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F) && !!(cpu_stdext_feature & CPUID_STDEXT_AVX512IFMA); return (has_avx512 && __zmm_enabled()); } /* Check if AVX512VBMI instruction set is available */ static inline boolean_t zfs_avx512vbmi_available(void) { boolean_t has_avx512; has_avx512 = !!(cpu_stdext_feature & CPUID_STDEXT_AVX512F) && !!(cpu_stdext_feature & CPUID_STDEXT_BMI1); return (has_avx512 && __zmm_enabled()); } Index: vendor-sys/openzfs/dist/include/os/linux/spl/sys/procfs_list.h =================================================================== --- vendor-sys/openzfs/dist/include/os/linux/spl/sys/procfs_list.h (revision 366347) +++ vendor-sys/openzfs/dist/include/os/linux/spl/sys/procfs_list.h (revision 366348) @@ -1,72 +1,73 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2018 by Delphix. All rights reserved. */ #ifndef _SPL_PROCFS_LIST_H #define _SPL_PROCFS_LIST_H #include #include #include #include typedef struct procfs_list procfs_list_t; struct procfs_list { /* Accessed only by user of a procfs_list */ void *pl_private; /* * Accessed both by user of a procfs_list and by procfs_list * implementation */ kmutex_t pl_lock; list_t pl_list; /* Accessed only by procfs_list implementation */ uint64_t pl_next_id; int (*pl_show)(struct seq_file *f, void *p); int (*pl_show_header)(struct seq_file *f); int (*pl_clear)(procfs_list_t *procfs_list); size_t pl_node_offset; kstat_proc_entry_t pl_kstat_entry; }; typedef struct procfs_list_node { list_node_t pln_link; uint64_t pln_id; } procfs_list_node_t; void procfs_list_install(const char *module, + const char *submodule, const char *name, mode_t mode, procfs_list_t *procfs_list, int (*show)(struct seq_file *f, void *p), int (*show_header)(struct seq_file *f), int (*clear)(procfs_list_t *procfs_list), size_t procfs_list_node_off); void procfs_list_uninstall(procfs_list_t *procfs_list); void procfs_list_destroy(procfs_list_t *procfs_list); void procfs_list_add(procfs_list_t *procfs_list, void *p); #endif /* _SPL_PROCFS_LIST_H */ Index: vendor-sys/openzfs/dist/include/sys/frame.h =================================================================== --- vendor-sys/openzfs/dist/include/sys/frame.h (revision 366347) +++ vendor-sys/openzfs/dist/include/sys/frame.h (revision 366348) @@ -1,36 +1,37 @@ /* * CDDL HEADER START * * This file and its contents are supplied under the terms of the * Common Development and Distribution License ("CDDL"), version 1.0. * You may only use this file in accordance with the terms of version * 1.0 of the CDDL. * * A full copy of the text of the CDDL should have accompanied this * source. A copy of the CDDL is also available via the Internet at * http://www.illumos.org/license/CDDL. * * CDDL HEADER END */ /* * Copyright (C) 2017 by Lawrence Livermore National Security, LLC. */ #ifndef _SYS_FRAME_H #define _SYS_FRAME_H #ifdef __cplusplus extern "C" { #endif -#if defined(__KERNEL__) && defined(HAVE_STACK_FRAME_NON_STANDARD) +#if defined(__KERNEL__) && defined(HAVE_KERNEL_OBJTOOL) && \ + defined(HAVE_STACK_FRAME_NON_STANDARD) #include #else #define STACK_FRAME_NON_STANDARD(func) #endif #ifdef __cplusplus } #endif #endif /* _SYS_FRAME_H */ Index: vendor-sys/openzfs/dist/include/sys/lua/luaconf.h =================================================================== --- vendor-sys/openzfs/dist/include/sys/lua/luaconf.h (revision 366347) +++ vendor-sys/openzfs/dist/include/sys/lua/luaconf.h (revision 366348) @@ -1,562 +1,558 @@ /* BEGIN CSTYLED */ /* ** $Id: luaconf.h,v 1.176.1.2 2013/11/21 17:26:16 roberto Exp $ ** Configuration file for Lua ** See Copyright Notice in lua.h */ #ifndef lconfig_h #define lconfig_h #include extern ssize_t lcompat_sprintf(char *, size_t size, const char *, ...); extern int64_t lcompat_strtoll(const char *, char **); extern int64_t lcompat_pow(int64_t, int64_t); extern int lcompat_hashnum(int64_t); /* ** ================================================================== ** Search for "@@" to find all configurable definitions. ** =================================================================== */ /* @@ LUA_ANSI controls the use of non-ansi features. ** CHANGE it (define it) if you want Lua to avoid the use of any ** non-ansi feature or library. */ #if !defined(LUA_ANSI) && defined(__STRICT_ANSI__) #define LUA_ANSI #endif #if !defined(LUA_ANSI) && defined(_WIN32) && !defined(_WIN32_WCE) #define LUA_WIN /* enable goodies for regular Windows platforms */ #endif #if defined(LUA_WIN) #define LUA_DL_DLL #define LUA_USE_AFORMAT /* assume 'printf' handles 'aA' specifiers */ #endif #if defined(LUA_USE_LINUX) #define LUA_USE_POSIX #define LUA_USE_DLOPEN /* needs an extra library: -ldl */ #define LUA_USE_READLINE /* needs some extra libraries */ #define LUA_USE_STRTODHEX /* assume 'strtod' handles hex formats */ #define LUA_USE_AFORMAT /* assume 'printf' handles 'aA' specifiers */ #define LUA_USE_LONGLONG /* assume support for long long */ #endif #if defined(LUA_USE_MACOSX) #define LUA_USE_POSIX #define LUA_USE_DLOPEN /* does not need -ldl */ #define LUA_USE_READLINE /* needs an extra library: -lreadline */ #define LUA_USE_STRTODHEX /* assume 'strtod' handles hex formats */ #define LUA_USE_AFORMAT /* assume 'printf' handles 'aA' specifiers */ #define LUA_USE_LONGLONG /* assume support for long long */ #endif /* @@ LUA_USE_POSIX includes all functionality listed as X/Open System @* Interfaces Extension (XSI). ** CHANGE it (define it) if your system is XSI compatible. */ #if defined(LUA_USE_POSIX) #define LUA_USE_MKSTEMP #define LUA_USE_ISATTY #define LUA_USE_POPEN #define LUA_USE_ULONGJMP #define LUA_USE_GMTIME_R #endif /* @@ LUA_PATH_DEFAULT is the default path that Lua uses to look for @* Lua libraries. @@ LUA_CPATH_DEFAULT is the default path that Lua uses to look for @* C libraries. ** CHANGE them if your machine has a non-conventional directory ** hierarchy or if you want to install your libraries in ** non-conventional directories. */ #if defined(_WIN32) /* { */ /* ** In Windows, any exclamation mark ('!') in the path is replaced by the ** path of the directory of the executable file of the current process. */ #define LUA_LDIR "!\\lua\\" #define LUA_CDIR "!\\" #define LUA_PATH_DEFAULT \ LUA_LDIR"?.lua;" LUA_LDIR"?\\init.lua;" \ LUA_CDIR"?.lua;" LUA_CDIR"?\\init.lua;" ".\\?.lua" #define LUA_CPATH_DEFAULT \ LUA_CDIR"?.dll;" LUA_CDIR"loadall.dll;" ".\\?.dll" #else /* }{ */ #define LUA_VDIR LUA_VERSION_MAJOR "." LUA_VERSION_MINOR "/" #define LUA_ROOT "/usr/local/" #define LUA_LDIR LUA_ROOT "share/lua/" LUA_VDIR #define LUA_CDIR LUA_ROOT "lib/lua/" LUA_VDIR #define LUA_PATH_DEFAULT \ LUA_LDIR"?.lua;" LUA_LDIR"?/init.lua;" \ LUA_CDIR"?.lua;" LUA_CDIR"?/init.lua;" "./?.lua" #define LUA_CPATH_DEFAULT \ LUA_CDIR"?.so;" LUA_CDIR"loadall.so;" "./?.so" #endif /* } */ /* @@ LUA_DIRSEP is the directory separator (for submodules). ** CHANGE it if your machine does not use "/" as the directory separator ** and is not Windows. (On Windows Lua automatically uses "\".) */ #if defined(_WIN32) #define LUA_DIRSEP "\\" #else #define LUA_DIRSEP "/" #endif /* @@ LUA_ENV is the name of the variable that holds the current @@ environment, used to access global names. ** CHANGE it if you do not like this name. */ #define LUA_ENV "_ENV" /* @@ LUA_API is a mark for all core API functions. @@ LUALIB_API is a mark for all auxiliary library functions. @@ LUAMOD_API is a mark for all standard library opening functions. ** CHANGE them if you need to define those functions in some special way. ** For instance, if you want to create one Windows DLL with the core and ** the libraries, you may want to use the following definition (define ** LUA_BUILD_AS_DLL to get it). */ #if defined(LUA_BUILD_AS_DLL) /* { */ #if defined(LUA_CORE) || defined(LUA_LIB) /* { */ #define LUA_API __declspec(dllexport) #else /* }{ */ #define LUA_API __declspec(dllimport) #endif /* } */ #else /* }{ */ #define LUA_API extern #endif /* } */ /* more often than not the libs go together with the core */ #define LUALIB_API LUA_API #define LUAMOD_API LUALIB_API /* @@ LUAI_FUNC is a mark for all extern functions that are not to be @* exported to outside modules. @@ LUAI_DDEF and LUAI_DDEC are marks for all extern (const) variables @* that are not to be exported to outside modules (LUAI_DDEF for @* definitions and LUAI_DDEC for declarations). ** CHANGE them if you need to mark them in some special way. Elf/gcc ** (versions 3.2 and later) mark them as "hidden" to optimize access ** when Lua is compiled as a shared library. Not all elf targets support ** this attribute. Unfortunately, gcc does not offer a way to check ** whether the target offers that support, and those without support ** give a warning about it. To avoid these warnings, change to the ** default definition. */ #if defined(__GNUC__) && ((__GNUC__*100 + __GNUC_MINOR__) >= 302) && \ defined(__ELF__) /* { */ #define LUAI_FUNC __attribute__((visibility("hidden"))) extern #define LUAI_DDEC LUAI_FUNC #define LUAI_DDEF /* empty */ #else /* }{ */ #define LUAI_FUNC extern #define LUAI_DDEC extern #define LUAI_DDEF /* empty */ #endif /* } */ /* @@ LUA_QL describes how error messages quote program elements. ** CHANGE it if you want a different appearance. */ #define LUA_QL(x) "'" x "'" #define LUA_QS LUA_QL("%s") /* @@ LUA_IDSIZE gives the maximum size for the description of the source @* of a function in debug information. ** CHANGE it if you want a different size. */ #define LUA_IDSIZE 60 /* @@ luai_writestringerror defines how to print error messages. ** (A format string with one argument is enough for Lua...) */ #ifdef _KERNEL #define luai_writestringerror(s,p) \ (zfs_dbgmsg((s), (p))) #else #define luai_writestringerror(s,p) \ (fprintf(stderr, (s), (p)), fflush(stderr)) #endif /* @@ LUAI_MAXSHORTLEN is the maximum length for short strings, that is, ** strings that are internalized. (Cannot be smaller than reserved words ** or tags for metamethods, as these strings must be internalized; ** #("function") = 8, #("__newindex") = 10.) */ #define LUAI_MAXSHORTLEN 40 /* ** {================================================================== ** Compatibility with previous versions ** =================================================================== */ /* @@ LUA_COMPAT_ALL controls all compatibility options. ** You can define it to get all options, or change specific options ** to fit your specific needs. */ #if defined(LUA_COMPAT_ALL) /* { */ /* @@ LUA_COMPAT_UNPACK controls the presence of global 'unpack'. ** You can replace it with 'table.unpack'. */ #define LUA_COMPAT_UNPACK /* @@ LUA_COMPAT_LOADERS controls the presence of table 'package.loaders'. ** You can replace it with 'package.searchers'. */ #define LUA_COMPAT_LOADERS /* @@ macro 'lua_cpcall' emulates deprecated function lua_cpcall. ** You can call your C function directly (with light C functions). */ #define lua_cpcall(L,f,u) \ (lua_pushcfunction(L, (f)), \ lua_pushlightuserdata(L,(u)), \ lua_pcall(L,1,0,0)) /* @@ LUA_COMPAT_LOG10 defines the function 'log10' in the math library. ** You can rewrite 'log10(x)' as 'log(x, 10)'. */ #define LUA_COMPAT_LOG10 /* @@ LUA_COMPAT_LOADSTRING defines the function 'loadstring' in the base ** library. You can rewrite 'loadstring(s)' as 'load(s)'. */ #define LUA_COMPAT_LOADSTRING /* @@ LUA_COMPAT_MAXN defines the function 'maxn' in the table library. */ #define LUA_COMPAT_MAXN /* @@ The following macros supply trivial compatibility for some ** changes in the API. The macros themselves document how to ** change your code to avoid using them. */ #define lua_strlen(L,i) lua_rawlen(L, (i)) #define lua_objlen(L,i) lua_rawlen(L, (i)) #define lua_equal(L,idx1,idx2) lua_compare(L,(idx1),(idx2),LUA_OPEQ) #define lua_lessthan(L,idx1,idx2) lua_compare(L,(idx1),(idx2),LUA_OPLT) /* @@ LUA_COMPAT_MODULE controls compatibility with previous ** module functions 'module' (Lua) and 'luaL_register' (C). */ #define LUA_COMPAT_MODULE #endif /* } */ /* }================================================================== */ #undef INT_MAX #define INT_MAX 2147483647L /* @@ LUAI_BITSINT defines the number of bits in an int. ** CHANGE here if Lua cannot automatically detect the number of bits of ** your machine. Probably you do not need to change this. */ /* avoid overflows in comparison */ #if INT_MAX-20 < 32760 /* { */ #define LUAI_BITSINT 16 #elif INT_MAX > 2147483640L /* }{ */ /* int has at least 32 bits */ #define LUAI_BITSINT 32 #else /* }{ */ #error "you must define LUA_BITSINT with number of bits in an integer" #endif /* } */ /* @@ LUA_INT32 is a signed integer with exactly 32 bits. @@ LUAI_UMEM is an unsigned integer big enough to count the total @* memory used by Lua. @@ LUAI_MEM is a signed integer big enough to count the total memory @* used by Lua. ** CHANGE here if for some weird reason the default definitions are not ** good enough for your machine. Probably you do not need to change ** this. */ #if LUAI_BITSINT >= 32 /* { */ #define LUA_INT32 int #define LUAI_UMEM size_t #define LUAI_MEM ptrdiff_t #else /* }{ */ /* 16-bit ints */ #define LUA_INT32 long #define LUAI_UMEM unsigned long #define LUAI_MEM long #endif /* } */ /* @@ LUAI_MAXSTACK limits the size of the Lua stack. ** CHANGE it if you need a different limit. This limit is arbitrary; ** its only purpose is to stop Lua from consuming unlimited stack ** space (and to reserve some numbers for pseudo-indices). */ #if LUAI_BITSINT >= 32 #define LUAI_MAXSTACK 1000000 #else #define LUAI_MAXSTACK 15000 #endif /* reserve some space for error handling */ #define LUAI_FIRSTPSEUDOIDX (-LUAI_MAXSTACK - 1000) /* @@ LUAL_BUFFERSIZE is the buffer size used by the lauxlib buffer system. ** CHANGE it if it uses too much C-stack space. */ -#ifdef __linux__ #define LUAL_BUFFERSIZE 512 -#else -#define LUAL_BUFFERSIZE 1024 -#endif /* ** {================================================================== @@ LUA_NUMBER is the type of numbers in Lua. ** CHANGE the following definitions only if you want to build Lua ** with a number type different from double. You may also need to ** change lua_number2int & lua_number2integer. ** =================================================================== */ #define LUA_NUMBER int64_t /* @@ LUAI_UACNUMBER is the result of an 'usual argument conversion' @* over a number. */ #define LUAI_UACNUMBER int64_t /* @@ LUA_NUMBER_SCAN is the format for reading numbers. @@ LUA_NUMBER_FMT is the format for writing numbers. @@ lua_number2str converts a number to a string. @@ LUAI_MAXNUMBER2STR is maximum size of previous conversion. */ #ifndef PRId64 #define PRId64 "lld" #endif #define LUAI_MAXNUMBER2STR 32 /* 16 digits, sign, point, and \0 */ #define LUA_NUMBER_FMT "%" PRId64 #define lua_number2str(s,n) \ lcompat_sprintf((s), LUAI_MAXNUMBER2STR, LUA_NUMBER_FMT, (n)) /* @@ l_mathop allows the addition of an 'l' or 'f' to all math operations */ #define l_mathop(x) (x ## l) /* @@ lua_str2number converts a decimal numeric string to a number. @@ lua_strx2number converts an hexadecimal numeric string to a number. ** In C99, 'strtod' does both conversions. C89, however, has no function ** to convert floating hexadecimal strings to numbers. For these ** systems, you can leave 'lua_strx2number' undefined and Lua will ** provide its own implementation. */ #define lua_str2number(s,p) lcompat_strtoll((s), (p)) #if defined(LUA_USE_STRTODHEX) #define lua_strx2number(s,p) lcompat_strtoll((s), (p)) #endif /* @@ The luai_num* macros define the primitive operations over numbers. */ /* the following operations need the math library */ #if defined(lobject_c) || defined(lvm_c) #define luai_nummod(L,a,b) ((a) % (b)) #define luai_numpow(L,a,b) (lcompat_pow((a),(b))) #endif /* these are quite standard operations */ #if defined(LUA_CORE) #define luai_numadd(L,a,b) ((a)+(b)) #define luai_numsub(L,a,b) ((a)-(b)) #define luai_nummul(L,a,b) ((a)*(b)) #define luai_numdiv(L,a,b) ((a)/(b)) #define luai_numunm(L,a) (-(a)) #define luai_numeq(a,b) ((a)==(b)) #define luai_numlt(L,a,b) ((a)<(b)) #define luai_numle(L,a,b) ((a)<=(b)) #define luai_numisnan(L,a) (!luai_numeq((a), (a))) #endif /* @@ LUA_INTEGER is the integral type used by lua_pushinteger/lua_tointeger. ** CHANGE that if ptrdiff_t is not adequate on your machine. (On most ** machines, ptrdiff_t gives a good choice between int or long.) */ #define LUA_INTEGER ptrdiff_t /* @@ LUA_UNSIGNED is the integral type used by lua_pushunsigned/lua_tounsigned. ** It must have at least 32 bits. */ #define LUA_UNSIGNED uint64_t /* ** Some tricks with doubles */ #if defined(LUA_NUMBER_DOUBLE) && !defined(LUA_ANSI) /* { */ /* ** The next definitions activate some tricks to speed up the ** conversion from doubles to integer types, mainly to LUA_UNSIGNED. ** @@ LUA_MSASMTRICK uses Microsoft assembler to avoid clashes with a ** DirectX idiosyncrasy. ** @@ LUA_IEEE754TRICK uses a trick that should work on any machine ** using IEEE754 with a 32-bit integer type. ** @@ LUA_IEEELL extends the trick to LUA_INTEGER; should only be ** defined when LUA_INTEGER is a 32-bit integer. ** @@ LUA_IEEEENDIAN is the endianness of doubles in your machine ** (0 for little endian, 1 for big endian); if not defined, Lua will ** check it dynamically for LUA_IEEE754TRICK (but not for LUA_NANTRICK). ** @@ LUA_NANTRICK controls the use of a trick to pack all types into ** a single double value, using NaN values to represent non-number ** values. The trick only works on 32-bit machines (ints and pointers ** are 32-bit values) with numbers represented as IEEE 754-2008 doubles ** with conventional endianness (12345678 or 87654321), in CPUs that do ** not produce signaling NaN values (all NaNs are quiet). */ /* Microsoft compiler on a Pentium (32 bit) ? */ #if defined(LUA_WIN) && defined(_MSC_VER) && defined(_M_IX86) /* { */ #define LUA_MSASMTRICK #define LUA_IEEEENDIAN 0 #define LUA_NANTRICK /* pentium 32 bits? */ #elif defined(__i386__) || defined(__i386) || defined(__X86__) /* }{ */ #define LUA_IEEE754TRICK #define LUA_IEEELL #define LUA_IEEEENDIAN 0 #define LUA_NANTRICK /* pentium 64 bits? */ #elif defined(__x86_64) /* }{ */ #define LUA_IEEE754TRICK #define LUA_IEEEENDIAN 0 #elif defined(__POWERPC__) || defined(__ppc__) /* }{ */ #define LUA_IEEE754TRICK #define LUA_IEEEENDIAN 1 #else /* }{ */ /* assume IEEE754 and a 32-bit integer type */ #define LUA_IEEE754TRICK #endif /* } */ #endif /* } */ /* }================================================================== */ /* =================================================================== */ /* ** Local configuration. You can use this space to add your redefinitions ** without modifying the main part of the file. */ #define getlocaledecpoint() ('.') #ifndef abs #define abs(x) (((x) < 0) ? -(x) : (x)) #endif #if !defined(UCHAR_MAX) #define UCHAR_MAX (0xff) #endif #endif /* END CSTYLED */ Index: vendor-sys/openzfs/dist/include/sys/zfs_context.h =================================================================== --- vendor-sys/openzfs/dist/include/sys/zfs_context.h (revision 366347) +++ vendor-sys/openzfs/dist/include/sys/zfs_context.h (revision 366348) @@ -1,767 +1,768 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright 2011 Nexenta Systems, Inc. All rights reserved. * Copyright (c) 2012, 2018 by Delphix. All rights reserved. * Copyright (c) 2012, Joyent, Inc. All rights reserved. */ #ifndef _SYS_ZFS_CONTEXT_H #define _SYS_ZFS_CONTEXT_H #ifdef __cplusplus extern "C" { #endif #ifdef __KERNEL__ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #else /* _KERNEL */ #define _SYS_MUTEX_H #define _SYS_RWLOCK_H #define _SYS_CONDVAR_H #define _SYS_VNODE_H #define _SYS_VFS_H #define _SYS_SUNDDI_H #define _SYS_CALLB_H #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * Stack */ #define noinline __attribute__((noinline)) #define likely(x) __builtin_expect((x), 1) #define unlikely(x) __builtin_expect((x), 0) /* * Debugging */ /* * Note that we are not using the debugging levels. */ #define CE_CONT 0 /* continuation */ #define CE_NOTE 1 /* notice */ #define CE_WARN 2 /* warning */ #define CE_PANIC 3 /* panic */ #define CE_IGNORE 4 /* print nothing */ /* * ZFS debugging */ extern void dprintf_setup(int *argc, char **argv); extern void cmn_err(int, const char *, ...); extern void vcmn_err(int, const char *, va_list); extern void panic(const char *, ...) __NORETURN; extern void vpanic(const char *, va_list) __NORETURN; #define fm_panic panic extern int aok; /* * DTrace SDT probes have different signatures in userland than they do in * the kernel. If they're being used in kernel code, re-define them out of * existence for their counterparts in libzpool. * * Here's an example of how to use the set-error probes in userland: * zfs$target:::set-error /arg0 == EBUSY/ {stack();} * * Here's an example of how to use DTRACE_PROBE probes in userland: * If there is a probe declared as follows: * DTRACE_PROBE2(zfs__probe_name, uint64_t, blkid, dnode_t *, dn); * Then you can use it as follows: * zfs$target:::probe2 /copyinstr(arg0) == "zfs__probe_name"/ * {printf("%u %p\n", arg1, arg2);} */ #ifdef DTRACE_PROBE #undef DTRACE_PROBE #endif /* DTRACE_PROBE */ #define DTRACE_PROBE(a) #ifdef DTRACE_PROBE1 #undef DTRACE_PROBE1 #endif /* DTRACE_PROBE1 */ #define DTRACE_PROBE1(a, b, c) #ifdef DTRACE_PROBE2 #undef DTRACE_PROBE2 #endif /* DTRACE_PROBE2 */ #define DTRACE_PROBE2(a, b, c, d, e) #ifdef DTRACE_PROBE3 #undef DTRACE_PROBE3 #endif /* DTRACE_PROBE3 */ #define DTRACE_PROBE3(a, b, c, d, e, f, g) #ifdef DTRACE_PROBE4 #undef DTRACE_PROBE4 #endif /* DTRACE_PROBE4 */ #define DTRACE_PROBE4(a, b, c, d, e, f, g, h, i) /* * Tunables. */ typedef struct zfs_kernel_param { const char *name; /* unused stub */ } zfs_kernel_param_t; #define ZFS_MODULE_PARAM(scope_prefix, name_prefix, name, type, perm, desc) #define ZFS_MODULE_PARAM_ARGS void #define ZFS_MODULE_PARAM_CALL(scope_prefix, name_prefix, name, setfunc, \ getfunc, perm, desc) /* * Threads. */ typedef pthread_t kthread_t; #define TS_RUN 0x00000002 #define TS_JOINABLE 0x00000004 #define curthread ((void *)(uintptr_t)pthread_self()) #define kpreempt(x) yield() #define getcomm() "unknown" #define thread_create_named(name, stk, stksize, func, arg, len, \ pp, state, pri) \ zk_thread_create(func, arg, stksize, state) #define thread_create(stk, stksize, func, arg, len, pp, state, pri) \ zk_thread_create(func, arg, stksize, state) #define thread_exit() pthread_exit(NULL) #define thread_join(t) pthread_join((pthread_t)(t), NULL) #define newproc(f, a, cid, pri, ctp, pid) (ENOSYS) /* in libzpool, p0 exists only to have its address taken */ typedef struct proc { uintptr_t this_is_never_used_dont_dereference_it; } proc_t; extern struct proc p0; #define curproc (&p0) #define PS_NONE -1 extern kthread_t *zk_thread_create(void (*func)(void *), void *arg, size_t stksize, int state); #define issig(why) (FALSE) #define ISSIG(thr, why) (FALSE) #define kpreempt_disable() ((void)0) #define kpreempt_enable() ((void)0) #define cond_resched() sched_yield() /* * Mutexes */ typedef struct kmutex { pthread_mutex_t m_lock; pthread_t m_owner; } kmutex_t; #define MUTEX_DEFAULT 0 #define MUTEX_NOLOCKDEP MUTEX_DEFAULT #define MUTEX_HELD(mp) pthread_equal((mp)->m_owner, pthread_self()) #define MUTEX_NOT_HELD(mp) !MUTEX_HELD(mp) extern void mutex_init(kmutex_t *mp, char *name, int type, void *cookie); extern void mutex_destroy(kmutex_t *mp); extern void mutex_enter(kmutex_t *mp); extern void mutex_exit(kmutex_t *mp); extern int mutex_tryenter(kmutex_t *mp); #define NESTED_SINGLE 1 #define mutex_enter_nested(mp, class) mutex_enter(mp) /* * RW locks */ typedef struct krwlock { pthread_rwlock_t rw_lock; pthread_t rw_owner; uint_t rw_readers; } krwlock_t; typedef int krw_t; #define RW_READER 0 #define RW_WRITER 1 #define RW_DEFAULT RW_READER #define RW_NOLOCKDEP RW_READER #define RW_READ_HELD(rw) ((rw)->rw_readers > 0) #define RW_WRITE_HELD(rw) pthread_equal((rw)->rw_owner, pthread_self()) #define RW_LOCK_HELD(rw) (RW_READ_HELD(rw) || RW_WRITE_HELD(rw)) extern void rw_init(krwlock_t *rwlp, char *name, int type, void *arg); extern void rw_destroy(krwlock_t *rwlp); extern void rw_enter(krwlock_t *rwlp, krw_t rw); extern int rw_tryenter(krwlock_t *rwlp, krw_t rw); extern int rw_tryupgrade(krwlock_t *rwlp); extern void rw_exit(krwlock_t *rwlp); #define rw_downgrade(rwlp) do { } while (0) /* * Credentials */ extern uid_t crgetuid(cred_t *cr); extern uid_t crgetruid(cred_t *cr); extern gid_t crgetgid(cred_t *cr); extern int crgetngroups(cred_t *cr); extern gid_t *crgetgroups(cred_t *cr); /* * Condition variables */ typedef pthread_cond_t kcondvar_t; #define CV_DEFAULT 0 #define CALLOUT_FLAG_ABSOLUTE 0x2 extern void cv_init(kcondvar_t *cv, char *name, int type, void *arg); extern void cv_destroy(kcondvar_t *cv); extern void cv_wait(kcondvar_t *cv, kmutex_t *mp); extern int cv_wait_sig(kcondvar_t *cv, kmutex_t *mp); extern int cv_timedwait(kcondvar_t *cv, kmutex_t *mp, clock_t abstime); extern int cv_timedwait_hires(kcondvar_t *cvp, kmutex_t *mp, hrtime_t tim, hrtime_t res, int flag); extern void cv_signal(kcondvar_t *cv); extern void cv_broadcast(kcondvar_t *cv); #define cv_timedwait_io(cv, mp, at) cv_timedwait(cv, mp, at) #define cv_timedwait_idle(cv, mp, at) cv_timedwait(cv, mp, at) #define cv_timedwait_sig(cv, mp, at) cv_timedwait(cv, mp, at) #define cv_wait_io(cv, mp) cv_wait(cv, mp) #define cv_wait_idle(cv, mp) cv_wait(cv, mp) #define cv_wait_io_sig(cv, mp) cv_wait_sig(cv, mp) #define cv_timedwait_sig_hires(cv, mp, t, r, f) \ cv_timedwait_hires(cv, mp, t, r, f) #define cv_timedwait_idle_hires(cv, mp, t, r, f) \ cv_timedwait_hires(cv, mp, t, r, f) /* * Thread-specific data */ #define tsd_get(k) pthread_getspecific(k) #define tsd_set(k, v) pthread_setspecific(k, v) #define tsd_create(kp, d) pthread_key_create((pthread_key_t *)kp, d) #define tsd_destroy(kp) /* nothing */ #ifdef __FreeBSD__ typedef off_t loff_t; #endif /* * kstat creation, installation and deletion */ extern kstat_t *kstat_create(const char *, int, const char *, const char *, uchar_t, ulong_t, uchar_t); extern void kstat_install(kstat_t *); extern void kstat_delete(kstat_t *); extern void kstat_waitq_enter(kstat_io_t *); extern void kstat_waitq_exit(kstat_io_t *); extern void kstat_runq_enter(kstat_io_t *); extern void kstat_runq_exit(kstat_io_t *); extern void kstat_waitq_to_runq(kstat_io_t *); extern void kstat_runq_back_to_waitq(kstat_io_t *); extern void kstat_set_raw_ops(kstat_t *ksp, int (*headers)(char *buf, size_t size), int (*data)(char *buf, size_t size, void *data), void *(*addr)(kstat_t *ksp, loff_t index)); /* * procfs list manipulation */ typedef struct procfs_list { void *pl_private; kmutex_t pl_lock; list_t pl_list; uint64_t pl_next_id; size_t pl_node_offset; } procfs_list_t; #ifndef __cplusplus struct seq_file { }; void seq_printf(struct seq_file *m, const char *fmt, ...); typedef struct procfs_list_node { list_node_t pln_link; uint64_t pln_id; } procfs_list_node_t; void procfs_list_install(const char *module, + const char *submodule, const char *name, mode_t mode, procfs_list_t *procfs_list, int (*show)(struct seq_file *f, void *p), int (*show_header)(struct seq_file *f), int (*clear)(procfs_list_t *procfs_list), size_t procfs_list_node_off); void procfs_list_uninstall(procfs_list_t *procfs_list); void procfs_list_destroy(procfs_list_t *procfs_list); void procfs_list_add(procfs_list_t *procfs_list, void *p); #endif /* * Kernel memory */ #define KM_SLEEP UMEM_NOFAIL #define KM_PUSHPAGE KM_SLEEP #define KM_NOSLEEP UMEM_DEFAULT #define KM_NORMALPRI 0 /* not needed with UMEM_DEFAULT */ #define KMC_NODEBUG UMC_NODEBUG #define KMC_KVMEM 0x0 #define kmem_alloc(_s, _f) umem_alloc(_s, _f) #define kmem_zalloc(_s, _f) umem_zalloc(_s, _f) #define kmem_free(_b, _s) umem_free(_b, _s) #define vmem_alloc(_s, _f) kmem_alloc(_s, _f) #define vmem_zalloc(_s, _f) kmem_zalloc(_s, _f) #define vmem_free(_b, _s) kmem_free(_b, _s) #define kmem_cache_create(_a, _b, _c, _d, _e, _f, _g, _h, _i) \ umem_cache_create(_a, _b, _c, _d, _e, _f, _g, _h, _i) #define kmem_cache_destroy(_c) umem_cache_destroy(_c) #define kmem_cache_alloc(_c, _f) umem_cache_alloc(_c, _f) #define kmem_cache_free(_c, _b) umem_cache_free(_c, _b) #define kmem_debugging() 0 #define kmem_cache_reap_now(_c) umem_cache_reap_now(_c); #define kmem_cache_set_move(_c, _cb) /* nothing */ #define POINTER_INVALIDATE(_pp) /* nothing */ #define POINTER_IS_VALID(_p) 0 typedef umem_cache_t kmem_cache_t; typedef enum kmem_cbrc { KMEM_CBRC_YES, KMEM_CBRC_NO, KMEM_CBRC_LATER, KMEM_CBRC_DONT_NEED, KMEM_CBRC_DONT_KNOW } kmem_cbrc_t; /* * Task queues */ #define TASKQ_NAMELEN 31 typedef uintptr_t taskqid_t; typedef void (task_func_t)(void *); typedef struct taskq_ent { struct taskq_ent *tqent_next; struct taskq_ent *tqent_prev; task_func_t *tqent_func; void *tqent_arg; uintptr_t tqent_flags; } taskq_ent_t; typedef struct taskq { char tq_name[TASKQ_NAMELEN + 1]; kmutex_t tq_lock; krwlock_t tq_threadlock; kcondvar_t tq_dispatch_cv; kcondvar_t tq_wait_cv; kthread_t **tq_threadlist; int tq_flags; int tq_active; int tq_nthreads; int tq_nalloc; int tq_minalloc; int tq_maxalloc; kcondvar_t tq_maxalloc_cv; int tq_maxalloc_wait; taskq_ent_t *tq_freelist; taskq_ent_t tq_task; } taskq_t; #define TQENT_FLAG_PREALLOC 0x1 /* taskq_dispatch_ent used */ #define TASKQ_PREPOPULATE 0x0001 #define TASKQ_CPR_SAFE 0x0002 /* Use CPR safe protocol */ #define TASKQ_DYNAMIC 0x0004 /* Use dynamic thread scheduling */ #define TASKQ_THREADS_CPU_PCT 0x0008 /* Scale # threads by # cpus */ #define TASKQ_DC_BATCH 0x0010 /* Mark threads as batch */ #define TQ_SLEEP KM_SLEEP /* Can block for memory */ #define TQ_NOSLEEP KM_NOSLEEP /* cannot block for memory; may fail */ #define TQ_NOQUEUE 0x02 /* Do not enqueue if can't dispatch */ #define TQ_FRONT 0x08 /* Queue in front */ #define TASKQID_INVALID ((taskqid_t)0) extern taskq_t *system_taskq; extern taskq_t *system_delay_taskq; extern taskq_t *taskq_create(const char *, int, pri_t, int, int, uint_t); #define taskq_create_proc(a, b, c, d, e, p, f) \ (taskq_create(a, b, c, d, e, f)) #define taskq_create_sysdc(a, b, d, e, p, dc, f) \ (taskq_create(a, b, maxclsyspri, d, e, f)) extern taskqid_t taskq_dispatch(taskq_t *, task_func_t, void *, uint_t); extern taskqid_t taskq_dispatch_delay(taskq_t *, task_func_t, void *, uint_t, clock_t); extern void taskq_dispatch_ent(taskq_t *, task_func_t, void *, uint_t, taskq_ent_t *); extern int taskq_empty_ent(taskq_ent_t *); extern void taskq_init_ent(taskq_ent_t *); extern void taskq_destroy(taskq_t *); extern void taskq_wait(taskq_t *); extern void taskq_wait_id(taskq_t *, taskqid_t); extern void taskq_wait_outstanding(taskq_t *, taskqid_t); extern int taskq_member(taskq_t *, kthread_t *); extern taskq_t *taskq_of_curthread(void); extern int taskq_cancel_id(taskq_t *, taskqid_t); extern void system_taskq_init(void); extern void system_taskq_fini(void); #define XVA_MAPSIZE 3 #define XVA_MAGIC 0x78766174 extern char *vn_dumpdir; #define AV_SCANSTAMP_SZ 32 /* length of anti-virus scanstamp */ typedef struct xoptattr { inode_timespec_t xoa_createtime; /* Create time of file */ uint8_t xoa_archive; uint8_t xoa_system; uint8_t xoa_readonly; uint8_t xoa_hidden; uint8_t xoa_nounlink; uint8_t xoa_immutable; uint8_t xoa_appendonly; uint8_t xoa_nodump; uint8_t xoa_settable; uint8_t xoa_opaque; uint8_t xoa_av_quarantined; uint8_t xoa_av_modified; uint8_t xoa_av_scanstamp[AV_SCANSTAMP_SZ]; uint8_t xoa_reparse; uint8_t xoa_offline; uint8_t xoa_sparse; } xoptattr_t; typedef struct vattr { uint_t va_mask; /* bit-mask of attributes */ u_offset_t va_size; /* file size in bytes */ } vattr_t; typedef struct xvattr { vattr_t xva_vattr; /* Embedded vattr structure */ uint32_t xva_magic; /* Magic Number */ uint32_t xva_mapsize; /* Size of attr bitmap (32-bit words) */ uint32_t *xva_rtnattrmapp; /* Ptr to xva_rtnattrmap[] */ uint32_t xva_reqattrmap[XVA_MAPSIZE]; /* Requested attrs */ uint32_t xva_rtnattrmap[XVA_MAPSIZE]; /* Returned attrs */ xoptattr_t xva_xoptattrs; /* Optional attributes */ } xvattr_t; typedef struct vsecattr { uint_t vsa_mask; /* See below */ int vsa_aclcnt; /* ACL entry count */ void *vsa_aclentp; /* pointer to ACL entries */ int vsa_dfaclcnt; /* default ACL entry count */ void *vsa_dfaclentp; /* pointer to default ACL entries */ size_t vsa_aclentsz; /* ACE size in bytes of vsa_aclentp */ } vsecattr_t; #define AT_MODE 0x00002 #define AT_UID 0x00004 #define AT_GID 0x00008 #define AT_FSID 0x00010 #define AT_NODEID 0x00020 #define AT_NLINK 0x00040 #define AT_SIZE 0x00080 #define AT_ATIME 0x00100 #define AT_MTIME 0x00200 #define AT_CTIME 0x00400 #define AT_RDEV 0x00800 #define AT_BLKSIZE 0x01000 #define AT_NBLOCKS 0x02000 #define AT_SEQ 0x08000 #define AT_XVATTR 0x10000 #define CRCREAT 0 #define F_FREESP 11 #define FIGNORECASE 0x80000 /* request case-insensitive lookups */ /* * Random stuff */ #define ddi_get_lbolt() (gethrtime() >> 23) #define ddi_get_lbolt64() (gethrtime() >> 23) #define hz 119 /* frequency when using gethrtime() >> 23 for lbolt */ #define ddi_time_before(a, b) (a < b) #define ddi_time_after(a, b) ddi_time_before(b, a) #define ddi_time_before_eq(a, b) (!ddi_time_after(a, b)) #define ddi_time_after_eq(a, b) ddi_time_before_eq(b, a) #define ddi_time_before64(a, b) (a < b) #define ddi_time_after64(a, b) ddi_time_before64(b, a) #define ddi_time_before_eq64(a, b) (!ddi_time_after64(a, b)) #define ddi_time_after_eq64(a, b) ddi_time_before_eq64(b, a) extern void delay(clock_t ticks); #define SEC_TO_TICK(sec) ((sec) * hz) #define MSEC_TO_TICK(msec) (howmany((hrtime_t)(msec) * hz, MILLISEC)) #define USEC_TO_TICK(usec) (howmany((hrtime_t)(usec) * hz, MICROSEC)) #define NSEC_TO_TICK(nsec) (howmany((hrtime_t)(nsec) * hz, NANOSEC)) #define max_ncpus 64 #define boot_ncpus (sysconf(_SC_NPROCESSORS_ONLN)) /* * Process priorities as defined by setpriority(2) and getpriority(2). */ #define minclsyspri 19 #define maxclsyspri -20 #define defclsyspri 0 #define CPU_SEQID ((uintptr_t)pthread_self() & (max_ncpus - 1)) #define kcred NULL #define CRED() NULL #define ptob(x) ((x) * PAGESIZE) #define NN_DIVISOR_1000 (1U << 0) #define NN_NUMBUF_SZ (6) extern uint64_t physmem; extern char *random_path; extern char *urandom_path; extern int highbit64(uint64_t i); extern int lowbit64(uint64_t i); extern int random_get_bytes(uint8_t *ptr, size_t len); extern int random_get_pseudo_bytes(uint8_t *ptr, size_t len); extern void kernel_init(int mode); extern void kernel_fini(void); extern void random_init(void); extern void random_fini(void); struct spa; extern void show_pool_stats(struct spa *); extern int set_global_var(char *arg); typedef struct callb_cpr { kmutex_t *cc_lockp; } callb_cpr_t; #define CALLB_CPR_INIT(cp, lockp, func, name) { \ (cp)->cc_lockp = lockp; \ } #define CALLB_CPR_SAFE_BEGIN(cp) { \ ASSERT(MUTEX_HELD((cp)->cc_lockp)); \ } #define CALLB_CPR_SAFE_END(cp, lockp) { \ ASSERT(MUTEX_HELD((cp)->cc_lockp)); \ } #define CALLB_CPR_EXIT(cp) { \ ASSERT(MUTEX_HELD((cp)->cc_lockp)); \ mutex_exit((cp)->cc_lockp); \ } #define zone_dataset_visible(x, y) (1) #define INGLOBALZONE(z) (1) extern uint32_t zone_get_hostid(void *zonep); extern char *kmem_vasprintf(const char *fmt, va_list adx); extern char *kmem_asprintf(const char *fmt, ...); #define kmem_strfree(str) kmem_free((str), strlen(str) + 1) #define kmem_strdup(s) strdup(s) /* * Hostname information */ extern char hw_serial[]; /* for userland-emulated hostid access */ extern int ddi_strtoul(const char *str, char **nptr, int base, unsigned long *result); extern int ddi_strtoull(const char *str, char **nptr, int base, u_longlong_t *result); typedef struct utsname utsname_t; extern utsname_t *utsname(void); /* ZFS Boot Related stuff. */ struct _buf { intptr_t _fd; }; struct bootstat { uint64_t st_size; }; typedef struct ace_object { uid_t a_who; uint32_t a_access_mask; uint16_t a_flags; uint16_t a_type; uint8_t a_obj_type[16]; uint8_t a_inherit_obj_type[16]; } ace_object_t; #define ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE 0x05 #define ACE_ACCESS_DENIED_OBJECT_ACE_TYPE 0x06 #define ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE 0x07 #define ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE 0x08 extern int zfs_secpolicy_snapshot_perms(const char *name, cred_t *cr); extern int zfs_secpolicy_rename_perms(const char *from, const char *to, cred_t *cr); extern int zfs_secpolicy_destroy_perms(const char *name, cred_t *cr); extern int secpolicy_zfs(const cred_t *cr); extern int secpolicy_zfs_proc(const cred_t *cr, proc_t *proc); extern zoneid_t getzoneid(void); /* SID stuff */ typedef struct ksiddomain { uint_t kd_ref; uint_t kd_len; char *kd_name; } ksiddomain_t; ksiddomain_t *ksid_lookupdomain(const char *); void ksiddomain_rele(ksiddomain_t *); #define DDI_SLEEP KM_SLEEP #define ddi_log_sysevent(_a, _b, _c, _d, _e, _f, _g) \ sysevent_post_event(_c, _d, _b, "libzpool", _e, _f) #define zfs_sleep_until(wakeup) \ do { \ hrtime_t delta = wakeup - gethrtime(); \ struct timespec ts; \ ts.tv_sec = delta / NANOSEC; \ ts.tv_nsec = delta % NANOSEC; \ (void) nanosleep(&ts, NULL); \ } while (0) typedef int fstrans_cookie_t; extern fstrans_cookie_t spl_fstrans_mark(void); extern void spl_fstrans_unmark(fstrans_cookie_t); extern int __spl_pf_fstrans_check(void); extern int kmem_cache_reap_active(void); #define ____cacheline_aligned /* * Kernel modules */ #define __init #define __exit #endif /* _KERNEL */ #ifdef __cplusplus }; #endif #endif /* _SYS_ZFS_CONTEXT_H */ Index: vendor-sys/openzfs/dist/include/sys/zstd/zstd.h =================================================================== --- vendor-sys/openzfs/dist/include/sys/zstd/zstd.h (revision 366347) +++ vendor-sys/openzfs/dist/include/sys/zstd/zstd.h (revision 366348) @@ -1,98 +1,99 @@ /* * BSD 3-Clause New License (https://spdx.org/licenses/BSD-3-Clause.html) * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * 3. Neither the name of the copyright holder nor the names of its * contributors may be used to endorse or promote products derived from this * software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ /* * Copyright (c) 2016-2018, Klara Inc. * Copyright (c) 2016-2018, Allan Jude * Copyright (c) 2018-2020, Sebastian Gottschall * Copyright (c) 2019-2020, Michael Niewöhner * Copyright (c) 2020, The FreeBSD Foundation [1] * * [1] Portions of this software were developed by Allan Jude * under sponsorship from the FreeBSD Foundation. */ #ifndef _ZFS_ZSTD_H #define _ZFS_ZSTD_H #ifdef __cplusplus extern "C" { #endif /* * ZSTD block header * NOTE: all fields in this header are in big endian order. */ typedef struct zfs_zstd_header { /* Compressed size of data */ uint32_t c_len; /* * Version and compression level * We use a union to be able to big endian encode a single 32 bit * unsigned integer, but still access the individual bitmasked * components easily. */ union { uint32_t raw_version_level; struct { uint32_t version : 24; uint8_t level; }; }; char data[]; } zfs_zstdhdr_t; /* * kstat helper macros */ #define ZSTDSTAT(stat) (zstd_stats.stat.value.ui64) #define ZSTDSTAT_INCR(stat, val) \ atomic_add_64(&zstd_stats.stat.value.ui64, (val)) #define ZSTDSTAT_BUMP(stat) ZSTDSTAT_INCR(stat, 1) /* (de)init for user space / kernel emulation */ int zstd_init(void); void zstd_fini(void); size_t zfs_zstd_compress(void *s_start, void *d_start, size_t s_len, size_t d_len, int level); int zfs_zstd_get_level(void *s_start, size_t s_len, uint8_t *level); int zfs_zstd_decompress_level(void *s_start, void *d_start, size_t s_len, size_t d_len, uint8_t *level); int zfs_zstd_decompress(void *s_start, void *d_start, size_t s_len, size_t d_len, int n); +void zfs_zstd_cache_reap_now(void); #ifdef __cplusplus } #endif #endif /* _ZFS_ZSTD_H */ Index: vendor-sys/openzfs/dist/lib/libshare/os/freebsd/nfs.c =================================================================== --- vendor-sys/openzfs/dist/lib/libshare/os/freebsd/nfs.c (revision 366347) +++ vendor-sys/openzfs/dist/lib/libshare/os/freebsd/nfs.c (revision 366348) @@ -1,448 +1,458 @@ /* * Copyright (c) 2007 Pawel Jakub Dawidek * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * Copyright (c) 2020 by Delphix. All rights reserved. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include "libzfs_impl.h" #include "libshare_impl.h" #include "nfs.h" #define _PATH_MOUNTDPID "/var/run/mountd.pid" #define FILE_HEADER "# !!! DO NOT EDIT THIS FILE MANUALLY !!!\n\n" #define OPTSSIZE 1024 #define MAXLINESIZE (PATH_MAX + OPTSSIZE) #define ZFS_EXPORTS_FILE "/etc/zfs/exports" #define ZFS_EXPORTS_LOCK ZFS_EXPORTS_FILE".lock" static sa_fstype_t *nfs_fstype; static int nfs_lock_fd = -1; /* * The nfs_exports_[lock|unlock] is used to guard against conconcurrent * updates to the exports file. Each protocol is responsible for * providing the necessary locking to ensure consistency. */ static int nfs_exports_lock(void) { nfs_lock_fd = open(ZFS_EXPORTS_LOCK, O_RDWR | O_CREAT, 0600); if (nfs_lock_fd == -1) { fprintf(stderr, "failed to lock %s: %s\n", ZFS_EXPORTS_LOCK, strerror(errno)); return (errno); } if (flock(nfs_lock_fd, LOCK_EX) != 0) { fprintf(stderr, "failed to lock %s: %s\n", ZFS_EXPORTS_LOCK, strerror(errno)); return (errno); } return (0); } static void nfs_exports_unlock(void) { verify(nfs_lock_fd > 0); if (flock(nfs_lock_fd, LOCK_UN) != 0) { fprintf(stderr, "failed to unlock %s: %s\n", ZFS_EXPORTS_LOCK, strerror(errno)); } close(nfs_lock_fd); nfs_lock_fd = -1; } /* * Read one line from a file. Skip comments, empty lines and a line with a * mountpoint specified in the 'skip' argument. * * NOTE: This function returns a static buffer and thus is not thread-safe. */ static char * zgetline(FILE *fd, const char *skip) { static char line[MAXLINESIZE]; size_t len, skiplen = 0; char *s, last; if (skip != NULL) skiplen = strlen(skip); for (;;) { s = fgets(line, sizeof (line), fd); if (s == NULL) return (NULL); /* Skip empty lines and comments. */ if (line[0] == '\n' || line[0] == '#') continue; len = strlen(line); if (line[len - 1] == '\n') line[len - 1] = '\0'; last = line[skiplen]; /* Skip the given mountpoint. */ if (skip != NULL && strncmp(skip, line, skiplen) == 0 && (last == '\t' || last == ' ' || last == '\0')) { continue; } break; } return (line); } /* * This function translate options to a format acceptable by exports(5), eg. * * -ro -network=192.168.0.0 -mask=255.255.255.0 -maproot=0 \ * zfs.freebsd.org 69.147.83.54 * * Accepted input formats: * * ro,network=192.168.0.0,mask=255.255.255.0,maproot=0,zfs.freebsd.org * ro network=192.168.0.0 mask=255.255.255.0 maproot=0 zfs.freebsd.org * -ro,-network=192.168.0.0,-mask=255.255.255.0,-maproot=0,zfs.freebsd.org * -ro -network=192.168.0.0 -mask=255.255.255.0 -maproot=0 \ * zfs.freebsd.org * * Recognized keywords: * * ro, maproot, mapall, mask, network, sec, alldirs, public, webnfs, * index, quiet * * NOTE: This function returns a static buffer and thus is not thread-safe. */ static char * translate_opts(const char *shareopts) { static const char *known_opts[] = { "ro", "maproot", "mapall", "mask", "network", "sec", "alldirs", "public", "webnfs", "index", "quiet", NULL }; static char newopts[OPTSSIZE]; char oldopts[OPTSSIZE]; char *o, *s = NULL; unsigned int i; size_t len; strlcpy(oldopts, shareopts, sizeof (oldopts)); newopts[0] = '\0'; s = oldopts; while ((o = strsep(&s, "-, ")) != NULL) { if (o[0] == '\0') continue; for (i = 0; known_opts[i] != NULL; i++) { len = strlen(known_opts[i]); if (strncmp(known_opts[i], o, len) == 0 && (o[len] == '\0' || o[len] == '=')) { strlcat(newopts, "-", sizeof (newopts)); break; } } strlcat(newopts, o, sizeof (newopts)); strlcat(newopts, " ", sizeof (newopts)); } return (newopts); } static char * nfs_init_tmpfile(void) { char *tmpfile = NULL; if (asprintf(&tmpfile, "%s%s", ZFS_EXPORTS_FILE, ".XXXXXXXX") == -1) { fprintf(stderr, "Unable to allocate buffer for temporary " "file name\n"); return (NULL); } int fd = mkstemp(tmpfile); if (fd == -1) { fprintf(stderr, "Unable to create temporary file: %s", strerror(errno)); free(tmpfile); return (NULL); } close(fd); return (tmpfile); } static int nfs_fini_tmpfile(char *tmpfile) { if (rename(tmpfile, ZFS_EXPORTS_FILE) == -1) { fprintf(stderr, "Unable to rename %s: %s\n", tmpfile, strerror(errno)); unlink(tmpfile); free(tmpfile); return (SA_SYSTEM_ERR); } free(tmpfile); return (SA_OK); } /* * This function copies all entries from the exports file to "filename", * omitting any entries for the specified mountpoint. */ static int nfs_copy_entries(char *filename, const char *mountpoint) { int error = SA_OK; char *line; - /* - * If the file doesn't exist then there is nothing more - * we need to do. - */ FILE *oldfp = fopen(ZFS_EXPORTS_FILE, "r"); - if (oldfp == NULL) - return (SA_OK); - FILE *newfp = fopen(filename, "w+"); + if (newfp == NULL) { + fprintf(stderr, "failed to open %s file: %s", filename, + strerror(errno)); + fclose(oldfp); + return (SA_SYSTEM_ERR); + } fputs(FILE_HEADER, newfp); - while ((line = zgetline(oldfp, mountpoint)) != NULL) - fprintf(newfp, "%s\n", line); - if (ferror(oldfp) != 0) { - error = ferror(oldfp); + + /* + * The ZFS_EXPORTS_FILE may not exist yet. If that's the + * case then just write out the new file. + */ + if (oldfp != NULL) { + while ((line = zgetline(oldfp, mountpoint)) != NULL) + fprintf(newfp, "%s\n", line); + if (ferror(oldfp) != 0) { + error = ferror(oldfp); + } + if (fclose(oldfp) != 0) { + fprintf(stderr, "Unable to close file %s: %s\n", + filename, strerror(errno)); + error = error != 0 ? error : SA_SYSTEM_ERR; + } } + if (error == 0 && ferror(newfp) != 0) { error = ferror(newfp); } if (fclose(newfp) != 0) { fprintf(stderr, "Unable to close file %s: %s\n", filename, strerror(errno)); error = error != 0 ? error : SA_SYSTEM_ERR; } - fclose(oldfp); - return (error); } static int nfs_enable_share(sa_share_impl_t impl_share) { char *filename = NULL; int error; if ((filename = nfs_init_tmpfile()) == NULL) return (SA_SYSTEM_ERR); error = nfs_exports_lock(); if (error != 0) { unlink(filename); free(filename); return (error); } error = nfs_copy_entries(filename, impl_share->sa_mountpoint); if (error != SA_OK) { unlink(filename); free(filename); nfs_exports_unlock(); return (error); } FILE *fp = fopen(filename, "a+"); if (fp == NULL) { fprintf(stderr, "failed to open %s file: %s", filename, strerror(errno)); unlink(filename); free(filename); nfs_exports_unlock(); return (SA_SYSTEM_ERR); } char *shareopts = FSINFO(impl_share, nfs_fstype)->shareopts; if (strcmp(shareopts, "on") == 0) shareopts = ""; if (fprintf(fp, "%s\t%s\n", impl_share->sa_mountpoint, translate_opts(shareopts)) < 0) { fprintf(stderr, "failed to write to %s\n", filename); fclose(fp); unlink(filename); free(filename); nfs_exports_unlock(); return (SA_SYSTEM_ERR); } if (fclose(fp) != 0) { fprintf(stderr, "Unable to close file %s: %s\n", filename, strerror(errno)); unlink(filename); free(filename); nfs_exports_unlock(); return (SA_SYSTEM_ERR); } error = nfs_fini_tmpfile(filename); nfs_exports_unlock(); return (error); } static int nfs_disable_share(sa_share_impl_t impl_share) { int error; char *filename = NULL; if ((filename = nfs_init_tmpfile()) == NULL) return (SA_SYSTEM_ERR); error = nfs_exports_lock(); if (error != 0) { unlink(filename); free(filename); return (error); } error = nfs_copy_entries(filename, impl_share->sa_mountpoint); if (error != SA_OK) { unlink(filename); free(filename); nfs_exports_unlock(); return (error); } error = nfs_fini_tmpfile(filename); nfs_exports_unlock(); return (error); } /* * NOTE: This function returns a static buffer and thus is not thread-safe. */ static boolean_t nfs_is_shared(sa_share_impl_t impl_share) { static char line[MAXLINESIZE]; char *s, last; size_t len; char *mntpoint = impl_share->sa_mountpoint; size_t mntlen = strlen(mntpoint); FILE *fp = fopen(ZFS_EXPORTS_FILE, "r"); if (fp == NULL) return (B_FALSE); for (;;) { s = fgets(line, sizeof (line), fp); if (s == NULL) return (B_FALSE); /* Skip empty lines and comments. */ if (line[0] == '\n' || line[0] == '#') continue; len = strlen(line); if (line[len - 1] == '\n') line[len - 1] = '\0'; last = line[mntlen]; /* Skip the given mountpoint. */ if (strncmp(mntpoint, line, mntlen) == 0 && (last == '\t' || last == ' ' || last == '\0')) { fclose(fp); return (B_TRUE); } } fclose(fp); return (B_FALSE); } static int nfs_validate_shareopts(const char *shareopts) { return (SA_OK); } static int nfs_update_shareopts(sa_share_impl_t impl_share, const char *shareopts) { FSINFO(impl_share, nfs_fstype)->shareopts = (char *)shareopts; return (SA_OK); } static void nfs_clear_shareopts(sa_share_impl_t impl_share) { FSINFO(impl_share, nfs_fstype)->shareopts = NULL; } /* * Commit the shares by restarting mountd. */ static int nfs_commit_shares(void) { struct pidfh *pfh; pid_t mountdpid; pfh = pidfile_open(_PATH_MOUNTDPID, 0600, &mountdpid); if (pfh != NULL) { /* Mountd is not running. */ pidfile_remove(pfh); return (SA_OK); } if (errno != EEXIST) { /* Cannot open pidfile for some reason. */ return (SA_SYSTEM_ERR); } /* We have mountd(8) PID in mountdpid variable. */ kill(mountdpid, SIGHUP); return (SA_OK); } static const sa_share_ops_t nfs_shareops = { .enable_share = nfs_enable_share, .disable_share = nfs_disable_share, .is_shared = nfs_is_shared, .validate_shareopts = nfs_validate_shareopts, .update_shareopts = nfs_update_shareopts, .clear_shareopts = nfs_clear_shareopts, .commit_shares = nfs_commit_shares, }; /* * Initializes the NFS functionality of libshare. */ void libshare_nfs_init(void) { nfs_fstype = register_fstype("nfs", &nfs_shareops); } Index: vendor-sys/openzfs/dist/lib/libshare/os/linux/nfs.c =================================================================== --- vendor-sys/openzfs/dist/lib/libshare/os/linux/nfs.c (revision 366347) +++ vendor-sys/openzfs/dist/lib/libshare/os/linux/nfs.c (revision 366348) @@ -1,713 +1,724 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2002, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011 Gunnar Beutner * Copyright (c) 2012 Cyril Plisko. All rights reserved. * Copyright (c) 2019, 2020 by Delphix. All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include "libshare_impl.h" #include "nfs.h" #define FILE_HEADER "# !!! DO NOT EDIT THIS FILE MANUALLY !!!\n\n" #define ZFS_EXPORTS_DIR "/etc/exports.d" #define ZFS_EXPORTS_FILE ZFS_EXPORTS_DIR"/zfs.exports" #define ZFS_EXPORTS_LOCK ZFS_EXPORTS_FILE".lock" static sa_fstype_t *nfs_fstype; typedef int (*nfs_shareopt_callback_t)(const char *opt, const char *value, void *cookie); typedef int (*nfs_host_callback_t)(const char *sharepath, const char *filename, const char *host, const char *security, const char *access, void *cookie); static int nfs_lock_fd = -1; /* * The nfs_exports_[lock|unlock] is used to guard against conconcurrent * updates to the exports file. Each protocol is responsible for * providing the necessary locking to ensure consistency. */ static int nfs_exports_lock(void) { nfs_lock_fd = open(ZFS_EXPORTS_LOCK, O_RDWR | O_CREAT, 0600); if (nfs_lock_fd == -1) { fprintf(stderr, "failed to lock %s: %s\n", ZFS_EXPORTS_LOCK, strerror(errno)); return (errno); } if (flock(nfs_lock_fd, LOCK_EX) != 0) { fprintf(stderr, "failed to lock %s: %s\n", ZFS_EXPORTS_LOCK, strerror(errno)); return (errno); } return (0); } static void nfs_exports_unlock(void) { verify(nfs_lock_fd > 0); if (flock(nfs_lock_fd, LOCK_UN) != 0) { fprintf(stderr, "failed to unlock %s: %s\n", ZFS_EXPORTS_LOCK, strerror(errno)); } close(nfs_lock_fd); nfs_lock_fd = -1; } /* * Invokes the specified callback function for each Solaris share option * listed in the specified string. */ static int foreach_nfs_shareopt(const char *shareopts, nfs_shareopt_callback_t callback, void *cookie) { char *shareopts_dup, *opt, *cur, *value; int was_nul, error; if (shareopts == NULL) return (SA_OK); if (strcmp(shareopts, "on") == 0) shareopts = "rw,crossmnt"; shareopts_dup = strdup(shareopts); if (shareopts_dup == NULL) return (SA_NO_MEMORY); opt = shareopts_dup; was_nul = 0; while (1) { cur = opt; while (*cur != ',' && *cur != '\0') cur++; if (*cur == '\0') was_nul = 1; *cur = '\0'; if (cur > opt) { value = strchr(opt, '='); if (value != NULL) { *value = '\0'; value++; } error = callback(opt, value, cookie); if (error != SA_OK) { free(shareopts_dup); return (error); } } opt = cur + 1; if (was_nul) break; } free(shareopts_dup); return (SA_OK); } typedef struct nfs_host_cookie_s { nfs_host_callback_t callback; const char *sharepath; void *cookie; const char *filename; const char *security; } nfs_host_cookie_t; /* * Helper function for foreach_nfs_host. This function checks whether the * current share option is a host specification and invokes a callback * function with information about the host. */ static int foreach_nfs_host_cb(const char *opt, const char *value, void *pcookie) { int error; const char *access; char *host_dup, *host, *next; nfs_host_cookie_t *udata = (nfs_host_cookie_t *)pcookie; #ifdef DEBUG fprintf(stderr, "foreach_nfs_host_cb: key=%s, value=%s\n", opt, value); #endif if (strcmp(opt, "sec") == 0) udata->security = value; if (strcmp(opt, "rw") == 0 || strcmp(opt, "ro") == 0) { if (value == NULL) value = "*"; access = opt; host_dup = strdup(value); if (host_dup == NULL) return (SA_NO_MEMORY); host = host_dup; do { next = strchr(host, ':'); if (next != NULL) { *next = '\0'; next++; } error = udata->callback(udata->filename, udata->sharepath, host, udata->security, access, udata->cookie); if (error != SA_OK) { free(host_dup); return (error); } host = next; } while (host != NULL); free(host_dup); } return (SA_OK); } /* * Invokes a callback function for all NFS hosts that are set for a share. */ static int foreach_nfs_host(sa_share_impl_t impl_share, char *filename, nfs_host_callback_t callback, void *cookie) { nfs_host_cookie_t udata; char *shareopts; udata.callback = callback; udata.sharepath = impl_share->sa_mountpoint; udata.cookie = cookie; udata.filename = filename; udata.security = "sys"; shareopts = FSINFO(impl_share, nfs_fstype)->shareopts; return (foreach_nfs_shareopt(shareopts, foreach_nfs_host_cb, &udata)); } /* * Converts a Solaris NFS host specification to its Linux equivalent. */ static int get_linux_hostspec(const char *solaris_hostspec, char **plinux_hostspec) { /* * For now we just support CIDR masks (e.g. @192.168.0.0/16) and host * wildcards (e.g. *.example.org). */ if (solaris_hostspec[0] == '@') { /* * Solaris host specifier, e.g. @192.168.0.0/16; we just need * to skip the @ in this case */ *plinux_hostspec = strdup(solaris_hostspec + 1); } else { *plinux_hostspec = strdup(solaris_hostspec); } if (*plinux_hostspec == NULL) { return (SA_NO_MEMORY); } return (SA_OK); } /* * Adds a Linux share option to an array of NFS options. */ static int add_linux_shareopt(char **plinux_opts, const char *key, const char *value) { size_t len = 0; char *new_linux_opts; if (*plinux_opts != NULL) len = strlen(*plinux_opts); new_linux_opts = realloc(*plinux_opts, len + 1 + strlen(key) + (value ? 1 + strlen(value) : 0) + 1); if (new_linux_opts == NULL) return (SA_NO_MEMORY); new_linux_opts[len] = '\0'; if (len > 0) strcat(new_linux_opts, ","); strcat(new_linux_opts, key); if (value != NULL) { strcat(new_linux_opts, "="); strcat(new_linux_opts, value); } *plinux_opts = new_linux_opts; return (SA_OK); } /* * Validates and converts a single Solaris share option to its Linux * equivalent. */ static int get_linux_shareopts_cb(const char *key, const char *value, void *cookie) { char **plinux_opts = (char **)cookie; /* host-specific options, these are taken care of elsewhere */ if (strcmp(key, "ro") == 0 || strcmp(key, "rw") == 0 || strcmp(key, "sec") == 0) return (SA_OK); if (strcmp(key, "anon") == 0) key = "anonuid"; if (strcmp(key, "root_mapping") == 0) { (void) add_linux_shareopt(plinux_opts, "root_squash", NULL); key = "anonuid"; } if (strcmp(key, "nosub") == 0) key = "subtree_check"; if (strcmp(key, "insecure") != 0 && strcmp(key, "secure") != 0 && strcmp(key, "async") != 0 && strcmp(key, "sync") != 0 && strcmp(key, "no_wdelay") != 0 && strcmp(key, "wdelay") != 0 && strcmp(key, "nohide") != 0 && strcmp(key, "hide") != 0 && strcmp(key, "crossmnt") != 0 && strcmp(key, "no_subtree_check") != 0 && strcmp(key, "subtree_check") != 0 && strcmp(key, "insecure_locks") != 0 && strcmp(key, "secure_locks") != 0 && strcmp(key, "no_auth_nlm") != 0 && strcmp(key, "auth_nlm") != 0 && strcmp(key, "no_acl") != 0 && strcmp(key, "mountpoint") != 0 && strcmp(key, "mp") != 0 && strcmp(key, "fsuid") != 0 && strcmp(key, "refer") != 0 && strcmp(key, "replicas") != 0 && strcmp(key, "root_squash") != 0 && strcmp(key, "no_root_squash") != 0 && strcmp(key, "all_squash") != 0 && strcmp(key, "no_all_squash") != 0 && strcmp(key, "fsid") != 0 && strcmp(key, "anonuid") != 0 && strcmp(key, "anongid") != 0) { return (SA_SYNTAX_ERR); } (void) add_linux_shareopt(plinux_opts, key, value); return (SA_OK); } /* * Takes a string containing Solaris share options (e.g. "sync,no_acl") and * converts them to a NULL-terminated array of Linux NFS options. */ static int get_linux_shareopts(const char *shareopts, char **plinux_opts) { int error; assert(plinux_opts != NULL); *plinux_opts = NULL; /* no_subtree_check - Default as of nfs-utils v1.1.0 */ (void) add_linux_shareopt(plinux_opts, "no_subtree_check", NULL); /* mountpoint - Restrict exports to ZFS mountpoints */ (void) add_linux_shareopt(plinux_opts, "mountpoint", NULL); error = foreach_nfs_shareopt(shareopts, get_linux_shareopts_cb, plinux_opts); if (error != SA_OK) { free(*plinux_opts); *plinux_opts = NULL; } return (error); } static char * nfs_init_tmpfile(void) { char *tmpfile = NULL; + struct stat sb; + if (stat(ZFS_EXPORTS_DIR, &sb) < 0 && + mkdir(ZFS_EXPORTS_DIR, 0755) < 0) { + fprintf(stderr, "failed to create %s: %s\n", + ZFS_EXPORTS_DIR, strerror(errno)); + return (NULL); + } + if (asprintf(&tmpfile, "%s%s", ZFS_EXPORTS_FILE, ".XXXXXXXX") == -1) { fprintf(stderr, "Unable to allocate temporary file\n"); return (NULL); } int fd = mkstemp(tmpfile); if (fd == -1) { fprintf(stderr, "Unable to create temporary file: %s", strerror(errno)); free(tmpfile); return (NULL); } close(fd); return (tmpfile); } static int nfs_fini_tmpfile(char *tmpfile) { if (rename(tmpfile, ZFS_EXPORTS_FILE) == -1) { fprintf(stderr, "Unable to rename %s: %s\n", tmpfile, strerror(errno)); unlink(tmpfile); free(tmpfile); return (SA_SYSTEM_ERR); } free(tmpfile); return (SA_OK); } /* * This function populates an entry into /etc/exports.d/zfs.exports. * This file is consumed by the linux nfs server so that zfs shares are * automatically exported upon boot or whenever the nfs server restarts. */ static int nfs_add_entry(const char *filename, const char *sharepath, const char *host, const char *security, const char *access_opts, void *pcookie) { int error; char *linuxhost; const char *linux_opts = (const char *)pcookie; error = get_linux_hostspec(host, &linuxhost); if (error != SA_OK) return (error); if (linux_opts == NULL) linux_opts = ""; FILE *fp = fopen(filename, "a+"); if (fp == NULL) { fprintf(stderr, "failed to open %s file: %s", filename, strerror(errno)); free(linuxhost); return (SA_SYSTEM_ERR); } if (fprintf(fp, "%s %s(sec=%s,%s,%s)\n", sharepath, linuxhost, security, access_opts, linux_opts) < 0) { fprintf(stderr, "failed to write to %s\n", filename); free(linuxhost); fclose(fp); return (SA_SYSTEM_ERR); } free(linuxhost); if (fclose(fp) != 0) { fprintf(stderr, "Unable to close file %s: %s\n", filename, strerror(errno)); return (SA_SYSTEM_ERR); } return (SA_OK); } /* * This function copies all entries from the exports file to "filename", * omitting any entries for the specified mountpoint. */ static int nfs_copy_entries(char *filename, const char *mountpoint) { char *buf = NULL; size_t buflen = 0; int error = SA_OK; - /* - * If the file doesn't exist then there is nothing more - * we need to do. - */ FILE *oldfp = fopen(ZFS_EXPORTS_FILE, "r"); - if (oldfp == NULL) - return (SA_OK); - FILE *newfp = fopen(filename, "w+"); + if (newfp == NULL) { + fprintf(stderr, "failed to open %s file: %s", filename, + strerror(errno)); + fclose(oldfp); + return (SA_SYSTEM_ERR); + } fputs(FILE_HEADER, newfp); - while ((getline(&buf, &buflen, oldfp)) != -1) { - char *space = NULL; - if (buf[0] == '\n' || buf[0] == '#') - continue; + /* + * The ZFS_EXPORTS_FILE may not exist yet. If that's the + * case then just write out the new file. + */ + if (oldfp != NULL) { + while (getline(&buf, &buflen, oldfp) != -1) { + char *space = NULL; - if ((space = strchr(buf, ' ')) != NULL) { - int mountpoint_len = strlen(mountpoint); - - if (space - buf == mountpoint_len && - strncmp(mountpoint, buf, mountpoint_len) == 0) { + if (buf[0] == '\n' || buf[0] == '#') continue; + + if ((space = strchr(buf, ' ')) != NULL) { + int mountpoint_len = strlen(mountpoint); + + if (space - buf == mountpoint_len && + strncmp(mountpoint, buf, + mountpoint_len) == 0) { + continue; + } } + fputs(buf, newfp); } - fputs(buf, newfp); - } - if (oldfp != NULL && ferror(oldfp) != 0) { - error = ferror(oldfp); + if (ferror(oldfp) != 0) { + error = ferror(oldfp); + } + if (fclose(oldfp) != 0) { + fprintf(stderr, "Unable to close file %s: %s\n", + filename, strerror(errno)); + error = error != 0 ? error : SA_SYSTEM_ERR; + } } + if (error == 0 && ferror(newfp) != 0) { error = ferror(newfp); } free(buf); if (fclose(newfp) != 0) { fprintf(stderr, "Unable to close file %s: %s\n", filename, strerror(errno)); error = error != 0 ? error : SA_SYSTEM_ERR; } - fclose(oldfp); - return (error); } /* * Enables NFS sharing for the specified share. */ static int nfs_enable_share(sa_share_impl_t impl_share) { char *shareopts, *linux_opts; char *filename = NULL; int error; if ((filename = nfs_init_tmpfile()) == NULL) return (SA_SYSTEM_ERR); error = nfs_exports_lock(); if (error != 0) { unlink(filename); free(filename); return (error); } error = nfs_copy_entries(filename, impl_share->sa_mountpoint); if (error != SA_OK) { unlink(filename); free(filename); nfs_exports_unlock(); return (error); } shareopts = FSINFO(impl_share, nfs_fstype)->shareopts; error = get_linux_shareopts(shareopts, &linux_opts); if (error != SA_OK) { unlink(filename); free(filename); nfs_exports_unlock(); return (error); } error = foreach_nfs_host(impl_share, filename, nfs_add_entry, linux_opts); free(linux_opts); if (error == 0) { error = nfs_fini_tmpfile(filename); } else { unlink(filename); free(filename); } nfs_exports_unlock(); return (error); } /* * Disables NFS sharing for the specified share. */ static int nfs_disable_share(sa_share_impl_t impl_share) { int error; char *filename = NULL; if ((filename = nfs_init_tmpfile()) == NULL) return (SA_SYSTEM_ERR); error = nfs_exports_lock(); if (error != 0) { unlink(filename); free(filename); return (error); } error = nfs_copy_entries(filename, impl_share->sa_mountpoint); if (error != SA_OK) { unlink(filename); free(filename); nfs_exports_unlock(); return (error); } error = nfs_fini_tmpfile(filename); nfs_exports_unlock(); return (error); } static boolean_t nfs_is_shared(sa_share_impl_t impl_share) { size_t buflen = 0; char *buf = NULL; FILE *fp = fopen(ZFS_EXPORTS_FILE, "r"); if (fp == NULL) { return (B_FALSE); } while ((getline(&buf, &buflen, fp)) != -1) { char *space = NULL; if ((space = strchr(buf, ' ')) != NULL) { int mountpoint_len = strlen(impl_share->sa_mountpoint); if (space - buf == mountpoint_len && strncmp(impl_share->sa_mountpoint, buf, mountpoint_len) == 0) { fclose(fp); free(buf); return (B_TRUE); } } } free(buf); fclose(fp); return (B_FALSE); } /* * Checks whether the specified NFS share options are syntactically correct. */ static int nfs_validate_shareopts(const char *shareopts) { char *linux_opts; int error; error = get_linux_shareopts(shareopts, &linux_opts); if (error != SA_OK) return (error); free(linux_opts); return (SA_OK); } static int nfs_update_shareopts(sa_share_impl_t impl_share, const char *shareopts) { FSINFO(impl_share, nfs_fstype)->shareopts = (char *)shareopts; return (SA_OK); } /* * Clears a share's NFS options. Used by libshare to * clean up shares that are about to be free()'d. */ static void nfs_clear_shareopts(sa_share_impl_t impl_share) { FSINFO(impl_share, nfs_fstype)->shareopts = NULL; } static int nfs_commit_shares(void) { char *argv[] = { "/usr/sbin/exportfs", "-ra", NULL }; return (libzfs_run_process(argv[0], argv, 0)); } static const sa_share_ops_t nfs_shareops = { .enable_share = nfs_enable_share, .disable_share = nfs_disable_share, .is_shared = nfs_is_shared, .validate_shareopts = nfs_validate_shareopts, .update_shareopts = nfs_update_shareopts, .clear_shareopts = nfs_clear_shareopts, .commit_shares = nfs_commit_shares, }; /* * Initializes the NFS functionality of libshare. */ void libshare_nfs_init(void) { - struct stat sb; - nfs_fstype = register_fstype("nfs", &nfs_shareops); - - if (stat(ZFS_EXPORTS_DIR, &sb) < 0 && - mkdir(ZFS_EXPORTS_DIR, 0755) < 0) { - fprintf(stderr, "failed to create %s: %s\n", - ZFS_EXPORTS_DIR, strerror(errno)); - } } Index: vendor-sys/openzfs/dist/lib/libzpool/kernel.c =================================================================== --- vendor-sys/openzfs/dist/lib/libzpool/kernel.c (revision 366347) +++ vendor-sys/openzfs/dist/lib/libzpool/kernel.c (revision 366348) @@ -1,1416 +1,1417 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2012, 2018 by Delphix. All rights reserved. * Copyright (c) 2016 Actifio, Inc. All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * Emulation of kernel services in userland. */ uint64_t physmem; char hw_serial[HW_HOSTID_LEN]; struct utsname hw_utsname; /* If set, all blocks read will be copied to the specified directory. */ char *vn_dumpdir = NULL; /* this only exists to have its address taken */ struct proc p0; /* * ========================================================================= * threads * ========================================================================= * * TS_STACK_MIN is dictated by the minimum allowed pthread stack size. While * TS_STACK_MAX is somewhat arbitrary, it was selected to be large enough for * the expected stack depth while small enough to avoid exhausting address * space with high thread counts. */ #define TS_STACK_MIN MAX(PTHREAD_STACK_MIN, 32768) #define TS_STACK_MAX (256 * 1024) /*ARGSUSED*/ kthread_t * zk_thread_create(void (*func)(void *), void *arg, size_t stksize, int state) { pthread_attr_t attr; pthread_t tid; char *stkstr; int detachstate = PTHREAD_CREATE_DETACHED; VERIFY0(pthread_attr_init(&attr)); if (state & TS_JOINABLE) detachstate = PTHREAD_CREATE_JOINABLE; VERIFY0(pthread_attr_setdetachstate(&attr, detachstate)); /* * We allow the default stack size in user space to be specified by * setting the ZFS_STACK_SIZE environment variable. This allows us * the convenience of observing and debugging stack overruns in * user space. Explicitly specified stack sizes will be honored. * The usage of ZFS_STACK_SIZE is discussed further in the * ENVIRONMENT VARIABLES sections of the ztest(1) man page. */ if (stksize == 0) { stkstr = getenv("ZFS_STACK_SIZE"); if (stkstr == NULL) stksize = TS_STACK_MAX; else stksize = MAX(atoi(stkstr), TS_STACK_MIN); } VERIFY3S(stksize, >, 0); stksize = P2ROUNDUP(MAX(stksize, TS_STACK_MIN), PAGESIZE); /* * If this ever fails, it may be because the stack size is not a * multiple of system page size. */ VERIFY0(pthread_attr_setstacksize(&attr, stksize)); VERIFY0(pthread_attr_setguardsize(&attr, PAGESIZE)); VERIFY0(pthread_create(&tid, &attr, (void *(*)(void *))func, arg)); VERIFY0(pthread_attr_destroy(&attr)); return ((void *)(uintptr_t)tid); } /* * ========================================================================= * kstats * ========================================================================= */ /*ARGSUSED*/ kstat_t * kstat_create(const char *module, int instance, const char *name, const char *class, uchar_t type, ulong_t ndata, uchar_t ks_flag) { return (NULL); } /*ARGSUSED*/ void kstat_install(kstat_t *ksp) {} /*ARGSUSED*/ void kstat_delete(kstat_t *ksp) {} /*ARGSUSED*/ void kstat_waitq_enter(kstat_io_t *kiop) {} /*ARGSUSED*/ void kstat_waitq_exit(kstat_io_t *kiop) {} /*ARGSUSED*/ void kstat_runq_enter(kstat_io_t *kiop) {} /*ARGSUSED*/ void kstat_runq_exit(kstat_io_t *kiop) {} /*ARGSUSED*/ void kstat_waitq_to_runq(kstat_io_t *kiop) {} /*ARGSUSED*/ void kstat_runq_back_to_waitq(kstat_io_t *kiop) {} void kstat_set_raw_ops(kstat_t *ksp, int (*headers)(char *buf, size_t size), int (*data)(char *buf, size_t size, void *data), void *(*addr)(kstat_t *ksp, loff_t index)) {} /* * ========================================================================= * mutexes * ========================================================================= */ void mutex_init(kmutex_t *mp, char *name, int type, void *cookie) { VERIFY0(pthread_mutex_init(&mp->m_lock, NULL)); memset(&mp->m_owner, 0, sizeof (pthread_t)); } void mutex_destroy(kmutex_t *mp) { VERIFY0(pthread_mutex_destroy(&mp->m_lock)); } void mutex_enter(kmutex_t *mp) { VERIFY0(pthread_mutex_lock(&mp->m_lock)); mp->m_owner = pthread_self(); } int mutex_tryenter(kmutex_t *mp) { int error; error = pthread_mutex_trylock(&mp->m_lock); if (error == 0) { mp->m_owner = pthread_self(); return (1); } else { VERIFY3S(error, ==, EBUSY); return (0); } } void mutex_exit(kmutex_t *mp) { memset(&mp->m_owner, 0, sizeof (pthread_t)); VERIFY0(pthread_mutex_unlock(&mp->m_lock)); } /* * ========================================================================= * rwlocks * ========================================================================= */ void rw_init(krwlock_t *rwlp, char *name, int type, void *arg) { VERIFY0(pthread_rwlock_init(&rwlp->rw_lock, NULL)); rwlp->rw_readers = 0; rwlp->rw_owner = 0; } void rw_destroy(krwlock_t *rwlp) { VERIFY0(pthread_rwlock_destroy(&rwlp->rw_lock)); } void rw_enter(krwlock_t *rwlp, krw_t rw) { if (rw == RW_READER) { VERIFY0(pthread_rwlock_rdlock(&rwlp->rw_lock)); atomic_inc_uint(&rwlp->rw_readers); } else { VERIFY0(pthread_rwlock_wrlock(&rwlp->rw_lock)); rwlp->rw_owner = pthread_self(); } } void rw_exit(krwlock_t *rwlp) { if (RW_READ_HELD(rwlp)) atomic_dec_uint(&rwlp->rw_readers); else rwlp->rw_owner = 0; VERIFY0(pthread_rwlock_unlock(&rwlp->rw_lock)); } int rw_tryenter(krwlock_t *rwlp, krw_t rw) { int error; if (rw == RW_READER) error = pthread_rwlock_tryrdlock(&rwlp->rw_lock); else error = pthread_rwlock_trywrlock(&rwlp->rw_lock); if (error == 0) { if (rw == RW_READER) atomic_inc_uint(&rwlp->rw_readers); else rwlp->rw_owner = pthread_self(); return (1); } VERIFY3S(error, ==, EBUSY); return (0); } /* ARGSUSED */ uint32_t zone_get_hostid(void *zonep) { /* * We're emulating the system's hostid in userland. */ return (strtoul(hw_serial, NULL, 10)); } int rw_tryupgrade(krwlock_t *rwlp) { return (0); } /* * ========================================================================= * condition variables * ========================================================================= */ void cv_init(kcondvar_t *cv, char *name, int type, void *arg) { VERIFY0(pthread_cond_init(cv, NULL)); } void cv_destroy(kcondvar_t *cv) { VERIFY0(pthread_cond_destroy(cv)); } void cv_wait(kcondvar_t *cv, kmutex_t *mp) { memset(&mp->m_owner, 0, sizeof (pthread_t)); VERIFY0(pthread_cond_wait(cv, &mp->m_lock)); mp->m_owner = pthread_self(); } int cv_wait_sig(kcondvar_t *cv, kmutex_t *mp) { cv_wait(cv, mp); return (1); } int cv_timedwait(kcondvar_t *cv, kmutex_t *mp, clock_t abstime) { int error; struct timeval tv; struct timespec ts; clock_t delta; delta = abstime - ddi_get_lbolt(); if (delta <= 0) return (-1); VERIFY(gettimeofday(&tv, NULL) == 0); ts.tv_sec = tv.tv_sec + delta / hz; ts.tv_nsec = tv.tv_usec * NSEC_PER_USEC + (delta % hz) * (NANOSEC / hz); if (ts.tv_nsec >= NANOSEC) { ts.tv_sec++; ts.tv_nsec -= NANOSEC; } memset(&mp->m_owner, 0, sizeof (pthread_t)); error = pthread_cond_timedwait(cv, &mp->m_lock, &ts); mp->m_owner = pthread_self(); if (error == ETIMEDOUT) return (-1); VERIFY0(error); return (1); } /*ARGSUSED*/ int cv_timedwait_hires(kcondvar_t *cv, kmutex_t *mp, hrtime_t tim, hrtime_t res, int flag) { int error; struct timeval tv; struct timespec ts; hrtime_t delta; ASSERT(flag == 0 || flag == CALLOUT_FLAG_ABSOLUTE); delta = tim; if (flag & CALLOUT_FLAG_ABSOLUTE) delta -= gethrtime(); if (delta <= 0) return (-1); VERIFY0(gettimeofday(&tv, NULL)); ts.tv_sec = tv.tv_sec + delta / NANOSEC; ts.tv_nsec = tv.tv_usec * NSEC_PER_USEC + (delta % NANOSEC); if (ts.tv_nsec >= NANOSEC) { ts.tv_sec++; ts.tv_nsec -= NANOSEC; } memset(&mp->m_owner, 0, sizeof (pthread_t)); error = pthread_cond_timedwait(cv, &mp->m_lock, &ts); mp->m_owner = pthread_self(); if (error == ETIMEDOUT) return (-1); VERIFY0(error); return (1); } void cv_signal(kcondvar_t *cv) { VERIFY0(pthread_cond_signal(cv)); } void cv_broadcast(kcondvar_t *cv) { VERIFY0(pthread_cond_broadcast(cv)); } /* * ========================================================================= * procfs list * ========================================================================= */ void seq_printf(struct seq_file *m, const char *fmt, ...) {} void procfs_list_install(const char *module, + const char *submodule, const char *name, mode_t mode, procfs_list_t *procfs_list, int (*show)(struct seq_file *f, void *p), int (*show_header)(struct seq_file *f), int (*clear)(procfs_list_t *procfs_list), size_t procfs_list_node_off) { mutex_init(&procfs_list->pl_lock, NULL, MUTEX_DEFAULT, NULL); list_create(&procfs_list->pl_list, procfs_list_node_off + sizeof (procfs_list_node_t), procfs_list_node_off + offsetof(procfs_list_node_t, pln_link)); procfs_list->pl_next_id = 1; procfs_list->pl_node_offset = procfs_list_node_off; } void procfs_list_uninstall(procfs_list_t *procfs_list) {} void procfs_list_destroy(procfs_list_t *procfs_list) { ASSERT(list_is_empty(&procfs_list->pl_list)); list_destroy(&procfs_list->pl_list); mutex_destroy(&procfs_list->pl_lock); } #define NODE_ID(procfs_list, obj) \ (((procfs_list_node_t *)(((char *)obj) + \ (procfs_list)->pl_node_offset))->pln_id) void procfs_list_add(procfs_list_t *procfs_list, void *p) { ASSERT(MUTEX_HELD(&procfs_list->pl_lock)); NODE_ID(procfs_list, p) = procfs_list->pl_next_id++; list_insert_tail(&procfs_list->pl_list, p); } /* * ========================================================================= * vnode operations * ========================================================================= */ /* * ========================================================================= * Figure out which debugging statements to print * ========================================================================= */ static char *dprintf_string; static int dprintf_print_all; int dprintf_find_string(const char *string) { char *tmp_str = dprintf_string; int len = strlen(string); /* * Find out if this is a string we want to print. * String format: file1.c,function_name1,file2.c,file3.c */ while (tmp_str != NULL) { if (strncmp(tmp_str, string, len) == 0 && (tmp_str[len] == ',' || tmp_str[len] == '\0')) return (1); tmp_str = strchr(tmp_str, ','); if (tmp_str != NULL) tmp_str++; /* Get rid of , */ } return (0); } void dprintf_setup(int *argc, char **argv) { int i, j; /* * Debugging can be specified two ways: by setting the * environment variable ZFS_DEBUG, or by including a * "debug=..." argument on the command line. The command * line setting overrides the environment variable. */ for (i = 1; i < *argc; i++) { int len = strlen("debug="); /* First look for a command line argument */ if (strncmp("debug=", argv[i], len) == 0) { dprintf_string = argv[i] + len; /* Remove from args */ for (j = i; j < *argc; j++) argv[j] = argv[j+1]; argv[j] = NULL; (*argc)--; } } if (dprintf_string == NULL) { /* Look for ZFS_DEBUG environment variable */ dprintf_string = getenv("ZFS_DEBUG"); } /* * Are we just turning on all debugging? */ if (dprintf_find_string("on")) dprintf_print_all = 1; if (dprintf_string != NULL) zfs_flags |= ZFS_DEBUG_DPRINTF; } /* * ========================================================================= * debug printfs * ========================================================================= */ void __dprintf(boolean_t dprint, const char *file, const char *func, int line, const char *fmt, ...) { const char *newfile; va_list adx; /* * Get rid of annoying "../common/" prefix to filename. */ newfile = strrchr(file, '/'); if (newfile != NULL) { newfile = newfile + 1; /* Get rid of leading / */ } else { newfile = file; } if (dprint) { /* dprintf messages are printed immediately */ if (!dprintf_print_all && !dprintf_find_string(newfile) && !dprintf_find_string(func)) return; /* Print out just the function name if requested */ flockfile(stdout); if (dprintf_find_string("pid")) (void) printf("%d ", getpid()); if (dprintf_find_string("tid")) (void) printf("%ju ", (uintmax_t)(uintptr_t)pthread_self()); if (dprintf_find_string("cpu")) (void) printf("%u ", getcpuid()); if (dprintf_find_string("time")) (void) printf("%llu ", gethrtime()); if (dprintf_find_string("long")) (void) printf("%s, line %d: ", newfile, line); (void) printf("dprintf: %s: ", func); va_start(adx, fmt); (void) vprintf(fmt, adx); va_end(adx); funlockfile(stdout); } else { /* zfs_dbgmsg is logged for dumping later */ size_t size; char *buf; int i; size = 1024; buf = umem_alloc(size, UMEM_NOFAIL); i = snprintf(buf, size, "%s:%d:%s(): ", newfile, line, func); if (i < size) { va_start(adx, fmt); (void) vsnprintf(buf + i, size - i, fmt, adx); va_end(adx); } __zfs_dbgmsg(buf); umem_free(buf, size); } } /* * ========================================================================= * cmn_err() and panic() * ========================================================================= */ static char ce_prefix[CE_IGNORE][10] = { "", "NOTICE: ", "WARNING: ", "" }; static char ce_suffix[CE_IGNORE][2] = { "", "\n", "\n", "" }; void vpanic(const char *fmt, va_list adx) { (void) fprintf(stderr, "error: "); (void) vfprintf(stderr, fmt, adx); (void) fprintf(stderr, "\n"); abort(); /* think of it as a "user-level crash dump" */ } void panic(const char *fmt, ...) { va_list adx; va_start(adx, fmt); vpanic(fmt, adx); va_end(adx); } void vcmn_err(int ce, const char *fmt, va_list adx) { if (ce == CE_PANIC) vpanic(fmt, adx); if (ce != CE_NOTE) { /* suppress noise in userland stress testing */ (void) fprintf(stderr, "%s", ce_prefix[ce]); (void) vfprintf(stderr, fmt, adx); (void) fprintf(stderr, "%s", ce_suffix[ce]); } } /*PRINTFLIKE2*/ void cmn_err(int ce, const char *fmt, ...) { va_list adx; va_start(adx, fmt); vcmn_err(ce, fmt, adx); va_end(adx); } /* * ========================================================================= * misc routines * ========================================================================= */ void delay(clock_t ticks) { (void) poll(0, 0, ticks * (1000 / hz)); } /* * Find highest one bit set. * Returns bit number + 1 of highest bit that is set, otherwise returns 0. * The __builtin_clzll() function is supported by both GCC and Clang. */ int highbit64(uint64_t i) { if (i == 0) return (0); return (NBBY * sizeof (uint64_t) - __builtin_clzll(i)); } /* * Find lowest one bit set. * Returns bit number + 1 of lowest bit that is set, otherwise returns 0. * The __builtin_ffsll() function is supported by both GCC and Clang. */ int lowbit64(uint64_t i) { if (i == 0) return (0); return (__builtin_ffsll(i)); } char *random_path = "/dev/random"; char *urandom_path = "/dev/urandom"; static int random_fd = -1, urandom_fd = -1; void random_init(void) { VERIFY((random_fd = open(random_path, O_RDONLY)) != -1); VERIFY((urandom_fd = open(urandom_path, O_RDONLY)) != -1); } void random_fini(void) { close(random_fd); close(urandom_fd); random_fd = -1; urandom_fd = -1; } static int random_get_bytes_common(uint8_t *ptr, size_t len, int fd) { size_t resid = len; ssize_t bytes; ASSERT(fd != -1); while (resid != 0) { bytes = read(fd, ptr, resid); ASSERT3S(bytes, >=, 0); ptr += bytes; resid -= bytes; } return (0); } int random_get_bytes(uint8_t *ptr, size_t len) { return (random_get_bytes_common(ptr, len, random_fd)); } int random_get_pseudo_bytes(uint8_t *ptr, size_t len) { return (random_get_bytes_common(ptr, len, urandom_fd)); } int ddi_strtoul(const char *hw_serial, char **nptr, int base, unsigned long *result) { char *end; *result = strtoul(hw_serial, &end, base); if (*result == 0) return (errno); return (0); } int ddi_strtoull(const char *str, char **nptr, int base, u_longlong_t *result) { char *end; *result = strtoull(str, &end, base); if (*result == 0) return (errno); return (0); } utsname_t * utsname(void) { return (&hw_utsname); } /* * ========================================================================= * kernel emulation setup & teardown * ========================================================================= */ static int umem_out_of_memory(void) { char errmsg[] = "out of memory -- generating core dump\n"; (void) fprintf(stderr, "%s", errmsg); abort(); return (0); } void kernel_init(int mode) { extern uint_t rrw_tsd_key; umem_nofail_callback(umem_out_of_memory); physmem = sysconf(_SC_PHYS_PAGES); dprintf("physmem = %llu pages (%.2f GB)\n", physmem, (double)physmem * sysconf(_SC_PAGE_SIZE) / (1ULL << 30)); (void) snprintf(hw_serial, sizeof (hw_serial), "%ld", (mode & SPA_MODE_WRITE) ? get_system_hostid() : 0); random_init(); VERIFY0(uname(&hw_utsname)); system_taskq_init(); icp_init(); zstd_init(); spa_init((spa_mode_t)mode); fletcher_4_init(); tsd_create(&rrw_tsd_key, rrw_tsd_destroy); } void kernel_fini(void) { fletcher_4_fini(); spa_fini(); zstd_fini(); icp_fini(); system_taskq_fini(); random_fini(); } uid_t crgetuid(cred_t *cr) { return (0); } uid_t crgetruid(cred_t *cr) { return (0); } gid_t crgetgid(cred_t *cr) { return (0); } int crgetngroups(cred_t *cr) { return (0); } gid_t * crgetgroups(cred_t *cr) { return (NULL); } int zfs_secpolicy_snapshot_perms(const char *name, cred_t *cr) { return (0); } int zfs_secpolicy_rename_perms(const char *from, const char *to, cred_t *cr) { return (0); } int zfs_secpolicy_destroy_perms(const char *name, cred_t *cr) { return (0); } int secpolicy_zfs(const cred_t *cr) { return (0); } int secpolicy_zfs_proc(const cred_t *cr, proc_t *proc) { return (0); } ksiddomain_t * ksid_lookupdomain(const char *dom) { ksiddomain_t *kd; kd = umem_zalloc(sizeof (ksiddomain_t), UMEM_NOFAIL); kd->kd_name = spa_strdup(dom); return (kd); } void ksiddomain_rele(ksiddomain_t *ksid) { spa_strfree(ksid->kd_name); umem_free(ksid, sizeof (ksiddomain_t)); } char * kmem_vasprintf(const char *fmt, va_list adx) { char *buf = NULL; va_list adx_copy; va_copy(adx_copy, adx); VERIFY(vasprintf(&buf, fmt, adx_copy) != -1); va_end(adx_copy); return (buf); } char * kmem_asprintf(const char *fmt, ...) { char *buf = NULL; va_list adx; va_start(adx, fmt); VERIFY(vasprintf(&buf, fmt, adx) != -1); va_end(adx); return (buf); } /* ARGSUSED */ int zfs_onexit_fd_hold(int fd, minor_t *minorp) { *minorp = 0; return (0); } /* ARGSUSED */ void zfs_onexit_fd_rele(int fd) { } /* ARGSUSED */ int zfs_onexit_add_cb(minor_t minor, void (*func)(void *), void *data, uint64_t *action_handle) { return (0); } fstrans_cookie_t spl_fstrans_mark(void) { return ((fstrans_cookie_t)0); } void spl_fstrans_unmark(fstrans_cookie_t cookie) { } int __spl_pf_fstrans_check(void) { return (0); } int kmem_cache_reap_active(void) { return (0); } void *zvol_tag = "zvol_tag"; void zvol_create_minor(const char *name) { } void zvol_create_minors_recursive(const char *name) { } void zvol_remove_minors(spa_t *spa, const char *name, boolean_t async) { } void zvol_rename_minors(spa_t *spa, const char *oldname, const char *newname, boolean_t async) { } /* * Open file * * path - fully qualified path to file * flags - file attributes O_READ / O_WRITE / O_EXCL * fpp - pointer to return file pointer * * Returns 0 on success underlying error on failure. */ int zfs_file_open(const char *path, int flags, int mode, zfs_file_t **fpp) { int fd = -1; int dump_fd = -1; int err; int old_umask = 0; zfs_file_t *fp; struct stat64 st; if (!(flags & O_CREAT) && stat64(path, &st) == -1) return (errno); if (!(flags & O_CREAT) && S_ISBLK(st.st_mode)) flags |= O_DIRECT; if (flags & O_CREAT) old_umask = umask(0); fd = open64(path, flags, mode); if (fd == -1) return (errno); if (flags & O_CREAT) (void) umask(old_umask); if (vn_dumpdir != NULL) { char *dumppath = umem_zalloc(MAXPATHLEN, UMEM_NOFAIL); char *inpath = basename((char *)(uintptr_t)path); (void) snprintf(dumppath, MAXPATHLEN, "%s/%s", vn_dumpdir, inpath); dump_fd = open64(dumppath, O_CREAT | O_WRONLY, 0666); umem_free(dumppath, MAXPATHLEN); if (dump_fd == -1) { err = errno; close(fd); return (err); } } else { dump_fd = -1; } (void) fcntl(fd, F_SETFD, FD_CLOEXEC); fp = umem_zalloc(sizeof (zfs_file_t), UMEM_NOFAIL); fp->f_fd = fd; fp->f_dump_fd = dump_fd; *fpp = fp; return (0); } void zfs_file_close(zfs_file_t *fp) { close(fp->f_fd); if (fp->f_dump_fd != -1) close(fp->f_dump_fd); umem_free(fp, sizeof (zfs_file_t)); } /* * Stateful write - use os internal file pointer to determine where to * write and update on successful completion. * * fp - pointer to file (pipe, socket, etc) to write to * buf - buffer to write * count - # of bytes to write * resid - pointer to count of unwritten bytes (if short write) * * Returns 0 on success errno on failure. */ int zfs_file_write(zfs_file_t *fp, const void *buf, size_t count, ssize_t *resid) { ssize_t rc; rc = write(fp->f_fd, buf, count); if (rc < 0) return (errno); if (resid) { *resid = count - rc; } else if (rc != count) { return (EIO); } return (0); } /* * Stateless write - os internal file pointer is not updated. * * fp - pointer to file (pipe, socket, etc) to write to * buf - buffer to write * count - # of bytes to write * off - file offset to write to (only valid for seekable types) * resid - pointer to count of unwritten bytes * * Returns 0 on success errno on failure. */ int zfs_file_pwrite(zfs_file_t *fp, const void *buf, size_t count, loff_t pos, ssize_t *resid) { ssize_t rc, split, done; int sectors; /* * To simulate partial disk writes, we split writes into two * system calls so that the process can be killed in between. * This is used by ztest to simulate realistic failure modes. */ sectors = count >> SPA_MINBLOCKSHIFT; split = (sectors > 0 ? rand() % sectors : 0) << SPA_MINBLOCKSHIFT; rc = pwrite64(fp->f_fd, buf, split, pos); if (rc != -1) { done = rc; rc = pwrite64(fp->f_fd, (char *)buf + split, count - split, pos + split); } #ifdef __linux__ if (rc == -1 && errno == EINVAL) { /* * Under Linux, this most likely means an alignment issue * (memory or disk) due to O_DIRECT, so we abort() in order * to catch the offender. */ abort(); } #endif if (rc < 0) return (errno); done += rc; if (resid) { *resid = count - done; } else if (done != count) { return (EIO); } return (0); } /* * Stateful read - use os internal file pointer to determine where to * read and update on successful completion. * * fp - pointer to file (pipe, socket, etc) to read from * buf - buffer to write * count - # of bytes to read * resid - pointer to count of unread bytes (if short read) * * Returns 0 on success errno on failure. */ int zfs_file_read(zfs_file_t *fp, void *buf, size_t count, ssize_t *resid) { int rc; rc = read(fp->f_fd, buf, count); if (rc < 0) return (errno); if (resid) { *resid = count - rc; } else if (rc != count) { return (EIO); } return (0); } /* * Stateless read - os internal file pointer is not updated. * * fp - pointer to file (pipe, socket, etc) to read from * buf - buffer to write * count - # of bytes to write * off - file offset to read from (only valid for seekable types) * resid - pointer to count of unwritten bytes (if short write) * * Returns 0 on success errno on failure. */ int zfs_file_pread(zfs_file_t *fp, void *buf, size_t count, loff_t off, ssize_t *resid) { ssize_t rc; rc = pread64(fp->f_fd, buf, count, off); if (rc < 0) { #ifdef __linux__ /* * Under Linux, this most likely means an alignment issue * (memory or disk) due to O_DIRECT, so we abort() in order to * catch the offender. */ if (errno == EINVAL) abort(); #endif return (errno); } if (fp->f_dump_fd != -1) { int status; status = pwrite64(fp->f_dump_fd, buf, rc, off); ASSERT(status != -1); } if (resid) { *resid = count - rc; } else if (rc != count) { return (EIO); } return (0); } /* * lseek - set / get file pointer * * fp - pointer to file (pipe, socket, etc) to read from * offp - value to seek to, returns current value plus passed offset * whence - see man pages for standard lseek whence values * * Returns 0 on success errno on failure (ESPIPE for non seekable types) */ int zfs_file_seek(zfs_file_t *fp, loff_t *offp, int whence) { loff_t rc; rc = lseek(fp->f_fd, *offp, whence); if (rc < 0) return (errno); *offp = rc; return (0); } /* * Get file attributes * * filp - file pointer * zfattr - pointer to file attr structure * * Currently only used for fetching size and file mode * * Returns 0 on success or error code of underlying getattr call on failure. */ int zfs_file_getattr(zfs_file_t *fp, zfs_file_attr_t *zfattr) { struct stat64 st; if (fstat64_blk(fp->f_fd, &st) == -1) return (errno); zfattr->zfa_size = st.st_size; zfattr->zfa_mode = st.st_mode; return (0); } /* * Sync file to disk * * filp - file pointer * flags - O_SYNC and or O_DSYNC * * Returns 0 on success or error code of underlying sync call on failure. */ int zfs_file_fsync(zfs_file_t *fp, int flags) { int rc; rc = fsync(fp->f_fd); if (rc < 0) return (errno); return (0); } /* * fallocate - allocate or free space on disk * * fp - file pointer * mode (non-standard options for hole punching etc) * offset - offset to start allocating or freeing from * len - length to free / allocate * * OPTIONAL */ int zfs_file_fallocate(zfs_file_t *fp, int mode, loff_t offset, loff_t len) { #ifdef __linux__ return (fallocate(fp->f_fd, mode, offset, len)); #else return (EOPNOTSUPP); #endif } /* * Request current file pointer offset * * fp - pointer to file * * Returns current file offset. */ loff_t zfs_file_off(zfs_file_t *fp) { return (lseek(fp->f_fd, SEEK_CUR, 0)); } /* * unlink file * * path - fully qualified file path * * Returns 0 on success. * * OPTIONAL */ int zfs_file_unlink(const char *path) { return (remove(path)); } /* * Get reference to file pointer * * fd - input file descriptor * fpp - pointer to file pointer * * Returns 0 on success EBADF on failure. * Unsupported in user space. */ int zfs_file_get(int fd, zfs_file_t **fpp) { abort(); return (EOPNOTSUPP); } /* * Drop reference to file pointer * * fd - input file descriptor * * Unsupported in user space. */ void zfs_file_put(int fd) { abort(); } void zfsvfs_update_fromname(const char *oldname, const char *newname) { } Index: vendor-sys/openzfs/dist/man/man8/zfs-userspace.8 =================================================================== --- vendor-sys/openzfs/dist/man/man8/zfs-userspace.8 (revision 366347) +++ vendor-sys/openzfs/dist/man/man8/zfs-userspace.8 (revision 366348) @@ -1,186 +1,187 @@ .\" .\" CDDL HEADER START .\" .\" The contents of this file are subject to the terms of the .\" Common Development and Distribution License (the "License"). .\" You may not use this file except in compliance with the License. .\" .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE .\" or http://www.opensolaris.org/os/licensing. .\" See the License for the specific language governing permissions .\" and limitations under the License. .\" .\" When distributing Covered Code, include this CDDL HEADER in each .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE. .\" If applicable, add the following below this CDDL HEADER, with the .\" fields enclosed by brackets "[]" replaced with your own identifying .\" information: Portions Copyright [yyyy] [name of copyright owner] .\" .\" CDDL HEADER END .\" .\" .\" Copyright (c) 2009 Sun Microsystems, Inc. All Rights Reserved. .\" Copyright 2011 Joshua M. Clulow .\" Copyright (c) 2011, 2019 by Delphix. All rights reserved. .\" Copyright (c) 2013 by Saso Kiselkov. All rights reserved. .\" Copyright (c) 2014, Joyent, Inc. All rights reserved. .\" Copyright (c) 2014 by Adam Stevko. All rights reserved. .\" Copyright (c) 2014 Integros [integros.com] .\" Copyright 2019 Richard Laager. All rights reserved. .\" Copyright 2018 Nexenta Systems, Inc. .\" Copyright 2019 Joyent, Inc. .\" .Dd June 30, 2019 .Dt ZFS-USERSPACE 8 .Os .Sh NAME .Nm zfs Ns Pf - Cm userspace .Nd Displays space consumed by, and quotas on, each user or group in the specified filesystem or snapshot. .Sh SYNOPSIS .Nm .Cm userspace .Op Fl Hinp .Oo Fl o Ar field Ns Oo , Ns Ar field Oc Ns ... Oc .Oo Fl s Ar field Oc Ns ... .Oo Fl S Ar field Oc Ns ... .Oo Fl t Ar type Ns Oo , Ns Ar type Oc Ns ... Oc -.Ar filesystem Ns | Ns Ar snapshot +.Ar filesystem Ns | Ns Ar snapshot Ns | Ns Ar path .Nm .Cm groupspace .Op Fl Hinp .Oo Fl o Ar field Ns Oo , Ns Ar field Oc Ns ... Oc .Oo Fl s Ar field Oc Ns ... .Oo Fl S Ar field Oc Ns ... .Oo Fl t Ar type Ns Oo , Ns Ar type Oc Ns ... Oc -.Ar filesystem Ns | Ns Ar snapshot +.Ar filesystem Ns | Ns Ar snapshot Ns | Ns Ar path .Nm .Cm projectspace .Op Fl Hp .Oo Fl o Ar field Ns Oo , Ns Ar field Oc Ns ... Oc .Oo Fl s Ar field Oc Ns ... .Oo Fl S Ar field Oc Ns ... -.Ar filesystem Ns | Ns Ar snapshot +.Ar filesystem Ns | Ns Ar snapshot Ns | Ns Ar path .Sh DESCRIPTION .Bl -tag -width "" .It Xo .Nm .Cm userspace .Op Fl Hinp .Oo Fl o Ar field Ns Oo , Ns Ar field Oc Ns ... Oc .Oo Fl s Ar field Oc Ns ... .Oo Fl S Ar field Oc Ns ... .Oo Fl t Ar type Ns Oo , Ns Ar type Oc Ns ... Oc -.Ar filesystem Ns | Ns Ar snapshot +.Ar filesystem Ns | Ns Ar snapshot Ns | Ns Ar path .Xc -Displays space consumed by, and quotas on, each user in the specified filesystem -or snapshot. +Displays space consumed by, and quotas on, each user in the specified filesystem, +snapshot, or path. +If a path is given, the filesystem that contains that path will be used. This corresponds to the .Sy userused@ Ns Em user , .Sy userobjused@ Ns Em user , .Sy userquota@ Ns Em user, and .Sy userobjquota@ Ns Em user properties. .Bl -tag -width "-H" .It Fl H Do not print headers, use tab-delimited output. .It Fl S Ar field Sort by this field in reverse order. See .Fl s . .It Fl i Translate SID to POSIX ID. The POSIX ID may be ephemeral if no mapping exists. Normal POSIX interfaces .Po for example, .Xr stat 2 , .Nm ls Fl l .Pc perform this translation, so the .Fl i option allows the output from .Nm zfs Cm userspace to be compared directly with those utilities. However, .Fl i may lead to confusion if some files were created by an SMB user before a SMB-to-POSIX name mapping was established. In such a case, some files will be owned by the SMB entity and some by the POSIX entity. However, the .Fl i option will report that the POSIX entity has the total usage and quota for both. .It Fl n Print numeric ID instead of user/group name. .It Fl o Ar field Ns Oo , Ns Ar field Oc Ns ... Display only the specified fields from the following set: .Sy type , .Sy name , .Sy used , .Sy quota . The default is to display all fields. .It Fl p Use exact .Pq parsable numeric output. .It Fl s Ar field Sort output by this field. The .Fl s and .Fl S flags may be specified multiple times to sort first by one field, then by another. The default is .Fl s Sy type Fl s Sy name . .It Fl t Ar type Ns Oo , Ns Ar type Oc Ns ... Print only the specified types from the following set: .Sy all , .Sy posixuser , .Sy smbuser , .Sy posixgroup , .Sy smbgroup . The default is .Fl t Sy posixuser Ns \&, Ns Sy smbuser . The default can be changed to include group types. .El .It Xo .Nm .Cm groupspace .Op Fl Hinp .Oo Fl o Ar field Ns Oo , Ns Ar field Oc Ns ... Oc .Oo Fl s Ar field Oc Ns ... .Oo Fl S Ar field Oc Ns ... .Oo Fl t Ar type Ns Oo , Ns Ar type Oc Ns ... Oc .Ar filesystem Ns | Ns Ar snapshot .Xc Displays space consumed by, and quotas on, each group in the specified filesystem or snapshot. This subcommand is identical to .Cm userspace , except that the default types to display are .Fl t Sy posixgroup Ns \&, Ns Sy smbgroup . .It Xo .Nm .Cm projectspace .Op Fl Hp .Oo Fl o Ar field Ns Oo , Ns Ar field Oc Ns ... Oc .Oo Fl s Ar field Oc Ns ... .Oo Fl S Ar field Oc Ns ... -.Ar filesystem Ns | Ns Ar snapshot +.Ar filesystem Ns | Ns Ar snapshot Ns | Ns Ar path .Xc Displays space consumed by, and quotas on, each project in the specified filesystem or snapshot. This subcommand is identical to .Cm userspace , except that the project identifier is numeral, not name. So need neither the option .Sy -i for SID to POSIX ID nor .Sy -n for numeric ID, nor .Sy -t for types. .El .Sh SEE ALSO .Xr zfs-set 8 , .Xr zfsprops 8 Index: vendor-sys/openzfs/dist/man/man8/zpool-remove.8 =================================================================== --- vendor-sys/openzfs/dist/man/man8/zpool-remove.8 (revision 366347) +++ vendor-sys/openzfs/dist/man/man8/zpool-remove.8 (revision 366348) @@ -1,109 +1,111 @@ .\" .\" CDDL HEADER START .\" .\" The contents of this file are subject to the terms of the .\" Common Development and Distribution License (the "License"). .\" You may not use this file except in compliance with the License. .\" .\" You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE .\" or http://www.opensolaris.org/os/licensing. .\" See the License for the specific language governing permissions .\" and limitations under the License. .\" .\" When distributing Covered Code, include this CDDL HEADER in each .\" file and include the License file at usr/src/OPENSOLARIS.LICENSE. .\" If applicable, add the following below this CDDL HEADER, with the .\" fields enclosed by brackets "[]" replaced with your own identifying .\" information: Portions Copyright [yyyy] [name of copyright owner] .\" .\" CDDL HEADER END .\" .\" .\" Copyright (c) 2007, Sun Microsystems, Inc. All Rights Reserved. .\" Copyright (c) 2012, 2018 by Delphix. All rights reserved. .\" Copyright (c) 2012 Cyril Plisko. All Rights Reserved. .\" Copyright (c) 2017 Datto Inc. .\" Copyright (c) 2018 George Melikov. All Rights Reserved. .\" Copyright 2017 Nexenta Systems, Inc. .\" Copyright (c) 2017 Open-E, Inc. All Rights Reserved. .\" .Dd August 9, 2019 .Dt ZPOOL-REMOVE 8 .Os .Sh NAME .Nm zpool Ns Pf - Cm remove .Nd Remove a device from a ZFS storage pool .Sh SYNOPSIS .Nm .Cm remove .Op Fl npw .Ar pool Ar device Ns ... .Nm .Cm remove .Fl s .Ar pool .Sh DESCRIPTION .Bl -tag -width Ds .It Xo .Nm .Cm remove .Op Fl npw .Ar pool Ar device Ns ... .Xc Removes the specified device from the pool. This command supports removing hot spare, cache, log, and both mirrored and non-redundant primary top-level vdevs, including dedup and special vdevs. When the primary pool storage includes a top-level raidz vdev only hot spare, cache, and log devices can be removed. +Note that keys for all encrypted datasets must be loaded for top-level vdevs +to be removed. .sp Removing a top-level vdev reduces the total amount of space in the storage pool. The specified device will be evacuated by copying all allocated space from it to the other devices in the pool. In this case, the .Nm zpool Cm remove command initiates the removal and returns, while the evacuation continues in the background. The removal progress can be monitored with .Nm zpool Cm status . If an IO error is encountered during the removal process it will be cancelled. The .Sy device_removal feature flag must be enabled to remove a top-level vdev, see .Xr zpool-features 5 . .Pp A mirrored top-level device (log or data) can be removed by specifying the top-level mirror for the same. Non-log devices or data devices that are part of a mirrored configuration can be removed using the .Nm zpool Cm detach command. .Bl -tag -width Ds .It Fl n Do not actually perform the removal ("no-op"). Instead, print the estimated amount of memory that will be used by the mapping table after the removal completes. This is nonzero only for top-level vdevs. .El .Bl -tag -width Ds .It Fl p Used in conjunction with the .Fl n flag, displays numbers as parsable (exact) values. .It Fl w Waits until the removal has completed before returning. .El .It Xo .Nm .Cm remove .Fl s .Ar pool .Xc Stops and cancels an in-progress removal of a top-level vdev. .El .Sh SEE ALSO .Xr zpool-add 8 , .Xr zpool-detach 8 , .Xr zpool-offline 8 , .Xr zpool-labelclear 8 , .Xr zpool-replace 8 , .Xr zpool-split 8 Index: vendor-sys/openzfs/dist/module/lua/llimits.h =================================================================== --- vendor-sys/openzfs/dist/module/lua/llimits.h (revision 366347) +++ vendor-sys/openzfs/dist/module/lua/llimits.h (revision 366348) @@ -1,323 +1,314 @@ /* BEGIN CSTYLED */ /* ** $Id: llimits.h,v 1.103.1.1 2013/04/12 18:48:47 roberto Exp $ ** Limits, basic types, and some other `installation-dependent' definitions ** See Copyright Notice in lua.h */ #ifndef llimits_h #define llimits_h #include typedef unsigned LUA_INT32 lu_int32; typedef LUAI_UMEM lu_mem; typedef LUAI_MEM l_mem; /* chars used as small naturals (so that `char' is reserved for characters) */ typedef unsigned char lu_byte; #define MAX_SIZET ((size_t)(~(size_t)0)-2) #define MAX_LUMEM ((lu_mem)(~(lu_mem)0)-2) #define MAX_LMEM ((l_mem) ((MAX_LUMEM >> 1) - 2)) #define MAX_INT (INT_MAX-2) /* maximum value of an int (-2 for safety) */ /* ** conversion of pointer to integer ** this is for hashing only; there is no problem if the integer ** cannot hold the whole pointer value */ #define IntPoint(p) ((unsigned int)(lu_mem)(p)) /* type to ensure maximum alignment */ #if !defined(LUAI_USER_ALIGNMENT_T) #define LUAI_USER_ALIGNMENT_T union { double u; void *s; long l; } #endif typedef LUAI_USER_ALIGNMENT_T L_Umaxalign; /* result of a `usual argument conversion' over lua_Number */ typedef LUAI_UACNUMBER l_uacNumber; /* internal assertions for in-house debugging */ #if defined(lua_assert) #define check_exp(c,e) (lua_assert(c), (e)) /* to avoid problems with conditions too long */ #define lua_longassert(c) { if (!(c)) lua_assert(0); } #else #define lua_assert(c) ((void)0) #define check_exp(c,e) (e) #define lua_longassert(c) ((void)0) #endif /* ** assertion for checking API calls */ #if !defined(luai_apicheck) #if defined(LUA_USE_APICHECK) #include #define luai_apicheck(L,e) assert(e) #else #define luai_apicheck(L,e) lua_assert(e) #endif #endif #define api_check(l,e,msg) luai_apicheck(l,(e) && msg) #if !defined(UNUSED) #define UNUSED(x) ((void)(x)) /* to avoid warnings */ #endif #define cast(t, exp) ((t)(exp)) #define cast_byte(i) cast(lu_byte, (i)) #define cast_num(i) cast(lua_Number, (i)) #define cast_int(i) cast(int, (i)) #define cast_uchar(i) cast(unsigned char, (i)) /* ** non-return type ** ** Suppress noreturn attribute in kernel builds to avoid objtool check warnings */ #if defined(__GNUC__) && !defined(_KERNEL) #define l_noret void __attribute__((noreturn)) #elif defined(_MSC_VER) #define l_noret void __declspec(noreturn) #else #define l_noret void #endif /* ** maximum depth for nested C calls and syntactical nested non-terminals ** in a program. (Value must fit in an unsigned short int.) ** ** Note: On amd64 platform, the limit has been measured to be 45. We set ** the maximum lower to give a margin for changing the amount of stack ** used by various functions involved in parsing and executing code. */ #if !defined(LUAI_MAXCCALLS) #define LUAI_MAXCCALLS 20 #endif /* * Minimum amount of available stack space (in bytes) to make a C call. With * gsub() recursion, the stack space between each luaD_call() is 1256 bytes. */ -#if defined(__FreeBSD__) -/* - * FreeBSD needs a few extra bytes in unoptimized debug builds to avoid a - * double-fault handling the error when the max call depth is exceeded just - * before the C stack runs out. 64 bytes seems to do the trick. - */ -#define LUAI_MINCSTACK 4160 -#else #define LUAI_MINCSTACK 4096 -#endif /* ** maximum number of upvalues in a closure (both C and Lua). (Value ** must fit in an unsigned char.) */ #define MAXUPVAL UCHAR_MAX /* ** type for virtual-machine instructions ** must be an unsigned with (at least) 4 bytes (see details in lopcodes.h) */ typedef lu_int32 Instruction; /* maximum stack for a Lua function */ #define MAXSTACK 250 /* minimum size for the string table (must be power of 2) */ #if !defined(MINSTRTABSIZE) #define MINSTRTABSIZE 32 #endif /* minimum size for string buffer */ #if !defined(LUA_MINBUFFER) #define LUA_MINBUFFER 32 #endif #if !defined(lua_lock) #define lua_lock(L) ((void) 0) #define lua_unlock(L) ((void) 0) #endif #if !defined(luai_threadyield) #define luai_threadyield(L) {lua_unlock(L); lua_lock(L);} #endif /* ** these macros allow user-specific actions on threads when you defined ** LUAI_EXTRASPACE and need to do something extra when a thread is ** created/deleted/resumed/yielded. */ #if !defined(luai_userstateopen) #define luai_userstateopen(L) ((void)L) #endif #if !defined(luai_userstateclose) #define luai_userstateclose(L) ((void)L) #endif #if !defined(luai_userstatethread) #define luai_userstatethread(L,L1) ((void)L) #endif #if !defined(luai_userstatefree) #define luai_userstatefree(L,L1) ((void)L) #endif #if !defined(luai_userstateresume) #define luai_userstateresume(L,n) ((void)L) #endif #if !defined(luai_userstateyield) #define luai_userstateyield(L,n) ((void)L) #endif /* ** lua_number2int is a macro to convert lua_Number to int. ** lua_number2integer is a macro to convert lua_Number to lua_Integer. ** lua_number2unsigned is a macro to convert a lua_Number to a lua_Unsigned. ** lua_unsigned2number is a macro to convert a lua_Unsigned to a lua_Number. ** luai_hashnum is a macro to hash a lua_Number value into an integer. ** The hash must be deterministic and give reasonable values for ** both small and large values (outside the range of integers). */ #if defined(MS_ASMTRICK) || defined(LUA_MSASMTRICK) /* { */ /* trick with Microsoft assembler for X86 */ #define lua_number2int(i,n) __asm {__asm fld n __asm fistp i} #define lua_number2integer(i,n) lua_number2int(i, n) #define lua_number2unsigned(i,n) \ {__int64 l; __asm {__asm fld n __asm fistp l} i = (unsigned int)l;} #elif defined(LUA_IEEE754TRICK) /* }{ */ /* the next trick should work on any machine using IEEE754 with a 32-bit int type */ union luai_Cast { double l_d; LUA_INT32 l_p[2]; }; #if !defined(LUA_IEEEENDIAN) /* { */ #define LUAI_EXTRAIEEE \ static const union luai_Cast ieeeendian = {-(33.0 + 6755399441055744.0)}; #define LUA_IEEEENDIANLOC (ieeeendian.l_p[1] == 33) #else #define LUA_IEEEENDIANLOC LUA_IEEEENDIAN #define LUAI_EXTRAIEEE /* empty */ #endif /* } */ #define lua_number2int32(i,n,t) \ { LUAI_EXTRAIEEE \ volatile union luai_Cast u; u.l_d = (n) + 6755399441055744.0; \ (i) = (t)u.l_p[LUA_IEEEENDIANLOC]; } #define luai_hashnum(i,n) \ { volatile union luai_Cast u; u.l_d = (n) + 1.0; /* avoid -0 */ \ (i) = u.l_p[0]; (i) += u.l_p[1]; } /* add double bits for his hash */ #define lua_number2int(i,n) lua_number2int32(i, n, int) #define lua_number2unsigned(i,n) lua_number2int32(i, n, lua_Unsigned) /* the trick can be expanded to lua_Integer when it is a 32-bit value */ #if defined(LUA_IEEELL) #define lua_number2integer(i,n) lua_number2int32(i, n, lua_Integer) #endif #endif /* } */ /* the following definitions always work, but may be slow */ #if !defined(lua_number2int) #define lua_number2int(i,n) ((i)=(int)(n)) #endif #if !defined(lua_number2integer) #define lua_number2integer(i,n) ((i)=(lua_Integer)(n)) #endif #if !defined(lua_number2unsigned) /* { */ /* the following definition assures proper modulo behavior */ #if defined(LUA_NUMBER_DOUBLE) || defined(LUA_NUMBER_FLOAT) #include #define SUPUNSIGNED ((lua_Number)(~(lua_Unsigned)0) + 1) #define lua_number2unsigned(i,n) \ ((i)=(lua_Unsigned)((n) - floor((n)/SUPUNSIGNED)*SUPUNSIGNED)) #else #define lua_number2unsigned(i,n) ((i)=(lua_Unsigned)(n)) #endif #endif /* } */ #if !defined(lua_unsigned2number) /* on several machines, coercion from unsigned to double is slow, so it may be worth to avoid */ #define lua_unsigned2number(u) \ (((u) <= (lua_Unsigned)INT_MAX) ? (lua_Number)(int)(u) : (lua_Number)(u)) #endif #if defined(ltable_c) && !defined(luai_hashnum) #define luai_hashnum(i,n) (i = lcompat_hashnum(n)) #endif /* ** macro to control inclusion of some hard tests on stack reallocation */ #if !defined(HARDSTACKTESTS) #define condmovestack(L) ((void)0) #else /* realloc stack keeping its size */ #define condmovestack(L) luaD_reallocstack((L), (L)->stacksize) #endif #if !defined(HARDMEMTESTS) #define condchangemem(L) condmovestack(L) #else #define condchangemem(L) \ ((void)(!(G(L)->gcrunning) || (luaC_fullgc(L, 0), 1))) #endif #endif /* END CSTYLED */ Index: vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_kstat.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_kstat.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_kstat.c (revision 366348) @@ -1,519 +1,574 @@ /* * Copyright (c) 2007 Pawel Jakub Dawidek * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * Links to Illumos.org for more information on kstat function: * [1] https://illumos.org/man/1M/kstat * [2] https://illumos.org/man/9f/kstat_create */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include static MALLOC_DEFINE(M_KSTAT, "kstat_data", "Kernel statistics"); SYSCTL_ROOT_NODE(OID_AUTO, kstat, CTLFLAG_RW, 0, "Kernel statistics"); void __kstat_set_raw_ops(kstat_t *ksp, int (*headers)(char *buf, size_t size), int (*data)(char *buf, size_t size, void *data), void *(*addr)(kstat_t *ksp, loff_t index)) { ksp->ks_raw_ops.headers = headers; ksp->ks_raw_ops.data = data; ksp->ks_raw_ops.addr = addr; } +void +__kstat_set_seq_raw_ops(kstat_t *ksp, + int (*headers)(struct seq_file *f), + int (*data)(char *buf, size_t size, void *data), + void *(*addr)(kstat_t *ksp, loff_t index)) +{ + ksp->ks_raw_ops.seq_headers = headers; + ksp->ks_raw_ops.data = data; + ksp->ks_raw_ops.addr = addr; +} + static int kstat_default_update(kstat_t *ksp, int rw) { ASSERT(ksp != NULL); if (rw == KSTAT_WRITE) return (EACCES); return (0); } static int kstat_resize_raw(kstat_t *ksp) { if (ksp->ks_raw_bufsize == KSTAT_RAW_MAX) return (ENOMEM); free(ksp->ks_raw_buf, M_TEMP); ksp->ks_raw_bufsize = MIN(ksp->ks_raw_bufsize * 2, KSTAT_RAW_MAX); ksp->ks_raw_buf = malloc(ksp->ks_raw_bufsize, M_TEMP, M_WAITOK); return (0); } static void * kstat_raw_default_addr(kstat_t *ksp, loff_t n) { if (n == 0) return (ksp->ks_data); return (NULL); } static int kstat_sysctl(SYSCTL_HANDLER_ARGS) { kstat_t *ksp = arg1; kstat_named_t *ksent; uint64_t val; ksent = ksp->ks_data; /* Select the correct element */ ksent += arg2; /* Update the aggsums before reading */ (void) ksp->ks_update(ksp, KSTAT_READ); val = ksent->value.ui64; return (sysctl_handle_64(oidp, &val, 0, req)); } static int kstat_sysctl_string(SYSCTL_HANDLER_ARGS) { kstat_t *ksp = arg1; kstat_named_t *ksent = ksp->ks_data; char *val; uint32_t len = 0; /* Select the correct element */ ksent += arg2; /* Update the aggsums before reading */ (void) ksp->ks_update(ksp, KSTAT_READ); val = KSTAT_NAMED_STR_PTR(ksent); len = KSTAT_NAMED_STR_BUFLEN(ksent); val[len-1] = '\0'; return (sysctl_handle_string(oidp, val, len, req)); } static int kstat_sysctl_io(SYSCTL_HANDLER_ARGS) { struct sbuf *sb; kstat_t *ksp = arg1; kstat_io_t *kip = ksp->ks_data; int rc; sb = sbuf_new_auto(); if (sb == NULL) return (ENOMEM); /* Update the aggsums before reading */ (void) ksp->ks_update(ksp, KSTAT_READ); /* though wlentime & friends are signed, they will never be negative */ sbuf_printf(sb, "%-8llu %-8llu %-8u %-8u %-8llu %-8llu " "%-8llu %-8llu %-8llu %-8llu %-8u %-8u\n", kip->nread, kip->nwritten, kip->reads, kip->writes, kip->wtime, kip->wlentime, kip->wlastupdate, kip->rtime, kip->rlentime, kip->rlastupdate, kip->wcnt, kip->rcnt); rc = sbuf_finish(sb); if (rc == 0) rc = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb)); sbuf_delete(sb); return (rc); } static int kstat_sysctl_raw(SYSCTL_HANDLER_ARGS) { struct sbuf *sb; void *data; kstat_t *ksp = arg1; void *(*addr_op)(kstat_t *ksp, loff_t index); - int n, rc = 0; + int n, has_header, rc = 0; sb = sbuf_new_auto(); if (sb == NULL) return (ENOMEM); if (ksp->ks_raw_ops.addr) addr_op = ksp->ks_raw_ops.addr; else addr_op = kstat_raw_default_addr; mutex_enter(ksp->ks_lock); /* Update the aggsums before reading */ (void) ksp->ks_update(ksp, KSTAT_READ); ksp->ks_raw_bufsize = PAGE_SIZE; ksp->ks_raw_buf = malloc(PAGE_SIZE, M_TEMP, M_WAITOK); n = 0; + has_header = (ksp->ks_raw_ops.headers || + ksp->ks_raw_ops.seq_headers); + restart_headers: if (ksp->ks_raw_ops.headers) { rc = ksp->ks_raw_ops.headers( ksp->ks_raw_buf, ksp->ks_raw_bufsize); + } else if (ksp->ks_raw_ops.seq_headers) { + struct seq_file f; + + f.sf_buf = ksp->ks_raw_buf; + f.sf_size = ksp->ks_raw_bufsize; + rc = ksp->ks_raw_ops.seq_headers(&f); + } + if (has_header) { if (rc == ENOMEM && !kstat_resize_raw(ksp)) goto restart_headers; if (rc == 0) - sbuf_printf(sb, "%s", ksp->ks_raw_buf); + sbuf_printf(sb, "\n%s", ksp->ks_raw_buf); } while ((data = addr_op(ksp, n)) != NULL) { restart: if (ksp->ks_raw_ops.data) { rc = ksp->ks_raw_ops.data(ksp->ks_raw_buf, ksp->ks_raw_bufsize, data); if (rc == ENOMEM && !kstat_resize_raw(ksp)) goto restart; if (rc == 0) sbuf_printf(sb, "%s", ksp->ks_raw_buf); } else { ASSERT(ksp->ks_ndata == 1); sbuf_hexdump(sb, ksp->ks_data, ksp->ks_data_size, NULL, 0); } n++; } free(ksp->ks_raw_buf, M_TEMP); mutex_exit(ksp->ks_lock); rc = sbuf_finish(sb); if (rc == 0) rc = SYSCTL_OUT(req, sbuf_data(sb), sbuf_len(sb)); sbuf_delete(sb); return (rc); } kstat_t * __kstat_create(const char *module, int instance, const char *name, const char *class, uchar_t ks_type, uint_t ks_ndata, uchar_t flags) { + char buf[KSTAT_STRLEN]; struct sysctl_oid *root; kstat_t *ksp; + char *pool; KASSERT(instance == 0, ("instance=%d", instance)); if ((ks_type == KSTAT_TYPE_INTR) || (ks_type == KSTAT_TYPE_IO)) ASSERT(ks_ndata == 1); + if (class == NULL) + class = "misc"; + /* - * Allocate the main structure. We don't need to copy module/class/name - * stuff in here, because it is only used for sysctl node creation + * Allocate the main structure. We don't need to keep a copy of + * module in here, because it is only used for sysctl node creation * done in this function. */ ksp = malloc(sizeof (*ksp), M_KSTAT, M_WAITOK|M_ZERO); ksp->ks_crtime = gethrtime(); ksp->ks_snaptime = ksp->ks_crtime; ksp->ks_instance = instance; - strncpy(ksp->ks_name, name, KSTAT_STRLEN); - strncpy(ksp->ks_class, class, KSTAT_STRLEN); + (void) strlcpy(ksp->ks_name, name, KSTAT_STRLEN); + (void) strlcpy(ksp->ks_class, class, KSTAT_STRLEN); ksp->ks_type = ks_type; ksp->ks_flags = flags; ksp->ks_update = kstat_default_update; mutex_init(&ksp->ks_private_lock, NULL, MUTEX_DEFAULT, NULL); ksp->ks_lock = &ksp->ks_private_lock; switch (ksp->ks_type) { - case KSTAT_TYPE_RAW: - ksp->ks_ndata = 1; - ksp->ks_data_size = ks_ndata; - break; - case KSTAT_TYPE_NAMED: - ksp->ks_ndata = ks_ndata; - ksp->ks_data_size = ks_ndata * sizeof (kstat_named_t); - break; - case KSTAT_TYPE_INTR: - ksp->ks_ndata = ks_ndata; - ksp->ks_data_size = ks_ndata * sizeof (kstat_intr_t); - break; - case KSTAT_TYPE_IO: - ksp->ks_ndata = ks_ndata; - ksp->ks_data_size = ks_ndata * sizeof (kstat_io_t); - break; - case KSTAT_TYPE_TIMER: - ksp->ks_ndata = ks_ndata; - ksp->ks_data_size = ks_ndata * sizeof (kstat_timer_t); - break; - default: - panic("Undefined kstat type %d\n", ksp->ks_type); + case KSTAT_TYPE_RAW: + ksp->ks_ndata = 1; + ksp->ks_data_size = ks_ndata; + break; + case KSTAT_TYPE_NAMED: + ksp->ks_ndata = ks_ndata; + ksp->ks_data_size = ks_ndata * sizeof (kstat_named_t); + break; + case KSTAT_TYPE_INTR: + ksp->ks_ndata = ks_ndata; + ksp->ks_data_size = ks_ndata * sizeof (kstat_intr_t); + break; + case KSTAT_TYPE_IO: + ksp->ks_ndata = ks_ndata; + ksp->ks_data_size = ks_ndata * sizeof (kstat_io_t); + break; + case KSTAT_TYPE_TIMER: + ksp->ks_ndata = ks_ndata; + ksp->ks_data_size = ks_ndata * sizeof (kstat_timer_t); + break; + default: + panic("Undefined kstat type %d\n", ksp->ks_type); } if (ksp->ks_flags & KSTAT_FLAG_VIRTUAL) { ksp->ks_data = NULL; } else { ksp->ks_data = kmem_zalloc(ksp->ks_data_size, KM_SLEEP); if (ksp->ks_data == NULL) { kmem_free(ksp, sizeof (*ksp)); ksp = NULL; } } + /* + * Some kstats use a module name like "zfs/poolname" to distinguish a + * set of kstats belonging to a specific pool. Split on '/' to add an + * extra node for the pool name if needed. + */ + (void) strlcpy(buf, module, KSTAT_STRLEN); + module = buf; + pool = strchr(module, '/'); + if (pool != NULL) + *pool++ = '\0'; + + /* * Create sysctl tree for those statistics: * - * kstat.... + * kstat.[.].. */ sysctl_ctx_init(&ksp->ks_sysctl_ctx); root = SYSCTL_ADD_NODE(&ksp->ks_sysctl_ctx, SYSCTL_STATIC_CHILDREN(_kstat), OID_AUTO, module, CTLFLAG_RW, 0, ""); if (root == NULL) { printf("%s: Cannot create kstat.%s tree!\n", __func__, module); sysctl_ctx_free(&ksp->ks_sysctl_ctx); free(ksp, M_KSTAT); return (NULL); } + if (pool != NULL) { + root = SYSCTL_ADD_NODE(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(root), OID_AUTO, pool, CTLFLAG_RW, 0, ""); + if (root == NULL) { + printf("%s: Cannot create kstat.%s.%s tree!\n", + __func__, module, pool); + sysctl_ctx_free(&ksp->ks_sysctl_ctx); + free(ksp, M_KSTAT); + return (NULL); + } + } root = SYSCTL_ADD_NODE(&ksp->ks_sysctl_ctx, SYSCTL_CHILDREN(root), OID_AUTO, class, CTLFLAG_RW, 0, ""); if (root == NULL) { - printf("%s: Cannot create kstat.%s.%s tree!\n", __func__, - module, class); + if (pool != NULL) + printf("%s: Cannot create kstat.%s.%s.%s tree!\n", + __func__, module, pool, class); + else + printf("%s: Cannot create kstat.%s.%s tree!\n", + __func__, module, class); sysctl_ctx_free(&ksp->ks_sysctl_ctx); free(ksp, M_KSTAT); return (NULL); } if (ksp->ks_type == KSTAT_TYPE_NAMED) { root = SYSCTL_ADD_NODE(&ksp->ks_sysctl_ctx, SYSCTL_CHILDREN(root), OID_AUTO, name, CTLFLAG_RW, 0, ""); if (root == NULL) { - printf("%s: Cannot create kstat.%s.%s.%s tree!\n", - __func__, module, class, name); + if (pool != NULL) + printf("%s: Cannot create kstat.%s.%s.%s.%s " + "tree!\n", __func__, module, pool, class, + name); + else + printf("%s: Cannot create kstat.%s.%s.%s " + "tree!\n", __func__, module, class, name); sysctl_ctx_free(&ksp->ks_sysctl_ctx); free(ksp, M_KSTAT); return (NULL); } } ksp->ks_sysctl_root = root; return (ksp); } static void kstat_install_named(kstat_t *ksp) { kstat_named_t *ksent; char *namelast; int typelast; ksent = ksp->ks_data; VERIFY((ksp->ks_flags & KSTAT_FLAG_VIRTUAL) || ksent != NULL); typelast = 0; namelast = NULL; for (int i = 0; i < ksp->ks_ndata; i++, ksent++) { if (ksent->data_type != 0) { typelast = ksent->data_type; namelast = ksent->name; } switch (typelast) { - case KSTAT_DATA_CHAR: - /* Not Implemented */ - break; - case KSTAT_DATA_INT32: - SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, namelast, - CTLTYPE_S32 | CTLFLAG_RD, ksp, i, - kstat_sysctl, "I", namelast); - break; - case KSTAT_DATA_UINT32: - SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, namelast, - CTLTYPE_U32 | CTLFLAG_RD, ksp, i, - kstat_sysctl, "IU", namelast); - break; - case KSTAT_DATA_INT64: - SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, namelast, - CTLTYPE_S64 | CTLFLAG_RD, ksp, i, - kstat_sysctl, "Q", namelast); - break; - case KSTAT_DATA_UINT64: - SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, namelast, - CTLTYPE_U64 | CTLFLAG_RD, ksp, i, - kstat_sysctl, "QU", namelast); - break; - case KSTAT_DATA_LONG: - SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, namelast, - CTLTYPE_LONG | CTLFLAG_RD, ksp, i, - kstat_sysctl, "L", namelast); - break; - case KSTAT_DATA_ULONG: - SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, namelast, - CTLTYPE_ULONG | CTLFLAG_RD, ksp, i, - kstat_sysctl, "LU", namelast); - break; - case KSTAT_DATA_STRING: - SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, namelast, - CTLTYPE_STRING | CTLFLAG_RD, ksp, i, - kstat_sysctl_string, "A", namelast); - break; - default: - panic("unsupported type: %d", typelast); + case KSTAT_DATA_CHAR: + /* Not Implemented */ + break; + case KSTAT_DATA_INT32: + SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, namelast, + CTLTYPE_S32 | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, i, kstat_sysctl, "I", namelast); + break; + case KSTAT_DATA_UINT32: + SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, namelast, + CTLTYPE_U32 | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, i, kstat_sysctl, "IU", namelast); + break; + case KSTAT_DATA_INT64: + SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, namelast, + CTLTYPE_S64 | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, i, kstat_sysctl, "Q", namelast); + break; + case KSTAT_DATA_UINT64: + SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, namelast, + CTLTYPE_U64 | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, i, kstat_sysctl, "QU", namelast); + break; + case KSTAT_DATA_LONG: + SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, namelast, + CTLTYPE_LONG | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, i, kstat_sysctl, "L", namelast); + break; + case KSTAT_DATA_ULONG: + SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, namelast, + CTLTYPE_ULONG | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, i, kstat_sysctl, "LU", namelast); + break; + case KSTAT_DATA_STRING: + SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, namelast, + CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, i, kstat_sysctl_string, "A", namelast); + break; + default: + panic("unsupported type: %d", typelast); } - } - } void kstat_install(kstat_t *ksp) { struct sysctl_oid *root; if (ksp->ks_ndata == UINT32_MAX) VERIFY(ksp->ks_type == KSTAT_TYPE_RAW); switch (ksp->ks_type) { - case KSTAT_TYPE_NAMED: - return (kstat_install_named(ksp)); - break; - case KSTAT_TYPE_RAW: - if (ksp->ks_raw_ops.data) { - root = SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, ksp->ks_name, - CTLTYPE_STRING | CTLFLAG_RD, ksp, 0, - kstat_sysctl_raw, "A", ksp->ks_name); - } else { - root = SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, - SYSCTL_CHILDREN(ksp->ks_sysctl_root), - OID_AUTO, ksp->ks_name, - CTLTYPE_OPAQUE | CTLFLAG_RD, ksp, 0, - kstat_sysctl_raw, "", ksp->ks_name); - } - VERIFY(root != NULL); - break; - case KSTAT_TYPE_IO: + case KSTAT_TYPE_NAMED: + return (kstat_install_named(ksp)); + case KSTAT_TYPE_RAW: + if (ksp->ks_raw_ops.data) { root = SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, SYSCTL_CHILDREN(ksp->ks_sysctl_root), OID_AUTO, ksp->ks_name, - CTLTYPE_STRING | CTLFLAG_RD, ksp, 0, - kstat_sysctl_io, "A", ksp->ks_name); - break; - case KSTAT_TYPE_TIMER: - case KSTAT_TYPE_INTR: - default: - panic("unsupported kstat type %d\n", ksp->ks_type); + CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, 0, kstat_sysctl_raw, "A", ksp->ks_name); + } else { + root = SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, ksp->ks_name, + CTLTYPE_OPAQUE | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, 0, kstat_sysctl_raw, "", ksp->ks_name); + } + break; + case KSTAT_TYPE_IO: + root = SYSCTL_ADD_PROC(&ksp->ks_sysctl_ctx, + SYSCTL_CHILDREN(ksp->ks_sysctl_root), + OID_AUTO, ksp->ks_name, + CTLTYPE_STRING | CTLFLAG_RD | CTLFLAG_MPSAFE, + ksp, 0, kstat_sysctl_io, "A", ksp->ks_name); + break; + case KSTAT_TYPE_TIMER: + case KSTAT_TYPE_INTR: + default: + panic("unsupported kstat type %d\n", ksp->ks_type); } + VERIFY(root != NULL); ksp->ks_sysctl_root = root; - } void kstat_delete(kstat_t *ksp) { sysctl_ctx_free(&ksp->ks_sysctl_ctx); ksp->ks_lock = NULL; mutex_destroy(&ksp->ks_private_lock); free(ksp, M_KSTAT); } void kstat_waitq_enter(kstat_io_t *kiop) { hrtime_t new, delta; ulong_t wcnt; new = gethrtime(); delta = new - kiop->wlastupdate; kiop->wlastupdate = new; wcnt = kiop->wcnt++; if (wcnt != 0) { kiop->wlentime += delta * wcnt; kiop->wtime += delta; } } void kstat_waitq_exit(kstat_io_t *kiop) { hrtime_t new, delta; ulong_t wcnt; new = gethrtime(); delta = new - kiop->wlastupdate; kiop->wlastupdate = new; wcnt = kiop->wcnt--; ASSERT((int)wcnt > 0); kiop->wlentime += delta * wcnt; kiop->wtime += delta; } void kstat_runq_enter(kstat_io_t *kiop) { hrtime_t new, delta; ulong_t rcnt; new = gethrtime(); delta = new - kiop->rlastupdate; kiop->rlastupdate = new; rcnt = kiop->rcnt++; if (rcnt != 0) { kiop->rlentime += delta * rcnt; kiop->rtime += delta; } } void kstat_runq_exit(kstat_io_t *kiop) { hrtime_t new, delta; ulong_t rcnt; new = gethrtime(); delta = new - kiop->rlastupdate; kiop->rlastupdate = new; rcnt = kiop->rcnt--; ASSERT((int)rcnt > 0); kiop->rlentime += delta * rcnt; kiop->rtime += delta; } Index: vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_procfs_list.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_procfs_list.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_procfs_list.c (revision 366348) @@ -1,79 +1,161 @@ /* * Copyright (c) 2020 iXsystems, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include +typedef struct procfs_list_iter { + procfs_list_t *pli_pl; + void *pli_elt; +} pli_t; + void -seq_printf(struct seq_file *m, const char *fmt, ...) -{} +seq_printf(struct seq_file *f, const char *fmt, ...) +{ + va_list adx; + va_start(adx, fmt); + (void) vsnprintf(f->sf_buf, f->sf_size, fmt, adx); + va_end(adx); +} + +static int +procfs_list_update(kstat_t *ksp, int rw) +{ + procfs_list_t *pl = ksp->ks_private; + + if (rw == KSTAT_WRITE) + pl->pl_clear(pl); + + return (0); +} + +static int +procfs_list_data(char *buf, size_t size, void *data) +{ + pli_t *p; + void *elt; + procfs_list_t *pl; + struct seq_file f; + + p = data; + pl = p->pli_pl; + elt = p->pli_elt; + free(p, M_TEMP); + f.sf_buf = buf; + f.sf_size = size; + return (pl->pl_show(&f, elt)); +} + +static void * +procfs_list_addr(kstat_t *ksp, loff_t n) +{ + procfs_list_t *pl = ksp->ks_private; + void *elt = ksp->ks_private1; + pli_t *p = NULL; + + + if (n == 0) + ksp->ks_private1 = list_head(&pl->pl_list); + else if (elt) + ksp->ks_private1 = list_next(&pl->pl_list, elt); + + if (ksp->ks_private1) { + p = malloc(sizeof (*p), M_TEMP, M_WAITOK); + p->pli_pl = pl; + p->pli_elt = ksp->ks_private1; + } + + return (p); +} + void procfs_list_install(const char *module, + const char *submodule, const char *name, mode_t mode, procfs_list_t *procfs_list, int (*show)(struct seq_file *f, void *p), int (*show_header)(struct seq_file *f), int (*clear)(procfs_list_t *procfs_list), size_t procfs_list_node_off) { + kstat_t *procfs_kstat; + mutex_init(&procfs_list->pl_lock, NULL, MUTEX_DEFAULT, NULL); list_create(&procfs_list->pl_list, procfs_list_node_off + sizeof (procfs_list_node_t), procfs_list_node_off + offsetof(procfs_list_node_t, pln_link)); + procfs_list->pl_show = show; + procfs_list->pl_show_header = show_header; + procfs_list->pl_clear = clear; procfs_list->pl_next_id = 1; procfs_list->pl_node_offset = procfs_list_node_off; + + procfs_kstat = kstat_create(module, 0, name, submodule, + KSTAT_TYPE_RAW, 0, KSTAT_FLAG_VIRTUAL); + + if (procfs_kstat) { + procfs_kstat->ks_lock = &procfs_list->pl_lock; + procfs_kstat->ks_ndata = UINT32_MAX; + procfs_kstat->ks_private = procfs_list; + procfs_kstat->ks_update = procfs_list_update; + kstat_set_seq_raw_ops(procfs_kstat, show_header, + procfs_list_data, procfs_list_addr); + kstat_install(procfs_kstat); + procfs_list->pl_private = procfs_kstat; + } } void procfs_list_uninstall(procfs_list_t *procfs_list) {} void procfs_list_destroy(procfs_list_t *procfs_list) { ASSERT(list_is_empty(&procfs_list->pl_list)); + kstat_delete(procfs_list->pl_private); list_destroy(&procfs_list->pl_list); mutex_destroy(&procfs_list->pl_lock); } #define NODE_ID(procfs_list, obj) \ (((procfs_list_node_t *)(((char *)obj) + \ (procfs_list)->pl_node_offset))->pln_id) void procfs_list_add(procfs_list_t *procfs_list, void *p) { ASSERT(MUTEX_HELD(&procfs_list->pl_lock)); NODE_ID(procfs_list, p) = procfs_list->pl_next_id++; list_insert_tail(&procfs_list->pl_list, p); } Index: vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_taskq.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_taskq.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/spl/spl_taskq.c (revision 366348) @@ -1,409 +1,413 @@ /* * Copyright (c) 2009 Pawel Jakub Dawidek * All rights reserved. * * Copyright (c) 2012 Spectra Logic Corporation. All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #if __FreeBSD_version < 1201522 #define taskqueue_start_threads_in_proc(tqp, count, pri, proc, name, ...) \ taskqueue_start_threads(tqp, count, pri, name, __VA_ARGS__) #endif static uint_t taskq_tsd; static uma_zone_t taskq_zone; taskq_t *system_taskq = NULL; taskq_t *system_delay_taskq = NULL; taskq_t *dynamic_taskq = NULL; proc_t *system_proc; extern int uma_align_cache; static MALLOC_DEFINE(M_TASKQ, "taskq", "taskq structures"); static CK_LIST_HEAD(tqenthashhead, taskq_ent) *tqenthashtbl; static unsigned long tqenthash; static unsigned long tqenthashlock; static struct sx *tqenthashtbl_lock; static uint32_t tqidnext = 1; #define TQIDHASH(tqid) (&tqenthashtbl[(tqid) & tqenthash]) #define TQIDHASHLOCK(tqid) (&tqenthashtbl_lock[((tqid) & tqenthashlock)]) #define TIMEOUT_TASK 1 #define NORMAL_TASK 2 static void system_taskq_init(void *arg) { int i; tsd_create(&taskq_tsd, NULL); tqenthashtbl = hashinit(mp_ncpus * 8, M_TASKQ, &tqenthash); tqenthashlock = (tqenthash + 1) / 8; if (tqenthashlock > 0) tqenthashlock--; tqenthashtbl_lock = malloc(sizeof (*tqenthashtbl_lock) * (tqenthashlock + 1), M_TASKQ, M_WAITOK | M_ZERO); for (i = 0; i < tqenthashlock + 1; i++) sx_init_flags(&tqenthashtbl_lock[i], "tqenthash", SX_DUPOK); tqidnext = 1; taskq_zone = uma_zcreate("taskq_zone", sizeof (taskq_ent_t), NULL, NULL, NULL, NULL, UMA_ALIGN_CACHE, 0); system_taskq = taskq_create("system_taskq", mp_ncpus, minclsyspri, 0, 0, 0); system_delay_taskq = taskq_create("system_delay_taskq", mp_ncpus, minclsyspri, 0, 0, 0); } SYSINIT(system_taskq_init, SI_SUB_CONFIGURE, SI_ORDER_ANY, system_taskq_init, NULL); static void system_taskq_fini(void *arg) { int i; taskq_destroy(system_delay_taskq); taskq_destroy(system_taskq); uma_zdestroy(taskq_zone); tsd_destroy(&taskq_tsd); for (i = 0; i < tqenthashlock + 1; i++) sx_destroy(&tqenthashtbl_lock[i]); for (i = 0; i < tqenthash + 1; i++) VERIFY(CK_LIST_EMPTY(&tqenthashtbl[i])); free(tqenthashtbl_lock, M_TASKQ); free(tqenthashtbl, M_TASKQ); } SYSUNINIT(system_taskq_fini, SI_SUB_CONFIGURE, SI_ORDER_ANY, system_taskq_fini, NULL); static taskq_ent_t * taskq_lookup(taskqid_t tqid) { taskq_ent_t *ent = NULL; sx_xlock(TQIDHASHLOCK(tqid)); CK_LIST_FOREACH(ent, TQIDHASH(tqid), tqent_hash) { if (ent->tqent_id == tqid) break; } if (ent != NULL) refcount_acquire(&ent->tqent_rc); sx_xunlock(TQIDHASHLOCK(tqid)); return (ent); } static taskqid_t taskq_insert(taskq_ent_t *ent) { taskqid_t tqid = atomic_fetchadd_int(&tqidnext, 1); ent->tqent_id = tqid; ent->tqent_registered = B_TRUE; sx_xlock(TQIDHASHLOCK(tqid)); CK_LIST_INSERT_HEAD(TQIDHASH(tqid), ent, tqent_hash); sx_xunlock(TQIDHASHLOCK(tqid)); return (tqid); } static void taskq_remove(taskq_ent_t *ent) { taskqid_t tqid = ent->tqent_id; if (!ent->tqent_registered) return; sx_xlock(TQIDHASHLOCK(tqid)); CK_LIST_REMOVE(ent, tqent_hash); sx_xunlock(TQIDHASHLOCK(tqid)); ent->tqent_registered = B_FALSE; } static void taskq_tsd_set(void *context) { taskq_t *tq = context; +#if defined(__amd64__) || defined(__i386__) || defined(__aarch64__) + if (context != NULL && tsd_get(taskq_tsd) == NULL) + fpu_kern_thread(FPU_KERN_NORMAL); +#endif tsd_set(taskq_tsd, tq); } static taskq_t * taskq_create_impl(const char *name, int nthreads, pri_t pri, proc_t *proc __maybe_unused, uint_t flags) { taskq_t *tq; if ((flags & TASKQ_THREADS_CPU_PCT) != 0) nthreads = MAX((mp_ncpus * nthreads) / 100, 1); tq = kmem_alloc(sizeof (*tq), KM_SLEEP); tq->tq_queue = taskqueue_create(name, M_WAITOK, taskqueue_thread_enqueue, &tq->tq_queue); taskqueue_set_callback(tq->tq_queue, TASKQUEUE_CALLBACK_TYPE_INIT, taskq_tsd_set, tq); taskqueue_set_callback(tq->tq_queue, TASKQUEUE_CALLBACK_TYPE_SHUTDOWN, taskq_tsd_set, NULL); (void) taskqueue_start_threads_in_proc(&tq->tq_queue, nthreads, pri, proc, "%s", name); return ((taskq_t *)tq); } taskq_t * taskq_create(const char *name, int nthreads, pri_t pri, int minalloc __unused, int maxalloc __unused, uint_t flags) { return (taskq_create_impl(name, nthreads, pri, system_proc, flags)); } taskq_t * taskq_create_proc(const char *name, int nthreads, pri_t pri, int minalloc __unused, int maxalloc __unused, proc_t *proc, uint_t flags) { return (taskq_create_impl(name, nthreads, pri, proc, flags)); } void taskq_destroy(taskq_t *tq) { taskqueue_free(tq->tq_queue); kmem_free(tq, sizeof (*tq)); } int taskq_member(taskq_t *tq, kthread_t *thread) { return (taskqueue_member(tq->tq_queue, thread)); } taskq_t * taskq_of_curthread(void) { return (tsd_get(taskq_tsd)); } static void taskq_free(taskq_ent_t *task) { taskq_remove(task); if (refcount_release(&task->tqent_rc)) uma_zfree(taskq_zone, task); } int taskq_cancel_id(taskq_t *tq, taskqid_t tid) { uint32_t pend; int rc; taskq_ent_t *ent; if (tid == 0) return (0); if ((ent = taskq_lookup(tid)) == NULL) return (0); ent->tqent_cancelled = B_TRUE; if (ent->tqent_type == TIMEOUT_TASK) { rc = taskqueue_cancel_timeout(tq->tq_queue, &ent->tqent_timeout_task, &pend); } else rc = taskqueue_cancel(tq->tq_queue, &ent->tqent_task, &pend); if (rc == EBUSY) { taskqueue_drain(tq->tq_queue, &ent->tqent_task); } else if (pend) { /* * Tasks normally free themselves when run, but here the task * was cancelled so it did not free itself. */ taskq_free(ent); } /* Free the extra reference we added with taskq_lookup. */ taskq_free(ent); return (rc); } static void taskq_run(void *arg, int pending __unused) { taskq_ent_t *task = arg; if (!task->tqent_cancelled) task->tqent_func(task->tqent_arg); taskq_free(task); } taskqid_t taskq_dispatch_delay(taskq_t *tq, task_func_t func, void *arg, uint_t flags, clock_t expire_time) { taskq_ent_t *task; taskqid_t tid; clock_t timo; int mflag; timo = expire_time - ddi_get_lbolt(); if (timo <= 0) return (taskq_dispatch(tq, func, arg, flags)); if ((flags & (TQ_SLEEP | TQ_NOQUEUE)) == TQ_SLEEP) mflag = M_WAITOK; else mflag = M_NOWAIT; task = uma_zalloc(taskq_zone, mflag); if (task == NULL) return (0); task->tqent_func = func; task->tqent_arg = arg; task->tqent_type = TIMEOUT_TASK; task->tqent_cancelled = B_FALSE; refcount_init(&task->tqent_rc, 1); tid = taskq_insert(task); TIMEOUT_TASK_INIT(tq->tq_queue, &task->tqent_timeout_task, 0, taskq_run, task); taskqueue_enqueue_timeout(tq->tq_queue, &task->tqent_timeout_task, timo); return (tid); } taskqid_t taskq_dispatch(taskq_t *tq, task_func_t func, void *arg, uint_t flags) { taskq_ent_t *task; int mflag, prio; taskqid_t tid; if ((flags & (TQ_SLEEP | TQ_NOQUEUE)) == TQ_SLEEP) mflag = M_WAITOK; else mflag = M_NOWAIT; /* * If TQ_FRONT is given, we want higher priority for this task, so it * can go at the front of the queue. */ prio = !!(flags & TQ_FRONT); task = uma_zalloc(taskq_zone, mflag); if (task == NULL) return (0); refcount_init(&task->tqent_rc, 1); task->tqent_func = func; task->tqent_arg = arg; task->tqent_cancelled = B_FALSE; task->tqent_type = NORMAL_TASK; tid = taskq_insert(task); TASK_INIT(&task->tqent_task, prio, taskq_run, task); taskqueue_enqueue(tq->tq_queue, &task->tqent_task); VERIFY(tid); return (tid); } static void taskq_run_ent(void *arg, int pending __unused) { taskq_ent_t *task = arg; task->tqent_func(task->tqent_arg); } void taskq_dispatch_ent(taskq_t *tq, task_func_t func, void *arg, uint32_t flags, taskq_ent_t *task) { int prio; /* * If TQ_FRONT is given, we want higher priority for this task, so it * can go at the front of the queue. */ prio = !!(flags & TQ_FRONT); task->tqent_cancelled = B_FALSE; task->tqent_registered = B_FALSE; task->tqent_id = 0; task->tqent_func = func; task->tqent_arg = arg; TASK_INIT(&task->tqent_task, prio, taskq_run_ent, task); taskqueue_enqueue(tq->tq_queue, &task->tqent_task); } void taskq_wait(taskq_t *tq) { taskqueue_quiesce(tq->tq_queue); } void taskq_wait_id(taskq_t *tq, taskqid_t tid) { taskq_ent_t *ent; if (tid == 0) return; if ((ent = taskq_lookup(tid)) == NULL) return; taskqueue_drain(tq->tq_queue, &ent->tqent_task); taskq_free(ent); } void taskq_wait_outstanding(taskq_t *tq, taskqid_t id __unused) { taskqueue_drain_all(tq->tq_queue); } int taskq_empty_ent(taskq_ent_t *t) { return (t->tqent_task.ta_pending == 0); } Index: vendor-sys/openzfs/dist/module/os/freebsd/zfs/kmod_core.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/zfs/kmod_core.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/zfs/kmod_core.c (revision 366348) @@ -1,380 +1,382 @@ /* * Copyright (c) 2020 iXsystems, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include +#include #include #include #include "zfs_namecheck.h" #include "zfs_prop.h" #include "zfs_deleg.h" #include "zfs_comutil.h" SYSCTL_DECL(_vfs_zfs); SYSCTL_DECL(_vfs_zfs_vdev); - +extern uint_t rrw_tsd_key; static int zfs_version_ioctl = ZFS_IOCVER_OZFS; SYSCTL_DECL(_vfs_zfs_version); SYSCTL_INT(_vfs_zfs_version, OID_AUTO, ioctl, CTLFLAG_RD, &zfs_version_ioctl, 0, "ZFS_IOCTL_VERSION"); static struct cdev *zfsdev; static struct root_hold_token *zfs_root_token; extern uint_t rrw_tsd_key; extern uint_t zfs_allow_log_key; extern uint_t zfs_geom_probe_vdev_key; static int zfs__init(void); static int zfs__fini(void); static void zfs_shutdown(void *, int); static eventhandler_tag zfs_shutdown_event_tag; extern zfsdev_state_t *zfsdev_state_list; #define ZFS_MIN_KSTACK_PAGES 4 static int zfsdev_ioctl(struct cdev *dev, ulong_t zcmd, caddr_t arg, int flag, struct thread *td) { uint_t len; int vecnum; zfs_iocparm_t *zp; zfs_cmd_t *zc; zfs_cmd_legacy_t *zcl; int rc, error; void *uaddr; len = IOCPARM_LEN(zcmd); vecnum = zcmd & 0xff; zp = (void *)arg; uaddr = (void *)zp->zfs_cmd; error = 0; zcl = NULL; if (len != sizeof (zfs_iocparm_t)) { printf("len %d vecnum: %d sizeof (zfs_cmd_t) %ju\n", len, vecnum, (uintmax_t)sizeof (zfs_cmd_t)); return (EINVAL); } zc = kmem_zalloc(sizeof (zfs_cmd_t), KM_SLEEP); /* * Remap ioctl code for legacy user binaries */ if (zp->zfs_ioctl_version == ZFS_IOCVER_LEGACY) { vecnum = zfs_ioctl_legacy_to_ozfs(vecnum); if (vecnum < 0) { kmem_free(zc, sizeof (zfs_cmd_t)); return (ENOTSUP); } zcl = kmem_zalloc(sizeof (zfs_cmd_legacy_t), KM_SLEEP); if (copyin(uaddr, zcl, sizeof (zfs_cmd_legacy_t))) { error = SET_ERROR(EFAULT); goto out; } zfs_cmd_legacy_to_ozfs(zcl, zc); } else if (copyin(uaddr, zc, sizeof (zfs_cmd_t))) { error = SET_ERROR(EFAULT); goto out; } error = zfsdev_ioctl_common(vecnum, zc, 0); if (zcl) { zfs_cmd_ozfs_to_legacy(zc, zcl); rc = copyout(zcl, uaddr, sizeof (*zcl)); } else { rc = copyout(zc, uaddr, sizeof (*zc)); } if (error == 0 && rc != 0) error = SET_ERROR(EFAULT); out: if (zcl) kmem_free(zcl, sizeof (zfs_cmd_legacy_t)); kmem_free(zc, sizeof (zfs_cmd_t)); + MPASS(tsd_get(rrw_tsd_key) == NULL); return (error); } static void zfsdev_close(void *data) { zfsdev_state_t *zs, *zsp = data; mutex_enter(&zfsdev_state_lock); for (zs = zfsdev_state_list; zs != NULL; zs = zs->zs_next) { if (zs == zsp) break; } if (zs == NULL || zs->zs_minor <= 0) { mutex_exit(&zfsdev_state_lock); return; } zs->zs_minor = -1; zfs_onexit_destroy(zs->zs_onexit); zfs_zevent_destroy(zs->zs_zevent); mutex_exit(&zfsdev_state_lock); zs->zs_onexit = NULL; zs->zs_zevent = NULL; } static int zfs_ctldev_init(struct cdev *devp) { boolean_t newzs = B_FALSE; minor_t minor; zfsdev_state_t *zs, *zsprev = NULL; ASSERT(MUTEX_HELD(&zfsdev_state_lock)); minor = zfsdev_minor_alloc(); if (minor == 0) return (SET_ERROR(ENXIO)); for (zs = zfsdev_state_list; zs != NULL; zs = zs->zs_next) { if (zs->zs_minor == -1) break; zsprev = zs; } if (!zs) { zs = kmem_zalloc(sizeof (zfsdev_state_t), KM_SLEEP); newzs = B_TRUE; } devfs_set_cdevpriv(zs, zfsdev_close); zfs_onexit_init((zfs_onexit_t **)&zs->zs_onexit); zfs_zevent_init((zfs_zevent_t **)&zs->zs_zevent); if (newzs) { zs->zs_minor = minor; wmb(); zsprev->zs_next = zs; } else { wmb(); zs->zs_minor = minor; } return (0); } static int zfsdev_open(struct cdev *devp, int flag, int mode, struct thread *td) { int error; mutex_enter(&zfsdev_state_lock); error = zfs_ctldev_init(devp); mutex_exit(&zfsdev_state_lock); return (error); } static struct cdevsw zfs_cdevsw = { .d_version = D_VERSION, .d_open = zfsdev_open, .d_ioctl = zfsdev_ioctl, .d_name = ZFS_DRIVER }; int zfsdev_attach(void) { zfsdev = make_dev(&zfs_cdevsw, 0x0, UID_ROOT, GID_OPERATOR, 0666, ZFS_DRIVER); return (0); } void zfsdev_detach(void) { if (zfsdev != NULL) destroy_dev(zfsdev); } int zfs__init(void) { int error; #if KSTACK_PAGES < ZFS_MIN_KSTACK_PAGES printf("ZFS NOTICE: KSTACK_PAGES is %d which could result in stack " "overflow panic!\nPlease consider adding " "'options KSTACK_PAGES=%d' to your kernel config\n", KSTACK_PAGES, ZFS_MIN_KSTACK_PAGES); #endif zfs_root_token = root_mount_hold("ZFS"); if ((error = zfs_kmod_init()) != 0) { printf("ZFS: Failed to Load ZFS Filesystem" ", rc = %d\n", error); root_mount_rel(zfs_root_token); return (error); } tsd_create(&zfs_geom_probe_vdev_key, NULL); printf("ZFS storage pool version: features support (" SPA_VERSION_STRING ")\n"); root_mount_rel(zfs_root_token); ddi_sysevent_init(); return (0); } int zfs__fini(void) { if (zfs_busy() || zvol_busy() || zio_injection_enabled) { return (EBUSY); } zfs_kmod_fini(); tsd_destroy(&zfs_geom_probe_vdev_key); return (0); } static void zfs_shutdown(void *arg __unused, int howto __unused) { /* * ZFS fini routines can not properly work in a panic-ed system. */ if (panicstr == NULL) zfs__fini(); } static int zfs_modevent(module_t mod, int type, void *unused __unused) { int err; switch (type) { case MOD_LOAD: err = zfs__init(); if (err == 0) zfs_shutdown_event_tag = EVENTHANDLER_REGISTER( shutdown_post_sync, zfs_shutdown, NULL, SHUTDOWN_PRI_FIRST); return (err); case MOD_UNLOAD: err = zfs__fini(); if (err == 0 && zfs_shutdown_event_tag != NULL) EVENTHANDLER_DEREGISTER(shutdown_post_sync, zfs_shutdown_event_tag); return (err); case MOD_SHUTDOWN: return (0); default: break; } return (EOPNOTSUPP); } static moduledata_t zfs_mod = { "zfsctrl", zfs_modevent, 0 }; #ifdef _KERNEL EVENTHANDLER_DEFINE(mountroot, spa_boot_init, NULL, 0); #endif DECLARE_MODULE(zfsctrl, zfs_mod, SI_SUB_CLOCKS, SI_ORDER_ANY); MODULE_VERSION(zfsctrl, 1); #if __FreeBSD_version > 1300092 MODULE_DEPEND(zfsctrl, xdr, 1, 1, 1); #else MODULE_DEPEND(zfsctrl, krpc, 1, 1, 1); #endif MODULE_DEPEND(zfsctrl, acl_nfs4, 1, 1, 1); MODULE_DEPEND(zfsctrl, crypto, 1, 1, 1); Index: vendor-sys/openzfs/dist/module/os/freebsd/zfs/sysctl_os.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/zfs/sysctl_os.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/zfs/sysctl_os.c (revision 366348) @@ -1,694 +1,699 @@ /* * Copyright (c) 2020 iXsystems, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* BEGIN CSTYLED */ SYSCTL_DECL(_vfs_zfs); SYSCTL_NODE(_vfs_zfs, OID_AUTO, arc, CTLFLAG_RW, 0, "ZFS adaptive replacement cache"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, condense, CTLFLAG_RW, 0, "ZFS condense"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, dbuf, CTLFLAG_RW, 0, "ZFS disk buf cache"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, dbuf_cache, CTLFLAG_RW, 0, "ZFS disk buf cache"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, deadman, CTLFLAG_RW, 0, "ZFS deadman"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, dedup, CTLFLAG_RW, 0, "ZFS dedup"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, l2arc, CTLFLAG_RW, 0, "ZFS l2arc"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, livelist, CTLFLAG_RW, 0, "ZFS livelist"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, lua, CTLFLAG_RW, 0, "ZFS lua"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, metaslab, CTLFLAG_RW, 0, "ZFS metaslab"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, mg, CTLFLAG_RW, 0, "ZFS metaslab group"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, multihost, CTLFLAG_RW, 0, "ZFS multihost protection"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, prefetch, CTLFLAG_RW, 0, "ZFS prefetch"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, reconstruct, CTLFLAG_RW, 0, "ZFS reconstruct"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, recv, CTLFLAG_RW, 0, "ZFS receive"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, send, CTLFLAG_RW, 0, "ZFS send"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, spa, CTLFLAG_RW, 0, "ZFS space allocation"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, trim, CTLFLAG_RW, 0, "ZFS TRIM"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, txg, CTLFLAG_RW, 0, "ZFS transaction group"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, vdev, CTLFLAG_RW, 0, "ZFS VDEV"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, zevent, CTLFLAG_RW, 0, "ZFS event"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, zil, CTLFLAG_RW, 0, "ZFS ZIL"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, zio, CTLFLAG_RW, 0, "ZFS ZIO"); SYSCTL_NODE(_vfs_zfs_livelist, OID_AUTO, condense, CTLFLAG_RW, 0, "ZFS livelist condense"); SYSCTL_NODE(_vfs_zfs_vdev, OID_AUTO, cache, CTLFLAG_RW, 0, "ZFS VDEV Cache"); SYSCTL_NODE(_vfs_zfs_vdev, OID_AUTO, file, CTLFLAG_RW, 0, "ZFS VDEV file"); SYSCTL_NODE(_vfs_zfs_vdev, OID_AUTO, mirror, CTLFLAG_RD, 0, "ZFS VDEV mirror"); SYSCTL_DECL(_vfs_zfs_version); SYSCTL_CONST_STRING(_vfs_zfs_version, OID_AUTO, module, CTLFLAG_RD, (ZFS_META_VERSION "-" ZFS_META_RELEASE), "OpenZFS module version"); extern arc_state_t ARC_anon; extern arc_state_t ARC_mru; extern arc_state_t ARC_mru_ghost; extern arc_state_t ARC_mfu; extern arc_state_t ARC_mfu_ghost; extern arc_state_t ARC_l2c_only; /* * minimum lifespan of a prefetch block in clock ticks * (initialized in arc_init()) */ /* arc.c */ /* legacy compat */ extern uint64_t l2arc_write_max; /* def max write size */ extern uint64_t l2arc_write_boost; /* extra warmup write */ extern uint64_t l2arc_headroom; /* # of dev writes */ extern uint64_t l2arc_headroom_boost; extern uint64_t l2arc_feed_secs; /* interval seconds */ extern uint64_t l2arc_feed_min_ms; /* min interval msecs */ extern int l2arc_noprefetch; /* don't cache prefetch bufs */ extern int l2arc_feed_again; /* turbo warmup */ extern int l2arc_norw; /* no reads during writes */ SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_write_max, CTLFLAG_RW, &l2arc_write_max, 0, "max write size (LEGACY)"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_write_boost, CTLFLAG_RW, &l2arc_write_boost, 0, "extra write during warmup (LEGACY)"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_headroom, CTLFLAG_RW, &l2arc_headroom, 0, "number of dev writes (LEGACY)"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_feed_secs, CTLFLAG_RW, &l2arc_feed_secs, 0, "interval seconds (LEGACY)"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2arc_feed_min_ms, CTLFLAG_RW, &l2arc_feed_min_ms, 0, "min interval milliseconds (LEGACY)"); SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_noprefetch, CTLFLAG_RW, &l2arc_noprefetch, 0, "don't cache prefetch bufs (LEGACY)"); SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_feed_again, CTLFLAG_RW, &l2arc_feed_again, 0, "turbo warmup (LEGACY)"); SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_norw, CTLFLAG_RW, &l2arc_norw, 0, "no reads during writes (LEGACY)"); #if 0 extern int zfs_compressed_arc_enabled; SYSCTL_INT(_vfs_zfs, OID_AUTO, compressed_arc_enabled, CTLFLAG_RW, &zfs_compressed_arc_enabled, 1, "compressed arc buffers (LEGACY)"); #endif SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, anon_size, CTLFLAG_RD, &ARC_anon.arcs_size.rc_count, 0, "size of anonymous state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, anon_metadata_esize, CTLFLAG_RD, &ARC_anon.arcs_esize[ARC_BUFC_METADATA].rc_count, 0, "size of anonymous state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, anon_data_esize, CTLFLAG_RD, &ARC_anon.arcs_esize[ARC_BUFC_DATA].rc_count, 0, "size of anonymous state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_size, CTLFLAG_RD, &ARC_mru.arcs_size.rc_count, 0, "size of mru state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_metadata_esize, CTLFLAG_RD, &ARC_mru.arcs_esize[ARC_BUFC_METADATA].rc_count, 0, "size of metadata in mru state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_data_esize, CTLFLAG_RD, &ARC_mru.arcs_esize[ARC_BUFC_DATA].rc_count, 0, "size of data in mru state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_ghost_size, CTLFLAG_RD, &ARC_mru_ghost.arcs_size.rc_count, 0, "size of mru ghost state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_ghost_metadata_esize, CTLFLAG_RD, &ARC_mru_ghost.arcs_esize[ARC_BUFC_METADATA].rc_count, 0, "size of metadata in mru ghost state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mru_ghost_data_esize, CTLFLAG_RD, &ARC_mru_ghost.arcs_esize[ARC_BUFC_DATA].rc_count, 0, "size of data in mru ghost state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_size, CTLFLAG_RD, &ARC_mfu.arcs_size.rc_count, 0, "size of mfu state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_metadata_esize, CTLFLAG_RD, &ARC_mfu.arcs_esize[ARC_BUFC_METADATA].rc_count, 0, "size of metadata in mfu state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_data_esize, CTLFLAG_RD, &ARC_mfu.arcs_esize[ARC_BUFC_DATA].rc_count, 0, "size of data in mfu state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_ghost_size, CTLFLAG_RD, &ARC_mfu_ghost.arcs_size.rc_count, 0, "size of mfu ghost state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_ghost_metadata_esize, CTLFLAG_RD, &ARC_mfu_ghost.arcs_esize[ARC_BUFC_METADATA].rc_count, 0, "size of metadata in mfu ghost state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, mfu_ghost_data_esize, CTLFLAG_RD, &ARC_mfu_ghost.arcs_esize[ARC_BUFC_DATA].rc_count, 0, "size of data in mfu ghost state"); SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, l2c_only_size, CTLFLAG_RD, &ARC_l2c_only.arcs_size.rc_count, 0, "size of mru state"); static int sysctl_vfs_zfs_arc_no_grow_shift(SYSCTL_HANDLER_ARGS) { uint32_t val; int err; val = arc_no_grow_shift; err = sysctl_handle_32(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); if (val >= arc_shrink_shift) return (EINVAL); arc_no_grow_shift = val; return (0); } -SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_no_grow_shift, CTLTYPE_U32 | CTLFLAG_RWTUN, - 0, sizeof (uint32_t), sysctl_vfs_zfs_arc_no_grow_shift, "U", +SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_no_grow_shift, + CTLTYPE_U32 | CTLFLAG_RWTUN | CTLFLAG_MPSAFE, 0, sizeof (uint32_t), + sysctl_vfs_zfs_arc_no_grow_shift, "U", "log2(fraction of ARC which must be free to allow growing)"); int param_set_arc_long(SYSCTL_HANDLER_ARGS) { int err; err = sysctl_handle_long(oidp, arg1, 0, req); if (err != 0 || req->newptr == NULL) return (err); arc_tuning_update(B_TRUE); return (0); } int param_set_arc_int(SYSCTL_HANDLER_ARGS) { int err; err = sysctl_handle_int(oidp, arg1, 0, req); if (err != 0 || req->newptr == NULL) return (err); arc_tuning_update(B_TRUE); return (0); } -SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_min, CTLTYPE_ULONG | CTLFLAG_RWTUN, +SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_min, + CTLTYPE_ULONG | CTLFLAG_RWTUN | CTLFLAG_MPSAFE, &zfs_arc_min, sizeof (zfs_arc_min), param_set_arc_long, "LU", "min arc size (LEGACY)"); -SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_max, CTLTYPE_ULONG | CTLFLAG_RWTUN, +SYSCTL_PROC(_vfs_zfs, OID_AUTO, arc_max, + CTLTYPE_ULONG | CTLFLAG_RWTUN | CTLFLAG_MPSAFE, &zfs_arc_max, sizeof (zfs_arc_max), param_set_arc_long, "LU", "max arc size (LEGACY)"); /* dbuf.c */ /* dmu.c */ /* dmu_zfetch.c */ SYSCTL_NODE(_vfs_zfs, OID_AUTO, zfetch, CTLFLAG_RW, 0, "ZFS ZFETCH (LEGACY)"); /* max bytes to prefetch per stream (default 8MB) */ extern uint32_t zfetch_max_distance; SYSCTL_UINT(_vfs_zfs_zfetch, OID_AUTO, max_distance, CTLFLAG_RWTUN, &zfetch_max_distance, 0, "Max bytes to prefetch per stream (LEGACY)"); /* max bytes to prefetch indirects for per stream (default 64MB) */ extern uint32_t zfetch_max_idistance; SYSCTL_UINT(_vfs_zfs_prefetch, OID_AUTO, max_idistance, CTLFLAG_RWTUN, &zfetch_max_idistance, 0, "Max bytes to prefetch indirects for per stream"); /* dsl_pool.c */ /* dnode.c */ extern int zfs_default_bs; SYSCTL_INT(_vfs_zfs, OID_AUTO, default_bs, CTLFLAG_RWTUN, &zfs_default_bs, 0, "Default dnode block shift"); extern int zfs_default_ibs; SYSCTL_INT(_vfs_zfs, OID_AUTO, default_ibs, CTLFLAG_RWTUN, &zfs_default_ibs, 0, "Default dnode indirect block shift"); /* dsl_scan.c */ /* metaslab.c */ /* * In pools where the log space map feature is not enabled we touch * multiple metaslabs (and their respective space maps) with each * transaction group. Thus, we benefit from having a small space map * block size since it allows us to issue more I/O operations scattered * around the disk. So a sane default for the space map block size * is 8~16K. */ extern int zfs_metaslab_sm_blksz_no_log; SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, sm_blksz_no_log, CTLFLAG_RDTUN, &zfs_metaslab_sm_blksz_no_log, 0, "Block size for space map in pools with log space map disabled. " "Power of 2 and greater than 4096."); /* * When the log space map feature is enabled, we accumulate a lot of * changes per metaslab that are flushed once in a while so we benefit * from a bigger block size like 128K for the metaslab space maps. */ extern int zfs_metaslab_sm_blksz_with_log; SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, sm_blksz_with_log, CTLFLAG_RDTUN, &zfs_metaslab_sm_blksz_with_log, 0, "Block size for space map in pools with log space map enabled. " "Power of 2 and greater than 4096."); /* * The in-core space map representation is more compact than its on-disk form. * The zfs_condense_pct determines how much more compact the in-core * space map representation must be before we compact it on-disk. * Values should be greater than or equal to 100. */ extern int zfs_condense_pct; SYSCTL_INT(_vfs_zfs, OID_AUTO, condense_pct, CTLFLAG_RWTUN, &zfs_condense_pct, 0, "Condense on-disk spacemap when it is more than this many percents" " of in-memory counterpart"); extern int zfs_remove_max_segment; SYSCTL_INT(_vfs_zfs, OID_AUTO, remove_max_segment, CTLFLAG_RWTUN, &zfs_remove_max_segment, 0, "Largest contiguous segment ZFS will attempt to" " allocate when removing a device"); extern int zfs_removal_suspend_progress; SYSCTL_INT(_vfs_zfs, OID_AUTO, removal_suspend_progress, CTLFLAG_RWTUN, &zfs_removal_suspend_progress, 0, "Ensures certain actions can happen while" " in the middle of a removal"); /* * Minimum size which forces the dynamic allocator to change * it's allocation strategy. Once the space map cannot satisfy * an allocation of this size then it switches to using more * aggressive strategy (i.e search by size rather than offset). */ extern uint64_t metaslab_df_alloc_threshold; SYSCTL_QUAD(_vfs_zfs_metaslab, OID_AUTO, df_alloc_threshold, CTLFLAG_RWTUN, &metaslab_df_alloc_threshold, 0, "Minimum size which forces the dynamic allocator to change it's allocation strategy"); /* * The minimum free space, in percent, which must be available * in a space map to continue allocations in a first-fit fashion. * Once the space map's free space drops below this level we dynamically * switch to using best-fit allocations. */ extern int metaslab_df_free_pct; SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, df_free_pct, CTLFLAG_RWTUN, &metaslab_df_free_pct, 0, "The minimum free space, in percent, which must be available in a " "space map to continue allocations in a first-fit fashion"); /* * Percentage of all cpus that can be used by the metaslab taskq. */ extern int metaslab_load_pct; SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, load_pct, CTLFLAG_RWTUN, &metaslab_load_pct, 0, "Percentage of cpus that can be used by the metaslab taskq"); /* * Max number of metaslabs per group to preload. */ extern int metaslab_preload_limit; SYSCTL_INT(_vfs_zfs_metaslab, OID_AUTO, preload_limit, CTLFLAG_RWTUN, &metaslab_preload_limit, 0, "Max number of metaslabs per group to preload"); /* refcount.c */ extern int reference_tracking_enable; SYSCTL_INT(_vfs_zfs, OID_AUTO, reference_tracking_enable, CTLFLAG_RDTUN, &reference_tracking_enable, 0, "Track reference holders to refcount_t objects, used mostly by ZFS"); /* spa.c */ extern int zfs_ccw_retry_interval; SYSCTL_INT(_vfs_zfs, OID_AUTO, ccw_retry_interval, CTLFLAG_RWTUN, &zfs_ccw_retry_interval, 0, "Configuration cache file write, retry after failure, interval (seconds)"); extern uint64_t zfs_max_missing_tvds_cachefile; SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, max_missing_tvds_cachefile, CTLFLAG_RWTUN, &zfs_max_missing_tvds_cachefile, 0, "allow importing pools with missing top-level vdevs in cache file"); extern uint64_t zfs_max_missing_tvds_scan; SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, max_missing_tvds_scan, CTLFLAG_RWTUN, &zfs_max_missing_tvds_scan, 0, "allow importing pools with missing top-level vdevs during scan"); /* spa_misc.c */ extern int zfs_flags; static int sysctl_vfs_zfs_debug_flags(SYSCTL_HANDLER_ARGS) { int err, val; val = zfs_flags; err = sysctl_handle_int(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); /* * ZFS_DEBUG_MODIFY must be enabled prior to boot so all * arc buffers in the system have the necessary additional * checksum data. However, it is safe to disable at any * time. */ if (!(zfs_flags & ZFS_DEBUG_MODIFY)) val &= ~ZFS_DEBUG_MODIFY; zfs_flags = val; return (0); } SYSCTL_PROC(_vfs_zfs, OID_AUTO, debugflags, CTLTYPE_UINT | CTLFLAG_MPSAFE | CTLFLAG_RWTUN, NULL, 0, sysctl_vfs_zfs_debug_flags, "IU", "Debug flags for ZFS testing."); int param_set_deadman_synctime(SYSCTL_HANDLER_ARGS) { unsigned long val; int err; val = zfs_deadman_synctime_ms; err = sysctl_handle_long(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); zfs_deadman_synctime_ms = val; spa_set_deadman_synctime(MSEC2NSEC(zfs_deadman_synctime_ms)); return (0); } int param_set_deadman_ziotime(SYSCTL_HANDLER_ARGS) { unsigned long val; int err; val = zfs_deadman_ziotime_ms; err = sysctl_handle_long(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); zfs_deadman_ziotime_ms = val; spa_set_deadman_ziotime(MSEC2NSEC(zfs_deadman_synctime_ms)); return (0); } int param_set_deadman_failmode(SYSCTL_HANDLER_ARGS) { char buf[16]; int rc; if (req->newptr == NULL) strlcpy(buf, zfs_deadman_failmode, sizeof (buf)); rc = sysctl_handle_string(oidp, buf, sizeof (buf), req); if (rc || req->newptr == NULL) return (rc); if (strcmp(buf, zfs_deadman_failmode) == 0) return (0); if (!strcmp(buf, "wait")) zfs_deadman_failmode = "wait"; if (!strcmp(buf, "continue")) zfs_deadman_failmode = "continue"; if (!strcmp(buf, "panic")) zfs_deadman_failmode = "panic"; return (-param_set_deadman_failmode_common(buf)); } /* spacemap.c */ extern int space_map_ibs; SYSCTL_INT(_vfs_zfs, OID_AUTO, space_map_ibs, CTLFLAG_RWTUN, &space_map_ibs, 0, "Space map indirect block shift"); /* vdev.c */ int param_set_min_auto_ashift(SYSCTL_HANDLER_ARGS) { uint64_t val; int err; val = zfs_vdev_min_auto_ashift; err = sysctl_handle_64(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (SET_ERROR(err)); if (val < ASHIFT_MIN || val > zfs_vdev_max_auto_ashift) return (SET_ERROR(EINVAL)); zfs_vdev_min_auto_ashift = val; return (0); } int param_set_max_auto_ashift(SYSCTL_HANDLER_ARGS) { uint64_t val; int err; val = zfs_vdev_max_auto_ashift; err = sysctl_handle_64(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (SET_ERROR(err)); if (val > ASHIFT_MAX || val < zfs_vdev_min_auto_ashift) return (SET_ERROR(EINVAL)); zfs_vdev_max_auto_ashift = val; return (0); } -SYSCTL_PROC(_vfs_zfs, OID_AUTO, min_auto_ashift, CTLTYPE_U64 | CTLFLAG_RWTUN, +SYSCTL_PROC(_vfs_zfs, OID_AUTO, min_auto_ashift, + CTLTYPE_U64 | CTLFLAG_RWTUN | CTLFLAG_MPSAFE, &zfs_vdev_min_auto_ashift, sizeof (zfs_vdev_min_auto_ashift), param_set_min_auto_ashift, "QU", "Min ashift used when creating new top-level vdev. (LEGACY)"); -SYSCTL_PROC(_vfs_zfs, OID_AUTO, max_auto_ashift, CTLTYPE_U64 | CTLFLAG_RWTUN, +SYSCTL_PROC(_vfs_zfs, OID_AUTO, max_auto_ashift, + CTLTYPE_U64 | CTLFLAG_RWTUN | CTLFLAG_MPSAFE, &zfs_vdev_max_auto_ashift, sizeof (zfs_vdev_max_auto_ashift), param_set_max_auto_ashift, "QU", "Max ashift used when optimizing for logical -> physical sector size on " "new top-level vdevs. (LEGACY)"); /* * Since the DTL space map of a vdev is not expected to have a lot of * entries, we default its block size to 4K. */ extern int zfs_vdev_dtl_sm_blksz; SYSCTL_INT(_vfs_zfs, OID_AUTO, dtl_sm_blksz, CTLFLAG_RDTUN, &zfs_vdev_dtl_sm_blksz, 0, "Block size for DTL space map. Power of 2 and greater than 4096."); /* * vdev-wide space maps that have lots of entries written to them at * the end of each transaction can benefit from a higher I/O bandwidth * (e.g. vdev_obsolete_sm), thus we default their block size to 128K. */ extern int zfs_vdev_standard_sm_blksz; SYSCTL_INT(_vfs_zfs, OID_AUTO, standard_sm_blksz, CTLFLAG_RDTUN, &zfs_vdev_standard_sm_blksz, 0, "Block size for standard space map. Power of 2 and greater than 4096."); extern int vdev_validate_skip; SYSCTL_INT(_vfs_zfs, OID_AUTO, validate_skip, CTLFLAG_RDTUN, &vdev_validate_skip, 0, "Enable to bypass vdev_validate()."); /* vdev_cache.c */ /* vdev_mirror.c */ /* * The load configuration settings below are tuned by default for * the case where all devices are of the same rotational type. * * If there is a mixture of rotating and non-rotating media, setting * non_rotating_seek_inc to 0 may well provide better results as it * will direct more reads to the non-rotating vdevs which are more * likely to have a higher performance. */ /* vdev_queue.c */ #define ZFS_VDEV_QUEUE_KNOB_MIN(name) \ extern uint32_t zfs_vdev_ ## name ## _min_active; \ SYSCTL_UINT(_vfs_zfs_vdev, OID_AUTO, name ## _min_active, CTLFLAG_RWTUN,\ &zfs_vdev_ ## name ## _min_active, 0, \ "Initial number of I/O requests of type " #name \ " active for each device"); #define ZFS_VDEV_QUEUE_KNOB_MAX(name) \ extern uint32_t zfs_vdev_ ## name ## _max_active; \ SYSCTL_UINT(_vfs_zfs_vdev, OID_AUTO, name ## _max_active, CTLFLAG_RWTUN, \ &zfs_vdev_ ## name ## _max_active, 0, \ "Maximum number of I/O requests of type " #name \ " active for each device"); #undef ZFS_VDEV_QUEUE_KNOB extern uint32_t zfs_vdev_max_active; SYSCTL_UINT(_vfs_zfs, OID_AUTO, top_maxinflight, CTLFLAG_RWTUN, &zfs_vdev_max_active, 0, "The maximum number of I/Os of all types active for each device. (LEGACY)"); extern int zfs_vdev_def_queue_depth; SYSCTL_INT(_vfs_zfs_vdev, OID_AUTO, def_queue_depth, CTLFLAG_RWTUN, &zfs_vdev_def_queue_depth, 0, "Default queue depth for each allocator"); /*extern uint64_t zfs_multihost_history; SYSCTL_UQUAD(_vfs_zfs, OID_AUTO, multihost_history, CTLFLAG_RWTUN, &zfs_multihost_history, 0, "Historical staticists for the last N multihost updates");*/ #ifdef notyet SYSCTL_INT(_vfs_zfs_vdev, OID_AUTO, trim_on_init, CTLFLAG_RW, &vdev_trim_on_init, 0, "Enable/disable full vdev trim on initialisation"); #endif /* zio.c */ #if defined(__LP64__) int zio_use_uma = 1; #else int zio_use_uma = 0; #endif SYSCTL_INT(_vfs_zfs_zio, OID_AUTO, use_uma, CTLFLAG_RDTUN, &zio_use_uma, 0, "Use uma(9) for ZIO allocations"); SYSCTL_INT(_vfs_zfs_zio, OID_AUTO, exclude_metadata, CTLFLAG_RDTUN, &zio_exclude_metadata, 0, "Exclude metadata buffers from dumps as well"); int param_set_slop_shift(SYSCTL_HANDLER_ARGS) { int val; int err; val = *(int *)arg1; err = sysctl_handle_int(oidp, &val, 0, req); if (err != 0 || req->newptr == NULL) return (err); if (val < 1 || val > 31) return (EINVAL); *(int *)arg1 = val; return (0); } int param_set_multihost_interval(SYSCTL_HANDLER_ARGS) { int err; err = sysctl_handle_long(oidp, arg1, 0, req); if (err != 0 || req->newptr == NULL) return (err); if (spa_mode_global != SPA_MODE_UNINIT) mmp_signal_all_threads(); return (0); } Index: vendor-sys/openzfs/dist/module/os/freebsd/zfs/zfs_ioctl_compat.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/zfs/zfs_ioctl_compat.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/zfs/zfs_ioctl_compat.c (revision 366348) @@ -1,361 +1,363 @@ /* * Copyright (c) 2020 iXsystems, Inc. * All rights reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE AUTHORS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHORS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * */ #include __FBSDID("$FreeBSD$"); #include #include #include #include #include #include #include #include #include #include enum zfs_ioc_legacy { ZFS_IOC_LEGACY_NONE = -1, ZFS_IOC_LEGACY_FIRST = 0, ZFS_LEGACY_IOC = ZFS_IOC_LEGACY_FIRST, ZFS_IOC_LEGACY_POOL_CREATE = ZFS_IOC_LEGACY_FIRST, ZFS_IOC_LEGACY_POOL_DESTROY, ZFS_IOC_LEGACY_POOL_IMPORT, ZFS_IOC_LEGACY_POOL_EXPORT, ZFS_IOC_LEGACY_POOL_CONFIGS, ZFS_IOC_LEGACY_POOL_STATS, ZFS_IOC_LEGACY_POOL_TRYIMPORT, ZFS_IOC_LEGACY_POOL_SCAN, ZFS_IOC_LEGACY_POOL_FREEZE, ZFS_IOC_LEGACY_POOL_UPGRADE, ZFS_IOC_LEGACY_POOL_GET_HISTORY, ZFS_IOC_LEGACY_VDEV_ADD, ZFS_IOC_LEGACY_VDEV_REMOVE, ZFS_IOC_LEGACY_VDEV_SET_STATE, ZFS_IOC_LEGACY_VDEV_ATTACH, ZFS_IOC_LEGACY_VDEV_DETACH, ZFS_IOC_LEGACY_VDEV_SETPATH, ZFS_IOC_LEGACY_VDEV_SETFRU, ZFS_IOC_LEGACY_OBJSET_STATS, ZFS_IOC_LEGACY_OBJSET_ZPLPROPS, ZFS_IOC_LEGACY_DATASET_LIST_NEXT, ZFS_IOC_LEGACY_SNAPSHOT_LIST_NEXT, ZFS_IOC_LEGACY_SET_PROP, ZFS_IOC_LEGACY_CREATE, ZFS_IOC_LEGACY_DESTROY, ZFS_IOC_LEGACY_ROLLBACK, ZFS_IOC_LEGACY_RENAME, ZFS_IOC_LEGACY_RECV, ZFS_IOC_LEGACY_SEND, ZFS_IOC_LEGACY_INJECT_FAULT, ZFS_IOC_LEGACY_CLEAR_FAULT, ZFS_IOC_LEGACY_INJECT_LIST_NEXT, ZFS_IOC_LEGACY_ERROR_LOG, ZFS_IOC_LEGACY_CLEAR, ZFS_IOC_LEGACY_PROMOTE, ZFS_IOC_LEGACY_DESTROY_SNAPS, ZFS_IOC_LEGACY_SNAPSHOT, ZFS_IOC_LEGACY_DSOBJ_TO_DSNAME, ZFS_IOC_LEGACY_OBJ_TO_PATH, ZFS_IOC_LEGACY_POOL_SET_PROPS, ZFS_IOC_LEGACY_POOL_GET_PROPS, ZFS_IOC_LEGACY_SET_FSACL, ZFS_IOC_LEGACY_GET_FSACL, ZFS_IOC_LEGACY_SHARE, ZFS_IOC_LEGACY_INHERIT_PROP, ZFS_IOC_LEGACY_SMB_ACL, ZFS_IOC_LEGACY_USERSPACE_ONE, ZFS_IOC_LEGACY_USERSPACE_MANY, ZFS_IOC_LEGACY_USERSPACE_UPGRADE, ZFS_IOC_LEGACY_HOLD, ZFS_IOC_LEGACY_RELEASE, ZFS_IOC_LEGACY_GET_HOLDS, ZFS_IOC_LEGACY_OBJSET_RECVD_PROPS, ZFS_IOC_LEGACY_VDEV_SPLIT, ZFS_IOC_LEGACY_NEXT_OBJ, ZFS_IOC_LEGACY_DIFF, ZFS_IOC_LEGACY_TMP_SNAPSHOT, ZFS_IOC_LEGACY_OBJ_TO_STATS, ZFS_IOC_LEGACY_JAIL, ZFS_IOC_LEGACY_UNJAIL, ZFS_IOC_LEGACY_POOL_REGUID, ZFS_IOC_LEGACY_SPACE_WRITTEN, ZFS_IOC_LEGACY_SPACE_SNAPS, ZFS_IOC_LEGACY_SEND_PROGRESS, ZFS_IOC_LEGACY_POOL_REOPEN, ZFS_IOC_LEGACY_LOG_HISTORY, ZFS_IOC_LEGACY_SEND_NEW, ZFS_IOC_LEGACY_SEND_SPACE, ZFS_IOC_LEGACY_CLONE, ZFS_IOC_LEGACY_BOOKMARK, ZFS_IOC_LEGACY_GET_BOOKMARKS, ZFS_IOC_LEGACY_DESTROY_BOOKMARKS, ZFS_IOC_LEGACY_NEXTBOOT, ZFS_IOC_LEGACY_CHANNEL_PROGRAM, ZFS_IOC_LEGACY_REMAP, ZFS_IOC_LEGACY_POOL_CHECKPOINT, ZFS_IOC_LEGACY_POOL_DISCARD_CHECKPOINT, ZFS_IOC_LEGACY_POOL_INITIALIZE, ZFS_IOC_LEGACY_POOL_SYNC, ZFS_IOC_LEGACY_LAST }; unsigned static long zfs_ioctl_legacy_to_ozfs_[] = { ZFS_IOC_POOL_CREATE, /* 0x00 */ ZFS_IOC_POOL_DESTROY, /* 0x01 */ ZFS_IOC_POOL_IMPORT, /* 0x02 */ ZFS_IOC_POOL_EXPORT, /* 0x03 */ ZFS_IOC_POOL_CONFIGS, /* 0x04 */ ZFS_IOC_POOL_STATS, /* 0x05 */ ZFS_IOC_POOL_TRYIMPORT, /* 0x06 */ ZFS_IOC_POOL_SCAN, /* 0x07 */ ZFS_IOC_POOL_FREEZE, /* 0x08 */ ZFS_IOC_POOL_UPGRADE, /* 0x09 */ ZFS_IOC_POOL_GET_HISTORY, /* 0x0a */ ZFS_IOC_VDEV_ADD, /* 0x0b */ ZFS_IOC_VDEV_REMOVE, /* 0x0c */ ZFS_IOC_VDEV_SET_STATE, /* 0x0d */ ZFS_IOC_VDEV_ATTACH, /* 0x0e */ ZFS_IOC_VDEV_DETACH, /* 0x0f */ ZFS_IOC_VDEV_SETPATH, /* 0x10 */ ZFS_IOC_VDEV_SETFRU, /* 0x11 */ ZFS_IOC_OBJSET_STATS, /* 0x12 */ ZFS_IOC_OBJSET_ZPLPROPS, /* 0x13 */ ZFS_IOC_DATASET_LIST_NEXT, /* 0x14 */ ZFS_IOC_SNAPSHOT_LIST_NEXT, /* 0x15 */ ZFS_IOC_SET_PROP, /* 0x16 */ ZFS_IOC_CREATE, /* 0x17 */ ZFS_IOC_DESTROY, /* 0x18 */ ZFS_IOC_ROLLBACK, /* 0x19 */ ZFS_IOC_RENAME, /* 0x1a */ ZFS_IOC_RECV, /* 0x1b */ ZFS_IOC_SEND, /* 0x1c */ ZFS_IOC_INJECT_FAULT, /* 0x1d */ ZFS_IOC_CLEAR_FAULT, /* 0x1e */ ZFS_IOC_INJECT_LIST_NEXT, /* 0x1f */ ZFS_IOC_ERROR_LOG, /* 0x20 */ ZFS_IOC_CLEAR, /* 0x21 */ ZFS_IOC_PROMOTE, /* 0x22 */ /* start of mismatch */ ZFS_IOC_DESTROY_SNAPS, /* 0x23:0x3b */ ZFS_IOC_SNAPSHOT, /* 0x24:0x23 */ ZFS_IOC_DSOBJ_TO_DSNAME, /* 0x25:0x24 */ ZFS_IOC_OBJ_TO_PATH, /* 0x26:0x25 */ ZFS_IOC_POOL_SET_PROPS, /* 0x27:0x26 */ ZFS_IOC_POOL_GET_PROPS, /* 0x28:0x27 */ ZFS_IOC_SET_FSACL, /* 0x29:0x28 */ ZFS_IOC_GET_FSACL, /* 0x30:0x29 */ ZFS_IOC_SHARE, /* 0x2b:0x2a */ ZFS_IOC_INHERIT_PROP, /* 0x2c:0x2b */ ZFS_IOC_SMB_ACL, /* 0x2d:0x2c */ ZFS_IOC_USERSPACE_ONE, /* 0x2e:0x2d */ ZFS_IOC_USERSPACE_MANY, /* 0x2f:0x2e */ ZFS_IOC_USERSPACE_UPGRADE, /* 0x30:0x2f */ ZFS_IOC_HOLD, /* 0x31:0x30 */ ZFS_IOC_RELEASE, /* 0x32:0x31 */ ZFS_IOC_GET_HOLDS, /* 0x33:0x32 */ ZFS_IOC_OBJSET_RECVD_PROPS, /* 0x34:0x33 */ ZFS_IOC_VDEV_SPLIT, /* 0x35:0x34 */ ZFS_IOC_NEXT_OBJ, /* 0x36:0x35 */ ZFS_IOC_DIFF, /* 0x37:0x36 */ ZFS_IOC_TMP_SNAPSHOT, /* 0x38:0x37 */ ZFS_IOC_OBJ_TO_STATS, /* 0x39:0x38 */ ZFS_IOC_JAIL, /* 0x3a:0xc2 */ ZFS_IOC_UNJAIL, /* 0x3b:0xc3 */ ZFS_IOC_POOL_REGUID, /* 0x3c:0x3c */ ZFS_IOC_SPACE_WRITTEN, /* 0x3d:0x39 */ ZFS_IOC_SPACE_SNAPS, /* 0x3e:0x3a */ ZFS_IOC_SEND_PROGRESS, /* 0x3f:0x3e */ ZFS_IOC_POOL_REOPEN, /* 0x40:0x3d */ ZFS_IOC_LOG_HISTORY, /* 0x41:0x3f */ ZFS_IOC_SEND_NEW, /* 0x42:0x40 */ ZFS_IOC_SEND_SPACE, /* 0x43:0x41 */ ZFS_IOC_CLONE, /* 0x44:0x42 */ ZFS_IOC_BOOKMARK, /* 0x45:0x43 */ ZFS_IOC_GET_BOOKMARKS, /* 0x46:0x44 */ ZFS_IOC_DESTROY_BOOKMARKS, /* 0x47:0x45 */ ZFS_IOC_NEXTBOOT, /* 0x48:0xc1 */ ZFS_IOC_CHANNEL_PROGRAM, /* 0x49:0x48 */ ZFS_IOC_REMAP, /* 0x4a:0x4c */ ZFS_IOC_POOL_CHECKPOINT, /* 0x4b:0x4d */ ZFS_IOC_POOL_DISCARD_CHECKPOINT, /* 0x4c:0x4e */ ZFS_IOC_POOL_INITIALIZE, /* 0x4d:0x4f */ }; unsigned static long zfs_ioctl_ozfs_to_legacy_common_[] = { ZFS_IOC_POOL_CREATE, /* 0x00 */ ZFS_IOC_POOL_DESTROY, /* 0x01 */ ZFS_IOC_POOL_IMPORT, /* 0x02 */ ZFS_IOC_POOL_EXPORT, /* 0x03 */ ZFS_IOC_POOL_CONFIGS, /* 0x04 */ ZFS_IOC_POOL_STATS, /* 0x05 */ ZFS_IOC_POOL_TRYIMPORT, /* 0x06 */ ZFS_IOC_POOL_SCAN, /* 0x07 */ ZFS_IOC_POOL_FREEZE, /* 0x08 */ ZFS_IOC_POOL_UPGRADE, /* 0x09 */ ZFS_IOC_POOL_GET_HISTORY, /* 0x0a */ ZFS_IOC_VDEV_ADD, /* 0x0b */ ZFS_IOC_VDEV_REMOVE, /* 0x0c */ ZFS_IOC_VDEV_SET_STATE, /* 0x0d */ ZFS_IOC_VDEV_ATTACH, /* 0x0e */ ZFS_IOC_VDEV_DETACH, /* 0x0f */ ZFS_IOC_VDEV_SETPATH, /* 0x10 */ ZFS_IOC_VDEV_SETFRU, /* 0x11 */ ZFS_IOC_OBJSET_STATS, /* 0x12 */ ZFS_IOC_OBJSET_ZPLPROPS, /* 0x13 */ ZFS_IOC_DATASET_LIST_NEXT, /* 0x14 */ ZFS_IOC_SNAPSHOT_LIST_NEXT, /* 0x15 */ ZFS_IOC_SET_PROP, /* 0x16 */ ZFS_IOC_CREATE, /* 0x17 */ ZFS_IOC_DESTROY, /* 0x18 */ ZFS_IOC_ROLLBACK, /* 0x19 */ ZFS_IOC_RENAME, /* 0x1a */ ZFS_IOC_RECV, /* 0x1b */ ZFS_IOC_SEND, /* 0x1c */ ZFS_IOC_INJECT_FAULT, /* 0x1d */ ZFS_IOC_CLEAR_FAULT, /* 0x1e */ ZFS_IOC_INJECT_LIST_NEXT, /* 0x1f */ ZFS_IOC_ERROR_LOG, /* 0x20 */ ZFS_IOC_CLEAR, /* 0x21 */ ZFS_IOC_PROMOTE, /* 0x22 */ /* start of mismatch */ ZFS_IOC_LEGACY_SNAPSHOT, /* 0x23 */ ZFS_IOC_LEGACY_DSOBJ_TO_DSNAME, /* 0x24 */ ZFS_IOC_LEGACY_OBJ_TO_PATH, /* 0x25 */ ZFS_IOC_LEGACY_POOL_SET_PROPS, /* 0x26 */ ZFS_IOC_LEGACY_POOL_GET_PROPS, /* 0x27 */ ZFS_IOC_LEGACY_SET_FSACL, /* 0x28 */ ZFS_IOC_LEGACY_GET_FSACL, /* 0x29 */ ZFS_IOC_LEGACY_SHARE, /* 0x2a */ ZFS_IOC_LEGACY_INHERIT_PROP, /* 0x2b */ ZFS_IOC_LEGACY_SMB_ACL, /* 0x2c */ ZFS_IOC_LEGACY_USERSPACE_ONE, /* 0x2d */ ZFS_IOC_LEGACY_USERSPACE_MANY, /* 0x2e */ ZFS_IOC_LEGACY_USERSPACE_UPGRADE, /* 0x2f */ ZFS_IOC_LEGACY_HOLD, /* 0x30 */ ZFS_IOC_LEGACY_RELEASE, /* 0x31 */ ZFS_IOC_LEGACY_GET_HOLDS, /* 0x32 */ ZFS_IOC_LEGACY_OBJSET_RECVD_PROPS, /* 0x33 */ ZFS_IOC_LEGACY_VDEV_SPLIT, /* 0x34 */ ZFS_IOC_LEGACY_NEXT_OBJ, /* 0x35 */ ZFS_IOC_LEGACY_DIFF, /* 0x36 */ ZFS_IOC_LEGACY_TMP_SNAPSHOT, /* 0x37 */ ZFS_IOC_LEGACY_OBJ_TO_STATS, /* 0x38 */ ZFS_IOC_LEGACY_SPACE_WRITTEN, /* 0x39 */ ZFS_IOC_LEGACY_SPACE_SNAPS, /* 0x3a */ ZFS_IOC_LEGACY_DESTROY_SNAPS, /* 0x3b */ ZFS_IOC_LEGACY_POOL_REGUID, /* 0x3c */ ZFS_IOC_LEGACY_POOL_REOPEN, /* 0x3d */ ZFS_IOC_LEGACY_SEND_PROGRESS, /* 0x3e */ ZFS_IOC_LEGACY_LOG_HISTORY, /* 0x3f */ ZFS_IOC_LEGACY_SEND_NEW, /* 0x40 */ ZFS_IOC_LEGACY_SEND_SPACE, /* 0x41 */ ZFS_IOC_LEGACY_CLONE, /* 0x42 */ ZFS_IOC_LEGACY_BOOKMARK, /* 0x43 */ ZFS_IOC_LEGACY_GET_BOOKMARKS, /* 0x44 */ ZFS_IOC_LEGACY_DESTROY_BOOKMARKS, /* 0x45 */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_RECV_NEW */ ZFS_IOC_LEGACY_POOL_SYNC, /* 0x47 */ ZFS_IOC_LEGACY_CHANNEL_PROGRAM, /* 0x48 */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_LOAD_KEY */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_UNLOAD_KEY */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_CHANGE_KEY */ ZFS_IOC_LEGACY_REMAP, /* 0x4c */ ZFS_IOC_LEGACY_POOL_CHECKPOINT, /* 0x4d */ ZFS_IOC_LEGACY_POOL_DISCARD_CHECKPOINT, /* 0x4e */ ZFS_IOC_LEGACY_POOL_INITIALIZE, /* 0x4f */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_POOL_TRIM */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_REDACT */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_GET_BOOKMARK_PROPS */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_WAIT */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_WAIT_FS */ }; unsigned static long zfs_ioctl_ozfs_to_legacy_platform_[] = { ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_EVENTS_NEXT */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_EVENTS_CLEAR */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_EVENTS_SEEK */ ZFS_IOC_LEGACY_NEXTBOOT, ZFS_IOC_LEGACY_JAIL, ZFS_IOC_LEGACY_UNJAIL, ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_SET_BOOTENV */ ZFS_IOC_LEGACY_NONE, /* ZFS_IOC_GET_BOOTENV */ }; int zfs_ioctl_legacy_to_ozfs(int request) { if (request >= sizeof (zfs_ioctl_legacy_to_ozfs_)/sizeof (long)) return (-1); return (zfs_ioctl_legacy_to_ozfs_[request]); } int zfs_ioctl_ozfs_to_legacy(int request) { if (request > ZFS_IOC_LAST) return (-1); - if (request > ZFS_IOC_PLATFORM) + if (request > ZFS_IOC_PLATFORM) { + request -= ZFS_IOC_PLATFORM + 1; return (zfs_ioctl_ozfs_to_legacy_platform_[request]); + } if (request >= sizeof (zfs_ioctl_ozfs_to_legacy_common_)/sizeof (long)) return (-1); return (zfs_ioctl_ozfs_to_legacy_common_[request]); } void zfs_cmd_legacy_to_ozfs(zfs_cmd_legacy_t *src, zfs_cmd_t *dst) { memcpy(dst, src, offsetof(zfs_cmd_t, zc_objset_stats)); *&dst->zc_objset_stats = *&src->zc_objset_stats; memcpy(&dst->zc_begin_record, &src->zc_begin_record, offsetof(zfs_cmd_t, zc_sendobj) - offsetof(zfs_cmd_t, zc_begin_record)); memcpy(&dst->zc_sendobj, &src->zc_sendobj, sizeof (zfs_cmd_t) - 8 - offsetof(zfs_cmd_t, zc_sendobj)); dst->zc_zoneid = src->zc_jailid; } void zfs_cmd_ozfs_to_legacy(zfs_cmd_t *src, zfs_cmd_legacy_t *dst) { memcpy(dst, src, offsetof(zfs_cmd_t, zc_objset_stats)); *&dst->zc_objset_stats = *&src->zc_objset_stats; *&dst->zc_begin_record.drr_u.drr_begin = *&src->zc_begin_record; dst->zc_begin_record.drr_payloadlen = 0; dst->zc_begin_record.drr_type = 0; memcpy(&dst->zc_inject_record, &src->zc_inject_record, offsetof(zfs_cmd_t, zc_sendobj) - offsetof(zfs_cmd_t, zc_inject_record)); dst->zc_resumable = B_FALSE; memcpy(&dst->zc_sendobj, &src->zc_sendobj, sizeof (zfs_cmd_t) - 8 - offsetof(zfs_cmd_t, zc_sendobj)); dst->zc_jailid = src->zc_zoneid; } Index: vendor-sys/openzfs/dist/module/os/freebsd/zfs/zfs_vfsops.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/zfs/zfs_vfsops.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/zfs/zfs_vfsops.c (revision 366348) @@ -1,2289 +1,2293 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011 Pawel Jakub Dawidek . * All rights reserved. * Copyright (c) 2012, 2015 by Delphix. All rights reserved. * Copyright (c) 2014 Integros [integros.com] * Copyright 2016 Nexenta Systems, Inc. All rights reserved. */ /* Portions Copyright 2010 Robert Milkowski */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "zfs_comutil.h" #ifndef MNTK_VMSETSIZE_BUG #define MNTK_VMSETSIZE_BUG 0 #endif #ifndef MNTK_NOMSYNC #define MNTK_NOMSYNC 8 #endif /* BEGIN CSTYLED */ struct mtx zfs_debug_mtx; MTX_SYSINIT(zfs_debug_mtx, &zfs_debug_mtx, "zfs_debug", MTX_DEF); SYSCTL_NODE(_vfs, OID_AUTO, zfs, CTLFLAG_RW, 0, "ZFS file system"); int zfs_super_owner; SYSCTL_INT(_vfs_zfs, OID_AUTO, super_owner, CTLFLAG_RW, &zfs_super_owner, 0, "File system owner can perform privileged operation on his file systems"); int zfs_debug_level; SYSCTL_INT(_vfs_zfs, OID_AUTO, debug, CTLFLAG_RWTUN, &zfs_debug_level, 0, "Debug level"); SYSCTL_NODE(_vfs_zfs, OID_AUTO, version, CTLFLAG_RD, 0, "ZFS versions"); static int zfs_version_acl = ZFS_ACL_VERSION; SYSCTL_INT(_vfs_zfs_version, OID_AUTO, acl, CTLFLAG_RD, &zfs_version_acl, 0, "ZFS_ACL_VERSION"); static int zfs_version_spa = SPA_VERSION; SYSCTL_INT(_vfs_zfs_version, OID_AUTO, spa, CTLFLAG_RD, &zfs_version_spa, 0, "SPA_VERSION"); static int zfs_version_zpl = ZPL_VERSION; SYSCTL_INT(_vfs_zfs_version, OID_AUTO, zpl, CTLFLAG_RD, &zfs_version_zpl, 0, "ZPL_VERSION"); /* END CSTYLED */ static int zfs_quotactl(vfs_t *vfsp, int cmds, uid_t id, void *arg); static int zfs_mount(vfs_t *vfsp); static int zfs_umount(vfs_t *vfsp, int fflag); static int zfs_root(vfs_t *vfsp, int flags, vnode_t **vpp); static int zfs_statfs(vfs_t *vfsp, struct statfs *statp); static int zfs_vget(vfs_t *vfsp, ino_t ino, int flags, vnode_t **vpp); static int zfs_sync(vfs_t *vfsp, int waitfor); #if __FreeBSD_version >= 1300098 static int zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, uint64_t *extflagsp, struct ucred **credanonp, int *numsecflavors, int *secflavors); #else static int zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp, struct ucred **credanonp, int *numsecflavors, int **secflavors); #endif static int zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, int flags, vnode_t **vpp); static void zfs_freevfs(vfs_t *vfsp); struct vfsops zfs_vfsops = { .vfs_mount = zfs_mount, .vfs_unmount = zfs_umount, #if __FreeBSD_version >= 1300049 .vfs_root = vfs_cache_root, .vfs_cachedroot = zfs_root, #else .vfs_root = zfs_root, #endif .vfs_statfs = zfs_statfs, .vfs_vget = zfs_vget, .vfs_sync = zfs_sync, .vfs_checkexp = zfs_checkexp, .vfs_fhtovp = zfs_fhtovp, .vfs_quotactl = zfs_quotactl, }; VFS_SET(zfs_vfsops, zfs, VFCF_JAIL | VFCF_DELEGADMIN); /* * We need to keep a count of active fs's. * This is necessary to prevent our module * from being unloaded after a umount -f */ static uint32_t zfs_active_fs_count = 0; int zfs_get_temporary_prop(dsl_dataset_t *ds, zfs_prop_t zfs_prop, uint64_t *val, char *setpoint) { int error; zfsvfs_t *zfvp; vfs_t *vfsp; objset_t *os; uint64_t tmp = *val; error = dmu_objset_from_ds(ds, &os); if (error != 0) return (error); error = getzfsvfs_impl(os, &zfvp); if (error != 0) return (error); if (zfvp == NULL) return (ENOENT); vfsp = zfvp->z_vfs; switch (zfs_prop) { case ZFS_PROP_ATIME: if (vfs_optionisset(vfsp, MNTOPT_NOATIME, NULL)) tmp = 0; if (vfs_optionisset(vfsp, MNTOPT_ATIME, NULL)) tmp = 1; break; case ZFS_PROP_DEVICES: if (vfs_optionisset(vfsp, MNTOPT_NODEVICES, NULL)) tmp = 0; if (vfs_optionisset(vfsp, MNTOPT_DEVICES, NULL)) tmp = 1; break; case ZFS_PROP_EXEC: if (vfs_optionisset(vfsp, MNTOPT_NOEXEC, NULL)) tmp = 0; if (vfs_optionisset(vfsp, MNTOPT_EXEC, NULL)) tmp = 1; break; case ZFS_PROP_SETUID: if (vfs_optionisset(vfsp, MNTOPT_NOSETUID, NULL)) tmp = 0; if (vfs_optionisset(vfsp, MNTOPT_SETUID, NULL)) tmp = 1; break; case ZFS_PROP_READONLY: if (vfs_optionisset(vfsp, MNTOPT_RW, NULL)) tmp = 0; if (vfs_optionisset(vfsp, MNTOPT_RO, NULL)) tmp = 1; break; case ZFS_PROP_XATTR: if (zfvp->z_flags & ZSB_XATTR) tmp = zfvp->z_xattr; break; case ZFS_PROP_NBMAND: if (vfs_optionisset(vfsp, MNTOPT_NONBMAND, NULL)) tmp = 0; if (vfs_optionisset(vfsp, MNTOPT_NBMAND, NULL)) tmp = 1; break; default: vfs_unbusy(vfsp); return (ENOENT); } vfs_unbusy(vfsp); if (tmp != *val) { (void) strcpy(setpoint, "temporary"); *val = tmp; } return (0); } static int zfs_getquota(zfsvfs_t *zfsvfs, uid_t id, int isgroup, struct dqblk64 *dqp) { int error = 0; char buf[32]; uint64_t usedobj, quotaobj; uint64_t quota, used = 0; timespec_t now; usedobj = isgroup ? DMU_GROUPUSED_OBJECT : DMU_USERUSED_OBJECT; quotaobj = isgroup ? zfsvfs->z_groupquota_obj : zfsvfs->z_userquota_obj; if (quotaobj == 0 || zfsvfs->z_replay) { error = ENOENT; goto done; } (void) sprintf(buf, "%llx", (longlong_t)id); if ((error = zap_lookup(zfsvfs->z_os, quotaobj, buf, sizeof (quota), 1, "a)) != 0) { dprintf("%s(%d): quotaobj lookup failed\n", __FUNCTION__, __LINE__); goto done; } /* * quota(8) uses bsoftlimit as "quoota", and hardlimit as "limit". * So we set them to be the same. */ dqp->dqb_bsoftlimit = dqp->dqb_bhardlimit = btodb(quota); error = zap_lookup(zfsvfs->z_os, usedobj, buf, sizeof (used), 1, &used); if (error && error != ENOENT) { dprintf("%s(%d): usedobj failed; %d\n", __FUNCTION__, __LINE__, error); goto done; } dqp->dqb_curblocks = btodb(used); dqp->dqb_ihardlimit = dqp->dqb_isoftlimit = 0; vfs_timestamp(&now); /* * Setting this to 0 causes FreeBSD quota(8) to print * the number of days since the epoch, which isn't * particularly useful. */ dqp->dqb_btime = dqp->dqb_itime = now.tv_sec; done: return (error); } static int zfs_quotactl(vfs_t *vfsp, int cmds, uid_t id, void *arg) { zfsvfs_t *zfsvfs = vfsp->vfs_data; struct thread *td; int cmd, type, error = 0; int bitsize; zfs_userquota_prop_t quota_type; struct dqblk64 dqblk = { 0 }; td = curthread; cmd = cmds >> SUBCMDSHIFT; type = cmds & SUBCMDMASK; ZFS_ENTER(zfsvfs); if (id == -1) { switch (type) { case USRQUOTA: id = td->td_ucred->cr_ruid; break; case GRPQUOTA: id = td->td_ucred->cr_rgid; break; default: error = EINVAL; if (cmd == Q_QUOTAON || cmd == Q_QUOTAOFF) vfs_unbusy(vfsp); goto done; } } /* * Map BSD type to: * ZFS_PROP_USERUSED, * ZFS_PROP_USERQUOTA, * ZFS_PROP_GROUPUSED, * ZFS_PROP_GROUPQUOTA */ switch (cmd) { case Q_SETQUOTA: case Q_SETQUOTA32: if (type == USRQUOTA) quota_type = ZFS_PROP_USERQUOTA; else if (type == GRPQUOTA) quota_type = ZFS_PROP_GROUPQUOTA; else error = EINVAL; break; case Q_GETQUOTA: case Q_GETQUOTA32: if (type == USRQUOTA) quota_type = ZFS_PROP_USERUSED; else if (type == GRPQUOTA) quota_type = ZFS_PROP_GROUPUSED; else error = EINVAL; break; } /* * Depending on the cmd, we may need to get * the ruid and domain (see fuidstr_to_sid?), * the fuid (how?), or other information. * Create fuid using zfs_fuid_create(zfsvfs, id, * ZFS_OWNER or ZFS_GROUP, cr, &fuidp)? * I think I can use just the id? * * Look at zfs_id_overquota() to look up a quota. * zap_lookup(something, quotaobj, fuidstring, * sizeof (long long), 1, "a) * * See zfs_set_userquota() to set a quota. */ if ((uint32_t)type >= MAXQUOTAS) { error = EINVAL; goto done; } switch (cmd) { case Q_GETQUOTASIZE: bitsize = 64; error = copyout(&bitsize, arg, sizeof (int)); break; case Q_QUOTAON: // As far as I can tell, you can't turn quotas on or off on zfs error = 0; vfs_unbusy(vfsp); break; case Q_QUOTAOFF: error = ENOTSUP; vfs_unbusy(vfsp); break; case Q_SETQUOTA: error = copyin(arg, &dqblk, sizeof (dqblk)); if (error == 0) error = zfs_set_userquota(zfsvfs, quota_type, "", id, dbtob(dqblk.dqb_bhardlimit)); break; case Q_GETQUOTA: error = zfs_getquota(zfsvfs, id, type == GRPQUOTA, &dqblk); if (error == 0) error = copyout(&dqblk, arg, sizeof (dqblk)); break; default: error = EINVAL; break; } done: ZFS_EXIT(zfsvfs); return (error); } boolean_t zfs_is_readonly(zfsvfs_t *zfsvfs) { return (!!(zfsvfs->z_vfs->vfs_flag & VFS_RDONLY)); } /*ARGSUSED*/ static int zfs_sync(vfs_t *vfsp, int waitfor) { /* * Data integrity is job one. We don't want a compromised kernel * writing to the storage pool, so we never sync during panic. */ if (panicstr) return (0); /* * Ignore the system syncher. ZFS already commits async data * at zfs_txg_timeout intervals. */ if (waitfor == MNT_LAZY) return (0); if (vfsp != NULL) { /* * Sync a specific filesystem. */ zfsvfs_t *zfsvfs = vfsp->vfs_data; dsl_pool_t *dp; int error; error = vfs_stdsync(vfsp, waitfor); if (error != 0) return (error); ZFS_ENTER(zfsvfs); dp = dmu_objset_pool(zfsvfs->z_os); /* * If the system is shutting down, then skip any * filesystems which may exist on a suspended pool. */ if (rebooting && spa_suspended(dp->dp_spa)) { ZFS_EXIT(zfsvfs); return (0); } if (zfsvfs->z_log != NULL) zil_commit(zfsvfs->z_log, 0); ZFS_EXIT(zfsvfs); } else { /* * Sync all ZFS filesystems. This is what happens when you * run sync(1M). Unlike other filesystems, ZFS honors the * request by waiting for all pools to commit all dirty data. */ spa_sync_allpools(); } return (0); } static void atime_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; if (newval == TRUE) { zfsvfs->z_atime = TRUE; zfsvfs->z_vfs->vfs_flag &= ~MNT_NOATIME; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_NOATIME); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_ATIME, NULL, 0); } else { zfsvfs->z_atime = FALSE; zfsvfs->z_vfs->vfs_flag |= MNT_NOATIME; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_ATIME); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_NOATIME, NULL, 0); } } static void xattr_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; if (newval == ZFS_XATTR_OFF) { zfsvfs->z_flags &= ~ZSB_XATTR; } else { zfsvfs->z_flags |= ZSB_XATTR; if (newval == ZFS_XATTR_SA) zfsvfs->z_xattr_sa = B_TRUE; else zfsvfs->z_xattr_sa = B_FALSE; } } static void blksz_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; ASSERT3U(newval, <=, spa_maxblocksize(dmu_objset_spa(zfsvfs->z_os))); ASSERT3U(newval, >=, SPA_MINBLOCKSIZE); ASSERT(ISP2(newval)); zfsvfs->z_max_blksz = newval; zfsvfs->z_vfs->mnt_stat.f_iosize = newval; } static void readonly_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; if (newval) { /* XXX locking on vfs_flag? */ zfsvfs->z_vfs->vfs_flag |= VFS_RDONLY; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_RW); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_RO, NULL, 0); } else { /* XXX locking on vfs_flag? */ zfsvfs->z_vfs->vfs_flag &= ~VFS_RDONLY; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_RO); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_RW, NULL, 0); } } static void setuid_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; if (newval == FALSE) { zfsvfs->z_vfs->vfs_flag |= VFS_NOSETUID; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_SETUID); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_NOSETUID, NULL, 0); } else { zfsvfs->z_vfs->vfs_flag &= ~VFS_NOSETUID; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_NOSETUID); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_SETUID, NULL, 0); } } static void exec_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; if (newval == FALSE) { zfsvfs->z_vfs->vfs_flag |= VFS_NOEXEC; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_EXEC); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_NOEXEC, NULL, 0); } else { zfsvfs->z_vfs->vfs_flag &= ~VFS_NOEXEC; vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_NOEXEC); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_EXEC, NULL, 0); } } /* * The nbmand mount option can be changed at mount time. * We can't allow it to be toggled on live file systems or incorrect * behavior may be seen from cifs clients * * This property isn't registered via dsl_prop_register(), but this callback * will be called when a file system is first mounted */ static void nbmand_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; if (newval == FALSE) { vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_NBMAND); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_NONBMAND, NULL, 0); } else { vfs_clearmntopt(zfsvfs->z_vfs, MNTOPT_NONBMAND); vfs_setmntopt(zfsvfs->z_vfs, MNTOPT_NBMAND, NULL, 0); } } static void snapdir_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; zfsvfs->z_show_ctldir = newval; } static void vscan_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; zfsvfs->z_vscan = newval; } static void acl_mode_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; zfsvfs->z_acl_mode = newval; } static void acl_inherit_changed_cb(void *arg, uint64_t newval) { zfsvfs_t *zfsvfs = arg; zfsvfs->z_acl_inherit = newval; } static int zfs_register_callbacks(vfs_t *vfsp) { struct dsl_dataset *ds = NULL; objset_t *os = NULL; zfsvfs_t *zfsvfs = NULL; uint64_t nbmand; boolean_t readonly = B_FALSE; boolean_t do_readonly = B_FALSE; boolean_t setuid = B_FALSE; boolean_t do_setuid = B_FALSE; boolean_t exec = B_FALSE; boolean_t do_exec = B_FALSE; boolean_t xattr = B_FALSE; boolean_t atime = B_FALSE; boolean_t do_atime = B_FALSE; boolean_t do_xattr = B_FALSE; int error = 0; ASSERT(vfsp); zfsvfs = vfsp->vfs_data; ASSERT(zfsvfs); os = zfsvfs->z_os; /* * This function can be called for a snapshot when we update snapshot's * mount point, which isn't really supported. */ if (dmu_objset_is_snapshot(os)) return (EOPNOTSUPP); /* * The act of registering our callbacks will destroy any mount * options we may have. In order to enable temporary overrides * of mount options, we stash away the current values and * restore them after we register the callbacks. */ if (vfs_optionisset(vfsp, MNTOPT_RO, NULL) || !spa_writeable(dmu_objset_spa(os))) { readonly = B_TRUE; do_readonly = B_TRUE; } else if (vfs_optionisset(vfsp, MNTOPT_RW, NULL)) { readonly = B_FALSE; do_readonly = B_TRUE; } if (vfs_optionisset(vfsp, MNTOPT_NOSETUID, NULL)) { setuid = B_FALSE; do_setuid = B_TRUE; } else if (vfs_optionisset(vfsp, MNTOPT_SETUID, NULL)) { setuid = B_TRUE; do_setuid = B_TRUE; } if (vfs_optionisset(vfsp, MNTOPT_NOEXEC, NULL)) { exec = B_FALSE; do_exec = B_TRUE; } else if (vfs_optionisset(vfsp, MNTOPT_EXEC, NULL)) { exec = B_TRUE; do_exec = B_TRUE; } if (vfs_optionisset(vfsp, MNTOPT_NOXATTR, NULL)) { zfsvfs->z_xattr = xattr = ZFS_XATTR_OFF; do_xattr = B_TRUE; } else if (vfs_optionisset(vfsp, MNTOPT_XATTR, NULL)) { zfsvfs->z_xattr = xattr = ZFS_XATTR_DIR; do_xattr = B_TRUE; } else if (vfs_optionisset(vfsp, MNTOPT_DIRXATTR, NULL)) { zfsvfs->z_xattr = xattr = ZFS_XATTR_DIR; do_xattr = B_TRUE; } else if (vfs_optionisset(vfsp, MNTOPT_SAXATTR, NULL)) { zfsvfs->z_xattr = xattr = ZFS_XATTR_SA; do_xattr = B_TRUE; } if (vfs_optionisset(vfsp, MNTOPT_NOATIME, NULL)) { atime = B_FALSE; do_atime = B_TRUE; } else if (vfs_optionisset(vfsp, MNTOPT_ATIME, NULL)) { atime = B_TRUE; do_atime = B_TRUE; } /* * We need to enter pool configuration here, so that we can use * dsl_prop_get_int_ds() to handle the special nbmand property below. * dsl_prop_get_integer() can not be used, because it has to acquire * spa_namespace_lock and we can not do that because we already hold * z_teardown_lock. The problem is that spa_write_cachefile() is called * with spa_namespace_lock held and the function calls ZFS vnode * operations to write the cache file and thus z_teardown_lock is * acquired after spa_namespace_lock. */ ds = dmu_objset_ds(os); dsl_pool_config_enter(dmu_objset_pool(os), FTAG); /* * nbmand is a special property. It can only be changed at * mount time. * * This is weird, but it is documented to only be changeable * at mount time. */ if (vfs_optionisset(vfsp, MNTOPT_NONBMAND, NULL)) { nbmand = B_FALSE; } else if (vfs_optionisset(vfsp, MNTOPT_NBMAND, NULL)) { nbmand = B_TRUE; } else if ((error = dsl_prop_get_int_ds(ds, "nbmand", &nbmand) != 0)) { dsl_pool_config_exit(dmu_objset_pool(os), FTAG); return (error); } /* * Register property callbacks. * * It would probably be fine to just check for i/o error from * the first prop_register(), but I guess I like to go * overboard... */ error = dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_ATIME), atime_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_XATTR), xattr_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_RECORDSIZE), blksz_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_READONLY), readonly_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_SETUID), setuid_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_EXEC), exec_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_SNAPDIR), snapdir_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_ACLMODE), acl_mode_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_ACLINHERIT), acl_inherit_changed_cb, zfsvfs); error = error ? error : dsl_prop_register(ds, zfs_prop_to_name(ZFS_PROP_VSCAN), vscan_changed_cb, zfsvfs); dsl_pool_config_exit(dmu_objset_pool(os), FTAG); if (error) goto unregister; /* * Invoke our callbacks to restore temporary mount options. */ if (do_readonly) readonly_changed_cb(zfsvfs, readonly); if (do_setuid) setuid_changed_cb(zfsvfs, setuid); if (do_exec) exec_changed_cb(zfsvfs, exec); if (do_xattr) xattr_changed_cb(zfsvfs, xattr); if (do_atime) atime_changed_cb(zfsvfs, atime); nbmand_changed_cb(zfsvfs, nbmand); return (0); unregister: dsl_prop_unregister_all(ds, zfsvfs); return (error); } /* * Associate this zfsvfs with the given objset, which must be owned. * This will cache a bunch of on-disk state from the objset in the * zfsvfs. */ static int zfsvfs_init(zfsvfs_t *zfsvfs, objset_t *os) { int error; uint64_t val; zfsvfs->z_max_blksz = SPA_OLD_MAXBLOCKSIZE; zfsvfs->z_show_ctldir = ZFS_SNAPDIR_VISIBLE; zfsvfs->z_os = os; error = zfs_get_zplprop(os, ZFS_PROP_VERSION, &zfsvfs->z_version); if (error != 0) return (error); if (zfsvfs->z_version > zfs_zpl_version_map(spa_version(dmu_objset_spa(os)))) { (void) printf("Can't mount a version %lld file system " "on a version %lld pool\n. Pool must be upgraded to mount " "this file system.", (u_longlong_t)zfsvfs->z_version, (u_longlong_t)spa_version(dmu_objset_spa(os))); return (SET_ERROR(ENOTSUP)); } error = zfs_get_zplprop(os, ZFS_PROP_NORMALIZE, &val); if (error != 0) return (error); zfsvfs->z_norm = (int)val; error = zfs_get_zplprop(os, ZFS_PROP_UTF8ONLY, &val); if (error != 0) return (error); zfsvfs->z_utf8 = (val != 0); error = zfs_get_zplprop(os, ZFS_PROP_CASE, &val); if (error != 0) return (error); zfsvfs->z_case = (uint_t)val; /* * Fold case on file systems that are always or sometimes case * insensitive. */ if (zfsvfs->z_case == ZFS_CASE_INSENSITIVE || zfsvfs->z_case == ZFS_CASE_MIXED) zfsvfs->z_norm |= U8_TEXTPREP_TOUPPER; zfsvfs->z_use_fuids = USE_FUIDS(zfsvfs->z_version, zfsvfs->z_os); zfsvfs->z_use_sa = USE_SA(zfsvfs->z_version, zfsvfs->z_os); uint64_t sa_obj = 0; if (zfsvfs->z_use_sa) { /* should either have both of these objects or none */ error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_SA_ATTRS, 8, 1, &sa_obj); if (error != 0) return (error); } error = sa_setup(os, sa_obj, zfs_attr_table, ZPL_END, &zfsvfs->z_attr_table); if (error != 0) return (error); if (zfsvfs->z_version >= ZPL_VERSION_SA) sa_register_update_callback(os, zfs_sa_upgrade); error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_ROOT_OBJ, 8, 1, &zfsvfs->z_root); if (error != 0) return (error); ASSERT(zfsvfs->z_root != 0); error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_UNLINKED_SET, 8, 1, &zfsvfs->z_unlinkedobj); if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, zfs_userquota_prop_prefixes[ZFS_PROP_USERQUOTA], 8, 1, &zfsvfs->z_userquota_obj); if (error == ENOENT) zfsvfs->z_userquota_obj = 0; else if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, zfs_userquota_prop_prefixes[ZFS_PROP_GROUPQUOTA], 8, 1, &zfsvfs->z_groupquota_obj); if (error == ENOENT) zfsvfs->z_groupquota_obj = 0; else if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, zfs_userquota_prop_prefixes[ZFS_PROP_PROJECTQUOTA], 8, 1, &zfsvfs->z_projectquota_obj); if (error == ENOENT) zfsvfs->z_projectquota_obj = 0; else if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, zfs_userquota_prop_prefixes[ZFS_PROP_USEROBJQUOTA], 8, 1, &zfsvfs->z_userobjquota_obj); if (error == ENOENT) zfsvfs->z_userobjquota_obj = 0; else if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, zfs_userquota_prop_prefixes[ZFS_PROP_GROUPOBJQUOTA], 8, 1, &zfsvfs->z_groupobjquota_obj); if (error == ENOENT) zfsvfs->z_groupobjquota_obj = 0; else if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, zfs_userquota_prop_prefixes[ZFS_PROP_PROJECTOBJQUOTA], 8, 1, &zfsvfs->z_projectobjquota_obj); if (error == ENOENT) zfsvfs->z_projectobjquota_obj = 0; else if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_FUID_TABLES, 8, 1, &zfsvfs->z_fuid_obj); if (error == ENOENT) zfsvfs->z_fuid_obj = 0; else if (error != 0) return (error); error = zap_lookup(os, MASTER_NODE_OBJ, ZFS_SHARES_DIR, 8, 1, &zfsvfs->z_shares_dir); if (error == ENOENT) zfsvfs->z_shares_dir = 0; else if (error != 0) return (error); /* * Only use the name cache if we are looking for a * name on a file system that does not require normalization * or case folding. We can also look there if we happen to be * on a non-normalizing, mixed sensitivity file system IF we * are looking for the exact name (which is always the case on * FreeBSD). */ zfsvfs->z_use_namecache = !zfsvfs->z_norm || ((zfsvfs->z_case == ZFS_CASE_MIXED) && !(zfsvfs->z_norm & ~U8_TEXTPREP_TOUPPER)); return (0); } taskq_t *zfsvfs_taskq; static void zfsvfs_task_unlinked_drain(void *context, int pending __unused) { zfs_unlinked_drain((zfsvfs_t *)context); } int zfsvfs_create(const char *osname, boolean_t readonly, zfsvfs_t **zfvp) { objset_t *os; zfsvfs_t *zfsvfs; int error; boolean_t ro = (readonly || (strchr(osname, '@') != NULL)); /* * XXX: Fix struct statfs so this isn't necessary! * * The 'osname' is used as the filesystem's special node, which means * it must fit in statfs.f_mntfromname, or else it can't be * enumerated, so libzfs_mnttab_find() returns NULL, which causes * 'zfs unmount' to think it's not mounted when it is. */ if (strlen(osname) >= MNAMELEN) return (SET_ERROR(ENAMETOOLONG)); zfsvfs = kmem_zalloc(sizeof (zfsvfs_t), KM_SLEEP); error = dmu_objset_own(osname, DMU_OST_ZFS, ro, B_TRUE, zfsvfs, &os); if (error != 0) { kmem_free(zfsvfs, sizeof (zfsvfs_t)); return (error); } error = zfsvfs_create_impl(zfvp, zfsvfs, os); return (error); } int zfsvfs_create_impl(zfsvfs_t **zfvp, zfsvfs_t *zfsvfs, objset_t *os) { int error; zfsvfs->z_vfs = NULL; zfsvfs->z_parent = zfsvfs; mutex_init(&zfsvfs->z_znodes_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&zfsvfs->z_lock, NULL, MUTEX_DEFAULT, NULL); list_create(&zfsvfs->z_all_znodes, sizeof (znode_t), offsetof(znode_t, z_link_node)); TASK_INIT(&zfsvfs->z_unlinked_drain_task, 0, zfsvfs_task_unlinked_drain, zfsvfs); #ifdef DIAGNOSTIC rrm_init(&zfsvfs->z_teardown_lock, B_TRUE); #else rrm_init(&zfsvfs->z_teardown_lock, B_FALSE); #endif ZFS_INIT_TEARDOWN_INACTIVE(zfsvfs); rw_init(&zfsvfs->z_fuid_lock, NULL, RW_DEFAULT, NULL); for (int i = 0; i != ZFS_OBJ_MTX_SZ; i++) mutex_init(&zfsvfs->z_hold_mtx[i], NULL, MUTEX_DEFAULT, NULL); error = zfsvfs_init(zfsvfs, os); if (error != 0) { dmu_objset_disown(os, B_TRUE, zfsvfs); *zfvp = NULL; kmem_free(zfsvfs, sizeof (zfsvfs_t)); return (error); } *zfvp = zfsvfs; return (0); } static int zfsvfs_setup(zfsvfs_t *zfsvfs, boolean_t mounting) { int error; /* * Check for a bad on-disk format version now since we * lied about owning the dataset readonly before. */ if (!(zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) && dmu_objset_incompatible_encryption_version(zfsvfs->z_os)) return (SET_ERROR(EROFS)); error = zfs_register_callbacks(zfsvfs->z_vfs); if (error) return (error); zfsvfs->z_log = zil_open(zfsvfs->z_os, zfs_get_data); /* * If we are not mounting (ie: online recv), then we don't * have to worry about replaying the log as we blocked all * operations out since we closed the ZIL. */ if (mounting) { boolean_t readonly; ASSERT3P(zfsvfs->z_kstat.dk_kstats, ==, NULL); dataset_kstats_create(&zfsvfs->z_kstat, zfsvfs->z_os); /* * During replay we remove the read only flag to * allow replays to succeed. */ readonly = zfsvfs->z_vfs->vfs_flag & VFS_RDONLY; if (readonly != 0) { zfsvfs->z_vfs->vfs_flag &= ~VFS_RDONLY; } else { dsl_dir_t *dd; zap_stats_t zs; if (zap_get_stats(zfsvfs->z_os, zfsvfs->z_unlinkedobj, &zs) == 0) { dataset_kstats_update_nunlinks_kstat( &zfsvfs->z_kstat, zs.zs_num_entries); dprintf_ds(zfsvfs->z_os->os_dsl_dataset, "num_entries in unlinked set: %llu", zs.zs_num_entries); } zfs_unlinked_drain(zfsvfs); dd = zfsvfs->z_os->os_dsl_dataset->ds_dir; dd->dd_activity_cancelled = B_FALSE; } /* * Parse and replay the intent log. * * Because of ziltest, this must be done after * zfs_unlinked_drain(). (Further note: ziltest * doesn't use readonly mounts, where * zfs_unlinked_drain() isn't called.) This is because * ziltest causes spa_sync() to think it's committed, * but actually it is not, so the intent log contains * many txg's worth of changes. * * In particular, if object N is in the unlinked set in * the last txg to actually sync, then it could be * actually freed in a later txg and then reallocated * in a yet later txg. This would write a "create * object N" record to the intent log. Normally, this * would be fine because the spa_sync() would have * written out the fact that object N is free, before * we could write the "create object N" intent log * record. * * But when we are in ziltest mode, we advance the "open * txg" without actually spa_sync()-ing the changes to * disk. So we would see that object N is still * allocated and in the unlinked set, and there is an * intent log record saying to allocate it. */ if (spa_writeable(dmu_objset_spa(zfsvfs->z_os))) { if (zil_replay_disable) { zil_destroy(zfsvfs->z_log, B_FALSE); } else { boolean_t use_nc = zfsvfs->z_use_namecache; zfsvfs->z_use_namecache = B_FALSE; zfsvfs->z_replay = B_TRUE; zil_replay(zfsvfs->z_os, zfsvfs, zfs_replay_vector); zfsvfs->z_replay = B_FALSE; zfsvfs->z_use_namecache = use_nc; } } /* restore readonly bit */ if (readonly != 0) zfsvfs->z_vfs->vfs_flag |= VFS_RDONLY; } /* * Set the objset user_ptr to track its zfsvfs. */ mutex_enter(&zfsvfs->z_os->os_user_ptr_lock); dmu_objset_set_user(zfsvfs->z_os, zfsvfs); mutex_exit(&zfsvfs->z_os->os_user_ptr_lock); return (0); } extern krwlock_t zfsvfs_lock; /* in zfs_znode.c */ void zfsvfs_free(zfsvfs_t *zfsvfs) { int i; /* * This is a barrier to prevent the filesystem from going away in * zfs_znode_move() until we can safely ensure that the filesystem is * not unmounted. We consider the filesystem valid before the barrier * and invalid after the barrier. */ rw_enter(&zfsvfs_lock, RW_READER); rw_exit(&zfsvfs_lock); zfs_fuid_destroy(zfsvfs); mutex_destroy(&zfsvfs->z_znodes_lock); mutex_destroy(&zfsvfs->z_lock); ASSERT(zfsvfs->z_nr_znodes == 0); list_destroy(&zfsvfs->z_all_znodes); rrm_destroy(&zfsvfs->z_teardown_lock); ZFS_DESTROY_TEARDOWN_INACTIVE(zfsvfs); rw_destroy(&zfsvfs->z_fuid_lock); for (i = 0; i != ZFS_OBJ_MTX_SZ; i++) mutex_destroy(&zfsvfs->z_hold_mtx[i]); dataset_kstats_destroy(&zfsvfs->z_kstat); kmem_free(zfsvfs, sizeof (zfsvfs_t)); } static void zfs_set_fuid_feature(zfsvfs_t *zfsvfs) { zfsvfs->z_use_fuids = USE_FUIDS(zfsvfs->z_version, zfsvfs->z_os); if (zfsvfs->z_vfs) { if (zfsvfs->z_use_fuids) { vfs_set_feature(zfsvfs->z_vfs, VFSFT_XVATTR); vfs_set_feature(zfsvfs->z_vfs, VFSFT_SYSATTR_VIEWS); vfs_set_feature(zfsvfs->z_vfs, VFSFT_ACEMASKONACCESS); vfs_set_feature(zfsvfs->z_vfs, VFSFT_ACLONCREATE); vfs_set_feature(zfsvfs->z_vfs, VFSFT_ACCESS_FILTER); vfs_set_feature(zfsvfs->z_vfs, VFSFT_REPARSE); } else { vfs_clear_feature(zfsvfs->z_vfs, VFSFT_XVATTR); vfs_clear_feature(zfsvfs->z_vfs, VFSFT_SYSATTR_VIEWS); vfs_clear_feature(zfsvfs->z_vfs, VFSFT_ACEMASKONACCESS); vfs_clear_feature(zfsvfs->z_vfs, VFSFT_ACLONCREATE); vfs_clear_feature(zfsvfs->z_vfs, VFSFT_ACCESS_FILTER); vfs_clear_feature(zfsvfs->z_vfs, VFSFT_REPARSE); } } zfsvfs->z_use_sa = USE_SA(zfsvfs->z_version, zfsvfs->z_os); } static int zfs_domount(vfs_t *vfsp, char *osname) { uint64_t recordsize, fsid_guid; int error = 0; zfsvfs_t *zfsvfs; ASSERT(vfsp); ASSERT(osname); error = zfsvfs_create(osname, vfsp->mnt_flag & MNT_RDONLY, &zfsvfs); if (error) return (error); zfsvfs->z_vfs = vfsp; if ((error = dsl_prop_get_integer(osname, "recordsize", &recordsize, NULL))) goto out; zfsvfs->z_vfs->vfs_bsize = SPA_MINBLOCKSIZE; zfsvfs->z_vfs->mnt_stat.f_iosize = recordsize; vfsp->vfs_data = zfsvfs; vfsp->mnt_flag |= MNT_LOCAL; vfsp->mnt_kern_flag |= MNTK_LOOKUP_SHARED; vfsp->mnt_kern_flag |= MNTK_SHARED_WRITES; vfsp->mnt_kern_flag |= MNTK_EXTENDED_SHARED; /* * This can cause a loss of coherence between ARC and page cache * on ZoF - unclear if the problem is in FreeBSD or ZoF */ vfsp->mnt_kern_flag |= MNTK_NO_IOPF; /* vn_io_fault can be used */ vfsp->mnt_kern_flag |= MNTK_NOMSYNC; vfsp->mnt_kern_flag |= MNTK_VMSETSIZE_BUG; #if defined(_KERNEL) && !defined(KMEM_DEBUG) vfsp->mnt_kern_flag |= MNTK_FPLOOKUP; #endif /* * The fsid is 64 bits, composed of an 8-bit fs type, which * separates our fsid from any other filesystem types, and a * 56-bit objset unique ID. The objset unique ID is unique to * all objsets open on this system, provided by unique_create(). * The 8-bit fs type must be put in the low bits of fsid[1] * because that's where other Solaris filesystems put it. */ fsid_guid = dmu_objset_fsid_guid(zfsvfs->z_os); ASSERT((fsid_guid & ~((1ULL<<56)-1)) == 0); vfsp->vfs_fsid.val[0] = fsid_guid; vfsp->vfs_fsid.val[1] = ((fsid_guid>>32) << 8) | (vfsp->mnt_vfc->vfc_typenum & 0xFF); /* * Set features for file system. */ zfs_set_fuid_feature(zfsvfs); if (zfsvfs->z_case == ZFS_CASE_INSENSITIVE) { vfs_set_feature(vfsp, VFSFT_DIRENTFLAGS); vfs_set_feature(vfsp, VFSFT_CASEINSENSITIVE); vfs_set_feature(vfsp, VFSFT_NOCASESENSITIVE); } else if (zfsvfs->z_case == ZFS_CASE_MIXED) { vfs_set_feature(vfsp, VFSFT_DIRENTFLAGS); vfs_set_feature(vfsp, VFSFT_CASEINSENSITIVE); } vfs_set_feature(vfsp, VFSFT_ZEROCOPY_SUPPORTED); if (dmu_objset_is_snapshot(zfsvfs->z_os)) { uint64_t pval; atime_changed_cb(zfsvfs, B_FALSE); readonly_changed_cb(zfsvfs, B_TRUE); if ((error = dsl_prop_get_integer(osname, "xattr", &pval, NULL))) goto out; xattr_changed_cb(zfsvfs, pval); zfsvfs->z_issnap = B_TRUE; zfsvfs->z_os->os_sync = ZFS_SYNC_DISABLED; mutex_enter(&zfsvfs->z_os->os_user_ptr_lock); dmu_objset_set_user(zfsvfs->z_os, zfsvfs); mutex_exit(&zfsvfs->z_os->os_user_ptr_lock); } else { if ((error = zfsvfs_setup(zfsvfs, B_TRUE))) goto out; } vfs_mountedfrom(vfsp, osname); if (!zfsvfs->z_issnap) zfsctl_create(zfsvfs); out: if (error) { dmu_objset_disown(zfsvfs->z_os, B_TRUE, zfsvfs); zfsvfs_free(zfsvfs); } else { atomic_inc_32(&zfs_active_fs_count); } return (error); } static void zfs_unregister_callbacks(zfsvfs_t *zfsvfs) { objset_t *os = zfsvfs->z_os; if (!dmu_objset_is_snapshot(os)) dsl_prop_unregister_all(dmu_objset_ds(os), zfsvfs); } static int getpoolname(const char *osname, char *poolname) { char *p; p = strchr(osname, '/'); if (p == NULL) { if (strlen(osname) >= MAXNAMELEN) return (ENAMETOOLONG); (void) strcpy(poolname, osname); } else { if (p - osname >= MAXNAMELEN) return (ENAMETOOLONG); (void) strncpy(poolname, osname, p - osname); poolname[p - osname] = '\0'; } return (0); } /*ARGSUSED*/ static int zfs_mount(vfs_t *vfsp) { kthread_t *td = curthread; vnode_t *mvp = vfsp->mnt_vnodecovered; cred_t *cr = td->td_ucred; char *osname; int error = 0; int canwrite; if (vfs_getopt(vfsp->mnt_optnew, "from", (void **)&osname, NULL)) return (SET_ERROR(EINVAL)); /* * If full-owner-access is enabled and delegated administration is * turned on, we must set nosuid. */ if (zfs_super_owner && dsl_deleg_access(osname, ZFS_DELEG_PERM_MOUNT, cr) != ECANCELED) { secpolicy_fs_mount_clearopts(cr, vfsp); } /* * Check for mount privilege? * * If we don't have privilege then see if * we have local permission to allow it */ error = secpolicy_fs_mount(cr, mvp, vfsp); if (error) { if (dsl_deleg_access(osname, ZFS_DELEG_PERM_MOUNT, cr) != 0) goto out; if (!(vfsp->vfs_flag & MS_REMOUNT)) { vattr_t vattr; /* * Make sure user is the owner of the mount point * or has sufficient privileges. */ vattr.va_mask = AT_UID; vn_lock(mvp, LK_SHARED | LK_RETRY); if (VOP_GETATTR(mvp, &vattr, cr)) { VOP_UNLOCK1(mvp); goto out; } if (secpolicy_vnode_owner(mvp, cr, vattr.va_uid) != 0 && VOP_ACCESS(mvp, VWRITE, cr, td) != 0) { VOP_UNLOCK1(mvp); goto out; } VOP_UNLOCK1(mvp); } secpolicy_fs_mount_clearopts(cr, vfsp); } /* * Refuse to mount a filesystem if we are in a local zone and the * dataset is not visible. */ if (!INGLOBALZONE(curproc) && (!zone_dataset_visible(osname, &canwrite) || !canwrite)) { error = SET_ERROR(EPERM); goto out; } vfsp->vfs_flag |= MNT_NFS4ACLS; /* * When doing a remount, we simply refresh our temporary properties * according to those options set in the current VFS options. */ if (vfsp->vfs_flag & MS_REMOUNT) { zfsvfs_t *zfsvfs = vfsp->vfs_data; /* * Refresh mount options with z_teardown_lock blocking I/O while * the filesystem is in an inconsistent state. * The lock also serializes this code with filesystem * manipulations between entry to zfs_suspend_fs() and return * from zfs_resume_fs(). */ rrm_enter(&zfsvfs->z_teardown_lock, RW_WRITER, FTAG); zfs_unregister_callbacks(zfsvfs); error = zfs_register_callbacks(vfsp); rrm_exit(&zfsvfs->z_teardown_lock, FTAG); goto out; } /* Initial root mount: try hard to import the requested root pool. */ if ((vfsp->vfs_flag & MNT_ROOTFS) != 0 && (vfsp->vfs_flag & MNT_UPDATE) == 0) { char pname[MAXNAMELEN]; error = getpoolname(osname, pname); if (error == 0) error = spa_import_rootpool(pname, false); if (error) goto out; } DROP_GIANT(); error = zfs_domount(vfsp, osname); PICKUP_GIANT(); out: return (error); } static int zfs_statfs(vfs_t *vfsp, struct statfs *statp) { zfsvfs_t *zfsvfs = vfsp->vfs_data; uint64_t refdbytes, availbytes, usedobjs, availobjs; statp->f_version = STATFS_VERSION; ZFS_ENTER(zfsvfs); dmu_objset_space(zfsvfs->z_os, &refdbytes, &availbytes, &usedobjs, &availobjs); /* * The underlying storage pool actually uses multiple block sizes. * We report the fragsize as the smallest block size we support, * and we report our blocksize as the filesystem's maximum blocksize. */ statp->f_bsize = SPA_MINBLOCKSIZE; statp->f_iosize = zfsvfs->z_vfs->mnt_stat.f_iosize; /* * The following report "total" blocks of various kinds in the * file system, but reported in terms of f_frsize - the * "fragment" size. */ statp->f_blocks = (refdbytes + availbytes) >> SPA_MINBLOCKSHIFT; statp->f_bfree = availbytes / statp->f_bsize; statp->f_bavail = statp->f_bfree; /* no root reservation */ /* * statvfs() should really be called statufs(), because it assumes * static metadata. ZFS doesn't preallocate files, so the best * we can do is report the max that could possibly fit in f_files, * and that minus the number actually used in f_ffree. * For f_ffree, report the smaller of the number of object available * and the number of blocks (each object will take at least a block). */ statp->f_ffree = MIN(availobjs, statp->f_bfree); statp->f_files = statp->f_ffree + usedobjs; /* * We're a zfs filesystem. */ strlcpy(statp->f_fstypename, "zfs", sizeof (statp->f_fstypename)); strlcpy(statp->f_mntfromname, vfsp->mnt_stat.f_mntfromname, sizeof (statp->f_mntfromname)); strlcpy(statp->f_mntonname, vfsp->mnt_stat.f_mntonname, sizeof (statp->f_mntonname)); statp->f_namemax = MAXNAMELEN - 1; ZFS_EXIT(zfsvfs); return (0); } static int zfs_root(vfs_t *vfsp, int flags, vnode_t **vpp) { zfsvfs_t *zfsvfs = vfsp->vfs_data; znode_t *rootzp; int error; ZFS_ENTER(zfsvfs); error = zfs_zget(zfsvfs, zfsvfs->z_root, &rootzp); if (error == 0) *vpp = ZTOV(rootzp); ZFS_EXIT(zfsvfs); if (error == 0) { error = vn_lock(*vpp, flags); if (error != 0) { VN_RELE(*vpp); *vpp = NULL; } } return (error); } /* * Teardown the zfsvfs::z_os. * * Note, if 'unmounting' is FALSE, we return with the 'z_teardown_lock' * and 'z_teardown_inactive_lock' held. */ static int zfsvfs_teardown(zfsvfs_t *zfsvfs, boolean_t unmounting) { znode_t *zp; dsl_dir_t *dd; /* * If someone has not already unmounted this file system, * drain the zrele_taskq to ensure all active references to the * zfsvfs_t have been handled only then can it be safely destroyed. */ if (zfsvfs->z_os) { /* * If we're unmounting we have to wait for the list to * drain completely. * * If we're not unmounting there's no guarantee the list * will drain completely, but zreles run from the taskq * may add the parents of dir-based xattrs to the taskq * so we want to wait for these. * * We can safely read z_nr_znodes without locking because the * VFS has already blocked operations which add to the * z_all_znodes list and thus increment z_nr_znodes. */ int round = 0; while (zfsvfs->z_nr_znodes > 0) { taskq_wait_outstanding(dsl_pool_zrele_taskq( dmu_objset_pool(zfsvfs->z_os)), 0); if (++round > 1 && !unmounting) break; } } rrm_enter(&zfsvfs->z_teardown_lock, RW_WRITER, FTAG); if (!unmounting) { /* * We purge the parent filesystem's vfsp as the parent * filesystem and all of its snapshots have their vnode's * v_vfsp set to the parent's filesystem's vfsp. Note, * 'z_parent' is self referential for non-snapshots. */ #ifdef FREEBSD_NAMECACHE +#if __FreeBSD_version >= 1300117 + cache_purgevfs(zfsvfs->z_parent->z_vfs); +#else cache_purgevfs(zfsvfs->z_parent->z_vfs, true); +#endif #endif } /* * Close the zil. NB: Can't close the zil while zfs_inactive * threads are blocked as zil_close can call zfs_inactive. */ if (zfsvfs->z_log) { zil_close(zfsvfs->z_log); zfsvfs->z_log = NULL; } ZFS_WLOCK_TEARDOWN_INACTIVE(zfsvfs); /* * If we are not unmounting (ie: online recv) and someone already * unmounted this file system while we were doing the switcheroo, * or a reopen of z_os failed then just bail out now. */ if (!unmounting && (zfsvfs->z_unmounted || zfsvfs->z_os == NULL)) { ZFS_WUNLOCK_TEARDOWN_INACTIVE(zfsvfs); rrm_exit(&zfsvfs->z_teardown_lock, FTAG); return (SET_ERROR(EIO)); } /* * At this point there are no vops active, and any new vops will * fail with EIO since we have z_teardown_lock for writer (only * relevant for forced unmount). * * Release all holds on dbufs. */ mutex_enter(&zfsvfs->z_znodes_lock); for (zp = list_head(&zfsvfs->z_all_znodes); zp != NULL; zp = list_next(&zfsvfs->z_all_znodes, zp)) if (zp->z_sa_hdl) { ASSERT(ZTOV(zp)->v_count >= 0); zfs_znode_dmu_fini(zp); } mutex_exit(&zfsvfs->z_znodes_lock); /* * If we are unmounting, set the unmounted flag and let new vops * unblock. zfs_inactive will have the unmounted behavior, and all * other vops will fail with EIO. */ if (unmounting) { zfsvfs->z_unmounted = B_TRUE; ZFS_WUNLOCK_TEARDOWN_INACTIVE(zfsvfs); rrm_exit(&zfsvfs->z_teardown_lock, FTAG); } /* * z_os will be NULL if there was an error in attempting to reopen * zfsvfs, so just return as the properties had already been * unregistered and cached data had been evicted before. */ if (zfsvfs->z_os == NULL) return (0); /* * Unregister properties. */ zfs_unregister_callbacks(zfsvfs); /* * Evict cached data */ if (!zfs_is_readonly(zfsvfs)) txg_wait_synced(dmu_objset_pool(zfsvfs->z_os), 0); dmu_objset_evict_dbufs(zfsvfs->z_os); dd = zfsvfs->z_os->os_dsl_dataset->ds_dir; dsl_dir_cancel_waiters(dd); return (0); } /*ARGSUSED*/ static int zfs_umount(vfs_t *vfsp, int fflag) { kthread_t *td = curthread; zfsvfs_t *zfsvfs = vfsp->vfs_data; objset_t *os; cred_t *cr = td->td_ucred; int ret; ret = secpolicy_fs_unmount(cr, vfsp); if (ret) { if (dsl_deleg_access((char *)vfsp->vfs_resource, ZFS_DELEG_PERM_MOUNT, cr)) return (ret); } /* * Unmount any snapshots mounted under .zfs before unmounting the * dataset itself. */ if (zfsvfs->z_ctldir != NULL) { if ((ret = zfsctl_umount_snapshots(vfsp, fflag, cr)) != 0) return (ret); } if (fflag & MS_FORCE) { /* * Mark file system as unmounted before calling * vflush(FORCECLOSE). This way we ensure no future vnops * will be called and risk operating on DOOMED vnodes. */ rrm_enter(&zfsvfs->z_teardown_lock, RW_WRITER, FTAG); zfsvfs->z_unmounted = B_TRUE; rrm_exit(&zfsvfs->z_teardown_lock, FTAG); } /* * Flush all the files. */ ret = vflush(vfsp, 0, (fflag & MS_FORCE) ? FORCECLOSE : 0, td); if (ret != 0) return (ret); while (taskqueue_cancel(zfsvfs_taskq->tq_queue, &zfsvfs->z_unlinked_drain_task, NULL) != 0) taskqueue_drain(zfsvfs_taskq->tq_queue, &zfsvfs->z_unlinked_drain_task); VERIFY(zfsvfs_teardown(zfsvfs, B_TRUE) == 0); os = zfsvfs->z_os; /* * z_os will be NULL if there was an error in * attempting to reopen zfsvfs. */ if (os != NULL) { /* * Unset the objset user_ptr. */ mutex_enter(&os->os_user_ptr_lock); dmu_objset_set_user(os, NULL); mutex_exit(&os->os_user_ptr_lock); /* * Finally release the objset */ dmu_objset_disown(os, B_TRUE, zfsvfs); } /* * We can now safely destroy the '.zfs' directory node. */ if (zfsvfs->z_ctldir != NULL) zfsctl_destroy(zfsvfs); zfs_freevfs(vfsp); return (0); } static int zfs_vget(vfs_t *vfsp, ino_t ino, int flags, vnode_t **vpp) { zfsvfs_t *zfsvfs = vfsp->vfs_data; znode_t *zp; int err; /* * zfs_zget() can't operate on virtual entries like .zfs/ or * .zfs/snapshot/ directories, that's why we return EOPNOTSUPP. * This will make NFS to switch to LOOKUP instead of using VGET. */ if (ino == ZFSCTL_INO_ROOT || ino == ZFSCTL_INO_SNAPDIR || (zfsvfs->z_shares_dir != 0 && ino == zfsvfs->z_shares_dir)) return (EOPNOTSUPP); ZFS_ENTER(zfsvfs); err = zfs_zget(zfsvfs, ino, &zp); if (err == 0 && zp->z_unlinked) { vrele(ZTOV(zp)); err = EINVAL; } if (err == 0) *vpp = ZTOV(zp); ZFS_EXIT(zfsvfs); if (err == 0) { err = vn_lock(*vpp, flags); if (err != 0) vrele(*vpp); } if (err != 0) *vpp = NULL; return (err); } static int #if __FreeBSD_version >= 1300098 zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, uint64_t *extflagsp, struct ucred **credanonp, int *numsecflavors, int *secflavors) #else zfs_checkexp(vfs_t *vfsp, struct sockaddr *nam, int *extflagsp, struct ucred **credanonp, int *numsecflavors, int **secflavors) #endif { zfsvfs_t *zfsvfs = vfsp->vfs_data; /* * If this is regular file system vfsp is the same as * zfsvfs->z_parent->z_vfs, but if it is snapshot, * zfsvfs->z_parent->z_vfs represents parent file system * which we have to use here, because only this file system * has mnt_export configured. */ return (vfs_stdcheckexp(zfsvfs->z_parent->z_vfs, nam, extflagsp, credanonp, numsecflavors, secflavors)); } CTASSERT(SHORT_FID_LEN <= sizeof (struct fid)); CTASSERT(LONG_FID_LEN <= sizeof (struct fid)); static int zfs_fhtovp(vfs_t *vfsp, fid_t *fidp, int flags, vnode_t **vpp) { struct componentname cn; zfsvfs_t *zfsvfs = vfsp->vfs_data; znode_t *zp; vnode_t *dvp; uint64_t object = 0; uint64_t fid_gen = 0; uint64_t gen_mask; uint64_t zp_gen; int i, err; *vpp = NULL; ZFS_ENTER(zfsvfs); /* * On FreeBSD we can get snapshot's mount point or its parent file * system mount point depending if snapshot is already mounted or not. */ if (zfsvfs->z_parent == zfsvfs && fidp->fid_len == LONG_FID_LEN) { zfid_long_t *zlfid = (zfid_long_t *)fidp; uint64_t objsetid = 0; uint64_t setgen = 0; for (i = 0; i < sizeof (zlfid->zf_setid); i++) objsetid |= ((uint64_t)zlfid->zf_setid[i]) << (8 * i); for (i = 0; i < sizeof (zlfid->zf_setgen); i++) setgen |= ((uint64_t)zlfid->zf_setgen[i]) << (8 * i); ZFS_EXIT(zfsvfs); err = zfsctl_lookup_objset(vfsp, objsetid, &zfsvfs); if (err) return (SET_ERROR(EINVAL)); ZFS_ENTER(zfsvfs); } if (fidp->fid_len == SHORT_FID_LEN || fidp->fid_len == LONG_FID_LEN) { zfid_short_t *zfid = (zfid_short_t *)fidp; for (i = 0; i < sizeof (zfid->zf_object); i++) object |= ((uint64_t)zfid->zf_object[i]) << (8 * i); for (i = 0; i < sizeof (zfid->zf_gen); i++) fid_gen |= ((uint64_t)zfid->zf_gen[i]) << (8 * i); } else { ZFS_EXIT(zfsvfs); return (SET_ERROR(EINVAL)); } /* * A zero fid_gen means we are in .zfs or the .zfs/snapshot * directory tree. If the object == zfsvfs->z_shares_dir, then * we are in the .zfs/shares directory tree. */ if ((fid_gen == 0 && (object == ZFSCTL_INO_ROOT || object == ZFSCTL_INO_SNAPDIR)) || (zfsvfs->z_shares_dir != 0 && object == zfsvfs->z_shares_dir)) { ZFS_EXIT(zfsvfs); VERIFY0(zfsctl_root(zfsvfs, LK_SHARED, &dvp)); if (object == ZFSCTL_INO_SNAPDIR) { cn.cn_nameptr = "snapshot"; cn.cn_namelen = strlen(cn.cn_nameptr); cn.cn_nameiop = LOOKUP; cn.cn_flags = ISLASTCN | LOCKLEAF; cn.cn_lkflags = flags; VERIFY0(VOP_LOOKUP(dvp, vpp, &cn)); vput(dvp); } else if (object == zfsvfs->z_shares_dir) { /* * XXX This branch must not be taken, * if it is, then the lookup below will * explode. */ cn.cn_nameptr = "shares"; cn.cn_namelen = strlen(cn.cn_nameptr); cn.cn_nameiop = LOOKUP; cn.cn_flags = ISLASTCN; cn.cn_lkflags = flags; VERIFY0(VOP_LOOKUP(dvp, vpp, &cn)); vput(dvp); } else { *vpp = dvp; } return (err); } gen_mask = -1ULL >> (64 - 8 * i); dprintf("getting %llu [%u mask %llx]\n", object, fid_gen, gen_mask); if ((err = zfs_zget(zfsvfs, object, &zp))) { ZFS_EXIT(zfsvfs); return (err); } (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_GEN(zfsvfs), &zp_gen, sizeof (uint64_t)); zp_gen = zp_gen & gen_mask; if (zp_gen == 0) zp_gen = 1; if (zp->z_unlinked || zp_gen != fid_gen) { dprintf("znode gen (%u) != fid gen (%u)\n", zp_gen, fid_gen); vrele(ZTOV(zp)); ZFS_EXIT(zfsvfs); return (SET_ERROR(EINVAL)); } *vpp = ZTOV(zp); ZFS_EXIT(zfsvfs); err = vn_lock(*vpp, flags); if (err == 0) vnode_create_vobject(*vpp, zp->z_size, curthread); else *vpp = NULL; return (err); } /* * Block out VOPs and close zfsvfs_t::z_os * * Note, if successful, then we return with the 'z_teardown_lock' and * 'z_teardown_inactive_lock' write held. We leave ownership of the underlying * dataset and objset intact so that they can be atomically handed off during * a subsequent rollback or recv operation and the resume thereafter. */ int zfs_suspend_fs(zfsvfs_t *zfsvfs) { int error; if ((error = zfsvfs_teardown(zfsvfs, B_FALSE)) != 0) return (error); return (0); } /* * Rebuild SA and release VOPs. Note that ownership of the underlying dataset * is an invariant across any of the operations that can be performed while the * filesystem was suspended. Whether it succeeded or failed, the preconditions * are the same: the relevant objset and associated dataset are owned by * zfsvfs, held, and long held on entry. */ int zfs_resume_fs(zfsvfs_t *zfsvfs, dsl_dataset_t *ds) { int err; znode_t *zp; ASSERT(RRM_WRITE_HELD(&zfsvfs->z_teardown_lock)); ASSERT(ZFS_TEARDOWN_INACTIVE_WLOCKED(zfsvfs)); /* * We already own this, so just update the objset_t, as the one we * had before may have been evicted. */ objset_t *os; VERIFY3P(ds->ds_owner, ==, zfsvfs); VERIFY(dsl_dataset_long_held(ds)); dsl_pool_t *dp = spa_get_dsl(dsl_dataset_get_spa(ds)); dsl_pool_config_enter(dp, FTAG); VERIFY0(dmu_objset_from_ds(ds, &os)); dsl_pool_config_exit(dp, FTAG); err = zfsvfs_init(zfsvfs, os); if (err != 0) goto bail; ds->ds_dir->dd_activity_cancelled = B_FALSE; VERIFY(zfsvfs_setup(zfsvfs, B_FALSE) == 0); zfs_set_fuid_feature(zfsvfs); /* * Attempt to re-establish all the active znodes with * their dbufs. If a zfs_rezget() fails, then we'll let * any potential callers discover that via ZFS_ENTER_VERIFY_VP * when they try to use their znode. */ mutex_enter(&zfsvfs->z_znodes_lock); for (zp = list_head(&zfsvfs->z_all_znodes); zp; zp = list_next(&zfsvfs->z_all_znodes, zp)) { (void) zfs_rezget(zp); } mutex_exit(&zfsvfs->z_znodes_lock); bail: /* release the VOPs */ ZFS_WUNLOCK_TEARDOWN_INACTIVE(zfsvfs); rrm_exit(&zfsvfs->z_teardown_lock, FTAG); if (err) { /* * Since we couldn't setup the sa framework, try to force * unmount this file system. */ if (vn_vfswlock(zfsvfs->z_vfs->vfs_vnodecovered) == 0) { vfs_ref(zfsvfs->z_vfs); (void) dounmount(zfsvfs->z_vfs, MS_FORCE, curthread); } } return (err); } static void zfs_freevfs(vfs_t *vfsp) { zfsvfs_t *zfsvfs = vfsp->vfs_data; zfsvfs_free(zfsvfs); atomic_dec_32(&zfs_active_fs_count); } #ifdef __i386__ static int desiredvnodes_backup; #include #include #include #include #include #endif static void zfs_vnodes_adjust(void) { #ifdef __i386__ int newdesiredvnodes; desiredvnodes_backup = desiredvnodes; /* * We calculate newdesiredvnodes the same way it is done in * vntblinit(). If it is equal to desiredvnodes, it means that * it wasn't tuned by the administrator and we can tune it down. */ newdesiredvnodes = min(maxproc + vm_cnt.v_page_count / 4, 2 * vm_kmem_size / (5 * (sizeof (struct vm_object) + sizeof (struct vnode)))); if (newdesiredvnodes == desiredvnodes) desiredvnodes = (3 * newdesiredvnodes) / 4; #endif } static void zfs_vnodes_adjust_back(void) { #ifdef __i386__ desiredvnodes = desiredvnodes_backup; #endif } void zfs_init(void) { printf("ZFS filesystem version: " ZPL_VERSION_STRING "\n"); /* * Initialize .zfs directory structures */ zfsctl_init(); /* * Initialize znode cache, vnode ops, etc... */ zfs_znode_init(); /* * Reduce number of vnodes. Originally number of vnodes is calculated * with UFS inode in mind. We reduce it here, because it's too big for * ZFS/i386. */ zfs_vnodes_adjust(); dmu_objset_register_type(DMU_OST_ZFS, zpl_get_file_info); zfsvfs_taskq = taskq_create("zfsvfs", 1, minclsyspri, 0, 0, 0); } void zfs_fini(void) { taskq_destroy(zfsvfs_taskq); zfsctl_fini(); zfs_znode_fini(); zfs_vnodes_adjust_back(); } int zfs_busy(void) { return (zfs_active_fs_count != 0); } /* * Release VOPs and unmount a suspended filesystem. */ int zfs_end_fs(zfsvfs_t *zfsvfs, dsl_dataset_t *ds) { ASSERT(RRM_WRITE_HELD(&zfsvfs->z_teardown_lock)); ASSERT(ZFS_TEARDOWN_INACTIVE_WLOCKED(zfsvfs)); /* * We already own this, so just hold and rele it to update the * objset_t, as the one we had before may have been evicted. */ objset_t *os; VERIFY3P(ds->ds_owner, ==, zfsvfs); VERIFY(dsl_dataset_long_held(ds)); dsl_pool_t *dp = spa_get_dsl(dsl_dataset_get_spa(ds)); dsl_pool_config_enter(dp, FTAG); VERIFY0(dmu_objset_from_ds(ds, &os)); dsl_pool_config_exit(dp, FTAG); zfsvfs->z_os = os; /* release the VOPs */ ZFS_WUNLOCK_TEARDOWN_INACTIVE(zfsvfs); rrm_exit(&zfsvfs->z_teardown_lock, FTAG); /* * Try to force unmount this file system. */ (void) zfs_umount(zfsvfs->z_vfs, 0); zfsvfs->z_unmounted = B_TRUE; return (0); } int zfs_set_version(zfsvfs_t *zfsvfs, uint64_t newvers) { int error; objset_t *os = zfsvfs->z_os; dmu_tx_t *tx; if (newvers < ZPL_VERSION_INITIAL || newvers > ZPL_VERSION) return (SET_ERROR(EINVAL)); if (newvers < zfsvfs->z_version) return (SET_ERROR(EINVAL)); if (zfs_spa_version_map(newvers) > spa_version(dmu_objset_spa(zfsvfs->z_os))) return (SET_ERROR(ENOTSUP)); tx = dmu_tx_create(os); dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, B_FALSE, ZPL_VERSION_STR); if (newvers >= ZPL_VERSION_SA && !zfsvfs->z_use_sa) { dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, B_TRUE, ZFS_SA_ATTRS); dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL); } error = dmu_tx_assign(tx, TXG_WAIT); if (error) { dmu_tx_abort(tx); return (error); } error = zap_update(os, MASTER_NODE_OBJ, ZPL_VERSION_STR, 8, 1, &newvers, tx); if (error) { dmu_tx_commit(tx); return (error); } if (newvers >= ZPL_VERSION_SA && !zfsvfs->z_use_sa) { uint64_t sa_obj; ASSERT3U(spa_version(dmu_objset_spa(zfsvfs->z_os)), >=, SPA_VERSION_SA); sa_obj = zap_create(os, DMU_OT_SA_MASTER_NODE, DMU_OT_NONE, 0, tx); error = zap_add(os, MASTER_NODE_OBJ, ZFS_SA_ATTRS, 8, 1, &sa_obj, tx); ASSERT0(error); VERIFY(0 == sa_set_sa_object(os, sa_obj)); sa_register_update_callback(os, zfs_sa_upgrade); } spa_history_log_internal_ds(dmu_objset_ds(os), "upgrade", tx, "from %ju to %ju", (uintmax_t)zfsvfs->z_version, (uintmax_t)newvers); dmu_tx_commit(tx); zfsvfs->z_version = newvers; os->os_version = newvers; zfs_set_fuid_feature(zfsvfs); return (0); } /* * Read a property stored within the master node. */ int zfs_get_zplprop(objset_t *os, zfs_prop_t prop, uint64_t *value) { uint64_t *cached_copy = NULL; /* * Figure out where in the objset_t the cached copy would live, if it * is available for the requested property. */ if (os != NULL) { switch (prop) { case ZFS_PROP_VERSION: cached_copy = &os->os_version; break; case ZFS_PROP_NORMALIZE: cached_copy = &os->os_normalization; break; case ZFS_PROP_UTF8ONLY: cached_copy = &os->os_utf8only; break; case ZFS_PROP_CASE: cached_copy = &os->os_casesensitivity; break; default: break; } } if (cached_copy != NULL && *cached_copy != OBJSET_PROP_UNINITIALIZED) { *value = *cached_copy; return (0); } /* * If the property wasn't cached, look up the file system's value for * the property. For the version property, we look up a slightly * different string. */ const char *pname; int error = ENOENT; if (prop == ZFS_PROP_VERSION) { pname = ZPL_VERSION_STR; } else { pname = zfs_prop_to_name(prop); } if (os != NULL) { ASSERT3U(os->os_phys->os_type, ==, DMU_OST_ZFS); error = zap_lookup(os, MASTER_NODE_OBJ, pname, 8, 1, value); } if (error == ENOENT) { /* No value set, use the default value */ switch (prop) { case ZFS_PROP_VERSION: *value = ZPL_VERSION; break; case ZFS_PROP_NORMALIZE: case ZFS_PROP_UTF8ONLY: *value = 0; break; case ZFS_PROP_CASE: *value = ZFS_CASE_SENSITIVE; break; default: return (error); } error = 0; } /* * If one of the methods for getting the property value above worked, * copy it into the objset_t's cache. */ if (error == 0 && cached_copy != NULL) { *cached_copy = *value; } return (error); } /* * Return true if the corresponding vfs's unmounted flag is set. * Otherwise return false. * If this function returns true we know VFS unmount has been initiated. */ boolean_t zfs_get_vfs_flag_unmounted(objset_t *os) { zfsvfs_t *zfvp; boolean_t unmounted = B_FALSE; ASSERT(dmu_objset_type(os) == DMU_OST_ZFS); mutex_enter(&os->os_user_ptr_lock); zfvp = dmu_objset_get_user(os); if (zfvp != NULL && zfvp->z_vfs != NULL && (zfvp->z_vfs->mnt_kern_flag & MNTK_UNMOUNT)) unmounted = B_TRUE; mutex_exit(&os->os_user_ptr_lock); return (unmounted); } #ifdef _KERNEL void zfsvfs_update_fromname(const char *oldname, const char *newname) { char tmpbuf[MAXPATHLEN]; struct mount *mp; char *fromname; size_t oldlen; oldlen = strlen(oldname); mtx_lock(&mountlist_mtx); TAILQ_FOREACH(mp, &mountlist, mnt_list) { fromname = mp->mnt_stat.f_mntfromname; if (strcmp(fromname, oldname) == 0) { (void) strlcpy(fromname, newname, sizeof (mp->mnt_stat.f_mntfromname)); continue; } if (strncmp(fromname, oldname, oldlen) == 0 && (fromname[oldlen] == '/' || fromname[oldlen] == '@')) { (void) snprintf(tmpbuf, sizeof (tmpbuf), "%s%s", newname, fromname + oldlen); (void) strlcpy(fromname, tmpbuf, sizeof (mp->mnt_stat.f_mntfromname)); continue; } } mtx_unlock(&mountlist_mtx); } #endif Index: vendor-sys/openzfs/dist/module/os/freebsd/zfs/zio_crypt.c =================================================================== --- vendor-sys/openzfs/dist/module/os/freebsd/zfs/zio_crypt.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/freebsd/zfs/zio_crypt.c (revision 366348) @@ -1,1882 +1,1813 @@ /* * CDDL HEADER START * * This file and its contents are supplied under the terms of the * Common Development and Distribution License ("CDDL"), version 1.0. * You may only use this file in accordance with the terms of version * 1.0 of the CDDL. * * A full copy of the text of the CDDL should have accompanied this * source. A copy of the CDDL is also available via the Internet at * http://www.illumos.org/license/CDDL. * * CDDL HEADER END */ /* * Copyright (c) 2017, Datto, Inc. All rights reserved. */ #include #include #include #include #include #include #include #include #include /* * This file is responsible for handling all of the details of generating * encryption parameters and performing encryption and authentication. * * BLOCK ENCRYPTION PARAMETERS: * Encryption /Authentication Algorithm Suite (crypt): * The encryption algorithm, mode, and key length we are going to use. We * currently support AES in either GCM or CCM modes with 128, 192, and 256 bit * keys. All authentication is currently done with SHA512-HMAC. * * Plaintext: * The unencrypted data that we want to encrypt. * * Initialization Vector (IV): * An initialization vector for the encryption algorithms. This is used to * "tweak" the encryption algorithms so that two blocks of the same data are * encrypted into different ciphertext outputs, thus obfuscating block patterns. * The supported encryption modes (AES-GCM and AES-CCM) require that an IV is * never reused with the same encryption key. This value is stored unencrypted * and must simply be provided to the decryption function. We use a 96 bit IV * (as recommended by NIST) for all block encryption. For non-dedup blocks we * derive the IV randomly. The first 64 bits of the IV are stored in the second * word of DVA[2] and the remaining 32 bits are stored in the upper 32 bits of * blk_fill. This is safe because encrypted blocks can't use the upper 32 bits * of blk_fill. We only encrypt level 0 blocks, which normally have a fill count * of 1. The only exception is for DMU_OT_DNODE objects, where the fill count of * level 0 blocks is the number of allocated dnodes in that block. The on-disk * format supports at most 2^15 slots per L0 dnode block, because the maximum * block size is 16MB (2^24). In either case, for level 0 blocks this number * will still be smaller than UINT32_MAX so it is safe to store the IV in the * top 32 bits of blk_fill, while leaving the bottom 32 bits of the fill count * for the dnode code. * * Master key: * This is the most important secret data of an encrypted dataset. It is used * along with the salt to generate that actual encryption keys via HKDF. We * do not use the master key to directly encrypt any data because there are * theoretical limits on how much data can actually be safely encrypted with * any encryption mode. The master key is stored encrypted on disk with the * user's wrapping key. Its length is determined by the encryption algorithm. * For details on how this is stored see the block comment in dsl_crypt.c * * Salt: * Used as an input to the HKDF function, along with the master key. We use a * 64 bit salt, stored unencrypted in the first word of DVA[2]. Any given salt * can be used for encrypting many blocks, so we cache the current salt and the * associated derived key in zio_crypt_t so we do not need to derive it again * needlessly. * * Encryption Key: * A secret binary key, generated from an HKDF function used to encrypt and * decrypt data. * * Message Authentication Code (MAC) * The MAC is an output of authenticated encryption modes such as AES-GCM and * AES-CCM. Its purpose is to ensure that an attacker cannot modify encrypted * data on disk and return garbage to the application. Effectively, it is a * checksum that can not be reproduced by an attacker. We store the MAC in the * second 128 bits of blk_cksum, leaving the first 128 bits for a truncated * regular checksum of the ciphertext which can be used for scrubbing. * * OBJECT AUTHENTICATION: * Some object types, such as DMU_OT_MASTER_NODE cannot be encrypted because * they contain some info that always needs to be readable. To prevent this * data from being altered, we authenticate this data using SHA512-HMAC. This * will produce a MAC (similar to the one produced via encryption) which can * be used to verify the object was not modified. HMACs do not require key * rotation or IVs, so we can keep up to the full 3 copies of authenticated * data. * * ZIL ENCRYPTION: * ZIL blocks have their bp written to disk ahead of the associated data, so we * cannot store the MAC there as we normally do. For these blocks the MAC is * stored in the embedded checksum within the zil_chain_t header. The salt and * IV are generated for the block on bp allocation instead of at encryption * time. In addition, ZIL blocks have some pieces that must be left in plaintext * for claiming even though all of the sensitive user data still needs to be * encrypted. The function zio_crypt_init_uios_zil() handles parsing which * pieces of the block need to be encrypted. All data that is not encrypted is * authenticated using the AAD mechanisms that the supported encryption modes * provide for. In order to preserve the semantics of the ZIL for encrypted * datasets, the ZIL is not protected at the objset level as described below. * * DNODE ENCRYPTION: * Similarly to ZIL blocks, the core part of each dnode_phys_t needs to be left * in plaintext for scrubbing and claiming, but the bonus buffers might contain * sensitive user data. The function zio_crypt_init_uios_dnode() handles parsing * which which pieces of the block need to be encrypted. For more details about * dnode authentication and encryption, see zio_crypt_init_uios_dnode(). * * OBJECT SET AUTHENTICATION: * Up to this point, everything we have encrypted and authenticated has been * at level 0 (or -2 for the ZIL). If we did not do any further work the * on-disk format would be susceptible to attacks that deleted or rearranged * the order of level 0 blocks. Ideally, the cleanest solution would be to * maintain a tree of authentication MACs going up the bp tree. However, this * presents a problem for raw sends. Send files do not send information about * indirect blocks so there would be no convenient way to transfer the MACs and * they cannot be recalculated on the receive side without the master key which * would defeat one of the purposes of raw sends in the first place. Instead, * for the indirect levels of the bp tree, we use a regular SHA512 of the MACs * from the level below. We also include some portable fields from blk_prop such * as the lsize and compression algorithm to prevent the data from being * misinterpreted. * * At the objset level, we maintain 2 separate 256 bit MACs in the * objset_phys_t. The first one is "portable" and is the logical root of the * MAC tree maintained in the metadnode's bps. The second, is "local" and is * used as the root MAC for the user accounting objects, which are also not * transferred via "zfs send". The portable MAC is sent in the DRR_BEGIN payload * of the send file. The useraccounting code ensures that the useraccounting * info is not present upon a receive, so the local MAC can simply be cleared * out at that time. For more info about objset_phys_t authentication, see * zio_crypt_do_objset_hmacs(). * * CONSIDERATIONS FOR DEDUP: * In order for dedup to work, blocks that we want to dedup with one another * need to use the same IV and encryption key, so that they will have the same * ciphertext. Normally, one should never reuse an IV with the same encryption * key or else AES-GCM and AES-CCM can both actually leak the plaintext of both * blocks. In this case, however, since we are using the same plaintext as * well all that we end up with is a duplicate of the original ciphertext we * already had. As a result, an attacker with read access to the raw disk will * be able to tell which blocks are the same but this information is given away * by dedup anyway. In order to get the same IVs and encryption keys for * equivalent blocks of data we use an HMAC of the plaintext. We use an HMAC * here so that a reproducible checksum of the plaintext is never available to * the attacker. The HMAC key is kept alongside the master key, encrypted on * disk. The first 64 bits of the HMAC are used in place of the random salt, and * the next 96 bits are used as the IV. As a result of this mechanism, dedup * will only work within a clone family since encrypted dedup requires use of * the same master and HMAC keys. */ /* * After encrypting many blocks with the same key we may start to run up * against the theoretical limits of how much data can securely be encrypted * with a single key using the supported encryption modes. The most obvious * limitation is that our risk of generating 2 equivalent 96 bit IVs increases * the more IVs we generate (which both GCM and CCM modes strictly forbid). * This risk actually grows surprisingly quickly over time according to the * Birthday Problem. With a total IV space of 2^(96 bits), and assuming we have * generated n IVs with a cryptographically secure RNG, the approximate * probability p(n) of a collision is given as: * * p(n) ~= e^(-n*(n-1)/(2*(2^96))) * * [http://www.math.cornell.edu/~mec/2008-2009/TianyiZheng/Birthday.html] * * Assuming that we want to ensure that p(n) never goes over 1 / 1 trillion * we must not write more than 398,065,730 blocks with the same encryption key. * Therefore, we rotate our keys after 400,000,000 blocks have been written by * generating a new random 64 bit salt for our HKDF encryption key generation * function. */ #define ZFS_KEY_MAX_SALT_USES_DEFAULT 400000000 #define ZFS_CURRENT_MAX_SALT_USES \ (MIN(zfs_key_max_salt_uses, ZFS_KEY_MAX_SALT_USES_DEFAULT)) unsigned long zfs_key_max_salt_uses = ZFS_KEY_MAX_SALT_USES_DEFAULT; /* * Set to a nonzero value to cause zio_do_crypt_uio() to fail 1/this many * calls, to test decryption error handling code paths. */ uint64_t zio_decrypt_fail_fraction = 0; typedef struct blkptr_auth_buf { uint64_t bab_prop; /* blk_prop - portable mask */ uint8_t bab_mac[ZIO_DATA_MAC_LEN]; /* MAC from blk_cksum */ uint64_t bab_pad; /* reserved for future use */ } blkptr_auth_buf_t; zio_crypt_info_t zio_crypt_table[ZIO_CRYPT_FUNCTIONS] = { {"", ZC_TYPE_NONE, 0, "inherit"}, {"", ZC_TYPE_NONE, 0, "on"}, {"", ZC_TYPE_NONE, 0, "off"}, {SUN_CKM_AES_CCM, ZC_TYPE_CCM, 16, "aes-128-ccm"}, {SUN_CKM_AES_CCM, ZC_TYPE_CCM, 24, "aes-192-ccm"}, {SUN_CKM_AES_CCM, ZC_TYPE_CCM, 32, "aes-256-ccm"}, {SUN_CKM_AES_GCM, ZC_TYPE_GCM, 16, "aes-128-gcm"}, {SUN_CKM_AES_GCM, ZC_TYPE_GCM, 24, "aes-192-gcm"}, {SUN_CKM_AES_GCM, ZC_TYPE_GCM, 32, "aes-256-gcm"} }; static void zio_crypt_key_destroy_early(zio_crypt_key_t *key) { rw_destroy(&key->zk_salt_lock); /* free crypto templates */ bzero(&key->zk_session, sizeof (key->zk_session)); /* zero out sensitive data */ bzero(key, sizeof (zio_crypt_key_t)); } void zio_crypt_key_destroy(zio_crypt_key_t *key) { freebsd_crypt_freesession(&key->zk_session); zio_crypt_key_destroy_early(key); } int zio_crypt_key_init(uint64_t crypt, zio_crypt_key_t *key) { int ret; crypto_mechanism_t mech __unused; uint_t keydata_len; zio_crypt_info_t *ci = NULL; ASSERT(key != NULL); ASSERT3U(crypt, <, ZIO_CRYPT_FUNCTIONS); ci = &zio_crypt_table[crypt]; if (ci->ci_crypt_type != ZC_TYPE_GCM && ci->ci_crypt_type != ZC_TYPE_CCM) return (ENOTSUP); keydata_len = zio_crypt_table[crypt].ci_keylen; bzero(key, sizeof (zio_crypt_key_t)); rw_init(&key->zk_salt_lock, NULL, RW_DEFAULT, NULL); /* fill keydata buffers and salt with random data */ ret = random_get_bytes((uint8_t *)&key->zk_guid, sizeof (uint64_t)); if (ret != 0) goto error; ret = random_get_bytes(key->zk_master_keydata, keydata_len); if (ret != 0) goto error; ret = random_get_bytes(key->zk_hmac_keydata, SHA512_HMAC_KEYLEN); if (ret != 0) goto error; ret = random_get_bytes(key->zk_salt, ZIO_DATA_SALT_LEN); if (ret != 0) goto error; /* derive the current key from the master key */ ret = hkdf_sha512(key->zk_master_keydata, keydata_len, NULL, 0, key->zk_salt, ZIO_DATA_SALT_LEN, key->zk_current_keydata, keydata_len); if (ret != 0) goto error; /* initialize keys for the ICP */ key->zk_current_key.ck_format = CRYPTO_KEY_RAW; key->zk_current_key.ck_data = key->zk_current_keydata; key->zk_current_key.ck_length = CRYPTO_BYTES2BITS(keydata_len); key->zk_hmac_key.ck_format = CRYPTO_KEY_RAW; key->zk_hmac_key.ck_data = &key->zk_hmac_key; key->zk_hmac_key.ck_length = CRYPTO_BYTES2BITS(SHA512_HMAC_KEYLEN); ci = &zio_crypt_table[crypt]; if (ci->ci_crypt_type != ZC_TYPE_GCM && ci->ci_crypt_type != ZC_TYPE_CCM) return (ENOTSUP); ret = freebsd_crypt_newsession(&key->zk_session, ci, &key->zk_current_key); if (ret) goto error; key->zk_crypt = crypt; key->zk_version = ZIO_CRYPT_KEY_CURRENT_VERSION; key->zk_salt_count = 0; return (0); error: zio_crypt_key_destroy_early(key); return (ret); } static int zio_crypt_key_change_salt(zio_crypt_key_t *key) { int ret = 0; uint8_t salt[ZIO_DATA_SALT_LEN]; crypto_mechanism_t mech __unused; uint_t keydata_len = zio_crypt_table[key->zk_crypt].ci_keylen; /* generate a new salt */ ret = random_get_bytes(salt, ZIO_DATA_SALT_LEN); if (ret != 0) goto error; rw_enter(&key->zk_salt_lock, RW_WRITER); /* someone beat us to the salt rotation, just unlock and return */ if (key->zk_salt_count < ZFS_CURRENT_MAX_SALT_USES) goto out_unlock; /* derive the current key from the master key and the new salt */ ret = hkdf_sha512(key->zk_master_keydata, keydata_len, NULL, 0, salt, ZIO_DATA_SALT_LEN, key->zk_current_keydata, keydata_len); if (ret != 0) goto out_unlock; /* assign the salt and reset the usage count */ bcopy(salt, key->zk_salt, ZIO_DATA_SALT_LEN); key->zk_salt_count = 0; freebsd_crypt_freesession(&key->zk_session); ret = freebsd_crypt_newsession(&key->zk_session, &zio_crypt_table[key->zk_crypt], &key->zk_current_key); if (ret != 0) goto out_unlock; rw_exit(&key->zk_salt_lock); return (0); out_unlock: rw_exit(&key->zk_salt_lock); error: return (ret); } /* See comment above zfs_key_max_salt_uses definition for details */ int zio_crypt_key_get_salt(zio_crypt_key_t *key, uint8_t *salt) { int ret; boolean_t salt_change; rw_enter(&key->zk_salt_lock, RW_READER); bcopy(key->zk_salt, salt, ZIO_DATA_SALT_LEN); salt_change = (atomic_inc_64_nv(&key->zk_salt_count) >= ZFS_CURRENT_MAX_SALT_USES); rw_exit(&key->zk_salt_lock); if (salt_change) { ret = zio_crypt_key_change_salt(key); if (ret != 0) goto error; } return (0); error: return (ret); } void *failed_decrypt_buf; int failed_decrypt_size; /* * This function handles all encryption and decryption in zfs. When * encrypting it expects puio to reference the plaintext and cuio to * reference the ciphertext. cuio must have enough space for the * ciphertext + room for a MAC. datalen should be the length of the * plaintext / ciphertext alone. */ /* * The implementation for FreeBSD's OpenCrypto. * * The big difference between ICP and FOC is that FOC uses a single * buffer for input and output. This means that (for AES-GCM, the * only one supported right now) the source must be copied into the * destination, and the destination must have the AAD, and the tag/MAC, * already associated with it. (Both implementations can use a uio.) * * Since the auth data is part of the iovec array, all we need to know * is the length: 0 means there's no AAD. * */ static int zio_do_crypt_uio_opencrypto(boolean_t encrypt, freebsd_crypt_session_t *sess, uint64_t crypt, crypto_key_t *key, uint8_t *ivbuf, uint_t datalen, uio_t *uio, uint_t auth_len) { zio_crypt_info_t *ci; int ret; ci = &zio_crypt_table[crypt]; if (ci->ci_crypt_type != ZC_TYPE_GCM && ci->ci_crypt_type != ZC_TYPE_CCM) return (ENOTSUP); ret = freebsd_crypt_uio(encrypt, sess, ci, uio, key, ivbuf, datalen, auth_len); if (ret != 0) { #ifdef FCRYPTO_DEBUG printf("%s(%d): Returning error %s\n", __FUNCTION__, __LINE__, encrypt ? "EIO" : "ECKSUM"); #endif ret = SET_ERROR(encrypt ? EIO : ECKSUM); } return (ret); } int zio_crypt_key_wrap(crypto_key_t *cwkey, zio_crypt_key_t *key, uint8_t *iv, uint8_t *mac, uint8_t *keydata_out, uint8_t *hmac_keydata_out) { int ret; uint64_t aad[3]; /* * With OpenCrypto in FreeBSD, the same buffer is used for * input and output. Also, the AAD (for AES-GMC at least) * needs to logically go in front. */ uio_t cuio; iovec_t iovecs[4]; uint64_t crypt = key->zk_crypt; uint_t enc_len, keydata_len, aad_len; ASSERT3U(crypt, <, ZIO_CRYPT_FUNCTIONS); ASSERT3U(cwkey->ck_format, ==, CRYPTO_KEY_RAW); keydata_len = zio_crypt_table[crypt].ci_keylen; /* generate iv for wrapping the master and hmac key */ ret = random_get_pseudo_bytes(iv, WRAPPING_IV_LEN); if (ret != 0) goto error; /* * Since we only support one buffer, we need to copy * the plain text (source) to the cipher buffer (dest). * We set iovecs[0] -- the authentication data -- below. */ bcopy((void*)key->zk_master_keydata, keydata_out, keydata_len); bcopy((void*)key->zk_hmac_keydata, hmac_keydata_out, SHA512_HMAC_KEYLEN); iovecs[1].iov_base = keydata_out; iovecs[1].iov_len = keydata_len; iovecs[2].iov_base = hmac_keydata_out; iovecs[2].iov_len = SHA512_HMAC_KEYLEN; iovecs[3].iov_base = mac; iovecs[3].iov_len = WRAPPING_MAC_LEN; /* * Although we don't support writing to the old format, we do * support rewrapping the key so that the user can move and * quarantine datasets on the old format. */ if (key->zk_version == 0) { aad_len = sizeof (uint64_t); aad[0] = LE_64(key->zk_guid); } else { ASSERT3U(key->zk_version, ==, ZIO_CRYPT_KEY_CURRENT_VERSION); aad_len = sizeof (uint64_t) * 3; aad[0] = LE_64(key->zk_guid); aad[1] = LE_64(crypt); aad[2] = LE_64(key->zk_version); } iovecs[0].iov_base = aad; iovecs[0].iov_len = aad_len; enc_len = zio_crypt_table[crypt].ci_keylen + SHA512_HMAC_KEYLEN; cuio.uio_iov = iovecs; cuio.uio_iovcnt = 4; cuio.uio_segflg = UIO_SYSSPACE; /* encrypt the keys and store the resulting ciphertext and mac */ ret = zio_do_crypt_uio_opencrypto(B_TRUE, NULL, crypt, cwkey, iv, enc_len, &cuio, aad_len); if (ret != 0) goto error; return (0); error: return (ret); } int zio_crypt_key_unwrap(crypto_key_t *cwkey, uint64_t crypt, uint64_t version, uint64_t guid, uint8_t *keydata, uint8_t *hmac_keydata, uint8_t *iv, uint8_t *mac, zio_crypt_key_t *key) { int ret; uint64_t aad[3]; /* * With OpenCrypto in FreeBSD, the same buffer is used for * input and output. Also, the AAD (for AES-GMC at least) * needs to logically go in front. */ uio_t cuio; iovec_t iovecs[4]; void *src, *dst; uint_t enc_len, keydata_len, aad_len; ASSERT3U(crypt, <, ZIO_CRYPT_FUNCTIONS); ASSERT3U(cwkey->ck_format, ==, CRYPTO_KEY_RAW); keydata_len = zio_crypt_table[crypt].ci_keylen; rw_init(&key->zk_salt_lock, NULL, RW_DEFAULT, NULL); /* * Since we only support one buffer, we need to copy * the encrypted buffer (source) to the plain buffer * (dest). We set iovecs[0] -- the authentication data -- * below. */ dst = key->zk_master_keydata; src = keydata; bcopy(src, dst, keydata_len); dst = key->zk_hmac_keydata; src = hmac_keydata; bcopy(src, dst, SHA512_HMAC_KEYLEN); iovecs[1].iov_base = key->zk_master_keydata; iovecs[1].iov_len = keydata_len; iovecs[2].iov_base = key->zk_hmac_keydata; iovecs[2].iov_len = SHA512_HMAC_KEYLEN; iovecs[3].iov_base = mac; iovecs[3].iov_len = WRAPPING_MAC_LEN; if (version == 0) { aad_len = sizeof (uint64_t); aad[0] = LE_64(guid); } else { ASSERT3U(version, ==, ZIO_CRYPT_KEY_CURRENT_VERSION); aad_len = sizeof (uint64_t) * 3; aad[0] = LE_64(guid); aad[1] = LE_64(crypt); aad[2] = LE_64(version); } enc_len = keydata_len + SHA512_HMAC_KEYLEN; iovecs[0].iov_base = aad; iovecs[0].iov_len = aad_len; cuio.uio_iov = iovecs; cuio.uio_iovcnt = 4; cuio.uio_segflg = UIO_SYSSPACE; /* decrypt the keys and store the result in the output buffers */ ret = zio_do_crypt_uio_opencrypto(B_FALSE, NULL, crypt, cwkey, iv, enc_len, &cuio, aad_len); if (ret != 0) goto error; /* generate a fresh salt */ ret = random_get_bytes(key->zk_salt, ZIO_DATA_SALT_LEN); if (ret != 0) goto error; /* derive the current key from the master key */ ret = hkdf_sha512(key->zk_master_keydata, keydata_len, NULL, 0, key->zk_salt, ZIO_DATA_SALT_LEN, key->zk_current_keydata, keydata_len); if (ret != 0) goto error; /* initialize keys for ICP */ key->zk_current_key.ck_format = CRYPTO_KEY_RAW; key->zk_current_key.ck_data = key->zk_current_keydata; key->zk_current_key.ck_length = CRYPTO_BYTES2BITS(keydata_len); key->zk_hmac_key.ck_format = CRYPTO_KEY_RAW; key->zk_hmac_key.ck_data = key->zk_hmac_keydata; key->zk_hmac_key.ck_length = CRYPTO_BYTES2BITS(SHA512_HMAC_KEYLEN); ret = freebsd_crypt_newsession(&key->zk_session, &zio_crypt_table[crypt], &key->zk_current_key); if (ret != 0) goto error; key->zk_crypt = crypt; key->zk_version = version; key->zk_guid = guid; key->zk_salt_count = 0; return (0); error: zio_crypt_key_destroy_early(key); return (ret); } int zio_crypt_generate_iv(uint8_t *ivbuf) { int ret; /* randomly generate the IV */ ret = random_get_pseudo_bytes(ivbuf, ZIO_DATA_IV_LEN); if (ret != 0) goto error; return (0); error: bzero(ivbuf, ZIO_DATA_IV_LEN); return (ret); } int zio_crypt_do_hmac(zio_crypt_key_t *key, uint8_t *data, uint_t datalen, uint8_t *digestbuf, uint_t digestlen) { uint8_t raw_digestbuf[SHA512_DIGEST_LENGTH]; ASSERT3U(digestlen, <=, SHA512_DIGEST_LENGTH); crypto_mac(&key->zk_hmac_key, data, datalen, raw_digestbuf, SHA512_DIGEST_LENGTH); bcopy(raw_digestbuf, digestbuf, digestlen); return (0); } int zio_crypt_generate_iv_salt_dedup(zio_crypt_key_t *key, uint8_t *data, uint_t datalen, uint8_t *ivbuf, uint8_t *salt) { int ret; uint8_t digestbuf[SHA512_DIGEST_LENGTH]; ret = zio_crypt_do_hmac(key, data, datalen, digestbuf, SHA512_DIGEST_LENGTH); if (ret != 0) return (ret); bcopy(digestbuf, salt, ZIO_DATA_SALT_LEN); bcopy(digestbuf + ZIO_DATA_SALT_LEN, ivbuf, ZIO_DATA_IV_LEN); return (0); } /* * The following functions are used to encode and decode encryption parameters * into blkptr_t and zil_header_t. The ICP wants to use these parameters as * byte strings, which normally means that these strings would not need to deal * with byteswapping at all. However, both blkptr_t and zil_header_t may be * byteswapped by lower layers and so we must "undo" that byteswap here upon * decoding and encoding in a non-native byteorder. These functions require * that the byteorder bit is correct before being called. */ void zio_crypt_encode_params_bp(blkptr_t *bp, uint8_t *salt, uint8_t *iv) { uint64_t val64; uint32_t val32; ASSERT(BP_IS_ENCRYPTED(bp)); if (!BP_SHOULD_BYTESWAP(bp)) { bcopy(salt, &bp->blk_dva[2].dva_word[0], sizeof (uint64_t)); bcopy(iv, &bp->blk_dva[2].dva_word[1], sizeof (uint64_t)); bcopy(iv + sizeof (uint64_t), &val32, sizeof (uint32_t)); BP_SET_IV2(bp, val32); } else { bcopy(salt, &val64, sizeof (uint64_t)); bp->blk_dva[2].dva_word[0] = BSWAP_64(val64); bcopy(iv, &val64, sizeof (uint64_t)); bp->blk_dva[2].dva_word[1] = BSWAP_64(val64); bcopy(iv + sizeof (uint64_t), &val32, sizeof (uint32_t)); BP_SET_IV2(bp, BSWAP_32(val32)); } } void zio_crypt_decode_params_bp(const blkptr_t *bp, uint8_t *salt, uint8_t *iv) { uint64_t val64; uint32_t val32; ASSERT(BP_IS_PROTECTED(bp)); /* for convenience, so callers don't need to check */ if (BP_IS_AUTHENTICATED(bp)) { bzero(salt, ZIO_DATA_SALT_LEN); bzero(iv, ZIO_DATA_IV_LEN); return; } if (!BP_SHOULD_BYTESWAP(bp)) { bcopy(&bp->blk_dva[2].dva_word[0], salt, sizeof (uint64_t)); bcopy(&bp->blk_dva[2].dva_word[1], iv, sizeof (uint64_t)); val32 = (uint32_t)BP_GET_IV2(bp); bcopy(&val32, iv + sizeof (uint64_t), sizeof (uint32_t)); } else { val64 = BSWAP_64(bp->blk_dva[2].dva_word[0]); bcopy(&val64, salt, sizeof (uint64_t)); val64 = BSWAP_64(bp->blk_dva[2].dva_word[1]); bcopy(&val64, iv, sizeof (uint64_t)); val32 = BSWAP_32((uint32_t)BP_GET_IV2(bp)); bcopy(&val32, iv + sizeof (uint64_t), sizeof (uint32_t)); } } void zio_crypt_encode_mac_bp(blkptr_t *bp, uint8_t *mac) { uint64_t val64; ASSERT(BP_USES_CRYPT(bp)); ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_OBJSET); if (!BP_SHOULD_BYTESWAP(bp)) { bcopy(mac, &bp->blk_cksum.zc_word[2], sizeof (uint64_t)); bcopy(mac + sizeof (uint64_t), &bp->blk_cksum.zc_word[3], sizeof (uint64_t)); } else { bcopy(mac, &val64, sizeof (uint64_t)); bp->blk_cksum.zc_word[2] = BSWAP_64(val64); bcopy(mac + sizeof (uint64_t), &val64, sizeof (uint64_t)); bp->blk_cksum.zc_word[3] = BSWAP_64(val64); } } void zio_crypt_decode_mac_bp(const blkptr_t *bp, uint8_t *mac) { uint64_t val64; ASSERT(BP_USES_CRYPT(bp) || BP_IS_HOLE(bp)); /* for convenience, so callers don't need to check */ if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) { bzero(mac, ZIO_DATA_MAC_LEN); return; } if (!BP_SHOULD_BYTESWAP(bp)) { bcopy(&bp->blk_cksum.zc_word[2], mac, sizeof (uint64_t)); bcopy(&bp->blk_cksum.zc_word[3], mac + sizeof (uint64_t), sizeof (uint64_t)); } else { val64 = BSWAP_64(bp->blk_cksum.zc_word[2]); bcopy(&val64, mac, sizeof (uint64_t)); val64 = BSWAP_64(bp->blk_cksum.zc_word[3]); bcopy(&val64, mac + sizeof (uint64_t), sizeof (uint64_t)); } } void zio_crypt_encode_mac_zil(void *data, uint8_t *mac) { zil_chain_t *zilc = data; bcopy(mac, &zilc->zc_eck.zec_cksum.zc_word[2], sizeof (uint64_t)); bcopy(mac + sizeof (uint64_t), &zilc->zc_eck.zec_cksum.zc_word[3], sizeof (uint64_t)); } void zio_crypt_decode_mac_zil(const void *data, uint8_t *mac) { /* * The ZIL MAC is embedded in the block it protects, which will * not have been byteswapped by the time this function has been called. * As a result, we don't need to worry about byteswapping the MAC. */ const zil_chain_t *zilc = data; bcopy(&zilc->zc_eck.zec_cksum.zc_word[2], mac, sizeof (uint64_t)); bcopy(&zilc->zc_eck.zec_cksum.zc_word[3], mac + sizeof (uint64_t), sizeof (uint64_t)); } /* * This routine takes a block of dnodes (src_abd) and copies only the bonus * buffers to the same offsets in the dst buffer. datalen should be the size * of both the src_abd and the dst buffer (not just the length of the bonus * buffers). */ void zio_crypt_copy_dnode_bonus(abd_t *src_abd, uint8_t *dst, uint_t datalen) { uint_t i, max_dnp = datalen >> DNODE_SHIFT; uint8_t *src; dnode_phys_t *dnp, *sdnp, *ddnp; src = abd_borrow_buf_copy(src_abd, datalen); sdnp = (dnode_phys_t *)src; ddnp = (dnode_phys_t *)dst; for (i = 0; i < max_dnp; i += sdnp[i].dn_extra_slots + 1) { dnp = &sdnp[i]; if (dnp->dn_type != DMU_OT_NONE && DMU_OT_IS_ENCRYPTED(dnp->dn_bonustype) && dnp->dn_bonuslen != 0) { bcopy(DN_BONUS(dnp), DN_BONUS(&ddnp[i]), DN_MAX_BONUS_LEN(dnp)); } } abd_return_buf(src_abd, src, datalen); } /* * This function decides what fields from blk_prop are included in * the on-disk various MAC algorithms. */ static void zio_crypt_bp_zero_nonportable_blkprop(blkptr_t *bp, uint64_t version) { int avoidlint = SPA_MINBLOCKSIZE; /* * Version 0 did not properly zero out all non-portable fields * as it should have done. We maintain this code so that we can * do read-only imports of pools on this version. */ if (version == 0) { BP_SET_DEDUP(bp, 0); BP_SET_CHECKSUM(bp, 0); BP_SET_PSIZE(bp, avoidlint); return; } ASSERT3U(version, ==, ZIO_CRYPT_KEY_CURRENT_VERSION); /* * The hole_birth feature might set these fields even if this bp * is a hole. We zero them out here to guarantee that raw sends * will function with or without the feature. */ if (BP_IS_HOLE(bp)) { bp->blk_prop = 0ULL; return; } /* * At L0 we want to verify these fields to ensure that data blocks * can not be reinterpreted. For instance, we do not want an attacker * to trick us into returning raw lz4 compressed data to the user * by modifying the compression bits. At higher levels, we cannot * enforce this policy since raw sends do not convey any information * about indirect blocks, so these values might be different on the * receive side. Fortunately, this does not open any new attack * vectors, since any alterations that can be made to a higher level * bp must still verify the correct order of the layer below it. */ if (BP_GET_LEVEL(bp) != 0) { BP_SET_BYTEORDER(bp, 0); BP_SET_COMPRESS(bp, 0); /* * psize cannot be set to zero or it will trigger * asserts, but the value doesn't really matter as * long as it is constant. */ BP_SET_PSIZE(bp, avoidlint); } BP_SET_DEDUP(bp, 0); BP_SET_CHECKSUM(bp, 0); } static void zio_crypt_bp_auth_init(uint64_t version, boolean_t should_bswap, blkptr_t *bp, blkptr_auth_buf_t *bab, uint_t *bab_len) { blkptr_t tmpbp = *bp; if (should_bswap) byteswap_uint64_array(&tmpbp, sizeof (blkptr_t)); ASSERT(BP_USES_CRYPT(&tmpbp) || BP_IS_HOLE(&tmpbp)); ASSERT0(BP_IS_EMBEDDED(&tmpbp)); zio_crypt_decode_mac_bp(&tmpbp, bab->bab_mac); /* * We always MAC blk_prop in LE to ensure portability. This * must be done after decoding the mac, since the endianness * will get zero'd out here. */ zio_crypt_bp_zero_nonportable_blkprop(&tmpbp, version); bab->bab_prop = LE_64(tmpbp.blk_prop); bab->bab_pad = 0ULL; /* version 0 did not include the padding */ *bab_len = sizeof (blkptr_auth_buf_t); if (version == 0) *bab_len -= sizeof (uint64_t); } static int zio_crypt_bp_do_hmac_updates(crypto_context_t ctx, uint64_t version, boolean_t should_bswap, blkptr_t *bp) { uint_t bab_len; blkptr_auth_buf_t bab; zio_crypt_bp_auth_init(version, should_bswap, bp, &bab, &bab_len); crypto_mac_update(ctx, &bab, bab_len); return (0); } static void zio_crypt_bp_do_indrect_checksum_updates(SHA2_CTX *ctx, uint64_t version, boolean_t should_bswap, blkptr_t *bp) { uint_t bab_len; blkptr_auth_buf_t bab; zio_crypt_bp_auth_init(version, should_bswap, bp, &bab, &bab_len); SHA2Update(ctx, &bab, bab_len); } static void zio_crypt_bp_do_aad_updates(uint8_t **aadp, uint_t *aad_len, uint64_t version, boolean_t should_bswap, blkptr_t *bp) { uint_t bab_len; blkptr_auth_buf_t bab; zio_crypt_bp_auth_init(version, should_bswap, bp, &bab, &bab_len); bcopy(&bab, *aadp, bab_len); *aadp += bab_len; *aad_len += bab_len; } static int zio_crypt_do_dnode_hmac_updates(crypto_context_t ctx, uint64_t version, boolean_t should_bswap, dnode_phys_t *dnp) { int ret, i; dnode_phys_t *adnp; boolean_t le_bswap = (should_bswap == ZFS_HOST_BYTEORDER); uint8_t tmp_dncore[offsetof(dnode_phys_t, dn_blkptr)]; /* authenticate the core dnode (masking out non-portable bits) */ bcopy(dnp, tmp_dncore, sizeof (tmp_dncore)); adnp = (dnode_phys_t *)tmp_dncore; if (le_bswap) { adnp->dn_datablkszsec = BSWAP_16(adnp->dn_datablkszsec); adnp->dn_bonuslen = BSWAP_16(adnp->dn_bonuslen); adnp->dn_maxblkid = BSWAP_64(adnp->dn_maxblkid); adnp->dn_used = BSWAP_64(adnp->dn_used); } adnp->dn_flags &= DNODE_CRYPT_PORTABLE_FLAGS_MASK; adnp->dn_used = 0; crypto_mac_update(ctx, adnp, sizeof (tmp_dncore)); for (i = 0; i < dnp->dn_nblkptr; i++) { ret = zio_crypt_bp_do_hmac_updates(ctx, version, should_bswap, &dnp->dn_blkptr[i]); if (ret != 0) goto error; } if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) { ret = zio_crypt_bp_do_hmac_updates(ctx, version, should_bswap, DN_SPILL_BLKPTR(dnp)); if (ret != 0) goto error; } return (0); error: return (ret); } /* * objset_phys_t blocks introduce a number of exceptions to the normal * authentication process. objset_phys_t's contain 2 separate HMACS for * protecting the integrity of their data. The portable_mac protects the * metadnode. This MAC can be sent with a raw send and protects against * reordering of data within the metadnode. The local_mac protects the user * accounting objects which are not sent from one system to another. * * In addition, objset blocks are the only blocks that can be modified and * written to disk without the key loaded under certain circumstances. During * zil_claim() we need to be able to update the zil_header_t to complete * claiming log blocks and during raw receives we need to write out the * portable_mac from the send file. Both of these actions are possible * because these fields are not protected by either MAC so neither one will * need to modify the MACs without the key. However, when the modified blocks * are written out they will be byteswapped into the host machine's native * endianness which will modify fields protected by the MAC. As a result, MAC * calculation for objset blocks works slightly differently from other block * types. Where other block types MAC the data in whatever endianness is * written to disk, objset blocks always MAC little endian version of their * values. In the code, should_bswap is the value from BP_SHOULD_BYTESWAP() * and le_bswap indicates whether a byteswap is needed to get this block * into little endian format. */ /* ARGSUSED */ int zio_crypt_do_objset_hmacs(zio_crypt_key_t *key, void *data, uint_t datalen, boolean_t should_bswap, uint8_t *portable_mac, uint8_t *local_mac) { int ret; struct hmac_ctx hash_ctx; struct hmac_ctx *ctx = &hash_ctx; objset_phys_t *osp = data; uint64_t intval; boolean_t le_bswap = (should_bswap == ZFS_HOST_BYTEORDER); uint8_t raw_portable_mac[SHA512_DIGEST_LENGTH]; uint8_t raw_local_mac[SHA512_DIGEST_LENGTH]; /* calculate the portable MAC from the portable fields and metadnode */ crypto_mac_init(ctx, &key->zk_hmac_key); /* add in the os_type */ intval = (le_bswap) ? osp->os_type : BSWAP_64(osp->os_type); crypto_mac_update(ctx, &intval, sizeof (uint64_t)); /* add in the portable os_flags */ intval = osp->os_flags; if (should_bswap) intval = BSWAP_64(intval); intval &= OBJSET_CRYPT_PORTABLE_FLAGS_MASK; /* CONSTCOND */ if (!ZFS_HOST_BYTEORDER) intval = BSWAP_64(intval); crypto_mac_update(ctx, &intval, sizeof (uint64_t)); /* add in fields from the metadnode */ ret = zio_crypt_do_dnode_hmac_updates(ctx, key->zk_version, should_bswap, &osp->os_meta_dnode); if (ret) goto error; crypto_mac_final(ctx, raw_portable_mac, SHA512_DIGEST_LENGTH); bcopy(raw_portable_mac, portable_mac, ZIO_OBJSET_MAC_LEN); /* * The local MAC protects the user, group and project accounting. * If these objects are not present, the local MAC is zeroed out. */ if ((datalen >= OBJSET_PHYS_SIZE_V3 && osp->os_userused_dnode.dn_type == DMU_OT_NONE && osp->os_groupused_dnode.dn_type == DMU_OT_NONE && osp->os_projectused_dnode.dn_type == DMU_OT_NONE) || (datalen >= OBJSET_PHYS_SIZE_V2 && osp->os_userused_dnode.dn_type == DMU_OT_NONE && osp->os_groupused_dnode.dn_type == DMU_OT_NONE) || (datalen <= OBJSET_PHYS_SIZE_V1)) { bzero(local_mac, ZIO_OBJSET_MAC_LEN); return (0); } /* calculate the local MAC from the userused and groupused dnodes */ crypto_mac_init(ctx, &key->zk_hmac_key); /* add in the non-portable os_flags */ intval = osp->os_flags; if (should_bswap) intval = BSWAP_64(intval); intval &= ~OBJSET_CRYPT_PORTABLE_FLAGS_MASK; /* CONSTCOND */ if (!ZFS_HOST_BYTEORDER) intval = BSWAP_64(intval); crypto_mac_update(ctx, &intval, sizeof (uint64_t)); /* XXX check dnode type ... */ /* add in fields from the user accounting dnodes */ if (osp->os_userused_dnode.dn_type != DMU_OT_NONE) { ret = zio_crypt_do_dnode_hmac_updates(ctx, key->zk_version, should_bswap, &osp->os_userused_dnode); if (ret) goto error; } if (osp->os_groupused_dnode.dn_type != DMU_OT_NONE) { ret = zio_crypt_do_dnode_hmac_updates(ctx, key->zk_version, should_bswap, &osp->os_groupused_dnode); if (ret) goto error; } if (osp->os_projectused_dnode.dn_type != DMU_OT_NONE && datalen >= OBJSET_PHYS_SIZE_V3) { ret = zio_crypt_do_dnode_hmac_updates(ctx, key->zk_version, should_bswap, &osp->os_projectused_dnode); if (ret) goto error; } crypto_mac_final(ctx, raw_local_mac, SHA512_DIGEST_LENGTH); bcopy(raw_local_mac, local_mac, ZIO_OBJSET_MAC_LEN); return (0); error: bzero(portable_mac, ZIO_OBJSET_MAC_LEN); bzero(local_mac, ZIO_OBJSET_MAC_LEN); return (ret); } static void zio_crypt_destroy_uio(uio_t *uio) { if (uio->uio_iov) kmem_free(uio->uio_iov, uio->uio_iovcnt * sizeof (iovec_t)); } /* * This function parses an uncompressed indirect block and returns a checksum * of all the portable fields from all of the contained bps. The portable * fields are the MAC and all of the fields from blk_prop except for the dedup, * checksum, and psize bits. For an explanation of the purpose of this, see * the comment block on object set authentication. */ static int zio_crypt_do_indirect_mac_checksum_impl(boolean_t generate, void *buf, uint_t datalen, uint64_t version, boolean_t byteswap, uint8_t *cksum) { blkptr_t *bp; int i, epb = datalen >> SPA_BLKPTRSHIFT; SHA2_CTX ctx; uint8_t digestbuf[SHA512_DIGEST_LENGTH]; /* checksum all of the MACs from the layer below */ SHA2Init(SHA512, &ctx); for (i = 0, bp = buf; i < epb; i++, bp++) { zio_crypt_bp_do_indrect_checksum_updates(&ctx, version, byteswap, bp); } SHA2Final(digestbuf, &ctx); if (generate) { bcopy(digestbuf, cksum, ZIO_DATA_MAC_LEN); return (0); } if (bcmp(digestbuf, cksum, ZIO_DATA_MAC_LEN) != 0) { #ifdef FCRYPTO_DEBUG printf("%s(%d): Setting ECKSUM\n", __FUNCTION__, __LINE__); #endif return (SET_ERROR(ECKSUM)); } return (0); } int zio_crypt_do_indirect_mac_checksum(boolean_t generate, void *buf, uint_t datalen, boolean_t byteswap, uint8_t *cksum) { int ret; /* * Unfortunately, callers of this function will not always have * easy access to the on-disk format version. This info is * normally found in the DSL Crypto Key, but the checksum-of-MACs * is expected to be verifiable even when the key isn't loaded. * Here, instead of doing a ZAP lookup for the version for each * zio, we simply try both existing formats. */ ret = zio_crypt_do_indirect_mac_checksum_impl(generate, buf, datalen, ZIO_CRYPT_KEY_CURRENT_VERSION, byteswap, cksum); if (ret == ECKSUM) { ASSERT(!generate); ret = zio_crypt_do_indirect_mac_checksum_impl(generate, buf, datalen, 0, byteswap, cksum); } return (ret); } int zio_crypt_do_indirect_mac_checksum_abd(boolean_t generate, abd_t *abd, uint_t datalen, boolean_t byteswap, uint8_t *cksum) { int ret; void *buf; buf = abd_borrow_buf_copy(abd, datalen); ret = zio_crypt_do_indirect_mac_checksum(generate, buf, datalen, byteswap, cksum); abd_return_buf(abd, buf, datalen); return (ret); } /* * Special case handling routine for encrypting / decrypting ZIL blocks. * We do not check for the older ZIL chain because the encryption feature * was not available before the newer ZIL chain was introduced. The goal * here is to encrypt everything except the blkptr_t of a lr_write_t and * the zil_chain_t header. Everything that is not encrypted is authenticated. */ /* * The OpenCrypto used in FreeBSD does not use separate source and * destination buffers; instead, the same buffer is used. Further, to * accommodate some of the drivers, the authbuf needs to be logically before * the data. This means that we need to copy the source to the destination, * and set up an extra iovec_t at the beginning to handle the authbuf. - * It also means we'll only return one uio_t, which we do via the clumsy - * ifdef in the function declaration. + * It also means we'll only return one uio_t. */ /* ARGSUSED */ static int zio_crypt_init_uios_zil(boolean_t encrypt, uint8_t *plainbuf, uint8_t *cipherbuf, uint_t datalen, boolean_t byteswap, uio_t *puio, uio_t *out_uio, uint_t *enc_len, uint8_t **authbuf, uint_t *auth_len, boolean_t *no_crypt) { - int ret; - uint64_t txtype, lr_len; - uint_t nr_src, nr_dst, crypt_len; - uint_t aad_len = 0, nr_iovecs = 0, total_len = 0; - iovec_t *src_iovecs = NULL, *dst_iovecs = NULL; + uint8_t *aadbuf = zio_buf_alloc(datalen); uint8_t *src, *dst, *slrp, *dlrp, *blkend, *aadp; + iovec_t *dst_iovecs; zil_chain_t *zilc; lr_t *lr; - uint8_t *aadbuf = zio_buf_alloc(datalen); + uint64_t txtype, lr_len; + uint_t crypt_len, nr_iovecs, vec; + uint_t aad_len = 0, total_len = 0; - /* cipherbuf always needs an extra iovec for the MAC */ if (encrypt) { src = plainbuf; dst = cipherbuf; - nr_src = 0; - nr_dst = 1; } else { src = cipherbuf; dst = plainbuf; - nr_src = 1; - nr_dst = 0; } - - /* - * We need at least two iovecs -- one for the AAD, - * one for the MAC. - */ bcopy(src, dst, datalen); - nr_dst = 2; - /* find the start and end record of the log block */ + /* Find the start and end record of the log block. */ zilc = (zil_chain_t *)src; slrp = src + sizeof (zil_chain_t); aadp = aadbuf; blkend = src + ((byteswap) ? BSWAP_64(zilc->zc_nused) : zilc->zc_nused); - /* calculate the number of encrypted iovecs we will need */ + /* + * Calculate the number of encrypted iovecs we will need. + */ + + /* We need at least two iovecs -- one for the AAD, one for the MAC. */ + nr_iovecs = 2; + for (; slrp < blkend; slrp += lr_len) { lr = (lr_t *)slrp; - if (!byteswap) { - txtype = lr->lrc_txtype; - lr_len = lr->lrc_reclen; - } else { + if (byteswap) { txtype = BSWAP_64(lr->lrc_txtype); lr_len = BSWAP_64(lr->lrc_reclen); + } else { + txtype = lr->lrc_txtype; + lr_len = lr->lrc_reclen; } nr_iovecs++; if (txtype == TX_WRITE && lr_len != sizeof (lr_write_t)) nr_iovecs++; } - nr_src = 0; - nr_dst += nr_iovecs; + dst_iovecs = kmem_alloc(nr_iovecs * sizeof (iovec_t), KM_SLEEP); - /* allocate the iovec arrays */ - if (nr_src != 0) { - src_iovecs = kmem_alloc(nr_src * sizeof (iovec_t), KM_SLEEP); - if (src_iovecs == NULL) { - ret = SET_ERROR(ENOMEM); - goto error; - } - bzero(src_iovecs, nr_src * sizeof (iovec_t)); - } - - if (nr_dst != 0) { - dst_iovecs = kmem_alloc(nr_dst * sizeof (iovec_t), KM_SLEEP); - if (dst_iovecs == NULL) { - ret = SET_ERROR(ENOMEM); - goto error; - } - bzero(dst_iovecs, nr_dst * sizeof (iovec_t)); - } - /* * Copy the plain zil header over and authenticate everything except * the checksum that will store our MAC. If we are writing the data * the embedded checksum will not have been calculated yet, so we don't * authenticate that. */ - bcopy(src, dst, sizeof (zil_chain_t)); bcopy(src, aadp, sizeof (zil_chain_t) - sizeof (zio_eck_t)); aadp += sizeof (zil_chain_t) - sizeof (zio_eck_t); aad_len += sizeof (zil_chain_t) - sizeof (zio_eck_t); - /* loop over records again, filling in iovecs */ - /* The first one will contain the authbuf */ - nr_iovecs = 1; - slrp = src + sizeof (zil_chain_t); dlrp = dst + sizeof (zil_chain_t); + /* + * Loop over records again, filling in iovecs. + */ + + /* The first iovec will contain the authbuf. */ + vec = 1; + for (; slrp < blkend; slrp += lr_len, dlrp += lr_len) { lr = (lr_t *)slrp; if (!byteswap) { txtype = lr->lrc_txtype; lr_len = lr->lrc_reclen; } else { txtype = BSWAP_64(lr->lrc_txtype); lr_len = BSWAP_64(lr->lrc_reclen); } /* copy the common lr_t */ bcopy(slrp, dlrp, sizeof (lr_t)); bcopy(slrp, aadp, sizeof (lr_t)); aadp += sizeof (lr_t); aad_len += sizeof (lr_t); - ASSERT3P(dst_iovecs, !=, NULL); - /* * If this is a TX_WRITE record we want to encrypt everything * except the bp if exists. If the bp does exist we want to * authenticate it. */ if (txtype == TX_WRITE) { crypt_len = sizeof (lr_write_t) - sizeof (lr_t) - sizeof (blkptr_t); - dst_iovecs[nr_iovecs].iov_base = (char *)dlrp + + dst_iovecs[vec].iov_base = (char *)dlrp + sizeof (lr_t); - dst_iovecs[nr_iovecs].iov_len = crypt_len; + dst_iovecs[vec].iov_len = crypt_len; /* copy the bp now since it will not be encrypted */ bcopy(slrp + sizeof (lr_write_t) - sizeof (blkptr_t), dlrp + sizeof (lr_write_t) - sizeof (blkptr_t), sizeof (blkptr_t)); bcopy(slrp + sizeof (lr_write_t) - sizeof (blkptr_t), aadp, sizeof (blkptr_t)); aadp += sizeof (blkptr_t); aad_len += sizeof (blkptr_t); - nr_iovecs++; + vec++; total_len += crypt_len; if (lr_len != sizeof (lr_write_t)) { crypt_len = lr_len - sizeof (lr_write_t); - dst_iovecs[nr_iovecs].iov_base = (char *) + dst_iovecs[vec].iov_base = (char *) dlrp + sizeof (lr_write_t); - dst_iovecs[nr_iovecs].iov_len = crypt_len; - nr_iovecs++; + dst_iovecs[vec].iov_len = crypt_len; + vec++; total_len += crypt_len; } } else { crypt_len = lr_len - sizeof (lr_t); - dst_iovecs[nr_iovecs].iov_base = (char *)dlrp + + dst_iovecs[vec].iov_base = (char *)dlrp + sizeof (lr_t); - dst_iovecs[nr_iovecs].iov_len = crypt_len; - nr_iovecs++; + dst_iovecs[vec].iov_len = crypt_len; + vec++; total_len += crypt_len; } } - *no_crypt = (nr_iovecs == 0); - *enc_len = total_len; - *authbuf = aadbuf; - *auth_len = aad_len; + /* The last iovec will contain the MAC. */ + ASSERT3U(vec, ==, nr_iovecs - 1); + + /* AAD */ dst_iovecs[0].iov_base = aadbuf; dst_iovecs[0].iov_len = aad_len; + /* MAC */ + dst_iovecs[vec].iov_base = 0; + dst_iovecs[vec].iov_len = 0; + *no_crypt = (vec == 1); + *enc_len = total_len; + *authbuf = aadbuf; + *auth_len = aad_len; out_uio->uio_iov = dst_iovecs; - out_uio->uio_iovcnt = nr_dst; + out_uio->uio_iovcnt = nr_iovecs; return (0); - -error: - zio_buf_free(aadbuf, datalen); - if (src_iovecs != NULL) - kmem_free(src_iovecs, nr_src * sizeof (iovec_t)); - if (dst_iovecs != NULL) - kmem_free(dst_iovecs, nr_dst * sizeof (iovec_t)); - - *enc_len = 0; - *authbuf = NULL; - *auth_len = 0; - *no_crypt = B_FALSE; - puio->uio_iov = NULL; - puio->uio_iovcnt = 0; - out_uio->uio_iov = NULL; - out_uio->uio_iovcnt = 0; - - return (ret); } /* * Special case handling routine for encrypting / decrypting dnode blocks. */ static int zio_crypt_init_uios_dnode(boolean_t encrypt, uint64_t version, uint8_t *plainbuf, uint8_t *cipherbuf, uint_t datalen, boolean_t byteswap, uio_t *puio, uio_t *out_uio, uint_t *enc_len, uint8_t **authbuf, uint_t *auth_len, boolean_t *no_crypt) { - int ret; - uint_t nr_src, nr_dst, crypt_len; - uint_t aad_len = 0, nr_iovecs = 0, total_len = 0; - uint_t i, j, max_dnp = datalen >> DNODE_SHIFT; - iovec_t *src_iovecs = NULL, *dst_iovecs = NULL; + uint8_t *aadbuf = zio_buf_alloc(datalen); uint8_t *src, *dst, *aadp; dnode_phys_t *dnp, *adnp, *sdnp, *ddnp; - uint8_t *aadbuf = zio_buf_alloc(datalen); + iovec_t *dst_iovecs; + uint_t nr_iovecs, crypt_len, vec; + uint_t aad_len = 0, total_len = 0; + uint_t i, j, max_dnp = datalen >> DNODE_SHIFT; if (encrypt) { src = plainbuf; dst = cipherbuf; - nr_src = 0; - nr_dst = 1; } else { src = cipherbuf; dst = plainbuf; - nr_src = 1; - nr_dst = 0; } - bcopy(src, dst, datalen); - nr_dst = 2; sdnp = (dnode_phys_t *)src; ddnp = (dnode_phys_t *)dst; aadp = aadbuf; /* * Count the number of iovecs we will need to do the encryption by * counting the number of bonus buffers that need to be encrypted. */ + + /* We need at least two iovecs -- one for the AAD, one for the MAC. */ + nr_iovecs = 2; + for (i = 0; i < max_dnp; i += sdnp[i].dn_extra_slots + 1) { /* * This block may still be byteswapped. However, all of the * values we use are either uint8_t's (for which byteswapping * is a noop) or a * != 0 check, which will work regardless * of whether or not we byteswap. */ if (sdnp[i].dn_type != DMU_OT_NONE && DMU_OT_IS_ENCRYPTED(sdnp[i].dn_bonustype) && sdnp[i].dn_bonuslen != 0) { nr_iovecs++; } } - nr_src = 0; - nr_dst += nr_iovecs; + dst_iovecs = kmem_alloc(nr_iovecs * sizeof (iovec_t), KM_SLEEP); - if (nr_src != 0) { - src_iovecs = kmem_alloc(nr_src * sizeof (iovec_t), KM_SLEEP); - if (src_iovecs == NULL) { - ret = SET_ERROR(ENOMEM); - goto error; - } - bzero(src_iovecs, nr_src * sizeof (iovec_t)); - } - - if (nr_dst != 0) { - dst_iovecs = kmem_alloc(nr_dst * sizeof (iovec_t), KM_SLEEP); - if (dst_iovecs == NULL) { - ret = SET_ERROR(ENOMEM); - goto error; - } - bzero(dst_iovecs, nr_dst * sizeof (iovec_t)); - } - - nr_iovecs = 1; - /* * Iterate through the dnodes again, this time filling in the uios * we allocated earlier. We also concatenate any data we want to * authenticate onto aadbuf. */ + + /* The first iovec will contain the authbuf. */ + vec = 1; + for (i = 0; i < max_dnp; i += sdnp[i].dn_extra_slots + 1) { dnp = &sdnp[i]; /* copy over the core fields and blkptrs (kept as plaintext) */ bcopy(dnp, &ddnp[i], (uint8_t *)DN_BONUS(dnp) - (uint8_t *)dnp); if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) { bcopy(DN_SPILL_BLKPTR(dnp), DN_SPILL_BLKPTR(&ddnp[i]), sizeof (blkptr_t)); } /* * Handle authenticated data. We authenticate everything in * the dnode that can be brought over when we do a raw send. * This includes all of the core fields as well as the MACs * stored in the bp checksums and all of the portable bits * from blk_prop. We include the dnode padding here in case it * ever gets used in the future. Some dn_flags and dn_used are * not portable so we mask those out values out of the * authenticated data. */ crypt_len = offsetof(dnode_phys_t, dn_blkptr); bcopy(dnp, aadp, crypt_len); adnp = (dnode_phys_t *)aadp; adnp->dn_flags &= DNODE_CRYPT_PORTABLE_FLAGS_MASK; adnp->dn_used = 0; aadp += crypt_len; aad_len += crypt_len; for (j = 0; j < dnp->dn_nblkptr; j++) { zio_crypt_bp_do_aad_updates(&aadp, &aad_len, version, byteswap, &dnp->dn_blkptr[j]); } if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) { zio_crypt_bp_do_aad_updates(&aadp, &aad_len, version, byteswap, DN_SPILL_BLKPTR(dnp)); } /* * If this bonus buffer needs to be encrypted, we prepare an * iovec_t. The encryption / decryption functions will fill * this in for us with the encrypted or decrypted data. * Otherwise we add the bonus buffer to the authenticated * data buffer and copy it over to the destination. The * encrypted iovec extends to DN_MAX_BONUS_LEN(dnp) so that * we can guarantee alignment with the AES block size * (128 bits). */ crypt_len = DN_MAX_BONUS_LEN(dnp); if (dnp->dn_type != DMU_OT_NONE && DMU_OT_IS_ENCRYPTED(dnp->dn_bonustype) && dnp->dn_bonuslen != 0) { - ASSERT3U(nr_iovecs, <, nr_dst); - ASSERT3P(dst_iovecs, !=, NULL); - dst_iovecs[nr_iovecs].iov_base = DN_BONUS(&ddnp[i]); - dst_iovecs[nr_iovecs].iov_len = crypt_len; + dst_iovecs[vec].iov_base = DN_BONUS(&ddnp[i]); + dst_iovecs[vec].iov_len = crypt_len; - nr_iovecs++; + vec++; total_len += crypt_len; } else { bcopy(DN_BONUS(dnp), DN_BONUS(&ddnp[i]), crypt_len); bcopy(DN_BONUS(dnp), aadp, crypt_len); aadp += crypt_len; aad_len += crypt_len; } } - *no_crypt = (nr_iovecs == 0); - *enc_len = total_len; - *authbuf = aadbuf; - *auth_len = aad_len; + /* The last iovec will contain the MAC. */ + ASSERT3U(vec, ==, nr_iovecs - 1); + /* AAD */ dst_iovecs[0].iov_base = aadbuf; dst_iovecs[0].iov_len = aad_len; + /* MAC */ + dst_iovecs[vec].iov_base = 0; + dst_iovecs[vec].iov_len = 0; + + *no_crypt = (vec == 1); + *enc_len = total_len; + *authbuf = aadbuf; + *auth_len = aad_len; out_uio->uio_iov = dst_iovecs; - out_uio->uio_iovcnt = nr_dst; + out_uio->uio_iovcnt = nr_iovecs; return (0); - -error: - zio_buf_free(aadbuf, datalen); - if (src_iovecs != NULL) - kmem_free(src_iovecs, nr_src * sizeof (iovec_t)); - if (dst_iovecs != NULL) - kmem_free(dst_iovecs, nr_dst * sizeof (iovec_t)); - - *enc_len = 0; - *authbuf = NULL; - *auth_len = 0; - *no_crypt = B_FALSE; - out_uio->uio_iov = NULL; - out_uio->uio_iovcnt = 0; - - return (ret); } /* ARGSUSED */ static int zio_crypt_init_uios_normal(boolean_t encrypt, uint8_t *plainbuf, uint8_t *cipherbuf, uint_t datalen, uio_t *puio, uio_t *out_uio, uint_t *enc_len) { int ret; uint_t nr_plain = 1, nr_cipher = 2; iovec_t *plain_iovecs = NULL, *cipher_iovecs = NULL; void *src, *dst; cipher_iovecs = kmem_alloc(nr_cipher * sizeof (iovec_t), KM_SLEEP); if (!cipher_iovecs) { ret = SET_ERROR(ENOMEM); goto error; } bzero(cipher_iovecs, nr_cipher * sizeof (iovec_t)); if (encrypt) { src = plainbuf; dst = cipherbuf; } else { src = cipherbuf; dst = plainbuf; } bcopy(src, dst, datalen); cipher_iovecs[0].iov_base = dst; cipher_iovecs[0].iov_len = datalen; *enc_len = datalen; out_uio->uio_iov = cipher_iovecs; out_uio->uio_iovcnt = nr_cipher; return (0); error: if (plain_iovecs != NULL) kmem_free(plain_iovecs, nr_plain * sizeof (iovec_t)); if (cipher_iovecs != NULL) kmem_free(cipher_iovecs, nr_cipher * sizeof (iovec_t)); *enc_len = 0; out_uio->uio_iov = NULL; out_uio->uio_iovcnt = 0; return (ret); } /* * This function builds up the plaintext (puio) and ciphertext (cuio) uios so * that they can be used for encryption and decryption by zio_do_crypt_uio(). * Most blocks will use zio_crypt_init_uios_normal(), with ZIL and dnode blocks * requiring special handling to parse out pieces that are to be encrypted. The * authbuf is used by these special cases to store additional authenticated * data (AAD) for the encryption modes. */ static int zio_crypt_init_uios(boolean_t encrypt, uint64_t version, dmu_object_type_t ot, uint8_t *plainbuf, uint8_t *cipherbuf, uint_t datalen, boolean_t byteswap, uint8_t *mac, uio_t *puio, uio_t *cuio, uint_t *enc_len, uint8_t **authbuf, uint_t *auth_len, boolean_t *no_crypt) { int ret; iovec_t *mac_iov; ASSERT(DMU_OT_IS_ENCRYPTED(ot) || ot == DMU_OT_NONE); /* route to handler */ switch (ot) { case DMU_OT_INTENT_LOG: ret = zio_crypt_init_uios_zil(encrypt, plainbuf, cipherbuf, datalen, byteswap, puio, cuio, enc_len, authbuf, auth_len, no_crypt); break; case DMU_OT_DNODE: ret = zio_crypt_init_uios_dnode(encrypt, version, plainbuf, cipherbuf, datalen, byteswap, puio, cuio, enc_len, authbuf, auth_len, no_crypt); break; default: ret = zio_crypt_init_uios_normal(encrypt, plainbuf, cipherbuf, datalen, puio, cuio, enc_len); *authbuf = NULL; *auth_len = 0; *no_crypt = B_FALSE; break; } if (ret != 0) goto error; /* populate the uios */ cuio->uio_segflg = UIO_SYSSPACE; mac_iov = ((iovec_t *)&cuio->uio_iov[cuio->uio_iovcnt - 1]); mac_iov->iov_base = (void *)mac; mac_iov->iov_len = ZIO_DATA_MAC_LEN; return (0); error: return (ret); } void *failed_decrypt_buf; int faile_decrypt_size; /* * Primary encryption / decryption entrypoint for zio data. */ int zio_do_crypt_data(boolean_t encrypt, zio_crypt_key_t *key, dmu_object_type_t ot, boolean_t byteswap, uint8_t *salt, uint8_t *iv, uint8_t *mac, uint_t datalen, uint8_t *plainbuf, uint8_t *cipherbuf, boolean_t *no_crypt) { int ret; boolean_t locked = B_FALSE; uint64_t crypt = key->zk_crypt; uint_t keydata_len = zio_crypt_table[crypt].ci_keylen; uint_t enc_len, auth_len; uio_t puio, cuio; uint8_t enc_keydata[MASTER_KEY_MAX_LEN]; crypto_key_t tmp_ckey, *ckey = NULL; freebsd_crypt_session_t *tmpl = NULL; uint8_t *authbuf = NULL; bzero(&puio, sizeof (uio_t)); bzero(&cuio, sizeof (uio_t)); #ifdef FCRYPTO_DEBUG printf("%s(%s, %p, %p, %d, %p, %p, %u, %s, %p, %p, %p)\n", __FUNCTION__, encrypt ? "encrypt" : "decrypt", key, salt, ot, iv, mac, datalen, byteswap ? "byteswap" : "native_endian", plainbuf, cipherbuf, no_crypt); printf("\tkey = {"); for (int i = 0; i < key->zk_current_key.ck_length/8; i++) printf("%02x ", ((uint8_t *)key->zk_current_key.ck_data)[i]); printf("}\n"); #endif /* create uios for encryption */ ret = zio_crypt_init_uios(encrypt, key->zk_version, ot, plainbuf, cipherbuf, datalen, byteswap, mac, &puio, &cuio, &enc_len, &authbuf, &auth_len, no_crypt); if (ret != 0) return (ret); /* * If the needed key is the current one, just use it. Otherwise we * need to generate a temporary one from the given salt + master key. * If we are encrypting, we must return a copy of the current salt * so that it can be stored in the blkptr_t. */ rw_enter(&key->zk_salt_lock, RW_READER); locked = B_TRUE; if (bcmp(salt, key->zk_salt, ZIO_DATA_SALT_LEN) == 0) { ckey = &key->zk_current_key; tmpl = &key->zk_session; } else { rw_exit(&key->zk_salt_lock); locked = B_FALSE; ret = hkdf_sha512(key->zk_master_keydata, keydata_len, NULL, 0, salt, ZIO_DATA_SALT_LEN, enc_keydata, keydata_len); if (ret != 0) goto error; tmp_ckey.ck_format = CRYPTO_KEY_RAW; tmp_ckey.ck_data = enc_keydata; tmp_ckey.ck_length = CRYPTO_BYTES2BITS(keydata_len); ckey = &tmp_ckey; tmpl = NULL; } /* perform the encryption / decryption */ ret = zio_do_crypt_uio_opencrypto(encrypt, tmpl, key->zk_crypt, ckey, iv, enc_len, &cuio, auth_len); if (ret != 0) goto error; if (locked) { rw_exit(&key->zk_salt_lock); locked = B_FALSE; } if (authbuf != NULL) zio_buf_free(authbuf, datalen); if (ckey == &tmp_ckey) bzero(enc_keydata, keydata_len); zio_crypt_destroy_uio(&puio); zio_crypt_destroy_uio(&cuio); return (0); error: if (!encrypt) { if (failed_decrypt_buf != NULL) kmem_free(failed_decrypt_buf, failed_decrypt_size); failed_decrypt_buf = kmem_alloc(datalen, KM_SLEEP); failed_decrypt_size = datalen; bcopy(cipherbuf, failed_decrypt_buf, datalen); } if (locked) rw_exit(&key->zk_salt_lock); if (authbuf != NULL) zio_buf_free(authbuf, datalen); if (ckey == &tmp_ckey) bzero(enc_keydata, keydata_len); zio_crypt_destroy_uio(&puio); zio_crypt_destroy_uio(&cuio); return (SET_ERROR(ret)); } /* * Simple wrapper around zio_do_crypt_data() to work with abd's instead of * linear buffers. */ int zio_do_crypt_abd(boolean_t encrypt, zio_crypt_key_t *key, dmu_object_type_t ot, boolean_t byteswap, uint8_t *salt, uint8_t *iv, uint8_t *mac, uint_t datalen, abd_t *pabd, abd_t *cabd, boolean_t *no_crypt) { int ret; void *ptmp, *ctmp; if (encrypt) { ptmp = abd_borrow_buf_copy(pabd, datalen); ctmp = abd_borrow_buf(cabd, datalen); } else { ptmp = abd_borrow_buf(pabd, datalen); ctmp = abd_borrow_buf_copy(cabd, datalen); } ret = zio_do_crypt_data(encrypt, key, ot, byteswap, salt, iv, mac, datalen, ptmp, ctmp, no_crypt); if (ret != 0) goto error; if (encrypt) { abd_return_buf(pabd, ptmp, datalen); abd_return_buf_copy(cabd, ctmp, datalen); } else { abd_return_buf_copy(pabd, ptmp, datalen); abd_return_buf(cabd, ctmp, datalen); } return (0); error: if (encrypt) { abd_return_buf(pabd, ptmp, datalen); abd_return_buf_copy(cabd, ctmp, datalen); } else { abd_return_buf_copy(pabd, ptmp, datalen); abd_return_buf(cabd, ctmp, datalen); } return (SET_ERROR(ret)); } #if defined(_KERNEL) && defined(HAVE_SPL) /* BEGIN CSTYLED */ module_param(zfs_key_max_salt_uses, ulong, 0644); MODULE_PARM_DESC(zfs_key_max_salt_uses, "Max number of times a salt value " "can be used for generating encryption keys before it is rotated"); /* END CSTYLED */ #endif Index: vendor-sys/openzfs/dist/module/os/linux/spl/spl-procfs-list.c =================================================================== --- vendor-sys/openzfs/dist/module/os/linux/spl/spl-procfs-list.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/linux/spl/spl-procfs-list.c (revision 366348) @@ -1,264 +1,284 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2018 by Delphix. All rights reserved. */ #include #include #include #include /* * A procfs_list is a wrapper around a linked list which implements the seq_file * interface, allowing the contents of the list to be exposed through procfs. * The kernel already has some utilities to help implement the seq_file * interface for linked lists (seq_list_*), but they aren't appropriate for use * with lists that have many entries, because seq_list_start walks the list at * the start of each read syscall to find where it left off, so reading a file * ends up being quadratic in the number of entries in the list. * * This implementation avoids this penalty by maintaining a separate cursor into * the list per instance of the file that is open. It also maintains some extra * information in each node of the list to prevent reads of entries that have * been dropped from the list. * * Callers should only add elements to the list using procfs_list_add, which * adds an element to the tail of the list. Other operations can be performed * directly on the wrapped list using the normal list manipulation functions, * but elements should only be removed from the head of the list. */ #define NODE_ID(procfs_list, obj) \ (((procfs_list_node_t *)(((char *)obj) + \ (procfs_list)->pl_node_offset))->pln_id) typedef struct procfs_list_cursor { procfs_list_t *procfs_list; /* List into which this cursor points */ void *cached_node; /* Most recently accessed node */ loff_t cached_pos; /* Position of cached_node */ } procfs_list_cursor_t; static int procfs_list_seq_show(struct seq_file *f, void *p) { procfs_list_cursor_t *cursor = f->private; procfs_list_t *procfs_list = cursor->procfs_list; ASSERT(MUTEX_HELD(&procfs_list->pl_lock)); if (p == SEQ_START_TOKEN) { if (procfs_list->pl_show_header != NULL) return (procfs_list->pl_show_header(f)); else return (0); } return (procfs_list->pl_show(f, p)); } static void * procfs_list_next_node(procfs_list_cursor_t *cursor, loff_t *pos) { void *next_node; procfs_list_t *procfs_list = cursor->procfs_list; if (cursor->cached_node == SEQ_START_TOKEN) next_node = list_head(&procfs_list->pl_list); else next_node = list_next(&procfs_list->pl_list, cursor->cached_node); if (next_node != NULL) { cursor->cached_node = next_node; cursor->cached_pos = NODE_ID(procfs_list, cursor->cached_node); *pos = cursor->cached_pos; + } else { + /* + * seq_read() expects ->next() to update the position even + * when there are no more entries. Advance the position to + * prevent a warning from being logged. + */ + cursor->cached_node = NULL; + cursor->cached_pos++; + *pos = cursor->cached_pos; } + return (next_node); } static void * procfs_list_seq_start(struct seq_file *f, loff_t *pos) { procfs_list_cursor_t *cursor = f->private; procfs_list_t *procfs_list = cursor->procfs_list; mutex_enter(&procfs_list->pl_lock); if (*pos == 0) { cursor->cached_node = SEQ_START_TOKEN; cursor->cached_pos = 0; return (SEQ_START_TOKEN); + } else if (cursor->cached_node == NULL) { + return (NULL); } /* * Check if our cached pointer has become stale, which happens if the * the message where we left off has been dropped from the list since * the last read syscall completed. */ void *oldest_node = list_head(&procfs_list->pl_list); if (cursor->cached_node != SEQ_START_TOKEN && (oldest_node == NULL || NODE_ID(procfs_list, oldest_node) > cursor->cached_pos)) return (ERR_PTR(-EIO)); /* * If it isn't starting from the beginning of the file, the seq_file * code will either pick up at the same position it visited last or the * following one. */ if (*pos == cursor->cached_pos) { return (cursor->cached_node); } else { ASSERT3U(*pos, ==, cursor->cached_pos + 1); return (procfs_list_next_node(cursor, pos)); } } static void * procfs_list_seq_next(struct seq_file *f, void *p, loff_t *pos) { procfs_list_cursor_t *cursor = f->private; ASSERT(MUTEX_HELD(&cursor->procfs_list->pl_lock)); return (procfs_list_next_node(cursor, pos)); } static void procfs_list_seq_stop(struct seq_file *f, void *p) { procfs_list_cursor_t *cursor = f->private; procfs_list_t *procfs_list = cursor->procfs_list; mutex_exit(&procfs_list->pl_lock); } static struct seq_operations procfs_list_seq_ops = { .show = procfs_list_seq_show, .start = procfs_list_seq_start, .next = procfs_list_seq_next, .stop = procfs_list_seq_stop, }; static int procfs_list_open(struct inode *inode, struct file *filp) { int rc = seq_open_private(filp, &procfs_list_seq_ops, sizeof (procfs_list_cursor_t)); if (rc != 0) return (rc); struct seq_file *f = filp->private_data; procfs_list_cursor_t *cursor = f->private; cursor->procfs_list = PDE_DATA(inode); cursor->cached_node = NULL; cursor->cached_pos = 0; return (0); } static ssize_t procfs_list_write(struct file *filp, const char __user *buf, size_t len, loff_t *ppos) { struct seq_file *f = filp->private_data; procfs_list_cursor_t *cursor = f->private; procfs_list_t *procfs_list = cursor->procfs_list; int rc; if (procfs_list->pl_clear != NULL && (rc = procfs_list->pl_clear(procfs_list)) != 0) return (-rc); return (len); } static const kstat_proc_op_t procfs_list_operations = { #ifdef HAVE_PROC_OPS_STRUCT .proc_open = procfs_list_open, .proc_write = procfs_list_write, .proc_read = seq_read, .proc_lseek = seq_lseek, .proc_release = seq_release_private, #else .open = procfs_list_open, .write = procfs_list_write, .read = seq_read, .llseek = seq_lseek, .release = seq_release_private, #endif }; /* * Initialize a procfs_list and create a file for it in the proc filesystem * under the kstat namespace. */ void procfs_list_install(const char *module, + const char *submodule, const char *name, mode_t mode, procfs_list_t *procfs_list, int (*show)(struct seq_file *f, void *p), int (*show_header)(struct seq_file *f), int (*clear)(procfs_list_t *procfs_list), size_t procfs_list_node_off) { + char *modulestr; + + if (submodule != NULL) + modulestr = kmem_asprintf("%s/%s", module, submodule); + else + modulestr = kmem_asprintf("%s", module); mutex_init(&procfs_list->pl_lock, NULL, MUTEX_DEFAULT, NULL); list_create(&procfs_list->pl_list, procfs_list_node_off + sizeof (procfs_list_node_t), procfs_list_node_off + offsetof(procfs_list_node_t, pln_link)); procfs_list->pl_next_id = 1; /* Save id 0 for SEQ_START_TOKEN */ procfs_list->pl_show = show; procfs_list->pl_show_header = show_header; procfs_list->pl_clear = clear; procfs_list->pl_node_offset = procfs_list_node_off; - kstat_proc_entry_init(&procfs_list->pl_kstat_entry, module, name); + kstat_proc_entry_init(&procfs_list->pl_kstat_entry, modulestr, name); kstat_proc_entry_install(&procfs_list->pl_kstat_entry, mode, &procfs_list_operations, procfs_list); + kmem_strfree(modulestr); } EXPORT_SYMBOL(procfs_list_install); /* Remove the proc filesystem file corresponding to the given list */ void procfs_list_uninstall(procfs_list_t *procfs_list) { kstat_proc_entry_delete(&procfs_list->pl_kstat_entry); } EXPORT_SYMBOL(procfs_list_uninstall); void procfs_list_destroy(procfs_list_t *procfs_list) { ASSERT(list_is_empty(&procfs_list->pl_list)); list_destroy(&procfs_list->pl_list); mutex_destroy(&procfs_list->pl_lock); } EXPORT_SYMBOL(procfs_list_destroy); /* * Add a new node to the tail of the list. While the standard list manipulation * functions can be use for all other operation, adding elements to the list * should only be done using this helper so that the id of the new node is set * correctly. */ void procfs_list_add(procfs_list_t *procfs_list, void *p) { ASSERT(MUTEX_HELD(&procfs_list->pl_lock)); NODE_ID(procfs_list, p) = procfs_list->pl_next_id++; list_insert_tail(&procfs_list->pl_list, p); } EXPORT_SYMBOL(procfs_list_add); Index: vendor-sys/openzfs/dist/module/os/linux/zfs/vdev_disk.c =================================================================== --- vendor-sys/openzfs/dist/module/os/linux/zfs/vdev_disk.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/linux/zfs/vdev_disk.c (revision 366348) @@ -1,891 +1,901 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (C) 2008-2010 Lawrence Livermore National Security, LLC. * Produced at Lawrence Livermore National Laboratory (cf, DISCLAIMER). * Rewritten for Linux by Brian Behlendorf . * LLNL-CODE-403049. * Copyright (c) 2012, 2019 by Delphix. All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include typedef struct vdev_disk { struct block_device *vd_bdev; krwlock_t vd_lock; } vdev_disk_t; /* * Unique identifier for the exclusive vdev holder. */ static void *zfs_vdev_holder = VDEV_HOLDER; /* * Wait up to zfs_vdev_open_timeout_ms milliseconds before determining the * device is missing. The missing path may be transient since the links * can be briefly removed and recreated in response to udev events. */ static unsigned zfs_vdev_open_timeout_ms = 1000; /* * Size of the "reserved" partition, in blocks. */ #define EFI_MIN_RESV_SIZE (16 * 1024) /* * Virtual device vector for disks. */ typedef struct dio_request { zio_t *dr_zio; /* Parent ZIO */ atomic_t dr_ref; /* References */ int dr_error; /* Bio error */ int dr_bio_count; /* Count of bio's */ struct bio *dr_bio[0]; /* Attached bio's */ } dio_request_t; static fmode_t vdev_bdev_mode(spa_mode_t spa_mode) { fmode_t mode = 0; if (spa_mode & SPA_MODE_READ) mode |= FMODE_READ; if (spa_mode & SPA_MODE_WRITE) mode |= FMODE_WRITE; return (mode); } /* * Returns the usable capacity (in bytes) for the partition or disk. */ static uint64_t bdev_capacity(struct block_device *bdev) { return (i_size_read(bdev->bd_inode)); } /* * Returns the maximum expansion capacity of the block device (in bytes). * * It is possible to expand a vdev when it has been created as a wholedisk * and the containing block device has increased in capacity. Or when the * partition containing the pool has been manually increased in size. * * This function is only responsible for calculating the potential expansion * size so it can be reported by 'zpool list'. The efi_use_whole_disk() is * responsible for verifying the expected partition layout in the wholedisk * case, and updating the partition table if appropriate. Once the partition * size has been increased the additional capacity will be visible using * bdev_capacity(). * * The returned maximum expansion capacity is always expected to be larger, or * at the very least equal, to its usable capacity to prevent overestimating * the pool expandsize. */ static uint64_t bdev_max_capacity(struct block_device *bdev, uint64_t wholedisk) { uint64_t psize; int64_t available; if (wholedisk && bdev->bd_part != NULL && bdev != bdev->bd_contains) { /* * When reporting maximum expansion capacity for a wholedisk * deduct any capacity which is expected to be lost due to * alignment restrictions. Over reporting this value isn't * harmful and would only result in slightly less capacity * than expected post expansion. * The estimated available space may be slightly smaller than * bdev_capacity() for devices where the number of sectors is * not a multiple of the alignment size and the partition layout * is keeping less than PARTITION_END_ALIGNMENT bytes after the * "reserved" EFI partition: in such cases return the device * usable capacity. */ available = i_size_read(bdev->bd_contains->bd_inode) - ((EFI_MIN_RESV_SIZE + NEW_START_BLOCK + PARTITION_END_ALIGNMENT) << SECTOR_BITS); psize = MAX(available, bdev_capacity(bdev)); } else { psize = bdev_capacity(bdev); } return (psize); } static void vdev_disk_error(zio_t *zio) { /* * This function can be called in interrupt context, for instance while * handling IRQs coming from a misbehaving disk device; use printk() * which is safe from any context. */ printk(KERN_WARNING "zio pool=%s vdev=%s error=%d type=%d " "offset=%llu size=%llu flags=%x\n", spa_name(zio->io_spa), zio->io_vd->vdev_path, zio->io_error, zio->io_type, (u_longlong_t)zio->io_offset, (u_longlong_t)zio->io_size, zio->io_flags); } static int vdev_disk_open(vdev_t *v, uint64_t *psize, uint64_t *max_psize, uint64_t *logical_ashift, uint64_t *physical_ashift) { struct block_device *bdev; fmode_t mode = vdev_bdev_mode(spa_mode(v->vdev_spa)); hrtime_t timeout = MSEC2NSEC(zfs_vdev_open_timeout_ms); vdev_disk_t *vd; /* Must have a pathname and it must be absolute. */ if (v->vdev_path == NULL || v->vdev_path[0] != '/') { v->vdev_stat.vs_aux = VDEV_AUX_BAD_LABEL; vdev_dbgmsg(v, "invalid vdev_path"); return (SET_ERROR(EINVAL)); } /* * Reopen the device if it is currently open. When expanding a * partition force re-scanning the partition table if userland * did not take care of this already. We need to do this while closed * in order to get an accurate updated block device size. Then * since udev may need to recreate the device links increase the * open retry timeout before reporting the device as unavailable. */ vd = v->vdev_tsd; if (vd) { char disk_name[BDEVNAME_SIZE + 6] = "/dev/"; boolean_t reread_part = B_FALSE; rw_enter(&vd->vd_lock, RW_WRITER); bdev = vd->vd_bdev; vd->vd_bdev = NULL; if (bdev) { if (v->vdev_expanding && bdev != bdev->bd_contains) { bdevname(bdev->bd_contains, disk_name + 5); /* * If userland has BLKPG_RESIZE_PARTITION, * then it should have updated the partition * table already. We can detect this by * comparing our current physical size * with that of the device. If they are * the same, then we must not have * BLKPG_RESIZE_PARTITION or it failed to * update the partition table online. We * fallback to rescanning the partition * table from the kernel below. However, * if the capacity already reflects the * updated partition, then we skip * rescanning the partition table here. */ if (v->vdev_psize == bdev_capacity(bdev)) reread_part = B_TRUE; } blkdev_put(bdev, mode | FMODE_EXCL); } if (reread_part) { bdev = blkdev_get_by_path(disk_name, mode | FMODE_EXCL, zfs_vdev_holder); if (!IS_ERR(bdev)) { int error = vdev_bdev_reread_part(bdev); blkdev_put(bdev, mode | FMODE_EXCL); if (error == 0) { timeout = MSEC2NSEC( zfs_vdev_open_timeout_ms * 2); } } } } else { vd = kmem_zalloc(sizeof (vdev_disk_t), KM_SLEEP); rw_init(&vd->vd_lock, NULL, RW_DEFAULT, NULL); rw_enter(&vd->vd_lock, RW_WRITER); } /* * Devices are always opened by the path provided at configuration * time. This means that if the provided path is a udev by-id path * then drives may be re-cabled without an issue. If the provided * path is a udev by-path path, then the physical location information * will be preserved. This can be critical for more complicated * configurations where drives are located in specific physical * locations to maximize the systems tolerance to component failure. * * Alternatively, you can provide your own udev rule to flexibly map * the drives as you see fit. It is not advised that you use the * /dev/[hd]d devices which may be reordered due to probing order. * Devices in the wrong locations will be detected by the higher * level vdev validation. * * The specified paths may be briefly removed and recreated in * response to udev events. This should be exceptionally unlikely * because the zpool command makes every effort to verify these paths * have already settled prior to reaching this point. Therefore, * a ENOENT failure at this point is highly likely to be transient * and it is reasonable to sleep and retry before giving up. In * practice delays have been observed to be on the order of 100ms. */ hrtime_t start = gethrtime(); bdev = ERR_PTR(-ENXIO); while (IS_ERR(bdev) && ((gethrtime() - start) < timeout)) { bdev = blkdev_get_by_path(v->vdev_path, mode | FMODE_EXCL, zfs_vdev_holder); if (unlikely(PTR_ERR(bdev) == -ENOENT)) { schedule_timeout(MSEC_TO_TICK(10)); } else if (IS_ERR(bdev)) { break; } } if (IS_ERR(bdev)) { int error = -PTR_ERR(bdev); vdev_dbgmsg(v, "open error=%d timeout=%llu/%llu", error, (u_longlong_t)(gethrtime() - start), (u_longlong_t)timeout); vd->vd_bdev = NULL; v->vdev_tsd = vd; rw_exit(&vd->vd_lock); return (SET_ERROR(error)); } else { vd->vd_bdev = bdev; v->vdev_tsd = vd; rw_exit(&vd->vd_lock); } struct request_queue *q = bdev_get_queue(vd->vd_bdev); /* Determine the physical block size */ int physical_block_size = bdev_physical_block_size(vd->vd_bdev); /* Determine the logical block size */ int logical_block_size = bdev_logical_block_size(vd->vd_bdev); /* Clear the nowritecache bit, causes vdev_reopen() to try again. */ v->vdev_nowritecache = B_FALSE; /* Set when device reports it supports TRIM. */ v->vdev_has_trim = !!blk_queue_discard(q); /* Set when device reports it supports secure TRIM. */ v->vdev_has_securetrim = !!blk_queue_discard_secure(q); /* Inform the ZIO pipeline that we are non-rotational */ v->vdev_nonrot = blk_queue_nonrot(q); /* Physical volume size in bytes for the partition */ *psize = bdev_capacity(vd->vd_bdev); /* Physical volume size in bytes including possible expansion space */ *max_psize = bdev_max_capacity(vd->vd_bdev, v->vdev_wholedisk); /* Based on the minimum sector size set the block size */ *physical_ashift = highbit64(MAX(physical_block_size, SPA_MINBLOCKSIZE)) - 1; *logical_ashift = highbit64(MAX(logical_block_size, SPA_MINBLOCKSIZE)) - 1; return (0); } static void vdev_disk_close(vdev_t *v) { vdev_disk_t *vd = v->vdev_tsd; if (v->vdev_reopening || vd == NULL) return; if (vd->vd_bdev != NULL) { blkdev_put(vd->vd_bdev, vdev_bdev_mode(spa_mode(v->vdev_spa)) | FMODE_EXCL); } rw_destroy(&vd->vd_lock); kmem_free(vd, sizeof (vdev_disk_t)); v->vdev_tsd = NULL; } static dio_request_t * vdev_disk_dio_alloc(int bio_count) { dio_request_t *dr; int i; dr = kmem_zalloc(sizeof (dio_request_t) + sizeof (struct bio *) * bio_count, KM_SLEEP); if (dr) { atomic_set(&dr->dr_ref, 0); dr->dr_bio_count = bio_count; dr->dr_error = 0; for (i = 0; i < dr->dr_bio_count; i++) dr->dr_bio[i] = NULL; } return (dr); } static void vdev_disk_dio_free(dio_request_t *dr) { int i; for (i = 0; i < dr->dr_bio_count; i++) if (dr->dr_bio[i]) bio_put(dr->dr_bio[i]); kmem_free(dr, sizeof (dio_request_t) + sizeof (struct bio *) * dr->dr_bio_count); } static void vdev_disk_dio_get(dio_request_t *dr) { atomic_inc(&dr->dr_ref); } static int vdev_disk_dio_put(dio_request_t *dr) { int rc = atomic_dec_return(&dr->dr_ref); /* * Free the dio_request when the last reference is dropped and * ensure zio_interpret is called only once with the correct zio */ if (rc == 0) { zio_t *zio = dr->dr_zio; int error = dr->dr_error; vdev_disk_dio_free(dr); if (zio) { zio->io_error = error; ASSERT3S(zio->io_error, >=, 0); if (zio->io_error) vdev_disk_error(zio); zio_delay_interrupt(zio); } } return (rc); } BIO_END_IO_PROTO(vdev_disk_physio_completion, bio, error) { dio_request_t *dr = bio->bi_private; int rc; if (dr->dr_error == 0) { #ifdef HAVE_1ARG_BIO_END_IO_T dr->dr_error = BIO_END_IO_ERROR(bio); #else if (error) dr->dr_error = -(error); else if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) dr->dr_error = EIO; #endif } /* Drop reference acquired by __vdev_disk_physio */ rc = vdev_disk_dio_put(dr); } static inline void vdev_submit_bio_impl(struct bio *bio) { #ifdef HAVE_1ARG_SUBMIT_BIO submit_bio(bio); #else submit_bio(0, bio); #endif } +/* + * preempt_schedule_notrace is GPL-only which breaks the ZFS build, so + * replace it with preempt_schedule under the following condition: + */ +#if defined(CONFIG_ARM64) && \ + defined(CONFIG_PREEMPTION) && \ + defined(CONFIG_BLK_CGROUP) +#define preempt_schedule_notrace(x) preempt_schedule(x) +#endif + #ifdef HAVE_BIO_SET_DEV #if defined(CONFIG_BLK_CGROUP) && defined(HAVE_BIO_SET_DEV_GPL_ONLY) /* * The Linux 5.5 kernel updated percpu_ref_tryget() which is inlined by * blkg_tryget() to use rcu_read_lock() instead of rcu_read_lock_sched(). * As a side effect the function was converted to GPL-only. Define our * own version when needed which uses rcu_read_lock_sched(). */ #if defined(HAVE_BLKG_TRYGET_GPL_ONLY) static inline bool vdev_blkg_tryget(struct blkcg_gq *blkg) { struct percpu_ref *ref = &blkg->refcnt; unsigned long __percpu *count; bool rc; rcu_read_lock_sched(); if (__ref_is_percpu(ref, &count)) { this_cpu_inc(*count); rc = true; } else { rc = atomic_long_inc_not_zero(&ref->count); } rcu_read_unlock_sched(); return (rc); } #elif defined(HAVE_BLKG_TRYGET) #define vdev_blkg_tryget(bg) blkg_tryget(bg) #endif /* * The Linux 5.0 kernel updated the bio_set_dev() macro so it calls the * GPL-only bio_associate_blkg() symbol thus inadvertently converting * the entire macro. Provide a minimal version which always assigns the * request queue's root_blkg to the bio. */ static inline void vdev_bio_associate_blkg(struct bio *bio) { struct request_queue *q = bio->bi_disk->queue; ASSERT3P(q, !=, NULL); ASSERT3P(bio->bi_blkg, ==, NULL); if (q->root_blkg && vdev_blkg_tryget(q->root_blkg)) bio->bi_blkg = q->root_blkg; } #define bio_associate_blkg vdev_bio_associate_blkg #endif #else /* * Provide a bio_set_dev() helper macro for pre-Linux 4.14 kernels. */ static inline void bio_set_dev(struct bio *bio, struct block_device *bdev) { bio->bi_bdev = bdev; } #endif /* HAVE_BIO_SET_DEV */ static inline void vdev_submit_bio(struct bio *bio) { struct bio_list *bio_list = current->bio_list; current->bio_list = NULL; vdev_submit_bio_impl(bio); current->bio_list = bio_list; } static int __vdev_disk_physio(struct block_device *bdev, zio_t *zio, size_t io_size, uint64_t io_offset, int rw, int flags) { dio_request_t *dr; uint64_t abd_offset; uint64_t bio_offset; int bio_size, bio_count = 16; int i = 0, error = 0; struct blk_plug plug; /* * Accessing outside the block device is never allowed. */ if (io_offset + io_size > bdev->bd_inode->i_size) { vdev_dbgmsg(zio->io_vd, "Illegal access %llu size %llu, device size %llu", io_offset, io_size, i_size_read(bdev->bd_inode)); return (SET_ERROR(EIO)); } retry: dr = vdev_disk_dio_alloc(bio_count); if (dr == NULL) return (SET_ERROR(ENOMEM)); if (zio && !(zio->io_flags & (ZIO_FLAG_IO_RETRY | ZIO_FLAG_TRYHARD))) bio_set_flags_failfast(bdev, &flags); dr->dr_zio = zio; /* * When the IO size exceeds the maximum bio size for the request * queue we are forced to break the IO in multiple bio's and wait * for them all to complete. Ideally, all pool users will set * their volume block size to match the maximum request size and * the common case will be one bio per vdev IO request. */ abd_offset = 0; bio_offset = io_offset; bio_size = io_size; for (i = 0; i <= dr->dr_bio_count; i++) { /* Finished constructing bio's for given buffer */ if (bio_size <= 0) break; /* * By default only 'bio_count' bio's per dio are allowed. * However, if we find ourselves in a situation where more * are needed we allocate a larger dio and warn the user. */ if (dr->dr_bio_count == i) { vdev_disk_dio_free(dr); bio_count *= 2; goto retry; } /* bio_alloc() with __GFP_WAIT never returns NULL */ dr->dr_bio[i] = bio_alloc(GFP_NOIO, MIN(abd_nr_pages_off(zio->io_abd, bio_size, abd_offset), BIO_MAX_PAGES)); if (unlikely(dr->dr_bio[i] == NULL)) { vdev_disk_dio_free(dr); return (SET_ERROR(ENOMEM)); } /* Matching put called by vdev_disk_physio_completion */ vdev_disk_dio_get(dr); bio_set_dev(dr->dr_bio[i], bdev); BIO_BI_SECTOR(dr->dr_bio[i]) = bio_offset >> 9; dr->dr_bio[i]->bi_end_io = vdev_disk_physio_completion; dr->dr_bio[i]->bi_private = dr; bio_set_op_attrs(dr->dr_bio[i], rw, flags); /* Remaining size is returned to become the new size */ bio_size = abd_bio_map_off(dr->dr_bio[i], zio->io_abd, bio_size, abd_offset); /* Advance in buffer and construct another bio if needed */ abd_offset += BIO_BI_SIZE(dr->dr_bio[i]); bio_offset += BIO_BI_SIZE(dr->dr_bio[i]); } /* Extra reference to protect dio_request during vdev_submit_bio */ vdev_disk_dio_get(dr); if (dr->dr_bio_count > 1) blk_start_plug(&plug); /* Submit all bio's associated with this dio */ for (i = 0; i < dr->dr_bio_count; i++) if (dr->dr_bio[i]) vdev_submit_bio(dr->dr_bio[i]); if (dr->dr_bio_count > 1) blk_finish_plug(&plug); (void) vdev_disk_dio_put(dr); return (error); } BIO_END_IO_PROTO(vdev_disk_io_flush_completion, bio, error) { zio_t *zio = bio->bi_private; #ifdef HAVE_1ARG_BIO_END_IO_T zio->io_error = BIO_END_IO_ERROR(bio); #else zio->io_error = -error; #endif if (zio->io_error && (zio->io_error == EOPNOTSUPP)) zio->io_vd->vdev_nowritecache = B_TRUE; bio_put(bio); ASSERT3S(zio->io_error, >=, 0); if (zio->io_error) vdev_disk_error(zio); zio_interrupt(zio); } static int vdev_disk_io_flush(struct block_device *bdev, zio_t *zio) { struct request_queue *q; struct bio *bio; q = bdev_get_queue(bdev); if (!q) return (SET_ERROR(ENXIO)); bio = bio_alloc(GFP_NOIO, 0); /* bio_alloc() with __GFP_WAIT never returns NULL */ if (unlikely(bio == NULL)) return (SET_ERROR(ENOMEM)); bio->bi_end_io = vdev_disk_io_flush_completion; bio->bi_private = zio; bio_set_dev(bio, bdev); bio_set_flush(bio); vdev_submit_bio(bio); invalidate_bdev(bdev); return (0); } static void vdev_disk_io_start(zio_t *zio) { vdev_t *v = zio->io_vd; vdev_disk_t *vd = v->vdev_tsd; unsigned long trim_flags = 0; int rw, error; /* * If the vdev is closed, it's likely in the REMOVED or FAULTED state. * Nothing to be done here but return failure. */ if (vd == NULL) { zio->io_error = ENXIO; zio_interrupt(zio); return; } rw_enter(&vd->vd_lock, RW_READER); /* * If the vdev is closed, it's likely due to a failed reopen and is * in the UNAVAIL state. Nothing to be done here but return failure. */ if (vd->vd_bdev == NULL) { rw_exit(&vd->vd_lock); zio->io_error = ENXIO; zio_interrupt(zio); return; } switch (zio->io_type) { case ZIO_TYPE_IOCTL: if (!vdev_readable(v)) { rw_exit(&vd->vd_lock); zio->io_error = SET_ERROR(ENXIO); zio_interrupt(zio); return; } switch (zio->io_cmd) { case DKIOCFLUSHWRITECACHE: if (zfs_nocacheflush) break; if (v->vdev_nowritecache) { zio->io_error = SET_ERROR(ENOTSUP); break; } error = vdev_disk_io_flush(vd->vd_bdev, zio); if (error == 0) { rw_exit(&vd->vd_lock); return; } zio->io_error = error; break; default: zio->io_error = SET_ERROR(ENOTSUP); } rw_exit(&vd->vd_lock); zio_execute(zio); return; case ZIO_TYPE_WRITE: rw = WRITE; break; case ZIO_TYPE_READ: rw = READ; break; case ZIO_TYPE_TRIM: #if defined(BLKDEV_DISCARD_SECURE) if (zio->io_trim_flags & ZIO_TRIM_SECURE) trim_flags |= BLKDEV_DISCARD_SECURE; #endif zio->io_error = -blkdev_issue_discard(vd->vd_bdev, zio->io_offset >> 9, zio->io_size >> 9, GFP_NOFS, trim_flags); rw_exit(&vd->vd_lock); zio_interrupt(zio); return; default: rw_exit(&vd->vd_lock); zio->io_error = SET_ERROR(ENOTSUP); zio_interrupt(zio); return; } zio->io_target_timestamp = zio_handle_io_delay(zio); error = __vdev_disk_physio(vd->vd_bdev, zio, zio->io_size, zio->io_offset, rw, 0); rw_exit(&vd->vd_lock); if (error) { zio->io_error = error; zio_interrupt(zio); return; } } static void vdev_disk_io_done(zio_t *zio) { /* * If the device returned EIO, we revalidate the media. If it is * determined the media has changed this triggers the asynchronous * removal of the device from the configuration. */ if (zio->io_error == EIO) { vdev_t *v = zio->io_vd; vdev_disk_t *vd = v->vdev_tsd; if (check_disk_change(vd->vd_bdev)) { invalidate_bdev(vd->vd_bdev); v->vdev_remove_wanted = B_TRUE; spa_async_request(zio->io_spa, SPA_ASYNC_REMOVE); } } } static void vdev_disk_hold(vdev_t *vd) { ASSERT(spa_config_held(vd->vdev_spa, SCL_STATE, RW_WRITER)); /* We must have a pathname, and it must be absolute. */ if (vd->vdev_path == NULL || vd->vdev_path[0] != '/') return; /* * Only prefetch path and devid info if the device has * never been opened. */ if (vd->vdev_tsd != NULL) return; } static void vdev_disk_rele(vdev_t *vd) { ASSERT(spa_config_held(vd->vdev_spa, SCL_STATE, RW_WRITER)); /* XXX: Implement me as a vnode rele for the device */ } vdev_ops_t vdev_disk_ops = { .vdev_op_open = vdev_disk_open, .vdev_op_close = vdev_disk_close, .vdev_op_asize = vdev_default_asize, .vdev_op_io_start = vdev_disk_io_start, .vdev_op_io_done = vdev_disk_io_done, .vdev_op_state_change = NULL, .vdev_op_need_resilver = NULL, .vdev_op_hold = vdev_disk_hold, .vdev_op_rele = vdev_disk_rele, .vdev_op_remap = NULL, .vdev_op_xlate = vdev_default_xlate, .vdev_op_type = VDEV_TYPE_DISK, /* name of this vdev type */ .vdev_op_leaf = B_TRUE /* leaf vdev */ }; /* * The zfs_vdev_scheduler module option has been deprecated. Setting this * value no longer has any effect. It has not yet been entirely removed * to allow the module to be loaded if this option is specified in the * /etc/modprobe.d/zfs.conf file. The following warning will be logged. */ static int param_set_vdev_scheduler(const char *val, zfs_kernel_param_t *kp) { int error = param_set_charp(val, kp); if (error == 0) { printk(KERN_INFO "The 'zfs_vdev_scheduler' module option " "is not supported.\n"); } return (error); } char *zfs_vdev_scheduler = "unused"; module_param_call(zfs_vdev_scheduler, param_set_vdev_scheduler, param_get_charp, &zfs_vdev_scheduler, 0644); MODULE_PARM_DESC(zfs_vdev_scheduler, "I/O scheduler"); int param_set_min_auto_ashift(const char *buf, zfs_kernel_param_t *kp) { uint64_t val; int error; error = kstrtoull(buf, 0, &val); if (error < 0) return (SET_ERROR(error)); if (val < ASHIFT_MIN || val > zfs_vdev_max_auto_ashift) return (SET_ERROR(-EINVAL)); error = param_set_ulong(buf, kp); if (error < 0) return (SET_ERROR(error)); return (0); } int param_set_max_auto_ashift(const char *buf, zfs_kernel_param_t *kp) { uint64_t val; int error; error = kstrtoull(buf, 0, &val); if (error < 0) return (SET_ERROR(error)); if (val > ASHIFT_MAX || val < zfs_vdev_min_auto_ashift) return (SET_ERROR(-EINVAL)); error = param_set_ulong(buf, kp); if (error < 0) return (SET_ERROR(error)); return (0); } Index: vendor-sys/openzfs/dist/module/os/linux/zfs/zfs_debug.c =================================================================== --- vendor-sys/openzfs/dist/module/os/linux/zfs/zfs_debug.c (revision 366347) +++ vendor-sys/openzfs/dist/module/os/linux/zfs/zfs_debug.c (revision 366348) @@ -1,254 +1,255 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2012, 2014 by Delphix. All rights reserved. */ #include #include typedef struct zfs_dbgmsg { procfs_list_node_t zdm_node; uint64_t zdm_timestamp; int zdm_size; char zdm_msg[1]; /* variable length allocation */ } zfs_dbgmsg_t; procfs_list_t zfs_dbgmsgs; int zfs_dbgmsg_size = 0; int zfs_dbgmsg_maxsize = 4<<20; /* 4MB */ /* * Internal ZFS debug messages are enabled by default. * * # Print debug messages * cat /proc/spl/kstat/zfs/dbgmsg * * # Disable the kernel debug message log. * echo 0 > /sys/module/zfs/parameters/zfs_dbgmsg_enable * * # Clear the kernel debug message log. * echo 0 >/proc/spl/kstat/zfs/dbgmsg */ int zfs_dbgmsg_enable = 1; static int zfs_dbgmsg_show_header(struct seq_file *f) { seq_printf(f, "%-12s %-8s\n", "timestamp", "message"); return (0); } static int zfs_dbgmsg_show(struct seq_file *f, void *p) { zfs_dbgmsg_t *zdm = (zfs_dbgmsg_t *)p; seq_printf(f, "%-12llu %-s\n", (u_longlong_t)zdm->zdm_timestamp, zdm->zdm_msg); return (0); } static void zfs_dbgmsg_purge(int max_size) { while (zfs_dbgmsg_size > max_size) { zfs_dbgmsg_t *zdm = list_remove_head(&zfs_dbgmsgs.pl_list); if (zdm == NULL) return; int size = zdm->zdm_size; kmem_free(zdm, size); zfs_dbgmsg_size -= size; } } static int zfs_dbgmsg_clear(procfs_list_t *procfs_list) { mutex_enter(&zfs_dbgmsgs.pl_lock); zfs_dbgmsg_purge(0); mutex_exit(&zfs_dbgmsgs.pl_lock); return (0); } void zfs_dbgmsg_init(void) { procfs_list_install("zfs", + NULL, "dbgmsg", 0600, &zfs_dbgmsgs, zfs_dbgmsg_show, zfs_dbgmsg_show_header, zfs_dbgmsg_clear, offsetof(zfs_dbgmsg_t, zdm_node)); } void zfs_dbgmsg_fini(void) { procfs_list_uninstall(&zfs_dbgmsgs); zfs_dbgmsg_purge(0); /* * TODO - decide how to make this permanent */ #ifdef _KERNEL procfs_list_destroy(&zfs_dbgmsgs); #endif } void __set_error(const char *file, const char *func, int line, int err) { /* * To enable this: * * $ echo 512 >/sys/module/zfs/parameters/zfs_flags */ if (zfs_flags & ZFS_DEBUG_SET_ERROR) __dprintf(B_FALSE, file, func, line, "error %lu", err); } void __zfs_dbgmsg(char *buf) { int size = sizeof (zfs_dbgmsg_t) + strlen(buf); zfs_dbgmsg_t *zdm = kmem_zalloc(size, KM_SLEEP); zdm->zdm_size = size; zdm->zdm_timestamp = gethrestime_sec(); strcpy(zdm->zdm_msg, buf); mutex_enter(&zfs_dbgmsgs.pl_lock); procfs_list_add(&zfs_dbgmsgs, zdm); zfs_dbgmsg_size += size; zfs_dbgmsg_purge(MAX(zfs_dbgmsg_maxsize, 0)); mutex_exit(&zfs_dbgmsgs.pl_lock); } #ifdef _KERNEL void __dprintf(boolean_t dprint, const char *file, const char *func, int line, const char *fmt, ...) { const char *newfile; va_list adx; size_t size; char *buf; char *nl; int i; char *prefix = (dprint) ? "dprintf: " : ""; size = 1024; buf = kmem_alloc(size, KM_SLEEP); /* * Get rid of annoying prefix to filename. */ newfile = strrchr(file, '/'); if (newfile != NULL) { newfile = newfile + 1; /* Get rid of leading / */ } else { newfile = file; } i = snprintf(buf, size, "%s%s:%d:%s(): ", prefix, newfile, line, func); if (i < size) { va_start(adx, fmt); (void) vsnprintf(buf + i, size - i, fmt, adx); va_end(adx); } /* * Get rid of trailing newline for dprintf logs. */ if (dprint && buf[0] != '\0') { nl = &buf[strlen(buf) - 1]; if (*nl == '\n') *nl = '\0'; } /* * To get this data enable the zfs__dprintf trace point as shown: * * # Enable zfs__dprintf tracepoint, clear the tracepoint ring buffer * $ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable * $ echo 0 > /sys/kernel/debug/tracing/trace * * # Dump the ring buffer. * $ cat /sys/kernel/debug/tracing/trace */ DTRACE_PROBE1(zfs__dprintf, char *, buf); /* * To get this data: * * $ cat /proc/spl/kstat/zfs/dbgmsg * * To clear the buffer: * $ echo 0 > /proc/spl/kstat/zfs/dbgmsg */ __zfs_dbgmsg(buf); kmem_free(buf, size); } #else void zfs_dbgmsg_print(const char *tag) { ssize_t ret __attribute__((unused)); /* * We use write() in this function instead of printf() * so it is safe to call from a signal handler. */ ret = write(STDOUT_FILENO, "ZFS_DBGMSG(", 11); ret = write(STDOUT_FILENO, tag, strlen(tag)); ret = write(STDOUT_FILENO, ") START:\n", 9); mutex_enter(&zfs_dbgmsgs.pl_lock); for (zfs_dbgmsg_t *zdm = list_head(&zfs_dbgmsgs.pl_list); zdm != NULL; zdm = list_next(&zfs_dbgmsgs.pl_list, zdm)) { ret = write(STDOUT_FILENO, zdm->zdm_msg, strlen(zdm->zdm_msg)); ret = write(STDOUT_FILENO, "\n", 1); } ret = write(STDOUT_FILENO, "ZFS_DBGMSG(", 11); ret = write(STDOUT_FILENO, tag, strlen(tag)); ret = write(STDOUT_FILENO, ") END\n", 6); mutex_exit(&zfs_dbgmsgs.pl_lock); } #endif /* _KERNEL */ #ifdef _KERNEL module_param(zfs_dbgmsg_enable, int, 0644); MODULE_PARM_DESC(zfs_dbgmsg_enable, "Enable ZFS debug message log"); module_param(zfs_dbgmsg_maxsize, int, 0644); MODULE_PARM_DESC(zfs_dbgmsg_maxsize, "Maximum ZFS debug log size"); #endif Index: vendor-sys/openzfs/dist/module/zfs/arc.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/arc.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/arc.c (revision 366348) @@ -1,10599 +1,10609 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2018, Joyent, Inc. * Copyright (c) 2011, 2020, Delphix. All rights reserved. * Copyright (c) 2014, Saso Kiselkov. All rights reserved. * Copyright (c) 2017, Nexenta Systems, Inc. All rights reserved. * Copyright (c) 2019, loli10K . All rights reserved. * Copyright (c) 2020, George Amanakis. All rights reserved. * Copyright (c) 2019, Klara Inc. * Copyright (c) 2019, Allan Jude * Copyright (c) 2020, The FreeBSD Foundation [1] * * [1] Portions of this software were developed by Allan Jude * under sponsorship from the FreeBSD Foundation. */ /* * DVA-based Adjustable Replacement Cache * * While much of the theory of operation used here is * based on the self-tuning, low overhead replacement cache * presented by Megiddo and Modha at FAST 2003, there are some * significant differences: * * 1. The Megiddo and Modha model assumes any page is evictable. * Pages in its cache cannot be "locked" into memory. This makes * the eviction algorithm simple: evict the last page in the list. * This also make the performance characteristics easy to reason * about. Our cache is not so simple. At any given moment, some * subset of the blocks in the cache are un-evictable because we * have handed out a reference to them. Blocks are only evictable * when there are no external references active. This makes * eviction far more problematic: we choose to evict the evictable * blocks that are the "lowest" in the list. * * There are times when it is not possible to evict the requested * space. In these circumstances we are unable to adjust the cache * size. To prevent the cache growing unbounded at these times we * implement a "cache throttle" that slows the flow of new data * into the cache until we can make space available. * * 2. The Megiddo and Modha model assumes a fixed cache size. * Pages are evicted when the cache is full and there is a cache * miss. Our model has a variable sized cache. It grows with * high use, but also tries to react to memory pressure from the * operating system: decreasing its size when system memory is * tight. * * 3. The Megiddo and Modha model assumes a fixed page size. All * elements of the cache are therefore exactly the same size. So * when adjusting the cache size following a cache miss, its simply * a matter of choosing a single page to evict. In our model, we * have variable sized cache blocks (ranging from 512 bytes to * 128K bytes). We therefore choose a set of blocks to evict to make * space for a cache miss that approximates as closely as possible * the space used by the new block. * * See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache" * by N. Megiddo & D. Modha, FAST 2003 */ /* * The locking model: * * A new reference to a cache buffer can be obtained in two * ways: 1) via a hash table lookup using the DVA as a key, * or 2) via one of the ARC lists. The arc_read() interface * uses method 1, while the internal ARC algorithms for * adjusting the cache use method 2. We therefore provide two * types of locks: 1) the hash table lock array, and 2) the * ARC list locks. * * Buffers do not have their own mutexes, rather they rely on the * hash table mutexes for the bulk of their protection (i.e. most * fields in the arc_buf_hdr_t are protected by these mutexes). * * buf_hash_find() returns the appropriate mutex (held) when it * locates the requested buffer in the hash table. It returns * NULL for the mutex if the buffer was not in the table. * * buf_hash_remove() expects the appropriate hash mutex to be * already held before it is invoked. * * Each ARC state also has a mutex which is used to protect the * buffer list associated with the state. When attempting to * obtain a hash table lock while holding an ARC list lock you * must use: mutex_tryenter() to avoid deadlock. Also note that * the active state mutex must be held before the ghost state mutex. * * It as also possible to register a callback which is run when the * arc_meta_limit is reached and no buffers can be safely evicted. In * this case the arc user should drop a reference on some arc buffers so * they can be reclaimed and the arc_meta_limit honored. For example, * when using the ZPL each dentry holds a references on a znode. These * dentries must be pruned before the arc buffer holding the znode can * be safely evicted. * * Note that the majority of the performance stats are manipulated * with atomic operations. * * The L2ARC uses the l2ad_mtx on each vdev for the following: * * - L2ARC buflist creation * - L2ARC buflist eviction * - L2ARC write completion, which walks L2ARC buflists * - ARC header destruction, as it removes from L2ARC buflists * - ARC header release, as it removes from L2ARC buflists */ /* * ARC operation: * * Every block that is in the ARC is tracked by an arc_buf_hdr_t structure. * This structure can point either to a block that is still in the cache or to * one that is only accessible in an L2 ARC device, or it can provide * information about a block that was recently evicted. If a block is * only accessible in the L2ARC, then the arc_buf_hdr_t only has enough * information to retrieve it from the L2ARC device. This information is * stored in the l2arc_buf_hdr_t sub-structure of the arc_buf_hdr_t. A block * that is in this state cannot access the data directly. * * Blocks that are actively being referenced or have not been evicted * are cached in the L1ARC. The L1ARC (l1arc_buf_hdr_t) is a structure within * the arc_buf_hdr_t that will point to the data block in memory. A block can * only be read by a consumer if it has an l1arc_buf_hdr_t. The L1ARC * caches data in two ways -- in a list of ARC buffers (arc_buf_t) and * also in the arc_buf_hdr_t's private physical data block pointer (b_pabd). * * The L1ARC's data pointer may or may not be uncompressed. The ARC has the * ability to store the physical data (b_pabd) associated with the DVA of the * arc_buf_hdr_t. Since the b_pabd is a copy of the on-disk physical block, * it will match its on-disk compression characteristics. This behavior can be * disabled by setting 'zfs_compressed_arc_enabled' to B_FALSE. When the * compressed ARC functionality is disabled, the b_pabd will point to an * uncompressed version of the on-disk data. * * Data in the L1ARC is not accessed by consumers of the ARC directly. Each * arc_buf_hdr_t can have multiple ARC buffers (arc_buf_t) which reference it. * Each ARC buffer (arc_buf_t) is being actively accessed by a specific ARC * consumer. The ARC will provide references to this data and will keep it * cached until it is no longer in use. The ARC caches only the L1ARC's physical * data block and will evict any arc_buf_t that is no longer referenced. The * amount of memory consumed by the arc_buf_ts' data buffers can be seen via the * "overhead_size" kstat. * * Depending on the consumer, an arc_buf_t can be requested in uncompressed or * compressed form. The typical case is that consumers will want uncompressed * data, and when that happens a new data buffer is allocated where the data is * decompressed for them to use. Currently the only consumer who wants * compressed arc_buf_t's is "zfs send", when it streams data exactly as it * exists on disk. When this happens, the arc_buf_t's data buffer is shared * with the arc_buf_hdr_t. * * Here is a diagram showing an arc_buf_hdr_t referenced by two arc_buf_t's. The * first one is owned by a compressed send consumer (and therefore references * the same compressed data buffer as the arc_buf_hdr_t) and the second could be * used by any other consumer (and has its own uncompressed copy of the data * buffer). * * arc_buf_hdr_t * +-----------+ * | fields | * | common to | * | L1- and | * | L2ARC | * +-----------+ * | l2arc_buf_hdr_t * | | * +-----------+ * | l1arc_buf_hdr_t * | | arc_buf_t * | b_buf +------------>+-----------+ arc_buf_t * | b_pabd +-+ |b_next +---->+-----------+ * +-----------+ | |-----------| |b_next +-->NULL * | |b_comp = T | +-----------+ * | |b_data +-+ |b_comp = F | * | +-----------+ | |b_data +-+ * +->+------+ | +-----------+ | * compressed | | | | * data | |<--------------+ | uncompressed * +------+ compressed, | data * shared +-->+------+ * data | | * | | * +------+ * * When a consumer reads a block, the ARC must first look to see if the * arc_buf_hdr_t is cached. If the hdr is cached then the ARC allocates a new * arc_buf_t and either copies uncompressed data into a new data buffer from an * existing uncompressed arc_buf_t, decompresses the hdr's b_pabd buffer into a * new data buffer, or shares the hdr's b_pabd buffer, depending on whether the * hdr is compressed and the desired compression characteristics of the * arc_buf_t consumer. If the arc_buf_t ends up sharing data with the * arc_buf_hdr_t and both of them are uncompressed then the arc_buf_t must be * the last buffer in the hdr's b_buf list, however a shared compressed buf can * be anywhere in the hdr's list. * * The diagram below shows an example of an uncompressed ARC hdr that is * sharing its data with an arc_buf_t (note that the shared uncompressed buf is * the last element in the buf list): * * arc_buf_hdr_t * +-----------+ * | | * | | * | | * +-----------+ * l2arc_buf_hdr_t| | * | | * +-----------+ * l1arc_buf_hdr_t| | * | | arc_buf_t (shared) * | b_buf +------------>+---------+ arc_buf_t * | | |b_next +---->+---------+ * | b_pabd +-+ |---------| |b_next +-->NULL * +-----------+ | | | +---------+ * | |b_data +-+ | | * | +---------+ | |b_data +-+ * +->+------+ | +---------+ | * | | | | * uncompressed | | | | * data +------+ | | * ^ +->+------+ | * | uncompressed | | | * | data | | | * | +------+ | * +---------------------------------+ * * Writing to the ARC requires that the ARC first discard the hdr's b_pabd * since the physical block is about to be rewritten. The new data contents * will be contained in the arc_buf_t. As the I/O pipeline performs the write, * it may compress the data before writing it to disk. The ARC will be called * with the transformed data and will bcopy the transformed on-disk block into * a newly allocated b_pabd. Writes are always done into buffers which have * either been loaned (and hence are new and don't have other readers) or * buffers which have been released (and hence have their own hdr, if there * were originally other readers of the buf's original hdr). This ensures that * the ARC only needs to update a single buf and its hdr after a write occurs. * * When the L2ARC is in use, it will also take advantage of the b_pabd. The * L2ARC will always write the contents of b_pabd to the L2ARC. This means * that when compressed ARC is enabled that the L2ARC blocks are identical * to the on-disk block in the main data pool. This provides a significant * advantage since the ARC can leverage the bp's checksum when reading from the * L2ARC to determine if the contents are valid. However, if the compressed * ARC is disabled, then the L2ARC's block must be transformed to look * like the physical block in the main data pool before comparing the * checksum and determining its validity. * * The L1ARC has a slightly different system for storing encrypted data. * Raw (encrypted + possibly compressed) data has a few subtle differences from * data that is just compressed. The biggest difference is that it is not * possible to decrypt encrypted data (or vice-versa) if the keys aren't loaded. * The other difference is that encryption cannot be treated as a suggestion. * If a caller would prefer compressed data, but they actually wind up with * uncompressed data the worst thing that could happen is there might be a * performance hit. If the caller requests encrypted data, however, we must be * sure they actually get it or else secret information could be leaked. Raw * data is stored in hdr->b_crypt_hdr.b_rabd. An encrypted header, therefore, * may have both an encrypted version and a decrypted version of its data at * once. When a caller needs a raw arc_buf_t, it is allocated and the data is * copied out of this header. To avoid complications with b_pabd, raw buffers * cannot be shared. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include +#include #ifndef _KERNEL /* set with ZFS_DEBUG=watch, to enable watchpoints on frozen buffers */ boolean_t arc_watch = B_FALSE; #endif /* * This thread's job is to keep enough free memory in the system, by * calling arc_kmem_reap_soon() plus arc_reduce_target_size(), which improves * arc_available_memory(). */ static zthr_t *arc_reap_zthr; /* * This thread's job is to keep arc_size under arc_c, by calling * arc_evict(), which improves arc_is_overflowing(). */ static zthr_t *arc_evict_zthr; static kmutex_t arc_evict_lock; static boolean_t arc_evict_needed = B_FALSE; /* * Count of bytes evicted since boot. */ static uint64_t arc_evict_count; /* * List of arc_evict_waiter_t's, representing threads waiting for the * arc_evict_count to reach specific values. */ static list_t arc_evict_waiters; /* * When arc_is_overflowing(), arc_get_data_impl() waits for this percent of * the requested amount of data to be evicted. For example, by default for * every 2KB that's evicted, 1KB of it may be "reused" by a new allocation. * Since this is above 100%, it ensures that progress is made towards getting * arc_size under arc_c. Since this is finite, it ensures that allocations * can still happen, even during the potentially long time that arc_size is * more than arc_c. */ int zfs_arc_eviction_pct = 200; /* * The number of headers to evict in arc_evict_state_impl() before * dropping the sublist lock and evicting from another sublist. A lower * value means we're more likely to evict the "correct" header (i.e. the * oldest header in the arc state), but comes with higher overhead * (i.e. more invocations of arc_evict_state_impl()). */ int zfs_arc_evict_batch_limit = 10; /* number of seconds before growing cache again */ int arc_grow_retry = 5; /* * Minimum time between calls to arc_kmem_reap_soon(). */ int arc_kmem_cache_reap_retry_ms = 1000; /* shift of arc_c for calculating overflow limit in arc_get_data_impl */ int zfs_arc_overflow_shift = 8; /* shift of arc_c for calculating both min and max arc_p */ int arc_p_min_shift = 4; /* log2(fraction of arc to reclaim) */ int arc_shrink_shift = 7; /* percent of pagecache to reclaim arc to */ #ifdef _KERNEL uint_t zfs_arc_pc_percent = 0; #endif /* * log2(fraction of ARC which must be free to allow growing). * I.e. If there is less than arc_c >> arc_no_grow_shift free memory, * when reading a new block into the ARC, we will evict an equal-sized block * from the ARC. * * This must be less than arc_shrink_shift, so that when we shrink the ARC, * we will still not allow it to grow. */ int arc_no_grow_shift = 5; /* * minimum lifespan of a prefetch block in clock ticks * (initialized in arc_init()) */ static int arc_min_prefetch_ms; static int arc_min_prescient_prefetch_ms; /* * If this percent of memory is free, don't throttle. */ int arc_lotsfree_percent = 10; /* * The arc has filled available memory and has now warmed up. */ boolean_t arc_warm; /* * These tunables are for performance analysis. */ unsigned long zfs_arc_max = 0; unsigned long zfs_arc_min = 0; unsigned long zfs_arc_meta_limit = 0; unsigned long zfs_arc_meta_min = 0; unsigned long zfs_arc_dnode_limit = 0; unsigned long zfs_arc_dnode_reduce_percent = 10; int zfs_arc_grow_retry = 0; int zfs_arc_shrink_shift = 0; int zfs_arc_p_min_shift = 0; int zfs_arc_average_blocksize = 8 * 1024; /* 8KB */ /* * ARC dirty data constraints for arc_tempreserve_space() throttle. */ unsigned long zfs_arc_dirty_limit_percent = 50; /* total dirty data limit */ unsigned long zfs_arc_anon_limit_percent = 25; /* anon block dirty limit */ unsigned long zfs_arc_pool_dirty_percent = 20; /* each pool's anon allowance */ /* * Enable or disable compressed arc buffers. */ int zfs_compressed_arc_enabled = B_TRUE; /* * ARC will evict meta buffers that exceed arc_meta_limit. This * tunable make arc_meta_limit adjustable for different workloads. */ unsigned long zfs_arc_meta_limit_percent = 75; /* * Percentage that can be consumed by dnodes of ARC meta buffers. */ unsigned long zfs_arc_dnode_limit_percent = 10; /* * These tunables are Linux specific */ unsigned long zfs_arc_sys_free = 0; int zfs_arc_min_prefetch_ms = 0; int zfs_arc_min_prescient_prefetch_ms = 0; int zfs_arc_p_dampener_disable = 1; int zfs_arc_meta_prune = 10000; int zfs_arc_meta_strategy = ARC_STRATEGY_META_BALANCED; int zfs_arc_meta_adjust_restarts = 4096; int zfs_arc_lotsfree_percent = 10; /* The 6 states: */ arc_state_t ARC_anon; arc_state_t ARC_mru; arc_state_t ARC_mru_ghost; arc_state_t ARC_mfu; arc_state_t ARC_mfu_ghost; arc_state_t ARC_l2c_only; arc_stats_t arc_stats = { { "hits", KSTAT_DATA_UINT64 }, { "misses", KSTAT_DATA_UINT64 }, { "demand_data_hits", KSTAT_DATA_UINT64 }, { "demand_data_misses", KSTAT_DATA_UINT64 }, { "demand_metadata_hits", KSTAT_DATA_UINT64 }, { "demand_metadata_misses", KSTAT_DATA_UINT64 }, { "prefetch_data_hits", KSTAT_DATA_UINT64 }, { "prefetch_data_misses", KSTAT_DATA_UINT64 }, { "prefetch_metadata_hits", KSTAT_DATA_UINT64 }, { "prefetch_metadata_misses", KSTAT_DATA_UINT64 }, { "mru_hits", KSTAT_DATA_UINT64 }, { "mru_ghost_hits", KSTAT_DATA_UINT64 }, { "mfu_hits", KSTAT_DATA_UINT64 }, { "mfu_ghost_hits", KSTAT_DATA_UINT64 }, { "deleted", KSTAT_DATA_UINT64 }, { "mutex_miss", KSTAT_DATA_UINT64 }, { "access_skip", KSTAT_DATA_UINT64 }, { "evict_skip", KSTAT_DATA_UINT64 }, { "evict_not_enough", KSTAT_DATA_UINT64 }, { "evict_l2_cached", KSTAT_DATA_UINT64 }, { "evict_l2_eligible", KSTAT_DATA_UINT64 }, { "evict_l2_ineligible", KSTAT_DATA_UINT64 }, { "evict_l2_skip", KSTAT_DATA_UINT64 }, { "hash_elements", KSTAT_DATA_UINT64 }, { "hash_elements_max", KSTAT_DATA_UINT64 }, { "hash_collisions", KSTAT_DATA_UINT64 }, { "hash_chains", KSTAT_DATA_UINT64 }, { "hash_chain_max", KSTAT_DATA_UINT64 }, { "p", KSTAT_DATA_UINT64 }, { "c", KSTAT_DATA_UINT64 }, { "c_min", KSTAT_DATA_UINT64 }, { "c_max", KSTAT_DATA_UINT64 }, { "size", KSTAT_DATA_UINT64 }, { "compressed_size", KSTAT_DATA_UINT64 }, { "uncompressed_size", KSTAT_DATA_UINT64 }, { "overhead_size", KSTAT_DATA_UINT64 }, { "hdr_size", KSTAT_DATA_UINT64 }, { "data_size", KSTAT_DATA_UINT64 }, { "metadata_size", KSTAT_DATA_UINT64 }, { "dbuf_size", KSTAT_DATA_UINT64 }, { "dnode_size", KSTAT_DATA_UINT64 }, { "bonus_size", KSTAT_DATA_UINT64 }, #if defined(COMPAT_FREEBSD11) { "other_size", KSTAT_DATA_UINT64 }, #endif { "anon_size", KSTAT_DATA_UINT64 }, { "anon_evictable_data", KSTAT_DATA_UINT64 }, { "anon_evictable_metadata", KSTAT_DATA_UINT64 }, { "mru_size", KSTAT_DATA_UINT64 }, { "mru_evictable_data", KSTAT_DATA_UINT64 }, { "mru_evictable_metadata", KSTAT_DATA_UINT64 }, { "mru_ghost_size", KSTAT_DATA_UINT64 }, { "mru_ghost_evictable_data", KSTAT_DATA_UINT64 }, { "mru_ghost_evictable_metadata", KSTAT_DATA_UINT64 }, { "mfu_size", KSTAT_DATA_UINT64 }, { "mfu_evictable_data", KSTAT_DATA_UINT64 }, { "mfu_evictable_metadata", KSTAT_DATA_UINT64 }, { "mfu_ghost_size", KSTAT_DATA_UINT64 }, { "mfu_ghost_evictable_data", KSTAT_DATA_UINT64 }, { "mfu_ghost_evictable_metadata", KSTAT_DATA_UINT64 }, { "l2_hits", KSTAT_DATA_UINT64 }, { "l2_misses", KSTAT_DATA_UINT64 }, { "l2_feeds", KSTAT_DATA_UINT64 }, { "l2_rw_clash", KSTAT_DATA_UINT64 }, { "l2_read_bytes", KSTAT_DATA_UINT64 }, { "l2_write_bytes", KSTAT_DATA_UINT64 }, { "l2_writes_sent", KSTAT_DATA_UINT64 }, { "l2_writes_done", KSTAT_DATA_UINT64 }, { "l2_writes_error", KSTAT_DATA_UINT64 }, { "l2_writes_lock_retry", KSTAT_DATA_UINT64 }, { "l2_evict_lock_retry", KSTAT_DATA_UINT64 }, { "l2_evict_reading", KSTAT_DATA_UINT64 }, { "l2_evict_l1cached", KSTAT_DATA_UINT64 }, { "l2_free_on_write", KSTAT_DATA_UINT64 }, { "l2_abort_lowmem", KSTAT_DATA_UINT64 }, { "l2_cksum_bad", KSTAT_DATA_UINT64 }, { "l2_io_error", KSTAT_DATA_UINT64 }, { "l2_size", KSTAT_DATA_UINT64 }, { "l2_asize", KSTAT_DATA_UINT64 }, { "l2_hdr_size", KSTAT_DATA_UINT64 }, { "l2_log_blk_writes", KSTAT_DATA_UINT64 }, { "l2_log_blk_avg_asize", KSTAT_DATA_UINT64 }, { "l2_log_blk_asize", KSTAT_DATA_UINT64 }, { "l2_log_blk_count", KSTAT_DATA_UINT64 }, { "l2_data_to_meta_ratio", KSTAT_DATA_UINT64 }, { "l2_rebuild_success", KSTAT_DATA_UINT64 }, { "l2_rebuild_unsupported", KSTAT_DATA_UINT64 }, { "l2_rebuild_io_errors", KSTAT_DATA_UINT64 }, { "l2_rebuild_dh_errors", KSTAT_DATA_UINT64 }, { "l2_rebuild_cksum_lb_errors", KSTAT_DATA_UINT64 }, { "l2_rebuild_lowmem", KSTAT_DATA_UINT64 }, { "l2_rebuild_size", KSTAT_DATA_UINT64 }, { "l2_rebuild_asize", KSTAT_DATA_UINT64 }, { "l2_rebuild_bufs", KSTAT_DATA_UINT64 }, { "l2_rebuild_bufs_precached", KSTAT_DATA_UINT64 }, { "l2_rebuild_log_blks", KSTAT_DATA_UINT64 }, { "memory_throttle_count", KSTAT_DATA_UINT64 }, { "memory_direct_count", KSTAT_DATA_UINT64 }, { "memory_indirect_count", KSTAT_DATA_UINT64 }, { "memory_all_bytes", KSTAT_DATA_UINT64 }, { "memory_free_bytes", KSTAT_DATA_UINT64 }, { "memory_available_bytes", KSTAT_DATA_INT64 }, { "arc_no_grow", KSTAT_DATA_UINT64 }, { "arc_tempreserve", KSTAT_DATA_UINT64 }, { "arc_loaned_bytes", KSTAT_DATA_UINT64 }, { "arc_prune", KSTAT_DATA_UINT64 }, { "arc_meta_used", KSTAT_DATA_UINT64 }, { "arc_meta_limit", KSTAT_DATA_UINT64 }, { "arc_dnode_limit", KSTAT_DATA_UINT64 }, { "arc_meta_max", KSTAT_DATA_UINT64 }, { "arc_meta_min", KSTAT_DATA_UINT64 }, { "async_upgrade_sync", KSTAT_DATA_UINT64 }, { "demand_hit_predictive_prefetch", KSTAT_DATA_UINT64 }, { "demand_hit_prescient_prefetch", KSTAT_DATA_UINT64 }, { "arc_need_free", KSTAT_DATA_UINT64 }, { "arc_sys_free", KSTAT_DATA_UINT64 }, { "arc_raw_size", KSTAT_DATA_UINT64 }, { "cached_only_in_progress", KSTAT_DATA_UINT64 }, { "abd_chunk_waste_size", KSTAT_DATA_UINT64 }, }; #define ARCSTAT_MAX(stat, val) { \ uint64_t m; \ while ((val) > (m = arc_stats.stat.value.ui64) && \ (m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \ continue; \ } #define ARCSTAT_MAXSTAT(stat) \ ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64) /* * We define a macro to allow ARC hits/misses to be easily broken down by * two separate conditions, giving a total of four different subtypes for * each of hits and misses (so eight statistics total). */ #define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \ if (cond1) { \ if (cond2) { \ ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \ } else { \ ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \ } \ } else { \ if (cond2) { \ ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \ } else { \ ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\ } \ } /* * This macro allows us to use kstats as floating averages. Each time we * update this kstat, we first factor it and the update value by * ARCSTAT_AVG_FACTOR to shrink the new value's contribution to the overall * average. This macro assumes that integer loads and stores are atomic, but * is not safe for multiple writers updating the kstat in parallel (only the * last writer's update will remain). */ #define ARCSTAT_F_AVG_FACTOR 3 #define ARCSTAT_F_AVG(stat, value) \ do { \ uint64_t x = ARCSTAT(stat); \ x = x - x / ARCSTAT_F_AVG_FACTOR + \ (value) / ARCSTAT_F_AVG_FACTOR; \ ARCSTAT(stat) = x; \ _NOTE(CONSTCOND) \ } while (0) kstat_t *arc_ksp; static arc_state_t *arc_anon; static arc_state_t *arc_mru_ghost; static arc_state_t *arc_mfu_ghost; static arc_state_t *arc_l2c_only; arc_state_t *arc_mru; arc_state_t *arc_mfu; /* * There are several ARC variables that are critical to export as kstats -- * but we don't want to have to grovel around in the kstat whenever we wish to * manipulate them. For these variables, we therefore define them to be in * terms of the statistic variable. This assures that we are not introducing * the possibility of inconsistency by having shadow copies of the variables, * while still allowing the code to be readable. */ #define arc_tempreserve ARCSTAT(arcstat_tempreserve) #define arc_loaned_bytes ARCSTAT(arcstat_loaned_bytes) #define arc_meta_limit ARCSTAT(arcstat_meta_limit) /* max size for metadata */ /* max size for dnodes */ #define arc_dnode_size_limit ARCSTAT(arcstat_dnode_limit) #define arc_meta_min ARCSTAT(arcstat_meta_min) /* min size for metadata */ #define arc_meta_max ARCSTAT(arcstat_meta_max) /* max size of metadata */ #define arc_need_free ARCSTAT(arcstat_need_free) /* waiting to be evicted */ /* size of all b_rabd's in entire arc */ #define arc_raw_size ARCSTAT(arcstat_raw_size) /* compressed size of entire arc */ #define arc_compressed_size ARCSTAT(arcstat_compressed_size) /* uncompressed size of entire arc */ #define arc_uncompressed_size ARCSTAT(arcstat_uncompressed_size) /* number of bytes in the arc from arc_buf_t's */ #define arc_overhead_size ARCSTAT(arcstat_overhead_size) /* * There are also some ARC variables that we want to export, but that are * updated so often that having the canonical representation be the statistic * variable causes a performance bottleneck. We want to use aggsum_t's for these * instead, but still be able to export the kstat in the same way as before. * The solution is to always use the aggsum version, except in the kstat update * callback. */ aggsum_t arc_size; aggsum_t arc_meta_used; aggsum_t astat_data_size; aggsum_t astat_metadata_size; aggsum_t astat_dbuf_size; aggsum_t astat_dnode_size; aggsum_t astat_bonus_size; aggsum_t astat_hdr_size; aggsum_t astat_l2_hdr_size; aggsum_t astat_abd_chunk_waste_size; hrtime_t arc_growtime; list_t arc_prune_list; kmutex_t arc_prune_mtx; taskq_t *arc_prune_taskq; #define GHOST_STATE(state) \ ((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \ (state) == arc_l2c_only) #define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_FLAG_IN_HASH_TABLE) #define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) #define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_FLAG_IO_ERROR) #define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_FLAG_PREFETCH) #define HDR_PRESCIENT_PREFETCH(hdr) \ ((hdr)->b_flags & ARC_FLAG_PRESCIENT_PREFETCH) #define HDR_COMPRESSION_ENABLED(hdr) \ ((hdr)->b_flags & ARC_FLAG_COMPRESSED_ARC) #define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_FLAG_L2CACHE) #define HDR_L2_READING(hdr) \ (((hdr)->b_flags & ARC_FLAG_IO_IN_PROGRESS) && \ ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR)) #define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITING) #define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_FLAG_L2_EVICTED) #define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_FLAG_L2_WRITE_HEAD) #define HDR_PROTECTED(hdr) ((hdr)->b_flags & ARC_FLAG_PROTECTED) #define HDR_NOAUTH(hdr) ((hdr)->b_flags & ARC_FLAG_NOAUTH) #define HDR_SHARED_DATA(hdr) ((hdr)->b_flags & ARC_FLAG_SHARED_DATA) #define HDR_ISTYPE_METADATA(hdr) \ ((hdr)->b_flags & ARC_FLAG_BUFC_METADATA) #define HDR_ISTYPE_DATA(hdr) (!HDR_ISTYPE_METADATA(hdr)) #define HDR_HAS_L1HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L1HDR) #define HDR_HAS_L2HDR(hdr) ((hdr)->b_flags & ARC_FLAG_HAS_L2HDR) #define HDR_HAS_RABD(hdr) \ (HDR_HAS_L1HDR(hdr) && HDR_PROTECTED(hdr) && \ (hdr)->b_crypt_hdr.b_rabd != NULL) #define HDR_ENCRYPTED(hdr) \ (HDR_PROTECTED(hdr) && DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot)) #define HDR_AUTHENTICATED(hdr) \ (HDR_PROTECTED(hdr) && !DMU_OT_IS_ENCRYPTED((hdr)->b_crypt_hdr.b_ot)) /* For storing compression mode in b_flags */ #define HDR_COMPRESS_OFFSET (highbit64(ARC_FLAG_COMPRESS_0) - 1) #define HDR_GET_COMPRESS(hdr) ((enum zio_compress)BF32_GET((hdr)->b_flags, \ HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS)) #define HDR_SET_COMPRESS(hdr, cmp) BF32_SET((hdr)->b_flags, \ HDR_COMPRESS_OFFSET, SPA_COMPRESSBITS, (cmp)); #define ARC_BUF_LAST(buf) ((buf)->b_next == NULL) #define ARC_BUF_SHARED(buf) ((buf)->b_flags & ARC_BUF_FLAG_SHARED) #define ARC_BUF_COMPRESSED(buf) ((buf)->b_flags & ARC_BUF_FLAG_COMPRESSED) #define ARC_BUF_ENCRYPTED(buf) ((buf)->b_flags & ARC_BUF_FLAG_ENCRYPTED) /* * Other sizes */ #define HDR_FULL_CRYPT_SIZE ((int64_t)sizeof (arc_buf_hdr_t)) #define HDR_FULL_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_crypt_hdr)) #define HDR_L2ONLY_SIZE ((int64_t)offsetof(arc_buf_hdr_t, b_l1hdr)) /* * Hash table routines */ #define HT_LOCK_ALIGN 64 #define HT_LOCK_PAD (P2NPHASE(sizeof (kmutex_t), (HT_LOCK_ALIGN))) struct ht_lock { kmutex_t ht_lock; #ifdef _KERNEL unsigned char pad[HT_LOCK_PAD]; #endif }; #define BUF_LOCKS 8192 typedef struct buf_hash_table { uint64_t ht_mask; arc_buf_hdr_t **ht_table; struct ht_lock ht_locks[BUF_LOCKS]; } buf_hash_table_t; static buf_hash_table_t buf_hash_table; #define BUF_HASH_INDEX(spa, dva, birth) \ (buf_hash(spa, dva, birth) & buf_hash_table.ht_mask) #define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)]) #define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock)) #define HDR_LOCK(hdr) \ (BUF_HASH_LOCK(BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth))) uint64_t zfs_crc64_table[256]; /* * Level 2 ARC */ #define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */ #define L2ARC_HEADROOM 2 /* num of writes */ /* * If we discover during ARC scan any buffers to be compressed, we boost * our headroom for the next scanning cycle by this percentage multiple. */ #define L2ARC_HEADROOM_BOOST 200 #define L2ARC_FEED_SECS 1 /* caching interval secs */ #define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */ /* * We can feed L2ARC from two states of ARC buffers, mru and mfu, * and each of the state has two types: data and metadata. */ #define L2ARC_FEED_TYPES 4 #define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent) #define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done) /* L2ARC Performance Tunables */ unsigned long l2arc_write_max = L2ARC_WRITE_SIZE; /* def max write size */ unsigned long l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra warmup write */ unsigned long l2arc_headroom = L2ARC_HEADROOM; /* # of dev writes */ unsigned long l2arc_headroom_boost = L2ARC_HEADROOM_BOOST; unsigned long l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */ unsigned long l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval msecs */ int l2arc_noprefetch = B_TRUE; /* don't cache prefetch bufs */ int l2arc_feed_again = B_TRUE; /* turbo warmup */ int l2arc_norw = B_FALSE; /* no reads during writes */ int l2arc_meta_percent = 33; /* limit on headers size */ /* * L2ARC Internals */ static list_t L2ARC_dev_list; /* device list */ static list_t *l2arc_dev_list; /* device list pointer */ static kmutex_t l2arc_dev_mtx; /* device list mutex */ static l2arc_dev_t *l2arc_dev_last; /* last device used */ static list_t L2ARC_free_on_write; /* free after write buf list */ static list_t *l2arc_free_on_write; /* free after write list ptr */ static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */ static uint64_t l2arc_ndev; /* number of devices */ typedef struct l2arc_read_callback { arc_buf_hdr_t *l2rcb_hdr; /* read header */ blkptr_t l2rcb_bp; /* original blkptr */ zbookmark_phys_t l2rcb_zb; /* original bookmark */ int l2rcb_flags; /* original flags */ abd_t *l2rcb_abd; /* temporary buffer */ } l2arc_read_callback_t; typedef struct l2arc_data_free { /* protected by l2arc_free_on_write_mtx */ abd_t *l2df_abd; size_t l2df_size; arc_buf_contents_t l2df_type; list_node_t l2df_list_node; } l2arc_data_free_t; typedef enum arc_fill_flags { ARC_FILL_LOCKED = 1 << 0, /* hdr lock is held */ ARC_FILL_COMPRESSED = 1 << 1, /* fill with compressed data */ ARC_FILL_ENCRYPTED = 1 << 2, /* fill with encrypted data */ ARC_FILL_NOAUTH = 1 << 3, /* don't attempt to authenticate */ ARC_FILL_IN_PLACE = 1 << 4 /* fill in place (special case) */ } arc_fill_flags_t; static kmutex_t l2arc_feed_thr_lock; static kcondvar_t l2arc_feed_thr_cv; static uint8_t l2arc_thread_exit; static kmutex_t l2arc_rebuild_thr_lock; static kcondvar_t l2arc_rebuild_thr_cv; enum arc_hdr_alloc_flags { ARC_HDR_ALLOC_RDATA = 0x1, ARC_HDR_DO_ADAPT = 0x2, }; static abd_t *arc_get_data_abd(arc_buf_hdr_t *, uint64_t, void *, boolean_t); static void *arc_get_data_buf(arc_buf_hdr_t *, uint64_t, void *); static void arc_get_data_impl(arc_buf_hdr_t *, uint64_t, void *, boolean_t); static void arc_free_data_abd(arc_buf_hdr_t *, abd_t *, uint64_t, void *); static void arc_free_data_buf(arc_buf_hdr_t *, void *, uint64_t, void *); static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag); static void arc_hdr_free_abd(arc_buf_hdr_t *, boolean_t); static void arc_hdr_alloc_abd(arc_buf_hdr_t *, int); static void arc_access(arc_buf_hdr_t *, kmutex_t *); static void arc_buf_watch(arc_buf_t *); static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *); static uint32_t arc_bufc_to_flags(arc_buf_contents_t); static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags); static boolean_t l2arc_write_eligible(uint64_t, arc_buf_hdr_t *); static void l2arc_read_done(zio_t *); static void l2arc_do_free_on_write(void); /* * l2arc_mfuonly : A ZFS module parameter that controls whether only MFU * metadata and data are cached from ARC into L2ARC. */ int l2arc_mfuonly = 0; /* * L2ARC TRIM * l2arc_trim_ahead : A ZFS module parameter that controls how much ahead of * the current write size (l2arc_write_max) we should TRIM if we * have filled the device. It is defined as a percentage of the * write size. If set to 100 we trim twice the space required to * accommodate upcoming writes. A minimum of 64MB will be trimmed. * It also enables TRIM of the whole L2ARC device upon creation or * addition to an existing pool or if the header of the device is * invalid upon importing a pool or onlining a cache device. The * default is 0, which disables TRIM on L2ARC altogether as it can * put significant stress on the underlying storage devices. This * will vary depending of how well the specific device handles * these commands. */ unsigned long l2arc_trim_ahead = 0; /* * Performance tuning of L2ARC persistence: * * l2arc_rebuild_enabled : A ZFS module parameter that controls whether adding * an L2ARC device (either at pool import or later) will attempt * to rebuild L2ARC buffer contents. * l2arc_rebuild_blocks_min_l2size : A ZFS module parameter that controls * whether log blocks are written to the L2ARC device. If the L2ARC * device is less than 1GB, the amount of data l2arc_evict() * evicts is significant compared to the amount of restored L2ARC * data. In this case do not write log blocks in L2ARC in order * not to waste space. */ int l2arc_rebuild_enabled = B_TRUE; unsigned long l2arc_rebuild_blocks_min_l2size = 1024 * 1024 * 1024; /* L2ARC persistence rebuild control routines. */ void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen); static void l2arc_dev_rebuild_thread(void *arg); static int l2arc_rebuild(l2arc_dev_t *dev); /* L2ARC persistence read I/O routines. */ static int l2arc_dev_hdr_read(l2arc_dev_t *dev); static int l2arc_log_blk_read(l2arc_dev_t *dev, const l2arc_log_blkptr_t *this_lp, const l2arc_log_blkptr_t *next_lp, l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb, zio_t *this_io, zio_t **next_io); static zio_t *l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lp, l2arc_log_blk_phys_t *lb); static void l2arc_log_blk_fetch_abort(zio_t *zio); /* L2ARC persistence block restoration routines. */ static void l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb, uint64_t lb_asize, uint64_t lb_daddr); static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev); /* L2ARC persistence write I/O routines. */ static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb); /* L2ARC persistence auxiliary routines. */ boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp); static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *ab); boolean_t l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check); static void l2arc_blk_fetch_done(zio_t *zio); static inline uint64_t l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev); /* * We use Cityhash for this. It's fast, and has good hash properties without * requiring any large static buffers. */ static uint64_t buf_hash(uint64_t spa, const dva_t *dva, uint64_t birth) { return (cityhash4(spa, dva->dva_word[0], dva->dva_word[1], birth)); } #define HDR_EMPTY(hdr) \ ((hdr)->b_dva.dva_word[0] == 0 && \ (hdr)->b_dva.dva_word[1] == 0) #define HDR_EMPTY_OR_LOCKED(hdr) \ (HDR_EMPTY(hdr) || MUTEX_HELD(HDR_LOCK(hdr))) #define HDR_EQUAL(spa, dva, birth, hdr) \ ((hdr)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \ ((hdr)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \ ((hdr)->b_birth == birth) && ((hdr)->b_spa == spa) static void buf_discard_identity(arc_buf_hdr_t *hdr) { hdr->b_dva.dva_word[0] = 0; hdr->b_dva.dva_word[1] = 0; hdr->b_birth = 0; } static arc_buf_hdr_t * buf_hash_find(uint64_t spa, const blkptr_t *bp, kmutex_t **lockp) { const dva_t *dva = BP_IDENTITY(bp); uint64_t birth = BP_PHYSICAL_BIRTH(bp); uint64_t idx = BUF_HASH_INDEX(spa, dva, birth); kmutex_t *hash_lock = BUF_HASH_LOCK(idx); arc_buf_hdr_t *hdr; mutex_enter(hash_lock); for (hdr = buf_hash_table.ht_table[idx]; hdr != NULL; hdr = hdr->b_hash_next) { if (HDR_EQUAL(spa, dva, birth, hdr)) { *lockp = hash_lock; return (hdr); } } mutex_exit(hash_lock); *lockp = NULL; return (NULL); } /* * Insert an entry into the hash table. If there is already an element * equal to elem in the hash table, then the already existing element * will be returned and the new element will not be inserted. * Otherwise returns NULL. * If lockp == NULL, the caller is assumed to already hold the hash lock. */ static arc_buf_hdr_t * buf_hash_insert(arc_buf_hdr_t *hdr, kmutex_t **lockp) { uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth); kmutex_t *hash_lock = BUF_HASH_LOCK(idx); arc_buf_hdr_t *fhdr; uint32_t i; ASSERT(!DVA_IS_EMPTY(&hdr->b_dva)); ASSERT(hdr->b_birth != 0); ASSERT(!HDR_IN_HASH_TABLE(hdr)); if (lockp != NULL) { *lockp = hash_lock; mutex_enter(hash_lock); } else { ASSERT(MUTEX_HELD(hash_lock)); } for (fhdr = buf_hash_table.ht_table[idx], i = 0; fhdr != NULL; fhdr = fhdr->b_hash_next, i++) { if (HDR_EQUAL(hdr->b_spa, &hdr->b_dva, hdr->b_birth, fhdr)) return (fhdr); } hdr->b_hash_next = buf_hash_table.ht_table[idx]; buf_hash_table.ht_table[idx] = hdr; arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE); /* collect some hash table performance data */ if (i > 0) { ARCSTAT_BUMP(arcstat_hash_collisions); if (i == 1) ARCSTAT_BUMP(arcstat_hash_chains); ARCSTAT_MAX(arcstat_hash_chain_max, i); } ARCSTAT_BUMP(arcstat_hash_elements); ARCSTAT_MAXSTAT(arcstat_hash_elements); return (NULL); } static void buf_hash_remove(arc_buf_hdr_t *hdr) { arc_buf_hdr_t *fhdr, **hdrp; uint64_t idx = BUF_HASH_INDEX(hdr->b_spa, &hdr->b_dva, hdr->b_birth); ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx))); ASSERT(HDR_IN_HASH_TABLE(hdr)); hdrp = &buf_hash_table.ht_table[idx]; while ((fhdr = *hdrp) != hdr) { ASSERT3P(fhdr, !=, NULL); hdrp = &fhdr->b_hash_next; } *hdrp = hdr->b_hash_next; hdr->b_hash_next = NULL; arc_hdr_clear_flags(hdr, ARC_FLAG_IN_HASH_TABLE); /* collect some hash table performance data */ ARCSTAT_BUMPDOWN(arcstat_hash_elements); if (buf_hash_table.ht_table[idx] && buf_hash_table.ht_table[idx]->b_hash_next == NULL) ARCSTAT_BUMPDOWN(arcstat_hash_chains); } /* * Global data structures and functions for the buf kmem cache. */ static kmem_cache_t *hdr_full_cache; static kmem_cache_t *hdr_full_crypt_cache; static kmem_cache_t *hdr_l2only_cache; static kmem_cache_t *buf_cache; static void buf_fini(void) { int i; #if defined(_KERNEL) /* * Large allocations which do not require contiguous pages * should be using vmem_free() in the linux kernel\ */ vmem_free(buf_hash_table.ht_table, (buf_hash_table.ht_mask + 1) * sizeof (void *)); #else kmem_free(buf_hash_table.ht_table, (buf_hash_table.ht_mask + 1) * sizeof (void *)); #endif for (i = 0; i < BUF_LOCKS; i++) mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock); kmem_cache_destroy(hdr_full_cache); kmem_cache_destroy(hdr_full_crypt_cache); kmem_cache_destroy(hdr_l2only_cache); kmem_cache_destroy(buf_cache); } /* * Constructor callback - called when the cache is empty * and a new buf is requested. */ /* ARGSUSED */ static int hdr_full_cons(void *vbuf, void *unused, int kmflag) { arc_buf_hdr_t *hdr = vbuf; bzero(hdr, HDR_FULL_SIZE); hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; cv_init(&hdr->b_l1hdr.b_cv, NULL, CV_DEFAULT, NULL); zfs_refcount_create(&hdr->b_l1hdr.b_refcnt); mutex_init(&hdr->b_l1hdr.b_freeze_lock, NULL, MUTEX_DEFAULT, NULL); list_link_init(&hdr->b_l1hdr.b_arc_node); list_link_init(&hdr->b_l2hdr.b_l2node); multilist_link_init(&hdr->b_l1hdr.b_arc_node); arc_space_consume(HDR_FULL_SIZE, ARC_SPACE_HDRS); return (0); } /* ARGSUSED */ static int hdr_full_crypt_cons(void *vbuf, void *unused, int kmflag) { arc_buf_hdr_t *hdr = vbuf; hdr_full_cons(vbuf, unused, kmflag); bzero(&hdr->b_crypt_hdr, sizeof (hdr->b_crypt_hdr)); arc_space_consume(sizeof (hdr->b_crypt_hdr), ARC_SPACE_HDRS); return (0); } /* ARGSUSED */ static int hdr_l2only_cons(void *vbuf, void *unused, int kmflag) { arc_buf_hdr_t *hdr = vbuf; bzero(hdr, HDR_L2ONLY_SIZE); arc_space_consume(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); return (0); } /* ARGSUSED */ static int buf_cons(void *vbuf, void *unused, int kmflag) { arc_buf_t *buf = vbuf; bzero(buf, sizeof (arc_buf_t)); mutex_init(&buf->b_evict_lock, NULL, MUTEX_DEFAULT, NULL); arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS); return (0); } /* * Destructor callback - called when a cached buf is * no longer required. */ /* ARGSUSED */ static void hdr_full_dest(void *vbuf, void *unused) { arc_buf_hdr_t *hdr = vbuf; ASSERT(HDR_EMPTY(hdr)); cv_destroy(&hdr->b_l1hdr.b_cv); zfs_refcount_destroy(&hdr->b_l1hdr.b_refcnt); mutex_destroy(&hdr->b_l1hdr.b_freeze_lock); ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); arc_space_return(HDR_FULL_SIZE, ARC_SPACE_HDRS); } /* ARGSUSED */ static void hdr_full_crypt_dest(void *vbuf, void *unused) { arc_buf_hdr_t *hdr = vbuf; hdr_full_dest(vbuf, unused); arc_space_return(sizeof (hdr->b_crypt_hdr), ARC_SPACE_HDRS); } /* ARGSUSED */ static void hdr_l2only_dest(void *vbuf, void *unused) { arc_buf_hdr_t *hdr __maybe_unused = vbuf; ASSERT(HDR_EMPTY(hdr)); arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS); } /* ARGSUSED */ static void buf_dest(void *vbuf, void *unused) { arc_buf_t *buf = vbuf; mutex_destroy(&buf->b_evict_lock); arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS); } static void buf_init(void) { uint64_t *ct = NULL; uint64_t hsize = 1ULL << 12; int i, j; /* * The hash table is big enough to fill all of physical memory * with an average block size of zfs_arc_average_blocksize (default 8K). * By default, the table will take up * totalmem * sizeof(void*) / 8K (1MB per GB with 8-byte pointers). */ while (hsize * zfs_arc_average_blocksize < arc_all_memory()) hsize <<= 1; retry: buf_hash_table.ht_mask = hsize - 1; #if defined(_KERNEL) /* * Large allocations which do not require contiguous pages * should be using vmem_alloc() in the linux kernel */ buf_hash_table.ht_table = vmem_zalloc(hsize * sizeof (void*), KM_SLEEP); #else buf_hash_table.ht_table = kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP); #endif if (buf_hash_table.ht_table == NULL) { ASSERT(hsize > (1ULL << 8)); hsize >>= 1; goto retry; } hdr_full_cache = kmem_cache_create("arc_buf_hdr_t_full", HDR_FULL_SIZE, 0, hdr_full_cons, hdr_full_dest, NULL, NULL, NULL, 0); hdr_full_crypt_cache = kmem_cache_create("arc_buf_hdr_t_full_crypt", HDR_FULL_CRYPT_SIZE, 0, hdr_full_crypt_cons, hdr_full_crypt_dest, NULL, NULL, NULL, 0); hdr_l2only_cache = kmem_cache_create("arc_buf_hdr_t_l2only", HDR_L2ONLY_SIZE, 0, hdr_l2only_cons, hdr_l2only_dest, NULL, NULL, NULL, 0); buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t), 0, buf_cons, buf_dest, NULL, NULL, NULL, 0); for (i = 0; i < 256; i++) for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--) *ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY); for (i = 0; i < BUF_LOCKS; i++) { mutex_init(&buf_hash_table.ht_locks[i].ht_lock, NULL, MUTEX_DEFAULT, NULL); } } #define ARC_MINTIME (hz>>4) /* 62 ms */ /* * This is the size that the buf occupies in memory. If the buf is compressed, * it will correspond to the compressed size. You should use this method of * getting the buf size unless you explicitly need the logical size. */ uint64_t arc_buf_size(arc_buf_t *buf) { return (ARC_BUF_COMPRESSED(buf) ? HDR_GET_PSIZE(buf->b_hdr) : HDR_GET_LSIZE(buf->b_hdr)); } uint64_t arc_buf_lsize(arc_buf_t *buf) { return (HDR_GET_LSIZE(buf->b_hdr)); } /* * This function will return B_TRUE if the buffer is encrypted in memory. * This buffer can be decrypted by calling arc_untransform(). */ boolean_t arc_is_encrypted(arc_buf_t *buf) { return (ARC_BUF_ENCRYPTED(buf) != 0); } /* * Returns B_TRUE if the buffer represents data that has not had its MAC * verified yet. */ boolean_t arc_is_unauthenticated(arc_buf_t *buf) { return (HDR_NOAUTH(buf->b_hdr) != 0); } void arc_get_raw_params(arc_buf_t *buf, boolean_t *byteorder, uint8_t *salt, uint8_t *iv, uint8_t *mac) { arc_buf_hdr_t *hdr = buf->b_hdr; ASSERT(HDR_PROTECTED(hdr)); bcopy(hdr->b_crypt_hdr.b_salt, salt, ZIO_DATA_SALT_LEN); bcopy(hdr->b_crypt_hdr.b_iv, iv, ZIO_DATA_IV_LEN); bcopy(hdr->b_crypt_hdr.b_mac, mac, ZIO_DATA_MAC_LEN); *byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ? ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER; } /* * Indicates how this buffer is compressed in memory. If it is not compressed * the value will be ZIO_COMPRESS_OFF. It can be made normally readable with * arc_untransform() as long as it is also unencrypted. */ enum zio_compress arc_get_compression(arc_buf_t *buf) { return (ARC_BUF_COMPRESSED(buf) ? HDR_GET_COMPRESS(buf->b_hdr) : ZIO_COMPRESS_OFF); } /* * Return the compression algorithm used to store this data in the ARC. If ARC * compression is enabled or this is an encrypted block, this will be the same * as what's used to store it on-disk. Otherwise, this will be ZIO_COMPRESS_OFF. */ static inline enum zio_compress arc_hdr_get_compress(arc_buf_hdr_t *hdr) { return (HDR_COMPRESSION_ENABLED(hdr) ? HDR_GET_COMPRESS(hdr) : ZIO_COMPRESS_OFF); } uint8_t arc_get_complevel(arc_buf_t *buf) { return (buf->b_hdr->b_complevel); } static inline boolean_t arc_buf_is_shared(arc_buf_t *buf) { boolean_t shared = (buf->b_data != NULL && buf->b_hdr->b_l1hdr.b_pabd != NULL && abd_is_linear(buf->b_hdr->b_l1hdr.b_pabd) && buf->b_data == abd_to_buf(buf->b_hdr->b_l1hdr.b_pabd)); IMPLY(shared, HDR_SHARED_DATA(buf->b_hdr)); IMPLY(shared, ARC_BUF_SHARED(buf)); IMPLY(shared, ARC_BUF_COMPRESSED(buf) || ARC_BUF_LAST(buf)); /* * It would be nice to assert arc_can_share() too, but the "hdr isn't * already being shared" requirement prevents us from doing that. */ return (shared); } /* * Free the checksum associated with this header. If there is no checksum, this * is a no-op. */ static inline void arc_cksum_free(arc_buf_hdr_t *hdr) { ASSERT(HDR_HAS_L1HDR(hdr)); mutex_enter(&hdr->b_l1hdr.b_freeze_lock); if (hdr->b_l1hdr.b_freeze_cksum != NULL) { kmem_free(hdr->b_l1hdr.b_freeze_cksum, sizeof (zio_cksum_t)); hdr->b_l1hdr.b_freeze_cksum = NULL; } mutex_exit(&hdr->b_l1hdr.b_freeze_lock); } /* * Return true iff at least one of the bufs on hdr is not compressed. * Encrypted buffers count as compressed. */ static boolean_t arc_hdr_has_uncompressed_buf(arc_buf_hdr_t *hdr) { ASSERT(hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY_OR_LOCKED(hdr)); for (arc_buf_t *b = hdr->b_l1hdr.b_buf; b != NULL; b = b->b_next) { if (!ARC_BUF_COMPRESSED(b)) { return (B_TRUE); } } return (B_FALSE); } /* * If we've turned on the ZFS_DEBUG_MODIFY flag, verify that the buf's data * matches the checksum that is stored in the hdr. If there is no checksum, * or if the buf is compressed, this is a no-op. */ static void arc_cksum_verify(arc_buf_t *buf) { arc_buf_hdr_t *hdr = buf->b_hdr; zio_cksum_t zc; if (!(zfs_flags & ZFS_DEBUG_MODIFY)) return; if (ARC_BUF_COMPRESSED(buf)) return; ASSERT(HDR_HAS_L1HDR(hdr)); mutex_enter(&hdr->b_l1hdr.b_freeze_lock); if (hdr->b_l1hdr.b_freeze_cksum == NULL || HDR_IO_ERROR(hdr)) { mutex_exit(&hdr->b_l1hdr.b_freeze_lock); return; } fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, &zc); if (!ZIO_CHECKSUM_EQUAL(*hdr->b_l1hdr.b_freeze_cksum, zc)) panic("buffer modified while frozen!"); mutex_exit(&hdr->b_l1hdr.b_freeze_lock); } /* * This function makes the assumption that data stored in the L2ARC * will be transformed exactly as it is in the main pool. Because of * this we can verify the checksum against the reading process's bp. */ static boolean_t arc_cksum_is_equal(arc_buf_hdr_t *hdr, zio_t *zio) { ASSERT(!BP_IS_EMBEDDED(zio->io_bp)); VERIFY3U(BP_GET_PSIZE(zio->io_bp), ==, HDR_GET_PSIZE(hdr)); /* * Block pointers always store the checksum for the logical data. * If the block pointer has the gang bit set, then the checksum * it represents is for the reconstituted data and not for an * individual gang member. The zio pipeline, however, must be able to * determine the checksum of each of the gang constituents so it * treats the checksum comparison differently than what we need * for l2arc blocks. This prevents us from using the * zio_checksum_error() interface directly. Instead we must call the * zio_checksum_error_impl() so that we can ensure the checksum is * generated using the correct checksum algorithm and accounts for the * logical I/O size and not just a gang fragment. */ return (zio_checksum_error_impl(zio->io_spa, zio->io_bp, BP_GET_CHECKSUM(zio->io_bp), zio->io_abd, zio->io_size, zio->io_offset, NULL) == 0); } /* * Given a buf full of data, if ZFS_DEBUG_MODIFY is enabled this computes a * checksum and attaches it to the buf's hdr so that we can ensure that the buf * isn't modified later on. If buf is compressed or there is already a checksum * on the hdr, this is a no-op (we only checksum uncompressed bufs). */ static void arc_cksum_compute(arc_buf_t *buf) { arc_buf_hdr_t *hdr = buf->b_hdr; if (!(zfs_flags & ZFS_DEBUG_MODIFY)) return; ASSERT(HDR_HAS_L1HDR(hdr)); mutex_enter(&buf->b_hdr->b_l1hdr.b_freeze_lock); if (hdr->b_l1hdr.b_freeze_cksum != NULL || ARC_BUF_COMPRESSED(buf)) { mutex_exit(&hdr->b_l1hdr.b_freeze_lock); return; } ASSERT(!ARC_BUF_ENCRYPTED(buf)); ASSERT(!ARC_BUF_COMPRESSED(buf)); hdr->b_l1hdr.b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP); fletcher_2_native(buf->b_data, arc_buf_size(buf), NULL, hdr->b_l1hdr.b_freeze_cksum); mutex_exit(&hdr->b_l1hdr.b_freeze_lock); arc_buf_watch(buf); } #ifndef _KERNEL void arc_buf_sigsegv(int sig, siginfo_t *si, void *unused) { panic("Got SIGSEGV at address: 0x%lx\n", (long)si->si_addr); } #endif /* ARGSUSED */ static void arc_buf_unwatch(arc_buf_t *buf) { #ifndef _KERNEL if (arc_watch) { ASSERT0(mprotect(buf->b_data, arc_buf_size(buf), PROT_READ | PROT_WRITE)); } #endif } /* ARGSUSED */ static void arc_buf_watch(arc_buf_t *buf) { #ifndef _KERNEL if (arc_watch) ASSERT0(mprotect(buf->b_data, arc_buf_size(buf), PROT_READ)); #endif } static arc_buf_contents_t arc_buf_type(arc_buf_hdr_t *hdr) { arc_buf_contents_t type; if (HDR_ISTYPE_METADATA(hdr)) { type = ARC_BUFC_METADATA; } else { type = ARC_BUFC_DATA; } VERIFY3U(hdr->b_type, ==, type); return (type); } boolean_t arc_is_metadata(arc_buf_t *buf) { return (HDR_ISTYPE_METADATA(buf->b_hdr) != 0); } static uint32_t arc_bufc_to_flags(arc_buf_contents_t type) { switch (type) { case ARC_BUFC_DATA: /* metadata field is 0 if buffer contains normal data */ return (0); case ARC_BUFC_METADATA: return (ARC_FLAG_BUFC_METADATA); default: break; } panic("undefined ARC buffer type!"); return ((uint32_t)-1); } void arc_buf_thaw(arc_buf_t *buf) { arc_buf_hdr_t *hdr = buf->b_hdr; ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); ASSERT(!HDR_IO_IN_PROGRESS(hdr)); arc_cksum_verify(buf); /* * Compressed buffers do not manipulate the b_freeze_cksum. */ if (ARC_BUF_COMPRESSED(buf)) return; ASSERT(HDR_HAS_L1HDR(hdr)); arc_cksum_free(hdr); arc_buf_unwatch(buf); } void arc_buf_freeze(arc_buf_t *buf) { if (!(zfs_flags & ZFS_DEBUG_MODIFY)) return; if (ARC_BUF_COMPRESSED(buf)) return; ASSERT(HDR_HAS_L1HDR(buf->b_hdr)); arc_cksum_compute(buf); } /* * The arc_buf_hdr_t's b_flags should never be modified directly. Instead, * the following functions should be used to ensure that the flags are * updated in a thread-safe way. When manipulating the flags either * the hash_lock must be held or the hdr must be undiscoverable. This * ensures that we're not racing with any other threads when updating * the flags. */ static inline void arc_hdr_set_flags(arc_buf_hdr_t *hdr, arc_flags_t flags) { ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); hdr->b_flags |= flags; } static inline void arc_hdr_clear_flags(arc_buf_hdr_t *hdr, arc_flags_t flags) { ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); hdr->b_flags &= ~flags; } /* * Setting the compression bits in the arc_buf_hdr_t's b_flags is * done in a special way since we have to clear and set bits * at the same time. Consumers that wish to set the compression bits * must use this function to ensure that the flags are updated in * thread-safe manner. */ static void arc_hdr_set_compress(arc_buf_hdr_t *hdr, enum zio_compress cmp) { ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); /* * Holes and embedded blocks will always have a psize = 0 so * we ignore the compression of the blkptr and set the * want to uncompress them. Mark them as uncompressed. */ if (!zfs_compressed_arc_enabled || HDR_GET_PSIZE(hdr) == 0) { arc_hdr_clear_flags(hdr, ARC_FLAG_COMPRESSED_ARC); ASSERT(!HDR_COMPRESSION_ENABLED(hdr)); } else { arc_hdr_set_flags(hdr, ARC_FLAG_COMPRESSED_ARC); ASSERT(HDR_COMPRESSION_ENABLED(hdr)); } HDR_SET_COMPRESS(hdr, cmp); ASSERT3U(HDR_GET_COMPRESS(hdr), ==, cmp); } /* * Looks for another buf on the same hdr which has the data decompressed, copies * from it, and returns true. If no such buf exists, returns false. */ static boolean_t arc_buf_try_copy_decompressed_data(arc_buf_t *buf) { arc_buf_hdr_t *hdr = buf->b_hdr; boolean_t copied = B_FALSE; ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT3P(buf->b_data, !=, NULL); ASSERT(!ARC_BUF_COMPRESSED(buf)); for (arc_buf_t *from = hdr->b_l1hdr.b_buf; from != NULL; from = from->b_next) { /* can't use our own data buffer */ if (from == buf) { continue; } if (!ARC_BUF_COMPRESSED(from)) { bcopy(from->b_data, buf->b_data, arc_buf_size(buf)); copied = B_TRUE; break; } } /* * There were no decompressed bufs, so there should not be a * checksum on the hdr either. */ if (zfs_flags & ZFS_DEBUG_MODIFY) EQUIV(!copied, hdr->b_l1hdr.b_freeze_cksum == NULL); return (copied); } /* * Allocates an ARC buf header that's in an evicted & L2-cached state. * This is used during l2arc reconstruction to make empty ARC buffers * which circumvent the regular disk->arc->l2arc path and instead come * into being in the reverse order, i.e. l2arc->arc. */ static arc_buf_hdr_t * arc_buf_alloc_l2only(size_t size, arc_buf_contents_t type, l2arc_dev_t *dev, dva_t dva, uint64_t daddr, int32_t psize, uint64_t birth, enum zio_compress compress, uint8_t complevel, boolean_t protected, boolean_t prefetch) { arc_buf_hdr_t *hdr; ASSERT(size != 0); hdr = kmem_cache_alloc(hdr_l2only_cache, KM_SLEEP); hdr->b_birth = birth; hdr->b_type = type; hdr->b_flags = 0; arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L2HDR); HDR_SET_LSIZE(hdr, size); HDR_SET_PSIZE(hdr, psize); arc_hdr_set_compress(hdr, compress); hdr->b_complevel = complevel; if (protected) arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); if (prefetch) arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); hdr->b_spa = spa_load_guid(dev->l2ad_vdev->vdev_spa); hdr->b_dva = dva; hdr->b_l2hdr.b_dev = dev; hdr->b_l2hdr.b_daddr = daddr; return (hdr); } /* * Return the size of the block, b_pabd, that is stored in the arc_buf_hdr_t. */ static uint64_t arc_hdr_size(arc_buf_hdr_t *hdr) { uint64_t size; if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF && HDR_GET_PSIZE(hdr) > 0) { size = HDR_GET_PSIZE(hdr); } else { ASSERT3U(HDR_GET_LSIZE(hdr), !=, 0); size = HDR_GET_LSIZE(hdr); } return (size); } static int arc_hdr_authenticate(arc_buf_hdr_t *hdr, spa_t *spa, uint64_t dsobj) { int ret; uint64_t csize; uint64_t lsize = HDR_GET_LSIZE(hdr); uint64_t psize = HDR_GET_PSIZE(hdr); void *tmpbuf = NULL; abd_t *abd = hdr->b_l1hdr.b_pabd; ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); ASSERT(HDR_AUTHENTICATED(hdr)); ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); /* * The MAC is calculated on the compressed data that is stored on disk. * However, if compressed arc is disabled we will only have the * decompressed data available to us now. Compress it into a temporary * abd so we can verify the MAC. The performance overhead of this will * be relatively low, since most objects in an encrypted objset will * be encrypted (instead of authenticated) anyway. */ if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) { tmpbuf = zio_buf_alloc(lsize); abd = abd_get_from_buf(tmpbuf, lsize); abd_take_ownership_of_buf(abd, B_TRUE); csize = zio_compress_data(HDR_GET_COMPRESS(hdr), hdr->b_l1hdr.b_pabd, tmpbuf, lsize, hdr->b_complevel); ASSERT3U(csize, <=, psize); abd_zero_off(abd, csize, psize - csize); } /* * Authentication is best effort. We authenticate whenever the key is * available. If we succeed we clear ARC_FLAG_NOAUTH. */ if (hdr->b_crypt_hdr.b_ot == DMU_OT_OBJSET) { ASSERT3U(HDR_GET_COMPRESS(hdr), ==, ZIO_COMPRESS_OFF); ASSERT3U(lsize, ==, psize); ret = spa_do_crypt_objset_mac_abd(B_FALSE, spa, dsobj, abd, psize, hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); } else { ret = spa_do_crypt_mac_abd(B_FALSE, spa, dsobj, abd, psize, hdr->b_crypt_hdr.b_mac); } if (ret == 0) arc_hdr_clear_flags(hdr, ARC_FLAG_NOAUTH); else if (ret != ENOENT) goto error; if (tmpbuf != NULL) abd_free(abd); return (0); error: if (tmpbuf != NULL) abd_free(abd); return (ret); } /* * This function will take a header that only has raw encrypted data in * b_crypt_hdr.b_rabd and decrypt it into a new buffer which is stored in * b_l1hdr.b_pabd. If designated in the header flags, this function will * also decompress the data. */ static int arc_hdr_decrypt(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb) { int ret; abd_t *cabd = NULL; void *tmp = NULL; boolean_t no_crypt = B_FALSE; boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); ASSERT(HDR_ENCRYPTED(hdr)); arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT); ret = spa_do_crypt_abd(B_FALSE, spa, zb, hdr->b_crypt_hdr.b_ot, B_FALSE, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv, hdr->b_crypt_hdr.b_mac, HDR_GET_PSIZE(hdr), hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd, &no_crypt); if (ret != 0) goto error; if (no_crypt) { abd_copy(hdr->b_l1hdr.b_pabd, hdr->b_crypt_hdr.b_rabd, HDR_GET_PSIZE(hdr)); } /* * If this header has disabled arc compression but the b_pabd is * compressed after decrypting it, we need to decompress the newly * decrypted data. */ if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) { /* * We want to make sure that we are correctly honoring the * zfs_abd_scatter_enabled setting, so we allocate an abd here * and then loan a buffer from it, rather than allocating a * linear buffer and wrapping it in an abd later. */ cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, B_TRUE); tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr)); ret = zio_decompress_data(HDR_GET_COMPRESS(hdr), hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr), &hdr->b_complevel); if (ret != 0) { abd_return_buf(cabd, tmp, arc_hdr_size(hdr)); goto error; } abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr)); arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, arc_hdr_size(hdr), hdr); hdr->b_l1hdr.b_pabd = cabd; } return (0); error: arc_hdr_free_abd(hdr, B_FALSE); if (cabd != NULL) arc_free_data_buf(hdr, cabd, arc_hdr_size(hdr), hdr); return (ret); } /* * This function is called during arc_buf_fill() to prepare the header's * abd plaintext pointer for use. This involves authenticated protected * data and decrypting encrypted data into the plaintext abd. */ static int arc_fill_hdr_crypt(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, spa_t *spa, const zbookmark_phys_t *zb, boolean_t noauth) { int ret; ASSERT(HDR_PROTECTED(hdr)); if (hash_lock != NULL) mutex_enter(hash_lock); if (HDR_NOAUTH(hdr) && !noauth) { /* * The caller requested authenticated data but our data has * not been authenticated yet. Verify the MAC now if we can. */ ret = arc_hdr_authenticate(hdr, spa, zb->zb_objset); if (ret != 0) goto error; } else if (HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd == NULL) { /* * If we only have the encrypted version of the data, but the * unencrypted version was requested we take this opportunity * to store the decrypted version in the header for future use. */ ret = arc_hdr_decrypt(hdr, spa, zb); if (ret != 0) goto error; } ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); if (hash_lock != NULL) mutex_exit(hash_lock); return (0); error: if (hash_lock != NULL) mutex_exit(hash_lock); return (ret); } /* * This function is used by the dbuf code to decrypt bonus buffers in place. * The dbuf code itself doesn't have any locking for decrypting a shared dnode * block, so we use the hash lock here to protect against concurrent calls to * arc_buf_fill(). */ static void arc_buf_untransform_in_place(arc_buf_t *buf, kmutex_t *hash_lock) { arc_buf_hdr_t *hdr = buf->b_hdr; ASSERT(HDR_ENCRYPTED(hdr)); ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE); ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); zio_crypt_copy_dnode_bonus(hdr->b_l1hdr.b_pabd, buf->b_data, arc_buf_size(buf)); buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; hdr->b_crypt_hdr.b_ebufcnt -= 1; } /* * Given a buf that has a data buffer attached to it, this function will * efficiently fill the buf with data of the specified compression setting from * the hdr and update the hdr's b_freeze_cksum if necessary. If the buf and hdr * are already sharing a data buf, no copy is performed. * * If the buf is marked as compressed but uncompressed data was requested, this * will allocate a new data buffer for the buf, remove that flag, and fill the * buf with uncompressed data. You can't request a compressed buf on a hdr with * uncompressed data, and (since we haven't added support for it yet) if you * want compressed data your buf must already be marked as compressed and have * the correct-sized data buffer. */ static int arc_buf_fill(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb, arc_fill_flags_t flags) { int error = 0; arc_buf_hdr_t *hdr = buf->b_hdr; boolean_t hdr_compressed = (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); boolean_t compressed = (flags & ARC_FILL_COMPRESSED) != 0; boolean_t encrypted = (flags & ARC_FILL_ENCRYPTED) != 0; dmu_object_byteswap_t bswap = hdr->b_l1hdr.b_byteswap; kmutex_t *hash_lock = (flags & ARC_FILL_LOCKED) ? NULL : HDR_LOCK(hdr); ASSERT3P(buf->b_data, !=, NULL); IMPLY(compressed, hdr_compressed || ARC_BUF_ENCRYPTED(buf)); IMPLY(compressed, ARC_BUF_COMPRESSED(buf)); IMPLY(encrypted, HDR_ENCRYPTED(hdr)); IMPLY(encrypted, ARC_BUF_ENCRYPTED(buf)); IMPLY(encrypted, ARC_BUF_COMPRESSED(buf)); IMPLY(encrypted, !ARC_BUF_SHARED(buf)); /* * If the caller wanted encrypted data we just need to copy it from * b_rabd and potentially byteswap it. We won't be able to do any * further transforms on it. */ if (encrypted) { ASSERT(HDR_HAS_RABD(hdr)); abd_copy_to_buf(buf->b_data, hdr->b_crypt_hdr.b_rabd, HDR_GET_PSIZE(hdr)); goto byteswap; } /* * Adjust encrypted and authenticated headers to accommodate * the request if needed. Dnode blocks (ARC_FILL_IN_PLACE) are * allowed to fail decryption due to keys not being loaded * without being marked as an IO error. */ if (HDR_PROTECTED(hdr)) { error = arc_fill_hdr_crypt(hdr, hash_lock, spa, zb, !!(flags & ARC_FILL_NOAUTH)); if (error == EACCES && (flags & ARC_FILL_IN_PLACE) != 0) { return (error); } else if (error != 0) { if (hash_lock != NULL) mutex_enter(hash_lock); arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); if (hash_lock != NULL) mutex_exit(hash_lock); return (error); } } /* * There is a special case here for dnode blocks which are * decrypting their bonus buffers. These blocks may request to * be decrypted in-place. This is necessary because there may * be many dnodes pointing into this buffer and there is * currently no method to synchronize replacing the backing * b_data buffer and updating all of the pointers. Here we use * the hash lock to ensure there are no races. If the need * arises for other types to be decrypted in-place, they must * add handling here as well. */ if ((flags & ARC_FILL_IN_PLACE) != 0) { ASSERT(!hdr_compressed); ASSERT(!compressed); ASSERT(!encrypted); if (HDR_ENCRYPTED(hdr) && ARC_BUF_ENCRYPTED(buf)) { ASSERT3U(hdr->b_crypt_hdr.b_ot, ==, DMU_OT_DNODE); if (hash_lock != NULL) mutex_enter(hash_lock); arc_buf_untransform_in_place(buf, hash_lock); if (hash_lock != NULL) mutex_exit(hash_lock); /* Compute the hdr's checksum if necessary */ arc_cksum_compute(buf); } return (0); } if (hdr_compressed == compressed) { if (!arc_buf_is_shared(buf)) { abd_copy_to_buf(buf->b_data, hdr->b_l1hdr.b_pabd, arc_buf_size(buf)); } } else { ASSERT(hdr_compressed); ASSERT(!compressed); ASSERT3U(HDR_GET_LSIZE(hdr), !=, HDR_GET_PSIZE(hdr)); /* * If the buf is sharing its data with the hdr, unlink it and * allocate a new data buffer for the buf. */ if (arc_buf_is_shared(buf)) { ASSERT(ARC_BUF_COMPRESSED(buf)); /* We need to give the buf its own b_data */ buf->b_flags &= ~ARC_BUF_FLAG_SHARED; buf->b_data = arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf); arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); /* Previously overhead was 0; just add new overhead */ ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr)); } else if (ARC_BUF_COMPRESSED(buf)) { /* We need to reallocate the buf's b_data */ arc_free_data_buf(hdr, buf->b_data, HDR_GET_PSIZE(hdr), buf); buf->b_data = arc_get_data_buf(hdr, HDR_GET_LSIZE(hdr), buf); /* We increased the size of b_data; update overhead */ ARCSTAT_INCR(arcstat_overhead_size, HDR_GET_LSIZE(hdr) - HDR_GET_PSIZE(hdr)); } /* * Regardless of the buf's previous compression settings, it * should not be compressed at the end of this function. */ buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; /* * Try copying the data from another buf which already has a * decompressed version. If that's not possible, it's time to * bite the bullet and decompress the data from the hdr. */ if (arc_buf_try_copy_decompressed_data(buf)) { /* Skip byteswapping and checksumming (already done) */ return (0); } else { error = zio_decompress_data(HDR_GET_COMPRESS(hdr), hdr->b_l1hdr.b_pabd, buf->b_data, HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr), &hdr->b_complevel); /* * Absent hardware errors or software bugs, this should * be impossible, but log it anyway so we can debug it. */ if (error != 0) { zfs_dbgmsg( "hdr %px, compress %d, psize %d, lsize %d", hdr, arc_hdr_get_compress(hdr), HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr)); if (hash_lock != NULL) mutex_enter(hash_lock); arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); if (hash_lock != NULL) mutex_exit(hash_lock); return (SET_ERROR(EIO)); } } } byteswap: /* Byteswap the buf's data if necessary */ if (bswap != DMU_BSWAP_NUMFUNCS) { ASSERT(!HDR_SHARED_DATA(hdr)); ASSERT3U(bswap, <, DMU_BSWAP_NUMFUNCS); dmu_ot_byteswap[bswap].ob_func(buf->b_data, HDR_GET_LSIZE(hdr)); } /* Compute the hdr's checksum if necessary */ arc_cksum_compute(buf); return (0); } /* * If this function is being called to decrypt an encrypted buffer or verify an * authenticated one, the key must be loaded and a mapping must be made * available in the keystore via spa_keystore_create_mapping() or one of its * callers. */ int arc_untransform(arc_buf_t *buf, spa_t *spa, const zbookmark_phys_t *zb, boolean_t in_place) { int ret; arc_fill_flags_t flags = 0; if (in_place) flags |= ARC_FILL_IN_PLACE; ret = arc_buf_fill(buf, spa, zb, flags); if (ret == ECKSUM) { /* * Convert authentication and decryption errors to EIO * (and generate an ereport) before leaving the ARC. */ ret = SET_ERROR(EIO); spa_log_error(spa, zb); (void) zfs_ereport_post(FM_EREPORT_ZFS_AUTHENTICATION, spa, NULL, zb, NULL, 0); } return (ret); } /* * Increment the amount of evictable space in the arc_state_t's refcount. * We account for the space used by the hdr and the arc buf individually * so that we can add and remove them from the refcount individually. */ static void arc_evictable_space_increment(arc_buf_hdr_t *hdr, arc_state_t *state) { arc_buf_contents_t type = arc_buf_type(hdr); ASSERT(HDR_HAS_L1HDR(hdr)); if (GHOST_STATE(state)) { ASSERT0(hdr->b_l1hdr.b_bufcnt); ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); (void) zfs_refcount_add_many(&state->arcs_esize[type], HDR_GET_LSIZE(hdr), hdr); return; } ASSERT(!GHOST_STATE(state)); if (hdr->b_l1hdr.b_pabd != NULL) { (void) zfs_refcount_add_many(&state->arcs_esize[type], arc_hdr_size(hdr), hdr); } if (HDR_HAS_RABD(hdr)) { (void) zfs_refcount_add_many(&state->arcs_esize[type], HDR_GET_PSIZE(hdr), hdr); } for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) { if (arc_buf_is_shared(buf)) continue; (void) zfs_refcount_add_many(&state->arcs_esize[type], arc_buf_size(buf), buf); } } /* * Decrement the amount of evictable space in the arc_state_t's refcount. * We account for the space used by the hdr and the arc buf individually * so that we can add and remove them from the refcount individually. */ static void arc_evictable_space_decrement(arc_buf_hdr_t *hdr, arc_state_t *state) { arc_buf_contents_t type = arc_buf_type(hdr); ASSERT(HDR_HAS_L1HDR(hdr)); if (GHOST_STATE(state)) { ASSERT0(hdr->b_l1hdr.b_bufcnt); ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); (void) zfs_refcount_remove_many(&state->arcs_esize[type], HDR_GET_LSIZE(hdr), hdr); return; } ASSERT(!GHOST_STATE(state)); if (hdr->b_l1hdr.b_pabd != NULL) { (void) zfs_refcount_remove_many(&state->arcs_esize[type], arc_hdr_size(hdr), hdr); } if (HDR_HAS_RABD(hdr)) { (void) zfs_refcount_remove_many(&state->arcs_esize[type], HDR_GET_PSIZE(hdr), hdr); } for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) { if (arc_buf_is_shared(buf)) continue; (void) zfs_refcount_remove_many(&state->arcs_esize[type], arc_buf_size(buf), buf); } } /* * Add a reference to this hdr indicating that someone is actively * referencing that memory. When the refcount transitions from 0 to 1, * we remove it from the respective arc_state_t list to indicate that * it is not evictable. */ static void add_reference(arc_buf_hdr_t *hdr, void *tag) { arc_state_t *state; ASSERT(HDR_HAS_L1HDR(hdr)); if (!HDR_EMPTY(hdr) && !MUTEX_HELD(HDR_LOCK(hdr))) { ASSERT(hdr->b_l1hdr.b_state == arc_anon); ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); } state = hdr->b_l1hdr.b_state; if ((zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag) == 1) && (state != arc_anon)) { /* We don't use the L2-only state list. */ if (state != arc_l2c_only) { multilist_remove(state->arcs_list[arc_buf_type(hdr)], hdr); arc_evictable_space_decrement(hdr, state); } /* remove the prefetch flag if we get a reference */ arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH); } } /* * Remove a reference from this hdr. When the reference transitions from * 1 to 0 and we're not anonymous, then we add this hdr to the arc_state_t's * list making it eligible for eviction. */ static int remove_reference(arc_buf_hdr_t *hdr, kmutex_t *hash_lock, void *tag) { int cnt; arc_state_t *state = hdr->b_l1hdr.b_state; ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT(state == arc_anon || MUTEX_HELD(hash_lock)); ASSERT(!GHOST_STATE(state)); /* * arc_l2c_only counts as a ghost state so we don't need to explicitly * check to prevent usage of the arc_l2c_only list. */ if (((cnt = zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag)) == 0) && (state != arc_anon)) { multilist_insert(state->arcs_list[arc_buf_type(hdr)], hdr); ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0); arc_evictable_space_increment(hdr, state); } return (cnt); } /* * Returns detailed information about a specific arc buffer. When the * state_index argument is set the function will calculate the arc header * list position for its arc state. Since this requires a linear traversal * callers are strongly encourage not to do this. However, it can be helpful * for targeted analysis so the functionality is provided. */ void arc_buf_info(arc_buf_t *ab, arc_buf_info_t *abi, int state_index) { arc_buf_hdr_t *hdr = ab->b_hdr; l1arc_buf_hdr_t *l1hdr = NULL; l2arc_buf_hdr_t *l2hdr = NULL; arc_state_t *state = NULL; memset(abi, 0, sizeof (arc_buf_info_t)); if (hdr == NULL) return; abi->abi_flags = hdr->b_flags; if (HDR_HAS_L1HDR(hdr)) { l1hdr = &hdr->b_l1hdr; state = l1hdr->b_state; } if (HDR_HAS_L2HDR(hdr)) l2hdr = &hdr->b_l2hdr; if (l1hdr) { abi->abi_bufcnt = l1hdr->b_bufcnt; abi->abi_access = l1hdr->b_arc_access; abi->abi_mru_hits = l1hdr->b_mru_hits; abi->abi_mru_ghost_hits = l1hdr->b_mru_ghost_hits; abi->abi_mfu_hits = l1hdr->b_mfu_hits; abi->abi_mfu_ghost_hits = l1hdr->b_mfu_ghost_hits; abi->abi_holds = zfs_refcount_count(&l1hdr->b_refcnt); } if (l2hdr) { abi->abi_l2arc_dattr = l2hdr->b_daddr; abi->abi_l2arc_hits = l2hdr->b_hits; } abi->abi_state_type = state ? state->arcs_state : ARC_STATE_ANON; abi->abi_state_contents = arc_buf_type(hdr); abi->abi_size = arc_hdr_size(hdr); } /* * Move the supplied buffer to the indicated state. The hash lock * for the buffer must be held by the caller. */ static void arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *hdr, kmutex_t *hash_lock) { arc_state_t *old_state; int64_t refcnt; uint32_t bufcnt; boolean_t update_old, update_new; arc_buf_contents_t buftype = arc_buf_type(hdr); /* * We almost always have an L1 hdr here, since we call arc_hdr_realloc() * in arc_read() when bringing a buffer out of the L2ARC. However, the * L1 hdr doesn't always exist when we change state to arc_anon before * destroying a header, in which case reallocating to add the L1 hdr is * pointless. */ if (HDR_HAS_L1HDR(hdr)) { old_state = hdr->b_l1hdr.b_state; refcnt = zfs_refcount_count(&hdr->b_l1hdr.b_refcnt); bufcnt = hdr->b_l1hdr.b_bufcnt; update_old = (bufcnt > 0 || hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); } else { old_state = arc_l2c_only; refcnt = 0; bufcnt = 0; update_old = B_FALSE; } update_new = update_old; ASSERT(MUTEX_HELD(hash_lock)); ASSERT3P(new_state, !=, old_state); ASSERT(!GHOST_STATE(new_state) || bufcnt == 0); ASSERT(old_state != arc_anon || bufcnt <= 1); /* * If this buffer is evictable, transfer it from the * old state list to the new state list. */ if (refcnt == 0) { if (old_state != arc_anon && old_state != arc_l2c_only) { ASSERT(HDR_HAS_L1HDR(hdr)); multilist_remove(old_state->arcs_list[buftype], hdr); if (GHOST_STATE(old_state)) { ASSERT0(bufcnt); ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); update_old = B_TRUE; } arc_evictable_space_decrement(hdr, old_state); } if (new_state != arc_anon && new_state != arc_l2c_only) { /* * An L1 header always exists here, since if we're * moving to some L1-cached state (i.e. not l2c_only or * anonymous), we realloc the header to add an L1hdr * beforehand. */ ASSERT(HDR_HAS_L1HDR(hdr)); multilist_insert(new_state->arcs_list[buftype], hdr); if (GHOST_STATE(new_state)) { ASSERT0(bufcnt); ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); update_new = B_TRUE; } arc_evictable_space_increment(hdr, new_state); } } ASSERT(!HDR_EMPTY(hdr)); if (new_state == arc_anon && HDR_IN_HASH_TABLE(hdr)) buf_hash_remove(hdr); /* adjust state sizes (ignore arc_l2c_only) */ if (update_new && new_state != arc_l2c_only) { ASSERT(HDR_HAS_L1HDR(hdr)); if (GHOST_STATE(new_state)) { ASSERT0(bufcnt); /* * When moving a header to a ghost state, we first * remove all arc buffers. Thus, we'll have a * bufcnt of zero, and no arc buffer to use for * the reference. As a result, we use the arc * header pointer for the reference. */ (void) zfs_refcount_add_many(&new_state->arcs_size, HDR_GET_LSIZE(hdr), hdr); ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); } else { uint32_t buffers = 0; /* * Each individual buffer holds a unique reference, * thus we must remove each of these references one * at a time. */ for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) { ASSERT3U(bufcnt, !=, 0); buffers++; /* * When the arc_buf_t is sharing the data * block with the hdr, the owner of the * reference belongs to the hdr. Only * add to the refcount if the arc_buf_t is * not shared. */ if (arc_buf_is_shared(buf)) continue; (void) zfs_refcount_add_many( &new_state->arcs_size, arc_buf_size(buf), buf); } ASSERT3U(bufcnt, ==, buffers); if (hdr->b_l1hdr.b_pabd != NULL) { (void) zfs_refcount_add_many( &new_state->arcs_size, arc_hdr_size(hdr), hdr); } if (HDR_HAS_RABD(hdr)) { (void) zfs_refcount_add_many( &new_state->arcs_size, HDR_GET_PSIZE(hdr), hdr); } } } if (update_old && old_state != arc_l2c_only) { ASSERT(HDR_HAS_L1HDR(hdr)); if (GHOST_STATE(old_state)) { ASSERT0(bufcnt); ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); /* * When moving a header off of a ghost state, * the header will not contain any arc buffers. * We use the arc header pointer for the reference * which is exactly what we did when we put the * header on the ghost state. */ (void) zfs_refcount_remove_many(&old_state->arcs_size, HDR_GET_LSIZE(hdr), hdr); } else { uint32_t buffers = 0; /* * Each individual buffer holds a unique reference, * thus we must remove each of these references one * at a time. */ for (arc_buf_t *buf = hdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) { ASSERT3U(bufcnt, !=, 0); buffers++; /* * When the arc_buf_t is sharing the data * block with the hdr, the owner of the * reference belongs to the hdr. Only * add to the refcount if the arc_buf_t is * not shared. */ if (arc_buf_is_shared(buf)) continue; (void) zfs_refcount_remove_many( &old_state->arcs_size, arc_buf_size(buf), buf); } ASSERT3U(bufcnt, ==, buffers); ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); if (hdr->b_l1hdr.b_pabd != NULL) { (void) zfs_refcount_remove_many( &old_state->arcs_size, arc_hdr_size(hdr), hdr); } if (HDR_HAS_RABD(hdr)) { (void) zfs_refcount_remove_many( &old_state->arcs_size, HDR_GET_PSIZE(hdr), hdr); } } } if (HDR_HAS_L1HDR(hdr)) hdr->b_l1hdr.b_state = new_state; /* * L2 headers should never be on the L2 state list since they don't * have L1 headers allocated. */ ASSERT(multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_DATA]) && multilist_is_empty(arc_l2c_only->arcs_list[ARC_BUFC_METADATA])); } void arc_space_consume(uint64_t space, arc_space_type_t type) { ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES); switch (type) { default: break; case ARC_SPACE_DATA: aggsum_add(&astat_data_size, space); break; case ARC_SPACE_META: aggsum_add(&astat_metadata_size, space); break; case ARC_SPACE_BONUS: aggsum_add(&astat_bonus_size, space); break; case ARC_SPACE_DNODE: aggsum_add(&astat_dnode_size, space); break; case ARC_SPACE_DBUF: aggsum_add(&astat_dbuf_size, space); break; case ARC_SPACE_HDRS: aggsum_add(&astat_hdr_size, space); break; case ARC_SPACE_L2HDRS: aggsum_add(&astat_l2_hdr_size, space); break; case ARC_SPACE_ABD_CHUNK_WASTE: /* * Note: this includes space wasted by all scatter ABD's, not * just those allocated by the ARC. But the vast majority of * scatter ABD's come from the ARC, because other users are * very short-lived. */ aggsum_add(&astat_abd_chunk_waste_size, space); break; } if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE) aggsum_add(&arc_meta_used, space); aggsum_add(&arc_size, space); } void arc_space_return(uint64_t space, arc_space_type_t type) { ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES); switch (type) { default: break; case ARC_SPACE_DATA: aggsum_add(&astat_data_size, -space); break; case ARC_SPACE_META: aggsum_add(&astat_metadata_size, -space); break; case ARC_SPACE_BONUS: aggsum_add(&astat_bonus_size, -space); break; case ARC_SPACE_DNODE: aggsum_add(&astat_dnode_size, -space); break; case ARC_SPACE_DBUF: aggsum_add(&astat_dbuf_size, -space); break; case ARC_SPACE_HDRS: aggsum_add(&astat_hdr_size, -space); break; case ARC_SPACE_L2HDRS: aggsum_add(&astat_l2_hdr_size, -space); break; case ARC_SPACE_ABD_CHUNK_WASTE: aggsum_add(&astat_abd_chunk_waste_size, -space); break; } if (type != ARC_SPACE_DATA && type != ARC_SPACE_ABD_CHUNK_WASTE) { ASSERT(aggsum_compare(&arc_meta_used, space) >= 0); /* * We use the upper bound here rather than the precise value * because the arc_meta_max value doesn't need to be * precise. It's only consumed by humans via arcstats. */ if (arc_meta_max < aggsum_upper_bound(&arc_meta_used)) arc_meta_max = aggsum_upper_bound(&arc_meta_used); aggsum_add(&arc_meta_used, -space); } ASSERT(aggsum_compare(&arc_size, space) >= 0); aggsum_add(&arc_size, -space); } /* * Given a hdr and a buf, returns whether that buf can share its b_data buffer * with the hdr's b_pabd. */ static boolean_t arc_can_share(arc_buf_hdr_t *hdr, arc_buf_t *buf) { /* * The criteria for sharing a hdr's data are: * 1. the buffer is not encrypted * 2. the hdr's compression matches the buf's compression * 3. the hdr doesn't need to be byteswapped * 4. the hdr isn't already being shared * 5. the buf is either compressed or it is the last buf in the hdr list * * Criterion #5 maintains the invariant that shared uncompressed * bufs must be the final buf in the hdr's b_buf list. Reading this, you * might ask, "if a compressed buf is allocated first, won't that be the * last thing in the list?", but in that case it's impossible to create * a shared uncompressed buf anyway (because the hdr must be compressed * to have the compressed buf). You might also think that #3 is * sufficient to make this guarantee, however it's possible * (specifically in the rare L2ARC write race mentioned in * arc_buf_alloc_impl()) there will be an existing uncompressed buf that * is shareable, but wasn't at the time of its allocation. Rather than * allow a new shared uncompressed buf to be created and then shuffle * the list around to make it the last element, this simply disallows * sharing if the new buf isn't the first to be added. */ ASSERT3P(buf->b_hdr, ==, hdr); boolean_t hdr_compressed = arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF; boolean_t buf_compressed = ARC_BUF_COMPRESSED(buf) != 0; return (!ARC_BUF_ENCRYPTED(buf) && buf_compressed == hdr_compressed && hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS && !HDR_SHARED_DATA(hdr) && (ARC_BUF_LAST(buf) || ARC_BUF_COMPRESSED(buf))); } /* * Allocate a buf for this hdr. If you care about the data that's in the hdr, * or if you want a compressed buffer, pass those flags in. Returns 0 if the * copy was made successfully, or an error code otherwise. */ static int arc_buf_alloc_impl(arc_buf_hdr_t *hdr, spa_t *spa, const zbookmark_phys_t *zb, void *tag, boolean_t encrypted, boolean_t compressed, boolean_t noauth, boolean_t fill, arc_buf_t **ret) { arc_buf_t *buf; arc_fill_flags_t flags = ARC_FILL_LOCKED; ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT3U(HDR_GET_LSIZE(hdr), >, 0); VERIFY(hdr->b_type == ARC_BUFC_DATA || hdr->b_type == ARC_BUFC_METADATA); ASSERT3P(ret, !=, NULL); ASSERT3P(*ret, ==, NULL); IMPLY(encrypted, compressed); hdr->b_l1hdr.b_mru_hits = 0; hdr->b_l1hdr.b_mru_ghost_hits = 0; hdr->b_l1hdr.b_mfu_hits = 0; hdr->b_l1hdr.b_mfu_ghost_hits = 0; hdr->b_l1hdr.b_l2_hits = 0; buf = *ret = kmem_cache_alloc(buf_cache, KM_PUSHPAGE); buf->b_hdr = hdr; buf->b_data = NULL; buf->b_next = hdr->b_l1hdr.b_buf; buf->b_flags = 0; add_reference(hdr, tag); /* * We're about to change the hdr's b_flags. We must either * hold the hash_lock or be undiscoverable. */ ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); /* * Only honor requests for compressed bufs if the hdr is actually * compressed. This must be overridden if the buffer is encrypted since * encrypted buffers cannot be decompressed. */ if (encrypted) { buf->b_flags |= ARC_BUF_FLAG_COMPRESSED; buf->b_flags |= ARC_BUF_FLAG_ENCRYPTED; flags |= ARC_FILL_COMPRESSED | ARC_FILL_ENCRYPTED; } else if (compressed && arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) { buf->b_flags |= ARC_BUF_FLAG_COMPRESSED; flags |= ARC_FILL_COMPRESSED; } if (noauth) { ASSERT0(encrypted); flags |= ARC_FILL_NOAUTH; } /* * If the hdr's data can be shared then we share the data buffer and * set the appropriate bit in the hdr's b_flags to indicate the hdr is * sharing it's b_pabd with the arc_buf_t. Otherwise, we allocate a new * buffer to store the buf's data. * * There are two additional restrictions here because we're sharing * hdr -> buf instead of the usual buf -> hdr. First, the hdr can't be * actively involved in an L2ARC write, because if this buf is used by * an arc_write() then the hdr's data buffer will be released when the * write completes, even though the L2ARC write might still be using it. * Second, the hdr's ABD must be linear so that the buf's user doesn't * need to be ABD-aware. It must be allocated via * zio_[data_]buf_alloc(), not as a page, because we need to be able * to abd_release_ownership_of_buf(), which isn't allowed on "linear * page" buffers because the ABD code needs to handle freeing them * specially. */ boolean_t can_share = arc_can_share(hdr, buf) && !HDR_L2_WRITING(hdr) && hdr->b_l1hdr.b_pabd != NULL && abd_is_linear(hdr->b_l1hdr.b_pabd) && !abd_is_linear_page(hdr->b_l1hdr.b_pabd); /* Set up b_data and sharing */ if (can_share) { buf->b_data = abd_to_buf(hdr->b_l1hdr.b_pabd); buf->b_flags |= ARC_BUF_FLAG_SHARED; arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA); } else { buf->b_data = arc_get_data_buf(hdr, arc_buf_size(buf), buf); ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf)); } VERIFY3P(buf->b_data, !=, NULL); hdr->b_l1hdr.b_buf = buf; hdr->b_l1hdr.b_bufcnt += 1; if (encrypted) hdr->b_crypt_hdr.b_ebufcnt += 1; /* * If the user wants the data from the hdr, we need to either copy or * decompress the data. */ if (fill) { ASSERT3P(zb, !=, NULL); return (arc_buf_fill(buf, spa, zb, flags)); } return (0); } static char *arc_onloan_tag = "onloan"; static inline void arc_loaned_bytes_update(int64_t delta) { atomic_add_64(&arc_loaned_bytes, delta); /* assert that it did not wrap around */ ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0); } /* * Loan out an anonymous arc buffer. Loaned buffers are not counted as in * flight data by arc_tempreserve_space() until they are "returned". Loaned * buffers must be returned to the arc before they can be used by the DMU or * freed. */ arc_buf_t * arc_loan_buf(spa_t *spa, boolean_t is_metadata, int size) { arc_buf_t *buf = arc_alloc_buf(spa, arc_onloan_tag, is_metadata ? ARC_BUFC_METADATA : ARC_BUFC_DATA, size); arc_loaned_bytes_update(arc_buf_size(buf)); return (buf); } arc_buf_t * arc_loan_compressed_buf(spa_t *spa, uint64_t psize, uint64_t lsize, enum zio_compress compression_type, uint8_t complevel) { arc_buf_t *buf = arc_alloc_compressed_buf(spa, arc_onloan_tag, psize, lsize, compression_type, complevel); arc_loaned_bytes_update(arc_buf_size(buf)); return (buf); } arc_buf_t * arc_loan_raw_buf(spa_t *spa, uint64_t dsobj, boolean_t byteorder, const uint8_t *salt, const uint8_t *iv, const uint8_t *mac, dmu_object_type_t ot, uint64_t psize, uint64_t lsize, enum zio_compress compression_type, uint8_t complevel) { arc_buf_t *buf = arc_alloc_raw_buf(spa, arc_onloan_tag, dsobj, byteorder, salt, iv, mac, ot, psize, lsize, compression_type, complevel); atomic_add_64(&arc_loaned_bytes, psize); return (buf); } /* * Return a loaned arc buffer to the arc. */ void arc_return_buf(arc_buf_t *buf, void *tag) { arc_buf_hdr_t *hdr = buf->b_hdr; ASSERT3P(buf->b_data, !=, NULL); ASSERT(HDR_HAS_L1HDR(hdr)); (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, tag); (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag); arc_loaned_bytes_update(-arc_buf_size(buf)); } /* Detach an arc_buf from a dbuf (tag) */ void arc_loan_inuse_buf(arc_buf_t *buf, void *tag) { arc_buf_hdr_t *hdr = buf->b_hdr; ASSERT3P(buf->b_data, !=, NULL); ASSERT(HDR_HAS_L1HDR(hdr)); (void) zfs_refcount_add(&hdr->b_l1hdr.b_refcnt, arc_onloan_tag); (void) zfs_refcount_remove(&hdr->b_l1hdr.b_refcnt, tag); arc_loaned_bytes_update(arc_buf_size(buf)); } static void l2arc_free_abd_on_write(abd_t *abd, size_t size, arc_buf_contents_t type) { l2arc_data_free_t *df = kmem_alloc(sizeof (*df), KM_SLEEP); df->l2df_abd = abd; df->l2df_size = size; df->l2df_type = type; mutex_enter(&l2arc_free_on_write_mtx); list_insert_head(l2arc_free_on_write, df); mutex_exit(&l2arc_free_on_write_mtx); } static void arc_hdr_free_on_write(arc_buf_hdr_t *hdr, boolean_t free_rdata) { arc_state_t *state = hdr->b_l1hdr.b_state; arc_buf_contents_t type = arc_buf_type(hdr); uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr); /* protected by hash lock, if in the hash table */ if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); ASSERT(state != arc_anon && state != arc_l2c_only); (void) zfs_refcount_remove_many(&state->arcs_esize[type], size, hdr); } (void) zfs_refcount_remove_many(&state->arcs_size, size, hdr); if (type == ARC_BUFC_METADATA) { arc_space_return(size, ARC_SPACE_META); } else { ASSERT(type == ARC_BUFC_DATA); arc_space_return(size, ARC_SPACE_DATA); } if (free_rdata) { l2arc_free_abd_on_write(hdr->b_crypt_hdr.b_rabd, size, type); } else { l2arc_free_abd_on_write(hdr->b_l1hdr.b_pabd, size, type); } } /* * Share the arc_buf_t's data with the hdr. Whenever we are sharing the * data buffer, we transfer the refcount ownership to the hdr and update * the appropriate kstats. */ static void arc_share_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf) { ASSERT(arc_can_share(hdr, buf)); ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!ARC_BUF_ENCRYPTED(buf)); ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); /* * Start sharing the data buffer. We transfer the * refcount ownership to the hdr since it always owns * the refcount whenever an arc_buf_t is shared. */ zfs_refcount_transfer_ownership_many(&hdr->b_l1hdr.b_state->arcs_size, arc_hdr_size(hdr), buf, hdr); hdr->b_l1hdr.b_pabd = abd_get_from_buf(buf->b_data, arc_buf_size(buf)); abd_take_ownership_of_buf(hdr->b_l1hdr.b_pabd, HDR_ISTYPE_METADATA(hdr)); arc_hdr_set_flags(hdr, ARC_FLAG_SHARED_DATA); buf->b_flags |= ARC_BUF_FLAG_SHARED; /* * Since we've transferred ownership to the hdr we need * to increment its compressed and uncompressed kstats and * decrement the overhead size. */ ARCSTAT_INCR(arcstat_compressed_size, arc_hdr_size(hdr)); ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr)); ARCSTAT_INCR(arcstat_overhead_size, -arc_buf_size(buf)); } static void arc_unshare_buf(arc_buf_hdr_t *hdr, arc_buf_t *buf) { ASSERT(arc_buf_is_shared(buf)); ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); /* * We are no longer sharing this buffer so we need * to transfer its ownership to the rightful owner. */ zfs_refcount_transfer_ownership_many(&hdr->b_l1hdr.b_state->arcs_size, arc_hdr_size(hdr), hdr, buf); arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); abd_release_ownership_of_buf(hdr->b_l1hdr.b_pabd); abd_put(hdr->b_l1hdr.b_pabd); hdr->b_l1hdr.b_pabd = NULL; buf->b_flags &= ~ARC_BUF_FLAG_SHARED; /* * Since the buffer is no longer shared between * the arc buf and the hdr, count it as overhead. */ ARCSTAT_INCR(arcstat_compressed_size, -arc_hdr_size(hdr)); ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr)); ARCSTAT_INCR(arcstat_overhead_size, arc_buf_size(buf)); } /* * Remove an arc_buf_t from the hdr's buf list and return the last * arc_buf_t on the list. If no buffers remain on the list then return * NULL. */ static arc_buf_t * arc_buf_remove(arc_buf_hdr_t *hdr, arc_buf_t *buf) { ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); arc_buf_t **bufp = &hdr->b_l1hdr.b_buf; arc_buf_t *lastbuf = NULL; /* * Remove the buf from the hdr list and locate the last * remaining buffer on the list. */ while (*bufp != NULL) { if (*bufp == buf) *bufp = buf->b_next; /* * If we've removed a buffer in the middle of * the list then update the lastbuf and update * bufp. */ if (*bufp != NULL) { lastbuf = *bufp; bufp = &(*bufp)->b_next; } } buf->b_next = NULL; ASSERT3P(lastbuf, !=, buf); IMPLY(hdr->b_l1hdr.b_bufcnt > 0, lastbuf != NULL); IMPLY(hdr->b_l1hdr.b_bufcnt > 0, hdr->b_l1hdr.b_buf != NULL); IMPLY(lastbuf != NULL, ARC_BUF_LAST(lastbuf)); return (lastbuf); } /* * Free up buf->b_data and pull the arc_buf_t off of the arc_buf_hdr_t's * list and free it. */ static void arc_buf_destroy_impl(arc_buf_t *buf) { arc_buf_hdr_t *hdr = buf->b_hdr; /* * Free up the data associated with the buf but only if we're not * sharing this with the hdr. If we are sharing it with the hdr, the * hdr is responsible for doing the free. */ if (buf->b_data != NULL) { /* * We're about to change the hdr's b_flags. We must either * hold the hash_lock or be undiscoverable. */ ASSERT(HDR_EMPTY_OR_LOCKED(hdr)); arc_cksum_verify(buf); arc_buf_unwatch(buf); if (arc_buf_is_shared(buf)) { arc_hdr_clear_flags(hdr, ARC_FLAG_SHARED_DATA); } else { uint64_t size = arc_buf_size(buf); arc_free_data_buf(hdr, buf->b_data, size, buf); ARCSTAT_INCR(arcstat_overhead_size, -size); } buf->b_data = NULL; ASSERT(hdr->b_l1hdr.b_bufcnt > 0); hdr->b_l1hdr.b_bufcnt -= 1; if (ARC_BUF_ENCRYPTED(buf)) { hdr->b_crypt_hdr.b_ebufcnt -= 1; /* * If we have no more encrypted buffers and we've * already gotten a copy of the decrypted data we can * free b_rabd to save some space. */ if (hdr->b_crypt_hdr.b_ebufcnt == 0 && HDR_HAS_RABD(hdr) && hdr->b_l1hdr.b_pabd != NULL && !HDR_IO_IN_PROGRESS(hdr)) { arc_hdr_free_abd(hdr, B_TRUE); } } } arc_buf_t *lastbuf = arc_buf_remove(hdr, buf); if (ARC_BUF_SHARED(buf) && !ARC_BUF_COMPRESSED(buf)) { /* * If the current arc_buf_t is sharing its data buffer with the * hdr, then reassign the hdr's b_pabd to share it with the new * buffer at the end of the list. The shared buffer is always * the last one on the hdr's buffer list. * * There is an equivalent case for compressed bufs, but since * they aren't guaranteed to be the last buf in the list and * that is an exceedingly rare case, we just allow that space be * wasted temporarily. We must also be careful not to share * encrypted buffers, since they cannot be shared. */ if (lastbuf != NULL && !ARC_BUF_ENCRYPTED(lastbuf)) { /* Only one buf can be shared at once */ VERIFY(!arc_buf_is_shared(lastbuf)); /* hdr is uncompressed so can't have compressed buf */ VERIFY(!ARC_BUF_COMPRESSED(lastbuf)); ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); arc_hdr_free_abd(hdr, B_FALSE); /* * We must setup a new shared block between the * last buffer and the hdr. The data would have * been allocated by the arc buf so we need to transfer * ownership to the hdr since it's now being shared. */ arc_share_buf(hdr, lastbuf); } } else if (HDR_SHARED_DATA(hdr)) { /* * Uncompressed shared buffers are always at the end * of the list. Compressed buffers don't have the * same requirements. This makes it hard to * simply assert that the lastbuf is shared so * we rely on the hdr's compression flags to determine * if we have a compressed, shared buffer. */ ASSERT3P(lastbuf, !=, NULL); ASSERT(arc_buf_is_shared(lastbuf) || arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); } /* * Free the checksum if we're removing the last uncompressed buf from * this hdr. */ if (!arc_hdr_has_uncompressed_buf(hdr)) { arc_cksum_free(hdr); } /* clean up the buf */ buf->b_hdr = NULL; kmem_cache_free(buf_cache, buf); } static void arc_hdr_alloc_abd(arc_buf_hdr_t *hdr, int alloc_flags) { uint64_t size; boolean_t alloc_rdata = ((alloc_flags & ARC_HDR_ALLOC_RDATA) != 0); boolean_t do_adapt = ((alloc_flags & ARC_HDR_DO_ADAPT) != 0); ASSERT3U(HDR_GET_LSIZE(hdr), >, 0); ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT(!HDR_SHARED_DATA(hdr) || alloc_rdata); IMPLY(alloc_rdata, HDR_PROTECTED(hdr)); if (alloc_rdata) { size = HDR_GET_PSIZE(hdr); ASSERT3P(hdr->b_crypt_hdr.b_rabd, ==, NULL); hdr->b_crypt_hdr.b_rabd = arc_get_data_abd(hdr, size, hdr, do_adapt); ASSERT3P(hdr->b_crypt_hdr.b_rabd, !=, NULL); ARCSTAT_INCR(arcstat_raw_size, size); } else { size = arc_hdr_size(hdr); ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); hdr->b_l1hdr.b_pabd = arc_get_data_abd(hdr, size, hdr, do_adapt); ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); } ARCSTAT_INCR(arcstat_compressed_size, size); ARCSTAT_INCR(arcstat_uncompressed_size, HDR_GET_LSIZE(hdr)); } static void arc_hdr_free_abd(arc_buf_hdr_t *hdr, boolean_t free_rdata) { uint64_t size = (free_rdata) ? HDR_GET_PSIZE(hdr) : arc_hdr_size(hdr); ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); IMPLY(free_rdata, HDR_HAS_RABD(hdr)); /* * If the hdr is currently being written to the l2arc then * we defer freeing the data by adding it to the l2arc_free_on_write * list. The l2arc will free the data once it's finished * writing it to the l2arc device. */ if (HDR_L2_WRITING(hdr)) { arc_hdr_free_on_write(hdr, free_rdata); ARCSTAT_BUMP(arcstat_l2_free_on_write); } else if (free_rdata) { arc_free_data_abd(hdr, hdr->b_crypt_hdr.b_rabd, size, hdr); } else { arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, size, hdr); } if (free_rdata) { hdr->b_crypt_hdr.b_rabd = NULL; ARCSTAT_INCR(arcstat_raw_size, -size); } else { hdr->b_l1hdr.b_pabd = NULL; } if (hdr->b_l1hdr.b_pabd == NULL && !HDR_HAS_RABD(hdr)) hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; ARCSTAT_INCR(arcstat_compressed_size, -size); ARCSTAT_INCR(arcstat_uncompressed_size, -HDR_GET_LSIZE(hdr)); } static arc_buf_hdr_t * arc_hdr_alloc(uint64_t spa, int32_t psize, int32_t lsize, boolean_t protected, enum zio_compress compression_type, uint8_t complevel, arc_buf_contents_t type, boolean_t alloc_rdata) { arc_buf_hdr_t *hdr; int flags = ARC_HDR_DO_ADAPT; VERIFY(type == ARC_BUFC_DATA || type == ARC_BUFC_METADATA); if (protected) { hdr = kmem_cache_alloc(hdr_full_crypt_cache, KM_PUSHPAGE); } else { hdr = kmem_cache_alloc(hdr_full_cache, KM_PUSHPAGE); } flags |= alloc_rdata ? ARC_HDR_ALLOC_RDATA : 0; ASSERT(HDR_EMPTY(hdr)); ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); HDR_SET_PSIZE(hdr, psize); HDR_SET_LSIZE(hdr, lsize); hdr->b_spa = spa; hdr->b_type = type; hdr->b_flags = 0; arc_hdr_set_flags(hdr, arc_bufc_to_flags(type) | ARC_FLAG_HAS_L1HDR); arc_hdr_set_compress(hdr, compression_type); hdr->b_complevel = complevel; if (protected) arc_hdr_set_flags(hdr, ARC_FLAG_PROTECTED); hdr->b_l1hdr.b_state = arc_anon; hdr->b_l1hdr.b_arc_access = 0; hdr->b_l1hdr.b_bufcnt = 0; hdr->b_l1hdr.b_buf = NULL; /* * Allocate the hdr's buffer. This will contain either * the compressed or uncompressed data depending on the block * it references and compressed arc enablement. */ arc_hdr_alloc_abd(hdr, flags); ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); return (hdr); } /* * Transition between the two allocation states for the arc_buf_hdr struct. * The arc_buf_hdr struct can be allocated with (hdr_full_cache) or without * (hdr_l2only_cache) the fields necessary for the L1 cache - the smaller * version is used when a cache buffer is only in the L2ARC in order to reduce * memory usage. */ static arc_buf_hdr_t * arc_hdr_realloc(arc_buf_hdr_t *hdr, kmem_cache_t *old, kmem_cache_t *new) { ASSERT(HDR_HAS_L2HDR(hdr)); arc_buf_hdr_t *nhdr; l2arc_dev_t *dev = hdr->b_l2hdr.b_dev; ASSERT((old == hdr_full_cache && new == hdr_l2only_cache) || (old == hdr_l2only_cache && new == hdr_full_cache)); /* * if the caller wanted a new full header and the header is to be * encrypted we will actually allocate the header from the full crypt * cache instead. The same applies to freeing from the old cache. */ if (HDR_PROTECTED(hdr) && new == hdr_full_cache) new = hdr_full_crypt_cache; if (HDR_PROTECTED(hdr) && old == hdr_full_cache) old = hdr_full_crypt_cache; nhdr = kmem_cache_alloc(new, KM_PUSHPAGE); ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); buf_hash_remove(hdr); bcopy(hdr, nhdr, HDR_L2ONLY_SIZE); if (new == hdr_full_cache || new == hdr_full_crypt_cache) { arc_hdr_set_flags(nhdr, ARC_FLAG_HAS_L1HDR); /* * arc_access and arc_change_state need to be aware that a * header has just come out of L2ARC, so we set its state to * l2c_only even though it's about to change. */ nhdr->b_l1hdr.b_state = arc_l2c_only; /* Verify previous threads set to NULL before freeing */ ASSERT3P(nhdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); } else { ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); ASSERT0(hdr->b_l1hdr.b_bufcnt); ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); /* * If we've reached here, We must have been called from * arc_evict_hdr(), as such we should have already been * removed from any ghost list we were previously on * (which protects us from racing with arc_evict_state), * thus no locking is needed during this check. */ ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); /* * A buffer must not be moved into the arc_l2c_only * state if it's not finished being written out to the * l2arc device. Otherwise, the b_l1hdr.b_pabd field * might try to be accessed, even though it was removed. */ VERIFY(!HDR_L2_WRITING(hdr)); VERIFY3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); arc_hdr_clear_flags(nhdr, ARC_FLAG_HAS_L1HDR); } /* * The header has been reallocated so we need to re-insert it into any * lists it was on. */ (void) buf_hash_insert(nhdr, NULL); ASSERT(list_link_active(&hdr->b_l2hdr.b_l2node)); mutex_enter(&dev->l2ad_mtx); /* * We must place the realloc'ed header back into the list at * the same spot. Otherwise, if it's placed earlier in the list, * l2arc_write_buffers() could find it during the function's * write phase, and try to write it out to the l2arc. */ list_insert_after(&dev->l2ad_buflist, hdr, nhdr); list_remove(&dev->l2ad_buflist, hdr); mutex_exit(&dev->l2ad_mtx); /* * Since we're using the pointer address as the tag when * incrementing and decrementing the l2ad_alloc refcount, we * must remove the old pointer (that we're about to destroy) and * add the new pointer to the refcount. Otherwise we'd remove * the wrong pointer address when calling arc_hdr_destroy() later. */ (void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr); (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(nhdr), nhdr); buf_discard_identity(hdr); kmem_cache_free(old, hdr); return (nhdr); } /* * This function allows an L1 header to be reallocated as a crypt * header and vice versa. If we are going to a crypt header, the * new fields will be zeroed out. */ static arc_buf_hdr_t * arc_hdr_realloc_crypt(arc_buf_hdr_t *hdr, boolean_t need_crypt) { arc_buf_hdr_t *nhdr; arc_buf_t *buf; kmem_cache_t *ncache, *ocache; unsigned nsize, osize; /* * This function requires that hdr is in the arc_anon state. * Therefore it won't have any L2ARC data for us to worry * about copying. */ ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT(!HDR_HAS_L2HDR(hdr)); ASSERT3U(!!HDR_PROTECTED(hdr), !=, need_crypt); ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); ASSERT(!list_link_active(&hdr->b_l2hdr.b_l2node)); ASSERT3P(hdr->b_hash_next, ==, NULL); if (need_crypt) { ncache = hdr_full_crypt_cache; nsize = sizeof (hdr->b_crypt_hdr); ocache = hdr_full_cache; osize = HDR_FULL_SIZE; } else { ncache = hdr_full_cache; nsize = HDR_FULL_SIZE; ocache = hdr_full_crypt_cache; osize = sizeof (hdr->b_crypt_hdr); } nhdr = kmem_cache_alloc(ncache, KM_PUSHPAGE); /* * Copy all members that aren't locks or condvars to the new header. * No lists are pointing to us (as we asserted above), so we don't * need to worry about the list nodes. */ nhdr->b_dva = hdr->b_dva; nhdr->b_birth = hdr->b_birth; nhdr->b_type = hdr->b_type; nhdr->b_flags = hdr->b_flags; nhdr->b_psize = hdr->b_psize; nhdr->b_lsize = hdr->b_lsize; nhdr->b_spa = hdr->b_spa; nhdr->b_l1hdr.b_freeze_cksum = hdr->b_l1hdr.b_freeze_cksum; nhdr->b_l1hdr.b_bufcnt = hdr->b_l1hdr.b_bufcnt; nhdr->b_l1hdr.b_byteswap = hdr->b_l1hdr.b_byteswap; nhdr->b_l1hdr.b_state = hdr->b_l1hdr.b_state; nhdr->b_l1hdr.b_arc_access = hdr->b_l1hdr.b_arc_access; nhdr->b_l1hdr.b_mru_hits = hdr->b_l1hdr.b_mru_hits; nhdr->b_l1hdr.b_mru_ghost_hits = hdr->b_l1hdr.b_mru_ghost_hits; nhdr->b_l1hdr.b_mfu_hits = hdr->b_l1hdr.b_mfu_hits; nhdr->b_l1hdr.b_mfu_ghost_hits = hdr->b_l1hdr.b_mfu_ghost_hits; nhdr->b_l1hdr.b_l2_hits = hdr->b_l1hdr.b_l2_hits; nhdr->b_l1hdr.b_acb = hdr->b_l1hdr.b_acb; nhdr->b_l1hdr.b_pabd = hdr->b_l1hdr.b_pabd; /* * This zfs_refcount_add() exists only to ensure that the individual * arc buffers always point to a header that is referenced, avoiding * a small race condition that could trigger ASSERTs. */ (void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, FTAG); nhdr->b_l1hdr.b_buf = hdr->b_l1hdr.b_buf; for (buf = nhdr->b_l1hdr.b_buf; buf != NULL; buf = buf->b_next) { mutex_enter(&buf->b_evict_lock); buf->b_hdr = nhdr; mutex_exit(&buf->b_evict_lock); } zfs_refcount_transfer(&nhdr->b_l1hdr.b_refcnt, &hdr->b_l1hdr.b_refcnt); (void) zfs_refcount_remove(&nhdr->b_l1hdr.b_refcnt, FTAG); ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt)); if (need_crypt) { arc_hdr_set_flags(nhdr, ARC_FLAG_PROTECTED); } else { arc_hdr_clear_flags(nhdr, ARC_FLAG_PROTECTED); } /* unset all members of the original hdr */ bzero(&hdr->b_dva, sizeof (dva_t)); hdr->b_birth = 0; hdr->b_type = ARC_BUFC_INVALID; hdr->b_flags = 0; hdr->b_psize = 0; hdr->b_lsize = 0; hdr->b_spa = 0; hdr->b_l1hdr.b_freeze_cksum = NULL; hdr->b_l1hdr.b_buf = NULL; hdr->b_l1hdr.b_bufcnt = 0; hdr->b_l1hdr.b_byteswap = 0; hdr->b_l1hdr.b_state = NULL; hdr->b_l1hdr.b_arc_access = 0; hdr->b_l1hdr.b_mru_hits = 0; hdr->b_l1hdr.b_mru_ghost_hits = 0; hdr->b_l1hdr.b_mfu_hits = 0; hdr->b_l1hdr.b_mfu_ghost_hits = 0; hdr->b_l1hdr.b_l2_hits = 0; hdr->b_l1hdr.b_acb = NULL; hdr->b_l1hdr.b_pabd = NULL; if (ocache == hdr_full_crypt_cache) { ASSERT(!HDR_HAS_RABD(hdr)); hdr->b_crypt_hdr.b_ot = DMU_OT_NONE; hdr->b_crypt_hdr.b_ebufcnt = 0; hdr->b_crypt_hdr.b_dsobj = 0; bzero(hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN); bzero(hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN); bzero(hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN); } buf_discard_identity(hdr); kmem_cache_free(ocache, hdr); return (nhdr); } /* * This function is used by the send / receive code to convert a newly * allocated arc_buf_t to one that is suitable for a raw encrypted write. It * is also used to allow the root objset block to be updated without altering * its embedded MACs. Both block types will always be uncompressed so we do not * have to worry about compression type or psize. */ void arc_convert_to_raw(arc_buf_t *buf, uint64_t dsobj, boolean_t byteorder, dmu_object_type_t ot, const uint8_t *salt, const uint8_t *iv, const uint8_t *mac) { arc_buf_hdr_t *hdr = buf->b_hdr; ASSERT(ot == DMU_OT_DNODE || ot == DMU_OT_OBJSET); ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); buf->b_flags |= (ARC_BUF_FLAG_COMPRESSED | ARC_BUF_FLAG_ENCRYPTED); if (!HDR_PROTECTED(hdr)) hdr = arc_hdr_realloc_crypt(hdr, B_TRUE); hdr->b_crypt_hdr.b_dsobj = dsobj; hdr->b_crypt_hdr.b_ot = ot; hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ? DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot); if (!arc_hdr_has_uncompressed_buf(hdr)) arc_cksum_free(hdr); if (salt != NULL) bcopy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN); if (iv != NULL) bcopy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN); if (mac != NULL) bcopy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN); } /* * Allocate a new arc_buf_hdr_t and arc_buf_t and return the buf to the caller. * The buf is returned thawed since we expect the consumer to modify it. */ arc_buf_t * arc_alloc_buf(spa_t *spa, void *tag, arc_buf_contents_t type, int32_t size) { arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), size, size, B_FALSE, ZIO_COMPRESS_OFF, 0, type, B_FALSE); arc_buf_t *buf = NULL; VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_FALSE, B_FALSE, B_FALSE, &buf)); arc_buf_thaw(buf); return (buf); } /* * Allocate a compressed buf in the same manner as arc_alloc_buf. Don't use this * for bufs containing metadata. */ arc_buf_t * arc_alloc_compressed_buf(spa_t *spa, void *tag, uint64_t psize, uint64_t lsize, enum zio_compress compression_type, uint8_t complevel) { ASSERT3U(lsize, >, 0); ASSERT3U(lsize, >=, psize); ASSERT3U(compression_type, >, ZIO_COMPRESS_OFF); ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS); arc_buf_hdr_t *hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_FALSE, compression_type, complevel, ARC_BUFC_DATA, B_FALSE); arc_buf_t *buf = NULL; VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_FALSE, B_TRUE, B_FALSE, B_FALSE, &buf)); arc_buf_thaw(buf); ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); if (!arc_buf_is_shared(buf)) { /* * To ensure that the hdr has the correct data in it if we call * arc_untransform() on this buf before it's been written to * disk, it's easiest if we just set up sharing between the * buf and the hdr. */ arc_hdr_free_abd(hdr, B_FALSE); arc_share_buf(hdr, buf); } return (buf); } arc_buf_t * arc_alloc_raw_buf(spa_t *spa, void *tag, uint64_t dsobj, boolean_t byteorder, const uint8_t *salt, const uint8_t *iv, const uint8_t *mac, dmu_object_type_t ot, uint64_t psize, uint64_t lsize, enum zio_compress compression_type, uint8_t complevel) { arc_buf_hdr_t *hdr; arc_buf_t *buf; arc_buf_contents_t type = DMU_OT_IS_METADATA(ot) ? ARC_BUFC_METADATA : ARC_BUFC_DATA; ASSERT3U(lsize, >, 0); ASSERT3U(lsize, >=, psize); ASSERT3U(compression_type, >=, ZIO_COMPRESS_OFF); ASSERT3U(compression_type, <, ZIO_COMPRESS_FUNCTIONS); hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, B_TRUE, compression_type, complevel, type, B_TRUE); hdr->b_crypt_hdr.b_dsobj = dsobj; hdr->b_crypt_hdr.b_ot = ot; hdr->b_l1hdr.b_byteswap = (byteorder == ZFS_HOST_BYTEORDER) ? DMU_BSWAP_NUMFUNCS : DMU_OT_BYTESWAP(ot); bcopy(salt, hdr->b_crypt_hdr.b_salt, ZIO_DATA_SALT_LEN); bcopy(iv, hdr->b_crypt_hdr.b_iv, ZIO_DATA_IV_LEN); bcopy(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN); /* * This buffer will be considered encrypted even if the ot is not an * encrypted type. It will become authenticated instead in * arc_write_ready(). */ buf = NULL; VERIFY0(arc_buf_alloc_impl(hdr, spa, NULL, tag, B_TRUE, B_TRUE, B_FALSE, B_FALSE, &buf)); arc_buf_thaw(buf); ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); return (buf); } static void arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr) { l2arc_buf_hdr_t *l2hdr = &hdr->b_l2hdr; l2arc_dev_t *dev = l2hdr->b_dev; uint64_t psize = HDR_GET_PSIZE(hdr); uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); ASSERT(MUTEX_HELD(&dev->l2ad_mtx)); ASSERT(HDR_HAS_L2HDR(hdr)); list_remove(&dev->l2ad_buflist, hdr); ARCSTAT_INCR(arcstat_l2_psize, -psize); ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr)); vdev_space_update(dev->l2ad_vdev, -asize, 0, 0); (void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr); arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR); } static void arc_hdr_destroy(arc_buf_hdr_t *hdr) { if (HDR_HAS_L1HDR(hdr)) { ASSERT(hdr->b_l1hdr.b_buf == NULL || hdr->b_l1hdr.b_bufcnt > 0); ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); } ASSERT(!HDR_IO_IN_PROGRESS(hdr)); ASSERT(!HDR_IN_HASH_TABLE(hdr)); if (HDR_HAS_L2HDR(hdr)) { l2arc_dev_t *dev = hdr->b_l2hdr.b_dev; boolean_t buflist_held = MUTEX_HELD(&dev->l2ad_mtx); if (!buflist_held) mutex_enter(&dev->l2ad_mtx); /* * Even though we checked this conditional above, we * need to check this again now that we have the * l2ad_mtx. This is because we could be racing with * another thread calling l2arc_evict() which might have * destroyed this header's L2 portion as we were waiting * to acquire the l2ad_mtx. If that happens, we don't * want to re-destroy the header's L2 portion. */ if (HDR_HAS_L2HDR(hdr)) arc_hdr_l2hdr_destroy(hdr); if (!buflist_held) mutex_exit(&dev->l2ad_mtx); } /* * The header's identify can only be safely discarded once it is no * longer discoverable. This requires removing it from the hash table * and the l2arc header list. After this point the hash lock can not * be used to protect the header. */ if (!HDR_EMPTY(hdr)) buf_discard_identity(hdr); if (HDR_HAS_L1HDR(hdr)) { arc_cksum_free(hdr); while (hdr->b_l1hdr.b_buf != NULL) arc_buf_destroy_impl(hdr->b_l1hdr.b_buf); if (hdr->b_l1hdr.b_pabd != NULL) arc_hdr_free_abd(hdr, B_FALSE); if (HDR_HAS_RABD(hdr)) arc_hdr_free_abd(hdr, B_TRUE); } ASSERT3P(hdr->b_hash_next, ==, NULL); if (HDR_HAS_L1HDR(hdr)) { ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); if (!HDR_PROTECTED(hdr)) { kmem_cache_free(hdr_full_cache, hdr); } else { kmem_cache_free(hdr_full_crypt_cache, hdr); } } else { kmem_cache_free(hdr_l2only_cache, hdr); } } void arc_buf_destroy(arc_buf_t *buf, void* tag) { arc_buf_hdr_t *hdr = buf->b_hdr; if (hdr->b_l1hdr.b_state == arc_anon) { ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1); ASSERT(!HDR_IO_IN_PROGRESS(hdr)); VERIFY0(remove_reference(hdr, NULL, tag)); arc_hdr_destroy(hdr); return; } kmutex_t *hash_lock = HDR_LOCK(hdr); mutex_enter(hash_lock); ASSERT3P(hdr, ==, buf->b_hdr); ASSERT(hdr->b_l1hdr.b_bufcnt > 0); ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); ASSERT3P(hdr->b_l1hdr.b_state, !=, arc_anon); ASSERT3P(buf->b_data, !=, NULL); (void) remove_reference(hdr, hash_lock, tag); arc_buf_destroy_impl(buf); mutex_exit(hash_lock); } /* * Evict the arc_buf_hdr that is provided as a parameter. The resultant * state of the header is dependent on its state prior to entering this * function. The following transitions are possible: * * - arc_mru -> arc_mru_ghost * - arc_mfu -> arc_mfu_ghost * - arc_mru_ghost -> arc_l2c_only * - arc_mru_ghost -> deleted * - arc_mfu_ghost -> arc_l2c_only * - arc_mfu_ghost -> deleted */ static int64_t arc_evict_hdr(arc_buf_hdr_t *hdr, kmutex_t *hash_lock) { arc_state_t *evicted_state, *state; int64_t bytes_evicted = 0; int min_lifetime = HDR_PRESCIENT_PREFETCH(hdr) ? arc_min_prescient_prefetch_ms : arc_min_prefetch_ms; ASSERT(MUTEX_HELD(hash_lock)); ASSERT(HDR_HAS_L1HDR(hdr)); state = hdr->b_l1hdr.b_state; if (GHOST_STATE(state)) { ASSERT(!HDR_IO_IN_PROGRESS(hdr)); ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); /* * l2arc_write_buffers() relies on a header's L1 portion * (i.e. its b_pabd field) during it's write phase. * Thus, we cannot push a header onto the arc_l2c_only * state (removing its L1 piece) until the header is * done being written to the l2arc. */ if (HDR_HAS_L2HDR(hdr) && HDR_L2_WRITING(hdr)) { ARCSTAT_BUMP(arcstat_evict_l2_skip); return (bytes_evicted); } ARCSTAT_BUMP(arcstat_deleted); bytes_evicted += HDR_GET_LSIZE(hdr); DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, hdr); if (HDR_HAS_L2HDR(hdr)) { ASSERT(hdr->b_l1hdr.b_pabd == NULL); ASSERT(!HDR_HAS_RABD(hdr)); /* * This buffer is cached on the 2nd Level ARC; * don't destroy the header. */ arc_change_state(arc_l2c_only, hdr, hash_lock); /* * dropping from L1+L2 cached to L2-only, * realloc to remove the L1 header. */ hdr = arc_hdr_realloc(hdr, hdr_full_cache, hdr_l2only_cache); } else { arc_change_state(arc_anon, hdr, hash_lock); arc_hdr_destroy(hdr); } return (bytes_evicted); } ASSERT(state == arc_mru || state == arc_mfu); evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost; /* prefetch buffers have a minimum lifespan */ if (HDR_IO_IN_PROGRESS(hdr) || ((hdr->b_flags & (ARC_FLAG_PREFETCH | ARC_FLAG_INDIRECT)) && ddi_get_lbolt() - hdr->b_l1hdr.b_arc_access < MSEC_TO_TICK(min_lifetime))) { ARCSTAT_BUMP(arcstat_evict_skip); return (bytes_evicted); } ASSERT0(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt)); while (hdr->b_l1hdr.b_buf) { arc_buf_t *buf = hdr->b_l1hdr.b_buf; if (!mutex_tryenter(&buf->b_evict_lock)) { ARCSTAT_BUMP(arcstat_mutex_miss); break; } if (buf->b_data != NULL) bytes_evicted += HDR_GET_LSIZE(hdr); mutex_exit(&buf->b_evict_lock); arc_buf_destroy_impl(buf); } if (HDR_HAS_L2HDR(hdr)) { ARCSTAT_INCR(arcstat_evict_l2_cached, HDR_GET_LSIZE(hdr)); } else { if (l2arc_write_eligible(hdr->b_spa, hdr)) { ARCSTAT_INCR(arcstat_evict_l2_eligible, HDR_GET_LSIZE(hdr)); } else { ARCSTAT_INCR(arcstat_evict_l2_ineligible, HDR_GET_LSIZE(hdr)); } } if (hdr->b_l1hdr.b_bufcnt == 0) { arc_cksum_free(hdr); bytes_evicted += arc_hdr_size(hdr); /* * If this hdr is being evicted and has a compressed * buffer then we discard it here before we change states. * This ensures that the accounting is updated correctly * in arc_free_data_impl(). */ if (hdr->b_l1hdr.b_pabd != NULL) arc_hdr_free_abd(hdr, B_FALSE); if (HDR_HAS_RABD(hdr)) arc_hdr_free_abd(hdr, B_TRUE); arc_change_state(evicted_state, hdr, hash_lock); ASSERT(HDR_IN_HASH_TABLE(hdr)); arc_hdr_set_flags(hdr, ARC_FLAG_IN_HASH_TABLE); DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, hdr); } return (bytes_evicted); } static void arc_set_need_free(void) { ASSERT(MUTEX_HELD(&arc_evict_lock)); int64_t remaining = arc_free_memory() - arc_sys_free / 2; arc_evict_waiter_t *aw = list_tail(&arc_evict_waiters); if (aw == NULL) { arc_need_free = MAX(-remaining, 0); } else { arc_need_free = MAX(-remaining, (int64_t)(aw->aew_count - arc_evict_count)); } } static uint64_t arc_evict_state_impl(multilist_t *ml, int idx, arc_buf_hdr_t *marker, uint64_t spa, int64_t bytes) { multilist_sublist_t *mls; uint64_t bytes_evicted = 0; arc_buf_hdr_t *hdr; kmutex_t *hash_lock; int evict_count = 0; ASSERT3P(marker, !=, NULL); IMPLY(bytes < 0, bytes == ARC_EVICT_ALL); mls = multilist_sublist_lock(ml, idx); for (hdr = multilist_sublist_prev(mls, marker); hdr != NULL; hdr = multilist_sublist_prev(mls, marker)) { if ((bytes != ARC_EVICT_ALL && bytes_evicted >= bytes) || (evict_count >= zfs_arc_evict_batch_limit)) break; /* * To keep our iteration location, move the marker * forward. Since we're not holding hdr's hash lock, we * must be very careful and not remove 'hdr' from the * sublist. Otherwise, other consumers might mistake the * 'hdr' as not being on a sublist when they call the * multilist_link_active() function (they all rely on * the hash lock protecting concurrent insertions and * removals). multilist_sublist_move_forward() was * specifically implemented to ensure this is the case * (only 'marker' will be removed and re-inserted). */ multilist_sublist_move_forward(mls, marker); /* * The only case where the b_spa field should ever be * zero, is the marker headers inserted by * arc_evict_state(). It's possible for multiple threads * to be calling arc_evict_state() concurrently (e.g. * dsl_pool_close() and zio_inject_fault()), so we must * skip any markers we see from these other threads. */ if (hdr->b_spa == 0) continue; /* we're only interested in evicting buffers of a certain spa */ if (spa != 0 && hdr->b_spa != spa) { ARCSTAT_BUMP(arcstat_evict_skip); continue; } hash_lock = HDR_LOCK(hdr); /* * We aren't calling this function from any code path * that would already be holding a hash lock, so we're * asserting on this assumption to be defensive in case * this ever changes. Without this check, it would be * possible to incorrectly increment arcstat_mutex_miss * below (e.g. if the code changed such that we called * this function with a hash lock held). */ ASSERT(!MUTEX_HELD(hash_lock)); if (mutex_tryenter(hash_lock)) { uint64_t evicted = arc_evict_hdr(hdr, hash_lock); mutex_exit(hash_lock); bytes_evicted += evicted; /* * If evicted is zero, arc_evict_hdr() must have * decided to skip this header, don't increment * evict_count in this case. */ if (evicted != 0) evict_count++; } else { ARCSTAT_BUMP(arcstat_mutex_miss); } } multilist_sublist_unlock(mls); /* * Increment the count of evicted bytes, and wake up any threads that * are waiting for the count to reach this value. Since the list is * ordered by ascending aew_count, we pop off the beginning of the * list until we reach the end, or a waiter that's past the current * "count". Doing this outside the loop reduces the number of times * we need to acquire the global arc_evict_lock. * * Only wake when there's sufficient free memory in the system * (specifically, arc_sys_free/2, which by default is a bit more than * 1/64th of RAM). See the comments in arc_wait_for_eviction(). */ mutex_enter(&arc_evict_lock); arc_evict_count += bytes_evicted; if ((int64_t)(arc_free_memory() - arc_sys_free / 2) > 0) { arc_evict_waiter_t *aw; while ((aw = list_head(&arc_evict_waiters)) != NULL && aw->aew_count <= arc_evict_count) { list_remove(&arc_evict_waiters, aw); cv_broadcast(&aw->aew_cv); } } arc_set_need_free(); mutex_exit(&arc_evict_lock); /* * If the ARC size is reduced from arc_c_max to arc_c_min (especially * if the average cached block is small), eviction can be on-CPU for * many seconds. To ensure that other threads that may be bound to * this CPU are able to make progress, make a voluntary preemption * call here. */ cond_resched(); return (bytes_evicted); } /* * Evict buffers from the given arc state, until we've removed the * specified number of bytes. Move the removed buffers to the * appropriate evict state. * * This function makes a "best effort". It skips over any buffers * it can't get a hash_lock on, and so, may not catch all candidates. * It may also return without evicting as much space as requested. * * If bytes is specified using the special value ARC_EVICT_ALL, this * will evict all available (i.e. unlocked and evictable) buffers from * the given arc state; which is used by arc_flush(). */ static uint64_t arc_evict_state(arc_state_t *state, uint64_t spa, int64_t bytes, arc_buf_contents_t type) { uint64_t total_evicted = 0; multilist_t *ml = state->arcs_list[type]; int num_sublists; arc_buf_hdr_t **markers; IMPLY(bytes < 0, bytes == ARC_EVICT_ALL); num_sublists = multilist_get_num_sublists(ml); /* * If we've tried to evict from each sublist, made some * progress, but still have not hit the target number of bytes * to evict, we want to keep trying. The markers allow us to * pick up where we left off for each individual sublist, rather * than starting from the tail each time. */ markers = kmem_zalloc(sizeof (*markers) * num_sublists, KM_SLEEP); for (int i = 0; i < num_sublists; i++) { multilist_sublist_t *mls; markers[i] = kmem_cache_alloc(hdr_full_cache, KM_SLEEP); /* * A b_spa of 0 is used to indicate that this header is * a marker. This fact is used in arc_evict_type() and * arc_evict_state_impl(). */ markers[i]->b_spa = 0; mls = multilist_sublist_lock(ml, i); multilist_sublist_insert_tail(mls, markers[i]); multilist_sublist_unlock(mls); } /* * While we haven't hit our target number of bytes to evict, or * we're evicting all available buffers. */ while (total_evicted < bytes || bytes == ARC_EVICT_ALL) { int sublist_idx = multilist_get_random_index(ml); uint64_t scan_evicted = 0; /* * Try to reduce pinned dnodes with a floor of arc_dnode_limit. * Request that 10% of the LRUs be scanned by the superblock * shrinker. */ if (type == ARC_BUFC_DATA && aggsum_compare(&astat_dnode_size, arc_dnode_size_limit) > 0) { arc_prune_async((aggsum_upper_bound(&astat_dnode_size) - arc_dnode_size_limit) / sizeof (dnode_t) / zfs_arc_dnode_reduce_percent); } /* * Start eviction using a randomly selected sublist, * this is to try and evenly balance eviction across all * sublists. Always starting at the same sublist * (e.g. index 0) would cause evictions to favor certain * sublists over others. */ for (int i = 0; i < num_sublists; i++) { uint64_t bytes_remaining; uint64_t bytes_evicted; if (bytes == ARC_EVICT_ALL) bytes_remaining = ARC_EVICT_ALL; else if (total_evicted < bytes) bytes_remaining = bytes - total_evicted; else break; bytes_evicted = arc_evict_state_impl(ml, sublist_idx, markers[sublist_idx], spa, bytes_remaining); scan_evicted += bytes_evicted; total_evicted += bytes_evicted; /* we've reached the end, wrap to the beginning */ if (++sublist_idx >= num_sublists) sublist_idx = 0; } /* * If we didn't evict anything during this scan, we have * no reason to believe we'll evict more during another * scan, so break the loop. */ if (scan_evicted == 0) { /* This isn't possible, let's make that obvious */ ASSERT3S(bytes, !=, 0); /* * When bytes is ARC_EVICT_ALL, the only way to * break the loop is when scan_evicted is zero. * In that case, we actually have evicted enough, * so we don't want to increment the kstat. */ if (bytes != ARC_EVICT_ALL) { ASSERT3S(total_evicted, <, bytes); ARCSTAT_BUMP(arcstat_evict_not_enough); } break; } } for (int i = 0; i < num_sublists; i++) { multilist_sublist_t *mls = multilist_sublist_lock(ml, i); multilist_sublist_remove(mls, markers[i]); multilist_sublist_unlock(mls); kmem_cache_free(hdr_full_cache, markers[i]); } kmem_free(markers, sizeof (*markers) * num_sublists); return (total_evicted); } /* * Flush all "evictable" data of the given type from the arc state * specified. This will not evict any "active" buffers (i.e. referenced). * * When 'retry' is set to B_FALSE, the function will make a single pass * over the state and evict any buffers that it can. Since it doesn't * continually retry the eviction, it might end up leaving some buffers * in the ARC due to lock misses. * * When 'retry' is set to B_TRUE, the function will continually retry the * eviction until *all* evictable buffers have been removed from the * state. As a result, if concurrent insertions into the state are * allowed (e.g. if the ARC isn't shutting down), this function might * wind up in an infinite loop, continually trying to evict buffers. */ static uint64_t arc_flush_state(arc_state_t *state, uint64_t spa, arc_buf_contents_t type, boolean_t retry) { uint64_t evicted = 0; while (zfs_refcount_count(&state->arcs_esize[type]) != 0) { evicted += arc_evict_state(state, spa, ARC_EVICT_ALL, type); if (!retry) break; } return (evicted); } /* * Evict the specified number of bytes from the state specified, * restricting eviction to the spa and type given. This function * prevents us from trying to evict more from a state's list than * is "evictable", and to skip evicting altogether when passed a * negative value for "bytes". In contrast, arc_evict_state() will * evict everything it can, when passed a negative value for "bytes". */ static uint64_t arc_evict_impl(arc_state_t *state, uint64_t spa, int64_t bytes, arc_buf_contents_t type) { int64_t delta; if (bytes > 0 && zfs_refcount_count(&state->arcs_esize[type]) > 0) { delta = MIN(zfs_refcount_count(&state->arcs_esize[type]), bytes); return (arc_evict_state(state, spa, delta, type)); } return (0); } /* * The goal of this function is to evict enough meta data buffers from the * ARC in order to enforce the arc_meta_limit. Achieving this is slightly * more complicated than it appears because it is common for data buffers * to have holds on meta data buffers. In addition, dnode meta data buffers * will be held by the dnodes in the block preventing them from being freed. * This means we can't simply traverse the ARC and expect to always find * enough unheld meta data buffer to release. * * Therefore, this function has been updated to make alternating passes * over the ARC releasing data buffers and then newly unheld meta data * buffers. This ensures forward progress is maintained and meta_used * will decrease. Normally this is sufficient, but if required the ARC * will call the registered prune callbacks causing dentry and inodes to * be dropped from the VFS cache. This will make dnode meta data buffers * available for reclaim. */ static uint64_t arc_evict_meta_balanced(uint64_t meta_used) { int64_t delta, prune = 0, adjustmnt; uint64_t total_evicted = 0; arc_buf_contents_t type = ARC_BUFC_DATA; int restarts = MAX(zfs_arc_meta_adjust_restarts, 0); restart: /* * This slightly differs than the way we evict from the mru in * arc_evict because we don't have a "target" value (i.e. no * "meta" arc_p). As a result, I think we can completely * cannibalize the metadata in the MRU before we evict the * metadata from the MFU. I think we probably need to implement a * "metadata arc_p" value to do this properly. */ adjustmnt = meta_used - arc_meta_limit; if (adjustmnt > 0 && zfs_refcount_count(&arc_mru->arcs_esize[type]) > 0) { delta = MIN(zfs_refcount_count(&arc_mru->arcs_esize[type]), adjustmnt); total_evicted += arc_evict_impl(arc_mru, 0, delta, type); adjustmnt -= delta; } /* * We can't afford to recalculate adjustmnt here. If we do, * new metadata buffers can sneak into the MRU or ANON lists, * thus penalize the MFU metadata. Although the fudge factor is * small, it has been empirically shown to be significant for * certain workloads (e.g. creating many empty directories). As * such, we use the original calculation for adjustmnt, and * simply decrement the amount of data evicted from the MRU. */ if (adjustmnt > 0 && zfs_refcount_count(&arc_mfu->arcs_esize[type]) > 0) { delta = MIN(zfs_refcount_count(&arc_mfu->arcs_esize[type]), adjustmnt); total_evicted += arc_evict_impl(arc_mfu, 0, delta, type); } adjustmnt = meta_used - arc_meta_limit; if (adjustmnt > 0 && zfs_refcount_count(&arc_mru_ghost->arcs_esize[type]) > 0) { delta = MIN(adjustmnt, zfs_refcount_count(&arc_mru_ghost->arcs_esize[type])); total_evicted += arc_evict_impl(arc_mru_ghost, 0, delta, type); adjustmnt -= delta; } if (adjustmnt > 0 && zfs_refcount_count(&arc_mfu_ghost->arcs_esize[type]) > 0) { delta = MIN(adjustmnt, zfs_refcount_count(&arc_mfu_ghost->arcs_esize[type])); total_evicted += arc_evict_impl(arc_mfu_ghost, 0, delta, type); } /* * If after attempting to make the requested adjustment to the ARC * the meta limit is still being exceeded then request that the * higher layers drop some cached objects which have holds on ARC * meta buffers. Requests to the upper layers will be made with * increasingly large scan sizes until the ARC is below the limit. */ if (meta_used > arc_meta_limit) { if (type == ARC_BUFC_DATA) { type = ARC_BUFC_METADATA; } else { type = ARC_BUFC_DATA; if (zfs_arc_meta_prune) { prune += zfs_arc_meta_prune; arc_prune_async(prune); } } if (restarts > 0) { restarts--; goto restart; } } return (total_evicted); } /* * Evict metadata buffers from the cache, such that arc_meta_used is * capped by the arc_meta_limit tunable. */ static uint64_t arc_evict_meta_only(uint64_t meta_used) { uint64_t total_evicted = 0; int64_t target; /* * If we're over the meta limit, we want to evict enough * metadata to get back under the meta limit. We don't want to * evict so much that we drop the MRU below arc_p, though. If * we're over the meta limit more than we're over arc_p, we * evict some from the MRU here, and some from the MFU below. */ target = MIN((int64_t)(meta_used - arc_meta_limit), (int64_t)(zfs_refcount_count(&arc_anon->arcs_size) + zfs_refcount_count(&arc_mru->arcs_size) - arc_p)); total_evicted += arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA); /* * Similar to the above, we want to evict enough bytes to get us * below the meta limit, but not so much as to drop us below the * space allotted to the MFU (which is defined as arc_c - arc_p). */ target = MIN((int64_t)(meta_used - arc_meta_limit), (int64_t)(zfs_refcount_count(&arc_mfu->arcs_size) - (arc_c - arc_p))); total_evicted += arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); return (total_evicted); } static uint64_t arc_evict_meta(uint64_t meta_used) { if (zfs_arc_meta_strategy == ARC_STRATEGY_META_ONLY) return (arc_evict_meta_only(meta_used)); else return (arc_evict_meta_balanced(meta_used)); } /* * Return the type of the oldest buffer in the given arc state * * This function will select a random sublist of type ARC_BUFC_DATA and * a random sublist of type ARC_BUFC_METADATA. The tail of each sublist * is compared, and the type which contains the "older" buffer will be * returned. */ static arc_buf_contents_t arc_evict_type(arc_state_t *state) { multilist_t *data_ml = state->arcs_list[ARC_BUFC_DATA]; multilist_t *meta_ml = state->arcs_list[ARC_BUFC_METADATA]; int data_idx = multilist_get_random_index(data_ml); int meta_idx = multilist_get_random_index(meta_ml); multilist_sublist_t *data_mls; multilist_sublist_t *meta_mls; arc_buf_contents_t type; arc_buf_hdr_t *data_hdr; arc_buf_hdr_t *meta_hdr; /* * We keep the sublist lock until we're finished, to prevent * the headers from being destroyed via arc_evict_state(). */ data_mls = multilist_sublist_lock(data_ml, data_idx); meta_mls = multilist_sublist_lock(meta_ml, meta_idx); /* * These two loops are to ensure we skip any markers that * might be at the tail of the lists due to arc_evict_state(). */ for (data_hdr = multilist_sublist_tail(data_mls); data_hdr != NULL; data_hdr = multilist_sublist_prev(data_mls, data_hdr)) { if (data_hdr->b_spa != 0) break; } for (meta_hdr = multilist_sublist_tail(meta_mls); meta_hdr != NULL; meta_hdr = multilist_sublist_prev(meta_mls, meta_hdr)) { if (meta_hdr->b_spa != 0) break; } if (data_hdr == NULL && meta_hdr == NULL) { type = ARC_BUFC_DATA; } else if (data_hdr == NULL) { ASSERT3P(meta_hdr, !=, NULL); type = ARC_BUFC_METADATA; } else if (meta_hdr == NULL) { ASSERT3P(data_hdr, !=, NULL); type = ARC_BUFC_DATA; } else { ASSERT3P(data_hdr, !=, NULL); ASSERT3P(meta_hdr, !=, NULL); /* The headers can't be on the sublist without an L1 header */ ASSERT(HDR_HAS_L1HDR(data_hdr)); ASSERT(HDR_HAS_L1HDR(meta_hdr)); if (data_hdr->b_l1hdr.b_arc_access < meta_hdr->b_l1hdr.b_arc_access) { type = ARC_BUFC_DATA; } else { type = ARC_BUFC_METADATA; } } multilist_sublist_unlock(meta_mls); multilist_sublist_unlock(data_mls); return (type); } /* * Evict buffers from the cache, such that arc_size is capped by arc_c. */ static uint64_t arc_evict(void) { uint64_t total_evicted = 0; uint64_t bytes; int64_t target; uint64_t asize = aggsum_value(&arc_size); uint64_t ameta = aggsum_value(&arc_meta_used); /* * If we're over arc_meta_limit, we want to correct that before * potentially evicting data buffers below. */ total_evicted += arc_evict_meta(ameta); /* * Adjust MRU size * * If we're over the target cache size, we want to evict enough * from the list to get back to our target size. We don't want * to evict too much from the MRU, such that it drops below * arc_p. So, if we're over our target cache size more than * the MRU is over arc_p, we'll evict enough to get back to * arc_p here, and then evict more from the MFU below. */ target = MIN((int64_t)(asize - arc_c), (int64_t)(zfs_refcount_count(&arc_anon->arcs_size) + zfs_refcount_count(&arc_mru->arcs_size) + ameta - arc_p)); /* * If we're below arc_meta_min, always prefer to evict data. * Otherwise, try to satisfy the requested number of bytes to * evict from the type which contains older buffers; in an * effort to keep newer buffers in the cache regardless of their * type. If we cannot satisfy the number of bytes from this * type, spill over into the next type. */ if (arc_evict_type(arc_mru) == ARC_BUFC_METADATA && ameta > arc_meta_min) { bytes = arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA); total_evicted += bytes; /* * If we couldn't evict our target number of bytes from * metadata, we try to get the rest from data. */ target -= bytes; total_evicted += arc_evict_impl(arc_mru, 0, target, ARC_BUFC_DATA); } else { bytes = arc_evict_impl(arc_mru, 0, target, ARC_BUFC_DATA); total_evicted += bytes; /* * If we couldn't evict our target number of bytes from * data, we try to get the rest from metadata. */ target -= bytes; total_evicted += arc_evict_impl(arc_mru, 0, target, ARC_BUFC_METADATA); } /* * Re-sum ARC stats after the first round of evictions. */ asize = aggsum_value(&arc_size); ameta = aggsum_value(&arc_meta_used); /* * Adjust MFU size * * Now that we've tried to evict enough from the MRU to get its * size back to arc_p, if we're still above the target cache * size, we evict the rest from the MFU. */ target = asize - arc_c; if (arc_evict_type(arc_mfu) == ARC_BUFC_METADATA && ameta > arc_meta_min) { bytes = arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); total_evicted += bytes; /* * If we couldn't evict our target number of bytes from * metadata, we try to get the rest from data. */ target -= bytes; total_evicted += arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_DATA); } else { bytes = arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_DATA); total_evicted += bytes; /* * If we couldn't evict our target number of bytes from * data, we try to get the rest from data. */ target -= bytes; total_evicted += arc_evict_impl(arc_mfu, 0, target, ARC_BUFC_METADATA); } /* * Adjust ghost lists * * In addition to the above, the ARC also defines target values * for the ghost lists. The sum of the mru list and mru ghost * list should never exceed the target size of the cache, and * the sum of the mru list, mfu list, mru ghost list, and mfu * ghost list should never exceed twice the target size of the * cache. The following logic enforces these limits on the ghost * caches, and evicts from them as needed. */ target = zfs_refcount_count(&arc_mru->arcs_size) + zfs_refcount_count(&arc_mru_ghost->arcs_size) - arc_c; bytes = arc_evict_impl(arc_mru_ghost, 0, target, ARC_BUFC_DATA); total_evicted += bytes; target -= bytes; total_evicted += arc_evict_impl(arc_mru_ghost, 0, target, ARC_BUFC_METADATA); /* * We assume the sum of the mru list and mfu list is less than * or equal to arc_c (we enforced this above), which means we * can use the simpler of the two equations below: * * mru + mfu + mru ghost + mfu ghost <= 2 * arc_c * mru ghost + mfu ghost <= arc_c */ target = zfs_refcount_count(&arc_mru_ghost->arcs_size) + zfs_refcount_count(&arc_mfu_ghost->arcs_size) - arc_c; bytes = arc_evict_impl(arc_mfu_ghost, 0, target, ARC_BUFC_DATA); total_evicted += bytes; target -= bytes; total_evicted += arc_evict_impl(arc_mfu_ghost, 0, target, ARC_BUFC_METADATA); return (total_evicted); } void arc_flush(spa_t *spa, boolean_t retry) { uint64_t guid = 0; /* * If retry is B_TRUE, a spa must not be specified since we have * no good way to determine if all of a spa's buffers have been * evicted from an arc state. */ ASSERT(!retry || spa == 0); if (spa != NULL) guid = spa_load_guid(spa); (void) arc_flush_state(arc_mru, guid, ARC_BUFC_DATA, retry); (void) arc_flush_state(arc_mru, guid, ARC_BUFC_METADATA, retry); (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_DATA, retry); (void) arc_flush_state(arc_mfu, guid, ARC_BUFC_METADATA, retry); (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_DATA, retry); (void) arc_flush_state(arc_mru_ghost, guid, ARC_BUFC_METADATA, retry); (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_DATA, retry); (void) arc_flush_state(arc_mfu_ghost, guid, ARC_BUFC_METADATA, retry); } void arc_reduce_target_size(int64_t to_free) { uint64_t asize = aggsum_value(&arc_size); /* * All callers want the ARC to actually evict (at least) this much * memory. Therefore we reduce from the lower of the current size and * the target size. This way, even if arc_c is much higher than * arc_size (as can be the case after many calls to arc_freed(), we will * immediately have arc_c < arc_size and therefore the arc_evict_zthr * will evict. */ uint64_t c = MIN(arc_c, asize); if (c > to_free && c - to_free > arc_c_min) { arc_c = c - to_free; atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift)); if (arc_p > arc_c) arc_p = (arc_c >> 1); ASSERT(arc_c >= arc_c_min); ASSERT((int64_t)arc_p >= 0); } else { arc_c = arc_c_min; } if (asize > arc_c) { /* See comment in arc_evict_cb_check() on why lock+flag */ mutex_enter(&arc_evict_lock); arc_evict_needed = B_TRUE; mutex_exit(&arc_evict_lock); zthr_wakeup(arc_evict_zthr); } } /* * Determine if the system is under memory pressure and is asking * to reclaim memory. A return value of B_TRUE indicates that the system * is under memory pressure and that the arc should adjust accordingly. */ boolean_t arc_reclaim_needed(void) { return (arc_available_memory() < 0); } void arc_kmem_reap_soon(void) { size_t i; kmem_cache_t *prev_cache = NULL; kmem_cache_t *prev_data_cache = NULL; extern kmem_cache_t *zio_buf_cache[]; extern kmem_cache_t *zio_data_buf_cache[]; #ifdef _KERNEL if ((aggsum_compare(&arc_meta_used, arc_meta_limit) >= 0) && zfs_arc_meta_prune) { /* * We are exceeding our meta-data cache limit. * Prune some entries to release holds on meta-data. */ arc_prune_async(zfs_arc_meta_prune); } #if defined(_ILP32) /* * Reclaim unused memory from all kmem caches. */ kmem_reap(); #endif #endif for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) { #if defined(_ILP32) /* reach upper limit of cache size on 32-bit */ if (zio_buf_cache[i] == NULL) break; #endif if (zio_buf_cache[i] != prev_cache) { prev_cache = zio_buf_cache[i]; kmem_cache_reap_now(zio_buf_cache[i]); } if (zio_data_buf_cache[i] != prev_data_cache) { prev_data_cache = zio_data_buf_cache[i]; kmem_cache_reap_now(zio_data_buf_cache[i]); } } kmem_cache_reap_now(buf_cache); kmem_cache_reap_now(hdr_full_cache); kmem_cache_reap_now(hdr_l2only_cache); kmem_cache_reap_now(zfs_btree_leaf_cache); abd_cache_reap_now(); } /* ARGSUSED */ static boolean_t arc_evict_cb_check(void *arg, zthr_t *zthr) { /* * This is necessary so that any changes which may have been made to * many of the zfs_arc_* module parameters will be propagated to * their actual internal variable counterparts. Without this, * changing those module params at runtime would have no effect. */ arc_tuning_update(B_FALSE); /* * This is necessary in order to keep the kstat information * up to date for tools that display kstat data such as the * mdb ::arc dcmd and the Linux crash utility. These tools * typically do not call kstat's update function, but simply * dump out stats from the most recent update. Without * this call, these commands may show stale stats for the * anon, mru, mru_ghost, mfu, and mfu_ghost lists. Even * with this change, the data might be up to 1 second * out of date(the arc_evict_zthr has a maximum sleep * time of 1 second); but that should suffice. The * arc_state_t structures can be queried directly if more * accurate information is needed. */ if (arc_ksp != NULL) arc_ksp->ks_update(arc_ksp, KSTAT_READ); /* * We have to rely on arc_wait_for_eviction() to tell us when to * evict, rather than checking if we are overflowing here, so that we * are sure to not leave arc_wait_for_eviction() waiting on aew_cv. * If we have become "not overflowing" since arc_wait_for_eviction() * checked, we need to wake it up. We could broadcast the CV here, * but arc_wait_for_eviction() may have not yet gone to sleep. We * would need to use a mutex to ensure that this function doesn't * broadcast until arc_wait_for_eviction() has gone to sleep (e.g. * the arc_evict_lock). However, the lock ordering of such a lock * would necessarily be incorrect with respect to the zthr_lock, * which is held before this function is called, and is held by * arc_wait_for_eviction() when it calls zthr_wakeup(). */ return (arc_evict_needed); } /* * Keep arc_size under arc_c by running arc_evict which evicts data * from the ARC. */ /* ARGSUSED */ static void arc_evict_cb(void *arg, zthr_t *zthr) { uint64_t evicted = 0; fstrans_cookie_t cookie = spl_fstrans_mark(); /* Evict from cache */ evicted = arc_evict(); /* * If evicted is zero, we couldn't evict anything * via arc_evict(). This could be due to hash lock * collisions, but more likely due to the majority of * arc buffers being unevictable. Therefore, even if * arc_size is above arc_c, another pass is unlikely to * be helpful and could potentially cause us to enter an * infinite loop. Additionally, zthr_iscancelled() is * checked here so that if the arc is shutting down, the * broadcast will wake any remaining arc evict waiters. */ mutex_enter(&arc_evict_lock); arc_evict_needed = !zthr_iscancelled(arc_evict_zthr) && evicted > 0 && aggsum_compare(&arc_size, arc_c) > 0; if (!arc_evict_needed) { /* * We're either no longer overflowing, or we * can't evict anything more, so we should wake * arc_get_data_impl() sooner. */ arc_evict_waiter_t *aw; while ((aw = list_remove_head(&arc_evict_waiters)) != NULL) { cv_broadcast(&aw->aew_cv); } arc_set_need_free(); } mutex_exit(&arc_evict_lock); spl_fstrans_unmark(cookie); } /* ARGSUSED */ static boolean_t arc_reap_cb_check(void *arg, zthr_t *zthr) { int64_t free_memory = arc_available_memory(); + static int reap_cb_check_counter = 0; /* * If a kmem reap is already active, don't schedule more. We must * check for this because kmem_cache_reap_soon() won't actually * block on the cache being reaped (this is to prevent callers from * becoming implicitly blocked by a system-wide kmem reap -- which, * on a system with many, many full magazines, can take minutes). */ if (!kmem_cache_reap_active() && free_memory < 0) { arc_no_grow = B_TRUE; arc_warm = B_TRUE; /* * Wait at least zfs_grow_retry (default 5) seconds * before considering growing. */ arc_growtime = gethrtime() + SEC2NSEC(arc_grow_retry); return (B_TRUE); } else if (free_memory < arc_c >> arc_no_grow_shift) { arc_no_grow = B_TRUE; } else if (gethrtime() >= arc_growtime) { arc_no_grow = B_FALSE; } + + /* + * Called unconditionally every 60 seconds to reclaim unused + * zstd compression and decompression context. This is done + * here to avoid the need for an independent thread. + */ + if (!((reap_cb_check_counter++) % 60)) + zfs_zstd_cache_reap_now(); return (B_FALSE); } /* * Keep enough free memory in the system by reaping the ARC's kmem * caches. To cause more slabs to be reapable, we may reduce the * target size of the cache (arc_c), causing the arc_evict_cb() * to free more buffers. */ /* ARGSUSED */ static void arc_reap_cb(void *arg, zthr_t *zthr) { int64_t free_memory; fstrans_cookie_t cookie = spl_fstrans_mark(); /* * Kick off asynchronous kmem_reap()'s of all our caches. */ arc_kmem_reap_soon(); /* * Wait at least arc_kmem_cache_reap_retry_ms between * arc_kmem_reap_soon() calls. Without this check it is possible to * end up in a situation where we spend lots of time reaping * caches, while we're near arc_c_min. Waiting here also gives the * subsequent free memory check a chance of finding that the * asynchronous reap has already freed enough memory, and we don't * need to call arc_reduce_target_size(). */ delay((hz * arc_kmem_cache_reap_retry_ms + 999) / 1000); /* * Reduce the target size as needed to maintain the amount of free * memory in the system at a fraction of the arc_size (1/128th by * default). If oversubscribed (free_memory < 0) then reduce the * target arc_size by the deficit amount plus the fractional * amount. If free memory is positive but less then the fractional * amount, reduce by what is needed to hit the fractional amount. */ free_memory = arc_available_memory(); int64_t to_free = (arc_c >> arc_shrink_shift) - free_memory; if (to_free > 0) { arc_reduce_target_size(to_free); } spl_fstrans_unmark(cookie); } #ifdef _KERNEL /* * Determine the amount of memory eligible for eviction contained in the * ARC. All clean data reported by the ghost lists can always be safely * evicted. Due to arc_c_min, the same does not hold for all clean data * contained by the regular mru and mfu lists. * * In the case of the regular mru and mfu lists, we need to report as * much clean data as possible, such that evicting that same reported * data will not bring arc_size below arc_c_min. Thus, in certain * circumstances, the total amount of clean data in the mru and mfu * lists might not actually be evictable. * * The following two distinct cases are accounted for: * * 1. The sum of the amount of dirty data contained by both the mru and * mfu lists, plus the ARC's other accounting (e.g. the anon list), * is greater than or equal to arc_c_min. * (i.e. amount of dirty data >= arc_c_min) * * This is the easy case; all clean data contained by the mru and mfu * lists is evictable. Evicting all clean data can only drop arc_size * to the amount of dirty data, which is greater than arc_c_min. * * 2. The sum of the amount of dirty data contained by both the mru and * mfu lists, plus the ARC's other accounting (e.g. the anon list), * is less than arc_c_min. * (i.e. arc_c_min > amount of dirty data) * * 2.1. arc_size is greater than or equal arc_c_min. * (i.e. arc_size >= arc_c_min > amount of dirty data) * * In this case, not all clean data from the regular mru and mfu * lists is actually evictable; we must leave enough clean data * to keep arc_size above arc_c_min. Thus, the maximum amount of * evictable data from the two lists combined, is exactly the * difference between arc_size and arc_c_min. * * 2.2. arc_size is less than arc_c_min * (i.e. arc_c_min > arc_size > amount of dirty data) * * In this case, none of the data contained in the mru and mfu * lists is evictable, even if it's clean. Since arc_size is * already below arc_c_min, evicting any more would only * increase this negative difference. */ #endif /* _KERNEL */ /* * Adapt arc info given the number of bytes we are trying to add and * the state that we are coming from. This function is only called * when we are adding new content to the cache. */ static void arc_adapt(int bytes, arc_state_t *state) { int mult; uint64_t arc_p_min = (arc_c >> arc_p_min_shift); int64_t mrug_size = zfs_refcount_count(&arc_mru_ghost->arcs_size); int64_t mfug_size = zfs_refcount_count(&arc_mfu_ghost->arcs_size); ASSERT(bytes > 0); /* * Adapt the target size of the MRU list: * - if we just hit in the MRU ghost list, then increase * the target size of the MRU list. * - if we just hit in the MFU ghost list, then increase * the target size of the MFU list by decreasing the * target size of the MRU list. */ if (state == arc_mru_ghost) { mult = (mrug_size >= mfug_size) ? 1 : (mfug_size / mrug_size); if (!zfs_arc_p_dampener_disable) mult = MIN(mult, 10); /* avoid wild arc_p adjustment */ arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult); } else if (state == arc_mfu_ghost) { uint64_t delta; mult = (mfug_size >= mrug_size) ? 1 : (mrug_size / mfug_size); if (!zfs_arc_p_dampener_disable) mult = MIN(mult, 10); delta = MIN(bytes * mult, arc_p); arc_p = MAX(arc_p_min, arc_p - delta); } ASSERT((int64_t)arc_p >= 0); /* * Wake reap thread if we do not have any available memory */ if (arc_reclaim_needed()) { zthr_wakeup(arc_reap_zthr); return; } if (arc_no_grow) return; if (arc_c >= arc_c_max) return; /* * If we're within (2 * maxblocksize) bytes of the target * cache size, increment the target cache size */ ASSERT3U(arc_c, >=, 2ULL << SPA_MAXBLOCKSHIFT); if (aggsum_upper_bound(&arc_size) >= arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) { atomic_add_64(&arc_c, (int64_t)bytes); if (arc_c > arc_c_max) arc_c = arc_c_max; else if (state == arc_anon) atomic_add_64(&arc_p, (int64_t)bytes); if (arc_p > arc_c) arc_p = arc_c; } ASSERT((int64_t)arc_p >= 0); } /* * Check if arc_size has grown past our upper threshold, determined by * zfs_arc_overflow_shift. */ boolean_t arc_is_overflowing(void) { /* Always allow at least one block of overflow */ int64_t overflow = MAX(SPA_MAXBLOCKSIZE, arc_c >> zfs_arc_overflow_shift); /* * We just compare the lower bound here for performance reasons. Our * primary goals are to make sure that the arc never grows without * bound, and that it can reach its maximum size. This check * accomplishes both goals. The maximum amount we could run over by is * 2 * aggsum_borrow_multiplier * NUM_CPUS * the average size of a block * in the ARC. In practice, that's in the tens of MB, which is low * enough to be safe. */ return (aggsum_lower_bound(&arc_size) >= (int64_t)arc_c + overflow); } static abd_t * arc_get_data_abd(arc_buf_hdr_t *hdr, uint64_t size, void *tag, boolean_t do_adapt) { arc_buf_contents_t type = arc_buf_type(hdr); arc_get_data_impl(hdr, size, tag, do_adapt); if (type == ARC_BUFC_METADATA) { return (abd_alloc(size, B_TRUE)); } else { ASSERT(type == ARC_BUFC_DATA); return (abd_alloc(size, B_FALSE)); } } static void * arc_get_data_buf(arc_buf_hdr_t *hdr, uint64_t size, void *tag) { arc_buf_contents_t type = arc_buf_type(hdr); arc_get_data_impl(hdr, size, tag, B_TRUE); if (type == ARC_BUFC_METADATA) { return (zio_buf_alloc(size)); } else { ASSERT(type == ARC_BUFC_DATA); return (zio_data_buf_alloc(size)); } } /* * Wait for the specified amount of data (in bytes) to be evicted from the * ARC, and for there to be sufficient free memory in the system. Waiting for * eviction ensures that the memory used by the ARC decreases. Waiting for * free memory ensures that the system won't run out of free pages, regardless * of ARC behavior and settings. See arc_lowmem_init(). */ void arc_wait_for_eviction(uint64_t amount) { mutex_enter(&arc_evict_lock); if (arc_is_overflowing()) { arc_evict_needed = B_TRUE; zthr_wakeup(arc_evict_zthr); if (amount != 0) { arc_evict_waiter_t aw; list_link_init(&aw.aew_node); cv_init(&aw.aew_cv, NULL, CV_DEFAULT, NULL); arc_evict_waiter_t *last = list_tail(&arc_evict_waiters); if (last != NULL) { ASSERT3U(last->aew_count, >, arc_evict_count); aw.aew_count = last->aew_count + amount; } else { aw.aew_count = arc_evict_count + amount; } list_insert_tail(&arc_evict_waiters, &aw); arc_set_need_free(); DTRACE_PROBE3(arc__wait__for__eviction, uint64_t, amount, uint64_t, arc_evict_count, uint64_t, aw.aew_count); /* * We will be woken up either when arc_evict_count * reaches aew_count, or when the ARC is no longer * overflowing and eviction completes. */ cv_wait(&aw.aew_cv, &arc_evict_lock); /* * In case of "false" wakeup, we will still be on the * list. */ if (list_link_active(&aw.aew_node)) list_remove(&arc_evict_waiters, &aw); cv_destroy(&aw.aew_cv); } } mutex_exit(&arc_evict_lock); } /* * Allocate a block and return it to the caller. If we are hitting the * hard limit for the cache size, we must sleep, waiting for the eviction * thread to catch up. If we're past the target size but below the hard * limit, we'll only signal the reclaim thread and continue on. */ static void arc_get_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag, boolean_t do_adapt) { arc_state_t *state = hdr->b_l1hdr.b_state; arc_buf_contents_t type = arc_buf_type(hdr); if (do_adapt) arc_adapt(size, state); /* * If arc_size is currently overflowing, we must be adding data * faster than we are evicting. To ensure we don't compound the * problem by adding more data and forcing arc_size to grow even * further past it's target size, we wait for the eviction thread to * make some progress. We also wait for there to be sufficient free * memory in the system, as measured by arc_free_memory(). * * Specifically, we wait for zfs_arc_eviction_pct percent of the * requested size to be evicted. This should be more than 100%, to * ensure that that progress is also made towards getting arc_size * under arc_c. See the comment above zfs_arc_eviction_pct. * * We do the overflowing check without holding the arc_evict_lock to * reduce lock contention in this hot path. Note that * arc_wait_for_eviction() will acquire the lock and check again to * ensure we are truly overflowing before blocking. */ if (arc_is_overflowing()) { arc_wait_for_eviction(size * zfs_arc_eviction_pct / 100); } VERIFY3U(hdr->b_type, ==, type); if (type == ARC_BUFC_METADATA) { arc_space_consume(size, ARC_SPACE_META); } else { arc_space_consume(size, ARC_SPACE_DATA); } /* * Update the state size. Note that ghost states have a * "ghost size" and so don't need to be updated. */ if (!GHOST_STATE(state)) { (void) zfs_refcount_add_many(&state->arcs_size, size, tag); /* * If this is reached via arc_read, the link is * protected by the hash lock. If reached via * arc_buf_alloc, the header should not be accessed by * any other thread. And, if reached via arc_read_done, * the hash lock will protect it if it's found in the * hash table; otherwise no other thread should be * trying to [add|remove]_reference it. */ if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); (void) zfs_refcount_add_many(&state->arcs_esize[type], size, tag); } /* * If we are growing the cache, and we are adding anonymous * data, and we have outgrown arc_p, update arc_p */ if (aggsum_upper_bound(&arc_size) < arc_c && hdr->b_l1hdr.b_state == arc_anon && (zfs_refcount_count(&arc_anon->arcs_size) + zfs_refcount_count(&arc_mru->arcs_size) > arc_p)) arc_p = MIN(arc_c, arc_p + size); } } static void arc_free_data_abd(arc_buf_hdr_t *hdr, abd_t *abd, uint64_t size, void *tag) { arc_free_data_impl(hdr, size, tag); abd_free(abd); } static void arc_free_data_buf(arc_buf_hdr_t *hdr, void *buf, uint64_t size, void *tag) { arc_buf_contents_t type = arc_buf_type(hdr); arc_free_data_impl(hdr, size, tag); if (type == ARC_BUFC_METADATA) { zio_buf_free(buf, size); } else { ASSERT(type == ARC_BUFC_DATA); zio_data_buf_free(buf, size); } } /* * Free the arc data buffer. */ static void arc_free_data_impl(arc_buf_hdr_t *hdr, uint64_t size, void *tag) { arc_state_t *state = hdr->b_l1hdr.b_state; arc_buf_contents_t type = arc_buf_type(hdr); /* protected by hash lock, if in the hash table */ if (multilist_link_active(&hdr->b_l1hdr.b_arc_node)) { ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); ASSERT(state != arc_anon && state != arc_l2c_only); (void) zfs_refcount_remove_many(&state->arcs_esize[type], size, tag); } (void) zfs_refcount_remove_many(&state->arcs_size, size, tag); VERIFY3U(hdr->b_type, ==, type); if (type == ARC_BUFC_METADATA) { arc_space_return(size, ARC_SPACE_META); } else { ASSERT(type == ARC_BUFC_DATA); arc_space_return(size, ARC_SPACE_DATA); } } /* * This routine is called whenever a buffer is accessed. * NOTE: the hash lock is dropped in this function. */ static void arc_access(arc_buf_hdr_t *hdr, kmutex_t *hash_lock) { clock_t now; ASSERT(MUTEX_HELD(hash_lock)); ASSERT(HDR_HAS_L1HDR(hdr)); if (hdr->b_l1hdr.b_state == arc_anon) { /* * This buffer is not in the cache, and does not * appear in our "ghost" list. Add the new buffer * to the MRU state. */ ASSERT0(hdr->b_l1hdr.b_arc_access); hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr); arc_change_state(arc_mru, hdr, hash_lock); } else if (hdr->b_l1hdr.b_state == arc_mru) { now = ddi_get_lbolt(); /* * If this buffer is here because of a prefetch, then either: * - clear the flag if this is a "referencing" read * (any subsequent access will bump this into the MFU state). * or * - move the buffer to the head of the list if this is * another prefetch (to make it less likely to be evicted). */ if (HDR_PREFETCH(hdr) || HDR_PRESCIENT_PREFETCH(hdr)) { if (zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) { /* link protected by hash lock */ ASSERT(multilist_link_active( &hdr->b_l1hdr.b_arc_node)); } else { arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH); atomic_inc_32(&hdr->b_l1hdr.b_mru_hits); ARCSTAT_BUMP(arcstat_mru_hits); } hdr->b_l1hdr.b_arc_access = now; return; } /* * This buffer has been "accessed" only once so far, * but it is still in the cache. Move it to the MFU * state. */ if (ddi_time_after(now, hdr->b_l1hdr.b_arc_access + ARC_MINTIME)) { /* * More than 125ms have passed since we * instantiated this buffer. Move it to the * most frequently used state. */ hdr->b_l1hdr.b_arc_access = now; DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); arc_change_state(arc_mfu, hdr, hash_lock); } atomic_inc_32(&hdr->b_l1hdr.b_mru_hits); ARCSTAT_BUMP(arcstat_mru_hits); } else if (hdr->b_l1hdr.b_state == arc_mru_ghost) { arc_state_t *new_state; /* * This buffer has been "accessed" recently, but * was evicted from the cache. Move it to the * MFU state. */ if (HDR_PREFETCH(hdr) || HDR_PRESCIENT_PREFETCH(hdr)) { new_state = arc_mru; if (zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) > 0) { arc_hdr_clear_flags(hdr, ARC_FLAG_PREFETCH | ARC_FLAG_PRESCIENT_PREFETCH); } DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, hdr); } else { new_state = arc_mfu; DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); } hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); arc_change_state(new_state, hdr, hash_lock); atomic_inc_32(&hdr->b_l1hdr.b_mru_ghost_hits); ARCSTAT_BUMP(arcstat_mru_ghost_hits); } else if (hdr->b_l1hdr.b_state == arc_mfu) { /* * This buffer has been accessed more than once and is * still in the cache. Keep it in the MFU state. * * NOTE: an add_reference() that occurred when we did * the arc_read() will have kicked this off the list. * If it was a prefetch, we will explicitly move it to * the head of the list now. */ atomic_inc_32(&hdr->b_l1hdr.b_mfu_hits); ARCSTAT_BUMP(arcstat_mfu_hits); hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); } else if (hdr->b_l1hdr.b_state == arc_mfu_ghost) { arc_state_t *new_state = arc_mfu; /* * This buffer has been accessed more than once but has * been evicted from the cache. Move it back to the * MFU state. */ if (HDR_PREFETCH(hdr) || HDR_PRESCIENT_PREFETCH(hdr)) { /* * This is a prefetch access... * move this block back to the MRU state. */ new_state = arc_mru; } hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); arc_change_state(new_state, hdr, hash_lock); atomic_inc_32(&hdr->b_l1hdr.b_mfu_ghost_hits); ARCSTAT_BUMP(arcstat_mfu_ghost_hits); } else if (hdr->b_l1hdr.b_state == arc_l2c_only) { /* * This buffer is on the 2nd Level ARC. */ hdr->b_l1hdr.b_arc_access = ddi_get_lbolt(); DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, hdr); arc_change_state(arc_mfu, hdr, hash_lock); } else { cmn_err(CE_PANIC, "invalid arc state 0x%p", hdr->b_l1hdr.b_state); } } /* * This routine is called by dbuf_hold() to update the arc_access() state * which otherwise would be skipped for entries in the dbuf cache. */ void arc_buf_access(arc_buf_t *buf) { mutex_enter(&buf->b_evict_lock); arc_buf_hdr_t *hdr = buf->b_hdr; /* * Avoid taking the hash_lock when possible as an optimization. * The header must be checked again under the hash_lock in order * to handle the case where it is concurrently being released. */ if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) { mutex_exit(&buf->b_evict_lock); return; } kmutex_t *hash_lock = HDR_LOCK(hdr); mutex_enter(hash_lock); if (hdr->b_l1hdr.b_state == arc_anon || HDR_EMPTY(hdr)) { mutex_exit(hash_lock); mutex_exit(&buf->b_evict_lock); ARCSTAT_BUMP(arcstat_access_skip); return; } mutex_exit(&buf->b_evict_lock); ASSERT(hdr->b_l1hdr.b_state == arc_mru || hdr->b_l1hdr.b_state == arc_mfu); DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr); arc_access(hdr, hash_lock); mutex_exit(hash_lock); ARCSTAT_BUMP(arcstat_hits); ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr) && !HDR_PRESCIENT_PREFETCH(hdr), demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data, metadata, hits); } /* a generic arc_read_done_func_t which you can use */ /* ARGSUSED */ void arc_bcopy_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp, arc_buf_t *buf, void *arg) { if (buf == NULL) return; bcopy(buf->b_data, arg, arc_buf_size(buf)); arc_buf_destroy(buf, arg); } /* a generic arc_read_done_func_t */ /* ARGSUSED */ void arc_getbuf_func(zio_t *zio, const zbookmark_phys_t *zb, const blkptr_t *bp, arc_buf_t *buf, void *arg) { arc_buf_t **bufp = arg; if (buf == NULL) { ASSERT(zio == NULL || zio->io_error != 0); *bufp = NULL; } else { ASSERT(zio == NULL || zio->io_error == 0); *bufp = buf; ASSERT(buf->b_data != NULL); } } static void arc_hdr_verify(arc_buf_hdr_t *hdr, blkptr_t *bp) { if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) { ASSERT3U(HDR_GET_PSIZE(hdr), ==, 0); ASSERT3U(arc_hdr_get_compress(hdr), ==, ZIO_COMPRESS_OFF); } else { if (HDR_COMPRESSION_ENABLED(hdr)) { ASSERT3U(arc_hdr_get_compress(hdr), ==, BP_GET_COMPRESS(bp)); } ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp)); ASSERT3U(HDR_GET_PSIZE(hdr), ==, BP_GET_PSIZE(bp)); ASSERT3U(!!HDR_PROTECTED(hdr), ==, BP_IS_PROTECTED(bp)); } } static void arc_read_done(zio_t *zio) { blkptr_t *bp = zio->io_bp; arc_buf_hdr_t *hdr = zio->io_private; kmutex_t *hash_lock = NULL; arc_callback_t *callback_list; arc_callback_t *acb; boolean_t freeable = B_FALSE; /* * The hdr was inserted into hash-table and removed from lists * prior to starting I/O. We should find this header, since * it's in the hash table, and it should be legit since it's * not possible to evict it during the I/O. The only possible * reason for it not to be found is if we were freed during the * read. */ if (HDR_IN_HASH_TABLE(hdr)) { arc_buf_hdr_t *found; ASSERT3U(hdr->b_birth, ==, BP_PHYSICAL_BIRTH(zio->io_bp)); ASSERT3U(hdr->b_dva.dva_word[0], ==, BP_IDENTITY(zio->io_bp)->dva_word[0]); ASSERT3U(hdr->b_dva.dva_word[1], ==, BP_IDENTITY(zio->io_bp)->dva_word[1]); found = buf_hash_find(hdr->b_spa, zio->io_bp, &hash_lock); ASSERT((found == hdr && DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) || (found == hdr && HDR_L2_READING(hdr))); ASSERT3P(hash_lock, !=, NULL); } if (BP_IS_PROTECTED(bp)) { hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp); hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset; zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv); if (BP_GET_TYPE(bp) == DMU_OT_INTENT_LOG) { void *tmpbuf; tmpbuf = abd_borrow_buf_copy(zio->io_abd, sizeof (zil_chain_t)); zio_crypt_decode_mac_zil(tmpbuf, hdr->b_crypt_hdr.b_mac); abd_return_buf(zio->io_abd, tmpbuf, sizeof (zil_chain_t)); } else { zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac); } } if (zio->io_error == 0) { /* byteswap if necessary */ if (BP_SHOULD_BYTESWAP(zio->io_bp)) { if (BP_GET_LEVEL(zio->io_bp) > 0) { hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64; } else { hdr->b_l1hdr.b_byteswap = DMU_OT_BYTESWAP(BP_GET_TYPE(zio->io_bp)); } } else { hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; } if (!HDR_L2_READING(hdr)) { hdr->b_complevel = zio->io_prop.zp_complevel; } } arc_hdr_clear_flags(hdr, ARC_FLAG_L2_EVICTED); if (l2arc_noprefetch && HDR_PREFETCH(hdr)) arc_hdr_clear_flags(hdr, ARC_FLAG_L2CACHE); callback_list = hdr->b_l1hdr.b_acb; ASSERT3P(callback_list, !=, NULL); if (hash_lock && zio->io_error == 0 && hdr->b_l1hdr.b_state == arc_anon) { /* * Only call arc_access on anonymous buffers. This is because * if we've issued an I/O for an evicted buffer, we've already * called arc_access (to prevent any simultaneous readers from * getting confused). */ arc_access(hdr, hash_lock); } /* * If a read request has a callback (i.e. acb_done is not NULL), then we * make a buf containing the data according to the parameters which were * passed in. The implementation of arc_buf_alloc_impl() ensures that we * aren't needlessly decompressing the data multiple times. */ int callback_cnt = 0; for (acb = callback_list; acb != NULL; acb = acb->acb_next) { if (!acb->acb_done) continue; callback_cnt++; if (zio->io_error != 0) continue; int error = arc_buf_alloc_impl(hdr, zio->io_spa, &acb->acb_zb, acb->acb_private, acb->acb_encrypted, acb->acb_compressed, acb->acb_noauth, B_TRUE, &acb->acb_buf); /* * Assert non-speculative zios didn't fail because an * encryption key wasn't loaded */ ASSERT((zio->io_flags & ZIO_FLAG_SPECULATIVE) || error != EACCES); /* * If we failed to decrypt, report an error now (as the zio * layer would have done if it had done the transforms). */ if (error == ECKSUM) { ASSERT(BP_IS_PROTECTED(bp)); error = SET_ERROR(EIO); if ((zio->io_flags & ZIO_FLAG_SPECULATIVE) == 0) { spa_log_error(zio->io_spa, &acb->acb_zb); (void) zfs_ereport_post( FM_EREPORT_ZFS_AUTHENTICATION, zio->io_spa, NULL, &acb->acb_zb, zio, 0); } } if (error != 0) { /* * Decompression or decryption failed. Set * io_error so that when we call acb_done * (below), we will indicate that the read * failed. Note that in the unusual case * where one callback is compressed and another * uncompressed, we will mark all of them * as failed, even though the uncompressed * one can't actually fail. In this case, * the hdr will not be anonymous, because * if there are multiple callbacks, it's * because multiple threads found the same * arc buf in the hash table. */ zio->io_error = error; } } /* * If there are multiple callbacks, we must have the hash lock, * because the only way for multiple threads to find this hdr is * in the hash table. This ensures that if there are multiple * callbacks, the hdr is not anonymous. If it were anonymous, * we couldn't use arc_buf_destroy() in the error case below. */ ASSERT(callback_cnt < 2 || hash_lock != NULL); hdr->b_l1hdr.b_acb = NULL; arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); if (callback_cnt == 0) ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); ASSERT(zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt) || callback_list != NULL); if (zio->io_error == 0) { arc_hdr_verify(hdr, zio->io_bp); } else { arc_hdr_set_flags(hdr, ARC_FLAG_IO_ERROR); if (hdr->b_l1hdr.b_state != arc_anon) arc_change_state(arc_anon, hdr, hash_lock); if (HDR_IN_HASH_TABLE(hdr)) buf_hash_remove(hdr); freeable = zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt); } /* * Broadcast before we drop the hash_lock to avoid the possibility * that the hdr (and hence the cv) might be freed before we get to * the cv_broadcast(). */ cv_broadcast(&hdr->b_l1hdr.b_cv); if (hash_lock != NULL) { mutex_exit(hash_lock); } else { /* * This block was freed while we waited for the read to * complete. It has been removed from the hash table and * moved to the anonymous state (so that it won't show up * in the cache). */ ASSERT3P(hdr->b_l1hdr.b_state, ==, arc_anon); freeable = zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt); } /* execute each callback and free its structure */ while ((acb = callback_list) != NULL) { if (acb->acb_done != NULL) { if (zio->io_error != 0 && acb->acb_buf != NULL) { /* * If arc_buf_alloc_impl() fails during * decompression, the buf will still be * allocated, and needs to be freed here. */ arc_buf_destroy(acb->acb_buf, acb->acb_private); acb->acb_buf = NULL; } acb->acb_done(zio, &zio->io_bookmark, zio->io_bp, acb->acb_buf, acb->acb_private); } if (acb->acb_zio_dummy != NULL) { acb->acb_zio_dummy->io_error = zio->io_error; zio_nowait(acb->acb_zio_dummy); } callback_list = acb->acb_next; kmem_free(acb, sizeof (arc_callback_t)); } if (freeable) arc_hdr_destroy(hdr); } /* * "Read" the block at the specified DVA (in bp) via the * cache. If the block is found in the cache, invoke the provided * callback immediately and return. Note that the `zio' parameter * in the callback will be NULL in this case, since no IO was * required. If the block is not in the cache pass the read request * on to the spa with a substitute callback function, so that the * requested block will be added to the cache. * * If a read request arrives for a block that has a read in-progress, * either wait for the in-progress read to complete (and return the * results); or, if this is a read with a "done" func, add a record * to the read to invoke the "done" func when the read completes, * and return; or just return. * * arc_read_done() will invoke all the requested "done" functions * for readers of this block. */ int arc_read(zio_t *pio, spa_t *spa, const blkptr_t *bp, arc_read_done_func_t *done, void *private, zio_priority_t priority, int zio_flags, arc_flags_t *arc_flags, const zbookmark_phys_t *zb) { arc_buf_hdr_t *hdr = NULL; kmutex_t *hash_lock = NULL; zio_t *rzio; uint64_t guid = spa_load_guid(spa); boolean_t compressed_read = (zio_flags & ZIO_FLAG_RAW_COMPRESS) != 0; boolean_t encrypted_read = BP_IS_ENCRYPTED(bp) && (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0; boolean_t noauth_read = BP_IS_AUTHENTICATED(bp) && (zio_flags & ZIO_FLAG_RAW_ENCRYPT) != 0; boolean_t embedded_bp = !!BP_IS_EMBEDDED(bp); int rc = 0; ASSERT(!embedded_bp || BPE_GET_ETYPE(bp) == BP_EMBEDDED_TYPE_DATA); ASSERT(!BP_IS_HOLE(bp)); ASSERT(!BP_IS_REDACTED(bp)); /* * Normally SPL_FSTRANS will already be set since kernel threads which * expect to call the DMU interfaces will set it when created. System * calls are similarly handled by setting/cleaning the bit in the * registered callback (module/os/.../zfs/zpl_*). * * External consumers such as Lustre which call the exported DMU * interfaces may not have set SPL_FSTRANS. To avoid a deadlock * on the hash_lock always set and clear the bit. */ fstrans_cookie_t cookie = spl_fstrans_mark(); top: if (!embedded_bp) { /* * Embedded BP's have no DVA and require no I/O to "read". * Create an anonymous arc buf to back it. */ hdr = buf_hash_find(guid, bp, &hash_lock); } /* * Determine if we have an L1 cache hit or a cache miss. For simplicity * we maintain encrypted data separately from compressed / uncompressed * data. If the user is requesting raw encrypted data and we don't have * that in the header we will read from disk to guarantee that we can * get it even if the encryption keys aren't loaded. */ if (hdr != NULL && HDR_HAS_L1HDR(hdr) && (HDR_HAS_RABD(hdr) || (hdr->b_l1hdr.b_pabd != NULL && !encrypted_read))) { arc_buf_t *buf = NULL; *arc_flags |= ARC_FLAG_CACHED; if (HDR_IO_IN_PROGRESS(hdr)) { zio_t *head_zio = hdr->b_l1hdr.b_acb->acb_zio_head; if (*arc_flags & ARC_FLAG_CACHED_ONLY) { mutex_exit(hash_lock); ARCSTAT_BUMP(arcstat_cached_only_in_progress); rc = SET_ERROR(ENOENT); goto out; } ASSERT3P(head_zio, !=, NULL); if ((hdr->b_flags & ARC_FLAG_PRIO_ASYNC_READ) && priority == ZIO_PRIORITY_SYNC_READ) { /* * This is a sync read that needs to wait for * an in-flight async read. Request that the * zio have its priority upgraded. */ zio_change_priority(head_zio, priority); DTRACE_PROBE1(arc__async__upgrade__sync, arc_buf_hdr_t *, hdr); ARCSTAT_BUMP(arcstat_async_upgrade_sync); } if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) { arc_hdr_clear_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH); } if (*arc_flags & ARC_FLAG_WAIT) { cv_wait(&hdr->b_l1hdr.b_cv, hash_lock); mutex_exit(hash_lock); goto top; } ASSERT(*arc_flags & ARC_FLAG_NOWAIT); if (done) { arc_callback_t *acb = NULL; acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP); acb->acb_done = done; acb->acb_private = private; acb->acb_compressed = compressed_read; acb->acb_encrypted = encrypted_read; acb->acb_noauth = noauth_read; acb->acb_zb = *zb; if (pio != NULL) acb->acb_zio_dummy = zio_null(pio, spa, NULL, NULL, NULL, zio_flags); ASSERT3P(acb->acb_done, !=, NULL); acb->acb_zio_head = head_zio; acb->acb_next = hdr->b_l1hdr.b_acb; hdr->b_l1hdr.b_acb = acb; mutex_exit(hash_lock); goto out; } mutex_exit(hash_lock); goto out; } ASSERT(hdr->b_l1hdr.b_state == arc_mru || hdr->b_l1hdr.b_state == arc_mfu); if (done) { if (hdr->b_flags & ARC_FLAG_PREDICTIVE_PREFETCH) { /* * This is a demand read which does not have to * wait for i/o because we did a predictive * prefetch i/o for it, which has completed. */ DTRACE_PROBE1( arc__demand__hit__predictive__prefetch, arc_buf_hdr_t *, hdr); ARCSTAT_BUMP( arcstat_demand_hit_predictive_prefetch); arc_hdr_clear_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH); } if (hdr->b_flags & ARC_FLAG_PRESCIENT_PREFETCH) { ARCSTAT_BUMP( arcstat_demand_hit_prescient_prefetch); arc_hdr_clear_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH); } ASSERT(!embedded_bp || !BP_IS_HOLE(bp)); /* Get a buf with the desired data in it. */ rc = arc_buf_alloc_impl(hdr, spa, zb, private, encrypted_read, compressed_read, noauth_read, B_TRUE, &buf); if (rc == ECKSUM) { /* * Convert authentication and decryption errors * to EIO (and generate an ereport if needed) * before leaving the ARC. */ rc = SET_ERROR(EIO); if ((zio_flags & ZIO_FLAG_SPECULATIVE) == 0) { spa_log_error(spa, zb); (void) zfs_ereport_post( FM_EREPORT_ZFS_AUTHENTICATION, spa, NULL, zb, NULL, 0); } } if (rc != 0) { (void) remove_reference(hdr, hash_lock, private); arc_buf_destroy_impl(buf); buf = NULL; } /* assert any errors weren't due to unloaded keys */ ASSERT((zio_flags & ZIO_FLAG_SPECULATIVE) || rc != EACCES); } else if (*arc_flags & ARC_FLAG_PREFETCH && zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 0) { arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); } DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr); arc_access(hdr, hash_lock); if (*arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH); if (*arc_flags & ARC_FLAG_L2CACHE) arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); mutex_exit(hash_lock); ARCSTAT_BUMP(arcstat_hits); ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr), demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data, metadata, hits); if (done) done(NULL, zb, bp, buf, private); } else { uint64_t lsize = BP_GET_LSIZE(bp); uint64_t psize = BP_GET_PSIZE(bp); arc_callback_t *acb; vdev_t *vd = NULL; uint64_t addr = 0; boolean_t devw = B_FALSE; uint64_t size; abd_t *hdr_abd; int alloc_flags = encrypted_read ? ARC_HDR_ALLOC_RDATA : 0; if (*arc_flags & ARC_FLAG_CACHED_ONLY) { rc = SET_ERROR(ENOENT); if (hash_lock != NULL) mutex_exit(hash_lock); goto out; } /* * Gracefully handle a damaged logical block size as a * checksum error. */ if (lsize > spa_maxblocksize(spa)) { rc = SET_ERROR(ECKSUM); if (hash_lock != NULL) mutex_exit(hash_lock); goto out; } if (hdr == NULL) { /* * This block is not in the cache or it has * embedded data. */ arc_buf_hdr_t *exists = NULL; arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp); hdr = arc_hdr_alloc(spa_load_guid(spa), psize, lsize, BP_IS_PROTECTED(bp), BP_GET_COMPRESS(bp), 0, type, encrypted_read); if (!embedded_bp) { hdr->b_dva = *BP_IDENTITY(bp); hdr->b_birth = BP_PHYSICAL_BIRTH(bp); exists = buf_hash_insert(hdr, &hash_lock); } if (exists != NULL) { /* somebody beat us to the hash insert */ mutex_exit(hash_lock); buf_discard_identity(hdr); arc_hdr_destroy(hdr); goto top; /* restart the IO request */ } } else { /* * This block is in the ghost cache or encrypted data * was requested and we didn't have it. If it was * L2-only (and thus didn't have an L1 hdr), * we realloc the header to add an L1 hdr. */ if (!HDR_HAS_L1HDR(hdr)) { hdr = arc_hdr_realloc(hdr, hdr_l2only_cache, hdr_full_cache); } if (GHOST_STATE(hdr->b_l1hdr.b_state)) { ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); ASSERT(!HDR_IO_IN_PROGRESS(hdr)); ASSERT0(zfs_refcount_count( &hdr->b_l1hdr.b_refcnt)); ASSERT3P(hdr->b_l1hdr.b_buf, ==, NULL); ASSERT3P(hdr->b_l1hdr.b_freeze_cksum, ==, NULL); } else if (HDR_IO_IN_PROGRESS(hdr)) { /* * If this header already had an IO in progress * and we are performing another IO to fetch * encrypted data we must wait until the first * IO completes so as not to confuse * arc_read_done(). This should be very rare * and so the performance impact shouldn't * matter. */ cv_wait(&hdr->b_l1hdr.b_cv, hash_lock); mutex_exit(hash_lock); goto top; } /* * This is a delicate dance that we play here. * This hdr might be in the ghost list so we access * it to move it out of the ghost list before we * initiate the read. If it's a prefetch then * it won't have a callback so we'll remove the * reference that arc_buf_alloc_impl() created. We * do this after we've called arc_access() to * avoid hitting an assert in remove_reference(). */ arc_adapt(arc_hdr_size(hdr), hdr->b_l1hdr.b_state); arc_access(hdr, hash_lock); arc_hdr_alloc_abd(hdr, alloc_flags); } if (encrypted_read) { ASSERT(HDR_HAS_RABD(hdr)); size = HDR_GET_PSIZE(hdr); hdr_abd = hdr->b_crypt_hdr.b_rabd; zio_flags |= ZIO_FLAG_RAW; } else { ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); size = arc_hdr_size(hdr); hdr_abd = hdr->b_l1hdr.b_pabd; if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF) { zio_flags |= ZIO_FLAG_RAW_COMPRESS; } /* * For authenticated bp's, we do not ask the ZIO layer * to authenticate them since this will cause the entire * IO to fail if the key isn't loaded. Instead, we * defer authentication until arc_buf_fill(), which will * verify the data when the key is available. */ if (BP_IS_AUTHENTICATED(bp)) zio_flags |= ZIO_FLAG_RAW_ENCRYPT; } if (*arc_flags & ARC_FLAG_PREFETCH && zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) arc_hdr_set_flags(hdr, ARC_FLAG_PREFETCH); if (*arc_flags & ARC_FLAG_PRESCIENT_PREFETCH) arc_hdr_set_flags(hdr, ARC_FLAG_PRESCIENT_PREFETCH); if (*arc_flags & ARC_FLAG_L2CACHE) arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); if (BP_IS_AUTHENTICATED(bp)) arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH); if (BP_GET_LEVEL(bp) > 0) arc_hdr_set_flags(hdr, ARC_FLAG_INDIRECT); if (*arc_flags & ARC_FLAG_PREDICTIVE_PREFETCH) arc_hdr_set_flags(hdr, ARC_FLAG_PREDICTIVE_PREFETCH); ASSERT(!GHOST_STATE(hdr->b_l1hdr.b_state)); acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP); acb->acb_done = done; acb->acb_private = private; acb->acb_compressed = compressed_read; acb->acb_encrypted = encrypted_read; acb->acb_noauth = noauth_read; acb->acb_zb = *zb; ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); hdr->b_l1hdr.b_acb = acb; arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); if (HDR_HAS_L2HDR(hdr) && (vd = hdr->b_l2hdr.b_dev->l2ad_vdev) != NULL) { devw = hdr->b_l2hdr.b_dev->l2ad_writing; addr = hdr->b_l2hdr.b_daddr; /* * Lock out L2ARC device removal. */ if (vdev_is_dead(vd) || !spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER)) vd = NULL; } /* * We count both async reads and scrub IOs as asynchronous so * that both can be upgraded in the event of a cache hit while * the read IO is still in-flight. */ if (priority == ZIO_PRIORITY_ASYNC_READ || priority == ZIO_PRIORITY_SCRUB) arc_hdr_set_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ); else arc_hdr_clear_flags(hdr, ARC_FLAG_PRIO_ASYNC_READ); /* * At this point, we have a level 1 cache miss or a blkptr * with embedded data. Try again in L2ARC if possible. */ ASSERT3U(HDR_GET_LSIZE(hdr), ==, lsize); /* * Skip ARC stat bump for block pointers with embedded * data. The data are read from the blkptr itself via * decode_embedded_bp_compressed(). */ if (!embedded_bp) { DTRACE_PROBE4(arc__miss, arc_buf_hdr_t *, hdr, blkptr_t *, bp, uint64_t, lsize, zbookmark_phys_t *, zb); ARCSTAT_BUMP(arcstat_misses); ARCSTAT_CONDSTAT(!HDR_PREFETCH(hdr), demand, prefetch, !HDR_ISTYPE_METADATA(hdr), data, metadata, misses); } if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) { /* * Read from the L2ARC if the following are true: * 1. The L2ARC vdev was previously cached. * 2. This buffer still has L2ARC metadata. * 3. This buffer isn't currently writing to the L2ARC. * 4. The L2ARC entry wasn't evicted, which may * also have invalidated the vdev. * 5. This isn't prefetch and l2arc_noprefetch is set. */ if (HDR_HAS_L2HDR(hdr) && !HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) && !(l2arc_noprefetch && HDR_PREFETCH(hdr))) { l2arc_read_callback_t *cb; abd_t *abd; uint64_t asize; DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr); ARCSTAT_BUMP(arcstat_l2_hits); atomic_inc_32(&hdr->b_l2hdr.b_hits); cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP); cb->l2rcb_hdr = hdr; cb->l2rcb_bp = *bp; cb->l2rcb_zb = *zb; cb->l2rcb_flags = zio_flags; /* * When Compressed ARC is disabled, but the * L2ARC block is compressed, arc_hdr_size() * will have returned LSIZE rather than PSIZE. */ if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr) && HDR_GET_PSIZE(hdr) != 0) { size = HDR_GET_PSIZE(hdr); } asize = vdev_psize_to_asize(vd, size); if (asize != size) { abd = abd_alloc_for_io(asize, HDR_ISTYPE_METADATA(hdr)); cb->l2rcb_abd = abd; } else { abd = hdr_abd; } ASSERT(addr >= VDEV_LABEL_START_SIZE && addr + asize <= vd->vdev_psize - VDEV_LABEL_END_SIZE); /* * l2arc read. The SCL_L2ARC lock will be * released by l2arc_read_done(). * Issue a null zio if the underlying buffer * was squashed to zero size by compression. */ ASSERT3U(arc_hdr_get_compress(hdr), !=, ZIO_COMPRESS_EMPTY); rzio = zio_read_phys(pio, vd, addr, asize, abd, ZIO_CHECKSUM_OFF, l2arc_read_done, cb, priority, zio_flags | ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE); acb->acb_zio_head = rzio; if (hash_lock != NULL) mutex_exit(hash_lock); DTRACE_PROBE2(l2arc__read, vdev_t *, vd, zio_t *, rzio); ARCSTAT_INCR(arcstat_l2_read_bytes, HDR_GET_PSIZE(hdr)); if (*arc_flags & ARC_FLAG_NOWAIT) { zio_nowait(rzio); goto out; } ASSERT(*arc_flags & ARC_FLAG_WAIT); if (zio_wait(rzio) == 0) goto out; /* l2arc read error; goto zio_read() */ if (hash_lock != NULL) mutex_enter(hash_lock); } else { DTRACE_PROBE1(l2arc__miss, arc_buf_hdr_t *, hdr); ARCSTAT_BUMP(arcstat_l2_misses); if (HDR_L2_WRITING(hdr)) ARCSTAT_BUMP(arcstat_l2_rw_clash); spa_config_exit(spa, SCL_L2ARC, vd); } } else { if (vd != NULL) spa_config_exit(spa, SCL_L2ARC, vd); /* * Skip ARC stat bump for block pointers with * embedded data. The data are read from the blkptr * itself via decode_embedded_bp_compressed(). */ if (l2arc_ndev != 0 && !embedded_bp) { DTRACE_PROBE1(l2arc__miss, arc_buf_hdr_t *, hdr); ARCSTAT_BUMP(arcstat_l2_misses); } } rzio = zio_read(pio, spa, bp, hdr_abd, size, arc_read_done, hdr, priority, zio_flags, zb); acb->acb_zio_head = rzio; if (hash_lock != NULL) mutex_exit(hash_lock); if (*arc_flags & ARC_FLAG_WAIT) { rc = zio_wait(rzio); goto out; } ASSERT(*arc_flags & ARC_FLAG_NOWAIT); zio_nowait(rzio); } out: /* embedded bps don't actually go to disk */ if (!embedded_bp) spa_read_history_add(spa, zb, *arc_flags); spl_fstrans_unmark(cookie); return (rc); } arc_prune_t * arc_add_prune_callback(arc_prune_func_t *func, void *private) { arc_prune_t *p; p = kmem_alloc(sizeof (*p), KM_SLEEP); p->p_pfunc = func; p->p_private = private; list_link_init(&p->p_node); zfs_refcount_create(&p->p_refcnt); mutex_enter(&arc_prune_mtx); zfs_refcount_add(&p->p_refcnt, &arc_prune_list); list_insert_head(&arc_prune_list, p); mutex_exit(&arc_prune_mtx); return (p); } void arc_remove_prune_callback(arc_prune_t *p) { boolean_t wait = B_FALSE; mutex_enter(&arc_prune_mtx); list_remove(&arc_prune_list, p); if (zfs_refcount_remove(&p->p_refcnt, &arc_prune_list) > 0) wait = B_TRUE; mutex_exit(&arc_prune_mtx); /* wait for arc_prune_task to finish */ if (wait) taskq_wait_outstanding(arc_prune_taskq, 0); ASSERT0(zfs_refcount_count(&p->p_refcnt)); zfs_refcount_destroy(&p->p_refcnt); kmem_free(p, sizeof (*p)); } /* * Notify the arc that a block was freed, and thus will never be used again. */ void arc_freed(spa_t *spa, const blkptr_t *bp) { arc_buf_hdr_t *hdr; kmutex_t *hash_lock; uint64_t guid = spa_load_guid(spa); ASSERT(!BP_IS_EMBEDDED(bp)); hdr = buf_hash_find(guid, bp, &hash_lock); if (hdr == NULL) return; /* * We might be trying to free a block that is still doing I/O * (i.e. prefetch) or has a reference (i.e. a dedup-ed, * dmu_sync-ed block). If this block is being prefetched, then it * would still have the ARC_FLAG_IO_IN_PROGRESS flag set on the hdr * until the I/O completes. A block may also have a reference if it is * part of a dedup-ed, dmu_synced write. The dmu_sync() function would * have written the new block to its final resting place on disk but * without the dedup flag set. This would have left the hdr in the MRU * state and discoverable. When the txg finally syncs it detects that * the block was overridden in open context and issues an override I/O. * Since this is a dedup block, the override I/O will determine if the * block is already in the DDT. If so, then it will replace the io_bp * with the bp from the DDT and allow the I/O to finish. When the I/O * reaches the done callback, dbuf_write_override_done, it will * check to see if the io_bp and io_bp_override are identical. * If they are not, then it indicates that the bp was replaced with * the bp in the DDT and the override bp is freed. This allows * us to arrive here with a reference on a block that is being * freed. So if we have an I/O in progress, or a reference to * this hdr, then we don't destroy the hdr. */ if (!HDR_HAS_L1HDR(hdr) || (!HDR_IO_IN_PROGRESS(hdr) && zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt))) { arc_change_state(arc_anon, hdr, hash_lock); arc_hdr_destroy(hdr); mutex_exit(hash_lock); } else { mutex_exit(hash_lock); } } /* * Release this buffer from the cache, making it an anonymous buffer. This * must be done after a read and prior to modifying the buffer contents. * If the buffer has more than one reference, we must make * a new hdr for the buffer. */ void arc_release(arc_buf_t *buf, void *tag) { arc_buf_hdr_t *hdr = buf->b_hdr; /* * It would be nice to assert that if its DMU metadata (level > * 0 || it's the dnode file), then it must be syncing context. * But we don't know that information at this level. */ mutex_enter(&buf->b_evict_lock); ASSERT(HDR_HAS_L1HDR(hdr)); /* * We don't grab the hash lock prior to this check, because if * the buffer's header is in the arc_anon state, it won't be * linked into the hash table. */ if (hdr->b_l1hdr.b_state == arc_anon) { mutex_exit(&buf->b_evict_lock); ASSERT(!HDR_IO_IN_PROGRESS(hdr)); ASSERT(!HDR_IN_HASH_TABLE(hdr)); ASSERT(!HDR_HAS_L2HDR(hdr)); ASSERT(HDR_EMPTY(hdr)); ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1); ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), ==, 1); ASSERT(!list_link_active(&hdr->b_l1hdr.b_arc_node)); hdr->b_l1hdr.b_arc_access = 0; /* * If the buf is being overridden then it may already * have a hdr that is not empty. */ buf_discard_identity(hdr); arc_buf_thaw(buf); return; } kmutex_t *hash_lock = HDR_LOCK(hdr); mutex_enter(hash_lock); /* * This assignment is only valid as long as the hash_lock is * held, we must be careful not to reference state or the * b_state field after dropping the lock. */ arc_state_t *state = hdr->b_l1hdr.b_state; ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); ASSERT3P(state, !=, arc_anon); /* this buffer is not on any list */ ASSERT3S(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt), >, 0); if (HDR_HAS_L2HDR(hdr)) { mutex_enter(&hdr->b_l2hdr.b_dev->l2ad_mtx); /* * We have to recheck this conditional again now that * we're holding the l2ad_mtx to prevent a race with * another thread which might be concurrently calling * l2arc_evict(). In that case, l2arc_evict() might have * destroyed the header's L2 portion as we were waiting * to acquire the l2ad_mtx. */ if (HDR_HAS_L2HDR(hdr)) arc_hdr_l2hdr_destroy(hdr); mutex_exit(&hdr->b_l2hdr.b_dev->l2ad_mtx); } /* * Do we have more than one buf? */ if (hdr->b_l1hdr.b_bufcnt > 1) { arc_buf_hdr_t *nhdr; uint64_t spa = hdr->b_spa; uint64_t psize = HDR_GET_PSIZE(hdr); uint64_t lsize = HDR_GET_LSIZE(hdr); boolean_t protected = HDR_PROTECTED(hdr); enum zio_compress compress = arc_hdr_get_compress(hdr); arc_buf_contents_t type = arc_buf_type(hdr); VERIFY3U(hdr->b_type, ==, type); ASSERT(hdr->b_l1hdr.b_buf != buf || buf->b_next != NULL); (void) remove_reference(hdr, hash_lock, tag); if (arc_buf_is_shared(buf) && !ARC_BUF_COMPRESSED(buf)) { ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf); ASSERT(ARC_BUF_LAST(buf)); } /* * Pull the data off of this hdr and attach it to * a new anonymous hdr. Also find the last buffer * in the hdr's buffer list. */ arc_buf_t *lastbuf = arc_buf_remove(hdr, buf); ASSERT3P(lastbuf, !=, NULL); /* * If the current arc_buf_t and the hdr are sharing their data * buffer, then we must stop sharing that block. */ if (arc_buf_is_shared(buf)) { ASSERT3P(hdr->b_l1hdr.b_buf, !=, buf); VERIFY(!arc_buf_is_shared(lastbuf)); /* * First, sever the block sharing relationship between * buf and the arc_buf_hdr_t. */ arc_unshare_buf(hdr, buf); /* * Now we need to recreate the hdr's b_pabd. Since we * have lastbuf handy, we try to share with it, but if * we can't then we allocate a new b_pabd and copy the * data from buf into it. */ if (arc_can_share(hdr, lastbuf)) { arc_share_buf(hdr, lastbuf); } else { arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT); abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data, psize); } VERIFY3P(lastbuf->b_data, !=, NULL); } else if (HDR_SHARED_DATA(hdr)) { /* * Uncompressed shared buffers are always at the end * of the list. Compressed buffers don't have the * same requirements. This makes it hard to * simply assert that the lastbuf is shared so * we rely on the hdr's compression flags to determine * if we have a compressed, shared buffer. */ ASSERT(arc_buf_is_shared(lastbuf) || arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF); ASSERT(!ARC_BUF_SHARED(buf)); } ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); ASSERT3P(state, !=, arc_l2c_only); (void) zfs_refcount_remove_many(&state->arcs_size, arc_buf_size(buf), buf); if (zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)) { ASSERT3P(state, !=, arc_l2c_only); (void) zfs_refcount_remove_many( &state->arcs_esize[type], arc_buf_size(buf), buf); } hdr->b_l1hdr.b_bufcnt -= 1; if (ARC_BUF_ENCRYPTED(buf)) hdr->b_crypt_hdr.b_ebufcnt -= 1; arc_cksum_verify(buf); arc_buf_unwatch(buf); /* if this is the last uncompressed buf free the checksum */ if (!arc_hdr_has_uncompressed_buf(hdr)) arc_cksum_free(hdr); mutex_exit(hash_lock); /* * Allocate a new hdr. The new hdr will contain a b_pabd * buffer which will be freed in arc_write(). */ nhdr = arc_hdr_alloc(spa, psize, lsize, protected, compress, hdr->b_complevel, type, HDR_HAS_RABD(hdr)); ASSERT3P(nhdr->b_l1hdr.b_buf, ==, NULL); ASSERT0(nhdr->b_l1hdr.b_bufcnt); ASSERT0(zfs_refcount_count(&nhdr->b_l1hdr.b_refcnt)); VERIFY3U(nhdr->b_type, ==, type); ASSERT(!HDR_SHARED_DATA(nhdr)); nhdr->b_l1hdr.b_buf = buf; nhdr->b_l1hdr.b_bufcnt = 1; if (ARC_BUF_ENCRYPTED(buf)) nhdr->b_crypt_hdr.b_ebufcnt = 1; nhdr->b_l1hdr.b_mru_hits = 0; nhdr->b_l1hdr.b_mru_ghost_hits = 0; nhdr->b_l1hdr.b_mfu_hits = 0; nhdr->b_l1hdr.b_mfu_ghost_hits = 0; nhdr->b_l1hdr.b_l2_hits = 0; (void) zfs_refcount_add(&nhdr->b_l1hdr.b_refcnt, tag); buf->b_hdr = nhdr; mutex_exit(&buf->b_evict_lock); (void) zfs_refcount_add_many(&arc_anon->arcs_size, arc_buf_size(buf), buf); } else { mutex_exit(&buf->b_evict_lock); ASSERT(zfs_refcount_count(&hdr->b_l1hdr.b_refcnt) == 1); /* protected by hash lock, or hdr is on arc_anon */ ASSERT(!multilist_link_active(&hdr->b_l1hdr.b_arc_node)); ASSERT(!HDR_IO_IN_PROGRESS(hdr)); hdr->b_l1hdr.b_mru_hits = 0; hdr->b_l1hdr.b_mru_ghost_hits = 0; hdr->b_l1hdr.b_mfu_hits = 0; hdr->b_l1hdr.b_mfu_ghost_hits = 0; hdr->b_l1hdr.b_l2_hits = 0; arc_change_state(arc_anon, hdr, hash_lock); hdr->b_l1hdr.b_arc_access = 0; mutex_exit(hash_lock); buf_discard_identity(hdr); arc_buf_thaw(buf); } } int arc_released(arc_buf_t *buf) { int released; mutex_enter(&buf->b_evict_lock); released = (buf->b_data != NULL && buf->b_hdr->b_l1hdr.b_state == arc_anon); mutex_exit(&buf->b_evict_lock); return (released); } #ifdef ZFS_DEBUG int arc_referenced(arc_buf_t *buf) { int referenced; mutex_enter(&buf->b_evict_lock); referenced = (zfs_refcount_count(&buf->b_hdr->b_l1hdr.b_refcnt)); mutex_exit(&buf->b_evict_lock); return (referenced); } #endif static void arc_write_ready(zio_t *zio) { arc_write_callback_t *callback = zio->io_private; arc_buf_t *buf = callback->awcb_buf; arc_buf_hdr_t *hdr = buf->b_hdr; blkptr_t *bp = zio->io_bp; uint64_t psize = BP_IS_HOLE(bp) ? 0 : BP_GET_PSIZE(bp); fstrans_cookie_t cookie = spl_fstrans_mark(); ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT(!zfs_refcount_is_zero(&buf->b_hdr->b_l1hdr.b_refcnt)); ASSERT(hdr->b_l1hdr.b_bufcnt > 0); /* * If we're reexecuting this zio because the pool suspended, then * cleanup any state that was previously set the first time the * callback was invoked. */ if (zio->io_flags & ZIO_FLAG_REEXECUTED) { arc_cksum_free(hdr); arc_buf_unwatch(buf); if (hdr->b_l1hdr.b_pabd != NULL) { if (arc_buf_is_shared(buf)) { arc_unshare_buf(hdr, buf); } else { arc_hdr_free_abd(hdr, B_FALSE); } } if (HDR_HAS_RABD(hdr)) arc_hdr_free_abd(hdr, B_TRUE); } ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); ASSERT(!HDR_HAS_RABD(hdr)); ASSERT(!HDR_SHARED_DATA(hdr)); ASSERT(!arc_buf_is_shared(buf)); callback->awcb_ready(zio, buf, callback->awcb_private); if (HDR_IO_IN_PROGRESS(hdr)) ASSERT(zio->io_flags & ZIO_FLAG_REEXECUTED); arc_hdr_set_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); if (BP_IS_PROTECTED(bp) != !!HDR_PROTECTED(hdr)) hdr = arc_hdr_realloc_crypt(hdr, BP_IS_PROTECTED(bp)); if (BP_IS_PROTECTED(bp)) { /* ZIL blocks are written through zio_rewrite */ ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG); ASSERT(HDR_PROTECTED(hdr)); if (BP_SHOULD_BYTESWAP(bp)) { if (BP_GET_LEVEL(bp) > 0) { hdr->b_l1hdr.b_byteswap = DMU_BSWAP_UINT64; } else { hdr->b_l1hdr.b_byteswap = DMU_OT_BYTESWAP(BP_GET_TYPE(bp)); } } else { hdr->b_l1hdr.b_byteswap = DMU_BSWAP_NUMFUNCS; } hdr->b_crypt_hdr.b_ot = BP_GET_TYPE(bp); hdr->b_crypt_hdr.b_dsobj = zio->io_bookmark.zb_objset; zio_crypt_decode_params_bp(bp, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv); zio_crypt_decode_mac_bp(bp, hdr->b_crypt_hdr.b_mac); } /* * If this block was written for raw encryption but the zio layer * ended up only authenticating it, adjust the buffer flags now. */ if (BP_IS_AUTHENTICATED(bp) && ARC_BUF_ENCRYPTED(buf)) { arc_hdr_set_flags(hdr, ARC_FLAG_NOAUTH); buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; if (BP_GET_COMPRESS(bp) == ZIO_COMPRESS_OFF) buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; } else if (BP_IS_HOLE(bp) && ARC_BUF_ENCRYPTED(buf)) { buf->b_flags &= ~ARC_BUF_FLAG_ENCRYPTED; buf->b_flags &= ~ARC_BUF_FLAG_COMPRESSED; } /* this must be done after the buffer flags are adjusted */ arc_cksum_compute(buf); enum zio_compress compress; if (BP_IS_HOLE(bp) || BP_IS_EMBEDDED(bp)) { compress = ZIO_COMPRESS_OFF; } else { ASSERT3U(HDR_GET_LSIZE(hdr), ==, BP_GET_LSIZE(bp)); compress = BP_GET_COMPRESS(bp); } HDR_SET_PSIZE(hdr, psize); arc_hdr_set_compress(hdr, compress); hdr->b_complevel = zio->io_prop.zp_complevel; if (zio->io_error != 0 || psize == 0) goto out; /* * Fill the hdr with data. If the buffer is encrypted we have no choice * but to copy the data into b_radb. If the hdr is compressed, the data * we want is available from the zio, otherwise we can take it from * the buf. * * We might be able to share the buf's data with the hdr here. However, * doing so would cause the ARC to be full of linear ABDs if we write a * lot of shareable data. As a compromise, we check whether scattered * ABDs are allowed, and assume that if they are then the user wants * the ARC to be primarily filled with them regardless of the data being * written. Therefore, if they're allowed then we allocate one and copy * the data into it; otherwise, we share the data directly if we can. */ if (ARC_BUF_ENCRYPTED(buf)) { ASSERT3U(psize, >, 0); ASSERT(ARC_BUF_COMPRESSED(buf)); arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT|ARC_HDR_ALLOC_RDATA); abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize); } else if (zfs_abd_scatter_enabled || !arc_can_share(hdr, buf)) { /* * Ideally, we would always copy the io_abd into b_pabd, but the * user may have disabled compressed ARC, thus we must check the * hdr's compression setting rather than the io_bp's. */ if (BP_IS_ENCRYPTED(bp)) { ASSERT3U(psize, >, 0); arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT|ARC_HDR_ALLOC_RDATA); abd_copy(hdr->b_crypt_hdr.b_rabd, zio->io_abd, psize); } else if (arc_hdr_get_compress(hdr) != ZIO_COMPRESS_OFF && !ARC_BUF_COMPRESSED(buf)) { ASSERT3U(psize, >, 0); arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT); abd_copy(hdr->b_l1hdr.b_pabd, zio->io_abd, psize); } else { ASSERT3U(zio->io_orig_size, ==, arc_hdr_size(hdr)); arc_hdr_alloc_abd(hdr, ARC_HDR_DO_ADAPT); abd_copy_from_buf(hdr->b_l1hdr.b_pabd, buf->b_data, arc_buf_size(buf)); } } else { ASSERT3P(buf->b_data, ==, abd_to_buf(zio->io_orig_abd)); ASSERT3U(zio->io_orig_size, ==, arc_buf_size(buf)); ASSERT3U(hdr->b_l1hdr.b_bufcnt, ==, 1); arc_share_buf(hdr, buf); } out: arc_hdr_verify(hdr, bp); spl_fstrans_unmark(cookie); } static void arc_write_children_ready(zio_t *zio) { arc_write_callback_t *callback = zio->io_private; arc_buf_t *buf = callback->awcb_buf; callback->awcb_children_ready(zio, buf, callback->awcb_private); } /* * The SPA calls this callback for each physical write that happens on behalf * of a logical write. See the comment in dbuf_write_physdone() for details. */ static void arc_write_physdone(zio_t *zio) { arc_write_callback_t *cb = zio->io_private; if (cb->awcb_physdone != NULL) cb->awcb_physdone(zio, cb->awcb_buf, cb->awcb_private); } static void arc_write_done(zio_t *zio) { arc_write_callback_t *callback = zio->io_private; arc_buf_t *buf = callback->awcb_buf; arc_buf_hdr_t *hdr = buf->b_hdr; ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); if (zio->io_error == 0) { arc_hdr_verify(hdr, zio->io_bp); if (BP_IS_HOLE(zio->io_bp) || BP_IS_EMBEDDED(zio->io_bp)) { buf_discard_identity(hdr); } else { hdr->b_dva = *BP_IDENTITY(zio->io_bp); hdr->b_birth = BP_PHYSICAL_BIRTH(zio->io_bp); } } else { ASSERT(HDR_EMPTY(hdr)); } /* * If the block to be written was all-zero or compressed enough to be * embedded in the BP, no write was performed so there will be no * dva/birth/checksum. The buffer must therefore remain anonymous * (and uncached). */ if (!HDR_EMPTY(hdr)) { arc_buf_hdr_t *exists; kmutex_t *hash_lock; ASSERT3U(zio->io_error, ==, 0); arc_cksum_verify(buf); exists = buf_hash_insert(hdr, &hash_lock); if (exists != NULL) { /* * This can only happen if we overwrite for * sync-to-convergence, because we remove * buffers from the hash table when we arc_free(). */ if (zio->io_flags & ZIO_FLAG_IO_REWRITE) { if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp)) panic("bad overwrite, hdr=%p exists=%p", (void *)hdr, (void *)exists); ASSERT(zfs_refcount_is_zero( &exists->b_l1hdr.b_refcnt)); arc_change_state(arc_anon, exists, hash_lock); arc_hdr_destroy(exists); mutex_exit(hash_lock); exists = buf_hash_insert(hdr, &hash_lock); ASSERT3P(exists, ==, NULL); } else if (zio->io_flags & ZIO_FLAG_NOPWRITE) { /* nopwrite */ ASSERT(zio->io_prop.zp_nopwrite); if (!BP_EQUAL(&zio->io_bp_orig, zio->io_bp)) panic("bad nopwrite, hdr=%p exists=%p", (void *)hdr, (void *)exists); } else { /* Dedup */ ASSERT(hdr->b_l1hdr.b_bufcnt == 1); ASSERT(hdr->b_l1hdr.b_state == arc_anon); ASSERT(BP_GET_DEDUP(zio->io_bp)); ASSERT(BP_GET_LEVEL(zio->io_bp) == 0); } } arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); /* if it's not anon, we are doing a scrub */ if (exists == NULL && hdr->b_l1hdr.b_state == arc_anon) arc_access(hdr, hash_lock); mutex_exit(hash_lock); } else { arc_hdr_clear_flags(hdr, ARC_FLAG_IO_IN_PROGRESS); } ASSERT(!zfs_refcount_is_zero(&hdr->b_l1hdr.b_refcnt)); callback->awcb_done(zio, buf, callback->awcb_private); abd_put(zio->io_abd); kmem_free(callback, sizeof (arc_write_callback_t)); } zio_t * arc_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, arc_buf_t *buf, boolean_t l2arc, const zio_prop_t *zp, arc_write_done_func_t *ready, arc_write_done_func_t *children_ready, arc_write_done_func_t *physdone, arc_write_done_func_t *done, void *private, zio_priority_t priority, int zio_flags, const zbookmark_phys_t *zb) { arc_buf_hdr_t *hdr = buf->b_hdr; arc_write_callback_t *callback; zio_t *zio; zio_prop_t localprop = *zp; ASSERT3P(ready, !=, NULL); ASSERT3P(done, !=, NULL); ASSERT(!HDR_IO_ERROR(hdr)); ASSERT(!HDR_IO_IN_PROGRESS(hdr)); ASSERT3P(hdr->b_l1hdr.b_acb, ==, NULL); ASSERT3U(hdr->b_l1hdr.b_bufcnt, >, 0); if (l2arc) arc_hdr_set_flags(hdr, ARC_FLAG_L2CACHE); if (ARC_BUF_ENCRYPTED(buf)) { ASSERT(ARC_BUF_COMPRESSED(buf)); localprop.zp_encrypt = B_TRUE; localprop.zp_compress = HDR_GET_COMPRESS(hdr); localprop.zp_complevel = hdr->b_complevel; localprop.zp_byteorder = (hdr->b_l1hdr.b_byteswap == DMU_BSWAP_NUMFUNCS) ? ZFS_HOST_BYTEORDER : !ZFS_HOST_BYTEORDER; bcopy(hdr->b_crypt_hdr.b_salt, localprop.zp_salt, ZIO_DATA_SALT_LEN); bcopy(hdr->b_crypt_hdr.b_iv, localprop.zp_iv, ZIO_DATA_IV_LEN); bcopy(hdr->b_crypt_hdr.b_mac, localprop.zp_mac, ZIO_DATA_MAC_LEN); if (DMU_OT_IS_ENCRYPTED(localprop.zp_type)) { localprop.zp_nopwrite = B_FALSE; localprop.zp_copies = MIN(localprop.zp_copies, SPA_DVAS_PER_BP - 1); } zio_flags |= ZIO_FLAG_RAW; } else if (ARC_BUF_COMPRESSED(buf)) { ASSERT3U(HDR_GET_LSIZE(hdr), !=, arc_buf_size(buf)); localprop.zp_compress = HDR_GET_COMPRESS(hdr); localprop.zp_complevel = hdr->b_complevel; zio_flags |= ZIO_FLAG_RAW_COMPRESS; } callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP); callback->awcb_ready = ready; callback->awcb_children_ready = children_ready; callback->awcb_physdone = physdone; callback->awcb_done = done; callback->awcb_private = private; callback->awcb_buf = buf; /* * The hdr's b_pabd is now stale, free it now. A new data block * will be allocated when the zio pipeline calls arc_write_ready(). */ if (hdr->b_l1hdr.b_pabd != NULL) { /* * If the buf is currently sharing the data block with * the hdr then we need to break that relationship here. * The hdr will remain with a NULL data pointer and the * buf will take sole ownership of the block. */ if (arc_buf_is_shared(buf)) { arc_unshare_buf(hdr, buf); } else { arc_hdr_free_abd(hdr, B_FALSE); } VERIFY3P(buf->b_data, !=, NULL); } if (HDR_HAS_RABD(hdr)) arc_hdr_free_abd(hdr, B_TRUE); if (!(zio_flags & ZIO_FLAG_RAW)) arc_hdr_set_compress(hdr, ZIO_COMPRESS_OFF); ASSERT(!arc_buf_is_shared(buf)); ASSERT3P(hdr->b_l1hdr.b_pabd, ==, NULL); zio = zio_write(pio, spa, txg, bp, abd_get_from_buf(buf->b_data, HDR_GET_LSIZE(hdr)), HDR_GET_LSIZE(hdr), arc_buf_size(buf), &localprop, arc_write_ready, (children_ready != NULL) ? arc_write_children_ready : NULL, arc_write_physdone, arc_write_done, callback, priority, zio_flags, zb); return (zio); } void arc_tempreserve_clear(uint64_t reserve) { atomic_add_64(&arc_tempreserve, -reserve); ASSERT((int64_t)arc_tempreserve >= 0); } int arc_tempreserve_space(spa_t *spa, uint64_t reserve, uint64_t txg) { int error; uint64_t anon_size; if (!arc_no_grow && reserve > arc_c/4 && reserve * 4 > (2ULL << SPA_MAXBLOCKSHIFT)) arc_c = MIN(arc_c_max, reserve * 4); /* * Throttle when the calculated memory footprint for the TXG * exceeds the target ARC size. */ if (reserve > arc_c) { DMU_TX_STAT_BUMP(dmu_tx_memory_reserve); return (SET_ERROR(ERESTART)); } /* * Don't count loaned bufs as in flight dirty data to prevent long * network delays from blocking transactions that are ready to be * assigned to a txg. */ /* assert that it has not wrapped around */ ASSERT3S(atomic_add_64_nv(&arc_loaned_bytes, 0), >=, 0); anon_size = MAX((int64_t)(zfs_refcount_count(&arc_anon->arcs_size) - arc_loaned_bytes), 0); /* * Writes will, almost always, require additional memory allocations * in order to compress/encrypt/etc the data. We therefore need to * make sure that there is sufficient available memory for this. */ error = arc_memory_throttle(spa, reserve, txg); if (error != 0) return (error); /* * Throttle writes when the amount of dirty data in the cache * gets too large. We try to keep the cache less than half full * of dirty blocks so that our sync times don't grow too large. * * In the case of one pool being built on another pool, we want * to make sure we don't end up throttling the lower (backing) * pool when the upper pool is the majority contributor to dirty * data. To insure we make forward progress during throttling, we * also check the current pool's net dirty data and only throttle * if it exceeds zfs_arc_pool_dirty_percent of the anonymous dirty * data in the cache. * * Note: if two requests come in concurrently, we might let them * both succeed, when one of them should fail. Not a huge deal. */ uint64_t total_dirty = reserve + arc_tempreserve + anon_size; uint64_t spa_dirty_anon = spa_dirty_data(spa); if (total_dirty > arc_c * zfs_arc_dirty_limit_percent / 100 && anon_size > arc_c * zfs_arc_anon_limit_percent / 100 && spa_dirty_anon > anon_size * zfs_arc_pool_dirty_percent / 100) { #ifdef ZFS_DEBUG uint64_t meta_esize = zfs_refcount_count( &arc_anon->arcs_esize[ARC_BUFC_METADATA]); uint64_t data_esize = zfs_refcount_count(&arc_anon->arcs_esize[ARC_BUFC_DATA]); dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK " "anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n", arc_tempreserve >> 10, meta_esize >> 10, data_esize >> 10, reserve >> 10, arc_c >> 10); #endif DMU_TX_STAT_BUMP(dmu_tx_dirty_throttle); return (SET_ERROR(ERESTART)); } atomic_add_64(&arc_tempreserve, reserve); return (0); } static void arc_kstat_update_state(arc_state_t *state, kstat_named_t *size, kstat_named_t *evict_data, kstat_named_t *evict_metadata) { size->value.ui64 = zfs_refcount_count(&state->arcs_size); evict_data->value.ui64 = zfs_refcount_count(&state->arcs_esize[ARC_BUFC_DATA]); evict_metadata->value.ui64 = zfs_refcount_count(&state->arcs_esize[ARC_BUFC_METADATA]); } static int arc_kstat_update(kstat_t *ksp, int rw) { arc_stats_t *as = ksp->ks_data; if (rw == KSTAT_WRITE) { return (SET_ERROR(EACCES)); } else { arc_kstat_update_state(arc_anon, &as->arcstat_anon_size, &as->arcstat_anon_evictable_data, &as->arcstat_anon_evictable_metadata); arc_kstat_update_state(arc_mru, &as->arcstat_mru_size, &as->arcstat_mru_evictable_data, &as->arcstat_mru_evictable_metadata); arc_kstat_update_state(arc_mru_ghost, &as->arcstat_mru_ghost_size, &as->arcstat_mru_ghost_evictable_data, &as->arcstat_mru_ghost_evictable_metadata); arc_kstat_update_state(arc_mfu, &as->arcstat_mfu_size, &as->arcstat_mfu_evictable_data, &as->arcstat_mfu_evictable_metadata); arc_kstat_update_state(arc_mfu_ghost, &as->arcstat_mfu_ghost_size, &as->arcstat_mfu_ghost_evictable_data, &as->arcstat_mfu_ghost_evictable_metadata); ARCSTAT(arcstat_size) = aggsum_value(&arc_size); ARCSTAT(arcstat_meta_used) = aggsum_value(&arc_meta_used); ARCSTAT(arcstat_data_size) = aggsum_value(&astat_data_size); ARCSTAT(arcstat_metadata_size) = aggsum_value(&astat_metadata_size); ARCSTAT(arcstat_hdr_size) = aggsum_value(&astat_hdr_size); ARCSTAT(arcstat_l2_hdr_size) = aggsum_value(&astat_l2_hdr_size); ARCSTAT(arcstat_dbuf_size) = aggsum_value(&astat_dbuf_size); #if defined(COMPAT_FREEBSD11) ARCSTAT(arcstat_other_size) = aggsum_value(&astat_bonus_size) + aggsum_value(&astat_dnode_size) + aggsum_value(&astat_dbuf_size); #endif ARCSTAT(arcstat_dnode_size) = aggsum_value(&astat_dnode_size); ARCSTAT(arcstat_bonus_size) = aggsum_value(&astat_bonus_size); ARCSTAT(arcstat_abd_chunk_waste_size) = aggsum_value(&astat_abd_chunk_waste_size); as->arcstat_memory_all_bytes.value.ui64 = arc_all_memory(); as->arcstat_memory_free_bytes.value.ui64 = arc_free_memory(); as->arcstat_memory_available_bytes.value.i64 = arc_available_memory(); } return (0); } /* * This function *must* return indices evenly distributed between all * sublists of the multilist. This is needed due to how the ARC eviction * code is laid out; arc_evict_state() assumes ARC buffers are evenly * distributed between all sublists and uses this assumption when * deciding which sublist to evict from and how much to evict from it. */ static unsigned int arc_state_multilist_index_func(multilist_t *ml, void *obj) { arc_buf_hdr_t *hdr = obj; /* * We rely on b_dva to generate evenly distributed index * numbers using buf_hash below. So, as an added precaution, * let's make sure we never add empty buffers to the arc lists. */ ASSERT(!HDR_EMPTY(hdr)); /* * The assumption here, is the hash value for a given * arc_buf_hdr_t will remain constant throughout its lifetime * (i.e. its b_spa, b_dva, and b_birth fields don't change). * Thus, we don't need to store the header's sublist index * on insertion, as this index can be recalculated on removal. * * Also, the low order bits of the hash value are thought to be * distributed evenly. Otherwise, in the case that the multilist * has a power of two number of sublists, each sublists' usage * would not be evenly distributed. */ return (buf_hash(hdr->b_spa, &hdr->b_dva, hdr->b_birth) % multilist_get_num_sublists(ml)); } #define WARN_IF_TUNING_IGNORED(tuning, value, do_warn) do { \ if ((do_warn) && (tuning) && ((tuning) != (value))) { \ cmn_err(CE_WARN, \ "ignoring tunable %s (using %llu instead)", \ (#tuning), (value)); \ } \ } while (0) /* * Called during module initialization and periodically thereafter to * apply reasonable changes to the exposed performance tunings. Can also be * called explicitly by param_set_arc_*() functions when ARC tunables are * updated manually. Non-zero zfs_* values which differ from the currently set * values will be applied. */ void arc_tuning_update(boolean_t verbose) { uint64_t allmem = arc_all_memory(); unsigned long limit; /* Valid range: 32M - */ if ((zfs_arc_min) && (zfs_arc_min != arc_c_min) && (zfs_arc_min >= 2ULL << SPA_MAXBLOCKSHIFT) && (zfs_arc_min <= arc_c_max)) { arc_c_min = zfs_arc_min; arc_c = MAX(arc_c, arc_c_min); } WARN_IF_TUNING_IGNORED(zfs_arc_min, arc_c_min, verbose); /* Valid range: 64M - */ if ((zfs_arc_max) && (zfs_arc_max != arc_c_max) && (zfs_arc_max >= 64 << 20) && (zfs_arc_max < allmem) && (zfs_arc_max > arc_c_min)) { arc_c_max = zfs_arc_max; arc_c = MIN(arc_c, arc_c_max); arc_p = (arc_c >> 1); if (arc_meta_limit > arc_c_max) arc_meta_limit = arc_c_max; if (arc_dnode_size_limit > arc_meta_limit) arc_dnode_size_limit = arc_meta_limit; } WARN_IF_TUNING_IGNORED(zfs_arc_max, arc_c_max, verbose); /* Valid range: 16M - */ if ((zfs_arc_meta_min) && (zfs_arc_meta_min != arc_meta_min) && (zfs_arc_meta_min >= 1ULL << SPA_MAXBLOCKSHIFT) && (zfs_arc_meta_min <= arc_c_max)) { arc_meta_min = zfs_arc_meta_min; if (arc_meta_limit < arc_meta_min) arc_meta_limit = arc_meta_min; if (arc_dnode_size_limit < arc_meta_min) arc_dnode_size_limit = arc_meta_min; } WARN_IF_TUNING_IGNORED(zfs_arc_meta_min, arc_meta_min, verbose); /* Valid range: - */ limit = zfs_arc_meta_limit ? zfs_arc_meta_limit : MIN(zfs_arc_meta_limit_percent, 100) * arc_c_max / 100; if ((limit != arc_meta_limit) && (limit >= arc_meta_min) && (limit <= arc_c_max)) arc_meta_limit = limit; WARN_IF_TUNING_IGNORED(zfs_arc_meta_limit, arc_meta_limit, verbose); /* Valid range: - */ limit = zfs_arc_dnode_limit ? zfs_arc_dnode_limit : MIN(zfs_arc_dnode_limit_percent, 100) * arc_meta_limit / 100; if ((limit != arc_dnode_size_limit) && (limit >= arc_meta_min) && (limit <= arc_meta_limit)) arc_dnode_size_limit = limit; WARN_IF_TUNING_IGNORED(zfs_arc_dnode_limit, arc_dnode_size_limit, verbose); /* Valid range: 1 - N */ if (zfs_arc_grow_retry) arc_grow_retry = zfs_arc_grow_retry; /* Valid range: 1 - N */ if (zfs_arc_shrink_shift) { arc_shrink_shift = zfs_arc_shrink_shift; arc_no_grow_shift = MIN(arc_no_grow_shift, arc_shrink_shift -1); } /* Valid range: 1 - N */ if (zfs_arc_p_min_shift) arc_p_min_shift = zfs_arc_p_min_shift; /* Valid range: 1 - N ms */ if (zfs_arc_min_prefetch_ms) arc_min_prefetch_ms = zfs_arc_min_prefetch_ms; /* Valid range: 1 - N ms */ if (zfs_arc_min_prescient_prefetch_ms) { arc_min_prescient_prefetch_ms = zfs_arc_min_prescient_prefetch_ms; } /* Valid range: 0 - 100 */ if ((zfs_arc_lotsfree_percent >= 0) && (zfs_arc_lotsfree_percent <= 100)) arc_lotsfree_percent = zfs_arc_lotsfree_percent; WARN_IF_TUNING_IGNORED(zfs_arc_lotsfree_percent, arc_lotsfree_percent, verbose); /* Valid range: 0 - */ if ((zfs_arc_sys_free) && (zfs_arc_sys_free != arc_sys_free)) arc_sys_free = MIN(MAX(zfs_arc_sys_free, 0), allmem); WARN_IF_TUNING_IGNORED(zfs_arc_sys_free, arc_sys_free, verbose); } static void arc_state_init(void) { arc_anon = &ARC_anon; arc_mru = &ARC_mru; arc_mru_ghost = &ARC_mru_ghost; arc_mfu = &ARC_mfu; arc_mfu_ghost = &ARC_mfu_ghost; arc_l2c_only = &ARC_l2c_only; arc_mru->arcs_list[ARC_BUFC_METADATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_mru->arcs_list[ARC_BUFC_DATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_mru_ghost->arcs_list[ARC_BUFC_METADATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_mru_ghost->arcs_list[ARC_BUFC_DATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_mfu->arcs_list[ARC_BUFC_METADATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_mfu->arcs_list[ARC_BUFC_DATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_mfu_ghost->arcs_list[ARC_BUFC_DATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_l2c_only->arcs_list[ARC_BUFC_METADATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); arc_l2c_only->arcs_list[ARC_BUFC_DATA] = multilist_create(sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l1hdr.b_arc_node), arc_state_multilist_index_func); zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_create(&arc_anon->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_create(&arc_mru->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_create(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_create(&arc_mfu->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_create(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_create(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_create(&arc_anon->arcs_size); zfs_refcount_create(&arc_mru->arcs_size); zfs_refcount_create(&arc_mru_ghost->arcs_size); zfs_refcount_create(&arc_mfu->arcs_size); zfs_refcount_create(&arc_mfu_ghost->arcs_size); zfs_refcount_create(&arc_l2c_only->arcs_size); aggsum_init(&arc_meta_used, 0); aggsum_init(&arc_size, 0); aggsum_init(&astat_data_size, 0); aggsum_init(&astat_metadata_size, 0); aggsum_init(&astat_hdr_size, 0); aggsum_init(&astat_l2_hdr_size, 0); aggsum_init(&astat_bonus_size, 0); aggsum_init(&astat_dnode_size, 0); aggsum_init(&astat_dbuf_size, 0); aggsum_init(&astat_abd_chunk_waste_size, 0); arc_anon->arcs_state = ARC_STATE_ANON; arc_mru->arcs_state = ARC_STATE_MRU; arc_mru_ghost->arcs_state = ARC_STATE_MRU_GHOST; arc_mfu->arcs_state = ARC_STATE_MFU; arc_mfu_ghost->arcs_state = ARC_STATE_MFU_GHOST; arc_l2c_only->arcs_state = ARC_STATE_L2C_ONLY; } static void arc_state_fini(void) { zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_destroy(&arc_anon->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_destroy(&arc_mru->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_destroy(&arc_mru_ghost->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_destroy(&arc_mfu->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_destroy(&arc_mfu_ghost->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_METADATA]); zfs_refcount_destroy(&arc_l2c_only->arcs_esize[ARC_BUFC_DATA]); zfs_refcount_destroy(&arc_anon->arcs_size); zfs_refcount_destroy(&arc_mru->arcs_size); zfs_refcount_destroy(&arc_mru_ghost->arcs_size); zfs_refcount_destroy(&arc_mfu->arcs_size); zfs_refcount_destroy(&arc_mfu_ghost->arcs_size); zfs_refcount_destroy(&arc_l2c_only->arcs_size); multilist_destroy(arc_mru->arcs_list[ARC_BUFC_METADATA]); multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_METADATA]); multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_METADATA]); multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_METADATA]); multilist_destroy(arc_mru->arcs_list[ARC_BUFC_DATA]); multilist_destroy(arc_mru_ghost->arcs_list[ARC_BUFC_DATA]); multilist_destroy(arc_mfu->arcs_list[ARC_BUFC_DATA]); multilist_destroy(arc_mfu_ghost->arcs_list[ARC_BUFC_DATA]); multilist_destroy(arc_l2c_only->arcs_list[ARC_BUFC_METADATA]); multilist_destroy(arc_l2c_only->arcs_list[ARC_BUFC_DATA]); aggsum_fini(&arc_meta_used); aggsum_fini(&arc_size); aggsum_fini(&astat_data_size); aggsum_fini(&astat_metadata_size); aggsum_fini(&astat_hdr_size); aggsum_fini(&astat_l2_hdr_size); aggsum_fini(&astat_bonus_size); aggsum_fini(&astat_dnode_size); aggsum_fini(&astat_dbuf_size); aggsum_fini(&astat_abd_chunk_waste_size); } uint64_t arc_target_bytes(void) { return (arc_c); } void arc_init(void) { uint64_t percent, allmem = arc_all_memory(); mutex_init(&arc_evict_lock, NULL, MUTEX_DEFAULT, NULL); list_create(&arc_evict_waiters, sizeof (arc_evict_waiter_t), offsetof(arc_evict_waiter_t, aew_node)); arc_min_prefetch_ms = 1000; arc_min_prescient_prefetch_ms = 6000; #if defined(_KERNEL) arc_lowmem_init(); #endif /* Set min cache to 1/32 of all memory, or 32MB, whichever is more. */ arc_c_min = MAX(allmem / 32, 2ULL << SPA_MAXBLOCKSHIFT); /* How to set default max varies by platform. */ arc_c_max = arc_default_max(arc_c_min, allmem); #ifndef _KERNEL /* * In userland, there's only the memory pressure that we artificially * create (see arc_available_memory()). Don't let arc_c get too * small, because it can cause transactions to be larger than * arc_c, causing arc_tempreserve_space() to fail. */ arc_c_min = MAX(arc_c_max / 2, 2ULL << SPA_MAXBLOCKSHIFT); #endif arc_c = arc_c_min; arc_p = (arc_c >> 1); /* Set min to 1/2 of arc_c_min */ arc_meta_min = 1ULL << SPA_MAXBLOCKSHIFT; /* Initialize maximum observed usage to zero */ arc_meta_max = 0; /* * Set arc_meta_limit to a percent of arc_c_max with a floor of * arc_meta_min, and a ceiling of arc_c_max. */ percent = MIN(zfs_arc_meta_limit_percent, 100); arc_meta_limit = MAX(arc_meta_min, (percent * arc_c_max) / 100); percent = MIN(zfs_arc_dnode_limit_percent, 100); arc_dnode_size_limit = (percent * arc_meta_limit) / 100; /* Apply user specified tunings */ arc_tuning_update(B_TRUE); /* if kmem_flags are set, lets try to use less memory */ if (kmem_debugging()) arc_c = arc_c / 2; if (arc_c < arc_c_min) arc_c = arc_c_min; arc_state_init(); buf_init(); list_create(&arc_prune_list, sizeof (arc_prune_t), offsetof(arc_prune_t, p_node)); mutex_init(&arc_prune_mtx, NULL, MUTEX_DEFAULT, NULL); arc_prune_taskq = taskq_create("arc_prune", boot_ncpus, defclsyspri, boot_ncpus, INT_MAX, TASKQ_PREPOPULATE | TASKQ_DYNAMIC); arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED, sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL); if (arc_ksp != NULL) { arc_ksp->ks_data = &arc_stats; arc_ksp->ks_update = arc_kstat_update; kstat_install(arc_ksp); } arc_evict_zthr = zthr_create_timer("arc_evict", arc_evict_cb_check, arc_evict_cb, NULL, SEC2NSEC(1)); arc_reap_zthr = zthr_create_timer("arc_reap", arc_reap_cb_check, arc_reap_cb, NULL, SEC2NSEC(1)); arc_warm = B_FALSE; /* * Calculate maximum amount of dirty data per pool. * * If it has been set by a module parameter, take that. * Otherwise, use a percentage of physical memory defined by * zfs_dirty_data_max_percent (default 10%) with a cap at * zfs_dirty_data_max_max (default 4G or 25% of physical memory). */ #ifdef __LP64__ if (zfs_dirty_data_max_max == 0) zfs_dirty_data_max_max = MIN(4ULL * 1024 * 1024 * 1024, allmem * zfs_dirty_data_max_max_percent / 100); #else if (zfs_dirty_data_max_max == 0) zfs_dirty_data_max_max = MIN(1ULL * 1024 * 1024 * 1024, allmem * zfs_dirty_data_max_max_percent / 100); #endif if (zfs_dirty_data_max == 0) { zfs_dirty_data_max = allmem * zfs_dirty_data_max_percent / 100; zfs_dirty_data_max = MIN(zfs_dirty_data_max, zfs_dirty_data_max_max); } } void arc_fini(void) { arc_prune_t *p; #ifdef _KERNEL arc_lowmem_fini(); #endif /* _KERNEL */ /* Use B_TRUE to ensure *all* buffers are evicted */ arc_flush(NULL, B_TRUE); if (arc_ksp != NULL) { kstat_delete(arc_ksp); arc_ksp = NULL; } taskq_wait(arc_prune_taskq); taskq_destroy(arc_prune_taskq); mutex_enter(&arc_prune_mtx); while ((p = list_head(&arc_prune_list)) != NULL) { list_remove(&arc_prune_list, p); zfs_refcount_remove(&p->p_refcnt, &arc_prune_list); zfs_refcount_destroy(&p->p_refcnt); kmem_free(p, sizeof (*p)); } mutex_exit(&arc_prune_mtx); list_destroy(&arc_prune_list); mutex_destroy(&arc_prune_mtx); (void) zthr_cancel(arc_evict_zthr); (void) zthr_cancel(arc_reap_zthr); mutex_destroy(&arc_evict_lock); list_destroy(&arc_evict_waiters); /* * Free any buffers that were tagged for destruction. This needs * to occur before arc_state_fini() runs and destroys the aggsum * values which are updated when freeing scatter ABDs. */ l2arc_do_free_on_write(); /* * buf_fini() must proceed arc_state_fini() because buf_fin() may * trigger the release of kmem magazines, which can callback to * arc_space_return() which accesses aggsums freed in act_state_fini(). */ buf_fini(); arc_state_fini(); /* * We destroy the zthrs after all the ARC state has been * torn down to avoid the case of them receiving any * wakeup() signals after they are destroyed. */ zthr_destroy(arc_evict_zthr); zthr_destroy(arc_reap_zthr); ASSERT0(arc_loaned_bytes); } /* * Level 2 ARC * * The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk. * It uses dedicated storage devices to hold cached data, which are populated * using large infrequent writes. The main role of this cache is to boost * the performance of random read workloads. The intended L2ARC devices * include short-stroked disks, solid state disks, and other media with * substantially faster read latency than disk. * * +-----------------------+ * | ARC | * +-----------------------+ * | ^ ^ * | | | * l2arc_feed_thread() arc_read() * | | | * | l2arc read | * V | | * +---------------+ | * | L2ARC | | * +---------------+ | * | ^ | * l2arc_write() | | * | | | * V | | * +-------+ +-------+ * | vdev | | vdev | * | cache | | cache | * +-------+ +-------+ * +=========+ .-----. * : L2ARC : |-_____-| * : devices : | Disks | * +=========+ `-_____-' * * Read requests are satisfied from the following sources, in order: * * 1) ARC * 2) vdev cache of L2ARC devices * 3) L2ARC devices * 4) vdev cache of disks * 5) disks * * Some L2ARC device types exhibit extremely slow write performance. * To accommodate for this there are some significant differences between * the L2ARC and traditional cache design: * * 1. There is no eviction path from the ARC to the L2ARC. Evictions from * the ARC behave as usual, freeing buffers and placing headers on ghost * lists. The ARC does not send buffers to the L2ARC during eviction as * this would add inflated write latencies for all ARC memory pressure. * * 2. The L2ARC attempts to cache data from the ARC before it is evicted. * It does this by periodically scanning buffers from the eviction-end of * the MFU and MRU ARC lists, copying them to the L2ARC devices if they are * not already there. It scans until a headroom of buffers is satisfied, * which itself is a buffer for ARC eviction. If a compressible buffer is * found during scanning and selected for writing to an L2ARC device, we * temporarily boost scanning headroom during the next scan cycle to make * sure we adapt to compression effects (which might significantly reduce * the data volume we write to L2ARC). The thread that does this is * l2arc_feed_thread(), illustrated below; example sizes are included to * provide a better sense of ratio than this diagram: * * head --> tail * +---------------------+----------+ * ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC * +---------------------+----------+ | o L2ARC eligible * ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer * +---------------------+----------+ | * 15.9 Gbytes ^ 32 Mbytes | * headroom | * l2arc_feed_thread() * | * l2arc write hand <--[oooo]--' * | 8 Mbyte * | write max * V * +==============================+ * L2ARC dev |####|#|###|###| |####| ... | * +==============================+ * 32 Gbytes * * 3. If an ARC buffer is copied to the L2ARC but then hit instead of * evicted, then the L2ARC has cached a buffer much sooner than it probably * needed to, potentially wasting L2ARC device bandwidth and storage. It is * safe to say that this is an uncommon case, since buffers at the end of * the ARC lists have moved there due to inactivity. * * 4. If the ARC evicts faster than the L2ARC can maintain a headroom, * then the L2ARC simply misses copying some buffers. This serves as a * pressure valve to prevent heavy read workloads from both stalling the ARC * with waits and clogging the L2ARC with writes. This also helps prevent * the potential for the L2ARC to churn if it attempts to cache content too * quickly, such as during backups of the entire pool. * * 5. After system boot and before the ARC has filled main memory, there are * no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru * lists can remain mostly static. Instead of searching from tail of these * lists as pictured, the l2arc_feed_thread() will search from the list heads * for eligible buffers, greatly increasing its chance of finding them. * * The L2ARC device write speed is also boosted during this time so that * the L2ARC warms up faster. Since there have been no ARC evictions yet, * there are no L2ARC reads, and no fear of degrading read performance * through increased writes. * * 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that * the vdev queue can aggregate them into larger and fewer writes. Each * device is written to in a rotor fashion, sweeping writes through * available space then repeating. * * 7. The L2ARC does not store dirty content. It never needs to flush * write buffers back to disk based storage. * * 8. If an ARC buffer is written (and dirtied) which also exists in the * L2ARC, the now stale L2ARC buffer is immediately dropped. * * The performance of the L2ARC can be tweaked by a number of tunables, which * may be necessary for different workloads: * * l2arc_write_max max write bytes per interval * l2arc_write_boost extra write bytes during device warmup * l2arc_noprefetch skip caching prefetched buffers * l2arc_headroom number of max device writes to precache * l2arc_headroom_boost when we find compressed buffers during ARC * scanning, we multiply headroom by this * percentage factor for the next scan cycle, * since more compressed buffers are likely to * be present * l2arc_feed_secs seconds between L2ARC writing * * Tunables may be removed or added as future performance improvements are * integrated, and also may become zpool properties. * * There are three key functions that control how the L2ARC warms up: * * l2arc_write_eligible() check if a buffer is eligible to cache * l2arc_write_size() calculate how much to write * l2arc_write_interval() calculate sleep delay between writes * * These three functions determine what to write, how much, and how quickly * to send writes. * * L2ARC persistence: * * When writing buffers to L2ARC, we periodically add some metadata to * make sure we can pick them up after reboot, thus dramatically reducing * the impact that any downtime has on the performance of storage systems * with large caches. * * The implementation works fairly simply by integrating the following two * modifications: * * *) When writing to the L2ARC, we occasionally write a "l2arc log block", * which is an additional piece of metadata which describes what's been * written. This allows us to rebuild the arc_buf_hdr_t structures of the * main ARC buffers. There are 2 linked-lists of log blocks headed by * dh_start_lbps[2]. We alternate which chain we append to, so they are * time-wise and offset-wise interleaved, but that is an optimization rather * than for correctness. The log block also includes a pointer to the * previous block in its chain. * * *) We reserve SPA_MINBLOCKSIZE of space at the start of each L2ARC device * for our header bookkeeping purposes. This contains a device header, * which contains our top-level reference structures. We update it each * time we write a new log block, so that we're able to locate it in the * L2ARC device. If this write results in an inconsistent device header * (e.g. due to power failure), we detect this by verifying the header's * checksum and simply fail to reconstruct the L2ARC after reboot. * * Implementation diagram: * * +=== L2ARC device (not to scale) ======================================+ * | ___two newest log block pointers__.__________ | * | / \dh_start_lbps[1] | * | / \ \dh_start_lbps[0]| * |.___/__. V V | * ||L2 dev|....|lb |bufs |lb |bufs |lb |bufs |lb |bufs |lb |---(empty)---| * || hdr| ^ /^ /^ / / | * |+------+ ...--\-------/ \-----/--\------/ / | * | \--------------/ \--------------/ | * +======================================================================+ * * As can be seen on the diagram, rather than using a simple linked list, * we use a pair of linked lists with alternating elements. This is a * performance enhancement due to the fact that we only find out the * address of the next log block access once the current block has been * completely read in. Obviously, this hurts performance, because we'd be * keeping the device's I/O queue at only a 1 operation deep, thus * incurring a large amount of I/O round-trip latency. Having two lists * allows us to fetch two log blocks ahead of where we are currently * rebuilding L2ARC buffers. * * On-device data structures: * * L2ARC device header: l2arc_dev_hdr_phys_t * L2ARC log block: l2arc_log_blk_phys_t * * L2ARC reconstruction: * * When writing data, we simply write in the standard rotary fashion, * evicting buffers as we go and simply writing new data over them (writing * a new log block every now and then). This obviously means that once we * loop around the end of the device, we will start cutting into an already * committed log block (and its referenced data buffers), like so: * * current write head__ __old tail * \ / * V V * <--|bufs |lb |bufs |lb | |bufs |lb |bufs |lb |--> * ^ ^^^^^^^^^___________________________________ * | \ * <> may overwrite this blk and/or its bufs --' * * When importing the pool, we detect this situation and use it to stop * our scanning process (see l2arc_rebuild). * * There is one significant caveat to consider when rebuilding ARC contents * from an L2ARC device: what about invalidated buffers? Given the above * construction, we cannot update blocks which we've already written to amend * them to remove buffers which were invalidated. Thus, during reconstruction, * we might be populating the cache with buffers for data that's not on the * main pool anymore, or may have been overwritten! * * As it turns out, this isn't a problem. Every arc_read request includes * both the DVA and, crucially, the birth TXG of the BP the caller is * looking for. So even if the cache were populated by completely rotten * blocks for data that had been long deleted and/or overwritten, we'll * never actually return bad data from the cache, since the DVA with the * birth TXG uniquely identify a block in space and time - once created, * a block is immutable on disk. The worst thing we have done is wasted * some time and memory at l2arc rebuild to reconstruct outdated ARC * entries that will get dropped from the l2arc as it is being updated * with new blocks. * * L2ARC buffers that have been evicted by l2arc_evict() ahead of the write * hand are not restored. This is done by saving the offset (in bytes) * l2arc_evict() has evicted to in the L2ARC device header and taking it * into account when restoring buffers. */ static boolean_t l2arc_write_eligible(uint64_t spa_guid, arc_buf_hdr_t *hdr) { /* * A buffer is *not* eligible for the L2ARC if it: * 1. belongs to a different spa. * 2. is already cached on the L2ARC. * 3. has an I/O in progress (it may be an incomplete read). * 4. is flagged not eligible (zfs property). */ if (hdr->b_spa != spa_guid || HDR_HAS_L2HDR(hdr) || HDR_IO_IN_PROGRESS(hdr) || !HDR_L2CACHE(hdr)) return (B_FALSE); return (B_TRUE); } static uint64_t l2arc_write_size(l2arc_dev_t *dev) { uint64_t size, dev_size, tsize; /* * Make sure our globals have meaningful values in case the user * altered them. */ size = l2arc_write_max; if (size == 0) { cmn_err(CE_NOTE, "Bad value for l2arc_write_max, value must " "be greater than zero, resetting it to the default (%d)", L2ARC_WRITE_SIZE); size = l2arc_write_max = L2ARC_WRITE_SIZE; } if (arc_warm == B_FALSE) size += l2arc_write_boost; /* * Make sure the write size does not exceed the size of the cache * device. This is important in l2arc_evict(), otherwise infinite * iteration can occur. */ dev_size = dev->l2ad_end - dev->l2ad_start; tsize = size + l2arc_log_blk_overhead(size, dev); if (dev->l2ad_vdev->vdev_has_trim && l2arc_trim_ahead > 0) tsize += MAX(64 * 1024 * 1024, (tsize * l2arc_trim_ahead) / 100); if (tsize >= dev_size) { cmn_err(CE_NOTE, "l2arc_write_max or l2arc_write_boost " "plus the overhead of log blocks (persistent L2ARC, " "%llu bytes) exceeds the size of the cache device " "(guid %llu), resetting them to the default (%d)", l2arc_log_blk_overhead(size, dev), dev->l2ad_vdev->vdev_guid, L2ARC_WRITE_SIZE); size = l2arc_write_max = l2arc_write_boost = L2ARC_WRITE_SIZE; if (arc_warm == B_FALSE) size += l2arc_write_boost; } return (size); } static clock_t l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote) { clock_t interval, next, now; /* * If the ARC lists are busy, increase our write rate; if the * lists are stale, idle back. This is achieved by checking * how much we previously wrote - if it was more than half of * what we wanted, schedule the next write much sooner. */ if (l2arc_feed_again && wrote > (wanted / 2)) interval = (hz * l2arc_feed_min_ms) / 1000; else interval = hz * l2arc_feed_secs; now = ddi_get_lbolt(); next = MAX(now, MIN(now + interval, began + interval)); return (next); } /* * Cycle through L2ARC devices. This is how L2ARC load balances. * If a device is returned, this also returns holding the spa config lock. */ static l2arc_dev_t * l2arc_dev_get_next(void) { l2arc_dev_t *first, *next = NULL; /* * Lock out the removal of spas (spa_namespace_lock), then removal * of cache devices (l2arc_dev_mtx). Once a device has been selected, * both locks will be dropped and a spa config lock held instead. */ mutex_enter(&spa_namespace_lock); mutex_enter(&l2arc_dev_mtx); /* if there are no vdevs, there is nothing to do */ if (l2arc_ndev == 0) goto out; first = NULL; next = l2arc_dev_last; do { /* loop around the list looking for a non-faulted vdev */ if (next == NULL) { next = list_head(l2arc_dev_list); } else { next = list_next(l2arc_dev_list, next); if (next == NULL) next = list_head(l2arc_dev_list); } /* if we have come back to the start, bail out */ if (first == NULL) first = next; else if (next == first) break; } while (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild || next->l2ad_trim_all); /* if we were unable to find any usable vdevs, return NULL */ if (vdev_is_dead(next->l2ad_vdev) || next->l2ad_rebuild || next->l2ad_trim_all) next = NULL; l2arc_dev_last = next; out: mutex_exit(&l2arc_dev_mtx); /* * Grab the config lock to prevent the 'next' device from being * removed while we are writing to it. */ if (next != NULL) spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER); mutex_exit(&spa_namespace_lock); return (next); } /* * Free buffers that were tagged for destruction. */ static void l2arc_do_free_on_write(void) { list_t *buflist; l2arc_data_free_t *df, *df_prev; mutex_enter(&l2arc_free_on_write_mtx); buflist = l2arc_free_on_write; for (df = list_tail(buflist); df; df = df_prev) { df_prev = list_prev(buflist, df); ASSERT3P(df->l2df_abd, !=, NULL); abd_free(df->l2df_abd); list_remove(buflist, df); kmem_free(df, sizeof (l2arc_data_free_t)); } mutex_exit(&l2arc_free_on_write_mtx); } /* * A write to a cache device has completed. Update all headers to allow * reads from these buffers to begin. */ static void l2arc_write_done(zio_t *zio) { l2arc_write_callback_t *cb; l2arc_lb_abd_buf_t *abd_buf; l2arc_lb_ptr_buf_t *lb_ptr_buf; l2arc_dev_t *dev; l2arc_dev_hdr_phys_t *l2dhdr; list_t *buflist; arc_buf_hdr_t *head, *hdr, *hdr_prev; kmutex_t *hash_lock; int64_t bytes_dropped = 0; cb = zio->io_private; ASSERT3P(cb, !=, NULL); dev = cb->l2wcb_dev; l2dhdr = dev->l2ad_dev_hdr; ASSERT3P(dev, !=, NULL); head = cb->l2wcb_head; ASSERT3P(head, !=, NULL); buflist = &dev->l2ad_buflist; ASSERT3P(buflist, !=, NULL); DTRACE_PROBE2(l2arc__iodone, zio_t *, zio, l2arc_write_callback_t *, cb); if (zio->io_error != 0) ARCSTAT_BUMP(arcstat_l2_writes_error); /* * All writes completed, or an error was hit. */ top: mutex_enter(&dev->l2ad_mtx); for (hdr = list_prev(buflist, head); hdr; hdr = hdr_prev) { hdr_prev = list_prev(buflist, hdr); hash_lock = HDR_LOCK(hdr); /* * We cannot use mutex_enter or else we can deadlock * with l2arc_write_buffers (due to swapping the order * the hash lock and l2ad_mtx are taken). */ if (!mutex_tryenter(hash_lock)) { /* * Missed the hash lock. We must retry so we * don't leave the ARC_FLAG_L2_WRITING bit set. */ ARCSTAT_BUMP(arcstat_l2_writes_lock_retry); /* * We don't want to rescan the headers we've * already marked as having been written out, so * we reinsert the head node so we can pick up * where we left off. */ list_remove(buflist, head); list_insert_after(buflist, hdr, head); mutex_exit(&dev->l2ad_mtx); /* * We wait for the hash lock to become available * to try and prevent busy waiting, and increase * the chance we'll be able to acquire the lock * the next time around. */ mutex_enter(hash_lock); mutex_exit(hash_lock); goto top; } /* * We could not have been moved into the arc_l2c_only * state while in-flight due to our ARC_FLAG_L2_WRITING * bit being set. Let's just ensure that's being enforced. */ ASSERT(HDR_HAS_L1HDR(hdr)); /* * Skipped - drop L2ARC entry and mark the header as no * longer L2 eligibile. */ if (zio->io_error != 0) { /* * Error - drop L2ARC entry. */ list_remove(buflist, hdr); arc_hdr_clear_flags(hdr, ARC_FLAG_HAS_L2HDR); uint64_t psize = HDR_GET_PSIZE(hdr); ARCSTAT_INCR(arcstat_l2_psize, -psize); ARCSTAT_INCR(arcstat_l2_lsize, -HDR_GET_LSIZE(hdr)); bytes_dropped += vdev_psize_to_asize(dev->l2ad_vdev, psize); (void) zfs_refcount_remove_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr); } /* * Allow ARC to begin reads and ghost list evictions to * this L2ARC entry. */ arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING); mutex_exit(hash_lock); } /* * Free the allocated abd buffers for writing the log blocks. * If the zio failed reclaim the allocated space and remove the * pointers to these log blocks from the log block pointer list * of the L2ARC device. */ while ((abd_buf = list_remove_tail(&cb->l2wcb_abd_list)) != NULL) { abd_free(abd_buf->abd); zio_buf_free(abd_buf, sizeof (*abd_buf)); if (zio->io_error != 0) { lb_ptr_buf = list_remove_head(&dev->l2ad_lbptr_list); /* * L2BLK_GET_PSIZE returns aligned size for log * blocks. */ uint64_t asize = L2BLK_GET_PSIZE((lb_ptr_buf->lb_ptr)->lbp_prop); bytes_dropped += asize; ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize); ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count); zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf); kmem_free(lb_ptr_buf->lb_ptr, sizeof (l2arc_log_blkptr_t)); kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t)); } } list_destroy(&cb->l2wcb_abd_list); if (zio->io_error != 0) { /* * Restore the lbps array in the header to its previous state. * If the list of log block pointers is empty, zero out the * log block pointers in the device header. */ lb_ptr_buf = list_head(&dev->l2ad_lbptr_list); for (int i = 0; i < 2; i++) { if (lb_ptr_buf == NULL) { /* * If the list is empty zero out the device * header. Otherwise zero out the second log * block pointer in the header. */ if (i == 0) { bzero(l2dhdr, dev->l2ad_dev_hdr_asize); } else { bzero(&l2dhdr->dh_start_lbps[i], sizeof (l2arc_log_blkptr_t)); } break; } bcopy(lb_ptr_buf->lb_ptr, &l2dhdr->dh_start_lbps[i], sizeof (l2arc_log_blkptr_t)); lb_ptr_buf = list_next(&dev->l2ad_lbptr_list, lb_ptr_buf); } } atomic_inc_64(&l2arc_writes_done); list_remove(buflist, head); ASSERT(!HDR_HAS_L1HDR(head)); kmem_cache_free(hdr_l2only_cache, head); mutex_exit(&dev->l2ad_mtx); ASSERT(dev->l2ad_vdev != NULL); vdev_space_update(dev->l2ad_vdev, -bytes_dropped, 0, 0); l2arc_do_free_on_write(); kmem_free(cb, sizeof (l2arc_write_callback_t)); } static int l2arc_untransform(zio_t *zio, l2arc_read_callback_t *cb) { int ret; spa_t *spa = zio->io_spa; arc_buf_hdr_t *hdr = cb->l2rcb_hdr; blkptr_t *bp = zio->io_bp; uint8_t salt[ZIO_DATA_SALT_LEN]; uint8_t iv[ZIO_DATA_IV_LEN]; uint8_t mac[ZIO_DATA_MAC_LEN]; boolean_t no_crypt = B_FALSE; /* * ZIL data is never be written to the L2ARC, so we don't need * special handling for its unique MAC storage. */ ASSERT3U(BP_GET_TYPE(bp), !=, DMU_OT_INTENT_LOG); ASSERT(MUTEX_HELD(HDR_LOCK(hdr))); ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); /* * If the data was encrypted, decrypt it now. Note that * we must check the bp here and not the hdr, since the * hdr does not have its encryption parameters updated * until arc_read_done(). */ if (BP_IS_ENCRYPTED(bp)) { abd_t *eabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, B_TRUE); zio_crypt_decode_params_bp(bp, salt, iv); zio_crypt_decode_mac_bp(bp, mac); ret = spa_do_crypt_abd(B_FALSE, spa, &cb->l2rcb_zb, BP_GET_TYPE(bp), BP_GET_DEDUP(bp), BP_SHOULD_BYTESWAP(bp), salt, iv, mac, HDR_GET_PSIZE(hdr), eabd, hdr->b_l1hdr.b_pabd, &no_crypt); if (ret != 0) { arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr); goto error; } /* * If we actually performed decryption, replace b_pabd * with the decrypted data. Otherwise we can just throw * our decryption buffer away. */ if (!no_crypt) { arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, arc_hdr_size(hdr), hdr); hdr->b_l1hdr.b_pabd = eabd; zio->io_abd = eabd; } else { arc_free_data_abd(hdr, eabd, arc_hdr_size(hdr), hdr); } } /* * If the L2ARC block was compressed, but ARC compression * is disabled we decompress the data into a new buffer and * replace the existing data. */ if (HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) { abd_t *cabd = arc_get_data_abd(hdr, arc_hdr_size(hdr), hdr, B_TRUE); void *tmp = abd_borrow_buf(cabd, arc_hdr_size(hdr)); ret = zio_decompress_data(HDR_GET_COMPRESS(hdr), hdr->b_l1hdr.b_pabd, tmp, HDR_GET_PSIZE(hdr), HDR_GET_LSIZE(hdr), &hdr->b_complevel); if (ret != 0) { abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr)); arc_free_data_abd(hdr, cabd, arc_hdr_size(hdr), hdr); goto error; } abd_return_buf_copy(cabd, tmp, arc_hdr_size(hdr)); arc_free_data_abd(hdr, hdr->b_l1hdr.b_pabd, arc_hdr_size(hdr), hdr); hdr->b_l1hdr.b_pabd = cabd; zio->io_abd = cabd; zio->io_size = HDR_GET_LSIZE(hdr); } return (0); error: return (ret); } /* * A read to a cache device completed. Validate buffer contents before * handing over to the regular ARC routines. */ static void l2arc_read_done(zio_t *zio) { int tfm_error = 0; l2arc_read_callback_t *cb = zio->io_private; arc_buf_hdr_t *hdr; kmutex_t *hash_lock; boolean_t valid_cksum; boolean_t using_rdata = (BP_IS_ENCRYPTED(&cb->l2rcb_bp) && (cb->l2rcb_flags & ZIO_FLAG_RAW_ENCRYPT)); ASSERT3P(zio->io_vd, !=, NULL); ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE); spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd); ASSERT3P(cb, !=, NULL); hdr = cb->l2rcb_hdr; ASSERT3P(hdr, !=, NULL); hash_lock = HDR_LOCK(hdr); mutex_enter(hash_lock); ASSERT3P(hash_lock, ==, HDR_LOCK(hdr)); /* * If the data was read into a temporary buffer, * move it and free the buffer. */ if (cb->l2rcb_abd != NULL) { ASSERT3U(arc_hdr_size(hdr), <, zio->io_size); if (zio->io_error == 0) { if (using_rdata) { abd_copy(hdr->b_crypt_hdr.b_rabd, cb->l2rcb_abd, arc_hdr_size(hdr)); } else { abd_copy(hdr->b_l1hdr.b_pabd, cb->l2rcb_abd, arc_hdr_size(hdr)); } } /* * The following must be done regardless of whether * there was an error: * - free the temporary buffer * - point zio to the real ARC buffer * - set zio size accordingly * These are required because zio is either re-used for * an I/O of the block in the case of the error * or the zio is passed to arc_read_done() and it * needs real data. */ abd_free(cb->l2rcb_abd); zio->io_size = zio->io_orig_size = arc_hdr_size(hdr); if (using_rdata) { ASSERT(HDR_HAS_RABD(hdr)); zio->io_abd = zio->io_orig_abd = hdr->b_crypt_hdr.b_rabd; } else { ASSERT3P(hdr->b_l1hdr.b_pabd, !=, NULL); zio->io_abd = zio->io_orig_abd = hdr->b_l1hdr.b_pabd; } } ASSERT3P(zio->io_abd, !=, NULL); /* * Check this survived the L2ARC journey. */ ASSERT(zio->io_abd == hdr->b_l1hdr.b_pabd || (HDR_HAS_RABD(hdr) && zio->io_abd == hdr->b_crypt_hdr.b_rabd)); zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */ zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */ zio->io_prop.zp_complevel = hdr->b_complevel; valid_cksum = arc_cksum_is_equal(hdr, zio); /* * b_rabd will always match the data as it exists on disk if it is * being used. Therefore if we are reading into b_rabd we do not * attempt to untransform the data. */ if (valid_cksum && !using_rdata) tfm_error = l2arc_untransform(zio, cb); if (valid_cksum && tfm_error == 0 && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) { mutex_exit(hash_lock); zio->io_private = hdr; arc_read_done(zio); } else { /* * Buffer didn't survive caching. Increment stats and * reissue to the original storage device. */ if (zio->io_error != 0) { ARCSTAT_BUMP(arcstat_l2_io_error); } else { zio->io_error = SET_ERROR(EIO); } if (!valid_cksum || tfm_error != 0) ARCSTAT_BUMP(arcstat_l2_cksum_bad); /* * If there's no waiter, issue an async i/o to the primary * storage now. If there *is* a waiter, the caller must * issue the i/o in a context where it's OK to block. */ if (zio->io_waiter == NULL) { zio_t *pio = zio_unique_parent(zio); void *abd = (using_rdata) ? hdr->b_crypt_hdr.b_rabd : hdr->b_l1hdr.b_pabd; ASSERT(!pio || pio->io_child_type == ZIO_CHILD_LOGICAL); zio = zio_read(pio, zio->io_spa, zio->io_bp, abd, zio->io_size, arc_read_done, hdr, zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb); /* * Original ZIO will be freed, so we need to update * ARC header with the new ZIO pointer to be used * by zio_change_priority() in arc_read(). */ for (struct arc_callback *acb = hdr->b_l1hdr.b_acb; acb != NULL; acb = acb->acb_next) acb->acb_zio_head = zio; mutex_exit(hash_lock); zio_nowait(zio); } else { mutex_exit(hash_lock); } } kmem_free(cb, sizeof (l2arc_read_callback_t)); } /* * This is the list priority from which the L2ARC will search for pages to * cache. This is used within loops (0..3) to cycle through lists in the * desired order. This order can have a significant effect on cache * performance. * * Currently the metadata lists are hit first, MFU then MRU, followed by * the data lists. This function returns a locked list, and also returns * the lock pointer. */ static multilist_sublist_t * l2arc_sublist_lock(int list_num) { multilist_t *ml = NULL; unsigned int idx; ASSERT(list_num >= 0 && list_num < L2ARC_FEED_TYPES); switch (list_num) { case 0: ml = arc_mfu->arcs_list[ARC_BUFC_METADATA]; break; case 1: ml = arc_mru->arcs_list[ARC_BUFC_METADATA]; break; case 2: ml = arc_mfu->arcs_list[ARC_BUFC_DATA]; break; case 3: ml = arc_mru->arcs_list[ARC_BUFC_DATA]; break; default: return (NULL); } /* * Return a randomly-selected sublist. This is acceptable * because the caller feeds only a little bit of data for each * call (8MB). Subsequent calls will result in different * sublists being selected. */ idx = multilist_get_random_index(ml); return (multilist_sublist_lock(ml, idx)); } /* * Calculates the maximum overhead of L2ARC metadata log blocks for a given * L2ARC write size. l2arc_evict and l2arc_write_size need to include this * overhead in processing to make sure there is enough headroom available * when writing buffers. */ static inline uint64_t l2arc_log_blk_overhead(uint64_t write_sz, l2arc_dev_t *dev) { if (dev->l2ad_log_entries == 0) { return (0); } else { uint64_t log_entries = write_sz >> SPA_MINBLOCKSHIFT; uint64_t log_blocks = (log_entries + dev->l2ad_log_entries - 1) / dev->l2ad_log_entries; return (vdev_psize_to_asize(dev->l2ad_vdev, sizeof (l2arc_log_blk_phys_t)) * log_blocks); } } /* * Evict buffers from the device write hand to the distance specified in * bytes. This distance may span populated buffers, it may span nothing. * This is clearing a region on the L2ARC device ready for writing. * If the 'all' boolean is set, every buffer is evicted. */ static void l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all) { list_t *buflist; arc_buf_hdr_t *hdr, *hdr_prev; kmutex_t *hash_lock; uint64_t taddr; l2arc_lb_ptr_buf_t *lb_ptr_buf, *lb_ptr_buf_prev; vdev_t *vd = dev->l2ad_vdev; boolean_t rerun; buflist = &dev->l2ad_buflist; /* * We need to add in the worst case scenario of log block overhead. */ distance += l2arc_log_blk_overhead(distance, dev); if (vd->vdev_has_trim && l2arc_trim_ahead > 0) { /* * Trim ahead of the write size 64MB or (l2arc_trim_ahead/100) * times the write size, whichever is greater. */ distance += MAX(64 * 1024 * 1024, (distance * l2arc_trim_ahead) / 100); } top: rerun = B_FALSE; if (dev->l2ad_hand >= (dev->l2ad_end - distance)) { /* * When there is no space to accommodate upcoming writes, * evict to the end. Then bump the write and evict hands * to the start and iterate. This iteration does not * happen indefinitely as we make sure in * l2arc_write_size() that when the write hand is reset, * the write size does not exceed the end of the device. */ rerun = B_TRUE; taddr = dev->l2ad_end; } else { taddr = dev->l2ad_hand + distance; } DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist, uint64_t, taddr, boolean_t, all); if (!all) { /* * This check has to be placed after deciding whether to * iterate (rerun). */ if (dev->l2ad_first) { /* * This is the first sweep through the device. There is * nothing to evict. We have already trimmmed the * whole device. */ goto out; } else { /* * Trim the space to be evicted. */ if (vd->vdev_has_trim && dev->l2ad_evict < taddr && l2arc_trim_ahead > 0) { /* * We have to drop the spa_config lock because * vdev_trim_range() will acquire it. * l2ad_evict already accounts for the label * size. To prevent vdev_trim_ranges() from * adding it again, we subtract it from * l2ad_evict. */ spa_config_exit(dev->l2ad_spa, SCL_L2ARC, dev); vdev_trim_simple(vd, dev->l2ad_evict - VDEV_LABEL_START_SIZE, taddr - dev->l2ad_evict); spa_config_enter(dev->l2ad_spa, SCL_L2ARC, dev, RW_READER); } /* * When rebuilding L2ARC we retrieve the evict hand * from the header of the device. Of note, l2arc_evict() * does not actually delete buffers from the cache * device, but trimming may do so depending on the * hardware implementation. Thus keeping track of the * evict hand is useful. */ dev->l2ad_evict = MAX(dev->l2ad_evict, taddr); } } retry: mutex_enter(&dev->l2ad_mtx); /* * We have to account for evicted log blocks. Run vdev_space_update() * on log blocks whose offset (in bytes) is before the evicted offset * (in bytes) by searching in the list of pointers to log blocks * present in the L2ARC device. */ for (lb_ptr_buf = list_tail(&dev->l2ad_lbptr_list); lb_ptr_buf; lb_ptr_buf = lb_ptr_buf_prev) { lb_ptr_buf_prev = list_prev(&dev->l2ad_lbptr_list, lb_ptr_buf); /* L2BLK_GET_PSIZE returns aligned size for log blocks */ uint64_t asize = L2BLK_GET_PSIZE( (lb_ptr_buf->lb_ptr)->lbp_prop); /* * We don't worry about log blocks left behind (ie * lbp_payload_start < l2ad_hand) because l2arc_write_buffers() * will never write more than l2arc_evict() evicts. */ if (!all && l2arc_log_blkptr_valid(dev, lb_ptr_buf->lb_ptr)) { break; } else { vdev_space_update(vd, -asize, 0, 0); ARCSTAT_INCR(arcstat_l2_log_blk_asize, -asize); ARCSTAT_BUMPDOWN(arcstat_l2_log_blk_count); zfs_refcount_remove_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); zfs_refcount_remove(&dev->l2ad_lb_count, lb_ptr_buf); list_remove(&dev->l2ad_lbptr_list, lb_ptr_buf); kmem_free(lb_ptr_buf->lb_ptr, sizeof (l2arc_log_blkptr_t)); kmem_free(lb_ptr_buf, sizeof (l2arc_lb_ptr_buf_t)); } } for (hdr = list_tail(buflist); hdr; hdr = hdr_prev) { hdr_prev = list_prev(buflist, hdr); ASSERT(!HDR_EMPTY(hdr)); hash_lock = HDR_LOCK(hdr); /* * We cannot use mutex_enter or else we can deadlock * with l2arc_write_buffers (due to swapping the order * the hash lock and l2ad_mtx are taken). */ if (!mutex_tryenter(hash_lock)) { /* * Missed the hash lock. Retry. */ ARCSTAT_BUMP(arcstat_l2_evict_lock_retry); mutex_exit(&dev->l2ad_mtx); mutex_enter(hash_lock); mutex_exit(hash_lock); goto retry; } /* * A header can't be on this list if it doesn't have L2 header. */ ASSERT(HDR_HAS_L2HDR(hdr)); /* Ensure this header has finished being written. */ ASSERT(!HDR_L2_WRITING(hdr)); ASSERT(!HDR_L2_WRITE_HEAD(hdr)); if (!all && (hdr->b_l2hdr.b_daddr >= dev->l2ad_evict || hdr->b_l2hdr.b_daddr < dev->l2ad_hand)) { /* * We've evicted to the target address, * or the end of the device. */ mutex_exit(hash_lock); break; } if (!HDR_HAS_L1HDR(hdr)) { ASSERT(!HDR_L2_READING(hdr)); /* * This doesn't exist in the ARC. Destroy. * arc_hdr_destroy() will call list_remove() * and decrement arcstat_l2_lsize. */ arc_change_state(arc_anon, hdr, hash_lock); arc_hdr_destroy(hdr); } else { ASSERT(hdr->b_l1hdr.b_state != arc_l2c_only); ARCSTAT_BUMP(arcstat_l2_evict_l1cached); /* * Invalidate issued or about to be issued * reads, since we may be about to write * over this location. */ if (HDR_L2_READING(hdr)) { ARCSTAT_BUMP(arcstat_l2_evict_reading); arc_hdr_set_flags(hdr, ARC_FLAG_L2_EVICTED); } arc_hdr_l2hdr_destroy(hdr); } mutex_exit(hash_lock); } mutex_exit(&dev->l2ad_mtx); out: /* * We need to check if we evict all buffers, otherwise we may iterate * unnecessarily. */ if (!all && rerun) { /* * Bump device hand to the device start if it is approaching the * end. l2arc_evict() has already evicted ahead for this case. */ dev->l2ad_hand = dev->l2ad_start; dev->l2ad_evict = dev->l2ad_start; dev->l2ad_first = B_FALSE; goto top; } ASSERT3U(dev->l2ad_hand + distance, <, dev->l2ad_end); if (!dev->l2ad_first) ASSERT3U(dev->l2ad_hand, <, dev->l2ad_evict); } /* * Handle any abd transforms that might be required for writing to the L2ARC. * If successful, this function will always return an abd with the data * transformed as it is on disk in a new abd of asize bytes. */ static int l2arc_apply_transforms(spa_t *spa, arc_buf_hdr_t *hdr, uint64_t asize, abd_t **abd_out) { int ret; void *tmp = NULL; abd_t *cabd = NULL, *eabd = NULL, *to_write = hdr->b_l1hdr.b_pabd; enum zio_compress compress = HDR_GET_COMPRESS(hdr); uint64_t psize = HDR_GET_PSIZE(hdr); uint64_t size = arc_hdr_size(hdr); boolean_t ismd = HDR_ISTYPE_METADATA(hdr); boolean_t bswap = (hdr->b_l1hdr.b_byteswap != DMU_BSWAP_NUMFUNCS); dsl_crypto_key_t *dck = NULL; uint8_t mac[ZIO_DATA_MAC_LEN] = { 0 }; boolean_t no_crypt = B_FALSE; ASSERT((HDR_GET_COMPRESS(hdr) != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) || HDR_ENCRYPTED(hdr) || HDR_SHARED_DATA(hdr) || psize != asize); ASSERT3U(psize, <=, asize); /* * If this data simply needs its own buffer, we simply allocate it * and copy the data. This may be done to eliminate a dependency on a * shared buffer or to reallocate the buffer to match asize. */ if (HDR_HAS_RABD(hdr) && asize != psize) { ASSERT3U(asize, >=, psize); to_write = abd_alloc_for_io(asize, ismd); abd_copy(to_write, hdr->b_crypt_hdr.b_rabd, psize); if (psize != asize) abd_zero_off(to_write, psize, asize - psize); goto out; } if ((compress == ZIO_COMPRESS_OFF || HDR_COMPRESSION_ENABLED(hdr)) && !HDR_ENCRYPTED(hdr)) { ASSERT3U(size, ==, psize); to_write = abd_alloc_for_io(asize, ismd); abd_copy(to_write, hdr->b_l1hdr.b_pabd, size); if (size != asize) abd_zero_off(to_write, size, asize - size); goto out; } if (compress != ZIO_COMPRESS_OFF && !HDR_COMPRESSION_ENABLED(hdr)) { cabd = abd_alloc_for_io(asize, ismd); tmp = abd_borrow_buf(cabd, asize); psize = zio_compress_data(compress, to_write, tmp, size, hdr->b_complevel); if (psize >= size) { abd_return_buf(cabd, tmp, asize); HDR_SET_COMPRESS(hdr, ZIO_COMPRESS_OFF); to_write = cabd; abd_copy(to_write, hdr->b_l1hdr.b_pabd, size); if (size != asize) abd_zero_off(to_write, size, asize - size); goto encrypt; } ASSERT3U(psize, <=, HDR_GET_PSIZE(hdr)); if (psize < asize) bzero((char *)tmp + psize, asize - psize); psize = HDR_GET_PSIZE(hdr); abd_return_buf_copy(cabd, tmp, asize); to_write = cabd; } encrypt: if (HDR_ENCRYPTED(hdr)) { eabd = abd_alloc_for_io(asize, ismd); /* * If the dataset was disowned before the buffer * made it to this point, the key to re-encrypt * it won't be available. In this case we simply * won't write the buffer to the L2ARC. */ ret = spa_keystore_lookup_key(spa, hdr->b_crypt_hdr.b_dsobj, FTAG, &dck); if (ret != 0) goto error; ret = zio_do_crypt_abd(B_TRUE, &dck->dck_key, hdr->b_crypt_hdr.b_ot, bswap, hdr->b_crypt_hdr.b_salt, hdr->b_crypt_hdr.b_iv, mac, psize, to_write, eabd, &no_crypt); if (ret != 0) goto error; if (no_crypt) abd_copy(eabd, to_write, psize); if (psize != asize) abd_zero_off(eabd, psize, asize - psize); /* assert that the MAC we got here matches the one we saved */ ASSERT0(bcmp(mac, hdr->b_crypt_hdr.b_mac, ZIO_DATA_MAC_LEN)); spa_keystore_dsl_key_rele(spa, dck, FTAG); if (to_write == cabd) abd_free(cabd); to_write = eabd; } out: ASSERT3P(to_write, !=, hdr->b_l1hdr.b_pabd); *abd_out = to_write; return (0); error: if (dck != NULL) spa_keystore_dsl_key_rele(spa, dck, FTAG); if (cabd != NULL) abd_free(cabd); if (eabd != NULL) abd_free(eabd); *abd_out = NULL; return (ret); } static void l2arc_blk_fetch_done(zio_t *zio) { l2arc_read_callback_t *cb; cb = zio->io_private; if (cb->l2rcb_abd != NULL) abd_put(cb->l2rcb_abd); kmem_free(cb, sizeof (l2arc_read_callback_t)); } /* * Find and write ARC buffers to the L2ARC device. * * An ARC_FLAG_L2_WRITING flag is set so that the L2ARC buffers are not valid * for reading until they have completed writing. * The headroom_boost is an in-out parameter used to maintain headroom boost * state between calls to this function. * * Returns the number of bytes actually written (which may be smaller than * the delta by which the device hand has changed due to alignment and the * writing of log blocks). */ static uint64_t l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz) { arc_buf_hdr_t *hdr, *hdr_prev, *head; uint64_t write_asize, write_psize, write_lsize, headroom; boolean_t full; l2arc_write_callback_t *cb = NULL; zio_t *pio, *wzio; uint64_t guid = spa_load_guid(spa); ASSERT3P(dev->l2ad_vdev, !=, NULL); pio = NULL; write_lsize = write_asize = write_psize = 0; full = B_FALSE; head = kmem_cache_alloc(hdr_l2only_cache, KM_PUSHPAGE); arc_hdr_set_flags(head, ARC_FLAG_L2_WRITE_HEAD | ARC_FLAG_HAS_L2HDR); /* * Copy buffers for L2ARC writing. */ for (int try = 0; try < L2ARC_FEED_TYPES; try++) { /* * If try == 1 or 3, we cache MRU metadata and data * respectively. */ if (l2arc_mfuonly) { if (try == 1 || try == 3) continue; } multilist_sublist_t *mls = l2arc_sublist_lock(try); uint64_t passed_sz = 0; VERIFY3P(mls, !=, NULL); /* * L2ARC fast warmup. * * Until the ARC is warm and starts to evict, read from the * head of the ARC lists rather than the tail. */ if (arc_warm == B_FALSE) hdr = multilist_sublist_head(mls); else hdr = multilist_sublist_tail(mls); headroom = target_sz * l2arc_headroom; if (zfs_compressed_arc_enabled) headroom = (headroom * l2arc_headroom_boost) / 100; for (; hdr; hdr = hdr_prev) { kmutex_t *hash_lock; abd_t *to_write = NULL; if (arc_warm == B_FALSE) hdr_prev = multilist_sublist_next(mls, hdr); else hdr_prev = multilist_sublist_prev(mls, hdr); hash_lock = HDR_LOCK(hdr); if (!mutex_tryenter(hash_lock)) { /* * Skip this buffer rather than waiting. */ continue; } passed_sz += HDR_GET_LSIZE(hdr); if (l2arc_headroom != 0 && passed_sz > headroom) { /* * Searched too far. */ mutex_exit(hash_lock); break; } if (!l2arc_write_eligible(guid, hdr)) { mutex_exit(hash_lock); continue; } /* * We rely on the L1 portion of the header below, so * it's invalid for this header to have been evicted out * of the ghost cache, prior to being written out. The * ARC_FLAG_L2_WRITING bit ensures this won't happen. */ ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT3U(HDR_GET_PSIZE(hdr), >, 0); ASSERT3U(arc_hdr_size(hdr), >, 0); ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); uint64_t psize = HDR_GET_PSIZE(hdr); uint64_t asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); if ((write_asize + asize) > target_sz) { full = B_TRUE; mutex_exit(hash_lock); break; } /* * We rely on the L1 portion of the header below, so * it's invalid for this header to have been evicted out * of the ghost cache, prior to being written out. The * ARC_FLAG_L2_WRITING bit ensures this won't happen. */ arc_hdr_set_flags(hdr, ARC_FLAG_L2_WRITING); ASSERT(HDR_HAS_L1HDR(hdr)); ASSERT3U(HDR_GET_PSIZE(hdr), >, 0); ASSERT(hdr->b_l1hdr.b_pabd != NULL || HDR_HAS_RABD(hdr)); ASSERT3U(arc_hdr_size(hdr), >, 0); /* * If this header has b_rabd, we can use this since it * must always match the data exactly as it exists on * disk. Otherwise, the L2ARC can normally use the * hdr's data, but if we're sharing data between the * hdr and one of its bufs, L2ARC needs its own copy of * the data so that the ZIO below can't race with the * buf consumer. To ensure that this copy will be * available for the lifetime of the ZIO and be cleaned * up afterwards, we add it to the l2arc_free_on_write * queue. If we need to apply any transforms to the * data (compression, encryption) we will also need the * extra buffer. */ if (HDR_HAS_RABD(hdr) && psize == asize) { to_write = hdr->b_crypt_hdr.b_rabd; } else if ((HDR_COMPRESSION_ENABLED(hdr) || HDR_GET_COMPRESS(hdr) == ZIO_COMPRESS_OFF) && !HDR_ENCRYPTED(hdr) && !HDR_SHARED_DATA(hdr) && psize == asize) { to_write = hdr->b_l1hdr.b_pabd; } else { int ret; arc_buf_contents_t type = arc_buf_type(hdr); ret = l2arc_apply_transforms(spa, hdr, asize, &to_write); if (ret != 0) { arc_hdr_clear_flags(hdr, ARC_FLAG_L2_WRITING); mutex_exit(hash_lock); continue; } l2arc_free_abd_on_write(to_write, asize, type); } if (pio == NULL) { /* * Insert a dummy header on the buflist so * l2arc_write_done() can find where the * write buffers begin without searching. */ mutex_enter(&dev->l2ad_mtx); list_insert_head(&dev->l2ad_buflist, head); mutex_exit(&dev->l2ad_mtx); cb = kmem_alloc( sizeof (l2arc_write_callback_t), KM_SLEEP); cb->l2wcb_dev = dev; cb->l2wcb_head = head; /* * Create a list to save allocated abd buffers * for l2arc_log_blk_commit(). */ list_create(&cb->l2wcb_abd_list, sizeof (l2arc_lb_abd_buf_t), offsetof(l2arc_lb_abd_buf_t, node)); pio = zio_root(spa, l2arc_write_done, cb, ZIO_FLAG_CANFAIL); } hdr->b_l2hdr.b_dev = dev; hdr->b_l2hdr.b_hits = 0; hdr->b_l2hdr.b_daddr = dev->l2ad_hand; arc_hdr_set_flags(hdr, ARC_FLAG_HAS_L2HDR); mutex_enter(&dev->l2ad_mtx); list_insert_head(&dev->l2ad_buflist, hdr); mutex_exit(&dev->l2ad_mtx); (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr); wzio = zio_write_phys(pio, dev->l2ad_vdev, hdr->b_l2hdr.b_daddr, asize, to_write, ZIO_CHECKSUM_OFF, NULL, hdr, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE); write_lsize += HDR_GET_LSIZE(hdr); DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio); write_psize += psize; write_asize += asize; dev->l2ad_hand += asize; vdev_space_update(dev->l2ad_vdev, asize, 0, 0); mutex_exit(hash_lock); /* * Append buf info to current log and commit if full. * arcstat_l2_{size,asize} kstats are updated * internally. */ if (l2arc_log_blk_insert(dev, hdr)) l2arc_log_blk_commit(dev, pio, cb); zio_nowait(wzio); } multilist_sublist_unlock(mls); if (full == B_TRUE) break; } /* No buffers selected for writing? */ if (pio == NULL) { ASSERT0(write_lsize); ASSERT(!HDR_HAS_L1HDR(head)); kmem_cache_free(hdr_l2only_cache, head); /* * Although we did not write any buffers l2ad_evict may * have advanced. */ l2arc_dev_hdr_update(dev); return (0); } if (!dev->l2ad_first) ASSERT3U(dev->l2ad_hand, <=, dev->l2ad_evict); ASSERT3U(write_asize, <=, target_sz); ARCSTAT_BUMP(arcstat_l2_writes_sent); ARCSTAT_INCR(arcstat_l2_write_bytes, write_psize); ARCSTAT_INCR(arcstat_l2_lsize, write_lsize); ARCSTAT_INCR(arcstat_l2_psize, write_psize); dev->l2ad_writing = B_TRUE; (void) zio_wait(pio); dev->l2ad_writing = B_FALSE; /* * Update the device header after the zio completes as * l2arc_write_done() may have updated the memory holding the log block * pointers in the device header. */ l2arc_dev_hdr_update(dev); return (write_asize); } static boolean_t l2arc_hdr_limit_reached(void) { int64_t s = aggsum_upper_bound(&astat_l2_hdr_size); return (arc_reclaim_needed() || (s > arc_meta_limit * 3 / 4) || (s > (arc_warm ? arc_c : arc_c_max) * l2arc_meta_percent / 100)); } /* * This thread feeds the L2ARC at regular intervals. This is the beating * heart of the L2ARC. */ /* ARGSUSED */ static void l2arc_feed_thread(void *unused) { callb_cpr_t cpr; l2arc_dev_t *dev; spa_t *spa; uint64_t size, wrote; clock_t begin, next = ddi_get_lbolt(); fstrans_cookie_t cookie; CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG); mutex_enter(&l2arc_feed_thr_lock); cookie = spl_fstrans_mark(); while (l2arc_thread_exit == 0) { CALLB_CPR_SAFE_BEGIN(&cpr); (void) cv_timedwait_idle(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock, next); CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock); next = ddi_get_lbolt() + hz; /* * Quick check for L2ARC devices. */ mutex_enter(&l2arc_dev_mtx); if (l2arc_ndev == 0) { mutex_exit(&l2arc_dev_mtx); continue; } mutex_exit(&l2arc_dev_mtx); begin = ddi_get_lbolt(); /* * This selects the next l2arc device to write to, and in * doing so the next spa to feed from: dev->l2ad_spa. This * will return NULL if there are now no l2arc devices or if * they are all faulted. * * If a device is returned, its spa's config lock is also * held to prevent device removal. l2arc_dev_get_next() * will grab and release l2arc_dev_mtx. */ if ((dev = l2arc_dev_get_next()) == NULL) continue; spa = dev->l2ad_spa; ASSERT3P(spa, !=, NULL); /* * If the pool is read-only then force the feed thread to * sleep a little longer. */ if (!spa_writeable(spa)) { next = ddi_get_lbolt() + 5 * l2arc_feed_secs * hz; spa_config_exit(spa, SCL_L2ARC, dev); continue; } /* * Avoid contributing to memory pressure. */ if (l2arc_hdr_limit_reached()) { ARCSTAT_BUMP(arcstat_l2_abort_lowmem); spa_config_exit(spa, SCL_L2ARC, dev); continue; } ARCSTAT_BUMP(arcstat_l2_feeds); size = l2arc_write_size(dev); /* * Evict L2ARC buffers that will be overwritten. */ l2arc_evict(dev, size, B_FALSE); /* * Write ARC buffers. */ wrote = l2arc_write_buffers(spa, dev, size); /* * Calculate interval between writes. */ next = l2arc_write_interval(begin, size, wrote); spa_config_exit(spa, SCL_L2ARC, dev); } spl_fstrans_unmark(cookie); l2arc_thread_exit = 0; cv_broadcast(&l2arc_feed_thr_cv); CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */ thread_exit(); } boolean_t l2arc_vdev_present(vdev_t *vd) { return (l2arc_vdev_get(vd) != NULL); } /* * Returns the l2arc_dev_t associated with a particular vdev_t or NULL if * the vdev_t isn't an L2ARC device. */ l2arc_dev_t * l2arc_vdev_get(vdev_t *vd) { l2arc_dev_t *dev; mutex_enter(&l2arc_dev_mtx); for (dev = list_head(l2arc_dev_list); dev != NULL; dev = list_next(l2arc_dev_list, dev)) { if (dev->l2ad_vdev == vd) break; } mutex_exit(&l2arc_dev_mtx); return (dev); } /* * Add a vdev for use by the L2ARC. By this point the spa has already * validated the vdev and opened it. */ void l2arc_add_vdev(spa_t *spa, vdev_t *vd) { l2arc_dev_t *adddev; uint64_t l2dhdr_asize; ASSERT(!l2arc_vdev_present(vd)); /* * Create a new l2arc device entry. */ adddev = vmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP); adddev->l2ad_spa = spa; adddev->l2ad_vdev = vd; /* leave extra size for an l2arc device header */ l2dhdr_asize = adddev->l2ad_dev_hdr_asize = MAX(sizeof (*adddev->l2ad_dev_hdr), 1 << vd->vdev_ashift); adddev->l2ad_start = VDEV_LABEL_START_SIZE + l2dhdr_asize; adddev->l2ad_end = VDEV_LABEL_START_SIZE + vdev_get_min_asize(vd); ASSERT3U(adddev->l2ad_start, <, adddev->l2ad_end); adddev->l2ad_hand = adddev->l2ad_start; adddev->l2ad_evict = adddev->l2ad_start; adddev->l2ad_first = B_TRUE; adddev->l2ad_writing = B_FALSE; adddev->l2ad_trim_all = B_FALSE; list_link_init(&adddev->l2ad_node); adddev->l2ad_dev_hdr = kmem_zalloc(l2dhdr_asize, KM_SLEEP); mutex_init(&adddev->l2ad_mtx, NULL, MUTEX_DEFAULT, NULL); /* * This is a list of all ARC buffers that are still valid on the * device. */ list_create(&adddev->l2ad_buflist, sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_l2hdr.b_l2node)); /* * This is a list of pointers to log blocks that are still present * on the device. */ list_create(&adddev->l2ad_lbptr_list, sizeof (l2arc_lb_ptr_buf_t), offsetof(l2arc_lb_ptr_buf_t, node)); vdev_space_update(vd, 0, 0, adddev->l2ad_end - adddev->l2ad_hand); zfs_refcount_create(&adddev->l2ad_alloc); zfs_refcount_create(&adddev->l2ad_lb_asize); zfs_refcount_create(&adddev->l2ad_lb_count); /* * Add device to global list */ mutex_enter(&l2arc_dev_mtx); list_insert_head(l2arc_dev_list, adddev); atomic_inc_64(&l2arc_ndev); mutex_exit(&l2arc_dev_mtx); /* * Decide if vdev is eligible for L2ARC rebuild */ l2arc_rebuild_vdev(adddev->l2ad_vdev, B_FALSE); } void l2arc_rebuild_vdev(vdev_t *vd, boolean_t reopen) { l2arc_dev_t *dev = NULL; l2arc_dev_hdr_phys_t *l2dhdr; uint64_t l2dhdr_asize; spa_t *spa; int err; boolean_t l2dhdr_valid = B_TRUE; dev = l2arc_vdev_get(vd); ASSERT3P(dev, !=, NULL); spa = dev->l2ad_spa; l2dhdr = dev->l2ad_dev_hdr; l2dhdr_asize = dev->l2ad_dev_hdr_asize; /* * The L2ARC has to hold at least the payload of one log block for * them to be restored (persistent L2ARC). The payload of a log block * depends on the amount of its log entries. We always write log blocks * with 1022 entries. How many of them are committed or restored depends * on the size of the L2ARC device. Thus the maximum payload of * one log block is 1022 * SPA_MAXBLOCKSIZE = 16GB. If the L2ARC device * is less than that, we reduce the amount of committed and restored * log entries per block so as to enable persistence. */ if (dev->l2ad_end < l2arc_rebuild_blocks_min_l2size) { dev->l2ad_log_entries = 0; } else { dev->l2ad_log_entries = MIN((dev->l2ad_end - dev->l2ad_start) >> SPA_MAXBLOCKSHIFT, L2ARC_LOG_BLK_MAX_ENTRIES); } /* * Read the device header, if an error is returned do not rebuild L2ARC. */ if ((err = l2arc_dev_hdr_read(dev)) != 0) l2dhdr_valid = B_FALSE; if (l2dhdr_valid && dev->l2ad_log_entries > 0) { /* * If we are onlining a cache device (vdev_reopen) that was * still present (l2arc_vdev_present()) and rebuild is enabled, * we should evict all ARC buffers and pointers to log blocks * and reclaim their space before restoring its contents to * L2ARC. */ if (reopen) { if (!l2arc_rebuild_enabled) { return; } else { l2arc_evict(dev, 0, B_TRUE); /* start a new log block */ dev->l2ad_log_ent_idx = 0; dev->l2ad_log_blk_payload_asize = 0; dev->l2ad_log_blk_payload_start = 0; } } /* * Just mark the device as pending for a rebuild. We won't * be starting a rebuild in line here as it would block pool * import. Instead spa_load_impl will hand that off to an * async task which will call l2arc_spa_rebuild_start. */ dev->l2ad_rebuild = B_TRUE; } else if (spa_writeable(spa)) { /* * In this case TRIM the whole device if l2arc_trim_ahead > 0, * otherwise create a new header. We zero out the memory holding * the header to reset dh_start_lbps. If we TRIM the whole * device the new header will be written by * vdev_trim_l2arc_thread() at the end of the TRIM to update the * trim_state in the header too. When reading the header, if * trim_state is not VDEV_TRIM_COMPLETE and l2arc_trim_ahead > 0 * we opt to TRIM the whole device again. */ if (l2arc_trim_ahead > 0) { dev->l2ad_trim_all = B_TRUE; } else { bzero(l2dhdr, l2dhdr_asize); l2arc_dev_hdr_update(dev); } } } /* * Remove a vdev from the L2ARC. */ void l2arc_remove_vdev(vdev_t *vd) { l2arc_dev_t *remdev = NULL; /* * Find the device by vdev */ remdev = l2arc_vdev_get(vd); ASSERT3P(remdev, !=, NULL); /* * Cancel any ongoing or scheduled rebuild. */ mutex_enter(&l2arc_rebuild_thr_lock); if (remdev->l2ad_rebuild_began == B_TRUE) { remdev->l2ad_rebuild_cancel = B_TRUE; while (remdev->l2ad_rebuild == B_TRUE) cv_wait(&l2arc_rebuild_thr_cv, &l2arc_rebuild_thr_lock); } mutex_exit(&l2arc_rebuild_thr_lock); /* * Remove device from global list */ mutex_enter(&l2arc_dev_mtx); list_remove(l2arc_dev_list, remdev); l2arc_dev_last = NULL; /* may have been invalidated */ atomic_dec_64(&l2arc_ndev); mutex_exit(&l2arc_dev_mtx); /* * Clear all buflists and ARC references. L2ARC device flush. */ l2arc_evict(remdev, 0, B_TRUE); list_destroy(&remdev->l2ad_buflist); ASSERT(list_is_empty(&remdev->l2ad_lbptr_list)); list_destroy(&remdev->l2ad_lbptr_list); mutex_destroy(&remdev->l2ad_mtx); zfs_refcount_destroy(&remdev->l2ad_alloc); zfs_refcount_destroy(&remdev->l2ad_lb_asize); zfs_refcount_destroy(&remdev->l2ad_lb_count); kmem_free(remdev->l2ad_dev_hdr, remdev->l2ad_dev_hdr_asize); vmem_free(remdev, sizeof (l2arc_dev_t)); } void l2arc_init(void) { l2arc_thread_exit = 0; l2arc_ndev = 0; l2arc_writes_sent = 0; l2arc_writes_done = 0; mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL); mutex_init(&l2arc_rebuild_thr_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&l2arc_rebuild_thr_cv, NULL, CV_DEFAULT, NULL); mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL); mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL); l2arc_dev_list = &L2ARC_dev_list; l2arc_free_on_write = &L2ARC_free_on_write; list_create(l2arc_dev_list, sizeof (l2arc_dev_t), offsetof(l2arc_dev_t, l2ad_node)); list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t), offsetof(l2arc_data_free_t, l2df_list_node)); } void l2arc_fini(void) { mutex_destroy(&l2arc_feed_thr_lock); cv_destroy(&l2arc_feed_thr_cv); mutex_destroy(&l2arc_rebuild_thr_lock); cv_destroy(&l2arc_rebuild_thr_cv); mutex_destroy(&l2arc_dev_mtx); mutex_destroy(&l2arc_free_on_write_mtx); list_destroy(l2arc_dev_list); list_destroy(l2arc_free_on_write); } void l2arc_start(void) { if (!(spa_mode_global & SPA_MODE_WRITE)) return; (void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0, TS_RUN, defclsyspri); } void l2arc_stop(void) { if (!(spa_mode_global & SPA_MODE_WRITE)) return; mutex_enter(&l2arc_feed_thr_lock); cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */ l2arc_thread_exit = 1; while (l2arc_thread_exit != 0) cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock); mutex_exit(&l2arc_feed_thr_lock); } /* * Punches out rebuild threads for the L2ARC devices in a spa. This should * be called after pool import from the spa async thread, since starting * these threads directly from spa_import() will make them part of the * "zpool import" context and delay process exit (and thus pool import). */ void l2arc_spa_rebuild_start(spa_t *spa) { ASSERT(MUTEX_HELD(&spa_namespace_lock)); /* * Locate the spa's l2arc devices and kick off rebuild threads. */ for (int i = 0; i < spa->spa_l2cache.sav_count; i++) { l2arc_dev_t *dev = l2arc_vdev_get(spa->spa_l2cache.sav_vdevs[i]); if (dev == NULL) { /* Don't attempt a rebuild if the vdev is UNAVAIL */ continue; } mutex_enter(&l2arc_rebuild_thr_lock); if (dev->l2ad_rebuild && !dev->l2ad_rebuild_cancel) { dev->l2ad_rebuild_began = B_TRUE; (void) thread_create(NULL, 0, l2arc_dev_rebuild_thread, dev, 0, &p0, TS_RUN, minclsyspri); } mutex_exit(&l2arc_rebuild_thr_lock); } } /* * Main entry point for L2ARC rebuilding. */ static void l2arc_dev_rebuild_thread(void *arg) { l2arc_dev_t *dev = arg; VERIFY(!dev->l2ad_rebuild_cancel); VERIFY(dev->l2ad_rebuild); (void) l2arc_rebuild(dev); mutex_enter(&l2arc_rebuild_thr_lock); dev->l2ad_rebuild_began = B_FALSE; dev->l2ad_rebuild = B_FALSE; mutex_exit(&l2arc_rebuild_thr_lock); thread_exit(); } /* * This function implements the actual L2ARC metadata rebuild. It: * starts reading the log block chain and restores each block's contents * to memory (reconstructing arc_buf_hdr_t's). * * Operation stops under any of the following conditions: * * 1) We reach the end of the log block chain. * 2) We encounter *any* error condition (cksum errors, io errors) */ static int l2arc_rebuild(l2arc_dev_t *dev) { vdev_t *vd = dev->l2ad_vdev; spa_t *spa = vd->vdev_spa; int err = 0; l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; l2arc_log_blk_phys_t *this_lb, *next_lb; zio_t *this_io = NULL, *next_io = NULL; l2arc_log_blkptr_t lbps[2]; l2arc_lb_ptr_buf_t *lb_ptr_buf; boolean_t lock_held; this_lb = vmem_zalloc(sizeof (*this_lb), KM_SLEEP); next_lb = vmem_zalloc(sizeof (*next_lb), KM_SLEEP); /* * We prevent device removal while issuing reads to the device, * then during the rebuilding phases we drop this lock again so * that a spa_unload or device remove can be initiated - this is * safe, because the spa will signal us to stop before removing * our device and wait for us to stop. */ spa_config_enter(spa, SCL_L2ARC, vd, RW_READER); lock_held = B_TRUE; /* * Retrieve the persistent L2ARC device state. * L2BLK_GET_PSIZE returns aligned size for log blocks. */ dev->l2ad_evict = MAX(l2dhdr->dh_evict, dev->l2ad_start); dev->l2ad_hand = MAX(l2dhdr->dh_start_lbps[0].lbp_daddr + L2BLK_GET_PSIZE((&l2dhdr->dh_start_lbps[0])->lbp_prop), dev->l2ad_start); dev->l2ad_first = !!(l2dhdr->dh_flags & L2ARC_DEV_HDR_EVICT_FIRST); vd->vdev_trim_action_time = l2dhdr->dh_trim_action_time; vd->vdev_trim_state = l2dhdr->dh_trim_state; /* * In case the zfs module parameter l2arc_rebuild_enabled is false * we do not start the rebuild process. */ if (!l2arc_rebuild_enabled) goto out; /* Prepare the rebuild process */ bcopy(l2dhdr->dh_start_lbps, lbps, sizeof (lbps)); /* Start the rebuild process */ for (;;) { if (!l2arc_log_blkptr_valid(dev, &lbps[0])) break; if ((err = l2arc_log_blk_read(dev, &lbps[0], &lbps[1], this_lb, next_lb, this_io, &next_io)) != 0) goto out; /* * Our memory pressure valve. If the system is running low * on memory, rather than swamping memory with new ARC buf * hdrs, we opt not to rebuild the L2ARC. At this point, * however, we have already set up our L2ARC dev to chain in * new metadata log blocks, so the user may choose to offline/ * online the L2ARC dev at a later time (or re-import the pool) * to reconstruct it (when there's less memory pressure). */ if (l2arc_hdr_limit_reached()) { ARCSTAT_BUMP(arcstat_l2_rebuild_abort_lowmem); cmn_err(CE_NOTE, "System running low on memory, " "aborting L2ARC rebuild."); err = SET_ERROR(ENOMEM); goto out; } spa_config_exit(spa, SCL_L2ARC, vd); lock_held = B_FALSE; /* * Now that we know that the next_lb checks out alright, we * can start reconstruction from this log block. * L2BLK_GET_PSIZE returns aligned size for log blocks. */ uint64_t asize = L2BLK_GET_PSIZE((&lbps[0])->lbp_prop); l2arc_log_blk_restore(dev, this_lb, asize, lbps[0].lbp_daddr); /* * log block restored, include its pointer in the list of * pointers to log blocks present in the L2ARC device. */ lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP); lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP); bcopy(&lbps[0], lb_ptr_buf->lb_ptr, sizeof (l2arc_log_blkptr_t)); mutex_enter(&dev->l2ad_mtx); list_insert_tail(&dev->l2ad_lbptr_list, lb_ptr_buf); ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize); ARCSTAT_BUMP(arcstat_l2_log_blk_count); zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf); mutex_exit(&dev->l2ad_mtx); vdev_space_update(vd, asize, 0, 0); /* * Protection against loops of log blocks: * * l2ad_hand l2ad_evict * V V * l2ad_start |=======================================| l2ad_end * -----|||----|||---|||----||| * (3) (2) (1) (0) * ---|||---|||----|||---||| * (7) (6) (5) (4) * * In this situation the pointer of log block (4) passes * l2arc_log_blkptr_valid() but the log block should not be * restored as it is overwritten by the payload of log block * (0). Only log blocks (0)-(3) should be restored. We check * whether l2ad_evict lies in between the payload starting * offset of the next log block (lbps[1].lbp_payload_start) * and the payload starting offset of the present log block * (lbps[0].lbp_payload_start). If true and this isn't the * first pass, we are looping from the beginning and we should * stop. */ if (l2arc_range_check_overlap(lbps[1].lbp_payload_start, lbps[0].lbp_payload_start, dev->l2ad_evict) && !dev->l2ad_first) goto out; for (;;) { mutex_enter(&l2arc_rebuild_thr_lock); if (dev->l2ad_rebuild_cancel) { dev->l2ad_rebuild = B_FALSE; cv_signal(&l2arc_rebuild_thr_cv); mutex_exit(&l2arc_rebuild_thr_lock); err = SET_ERROR(ECANCELED); goto out; } mutex_exit(&l2arc_rebuild_thr_lock); if (spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER)) { lock_held = B_TRUE; break; } /* * L2ARC config lock held by somebody in writer, * possibly due to them trying to remove us. They'll * likely to want us to shut down, so after a little * delay, we check l2ad_rebuild_cancel and retry * the lock again. */ delay(1); } /* * Continue with the next log block. */ lbps[0] = lbps[1]; lbps[1] = this_lb->lb_prev_lbp; PTR_SWAP(this_lb, next_lb); this_io = next_io; next_io = NULL; } if (this_io != NULL) l2arc_log_blk_fetch_abort(this_io); out: if (next_io != NULL) l2arc_log_blk_fetch_abort(next_io); vmem_free(this_lb, sizeof (*this_lb)); vmem_free(next_lb, sizeof (*next_lb)); if (!l2arc_rebuild_enabled) { spa_history_log_internal(spa, "L2ARC rebuild", NULL, "disabled"); } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) > 0) { ARCSTAT_BUMP(arcstat_l2_rebuild_success); spa_history_log_internal(spa, "L2ARC rebuild", NULL, "successful, restored %llu blocks", (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count)); } else if (err == 0 && zfs_refcount_count(&dev->l2ad_lb_count) == 0) { /* * No error but also nothing restored, meaning the lbps array * in the device header points to invalid/non-present log * blocks. Reset the header. */ spa_history_log_internal(spa, "L2ARC rebuild", NULL, "no valid log blocks"); bzero(l2dhdr, dev->l2ad_dev_hdr_asize); l2arc_dev_hdr_update(dev); } else if (err == ECANCELED) { /* * In case the rebuild was canceled do not log to spa history * log as the pool may be in the process of being removed. */ zfs_dbgmsg("L2ARC rebuild aborted, restored %llu blocks", zfs_refcount_count(&dev->l2ad_lb_count)); } else if (err != 0) { spa_history_log_internal(spa, "L2ARC rebuild", NULL, "aborted, restored %llu blocks", (u_longlong_t)zfs_refcount_count(&dev->l2ad_lb_count)); } if (lock_held) spa_config_exit(spa, SCL_L2ARC, vd); return (err); } /* * Attempts to read the device header on the provided L2ARC device and writes * it to `hdr'. On success, this function returns 0, otherwise the appropriate * error code is returned. */ static int l2arc_dev_hdr_read(l2arc_dev_t *dev) { int err; uint64_t guid; l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize; abd_t *abd; guid = spa_guid(dev->l2ad_vdev->vdev_spa); abd = abd_get_from_buf(l2dhdr, l2dhdr_asize); err = zio_wait(zio_read_phys(NULL, dev->l2ad_vdev, VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY | ZIO_FLAG_SPECULATIVE, B_FALSE)); abd_put(abd); if (err != 0) { ARCSTAT_BUMP(arcstat_l2_rebuild_abort_dh_errors); zfs_dbgmsg("L2ARC IO error (%d) while reading device header, " "vdev guid: %llu", err, dev->l2ad_vdev->vdev_guid); return (err); } if (l2dhdr->dh_magic == BSWAP_64(L2ARC_DEV_HDR_MAGIC)) byteswap_uint64_array(l2dhdr, sizeof (*l2dhdr)); if (l2dhdr->dh_magic != L2ARC_DEV_HDR_MAGIC || l2dhdr->dh_spa_guid != guid || l2dhdr->dh_vdev_guid != dev->l2ad_vdev->vdev_guid || l2dhdr->dh_version != L2ARC_PERSISTENT_VERSION || l2dhdr->dh_log_entries != dev->l2ad_log_entries || l2dhdr->dh_end != dev->l2ad_end || !l2arc_range_check_overlap(dev->l2ad_start, dev->l2ad_end, l2dhdr->dh_evict) || (l2dhdr->dh_trim_state != VDEV_TRIM_COMPLETE && l2arc_trim_ahead > 0)) { /* * Attempt to rebuild a device containing no actual dev hdr * or containing a header from some other pool or from another * version of persistent L2ARC. */ ARCSTAT_BUMP(arcstat_l2_rebuild_abort_unsupported); return (SET_ERROR(ENOTSUP)); } return (0); } /* * Reads L2ARC log blocks from storage and validates their contents. * * This function implements a simple fetcher to make sure that while * we're processing one buffer the L2ARC is already fetching the next * one in the chain. * * The arguments this_lp and next_lp point to the current and next log block * address in the block chain. Similarly, this_lb and next_lb hold the * l2arc_log_blk_phys_t's of the current and next L2ARC blk. * * The `this_io' and `next_io' arguments are used for block fetching. * When issuing the first blk IO during rebuild, you should pass NULL for * `this_io'. This function will then issue a sync IO to read the block and * also issue an async IO to fetch the next block in the block chain. The * fetched IO is returned in `next_io'. On subsequent calls to this * function, pass the value returned in `next_io' from the previous call * as `this_io' and a fresh `next_io' pointer to hold the next fetch IO. * Prior to the call, you should initialize your `next_io' pointer to be * NULL. If no fetch IO was issued, the pointer is left set at NULL. * * On success, this function returns 0, otherwise it returns an appropriate * error code. On error the fetching IO is aborted and cleared before * returning from this function. Therefore, if we return `success', the * caller can assume that we have taken care of cleanup of fetch IOs. */ static int l2arc_log_blk_read(l2arc_dev_t *dev, const l2arc_log_blkptr_t *this_lbp, const l2arc_log_blkptr_t *next_lbp, l2arc_log_blk_phys_t *this_lb, l2arc_log_blk_phys_t *next_lb, zio_t *this_io, zio_t **next_io) { int err = 0; zio_cksum_t cksum; abd_t *abd = NULL; uint64_t asize; ASSERT(this_lbp != NULL && next_lbp != NULL); ASSERT(this_lb != NULL && next_lb != NULL); ASSERT(next_io != NULL && *next_io == NULL); ASSERT(l2arc_log_blkptr_valid(dev, this_lbp)); /* * Check to see if we have issued the IO for this log block in a * previous run. If not, this is the first call, so issue it now. */ if (this_io == NULL) { this_io = l2arc_log_blk_fetch(dev->l2ad_vdev, this_lbp, this_lb); } /* * Peek to see if we can start issuing the next IO immediately. */ if (l2arc_log_blkptr_valid(dev, next_lbp)) { /* * Start issuing IO for the next log block early - this * should help keep the L2ARC device busy while we * decompress and restore this log block. */ *next_io = l2arc_log_blk_fetch(dev->l2ad_vdev, next_lbp, next_lb); } /* Wait for the IO to read this log block to complete */ if ((err = zio_wait(this_io)) != 0) { ARCSTAT_BUMP(arcstat_l2_rebuild_abort_io_errors); zfs_dbgmsg("L2ARC IO error (%d) while reading log block, " "offset: %llu, vdev guid: %llu", err, this_lbp->lbp_daddr, dev->l2ad_vdev->vdev_guid); goto cleanup; } /* * Make sure the buffer checks out. * L2BLK_GET_PSIZE returns aligned size for log blocks. */ asize = L2BLK_GET_PSIZE((this_lbp)->lbp_prop); fletcher_4_native(this_lb, asize, NULL, &cksum); if (!ZIO_CHECKSUM_EQUAL(cksum, this_lbp->lbp_cksum)) { ARCSTAT_BUMP(arcstat_l2_rebuild_abort_cksum_lb_errors); zfs_dbgmsg("L2ARC log block cksum failed, offset: %llu, " "vdev guid: %llu, l2ad_hand: %llu, l2ad_evict: %llu", this_lbp->lbp_daddr, dev->l2ad_vdev->vdev_guid, dev->l2ad_hand, dev->l2ad_evict); err = SET_ERROR(ECKSUM); goto cleanup; } /* Now we can take our time decoding this buffer */ switch (L2BLK_GET_COMPRESS((this_lbp)->lbp_prop)) { case ZIO_COMPRESS_OFF: break; case ZIO_COMPRESS_LZ4: abd = abd_alloc_for_io(asize, B_TRUE); abd_copy_from_buf_off(abd, this_lb, 0, asize); if ((err = zio_decompress_data( L2BLK_GET_COMPRESS((this_lbp)->lbp_prop), abd, this_lb, asize, sizeof (*this_lb), NULL)) != 0) { err = SET_ERROR(EINVAL); goto cleanup; } break; default: err = SET_ERROR(EINVAL); goto cleanup; } if (this_lb->lb_magic == BSWAP_64(L2ARC_LOG_BLK_MAGIC)) byteswap_uint64_array(this_lb, sizeof (*this_lb)); if (this_lb->lb_magic != L2ARC_LOG_BLK_MAGIC) { err = SET_ERROR(EINVAL); goto cleanup; } cleanup: /* Abort an in-flight fetch I/O in case of error */ if (err != 0 && *next_io != NULL) { l2arc_log_blk_fetch_abort(*next_io); *next_io = NULL; } if (abd != NULL) abd_free(abd); return (err); } /* * Restores the payload of a log block to ARC. This creates empty ARC hdr * entries which only contain an l2arc hdr, essentially restoring the * buffers to their L2ARC evicted state. This function also updates space * usage on the L2ARC vdev to make sure it tracks restored buffers. */ static void l2arc_log_blk_restore(l2arc_dev_t *dev, const l2arc_log_blk_phys_t *lb, uint64_t lb_asize, uint64_t lb_daddr) { uint64_t size = 0, asize = 0; uint64_t log_entries = dev->l2ad_log_entries; /* * Usually arc_adapt() is called only for data, not headers, but * since we may allocate significant amount of memory here, let ARC * grow its arc_c. */ arc_adapt(log_entries * HDR_L2ONLY_SIZE, arc_l2c_only); for (int i = log_entries - 1; i >= 0; i--) { /* * Restore goes in the reverse temporal direction to preserve * correct temporal ordering of buffers in the l2ad_buflist. * l2arc_hdr_restore also does a list_insert_tail instead of * list_insert_head on the l2ad_buflist: * * LIST l2ad_buflist LIST * HEAD <------ (time) ------ TAIL * direction +-----+-----+-----+-----+-----+ direction * of l2arc <== | buf | buf | buf | buf | buf | ===> of rebuild * fill +-----+-----+-----+-----+-----+ * ^ ^ * | | * | | * l2arc_feed_thread l2arc_rebuild * will place new bufs here restores bufs here * * During l2arc_rebuild() the device is not used by * l2arc_feed_thread() as dev->l2ad_rebuild is set to true. */ size += L2BLK_GET_LSIZE((&lb->lb_entries[i])->le_prop); asize += vdev_psize_to_asize(dev->l2ad_vdev, L2BLK_GET_PSIZE((&lb->lb_entries[i])->le_prop)); l2arc_hdr_restore(&lb->lb_entries[i], dev); } /* * Record rebuild stats: * size Logical size of restored buffers in the L2ARC * asize Aligned size of restored buffers in the L2ARC */ ARCSTAT_INCR(arcstat_l2_rebuild_size, size); ARCSTAT_INCR(arcstat_l2_rebuild_asize, asize); ARCSTAT_INCR(arcstat_l2_rebuild_bufs, log_entries); ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, lb_asize); ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, asize / lb_asize); ARCSTAT_BUMP(arcstat_l2_rebuild_log_blks); } /* * Restores a single ARC buf hdr from a log entry. The ARC buffer is put * into a state indicating that it has been evicted to L2ARC. */ static void l2arc_hdr_restore(const l2arc_log_ent_phys_t *le, l2arc_dev_t *dev) { arc_buf_hdr_t *hdr, *exists; kmutex_t *hash_lock; arc_buf_contents_t type = L2BLK_GET_TYPE((le)->le_prop); uint64_t asize; /* * Do all the allocation before grabbing any locks, this lets us * sleep if memory is full and we don't have to deal with failed * allocations. */ hdr = arc_buf_alloc_l2only(L2BLK_GET_LSIZE((le)->le_prop), type, dev, le->le_dva, le->le_daddr, L2BLK_GET_PSIZE((le)->le_prop), le->le_birth, L2BLK_GET_COMPRESS((le)->le_prop), le->le_complevel, L2BLK_GET_PROTECTED((le)->le_prop), L2BLK_GET_PREFETCH((le)->le_prop)); asize = vdev_psize_to_asize(dev->l2ad_vdev, L2BLK_GET_PSIZE((le)->le_prop)); /* * vdev_space_update() has to be called before arc_hdr_destroy() to * avoid underflow since the latter also calls the former. */ vdev_space_update(dev->l2ad_vdev, asize, 0, 0); ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(hdr)); ARCSTAT_INCR(arcstat_l2_psize, HDR_GET_PSIZE(hdr)); mutex_enter(&dev->l2ad_mtx); list_insert_tail(&dev->l2ad_buflist, hdr); (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(hdr), hdr); mutex_exit(&dev->l2ad_mtx); exists = buf_hash_insert(hdr, &hash_lock); if (exists) { /* Buffer was already cached, no need to restore it. */ arc_hdr_destroy(hdr); /* * If the buffer is already cached, check whether it has * L2ARC metadata. If not, enter them and update the flag. * This is important is case of onlining a cache device, since * we previously evicted all L2ARC metadata from ARC. */ if (!HDR_HAS_L2HDR(exists)) { arc_hdr_set_flags(exists, ARC_FLAG_HAS_L2HDR); exists->b_l2hdr.b_dev = dev; exists->b_l2hdr.b_daddr = le->le_daddr; mutex_enter(&dev->l2ad_mtx); list_insert_tail(&dev->l2ad_buflist, exists); (void) zfs_refcount_add_many(&dev->l2ad_alloc, arc_hdr_size(exists), exists); mutex_exit(&dev->l2ad_mtx); vdev_space_update(dev->l2ad_vdev, asize, 0, 0); ARCSTAT_INCR(arcstat_l2_lsize, HDR_GET_LSIZE(exists)); ARCSTAT_INCR(arcstat_l2_psize, HDR_GET_PSIZE(exists)); } ARCSTAT_BUMP(arcstat_l2_rebuild_bufs_precached); } mutex_exit(hash_lock); } /* * Starts an asynchronous read IO to read a log block. This is used in log * block reconstruction to start reading the next block before we are done * decoding and reconstructing the current block, to keep the l2arc device * nice and hot with read IO to process. * The returned zio will contain a newly allocated memory buffers for the IO * data which should then be freed by the caller once the zio is no longer * needed (i.e. due to it having completed). If you wish to abort this * zio, you should do so using l2arc_log_blk_fetch_abort, which takes * care of disposing of the allocated buffers correctly. */ static zio_t * l2arc_log_blk_fetch(vdev_t *vd, const l2arc_log_blkptr_t *lbp, l2arc_log_blk_phys_t *lb) { uint32_t asize; zio_t *pio; l2arc_read_callback_t *cb; /* L2BLK_GET_PSIZE returns aligned size for log blocks */ asize = L2BLK_GET_PSIZE((lbp)->lbp_prop); ASSERT(asize <= sizeof (l2arc_log_blk_phys_t)); cb = kmem_zalloc(sizeof (l2arc_read_callback_t), KM_SLEEP); cb->l2rcb_abd = abd_get_from_buf(lb, asize); pio = zio_root(vd->vdev_spa, l2arc_blk_fetch_done, cb, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY); (void) zio_nowait(zio_read_phys(pio, vd, lbp->lbp_daddr, asize, cb->l2rcb_abd, ZIO_CHECKSUM_OFF, NULL, NULL, ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY, B_FALSE)); return (pio); } /* * Aborts a zio returned from l2arc_log_blk_fetch and frees the data * buffers allocated for it. */ static void l2arc_log_blk_fetch_abort(zio_t *zio) { (void) zio_wait(zio); } /* * Creates a zio to update the device header on an l2arc device. */ void l2arc_dev_hdr_update(l2arc_dev_t *dev) { l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; const uint64_t l2dhdr_asize = dev->l2ad_dev_hdr_asize; abd_t *abd; int err; VERIFY(spa_config_held(dev->l2ad_spa, SCL_STATE_ALL, RW_READER)); l2dhdr->dh_magic = L2ARC_DEV_HDR_MAGIC; l2dhdr->dh_version = L2ARC_PERSISTENT_VERSION; l2dhdr->dh_spa_guid = spa_guid(dev->l2ad_vdev->vdev_spa); l2dhdr->dh_vdev_guid = dev->l2ad_vdev->vdev_guid; l2dhdr->dh_log_entries = dev->l2ad_log_entries; l2dhdr->dh_evict = dev->l2ad_evict; l2dhdr->dh_start = dev->l2ad_start; l2dhdr->dh_end = dev->l2ad_end; l2dhdr->dh_lb_asize = zfs_refcount_count(&dev->l2ad_lb_asize); l2dhdr->dh_lb_count = zfs_refcount_count(&dev->l2ad_lb_count); l2dhdr->dh_flags = 0; l2dhdr->dh_trim_action_time = dev->l2ad_vdev->vdev_trim_action_time; l2dhdr->dh_trim_state = dev->l2ad_vdev->vdev_trim_state; if (dev->l2ad_first) l2dhdr->dh_flags |= L2ARC_DEV_HDR_EVICT_FIRST; abd = abd_get_from_buf(l2dhdr, l2dhdr_asize); err = zio_wait(zio_write_phys(NULL, dev->l2ad_vdev, VDEV_LABEL_START_SIZE, l2dhdr_asize, abd, ZIO_CHECKSUM_LABEL, NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE)); abd_put(abd); if (err != 0) { zfs_dbgmsg("L2ARC IO error (%d) while writing device header, " "vdev guid: %llu", err, dev->l2ad_vdev->vdev_guid); } } /* * Commits a log block to the L2ARC device. This routine is invoked from * l2arc_write_buffers when the log block fills up. * This function allocates some memory to temporarily hold the serialized * buffer to be written. This is then released in l2arc_write_done. */ static void l2arc_log_blk_commit(l2arc_dev_t *dev, zio_t *pio, l2arc_write_callback_t *cb) { l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk; l2arc_dev_hdr_phys_t *l2dhdr = dev->l2ad_dev_hdr; uint64_t psize, asize; zio_t *wzio; l2arc_lb_abd_buf_t *abd_buf; uint8_t *tmpbuf; l2arc_lb_ptr_buf_t *lb_ptr_buf; VERIFY3S(dev->l2ad_log_ent_idx, ==, dev->l2ad_log_entries); tmpbuf = zio_buf_alloc(sizeof (*lb)); abd_buf = zio_buf_alloc(sizeof (*abd_buf)); abd_buf->abd = abd_get_from_buf(lb, sizeof (*lb)); lb_ptr_buf = kmem_zalloc(sizeof (l2arc_lb_ptr_buf_t), KM_SLEEP); lb_ptr_buf->lb_ptr = kmem_zalloc(sizeof (l2arc_log_blkptr_t), KM_SLEEP); /* link the buffer into the block chain */ lb->lb_prev_lbp = l2dhdr->dh_start_lbps[1]; lb->lb_magic = L2ARC_LOG_BLK_MAGIC; /* * l2arc_log_blk_commit() may be called multiple times during a single * l2arc_write_buffers() call. Save the allocated abd buffers in a list * so we can free them in l2arc_write_done() later on. */ list_insert_tail(&cb->l2wcb_abd_list, abd_buf); /* try to compress the buffer */ psize = zio_compress_data(ZIO_COMPRESS_LZ4, abd_buf->abd, tmpbuf, sizeof (*lb), 0); /* a log block is never entirely zero */ ASSERT(psize != 0); asize = vdev_psize_to_asize(dev->l2ad_vdev, psize); ASSERT(asize <= sizeof (*lb)); /* * Update the start log block pointer in the device header to point * to the log block we're about to write. */ l2dhdr->dh_start_lbps[1] = l2dhdr->dh_start_lbps[0]; l2dhdr->dh_start_lbps[0].lbp_daddr = dev->l2ad_hand; l2dhdr->dh_start_lbps[0].lbp_payload_asize = dev->l2ad_log_blk_payload_asize; l2dhdr->dh_start_lbps[0].lbp_payload_start = dev->l2ad_log_blk_payload_start; _NOTE(CONSTCOND) L2BLK_SET_LSIZE( (&l2dhdr->dh_start_lbps[0])->lbp_prop, sizeof (*lb)); L2BLK_SET_PSIZE( (&l2dhdr->dh_start_lbps[0])->lbp_prop, asize); L2BLK_SET_CHECKSUM( (&l2dhdr->dh_start_lbps[0])->lbp_prop, ZIO_CHECKSUM_FLETCHER_4); if (asize < sizeof (*lb)) { /* compression succeeded */ bzero(tmpbuf + psize, asize - psize); L2BLK_SET_COMPRESS( (&l2dhdr->dh_start_lbps[0])->lbp_prop, ZIO_COMPRESS_LZ4); } else { /* compression failed */ bcopy(lb, tmpbuf, sizeof (*lb)); L2BLK_SET_COMPRESS( (&l2dhdr->dh_start_lbps[0])->lbp_prop, ZIO_COMPRESS_OFF); } /* checksum what we're about to write */ fletcher_4_native(tmpbuf, asize, NULL, &l2dhdr->dh_start_lbps[0].lbp_cksum); abd_put(abd_buf->abd); /* perform the write itself */ abd_buf->abd = abd_get_from_buf(tmpbuf, sizeof (*lb)); abd_take_ownership_of_buf(abd_buf->abd, B_TRUE); wzio = zio_write_phys(pio, dev->l2ad_vdev, dev->l2ad_hand, asize, abd_buf->abd, ZIO_CHECKSUM_OFF, NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE, ZIO_FLAG_CANFAIL, B_FALSE); DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev, zio_t *, wzio); (void) zio_nowait(wzio); dev->l2ad_hand += asize; /* * Include the committed log block's pointer in the list of pointers * to log blocks present in the L2ARC device. */ bcopy(&l2dhdr->dh_start_lbps[0], lb_ptr_buf->lb_ptr, sizeof (l2arc_log_blkptr_t)); mutex_enter(&dev->l2ad_mtx); list_insert_head(&dev->l2ad_lbptr_list, lb_ptr_buf); ARCSTAT_INCR(arcstat_l2_log_blk_asize, asize); ARCSTAT_BUMP(arcstat_l2_log_blk_count); zfs_refcount_add_many(&dev->l2ad_lb_asize, asize, lb_ptr_buf); zfs_refcount_add(&dev->l2ad_lb_count, lb_ptr_buf); mutex_exit(&dev->l2ad_mtx); vdev_space_update(dev->l2ad_vdev, asize, 0, 0); /* bump the kstats */ ARCSTAT_INCR(arcstat_l2_write_bytes, asize); ARCSTAT_BUMP(arcstat_l2_log_blk_writes); ARCSTAT_F_AVG(arcstat_l2_log_blk_avg_asize, asize); ARCSTAT_F_AVG(arcstat_l2_data_to_meta_ratio, dev->l2ad_log_blk_payload_asize / asize); /* start a new log block */ dev->l2ad_log_ent_idx = 0; dev->l2ad_log_blk_payload_asize = 0; dev->l2ad_log_blk_payload_start = 0; } /* * Validates an L2ARC log block address to make sure that it can be read * from the provided L2ARC device. */ boolean_t l2arc_log_blkptr_valid(l2arc_dev_t *dev, const l2arc_log_blkptr_t *lbp) { /* L2BLK_GET_PSIZE returns aligned size for log blocks */ uint64_t asize = L2BLK_GET_PSIZE((lbp)->lbp_prop); uint64_t end = lbp->lbp_daddr + asize - 1; uint64_t start = lbp->lbp_payload_start; boolean_t evicted = B_FALSE; /* * A log block is valid if all of the following conditions are true: * - it fits entirely (including its payload) between l2ad_start and * l2ad_end * - it has a valid size * - neither the log block itself nor part of its payload was evicted * by l2arc_evict(): * * l2ad_hand l2ad_evict * | | lbp_daddr * | start | | end * | | | | | * V V V V V * l2ad_start ============================================ l2ad_end * --------------------------|||| * ^ ^ * | log block * payload */ evicted = l2arc_range_check_overlap(start, end, dev->l2ad_hand) || l2arc_range_check_overlap(start, end, dev->l2ad_evict) || l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, start) || l2arc_range_check_overlap(dev->l2ad_hand, dev->l2ad_evict, end); return (start >= dev->l2ad_start && end <= dev->l2ad_end && asize > 0 && asize <= sizeof (l2arc_log_blk_phys_t) && (!evicted || dev->l2ad_first)); } /* * Inserts ARC buffer header `hdr' into the current L2ARC log block on * the device. The buffer being inserted must be present in L2ARC. * Returns B_TRUE if the L2ARC log block is full and needs to be committed * to L2ARC, or B_FALSE if it still has room for more ARC buffers. */ static boolean_t l2arc_log_blk_insert(l2arc_dev_t *dev, const arc_buf_hdr_t *hdr) { l2arc_log_blk_phys_t *lb = &dev->l2ad_log_blk; l2arc_log_ent_phys_t *le; if (dev->l2ad_log_entries == 0) return (B_FALSE); int index = dev->l2ad_log_ent_idx++; ASSERT3S(index, <, dev->l2ad_log_entries); ASSERT(HDR_HAS_L2HDR(hdr)); le = &lb->lb_entries[index]; bzero(le, sizeof (*le)); le->le_dva = hdr->b_dva; le->le_birth = hdr->b_birth; le->le_daddr = hdr->b_l2hdr.b_daddr; if (index == 0) dev->l2ad_log_blk_payload_start = le->le_daddr; L2BLK_SET_LSIZE((le)->le_prop, HDR_GET_LSIZE(hdr)); L2BLK_SET_PSIZE((le)->le_prop, HDR_GET_PSIZE(hdr)); L2BLK_SET_COMPRESS((le)->le_prop, HDR_GET_COMPRESS(hdr)); le->le_complevel = hdr->b_complevel; L2BLK_SET_TYPE((le)->le_prop, hdr->b_type); L2BLK_SET_PROTECTED((le)->le_prop, !!(HDR_PROTECTED(hdr))); L2BLK_SET_PREFETCH((le)->le_prop, !!(HDR_PREFETCH(hdr))); dev->l2ad_log_blk_payload_asize += vdev_psize_to_asize(dev->l2ad_vdev, HDR_GET_PSIZE(hdr)); return (dev->l2ad_log_ent_idx == dev->l2ad_log_entries); } /* * Checks whether a given L2ARC device address sits in a time-sequential * range. The trick here is that the L2ARC is a rotary buffer, so we can't * just do a range comparison, we need to handle the situation in which the * range wraps around the end of the L2ARC device. Arguments: * bottom -- Lower end of the range to check (written to earlier). * top -- Upper end of the range to check (written to later). * check -- The address for which we want to determine if it sits in * between the top and bottom. * * The 3-way conditional below represents the following cases: * * bottom < top : Sequentially ordered case: * --------+-------------------+ * | (overlap here?) | * L2ARC dev V V * |---------------============--------------| * * bottom > top: Looped-around case: * --------+------------------+ * | (overlap here?) | * L2ARC dev V V * |===============---------------===========| * ^ ^ * | (or here?) | * +---------------+--------- * * top == bottom : Just a single address comparison. */ boolean_t l2arc_range_check_overlap(uint64_t bottom, uint64_t top, uint64_t check) { if (bottom < top) return (bottom <= check && check <= top); else if (bottom > top) return (check <= top || bottom <= check); else return (check == top); } EXPORT_SYMBOL(arc_buf_size); EXPORT_SYMBOL(arc_write); EXPORT_SYMBOL(arc_read); EXPORT_SYMBOL(arc_buf_info); EXPORT_SYMBOL(arc_getbuf_func); EXPORT_SYMBOL(arc_add_prune_callback); EXPORT_SYMBOL(arc_remove_prune_callback); /* BEGIN CSTYLED */ ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min, param_set_arc_long, param_get_long, ZMOD_RW, "Min arc size"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, max, param_set_arc_long, param_get_long, ZMOD_RW, "Max arc size"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, meta_limit, param_set_arc_long, param_get_long, ZMOD_RW, "Metadata limit for arc size"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, meta_limit_percent, param_set_arc_long, param_get_long, ZMOD_RW, "Percent of arc size for arc meta limit"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, meta_min, param_set_arc_long, param_get_long, ZMOD_RW, "Min arc metadata"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_prune, INT, ZMOD_RW, "Meta objects to scan for prune"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_adjust_restarts, INT, ZMOD_RW, "Limit number of restarts in arc_evict_meta"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, meta_strategy, INT, ZMOD_RW, "Meta reclaim strategy"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, grow_retry, param_set_arc_int, param_get_int, ZMOD_RW, "Seconds before growing arc size"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, p_dampener_disable, INT, ZMOD_RW, "Disable arc_p adapt dampener"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, shrink_shift, param_set_arc_int, param_get_int, ZMOD_RW, "log2(fraction of arc to reclaim)"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, pc_percent, UINT, ZMOD_RW, "Percent of pagecache to reclaim arc to"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, p_min_shift, param_set_arc_int, param_get_int, ZMOD_RW, "arc_c shift to calc min/max arc_p"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, average_blocksize, INT, ZMOD_RD, "Target average block size"); ZFS_MODULE_PARAM(zfs, zfs_, compressed_arc_enabled, INT, ZMOD_RW, "Disable compressed arc buffers"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prefetch_ms, param_set_arc_int, param_get_int, ZMOD_RW, "Min life of prefetch block in ms"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, min_prescient_prefetch_ms, param_set_arc_int, param_get_int, ZMOD_RW, "Min life of prescient prefetched block in ms"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_max, ULONG, ZMOD_RW, "Max write bytes per interval"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, write_boost, ULONG, ZMOD_RW, "Extra write bytes during device warmup"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom, ULONG, ZMOD_RW, "Number of max device writes to precache"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, headroom_boost, ULONG, ZMOD_RW, "Compressed l2arc_headroom multiplier"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, trim_ahead, ULONG, ZMOD_RW, "TRIM ahead L2ARC write size multiplier"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_secs, ULONG, ZMOD_RW, "Seconds between L2ARC writing"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_min_ms, ULONG, ZMOD_RW, "Min feed interval in milliseconds"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, noprefetch, INT, ZMOD_RW, "Skip caching prefetched buffers"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, feed_again, INT, ZMOD_RW, "Turbo L2ARC warmup"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, norw, INT, ZMOD_RW, "No reads during writes"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, meta_percent, INT, ZMOD_RW, "Percent of ARC size allowed for L2ARC-only headers"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_enabled, INT, ZMOD_RW, "Rebuild the L2ARC when importing a pool"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, rebuild_blocks_min_l2size, ULONG, ZMOD_RW, "Min size in bytes to write rebuild log blocks in L2ARC"); ZFS_MODULE_PARAM(zfs_l2arc, l2arc_, mfuonly, INT, ZMOD_RW, "Cache only MFU data from ARC into L2ARC"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, lotsfree_percent, param_set_arc_int, param_get_int, ZMOD_RW, "System free memory I/O throttle in bytes"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, sys_free, param_set_arc_long, param_get_long, ZMOD_RW, "System free memory target size in bytes"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit, param_set_arc_long, param_get_long, ZMOD_RW, "Minimum bytes of dnodes in arc"); ZFS_MODULE_PARAM_CALL(zfs_arc, zfs_arc_, dnode_limit_percent, param_set_arc_long, param_get_long, ZMOD_RW, "Percent of ARC meta buffers for dnodes"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, dnode_reduce_percent, ULONG, ZMOD_RW, "Percentage of excess dnodes to try to unpin"); ZFS_MODULE_PARAM(zfs_arc, zfs_arc_, eviction_pct, INT, ZMOD_RW, "When full, ARC allocation waits for eviction of this % of alloc size"); /* END CSTYLED */ Index: vendor-sys/openzfs/dist/module/zfs/dbuf_stats.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/dbuf_stats.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/dbuf_stats.c (revision 366348) @@ -1,231 +1,232 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ #include #include #include /* * Calculate the index of the arc header for the state, disabled by default. */ int zfs_dbuf_state_index = 0; /* * ========================================================================== * Dbuf Hash Read Routines * ========================================================================== */ typedef struct dbuf_stats_t { kmutex_t lock; kstat_t *kstat; dbuf_hash_table_t *hash; int idx; } dbuf_stats_t; static dbuf_stats_t dbuf_stats_hash_table; static int dbuf_stats_hash_table_headers(char *buf, size_t size) { (void) snprintf(buf, size, "%-96s | %-119s | %s\n" "%-16s %-8s %-8s %-8s %-8s %-10s %-8s %-5s %-5s %-7s %3s | " "%-5s %-5s %-9s %-6s %-8s %-12s " "%-6s %-6s %-6s %-6s %-6s %-8s %-8s %-8s %-6s | " "%-6s %-6s %-8s %-8s %-6s %-6s %-6s %-8s %-8s\n", "dbuf", "arcbuf", "dnode", "pool", "objset", "object", "level", "blkid", "offset", "dbsize", "meta", "state", "dbholds", "dbc", "list", "atype", "flags", "count", "asize", "access", "mru", "gmru", "mfu", "gmfu", "l2", "l2_dattr", "l2_asize", "l2_comp", "aholds", "dtype", "btype", "data_bs", "meta_bs", "bsize", "lvls", "dholds", "blocks", "dsize"); return (0); } static int __dbuf_stats_hash_table_data(char *buf, size_t size, dmu_buf_impl_t *db) { arc_buf_info_t abi = { 0 }; dmu_object_info_t doi = { 0 }; dnode_t *dn = DB_DNODE(db); size_t nwritten; if (db->db_buf) arc_buf_info(db->db_buf, &abi, zfs_dbuf_state_index); __dmu_object_info_from_dnode(dn, &doi); nwritten = snprintf(buf, size, "%-16s %-8llu %-8lld %-8lld %-8lld %-10llu %-8llu %-5d %-5d " "%-7lu %-3d | %-5d %-5d 0x%-7x %-6lu %-8llu %-12llu " "%-6lu %-6lu %-6lu %-6lu %-6lu %-8llu %-8llu %-8d %-6lu | " "%-6d %-6d %-8lu %-8lu %-6llu %-6lu %-6lu %-8llu %-8llu\n", /* dmu_buf_impl_t */ spa_name(dn->dn_objset->os_spa), (u_longlong_t)dmu_objset_id(db->db_objset), (longlong_t)db->db.db_object, (longlong_t)db->db_level, (longlong_t)db->db_blkid, (u_longlong_t)db->db.db_offset, (u_longlong_t)db->db.db_size, !!dbuf_is_metadata(db), db->db_state, (ulong_t)zfs_refcount_count(&db->db_holds), multilist_link_active(&db->db_cache_link), /* arc_buf_info_t */ abi.abi_state_type, abi.abi_state_contents, abi.abi_flags, (ulong_t)abi.abi_bufcnt, (u_longlong_t)abi.abi_size, (u_longlong_t)abi.abi_access, (ulong_t)abi.abi_mru_hits, (ulong_t)abi.abi_mru_ghost_hits, (ulong_t)abi.abi_mfu_hits, (ulong_t)abi.abi_mfu_ghost_hits, (ulong_t)abi.abi_l2arc_hits, (u_longlong_t)abi.abi_l2arc_dattr, (u_longlong_t)abi.abi_l2arc_asize, abi.abi_l2arc_compress, (ulong_t)abi.abi_holds, /* dmu_object_info_t */ doi.doi_type, doi.doi_bonus_type, (ulong_t)doi.doi_data_block_size, (ulong_t)doi.doi_metadata_block_size, (u_longlong_t)doi.doi_bonus_size, (ulong_t)doi.doi_indirection, (ulong_t)zfs_refcount_count(&dn->dn_holds), (u_longlong_t)doi.doi_fill_count, (u_longlong_t)doi.doi_max_offset); if (nwritten >= size) return (size); return (nwritten + 1); } static int dbuf_stats_hash_table_data(char *buf, size_t size, void *data) { dbuf_stats_t *dsh = (dbuf_stats_t *)data; dbuf_hash_table_t *h = dsh->hash; dmu_buf_impl_t *db; int length, error = 0; ASSERT3S(dsh->idx, >=, 0); ASSERT3S(dsh->idx, <=, h->hash_table_mask); - memset(buf, 0, size); + if (size) + buf[0] = 0; mutex_enter(DBUF_HASH_MUTEX(h, dsh->idx)); for (db = h->hash_table[dsh->idx]; db != NULL; db = db->db_hash_next) { /* * Returning ENOMEM will cause the data and header functions * to be called with a larger scratch buffers. */ if (size < 512) { error = SET_ERROR(ENOMEM); break; } mutex_enter(&db->db_mtx); if (db->db_state != DB_EVICTING) { length = __dbuf_stats_hash_table_data(buf, size, db); buf += length; size -= length; } mutex_exit(&db->db_mtx); } mutex_exit(DBUF_HASH_MUTEX(h, dsh->idx)); return (error); } static void * dbuf_stats_hash_table_addr(kstat_t *ksp, loff_t n) { dbuf_stats_t *dsh = ksp->ks_private; ASSERT(MUTEX_HELD(&dsh->lock)); if (n <= dsh->hash->hash_table_mask) { dsh->idx = n; return (dsh); } return (NULL); } static void dbuf_stats_hash_table_init(dbuf_hash_table_t *hash) { dbuf_stats_t *dsh = &dbuf_stats_hash_table; kstat_t *ksp; mutex_init(&dsh->lock, NULL, MUTEX_DEFAULT, NULL); dsh->hash = hash; ksp = kstat_create("zfs", 0, "dbufs", "misc", KSTAT_TYPE_RAW, 0, KSTAT_FLAG_VIRTUAL); dsh->kstat = ksp; if (ksp) { ksp->ks_lock = &dsh->lock; ksp->ks_ndata = UINT32_MAX; ksp->ks_private = dsh; kstat_set_raw_ops(ksp, dbuf_stats_hash_table_headers, dbuf_stats_hash_table_data, dbuf_stats_hash_table_addr); kstat_install(ksp); } } static void dbuf_stats_hash_table_destroy(void) { dbuf_stats_t *dsh = &dbuf_stats_hash_table; kstat_t *ksp; ksp = dsh->kstat; if (ksp) kstat_delete(ksp); mutex_destroy(&dsh->lock); } void dbuf_stats_init(dbuf_hash_table_t *hash) { dbuf_stats_hash_table_init(hash); } void dbuf_stats_destroy(void) { dbuf_stats_hash_table_destroy(); } /* BEGIN CSTYLED */ ZFS_MODULE_PARAM(zfs, zfs_, dbuf_state_index, INT, ZMOD_RW, "Calculate arc header index"); /* END CSTYLED */ Index: vendor-sys/openzfs/dist/module/zfs/dmu_send.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/dmu_send.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/dmu_send.c (revision 366348) @@ -1,3091 +1,3092 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright 2011 Nexenta Systems, Inc. All rights reserved. * Copyright (c) 2011, 2018 by Delphix. All rights reserved. * Copyright (c) 2014, Joyent, Inc. All rights reserved. * Copyright 2014 HybridCluster. All rights reserved. * Copyright 2016 RackTop Systems. * Copyright (c) 2016 Actifio, Inc. All rights reserved. * Copyright (c) 2019, Klara Inc. * Copyright (c) 2019, Allan Jude */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #ifdef _KERNEL #include #endif /* Set this tunable to TRUE to replace corrupt data with 0x2f5baddb10c */ int zfs_send_corrupt_data = B_FALSE; /* * This tunable controls the amount of data (measured in bytes) that will be * prefetched by zfs send. If the main thread is blocking on reads that haven't * completed, this variable might need to be increased. If instead the main * thread is issuing new reads because the prefetches have fallen out of the * cache, this may need to be decreased. */ int zfs_send_queue_length = SPA_MAXBLOCKSIZE; /* * This tunable controls the length of the queues that zfs send worker threads * use to communicate. If the send_main_thread is blocking on these queues, * this variable may need to be increased. If there is a significant slowdown * at the start of a send as these threads consume all the available IO * resources, this variable may need to be decreased. */ int zfs_send_no_prefetch_queue_length = 1024 * 1024; /* * These tunables control the fill fraction of the queues by zfs send. The fill * fraction controls the frequency with which threads have to be cv_signaled. * If a lot of cpu time is being spent on cv_signal, then these should be tuned * down. If the queues empty before the signalled thread can catch up, then * these should be tuned up. */ int zfs_send_queue_ff = 20; int zfs_send_no_prefetch_queue_ff = 20; /* * Use this to override the recordsize calculation for fast zfs send estimates. */ int zfs_override_estimate_recordsize = 0; /* Set this tunable to FALSE to disable setting of DRR_FLAG_FREERECORDS */ int zfs_send_set_freerecords_bit = B_TRUE; /* Set this tunable to FALSE is disable sending unmodified spill blocks. */ int zfs_send_unmodified_spill_blocks = B_TRUE; static inline boolean_t overflow_multiply(uint64_t a, uint64_t b, uint64_t *c) { uint64_t temp = a * b; if (b != 0 && temp / b != a) return (B_FALSE); *c = temp; return (B_TRUE); } struct send_thread_arg { bqueue_t q; objset_t *os; /* Objset to traverse */ uint64_t fromtxg; /* Traverse from this txg */ int flags; /* flags to pass to traverse_dataset */ int error_code; boolean_t cancel; zbookmark_phys_t resume; uint64_t *num_blocks_visited; }; struct redact_list_thread_arg { boolean_t cancel; bqueue_t q; zbookmark_phys_t resume; redaction_list_t *rl; boolean_t mark_redact; int error_code; uint64_t *num_blocks_visited; }; struct send_merge_thread_arg { bqueue_t q; objset_t *os; struct redact_list_thread_arg *from_arg; struct send_thread_arg *to_arg; struct redact_list_thread_arg *redact_arg; int error; boolean_t cancel; }; struct send_range { boolean_t eos_marker; /* Marks the end of the stream */ uint64_t object; uint64_t start_blkid; uint64_t end_blkid; bqueue_node_t ln; enum type {DATA, HOLE, OBJECT, OBJECT_RANGE, REDACT, PREVIOUSLY_REDACTED} type; union { struct srd { dmu_object_type_t obj_type; uint32_t datablksz; // logical size uint32_t datasz; // payload size blkptr_t bp; arc_buf_t *abuf; abd_t *abd; kmutex_t lock; kcondvar_t cv; boolean_t io_outstanding; int io_err; } data; struct srh { uint32_t datablksz; } hole; struct sro { /* * This is a pointer because embedding it in the * struct causes these structures to be massively larger * for all range types; this makes the code much less * memory efficient. */ dnode_phys_t *dnp; blkptr_t bp; } object; struct srr { uint32_t datablksz; } redact; struct sror { blkptr_t bp; } object_range; } sru; }; /* * The list of data whose inclusion in a send stream can be pending from * one call to backup_cb to another. Multiple calls to dump_free(), * dump_freeobjects(), and dump_redact() can be aggregated into a single * DRR_FREE, DRR_FREEOBJECTS, or DRR_REDACT replay record. */ typedef enum { PENDING_NONE, PENDING_FREE, PENDING_FREEOBJECTS, PENDING_REDACT } dmu_pendop_t; typedef struct dmu_send_cookie { dmu_replay_record_t *dsc_drr; dmu_send_outparams_t *dsc_dso; offset_t *dsc_off; objset_t *dsc_os; zio_cksum_t dsc_zc; uint64_t dsc_toguid; uint64_t dsc_fromtxg; int dsc_err; dmu_pendop_t dsc_pending_op; uint64_t dsc_featureflags; uint64_t dsc_last_data_object; uint64_t dsc_last_data_offset; uint64_t dsc_resume_object; uint64_t dsc_resume_offset; boolean_t dsc_sent_begin; boolean_t dsc_sent_end; } dmu_send_cookie_t; static int do_dump(dmu_send_cookie_t *dscp, struct send_range *range); static void range_free(struct send_range *range) { if (range->type == OBJECT) { size_t size = sizeof (dnode_phys_t) * (range->sru.object.dnp->dn_extra_slots + 1); kmem_free(range->sru.object.dnp, size); } else if (range->type == DATA) { mutex_enter(&range->sru.data.lock); while (range->sru.data.io_outstanding) cv_wait(&range->sru.data.cv, &range->sru.data.lock); if (range->sru.data.abd != NULL) abd_free(range->sru.data.abd); if (range->sru.data.abuf != NULL) { arc_buf_destroy(range->sru.data.abuf, &range->sru.data.abuf); } mutex_exit(&range->sru.data.lock); cv_destroy(&range->sru.data.cv); mutex_destroy(&range->sru.data.lock); } kmem_free(range, sizeof (*range)); } /* * For all record types except BEGIN, fill in the checksum (overlaid in * drr_u.drr_checksum.drr_checksum). The checksum verifies everything * up to the start of the checksum itself. */ static int dump_record(dmu_send_cookie_t *dscp, void *payload, int payload_len) { dmu_send_outparams_t *dso = dscp->dsc_dso; ASSERT3U(offsetof(dmu_replay_record_t, drr_u.drr_checksum.drr_checksum), ==, sizeof (dmu_replay_record_t) - sizeof (zio_cksum_t)); (void) fletcher_4_incremental_native(dscp->dsc_drr, offsetof(dmu_replay_record_t, drr_u.drr_checksum.drr_checksum), &dscp->dsc_zc); if (dscp->dsc_drr->drr_type == DRR_BEGIN) { dscp->dsc_sent_begin = B_TRUE; } else { ASSERT(ZIO_CHECKSUM_IS_ZERO(&dscp->dsc_drr->drr_u. drr_checksum.drr_checksum)); dscp->dsc_drr->drr_u.drr_checksum.drr_checksum = dscp->dsc_zc; } if (dscp->dsc_drr->drr_type == DRR_END) { dscp->dsc_sent_end = B_TRUE; } (void) fletcher_4_incremental_native(&dscp->dsc_drr-> drr_u.drr_checksum.drr_checksum, sizeof (zio_cksum_t), &dscp->dsc_zc); *dscp->dsc_off += sizeof (dmu_replay_record_t); dscp->dsc_err = dso->dso_outfunc(dscp->dsc_os, dscp->dsc_drr, sizeof (dmu_replay_record_t), dso->dso_arg); if (dscp->dsc_err != 0) return (SET_ERROR(EINTR)); if (payload_len != 0) { *dscp->dsc_off += payload_len; /* * payload is null when dso_dryrun == B_TRUE (i.e. when we're * doing a send size calculation) */ if (payload != NULL) { (void) fletcher_4_incremental_native( payload, payload_len, &dscp->dsc_zc); } /* * The code does not rely on this (len being a multiple of 8). * We keep this assertion because of the corresponding assertion * in receive_read(). Keeping this assertion ensures that we do * not inadvertently break backwards compatibility (causing the * assertion in receive_read() to trigger on old software). * * Raw sends cannot be received on old software, and so can * bypass this assertion. */ ASSERT((payload_len % 8 == 0) || (dscp->dsc_featureflags & DMU_BACKUP_FEATURE_RAW)); dscp->dsc_err = dso->dso_outfunc(dscp->dsc_os, payload, payload_len, dso->dso_arg); if (dscp->dsc_err != 0) return (SET_ERROR(EINTR)); } return (0); } /* * Fill in the drr_free struct, or perform aggregation if the previous record is * also a free record, and the two are adjacent. * * Note that we send free records even for a full send, because we want to be * able to receive a full send as a clone, which requires a list of all the free * and freeobject records that were generated on the source. */ static int dump_free(dmu_send_cookie_t *dscp, uint64_t object, uint64_t offset, uint64_t length) { struct drr_free *drrf = &(dscp->dsc_drr->drr_u.drr_free); /* * When we receive a free record, dbuf_free_range() assumes * that the receiving system doesn't have any dbufs in the range * being freed. This is always true because there is a one-record * constraint: we only send one WRITE record for any given * object,offset. We know that the one-record constraint is * true because we always send data in increasing order by * object,offset. * * If the increasing-order constraint ever changes, we should find * another way to assert that the one-record constraint is still * satisfied. */ ASSERT(object > dscp->dsc_last_data_object || (object == dscp->dsc_last_data_object && offset > dscp->dsc_last_data_offset)); /* * If there is a pending op, but it's not PENDING_FREE, push it out, * since free block aggregation can only be done for blocks of the * same type (i.e., DRR_FREE records can only be aggregated with * other DRR_FREE records. DRR_FREEOBJECTS records can only be * aggregated with other DRR_FREEOBJECTS records). */ if (dscp->dsc_pending_op != PENDING_NONE && dscp->dsc_pending_op != PENDING_FREE) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } if (dscp->dsc_pending_op == PENDING_FREE) { /* * Check to see whether this free block can be aggregated * with pending one. */ if (drrf->drr_object == object && drrf->drr_offset + drrf->drr_length == offset) { if (offset + length < offset || length == UINT64_MAX) drrf->drr_length = UINT64_MAX; else drrf->drr_length += length; return (0); } else { /* not a continuation. Push out pending record */ if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } } /* create a FREE record and make it pending */ bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_FREE; drrf->drr_object = object; drrf->drr_offset = offset; if (offset + length < offset) drrf->drr_length = DMU_OBJECT_END; else drrf->drr_length = length; drrf->drr_toguid = dscp->dsc_toguid; if (length == DMU_OBJECT_END) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); } else { dscp->dsc_pending_op = PENDING_FREE; } return (0); } /* * Fill in the drr_redact struct, or perform aggregation if the previous record * is also a redaction record, and the two are adjacent. */ static int dump_redact(dmu_send_cookie_t *dscp, uint64_t object, uint64_t offset, uint64_t length) { struct drr_redact *drrr = &dscp->dsc_drr->drr_u.drr_redact; /* * If there is a pending op, but it's not PENDING_REDACT, push it out, * since free block aggregation can only be done for blocks of the * same type (i.e., DRR_REDACT records can only be aggregated with * other DRR_REDACT records). */ if (dscp->dsc_pending_op != PENDING_NONE && dscp->dsc_pending_op != PENDING_REDACT) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } if (dscp->dsc_pending_op == PENDING_REDACT) { /* * Check to see whether this redacted block can be aggregated * with pending one. */ if (drrr->drr_object == object && drrr->drr_offset + drrr->drr_length == offset) { drrr->drr_length += length; return (0); } else { /* not a continuation. Push out pending record */ if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } } /* create a REDACT record and make it pending */ bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_REDACT; drrr->drr_object = object; drrr->drr_offset = offset; drrr->drr_length = length; drrr->drr_toguid = dscp->dsc_toguid; dscp->dsc_pending_op = PENDING_REDACT; return (0); } static int dmu_dump_write(dmu_send_cookie_t *dscp, dmu_object_type_t type, uint64_t object, uint64_t offset, int lsize, int psize, const blkptr_t *bp, void *data) { uint64_t payload_size; boolean_t raw = (dscp->dsc_featureflags & DMU_BACKUP_FEATURE_RAW); struct drr_write *drrw = &(dscp->dsc_drr->drr_u.drr_write); /* * We send data in increasing object, offset order. * See comment in dump_free() for details. */ ASSERT(object > dscp->dsc_last_data_object || (object == dscp->dsc_last_data_object && offset > dscp->dsc_last_data_offset)); dscp->dsc_last_data_object = object; dscp->dsc_last_data_offset = offset + lsize - 1; /* * If there is any kind of pending aggregation (currently either * a grouping of free objects or free blocks), push it out to * the stream, since aggregation can't be done across operations * of different types. */ if (dscp->dsc_pending_op != PENDING_NONE) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } /* write a WRITE record */ bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_WRITE; drrw->drr_object = object; drrw->drr_type = type; drrw->drr_offset = offset; drrw->drr_toguid = dscp->dsc_toguid; drrw->drr_logical_size = lsize; /* only set the compression fields if the buf is compressed or raw */ if (raw || lsize != psize) { ASSERT(raw || dscp->dsc_featureflags & DMU_BACKUP_FEATURE_COMPRESSED); ASSERT(!BP_IS_EMBEDDED(bp)); ASSERT3S(psize, >, 0); if (raw) { ASSERT(BP_IS_PROTECTED(bp)); /* * This is a raw protected block so we need to pass * along everything the receiving side will need to * interpret this block, including the byteswap, salt, * IV, and MAC. */ if (BP_SHOULD_BYTESWAP(bp)) drrw->drr_flags |= DRR_RAW_BYTESWAP; zio_crypt_decode_params_bp(bp, drrw->drr_salt, drrw->drr_iv); zio_crypt_decode_mac_bp(bp, drrw->drr_mac); } else { /* this is a compressed block */ ASSERT(dscp->dsc_featureflags & DMU_BACKUP_FEATURE_COMPRESSED); ASSERT(!BP_SHOULD_BYTESWAP(bp)); ASSERT(!DMU_OT_IS_METADATA(BP_GET_TYPE(bp))); ASSERT3U(BP_GET_COMPRESS(bp), !=, ZIO_COMPRESS_OFF); ASSERT3S(lsize, >=, psize); } /* set fields common to compressed and raw sends */ drrw->drr_compressiontype = BP_GET_COMPRESS(bp); drrw->drr_compressed_size = psize; payload_size = drrw->drr_compressed_size; } else { payload_size = drrw->drr_logical_size; } if (bp == NULL || BP_IS_EMBEDDED(bp) || (BP_IS_PROTECTED(bp) && !raw)) { /* * There's no pre-computed checksum for partial-block writes, * embedded BP's, or encrypted BP's that are being sent as * plaintext, so (like fletcher4-checksummed blocks) userland * will have to compute a dedup-capable checksum itself. */ drrw->drr_checksumtype = ZIO_CHECKSUM_OFF; } else { drrw->drr_checksumtype = BP_GET_CHECKSUM(bp); if (zio_checksum_table[drrw->drr_checksumtype].ci_flags & ZCHECKSUM_FLAG_DEDUP) drrw->drr_flags |= DRR_CHECKSUM_DEDUP; DDK_SET_LSIZE(&drrw->drr_key, BP_GET_LSIZE(bp)); DDK_SET_PSIZE(&drrw->drr_key, BP_GET_PSIZE(bp)); DDK_SET_COMPRESS(&drrw->drr_key, BP_GET_COMPRESS(bp)); DDK_SET_CRYPT(&drrw->drr_key, BP_IS_PROTECTED(bp)); drrw->drr_key.ddk_cksum = bp->blk_cksum; } if (dump_record(dscp, data, payload_size) != 0) return (SET_ERROR(EINTR)); return (0); } static int dump_write_embedded(dmu_send_cookie_t *dscp, uint64_t object, uint64_t offset, int blksz, const blkptr_t *bp) { char buf[BPE_PAYLOAD_SIZE]; struct drr_write_embedded *drrw = &(dscp->dsc_drr->drr_u.drr_write_embedded); if (dscp->dsc_pending_op != PENDING_NONE) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } ASSERT(BP_IS_EMBEDDED(bp)); bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_WRITE_EMBEDDED; drrw->drr_object = object; drrw->drr_offset = offset; drrw->drr_length = blksz; drrw->drr_toguid = dscp->dsc_toguid; drrw->drr_compression = BP_GET_COMPRESS(bp); drrw->drr_etype = BPE_GET_ETYPE(bp); drrw->drr_lsize = BPE_GET_LSIZE(bp); drrw->drr_psize = BPE_GET_PSIZE(bp); decode_embedded_bp_compressed(bp, buf); if (dump_record(dscp, buf, P2ROUNDUP(drrw->drr_psize, 8)) != 0) return (SET_ERROR(EINTR)); return (0); } static int dump_spill(dmu_send_cookie_t *dscp, const blkptr_t *bp, uint64_t object, void *data) { struct drr_spill *drrs = &(dscp->dsc_drr->drr_u.drr_spill); uint64_t blksz = BP_GET_LSIZE(bp); uint64_t payload_size = blksz; if (dscp->dsc_pending_op != PENDING_NONE) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } /* write a SPILL record */ bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_SPILL; drrs->drr_object = object; drrs->drr_length = blksz; drrs->drr_toguid = dscp->dsc_toguid; /* See comment in dump_dnode() for full details */ if (zfs_send_unmodified_spill_blocks && (bp->blk_birth <= dscp->dsc_fromtxg)) { drrs->drr_flags |= DRR_SPILL_UNMODIFIED; } /* handle raw send fields */ if (dscp->dsc_featureflags & DMU_BACKUP_FEATURE_RAW) { ASSERT(BP_IS_PROTECTED(bp)); if (BP_SHOULD_BYTESWAP(bp)) drrs->drr_flags |= DRR_RAW_BYTESWAP; drrs->drr_compressiontype = BP_GET_COMPRESS(bp); drrs->drr_compressed_size = BP_GET_PSIZE(bp); zio_crypt_decode_params_bp(bp, drrs->drr_salt, drrs->drr_iv); zio_crypt_decode_mac_bp(bp, drrs->drr_mac); payload_size = drrs->drr_compressed_size; } if (dump_record(dscp, data, payload_size) != 0) return (SET_ERROR(EINTR)); return (0); } static int dump_freeobjects(dmu_send_cookie_t *dscp, uint64_t firstobj, uint64_t numobjs) { struct drr_freeobjects *drrfo = &(dscp->dsc_drr->drr_u.drr_freeobjects); uint64_t maxobj = DNODES_PER_BLOCK * (DMU_META_DNODE(dscp->dsc_os)->dn_maxblkid + 1); /* * ZoL < 0.7 does not handle large FREEOBJECTS records correctly, * leading to zfs recv never completing. to avoid this issue, don't * send FREEOBJECTS records for object IDs which cannot exist on the * receiving side. */ if (maxobj > 0) { - if (maxobj < firstobj) + if (maxobj <= firstobj) return (0); if (maxobj < firstobj + numobjs) numobjs = maxobj - firstobj; } /* * If there is a pending op, but it's not PENDING_FREEOBJECTS, * push it out, since free block aggregation can only be done for * blocks of the same type (i.e., DRR_FREE records can only be * aggregated with other DRR_FREE records. DRR_FREEOBJECTS records * can only be aggregated with other DRR_FREEOBJECTS records). */ if (dscp->dsc_pending_op != PENDING_NONE && dscp->dsc_pending_op != PENDING_FREEOBJECTS) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } - if (numobjs == 0) - numobjs = UINT64_MAX - firstobj; if (dscp->dsc_pending_op == PENDING_FREEOBJECTS) { /* * See whether this free object array can be aggregated * with pending one */ if (drrfo->drr_firstobj + drrfo->drr_numobjs == firstobj) { drrfo->drr_numobjs += numobjs; return (0); } else { /* can't be aggregated. Push out pending record */ if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } } /* write a FREEOBJECTS record */ bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_FREEOBJECTS; drrfo->drr_firstobj = firstobj; drrfo->drr_numobjs = numobjs; drrfo->drr_toguid = dscp->dsc_toguid; dscp->dsc_pending_op = PENDING_FREEOBJECTS; return (0); } static int dump_dnode(dmu_send_cookie_t *dscp, const blkptr_t *bp, uint64_t object, dnode_phys_t *dnp) { struct drr_object *drro = &(dscp->dsc_drr->drr_u.drr_object); int bonuslen; if (object < dscp->dsc_resume_object) { /* * Note: when resuming, we will visit all the dnodes in * the block of dnodes that we are resuming from. In * this case it's unnecessary to send the dnodes prior to * the one we are resuming from. We should be at most one * block's worth of dnodes behind the resume point. */ ASSERT3U(dscp->dsc_resume_object - object, <, 1 << (DNODE_BLOCK_SHIFT - DNODE_SHIFT)); return (0); } if (dnp == NULL || dnp->dn_type == DMU_OT_NONE) return (dump_freeobjects(dscp, object, 1)); if (dscp->dsc_pending_op != PENDING_NONE) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } /* write an OBJECT record */ bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_OBJECT; drro->drr_object = object; drro->drr_type = dnp->dn_type; drro->drr_bonustype = dnp->dn_bonustype; drro->drr_blksz = dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT; drro->drr_bonuslen = dnp->dn_bonuslen; drro->drr_dn_slots = dnp->dn_extra_slots + 1; drro->drr_checksumtype = dnp->dn_checksum; drro->drr_compress = dnp->dn_compress; drro->drr_toguid = dscp->dsc_toguid; if (!(dscp->dsc_featureflags & DMU_BACKUP_FEATURE_LARGE_BLOCKS) && drro->drr_blksz > SPA_OLD_MAXBLOCKSIZE) drro->drr_blksz = SPA_OLD_MAXBLOCKSIZE; bonuslen = P2ROUNDUP(dnp->dn_bonuslen, 8); if ((dscp->dsc_featureflags & DMU_BACKUP_FEATURE_RAW)) { ASSERT(BP_IS_ENCRYPTED(bp)); if (BP_SHOULD_BYTESWAP(bp)) drro->drr_flags |= DRR_RAW_BYTESWAP; /* needed for reconstructing dnp on recv side */ drro->drr_maxblkid = dnp->dn_maxblkid; drro->drr_indblkshift = dnp->dn_indblkshift; drro->drr_nlevels = dnp->dn_nlevels; drro->drr_nblkptr = dnp->dn_nblkptr; /* * Since we encrypt the entire bonus area, the (raw) part * beyond the bonuslen is actually nonzero, so we need * to send it. */ if (bonuslen != 0) { drro->drr_raw_bonuslen = DN_MAX_BONUS_LEN(dnp); bonuslen = drro->drr_raw_bonuslen; } } /* * DRR_OBJECT_SPILL is set for every dnode which references a * spill block. This allows the receiving pool to definitively * determine when a spill block should be kept or freed. */ if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) drro->drr_flags |= DRR_OBJECT_SPILL; if (dump_record(dscp, DN_BONUS(dnp), bonuslen) != 0) return (SET_ERROR(EINTR)); /* Free anything past the end of the file. */ if (dump_free(dscp, object, (dnp->dn_maxblkid + 1) * (dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT), DMU_OBJECT_END) != 0) return (SET_ERROR(EINTR)); /* * Send DRR_SPILL records for unmodified spill blocks. This is useful * because changing certain attributes of the object (e.g. blocksize) * can cause old versions of ZFS to incorrectly remove a spill block. * Including these records in the stream forces an up to date version * to always be written ensuring they're never lost. Current versions * of the code which understand the DRR_FLAG_SPILL_BLOCK feature can * ignore these unmodified spill blocks. */ if (zfs_send_unmodified_spill_blocks && (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) && (DN_SPILL_BLKPTR(dnp)->blk_birth <= dscp->dsc_fromtxg)) { struct send_range record; blkptr_t *bp = DN_SPILL_BLKPTR(dnp); bzero(&record, sizeof (struct send_range)); record.type = DATA; record.object = object; record.eos_marker = B_FALSE; record.start_blkid = DMU_SPILL_BLKID; record.end_blkid = record.start_blkid + 1; record.sru.data.bp = *bp; record.sru.data.obj_type = dnp->dn_type; record.sru.data.datablksz = BP_GET_LSIZE(bp); if (do_dump(dscp, &record) != 0) return (SET_ERROR(EINTR)); } if (dscp->dsc_err != 0) return (SET_ERROR(EINTR)); return (0); } static int dump_object_range(dmu_send_cookie_t *dscp, const blkptr_t *bp, uint64_t firstobj, uint64_t numslots) { struct drr_object_range *drror = &(dscp->dsc_drr->drr_u.drr_object_range); /* we only use this record type for raw sends */ ASSERT(BP_IS_PROTECTED(bp)); ASSERT(dscp->dsc_featureflags & DMU_BACKUP_FEATURE_RAW); ASSERT3U(BP_GET_COMPRESS(bp), ==, ZIO_COMPRESS_OFF); ASSERT3U(BP_GET_TYPE(bp), ==, DMU_OT_DNODE); ASSERT0(BP_GET_LEVEL(bp)); if (dscp->dsc_pending_op != PENDING_NONE) { if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); dscp->dsc_pending_op = PENDING_NONE; } bzero(dscp->dsc_drr, sizeof (dmu_replay_record_t)); dscp->dsc_drr->drr_type = DRR_OBJECT_RANGE; drror->drr_firstobj = firstobj; drror->drr_numslots = numslots; drror->drr_toguid = dscp->dsc_toguid; if (BP_SHOULD_BYTESWAP(bp)) drror->drr_flags |= DRR_RAW_BYTESWAP; zio_crypt_decode_params_bp(bp, drror->drr_salt, drror->drr_iv); zio_crypt_decode_mac_bp(bp, drror->drr_mac); if (dump_record(dscp, NULL, 0) != 0) return (SET_ERROR(EINTR)); return (0); } static boolean_t send_do_embed(const blkptr_t *bp, uint64_t featureflags) { if (!BP_IS_EMBEDDED(bp)) return (B_FALSE); /* * Compression function must be legacy, or explicitly enabled. */ if ((BP_GET_COMPRESS(bp) >= ZIO_COMPRESS_LEGACY_FUNCTIONS && !(featureflags & DMU_BACKUP_FEATURE_LZ4))) return (B_FALSE); /* * If we have not set the ZSTD feature flag, we can't send ZSTD * compressed embedded blocks, as the receiver may not support them. */ if ((BP_GET_COMPRESS(bp) == ZIO_COMPRESS_ZSTD && !(featureflags & DMU_BACKUP_FEATURE_ZSTD))) return (B_FALSE); /* * Embed type must be explicitly enabled. */ switch (BPE_GET_ETYPE(bp)) { case BP_EMBEDDED_TYPE_DATA: if (featureflags & DMU_BACKUP_FEATURE_EMBED_DATA) return (B_TRUE); break; default: return (B_FALSE); } return (B_FALSE); } /* * This function actually handles figuring out what kind of record needs to be * dumped, and calling the appropriate helper function. In most cases, * the data has already been read by send_reader_thread(). */ static int do_dump(dmu_send_cookie_t *dscp, struct send_range *range) { int err = 0; switch (range->type) { case OBJECT: err = dump_dnode(dscp, &range->sru.object.bp, range->object, range->sru.object.dnp); return (err); case OBJECT_RANGE: { ASSERT3U(range->start_blkid + 1, ==, range->end_blkid); if (!(dscp->dsc_featureflags & DMU_BACKUP_FEATURE_RAW)) { return (0); } uint64_t epb = BP_GET_LSIZE(&range->sru.object_range.bp) >> DNODE_SHIFT; uint64_t firstobj = range->start_blkid * epb; err = dump_object_range(dscp, &range->sru.object_range.bp, firstobj, epb); break; } case REDACT: { struct srr *srrp = &range->sru.redact; err = dump_redact(dscp, range->object, range->start_blkid * srrp->datablksz, (range->end_blkid - range->start_blkid) * srrp->datablksz); return (err); } case DATA: { struct srd *srdp = &range->sru.data; blkptr_t *bp = &srdp->bp; spa_t *spa = dmu_objset_spa(dscp->dsc_os); ASSERT3U(srdp->datablksz, ==, BP_GET_LSIZE(bp)); ASSERT3U(range->start_blkid + 1, ==, range->end_blkid); if (BP_GET_TYPE(bp) == DMU_OT_SA) { arc_flags_t aflags = ARC_FLAG_WAIT; enum zio_flag zioflags = ZIO_FLAG_CANFAIL; if (dscp->dsc_featureflags & DMU_BACKUP_FEATURE_RAW) { ASSERT(BP_IS_PROTECTED(bp)); zioflags |= ZIO_FLAG_RAW; } zbookmark_phys_t zb; ASSERT3U(range->start_blkid, ==, DMU_SPILL_BLKID); zb.zb_objset = dmu_objset_id(dscp->dsc_os); zb.zb_object = range->object; zb.zb_level = 0; zb.zb_blkid = range->start_blkid; arc_buf_t *abuf = NULL; if (!dscp->dsc_dso->dso_dryrun && arc_read(NULL, spa, bp, arc_getbuf_func, &abuf, ZIO_PRIORITY_ASYNC_READ, zioflags, &aflags, &zb) != 0) return (SET_ERROR(EIO)); err = dump_spill(dscp, bp, zb.zb_object, (abuf == NULL ? NULL : abuf->b_data)); if (abuf != NULL) arc_buf_destroy(abuf, &abuf); return (err); } if (send_do_embed(bp, dscp->dsc_featureflags)) { err = dump_write_embedded(dscp, range->object, range->start_blkid * srdp->datablksz, srdp->datablksz, bp); return (err); } ASSERT(range->object > dscp->dsc_resume_object || (range->object == dscp->dsc_resume_object && range->start_blkid * srdp->datablksz >= dscp->dsc_resume_offset)); /* it's a level-0 block of a regular object */ mutex_enter(&srdp->lock); while (srdp->io_outstanding) cv_wait(&srdp->cv, &srdp->lock); err = srdp->io_err; mutex_exit(&srdp->lock); if (err != 0) { if (zfs_send_corrupt_data && !dscp->dsc_dso->dso_dryrun) { /* * Send a block filled with 0x"zfs badd bloc" */ srdp->abuf = arc_alloc_buf(spa, &srdp->abuf, ARC_BUFC_DATA, srdp->datablksz); uint64_t *ptr; for (ptr = srdp->abuf->b_data; (char *)ptr < (char *)srdp->abuf->b_data + srdp->datablksz; ptr++) *ptr = 0x2f5baddb10cULL; } else { return (SET_ERROR(EIO)); } } ASSERT(dscp->dsc_dso->dso_dryrun || srdp->abuf != NULL || srdp->abd != NULL); uint64_t offset = range->start_blkid * srdp->datablksz; char *data = NULL; if (srdp->abd != NULL) { data = abd_to_buf(srdp->abd); ASSERT3P(srdp->abuf, ==, NULL); } else if (srdp->abuf != NULL) { data = srdp->abuf->b_data; } /* * If we have large blocks stored on disk but the send flags * don't allow us to send large blocks, we split the data from * the arc buf into chunks. */ if (srdp->datablksz > SPA_OLD_MAXBLOCKSIZE && !(dscp->dsc_featureflags & DMU_BACKUP_FEATURE_LARGE_BLOCKS)) { while (srdp->datablksz > 0 && err == 0) { int n = MIN(srdp->datablksz, SPA_OLD_MAXBLOCKSIZE); err = dmu_dump_write(dscp, srdp->obj_type, range->object, offset, n, n, NULL, data); offset += n; /* * When doing dry run, data==NULL is used as a * sentinel value by * dmu_dump_write()->dump_record(). */ if (data != NULL) data += n; srdp->datablksz -= n; } } else { err = dmu_dump_write(dscp, srdp->obj_type, range->object, offset, srdp->datablksz, srdp->datasz, bp, data); } return (err); } case HOLE: { struct srh *srhp = &range->sru.hole; if (range->object == DMU_META_DNODE_OBJECT) { uint32_t span = srhp->datablksz >> DNODE_SHIFT; uint64_t first_obj = range->start_blkid * span; uint64_t numobj = range->end_blkid * span - first_obj; return (dump_freeobjects(dscp, first_obj, numobj)); } uint64_t offset = 0; /* * If this multiply overflows, we don't need to send this block. * Even if it has a birth time, it can never not be a hole, so * we don't need to send records for it. */ if (!overflow_multiply(range->start_blkid, srhp->datablksz, &offset)) { return (0); } uint64_t len = 0; if (!overflow_multiply(range->end_blkid, srhp->datablksz, &len)) len = UINT64_MAX; len = len - offset; return (dump_free(dscp, range->object, offset, len)); } default: panic("Invalid range type in do_dump: %d", range->type); } return (err); } static struct send_range * range_alloc(enum type type, uint64_t object, uint64_t start_blkid, uint64_t end_blkid, boolean_t eos) { struct send_range *range = kmem_alloc(sizeof (*range), KM_SLEEP); range->type = type; range->object = object; range->start_blkid = start_blkid; range->end_blkid = end_blkid; range->eos_marker = eos; if (type == DATA) { range->sru.data.abd = NULL; range->sru.data.abuf = NULL; mutex_init(&range->sru.data.lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&range->sru.data.cv, NULL, CV_DEFAULT, NULL); range->sru.data.io_outstanding = 0; range->sru.data.io_err = 0; } return (range); } /* * This is the callback function to traverse_dataset that acts as a worker * thread for dmu_send_impl. */ /*ARGSUSED*/ static int send_cb(spa_t *spa, zilog_t *zilog, const blkptr_t *bp, const zbookmark_phys_t *zb, const struct dnode_phys *dnp, void *arg) { struct send_thread_arg *sta = arg; struct send_range *record; ASSERT(zb->zb_object == DMU_META_DNODE_OBJECT || zb->zb_object >= sta->resume.zb_object); /* * All bps of an encrypted os should have the encryption bit set. * If this is not true it indicates tampering and we report an error. */ if (sta->os->os_encrypted && !BP_IS_HOLE(bp) && !BP_USES_CRYPT(bp)) { spa_log_error(spa, zb); zfs_panic_recover("unencrypted block in encrypted " "object set %llu", dmu_objset_id(sta->os)); return (SET_ERROR(EIO)); } if (sta->cancel) return (SET_ERROR(EINTR)); if (zb->zb_object != DMU_META_DNODE_OBJECT && DMU_OBJECT_IS_SPECIAL(zb->zb_object)) return (0); atomic_inc_64(sta->num_blocks_visited); if (zb->zb_level == ZB_DNODE_LEVEL) { if (zb->zb_object == DMU_META_DNODE_OBJECT) return (0); record = range_alloc(OBJECT, zb->zb_object, 0, 0, B_FALSE); record->sru.object.bp = *bp; size_t size = sizeof (*dnp) * (dnp->dn_extra_slots + 1); record->sru.object.dnp = kmem_alloc(size, KM_SLEEP); bcopy(dnp, record->sru.object.dnp, size); bqueue_enqueue(&sta->q, record, sizeof (*record)); return (0); } if (zb->zb_level == 0 && zb->zb_object == DMU_META_DNODE_OBJECT && !BP_IS_HOLE(bp)) { record = range_alloc(OBJECT_RANGE, 0, zb->zb_blkid, zb->zb_blkid + 1, B_FALSE); record->sru.object_range.bp = *bp; bqueue_enqueue(&sta->q, record, sizeof (*record)); return (0); } if (zb->zb_level < 0 || (zb->zb_level > 0 && !BP_IS_HOLE(bp))) return (0); if (zb->zb_object == DMU_META_DNODE_OBJECT && !BP_IS_HOLE(bp)) return (0); uint64_t span = bp_span_in_blocks(dnp->dn_indblkshift, zb->zb_level); uint64_t start; /* * If this multiply overflows, we don't need to send this block. * Even if it has a birth time, it can never not be a hole, so * we don't need to send records for it. */ if (!overflow_multiply(span, zb->zb_blkid, &start) || (!(zb->zb_blkid == DMU_SPILL_BLKID || DMU_OT_IS_METADATA(dnp->dn_type)) && span * zb->zb_blkid > dnp->dn_maxblkid)) { ASSERT(BP_IS_HOLE(bp)); return (0); } if (zb->zb_blkid == DMU_SPILL_BLKID) ASSERT3U(BP_GET_TYPE(bp), ==, DMU_OT_SA); enum type record_type = DATA; if (BP_IS_HOLE(bp)) record_type = HOLE; else if (BP_IS_REDACTED(bp)) record_type = REDACT; else record_type = DATA; record = range_alloc(record_type, zb->zb_object, start, (start + span < start ? 0 : start + span), B_FALSE); uint64_t datablksz = (zb->zb_blkid == DMU_SPILL_BLKID ? BP_GET_LSIZE(bp) : dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT); if (BP_IS_HOLE(bp)) { record->sru.hole.datablksz = datablksz; } else if (BP_IS_REDACTED(bp)) { record->sru.redact.datablksz = datablksz; } else { record->sru.data.datablksz = datablksz; record->sru.data.obj_type = dnp->dn_type; record->sru.data.bp = *bp; } bqueue_enqueue(&sta->q, record, sizeof (*record)); return (0); } struct redact_list_cb_arg { uint64_t *num_blocks_visited; bqueue_t *q; boolean_t *cancel; boolean_t mark_redact; }; static int redact_list_cb(redact_block_phys_t *rb, void *arg) { struct redact_list_cb_arg *rlcap = arg; atomic_inc_64(rlcap->num_blocks_visited); if (*rlcap->cancel) return (-1); struct send_range *data = range_alloc(REDACT, rb->rbp_object, rb->rbp_blkid, rb->rbp_blkid + redact_block_get_count(rb), B_FALSE); ASSERT3U(data->end_blkid, >, rb->rbp_blkid); if (rlcap->mark_redact) { data->type = REDACT; data->sru.redact.datablksz = redact_block_get_size(rb); } else { data->type = PREVIOUSLY_REDACTED; } bqueue_enqueue(rlcap->q, data, sizeof (*data)); return (0); } /* * This function kicks off the traverse_dataset. It also handles setting the * error code of the thread in case something goes wrong, and pushes the End of * Stream record when the traverse_dataset call has finished. */ static void send_traverse_thread(void *arg) { struct send_thread_arg *st_arg = arg; int err = 0; struct send_range *data; fstrans_cookie_t cookie = spl_fstrans_mark(); err = traverse_dataset_resume(st_arg->os->os_dsl_dataset, st_arg->fromtxg, &st_arg->resume, st_arg->flags, send_cb, st_arg); if (err != EINTR) st_arg->error_code = err; data = range_alloc(DATA, 0, 0, 0, B_TRUE); bqueue_enqueue_flush(&st_arg->q, data, sizeof (*data)); spl_fstrans_unmark(cookie); thread_exit(); } /* * Utility function that causes End of Stream records to compare after of all * others, so that other threads' comparison logic can stay simple. */ static int __attribute__((unused)) send_range_after(const struct send_range *from, const struct send_range *to) { if (from->eos_marker == B_TRUE) return (1); if (to->eos_marker == B_TRUE) return (-1); uint64_t from_obj = from->object; uint64_t from_end_obj = from->object + 1; uint64_t to_obj = to->object; uint64_t to_end_obj = to->object + 1; if (from_obj == 0) { ASSERT(from->type == HOLE || from->type == OBJECT_RANGE); from_obj = from->start_blkid << DNODES_PER_BLOCK_SHIFT; from_end_obj = from->end_blkid << DNODES_PER_BLOCK_SHIFT; } if (to_obj == 0) { ASSERT(to->type == HOLE || to->type == OBJECT_RANGE); to_obj = to->start_blkid << DNODES_PER_BLOCK_SHIFT; to_end_obj = to->end_blkid << DNODES_PER_BLOCK_SHIFT; } if (from_end_obj <= to_obj) return (-1); if (from_obj >= to_end_obj) return (1); int64_t cmp = TREE_CMP(to->type == OBJECT_RANGE, from->type == OBJECT_RANGE); if (unlikely(cmp)) return (cmp); cmp = TREE_CMP(to->type == OBJECT, from->type == OBJECT); if (unlikely(cmp)) return (cmp); if (from->end_blkid <= to->start_blkid) return (-1); if (from->start_blkid >= to->end_blkid) return (1); return (0); } /* * Pop the new data off the queue, check that the records we receive are in * the right order, but do not free the old data. This is used so that the * records can be sent on to the main thread without copying the data. */ static struct send_range * get_next_range_nofree(bqueue_t *bq, struct send_range *prev) { struct send_range *next = bqueue_dequeue(bq); ASSERT3S(send_range_after(prev, next), ==, -1); return (next); } /* * Pop the new data off the queue, check that the records we receive are in * the right order, and free the old data. */ static struct send_range * get_next_range(bqueue_t *bq, struct send_range *prev) { struct send_range *next = get_next_range_nofree(bq, prev); range_free(prev); return (next); } static void redact_list_thread(void *arg) { struct redact_list_thread_arg *rlt_arg = arg; struct send_range *record; fstrans_cookie_t cookie = spl_fstrans_mark(); if (rlt_arg->rl != NULL) { struct redact_list_cb_arg rlcba = {0}; rlcba.cancel = &rlt_arg->cancel; rlcba.q = &rlt_arg->q; rlcba.num_blocks_visited = rlt_arg->num_blocks_visited; rlcba.mark_redact = rlt_arg->mark_redact; int err = dsl_redaction_list_traverse(rlt_arg->rl, &rlt_arg->resume, redact_list_cb, &rlcba); if (err != EINTR) rlt_arg->error_code = err; } record = range_alloc(DATA, 0, 0, 0, B_TRUE); bqueue_enqueue_flush(&rlt_arg->q, record, sizeof (*record)); spl_fstrans_unmark(cookie); thread_exit(); } /* * Compare the start point of the two provided ranges. End of stream ranges * compare last, objects compare before any data or hole inside that object and * multi-object holes that start at the same object. */ static int send_range_start_compare(struct send_range *r1, struct send_range *r2) { uint64_t r1_objequiv = r1->object; uint64_t r1_l0equiv = r1->start_blkid; uint64_t r2_objequiv = r2->object; uint64_t r2_l0equiv = r2->start_blkid; int64_t cmp = TREE_CMP(r1->eos_marker, r2->eos_marker); if (unlikely(cmp)) return (cmp); if (r1->object == 0) { r1_objequiv = r1->start_blkid * DNODES_PER_BLOCK; r1_l0equiv = 0; } if (r2->object == 0) { r2_objequiv = r2->start_blkid * DNODES_PER_BLOCK; r2_l0equiv = 0; } cmp = TREE_CMP(r1_objequiv, r2_objequiv); if (likely(cmp)) return (cmp); cmp = TREE_CMP(r2->type == OBJECT_RANGE, r1->type == OBJECT_RANGE); if (unlikely(cmp)) return (cmp); cmp = TREE_CMP(r2->type == OBJECT, r1->type == OBJECT); if (unlikely(cmp)) return (cmp); return (TREE_CMP(r1_l0equiv, r2_l0equiv)); } enum q_idx { REDACT_IDX = 0, TO_IDX, FROM_IDX, NUM_THREADS }; /* * This function returns the next range the send_merge_thread should operate on. * The inputs are two arrays; the first one stores the range at the front of the * queues stored in the second one. The ranges are sorted in descending * priority order; the metadata from earlier ranges overrules metadata from * later ranges. out_mask is used to return which threads the ranges came from; * bit i is set if ranges[i] started at the same place as the returned range. * * This code is not hardcoded to compare a specific number of threads; it could * be used with any number, just by changing the q_idx enum. * * The "next range" is the one with the earliest start; if two starts are equal, * the highest-priority range is the next to operate on. If a higher-priority * range starts in the middle of the first range, then the first range will be * truncated to end where the higher-priority range starts, and we will operate * on that one next time. In this way, we make sure that each block covered by * some range gets covered by a returned range, and each block covered is * returned using the metadata of the highest-priority range it appears in. * * For example, if the three ranges at the front of the queues were [2,4), * [3,5), and [1,3), then the ranges returned would be [1,2) with the metadata * from the third range, [2,4) with the metadata from the first range, and then * [4,5) with the metadata from the second. */ static struct send_range * find_next_range(struct send_range **ranges, bqueue_t **qs, uint64_t *out_mask) { int idx = 0; // index of the range with the earliest start int i; uint64_t bmask = 0; for (i = 1; i < NUM_THREADS; i++) { if (send_range_start_compare(ranges[i], ranges[idx]) < 0) idx = i; } if (ranges[idx]->eos_marker) { struct send_range *ret = range_alloc(DATA, 0, 0, 0, B_TRUE); *out_mask = 0; return (ret); } /* * Find all the ranges that start at that same point. */ for (i = 0; i < NUM_THREADS; i++) { if (send_range_start_compare(ranges[i], ranges[idx]) == 0) bmask |= 1 << i; } *out_mask = bmask; /* * OBJECT_RANGE records only come from the TO thread, and should always * be treated as overlapping with nothing and sent on immediately. They * are only used in raw sends, and are never redacted. */ if (ranges[idx]->type == OBJECT_RANGE) { ASSERT3U(idx, ==, TO_IDX); ASSERT3U(*out_mask, ==, 1 << TO_IDX); struct send_range *ret = ranges[idx]; ranges[idx] = get_next_range_nofree(qs[idx], ranges[idx]); return (ret); } /* * Find the first start or end point after the start of the first range. */ uint64_t first_change = ranges[idx]->end_blkid; for (i = 0; i < NUM_THREADS; i++) { if (i == idx || ranges[i]->eos_marker || ranges[i]->object > ranges[idx]->object || ranges[i]->object == DMU_META_DNODE_OBJECT) continue; ASSERT3U(ranges[i]->object, ==, ranges[idx]->object); if (first_change > ranges[i]->start_blkid && (bmask & (1 << i)) == 0) first_change = ranges[i]->start_blkid; else if (first_change > ranges[i]->end_blkid) first_change = ranges[i]->end_blkid; } /* * Update all ranges to no longer overlap with the range we're * returning. All such ranges must start at the same place as the range * being returned, and end at or after first_change. Thus we update * their start to first_change. If that makes them size 0, then free * them and pull a new range from that thread. */ for (i = 0; i < NUM_THREADS; i++) { if (i == idx || (bmask & (1 << i)) == 0) continue; ASSERT3U(first_change, >, ranges[i]->start_blkid); ranges[i]->start_blkid = first_change; ASSERT3U(ranges[i]->start_blkid, <=, ranges[i]->end_blkid); if (ranges[i]->start_blkid == ranges[i]->end_blkid) ranges[i] = get_next_range(qs[i], ranges[i]); } /* * Short-circuit the simple case; if the range doesn't overlap with * anything else, or it only overlaps with things that start at the same * place and are longer, send it on. */ if (first_change == ranges[idx]->end_blkid) { struct send_range *ret = ranges[idx]; ranges[idx] = get_next_range_nofree(qs[idx], ranges[idx]); return (ret); } /* * Otherwise, return a truncated copy of ranges[idx] and move the start * of ranges[idx] back to first_change. */ struct send_range *ret = kmem_alloc(sizeof (*ret), KM_SLEEP); *ret = *ranges[idx]; ret->end_blkid = first_change; ranges[idx]->start_blkid = first_change; return (ret); } #define FROM_AND_REDACT_BITS ((1 << REDACT_IDX) | (1 << FROM_IDX)) /* * Merge the results from the from thread and the to thread, and then hand the * records off to send_prefetch_thread to prefetch them. If this is not a * send from a redaction bookmark, the from thread will push an end of stream * record and stop, and we'll just send everything that was changed in the * to_ds since the ancestor's creation txg. If it is, then since * traverse_dataset has a canonical order, we can compare each change as * they're pulled off the queues. That will give us a stream that is * appropriately sorted, and covers all records. In addition, we pull the * data from the redact_list_thread and use that to determine which blocks * should be redacted. */ static void send_merge_thread(void *arg) { struct send_merge_thread_arg *smt_arg = arg; struct send_range *front_ranges[NUM_THREADS]; bqueue_t *queues[NUM_THREADS]; int err = 0; fstrans_cookie_t cookie = spl_fstrans_mark(); if (smt_arg->redact_arg == NULL) { front_ranges[REDACT_IDX] = kmem_zalloc(sizeof (struct send_range), KM_SLEEP); front_ranges[REDACT_IDX]->eos_marker = B_TRUE; front_ranges[REDACT_IDX]->type = REDACT; queues[REDACT_IDX] = NULL; } else { front_ranges[REDACT_IDX] = bqueue_dequeue(&smt_arg->redact_arg->q); queues[REDACT_IDX] = &smt_arg->redact_arg->q; } front_ranges[TO_IDX] = bqueue_dequeue(&smt_arg->to_arg->q); queues[TO_IDX] = &smt_arg->to_arg->q; front_ranges[FROM_IDX] = bqueue_dequeue(&smt_arg->from_arg->q); queues[FROM_IDX] = &smt_arg->from_arg->q; uint64_t mask = 0; struct send_range *range; for (range = find_next_range(front_ranges, queues, &mask); !range->eos_marker && err == 0 && !smt_arg->cancel; range = find_next_range(front_ranges, queues, &mask)) { /* * If the range in question was in both the from redact bookmark * and the bookmark we're using to redact, then don't send it. * It's already redacted on the receiving system, so a redaction * record would be redundant. */ if ((mask & FROM_AND_REDACT_BITS) == FROM_AND_REDACT_BITS) { ASSERT3U(range->type, ==, REDACT); range_free(range); continue; } bqueue_enqueue(&smt_arg->q, range, sizeof (*range)); if (smt_arg->to_arg->error_code != 0) { err = smt_arg->to_arg->error_code; } else if (smt_arg->from_arg->error_code != 0) { err = smt_arg->from_arg->error_code; } else if (smt_arg->redact_arg != NULL && smt_arg->redact_arg->error_code != 0) { err = smt_arg->redact_arg->error_code; } } if (smt_arg->cancel && err == 0) err = SET_ERROR(EINTR); smt_arg->error = err; if (smt_arg->error != 0) { smt_arg->to_arg->cancel = B_TRUE; smt_arg->from_arg->cancel = B_TRUE; if (smt_arg->redact_arg != NULL) smt_arg->redact_arg->cancel = B_TRUE; } for (int i = 0; i < NUM_THREADS; i++) { while (!front_ranges[i]->eos_marker) { front_ranges[i] = get_next_range(queues[i], front_ranges[i]); } range_free(front_ranges[i]); } if (range == NULL) range = kmem_zalloc(sizeof (*range), KM_SLEEP); range->eos_marker = B_TRUE; bqueue_enqueue_flush(&smt_arg->q, range, 1); spl_fstrans_unmark(cookie); thread_exit(); } struct send_reader_thread_arg { struct send_merge_thread_arg *smta; bqueue_t q; boolean_t cancel; boolean_t issue_reads; uint64_t featureflags; int error; }; static void dmu_send_read_done(zio_t *zio) { struct send_range *range = zio->io_private; mutex_enter(&range->sru.data.lock); if (zio->io_error != 0) { abd_free(range->sru.data.abd); range->sru.data.abd = NULL; range->sru.data.io_err = zio->io_error; } ASSERT(range->sru.data.io_outstanding); range->sru.data.io_outstanding = B_FALSE; cv_broadcast(&range->sru.data.cv); mutex_exit(&range->sru.data.lock); } static void issue_data_read(struct send_reader_thread_arg *srta, struct send_range *range) { struct srd *srdp = &range->sru.data; blkptr_t *bp = &srdp->bp; objset_t *os = srta->smta->os; ASSERT3U(range->type, ==, DATA); ASSERT3U(range->start_blkid + 1, ==, range->end_blkid); /* * If we have large blocks stored on disk but * the send flags don't allow us to send large * blocks, we split the data from the arc buf * into chunks. */ boolean_t split_large_blocks = srdp->datablksz > SPA_OLD_MAXBLOCKSIZE && !(srta->featureflags & DMU_BACKUP_FEATURE_LARGE_BLOCKS); /* * We should only request compressed data from the ARC if all * the following are true: * - stream compression was requested * - we aren't splitting large blocks into smaller chunks * - the data won't need to be byteswapped before sending * - this isn't an embedded block * - this isn't metadata (if receiving on a different endian * system it can be byteswapped more easily) */ boolean_t request_compressed = (srta->featureflags & DMU_BACKUP_FEATURE_COMPRESSED) && !split_large_blocks && !BP_SHOULD_BYTESWAP(bp) && !BP_IS_EMBEDDED(bp) && !DMU_OT_IS_METADATA(BP_GET_TYPE(bp)); enum zio_flag zioflags = ZIO_FLAG_CANFAIL; if (srta->featureflags & DMU_BACKUP_FEATURE_RAW) zioflags |= ZIO_FLAG_RAW; else if (request_compressed) zioflags |= ZIO_FLAG_RAW_COMPRESS; srdp->datasz = (zioflags & ZIO_FLAG_RAW_COMPRESS) ? BP_GET_PSIZE(bp) : BP_GET_LSIZE(bp); if (!srta->issue_reads) return; if (BP_IS_REDACTED(bp)) return; if (send_do_embed(bp, srta->featureflags)) return; zbookmark_phys_t zb = { .zb_objset = dmu_objset_id(os), .zb_object = range->object, .zb_level = 0, .zb_blkid = range->start_blkid, }; arc_flags_t aflags = ARC_FLAG_CACHED_ONLY; int arc_err = arc_read(NULL, os->os_spa, bp, arc_getbuf_func, &srdp->abuf, ZIO_PRIORITY_ASYNC_READ, zioflags, &aflags, &zb); /* * If the data is not already cached in the ARC, we read directly * from zio. This avoids the performance overhead of adding a new * entry to the ARC, and we also avoid polluting the ARC cache with * data that is not likely to be used in the future. */ if (arc_err != 0) { srdp->abd = abd_alloc_linear(srdp->datasz, B_FALSE); srdp->io_outstanding = B_TRUE; zio_nowait(zio_read(NULL, os->os_spa, bp, srdp->abd, srdp->datasz, dmu_send_read_done, range, ZIO_PRIORITY_ASYNC_READ, zioflags, &zb)); } } /* * Create a new record with the given values. */ static void enqueue_range(struct send_reader_thread_arg *srta, bqueue_t *q, dnode_t *dn, uint64_t blkid, uint64_t count, const blkptr_t *bp, uint32_t datablksz) { enum type range_type = (bp == NULL || BP_IS_HOLE(bp) ? HOLE : (BP_IS_REDACTED(bp) ? REDACT : DATA)); struct send_range *range = range_alloc(range_type, dn->dn_object, blkid, blkid + count, B_FALSE); if (blkid == DMU_SPILL_BLKID) ASSERT3U(BP_GET_TYPE(bp), ==, DMU_OT_SA); switch (range_type) { case HOLE: range->sru.hole.datablksz = datablksz; break; case DATA: ASSERT3U(count, ==, 1); range->sru.data.datablksz = datablksz; range->sru.data.obj_type = dn->dn_type; range->sru.data.bp = *bp; issue_data_read(srta, range); break; case REDACT: range->sru.redact.datablksz = datablksz; break; default: break; } bqueue_enqueue(q, range, datablksz); } /* * This thread is responsible for two things: First, it retrieves the correct * blkptr in the to ds if we need to send the data because of something from * the from thread. As a result of this, we're the first ones to discover that * some indirect blocks can be discarded because they're not holes. Second, * it issues prefetches for the data we need to send. */ static void send_reader_thread(void *arg) { struct send_reader_thread_arg *srta = arg; struct send_merge_thread_arg *smta = srta->smta; bqueue_t *inq = &smta->q; bqueue_t *outq = &srta->q; objset_t *os = smta->os; fstrans_cookie_t cookie = spl_fstrans_mark(); struct send_range *range = bqueue_dequeue(inq); int err = 0; /* * If the record we're analyzing is from a redaction bookmark from the * fromds, then we need to know whether or not it exists in the tods so * we know whether to create records for it or not. If it does, we need * the datablksz so we can generate an appropriate record for it. * Finally, if it isn't redacted, we need the blkptr so that we can send * a WRITE record containing the actual data. */ uint64_t last_obj = UINT64_MAX; uint64_t last_obj_exists = B_TRUE; while (!range->eos_marker && !srta->cancel && smta->error == 0 && err == 0) { switch (range->type) { case DATA: issue_data_read(srta, range); bqueue_enqueue(outq, range, range->sru.data.datablksz); range = get_next_range_nofree(inq, range); break; case HOLE: case OBJECT: case OBJECT_RANGE: case REDACT: // Redacted blocks must exist bqueue_enqueue(outq, range, sizeof (*range)); range = get_next_range_nofree(inq, range); break; case PREVIOUSLY_REDACTED: { /* * This entry came from the "from bookmark" when * sending from a bookmark that has a redaction * list. We need to check if this object/blkid * exists in the target ("to") dataset, and if * not then we drop this entry. We also need * to fill in the block pointer so that we know * what to prefetch. * * To accomplish the above, we first cache whether or * not the last object we examined exists. If it * doesn't, we can drop this record. If it does, we hold * the dnode and use it to call dbuf_dnode_findbp. We do * this instead of dbuf_bookmark_findbp because we will * often operate on large ranges, and holding the dnode * once is more efficient. */ boolean_t object_exists = B_TRUE; /* * If the data is redacted, we only care if it exists, * so that we don't send records for objects that have * been deleted. */ dnode_t *dn; if (range->object == last_obj && !last_obj_exists) { /* * If we're still examining the same object as * previously, and it doesn't exist, we don't * need to call dbuf_bookmark_findbp. */ object_exists = B_FALSE; } else { err = dnode_hold(os, range->object, FTAG, &dn); if (err == ENOENT) { object_exists = B_FALSE; err = 0; } last_obj = range->object; last_obj_exists = object_exists; } if (err != 0) { break; } else if (!object_exists) { /* * The block was modified, but doesn't * exist in the to dataset; if it was * deleted in the to dataset, then we'll * visit the hole bp for it at some point. */ range = get_next_range(inq, range); continue; } uint64_t file_max = (dn->dn_maxblkid < range->end_blkid ? dn->dn_maxblkid : range->end_blkid); /* * The object exists, so we need to try to find the * blkptr for each block in the range we're processing. */ rw_enter(&dn->dn_struct_rwlock, RW_READER); for (uint64_t blkid = range->start_blkid; blkid < file_max; blkid++) { blkptr_t bp; uint32_t datablksz = dn->dn_phys->dn_datablkszsec << SPA_MINBLOCKSHIFT; uint64_t offset = blkid * datablksz; /* * This call finds the next non-hole block in * the object. This is to prevent a * performance problem where we're unredacting * a large hole. Using dnode_next_offset to * skip over the large hole avoids iterating * over every block in it. */ err = dnode_next_offset(dn, DNODE_FIND_HAVELOCK, &offset, 1, 1, 0); if (err == ESRCH) { offset = UINT64_MAX; err = 0; } else if (err != 0) { break; } if (offset != blkid * datablksz) { /* * if there is a hole from here * (blkid) to offset */ offset = MIN(offset, file_max * datablksz); uint64_t nblks = (offset / datablksz) - blkid; enqueue_range(srta, outq, dn, blkid, nblks, NULL, datablksz); blkid += nblks; } if (blkid >= file_max) break; err = dbuf_dnode_findbp(dn, 0, blkid, &bp, NULL, NULL); if (err != 0) break; ASSERT(!BP_IS_HOLE(&bp)); enqueue_range(srta, outq, dn, blkid, 1, &bp, datablksz); } rw_exit(&dn->dn_struct_rwlock); dnode_rele(dn, FTAG); range = get_next_range(inq, range); } } } if (srta->cancel || err != 0) { smta->cancel = B_TRUE; srta->error = err; } else if (smta->error != 0) { srta->error = smta->error; } while (!range->eos_marker) range = get_next_range(inq, range); bqueue_enqueue_flush(outq, range, 1); spl_fstrans_unmark(cookie); thread_exit(); } #define NUM_SNAPS_NOT_REDACTED UINT64_MAX struct dmu_send_params { /* Pool args */ void *tag; // Tag that dp was held with, will be used to release dp. dsl_pool_t *dp; /* To snapshot args */ const char *tosnap; dsl_dataset_t *to_ds; /* From snapshot args */ zfs_bookmark_phys_t ancestor_zb; uint64_t *fromredactsnaps; /* NUM_SNAPS_NOT_REDACTED if not sending from redaction bookmark */ uint64_t numfromredactsnaps; /* Stream params */ boolean_t is_clone; boolean_t embedok; boolean_t large_block_ok; boolean_t compressok; boolean_t rawok; boolean_t savedok; uint64_t resumeobj; uint64_t resumeoff; uint64_t saved_guid; zfs_bookmark_phys_t *redactbook; /* Stream output params */ dmu_send_outparams_t *dso; /* Stream progress params */ offset_t *off; int outfd; char saved_toname[MAXNAMELEN]; }; static int setup_featureflags(struct dmu_send_params *dspp, objset_t *os, uint64_t *featureflags) { dsl_dataset_t *to_ds = dspp->to_ds; dsl_pool_t *dp = dspp->dp; #ifdef _KERNEL if (dmu_objset_type(os) == DMU_OST_ZFS) { uint64_t version; if (zfs_get_zplprop(os, ZFS_PROP_VERSION, &version) != 0) return (SET_ERROR(EINVAL)); if (version >= ZPL_VERSION_SA) *featureflags |= DMU_BACKUP_FEATURE_SA_SPILL; } #endif /* raw sends imply large_block_ok */ if ((dspp->rawok || dspp->large_block_ok) && dsl_dataset_feature_is_active(to_ds, SPA_FEATURE_LARGE_BLOCKS)) { *featureflags |= DMU_BACKUP_FEATURE_LARGE_BLOCKS; } /* encrypted datasets will not have embedded blocks */ if ((dspp->embedok || dspp->rawok) && !os->os_encrypted && spa_feature_is_active(dp->dp_spa, SPA_FEATURE_EMBEDDED_DATA)) { *featureflags |= DMU_BACKUP_FEATURE_EMBED_DATA; } /* raw send implies compressok */ if (dspp->compressok || dspp->rawok) *featureflags |= DMU_BACKUP_FEATURE_COMPRESSED; if (dspp->rawok && os->os_encrypted) *featureflags |= DMU_BACKUP_FEATURE_RAW; if ((*featureflags & (DMU_BACKUP_FEATURE_EMBED_DATA | DMU_BACKUP_FEATURE_COMPRESSED | DMU_BACKUP_FEATURE_RAW)) != 0 && spa_feature_is_active(dp->dp_spa, SPA_FEATURE_LZ4_COMPRESS)) { *featureflags |= DMU_BACKUP_FEATURE_LZ4; } /* * We specifically do not include DMU_BACKUP_FEATURE_EMBED_DATA here to * allow sending ZSTD compressed datasets to a receiver that does not * support ZSTD */ if ((*featureflags & (DMU_BACKUP_FEATURE_COMPRESSED | DMU_BACKUP_FEATURE_RAW)) != 0 && dsl_dataset_feature_is_active(to_ds, SPA_FEATURE_ZSTD_COMPRESS)) { *featureflags |= DMU_BACKUP_FEATURE_ZSTD; } if (dspp->resumeobj != 0 || dspp->resumeoff != 0) { *featureflags |= DMU_BACKUP_FEATURE_RESUMING; } if (dspp->redactbook != NULL) { *featureflags |= DMU_BACKUP_FEATURE_REDACTED; } if (dsl_dataset_feature_is_active(to_ds, SPA_FEATURE_LARGE_DNODE)) { *featureflags |= DMU_BACKUP_FEATURE_LARGE_DNODE; } return (0); } static dmu_replay_record_t * create_begin_record(struct dmu_send_params *dspp, objset_t *os, uint64_t featureflags) { dmu_replay_record_t *drr = kmem_zalloc(sizeof (dmu_replay_record_t), KM_SLEEP); drr->drr_type = DRR_BEGIN; struct drr_begin *drrb = &drr->drr_u.drr_begin; dsl_dataset_t *to_ds = dspp->to_ds; drrb->drr_magic = DMU_BACKUP_MAGIC; drrb->drr_creation_time = dsl_dataset_phys(to_ds)->ds_creation_time; drrb->drr_type = dmu_objset_type(os); drrb->drr_toguid = dsl_dataset_phys(to_ds)->ds_guid; drrb->drr_fromguid = dspp->ancestor_zb.zbm_guid; DMU_SET_STREAM_HDRTYPE(drrb->drr_versioninfo, DMU_SUBSTREAM); DMU_SET_FEATUREFLAGS(drrb->drr_versioninfo, featureflags); if (dspp->is_clone) drrb->drr_flags |= DRR_FLAG_CLONE; if (dsl_dataset_phys(dspp->to_ds)->ds_flags & DS_FLAG_CI_DATASET) drrb->drr_flags |= DRR_FLAG_CI_DATA; if (zfs_send_set_freerecords_bit) drrb->drr_flags |= DRR_FLAG_FREERECORDS; drr->drr_u.drr_begin.drr_flags |= DRR_FLAG_SPILL_BLOCK; if (dspp->savedok) { drrb->drr_toguid = dspp->saved_guid; strlcpy(drrb->drr_toname, dspp->saved_toname, sizeof (drrb->drr_toname)); } else { dsl_dataset_name(to_ds, drrb->drr_toname); if (!to_ds->ds_is_snapshot) { (void) strlcat(drrb->drr_toname, "@--head--", sizeof (drrb->drr_toname)); } } return (drr); } static void setup_to_thread(struct send_thread_arg *to_arg, objset_t *to_os, dmu_sendstatus_t *dssp, uint64_t fromtxg, boolean_t rawok) { VERIFY0(bqueue_init(&to_arg->q, zfs_send_no_prefetch_queue_ff, MAX(zfs_send_no_prefetch_queue_length, 2 * zfs_max_recordsize), offsetof(struct send_range, ln))); to_arg->error_code = 0; to_arg->cancel = B_FALSE; to_arg->os = to_os; to_arg->fromtxg = fromtxg; to_arg->flags = TRAVERSE_PRE | TRAVERSE_PREFETCH_METADATA; if (rawok) to_arg->flags |= TRAVERSE_NO_DECRYPT; to_arg->num_blocks_visited = &dssp->dss_blocks; (void) thread_create(NULL, 0, send_traverse_thread, to_arg, 0, curproc, TS_RUN, minclsyspri); } static void setup_from_thread(struct redact_list_thread_arg *from_arg, redaction_list_t *from_rl, dmu_sendstatus_t *dssp) { VERIFY0(bqueue_init(&from_arg->q, zfs_send_no_prefetch_queue_ff, MAX(zfs_send_no_prefetch_queue_length, 2 * zfs_max_recordsize), offsetof(struct send_range, ln))); from_arg->error_code = 0; from_arg->cancel = B_FALSE; from_arg->rl = from_rl; from_arg->mark_redact = B_FALSE; from_arg->num_blocks_visited = &dssp->dss_blocks; /* * If from_ds is null, send_traverse_thread just returns success and * enqueues an eos marker. */ (void) thread_create(NULL, 0, redact_list_thread, from_arg, 0, curproc, TS_RUN, minclsyspri); } static void setup_redact_list_thread(struct redact_list_thread_arg *rlt_arg, struct dmu_send_params *dspp, redaction_list_t *rl, dmu_sendstatus_t *dssp) { if (dspp->redactbook == NULL) return; rlt_arg->cancel = B_FALSE; VERIFY0(bqueue_init(&rlt_arg->q, zfs_send_no_prefetch_queue_ff, MAX(zfs_send_no_prefetch_queue_length, 2 * zfs_max_recordsize), offsetof(struct send_range, ln))); rlt_arg->error_code = 0; rlt_arg->mark_redact = B_TRUE; rlt_arg->rl = rl; rlt_arg->num_blocks_visited = &dssp->dss_blocks; (void) thread_create(NULL, 0, redact_list_thread, rlt_arg, 0, curproc, TS_RUN, minclsyspri); } static void setup_merge_thread(struct send_merge_thread_arg *smt_arg, struct dmu_send_params *dspp, struct redact_list_thread_arg *from_arg, struct send_thread_arg *to_arg, struct redact_list_thread_arg *rlt_arg, objset_t *os) { VERIFY0(bqueue_init(&smt_arg->q, zfs_send_no_prefetch_queue_ff, MAX(zfs_send_no_prefetch_queue_length, 2 * zfs_max_recordsize), offsetof(struct send_range, ln))); smt_arg->cancel = B_FALSE; smt_arg->error = 0; smt_arg->from_arg = from_arg; smt_arg->to_arg = to_arg; if (dspp->redactbook != NULL) smt_arg->redact_arg = rlt_arg; smt_arg->os = os; (void) thread_create(NULL, 0, send_merge_thread, smt_arg, 0, curproc, TS_RUN, minclsyspri); } static void setup_reader_thread(struct send_reader_thread_arg *srt_arg, struct dmu_send_params *dspp, struct send_merge_thread_arg *smt_arg, uint64_t featureflags) { VERIFY0(bqueue_init(&srt_arg->q, zfs_send_queue_ff, MAX(zfs_send_queue_length, 2 * zfs_max_recordsize), offsetof(struct send_range, ln))); srt_arg->smta = smt_arg; srt_arg->issue_reads = !dspp->dso->dso_dryrun; srt_arg->featureflags = featureflags; (void) thread_create(NULL, 0, send_reader_thread, srt_arg, 0, curproc, TS_RUN, minclsyspri); } static int setup_resume_points(struct dmu_send_params *dspp, struct send_thread_arg *to_arg, struct redact_list_thread_arg *from_arg, struct redact_list_thread_arg *rlt_arg, struct send_merge_thread_arg *smt_arg, boolean_t resuming, objset_t *os, redaction_list_t *redact_rl, nvlist_t *nvl) { dsl_dataset_t *to_ds = dspp->to_ds; int err = 0; uint64_t obj = 0; uint64_t blkid = 0; if (resuming) { obj = dspp->resumeobj; dmu_object_info_t to_doi; err = dmu_object_info(os, obj, &to_doi); if (err != 0) return (err); blkid = dspp->resumeoff / to_doi.doi_data_block_size; } /* * If we're resuming a redacted send, we can skip to the appropriate * point in the redaction bookmark by binary searching through it. */ if (redact_rl != NULL) { SET_BOOKMARK(&rlt_arg->resume, to_ds->ds_object, obj, 0, blkid); } SET_BOOKMARK(&to_arg->resume, to_ds->ds_object, obj, 0, blkid); if (nvlist_exists(nvl, BEGINNV_REDACT_FROM_SNAPS)) { uint64_t objset = dspp->ancestor_zb.zbm_redaction_obj; /* * Note: If the resume point is in an object whose * blocksize is different in the from vs to snapshots, * we will have divided by the "wrong" blocksize. * However, in this case fromsnap's send_cb() will * detect that the blocksize has changed and therefore * ignore this object. * * If we're resuming a send from a redaction bookmark, * we still cannot accidentally suggest blocks behind * the to_ds. In addition, we know that any blocks in * the object in the to_ds will have to be sent, since * the size changed. Therefore, we can't cause any harm * this way either. */ SET_BOOKMARK(&from_arg->resume, objset, obj, 0, blkid); } if (resuming) { fnvlist_add_uint64(nvl, BEGINNV_RESUME_OBJECT, dspp->resumeobj); fnvlist_add_uint64(nvl, BEGINNV_RESUME_OFFSET, dspp->resumeoff); } return (0); } static dmu_sendstatus_t * setup_send_progress(struct dmu_send_params *dspp) { dmu_sendstatus_t *dssp = kmem_zalloc(sizeof (*dssp), KM_SLEEP); dssp->dss_outfd = dspp->outfd; dssp->dss_off = dspp->off; dssp->dss_proc = curproc; mutex_enter(&dspp->to_ds->ds_sendstream_lock); list_insert_head(&dspp->to_ds->ds_sendstreams, dssp); mutex_exit(&dspp->to_ds->ds_sendstream_lock); return (dssp); } /* * Actually do the bulk of the work in a zfs send. * * The idea is that we want to do a send from ancestor_zb to to_ds. We also * want to not send any data that has been modified by all the datasets in * redactsnaparr, and store the list of blocks that are redacted in this way in * a bookmark named redactbook, created on the to_ds. We do this by creating * several worker threads, whose function is described below. * * There are three cases. * The first case is a redacted zfs send. In this case there are 5 threads. * The first thread is the to_ds traversal thread: it calls dataset_traverse on * the to_ds and finds all the blocks that have changed since ancestor_zb (if * it's a full send, that's all blocks in the dataset). It then sends those * blocks on to the send merge thread. The redact list thread takes the data * from the redaction bookmark and sends those blocks on to the send merge * thread. The send merge thread takes the data from the to_ds traversal * thread, and combines it with the redaction records from the redact list * thread. If a block appears in both the to_ds's data and the redaction data, * the send merge thread will mark it as redacted and send it on to the prefetch * thread. Otherwise, the send merge thread will send the block on to the * prefetch thread unchanged. The prefetch thread will issue prefetch reads for * any data that isn't redacted, and then send the data on to the main thread. * The main thread behaves the same as in a normal send case, issuing demand * reads for data blocks and sending out records over the network * * The graphic below diagrams the flow of data in the case of a redacted zfs * send. Each box represents a thread, and each line represents the flow of * data. * * Records from the | * redaction bookmark | * +--------------------+ | +---------------------------+ * | | v | Send Merge Thread | * | Redact List Thread +----------> Apply redaction marks to | * | | | records as specified by | * +--------------------+ | redaction ranges | * +----^---------------+------+ * | | Merged data * | | * | +------------v--------+ * | | Prefetch Thread | * +--------------------+ | | Issues prefetch | * | to_ds Traversal | | | reads of data blocks| * | Thread (finds +---------------+ +------------+--------+ * | candidate blocks) | Blocks modified | Prefetched data * +--------------------+ by to_ds since | * ancestor_zb +------------v----+ * | Main Thread | File Descriptor * | Sends data over +->(to zfs receive) * | wire | * +-----------------+ * * The second case is an incremental send from a redaction bookmark. The to_ds * traversal thread and the main thread behave the same as in the redacted * send case. The new thread is the from bookmark traversal thread. It * iterates over the redaction list in the redaction bookmark, and enqueues * records for each block that was redacted in the original send. The send * merge thread now has to merge the data from the two threads. For details * about that process, see the header comment of send_merge_thread(). Any data * it decides to send on will be prefetched by the prefetch thread. Note that * you can perform a redacted send from a redaction bookmark; in that case, * the data flow behaves very similarly to the flow in the redacted send case, * except with the addition of the bookmark traversal thread iterating over the * redaction bookmark. The send_merge_thread also has to take on the * responsibility of merging the redact list thread's records, the bookmark * traversal thread's records, and the to_ds records. * * +---------------------+ * | | * | Redact List Thread +--------------+ * | | | * +---------------------+ | * Blocks in redaction list | Ranges modified by every secure snap * of from bookmark | (or EOS if not readcted) * | * +---------------------+ | +----v----------------------+ * | bookmark Traversal | v | Send Merge Thread | * | Thread (finds +---------> Merges bookmark, rlt, and | * | candidate blocks) | | to_ds send records | * +---------------------+ +----^---------------+------+ * | | Merged data * | +------------v--------+ * | | Prefetch Thread | * +--------------------+ | | Issues prefetch | * | to_ds Traversal | | | reads of data blocks| * | Thread (finds +---------------+ +------------+--------+ * | candidate blocks) | Blocks modified | Prefetched data * +--------------------+ by to_ds since +------------v----+ * ancestor_zb | Main Thread | File Descriptor * | Sends data over +->(to zfs receive) * | wire | * +-----------------+ * * The final case is a simple zfs full or incremental send. The to_ds traversal * thread behaves the same as always. The redact list thread is never started. * The send merge thread takes all the blocks that the to_ds traversal thread * sends it, prefetches the data, and sends the blocks on to the main thread. * The main thread sends the data over the wire. * * To keep performance acceptable, we want to prefetch the data in the worker * threads. While the to_ds thread could simply use the TRAVERSE_PREFETCH * feature built into traverse_dataset, the combining and deletion of records * due to redaction and sends from redaction bookmarks mean that we could * issue many unnecessary prefetches. As a result, we only prefetch data * after we've determined that the record is not going to be redacted. To * prevent the prefetching from getting too far ahead of the main thread, the * blocking queues that are used for communication are capped not by the * number of entries in the queue, but by the sum of the size of the * prefetches associated with them. The limit on the amount of data that the * thread can prefetch beyond what the main thread has reached is controlled * by the global variable zfs_send_queue_length. In addition, to prevent poor * performance in the beginning of a send, we also limit the distance ahead * that the traversal threads can be. That distance is controlled by the * zfs_send_no_prefetch_queue_length tunable. * * Note: Releases dp using the specified tag. */ static int dmu_send_impl(struct dmu_send_params *dspp) { objset_t *os; dmu_replay_record_t *drr; dmu_sendstatus_t *dssp; dmu_send_cookie_t dsc = {0}; int err; uint64_t fromtxg = dspp->ancestor_zb.zbm_creation_txg; uint64_t featureflags = 0; struct redact_list_thread_arg *from_arg; struct send_thread_arg *to_arg; struct redact_list_thread_arg *rlt_arg; struct send_merge_thread_arg *smt_arg; struct send_reader_thread_arg *srt_arg; struct send_range *range; redaction_list_t *from_rl = NULL; redaction_list_t *redact_rl = NULL; boolean_t resuming = (dspp->resumeobj != 0 || dspp->resumeoff != 0); boolean_t book_resuming = resuming; dsl_dataset_t *to_ds = dspp->to_ds; zfs_bookmark_phys_t *ancestor_zb = &dspp->ancestor_zb; dsl_pool_t *dp = dspp->dp; void *tag = dspp->tag; err = dmu_objset_from_ds(to_ds, &os); if (err != 0) { dsl_pool_rele(dp, tag); return (err); } /* * If this is a non-raw send of an encrypted ds, we can ensure that * the objset_phys_t is authenticated. This is safe because this is * either a snapshot or we have owned the dataset, ensuring that * it can't be modified. */ if (!dspp->rawok && os->os_encrypted && arc_is_unauthenticated(os->os_phys_buf)) { zbookmark_phys_t zb; SET_BOOKMARK(&zb, to_ds->ds_object, ZB_ROOT_OBJECT, ZB_ROOT_LEVEL, ZB_ROOT_BLKID); err = arc_untransform(os->os_phys_buf, os->os_spa, &zb, B_FALSE); if (err != 0) { dsl_pool_rele(dp, tag); return (err); } ASSERT0(arc_is_unauthenticated(os->os_phys_buf)); } if ((err = setup_featureflags(dspp, os, &featureflags)) != 0) { dsl_pool_rele(dp, tag); return (err); } /* * If we're doing a redacted send, hold the bookmark's redaction list. */ if (dspp->redactbook != NULL) { err = dsl_redaction_list_hold_obj(dp, dspp->redactbook->zbm_redaction_obj, FTAG, &redact_rl); if (err != 0) { dsl_pool_rele(dp, tag); return (SET_ERROR(EINVAL)); } dsl_redaction_list_long_hold(dp, redact_rl, FTAG); } /* * If we're sending from a redaction bookmark, hold the redaction list * so that we can consider sending the redacted blocks. */ if (ancestor_zb->zbm_redaction_obj != 0) { err = dsl_redaction_list_hold_obj(dp, ancestor_zb->zbm_redaction_obj, FTAG, &from_rl); if (err != 0) { if (redact_rl != NULL) { dsl_redaction_list_long_rele(redact_rl, FTAG); dsl_redaction_list_rele(redact_rl, FTAG); } dsl_pool_rele(dp, tag); return (SET_ERROR(EINVAL)); } dsl_redaction_list_long_hold(dp, from_rl, FTAG); } dsl_dataset_long_hold(to_ds, FTAG); from_arg = kmem_zalloc(sizeof (*from_arg), KM_SLEEP); to_arg = kmem_zalloc(sizeof (*to_arg), KM_SLEEP); rlt_arg = kmem_zalloc(sizeof (*rlt_arg), KM_SLEEP); smt_arg = kmem_zalloc(sizeof (*smt_arg), KM_SLEEP); srt_arg = kmem_zalloc(sizeof (*srt_arg), KM_SLEEP); drr = create_begin_record(dspp, os, featureflags); dssp = setup_send_progress(dspp); dsc.dsc_drr = drr; dsc.dsc_dso = dspp->dso; dsc.dsc_os = os; dsc.dsc_off = dspp->off; dsc.dsc_toguid = dsl_dataset_phys(to_ds)->ds_guid; dsc.dsc_fromtxg = fromtxg; dsc.dsc_pending_op = PENDING_NONE; dsc.dsc_featureflags = featureflags; dsc.dsc_resume_object = dspp->resumeobj; dsc.dsc_resume_offset = dspp->resumeoff; dsl_pool_rele(dp, tag); void *payload = NULL; size_t payload_len = 0; nvlist_t *nvl = fnvlist_alloc(); /* * If we're doing a redacted send, we include the snapshots we're * redacted with respect to so that the target system knows what send * streams can be correctly received on top of this dataset. If we're * instead sending a redacted dataset, we include the snapshots that the * dataset was created with respect to. */ if (dspp->redactbook != NULL) { fnvlist_add_uint64_array(nvl, BEGINNV_REDACT_SNAPS, redact_rl->rl_phys->rlp_snaps, redact_rl->rl_phys->rlp_num_snaps); } else if (dsl_dataset_feature_is_active(to_ds, SPA_FEATURE_REDACTED_DATASETS)) { uint64_t *tods_guids; uint64_t length; VERIFY(dsl_dataset_get_uint64_array_feature(to_ds, SPA_FEATURE_REDACTED_DATASETS, &length, &tods_guids)); fnvlist_add_uint64_array(nvl, BEGINNV_REDACT_SNAPS, tods_guids, length); } /* * If we're sending from a redaction bookmark, then we should retrieve * the guids of that bookmark so we can send them over the wire. */ if (from_rl != NULL) { fnvlist_add_uint64_array(nvl, BEGINNV_REDACT_FROM_SNAPS, from_rl->rl_phys->rlp_snaps, from_rl->rl_phys->rlp_num_snaps); } /* * If the snapshot we're sending from is redacted, include the redaction * list in the stream. */ if (dspp->numfromredactsnaps != NUM_SNAPS_NOT_REDACTED) { ASSERT3P(from_rl, ==, NULL); fnvlist_add_uint64_array(nvl, BEGINNV_REDACT_FROM_SNAPS, dspp->fromredactsnaps, (uint_t)dspp->numfromredactsnaps); if (dspp->numfromredactsnaps > 0) { kmem_free(dspp->fromredactsnaps, dspp->numfromredactsnaps * sizeof (uint64_t)); dspp->fromredactsnaps = NULL; } } if (resuming || book_resuming) { err = setup_resume_points(dspp, to_arg, from_arg, rlt_arg, smt_arg, resuming, os, redact_rl, nvl); if (err != 0) goto out; } if (featureflags & DMU_BACKUP_FEATURE_RAW) { uint64_t ivset_guid = (ancestor_zb != NULL) ? ancestor_zb->zbm_ivset_guid : 0; nvlist_t *keynvl = NULL; ASSERT(os->os_encrypted); err = dsl_crypto_populate_key_nvlist(os, ivset_guid, &keynvl); if (err != 0) { fnvlist_free(nvl); goto out; } fnvlist_add_nvlist(nvl, "crypt_keydata", keynvl); fnvlist_free(keynvl); } if (!nvlist_empty(nvl)) { payload = fnvlist_pack(nvl, &payload_len); drr->drr_payloadlen = payload_len; } fnvlist_free(nvl); err = dump_record(&dsc, payload, payload_len); fnvlist_pack_free(payload, payload_len); if (err != 0) { err = dsc.dsc_err; goto out; } setup_to_thread(to_arg, os, dssp, fromtxg, dspp->rawok); setup_from_thread(from_arg, from_rl, dssp); setup_redact_list_thread(rlt_arg, dspp, redact_rl, dssp); setup_merge_thread(smt_arg, dspp, from_arg, to_arg, rlt_arg, os); setup_reader_thread(srt_arg, dspp, smt_arg, featureflags); range = bqueue_dequeue(&srt_arg->q); while (err == 0 && !range->eos_marker) { err = do_dump(&dsc, range); range = get_next_range(&srt_arg->q, range); if (issig(JUSTLOOKING) && issig(FORREAL)) err = SET_ERROR(EINTR); } /* * If we hit an error or are interrupted, cancel our worker threads and * clear the queue of any pending records. The threads will pass the * cancel up the tree of worker threads, and each one will clean up any * pending records before exiting. */ if (err != 0) { srt_arg->cancel = B_TRUE; while (!range->eos_marker) { range = get_next_range(&srt_arg->q, range); } } range_free(range); bqueue_destroy(&srt_arg->q); bqueue_destroy(&smt_arg->q); if (dspp->redactbook != NULL) bqueue_destroy(&rlt_arg->q); bqueue_destroy(&to_arg->q); bqueue_destroy(&from_arg->q); if (err == 0 && srt_arg->error != 0) err = srt_arg->error; if (err != 0) goto out; if (dsc.dsc_pending_op != PENDING_NONE) if (dump_record(&dsc, NULL, 0) != 0) err = SET_ERROR(EINTR); if (err != 0) { if (err == EINTR && dsc.dsc_err != 0) err = dsc.dsc_err; goto out; } /* * Send the DRR_END record if this is not a saved stream. * Otherwise, the omitted DRR_END record will signal to * the receive side that the stream is incomplete. */ if (!dspp->savedok) { bzero(drr, sizeof (dmu_replay_record_t)); drr->drr_type = DRR_END; drr->drr_u.drr_end.drr_checksum = dsc.dsc_zc; drr->drr_u.drr_end.drr_toguid = dsc.dsc_toguid; if (dump_record(&dsc, NULL, 0) != 0) err = dsc.dsc_err; } out: mutex_enter(&to_ds->ds_sendstream_lock); list_remove(&to_ds->ds_sendstreams, dssp); mutex_exit(&to_ds->ds_sendstream_lock); VERIFY(err != 0 || (dsc.dsc_sent_begin && (dsc.dsc_sent_end || dspp->savedok))); kmem_free(drr, sizeof (dmu_replay_record_t)); kmem_free(dssp, sizeof (dmu_sendstatus_t)); kmem_free(from_arg, sizeof (*from_arg)); kmem_free(to_arg, sizeof (*to_arg)); kmem_free(rlt_arg, sizeof (*rlt_arg)); kmem_free(smt_arg, sizeof (*smt_arg)); kmem_free(srt_arg, sizeof (*srt_arg)); dsl_dataset_long_rele(to_ds, FTAG); if (from_rl != NULL) { dsl_redaction_list_long_rele(from_rl, FTAG); dsl_redaction_list_rele(from_rl, FTAG); } if (redact_rl != NULL) { dsl_redaction_list_long_rele(redact_rl, FTAG); dsl_redaction_list_rele(redact_rl, FTAG); } return (err); } int dmu_send_obj(const char *pool, uint64_t tosnap, uint64_t fromsnap, boolean_t embedok, boolean_t large_block_ok, boolean_t compressok, boolean_t rawok, boolean_t savedok, int outfd, offset_t *off, dmu_send_outparams_t *dsop) { int err; dsl_dataset_t *fromds; ds_hold_flags_t dsflags = (rawok) ? 0 : DS_HOLD_FLAG_DECRYPT; struct dmu_send_params dspp = {0}; dspp.embedok = embedok; dspp.large_block_ok = large_block_ok; dspp.compressok = compressok; dspp.outfd = outfd; dspp.off = off; dspp.dso = dsop; dspp.tag = FTAG; dspp.rawok = rawok; dspp.savedok = savedok; err = dsl_pool_hold(pool, FTAG, &dspp.dp); if (err != 0) return (err); err = dsl_dataset_hold_obj_flags(dspp.dp, tosnap, dsflags, FTAG, &dspp.to_ds); if (err != 0) { dsl_pool_rele(dspp.dp, FTAG); return (err); } if (fromsnap != 0) { err = dsl_dataset_hold_obj_flags(dspp.dp, fromsnap, dsflags, FTAG, &fromds); if (err != 0) { dsl_dataset_rele_flags(dspp.to_ds, dsflags, FTAG); dsl_pool_rele(dspp.dp, FTAG); return (err); } dspp.ancestor_zb.zbm_guid = dsl_dataset_phys(fromds)->ds_guid; dspp.ancestor_zb.zbm_creation_txg = dsl_dataset_phys(fromds)->ds_creation_txg; dspp.ancestor_zb.zbm_creation_time = dsl_dataset_phys(fromds)->ds_creation_time; if (dsl_dataset_is_zapified(fromds)) { (void) zap_lookup(dspp.dp->dp_meta_objset, fromds->ds_object, DS_FIELD_IVSET_GUID, 8, 1, &dspp.ancestor_zb.zbm_ivset_guid); } /* See dmu_send for the reasons behind this. */ uint64_t *fromredact; if (!dsl_dataset_get_uint64_array_feature(fromds, SPA_FEATURE_REDACTED_DATASETS, &dspp.numfromredactsnaps, &fromredact)) { dspp.numfromredactsnaps = NUM_SNAPS_NOT_REDACTED; } else if (dspp.numfromredactsnaps > 0) { uint64_t size = dspp.numfromredactsnaps * sizeof (uint64_t); dspp.fromredactsnaps = kmem_zalloc(size, KM_SLEEP); bcopy(fromredact, dspp.fromredactsnaps, size); } - if (!dsl_dataset_is_before(dspp.to_ds, fromds, 0)) { + boolean_t is_before = + dsl_dataset_is_before(dspp.to_ds, fromds, 0); + dspp.is_clone = (dspp.to_ds->ds_dir != + fromds->ds_dir); + dsl_dataset_rele(fromds, FTAG); + if (!is_before) { + dsl_pool_rele(dspp.dp, FTAG); err = SET_ERROR(EXDEV); } else { - dspp.is_clone = (dspp.to_ds->ds_dir != - fromds->ds_dir); - dsl_dataset_rele(fromds, FTAG); err = dmu_send_impl(&dspp); } } else { dspp.numfromredactsnaps = NUM_SNAPS_NOT_REDACTED; err = dmu_send_impl(&dspp); } dsl_dataset_rele(dspp.to_ds, FTAG); return (err); } int dmu_send(const char *tosnap, const char *fromsnap, boolean_t embedok, boolean_t large_block_ok, boolean_t compressok, boolean_t rawok, boolean_t savedok, uint64_t resumeobj, uint64_t resumeoff, const char *redactbook, int outfd, offset_t *off, dmu_send_outparams_t *dsop) { int err = 0; ds_hold_flags_t dsflags = (rawok) ? 0 : DS_HOLD_FLAG_DECRYPT; boolean_t owned = B_FALSE; dsl_dataset_t *fromds = NULL; zfs_bookmark_phys_t book = {0}; struct dmu_send_params dspp = {0}; dspp.tosnap = tosnap; dspp.embedok = embedok; dspp.large_block_ok = large_block_ok; dspp.compressok = compressok; dspp.outfd = outfd; dspp.off = off; dspp.dso = dsop; dspp.tag = FTAG; dspp.resumeobj = resumeobj; dspp.resumeoff = resumeoff; dspp.rawok = rawok; dspp.savedok = savedok; if (fromsnap != NULL && strpbrk(fromsnap, "@#") == NULL) return (SET_ERROR(EINVAL)); err = dsl_pool_hold(tosnap, FTAG, &dspp.dp); if (err != 0) return (err); if (strchr(tosnap, '@') == NULL && spa_writeable(dspp.dp->dp_spa)) { /* * We are sending a filesystem or volume. Ensure * that it doesn't change by owning the dataset. */ if (savedok) { /* * We are looking for the dataset that represents the * partially received send stream. If this stream was * received as a new snapshot of an existing dataset, * this will be saved in a hidden clone named * "//%recv". Otherwise, the stream * will be saved in the live dataset itself. In * either case we need to use dsl_dataset_own_force() * because the stream is marked as inconsistent, * which would normally make it unavailable to be * owned. */ char *name = kmem_asprintf("%s/%s", tosnap, recv_clone_name); err = dsl_dataset_own_force(dspp.dp, name, dsflags, FTAG, &dspp.to_ds); if (err == ENOENT) { err = dsl_dataset_own_force(dspp.dp, tosnap, dsflags, FTAG, &dspp.to_ds); } if (err == 0) { err = zap_lookup(dspp.dp->dp_meta_objset, dspp.to_ds->ds_object, DS_FIELD_RESUME_TOGUID, 8, 1, &dspp.saved_guid); } if (err == 0) { err = zap_lookup(dspp.dp->dp_meta_objset, dspp.to_ds->ds_object, DS_FIELD_RESUME_TONAME, 1, sizeof (dspp.saved_toname), dspp.saved_toname); } if (err != 0) dsl_dataset_disown(dspp.to_ds, dsflags, FTAG); kmem_strfree(name); } else { err = dsl_dataset_own(dspp.dp, tosnap, dsflags, FTAG, &dspp.to_ds); } owned = B_TRUE; } else { err = dsl_dataset_hold_flags(dspp.dp, tosnap, dsflags, FTAG, &dspp.to_ds); } if (err != 0) { dsl_pool_rele(dspp.dp, FTAG); return (err); } if (redactbook != NULL) { char path[ZFS_MAX_DATASET_NAME_LEN]; (void) strlcpy(path, tosnap, sizeof (path)); char *at = strchr(path, '@'); if (at == NULL) { err = EINVAL; } else { (void) snprintf(at, sizeof (path) - (at - path), "#%s", redactbook); err = dsl_bookmark_lookup(dspp.dp, path, NULL, &book); dspp.redactbook = &book; } } if (err != 0) { dsl_pool_rele(dspp.dp, FTAG); if (owned) dsl_dataset_disown(dspp.to_ds, dsflags, FTAG); else dsl_dataset_rele_flags(dspp.to_ds, dsflags, FTAG); return (err); } if (fromsnap != NULL) { zfs_bookmark_phys_t *zb = &dspp.ancestor_zb; int fsnamelen; if (strpbrk(tosnap, "@#") != NULL) fsnamelen = strpbrk(tosnap, "@#") - tosnap; else fsnamelen = strlen(tosnap); /* * If the fromsnap is in a different filesystem, then * mark the send stream as a clone. */ if (strncmp(tosnap, fromsnap, fsnamelen) != 0 || (fromsnap[fsnamelen] != '@' && fromsnap[fsnamelen] != '#')) { dspp.is_clone = B_TRUE; } if (strchr(fromsnap, '@') != NULL) { err = dsl_dataset_hold(dspp.dp, fromsnap, FTAG, &fromds); if (err != 0) { ASSERT3P(fromds, ==, NULL); } else { /* * We need to make a deep copy of the redact * snapshots of the from snapshot, because the * array will be freed when we evict from_ds. */ uint64_t *fromredact; if (!dsl_dataset_get_uint64_array_feature( fromds, SPA_FEATURE_REDACTED_DATASETS, &dspp.numfromredactsnaps, &fromredact)) { dspp.numfromredactsnaps = NUM_SNAPS_NOT_REDACTED; } else if (dspp.numfromredactsnaps > 0) { uint64_t size = dspp.numfromredactsnaps * sizeof (uint64_t); dspp.fromredactsnaps = kmem_zalloc(size, KM_SLEEP); bcopy(fromredact, dspp.fromredactsnaps, size); } if (!dsl_dataset_is_before(dspp.to_ds, fromds, 0)) { err = SET_ERROR(EXDEV); } else { zb->zbm_creation_txg = dsl_dataset_phys(fromds)-> ds_creation_txg; zb->zbm_creation_time = dsl_dataset_phys(fromds)-> ds_creation_time; zb->zbm_guid = dsl_dataset_phys(fromds)->ds_guid; zb->zbm_redaction_obj = 0; if (dsl_dataset_is_zapified(fromds)) { (void) zap_lookup( dspp.dp->dp_meta_objset, fromds->ds_object, DS_FIELD_IVSET_GUID, 8, 1, &zb->zbm_ivset_guid); } } dsl_dataset_rele(fromds, FTAG); } } else { dspp.numfromredactsnaps = NUM_SNAPS_NOT_REDACTED; err = dsl_bookmark_lookup(dspp.dp, fromsnap, dspp.to_ds, zb); if (err == EXDEV && zb->zbm_redaction_obj != 0 && zb->zbm_guid == dsl_dataset_phys(dspp.to_ds)->ds_guid) err = 0; } if (err == 0) { /* dmu_send_impl will call dsl_pool_rele for us. */ err = dmu_send_impl(&dspp); } else { dsl_pool_rele(dspp.dp, FTAG); } } else { dspp.numfromredactsnaps = NUM_SNAPS_NOT_REDACTED; err = dmu_send_impl(&dspp); } if (owned) dsl_dataset_disown(dspp.to_ds, dsflags, FTAG); else dsl_dataset_rele_flags(dspp.to_ds, dsflags, FTAG); return (err); } static int dmu_adjust_send_estimate_for_indirects(dsl_dataset_t *ds, uint64_t uncompressed, uint64_t compressed, boolean_t stream_compressed, uint64_t *sizep) { int err = 0; uint64_t size; /* * Assume that space (both on-disk and in-stream) is dominated by * data. We will adjust for indirect blocks and the copies property, * but ignore per-object space used (eg, dnodes and DRR_OBJECT records). */ uint64_t recordsize; uint64_t record_count; objset_t *os; VERIFY0(dmu_objset_from_ds(ds, &os)); /* Assume all (uncompressed) blocks are recordsize. */ if (zfs_override_estimate_recordsize != 0) { recordsize = zfs_override_estimate_recordsize; } else if (os->os_phys->os_type == DMU_OST_ZVOL) { err = dsl_prop_get_int_ds(ds, zfs_prop_to_name(ZFS_PROP_VOLBLOCKSIZE), &recordsize); } else { err = dsl_prop_get_int_ds(ds, zfs_prop_to_name(ZFS_PROP_RECORDSIZE), &recordsize); } if (err != 0) return (err); record_count = uncompressed / recordsize; /* * If we're estimating a send size for a compressed stream, use the * compressed data size to estimate the stream size. Otherwise, use the * uncompressed data size. */ size = stream_compressed ? compressed : uncompressed; /* * Subtract out approximate space used by indirect blocks. * Assume most space is used by data blocks (non-indirect, non-dnode). * Assume no ditto blocks or internal fragmentation. * * Therefore, space used by indirect blocks is sizeof(blkptr_t) per * block. */ size -= record_count * sizeof (blkptr_t); /* Add in the space for the record associated with each block. */ size += record_count * sizeof (dmu_replay_record_t); *sizep = size; return (0); } int dmu_send_estimate_fast(dsl_dataset_t *origds, dsl_dataset_t *fromds, zfs_bookmark_phys_t *frombook, boolean_t stream_compressed, boolean_t saved, uint64_t *sizep) { int err; dsl_dataset_t *ds = origds; uint64_t uncomp, comp; ASSERT(dsl_pool_config_held(origds->ds_dir->dd_pool)); ASSERT(fromds == NULL || frombook == NULL); /* * If this is a saved send we may actually be sending * from the %recv clone used for resuming. */ if (saved) { objset_t *mos = origds->ds_dir->dd_pool->dp_meta_objset; uint64_t guid; char dsname[ZFS_MAX_DATASET_NAME_LEN + 6]; dsl_dataset_name(origds, dsname); (void) strcat(dsname, "/"); (void) strcat(dsname, recv_clone_name); err = dsl_dataset_hold(origds->ds_dir->dd_pool, dsname, FTAG, &ds); if (err != ENOENT && err != 0) { return (err); } else if (err == ENOENT) { ds = origds; } /* check that this dataset has partially received data */ err = zap_lookup(mos, ds->ds_object, DS_FIELD_RESUME_TOGUID, 8, 1, &guid); if (err != 0) { err = SET_ERROR(err == ENOENT ? EINVAL : err); goto out; } err = zap_lookup(mos, ds->ds_object, DS_FIELD_RESUME_TONAME, 1, sizeof (dsname), dsname); if (err != 0) { err = SET_ERROR(err == ENOENT ? EINVAL : err); goto out; } } /* tosnap must be a snapshot or the target of a saved send */ if (!ds->ds_is_snapshot && ds == origds) return (SET_ERROR(EINVAL)); if (fromds != NULL) { uint64_t used; if (!fromds->ds_is_snapshot) { err = SET_ERROR(EINVAL); goto out; } if (!dsl_dataset_is_before(ds, fromds, 0)) { err = SET_ERROR(EXDEV); goto out; } err = dsl_dataset_space_written(fromds, ds, &used, &comp, &uncomp); if (err != 0) goto out; } else if (frombook != NULL) { uint64_t used; err = dsl_dataset_space_written_bookmark(frombook, ds, &used, &comp, &uncomp); if (err != 0) goto out; } else { uncomp = dsl_dataset_phys(ds)->ds_uncompressed_bytes; comp = dsl_dataset_phys(ds)->ds_compressed_bytes; } err = dmu_adjust_send_estimate_for_indirects(ds, uncomp, comp, stream_compressed, sizep); /* * Add the size of the BEGIN and END records to the estimate. */ *sizep += 2 * sizeof (dmu_replay_record_t); out: if (ds != origds) dsl_dataset_rele(ds, FTAG); return (err); } /* BEGIN CSTYLED */ ZFS_MODULE_PARAM(zfs_send, zfs_send_, corrupt_data, INT, ZMOD_RW, "Allow sending corrupt data"); ZFS_MODULE_PARAM(zfs_send, zfs_send_, queue_length, INT, ZMOD_RW, "Maximum send queue length"); ZFS_MODULE_PARAM(zfs_send, zfs_send_, unmodified_spill_blocks, INT, ZMOD_RW, "Send unmodified spill blocks"); ZFS_MODULE_PARAM(zfs_send, zfs_send_, no_prefetch_queue_length, INT, ZMOD_RW, "Maximum send queue length for non-prefetch queues"); ZFS_MODULE_PARAM(zfs_send, zfs_send_, queue_ff, INT, ZMOD_RW, "Send queue fill fraction"); ZFS_MODULE_PARAM(zfs_send, zfs_send_, no_prefetch_queue_ff, INT, ZMOD_RW, "Send queue fill fraction for non-prefetch queues"); ZFS_MODULE_PARAM(zfs_send, zfs_, override_estimate_recordsize, INT, ZMOD_RW, "Override block size estimate with fixed size"); /* END CSTYLED */ Index: vendor-sys/openzfs/dist/module/zfs/dnode.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/dnode.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/dnode.c (revision 366348) @@ -1,2578 +1,2580 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2012, 2019 by Delphix. All rights reserved. * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include dnode_stats_t dnode_stats = { { "dnode_hold_dbuf_hold", KSTAT_DATA_UINT64 }, { "dnode_hold_dbuf_read", KSTAT_DATA_UINT64 }, { "dnode_hold_alloc_hits", KSTAT_DATA_UINT64 }, { "dnode_hold_alloc_misses", KSTAT_DATA_UINT64 }, { "dnode_hold_alloc_interior", KSTAT_DATA_UINT64 }, { "dnode_hold_alloc_lock_retry", KSTAT_DATA_UINT64 }, { "dnode_hold_alloc_lock_misses", KSTAT_DATA_UINT64 }, { "dnode_hold_alloc_type_none", KSTAT_DATA_UINT64 }, { "dnode_hold_free_hits", KSTAT_DATA_UINT64 }, { "dnode_hold_free_misses", KSTAT_DATA_UINT64 }, { "dnode_hold_free_lock_misses", KSTAT_DATA_UINT64 }, { "dnode_hold_free_lock_retry", KSTAT_DATA_UINT64 }, { "dnode_hold_free_overflow", KSTAT_DATA_UINT64 }, { "dnode_hold_free_refcount", KSTAT_DATA_UINT64 }, { "dnode_free_interior_lock_retry", KSTAT_DATA_UINT64 }, { "dnode_allocate", KSTAT_DATA_UINT64 }, { "dnode_reallocate", KSTAT_DATA_UINT64 }, { "dnode_buf_evict", KSTAT_DATA_UINT64 }, { "dnode_alloc_next_chunk", KSTAT_DATA_UINT64 }, { "dnode_alloc_race", KSTAT_DATA_UINT64 }, { "dnode_alloc_next_block", KSTAT_DATA_UINT64 }, { "dnode_move_invalid", KSTAT_DATA_UINT64 }, { "dnode_move_recheck1", KSTAT_DATA_UINT64 }, { "dnode_move_recheck2", KSTAT_DATA_UINT64 }, { "dnode_move_special", KSTAT_DATA_UINT64 }, { "dnode_move_handle", KSTAT_DATA_UINT64 }, { "dnode_move_rwlock", KSTAT_DATA_UINT64 }, { "dnode_move_active", KSTAT_DATA_UINT64 }, }; static kstat_t *dnode_ksp; static kmem_cache_t *dnode_cache; static dnode_phys_t dnode_phys_zero __maybe_unused; int zfs_default_bs = SPA_MINBLOCKSHIFT; int zfs_default_ibs = DN_MAX_INDBLKSHIFT; #ifdef _KERNEL static kmem_cbrc_t dnode_move(void *, void *, size_t, void *); #endif /* _KERNEL */ static int dbuf_compare(const void *x1, const void *x2) { const dmu_buf_impl_t *d1 = x1; const dmu_buf_impl_t *d2 = x2; int cmp = TREE_CMP(d1->db_level, d2->db_level); if (likely(cmp)) return (cmp); cmp = TREE_CMP(d1->db_blkid, d2->db_blkid); if (likely(cmp)) return (cmp); if (d1->db_state == DB_SEARCH) { ASSERT3S(d2->db_state, !=, DB_SEARCH); return (-1); } else if (d2->db_state == DB_SEARCH) { ASSERT3S(d1->db_state, !=, DB_SEARCH); return (1); } return (TREE_PCMP(d1, d2)); } /* ARGSUSED */ static int dnode_cons(void *arg, void *unused, int kmflag) { dnode_t *dn = arg; int i; rw_init(&dn->dn_struct_rwlock, NULL, RW_NOLOCKDEP, NULL); mutex_init(&dn->dn_mtx, NULL, MUTEX_DEFAULT, NULL); mutex_init(&dn->dn_dbufs_mtx, NULL, MUTEX_DEFAULT, NULL); cv_init(&dn->dn_notxholds, NULL, CV_DEFAULT, NULL); cv_init(&dn->dn_nodnholds, NULL, CV_DEFAULT, NULL); /* * Every dbuf has a reference, and dropping a tracked reference is * O(number of references), so don't track dn_holds. */ zfs_refcount_create_untracked(&dn->dn_holds); zfs_refcount_create(&dn->dn_tx_holds); list_link_init(&dn->dn_link); bzero(&dn->dn_next_nblkptr[0], sizeof (dn->dn_next_nblkptr)); bzero(&dn->dn_next_nlevels[0], sizeof (dn->dn_next_nlevels)); bzero(&dn->dn_next_indblkshift[0], sizeof (dn->dn_next_indblkshift)); bzero(&dn->dn_next_bonustype[0], sizeof (dn->dn_next_bonustype)); bzero(&dn->dn_rm_spillblk[0], sizeof (dn->dn_rm_spillblk)); bzero(&dn->dn_next_bonuslen[0], sizeof (dn->dn_next_bonuslen)); bzero(&dn->dn_next_blksz[0], sizeof (dn->dn_next_blksz)); bzero(&dn->dn_next_maxblkid[0], sizeof (dn->dn_next_maxblkid)); for (i = 0; i < TXG_SIZE; i++) { multilist_link_init(&dn->dn_dirty_link[i]); dn->dn_free_ranges[i] = NULL; list_create(&dn->dn_dirty_records[i], sizeof (dbuf_dirty_record_t), offsetof(dbuf_dirty_record_t, dr_dirty_node)); } dn->dn_allocated_txg = 0; dn->dn_free_txg = 0; dn->dn_assigned_txg = 0; dn->dn_dirty_txg = 0; dn->dn_dirtyctx = 0; dn->dn_dirtyctx_firstset = NULL; dn->dn_bonus = NULL; dn->dn_have_spill = B_FALSE; dn->dn_zio = NULL; dn->dn_oldused = 0; dn->dn_oldflags = 0; dn->dn_olduid = 0; dn->dn_oldgid = 0; dn->dn_oldprojid = ZFS_DEFAULT_PROJID; dn->dn_newuid = 0; dn->dn_newgid = 0; dn->dn_newprojid = ZFS_DEFAULT_PROJID; dn->dn_id_flags = 0; dn->dn_dbufs_count = 0; avl_create(&dn->dn_dbufs, dbuf_compare, sizeof (dmu_buf_impl_t), offsetof(dmu_buf_impl_t, db_link)); dn->dn_moved = 0; return (0); } /* ARGSUSED */ static void dnode_dest(void *arg, void *unused) { int i; dnode_t *dn = arg; rw_destroy(&dn->dn_struct_rwlock); mutex_destroy(&dn->dn_mtx); mutex_destroy(&dn->dn_dbufs_mtx); cv_destroy(&dn->dn_notxholds); cv_destroy(&dn->dn_nodnholds); zfs_refcount_destroy(&dn->dn_holds); zfs_refcount_destroy(&dn->dn_tx_holds); ASSERT(!list_link_active(&dn->dn_link)); for (i = 0; i < TXG_SIZE; i++) { ASSERT(!multilist_link_active(&dn->dn_dirty_link[i])); ASSERT3P(dn->dn_free_ranges[i], ==, NULL); list_destroy(&dn->dn_dirty_records[i]); ASSERT0(dn->dn_next_nblkptr[i]); ASSERT0(dn->dn_next_nlevels[i]); ASSERT0(dn->dn_next_indblkshift[i]); ASSERT0(dn->dn_next_bonustype[i]); ASSERT0(dn->dn_rm_spillblk[i]); ASSERT0(dn->dn_next_bonuslen[i]); ASSERT0(dn->dn_next_blksz[i]); ASSERT0(dn->dn_next_maxblkid[i]); } ASSERT0(dn->dn_allocated_txg); ASSERT0(dn->dn_free_txg); ASSERT0(dn->dn_assigned_txg); ASSERT0(dn->dn_dirty_txg); ASSERT0(dn->dn_dirtyctx); ASSERT3P(dn->dn_dirtyctx_firstset, ==, NULL); ASSERT3P(dn->dn_bonus, ==, NULL); ASSERT(!dn->dn_have_spill); ASSERT3P(dn->dn_zio, ==, NULL); ASSERT0(dn->dn_oldused); ASSERT0(dn->dn_oldflags); ASSERT0(dn->dn_olduid); ASSERT0(dn->dn_oldgid); ASSERT0(dn->dn_oldprojid); ASSERT0(dn->dn_newuid); ASSERT0(dn->dn_newgid); ASSERT0(dn->dn_newprojid); ASSERT0(dn->dn_id_flags); ASSERT0(dn->dn_dbufs_count); avl_destroy(&dn->dn_dbufs); } void dnode_init(void) { ASSERT(dnode_cache == NULL); dnode_cache = kmem_cache_create("dnode_t", sizeof (dnode_t), 0, dnode_cons, dnode_dest, NULL, NULL, NULL, 0); kmem_cache_set_move(dnode_cache, dnode_move); dnode_ksp = kstat_create("zfs", 0, "dnodestats", "misc", KSTAT_TYPE_NAMED, sizeof (dnode_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL); if (dnode_ksp != NULL) { dnode_ksp->ks_data = &dnode_stats; kstat_install(dnode_ksp); } } void dnode_fini(void) { if (dnode_ksp != NULL) { kstat_delete(dnode_ksp); dnode_ksp = NULL; } kmem_cache_destroy(dnode_cache); dnode_cache = NULL; } #ifdef ZFS_DEBUG void dnode_verify(dnode_t *dn) { int drop_struct_lock = FALSE; ASSERT(dn->dn_phys); ASSERT(dn->dn_objset); ASSERT(dn->dn_handle->dnh_dnode == dn); ASSERT(DMU_OT_IS_VALID(dn->dn_phys->dn_type)); if (!(zfs_flags & ZFS_DEBUG_DNODE_VERIFY)) return; if (!RW_WRITE_HELD(&dn->dn_struct_rwlock)) { rw_enter(&dn->dn_struct_rwlock, RW_READER); drop_struct_lock = TRUE; } if (dn->dn_phys->dn_type != DMU_OT_NONE || dn->dn_allocated_txg != 0) { int i; int max_bonuslen = DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots); ASSERT3U(dn->dn_indblkshift, <=, SPA_MAXBLOCKSHIFT); if (dn->dn_datablkshift) { ASSERT3U(dn->dn_datablkshift, >=, SPA_MINBLOCKSHIFT); ASSERT3U(dn->dn_datablkshift, <=, SPA_MAXBLOCKSHIFT); ASSERT3U(1<dn_datablkshift, ==, dn->dn_datablksz); } ASSERT3U(dn->dn_nlevels, <=, 30); ASSERT(DMU_OT_IS_VALID(dn->dn_type)); ASSERT3U(dn->dn_nblkptr, >=, 1); ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR); ASSERT3U(dn->dn_bonuslen, <=, max_bonuslen); ASSERT3U(dn->dn_datablksz, ==, dn->dn_datablkszsec << SPA_MINBLOCKSHIFT); ASSERT3U(ISP2(dn->dn_datablksz), ==, dn->dn_datablkshift != 0); ASSERT3U((dn->dn_nblkptr - 1) * sizeof (blkptr_t) + dn->dn_bonuslen, <=, max_bonuslen); for (i = 0; i < TXG_SIZE; i++) { ASSERT3U(dn->dn_next_nlevels[i], <=, dn->dn_nlevels); } } if (dn->dn_phys->dn_type != DMU_OT_NONE) ASSERT3U(dn->dn_phys->dn_nlevels, <=, dn->dn_nlevels); ASSERT(DMU_OBJECT_IS_SPECIAL(dn->dn_object) || dn->dn_dbuf != NULL); if (dn->dn_dbuf != NULL) { ASSERT3P(dn->dn_phys, ==, (dnode_phys_t *)dn->dn_dbuf->db.db_data + (dn->dn_object % (dn->dn_dbuf->db.db_size >> DNODE_SHIFT))); } if (drop_struct_lock) rw_exit(&dn->dn_struct_rwlock); } #endif void dnode_byteswap(dnode_phys_t *dnp) { uint64_t *buf64 = (void*)&dnp->dn_blkptr; int i; if (dnp->dn_type == DMU_OT_NONE) { bzero(dnp, sizeof (dnode_phys_t)); return; } dnp->dn_datablkszsec = BSWAP_16(dnp->dn_datablkszsec); dnp->dn_bonuslen = BSWAP_16(dnp->dn_bonuslen); dnp->dn_extra_slots = BSWAP_8(dnp->dn_extra_slots); dnp->dn_maxblkid = BSWAP_64(dnp->dn_maxblkid); dnp->dn_used = BSWAP_64(dnp->dn_used); /* * dn_nblkptr is only one byte, so it's OK to read it in either * byte order. We can't read dn_bouslen. */ ASSERT(dnp->dn_indblkshift <= SPA_MAXBLOCKSHIFT); ASSERT(dnp->dn_nblkptr <= DN_MAX_NBLKPTR); for (i = 0; i < dnp->dn_nblkptr * sizeof (blkptr_t)/8; i++) buf64[i] = BSWAP_64(buf64[i]); /* * OK to check dn_bonuslen for zero, because it won't matter if * we have the wrong byte order. This is necessary because the * dnode dnode is smaller than a regular dnode. */ if (dnp->dn_bonuslen != 0) { /* * Note that the bonus length calculated here may be * longer than the actual bonus buffer. This is because * we always put the bonus buffer after the last block * pointer (instead of packing it against the end of the * dnode buffer). */ int off = (dnp->dn_nblkptr-1) * sizeof (blkptr_t); int slots = dnp->dn_extra_slots + 1; size_t len = DN_SLOTS_TO_BONUSLEN(slots) - off; dmu_object_byteswap_t byteswap; ASSERT(DMU_OT_IS_VALID(dnp->dn_bonustype)); byteswap = DMU_OT_BYTESWAP(dnp->dn_bonustype); dmu_ot_byteswap[byteswap].ob_func(dnp->dn_bonus + off, len); } /* Swap SPILL block if we have one */ if (dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) byteswap_uint64_array(DN_SPILL_BLKPTR(dnp), sizeof (blkptr_t)); } void dnode_buf_byteswap(void *vbuf, size_t size) { int i = 0; ASSERT3U(sizeof (dnode_phys_t), ==, (1<dn_type != DMU_OT_NONE) i += dnp->dn_extra_slots * DNODE_MIN_SIZE; } } void dnode_setbonuslen(dnode_t *dn, int newsize, dmu_tx_t *tx) { ASSERT3U(zfs_refcount_count(&dn->dn_holds), >=, 1); dnode_setdirty(dn, tx); rw_enter(&dn->dn_struct_rwlock, RW_WRITER); ASSERT3U(newsize, <=, DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots) - (dn->dn_nblkptr-1) * sizeof (blkptr_t)); if (newsize < dn->dn_bonuslen) { /* clear any data after the end of the new size */ size_t diff = dn->dn_bonuslen - newsize; char *data_end = ((char *)dn->dn_bonus->db.db_data) + newsize; bzero(data_end, diff); } dn->dn_bonuslen = newsize; if (newsize == 0) dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = DN_ZERO_BONUSLEN; else dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen; rw_exit(&dn->dn_struct_rwlock); } void dnode_setbonus_type(dnode_t *dn, dmu_object_type_t newtype, dmu_tx_t *tx) { ASSERT3U(zfs_refcount_count(&dn->dn_holds), >=, 1); dnode_setdirty(dn, tx); rw_enter(&dn->dn_struct_rwlock, RW_WRITER); dn->dn_bonustype = newtype; dn->dn_next_bonustype[tx->tx_txg & TXG_MASK] = dn->dn_bonustype; rw_exit(&dn->dn_struct_rwlock); } void dnode_rm_spill(dnode_t *dn, dmu_tx_t *tx) { ASSERT3U(zfs_refcount_count(&dn->dn_holds), >=, 1); ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock)); dnode_setdirty(dn, tx); dn->dn_rm_spillblk[tx->tx_txg & TXG_MASK] = DN_KILL_SPILLBLK; dn->dn_have_spill = B_FALSE; } static void dnode_setdblksz(dnode_t *dn, int size) { ASSERT0(P2PHASE(size, SPA_MINBLOCKSIZE)); ASSERT3U(size, <=, SPA_MAXBLOCKSIZE); ASSERT3U(size, >=, SPA_MINBLOCKSIZE); ASSERT3U(size >> SPA_MINBLOCKSHIFT, <, 1<<(sizeof (dn->dn_phys->dn_datablkszsec) * 8)); dn->dn_datablksz = size; dn->dn_datablkszsec = size >> SPA_MINBLOCKSHIFT; dn->dn_datablkshift = ISP2(size) ? highbit64(size - 1) : 0; } static dnode_t * dnode_create(objset_t *os, dnode_phys_t *dnp, dmu_buf_impl_t *db, uint64_t object, dnode_handle_t *dnh) { dnode_t *dn; dn = kmem_cache_alloc(dnode_cache, KM_SLEEP); dn->dn_moved = 0; /* * Defer setting dn_objset until the dnode is ready to be a candidate * for the dnode_move() callback. */ dn->dn_object = object; dn->dn_dbuf = db; dn->dn_handle = dnh; dn->dn_phys = dnp; if (dnp->dn_datablkszsec) { dnode_setdblksz(dn, dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT); } else { dn->dn_datablksz = 0; dn->dn_datablkszsec = 0; dn->dn_datablkshift = 0; } dn->dn_indblkshift = dnp->dn_indblkshift; dn->dn_nlevels = dnp->dn_nlevels; dn->dn_type = dnp->dn_type; dn->dn_nblkptr = dnp->dn_nblkptr; dn->dn_checksum = dnp->dn_checksum; dn->dn_compress = dnp->dn_compress; dn->dn_bonustype = dnp->dn_bonustype; dn->dn_bonuslen = dnp->dn_bonuslen; dn->dn_num_slots = dnp->dn_extra_slots + 1; dn->dn_maxblkid = dnp->dn_maxblkid; dn->dn_have_spill = ((dnp->dn_flags & DNODE_FLAG_SPILL_BLKPTR) != 0); dn->dn_id_flags = 0; dmu_zfetch_init(&dn->dn_zfetch, dn); ASSERT(DMU_OT_IS_VALID(dn->dn_phys->dn_type)); ASSERT(zrl_is_locked(&dnh->dnh_zrlock)); ASSERT(!DN_SLOT_IS_PTR(dnh->dnh_dnode)); mutex_enter(&os->os_lock); /* * Exclude special dnodes from os_dnodes so an empty os_dnodes * signifies that the special dnodes have no references from * their children (the entries in os_dnodes). This allows * dnode_destroy() to easily determine if the last child has * been removed and then complete eviction of the objset. */ if (!DMU_OBJECT_IS_SPECIAL(object)) list_insert_head(&os->os_dnodes, dn); membar_producer(); /* * Everything else must be valid before assigning dn_objset * makes the dnode eligible for dnode_move(). */ dn->dn_objset = os; dnh->dnh_dnode = dn; mutex_exit(&os->os_lock); arc_space_consume(sizeof (dnode_t), ARC_SPACE_DNODE); return (dn); } /* * Caller must be holding the dnode handle, which is released upon return. */ static void dnode_destroy(dnode_t *dn) { objset_t *os = dn->dn_objset; boolean_t complete_os_eviction = B_FALSE; ASSERT((dn->dn_id_flags & DN_ID_NEW_EXIST) == 0); mutex_enter(&os->os_lock); POINTER_INVALIDATE(&dn->dn_objset); if (!DMU_OBJECT_IS_SPECIAL(dn->dn_object)) { list_remove(&os->os_dnodes, dn); complete_os_eviction = list_is_empty(&os->os_dnodes) && list_link_active(&os->os_evicting_node); } mutex_exit(&os->os_lock); /* the dnode can no longer move, so we can release the handle */ if (!zrl_is_locked(&dn->dn_handle->dnh_zrlock)) zrl_remove(&dn->dn_handle->dnh_zrlock); dn->dn_allocated_txg = 0; dn->dn_free_txg = 0; dn->dn_assigned_txg = 0; dn->dn_dirty_txg = 0; dn->dn_dirtyctx = 0; dn->dn_dirtyctx_firstset = NULL; if (dn->dn_bonus != NULL) { mutex_enter(&dn->dn_bonus->db_mtx); dbuf_destroy(dn->dn_bonus); dn->dn_bonus = NULL; } dn->dn_zio = NULL; dn->dn_have_spill = B_FALSE; dn->dn_oldused = 0; dn->dn_oldflags = 0; dn->dn_olduid = 0; dn->dn_oldgid = 0; dn->dn_oldprojid = ZFS_DEFAULT_PROJID; dn->dn_newuid = 0; dn->dn_newgid = 0; dn->dn_newprojid = ZFS_DEFAULT_PROJID; dn->dn_id_flags = 0; dmu_zfetch_fini(&dn->dn_zfetch); kmem_cache_free(dnode_cache, dn); arc_space_return(sizeof (dnode_t), ARC_SPACE_DNODE); if (complete_os_eviction) dmu_objset_evict_done(os); } void dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs, dmu_object_type_t bonustype, int bonuslen, int dn_slots, dmu_tx_t *tx) { int i; ASSERT3U(dn_slots, >, 0); ASSERT3U(dn_slots << DNODE_SHIFT, <=, spa_maxdnodesize(dmu_objset_spa(dn->dn_objset))); ASSERT3U(blocksize, <=, spa_maxblocksize(dmu_objset_spa(dn->dn_objset))); if (blocksize == 0) blocksize = 1 << zfs_default_bs; else blocksize = P2ROUNDUP(blocksize, SPA_MINBLOCKSIZE); if (ibs == 0) ibs = zfs_default_ibs; ibs = MIN(MAX(ibs, DN_MIN_INDBLKSHIFT), DN_MAX_INDBLKSHIFT); dprintf("os=%p obj=%llu txg=%llu blocksize=%d ibs=%d dn_slots=%d\n", dn->dn_objset, dn->dn_object, tx->tx_txg, blocksize, ibs, dn_slots); DNODE_STAT_BUMP(dnode_allocate); ASSERT(dn->dn_type == DMU_OT_NONE); ASSERT(bcmp(dn->dn_phys, &dnode_phys_zero, sizeof (dnode_phys_t)) == 0); ASSERT(dn->dn_phys->dn_type == DMU_OT_NONE); ASSERT(ot != DMU_OT_NONE); ASSERT(DMU_OT_IS_VALID(ot)); ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) || (bonustype == DMU_OT_SA && bonuslen == 0) || (bonustype != DMU_OT_NONE && bonuslen != 0)); ASSERT(DMU_OT_IS_VALID(bonustype)); ASSERT3U(bonuslen, <=, DN_SLOTS_TO_BONUSLEN(dn_slots)); ASSERT(dn->dn_type == DMU_OT_NONE); ASSERT0(dn->dn_maxblkid); ASSERT0(dn->dn_allocated_txg); ASSERT0(dn->dn_assigned_txg); ASSERT0(dn->dn_dirty_txg); ASSERT(zfs_refcount_is_zero(&dn->dn_tx_holds)); ASSERT3U(zfs_refcount_count(&dn->dn_holds), <=, 1); ASSERT(avl_is_empty(&dn->dn_dbufs)); for (i = 0; i < TXG_SIZE; i++) { ASSERT0(dn->dn_next_nblkptr[i]); ASSERT0(dn->dn_next_nlevels[i]); ASSERT0(dn->dn_next_indblkshift[i]); ASSERT0(dn->dn_next_bonuslen[i]); ASSERT0(dn->dn_next_bonustype[i]); ASSERT0(dn->dn_rm_spillblk[i]); ASSERT0(dn->dn_next_blksz[i]); ASSERT0(dn->dn_next_maxblkid[i]); ASSERT(!multilist_link_active(&dn->dn_dirty_link[i])); ASSERT3P(list_head(&dn->dn_dirty_records[i]), ==, NULL); ASSERT3P(dn->dn_free_ranges[i], ==, NULL); } dn->dn_type = ot; dnode_setdblksz(dn, blocksize); dn->dn_indblkshift = ibs; dn->dn_nlevels = 1; dn->dn_num_slots = dn_slots; if (bonustype == DMU_OT_SA) /* Maximize bonus space for SA */ dn->dn_nblkptr = 1; else { dn->dn_nblkptr = MIN(DN_MAX_NBLKPTR, 1 + ((DN_SLOTS_TO_BONUSLEN(dn_slots) - bonuslen) >> SPA_BLKPTRSHIFT)); } dn->dn_bonustype = bonustype; dn->dn_bonuslen = bonuslen; dn->dn_checksum = ZIO_CHECKSUM_INHERIT; dn->dn_compress = ZIO_COMPRESS_INHERIT; dn->dn_dirtyctx = 0; dn->dn_free_txg = 0; dn->dn_dirtyctx_firstset = NULL; dn->dn_allocated_txg = tx->tx_txg; dn->dn_id_flags = 0; dnode_setdirty(dn, tx); dn->dn_next_indblkshift[tx->tx_txg & TXG_MASK] = ibs; dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen; dn->dn_next_bonustype[tx->tx_txg & TXG_MASK] = dn->dn_bonustype; dn->dn_next_blksz[tx->tx_txg & TXG_MASK] = dn->dn_datablksz; } void dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, dmu_object_type_t bonustype, int bonuslen, int dn_slots, boolean_t keep_spill, dmu_tx_t *tx) { int nblkptr; ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE); ASSERT3U(blocksize, <=, spa_maxblocksize(dmu_objset_spa(dn->dn_objset))); ASSERT0(blocksize % SPA_MINBLOCKSIZE); ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT || dmu_tx_private_ok(tx)); ASSERT(tx->tx_txg != 0); ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) || (bonustype != DMU_OT_NONE && bonuslen != 0) || (bonustype == DMU_OT_SA && bonuslen == 0)); ASSERT(DMU_OT_IS_VALID(bonustype)); ASSERT3U(bonuslen, <=, DN_BONUS_SIZE(spa_maxdnodesize(dmu_objset_spa(dn->dn_objset)))); ASSERT3U(bonuslen, <=, DN_BONUS_SIZE(dn_slots << DNODE_SHIFT)); dnode_free_interior_slots(dn); DNODE_STAT_BUMP(dnode_reallocate); /* clean up any unreferenced dbufs */ dnode_evict_dbufs(dn); dn->dn_id_flags = 0; rw_enter(&dn->dn_struct_rwlock, RW_WRITER); dnode_setdirty(dn, tx); if (dn->dn_datablksz != blocksize) { /* change blocksize */ ASSERT0(dn->dn_maxblkid); ASSERT(BP_IS_HOLE(&dn->dn_phys->dn_blkptr[0]) || dnode_block_freed(dn, 0)); dnode_setdblksz(dn, blocksize); dn->dn_next_blksz[tx->tx_txg & TXG_MASK] = blocksize; } if (dn->dn_bonuslen != bonuslen) dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = bonuslen; if (bonustype == DMU_OT_SA) /* Maximize bonus space for SA */ nblkptr = 1; else nblkptr = MIN(DN_MAX_NBLKPTR, 1 + ((DN_SLOTS_TO_BONUSLEN(dn_slots) - bonuslen) >> SPA_BLKPTRSHIFT)); if (dn->dn_bonustype != bonustype) dn->dn_next_bonustype[tx->tx_txg & TXG_MASK] = bonustype; if (dn->dn_nblkptr != nblkptr) dn->dn_next_nblkptr[tx->tx_txg & TXG_MASK] = nblkptr; if (dn->dn_phys->dn_flags & DNODE_FLAG_SPILL_BLKPTR && !keep_spill) { dbuf_rm_spill(dn, tx); dnode_rm_spill(dn, tx); } rw_exit(&dn->dn_struct_rwlock); /* change type */ dn->dn_type = ot; /* change bonus size and type */ mutex_enter(&dn->dn_mtx); dn->dn_bonustype = bonustype; dn->dn_bonuslen = bonuslen; dn->dn_num_slots = dn_slots; dn->dn_nblkptr = nblkptr; dn->dn_checksum = ZIO_CHECKSUM_INHERIT; dn->dn_compress = ZIO_COMPRESS_INHERIT; ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR); /* fix up the bonus db_size */ if (dn->dn_bonus) { dn->dn_bonus->db.db_size = DN_SLOTS_TO_BONUSLEN(dn->dn_num_slots) - (dn->dn_nblkptr-1) * sizeof (blkptr_t); ASSERT(dn->dn_bonuslen <= dn->dn_bonus->db.db_size); } dn->dn_allocated_txg = tx->tx_txg; mutex_exit(&dn->dn_mtx); } #ifdef _KERNEL static void dnode_move_impl(dnode_t *odn, dnode_t *ndn) { int i; ASSERT(!RW_LOCK_HELD(&odn->dn_struct_rwlock)); ASSERT(MUTEX_NOT_HELD(&odn->dn_mtx)); ASSERT(MUTEX_NOT_HELD(&odn->dn_dbufs_mtx)); ASSERT(!MUTEX_HELD(&odn->dn_zfetch.zf_lock)); /* Copy fields. */ ndn->dn_objset = odn->dn_objset; ndn->dn_object = odn->dn_object; ndn->dn_dbuf = odn->dn_dbuf; ndn->dn_handle = odn->dn_handle; ndn->dn_phys = odn->dn_phys; ndn->dn_type = odn->dn_type; ndn->dn_bonuslen = odn->dn_bonuslen; ndn->dn_bonustype = odn->dn_bonustype; ndn->dn_nblkptr = odn->dn_nblkptr; ndn->dn_checksum = odn->dn_checksum; ndn->dn_compress = odn->dn_compress; ndn->dn_nlevels = odn->dn_nlevels; ndn->dn_indblkshift = odn->dn_indblkshift; ndn->dn_datablkshift = odn->dn_datablkshift; ndn->dn_datablkszsec = odn->dn_datablkszsec; ndn->dn_datablksz = odn->dn_datablksz; ndn->dn_maxblkid = odn->dn_maxblkid; ndn->dn_num_slots = odn->dn_num_slots; bcopy(&odn->dn_next_type[0], &ndn->dn_next_type[0], sizeof (odn->dn_next_type)); bcopy(&odn->dn_next_nblkptr[0], &ndn->dn_next_nblkptr[0], sizeof (odn->dn_next_nblkptr)); bcopy(&odn->dn_next_nlevels[0], &ndn->dn_next_nlevels[0], sizeof (odn->dn_next_nlevels)); bcopy(&odn->dn_next_indblkshift[0], &ndn->dn_next_indblkshift[0], sizeof (odn->dn_next_indblkshift)); bcopy(&odn->dn_next_bonustype[0], &ndn->dn_next_bonustype[0], sizeof (odn->dn_next_bonustype)); bcopy(&odn->dn_rm_spillblk[0], &ndn->dn_rm_spillblk[0], sizeof (odn->dn_rm_spillblk)); bcopy(&odn->dn_next_bonuslen[0], &ndn->dn_next_bonuslen[0], sizeof (odn->dn_next_bonuslen)); bcopy(&odn->dn_next_blksz[0], &ndn->dn_next_blksz[0], sizeof (odn->dn_next_blksz)); bcopy(&odn->dn_next_maxblkid[0], &ndn->dn_next_maxblkid[0], sizeof (odn->dn_next_maxblkid)); for (i = 0; i < TXG_SIZE; i++) { list_move_tail(&ndn->dn_dirty_records[i], &odn->dn_dirty_records[i]); } bcopy(&odn->dn_free_ranges[0], &ndn->dn_free_ranges[0], sizeof (odn->dn_free_ranges)); ndn->dn_allocated_txg = odn->dn_allocated_txg; ndn->dn_free_txg = odn->dn_free_txg; ndn->dn_assigned_txg = odn->dn_assigned_txg; ndn->dn_dirty_txg = odn->dn_dirty_txg; ndn->dn_dirtyctx = odn->dn_dirtyctx; ndn->dn_dirtyctx_firstset = odn->dn_dirtyctx_firstset; ASSERT(zfs_refcount_count(&odn->dn_tx_holds) == 0); zfs_refcount_transfer(&ndn->dn_holds, &odn->dn_holds); ASSERT(avl_is_empty(&ndn->dn_dbufs)); avl_swap(&ndn->dn_dbufs, &odn->dn_dbufs); ndn->dn_dbufs_count = odn->dn_dbufs_count; ndn->dn_bonus = odn->dn_bonus; ndn->dn_have_spill = odn->dn_have_spill; ndn->dn_zio = odn->dn_zio; ndn->dn_oldused = odn->dn_oldused; ndn->dn_oldflags = odn->dn_oldflags; ndn->dn_olduid = odn->dn_olduid; ndn->dn_oldgid = odn->dn_oldgid; ndn->dn_oldprojid = odn->dn_oldprojid; ndn->dn_newuid = odn->dn_newuid; ndn->dn_newgid = odn->dn_newgid; ndn->dn_newprojid = odn->dn_newprojid; ndn->dn_id_flags = odn->dn_id_flags; dmu_zfetch_init(&ndn->dn_zfetch, NULL); list_move_tail(&ndn->dn_zfetch.zf_stream, &odn->dn_zfetch.zf_stream); ndn->dn_zfetch.zf_dnode = odn->dn_zfetch.zf_dnode; /* * Update back pointers. Updating the handle fixes the back pointer of * every descendant dbuf as well as the bonus dbuf. */ ASSERT(ndn->dn_handle->dnh_dnode == odn); ndn->dn_handle->dnh_dnode = ndn; if (ndn->dn_zfetch.zf_dnode == odn) { ndn->dn_zfetch.zf_dnode = ndn; } /* * Invalidate the original dnode by clearing all of its back pointers. */ odn->dn_dbuf = NULL; odn->dn_handle = NULL; avl_create(&odn->dn_dbufs, dbuf_compare, sizeof (dmu_buf_impl_t), offsetof(dmu_buf_impl_t, db_link)); odn->dn_dbufs_count = 0; odn->dn_bonus = NULL; dmu_zfetch_fini(&odn->dn_zfetch); /* * Set the low bit of the objset pointer to ensure that dnode_move() * recognizes the dnode as invalid in any subsequent callback. */ POINTER_INVALIDATE(&odn->dn_objset); /* * Satisfy the destructor. */ for (i = 0; i < TXG_SIZE; i++) { list_create(&odn->dn_dirty_records[i], sizeof (dbuf_dirty_record_t), offsetof(dbuf_dirty_record_t, dr_dirty_node)); odn->dn_free_ranges[i] = NULL; odn->dn_next_nlevels[i] = 0; odn->dn_next_indblkshift[i] = 0; odn->dn_next_bonustype[i] = 0; odn->dn_rm_spillblk[i] = 0; odn->dn_next_bonuslen[i] = 0; odn->dn_next_blksz[i] = 0; } odn->dn_allocated_txg = 0; odn->dn_free_txg = 0; odn->dn_assigned_txg = 0; odn->dn_dirty_txg = 0; odn->dn_dirtyctx = 0; odn->dn_dirtyctx_firstset = NULL; odn->dn_have_spill = B_FALSE; odn->dn_zio = NULL; odn->dn_oldused = 0; odn->dn_oldflags = 0; odn->dn_olduid = 0; odn->dn_oldgid = 0; odn->dn_oldprojid = ZFS_DEFAULT_PROJID; odn->dn_newuid = 0; odn->dn_newgid = 0; odn->dn_newprojid = ZFS_DEFAULT_PROJID; odn->dn_id_flags = 0; /* * Mark the dnode. */ ndn->dn_moved = 1; odn->dn_moved = (uint8_t)-1; } /*ARGSUSED*/ static kmem_cbrc_t dnode_move(void *buf, void *newbuf, size_t size, void *arg) { dnode_t *odn = buf, *ndn = newbuf; objset_t *os; int64_t refcount; uint32_t dbufs; /* * The dnode is on the objset's list of known dnodes if the objset * pointer is valid. We set the low bit of the objset pointer when * freeing the dnode to invalidate it, and the memory patterns written * by kmem (baddcafe and deadbeef) set at least one of the two low bits. * A newly created dnode sets the objset pointer last of all to indicate * that the dnode is known and in a valid state to be moved by this * function. */ os = odn->dn_objset; if (!POINTER_IS_VALID(os)) { DNODE_STAT_BUMP(dnode_move_invalid); return (KMEM_CBRC_DONT_KNOW); } /* * Ensure that the objset does not go away during the move. */ rw_enter(&os_lock, RW_WRITER); if (os != odn->dn_objset) { rw_exit(&os_lock); DNODE_STAT_BUMP(dnode_move_recheck1); return (KMEM_CBRC_DONT_KNOW); } /* * If the dnode is still valid, then so is the objset. We know that no * valid objset can be freed while we hold os_lock, so we can safely * ensure that the objset remains in use. */ mutex_enter(&os->os_lock); /* * Recheck the objset pointer in case the dnode was removed just before * acquiring the lock. */ if (os != odn->dn_objset) { mutex_exit(&os->os_lock); rw_exit(&os_lock); DNODE_STAT_BUMP(dnode_move_recheck2); return (KMEM_CBRC_DONT_KNOW); } /* * At this point we know that as long as we hold os->os_lock, the dnode * cannot be freed and fields within the dnode can be safely accessed. * The objset listing this dnode cannot go away as long as this dnode is * on its list. */ rw_exit(&os_lock); if (DMU_OBJECT_IS_SPECIAL(odn->dn_object)) { mutex_exit(&os->os_lock); DNODE_STAT_BUMP(dnode_move_special); return (KMEM_CBRC_NO); } ASSERT(odn->dn_dbuf != NULL); /* only "special" dnodes have no parent */ /* * Lock the dnode handle to prevent the dnode from obtaining any new * holds. This also prevents the descendant dbufs and the bonus dbuf * from accessing the dnode, so that we can discount their holds. The * handle is safe to access because we know that while the dnode cannot * go away, neither can its handle. Once we hold dnh_zrlock, we can * safely move any dnode referenced only by dbufs. */ if (!zrl_tryenter(&odn->dn_handle->dnh_zrlock)) { mutex_exit(&os->os_lock); DNODE_STAT_BUMP(dnode_move_handle); return (KMEM_CBRC_LATER); } /* * Ensure a consistent view of the dnode's holds and the dnode's dbufs. * We need to guarantee that there is a hold for every dbuf in order to * determine whether the dnode is actively referenced. Falsely matching * a dbuf to an active hold would lead to an unsafe move. It's possible * that a thread already having an active dnode hold is about to add a * dbuf, and we can't compare hold and dbuf counts while the add is in * progress. */ if (!rw_tryenter(&odn->dn_struct_rwlock, RW_WRITER)) { zrl_exit(&odn->dn_handle->dnh_zrlock); mutex_exit(&os->os_lock); DNODE_STAT_BUMP(dnode_move_rwlock); return (KMEM_CBRC_LATER); } /* * A dbuf may be removed (evicted) without an active dnode hold. In that * case, the dbuf count is decremented under the handle lock before the * dbuf's hold is released. This order ensures that if we count the hold * after the dbuf is removed but before its hold is released, we will * treat the unmatched hold as active and exit safely. If we count the * hold before the dbuf is removed, the hold is discounted, and the * removal is blocked until the move completes. */ refcount = zfs_refcount_count(&odn->dn_holds); ASSERT(refcount >= 0); dbufs = DN_DBUFS_COUNT(odn); /* We can't have more dbufs than dnode holds. */ ASSERT3U(dbufs, <=, refcount); DTRACE_PROBE3(dnode__move, dnode_t *, odn, int64_t, refcount, uint32_t, dbufs); if (refcount > dbufs) { rw_exit(&odn->dn_struct_rwlock); zrl_exit(&odn->dn_handle->dnh_zrlock); mutex_exit(&os->os_lock); DNODE_STAT_BUMP(dnode_move_active); return (KMEM_CBRC_LATER); } rw_exit(&odn->dn_struct_rwlock); /* * At this point we know that anyone with a hold on the dnode is not * actively referencing it. The dnode is known and in a valid state to * move. We're holding the locks needed to execute the critical section. */ dnode_move_impl(odn, ndn); list_link_replace(&odn->dn_link, &ndn->dn_link); /* If the dnode was safe to move, the refcount cannot have changed. */ ASSERT(refcount == zfs_refcount_count(&ndn->dn_holds)); ASSERT(dbufs == DN_DBUFS_COUNT(ndn)); zrl_exit(&ndn->dn_handle->dnh_zrlock); /* handle has moved */ mutex_exit(&os->os_lock); return (KMEM_CBRC_YES); } #endif /* _KERNEL */ static void dnode_slots_hold(dnode_children_t *children, int idx, int slots) { ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK); for (int i = idx; i < idx + slots; i++) { dnode_handle_t *dnh = &children->dnc_children[i]; zrl_add(&dnh->dnh_zrlock); } } static void dnode_slots_rele(dnode_children_t *children, int idx, int slots) { ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK); for (int i = idx; i < idx + slots; i++) { dnode_handle_t *dnh = &children->dnc_children[i]; if (zrl_is_locked(&dnh->dnh_zrlock)) zrl_exit(&dnh->dnh_zrlock); else zrl_remove(&dnh->dnh_zrlock); } } static int dnode_slots_tryenter(dnode_children_t *children, int idx, int slots) { ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK); for (int i = idx; i < idx + slots; i++) { dnode_handle_t *dnh = &children->dnc_children[i]; if (!zrl_tryenter(&dnh->dnh_zrlock)) { for (int j = idx; j < i; j++) { dnh = &children->dnc_children[j]; zrl_exit(&dnh->dnh_zrlock); } return (0); } } return (1); } static void dnode_set_slots(dnode_children_t *children, int idx, int slots, void *ptr) { ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK); for (int i = idx; i < idx + slots; i++) { dnode_handle_t *dnh = &children->dnc_children[i]; dnh->dnh_dnode = ptr; } } static boolean_t dnode_check_slots_free(dnode_children_t *children, int idx, int slots) { ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK); /* * If all dnode slots are either already free or * evictable return B_TRUE. */ for (int i = idx; i < idx + slots; i++) { dnode_handle_t *dnh = &children->dnc_children[i]; dnode_t *dn = dnh->dnh_dnode; if (dn == DN_SLOT_FREE) { continue; } else if (DN_SLOT_IS_PTR(dn)) { mutex_enter(&dn->dn_mtx); boolean_t can_free = (dn->dn_type == DMU_OT_NONE && zfs_refcount_is_zero(&dn->dn_holds) && !DNODE_IS_DIRTY(dn)); mutex_exit(&dn->dn_mtx); if (!can_free) return (B_FALSE); else continue; } else { return (B_FALSE); } } return (B_TRUE); } static void dnode_reclaim_slots(dnode_children_t *children, int idx, int slots) { ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK); for (int i = idx; i < idx + slots; i++) { dnode_handle_t *dnh = &children->dnc_children[i]; ASSERT(zrl_is_locked(&dnh->dnh_zrlock)); if (DN_SLOT_IS_PTR(dnh->dnh_dnode)) { ASSERT3S(dnh->dnh_dnode->dn_type, ==, DMU_OT_NONE); dnode_destroy(dnh->dnh_dnode); dnh->dnh_dnode = DN_SLOT_FREE; } } } void dnode_free_interior_slots(dnode_t *dn) { dnode_children_t *children = dmu_buf_get_user(&dn->dn_dbuf->db); int epb = dn->dn_dbuf->db.db_size >> DNODE_SHIFT; int idx = (dn->dn_object & (epb - 1)) + 1; int slots = dn->dn_num_slots - 1; if (slots == 0) return; ASSERT3S(idx + slots, <=, DNODES_PER_BLOCK); while (!dnode_slots_tryenter(children, idx, slots)) { DNODE_STAT_BUMP(dnode_free_interior_lock_retry); cond_resched(); } dnode_set_slots(children, idx, slots, DN_SLOT_FREE); dnode_slots_rele(children, idx, slots); } void dnode_special_close(dnode_handle_t *dnh) { dnode_t *dn = dnh->dnh_dnode; /* * Ensure dnode_rele_and_unlock() has released dn_mtx, after final * zfs_refcount_remove() */ mutex_enter(&dn->dn_mtx); if (zfs_refcount_count(&dn->dn_holds) > 0) cv_wait(&dn->dn_nodnholds, &dn->dn_mtx); mutex_exit(&dn->dn_mtx); ASSERT3U(zfs_refcount_count(&dn->dn_holds), ==, 0); ASSERT(dn->dn_dbuf == NULL || dmu_buf_get_user(&dn->dn_dbuf->db) == NULL); zrl_add(&dnh->dnh_zrlock); dnode_destroy(dn); /* implicit zrl_remove() */ zrl_destroy(&dnh->dnh_zrlock); dnh->dnh_dnode = NULL; } void dnode_special_open(objset_t *os, dnode_phys_t *dnp, uint64_t object, dnode_handle_t *dnh) { dnode_t *dn; zrl_init(&dnh->dnh_zrlock); VERIFY3U(1, ==, zrl_tryenter(&dnh->dnh_zrlock)); dn = dnode_create(os, dnp, NULL, object, dnh); DNODE_VERIFY(dn); zrl_exit(&dnh->dnh_zrlock); } static void dnode_buf_evict_async(void *dbu) { dnode_children_t *dnc = dbu; DNODE_STAT_BUMP(dnode_buf_evict); for (int i = 0; i < dnc->dnc_count; i++) { dnode_handle_t *dnh = &dnc->dnc_children[i]; dnode_t *dn; /* * The dnode handle lock guards against the dnode moving to * another valid address, so there is no need here to guard * against changes to or from NULL. */ if (!DN_SLOT_IS_PTR(dnh->dnh_dnode)) { zrl_destroy(&dnh->dnh_zrlock); dnh->dnh_dnode = DN_SLOT_UNINIT; continue; } zrl_add(&dnh->dnh_zrlock); dn = dnh->dnh_dnode; /* * If there are holds on this dnode, then there should * be holds on the dnode's containing dbuf as well; thus * it wouldn't be eligible for eviction and this function * would not have been called. */ ASSERT(zfs_refcount_is_zero(&dn->dn_holds)); ASSERT(zfs_refcount_is_zero(&dn->dn_tx_holds)); dnode_destroy(dn); /* implicit zrl_remove() for first slot */ zrl_destroy(&dnh->dnh_zrlock); dnh->dnh_dnode = DN_SLOT_UNINIT; } kmem_free(dnc, sizeof (dnode_children_t) + dnc->dnc_count * sizeof (dnode_handle_t)); } /* * When the DNODE_MUST_BE_FREE flag is set, the "slots" parameter is used * to ensure the hole at the specified object offset is large enough to * hold the dnode being created. The slots parameter is also used to ensure * a dnode does not span multiple dnode blocks. In both of these cases, if * a failure occurs, ENOSPC is returned. Keep in mind, these failure cases * are only possible when using DNODE_MUST_BE_FREE. * * If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. * dnode_hold_impl() will check if the requested dnode is already consumed * as an extra dnode slot by an large dnode, in which case it returns * ENOENT. * * If the DNODE_DRY_RUN flag is set, we don't actually hold the dnode, just * return whether the hold would succeed or not. tag and dnp should set to * NULL in this case. * * errors: * EINVAL - Invalid object number or flags. * ENOSPC - Hole too small to fulfill "slots" request (DNODE_MUST_BE_FREE) * EEXIST - Refers to an allocated dnode (DNODE_MUST_BE_FREE) * - Refers to a freeing dnode (DNODE_MUST_BE_FREE) * - Refers to an interior dnode slot (DNODE_MUST_BE_ALLOCATED) * ENOENT - The requested dnode is not allocated (DNODE_MUST_BE_ALLOCATED) * - The requested dnode is being freed (DNODE_MUST_BE_ALLOCATED) * EIO - I/O error when reading the meta dnode dbuf. * * succeeds even for free dnodes. */ int dnode_hold_impl(objset_t *os, uint64_t object, int flag, int slots, void *tag, dnode_t **dnp) { int epb, idx, err; int drop_struct_lock = FALSE; int type; uint64_t blk; dnode_t *mdn, *dn; dmu_buf_impl_t *db; dnode_children_t *dnc; dnode_phys_t *dn_block; dnode_handle_t *dnh; ASSERT(!(flag & DNODE_MUST_BE_ALLOCATED) || (slots == 0)); ASSERT(!(flag & DNODE_MUST_BE_FREE) || (slots > 0)); IMPLY(flag & DNODE_DRY_RUN, (tag == NULL) && (dnp == NULL)); /* * If you are holding the spa config lock as writer, you shouldn't * be asking the DMU to do *anything* unless it's the root pool * which may require us to read from the root filesystem while * holding some (not all) of the locks as writer. */ ASSERT(spa_config_held(os->os_spa, SCL_ALL, RW_WRITER) == 0 || (spa_is_root(os->os_spa) && spa_config_held(os->os_spa, SCL_STATE, RW_WRITER))); ASSERT((flag & DNODE_MUST_BE_ALLOCATED) || (flag & DNODE_MUST_BE_FREE)); if (object == DMU_USERUSED_OBJECT || object == DMU_GROUPUSED_OBJECT || object == DMU_PROJECTUSED_OBJECT) { if (object == DMU_USERUSED_OBJECT) dn = DMU_USERUSED_DNODE(os); else if (object == DMU_GROUPUSED_OBJECT) dn = DMU_GROUPUSED_DNODE(os); else dn = DMU_PROJECTUSED_DNODE(os); if (dn == NULL) return (SET_ERROR(ENOENT)); type = dn->dn_type; if ((flag & DNODE_MUST_BE_ALLOCATED) && type == DMU_OT_NONE) return (SET_ERROR(ENOENT)); if ((flag & DNODE_MUST_BE_FREE) && type != DMU_OT_NONE) return (SET_ERROR(EEXIST)); DNODE_VERIFY(dn); /* Don't actually hold if dry run, just return 0 */ if (!(flag & DNODE_DRY_RUN)) { (void) zfs_refcount_add(&dn->dn_holds, tag); *dnp = dn; } return (0); } if (object == 0 || object >= DN_MAX_OBJECT) return (SET_ERROR(EINVAL)); mdn = DMU_META_DNODE(os); ASSERT(mdn->dn_object == DMU_META_DNODE_OBJECT); DNODE_VERIFY(mdn); if (!RW_WRITE_HELD(&mdn->dn_struct_rwlock)) { rw_enter(&mdn->dn_struct_rwlock, RW_READER); drop_struct_lock = TRUE; } blk = dbuf_whichblock(mdn, 0, object * sizeof (dnode_phys_t)); db = dbuf_hold(mdn, blk, FTAG); if (drop_struct_lock) rw_exit(&mdn->dn_struct_rwlock); if (db == NULL) { DNODE_STAT_BUMP(dnode_hold_dbuf_hold); return (SET_ERROR(EIO)); } /* * We do not need to decrypt to read the dnode so it doesn't matter * if we get the encrypted or decrypted version. */ - err = dbuf_read(db, NULL, DB_RF_CANFAIL | DB_RF_NO_DECRYPT); + err = dbuf_read(db, NULL, DB_RF_CANFAIL | + DB_RF_NO_DECRYPT | DB_RF_NOPREFETCH); if (err) { DNODE_STAT_BUMP(dnode_hold_dbuf_read); dbuf_rele(db, FTAG); return (err); } ASSERT3U(db->db.db_size, >=, 1<db.db_size >> DNODE_SHIFT; idx = object & (epb - 1); dn_block = (dnode_phys_t *)db->db.db_data; ASSERT(DB_DNODE(db)->dn_type == DMU_OT_DNODE); dnc = dmu_buf_get_user(&db->db); dnh = NULL; if (dnc == NULL) { dnode_children_t *winner; int skip = 0; dnc = kmem_zalloc(sizeof (dnode_children_t) + epb * sizeof (dnode_handle_t), KM_SLEEP); dnc->dnc_count = epb; dnh = &dnc->dnc_children[0]; /* Initialize dnode slot status from dnode_phys_t */ for (int i = 0; i < epb; i++) { zrl_init(&dnh[i].dnh_zrlock); if (skip) { skip--; continue; } if (dn_block[i].dn_type != DMU_OT_NONE) { int interior = dn_block[i].dn_extra_slots; dnode_set_slots(dnc, i, 1, DN_SLOT_ALLOCATED); dnode_set_slots(dnc, i + 1, interior, DN_SLOT_INTERIOR); skip = interior; } else { dnh[i].dnh_dnode = DN_SLOT_FREE; skip = 0; } } dmu_buf_init_user(&dnc->dnc_dbu, NULL, dnode_buf_evict_async, NULL); winner = dmu_buf_set_user(&db->db, &dnc->dnc_dbu); if (winner != NULL) { for (int i = 0; i < epb; i++) zrl_destroy(&dnh[i].dnh_zrlock); kmem_free(dnc, sizeof (dnode_children_t) + epb * sizeof (dnode_handle_t)); dnc = winner; } } ASSERT(dnc->dnc_count == epb); if (flag & DNODE_MUST_BE_ALLOCATED) { slots = 1; dnode_slots_hold(dnc, idx, slots); dnh = &dnc->dnc_children[idx]; if (DN_SLOT_IS_PTR(dnh->dnh_dnode)) { dn = dnh->dnh_dnode; } else if (dnh->dnh_dnode == DN_SLOT_INTERIOR) { DNODE_STAT_BUMP(dnode_hold_alloc_interior); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (SET_ERROR(EEXIST)); } else if (dnh->dnh_dnode != DN_SLOT_ALLOCATED) { DNODE_STAT_BUMP(dnode_hold_alloc_misses); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (SET_ERROR(ENOENT)); } else { dnode_slots_rele(dnc, idx, slots); while (!dnode_slots_tryenter(dnc, idx, slots)) { DNODE_STAT_BUMP(dnode_hold_alloc_lock_retry); cond_resched(); } /* * Someone else won the race and called dnode_create() * after we checked DN_SLOT_IS_PTR() above but before * we acquired the lock. */ if (DN_SLOT_IS_PTR(dnh->dnh_dnode)) { DNODE_STAT_BUMP(dnode_hold_alloc_lock_misses); dn = dnh->dnh_dnode; } else { dn = dnode_create(os, dn_block + idx, db, object, dnh); } } mutex_enter(&dn->dn_mtx); if (dn->dn_type == DMU_OT_NONE || dn->dn_free_txg != 0) { DNODE_STAT_BUMP(dnode_hold_alloc_type_none); mutex_exit(&dn->dn_mtx); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (SET_ERROR(ENOENT)); } /* Don't actually hold if dry run, just return 0 */ if (flag & DNODE_DRY_RUN) { mutex_exit(&dn->dn_mtx); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (0); } DNODE_STAT_BUMP(dnode_hold_alloc_hits); } else if (flag & DNODE_MUST_BE_FREE) { if (idx + slots - 1 >= DNODES_PER_BLOCK) { DNODE_STAT_BUMP(dnode_hold_free_overflow); dbuf_rele(db, FTAG); return (SET_ERROR(ENOSPC)); } dnode_slots_hold(dnc, idx, slots); if (!dnode_check_slots_free(dnc, idx, slots)) { DNODE_STAT_BUMP(dnode_hold_free_misses); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (SET_ERROR(ENOSPC)); } dnode_slots_rele(dnc, idx, slots); while (!dnode_slots_tryenter(dnc, idx, slots)) { DNODE_STAT_BUMP(dnode_hold_free_lock_retry); cond_resched(); } if (!dnode_check_slots_free(dnc, idx, slots)) { DNODE_STAT_BUMP(dnode_hold_free_lock_misses); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (SET_ERROR(ENOSPC)); } /* * Allocated but otherwise free dnodes which would * be in the interior of a multi-slot dnodes need * to be freed. Single slot dnodes can be safely * re-purposed as a performance optimization. */ if (slots > 1) dnode_reclaim_slots(dnc, idx + 1, slots - 1); dnh = &dnc->dnc_children[idx]; if (DN_SLOT_IS_PTR(dnh->dnh_dnode)) { dn = dnh->dnh_dnode; } else { dn = dnode_create(os, dn_block + idx, db, object, dnh); } mutex_enter(&dn->dn_mtx); if (!zfs_refcount_is_zero(&dn->dn_holds) || dn->dn_free_txg) { DNODE_STAT_BUMP(dnode_hold_free_refcount); mutex_exit(&dn->dn_mtx); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (SET_ERROR(EEXIST)); } /* Don't actually hold if dry run, just return 0 */ if (flag & DNODE_DRY_RUN) { mutex_exit(&dn->dn_mtx); dnode_slots_rele(dnc, idx, slots); dbuf_rele(db, FTAG); return (0); } dnode_set_slots(dnc, idx + 1, slots - 1, DN_SLOT_INTERIOR); DNODE_STAT_BUMP(dnode_hold_free_hits); } else { dbuf_rele(db, FTAG); return (SET_ERROR(EINVAL)); } ASSERT0(dn->dn_free_txg); if (zfs_refcount_add(&dn->dn_holds, tag) == 1) dbuf_add_ref(db, dnh); mutex_exit(&dn->dn_mtx); /* Now we can rely on the hold to prevent the dnode from moving. */ dnode_slots_rele(dnc, idx, slots); DNODE_VERIFY(dn); ASSERT3P(dnp, !=, NULL); ASSERT3P(dn->dn_dbuf, ==, db); ASSERT3U(dn->dn_object, ==, object); dbuf_rele(db, FTAG); *dnp = dn; return (0); } /* * Return held dnode if the object is allocated, NULL if not. */ int dnode_hold(objset_t *os, uint64_t object, void *tag, dnode_t **dnp) { return (dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED, 0, tag, dnp)); } /* * Can only add a reference if there is already at least one * reference on the dnode. Returns FALSE if unable to add a * new reference. */ boolean_t dnode_add_ref(dnode_t *dn, void *tag) { mutex_enter(&dn->dn_mtx); if (zfs_refcount_is_zero(&dn->dn_holds)) { mutex_exit(&dn->dn_mtx); return (FALSE); } VERIFY(1 < zfs_refcount_add(&dn->dn_holds, tag)); mutex_exit(&dn->dn_mtx); return (TRUE); } void dnode_rele(dnode_t *dn, void *tag) { mutex_enter(&dn->dn_mtx); dnode_rele_and_unlock(dn, tag, B_FALSE); } void dnode_rele_and_unlock(dnode_t *dn, void *tag, boolean_t evicting) { uint64_t refs; /* Get while the hold prevents the dnode from moving. */ dmu_buf_impl_t *db = dn->dn_dbuf; dnode_handle_t *dnh = dn->dn_handle; refs = zfs_refcount_remove(&dn->dn_holds, tag); if (refs == 0) cv_broadcast(&dn->dn_nodnholds); mutex_exit(&dn->dn_mtx); /* dnode could get destroyed at this point, so don't use it anymore */ /* * It's unsafe to release the last hold on a dnode by dnode_rele() or * indirectly by dbuf_rele() while relying on the dnode handle to * prevent the dnode from moving, since releasing the last hold could * result in the dnode's parent dbuf evicting its dnode handles. For * that reason anyone calling dnode_rele() or dbuf_rele() without some * other direct or indirect hold on the dnode must first drop the dnode * handle. */ ASSERT(refs > 0 || dnh->dnh_zrlock.zr_owner != curthread); /* NOTE: the DNODE_DNODE does not have a dn_dbuf */ if (refs == 0 && db != NULL) { /* * Another thread could add a hold to the dnode handle in * dnode_hold_impl() while holding the parent dbuf. Since the * hold on the parent dbuf prevents the handle from being * destroyed, the hold on the handle is OK. We can't yet assert * that the handle has zero references, but that will be * asserted anyway when the handle gets destroyed. */ mutex_enter(&db->db_mtx); dbuf_rele_and_unlock(db, dnh, evicting); } } /* * Test whether we can create a dnode at the specified location. */ int dnode_try_claim(objset_t *os, uint64_t object, int slots) { return (dnode_hold_impl(os, object, DNODE_MUST_BE_FREE | DNODE_DRY_RUN, slots, NULL, NULL)); } void dnode_setdirty(dnode_t *dn, dmu_tx_t *tx) { objset_t *os = dn->dn_objset; uint64_t txg = tx->tx_txg; if (DMU_OBJECT_IS_SPECIAL(dn->dn_object)) { dsl_dataset_dirty(os->os_dsl_dataset, tx); return; } DNODE_VERIFY(dn); #ifdef ZFS_DEBUG mutex_enter(&dn->dn_mtx); ASSERT(dn->dn_phys->dn_type || dn->dn_allocated_txg); ASSERT(dn->dn_free_txg == 0 || dn->dn_free_txg >= txg); mutex_exit(&dn->dn_mtx); #endif /* * Determine old uid/gid when necessary */ dmu_objset_userquota_get_ids(dn, B_TRUE, tx); multilist_t *dirtylist = os->os_dirty_dnodes[txg & TXG_MASK]; multilist_sublist_t *mls = multilist_sublist_lock_obj(dirtylist, dn); /* * If we are already marked dirty, we're done. */ if (multilist_link_active(&dn->dn_dirty_link[txg & TXG_MASK])) { multilist_sublist_unlock(mls); return; } ASSERT(!zfs_refcount_is_zero(&dn->dn_holds) || !avl_is_empty(&dn->dn_dbufs)); ASSERT(dn->dn_datablksz != 0); ASSERT0(dn->dn_next_bonuslen[txg & TXG_MASK]); ASSERT0(dn->dn_next_blksz[txg & TXG_MASK]); ASSERT0(dn->dn_next_bonustype[txg & TXG_MASK]); dprintf_ds(os->os_dsl_dataset, "obj=%llu txg=%llu\n", dn->dn_object, txg); multilist_sublist_insert_head(mls, dn); multilist_sublist_unlock(mls); /* * The dnode maintains a hold on its containing dbuf as * long as there are holds on it. Each instantiated child * dbuf maintains a hold on the dnode. When the last child * drops its hold, the dnode will drop its hold on the * containing dbuf. We add a "dirty hold" here so that the * dnode will hang around after we finish processing its * children. */ VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx->tx_txg)); (void) dbuf_dirty(dn->dn_dbuf, tx); dsl_dataset_dirty(os->os_dsl_dataset, tx); } void dnode_free(dnode_t *dn, dmu_tx_t *tx) { mutex_enter(&dn->dn_mtx); if (dn->dn_type == DMU_OT_NONE || dn->dn_free_txg) { mutex_exit(&dn->dn_mtx); return; } dn->dn_free_txg = tx->tx_txg; mutex_exit(&dn->dn_mtx); dnode_setdirty(dn, tx); } /* * Try to change the block size for the indicated dnode. This can only * succeed if there are no blocks allocated or dirty beyond first block */ int dnode_set_blksz(dnode_t *dn, uint64_t size, int ibs, dmu_tx_t *tx) { dmu_buf_impl_t *db; int err; ASSERT3U(size, <=, spa_maxblocksize(dmu_objset_spa(dn->dn_objset))); if (size == 0) size = SPA_MINBLOCKSIZE; else size = P2ROUNDUP(size, SPA_MINBLOCKSIZE); if (ibs == dn->dn_indblkshift) ibs = 0; if (size >> SPA_MINBLOCKSHIFT == dn->dn_datablkszsec && ibs == 0) return (0); rw_enter(&dn->dn_struct_rwlock, RW_WRITER); /* Check for any allocated blocks beyond the first */ if (dn->dn_maxblkid != 0) goto fail; mutex_enter(&dn->dn_dbufs_mtx); for (db = avl_first(&dn->dn_dbufs); db != NULL; db = AVL_NEXT(&dn->dn_dbufs, db)) { if (db->db_blkid != 0 && db->db_blkid != DMU_BONUS_BLKID && db->db_blkid != DMU_SPILL_BLKID) { mutex_exit(&dn->dn_dbufs_mtx); goto fail; } } mutex_exit(&dn->dn_dbufs_mtx); if (ibs && dn->dn_nlevels != 1) goto fail; /* resize the old block */ err = dbuf_hold_impl(dn, 0, 0, TRUE, FALSE, FTAG, &db); if (err == 0) { dbuf_new_size(db, size, tx); } else if (err != ENOENT) { goto fail; } dnode_setdblksz(dn, size); dnode_setdirty(dn, tx); dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = size; if (ibs) { dn->dn_indblkshift = ibs; dn->dn_next_indblkshift[tx->tx_txg&TXG_MASK] = ibs; } /* release after we have fixed the blocksize in the dnode */ if (db) dbuf_rele(db, FTAG); rw_exit(&dn->dn_struct_rwlock); return (0); fail: rw_exit(&dn->dn_struct_rwlock); return (SET_ERROR(ENOTSUP)); } static void dnode_set_nlevels_impl(dnode_t *dn, int new_nlevels, dmu_tx_t *tx) { uint64_t txgoff = tx->tx_txg & TXG_MASK; int old_nlevels = dn->dn_nlevels; dmu_buf_impl_t *db; list_t *list; dbuf_dirty_record_t *new, *dr, *dr_next; ASSERT(RW_WRITE_HELD(&dn->dn_struct_rwlock)); dn->dn_nlevels = new_nlevels; ASSERT3U(new_nlevels, >, dn->dn_next_nlevels[txgoff]); dn->dn_next_nlevels[txgoff] = new_nlevels; /* dirty the left indirects */ db = dbuf_hold_level(dn, old_nlevels, 0, FTAG); ASSERT(db != NULL); new = dbuf_dirty(db, tx); dbuf_rele(db, FTAG); /* transfer the dirty records to the new indirect */ mutex_enter(&dn->dn_mtx); mutex_enter(&new->dt.di.dr_mtx); list = &dn->dn_dirty_records[txgoff]; for (dr = list_head(list); dr; dr = dr_next) { dr_next = list_next(&dn->dn_dirty_records[txgoff], dr); if (dr->dr_dbuf->db_level != new_nlevels-1 && dr->dr_dbuf->db_blkid != DMU_BONUS_BLKID && dr->dr_dbuf->db_blkid != DMU_SPILL_BLKID) { ASSERT(dr->dr_dbuf->db_level == old_nlevels-1); list_remove(&dn->dn_dirty_records[txgoff], dr); list_insert_tail(&new->dt.di.dr_children, dr); dr->dr_parent = new; } } mutex_exit(&new->dt.di.dr_mtx); mutex_exit(&dn->dn_mtx); } int dnode_set_nlevels(dnode_t *dn, int nlevels, dmu_tx_t *tx) { int ret = 0; rw_enter(&dn->dn_struct_rwlock, RW_WRITER); if (dn->dn_nlevels == nlevels) { ret = 0; goto out; } else if (nlevels < dn->dn_nlevels) { ret = SET_ERROR(EINVAL); goto out; } dnode_set_nlevels_impl(dn, nlevels, tx); out: rw_exit(&dn->dn_struct_rwlock); return (ret); } /* read-holding callers must not rely on the lock being continuously held */ void dnode_new_blkid(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx, boolean_t have_read, boolean_t force) { int epbs, new_nlevels; uint64_t sz; ASSERT(blkid != DMU_BONUS_BLKID); ASSERT(have_read ? RW_READ_HELD(&dn->dn_struct_rwlock) : RW_WRITE_HELD(&dn->dn_struct_rwlock)); /* * if we have a read-lock, check to see if we need to do any work * before upgrading to a write-lock. */ if (have_read) { if (blkid <= dn->dn_maxblkid) return; if (!rw_tryupgrade(&dn->dn_struct_rwlock)) { rw_exit(&dn->dn_struct_rwlock); rw_enter(&dn->dn_struct_rwlock, RW_WRITER); } } /* * Raw sends (indicated by the force flag) require that we take the * given blkid even if the value is lower than the current value. */ if (!force && blkid <= dn->dn_maxblkid) goto out; /* * We use the (otherwise unused) top bit of dn_next_maxblkid[txgoff] * to indicate that this field is set. This allows us to set the * maxblkid to 0 on an existing object in dnode_sync(). */ dn->dn_maxblkid = blkid; dn->dn_next_maxblkid[tx->tx_txg & TXG_MASK] = blkid | DMU_NEXT_MAXBLKID_SET; /* * Compute the number of levels necessary to support the new maxblkid. * Raw sends will ensure nlevels is set correctly for us. */ new_nlevels = 1; epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT; for (sz = dn->dn_nblkptr; sz <= blkid && sz >= dn->dn_nblkptr; sz <<= epbs) new_nlevels++; ASSERT3U(new_nlevels, <=, DN_MAX_LEVELS); if (!force) { if (new_nlevels > dn->dn_nlevels) dnode_set_nlevels_impl(dn, new_nlevels, tx); } else { ASSERT3U(dn->dn_nlevels, >=, new_nlevels); } out: if (have_read) rw_downgrade(&dn->dn_struct_rwlock); } static void dnode_dirty_l1(dnode_t *dn, uint64_t l1blkid, dmu_tx_t *tx) { dmu_buf_impl_t *db = dbuf_hold_level(dn, 1, l1blkid, FTAG); if (db != NULL) { dmu_buf_will_dirty(&db->db, tx); dbuf_rele(db, FTAG); } } /* * Dirty all the in-core level-1 dbufs in the range specified by start_blkid * and end_blkid. */ static void dnode_dirty_l1range(dnode_t *dn, uint64_t start_blkid, uint64_t end_blkid, dmu_tx_t *tx) { dmu_buf_impl_t *db_search; dmu_buf_impl_t *db; avl_index_t where; db_search = kmem_zalloc(sizeof (dmu_buf_impl_t), KM_SLEEP); mutex_enter(&dn->dn_dbufs_mtx); db_search->db_level = 1; db_search->db_blkid = start_blkid + 1; db_search->db_state = DB_SEARCH; for (;;) { db = avl_find(&dn->dn_dbufs, db_search, &where); if (db == NULL) db = avl_nearest(&dn->dn_dbufs, where, AVL_AFTER); if (db == NULL || db->db_level != 1 || db->db_blkid >= end_blkid) { break; } /* * Setup the next blkid we want to search for. */ db_search->db_blkid = db->db_blkid + 1; ASSERT3U(db->db_blkid, >=, start_blkid); /* * If the dbuf transitions to DB_EVICTING while we're trying * to dirty it, then we will be unable to discover it in * the dbuf hash table. This will result in a call to * dbuf_create() which needs to acquire the dn_dbufs_mtx * lock. To avoid a deadlock, we drop the lock before * dirtying the level-1 dbuf. */ mutex_exit(&dn->dn_dbufs_mtx); dnode_dirty_l1(dn, db->db_blkid, tx); mutex_enter(&dn->dn_dbufs_mtx); } #ifdef ZFS_DEBUG /* * Walk all the in-core level-1 dbufs and verify they have been dirtied. */ db_search->db_level = 1; db_search->db_blkid = start_blkid + 1; db_search->db_state = DB_SEARCH; db = avl_find(&dn->dn_dbufs, db_search, &where); if (db == NULL) db = avl_nearest(&dn->dn_dbufs, where, AVL_AFTER); for (; db != NULL; db = AVL_NEXT(&dn->dn_dbufs, db)) { if (db->db_level != 1 || db->db_blkid >= end_blkid) break; if (db->db_state != DB_EVICTING) ASSERT(db->db_dirtycnt > 0); } #endif kmem_free(db_search, sizeof (dmu_buf_impl_t)); mutex_exit(&dn->dn_dbufs_mtx); } void dnode_set_dirtyctx(dnode_t *dn, dmu_tx_t *tx, void *tag) { /* * Don't set dirtyctx to SYNC if we're just modifying this as we * initialize the objset. */ if (dn->dn_dirtyctx == DN_UNDIRTIED) { dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset; if (ds != NULL) { rrw_enter(&ds->ds_bp_rwlock, RW_READER, tag); } if (!BP_IS_HOLE(dn->dn_objset->os_rootbp)) { if (dmu_tx_is_syncing(tx)) dn->dn_dirtyctx = DN_DIRTY_SYNC; else dn->dn_dirtyctx = DN_DIRTY_OPEN; dn->dn_dirtyctx_firstset = tag; } if (ds != NULL) { rrw_exit(&ds->ds_bp_rwlock, tag); } } } void dnode_free_range(dnode_t *dn, uint64_t off, uint64_t len, dmu_tx_t *tx) { dmu_buf_impl_t *db; uint64_t blkoff, blkid, nblks; int blksz, blkshift, head, tail; int trunc = FALSE; int epbs; blksz = dn->dn_datablksz; blkshift = dn->dn_datablkshift; epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT; if (len == DMU_OBJECT_END) { len = UINT64_MAX - off; trunc = TRUE; } /* * First, block align the region to free: */ if (ISP2(blksz)) { head = P2NPHASE(off, blksz); blkoff = P2PHASE(off, blksz); if ((off >> blkshift) > dn->dn_maxblkid) return; } else { ASSERT(dn->dn_maxblkid == 0); if (off == 0 && len >= blksz) { /* * Freeing the whole block; fast-track this request. */ blkid = 0; nblks = 1; if (dn->dn_nlevels > 1) { rw_enter(&dn->dn_struct_rwlock, RW_WRITER); dnode_dirty_l1(dn, 0, tx); rw_exit(&dn->dn_struct_rwlock); } goto done; } else if (off >= blksz) { /* Freeing past end-of-data */ return; } else { /* Freeing part of the block. */ head = blksz - off; ASSERT3U(head, >, 0); } blkoff = off; } /* zero out any partial block data at the start of the range */ if (head) { int res; ASSERT3U(blkoff + head, ==, blksz); if (len < head) head = len; rw_enter(&dn->dn_struct_rwlock, RW_READER); res = dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, 0, off), TRUE, FALSE, FTAG, &db); rw_exit(&dn->dn_struct_rwlock); if (res == 0) { caddr_t data; boolean_t dirty; db_lock_type_t dblt = dmu_buf_lock_parent(db, RW_READER, FTAG); /* don't dirty if it isn't on disk and isn't dirty */ dirty = !list_is_empty(&db->db_dirty_records) || (db->db_blkptr && !BP_IS_HOLE(db->db_blkptr)); dmu_buf_unlock_parent(db, dblt, FTAG); if (dirty) { dmu_buf_will_dirty(&db->db, tx); data = db->db.db_data; bzero(data + blkoff, head); } dbuf_rele(db, FTAG); } off += head; len -= head; } /* If the range was less than one block, we're done */ if (len == 0) return; /* If the remaining range is past end of file, we're done */ if ((off >> blkshift) > dn->dn_maxblkid) return; ASSERT(ISP2(blksz)); if (trunc) tail = 0; else tail = P2PHASE(len, blksz); ASSERT0(P2PHASE(off, blksz)); /* zero out any partial block data at the end of the range */ if (tail) { int res; if (len < tail) tail = len; rw_enter(&dn->dn_struct_rwlock, RW_READER); res = dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, 0, off+len), TRUE, FALSE, FTAG, &db); rw_exit(&dn->dn_struct_rwlock); if (res == 0) { boolean_t dirty; /* don't dirty if not on disk and not dirty */ db_lock_type_t type = dmu_buf_lock_parent(db, RW_READER, FTAG); dirty = !list_is_empty(&db->db_dirty_records) || (db->db_blkptr && !BP_IS_HOLE(db->db_blkptr)); dmu_buf_unlock_parent(db, type, FTAG); if (dirty) { dmu_buf_will_dirty(&db->db, tx); bzero(db->db.db_data, tail); } dbuf_rele(db, FTAG); } len -= tail; } /* If the range did not include a full block, we are done */ if (len == 0) return; ASSERT(IS_P2ALIGNED(off, blksz)); ASSERT(trunc || IS_P2ALIGNED(len, blksz)); blkid = off >> blkshift; nblks = len >> blkshift; if (trunc) nblks += 1; /* * Dirty all the indirect blocks in this range. Note that only * the first and last indirect blocks can actually be written * (if they were partially freed) -- they must be dirtied, even if * they do not exist on disk yet. The interior blocks will * be freed by free_children(), so they will not actually be written. * Even though these interior blocks will not be written, we * dirty them for two reasons: * * - It ensures that the indirect blocks remain in memory until * syncing context. (They have already been prefetched by * dmu_tx_hold_free(), so we don't have to worry about reading * them serially here.) * * - The dirty space accounting will put pressure on the txg sync * mechanism to begin syncing, and to delay transactions if there * is a large amount of freeing. Even though these indirect * blocks will not be written, we could need to write the same * amount of space if we copy the freed BPs into deadlists. */ if (dn->dn_nlevels > 1) { rw_enter(&dn->dn_struct_rwlock, RW_WRITER); uint64_t first, last; first = blkid >> epbs; dnode_dirty_l1(dn, first, tx); if (trunc) last = dn->dn_maxblkid >> epbs; else last = (blkid + nblks - 1) >> epbs; if (last != first) dnode_dirty_l1(dn, last, tx); dnode_dirty_l1range(dn, first, last, tx); int shift = dn->dn_datablkshift + dn->dn_indblkshift - SPA_BLKPTRSHIFT; for (uint64_t i = first + 1; i < last; i++) { /* * Set i to the blockid of the next non-hole * level-1 indirect block at or after i. Note * that dnode_next_offset() operates in terms of * level-0-equivalent bytes. */ uint64_t ibyte = i << shift; int err = dnode_next_offset(dn, DNODE_FIND_HAVELOCK, &ibyte, 2, 1, 0); i = ibyte >> shift; if (i >= last) break; /* * Normally we should not see an error, either * from dnode_next_offset() or dbuf_hold_level() * (except for ESRCH from dnode_next_offset). * If there is an i/o error, then when we read * this block in syncing context, it will use * ZIO_FLAG_MUSTSUCCEED, and thus hang/panic according * to the "failmode" property. dnode_next_offset() * doesn't have a flag to indicate MUSTSUCCEED. */ if (err != 0) break; dnode_dirty_l1(dn, i, tx); } rw_exit(&dn->dn_struct_rwlock); } done: /* * Add this range to the dnode range list. * We will finish up this free operation in the syncing phase. */ mutex_enter(&dn->dn_mtx); { int txgoff = tx->tx_txg & TXG_MASK; if (dn->dn_free_ranges[txgoff] == NULL) { dn->dn_free_ranges[txgoff] = range_tree_create(NULL, RANGE_SEG64, NULL, 0, 0); } range_tree_clear(dn->dn_free_ranges[txgoff], blkid, nblks); range_tree_add(dn->dn_free_ranges[txgoff], blkid, nblks); } dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n", blkid, nblks, tx->tx_txg); mutex_exit(&dn->dn_mtx); dbuf_free_range(dn, blkid, blkid + nblks - 1, tx); dnode_setdirty(dn, tx); } static boolean_t dnode_spill_freed(dnode_t *dn) { int i; mutex_enter(&dn->dn_mtx); for (i = 0; i < TXG_SIZE; i++) { if (dn->dn_rm_spillblk[i] == DN_KILL_SPILLBLK) break; } mutex_exit(&dn->dn_mtx); return (i < TXG_SIZE); } /* return TRUE if this blkid was freed in a recent txg, or FALSE if it wasn't */ uint64_t dnode_block_freed(dnode_t *dn, uint64_t blkid) { void *dp = spa_get_dsl(dn->dn_objset->os_spa); int i; if (blkid == DMU_BONUS_BLKID) return (FALSE); /* * If we're in the process of opening the pool, dp will not be * set yet, but there shouldn't be anything dirty. */ if (dp == NULL) return (FALSE); if (dn->dn_free_txg) return (TRUE); if (blkid == DMU_SPILL_BLKID) return (dnode_spill_freed(dn)); mutex_enter(&dn->dn_mtx); for (i = 0; i < TXG_SIZE; i++) { if (dn->dn_free_ranges[i] != NULL && range_tree_contains(dn->dn_free_ranges[i], blkid, 1)) break; } mutex_exit(&dn->dn_mtx); return (i < TXG_SIZE); } /* call from syncing context when we actually write/free space for this dnode */ void dnode_diduse_space(dnode_t *dn, int64_t delta) { uint64_t space; dprintf_dnode(dn, "dn=%p dnp=%p used=%llu delta=%lld\n", dn, dn->dn_phys, (u_longlong_t)dn->dn_phys->dn_used, (longlong_t)delta); mutex_enter(&dn->dn_mtx); space = DN_USED_BYTES(dn->dn_phys); if (delta > 0) { ASSERT3U(space + delta, >=, space); /* no overflow */ } else { ASSERT3U(space, >=, -delta); /* no underflow */ } space += delta; if (spa_version(dn->dn_objset->os_spa) < SPA_VERSION_DNODE_BYTES) { ASSERT((dn->dn_phys->dn_flags & DNODE_FLAG_USED_BYTES) == 0); ASSERT0(P2PHASE(space, 1<dn_phys->dn_used = space >> DEV_BSHIFT; } else { dn->dn_phys->dn_used = space; dn->dn_phys->dn_flags |= DNODE_FLAG_USED_BYTES; } mutex_exit(&dn->dn_mtx); } /* * Scans a block at the indicated "level" looking for a hole or data, * depending on 'flags'. * * If level > 0, then we are scanning an indirect block looking at its * pointers. If level == 0, then we are looking at a block of dnodes. * * If we don't find what we are looking for in the block, we return ESRCH. * Otherwise, return with *offset pointing to the beginning (if searching * forwards) or end (if searching backwards) of the range covered by the * block pointer we matched on (or dnode). * * The basic search algorithm used below by dnode_next_offset() is to * use this function to search up the block tree (widen the search) until * we find something (i.e., we don't return ESRCH) and then search back * down the tree (narrow the search) until we reach our original search * level. */ static int dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset, int lvl, uint64_t blkfill, uint64_t txg) { dmu_buf_impl_t *db = NULL; void *data = NULL; uint64_t epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT; uint64_t epb = 1ULL << epbs; uint64_t minfill, maxfill; boolean_t hole; int i, inc, error, span; ASSERT(RW_LOCK_HELD(&dn->dn_struct_rwlock)); hole = ((flags & DNODE_FIND_HOLE) != 0); inc = (flags & DNODE_FIND_BACKWARDS) ? -1 : 1; ASSERT(txg == 0 || !hole); if (lvl == dn->dn_phys->dn_nlevels) { error = 0; epb = dn->dn_phys->dn_nblkptr; data = dn->dn_phys->dn_blkptr; } else { uint64_t blkid = dbuf_whichblock(dn, lvl, *offset); error = dbuf_hold_impl(dn, lvl, blkid, TRUE, FALSE, FTAG, &db); if (error) { if (error != ENOENT) return (error); if (hole) return (0); /* * This can only happen when we are searching up * the block tree for data. We don't really need to * adjust the offset, as we will just end up looking * at the pointer to this block in its parent, and its * going to be unallocated, so we will skip over it. */ return (SET_ERROR(ESRCH)); } error = dbuf_read(db, NULL, - DB_RF_CANFAIL | DB_RF_HAVESTRUCT | DB_RF_NO_DECRYPT); + DB_RF_CANFAIL | DB_RF_HAVESTRUCT | + DB_RF_NO_DECRYPT | DB_RF_NOPREFETCH); if (error) { dbuf_rele(db, FTAG); return (error); } data = db->db.db_data; rw_enter(&db->db_rwlock, RW_READER); } if (db != NULL && txg != 0 && (db->db_blkptr == NULL || db->db_blkptr->blk_birth <= txg || BP_IS_HOLE(db->db_blkptr))) { /* * This can only happen when we are searching up the tree * and these conditions mean that we need to keep climbing. */ error = SET_ERROR(ESRCH); } else if (lvl == 0) { dnode_phys_t *dnp = data; ASSERT(dn->dn_type == DMU_OT_DNODE); ASSERT(!(flags & DNODE_FIND_BACKWARDS)); for (i = (*offset >> DNODE_SHIFT) & (blkfill - 1); i < blkfill; i += dnp[i].dn_extra_slots + 1) { if ((dnp[i].dn_type == DMU_OT_NONE) == hole) break; } if (i == blkfill) error = SET_ERROR(ESRCH); *offset = (*offset & ~(DNODE_BLOCK_SIZE - 1)) + (i << DNODE_SHIFT); } else { blkptr_t *bp = data; uint64_t start = *offset; span = (lvl - 1) * epbs + dn->dn_datablkshift; minfill = 0; maxfill = blkfill << ((lvl - 1) * epbs); if (hole) maxfill--; else minfill++; if (span >= 8 * sizeof (*offset)) { /* This only happens on the highest indirection level */ ASSERT3U((lvl - 1), ==, dn->dn_phys->dn_nlevels - 1); *offset = 0; } else { *offset = *offset >> span; } for (i = BF64_GET(*offset, 0, epbs); i >= 0 && i < epb; i += inc) { if (BP_GET_FILL(&bp[i]) >= minfill && BP_GET_FILL(&bp[i]) <= maxfill && (hole || bp[i].blk_birth > txg)) break; if (inc > 0 || *offset > 0) *offset += inc; } if (span >= 8 * sizeof (*offset)) { *offset = start; } else { *offset = *offset << span; } if (inc < 0) { /* traversing backwards; position offset at the end */ ASSERT3U(*offset, <=, start); *offset = MIN(*offset + (1ULL << span) - 1, start); } else if (*offset < start) { *offset = start; } if (i < 0 || i >= epb) error = SET_ERROR(ESRCH); } if (db != NULL) { rw_exit(&db->db_rwlock); dbuf_rele(db, FTAG); } return (error); } /* * Find the next hole, data, or sparse region at or after *offset. * The value 'blkfill' tells us how many items we expect to find * in an L0 data block; this value is 1 for normal objects, * DNODES_PER_BLOCK for the meta dnode, and some fraction of * DNODES_PER_BLOCK when searching for sparse regions thereof. * * Examples: * * dnode_next_offset(dn, flags, offset, 1, 1, 0); * Finds the next/previous hole/data in a file. * Used in dmu_offset_next(). * * dnode_next_offset(mdn, flags, offset, 0, DNODES_PER_BLOCK, txg); * Finds the next free/allocated dnode an objset's meta-dnode. * Only finds objects that have new contents since txg (ie. * bonus buffer changes and content removal are ignored). * Used in dmu_object_next(). * * dnode_next_offset(mdn, DNODE_FIND_HOLE, offset, 2, DNODES_PER_BLOCK >> 2, 0); * Finds the next L2 meta-dnode bp that's at most 1/4 full. * Used in dmu_object_alloc(). */ int dnode_next_offset(dnode_t *dn, int flags, uint64_t *offset, int minlvl, uint64_t blkfill, uint64_t txg) { uint64_t initial_offset = *offset; int lvl, maxlvl; int error = 0; if (!(flags & DNODE_FIND_HAVELOCK)) rw_enter(&dn->dn_struct_rwlock, RW_READER); if (dn->dn_phys->dn_nlevels == 0) { error = SET_ERROR(ESRCH); goto out; } if (dn->dn_datablkshift == 0) { if (*offset < dn->dn_datablksz) { if (flags & DNODE_FIND_HOLE) *offset = dn->dn_datablksz; } else { error = SET_ERROR(ESRCH); } goto out; } maxlvl = dn->dn_phys->dn_nlevels; for (lvl = minlvl; lvl <= maxlvl; lvl++) { error = dnode_next_offset_level(dn, flags, offset, lvl, blkfill, txg); if (error != ESRCH) break; } while (error == 0 && --lvl >= minlvl) { error = dnode_next_offset_level(dn, flags, offset, lvl, blkfill, txg); } /* * There's always a "virtual hole" at the end of the object, even * if all BP's which physically exist are non-holes. */ if ((flags & DNODE_FIND_HOLE) && error == ESRCH && txg == 0 && minlvl == 1 && blkfill == 1 && !(flags & DNODE_FIND_BACKWARDS)) { error = 0; } if (error == 0 && (flags & DNODE_FIND_BACKWARDS ? initial_offset < *offset : initial_offset > *offset)) error = SET_ERROR(ESRCH); out: if (!(flags & DNODE_FIND_HAVELOCK)) rw_exit(&dn->dn_struct_rwlock); return (error); } #if defined(_KERNEL) EXPORT_SYMBOL(dnode_hold); EXPORT_SYMBOL(dnode_rele); EXPORT_SYMBOL(dnode_set_nlevels); EXPORT_SYMBOL(dnode_set_blksz); EXPORT_SYMBOL(dnode_free_range); EXPORT_SYMBOL(dnode_evict_dbufs); EXPORT_SYMBOL(dnode_evict_bonus); #endif Index: vendor-sys/openzfs/dist/module/zfs/dsl_crypt.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/dsl_crypt.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/dsl_crypt.c (revision 366348) @@ -1,2872 +1,2868 @@ /* * CDDL HEADER START * * This file and its contents are supplied under the terms of the * Common Development and Distribution License ("CDDL"), version 1.0. * You may only use this file in accordance with the terms of version * 1.0 of the CDDL. * * A full copy of the text of the CDDL should have accompanied this * source. A copy of the CDDL is also available via the Internet at * http://www.illumos.org/license/CDDL. * * CDDL HEADER END */ /* * Copyright (c) 2017, Datto, Inc. All rights reserved. * Copyright (c) 2018 by Delphix. All rights reserved. */ #include #include #include #include #include #include #include #include #include /* * This file's primary purpose is for managing master encryption keys in * memory and on disk. For more info on how these keys are used, see the * block comment in zio_crypt.c. * * All master keys are stored encrypted on disk in the form of the DSL * Crypto Key ZAP object. The binary key data in this object is always * randomly generated and is encrypted with the user's wrapping key. This * layer of indirection allows the user to change their key without * needing to re-encrypt the entire dataset. The ZAP also holds on to the * (non-encrypted) encryption algorithm identifier, IV, and MAC needed to * safely decrypt the master key. For more info on the user's key see the * block comment in libzfs_crypto.c * * In-memory encryption keys are managed through the spa_keystore. The * keystore consists of 3 AVL trees, which are as follows: * * The Wrapping Key Tree: * The wrapping key (wkey) tree stores the user's keys that are fed into the * kernel through 'zfs load-key' and related commands. Datasets inherit their * parent's wkey by default, so these structures are refcounted. The wrapping * keys remain in memory until they are explicitly unloaded (with * "zfs unload-key"). Unloading is only possible when no datasets are using * them (refcount=0). * * The DSL Crypto Key Tree: * The DSL Crypto Keys (DCK) are the in-memory representation of decrypted * master keys. They are used by the functions in zio_crypt.c to perform * encryption, decryption, and authentication. Snapshots and clones of a given * dataset will share a DSL Crypto Key, so they are also refcounted. Once the * refcount on a key hits zero, it is immediately zeroed out and freed. * * The Crypto Key Mapping Tree: * The zio layer needs to lookup master keys by their dataset object id. Since * the DSL Crypto Keys can belong to multiple datasets, we maintain a tree of * dsl_key_mapping_t's which essentially just map the dataset object id to its * appropriate DSL Crypto Key. The management for creating and destroying these * mappings hooks into the code for owning and disowning datasets. Usually, * there will only be one active dataset owner, but there are times * (particularly during dataset creation and destruction) when this may not be * true or the dataset may not be initialized enough to own. As a result, this * object is also refcounted. */ /* * This tunable allows datasets to be raw received even if the stream does * not include IVset guids or if the guids don't match. This is used as part * of the resolution for ZPOOL_ERRATA_ZOL_8308_ENCRYPTION. */ int zfs_disable_ivset_guid_check = 0; static void dsl_wrapping_key_hold(dsl_wrapping_key_t *wkey, void *tag) { (void) zfs_refcount_add(&wkey->wk_refcnt, tag); } static void dsl_wrapping_key_rele(dsl_wrapping_key_t *wkey, void *tag) { (void) zfs_refcount_remove(&wkey->wk_refcnt, tag); } static void dsl_wrapping_key_free(dsl_wrapping_key_t *wkey) { ASSERT0(zfs_refcount_count(&wkey->wk_refcnt)); if (wkey->wk_key.ck_data) { bzero(wkey->wk_key.ck_data, CRYPTO_BITS2BYTES(wkey->wk_key.ck_length)); kmem_free(wkey->wk_key.ck_data, CRYPTO_BITS2BYTES(wkey->wk_key.ck_length)); } zfs_refcount_destroy(&wkey->wk_refcnt); kmem_free(wkey, sizeof (dsl_wrapping_key_t)); } static void dsl_wrapping_key_create(uint8_t *wkeydata, zfs_keyformat_t keyformat, uint64_t salt, uint64_t iters, dsl_wrapping_key_t **wkey_out) { dsl_wrapping_key_t *wkey; /* allocate the wrapping key */ wkey = kmem_alloc(sizeof (dsl_wrapping_key_t), KM_SLEEP); /* allocate and initialize the underlying crypto key */ wkey->wk_key.ck_data = kmem_alloc(WRAPPING_KEY_LEN, KM_SLEEP); wkey->wk_key.ck_format = CRYPTO_KEY_RAW; wkey->wk_key.ck_length = CRYPTO_BYTES2BITS(WRAPPING_KEY_LEN); bcopy(wkeydata, wkey->wk_key.ck_data, WRAPPING_KEY_LEN); /* initialize the rest of the struct */ zfs_refcount_create(&wkey->wk_refcnt); wkey->wk_keyformat = keyformat; wkey->wk_salt = salt; wkey->wk_iters = iters; *wkey_out = wkey; } int dsl_crypto_params_create_nvlist(dcp_cmd_t cmd, nvlist_t *props, nvlist_t *crypto_args, dsl_crypto_params_t **dcp_out) { int ret; uint64_t crypt = ZIO_CRYPT_INHERIT; uint64_t keyformat = ZFS_KEYFORMAT_NONE; uint64_t salt = 0, iters = 0; dsl_crypto_params_t *dcp = NULL; dsl_wrapping_key_t *wkey = NULL; uint8_t *wkeydata = NULL; uint_t wkeydata_len = 0; char *keylocation = NULL; dcp = kmem_zalloc(sizeof (dsl_crypto_params_t), KM_SLEEP); dcp->cp_cmd = cmd; /* get relevant arguments from the nvlists */ if (props != NULL) { (void) nvlist_lookup_uint64(props, zfs_prop_to_name(ZFS_PROP_ENCRYPTION), &crypt); (void) nvlist_lookup_uint64(props, zfs_prop_to_name(ZFS_PROP_KEYFORMAT), &keyformat); (void) nvlist_lookup_string(props, zfs_prop_to_name(ZFS_PROP_KEYLOCATION), &keylocation); (void) nvlist_lookup_uint64(props, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), &salt); (void) nvlist_lookup_uint64(props, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), &iters); dcp->cp_crypt = crypt; } if (crypto_args != NULL) { (void) nvlist_lookup_uint8_array(crypto_args, "wkeydata", &wkeydata, &wkeydata_len); } /* check for valid command */ if (dcp->cp_cmd >= DCP_CMD_MAX) { ret = SET_ERROR(EINVAL); goto error; } else { dcp->cp_cmd = cmd; } /* check for valid crypt */ if (dcp->cp_crypt >= ZIO_CRYPT_FUNCTIONS) { ret = SET_ERROR(EINVAL); goto error; } else { dcp->cp_crypt = crypt; } /* check for valid keyformat */ if (keyformat >= ZFS_KEYFORMAT_FORMATS) { ret = SET_ERROR(EINVAL); goto error; } /* check for a valid keylocation (of any kind) and copy it in */ if (keylocation != NULL) { if (!zfs_prop_valid_keylocation(keylocation, B_FALSE)) { ret = SET_ERROR(EINVAL); goto error; } dcp->cp_keylocation = spa_strdup(keylocation); } /* check wrapping key length, if given */ if (wkeydata != NULL && wkeydata_len != WRAPPING_KEY_LEN) { ret = SET_ERROR(EINVAL); goto error; } /* if the user asked for the default crypt, determine that now */ if (dcp->cp_crypt == ZIO_CRYPT_ON) dcp->cp_crypt = ZIO_CRYPT_ON_VALUE; /* create the wrapping key from the raw data */ if (wkeydata != NULL) { /* create the wrapping key with the verified parameters */ dsl_wrapping_key_create(wkeydata, keyformat, salt, iters, &wkey); dcp->cp_wkey = wkey; } /* * Remove the encryption properties from the nvlist since they are not * maintained through the DSL. */ (void) nvlist_remove_all(props, zfs_prop_to_name(ZFS_PROP_ENCRYPTION)); (void) nvlist_remove_all(props, zfs_prop_to_name(ZFS_PROP_KEYFORMAT)); (void) nvlist_remove_all(props, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT)); (void) nvlist_remove_all(props, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS)); *dcp_out = dcp; return (0); error: - if (wkey != NULL) - dsl_wrapping_key_free(wkey); - if (dcp != NULL) - kmem_free(dcp, sizeof (dsl_crypto_params_t)); - + kmem_free(dcp, sizeof (dsl_crypto_params_t)); *dcp_out = NULL; return (ret); } void dsl_crypto_params_free(dsl_crypto_params_t *dcp, boolean_t unload) { if (dcp == NULL) return; if (dcp->cp_keylocation != NULL) spa_strfree(dcp->cp_keylocation); if (unload && dcp->cp_wkey != NULL) dsl_wrapping_key_free(dcp->cp_wkey); kmem_free(dcp, sizeof (dsl_crypto_params_t)); } static int spa_crypto_key_compare(const void *a, const void *b) { const dsl_crypto_key_t *dcka = a; const dsl_crypto_key_t *dckb = b; if (dcka->dck_obj < dckb->dck_obj) return (-1); if (dcka->dck_obj > dckb->dck_obj) return (1); return (0); } static int spa_key_mapping_compare(const void *a, const void *b) { const dsl_key_mapping_t *kma = a; const dsl_key_mapping_t *kmb = b; if (kma->km_dsobj < kmb->km_dsobj) return (-1); if (kma->km_dsobj > kmb->km_dsobj) return (1); return (0); } static int spa_wkey_compare(const void *a, const void *b) { const dsl_wrapping_key_t *wka = a; const dsl_wrapping_key_t *wkb = b; if (wka->wk_ddobj < wkb->wk_ddobj) return (-1); if (wka->wk_ddobj > wkb->wk_ddobj) return (1); return (0); } void spa_keystore_init(spa_keystore_t *sk) { rw_init(&sk->sk_dk_lock, NULL, RW_DEFAULT, NULL); rw_init(&sk->sk_km_lock, NULL, RW_DEFAULT, NULL); rw_init(&sk->sk_wkeys_lock, NULL, RW_DEFAULT, NULL); avl_create(&sk->sk_dsl_keys, spa_crypto_key_compare, sizeof (dsl_crypto_key_t), offsetof(dsl_crypto_key_t, dck_avl_link)); avl_create(&sk->sk_key_mappings, spa_key_mapping_compare, sizeof (dsl_key_mapping_t), offsetof(dsl_key_mapping_t, km_avl_link)); avl_create(&sk->sk_wkeys, spa_wkey_compare, sizeof (dsl_wrapping_key_t), offsetof(dsl_wrapping_key_t, wk_avl_link)); } void spa_keystore_fini(spa_keystore_t *sk) { dsl_wrapping_key_t *wkey; void *cookie = NULL; ASSERT(avl_is_empty(&sk->sk_dsl_keys)); ASSERT(avl_is_empty(&sk->sk_key_mappings)); while ((wkey = avl_destroy_nodes(&sk->sk_wkeys, &cookie)) != NULL) dsl_wrapping_key_free(wkey); avl_destroy(&sk->sk_wkeys); avl_destroy(&sk->sk_key_mappings); avl_destroy(&sk->sk_dsl_keys); rw_destroy(&sk->sk_wkeys_lock); rw_destroy(&sk->sk_km_lock); rw_destroy(&sk->sk_dk_lock); } static int dsl_dir_get_encryption_root_ddobj(dsl_dir_t *dd, uint64_t *rddobj) { if (dd->dd_crypto_obj == 0) return (SET_ERROR(ENOENT)); return (zap_lookup(dd->dd_pool->dp_meta_objset, dd->dd_crypto_obj, DSL_CRYPTO_KEY_ROOT_DDOBJ, 8, 1, rddobj)); } static int dsl_dir_get_encryption_version(dsl_dir_t *dd, uint64_t *version) { *version = 0; if (dd->dd_crypto_obj == 0) return (SET_ERROR(ENOENT)); /* version 0 is implied by ENOENT */ (void) zap_lookup(dd->dd_pool->dp_meta_objset, dd->dd_crypto_obj, DSL_CRYPTO_KEY_VERSION, 8, 1, version); return (0); } boolean_t dsl_dir_incompatible_encryption_version(dsl_dir_t *dd) { int ret; uint64_t version = 0; ret = dsl_dir_get_encryption_version(dd, &version); if (ret != 0) return (B_FALSE); return (version != ZIO_CRYPT_KEY_CURRENT_VERSION); } static int spa_keystore_wkey_hold_ddobj_impl(spa_t *spa, uint64_t ddobj, void *tag, dsl_wrapping_key_t **wkey_out) { int ret; dsl_wrapping_key_t search_wkey; dsl_wrapping_key_t *found_wkey; ASSERT(RW_LOCK_HELD(&spa->spa_keystore.sk_wkeys_lock)); /* init the search wrapping key */ search_wkey.wk_ddobj = ddobj; /* lookup the wrapping key */ found_wkey = avl_find(&spa->spa_keystore.sk_wkeys, &search_wkey, NULL); if (!found_wkey) { ret = SET_ERROR(ENOENT); goto error; } /* increment the refcount */ dsl_wrapping_key_hold(found_wkey, tag); *wkey_out = found_wkey; return (0); error: *wkey_out = NULL; return (ret); } static int spa_keystore_wkey_hold_dd(spa_t *spa, dsl_dir_t *dd, void *tag, dsl_wrapping_key_t **wkey_out) { int ret; dsl_wrapping_key_t *wkey; uint64_t rddobj; boolean_t locked = B_FALSE; if (!RW_WRITE_HELD(&spa->spa_keystore.sk_wkeys_lock)) { rw_enter(&spa->spa_keystore.sk_wkeys_lock, RW_READER); locked = B_TRUE; } /* get the ddobj that the keylocation property was inherited from */ ret = dsl_dir_get_encryption_root_ddobj(dd, &rddobj); if (ret != 0) goto error; /* lookup the wkey in the avl tree */ ret = spa_keystore_wkey_hold_ddobj_impl(spa, rddobj, tag, &wkey); if (ret != 0) goto error; /* unlock the wkey tree if we locked it */ if (locked) rw_exit(&spa->spa_keystore.sk_wkeys_lock); *wkey_out = wkey; return (0); error: if (locked) rw_exit(&spa->spa_keystore.sk_wkeys_lock); *wkey_out = NULL; return (ret); } int dsl_crypto_can_set_keylocation(const char *dsname, const char *keylocation) { int ret = 0; dsl_dir_t *dd = NULL; dsl_pool_t *dp = NULL; uint64_t rddobj; /* hold the dsl dir */ ret = dsl_pool_hold(dsname, FTAG, &dp); if (ret != 0) goto out; ret = dsl_dir_hold(dp, dsname, FTAG, &dd, NULL); if (ret != 0) { dd = NULL; goto out; } /* if dd is not encrypted, the value may only be "none" */ if (dd->dd_crypto_obj == 0) { if (strcmp(keylocation, "none") != 0) { ret = SET_ERROR(EACCES); goto out; } ret = 0; goto out; } /* check for a valid keylocation for encrypted datasets */ if (!zfs_prop_valid_keylocation(keylocation, B_TRUE)) { ret = SET_ERROR(EINVAL); goto out; } /* check that this is an encryption root */ ret = dsl_dir_get_encryption_root_ddobj(dd, &rddobj); if (ret != 0) goto out; if (rddobj != dd->dd_object) { ret = SET_ERROR(EACCES); goto out; } dsl_dir_rele(dd, FTAG); dsl_pool_rele(dp, FTAG); return (0); out: if (dd != NULL) dsl_dir_rele(dd, FTAG); if (dp != NULL) dsl_pool_rele(dp, FTAG); return (ret); } static void dsl_crypto_key_free(dsl_crypto_key_t *dck) { ASSERT(zfs_refcount_count(&dck->dck_holds) == 0); /* destroy the zio_crypt_key_t */ zio_crypt_key_destroy(&dck->dck_key); /* free the refcount, wrapping key, and lock */ zfs_refcount_destroy(&dck->dck_holds); if (dck->dck_wkey) dsl_wrapping_key_rele(dck->dck_wkey, dck); /* free the key */ kmem_free(dck, sizeof (dsl_crypto_key_t)); } static void dsl_crypto_key_rele(dsl_crypto_key_t *dck, void *tag) { if (zfs_refcount_remove(&dck->dck_holds, tag) == 0) dsl_crypto_key_free(dck); } static int dsl_crypto_key_open(objset_t *mos, dsl_wrapping_key_t *wkey, uint64_t dckobj, void *tag, dsl_crypto_key_t **dck_out) { int ret; uint64_t crypt = 0, guid = 0, version = 0; uint8_t raw_keydata[MASTER_KEY_MAX_LEN]; uint8_t raw_hmac_keydata[SHA512_HMAC_KEYLEN]; uint8_t iv[WRAPPING_IV_LEN]; uint8_t mac[WRAPPING_MAC_LEN]; dsl_crypto_key_t *dck; /* allocate and initialize the key */ dck = kmem_zalloc(sizeof (dsl_crypto_key_t), KM_SLEEP); /* fetch all of the values we need from the ZAP */ ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_CRYPTO_SUITE, 8, 1, &crypt); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_GUID, 8, 1, &guid); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_MASTER_KEY, 1, MASTER_KEY_MAX_LEN, raw_keydata); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_HMAC_KEY, 1, SHA512_HMAC_KEYLEN, raw_hmac_keydata); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_IV, 1, WRAPPING_IV_LEN, iv); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_MAC, 1, WRAPPING_MAC_LEN, mac); if (ret != 0) goto error; /* the initial on-disk format for encryption did not have a version */ (void) zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_VERSION, 8, 1, &version); /* * Unwrap the keys. If there is an error return EACCES to indicate * an authentication failure. */ ret = zio_crypt_key_unwrap(&wkey->wk_key, crypt, version, guid, raw_keydata, raw_hmac_keydata, iv, mac, &dck->dck_key); if (ret != 0) { ret = SET_ERROR(EACCES); goto error; } /* finish initializing the dsl_crypto_key_t */ zfs_refcount_create(&dck->dck_holds); dsl_wrapping_key_hold(wkey, dck); dck->dck_wkey = wkey; dck->dck_obj = dckobj; zfs_refcount_add(&dck->dck_holds, tag); *dck_out = dck; return (0); error: if (dck != NULL) { bzero(dck, sizeof (dsl_crypto_key_t)); kmem_free(dck, sizeof (dsl_crypto_key_t)); } *dck_out = NULL; return (ret); } static int spa_keystore_dsl_key_hold_impl(spa_t *spa, uint64_t dckobj, void *tag, dsl_crypto_key_t **dck_out) { int ret; dsl_crypto_key_t search_dck; dsl_crypto_key_t *found_dck; ASSERT(RW_LOCK_HELD(&spa->spa_keystore.sk_dk_lock)); /* init the search key */ search_dck.dck_obj = dckobj; /* find the matching key in the keystore */ found_dck = avl_find(&spa->spa_keystore.sk_dsl_keys, &search_dck, NULL); if (!found_dck) { ret = SET_ERROR(ENOENT); goto error; } /* increment the refcount */ zfs_refcount_add(&found_dck->dck_holds, tag); *dck_out = found_dck; return (0); error: *dck_out = NULL; return (ret); } static int spa_keystore_dsl_key_hold_dd(spa_t *spa, dsl_dir_t *dd, void *tag, dsl_crypto_key_t **dck_out) { int ret; avl_index_t where; dsl_crypto_key_t *dck_io = NULL, *dck_ks = NULL; dsl_wrapping_key_t *wkey = NULL; uint64_t dckobj = dd->dd_crypto_obj; /* Lookup the key in the tree of currently loaded keys */ rw_enter(&spa->spa_keystore.sk_dk_lock, RW_READER); ret = spa_keystore_dsl_key_hold_impl(spa, dckobj, tag, &dck_ks); rw_exit(&spa->spa_keystore.sk_dk_lock); if (ret == 0) { *dck_out = dck_ks; return (0); } /* Lookup the wrapping key from the keystore */ ret = spa_keystore_wkey_hold_dd(spa, dd, FTAG, &wkey); if (ret != 0) { *dck_out = NULL; return (SET_ERROR(EACCES)); } /* Read the key from disk */ ret = dsl_crypto_key_open(spa->spa_meta_objset, wkey, dckobj, tag, &dck_io); if (ret != 0) { dsl_wrapping_key_rele(wkey, FTAG); *dck_out = NULL; return (ret); } /* * Add the key to the keystore. It may already exist if it was * added while performing the read from disk. In this case discard * it and return the key from the keystore. */ rw_enter(&spa->spa_keystore.sk_dk_lock, RW_WRITER); ret = spa_keystore_dsl_key_hold_impl(spa, dckobj, tag, &dck_ks); if (ret != 0) { avl_find(&spa->spa_keystore.sk_dsl_keys, dck_io, &where); avl_insert(&spa->spa_keystore.sk_dsl_keys, dck_io, where); *dck_out = dck_io; } else { dsl_crypto_key_free(dck_io); *dck_out = dck_ks; } /* Release the wrapping key (the dsl key now has a reference to it) */ dsl_wrapping_key_rele(wkey, FTAG); rw_exit(&spa->spa_keystore.sk_dk_lock); return (0); } void spa_keystore_dsl_key_rele(spa_t *spa, dsl_crypto_key_t *dck, void *tag) { rw_enter(&spa->spa_keystore.sk_dk_lock, RW_WRITER); if (zfs_refcount_remove(&dck->dck_holds, tag) == 0) { avl_remove(&spa->spa_keystore.sk_dsl_keys, dck); dsl_crypto_key_free(dck); } rw_exit(&spa->spa_keystore.sk_dk_lock); } int spa_keystore_load_wkey_impl(spa_t *spa, dsl_wrapping_key_t *wkey) { int ret; avl_index_t where; dsl_wrapping_key_t *found_wkey; rw_enter(&spa->spa_keystore.sk_wkeys_lock, RW_WRITER); /* insert the wrapping key into the keystore */ found_wkey = avl_find(&spa->spa_keystore.sk_wkeys, wkey, &where); if (found_wkey != NULL) { ret = SET_ERROR(EEXIST); goto error_unlock; } avl_insert(&spa->spa_keystore.sk_wkeys, wkey, where); rw_exit(&spa->spa_keystore.sk_wkeys_lock); return (0); error_unlock: rw_exit(&spa->spa_keystore.sk_wkeys_lock); return (ret); } int spa_keystore_load_wkey(const char *dsname, dsl_crypto_params_t *dcp, boolean_t noop) { int ret; dsl_dir_t *dd = NULL; dsl_crypto_key_t *dck = NULL; dsl_wrapping_key_t *wkey = dcp->cp_wkey; dsl_pool_t *dp = NULL; uint64_t rddobj, keyformat, salt, iters; /* * We don't validate the wrapping key's keyformat, salt, or iters * since they will never be needed after the DCK has been wrapped. */ if (dcp->cp_wkey == NULL || dcp->cp_cmd != DCP_CMD_NONE || dcp->cp_crypt != ZIO_CRYPT_INHERIT || dcp->cp_keylocation != NULL) return (SET_ERROR(EINVAL)); ret = dsl_pool_hold(dsname, FTAG, &dp); if (ret != 0) goto error; if (!spa_feature_is_enabled(dp->dp_spa, SPA_FEATURE_ENCRYPTION)) { ret = SET_ERROR(ENOTSUP); goto error; } /* hold the dsl dir */ ret = dsl_dir_hold(dp, dsname, FTAG, &dd, NULL); if (ret != 0) { dd = NULL; goto error; } /* confirm that dd is the encryption root */ ret = dsl_dir_get_encryption_root_ddobj(dd, &rddobj); if (ret != 0 || rddobj != dd->dd_object) { ret = SET_ERROR(EINVAL); goto error; } /* initialize the wkey's ddobj */ wkey->wk_ddobj = dd->dd_object; /* verify that the wkey is correct by opening its dsl key */ ret = dsl_crypto_key_open(dp->dp_meta_objset, wkey, dd->dd_crypto_obj, FTAG, &dck); if (ret != 0) goto error; /* initialize the wkey encryption parameters from the DSL Crypto Key */ ret = zap_lookup(dp->dp_meta_objset, dd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_KEYFORMAT), 8, 1, &keyformat); if (ret != 0) goto error; ret = zap_lookup(dp->dp_meta_objset, dd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), 8, 1, &salt); if (ret != 0) goto error; ret = zap_lookup(dp->dp_meta_objset, dd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), 8, 1, &iters); if (ret != 0) goto error; ASSERT3U(keyformat, <, ZFS_KEYFORMAT_FORMATS); ASSERT3U(keyformat, !=, ZFS_KEYFORMAT_NONE); IMPLY(keyformat == ZFS_KEYFORMAT_PASSPHRASE, iters != 0); IMPLY(keyformat == ZFS_KEYFORMAT_PASSPHRASE, salt != 0); IMPLY(keyformat != ZFS_KEYFORMAT_PASSPHRASE, iters == 0); IMPLY(keyformat != ZFS_KEYFORMAT_PASSPHRASE, salt == 0); wkey->wk_keyformat = keyformat; wkey->wk_salt = salt; wkey->wk_iters = iters; /* * At this point we have verified the wkey and confirmed that it can * be used to decrypt a DSL Crypto Key. We can simply cleanup and * return if this is all the user wanted to do. */ if (noop) goto error; /* insert the wrapping key into the keystore */ ret = spa_keystore_load_wkey_impl(dp->dp_spa, wkey); if (ret != 0) goto error; dsl_crypto_key_rele(dck, FTAG); dsl_dir_rele(dd, FTAG); dsl_pool_rele(dp, FTAG); /* create any zvols under this ds */ zvol_create_minors_recursive(dsname); return (0); error: if (dck != NULL) dsl_crypto_key_rele(dck, FTAG); if (dd != NULL) dsl_dir_rele(dd, FTAG); if (dp != NULL) dsl_pool_rele(dp, FTAG); return (ret); } int spa_keystore_unload_wkey_impl(spa_t *spa, uint64_t ddobj) { int ret; dsl_wrapping_key_t search_wkey; dsl_wrapping_key_t *found_wkey; /* init the search wrapping key */ search_wkey.wk_ddobj = ddobj; rw_enter(&spa->spa_keystore.sk_wkeys_lock, RW_WRITER); /* remove the wrapping key from the keystore */ found_wkey = avl_find(&spa->spa_keystore.sk_wkeys, &search_wkey, NULL); if (!found_wkey) { ret = SET_ERROR(EACCES); goto error_unlock; } else if (zfs_refcount_count(&found_wkey->wk_refcnt) != 0) { ret = SET_ERROR(EBUSY); goto error_unlock; } avl_remove(&spa->spa_keystore.sk_wkeys, found_wkey); rw_exit(&spa->spa_keystore.sk_wkeys_lock); /* free the wrapping key */ dsl_wrapping_key_free(found_wkey); return (0); error_unlock: rw_exit(&spa->spa_keystore.sk_wkeys_lock); return (ret); } int spa_keystore_unload_wkey(const char *dsname) { int ret = 0; dsl_dir_t *dd = NULL; dsl_pool_t *dp = NULL; spa_t *spa = NULL; ret = spa_open(dsname, &spa, FTAG); if (ret != 0) return (ret); /* * Wait for any outstanding txg IO to complete, releasing any * remaining references on the wkey. */ if (spa_mode(spa) != SPA_MODE_READ) txg_wait_synced(spa->spa_dsl_pool, 0); spa_close(spa, FTAG); /* hold the dsl dir */ ret = dsl_pool_hold(dsname, FTAG, &dp); if (ret != 0) goto error; if (!spa_feature_is_enabled(dp->dp_spa, SPA_FEATURE_ENCRYPTION)) { ret = (SET_ERROR(ENOTSUP)); goto error; } ret = dsl_dir_hold(dp, dsname, FTAG, &dd, NULL); if (ret != 0) { dd = NULL; goto error; } /* unload the wkey */ ret = spa_keystore_unload_wkey_impl(dp->dp_spa, dd->dd_object); if (ret != 0) goto error; dsl_dir_rele(dd, FTAG); dsl_pool_rele(dp, FTAG); /* remove any zvols under this ds */ zvol_remove_minors(dp->dp_spa, dsname, B_TRUE); return (0); error: if (dd != NULL) dsl_dir_rele(dd, FTAG); if (dp != NULL) dsl_pool_rele(dp, FTAG); return (ret); } void key_mapping_add_ref(dsl_key_mapping_t *km, void *tag) { ASSERT3U(zfs_refcount_count(&km->km_refcnt), >=, 1); zfs_refcount_add(&km->km_refcnt, tag); } /* * The locking here is a little tricky to ensure we don't cause unnecessary * performance problems. We want to release a key mapping whenever someone * decrements the refcount to 0, but freeing the mapping requires removing * it from the spa_keystore, which requires holding sk_km_lock as a writer. * Most of the time we don't want to hold this lock as a writer, since the * same lock is held as a reader for each IO that needs to encrypt / decrypt * data for any dataset and in practice we will only actually free the * mapping after unmounting a dataset. */ void key_mapping_rele(spa_t *spa, dsl_key_mapping_t *km, void *tag) { ASSERT3U(zfs_refcount_count(&km->km_refcnt), >=, 1); if (zfs_refcount_remove(&km->km_refcnt, tag) != 0) return; /* * We think we are going to need to free the mapping. Add a * reference to prevent most other releasers from thinking * this might be their responsibility. This is inherently * racy, so we will confirm that we are legitimately the * last holder once we have the sk_km_lock as a writer. */ zfs_refcount_add(&km->km_refcnt, FTAG); rw_enter(&spa->spa_keystore.sk_km_lock, RW_WRITER); if (zfs_refcount_remove(&km->km_refcnt, FTAG) != 0) { rw_exit(&spa->spa_keystore.sk_km_lock); return; } avl_remove(&spa->spa_keystore.sk_key_mappings, km); rw_exit(&spa->spa_keystore.sk_km_lock); spa_keystore_dsl_key_rele(spa, km->km_key, km); zfs_refcount_destroy(&km->km_refcnt); kmem_free(km, sizeof (dsl_key_mapping_t)); } int spa_keystore_create_mapping(spa_t *spa, dsl_dataset_t *ds, void *tag, dsl_key_mapping_t **km_out) { int ret; avl_index_t where; dsl_key_mapping_t *km, *found_km; boolean_t should_free = B_FALSE; /* Allocate and initialize the mapping */ km = kmem_zalloc(sizeof (dsl_key_mapping_t), KM_SLEEP); zfs_refcount_create(&km->km_refcnt); ret = spa_keystore_dsl_key_hold_dd(spa, ds->ds_dir, km, &km->km_key); if (ret != 0) { zfs_refcount_destroy(&km->km_refcnt); kmem_free(km, sizeof (dsl_key_mapping_t)); if (km_out != NULL) *km_out = NULL; return (ret); } km->km_dsobj = ds->ds_object; rw_enter(&spa->spa_keystore.sk_km_lock, RW_WRITER); /* * If a mapping already exists, simply increment its refcount and * cleanup the one we made. We want to allocate / free outside of * the lock because this lock is also used by the zio layer to lookup * key mappings. Otherwise, use the one we created. Normally, there will * only be one active reference at a time (the objset owner), but there * are times when there could be multiple async users. */ found_km = avl_find(&spa->spa_keystore.sk_key_mappings, km, &where); if (found_km != NULL) { should_free = B_TRUE; zfs_refcount_add(&found_km->km_refcnt, tag); if (km_out != NULL) *km_out = found_km; } else { zfs_refcount_add(&km->km_refcnt, tag); avl_insert(&spa->spa_keystore.sk_key_mappings, km, where); if (km_out != NULL) *km_out = km; } rw_exit(&spa->spa_keystore.sk_km_lock); if (should_free) { spa_keystore_dsl_key_rele(spa, km->km_key, km); zfs_refcount_destroy(&km->km_refcnt); kmem_free(km, sizeof (dsl_key_mapping_t)); } return (0); } int spa_keystore_remove_mapping(spa_t *spa, uint64_t dsobj, void *tag) { int ret; dsl_key_mapping_t search_km; dsl_key_mapping_t *found_km; /* init the search key mapping */ search_km.km_dsobj = dsobj; rw_enter(&spa->spa_keystore.sk_km_lock, RW_READER); /* find the matching mapping */ found_km = avl_find(&spa->spa_keystore.sk_key_mappings, &search_km, NULL); if (found_km == NULL) { ret = SET_ERROR(ENOENT); goto error_unlock; } rw_exit(&spa->spa_keystore.sk_km_lock); key_mapping_rele(spa, found_km, tag); return (0); error_unlock: rw_exit(&spa->spa_keystore.sk_km_lock); return (ret); } /* * This function is primarily used by the zio and arc layer to lookup * DSL Crypto Keys for encryption. Callers must release the key with * spa_keystore_dsl_key_rele(). The function may also be called with * dck_out == NULL and tag == NULL to simply check that a key exists * without getting a reference to it. */ int spa_keystore_lookup_key(spa_t *spa, uint64_t dsobj, void *tag, dsl_crypto_key_t **dck_out) { int ret; dsl_key_mapping_t search_km; dsl_key_mapping_t *found_km; ASSERT((tag != NULL && dck_out != NULL) || (tag == NULL && dck_out == NULL)); /* init the search key mapping */ search_km.km_dsobj = dsobj; rw_enter(&spa->spa_keystore.sk_km_lock, RW_READER); /* remove the mapping from the tree */ found_km = avl_find(&spa->spa_keystore.sk_key_mappings, &search_km, NULL); if (found_km == NULL) { ret = SET_ERROR(ENOENT); goto error_unlock; } if (found_km && tag) zfs_refcount_add(&found_km->km_key->dck_holds, tag); rw_exit(&spa->spa_keystore.sk_km_lock); if (dck_out != NULL) *dck_out = found_km->km_key; return (0); error_unlock: rw_exit(&spa->spa_keystore.sk_km_lock); if (dck_out != NULL) *dck_out = NULL; return (ret); } static int dmu_objset_check_wkey_loaded(dsl_dir_t *dd) { int ret; dsl_wrapping_key_t *wkey = NULL; ret = spa_keystore_wkey_hold_dd(dd->dd_pool->dp_spa, dd, FTAG, &wkey); if (ret != 0) return (SET_ERROR(EACCES)); dsl_wrapping_key_rele(wkey, FTAG); return (0); } static zfs_keystatus_t dsl_dataset_get_keystatus(dsl_dir_t *dd) { /* check if this dd has a has a dsl key */ if (dd->dd_crypto_obj == 0) return (ZFS_KEYSTATUS_NONE); return (dmu_objset_check_wkey_loaded(dd) == 0 ? ZFS_KEYSTATUS_AVAILABLE : ZFS_KEYSTATUS_UNAVAILABLE); } static int dsl_dir_get_crypt(dsl_dir_t *dd, uint64_t *crypt) { if (dd->dd_crypto_obj == 0) { *crypt = ZIO_CRYPT_OFF; return (0); } return (zap_lookup(dd->dd_pool->dp_meta_objset, dd->dd_crypto_obj, DSL_CRYPTO_KEY_CRYPTO_SUITE, 8, 1, crypt)); } static void dsl_crypto_key_sync_impl(objset_t *mos, uint64_t dckobj, uint64_t crypt, uint64_t root_ddobj, uint64_t guid, uint8_t *iv, uint8_t *mac, uint8_t *keydata, uint8_t *hmac_keydata, uint64_t keyformat, uint64_t salt, uint64_t iters, dmu_tx_t *tx) { VERIFY0(zap_update(mos, dckobj, DSL_CRYPTO_KEY_CRYPTO_SUITE, 8, 1, &crypt, tx)); VERIFY0(zap_update(mos, dckobj, DSL_CRYPTO_KEY_ROOT_DDOBJ, 8, 1, &root_ddobj, tx)); VERIFY0(zap_update(mos, dckobj, DSL_CRYPTO_KEY_GUID, 8, 1, &guid, tx)); VERIFY0(zap_update(mos, dckobj, DSL_CRYPTO_KEY_IV, 1, WRAPPING_IV_LEN, iv, tx)); VERIFY0(zap_update(mos, dckobj, DSL_CRYPTO_KEY_MAC, 1, WRAPPING_MAC_LEN, mac, tx)); VERIFY0(zap_update(mos, dckobj, DSL_CRYPTO_KEY_MASTER_KEY, 1, MASTER_KEY_MAX_LEN, keydata, tx)); VERIFY0(zap_update(mos, dckobj, DSL_CRYPTO_KEY_HMAC_KEY, 1, SHA512_HMAC_KEYLEN, hmac_keydata, tx)); VERIFY0(zap_update(mos, dckobj, zfs_prop_to_name(ZFS_PROP_KEYFORMAT), 8, 1, &keyformat, tx)); VERIFY0(zap_update(mos, dckobj, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), 8, 1, &salt, tx)); VERIFY0(zap_update(mos, dckobj, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), 8, 1, &iters, tx)); } static void dsl_crypto_key_sync(dsl_crypto_key_t *dck, dmu_tx_t *tx) { zio_crypt_key_t *key = &dck->dck_key; dsl_wrapping_key_t *wkey = dck->dck_wkey; uint8_t keydata[MASTER_KEY_MAX_LEN]; uint8_t hmac_keydata[SHA512_HMAC_KEYLEN]; uint8_t iv[WRAPPING_IV_LEN]; uint8_t mac[WRAPPING_MAC_LEN]; ASSERT(dmu_tx_is_syncing(tx)); ASSERT3U(key->zk_crypt, <, ZIO_CRYPT_FUNCTIONS); /* encrypt and store the keys along with the IV and MAC */ VERIFY0(zio_crypt_key_wrap(&dck->dck_wkey->wk_key, key, iv, mac, keydata, hmac_keydata)); /* update the ZAP with the obtained values */ dsl_crypto_key_sync_impl(tx->tx_pool->dp_meta_objset, dck->dck_obj, key->zk_crypt, wkey->wk_ddobj, key->zk_guid, iv, mac, keydata, hmac_keydata, wkey->wk_keyformat, wkey->wk_salt, wkey->wk_iters, tx); } typedef struct spa_keystore_change_key_args { const char *skcka_dsname; dsl_crypto_params_t *skcka_cp; } spa_keystore_change_key_args_t; static int spa_keystore_change_key_check(void *arg, dmu_tx_t *tx) { int ret; dsl_dir_t *dd = NULL; dsl_pool_t *dp = dmu_tx_pool(tx); spa_keystore_change_key_args_t *skcka = arg; dsl_crypto_params_t *dcp = skcka->skcka_cp; uint64_t rddobj; /* check for the encryption feature */ if (!spa_feature_is_enabled(dp->dp_spa, SPA_FEATURE_ENCRYPTION)) { ret = SET_ERROR(ENOTSUP); goto error; } /* check for valid key change command */ if (dcp->cp_cmd != DCP_CMD_NEW_KEY && dcp->cp_cmd != DCP_CMD_INHERIT && dcp->cp_cmd != DCP_CMD_FORCE_NEW_KEY && dcp->cp_cmd != DCP_CMD_FORCE_INHERIT) { ret = SET_ERROR(EINVAL); goto error; } /* hold the dd */ ret = dsl_dir_hold(dp, skcka->skcka_dsname, FTAG, &dd, NULL); if (ret != 0) { dd = NULL; goto error; } /* verify that the dataset is encrypted */ if (dd->dd_crypto_obj == 0) { ret = SET_ERROR(EINVAL); goto error; } /* clones must always use their origin's key */ if (dsl_dir_is_clone(dd)) { ret = SET_ERROR(EINVAL); goto error; } /* lookup the ddobj we are inheriting the keylocation from */ ret = dsl_dir_get_encryption_root_ddobj(dd, &rddobj); if (ret != 0) goto error; /* Handle inheritance */ if (dcp->cp_cmd == DCP_CMD_INHERIT || dcp->cp_cmd == DCP_CMD_FORCE_INHERIT) { /* no other encryption params should be given */ if (dcp->cp_crypt != ZIO_CRYPT_INHERIT || dcp->cp_keylocation != NULL || dcp->cp_wkey != NULL) { ret = SET_ERROR(EINVAL); goto error; } /* check that this is an encryption root */ if (dd->dd_object != rddobj) { ret = SET_ERROR(EINVAL); goto error; } /* check that the parent is encrypted */ if (dd->dd_parent->dd_crypto_obj == 0) { ret = SET_ERROR(EINVAL); goto error; } /* if we are rewrapping check that both keys are loaded */ if (dcp->cp_cmd == DCP_CMD_INHERIT) { ret = dmu_objset_check_wkey_loaded(dd); if (ret != 0) goto error; ret = dmu_objset_check_wkey_loaded(dd->dd_parent); if (ret != 0) goto error; } dsl_dir_rele(dd, FTAG); return (0); } /* handle forcing an encryption root without rewrapping */ if (dcp->cp_cmd == DCP_CMD_FORCE_NEW_KEY) { /* no other encryption params should be given */ if (dcp->cp_crypt != ZIO_CRYPT_INHERIT || dcp->cp_keylocation != NULL || dcp->cp_wkey != NULL) { ret = SET_ERROR(EINVAL); goto error; } /* check that this is not an encryption root */ if (dd->dd_object == rddobj) { ret = SET_ERROR(EINVAL); goto error; } dsl_dir_rele(dd, FTAG); return (0); } /* crypt cannot be changed after creation */ if (dcp->cp_crypt != ZIO_CRYPT_INHERIT) { ret = SET_ERROR(EINVAL); goto error; } /* we are not inheritting our parent's wkey so we need one ourselves */ if (dcp->cp_wkey == NULL) { ret = SET_ERROR(EINVAL); goto error; } /* check for a valid keyformat for the new wrapping key */ if (dcp->cp_wkey->wk_keyformat >= ZFS_KEYFORMAT_FORMATS || dcp->cp_wkey->wk_keyformat == ZFS_KEYFORMAT_NONE) { ret = SET_ERROR(EINVAL); goto error; } /* * If this dataset is not currently an encryption root we need a new * keylocation for this dataset's new wrapping key. Otherwise we can * just keep the one we already had. */ if (dd->dd_object != rddobj && dcp->cp_keylocation == NULL) { ret = SET_ERROR(EINVAL); goto error; } /* check that the keylocation is valid if it is not NULL */ if (dcp->cp_keylocation != NULL && !zfs_prop_valid_keylocation(dcp->cp_keylocation, B_TRUE)) { ret = SET_ERROR(EINVAL); goto error; } /* passphrases require pbkdf2 salt and iters */ if (dcp->cp_wkey->wk_keyformat == ZFS_KEYFORMAT_PASSPHRASE) { if (dcp->cp_wkey->wk_salt == 0 || dcp->cp_wkey->wk_iters < MIN_PBKDF2_ITERATIONS) { ret = SET_ERROR(EINVAL); goto error; } } else { if (dcp->cp_wkey->wk_salt != 0 || dcp->cp_wkey->wk_iters != 0) { ret = SET_ERROR(EINVAL); goto error; } } /* make sure the dd's wkey is loaded */ ret = dmu_objset_check_wkey_loaded(dd); if (ret != 0) goto error; dsl_dir_rele(dd, FTAG); return (0); error: if (dd != NULL) dsl_dir_rele(dd, FTAG); return (ret); } /* * This function deals with the intricacies of updating wrapping * key references and encryption roots recursively in the event * of a call to 'zfs change-key' or 'zfs promote'. The 'skip' * parameter should always be set to B_FALSE when called * externally. */ static void spa_keystore_change_key_sync_impl(uint64_t rddobj, uint64_t ddobj, uint64_t new_rddobj, dsl_wrapping_key_t *wkey, boolean_t skip, dmu_tx_t *tx) { int ret; zap_cursor_t *zc; zap_attribute_t *za; dsl_pool_t *dp = dmu_tx_pool(tx); dsl_dir_t *dd = NULL; dsl_crypto_key_t *dck = NULL; uint64_t curr_rddobj; ASSERT(RW_WRITE_HELD(&dp->dp_spa->spa_keystore.sk_wkeys_lock)); /* hold the dd */ VERIFY0(dsl_dir_hold_obj(dp, ddobj, NULL, FTAG, &dd)); /* ignore special dsl dirs */ if (dd->dd_myname[0] == '$' || dd->dd_myname[0] == '%') { dsl_dir_rele(dd, FTAG); return; } ret = dsl_dir_get_encryption_root_ddobj(dd, &curr_rddobj); VERIFY(ret == 0 || ret == ENOENT); /* * Stop recursing if this dsl dir didn't inherit from the root * or if this dd is a clone. */ if (ret == ENOENT || (!skip && (curr_rddobj != rddobj || dsl_dir_is_clone(dd)))) { dsl_dir_rele(dd, FTAG); return; } /* * If we don't have a wrapping key just update the dck to reflect the * new encryption root. Otherwise rewrap the entire dck and re-sync it * to disk. If skip is set, we don't do any of this work. */ if (!skip) { if (wkey == NULL) { VERIFY0(zap_update(dp->dp_meta_objset, dd->dd_crypto_obj, DSL_CRYPTO_KEY_ROOT_DDOBJ, 8, 1, &new_rddobj, tx)); } else { VERIFY0(spa_keystore_dsl_key_hold_dd(dp->dp_spa, dd, FTAG, &dck)); dsl_wrapping_key_hold(wkey, dck); dsl_wrapping_key_rele(dck->dck_wkey, dck); dck->dck_wkey = wkey; dsl_crypto_key_sync(dck, tx); spa_keystore_dsl_key_rele(dp->dp_spa, dck, FTAG); } } zc = kmem_alloc(sizeof (zap_cursor_t), KM_SLEEP); za = kmem_alloc(sizeof (zap_attribute_t), KM_SLEEP); /* Recurse into all child dsl dirs. */ for (zap_cursor_init(zc, dp->dp_meta_objset, dsl_dir_phys(dd)->dd_child_dir_zapobj); zap_cursor_retrieve(zc, za) == 0; zap_cursor_advance(zc)) { spa_keystore_change_key_sync_impl(rddobj, za->za_first_integer, new_rddobj, wkey, B_FALSE, tx); } zap_cursor_fini(zc); /* * Recurse into all dsl dirs of clones. We utilize the skip parameter * here so that we don't attempt to process the clones directly. This * is because the clone and its origin share the same dck, which has * already been updated. */ for (zap_cursor_init(zc, dp->dp_meta_objset, dsl_dir_phys(dd)->dd_clones); zap_cursor_retrieve(zc, za) == 0; zap_cursor_advance(zc)) { dsl_dataset_t *clone; VERIFY0(dsl_dataset_hold_obj(dp, za->za_first_integer, FTAG, &clone)); spa_keystore_change_key_sync_impl(rddobj, clone->ds_dir->dd_object, new_rddobj, wkey, B_TRUE, tx); dsl_dataset_rele(clone, FTAG); } zap_cursor_fini(zc); kmem_free(za, sizeof (zap_attribute_t)); kmem_free(zc, sizeof (zap_cursor_t)); dsl_dir_rele(dd, FTAG); } static void spa_keystore_change_key_sync(void *arg, dmu_tx_t *tx) { dsl_dataset_t *ds; avl_index_t where; dsl_pool_t *dp = dmu_tx_pool(tx); spa_t *spa = dp->dp_spa; spa_keystore_change_key_args_t *skcka = arg; dsl_crypto_params_t *dcp = skcka->skcka_cp; dsl_wrapping_key_t *wkey = NULL, *found_wkey; dsl_wrapping_key_t wkey_search; char *keylocation = dcp->cp_keylocation; uint64_t rddobj, new_rddobj; /* create and initialize the wrapping key */ VERIFY0(dsl_dataset_hold(dp, skcka->skcka_dsname, FTAG, &ds)); ASSERT(!ds->ds_is_snapshot); if (dcp->cp_cmd == DCP_CMD_NEW_KEY || dcp->cp_cmd == DCP_CMD_FORCE_NEW_KEY) { /* * We are changing to a new wkey. Set additional properties * which can be sent along with this ioctl. Note that this * command can set keylocation even if it can't normally be * set via 'zfs set' due to a non-local keylocation. */ if (dcp->cp_cmd == DCP_CMD_NEW_KEY) { wkey = dcp->cp_wkey; wkey->wk_ddobj = ds->ds_dir->dd_object; } else { keylocation = "prompt"; } if (keylocation != NULL) { dsl_prop_set_sync_impl(ds, zfs_prop_to_name(ZFS_PROP_KEYLOCATION), ZPROP_SRC_LOCAL, 1, strlen(keylocation) + 1, keylocation, tx); } VERIFY0(dsl_dir_get_encryption_root_ddobj(ds->ds_dir, &rddobj)); new_rddobj = ds->ds_dir->dd_object; } else { /* * We are inheritting the parent's wkey. Unset any local * keylocation and grab a reference to the wkey. */ if (dcp->cp_cmd == DCP_CMD_INHERIT) { VERIFY0(spa_keystore_wkey_hold_dd(spa, ds->ds_dir->dd_parent, FTAG, &wkey)); } dsl_prop_set_sync_impl(ds, zfs_prop_to_name(ZFS_PROP_KEYLOCATION), ZPROP_SRC_NONE, 0, 0, NULL, tx); rddobj = ds->ds_dir->dd_object; VERIFY0(dsl_dir_get_encryption_root_ddobj(ds->ds_dir->dd_parent, &new_rddobj)); } if (wkey == NULL) { ASSERT(dcp->cp_cmd == DCP_CMD_FORCE_INHERIT || dcp->cp_cmd == DCP_CMD_FORCE_NEW_KEY); } rw_enter(&spa->spa_keystore.sk_wkeys_lock, RW_WRITER); /* recurse through all children and rewrap their keys */ spa_keystore_change_key_sync_impl(rddobj, ds->ds_dir->dd_object, new_rddobj, wkey, B_FALSE, tx); /* * All references to the old wkey should be released now (if it * existed). Replace the wrapping key. */ wkey_search.wk_ddobj = ds->ds_dir->dd_object; found_wkey = avl_find(&spa->spa_keystore.sk_wkeys, &wkey_search, NULL); if (found_wkey != NULL) { ASSERT0(zfs_refcount_count(&found_wkey->wk_refcnt)); avl_remove(&spa->spa_keystore.sk_wkeys, found_wkey); dsl_wrapping_key_free(found_wkey); } if (dcp->cp_cmd == DCP_CMD_NEW_KEY) { avl_find(&spa->spa_keystore.sk_wkeys, wkey, &where); avl_insert(&spa->spa_keystore.sk_wkeys, wkey, where); } else if (wkey != NULL) { dsl_wrapping_key_rele(wkey, FTAG); } rw_exit(&spa->spa_keystore.sk_wkeys_lock); dsl_dataset_rele(ds, FTAG); } int spa_keystore_change_key(const char *dsname, dsl_crypto_params_t *dcp) { spa_keystore_change_key_args_t skcka; /* initialize the args struct */ skcka.skcka_dsname = dsname; skcka.skcka_cp = dcp; /* * Perform the actual work in syncing context. The blocks modified * here could be calculated but it would require holding the pool * lock and traversing all of the datasets that will have their keys * changed. */ return (dsl_sync_task(dsname, spa_keystore_change_key_check, spa_keystore_change_key_sync, &skcka, 15, ZFS_SPACE_CHECK_RESERVED)); } int dsl_dir_rename_crypt_check(dsl_dir_t *dd, dsl_dir_t *newparent) { int ret; uint64_t curr_rddobj, parent_rddobj; if (dd->dd_crypto_obj == 0) return (0); ret = dsl_dir_get_encryption_root_ddobj(dd, &curr_rddobj); if (ret != 0) goto error; /* * if this is not an encryption root, we must make sure we are not * moving dd to a new encryption root */ if (dd->dd_object != curr_rddobj) { ret = dsl_dir_get_encryption_root_ddobj(newparent, &parent_rddobj); if (ret != 0) goto error; if (parent_rddobj != curr_rddobj) { ret = SET_ERROR(EACCES); goto error; } } return (0); error: return (ret); } /* * Check to make sure that a promote from targetdd to origindd will not require * any key rewraps. */ int dsl_dataset_promote_crypt_check(dsl_dir_t *target, dsl_dir_t *origin) { int ret; uint64_t rddobj, op_rddobj, tp_rddobj; /* If the dataset is not encrypted we don't need to check anything */ if (origin->dd_crypto_obj == 0) return (0); /* * If we are not changing the first origin snapshot in a chain * the encryption root won't change either. */ if (dsl_dir_is_clone(origin)) return (0); /* * If the origin is the encryption root we will update * the DSL Crypto Key to point to the target instead. */ ret = dsl_dir_get_encryption_root_ddobj(origin, &rddobj); if (ret != 0) return (ret); if (rddobj == origin->dd_object) return (0); /* * The origin is inheriting its encryption root from its parent. * Check that the parent of the target has the same encryption root. */ ret = dsl_dir_get_encryption_root_ddobj(origin->dd_parent, &op_rddobj); if (ret == ENOENT) return (SET_ERROR(EACCES)); else if (ret != 0) return (ret); ret = dsl_dir_get_encryption_root_ddobj(target->dd_parent, &tp_rddobj); if (ret == ENOENT) return (SET_ERROR(EACCES)); else if (ret != 0) return (ret); if (op_rddobj != tp_rddobj) return (SET_ERROR(EACCES)); return (0); } void dsl_dataset_promote_crypt_sync(dsl_dir_t *target, dsl_dir_t *origin, dmu_tx_t *tx) { uint64_t rddobj; dsl_pool_t *dp = target->dd_pool; dsl_dataset_t *targetds; dsl_dataset_t *originds; char *keylocation; if (origin->dd_crypto_obj == 0) return; if (dsl_dir_is_clone(origin)) return; VERIFY0(dsl_dir_get_encryption_root_ddobj(origin, &rddobj)); if (rddobj != origin->dd_object) return; /* * If the target is being promoted to the encryption root update the * DSL Crypto Key and keylocation to reflect that. We also need to * update the DSL Crypto Keys of all children inheritting their * encryption root to point to the new target. Otherwise, the check * function ensured that the encryption root will not change. */ keylocation = kmem_alloc(ZAP_MAXVALUELEN, KM_SLEEP); VERIFY0(dsl_dataset_hold_obj(dp, dsl_dir_phys(target)->dd_head_dataset_obj, FTAG, &targetds)); VERIFY0(dsl_dataset_hold_obj(dp, dsl_dir_phys(origin)->dd_head_dataset_obj, FTAG, &originds)); VERIFY0(dsl_prop_get_dd(origin, zfs_prop_to_name(ZFS_PROP_KEYLOCATION), 1, ZAP_MAXVALUELEN, keylocation, NULL, B_FALSE)); dsl_prop_set_sync_impl(targetds, zfs_prop_to_name(ZFS_PROP_KEYLOCATION), ZPROP_SRC_LOCAL, 1, strlen(keylocation) + 1, keylocation, tx); dsl_prop_set_sync_impl(originds, zfs_prop_to_name(ZFS_PROP_KEYLOCATION), ZPROP_SRC_NONE, 0, 0, NULL, tx); rw_enter(&dp->dp_spa->spa_keystore.sk_wkeys_lock, RW_WRITER); spa_keystore_change_key_sync_impl(rddobj, origin->dd_object, target->dd_object, NULL, B_FALSE, tx); rw_exit(&dp->dp_spa->spa_keystore.sk_wkeys_lock); dsl_dataset_rele(targetds, FTAG); dsl_dataset_rele(originds, FTAG); kmem_free(keylocation, ZAP_MAXVALUELEN); } int dmu_objset_create_crypt_check(dsl_dir_t *parentdd, dsl_crypto_params_t *dcp, boolean_t *will_encrypt) { int ret; uint64_t pcrypt, crypt; dsl_crypto_params_t dummy_dcp = { 0 }; if (will_encrypt != NULL) *will_encrypt = B_FALSE; if (dcp == NULL) dcp = &dummy_dcp; if (dcp->cp_cmd != DCP_CMD_NONE) return (SET_ERROR(EINVAL)); if (parentdd != NULL) { ret = dsl_dir_get_crypt(parentdd, &pcrypt); if (ret != 0) return (ret); } else { pcrypt = ZIO_CRYPT_OFF; } crypt = (dcp->cp_crypt == ZIO_CRYPT_INHERIT) ? pcrypt : dcp->cp_crypt; ASSERT3U(pcrypt, !=, ZIO_CRYPT_INHERIT); ASSERT3U(crypt, !=, ZIO_CRYPT_INHERIT); /* check for valid dcp with no encryption (inherited or local) */ if (crypt == ZIO_CRYPT_OFF) { /* Must not specify encryption params */ if (dcp->cp_wkey != NULL || (dcp->cp_keylocation != NULL && strcmp(dcp->cp_keylocation, "none") != 0)) return (SET_ERROR(EINVAL)); return (0); } if (will_encrypt != NULL) *will_encrypt = B_TRUE; /* * We will now definitely be encrypting. Check the feature flag. When * creating the pool the caller will check this for us since we won't * technically have the feature activated yet. */ if (parentdd != NULL && !spa_feature_is_enabled(parentdd->dd_pool->dp_spa, SPA_FEATURE_ENCRYPTION)) { return (SET_ERROR(EOPNOTSUPP)); } /* Check for errata #4 (encryption enabled, bookmark_v2 disabled) */ if (parentdd != NULL && !spa_feature_is_enabled(parentdd->dd_pool->dp_spa, SPA_FEATURE_BOOKMARK_V2)) { return (SET_ERROR(EOPNOTSUPP)); } /* handle inheritance */ if (dcp->cp_wkey == NULL) { ASSERT3P(parentdd, !=, NULL); /* key must be fully unspecified */ if (dcp->cp_keylocation != NULL) return (SET_ERROR(EINVAL)); /* parent must have a key to inherit */ if (pcrypt == ZIO_CRYPT_OFF) return (SET_ERROR(EINVAL)); /* check for parent key */ ret = dmu_objset_check_wkey_loaded(parentdd); if (ret != 0) return (ret); return (0); } /* At this point we should have a fully specified key. Check location */ if (dcp->cp_keylocation == NULL || !zfs_prop_valid_keylocation(dcp->cp_keylocation, B_TRUE)) return (SET_ERROR(EINVAL)); /* Must have fully specified keyformat */ switch (dcp->cp_wkey->wk_keyformat) { case ZFS_KEYFORMAT_HEX: case ZFS_KEYFORMAT_RAW: /* requires no pbkdf2 iters and salt */ if (dcp->cp_wkey->wk_salt != 0 || dcp->cp_wkey->wk_iters != 0) return (SET_ERROR(EINVAL)); break; case ZFS_KEYFORMAT_PASSPHRASE: /* requires pbkdf2 iters and salt */ if (dcp->cp_wkey->wk_salt == 0 || dcp->cp_wkey->wk_iters < MIN_PBKDF2_ITERATIONS) return (SET_ERROR(EINVAL)); break; case ZFS_KEYFORMAT_NONE: default: /* keyformat must be specified and valid */ return (SET_ERROR(EINVAL)); } return (0); } void dsl_dataset_create_crypt_sync(uint64_t dsobj, dsl_dir_t *dd, dsl_dataset_t *origin, dsl_crypto_params_t *dcp, dmu_tx_t *tx) { dsl_pool_t *dp = dd->dd_pool; uint64_t crypt; dsl_wrapping_key_t *wkey; /* clones always use their origin's wrapping key */ if (dsl_dir_is_clone(dd)) { ASSERT3P(dcp, ==, NULL); /* * If this is an encrypted clone we just need to clone the * dck into dd. Zapify the dd so we can do that. */ if (origin->ds_dir->dd_crypto_obj != 0) { dmu_buf_will_dirty(dd->dd_dbuf, tx); dsl_dir_zapify(dd, tx); dd->dd_crypto_obj = dsl_crypto_key_clone_sync(origin->ds_dir, tx); VERIFY0(zap_add(dp->dp_meta_objset, dd->dd_object, DD_FIELD_CRYPTO_KEY_OBJ, sizeof (uint64_t), 1, &dd->dd_crypto_obj, tx)); } return; } /* * A NULL dcp at this point indicates this is the origin dataset * which does not have an objset to encrypt. Raw receives will handle * encryption separately later. In both cases we can simply return. */ if (dcp == NULL || dcp->cp_cmd == DCP_CMD_RAW_RECV) return; crypt = dcp->cp_crypt; wkey = dcp->cp_wkey; /* figure out the effective crypt */ if (crypt == ZIO_CRYPT_INHERIT && dd->dd_parent != NULL) VERIFY0(dsl_dir_get_crypt(dd->dd_parent, &crypt)); /* if we aren't doing encryption just return */ if (crypt == ZIO_CRYPT_OFF || crypt == ZIO_CRYPT_INHERIT) return; /* zapify the dd so that we can add the crypto key obj to it */ dmu_buf_will_dirty(dd->dd_dbuf, tx); dsl_dir_zapify(dd, tx); /* use the new key if given or inherit from the parent */ if (wkey == NULL) { VERIFY0(spa_keystore_wkey_hold_dd(dp->dp_spa, dd->dd_parent, FTAG, &wkey)); } else { wkey->wk_ddobj = dd->dd_object; } ASSERT3P(wkey, !=, NULL); /* Create or clone the DSL crypto key and activate the feature */ dd->dd_crypto_obj = dsl_crypto_key_create_sync(crypt, wkey, tx); VERIFY0(zap_add(dp->dp_meta_objset, dd->dd_object, DD_FIELD_CRYPTO_KEY_OBJ, sizeof (uint64_t), 1, &dd->dd_crypto_obj, tx)); dsl_dataset_activate_feature(dsobj, SPA_FEATURE_ENCRYPTION, (void *)B_TRUE, tx); /* * If we inherited the wrapping key we release our reference now. * Otherwise, this is a new key and we need to load it into the * keystore. */ if (dcp->cp_wkey == NULL) { dsl_wrapping_key_rele(wkey, FTAG); } else { VERIFY0(spa_keystore_load_wkey_impl(dp->dp_spa, wkey)); } } typedef struct dsl_crypto_recv_key_arg { uint64_t dcrka_dsobj; uint64_t dcrka_fromobj; dmu_objset_type_t dcrka_ostype; nvlist_t *dcrka_nvl; boolean_t dcrka_do_key; } dsl_crypto_recv_key_arg_t; static int dsl_crypto_recv_raw_objset_check(dsl_dataset_t *ds, dsl_dataset_t *fromds, dmu_objset_type_t ostype, nvlist_t *nvl, dmu_tx_t *tx) { int ret; objset_t *os; dnode_t *mdn; uint8_t *buf = NULL; uint_t len; uint64_t intval, nlevels, blksz, ibs; uint64_t nblkptr, maxblkid; if (ostype != DMU_OST_ZFS && ostype != DMU_OST_ZVOL) return (SET_ERROR(EINVAL)); /* raw receives also need info about the structure of the metadnode */ ret = nvlist_lookup_uint64(nvl, "mdn_compress", &intval); if (ret != 0 || intval >= ZIO_COMPRESS_LEGACY_FUNCTIONS) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint64(nvl, "mdn_checksum", &intval); if (ret != 0 || intval >= ZIO_CHECKSUM_LEGACY_FUNCTIONS) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint64(nvl, "mdn_nlevels", &nlevels); if (ret != 0 || nlevels > DN_MAX_LEVELS) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint64(nvl, "mdn_blksz", &blksz); if (ret != 0 || blksz < SPA_MINBLOCKSIZE) return (SET_ERROR(EINVAL)); else if (blksz > spa_maxblocksize(tx->tx_pool->dp_spa)) return (SET_ERROR(ENOTSUP)); ret = nvlist_lookup_uint64(nvl, "mdn_indblkshift", &ibs); if (ret != 0 || ibs < DN_MIN_INDBLKSHIFT || ibs > DN_MAX_INDBLKSHIFT) return (SET_ERROR(ENOTSUP)); ret = nvlist_lookup_uint64(nvl, "mdn_nblkptr", &nblkptr); if (ret != 0 || nblkptr != DN_MAX_NBLKPTR) return (SET_ERROR(ENOTSUP)); ret = nvlist_lookup_uint64(nvl, "mdn_maxblkid", &maxblkid); if (ret != 0) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint8_array(nvl, "portable_mac", &buf, &len); if (ret != 0 || len != ZIO_OBJSET_MAC_LEN) return (SET_ERROR(EINVAL)); ret = dmu_objset_from_ds(ds, &os); if (ret != 0) return (ret); /* * Useraccounting is not portable and must be done with the keys loaded. * Therefore, whenever we do any kind of receive the useraccounting * must not be present. */ ASSERT0(os->os_flags & OBJSET_FLAG_USERACCOUNTING_COMPLETE); ASSERT0(os->os_flags & OBJSET_FLAG_USEROBJACCOUNTING_COMPLETE); mdn = DMU_META_DNODE(os); /* * If we already created the objset, make sure its unchangeable * properties match the ones received in the nvlist. */ rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG); if (!BP_IS_HOLE(dsl_dataset_get_blkptr(ds)) && (mdn->dn_nlevels != nlevels || mdn->dn_datablksz != blksz || mdn->dn_indblkshift != ibs || mdn->dn_nblkptr != nblkptr)) { rrw_exit(&ds->ds_bp_rwlock, FTAG); return (SET_ERROR(EINVAL)); } rrw_exit(&ds->ds_bp_rwlock, FTAG); /* * Check that the ivset guid of the fromds matches the one from the * send stream. Older versions of the encryption code did not have * an ivset guid on the from dataset and did not send one in the * stream. For these streams we provide the * zfs_disable_ivset_guid_check tunable to allow these datasets to * be received with a generated ivset guid. */ if (fromds != NULL && !zfs_disable_ivset_guid_check) { uint64_t from_ivset_guid = 0; intval = 0; (void) nvlist_lookup_uint64(nvl, "from_ivset_guid", &intval); (void) zap_lookup(tx->tx_pool->dp_meta_objset, fromds->ds_object, DS_FIELD_IVSET_GUID, sizeof (from_ivset_guid), 1, &from_ivset_guid); if (intval == 0 || from_ivset_guid == 0) return (SET_ERROR(ZFS_ERR_FROM_IVSET_GUID_MISSING)); if (intval != from_ivset_guid) return (SET_ERROR(ZFS_ERR_FROM_IVSET_GUID_MISMATCH)); } return (0); } static void dsl_crypto_recv_raw_objset_sync(dsl_dataset_t *ds, dmu_objset_type_t ostype, nvlist_t *nvl, dmu_tx_t *tx) { dsl_pool_t *dp = tx->tx_pool; objset_t *os; dnode_t *mdn; zio_t *zio; uint8_t *portable_mac; uint_t len; uint64_t compress, checksum, nlevels, blksz, ibs, maxblkid; boolean_t newds = B_FALSE; VERIFY0(dmu_objset_from_ds(ds, &os)); mdn = DMU_META_DNODE(os); /* * Fetch the values we need from the nvlist. "to_ivset_guid" must * be set on the snapshot, which doesn't exist yet. The receive * code will take care of this for us later. */ compress = fnvlist_lookup_uint64(nvl, "mdn_compress"); checksum = fnvlist_lookup_uint64(nvl, "mdn_checksum"); nlevels = fnvlist_lookup_uint64(nvl, "mdn_nlevels"); blksz = fnvlist_lookup_uint64(nvl, "mdn_blksz"); ibs = fnvlist_lookup_uint64(nvl, "mdn_indblkshift"); maxblkid = fnvlist_lookup_uint64(nvl, "mdn_maxblkid"); VERIFY0(nvlist_lookup_uint8_array(nvl, "portable_mac", &portable_mac, &len)); /* if we haven't created an objset for the ds yet, do that now */ rrw_enter(&ds->ds_bp_rwlock, RW_READER, FTAG); if (BP_IS_HOLE(dsl_dataset_get_blkptr(ds))) { (void) dmu_objset_create_impl_dnstats(dp->dp_spa, ds, dsl_dataset_get_blkptr(ds), ostype, nlevels, blksz, ibs, tx); newds = B_TRUE; } rrw_exit(&ds->ds_bp_rwlock, FTAG); /* * Set the portable MAC. The local MAC will always be zero since the * incoming data will all be portable and user accounting will be * deferred until the next mount. Afterwards, flag the os to be * written out raw next time. */ arc_release(os->os_phys_buf, &os->os_phys_buf); bcopy(portable_mac, os->os_phys->os_portable_mac, ZIO_OBJSET_MAC_LEN); bzero(os->os_phys->os_local_mac, ZIO_OBJSET_MAC_LEN); os->os_next_write_raw[tx->tx_txg & TXG_MASK] = B_TRUE; /* set metadnode compression and checksum */ mdn->dn_compress = compress; mdn->dn_checksum = checksum; rw_enter(&mdn->dn_struct_rwlock, RW_WRITER); dnode_new_blkid(mdn, maxblkid, tx, B_FALSE, B_TRUE); rw_exit(&mdn->dn_struct_rwlock); /* * We can't normally dirty the dataset in syncing context unless * we are creating a new dataset. In this case, we perform a * pseudo txg sync here instead. */ if (newds) { dsl_dataset_dirty(ds, tx); } else { zio = zio_root(dp->dp_spa, NULL, NULL, ZIO_FLAG_MUSTSUCCEED); dsl_dataset_sync(ds, zio, tx); VERIFY0(zio_wait(zio)); /* dsl_dataset_sync_done will drop this reference. */ dmu_buf_add_ref(ds->ds_dbuf, ds); dsl_dataset_sync_done(ds, tx); } } int dsl_crypto_recv_raw_key_check(dsl_dataset_t *ds, nvlist_t *nvl, dmu_tx_t *tx) { int ret; objset_t *mos = tx->tx_pool->dp_meta_objset; uint8_t *buf = NULL; uint_t len; uint64_t intval, key_guid, version; boolean_t is_passphrase = B_FALSE; ASSERT(dsl_dataset_phys(ds)->ds_flags & DS_FLAG_INCONSISTENT); /* * Read and check all the encryption values from the nvlist. We need * all of the fields of a DSL Crypto Key, as well as a fully specified * wrapping key. */ ret = nvlist_lookup_uint64(nvl, DSL_CRYPTO_KEY_CRYPTO_SUITE, &intval); if (ret != 0 || intval >= ZIO_CRYPT_FUNCTIONS || intval <= ZIO_CRYPT_OFF) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint64(nvl, DSL_CRYPTO_KEY_GUID, &intval); if (ret != 0) return (SET_ERROR(EINVAL)); /* * If this is an incremental receive make sure the given key guid * matches the one we already have. */ if (ds->ds_dir->dd_crypto_obj != 0) { ret = zap_lookup(mos, ds->ds_dir->dd_crypto_obj, DSL_CRYPTO_KEY_GUID, 8, 1, &key_guid); if (ret != 0) return (ret); if (intval != key_guid) return (SET_ERROR(EACCES)); } ret = nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_MASTER_KEY, &buf, &len); if (ret != 0 || len != MASTER_KEY_MAX_LEN) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_HMAC_KEY, &buf, &len); if (ret != 0 || len != SHA512_HMAC_KEYLEN) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_IV, &buf, &len); if (ret != 0 || len != WRAPPING_IV_LEN) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_MAC, &buf, &len); if (ret != 0 || len != WRAPPING_MAC_LEN) return (SET_ERROR(EINVAL)); /* * We don't support receiving old on-disk formats. The version 0 * implementation protected several fields in an objset that were * not always portable during a raw receive. As a result, we call * the old version an on-disk errata #3. */ ret = nvlist_lookup_uint64(nvl, DSL_CRYPTO_KEY_VERSION, &version); if (ret != 0 || version != ZIO_CRYPT_KEY_CURRENT_VERSION) return (SET_ERROR(ENOTSUP)); ret = nvlist_lookup_uint64(nvl, zfs_prop_to_name(ZFS_PROP_KEYFORMAT), &intval); if (ret != 0 || intval >= ZFS_KEYFORMAT_FORMATS || intval == ZFS_KEYFORMAT_NONE) return (SET_ERROR(EINVAL)); is_passphrase = (intval == ZFS_KEYFORMAT_PASSPHRASE); /* * for raw receives we allow any number of pbkdf2iters since there * won't be a chance for the user to change it. */ ret = nvlist_lookup_uint64(nvl, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), &intval); if (ret != 0 || (is_passphrase == (intval == 0))) return (SET_ERROR(EINVAL)); ret = nvlist_lookup_uint64(nvl, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), &intval); if (ret != 0 || (is_passphrase == (intval == 0))) return (SET_ERROR(EINVAL)); return (0); } void dsl_crypto_recv_raw_key_sync(dsl_dataset_t *ds, nvlist_t *nvl, dmu_tx_t *tx) { dsl_pool_t *dp = tx->tx_pool; objset_t *mos = dp->dp_meta_objset; dsl_dir_t *dd = ds->ds_dir; uint_t len; uint64_t rddobj, one = 1; uint8_t *keydata, *hmac_keydata, *iv, *mac; uint64_t crypt, key_guid, keyformat, iters, salt; uint64_t version = ZIO_CRYPT_KEY_CURRENT_VERSION; char *keylocation = "prompt"; /* lookup the values we need to create the DSL Crypto Key */ crypt = fnvlist_lookup_uint64(nvl, DSL_CRYPTO_KEY_CRYPTO_SUITE); key_guid = fnvlist_lookup_uint64(nvl, DSL_CRYPTO_KEY_GUID); keyformat = fnvlist_lookup_uint64(nvl, zfs_prop_to_name(ZFS_PROP_KEYFORMAT)); iters = fnvlist_lookup_uint64(nvl, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS)); salt = fnvlist_lookup_uint64(nvl, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT)); VERIFY0(nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_MASTER_KEY, &keydata, &len)); VERIFY0(nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_HMAC_KEY, &hmac_keydata, &len)); VERIFY0(nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_IV, &iv, &len)); VERIFY0(nvlist_lookup_uint8_array(nvl, DSL_CRYPTO_KEY_MAC, &mac, &len)); /* if this is a new dataset setup the DSL Crypto Key. */ if (dd->dd_crypto_obj == 0) { /* zapify the dsl dir so we can add the key object to it */ dmu_buf_will_dirty(dd->dd_dbuf, tx); dsl_dir_zapify(dd, tx); /* create the DSL Crypto Key on disk and activate the feature */ dd->dd_crypto_obj = zap_create(mos, DMU_OTN_ZAP_METADATA, DMU_OT_NONE, 0, tx); VERIFY0(zap_update(tx->tx_pool->dp_meta_objset, dd->dd_crypto_obj, DSL_CRYPTO_KEY_REFCOUNT, sizeof (uint64_t), 1, &one, tx)); VERIFY0(zap_update(tx->tx_pool->dp_meta_objset, dd->dd_crypto_obj, DSL_CRYPTO_KEY_VERSION, sizeof (uint64_t), 1, &version, tx)); dsl_dataset_activate_feature(ds->ds_object, SPA_FEATURE_ENCRYPTION, (void *)B_TRUE, tx); ds->ds_feature[SPA_FEATURE_ENCRYPTION] = (void *)B_TRUE; /* save the dd_crypto_obj on disk */ VERIFY0(zap_add(mos, dd->dd_object, DD_FIELD_CRYPTO_KEY_OBJ, sizeof (uint64_t), 1, &dd->dd_crypto_obj, tx)); /* * Set the keylocation to prompt by default. If keylocation * has been provided via the properties, this will be overridden * later. */ dsl_prop_set_sync_impl(ds, zfs_prop_to_name(ZFS_PROP_KEYLOCATION), ZPROP_SRC_LOCAL, 1, strlen(keylocation) + 1, keylocation, tx); rddobj = dd->dd_object; } else { VERIFY0(dsl_dir_get_encryption_root_ddobj(dd, &rddobj)); } /* sync the key data to the ZAP object on disk */ dsl_crypto_key_sync_impl(mos, dd->dd_crypto_obj, crypt, rddobj, key_guid, iv, mac, keydata, hmac_keydata, keyformat, salt, iters, tx); } static int dsl_crypto_recv_key_check(void *arg, dmu_tx_t *tx) { int ret; dsl_crypto_recv_key_arg_t *dcrka = arg; dsl_dataset_t *ds = NULL, *fromds = NULL; ret = dsl_dataset_hold_obj(tx->tx_pool, dcrka->dcrka_dsobj, FTAG, &ds); if (ret != 0) goto out; if (dcrka->dcrka_fromobj != 0) { ret = dsl_dataset_hold_obj(tx->tx_pool, dcrka->dcrka_fromobj, FTAG, &fromds); if (ret != 0) goto out; } ret = dsl_crypto_recv_raw_objset_check(ds, fromds, dcrka->dcrka_ostype, dcrka->dcrka_nvl, tx); if (ret != 0) goto out; /* * We run this check even if we won't be doing this part of * the receive now so that we don't make the user wait until * the receive finishes to fail. */ ret = dsl_crypto_recv_raw_key_check(ds, dcrka->dcrka_nvl, tx); if (ret != 0) goto out; out: if (ds != NULL) dsl_dataset_rele(ds, FTAG); if (fromds != NULL) dsl_dataset_rele(fromds, FTAG); return (ret); } static void dsl_crypto_recv_key_sync(void *arg, dmu_tx_t *tx) { dsl_crypto_recv_key_arg_t *dcrka = arg; dsl_dataset_t *ds; VERIFY0(dsl_dataset_hold_obj(tx->tx_pool, dcrka->dcrka_dsobj, FTAG, &ds)); dsl_crypto_recv_raw_objset_sync(ds, dcrka->dcrka_ostype, dcrka->dcrka_nvl, tx); if (dcrka->dcrka_do_key) dsl_crypto_recv_raw_key_sync(ds, dcrka->dcrka_nvl, tx); dsl_dataset_rele(ds, FTAG); } /* * This function is used to sync an nvlist representing a DSL Crypto Key and * the associated encryption parameters. The key will be written exactly as is * without wrapping it. */ int dsl_crypto_recv_raw(const char *poolname, uint64_t dsobj, uint64_t fromobj, dmu_objset_type_t ostype, nvlist_t *nvl, boolean_t do_key) { dsl_crypto_recv_key_arg_t dcrka; dcrka.dcrka_dsobj = dsobj; dcrka.dcrka_fromobj = fromobj; dcrka.dcrka_ostype = ostype; dcrka.dcrka_nvl = nvl; dcrka.dcrka_do_key = do_key; return (dsl_sync_task(poolname, dsl_crypto_recv_key_check, dsl_crypto_recv_key_sync, &dcrka, 1, ZFS_SPACE_CHECK_NORMAL)); } int dsl_crypto_populate_key_nvlist(objset_t *os, uint64_t from_ivset_guid, nvlist_t **nvl_out) { int ret; dsl_dataset_t *ds = os->os_dsl_dataset; dnode_t *mdn; uint64_t rddobj; nvlist_t *nvl = NULL; uint64_t dckobj = ds->ds_dir->dd_crypto_obj; dsl_dir_t *rdd = NULL; dsl_pool_t *dp = ds->ds_dir->dd_pool; objset_t *mos = dp->dp_meta_objset; uint64_t crypt = 0, key_guid = 0, format = 0; uint64_t iters = 0, salt = 0, version = 0; uint64_t to_ivset_guid = 0; uint8_t raw_keydata[MASTER_KEY_MAX_LEN]; uint8_t raw_hmac_keydata[SHA512_HMAC_KEYLEN]; uint8_t iv[WRAPPING_IV_LEN]; uint8_t mac[WRAPPING_MAC_LEN]; ASSERT(dckobj != 0); mdn = DMU_META_DNODE(os); nvl = fnvlist_alloc(); /* lookup values from the DSL Crypto Key */ ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_CRYPTO_SUITE, 8, 1, &crypt); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_GUID, 8, 1, &key_guid); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_MASTER_KEY, 1, MASTER_KEY_MAX_LEN, raw_keydata); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_HMAC_KEY, 1, SHA512_HMAC_KEYLEN, raw_hmac_keydata); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_IV, 1, WRAPPING_IV_LEN, iv); if (ret != 0) goto error; ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_MAC, 1, WRAPPING_MAC_LEN, mac); if (ret != 0) goto error; /* see zfs_disable_ivset_guid_check tunable for errata info */ ret = zap_lookup(mos, ds->ds_object, DS_FIELD_IVSET_GUID, 8, 1, &to_ivset_guid); if (ret != 0) ASSERT3U(dp->dp_spa->spa_errata, !=, 0); /* * We don't support raw sends of legacy on-disk formats. See the * comment in dsl_crypto_recv_key_check() for details. */ ret = zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_VERSION, 8, 1, &version); if (ret != 0 || version != ZIO_CRYPT_KEY_CURRENT_VERSION) { dp->dp_spa->spa_errata = ZPOOL_ERRATA_ZOL_6845_ENCRYPTION; ret = SET_ERROR(ENOTSUP); goto error; } /* * Lookup wrapping key properties. An early version of the code did * not correctly add these values to the wrapping key or the DSL * Crypto Key on disk for non encryption roots, so to be safe we * always take the slightly circuitous route of looking it up from * the encryption root's key. */ ret = dsl_dir_get_encryption_root_ddobj(ds->ds_dir, &rddobj); if (ret != 0) goto error; dsl_pool_config_enter(dp, FTAG); ret = dsl_dir_hold_obj(dp, rddobj, NULL, FTAG, &rdd); if (ret != 0) goto error_unlock; ret = zap_lookup(dp->dp_meta_objset, rdd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_KEYFORMAT), 8, 1, &format); if (ret != 0) goto error_unlock; if (format == ZFS_KEYFORMAT_PASSPHRASE) { ret = zap_lookup(dp->dp_meta_objset, rdd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), 8, 1, &iters); if (ret != 0) goto error_unlock; ret = zap_lookup(dp->dp_meta_objset, rdd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), 8, 1, &salt); if (ret != 0) goto error_unlock; } dsl_dir_rele(rdd, FTAG); dsl_pool_config_exit(dp, FTAG); fnvlist_add_uint64(nvl, DSL_CRYPTO_KEY_CRYPTO_SUITE, crypt); fnvlist_add_uint64(nvl, DSL_CRYPTO_KEY_GUID, key_guid); fnvlist_add_uint64(nvl, DSL_CRYPTO_KEY_VERSION, version); VERIFY0(nvlist_add_uint8_array(nvl, DSL_CRYPTO_KEY_MASTER_KEY, raw_keydata, MASTER_KEY_MAX_LEN)); VERIFY0(nvlist_add_uint8_array(nvl, DSL_CRYPTO_KEY_HMAC_KEY, raw_hmac_keydata, SHA512_HMAC_KEYLEN)); VERIFY0(nvlist_add_uint8_array(nvl, DSL_CRYPTO_KEY_IV, iv, WRAPPING_IV_LEN)); VERIFY0(nvlist_add_uint8_array(nvl, DSL_CRYPTO_KEY_MAC, mac, WRAPPING_MAC_LEN)); VERIFY0(nvlist_add_uint8_array(nvl, "portable_mac", os->os_phys->os_portable_mac, ZIO_OBJSET_MAC_LEN)); fnvlist_add_uint64(nvl, zfs_prop_to_name(ZFS_PROP_KEYFORMAT), format); fnvlist_add_uint64(nvl, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), iters); fnvlist_add_uint64(nvl, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), salt); fnvlist_add_uint64(nvl, "mdn_checksum", mdn->dn_checksum); fnvlist_add_uint64(nvl, "mdn_compress", mdn->dn_compress); fnvlist_add_uint64(nvl, "mdn_nlevels", mdn->dn_nlevels); fnvlist_add_uint64(nvl, "mdn_blksz", mdn->dn_datablksz); fnvlist_add_uint64(nvl, "mdn_indblkshift", mdn->dn_indblkshift); fnvlist_add_uint64(nvl, "mdn_nblkptr", mdn->dn_nblkptr); fnvlist_add_uint64(nvl, "mdn_maxblkid", mdn->dn_maxblkid); fnvlist_add_uint64(nvl, "to_ivset_guid", to_ivset_guid); fnvlist_add_uint64(nvl, "from_ivset_guid", from_ivset_guid); *nvl_out = nvl; return (0); error_unlock: dsl_pool_config_exit(dp, FTAG); error: if (rdd != NULL) dsl_dir_rele(rdd, FTAG); nvlist_free(nvl); *nvl_out = NULL; return (ret); } uint64_t dsl_crypto_key_create_sync(uint64_t crypt, dsl_wrapping_key_t *wkey, dmu_tx_t *tx) { dsl_crypto_key_t dck; uint64_t version = ZIO_CRYPT_KEY_CURRENT_VERSION; uint64_t one = 1ULL; ASSERT(dmu_tx_is_syncing(tx)); ASSERT3U(crypt, <, ZIO_CRYPT_FUNCTIONS); ASSERT3U(crypt, >, ZIO_CRYPT_OFF); /* create the DSL Crypto Key ZAP object */ dck.dck_obj = zap_create(tx->tx_pool->dp_meta_objset, DMU_OTN_ZAP_METADATA, DMU_OT_NONE, 0, tx); /* fill in the key (on the stack) and sync it to disk */ dck.dck_wkey = wkey; VERIFY0(zio_crypt_key_init(crypt, &dck.dck_key)); dsl_crypto_key_sync(&dck, tx); VERIFY0(zap_update(tx->tx_pool->dp_meta_objset, dck.dck_obj, DSL_CRYPTO_KEY_REFCOUNT, sizeof (uint64_t), 1, &one, tx)); VERIFY0(zap_update(tx->tx_pool->dp_meta_objset, dck.dck_obj, DSL_CRYPTO_KEY_VERSION, sizeof (uint64_t), 1, &version, tx)); zio_crypt_key_destroy(&dck.dck_key); bzero(&dck.dck_key, sizeof (zio_crypt_key_t)); return (dck.dck_obj); } uint64_t dsl_crypto_key_clone_sync(dsl_dir_t *origindd, dmu_tx_t *tx) { objset_t *mos = tx->tx_pool->dp_meta_objset; ASSERT(dmu_tx_is_syncing(tx)); VERIFY0(zap_increment(mos, origindd->dd_crypto_obj, DSL_CRYPTO_KEY_REFCOUNT, 1, tx)); return (origindd->dd_crypto_obj); } void dsl_crypto_key_destroy_sync(uint64_t dckobj, dmu_tx_t *tx) { objset_t *mos = tx->tx_pool->dp_meta_objset; uint64_t refcnt; /* Decrement the refcount, destroy if this is the last reference */ VERIFY0(zap_lookup(mos, dckobj, DSL_CRYPTO_KEY_REFCOUNT, sizeof (uint64_t), 1, &refcnt)); if (refcnt != 1) { VERIFY0(zap_increment(mos, dckobj, DSL_CRYPTO_KEY_REFCOUNT, -1, tx)); } else { VERIFY0(zap_destroy(mos, dckobj, tx)); } } void dsl_dataset_crypt_stats(dsl_dataset_t *ds, nvlist_t *nv) { uint64_t intval; dsl_dir_t *dd = ds->ds_dir; dsl_dir_t *enc_root; char buf[ZFS_MAX_DATASET_NAME_LEN]; if (dd->dd_crypto_obj == 0) return; intval = dsl_dataset_get_keystatus(dd); dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_KEYSTATUS, intval); if (dsl_dir_get_crypt(dd, &intval) == 0) dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_ENCRYPTION, intval); if (zap_lookup(dd->dd_pool->dp_meta_objset, dd->dd_crypto_obj, DSL_CRYPTO_KEY_GUID, 8, 1, &intval) == 0) { dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_KEY_GUID, intval); } if (zap_lookup(dd->dd_pool->dp_meta_objset, dd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_KEYFORMAT), 8, 1, &intval) == 0) { dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_KEYFORMAT, intval); } if (zap_lookup(dd->dd_pool->dp_meta_objset, dd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_PBKDF2_SALT), 8, 1, &intval) == 0) { dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_PBKDF2_SALT, intval); } if (zap_lookup(dd->dd_pool->dp_meta_objset, dd->dd_crypto_obj, zfs_prop_to_name(ZFS_PROP_PBKDF2_ITERS), 8, 1, &intval) == 0) { dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_PBKDF2_ITERS, intval); } if (zap_lookup(dd->dd_pool->dp_meta_objset, ds->ds_object, DS_FIELD_IVSET_GUID, 8, 1, &intval) == 0) { dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_IVSET_GUID, intval); } if (dsl_dir_get_encryption_root_ddobj(dd, &intval) == 0) { if (dsl_dir_hold_obj(dd->dd_pool, intval, NULL, FTAG, &enc_root) == 0) { dsl_dir_name(enc_root, buf); dsl_dir_rele(enc_root, FTAG); dsl_prop_nvlist_add_string(nv, ZFS_PROP_ENCRYPTION_ROOT, buf); } } } int spa_crypt_get_salt(spa_t *spa, uint64_t dsobj, uint8_t *salt) { int ret; dsl_crypto_key_t *dck = NULL; /* look up the key from the spa's keystore */ ret = spa_keystore_lookup_key(spa, dsobj, FTAG, &dck); if (ret != 0) goto error; ret = zio_crypt_key_get_salt(&dck->dck_key, salt); if (ret != 0) goto error; spa_keystore_dsl_key_rele(spa, dck, FTAG); return (0); error: if (dck != NULL) spa_keystore_dsl_key_rele(spa, dck, FTAG); return (ret); } /* * Objset blocks are a special case for MAC generation. These blocks have 2 * 256-bit MACs which are embedded within the block itself, rather than a * single 128 bit MAC. As a result, this function handles encoding and decoding * the MACs on its own, unlike other functions in this file. */ int spa_do_crypt_objset_mac_abd(boolean_t generate, spa_t *spa, uint64_t dsobj, abd_t *abd, uint_t datalen, boolean_t byteswap) { int ret; dsl_crypto_key_t *dck = NULL; void *buf = abd_borrow_buf_copy(abd, datalen); objset_phys_t *osp = buf; uint8_t portable_mac[ZIO_OBJSET_MAC_LEN]; uint8_t local_mac[ZIO_OBJSET_MAC_LEN]; /* look up the key from the spa's keystore */ ret = spa_keystore_lookup_key(spa, dsobj, FTAG, &dck); if (ret != 0) goto error; /* calculate both HMACs */ ret = zio_crypt_do_objset_hmacs(&dck->dck_key, buf, datalen, byteswap, portable_mac, local_mac); if (ret != 0) goto error; spa_keystore_dsl_key_rele(spa, dck, FTAG); /* if we are generating encode the HMACs in the objset_phys_t */ if (generate) { bcopy(portable_mac, osp->os_portable_mac, ZIO_OBJSET_MAC_LEN); bcopy(local_mac, osp->os_local_mac, ZIO_OBJSET_MAC_LEN); abd_return_buf_copy(abd, buf, datalen); return (0); } if (bcmp(portable_mac, osp->os_portable_mac, ZIO_OBJSET_MAC_LEN) != 0 || bcmp(local_mac, osp->os_local_mac, ZIO_OBJSET_MAC_LEN) != 0) { abd_return_buf(abd, buf, datalen); return (SET_ERROR(ECKSUM)); } abd_return_buf(abd, buf, datalen); return (0); error: if (dck != NULL) spa_keystore_dsl_key_rele(spa, dck, FTAG); abd_return_buf(abd, buf, datalen); return (ret); } int spa_do_crypt_mac_abd(boolean_t generate, spa_t *spa, uint64_t dsobj, abd_t *abd, uint_t datalen, uint8_t *mac) { int ret; dsl_crypto_key_t *dck = NULL; uint8_t *buf = abd_borrow_buf_copy(abd, datalen); uint8_t digestbuf[ZIO_DATA_MAC_LEN]; /* look up the key from the spa's keystore */ ret = spa_keystore_lookup_key(spa, dsobj, FTAG, &dck); if (ret != 0) goto error; /* perform the hmac */ ret = zio_crypt_do_hmac(&dck->dck_key, buf, datalen, digestbuf, ZIO_DATA_MAC_LEN); if (ret != 0) goto error; abd_return_buf(abd, buf, datalen); spa_keystore_dsl_key_rele(spa, dck, FTAG); /* * Truncate and fill in mac buffer if we were asked to generate a MAC. * Otherwise verify that the MAC matched what we expected. */ if (generate) { bcopy(digestbuf, mac, ZIO_DATA_MAC_LEN); return (0); } if (bcmp(digestbuf, mac, ZIO_DATA_MAC_LEN) != 0) return (SET_ERROR(ECKSUM)); return (0); error: if (dck != NULL) spa_keystore_dsl_key_rele(spa, dck, FTAG); abd_return_buf(abd, buf, datalen); return (ret); } /* * This function serves as a multiplexer for encryption and decryption of * all blocks (except the L2ARC). For encryption, it will populate the IV, * salt, MAC, and cabd (the ciphertext). On decryption it will simply use * these fields to populate pabd (the plaintext). */ int spa_do_crypt_abd(boolean_t encrypt, spa_t *spa, const zbookmark_phys_t *zb, dmu_object_type_t ot, boolean_t dedup, boolean_t bswap, uint8_t *salt, uint8_t *iv, uint8_t *mac, uint_t datalen, abd_t *pabd, abd_t *cabd, boolean_t *no_crypt) { int ret; dsl_crypto_key_t *dck = NULL; uint8_t *plainbuf = NULL, *cipherbuf = NULL; ASSERT(spa_feature_is_active(spa, SPA_FEATURE_ENCRYPTION)); /* look up the key from the spa's keystore */ ret = spa_keystore_lookup_key(spa, zb->zb_objset, FTAG, &dck); if (ret != 0) { ret = SET_ERROR(EACCES); return (ret); } if (encrypt) { plainbuf = abd_borrow_buf_copy(pabd, datalen); cipherbuf = abd_borrow_buf(cabd, datalen); } else { plainbuf = abd_borrow_buf(pabd, datalen); cipherbuf = abd_borrow_buf_copy(cabd, datalen); } /* * Both encryption and decryption functions need a salt for key * generation and an IV. When encrypting a non-dedup block, we * generate the salt and IV randomly to be stored by the caller. Dedup * blocks perform a (more expensive) HMAC of the plaintext to obtain * the salt and the IV. ZIL blocks have their salt and IV generated * at allocation time in zio_alloc_zil(). On decryption, we simply use * the provided values. */ if (encrypt && ot != DMU_OT_INTENT_LOG && !dedup) { ret = zio_crypt_key_get_salt(&dck->dck_key, salt); if (ret != 0) goto error; ret = zio_crypt_generate_iv(iv); if (ret != 0) goto error; } else if (encrypt && dedup) { ret = zio_crypt_generate_iv_salt_dedup(&dck->dck_key, plainbuf, datalen, iv, salt); if (ret != 0) goto error; } /* call lower level function to perform encryption / decryption */ ret = zio_do_crypt_data(encrypt, &dck->dck_key, ot, bswap, salt, iv, mac, datalen, plainbuf, cipherbuf, no_crypt); /* * Handle injected decryption faults. Unfortunately, we cannot inject * faults for dnode blocks because we might trigger the panic in * dbuf_prepare_encrypted_dnode_leaf(), which exists because syncing * context is not prepared to handle malicious decryption failures. */ if (zio_injection_enabled && !encrypt && ot != DMU_OT_DNODE && ret == 0) ret = zio_handle_decrypt_injection(spa, zb, ot, ECKSUM); if (ret != 0) goto error; if (encrypt) { abd_return_buf(pabd, plainbuf, datalen); abd_return_buf_copy(cabd, cipherbuf, datalen); } else { abd_return_buf_copy(pabd, plainbuf, datalen); abd_return_buf(cabd, cipherbuf, datalen); } spa_keystore_dsl_key_rele(spa, dck, FTAG); return (0); error: if (encrypt) { /* zero out any state we might have changed while encrypting */ bzero(salt, ZIO_DATA_SALT_LEN); bzero(iv, ZIO_DATA_IV_LEN); bzero(mac, ZIO_DATA_MAC_LEN); abd_return_buf(pabd, plainbuf, datalen); abd_return_buf_copy(cabd, cipherbuf, datalen); } else { abd_return_buf_copy(pabd, plainbuf, datalen); abd_return_buf(cabd, cipherbuf, datalen); } spa_keystore_dsl_key_rele(spa, dck, FTAG); return (ret); } ZFS_MODULE_PARAM(zfs, zfs_, disable_ivset_guid_check, INT, ZMOD_RW, "Set to allow raw receives without IVset guids"); Index: vendor-sys/openzfs/dist/module/zfs/spa_misc.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/spa_misc.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/spa_misc.c (revision 366348) @@ -1,2921 +1,2922 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2011, 2019 by Delphix. All rights reserved. * Copyright 2015 Nexenta Systems, Inc. All rights reserved. * Copyright (c) 2014 Spectra Logic Corporation, All rights reserved. * Copyright 2013 Saso Kiselkov. All rights reserved. * Copyright (c) 2017 Datto Inc. * Copyright (c) 2017, Intel Corporation. * Copyright (c) 2019, loli10K . All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include "zfs_prop.h" #include #include #include #include /* * SPA locking * * There are three basic locks for managing spa_t structures: * * spa_namespace_lock (global mutex) * * This lock must be acquired to do any of the following: * * - Lookup a spa_t by name * - Add or remove a spa_t from the namespace * - Increase spa_refcount from non-zero * - Check if spa_refcount is zero * - Rename a spa_t * - add/remove/attach/detach devices * - Held for the duration of create/destroy/import/export * * It does not need to handle recursion. A create or destroy may * reference objects (files or zvols) in other pools, but by * definition they must have an existing reference, and will never need * to lookup a spa_t by name. * * spa_refcount (per-spa zfs_refcount_t protected by mutex) * * This reference count keep track of any active users of the spa_t. The * spa_t cannot be destroyed or freed while this is non-zero. Internally, * the refcount is never really 'zero' - opening a pool implicitly keeps * some references in the DMU. Internally we check against spa_minref, but * present the image of a zero/non-zero value to consumers. * * spa_config_lock[] (per-spa array of rwlocks) * * This protects the spa_t from config changes, and must be held in * the following circumstances: * * - RW_READER to perform I/O to the spa * - RW_WRITER to change the vdev config * * The locking order is fairly straightforward: * * spa_namespace_lock -> spa_refcount * * The namespace lock must be acquired to increase the refcount from 0 * or to check if it is zero. * * spa_refcount -> spa_config_lock[] * * There must be at least one valid reference on the spa_t to acquire * the config lock. * * spa_namespace_lock -> spa_config_lock[] * * The namespace lock must always be taken before the config lock. * * * The spa_namespace_lock can be acquired directly and is globally visible. * * The namespace is manipulated using the following functions, all of which * require the spa_namespace_lock to be held. * * spa_lookup() Lookup a spa_t by name. * * spa_add() Create a new spa_t in the namespace. * * spa_remove() Remove a spa_t from the namespace. This also * frees up any memory associated with the spa_t. * * spa_next() Returns the next spa_t in the system, or the * first if NULL is passed. * * spa_evict_all() Shutdown and remove all spa_t structures in * the system. * * spa_guid_exists() Determine whether a pool/device guid exists. * * The spa_refcount is manipulated using the following functions: * * spa_open_ref() Adds a reference to the given spa_t. Must be * called with spa_namespace_lock held if the * refcount is currently zero. * * spa_close() Remove a reference from the spa_t. This will * not free the spa_t or remove it from the * namespace. No locking is required. * * spa_refcount_zero() Returns true if the refcount is currently * zero. Must be called with spa_namespace_lock * held. * * The spa_config_lock[] is an array of rwlocks, ordered as follows: * SCL_CONFIG > SCL_STATE > SCL_ALLOC > SCL_ZIO > SCL_FREE > SCL_VDEV. * spa_config_lock[] is manipulated with spa_config_{enter,exit,held}(). * * To read the configuration, it suffices to hold one of these locks as reader. * To modify the configuration, you must hold all locks as writer. To modify * vdev state without altering the vdev tree's topology (e.g. online/offline), * you must hold SCL_STATE and SCL_ZIO as writer. * * We use these distinct config locks to avoid recursive lock entry. * For example, spa_sync() (which holds SCL_CONFIG as reader) induces * block allocations (SCL_ALLOC), which may require reading space maps * from disk (dmu_read() -> zio_read() -> SCL_ZIO). * * The spa config locks cannot be normal rwlocks because we need the * ability to hand off ownership. For example, SCL_ZIO is acquired * by the issuing thread and later released by an interrupt thread. * They do, however, obey the usual write-wanted semantics to prevent * writer (i.e. system administrator) starvation. * * The lock acquisition rules are as follows: * * SCL_CONFIG * Protects changes to the vdev tree topology, such as vdev * add/remove/attach/detach. Protects the dirty config list * (spa_config_dirty_list) and the set of spares and l2arc devices. * * SCL_STATE * Protects changes to pool state and vdev state, such as vdev * online/offline/fault/degrade/clear. Protects the dirty state list * (spa_state_dirty_list) and global pool state (spa_state). * * SCL_ALLOC * Protects changes to metaslab groups and classes. * Held as reader by metaslab_alloc() and metaslab_claim(). * * SCL_ZIO * Held by bp-level zios (those which have no io_vd upon entry) * to prevent changes to the vdev tree. The bp-level zio implicitly * protects all of its vdev child zios, which do not hold SCL_ZIO. * * SCL_FREE * Protects changes to metaslab groups and classes. * Held as reader by metaslab_free(). SCL_FREE is distinct from * SCL_ALLOC, and lower than SCL_ZIO, so that we can safely free * blocks in zio_done() while another i/o that holds either * SCL_ALLOC or SCL_ZIO is waiting for this i/o to complete. * * SCL_VDEV * Held as reader to prevent changes to the vdev tree during trivial * inquiries such as bp_get_dsize(). SCL_VDEV is distinct from the * other locks, and lower than all of them, to ensure that it's safe * to acquire regardless of caller context. * * In addition, the following rules apply: * * (a) spa_props_lock protects pool properties, spa_config and spa_config_list. * The lock ordering is SCL_CONFIG > spa_props_lock. * * (b) I/O operations on leaf vdevs. For any zio operation that takes * an explicit vdev_t argument -- such as zio_ioctl(), zio_read_phys(), * or zio_write_phys() -- the caller must ensure that the config cannot * cannot change in the interim, and that the vdev cannot be reopened. * SCL_STATE as reader suffices for both. * * The vdev configuration is protected by spa_vdev_enter() / spa_vdev_exit(). * * spa_vdev_enter() Acquire the namespace lock and the config lock * for writing. * * spa_vdev_exit() Release the config lock, wait for all I/O * to complete, sync the updated configs to the * cache, and release the namespace lock. * * vdev state is protected by spa_vdev_state_enter() / spa_vdev_state_exit(). * Like spa_vdev_enter/exit, these are convenience wrappers -- the actual * locking is, always, based on spa_namespace_lock and spa_config_lock[]. */ static avl_tree_t spa_namespace_avl; kmutex_t spa_namespace_lock; static kcondvar_t spa_namespace_cv; int spa_max_replication_override = SPA_DVAS_PER_BP; static kmutex_t spa_spare_lock; static avl_tree_t spa_spare_avl; static kmutex_t spa_l2cache_lock; static avl_tree_t spa_l2cache_avl; kmem_cache_t *spa_buffer_pool; spa_mode_t spa_mode_global = SPA_MODE_UNINIT; #ifdef ZFS_DEBUG /* * Everything except dprintf, set_error, spa, and indirect_remap is on * by default in debug builds. */ int zfs_flags = ~(ZFS_DEBUG_DPRINTF | ZFS_DEBUG_SET_ERROR | ZFS_DEBUG_INDIRECT_REMAP); #else int zfs_flags = 0; #endif /* * zfs_recover can be set to nonzero to attempt to recover from * otherwise-fatal errors, typically caused by on-disk corruption. When * set, calls to zfs_panic_recover() will turn into warning messages. * This should only be used as a last resort, as it typically results * in leaked space, or worse. */ int zfs_recover = B_FALSE; /* * If destroy encounters an EIO while reading metadata (e.g. indirect * blocks), space referenced by the missing metadata can not be freed. * Normally this causes the background destroy to become "stalled", as * it is unable to make forward progress. While in this stalled state, * all remaining space to free from the error-encountering filesystem is * "temporarily leaked". Set this flag to cause it to ignore the EIO, * permanently leak the space from indirect blocks that can not be read, * and continue to free everything else that it can. * * The default, "stalling" behavior is useful if the storage partially * fails (i.e. some but not all i/os fail), and then later recovers. In * this case, we will be able to continue pool operations while it is * partially failed, and when it recovers, we can continue to free the * space, with no leaks. However, note that this case is actually * fairly rare. * * Typically pools either (a) fail completely (but perhaps temporarily, * e.g. a top-level vdev going offline), or (b) have localized, * permanent errors (e.g. disk returns the wrong data due to bit flip or * firmware bug). In case (a), this setting does not matter because the * pool will be suspended and the sync thread will not be able to make * forward progress regardless. In case (b), because the error is * permanent, the best we can do is leak the minimum amount of space, * which is what setting this flag will do. Therefore, it is reasonable * for this flag to normally be set, but we chose the more conservative * approach of not setting it, so that there is no possibility of * leaking space in the "partial temporary" failure case. */ int zfs_free_leak_on_eio = B_FALSE; /* * Expiration time in milliseconds. This value has two meanings. First it is * used to determine when the spa_deadman() logic should fire. By default the * spa_deadman() will fire if spa_sync() has not completed in 600 seconds. * Secondly, the value determines if an I/O is considered "hung". Any I/O that * has not completed in zfs_deadman_synctime_ms is considered "hung" resulting * in one of three behaviors controlled by zfs_deadman_failmode. */ unsigned long zfs_deadman_synctime_ms = 600000UL; /* * This value controls the maximum amount of time zio_wait() will block for an * outstanding IO. By default this is 300 seconds at which point the "hung" * behavior will be applied as described for zfs_deadman_synctime_ms. */ unsigned long zfs_deadman_ziotime_ms = 300000UL; /* * Check time in milliseconds. This defines the frequency at which we check * for hung I/O. */ unsigned long zfs_deadman_checktime_ms = 60000UL; /* * By default the deadman is enabled. */ int zfs_deadman_enabled = 1; /* * Controls the behavior of the deadman when it detects a "hung" I/O. * Valid values are zfs_deadman_failmode=. * * wait - Wait for the "hung" I/O (default) * continue - Attempt to recover from a "hung" I/O * panic - Panic the system */ char *zfs_deadman_failmode = "wait"; /* * The worst case is single-sector max-parity RAID-Z blocks, in which * case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1) * times the size; so just assume that. Add to this the fact that * we can have up to 3 DVAs per bp, and one more factor of 2 because * the block may be dittoed with up to 3 DVAs by ddt_sync(). All together, * the worst case is: * (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2 == 24 */ int spa_asize_inflation = 24; /* * Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in * the pool to be consumed. This ensures that we don't run the pool * completely out of space, due to unaccounted changes (e.g. to the MOS). * It also limits the worst-case time to allocate space. If we have * less than this amount of free space, most ZPL operations (e.g. write, * create) will return ENOSPC. * * Certain operations (e.g. file removal, most administrative actions) can * use half the slop space. They will only return ENOSPC if less than half * the slop space is free. Typically, once the pool has less than the slop * space free, the user will use these operations to free up space in the pool. * These are the operations that call dsl_pool_adjustedsize() with the netfree * argument set to TRUE. * * Operations that are almost guaranteed to free up space in the absence of * a pool checkpoint can use up to three quarters of the slop space * (e.g zfs destroy). * * A very restricted set of operations are always permitted, regardless of * the amount of free space. These are the operations that call * dsl_sync_task(ZFS_SPACE_CHECK_NONE). If these operations result in a net * increase in the amount of space used, it is possible to run the pool * completely out of space, causing it to be permanently read-only. * * Note that on very small pools, the slop space will be larger than * 3.2%, in an effort to have it be at least spa_min_slop (128MB), * but we never allow it to be more than half the pool size. * * See also the comments in zfs_space_check_t. */ int spa_slop_shift = 5; uint64_t spa_min_slop = 128 * 1024 * 1024; int spa_allocators = 4; /*PRINTFLIKE2*/ void spa_load_failed(spa_t *spa, const char *fmt, ...) { va_list adx; char buf[256]; va_start(adx, fmt); (void) vsnprintf(buf, sizeof (buf), fmt, adx); va_end(adx); zfs_dbgmsg("spa_load(%s, config %s): FAILED: %s", spa->spa_name, spa->spa_trust_config ? "trusted" : "untrusted", buf); } /*PRINTFLIKE2*/ void spa_load_note(spa_t *spa, const char *fmt, ...) { va_list adx; char buf[256]; va_start(adx, fmt); (void) vsnprintf(buf, sizeof (buf), fmt, adx); va_end(adx); zfs_dbgmsg("spa_load(%s, config %s): %s", spa->spa_name, spa->spa_trust_config ? "trusted" : "untrusted", buf); } /* * By default dedup and user data indirects land in the special class */ int zfs_ddt_data_is_special = B_TRUE; int zfs_user_indirect_is_special = B_TRUE; /* * The percentage of special class final space reserved for metadata only. * Once we allocate 100 - zfs_special_class_metadata_reserve_pct we only * let metadata into the class. */ int zfs_special_class_metadata_reserve_pct = 25; /* * ========================================================================== * SPA config locking * ========================================================================== */ static void spa_config_lock_init(spa_t *spa) { for (int i = 0; i < SCL_LOCKS; i++) { spa_config_lock_t *scl = &spa->spa_config_lock[i]; mutex_init(&scl->scl_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&scl->scl_cv, NULL, CV_DEFAULT, NULL); zfs_refcount_create_untracked(&scl->scl_count); scl->scl_writer = NULL; scl->scl_write_wanted = 0; } } static void spa_config_lock_destroy(spa_t *spa) { for (int i = 0; i < SCL_LOCKS; i++) { spa_config_lock_t *scl = &spa->spa_config_lock[i]; mutex_destroy(&scl->scl_lock); cv_destroy(&scl->scl_cv); zfs_refcount_destroy(&scl->scl_count); ASSERT(scl->scl_writer == NULL); ASSERT(scl->scl_write_wanted == 0); } } int spa_config_tryenter(spa_t *spa, int locks, void *tag, krw_t rw) { for (int i = 0; i < SCL_LOCKS; i++) { spa_config_lock_t *scl = &spa->spa_config_lock[i]; if (!(locks & (1 << i))) continue; mutex_enter(&scl->scl_lock); if (rw == RW_READER) { if (scl->scl_writer || scl->scl_write_wanted) { mutex_exit(&scl->scl_lock); spa_config_exit(spa, locks & ((1 << i) - 1), tag); return (0); } } else { ASSERT(scl->scl_writer != curthread); if (!zfs_refcount_is_zero(&scl->scl_count)) { mutex_exit(&scl->scl_lock); spa_config_exit(spa, locks & ((1 << i) - 1), tag); return (0); } scl->scl_writer = curthread; } (void) zfs_refcount_add(&scl->scl_count, tag); mutex_exit(&scl->scl_lock); } return (1); } void spa_config_enter(spa_t *spa, int locks, const void *tag, krw_t rw) { int wlocks_held = 0; ASSERT3U(SCL_LOCKS, <, sizeof (wlocks_held) * NBBY); for (int i = 0; i < SCL_LOCKS; i++) { spa_config_lock_t *scl = &spa->spa_config_lock[i]; if (scl->scl_writer == curthread) wlocks_held |= (1 << i); if (!(locks & (1 << i))) continue; mutex_enter(&scl->scl_lock); if (rw == RW_READER) { while (scl->scl_writer || scl->scl_write_wanted) { cv_wait(&scl->scl_cv, &scl->scl_lock); } } else { ASSERT(scl->scl_writer != curthread); while (!zfs_refcount_is_zero(&scl->scl_count)) { scl->scl_write_wanted++; cv_wait(&scl->scl_cv, &scl->scl_lock); scl->scl_write_wanted--; } scl->scl_writer = curthread; } (void) zfs_refcount_add(&scl->scl_count, tag); mutex_exit(&scl->scl_lock); } ASSERT3U(wlocks_held, <=, locks); } void spa_config_exit(spa_t *spa, int locks, const void *tag) { for (int i = SCL_LOCKS - 1; i >= 0; i--) { spa_config_lock_t *scl = &spa->spa_config_lock[i]; if (!(locks & (1 << i))) continue; mutex_enter(&scl->scl_lock); ASSERT(!zfs_refcount_is_zero(&scl->scl_count)); if (zfs_refcount_remove(&scl->scl_count, tag) == 0) { ASSERT(scl->scl_writer == NULL || scl->scl_writer == curthread); scl->scl_writer = NULL; /* OK in either case */ cv_broadcast(&scl->scl_cv); } mutex_exit(&scl->scl_lock); } } int spa_config_held(spa_t *spa, int locks, krw_t rw) { int locks_held = 0; for (int i = 0; i < SCL_LOCKS; i++) { spa_config_lock_t *scl = &spa->spa_config_lock[i]; if (!(locks & (1 << i))) continue; if ((rw == RW_READER && !zfs_refcount_is_zero(&scl->scl_count)) || (rw == RW_WRITER && scl->scl_writer == curthread)) locks_held |= 1 << i; } return (locks_held); } /* * ========================================================================== * SPA namespace functions * ========================================================================== */ /* * Lookup the named spa_t in the AVL tree. The spa_namespace_lock must be held. * Returns NULL if no matching spa_t is found. */ spa_t * spa_lookup(const char *name) { static spa_t search; /* spa_t is large; don't allocate on stack */ spa_t *spa; avl_index_t where; char *cp; ASSERT(MUTEX_HELD(&spa_namespace_lock)); (void) strlcpy(search.spa_name, name, sizeof (search.spa_name)); /* * If it's a full dataset name, figure out the pool name and * just use that. */ cp = strpbrk(search.spa_name, "/@#"); if (cp != NULL) *cp = '\0'; spa = avl_find(&spa_namespace_avl, &search, &where); return (spa); } /* * Fires when spa_sync has not completed within zfs_deadman_synctime_ms. * If the zfs_deadman_enabled flag is set then it inspects all vdev queues * looking for potentially hung I/Os. */ void spa_deadman(void *arg) { spa_t *spa = arg; /* Disable the deadman if the pool is suspended. */ if (spa_suspended(spa)) return; zfs_dbgmsg("slow spa_sync: started %llu seconds ago, calls %llu", (gethrtime() - spa->spa_sync_starttime) / NANOSEC, ++spa->spa_deadman_calls); if (zfs_deadman_enabled) vdev_deadman(spa->spa_root_vdev, FTAG); spa->spa_deadman_tqid = taskq_dispatch_delay(system_delay_taskq, spa_deadman, spa, TQ_SLEEP, ddi_get_lbolt() + MSEC_TO_TICK(zfs_deadman_checktime_ms)); } static int spa_log_sm_sort_by_txg(const void *va, const void *vb) { const spa_log_sm_t *a = va; const spa_log_sm_t *b = vb; return (TREE_CMP(a->sls_txg, b->sls_txg)); } /* * Create an uninitialized spa_t with the given name. Requires * spa_namespace_lock. The caller must ensure that the spa_t doesn't already * exist by calling spa_lookup() first. */ spa_t * spa_add(const char *name, nvlist_t *config, const char *altroot) { spa_t *spa; spa_config_dirent_t *dp; ASSERT(MUTEX_HELD(&spa_namespace_lock)); spa = kmem_zalloc(sizeof (spa_t), KM_SLEEP); mutex_init(&spa->spa_async_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_errlist_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_errlog_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_evicting_os_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_history_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_proc_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_props_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_cksum_tmpls_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_scrub_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_suspend_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_vdev_top_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_feat_stats_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_flushed_ms_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa->spa_activities_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&spa->spa_async_cv, NULL, CV_DEFAULT, NULL); cv_init(&spa->spa_evicting_os_cv, NULL, CV_DEFAULT, NULL); cv_init(&spa->spa_proc_cv, NULL, CV_DEFAULT, NULL); cv_init(&spa->spa_scrub_io_cv, NULL, CV_DEFAULT, NULL); cv_init(&spa->spa_suspend_cv, NULL, CV_DEFAULT, NULL); cv_init(&spa->spa_activities_cv, NULL, CV_DEFAULT, NULL); cv_init(&spa->spa_waiters_cv, NULL, CV_DEFAULT, NULL); for (int t = 0; t < TXG_SIZE; t++) bplist_create(&spa->spa_free_bplist[t]); (void) strlcpy(spa->spa_name, name, sizeof (spa->spa_name)); spa->spa_state = POOL_STATE_UNINITIALIZED; spa->spa_freeze_txg = UINT64_MAX; spa->spa_final_txg = UINT64_MAX; spa->spa_load_max_txg = UINT64_MAX; spa->spa_proc = &p0; spa->spa_proc_state = SPA_PROC_NONE; spa->spa_trust_config = B_TRUE; spa->spa_hostid = zone_get_hostid(NULL); spa->spa_deadman_synctime = MSEC2NSEC(zfs_deadman_synctime_ms); spa->spa_deadman_ziotime = MSEC2NSEC(zfs_deadman_ziotime_ms); spa_set_deadman_failmode(spa, zfs_deadman_failmode); zfs_refcount_create(&spa->spa_refcount); spa_config_lock_init(spa); spa_stats_init(spa); avl_add(&spa_namespace_avl, spa); /* * Set the alternate root, if there is one. */ if (altroot) spa->spa_root = spa_strdup(altroot); spa->spa_alloc_count = spa_allocators; spa->spa_alloc_locks = kmem_zalloc(spa->spa_alloc_count * sizeof (kmutex_t), KM_SLEEP); spa->spa_alloc_trees = kmem_zalloc(spa->spa_alloc_count * sizeof (avl_tree_t), KM_SLEEP); for (int i = 0; i < spa->spa_alloc_count; i++) { mutex_init(&spa->spa_alloc_locks[i], NULL, MUTEX_DEFAULT, NULL); avl_create(&spa->spa_alloc_trees[i], zio_bookmark_compare, sizeof (zio_t), offsetof(zio_t, io_alloc_node)); } avl_create(&spa->spa_metaslabs_by_flushed, metaslab_sort_by_flushed, sizeof (metaslab_t), offsetof(metaslab_t, ms_spa_txg_node)); avl_create(&spa->spa_sm_logs_by_txg, spa_log_sm_sort_by_txg, sizeof (spa_log_sm_t), offsetof(spa_log_sm_t, sls_node)); list_create(&spa->spa_log_summary, sizeof (log_summary_entry_t), offsetof(log_summary_entry_t, lse_node)); /* * Every pool starts with the default cachefile */ list_create(&spa->spa_config_list, sizeof (spa_config_dirent_t), offsetof(spa_config_dirent_t, scd_link)); dp = kmem_zalloc(sizeof (spa_config_dirent_t), KM_SLEEP); dp->scd_path = altroot ? NULL : spa_strdup(spa_config_path); list_insert_head(&spa->spa_config_list, dp); VERIFY(nvlist_alloc(&spa->spa_load_info, NV_UNIQUE_NAME, KM_SLEEP) == 0); if (config != NULL) { nvlist_t *features; if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_FEATURES_FOR_READ, &features) == 0) { VERIFY(nvlist_dup(features, &spa->spa_label_features, 0) == 0); } VERIFY(nvlist_dup(config, &spa->spa_config, 0) == 0); } if (spa->spa_label_features == NULL) { VERIFY(nvlist_alloc(&spa->spa_label_features, NV_UNIQUE_NAME, KM_SLEEP) == 0); } spa->spa_min_ashift = INT_MAX; spa->spa_max_ashift = 0; /* Reset cached value */ spa->spa_dedup_dspace = ~0ULL; /* * As a pool is being created, treat all features as disabled by * setting SPA_FEATURE_DISABLED for all entries in the feature * refcount cache. */ for (int i = 0; i < SPA_FEATURES; i++) { spa->spa_feat_refcount_cache[i] = SPA_FEATURE_DISABLED; } list_create(&spa->spa_leaf_list, sizeof (vdev_t), offsetof(vdev_t, vdev_leaf_node)); return (spa); } /* * Removes a spa_t from the namespace, freeing up any memory used. Requires * spa_namespace_lock. This is called only after the spa_t has been closed and * deactivated. */ void spa_remove(spa_t *spa) { spa_config_dirent_t *dp; ASSERT(MUTEX_HELD(&spa_namespace_lock)); ASSERT(spa_state(spa) == POOL_STATE_UNINITIALIZED); ASSERT3U(zfs_refcount_count(&spa->spa_refcount), ==, 0); ASSERT0(spa->spa_waiters); nvlist_free(spa->spa_config_splitting); avl_remove(&spa_namespace_avl, spa); cv_broadcast(&spa_namespace_cv); if (spa->spa_root) spa_strfree(spa->spa_root); while ((dp = list_head(&spa->spa_config_list)) != NULL) { list_remove(&spa->spa_config_list, dp); if (dp->scd_path != NULL) spa_strfree(dp->scd_path); kmem_free(dp, sizeof (spa_config_dirent_t)); } for (int i = 0; i < spa->spa_alloc_count; i++) { avl_destroy(&spa->spa_alloc_trees[i]); mutex_destroy(&spa->spa_alloc_locks[i]); } kmem_free(spa->spa_alloc_locks, spa->spa_alloc_count * sizeof (kmutex_t)); kmem_free(spa->spa_alloc_trees, spa->spa_alloc_count * sizeof (avl_tree_t)); avl_destroy(&spa->spa_metaslabs_by_flushed); avl_destroy(&spa->spa_sm_logs_by_txg); list_destroy(&spa->spa_log_summary); list_destroy(&spa->spa_config_list); list_destroy(&spa->spa_leaf_list); nvlist_free(spa->spa_label_features); nvlist_free(spa->spa_load_info); nvlist_free(spa->spa_feat_stats); spa_config_set(spa, NULL); zfs_refcount_destroy(&spa->spa_refcount); spa_stats_destroy(spa); spa_config_lock_destroy(spa); for (int t = 0; t < TXG_SIZE; t++) bplist_destroy(&spa->spa_free_bplist[t]); zio_checksum_templates_free(spa); cv_destroy(&spa->spa_async_cv); cv_destroy(&spa->spa_evicting_os_cv); cv_destroy(&spa->spa_proc_cv); cv_destroy(&spa->spa_scrub_io_cv); cv_destroy(&spa->spa_suspend_cv); cv_destroy(&spa->spa_activities_cv); cv_destroy(&spa->spa_waiters_cv); mutex_destroy(&spa->spa_flushed_ms_lock); mutex_destroy(&spa->spa_async_lock); mutex_destroy(&spa->spa_errlist_lock); mutex_destroy(&spa->spa_errlog_lock); mutex_destroy(&spa->spa_evicting_os_lock); mutex_destroy(&spa->spa_history_lock); mutex_destroy(&spa->spa_proc_lock); mutex_destroy(&spa->spa_props_lock); mutex_destroy(&spa->spa_cksum_tmpls_lock); mutex_destroy(&spa->spa_scrub_lock); mutex_destroy(&spa->spa_suspend_lock); mutex_destroy(&spa->spa_vdev_top_lock); mutex_destroy(&spa->spa_feat_stats_lock); mutex_destroy(&spa->spa_activities_lock); kmem_free(spa, sizeof (spa_t)); } /* * Given a pool, return the next pool in the namespace, or NULL if there is * none. If 'prev' is NULL, return the first pool. */ spa_t * spa_next(spa_t *prev) { ASSERT(MUTEX_HELD(&spa_namespace_lock)); if (prev) return (AVL_NEXT(&spa_namespace_avl, prev)); else return (avl_first(&spa_namespace_avl)); } /* * ========================================================================== * SPA refcount functions * ========================================================================== */ /* * Add a reference to the given spa_t. Must have at least one reference, or * have the namespace lock held. */ void spa_open_ref(spa_t *spa, void *tag) { ASSERT(zfs_refcount_count(&spa->spa_refcount) >= spa->spa_minref || MUTEX_HELD(&spa_namespace_lock)); (void) zfs_refcount_add(&spa->spa_refcount, tag); } /* * Remove a reference to the given spa_t. Must have at least one reference, or * have the namespace lock held. */ void spa_close(spa_t *spa, void *tag) { ASSERT(zfs_refcount_count(&spa->spa_refcount) > spa->spa_minref || MUTEX_HELD(&spa_namespace_lock)); (void) zfs_refcount_remove(&spa->spa_refcount, tag); } /* * Remove a reference to the given spa_t held by a dsl dir that is * being asynchronously released. Async releases occur from a taskq * performing eviction of dsl datasets and dirs. The namespace lock * isn't held and the hold by the object being evicted may contribute to * spa_minref (e.g. dataset or directory released during pool export), * so the asserts in spa_close() do not apply. */ void spa_async_close(spa_t *spa, void *tag) { (void) zfs_refcount_remove(&spa->spa_refcount, tag); } /* * Check to see if the spa refcount is zero. Must be called with * spa_namespace_lock held. We really compare against spa_minref, which is the * number of references acquired when opening a pool */ boolean_t spa_refcount_zero(spa_t *spa) { ASSERT(MUTEX_HELD(&spa_namespace_lock)); return (zfs_refcount_count(&spa->spa_refcount) == spa->spa_minref); } /* * ========================================================================== * SPA spare and l2cache tracking * ========================================================================== */ /* * Hot spares and cache devices are tracked using the same code below, * for 'auxiliary' devices. */ typedef struct spa_aux { uint64_t aux_guid; uint64_t aux_pool; avl_node_t aux_avl; int aux_count; } spa_aux_t; static inline int spa_aux_compare(const void *a, const void *b) { const spa_aux_t *sa = (const spa_aux_t *)a; const spa_aux_t *sb = (const spa_aux_t *)b; return (TREE_CMP(sa->aux_guid, sb->aux_guid)); } static void spa_aux_add(vdev_t *vd, avl_tree_t *avl) { avl_index_t where; spa_aux_t search; spa_aux_t *aux; search.aux_guid = vd->vdev_guid; if ((aux = avl_find(avl, &search, &where)) != NULL) { aux->aux_count++; } else { aux = kmem_zalloc(sizeof (spa_aux_t), KM_SLEEP); aux->aux_guid = vd->vdev_guid; aux->aux_count = 1; avl_insert(avl, aux, where); } } static void spa_aux_remove(vdev_t *vd, avl_tree_t *avl) { spa_aux_t search; spa_aux_t *aux; avl_index_t where; search.aux_guid = vd->vdev_guid; aux = avl_find(avl, &search, &where); ASSERT(aux != NULL); if (--aux->aux_count == 0) { avl_remove(avl, aux); kmem_free(aux, sizeof (spa_aux_t)); } else if (aux->aux_pool == spa_guid(vd->vdev_spa)) { aux->aux_pool = 0ULL; } } static boolean_t spa_aux_exists(uint64_t guid, uint64_t *pool, int *refcnt, avl_tree_t *avl) { spa_aux_t search, *found; search.aux_guid = guid; found = avl_find(avl, &search, NULL); if (pool) { if (found) *pool = found->aux_pool; else *pool = 0ULL; } if (refcnt) { if (found) *refcnt = found->aux_count; else *refcnt = 0; } return (found != NULL); } static void spa_aux_activate(vdev_t *vd, avl_tree_t *avl) { spa_aux_t search, *found; avl_index_t where; search.aux_guid = vd->vdev_guid; found = avl_find(avl, &search, &where); ASSERT(found != NULL); ASSERT(found->aux_pool == 0ULL); found->aux_pool = spa_guid(vd->vdev_spa); } /* * Spares are tracked globally due to the following constraints: * * - A spare may be part of multiple pools. * - A spare may be added to a pool even if it's actively in use within * another pool. * - A spare in use in any pool can only be the source of a replacement if * the target is a spare in the same pool. * * We keep track of all spares on the system through the use of a reference * counted AVL tree. When a vdev is added as a spare, or used as a replacement * spare, then we bump the reference count in the AVL tree. In addition, we set * the 'vdev_isspare' member to indicate that the device is a spare (active or * inactive). When a spare is made active (used to replace a device in the * pool), we also keep track of which pool its been made a part of. * * The 'spa_spare_lock' protects the AVL tree. These functions are normally * called under the spa_namespace lock as part of vdev reconfiguration. The * separate spare lock exists for the status query path, which does not need to * be completely consistent with respect to other vdev configuration changes. */ static int spa_spare_compare(const void *a, const void *b) { return (spa_aux_compare(a, b)); } void spa_spare_add(vdev_t *vd) { mutex_enter(&spa_spare_lock); ASSERT(!vd->vdev_isspare); spa_aux_add(vd, &spa_spare_avl); vd->vdev_isspare = B_TRUE; mutex_exit(&spa_spare_lock); } void spa_spare_remove(vdev_t *vd) { mutex_enter(&spa_spare_lock); ASSERT(vd->vdev_isspare); spa_aux_remove(vd, &spa_spare_avl); vd->vdev_isspare = B_FALSE; mutex_exit(&spa_spare_lock); } boolean_t spa_spare_exists(uint64_t guid, uint64_t *pool, int *refcnt) { boolean_t found; mutex_enter(&spa_spare_lock); found = spa_aux_exists(guid, pool, refcnt, &spa_spare_avl); mutex_exit(&spa_spare_lock); return (found); } void spa_spare_activate(vdev_t *vd) { mutex_enter(&spa_spare_lock); ASSERT(vd->vdev_isspare); spa_aux_activate(vd, &spa_spare_avl); mutex_exit(&spa_spare_lock); } /* * Level 2 ARC devices are tracked globally for the same reasons as spares. * Cache devices currently only support one pool per cache device, and so * for these devices the aux reference count is currently unused beyond 1. */ static int spa_l2cache_compare(const void *a, const void *b) { return (spa_aux_compare(a, b)); } void spa_l2cache_add(vdev_t *vd) { mutex_enter(&spa_l2cache_lock); ASSERT(!vd->vdev_isl2cache); spa_aux_add(vd, &spa_l2cache_avl); vd->vdev_isl2cache = B_TRUE; mutex_exit(&spa_l2cache_lock); } void spa_l2cache_remove(vdev_t *vd) { mutex_enter(&spa_l2cache_lock); ASSERT(vd->vdev_isl2cache); spa_aux_remove(vd, &spa_l2cache_avl); vd->vdev_isl2cache = B_FALSE; mutex_exit(&spa_l2cache_lock); } boolean_t spa_l2cache_exists(uint64_t guid, uint64_t *pool) { boolean_t found; mutex_enter(&spa_l2cache_lock); found = spa_aux_exists(guid, pool, NULL, &spa_l2cache_avl); mutex_exit(&spa_l2cache_lock); return (found); } void spa_l2cache_activate(vdev_t *vd) { mutex_enter(&spa_l2cache_lock); ASSERT(vd->vdev_isl2cache); spa_aux_activate(vd, &spa_l2cache_avl); mutex_exit(&spa_l2cache_lock); } /* * ========================================================================== * SPA vdev locking * ========================================================================== */ /* * Lock the given spa_t for the purpose of adding or removing a vdev. * Grabs the global spa_namespace_lock plus the spa config lock for writing. * It returns the next transaction group for the spa_t. */ uint64_t spa_vdev_enter(spa_t *spa) { mutex_enter(&spa->spa_vdev_top_lock); mutex_enter(&spa_namespace_lock); vdev_autotrim_stop_all(spa); return (spa_vdev_config_enter(spa)); } /* * The same as spa_vdev_enter() above but additionally takes the guid of * the vdev being detached. When there is a rebuild in process it will be * suspended while the vdev tree is modified then resumed by spa_vdev_exit(). * The rebuild is canceled if only a single child remains after the detach. */ uint64_t spa_vdev_detach_enter(spa_t *spa, uint64_t guid) { mutex_enter(&spa->spa_vdev_top_lock); mutex_enter(&spa_namespace_lock); vdev_autotrim_stop_all(spa); if (guid != 0) { vdev_t *vd = spa_lookup_by_guid(spa, guid, B_FALSE); if (vd) { vdev_rebuild_stop_wait(vd->vdev_top); } } return (spa_vdev_config_enter(spa)); } /* * Internal implementation for spa_vdev_enter(). Used when a vdev * operation requires multiple syncs (i.e. removing a device) while * keeping the spa_namespace_lock held. */ uint64_t spa_vdev_config_enter(spa_t *spa) { ASSERT(MUTEX_HELD(&spa_namespace_lock)); spa_config_enter(spa, SCL_ALL, spa, RW_WRITER); return (spa_last_synced_txg(spa) + 1); } /* * Used in combination with spa_vdev_config_enter() to allow the syncing * of multiple transactions without releasing the spa_namespace_lock. */ void spa_vdev_config_exit(spa_t *spa, vdev_t *vd, uint64_t txg, int error, char *tag) { ASSERT(MUTEX_HELD(&spa_namespace_lock)); int config_changed = B_FALSE; ASSERT(txg > spa_last_synced_txg(spa)); spa->spa_pending_vdev = NULL; /* * Reassess the DTLs. */ vdev_dtl_reassess(spa->spa_root_vdev, 0, 0, B_FALSE, B_FALSE); if (error == 0 && !list_is_empty(&spa->spa_config_dirty_list)) { config_changed = B_TRUE; spa->spa_config_generation++; } /* * Verify the metaslab classes. */ ASSERT(metaslab_class_validate(spa_normal_class(spa)) == 0); ASSERT(metaslab_class_validate(spa_log_class(spa)) == 0); ASSERT(metaslab_class_validate(spa_special_class(spa)) == 0); ASSERT(metaslab_class_validate(spa_dedup_class(spa)) == 0); spa_config_exit(spa, SCL_ALL, spa); /* * Panic the system if the specified tag requires it. This * is useful for ensuring that configurations are updated * transactionally. */ if (zio_injection_enabled) zio_handle_panic_injection(spa, tag, 0); /* * Note: this txg_wait_synced() is important because it ensures * that there won't be more than one config change per txg. * This allows us to use the txg as the generation number. */ if (error == 0) txg_wait_synced(spa->spa_dsl_pool, txg); if (vd != NULL) { ASSERT(!vd->vdev_detached || vd->vdev_dtl_sm == NULL); if (vd->vdev_ops->vdev_op_leaf) { mutex_enter(&vd->vdev_initialize_lock); vdev_initialize_stop(vd, VDEV_INITIALIZE_CANCELED, NULL); mutex_exit(&vd->vdev_initialize_lock); mutex_enter(&vd->vdev_trim_lock); vdev_trim_stop(vd, VDEV_TRIM_CANCELED, NULL); mutex_exit(&vd->vdev_trim_lock); } /* * The vdev may be both a leaf and top-level device. */ vdev_autotrim_stop_wait(vd); spa_config_enter(spa, SCL_ALL, spa, RW_WRITER); vdev_free(vd); spa_config_exit(spa, SCL_ALL, spa); } /* * If the config changed, update the config cache. */ if (config_changed) spa_write_cachefile(spa, B_FALSE, B_TRUE); } /* * Unlock the spa_t after adding or removing a vdev. Besides undoing the * locking of spa_vdev_enter(), we also want make sure the transactions have * synced to disk, and then update the global configuration cache with the new * information. */ int spa_vdev_exit(spa_t *spa, vdev_t *vd, uint64_t txg, int error) { vdev_autotrim_restart(spa); vdev_rebuild_restart(spa); spa_vdev_config_exit(spa, vd, txg, error, FTAG); mutex_exit(&spa_namespace_lock); mutex_exit(&spa->spa_vdev_top_lock); return (error); } /* * Lock the given spa_t for the purpose of changing vdev state. */ void spa_vdev_state_enter(spa_t *spa, int oplocks) { int locks = SCL_STATE_ALL | oplocks; /* * Root pools may need to read of the underlying devfs filesystem * when opening up a vdev. Unfortunately if we're holding the * SCL_ZIO lock it will result in a deadlock when we try to issue * the read from the root filesystem. Instead we "prefetch" * the associated vnodes that we need prior to opening the * underlying devices and cache them so that we can prevent * any I/O when we are doing the actual open. */ if (spa_is_root(spa)) { int low = locks & ~(SCL_ZIO - 1); int high = locks & ~low; spa_config_enter(spa, high, spa, RW_WRITER); vdev_hold(spa->spa_root_vdev); spa_config_enter(spa, low, spa, RW_WRITER); } else { spa_config_enter(spa, locks, spa, RW_WRITER); } spa->spa_vdev_locks = locks; } int spa_vdev_state_exit(spa_t *spa, vdev_t *vd, int error) { boolean_t config_changed = B_FALSE; vdev_t *vdev_top; if (vd == NULL || vd == spa->spa_root_vdev) { vdev_top = spa->spa_root_vdev; } else { vdev_top = vd->vdev_top; } if (vd != NULL || error == 0) vdev_dtl_reassess(vdev_top, 0, 0, B_FALSE, B_FALSE); if (vd != NULL) { if (vd != spa->spa_root_vdev) vdev_state_dirty(vdev_top); config_changed = B_TRUE; spa->spa_config_generation++; } if (spa_is_root(spa)) vdev_rele(spa->spa_root_vdev); ASSERT3U(spa->spa_vdev_locks, >=, SCL_STATE_ALL); spa_config_exit(spa, spa->spa_vdev_locks, spa); /* * If anything changed, wait for it to sync. This ensures that, * from the system administrator's perspective, zpool(1M) commands * are synchronous. This is important for things like zpool offline: * when the command completes, you expect no further I/O from ZFS. */ if (vd != NULL) txg_wait_synced(spa->spa_dsl_pool, 0); /* * If the config changed, update the config cache. */ if (config_changed) { mutex_enter(&spa_namespace_lock); spa_write_cachefile(spa, B_FALSE, B_TRUE); mutex_exit(&spa_namespace_lock); } return (error); } /* * ========================================================================== * Miscellaneous functions * ========================================================================== */ void spa_activate_mos_feature(spa_t *spa, const char *feature, dmu_tx_t *tx) { if (!nvlist_exists(spa->spa_label_features, feature)) { fnvlist_add_boolean(spa->spa_label_features, feature); /* * When we are creating the pool (tx_txg==TXG_INITIAL), we can't * dirty the vdev config because lock SCL_CONFIG is not held. * Thankfully, in this case we don't need to dirty the config * because it will be written out anyway when we finish * creating the pool. */ if (tx->tx_txg != TXG_INITIAL) vdev_config_dirty(spa->spa_root_vdev); } } void spa_deactivate_mos_feature(spa_t *spa, const char *feature) { if (nvlist_remove_all(spa->spa_label_features, feature) == 0) vdev_config_dirty(spa->spa_root_vdev); } /* * Return the spa_t associated with given pool_guid, if it exists. If * device_guid is non-zero, determine whether the pool exists *and* contains * a device with the specified device_guid. */ spa_t * spa_by_guid(uint64_t pool_guid, uint64_t device_guid) { spa_t *spa; avl_tree_t *t = &spa_namespace_avl; ASSERT(MUTEX_HELD(&spa_namespace_lock)); for (spa = avl_first(t); spa != NULL; spa = AVL_NEXT(t, spa)) { if (spa->spa_state == POOL_STATE_UNINITIALIZED) continue; if (spa->spa_root_vdev == NULL) continue; if (spa_guid(spa) == pool_guid) { if (device_guid == 0) break; if (vdev_lookup_by_guid(spa->spa_root_vdev, device_guid) != NULL) break; /* * Check any devices we may be in the process of adding. */ if (spa->spa_pending_vdev) { if (vdev_lookup_by_guid(spa->spa_pending_vdev, device_guid) != NULL) break; } } } return (spa); } /* * Determine whether a pool with the given pool_guid exists. */ boolean_t spa_guid_exists(uint64_t pool_guid, uint64_t device_guid) { return (spa_by_guid(pool_guid, device_guid) != NULL); } char * spa_strdup(const char *s) { size_t len; char *new; len = strlen(s); new = kmem_alloc(len + 1, KM_SLEEP); bcopy(s, new, len); new[len] = '\0'; return (new); } void spa_strfree(char *s) { kmem_free(s, strlen(s) + 1); } uint64_t spa_get_random(uint64_t range) { uint64_t r; ASSERT(range != 0); if (range == 1) return (0); (void) random_get_pseudo_bytes((void *)&r, sizeof (uint64_t)); return (r % range); } uint64_t spa_generate_guid(spa_t *spa) { uint64_t guid = spa_get_random(-1ULL); if (spa != NULL) { while (guid == 0 || spa_guid_exists(spa_guid(spa), guid)) guid = spa_get_random(-1ULL); } else { while (guid == 0 || spa_guid_exists(guid, 0)) guid = spa_get_random(-1ULL); } return (guid); } void snprintf_blkptr(char *buf, size_t buflen, const blkptr_t *bp) { char type[256]; char *checksum = NULL; char *compress = NULL; if (bp != NULL) { if (BP_GET_TYPE(bp) & DMU_OT_NEWTYPE) { dmu_object_byteswap_t bswap = DMU_OT_BYTESWAP(BP_GET_TYPE(bp)); (void) snprintf(type, sizeof (type), "bswap %s %s", DMU_OT_IS_METADATA(BP_GET_TYPE(bp)) ? "metadata" : "data", dmu_ot_byteswap[bswap].ob_name); } else { (void) strlcpy(type, dmu_ot[BP_GET_TYPE(bp)].ot_name, sizeof (type)); } if (!BP_IS_EMBEDDED(bp)) { checksum = zio_checksum_table[BP_GET_CHECKSUM(bp)].ci_name; } compress = zio_compress_table[BP_GET_COMPRESS(bp)].ci_name; } SNPRINTF_BLKPTR(snprintf, ' ', buf, buflen, bp, type, checksum, compress); } void spa_freeze(spa_t *spa) { uint64_t freeze_txg = 0; spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER); if (spa->spa_freeze_txg == UINT64_MAX) { freeze_txg = spa_last_synced_txg(spa) + TXG_SIZE; spa->spa_freeze_txg = freeze_txg; } spa_config_exit(spa, SCL_ALL, FTAG); if (freeze_txg != 0) txg_wait_synced(spa_get_dsl(spa), freeze_txg); } void zfs_panic_recover(const char *fmt, ...) { va_list adx; va_start(adx, fmt); vcmn_err(zfs_recover ? CE_WARN : CE_PANIC, fmt, adx); va_end(adx); } /* * This is a stripped-down version of strtoull, suitable only for converting * lowercase hexadecimal numbers that don't overflow. */ uint64_t zfs_strtonum(const char *str, char **nptr) { uint64_t val = 0; char c; int digit; while ((c = *str) != '\0') { if (c >= '0' && c <= '9') digit = c - '0'; else if (c >= 'a' && c <= 'f') digit = 10 + c - 'a'; else break; val *= 16; val += digit; str++; } if (nptr) *nptr = (char *)str; return (val); } void spa_activate_allocation_classes(spa_t *spa, dmu_tx_t *tx) { /* * We bump the feature refcount for each special vdev added to the pool */ ASSERT(spa_feature_is_enabled(spa, SPA_FEATURE_ALLOCATION_CLASSES)); spa_feature_incr(spa, SPA_FEATURE_ALLOCATION_CLASSES, tx); } /* * ========================================================================== * Accessor functions * ========================================================================== */ boolean_t spa_shutting_down(spa_t *spa) { return (spa->spa_async_suspended); } dsl_pool_t * spa_get_dsl(spa_t *spa) { return (spa->spa_dsl_pool); } boolean_t spa_is_initializing(spa_t *spa) { return (spa->spa_is_initializing); } boolean_t spa_indirect_vdevs_loaded(spa_t *spa) { return (spa->spa_indirect_vdevs_loaded); } blkptr_t * spa_get_rootblkptr(spa_t *spa) { return (&spa->spa_ubsync.ub_rootbp); } void spa_set_rootblkptr(spa_t *spa, const blkptr_t *bp) { spa->spa_uberblock.ub_rootbp = *bp; } void spa_altroot(spa_t *spa, char *buf, size_t buflen) { if (spa->spa_root == NULL) buf[0] = '\0'; else (void) strncpy(buf, spa->spa_root, buflen); } int spa_sync_pass(spa_t *spa) { return (spa->spa_sync_pass); } char * spa_name(spa_t *spa) { return (spa->spa_name); } uint64_t spa_guid(spa_t *spa) { dsl_pool_t *dp = spa_get_dsl(spa); uint64_t guid; /* * If we fail to parse the config during spa_load(), we can go through * the error path (which posts an ereport) and end up here with no root * vdev. We stash the original pool guid in 'spa_config_guid' to handle * this case. */ if (spa->spa_root_vdev == NULL) return (spa->spa_config_guid); guid = spa->spa_last_synced_guid != 0 ? spa->spa_last_synced_guid : spa->spa_root_vdev->vdev_guid; /* * Return the most recently synced out guid unless we're * in syncing context. */ if (dp && dsl_pool_sync_context(dp)) return (spa->spa_root_vdev->vdev_guid); else return (guid); } uint64_t spa_load_guid(spa_t *spa) { /* * This is a GUID that exists solely as a reference for the * purposes of the arc. It is generated at load time, and * is never written to persistent storage. */ return (spa->spa_load_guid); } uint64_t spa_last_synced_txg(spa_t *spa) { return (spa->spa_ubsync.ub_txg); } uint64_t spa_first_txg(spa_t *spa) { return (spa->spa_first_txg); } uint64_t spa_syncing_txg(spa_t *spa) { return (spa->spa_syncing_txg); } /* * Return the last txg where data can be dirtied. The final txgs * will be used to just clear out any deferred frees that remain. */ uint64_t spa_final_dirty_txg(spa_t *spa) { return (spa->spa_final_txg - TXG_DEFER_SIZE); } pool_state_t spa_state(spa_t *spa) { return (spa->spa_state); } spa_load_state_t spa_load_state(spa_t *spa) { return (spa->spa_load_state); } uint64_t spa_freeze_txg(spa_t *spa) { return (spa->spa_freeze_txg); } /* * Return the inflated asize for a logical write in bytes. This is used by the * DMU to calculate the space a logical write will require on disk. * If lsize is smaller than the largest physical block size allocatable on this * pool we use its value instead, since the write will end up using the whole * block anyway. */ uint64_t spa_get_worst_case_asize(spa_t *spa, uint64_t lsize) { if (lsize == 0) return (0); /* No inflation needed */ return (MAX(lsize, 1 << spa->spa_max_ashift) * spa_asize_inflation); } /* * Return the amount of slop space in bytes. It is 1/32 of the pool (3.2%), * or at least 128MB, unless that would cause it to be more than half the * pool size. * * See the comment above spa_slop_shift for details. */ uint64_t spa_get_slop_space(spa_t *spa) { uint64_t space = spa_get_dspace(spa); return (MAX(space >> spa_slop_shift, MIN(space >> 1, spa_min_slop))); } uint64_t spa_get_dspace(spa_t *spa) { return (spa->spa_dspace); } uint64_t spa_get_checkpoint_space(spa_t *spa) { return (spa->spa_checkpoint_info.sci_dspace); } void spa_update_dspace(spa_t *spa) { spa->spa_dspace = metaslab_class_get_dspace(spa_normal_class(spa)) + ddt_get_dedup_dspace(spa); if (spa->spa_vdev_removal != NULL) { /* * We can't allocate from the removing device, so * subtract its size. This prevents the DMU/DSL from * filling up the (now smaller) pool while we are in the * middle of removing the device. * * Note that the DMU/DSL doesn't actually know or care * how much space is allocated (it does its own tracking * of how much space has been logically used). So it * doesn't matter that the data we are moving may be * allocated twice (on the old device and the new * device). */ spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER); vdev_t *vd = vdev_lookup_top(spa, spa->spa_vdev_removal->svr_vdev_id); spa->spa_dspace -= spa_deflate(spa) ? vd->vdev_stat.vs_dspace : vd->vdev_stat.vs_space; spa_config_exit(spa, SCL_VDEV, FTAG); } } /* * Return the failure mode that has been set to this pool. The default * behavior will be to block all I/Os when a complete failure occurs. */ uint64_t spa_get_failmode(spa_t *spa) { return (spa->spa_failmode); } boolean_t spa_suspended(spa_t *spa) { return (spa->spa_suspended != ZIO_SUSPEND_NONE); } uint64_t spa_version(spa_t *spa) { return (spa->spa_ubsync.ub_version); } boolean_t spa_deflate(spa_t *spa) { return (spa->spa_deflate); } metaslab_class_t * spa_normal_class(spa_t *spa) { return (spa->spa_normal_class); } metaslab_class_t * spa_log_class(spa_t *spa) { return (spa->spa_log_class); } metaslab_class_t * spa_special_class(spa_t *spa) { return (spa->spa_special_class); } metaslab_class_t * spa_dedup_class(spa_t *spa) { return (spa->spa_dedup_class); } /* * Locate an appropriate allocation class */ metaslab_class_t * spa_preferred_class(spa_t *spa, uint64_t size, dmu_object_type_t objtype, uint_t level, uint_t special_smallblk) { if (DMU_OT_IS_ZIL(objtype)) { if (spa->spa_log_class->mc_groups != 0) return (spa_log_class(spa)); else return (spa_normal_class(spa)); } boolean_t has_special_class = spa->spa_special_class->mc_groups != 0; if (DMU_OT_IS_DDT(objtype)) { if (spa->spa_dedup_class->mc_groups != 0) return (spa_dedup_class(spa)); else if (has_special_class && zfs_ddt_data_is_special) return (spa_special_class(spa)); else return (spa_normal_class(spa)); } /* Indirect blocks for user data can land in special if allowed */ if (level > 0 && (DMU_OT_IS_FILE(objtype) || objtype == DMU_OT_ZVOL)) { if (has_special_class && zfs_user_indirect_is_special) return (spa_special_class(spa)); else return (spa_normal_class(spa)); } if (DMU_OT_IS_METADATA(objtype) || level > 0) { if (has_special_class) return (spa_special_class(spa)); else return (spa_normal_class(spa)); } /* * Allow small file blocks in special class in some cases (like * for the dRAID vdev feature). But always leave a reserve of * zfs_special_class_metadata_reserve_pct exclusively for metadata. */ if (DMU_OT_IS_FILE(objtype) && has_special_class && size <= special_smallblk) { metaslab_class_t *special = spa_special_class(spa); uint64_t alloc = metaslab_class_get_alloc(special); uint64_t space = metaslab_class_get_space(special); uint64_t limit = (space * (100 - zfs_special_class_metadata_reserve_pct)) / 100; if (alloc < limit) return (special); } return (spa_normal_class(spa)); } void spa_evicting_os_register(spa_t *spa, objset_t *os) { mutex_enter(&spa->spa_evicting_os_lock); list_insert_head(&spa->spa_evicting_os_list, os); mutex_exit(&spa->spa_evicting_os_lock); } void spa_evicting_os_deregister(spa_t *spa, objset_t *os) { mutex_enter(&spa->spa_evicting_os_lock); list_remove(&spa->spa_evicting_os_list, os); cv_broadcast(&spa->spa_evicting_os_cv); mutex_exit(&spa->spa_evicting_os_lock); } void spa_evicting_os_wait(spa_t *spa) { mutex_enter(&spa->spa_evicting_os_lock); while (!list_is_empty(&spa->spa_evicting_os_list)) cv_wait(&spa->spa_evicting_os_cv, &spa->spa_evicting_os_lock); mutex_exit(&spa->spa_evicting_os_lock); dmu_buf_user_evict_wait(); } int spa_max_replication(spa_t *spa) { /* * As of SPA_VERSION == SPA_VERSION_DITTO_BLOCKS, we are able to * handle BPs with more than one DVA allocated. Set our max * replication level accordingly. */ if (spa_version(spa) < SPA_VERSION_DITTO_BLOCKS) return (1); return (MIN(SPA_DVAS_PER_BP, spa_max_replication_override)); } int spa_prev_software_version(spa_t *spa) { return (spa->spa_prev_software_version); } uint64_t spa_deadman_synctime(spa_t *spa) { return (spa->spa_deadman_synctime); } spa_autotrim_t spa_get_autotrim(spa_t *spa) { return (spa->spa_autotrim); } uint64_t spa_deadman_ziotime(spa_t *spa) { return (spa->spa_deadman_ziotime); } uint64_t spa_get_deadman_failmode(spa_t *spa) { return (spa->spa_deadman_failmode); } void spa_set_deadman_failmode(spa_t *spa, const char *failmode) { if (strcmp(failmode, "wait") == 0) spa->spa_deadman_failmode = ZIO_FAILURE_MODE_WAIT; else if (strcmp(failmode, "continue") == 0) spa->spa_deadman_failmode = ZIO_FAILURE_MODE_CONTINUE; else if (strcmp(failmode, "panic") == 0) spa->spa_deadman_failmode = ZIO_FAILURE_MODE_PANIC; else spa->spa_deadman_failmode = ZIO_FAILURE_MODE_WAIT; } void spa_set_deadman_ziotime(hrtime_t ns) { spa_t *spa = NULL; if (spa_mode_global != SPA_MODE_UNINIT) { mutex_enter(&spa_namespace_lock); while ((spa = spa_next(spa)) != NULL) spa->spa_deadman_ziotime = ns; mutex_exit(&spa_namespace_lock); } } void spa_set_deadman_synctime(hrtime_t ns) { spa_t *spa = NULL; if (spa_mode_global != SPA_MODE_UNINIT) { mutex_enter(&spa_namespace_lock); while ((spa = spa_next(spa)) != NULL) spa->spa_deadman_synctime = ns; mutex_exit(&spa_namespace_lock); } } uint64_t dva_get_dsize_sync(spa_t *spa, const dva_t *dva) { uint64_t asize = DVA_GET_ASIZE(dva); uint64_t dsize = asize; ASSERT(spa_config_held(spa, SCL_ALL, RW_READER) != 0); if (asize != 0 && spa->spa_deflate) { vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva)); if (vd != NULL) dsize = (asize >> SPA_MINBLOCKSHIFT) * vd->vdev_deflate_ratio; } return (dsize); } uint64_t bp_get_dsize_sync(spa_t *spa, const blkptr_t *bp) { uint64_t dsize = 0; for (int d = 0; d < BP_GET_NDVAS(bp); d++) dsize += dva_get_dsize_sync(spa, &bp->blk_dva[d]); return (dsize); } uint64_t bp_get_dsize(spa_t *spa, const blkptr_t *bp) { uint64_t dsize = 0; spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER); for (int d = 0; d < BP_GET_NDVAS(bp); d++) dsize += dva_get_dsize_sync(spa, &bp->blk_dva[d]); spa_config_exit(spa, SCL_VDEV, FTAG); return (dsize); } uint64_t spa_dirty_data(spa_t *spa) { return (spa->spa_dsl_pool->dp_dirty_total); } /* * ========================================================================== * SPA Import Progress Routines * ========================================================================== */ typedef struct spa_import_progress { uint64_t pool_guid; /* unique id for updates */ char *pool_name; spa_load_state_t spa_load_state; uint64_t mmp_sec_remaining; /* MMP activity check */ uint64_t spa_load_max_txg; /* rewind txg */ procfs_list_node_t smh_node; } spa_import_progress_t; spa_history_list_t *spa_import_progress_list = NULL; static int spa_import_progress_show_header(struct seq_file *f) { seq_printf(f, "%-20s %-14s %-14s %-12s %s\n", "pool_guid", "load_state", "multihost_secs", "max_txg", "pool_name"); return (0); } static int spa_import_progress_show(struct seq_file *f, void *data) { spa_import_progress_t *sip = (spa_import_progress_t *)data; seq_printf(f, "%-20llu %-14llu %-14llu %-12llu %s\n", (u_longlong_t)sip->pool_guid, (u_longlong_t)sip->spa_load_state, (u_longlong_t)sip->mmp_sec_remaining, (u_longlong_t)sip->spa_load_max_txg, (sip->pool_name ? sip->pool_name : "-")); return (0); } /* Remove oldest elements from list until there are no more than 'size' left */ static void spa_import_progress_truncate(spa_history_list_t *shl, unsigned int size) { spa_import_progress_t *sip; while (shl->size > size) { sip = list_remove_head(&shl->procfs_list.pl_list); if (sip->pool_name) spa_strfree(sip->pool_name); kmem_free(sip, sizeof (spa_import_progress_t)); shl->size--; } IMPLY(size == 0, list_is_empty(&shl->procfs_list.pl_list)); } static void spa_import_progress_init(void) { spa_import_progress_list = kmem_zalloc(sizeof (spa_history_list_t), KM_SLEEP); spa_import_progress_list->size = 0; spa_import_progress_list->procfs_list.pl_private = spa_import_progress_list; procfs_list_install("zfs", + NULL, "import_progress", 0644, &spa_import_progress_list->procfs_list, spa_import_progress_show, spa_import_progress_show_header, NULL, offsetof(spa_import_progress_t, smh_node)); } static void spa_import_progress_destroy(void) { spa_history_list_t *shl = spa_import_progress_list; procfs_list_uninstall(&shl->procfs_list); spa_import_progress_truncate(shl, 0); procfs_list_destroy(&shl->procfs_list); kmem_free(shl, sizeof (spa_history_list_t)); } int spa_import_progress_set_state(uint64_t pool_guid, spa_load_state_t load_state) { spa_history_list_t *shl = spa_import_progress_list; spa_import_progress_t *sip; int error = ENOENT; if (shl->size == 0) return (0); mutex_enter(&shl->procfs_list.pl_lock); for (sip = list_tail(&shl->procfs_list.pl_list); sip != NULL; sip = list_prev(&shl->procfs_list.pl_list, sip)) { if (sip->pool_guid == pool_guid) { sip->spa_load_state = load_state; error = 0; break; } } mutex_exit(&shl->procfs_list.pl_lock); return (error); } int spa_import_progress_set_max_txg(uint64_t pool_guid, uint64_t load_max_txg) { spa_history_list_t *shl = spa_import_progress_list; spa_import_progress_t *sip; int error = ENOENT; if (shl->size == 0) return (0); mutex_enter(&shl->procfs_list.pl_lock); for (sip = list_tail(&shl->procfs_list.pl_list); sip != NULL; sip = list_prev(&shl->procfs_list.pl_list, sip)) { if (sip->pool_guid == pool_guid) { sip->spa_load_max_txg = load_max_txg; error = 0; break; } } mutex_exit(&shl->procfs_list.pl_lock); return (error); } int spa_import_progress_set_mmp_check(uint64_t pool_guid, uint64_t mmp_sec_remaining) { spa_history_list_t *shl = spa_import_progress_list; spa_import_progress_t *sip; int error = ENOENT; if (shl->size == 0) return (0); mutex_enter(&shl->procfs_list.pl_lock); for (sip = list_tail(&shl->procfs_list.pl_list); sip != NULL; sip = list_prev(&shl->procfs_list.pl_list, sip)) { if (sip->pool_guid == pool_guid) { sip->mmp_sec_remaining = mmp_sec_remaining; error = 0; break; } } mutex_exit(&shl->procfs_list.pl_lock); return (error); } /* * A new import is in progress, add an entry. */ void spa_import_progress_add(spa_t *spa) { spa_history_list_t *shl = spa_import_progress_list; spa_import_progress_t *sip; char *poolname = NULL; sip = kmem_zalloc(sizeof (spa_import_progress_t), KM_SLEEP); sip->pool_guid = spa_guid(spa); (void) nvlist_lookup_string(spa->spa_config, ZPOOL_CONFIG_POOL_NAME, &poolname); if (poolname == NULL) poolname = spa_name(spa); sip->pool_name = spa_strdup(poolname); sip->spa_load_state = spa_load_state(spa); mutex_enter(&shl->procfs_list.pl_lock); procfs_list_add(&shl->procfs_list, sip); shl->size++; mutex_exit(&shl->procfs_list.pl_lock); } void spa_import_progress_remove(uint64_t pool_guid) { spa_history_list_t *shl = spa_import_progress_list; spa_import_progress_t *sip; mutex_enter(&shl->procfs_list.pl_lock); for (sip = list_tail(&shl->procfs_list.pl_list); sip != NULL; sip = list_prev(&shl->procfs_list.pl_list, sip)) { if (sip->pool_guid == pool_guid) { if (sip->pool_name) spa_strfree(sip->pool_name); list_remove(&shl->procfs_list.pl_list, sip); shl->size--; kmem_free(sip, sizeof (spa_import_progress_t)); break; } } mutex_exit(&shl->procfs_list.pl_lock); } /* * ========================================================================== * Initialization and Termination * ========================================================================== */ static int spa_name_compare(const void *a1, const void *a2) { const spa_t *s1 = a1; const spa_t *s2 = a2; int s; s = strcmp(s1->spa_name, s2->spa_name); return (TREE_ISIGN(s)); } void spa_boot_init(void) { spa_config_load(); } void spa_init(spa_mode_t mode) { mutex_init(&spa_namespace_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa_spare_lock, NULL, MUTEX_DEFAULT, NULL); mutex_init(&spa_l2cache_lock, NULL, MUTEX_DEFAULT, NULL); cv_init(&spa_namespace_cv, NULL, CV_DEFAULT, NULL); avl_create(&spa_namespace_avl, spa_name_compare, sizeof (spa_t), offsetof(spa_t, spa_avl)); avl_create(&spa_spare_avl, spa_spare_compare, sizeof (spa_aux_t), offsetof(spa_aux_t, aux_avl)); avl_create(&spa_l2cache_avl, spa_l2cache_compare, sizeof (spa_aux_t), offsetof(spa_aux_t, aux_avl)); spa_mode_global = mode; #ifndef _KERNEL if (spa_mode_global != SPA_MODE_READ && dprintf_find_string("watch")) { struct sigaction sa; sa.sa_flags = SA_SIGINFO; sigemptyset(&sa.sa_mask); sa.sa_sigaction = arc_buf_sigsegv; if (sigaction(SIGSEGV, &sa, NULL) == -1) { perror("could not enable watchpoints: " "sigaction(SIGSEGV, ...) = "); } else { arc_watch = B_TRUE; } } #endif fm_init(); zfs_refcount_init(); unique_init(); zfs_btree_init(); metaslab_stat_init(); ddt_init(); zio_init(); dmu_init(); zil_init(); vdev_cache_stat_init(); vdev_mirror_stat_init(); vdev_raidz_math_init(); vdev_file_init(); zfs_prop_init(); zpool_prop_init(); zpool_feature_init(); spa_config_load(); l2arc_start(); scan_init(); qat_init(); spa_import_progress_init(); } void spa_fini(void) { l2arc_stop(); spa_evict_all(); vdev_file_fini(); vdev_cache_stat_fini(); vdev_mirror_stat_fini(); vdev_raidz_math_fini(); zil_fini(); dmu_fini(); zio_fini(); ddt_fini(); metaslab_stat_fini(); zfs_btree_fini(); unique_fini(); zfs_refcount_fini(); fm_fini(); scan_fini(); qat_fini(); spa_import_progress_destroy(); avl_destroy(&spa_namespace_avl); avl_destroy(&spa_spare_avl); avl_destroy(&spa_l2cache_avl); cv_destroy(&spa_namespace_cv); mutex_destroy(&spa_namespace_lock); mutex_destroy(&spa_spare_lock); mutex_destroy(&spa_l2cache_lock); } /* * Return whether this pool has slogs. No locking needed. * It's not a problem if the wrong answer is returned as it's only for * performance and not correctness */ boolean_t spa_has_slogs(spa_t *spa) { return (spa->spa_log_class->mc_rotor != NULL); } spa_log_state_t spa_get_log_state(spa_t *spa) { return (spa->spa_log_state); } void spa_set_log_state(spa_t *spa, spa_log_state_t state) { spa->spa_log_state = state; } boolean_t spa_is_root(spa_t *spa) { return (spa->spa_is_root); } boolean_t spa_writeable(spa_t *spa) { return (!!(spa->spa_mode & SPA_MODE_WRITE) && spa->spa_trust_config); } /* * Returns true if there is a pending sync task in any of the current * syncing txg, the current quiescing txg, or the current open txg. */ boolean_t spa_has_pending_synctask(spa_t *spa) { return (!txg_all_lists_empty(&spa->spa_dsl_pool->dp_sync_tasks) || !txg_all_lists_empty(&spa->spa_dsl_pool->dp_early_sync_tasks)); } spa_mode_t spa_mode(spa_t *spa) { return (spa->spa_mode); } uint64_t spa_bootfs(spa_t *spa) { return (spa->spa_bootfs); } uint64_t spa_delegation(spa_t *spa) { return (spa->spa_delegation); } objset_t * spa_meta_objset(spa_t *spa) { return (spa->spa_meta_objset); } enum zio_checksum spa_dedup_checksum(spa_t *spa) { return (spa->spa_dedup_checksum); } /* * Reset pool scan stat per scan pass (or reboot). */ void spa_scan_stat_init(spa_t *spa) { /* data not stored on disk */ spa->spa_scan_pass_start = gethrestime_sec(); if (dsl_scan_is_paused_scrub(spa->spa_dsl_pool->dp_scan)) spa->spa_scan_pass_scrub_pause = spa->spa_scan_pass_start; else spa->spa_scan_pass_scrub_pause = 0; spa->spa_scan_pass_scrub_spent_paused = 0; spa->spa_scan_pass_exam = 0; spa->spa_scan_pass_issued = 0; vdev_scan_stat_init(spa->spa_root_vdev); } /* * Get scan stats for zpool status reports */ int spa_scan_get_stats(spa_t *spa, pool_scan_stat_t *ps) { dsl_scan_t *scn = spa->spa_dsl_pool ? spa->spa_dsl_pool->dp_scan : NULL; if (scn == NULL || scn->scn_phys.scn_func == POOL_SCAN_NONE) return (SET_ERROR(ENOENT)); bzero(ps, sizeof (pool_scan_stat_t)); /* data stored on disk */ ps->pss_func = scn->scn_phys.scn_func; ps->pss_state = scn->scn_phys.scn_state; ps->pss_start_time = scn->scn_phys.scn_start_time; ps->pss_end_time = scn->scn_phys.scn_end_time; ps->pss_to_examine = scn->scn_phys.scn_to_examine; ps->pss_examined = scn->scn_phys.scn_examined; ps->pss_to_process = scn->scn_phys.scn_to_process; ps->pss_processed = scn->scn_phys.scn_processed; ps->pss_errors = scn->scn_phys.scn_errors; /* data not stored on disk */ ps->pss_pass_exam = spa->spa_scan_pass_exam; ps->pss_pass_start = spa->spa_scan_pass_start; ps->pss_pass_scrub_pause = spa->spa_scan_pass_scrub_pause; ps->pss_pass_scrub_spent_paused = spa->spa_scan_pass_scrub_spent_paused; ps->pss_pass_issued = spa->spa_scan_pass_issued; ps->pss_issued = scn->scn_issued_before_pass + spa->spa_scan_pass_issued; return (0); } int spa_maxblocksize(spa_t *spa) { if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_BLOCKS)) return (SPA_MAXBLOCKSIZE); else return (SPA_OLD_MAXBLOCKSIZE); } /* * Returns the txg that the last device removal completed. No indirect mappings * have been added since this txg. */ uint64_t spa_get_last_removal_txg(spa_t *spa) { uint64_t vdevid; uint64_t ret = -1ULL; spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER); /* * sr_prev_indirect_vdev is only modified while holding all the * config locks, so it is sufficient to hold SCL_VDEV as reader when * examining it. */ vdevid = spa->spa_removing_phys.sr_prev_indirect_vdev; while (vdevid != -1ULL) { vdev_t *vd = vdev_lookup_top(spa, vdevid); vdev_indirect_births_t *vib = vd->vdev_indirect_births; ASSERT3P(vd->vdev_ops, ==, &vdev_indirect_ops); /* * If the removal did not remap any data, we don't care. */ if (vdev_indirect_births_count(vib) != 0) { ret = vdev_indirect_births_last_entry_txg(vib); break; } vdevid = vd->vdev_indirect_config.vic_prev_indirect_vdev; } spa_config_exit(spa, SCL_VDEV, FTAG); IMPLY(ret != -1ULL, spa_feature_is_active(spa, SPA_FEATURE_DEVICE_REMOVAL)); return (ret); } int spa_maxdnodesize(spa_t *spa) { if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_DNODE)) return (DNODE_MAX_SIZE); else return (DNODE_MIN_SIZE); } boolean_t spa_multihost(spa_t *spa) { return (spa->spa_multihost ? B_TRUE : B_FALSE); } uint32_t spa_get_hostid(spa_t *spa) { return (spa->spa_hostid); } boolean_t spa_trust_config(spa_t *spa) { return (spa->spa_trust_config); } uint64_t spa_missing_tvds_allowed(spa_t *spa) { return (spa->spa_missing_tvds_allowed); } space_map_t * spa_syncing_log_sm(spa_t *spa) { return (spa->spa_syncing_log_sm); } void spa_set_missing_tvds(spa_t *spa, uint64_t missing) { spa->spa_missing_tvds = missing; } /* * Return the pool state string ("ONLINE", "DEGRADED", "SUSPENDED", etc). */ const char * spa_state_to_name(spa_t *spa) { ASSERT3P(spa, !=, NULL); /* * it is possible for the spa to exist, without root vdev * as the spa transitions during import/export */ vdev_t *rvd = spa->spa_root_vdev; if (rvd == NULL) { return ("TRANSITIONING"); } vdev_state_t state = rvd->vdev_state; vdev_aux_t aux = rvd->vdev_stat.vs_aux; if (spa_suspended(spa) && (spa_get_failmode(spa) != ZIO_FAILURE_MODE_CONTINUE)) return ("SUSPENDED"); switch (state) { case VDEV_STATE_CLOSED: case VDEV_STATE_OFFLINE: return ("OFFLINE"); case VDEV_STATE_REMOVED: return ("REMOVED"); case VDEV_STATE_CANT_OPEN: if (aux == VDEV_AUX_CORRUPT_DATA || aux == VDEV_AUX_BAD_LOG) return ("FAULTED"); else if (aux == VDEV_AUX_SPLIT_POOL) return ("SPLIT"); else return ("UNAVAIL"); case VDEV_STATE_FAULTED: return ("FAULTED"); case VDEV_STATE_DEGRADED: return ("DEGRADED"); case VDEV_STATE_HEALTHY: return ("ONLINE"); default: break; } return ("UNKNOWN"); } boolean_t spa_top_vdevs_spacemap_addressable(spa_t *spa) { vdev_t *rvd = spa->spa_root_vdev; for (uint64_t c = 0; c < rvd->vdev_children; c++) { if (!vdev_is_spacemap_addressable(rvd->vdev_child[c])) return (B_FALSE); } return (B_TRUE); } boolean_t spa_has_checkpoint(spa_t *spa) { return (spa->spa_checkpoint_txg != 0); } boolean_t spa_importing_readonly_checkpoint(spa_t *spa) { return ((spa->spa_import_flags & ZFS_IMPORT_CHECKPOINT) && spa->spa_mode == SPA_MODE_READ); } uint64_t spa_min_claim_txg(spa_t *spa) { uint64_t checkpoint_txg = spa->spa_uberblock.ub_checkpoint_txg; if (checkpoint_txg != 0) return (checkpoint_txg + 1); return (spa->spa_first_txg); } /* * If there is a checkpoint, async destroys may consume more space from * the pool instead of freeing it. In an attempt to save the pool from * getting suspended when it is about to run out of space, we stop * processing async destroys. */ boolean_t spa_suspend_async_destroy(spa_t *spa) { dsl_pool_t *dp = spa_get_dsl(spa); uint64_t unreserved = dsl_pool_unreserved_space(dp, ZFS_SPACE_CHECK_EXTRA_RESERVED); uint64_t used = dsl_dir_phys(dp->dp_root_dir)->dd_used_bytes; uint64_t avail = (unreserved > used) ? (unreserved - used) : 0; if (spa_has_checkpoint(spa) && avail == 0) return (B_TRUE); return (B_FALSE); } #if defined(_KERNEL) int param_set_deadman_failmode_common(const char *val) { spa_t *spa = NULL; char *p; if (val == NULL) return (SET_ERROR(EINVAL)); if ((p = strchr(val, '\n')) != NULL) *p = '\0'; if (strcmp(val, "wait") != 0 && strcmp(val, "continue") != 0 && strcmp(val, "panic")) return (SET_ERROR(EINVAL)); if (spa_mode_global != SPA_MODE_UNINIT) { mutex_enter(&spa_namespace_lock); while ((spa = spa_next(spa)) != NULL) spa_set_deadman_failmode(spa, val); mutex_exit(&spa_namespace_lock); } return (0); } #endif /* Namespace manipulation */ EXPORT_SYMBOL(spa_lookup); EXPORT_SYMBOL(spa_add); EXPORT_SYMBOL(spa_remove); EXPORT_SYMBOL(spa_next); /* Refcount functions */ EXPORT_SYMBOL(spa_open_ref); EXPORT_SYMBOL(spa_close); EXPORT_SYMBOL(spa_refcount_zero); /* Pool configuration lock */ EXPORT_SYMBOL(spa_config_tryenter); EXPORT_SYMBOL(spa_config_enter); EXPORT_SYMBOL(spa_config_exit); EXPORT_SYMBOL(spa_config_held); /* Pool vdev add/remove lock */ EXPORT_SYMBOL(spa_vdev_enter); EXPORT_SYMBOL(spa_vdev_exit); /* Pool vdev state change lock */ EXPORT_SYMBOL(spa_vdev_state_enter); EXPORT_SYMBOL(spa_vdev_state_exit); /* Accessor functions */ EXPORT_SYMBOL(spa_shutting_down); EXPORT_SYMBOL(spa_get_dsl); EXPORT_SYMBOL(spa_get_rootblkptr); EXPORT_SYMBOL(spa_set_rootblkptr); EXPORT_SYMBOL(spa_altroot); EXPORT_SYMBOL(spa_sync_pass); EXPORT_SYMBOL(spa_name); EXPORT_SYMBOL(spa_guid); EXPORT_SYMBOL(spa_last_synced_txg); EXPORT_SYMBOL(spa_first_txg); EXPORT_SYMBOL(spa_syncing_txg); EXPORT_SYMBOL(spa_version); EXPORT_SYMBOL(spa_state); EXPORT_SYMBOL(spa_load_state); EXPORT_SYMBOL(spa_freeze_txg); EXPORT_SYMBOL(spa_get_dspace); EXPORT_SYMBOL(spa_update_dspace); EXPORT_SYMBOL(spa_deflate); EXPORT_SYMBOL(spa_normal_class); EXPORT_SYMBOL(spa_log_class); EXPORT_SYMBOL(spa_special_class); EXPORT_SYMBOL(spa_preferred_class); EXPORT_SYMBOL(spa_max_replication); EXPORT_SYMBOL(spa_prev_software_version); EXPORT_SYMBOL(spa_get_failmode); EXPORT_SYMBOL(spa_suspended); EXPORT_SYMBOL(spa_bootfs); EXPORT_SYMBOL(spa_delegation); EXPORT_SYMBOL(spa_meta_objset); EXPORT_SYMBOL(spa_maxblocksize); EXPORT_SYMBOL(spa_maxdnodesize); /* Miscellaneous support routines */ EXPORT_SYMBOL(spa_guid_exists); EXPORT_SYMBOL(spa_strdup); EXPORT_SYMBOL(spa_strfree); EXPORT_SYMBOL(spa_get_random); EXPORT_SYMBOL(spa_generate_guid); EXPORT_SYMBOL(snprintf_blkptr); EXPORT_SYMBOL(spa_freeze); EXPORT_SYMBOL(spa_upgrade); EXPORT_SYMBOL(spa_evict_all); EXPORT_SYMBOL(spa_lookup_by_guid); EXPORT_SYMBOL(spa_has_spare); EXPORT_SYMBOL(dva_get_dsize_sync); EXPORT_SYMBOL(bp_get_dsize_sync); EXPORT_SYMBOL(bp_get_dsize); EXPORT_SYMBOL(spa_has_slogs); EXPORT_SYMBOL(spa_is_root); EXPORT_SYMBOL(spa_writeable); EXPORT_SYMBOL(spa_mode); EXPORT_SYMBOL(spa_namespace_lock); EXPORT_SYMBOL(spa_trust_config); EXPORT_SYMBOL(spa_missing_tvds_allowed); EXPORT_SYMBOL(spa_set_missing_tvds); EXPORT_SYMBOL(spa_state_to_name); EXPORT_SYMBOL(spa_importing_readonly_checkpoint); EXPORT_SYMBOL(spa_min_claim_txg); EXPORT_SYMBOL(spa_suspend_async_destroy); EXPORT_SYMBOL(spa_has_checkpoint); EXPORT_SYMBOL(spa_top_vdevs_spacemap_addressable); ZFS_MODULE_PARAM(zfs, zfs_, flags, UINT, ZMOD_RW, "Set additional debugging flags"); ZFS_MODULE_PARAM(zfs, zfs_, recover, INT, ZMOD_RW, "Set to attempt to recover from fatal errors"); ZFS_MODULE_PARAM(zfs, zfs_, free_leak_on_eio, INT, ZMOD_RW, "Set to ignore IO errors during free and permanently leak the space"); ZFS_MODULE_PARAM(zfs, zfs_, deadman_checktime_ms, ULONG, ZMOD_RW, "Dead I/O check interval in milliseconds"); ZFS_MODULE_PARAM(zfs, zfs_, deadman_enabled, INT, ZMOD_RW, "Enable deadman timer"); ZFS_MODULE_PARAM(zfs_spa, spa_, asize_inflation, INT, ZMOD_RW, "SPA size estimate multiplication factor"); ZFS_MODULE_PARAM(zfs, zfs_, ddt_data_is_special, INT, ZMOD_RW, "Place DDT data into the special class"); ZFS_MODULE_PARAM(zfs, zfs_, user_indirect_is_special, INT, ZMOD_RW, "Place user data indirect blocks into the special class"); /* BEGIN CSTYLED */ ZFS_MODULE_PARAM_CALL(zfs_deadman, zfs_deadman_, failmode, param_set_deadman_failmode, param_get_charp, ZMOD_RW, "Failmode for deadman timer"); ZFS_MODULE_PARAM_CALL(zfs_deadman, zfs_deadman_, synctime_ms, param_set_deadman_synctime, param_get_ulong, ZMOD_RW, "Pool sync expiration time in milliseconds"); ZFS_MODULE_PARAM_CALL(zfs_deadman, zfs_deadman_, ziotime_ms, param_set_deadman_ziotime, param_get_ulong, ZMOD_RW, "IO expiration time in milliseconds"); ZFS_MODULE_PARAM(zfs, zfs_, special_class_metadata_reserve_pct, INT, ZMOD_RW, "Small file blocks in special vdevs depends on this much " "free space available"); /* END CSTYLED */ ZFS_MODULE_PARAM_CALL(zfs_spa, spa_, slop_shift, param_set_slop_shift, param_get_int, ZMOD_RW, "Reserved free space in pool"); Index: vendor-sys/openzfs/dist/module/zfs/spa_stats.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/spa_stats.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/spa_stats.c (revision 366348) @@ -1,1043 +1,1029 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ #include #include #include #include #include /* * Keeps stats on last N reads per spa_t, disabled by default. */ int zfs_read_history = 0; /* * Include cache hits in history, disabled by default. */ int zfs_read_history_hits = 0; /* * Keeps stats on the last 100 txgs by default. */ int zfs_txg_history = 100; /* * Keeps stats on the last N MMP updates, disabled by default. */ int zfs_multihost_history = 0; /* * ========================================================================== * SPA Read History Routines * ========================================================================== */ /* * Read statistics - Information exported regarding each arc_read call */ typedef struct spa_read_history { hrtime_t start; /* time read completed */ uint64_t objset; /* read from this objset */ uint64_t object; /* read of this object number */ uint64_t level; /* block's indirection level */ uint64_t blkid; /* read of this block id */ char origin[24]; /* read originated from here */ uint32_t aflags; /* ARC flags (cached, prefetch, etc.) */ pid_t pid; /* PID of task doing read */ char comm[16]; /* process name of task doing read */ procfs_list_node_t srh_node; } spa_read_history_t; static int spa_read_history_show_header(struct seq_file *f) { seq_printf(f, "%-8s %-16s %-8s %-8s %-8s %-8s %-8s " "%-24s %-8s %-16s\n", "UID", "start", "objset", "object", "level", "blkid", "aflags", "origin", "pid", "process"); return (0); } static int spa_read_history_show(struct seq_file *f, void *data) { spa_read_history_t *srh = (spa_read_history_t *)data; seq_printf(f, "%-8llu %-16llu 0x%-6llx " "%-8lli %-8lli %-8lli 0x%-6x %-24s %-8i %-16s\n", (u_longlong_t)srh->srh_node.pln_id, srh->start, (longlong_t)srh->objset, (longlong_t)srh->object, (longlong_t)srh->level, (longlong_t)srh->blkid, srh->aflags, srh->origin, srh->pid, srh->comm); return (0); } /* Remove oldest elements from list until there are no more than 'size' left */ static void spa_read_history_truncate(spa_history_list_t *shl, unsigned int size) { spa_read_history_t *srh; while (shl->size > size) { srh = list_remove_head(&shl->procfs_list.pl_list); ASSERT3P(srh, !=, NULL); kmem_free(srh, sizeof (spa_read_history_t)); shl->size--; } if (size == 0) ASSERT(list_is_empty(&shl->procfs_list.pl_list)); } static int spa_read_history_clear(procfs_list_t *procfs_list) { spa_history_list_t *shl = procfs_list->pl_private; mutex_enter(&procfs_list->pl_lock); spa_read_history_truncate(shl, 0); mutex_exit(&procfs_list->pl_lock); return (0); } static void spa_read_history_init(spa_t *spa) { spa_history_list_t *shl = &spa->spa_stats.read_history; - char *module; shl->size = 0; - - module = kmem_asprintf("zfs/%s", spa_name(spa)); - shl->procfs_list.pl_private = shl; - procfs_list_install(module, + procfs_list_install("zfs", + spa_name(spa), "reads", 0600, &shl->procfs_list, spa_read_history_show, spa_read_history_show_header, spa_read_history_clear, offsetof(spa_read_history_t, srh_node)); - - kmem_strfree(module); } static void spa_read_history_destroy(spa_t *spa) { spa_history_list_t *shl = &spa->spa_stats.read_history; procfs_list_uninstall(&shl->procfs_list); spa_read_history_truncate(shl, 0); procfs_list_destroy(&shl->procfs_list); } void spa_read_history_add(spa_t *spa, const zbookmark_phys_t *zb, uint32_t aflags) { spa_history_list_t *shl = &spa->spa_stats.read_history; spa_read_history_t *srh; ASSERT3P(spa, !=, NULL); ASSERT3P(zb, !=, NULL); if (zfs_read_history == 0 && shl->size == 0) return; if (zfs_read_history_hits == 0 && (aflags & ARC_FLAG_CACHED)) return; srh = kmem_zalloc(sizeof (spa_read_history_t), KM_SLEEP); strlcpy(srh->comm, getcomm(), sizeof (srh->comm)); srh->start = gethrtime(); srh->objset = zb->zb_objset; srh->object = zb->zb_object; srh->level = zb->zb_level; srh->blkid = zb->zb_blkid; srh->aflags = aflags; srh->pid = getpid(); mutex_enter(&shl->procfs_list.pl_lock); procfs_list_add(&shl->procfs_list, srh); shl->size++; spa_read_history_truncate(shl, zfs_read_history); mutex_exit(&shl->procfs_list.pl_lock); } /* * ========================================================================== * SPA TXG History Routines * ========================================================================== */ /* * Txg statistics - Information exported regarding each txg sync */ typedef struct spa_txg_history { uint64_t txg; /* txg id */ txg_state_t state; /* active txg state */ uint64_t nread; /* number of bytes read */ uint64_t nwritten; /* number of bytes written */ uint64_t reads; /* number of read operations */ uint64_t writes; /* number of write operations */ uint64_t ndirty; /* number of dirty bytes */ hrtime_t times[TXG_STATE_COMMITTED]; /* completion times */ procfs_list_node_t sth_node; } spa_txg_history_t; static int spa_txg_history_show_header(struct seq_file *f) { seq_printf(f, "%-8s %-16s %-5s %-12s %-12s %-12s " "%-8s %-8s %-12s %-12s %-12s %-12s\n", "txg", "birth", "state", "ndirty", "nread", "nwritten", "reads", "writes", "otime", "qtime", "wtime", "stime"); return (0); } static int spa_txg_history_show(struct seq_file *f, void *data) { spa_txg_history_t *sth = (spa_txg_history_t *)data; uint64_t open = 0, quiesce = 0, wait = 0, sync = 0; char state; switch (sth->state) { case TXG_STATE_BIRTH: state = 'B'; break; case TXG_STATE_OPEN: state = 'O'; break; case TXG_STATE_QUIESCED: state = 'Q'; break; case TXG_STATE_WAIT_FOR_SYNC: state = 'W'; break; case TXG_STATE_SYNCED: state = 'S'; break; case TXG_STATE_COMMITTED: state = 'C'; break; default: state = '?'; break; } if (sth->times[TXG_STATE_OPEN]) open = sth->times[TXG_STATE_OPEN] - sth->times[TXG_STATE_BIRTH]; if (sth->times[TXG_STATE_QUIESCED]) quiesce = sth->times[TXG_STATE_QUIESCED] - sth->times[TXG_STATE_OPEN]; if (sth->times[TXG_STATE_WAIT_FOR_SYNC]) wait = sth->times[TXG_STATE_WAIT_FOR_SYNC] - sth->times[TXG_STATE_QUIESCED]; if (sth->times[TXG_STATE_SYNCED]) sync = sth->times[TXG_STATE_SYNCED] - sth->times[TXG_STATE_WAIT_FOR_SYNC]; seq_printf(f, "%-8llu %-16llu %-5c %-12llu " "%-12llu %-12llu %-8llu %-8llu %-12llu %-12llu %-12llu %-12llu\n", (longlong_t)sth->txg, sth->times[TXG_STATE_BIRTH], state, (u_longlong_t)sth->ndirty, (u_longlong_t)sth->nread, (u_longlong_t)sth->nwritten, (u_longlong_t)sth->reads, (u_longlong_t)sth->writes, (u_longlong_t)open, (u_longlong_t)quiesce, (u_longlong_t)wait, (u_longlong_t)sync); return (0); } /* Remove oldest elements from list until there are no more than 'size' left */ static void spa_txg_history_truncate(spa_history_list_t *shl, unsigned int size) { spa_txg_history_t *sth; while (shl->size > size) { sth = list_remove_head(&shl->procfs_list.pl_list); ASSERT3P(sth, !=, NULL); kmem_free(sth, sizeof (spa_txg_history_t)); shl->size--; } if (size == 0) ASSERT(list_is_empty(&shl->procfs_list.pl_list)); } static int spa_txg_history_clear(procfs_list_t *procfs_list) { spa_history_list_t *shl = procfs_list->pl_private; mutex_enter(&procfs_list->pl_lock); spa_txg_history_truncate(shl, 0); mutex_exit(&procfs_list->pl_lock); return (0); } static void spa_txg_history_init(spa_t *spa) { spa_history_list_t *shl = &spa->spa_stats.txg_history; - char *module; shl->size = 0; - - module = kmem_asprintf("zfs/%s", spa_name(spa)); - shl->procfs_list.pl_private = shl; - procfs_list_install(module, + procfs_list_install("zfs", + spa_name(spa), "txgs", 0644, &shl->procfs_list, spa_txg_history_show, spa_txg_history_show_header, spa_txg_history_clear, offsetof(spa_txg_history_t, sth_node)); - - kmem_strfree(module); } static void spa_txg_history_destroy(spa_t *spa) { spa_history_list_t *shl = &spa->spa_stats.txg_history; procfs_list_uninstall(&shl->procfs_list); spa_txg_history_truncate(shl, 0); procfs_list_destroy(&shl->procfs_list); } /* * Add a new txg to historical record. */ void spa_txg_history_add(spa_t *spa, uint64_t txg, hrtime_t birth_time) { spa_history_list_t *shl = &spa->spa_stats.txg_history; spa_txg_history_t *sth; if (zfs_txg_history == 0 && shl->size == 0) return; sth = kmem_zalloc(sizeof (spa_txg_history_t), KM_SLEEP); sth->txg = txg; sth->state = TXG_STATE_OPEN; sth->times[TXG_STATE_BIRTH] = birth_time; mutex_enter(&shl->procfs_list.pl_lock); procfs_list_add(&shl->procfs_list, sth); shl->size++; spa_txg_history_truncate(shl, zfs_txg_history); mutex_exit(&shl->procfs_list.pl_lock); } /* * Set txg state completion time and increment current state. */ int spa_txg_history_set(spa_t *spa, uint64_t txg, txg_state_t completed_state, hrtime_t completed_time) { spa_history_list_t *shl = &spa->spa_stats.txg_history; spa_txg_history_t *sth; int error = ENOENT; if (zfs_txg_history == 0) return (0); mutex_enter(&shl->procfs_list.pl_lock); for (sth = list_tail(&shl->procfs_list.pl_list); sth != NULL; sth = list_prev(&shl->procfs_list.pl_list, sth)) { if (sth->txg == txg) { sth->times[completed_state] = completed_time; sth->state++; error = 0; break; } } mutex_exit(&shl->procfs_list.pl_lock); return (error); } /* * Set txg IO stats. */ static int spa_txg_history_set_io(spa_t *spa, uint64_t txg, uint64_t nread, uint64_t nwritten, uint64_t reads, uint64_t writes, uint64_t ndirty) { spa_history_list_t *shl = &spa->spa_stats.txg_history; spa_txg_history_t *sth; int error = ENOENT; if (zfs_txg_history == 0) return (0); mutex_enter(&shl->procfs_list.pl_lock); for (sth = list_tail(&shl->procfs_list.pl_list); sth != NULL; sth = list_prev(&shl->procfs_list.pl_list, sth)) { if (sth->txg == txg) { sth->nread = nread; sth->nwritten = nwritten; sth->reads = reads; sth->writes = writes; sth->ndirty = ndirty; error = 0; break; } } mutex_exit(&shl->procfs_list.pl_lock); return (error); } txg_stat_t * spa_txg_history_init_io(spa_t *spa, uint64_t txg, dsl_pool_t *dp) { txg_stat_t *ts; if (zfs_txg_history == 0) return (NULL); ts = kmem_alloc(sizeof (txg_stat_t), KM_SLEEP); spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); vdev_get_stats(spa->spa_root_vdev, &ts->vs1); spa_config_exit(spa, SCL_CONFIG, FTAG); ts->txg = txg; ts->ndirty = dp->dp_dirty_pertxg[txg & TXG_MASK]; spa_txg_history_set(spa, txg, TXG_STATE_WAIT_FOR_SYNC, gethrtime()); return (ts); } void spa_txg_history_fini_io(spa_t *spa, txg_stat_t *ts) { if (ts == NULL) return; if (zfs_txg_history == 0) { kmem_free(ts, sizeof (txg_stat_t)); return; } spa_config_enter(spa, SCL_CONFIG, FTAG, RW_READER); vdev_get_stats(spa->spa_root_vdev, &ts->vs2); spa_config_exit(spa, SCL_CONFIG, FTAG); spa_txg_history_set(spa, ts->txg, TXG_STATE_SYNCED, gethrtime()); spa_txg_history_set_io(spa, ts->txg, ts->vs2.vs_bytes[ZIO_TYPE_READ] - ts->vs1.vs_bytes[ZIO_TYPE_READ], ts->vs2.vs_bytes[ZIO_TYPE_WRITE] - ts->vs1.vs_bytes[ZIO_TYPE_WRITE], ts->vs2.vs_ops[ZIO_TYPE_READ] - ts->vs1.vs_ops[ZIO_TYPE_READ], ts->vs2.vs_ops[ZIO_TYPE_WRITE] - ts->vs1.vs_ops[ZIO_TYPE_WRITE], ts->ndirty); kmem_free(ts, sizeof (txg_stat_t)); } /* * ========================================================================== * SPA TX Assign Histogram Routines * ========================================================================== */ /* * Tx statistics - Information exported regarding dmu_tx_assign time. */ /* * When the kstat is written zero all buckets. When the kstat is read * count the number of trailing buckets set to zero and update ks_ndata * such that they are not output. */ static int spa_tx_assign_update(kstat_t *ksp, int rw) { spa_t *spa = ksp->ks_private; spa_history_kstat_t *shk = &spa->spa_stats.tx_assign_histogram; int i; if (rw == KSTAT_WRITE) { for (i = 0; i < shk->count; i++) ((kstat_named_t *)shk->priv)[i].value.ui64 = 0; } for (i = shk->count; i > 0; i--) if (((kstat_named_t *)shk->priv)[i-1].value.ui64 != 0) break; ksp->ks_ndata = i; ksp->ks_data_size = i * sizeof (kstat_named_t); return (0); } static void spa_tx_assign_init(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.tx_assign_histogram; char *name; kstat_named_t *ks; kstat_t *ksp; int i; mutex_init(&shk->lock, NULL, MUTEX_DEFAULT, NULL); shk->count = 42; /* power of two buckets for 1ns to 2,199s */ shk->size = shk->count * sizeof (kstat_named_t); shk->priv = kmem_alloc(shk->size, KM_SLEEP); name = kmem_asprintf("zfs/%s", spa_name(spa)); for (i = 0; i < shk->count; i++) { ks = &((kstat_named_t *)shk->priv)[i]; ks->data_type = KSTAT_DATA_UINT64; ks->value.ui64 = 0; (void) snprintf(ks->name, KSTAT_STRLEN, "%llu ns", (u_longlong_t)1 << i); } ksp = kstat_create(name, 0, "dmu_tx_assign", "misc", KSTAT_TYPE_NAMED, 0, KSTAT_FLAG_VIRTUAL); shk->kstat = ksp; if (ksp) { ksp->ks_lock = &shk->lock; ksp->ks_data = shk->priv; ksp->ks_ndata = shk->count; ksp->ks_data_size = shk->size; ksp->ks_private = spa; ksp->ks_update = spa_tx_assign_update; kstat_install(ksp); } kmem_strfree(name); } static void spa_tx_assign_destroy(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.tx_assign_histogram; kstat_t *ksp; ksp = shk->kstat; if (ksp) kstat_delete(ksp); kmem_free(shk->priv, shk->size); mutex_destroy(&shk->lock); } void spa_tx_assign_add_nsecs(spa_t *spa, uint64_t nsecs) { spa_history_kstat_t *shk = &spa->spa_stats.tx_assign_histogram; uint64_t idx = 0; while (((1ULL << idx) < nsecs) && (idx < shk->size - 1)) idx++; atomic_inc_64(&((kstat_named_t *)shk->priv)[idx].value.ui64); } /* * ========================================================================== * SPA IO History Routines * ========================================================================== */ static int spa_io_history_update(kstat_t *ksp, int rw) { if (rw == KSTAT_WRITE) memset(ksp->ks_data, 0, ksp->ks_data_size); return (0); } static void spa_io_history_init(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.io_history; char *name; kstat_t *ksp; mutex_init(&shk->lock, NULL, MUTEX_DEFAULT, NULL); name = kmem_asprintf("zfs/%s", spa_name(spa)); ksp = kstat_create(name, 0, "io", "disk", KSTAT_TYPE_IO, 1, 0); shk->kstat = ksp; if (ksp) { ksp->ks_lock = &shk->lock; ksp->ks_private = spa; ksp->ks_update = spa_io_history_update; kstat_install(ksp); } kmem_strfree(name); } static void spa_io_history_destroy(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.io_history; if (shk->kstat) kstat_delete(shk->kstat); mutex_destroy(&shk->lock); } /* * ========================================================================== * SPA MMP History Routines * ========================================================================== */ /* * MMP statistics - Information exported regarding attempted MMP writes * For MMP writes issued, fields used as per comments below. * For MMP writes skipped, an entry represents a span of time when * writes were skipped for same reason (error from mmp_random_leaf). * Differences are: * timestamp time first write skipped, if >1 skipped in a row * mmp_delay delay value at timestamp * vdev_guid number of writes skipped * io_error one of enum mmp_error * duration time span (ns) of skipped writes */ typedef struct spa_mmp_history { uint64_t mmp_node_id; /* unique # for updates */ uint64_t txg; /* txg of last sync */ uint64_t timestamp; /* UTC time MMP write issued */ uint64_t mmp_delay; /* mmp_thread.mmp_delay at timestamp */ uint64_t vdev_guid; /* unique ID of leaf vdev */ char *vdev_path; int vdev_label; /* vdev label */ int io_error; /* error status of MMP write */ hrtime_t error_start; /* hrtime of start of error period */ hrtime_t duration; /* time from submission to completion */ procfs_list_node_t smh_node; } spa_mmp_history_t; static int spa_mmp_history_show_header(struct seq_file *f) { seq_printf(f, "%-10s %-10s %-10s %-6s %-10s %-12s %-24s " "%-10s %s\n", "id", "txg", "timestamp", "error", "duration", "mmp_delay", "vdev_guid", "vdev_label", "vdev_path"); return (0); } static int spa_mmp_history_show(struct seq_file *f, void *data) { spa_mmp_history_t *smh = (spa_mmp_history_t *)data; char skip_fmt[] = "%-10llu %-10llu %10llu %#6llx %10lld %12llu %-24llu " "%-10lld %s\n"; char write_fmt[] = "%-10llu %-10llu %10llu %6lld %10lld %12llu %-24llu " "%-10lld %s\n"; seq_printf(f, (smh->error_start ? skip_fmt : write_fmt), (u_longlong_t)smh->mmp_node_id, (u_longlong_t)smh->txg, (u_longlong_t)smh->timestamp, (longlong_t)smh->io_error, (longlong_t)smh->duration, (u_longlong_t)smh->mmp_delay, (u_longlong_t)smh->vdev_guid, (u_longlong_t)smh->vdev_label, (smh->vdev_path ? smh->vdev_path : "-")); return (0); } /* Remove oldest elements from list until there are no more than 'size' left */ static void spa_mmp_history_truncate(spa_history_list_t *shl, unsigned int size) { spa_mmp_history_t *smh; while (shl->size > size) { smh = list_remove_head(&shl->procfs_list.pl_list); if (smh->vdev_path) kmem_strfree(smh->vdev_path); kmem_free(smh, sizeof (spa_mmp_history_t)); shl->size--; } if (size == 0) ASSERT(list_is_empty(&shl->procfs_list.pl_list)); } static int spa_mmp_history_clear(procfs_list_t *procfs_list) { spa_history_list_t *shl = procfs_list->pl_private; mutex_enter(&procfs_list->pl_lock); spa_mmp_history_truncate(shl, 0); mutex_exit(&procfs_list->pl_lock); return (0); } static void spa_mmp_history_init(spa_t *spa) { spa_history_list_t *shl = &spa->spa_stats.mmp_history; - char *module; shl->size = 0; - module = kmem_asprintf("zfs/%s", spa_name(spa)); - shl->procfs_list.pl_private = shl; - procfs_list_install(module, + procfs_list_install("zfs", + spa_name(spa), "multihost", 0644, &shl->procfs_list, spa_mmp_history_show, spa_mmp_history_show_header, spa_mmp_history_clear, offsetof(spa_mmp_history_t, smh_node)); - - kmem_strfree(module); } static void spa_mmp_history_destroy(spa_t *spa) { spa_history_list_t *shl = &spa->spa_stats.mmp_history; procfs_list_uninstall(&shl->procfs_list); spa_mmp_history_truncate(shl, 0); procfs_list_destroy(&shl->procfs_list); } /* * Set duration in existing "skip" record to how long we have waited for a leaf * vdev to become available. * * Important that we start search at the tail of the list where new * records are inserted, so this is normally an O(1) operation. */ int spa_mmp_history_set_skip(spa_t *spa, uint64_t mmp_node_id) { spa_history_list_t *shl = &spa->spa_stats.mmp_history; spa_mmp_history_t *smh; int error = ENOENT; if (zfs_multihost_history == 0 && shl->size == 0) return (0); mutex_enter(&shl->procfs_list.pl_lock); for (smh = list_tail(&shl->procfs_list.pl_list); smh != NULL; smh = list_prev(&shl->procfs_list.pl_list, smh)) { if (smh->mmp_node_id == mmp_node_id) { ASSERT3U(smh->io_error, !=, 0); smh->duration = gethrtime() - smh->error_start; smh->vdev_guid++; error = 0; break; } } mutex_exit(&shl->procfs_list.pl_lock); return (error); } /* * Set MMP write duration and error status in existing record. * See comment re: search order above spa_mmp_history_set_skip(). */ int spa_mmp_history_set(spa_t *spa, uint64_t mmp_node_id, int io_error, hrtime_t duration) { spa_history_list_t *shl = &spa->spa_stats.mmp_history; spa_mmp_history_t *smh; int error = ENOENT; if (zfs_multihost_history == 0 && shl->size == 0) return (0); mutex_enter(&shl->procfs_list.pl_lock); for (smh = list_tail(&shl->procfs_list.pl_list); smh != NULL; smh = list_prev(&shl->procfs_list.pl_list, smh)) { if (smh->mmp_node_id == mmp_node_id) { ASSERT(smh->io_error == 0); smh->io_error = io_error; smh->duration = duration; error = 0; break; } } mutex_exit(&shl->procfs_list.pl_lock); return (error); } /* * Add a new MMP historical record. * error == 0 : a write was issued. * error != 0 : a write was not issued because no leaves were found. */ void spa_mmp_history_add(spa_t *spa, uint64_t txg, uint64_t timestamp, uint64_t mmp_delay, vdev_t *vd, int label, uint64_t mmp_node_id, int error) { spa_history_list_t *shl = &spa->spa_stats.mmp_history; spa_mmp_history_t *smh; if (zfs_multihost_history == 0 && shl->size == 0) return; smh = kmem_zalloc(sizeof (spa_mmp_history_t), KM_SLEEP); smh->txg = txg; smh->timestamp = timestamp; smh->mmp_delay = mmp_delay; if (vd) { smh->vdev_guid = vd->vdev_guid; if (vd->vdev_path) smh->vdev_path = kmem_strdup(vd->vdev_path); } smh->vdev_label = label; smh->mmp_node_id = mmp_node_id; if (error) { smh->io_error = error; smh->error_start = gethrtime(); smh->vdev_guid = 1; } mutex_enter(&shl->procfs_list.pl_lock); procfs_list_add(&shl->procfs_list, smh); shl->size++; spa_mmp_history_truncate(shl, zfs_multihost_history); mutex_exit(&shl->procfs_list.pl_lock); } static void * spa_state_addr(kstat_t *ksp, loff_t n) { if (n == 0) return (ksp->ks_private); /* return the spa_t */ return (NULL); } static int spa_state_data(char *buf, size_t size, void *data) { spa_t *spa = (spa_t *)data; (void) snprintf(buf, size, "%s\n", spa_state_to_name(spa)); return (0); } /* * Return the state of the pool in /proc/spl/kstat/zfs//state. * * This is a lock-less read of the pool's state (unlike using 'zpool', which * can potentially block for seconds). Because it doesn't block, it can useful * as a pool heartbeat value. */ static void spa_state_init(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.state; char *name; kstat_t *ksp; mutex_init(&shk->lock, NULL, MUTEX_DEFAULT, NULL); name = kmem_asprintf("zfs/%s", spa_name(spa)); ksp = kstat_create(name, 0, "state", "misc", KSTAT_TYPE_RAW, 0, KSTAT_FLAG_VIRTUAL); shk->kstat = ksp; if (ksp) { ksp->ks_lock = &shk->lock; ksp->ks_data = NULL; ksp->ks_private = spa; ksp->ks_flags |= KSTAT_FLAG_NO_HEADERS; kstat_set_raw_ops(ksp, NULL, spa_state_data, spa_state_addr); kstat_install(ksp); } kmem_strfree(name); } static void spa_health_destroy(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.state; kstat_t *ksp = shk->kstat; if (ksp) kstat_delete(ksp); mutex_destroy(&shk->lock); } static spa_iostats_t spa_iostats_template = { { "trim_extents_written", KSTAT_DATA_UINT64 }, { "trim_bytes_written", KSTAT_DATA_UINT64 }, { "trim_extents_skipped", KSTAT_DATA_UINT64 }, { "trim_bytes_skipped", KSTAT_DATA_UINT64 }, { "trim_extents_failed", KSTAT_DATA_UINT64 }, { "trim_bytes_failed", KSTAT_DATA_UINT64 }, { "autotrim_extents_written", KSTAT_DATA_UINT64 }, { "autotrim_bytes_written", KSTAT_DATA_UINT64 }, { "autotrim_extents_skipped", KSTAT_DATA_UINT64 }, { "autotrim_bytes_skipped", KSTAT_DATA_UINT64 }, { "autotrim_extents_failed", KSTAT_DATA_UINT64 }, { "autotrim_bytes_failed", KSTAT_DATA_UINT64 }, { "simple_trim_extents_written", KSTAT_DATA_UINT64 }, { "simple_trim_bytes_written", KSTAT_DATA_UINT64 }, { "simple_trim_extents_skipped", KSTAT_DATA_UINT64 }, { "simple_trim_bytes_skipped", KSTAT_DATA_UINT64 }, { "simple_trim_extents_failed", KSTAT_DATA_UINT64 }, { "simple_trim_bytes_failed", KSTAT_DATA_UINT64 }, }; #define SPA_IOSTATS_ADD(stat, val) \ atomic_add_64(&iostats->stat.value.ui64, (val)); void spa_iostats_trim_add(spa_t *spa, trim_type_t type, uint64_t extents_written, uint64_t bytes_written, uint64_t extents_skipped, uint64_t bytes_skipped, uint64_t extents_failed, uint64_t bytes_failed) { spa_history_kstat_t *shk = &spa->spa_stats.iostats; kstat_t *ksp = shk->kstat; spa_iostats_t *iostats; if (ksp == NULL) return; iostats = ksp->ks_data; if (type == TRIM_TYPE_MANUAL) { SPA_IOSTATS_ADD(trim_extents_written, extents_written); SPA_IOSTATS_ADD(trim_bytes_written, bytes_written); SPA_IOSTATS_ADD(trim_extents_skipped, extents_skipped); SPA_IOSTATS_ADD(trim_bytes_skipped, bytes_skipped); SPA_IOSTATS_ADD(trim_extents_failed, extents_failed); SPA_IOSTATS_ADD(trim_bytes_failed, bytes_failed); } else if (type == TRIM_TYPE_AUTO) { SPA_IOSTATS_ADD(autotrim_extents_written, extents_written); SPA_IOSTATS_ADD(autotrim_bytes_written, bytes_written); SPA_IOSTATS_ADD(autotrim_extents_skipped, extents_skipped); SPA_IOSTATS_ADD(autotrim_bytes_skipped, bytes_skipped); SPA_IOSTATS_ADD(autotrim_extents_failed, extents_failed); SPA_IOSTATS_ADD(autotrim_bytes_failed, bytes_failed); } else { SPA_IOSTATS_ADD(simple_trim_extents_written, extents_written); SPA_IOSTATS_ADD(simple_trim_bytes_written, bytes_written); SPA_IOSTATS_ADD(simple_trim_extents_skipped, extents_skipped); SPA_IOSTATS_ADD(simple_trim_bytes_skipped, bytes_skipped); SPA_IOSTATS_ADD(simple_trim_extents_failed, extents_failed); SPA_IOSTATS_ADD(simple_trim_bytes_failed, bytes_failed); } } static int spa_iostats_update(kstat_t *ksp, int rw) { if (rw == KSTAT_WRITE) { memcpy(ksp->ks_data, &spa_iostats_template, sizeof (spa_iostats_t)); } return (0); } static void spa_iostats_init(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.iostats; mutex_init(&shk->lock, NULL, MUTEX_DEFAULT, NULL); char *name = kmem_asprintf("zfs/%s", spa_name(spa)); kstat_t *ksp = kstat_create(name, 0, "iostats", "misc", KSTAT_TYPE_NAMED, sizeof (spa_iostats_t) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL); shk->kstat = ksp; if (ksp) { int size = sizeof (spa_iostats_t); ksp->ks_lock = &shk->lock; ksp->ks_private = spa; ksp->ks_update = spa_iostats_update; ksp->ks_data = kmem_alloc(size, KM_SLEEP); memcpy(ksp->ks_data, &spa_iostats_template, size); kstat_install(ksp); } kmem_strfree(name); } static void spa_iostats_destroy(spa_t *spa) { spa_history_kstat_t *shk = &spa->spa_stats.iostats; kstat_t *ksp = shk->kstat; if (ksp) { kmem_free(ksp->ks_data, sizeof (spa_iostats_t)); kstat_delete(ksp); } mutex_destroy(&shk->lock); } void spa_stats_init(spa_t *spa) { spa_read_history_init(spa); spa_txg_history_init(spa); spa_tx_assign_init(spa); spa_io_history_init(spa); spa_mmp_history_init(spa); spa_state_init(spa); spa_iostats_init(spa); } void spa_stats_destroy(spa_t *spa) { spa_iostats_destroy(spa); spa_health_destroy(spa); spa_tx_assign_destroy(spa); spa_txg_history_destroy(spa); spa_read_history_destroy(spa); spa_io_history_destroy(spa); spa_mmp_history_destroy(spa); } /* BEGIN CSTYLED */ ZFS_MODULE_PARAM(zfs, zfs_, read_history, INT, ZMOD_RW, "Historical statistics for the last N reads"); ZFS_MODULE_PARAM(zfs, zfs_, read_history_hits, INT, ZMOD_RW, "Include cache hits in read history"); ZFS_MODULE_PARAM(zfs_txg, zfs_txg_, history, INT, ZMOD_RW, "Historical statistics for the last N txgs"); ZFS_MODULE_PARAM(zfs_multihost, zfs_multihost_, history, INT, ZMOD_RW, "Historical statistics for last N multihost writes"); /* END CSTYLED */ Index: vendor-sys/openzfs/dist/module/zfs/zfs_log.c =================================================================== --- vendor-sys/openzfs/dist/module/zfs/zfs_log.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zfs/zfs_log.c (revision 366348) @@ -1,774 +1,781 @@ /* * CDDL HEADER START * * The contents of this file are subject to the terms of the * Common Development and Distribution License (the "License"). * You may not use this file except in compliance with the License. * * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE * or http://www.opensolaris.org/os/licensing. * See the License for the specific language governing permissions * and limitations under the License. * * When distributing Covered Code, include this CDDL HEADER in each * file and include the License file at usr/src/OPENSOLARIS.LICENSE. * If applicable, add the following below this CDDL HEADER, with the * fields enclosed by brackets "[]" replaced with your own identifying * information: Portions Copyright [yyyy] [name of copyright owner] * * CDDL HEADER END */ /* * Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved. * Copyright (c) 2015, 2018 by Delphix. All rights reserved. */ #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include #include /* * These zfs_log_* functions must be called within a dmu tx, in one * of 2 contexts depending on zilog->z_replay: * * Non replay mode * --------------- * We need to record the transaction so that if it is committed to * the Intent Log then it can be replayed. An intent log transaction * structure (itx_t) is allocated and all the information necessary to * possibly replay the transaction is saved in it. The itx is then assigned * a sequence number and inserted in the in-memory list anchored in the zilog. * * Replay mode * ----------- * We need to mark the intent log record as replayed in the log header. * This is done in the same transaction as the replay so that they * commit atomically. */ int zfs_log_create_txtype(zil_create_t type, vsecattr_t *vsecp, vattr_t *vap) { int isxvattr = (vap->va_mask & ATTR_XVATTR); switch (type) { case Z_FILE: if (vsecp == NULL && !isxvattr) return (TX_CREATE); if (vsecp && isxvattr) return (TX_CREATE_ACL_ATTR); if (vsecp) return (TX_CREATE_ACL); else return (TX_CREATE_ATTR); /*NOTREACHED*/ case Z_DIR: if (vsecp == NULL && !isxvattr) return (TX_MKDIR); if (vsecp && isxvattr) return (TX_MKDIR_ACL_ATTR); if (vsecp) return (TX_MKDIR_ACL); else return (TX_MKDIR_ATTR); case Z_XATTRDIR: return (TX_MKXATTR); } ASSERT(0); return (TX_MAX_TYPE); } /* * build up the log data necessary for logging xvattr_t * First lr_attr_t is initialized. following the lr_attr_t * is the mapsize and attribute bitmap copied from the xvattr_t. * Following the bitmap and bitmapsize two 64 bit words are reserved * for the create time which may be set. Following the create time * records a single 64 bit integer which has the bits to set on * replay for the xvattr. */ static void zfs_log_xvattr(lr_attr_t *lrattr, xvattr_t *xvap) { uint32_t *bitmap; uint64_t *attrs; uint64_t *crtime; xoptattr_t *xoap; void *scanstamp; int i; xoap = xva_getxoptattr(xvap); ASSERT(xoap); lrattr->lr_attr_masksize = xvap->xva_mapsize; bitmap = &lrattr->lr_attr_bitmap; for (i = 0; i != xvap->xva_mapsize; i++, bitmap++) { *bitmap = xvap->xva_reqattrmap[i]; } /* Now pack the attributes up in a single uint64_t */ attrs = (uint64_t *)bitmap; crtime = attrs + 1; scanstamp = (caddr_t)(crtime + 2); *attrs = 0; if (XVA_ISSET_REQ(xvap, XAT_READONLY)) *attrs |= (xoap->xoa_readonly == 0) ? 0 : XAT0_READONLY; if (XVA_ISSET_REQ(xvap, XAT_HIDDEN)) *attrs |= (xoap->xoa_hidden == 0) ? 0 : XAT0_HIDDEN; if (XVA_ISSET_REQ(xvap, XAT_SYSTEM)) *attrs |= (xoap->xoa_system == 0) ? 0 : XAT0_SYSTEM; if (XVA_ISSET_REQ(xvap, XAT_ARCHIVE)) *attrs |= (xoap->xoa_archive == 0) ? 0 : XAT0_ARCHIVE; if (XVA_ISSET_REQ(xvap, XAT_IMMUTABLE)) *attrs |= (xoap->xoa_immutable == 0) ? 0 : XAT0_IMMUTABLE; if (XVA_ISSET_REQ(xvap, XAT_NOUNLINK)) *attrs |= (xoap->xoa_nounlink == 0) ? 0 : XAT0_NOUNLINK; if (XVA_ISSET_REQ(xvap, XAT_APPENDONLY)) *attrs |= (xoap->xoa_appendonly == 0) ? 0 : XAT0_APPENDONLY; if (XVA_ISSET_REQ(xvap, XAT_OPAQUE)) *attrs |= (xoap->xoa_opaque == 0) ? 0 : XAT0_APPENDONLY; if (XVA_ISSET_REQ(xvap, XAT_NODUMP)) *attrs |= (xoap->xoa_nodump == 0) ? 0 : XAT0_NODUMP; if (XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED)) *attrs |= (xoap->xoa_av_quarantined == 0) ? 0 : XAT0_AV_QUARANTINED; if (XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED)) *attrs |= (xoap->xoa_av_modified == 0) ? 0 : XAT0_AV_MODIFIED; if (XVA_ISSET_REQ(xvap, XAT_CREATETIME)) ZFS_TIME_ENCODE(&xoap->xoa_createtime, crtime); if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP)) { ASSERT(!XVA_ISSET_REQ(xvap, XAT_PROJID)); bcopy(xoap->xoa_av_scanstamp, scanstamp, AV_SCANSTAMP_SZ); } else if (XVA_ISSET_REQ(xvap, XAT_PROJID)) { /* * XAT_PROJID and XAT_AV_SCANSTAMP will never be valid * at the same time, so we can share the same space. */ bcopy(&xoap->xoa_projid, scanstamp, sizeof (uint64_t)); } if (XVA_ISSET_REQ(xvap, XAT_REPARSE)) *attrs |= (xoap->xoa_reparse == 0) ? 0 : XAT0_REPARSE; if (XVA_ISSET_REQ(xvap, XAT_OFFLINE)) *attrs |= (xoap->xoa_offline == 0) ? 0 : XAT0_OFFLINE; if (XVA_ISSET_REQ(xvap, XAT_SPARSE)) *attrs |= (xoap->xoa_sparse == 0) ? 0 : XAT0_SPARSE; if (XVA_ISSET_REQ(xvap, XAT_PROJINHERIT)) *attrs |= (xoap->xoa_projinherit == 0) ? 0 : XAT0_PROJINHERIT; } static void * zfs_log_fuid_ids(zfs_fuid_info_t *fuidp, void *start) { zfs_fuid_t *zfuid; uint64_t *fuidloc = start; /* First copy in the ACE FUIDs */ for (zfuid = list_head(&fuidp->z_fuids); zfuid; zfuid = list_next(&fuidp->z_fuids, zfuid)) { *fuidloc++ = zfuid->z_logfuid; } return (fuidloc); } static void * zfs_log_fuid_domains(zfs_fuid_info_t *fuidp, void *start) { zfs_fuid_domain_t *zdomain; /* now copy in the domain info, if any */ if (fuidp->z_domain_str_sz != 0) { for (zdomain = list_head(&fuidp->z_domains); zdomain; zdomain = list_next(&fuidp->z_domains, zdomain)) { bcopy((void *)zdomain->z_domain, start, strlen(zdomain->z_domain) + 1); start = (caddr_t)start + strlen(zdomain->z_domain) + 1; } } return (start); } /* * If zp is an xattr node, check whether the xattr owner is unlinked. * We don't want to log anything if the owner is unlinked. */ static int zfs_xattr_owner_unlinked(znode_t *zp) { int unlinked = 0; znode_t *dzp; #ifdef __FreeBSD__ znode_t *tzp = zp; /* * zrele drops the vnode lock which violates the VOP locking contract * on FreeBSD. See comment at the top of zfs_replay.c for more detail. */ /* * if zp is XATTR node, keep walking up via z_xattr_parent until we * get the owner */ while (tzp->z_pflags & ZFS_XATTR) { ASSERT3U(zp->z_xattr_parent, !=, 0); if (zfs_zget(ZTOZSB(tzp), tzp->z_xattr_parent, &dzp) != 0) { unlinked = 1; break; } if (tzp != zp) zrele(tzp); tzp = dzp; unlinked = tzp->z_unlinked; } if (tzp != zp) zrele(tzp); #else zhold(zp); /* * if zp is XATTR node, keep walking up via z_xattr_parent until we * get the owner */ while (zp->z_pflags & ZFS_XATTR) { ASSERT3U(zp->z_xattr_parent, !=, 0); if (zfs_zget(ZTOZSB(zp), zp->z_xattr_parent, &dzp) != 0) { unlinked = 1; break; } zrele(zp); zp = dzp; unlinked = zp->z_unlinked; } zrele(zp); #endif return (unlinked); } /* * Handles TX_CREATE, TX_CREATE_ATTR, TX_MKDIR, TX_MKDIR_ATTR and * TK_MKXATTR transactions. * * TX_CREATE and TX_MKDIR are standard creates, but they may have FUID * domain information appended prior to the name. In this case the * uid/gid in the log record will be a log centric FUID. * * TX_CREATE_ACL_ATTR and TX_MKDIR_ACL_ATTR handle special creates that * may contain attributes, ACL and optional fuid information. * * TX_CREATE_ACL and TX_MKDIR_ACL handle special creates that specify * and ACL and normal users/groups in the ACEs. * * There may be an optional xvattr attribute information similar * to zfs_log_setattr. * * Also, after the file name "domain" strings may be appended. */ void zfs_log_create(zilog_t *zilog, dmu_tx_t *tx, uint64_t txtype, znode_t *dzp, znode_t *zp, char *name, vsecattr_t *vsecp, zfs_fuid_info_t *fuidp, vattr_t *vap) { itx_t *itx; lr_create_t *lr; lr_acl_create_t *lracl; size_t aclsize = 0; size_t xvatsize = 0; size_t txsize; xvattr_t *xvap = (xvattr_t *)vap; void *end; size_t lrsize; size_t namesize = strlen(name) + 1; size_t fuidsz = 0; if (zil_replaying(zilog, tx) || zfs_xattr_owner_unlinked(dzp)) return; /* * If we have FUIDs present then add in space for * domains and ACE fuid's if any. */ if (fuidp) { fuidsz += fuidp->z_domain_str_sz; fuidsz += fuidp->z_fuid_cnt * sizeof (uint64_t); } if (vap->va_mask & ATTR_XVATTR) xvatsize = ZIL_XVAT_SIZE(xvap->xva_mapsize); if ((int)txtype == TX_CREATE_ATTR || (int)txtype == TX_MKDIR_ATTR || (int)txtype == TX_CREATE || (int)txtype == TX_MKDIR || (int)txtype == TX_MKXATTR) { txsize = sizeof (*lr) + namesize + fuidsz + xvatsize; lrsize = sizeof (*lr); } else { txsize = sizeof (lr_acl_create_t) + namesize + fuidsz + ZIL_ACE_LENGTH(aclsize) + xvatsize; lrsize = sizeof (lr_acl_create_t); } itx = zil_itx_create(txtype, txsize); lr = (lr_create_t *)&itx->itx_lr; lr->lr_doid = dzp->z_id; lr->lr_foid = zp->z_id; /* Store dnode slot count in 8 bits above object id. */ LR_FOID_SET_SLOTS(lr->lr_foid, zp->z_dnodesize >> DNODE_SHIFT); lr->lr_mode = zp->z_mode; if (!IS_EPHEMERAL(KUID_TO_SUID(ZTOUID(zp)))) { lr->lr_uid = (uint64_t)KUID_TO_SUID(ZTOUID(zp)); } else { lr->lr_uid = fuidp->z_fuid_owner; } if (!IS_EPHEMERAL(KGID_TO_SGID(ZTOGID(zp)))) { lr->lr_gid = (uint64_t)KGID_TO_SGID(ZTOGID(zp)); } else { lr->lr_gid = fuidp->z_fuid_group; } (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_GEN(ZTOZSB(zp)), &lr->lr_gen, sizeof (uint64_t)); (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_CRTIME(ZTOZSB(zp)), lr->lr_crtime, sizeof (uint64_t) * 2); if (sa_lookup(zp->z_sa_hdl, SA_ZPL_RDEV(ZTOZSB(zp)), &lr->lr_rdev, sizeof (lr->lr_rdev)) != 0) lr->lr_rdev = 0; /* * Fill in xvattr info if any */ if (vap->va_mask & ATTR_XVATTR) { zfs_log_xvattr((lr_attr_t *)((caddr_t)lr + lrsize), xvap); end = (caddr_t)lr + lrsize + xvatsize; } else { end = (caddr_t)lr + lrsize; } /* Now fill in any ACL info */ if (vsecp) { lracl = (lr_acl_create_t *)&itx->itx_lr; lracl->lr_aclcnt = vsecp->vsa_aclcnt; lracl->lr_acl_bytes = aclsize; lracl->lr_domcnt = fuidp ? fuidp->z_domain_cnt : 0; lracl->lr_fuidcnt = fuidp ? fuidp->z_fuid_cnt : 0; if (vsecp->vsa_aclflags & VSA_ACE_ACLFLAGS) lracl->lr_acl_flags = (uint64_t)vsecp->vsa_aclflags; else lracl->lr_acl_flags = 0; bcopy(vsecp->vsa_aclentp, end, aclsize); end = (caddr_t)end + ZIL_ACE_LENGTH(aclsize); } /* drop in FUID info */ if (fuidp) { end = zfs_log_fuid_ids(fuidp, end); end = zfs_log_fuid_domains(fuidp, end); } /* * Now place file name in log record */ bcopy(name, end, namesize); zil_itx_assign(zilog, itx, tx); } /* * Handles both TX_REMOVE and TX_RMDIR transactions. */ void zfs_log_remove(zilog_t *zilog, dmu_tx_t *tx, uint64_t txtype, znode_t *dzp, char *name, uint64_t foid, boolean_t unlinked) { itx_t *itx; lr_remove_t *lr; size_t namesize = strlen(name) + 1; if (zil_replaying(zilog, tx) || zfs_xattr_owner_unlinked(dzp)) return; itx = zil_itx_create(txtype, sizeof (*lr) + namesize); lr = (lr_remove_t *)&itx->itx_lr; lr->lr_doid = dzp->z_id; bcopy(name, (char *)(lr + 1), namesize); itx->itx_oid = foid; /* * Object ids can be re-instantiated in the next txg so * remove any async transactions to avoid future leaks. * This can happen if a fsync occurs on the re-instantiated * object for a WR_INDIRECT or WR_NEED_COPY write, which gets * the new file data and flushes a write record for the old object. */ if (unlinked) { ASSERT((txtype & ~TX_CI) == TX_REMOVE); zil_remove_async(zilog, foid); } zil_itx_assign(zilog, itx, tx); } /* * Handles TX_LINK transactions. */ void zfs_log_link(zilog_t *zilog, dmu_tx_t *tx, uint64_t txtype, znode_t *dzp, znode_t *zp, char *name) { itx_t *itx; lr_link_t *lr; size_t namesize = strlen(name) + 1; if (zil_replaying(zilog, tx)) return; itx = zil_itx_create(txtype, sizeof (*lr) + namesize); lr = (lr_link_t *)&itx->itx_lr; lr->lr_doid = dzp->z_id; lr->lr_link_obj = zp->z_id; bcopy(name, (char *)(lr + 1), namesize); zil_itx_assign(zilog, itx, tx); } /* * Handles TX_SYMLINK transactions. */ void zfs_log_symlink(zilog_t *zilog, dmu_tx_t *tx, uint64_t txtype, znode_t *dzp, znode_t *zp, char *name, char *link) { itx_t *itx; lr_create_t *lr; size_t namesize = strlen(name) + 1; size_t linksize = strlen(link) + 1; if (zil_replaying(zilog, tx)) return; itx = zil_itx_create(txtype, sizeof (*lr) + namesize + linksize); lr = (lr_create_t *)&itx->itx_lr; lr->lr_doid = dzp->z_id; lr->lr_foid = zp->z_id; lr->lr_uid = KUID_TO_SUID(ZTOUID(zp)); lr->lr_gid = KGID_TO_SGID(ZTOGID(zp)); lr->lr_mode = zp->z_mode; (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_GEN(ZTOZSB(zp)), &lr->lr_gen, sizeof (uint64_t)); (void) sa_lookup(zp->z_sa_hdl, SA_ZPL_CRTIME(ZTOZSB(zp)), lr->lr_crtime, sizeof (uint64_t) * 2); bcopy(name, (char *)(lr + 1), namesize); bcopy(link, (char *)(lr + 1) + namesize, linksize); zil_itx_assign(zilog, itx, tx); } /* * Handles TX_RENAME transactions. */ void zfs_log_rename(zilog_t *zilog, dmu_tx_t *tx, uint64_t txtype, znode_t *sdzp, char *sname, znode_t *tdzp, char *dname, znode_t *szp) { itx_t *itx; lr_rename_t *lr; size_t snamesize = strlen(sname) + 1; size_t dnamesize = strlen(dname) + 1; if (zil_replaying(zilog, tx)) return; itx = zil_itx_create(txtype, sizeof (*lr) + snamesize + dnamesize); lr = (lr_rename_t *)&itx->itx_lr; lr->lr_sdoid = sdzp->z_id; lr->lr_tdoid = tdzp->z_id; bcopy(sname, (char *)(lr + 1), snamesize); bcopy(dname, (char *)(lr + 1) + snamesize, dnamesize); itx->itx_oid = szp->z_id; zil_itx_assign(zilog, itx, tx); } /* * zfs_log_write() handles TX_WRITE transactions. The specified callback is * called as soon as the write is on stable storage (be it via a DMU sync or a * ZIL commit). */ long zfs_immediate_write_sz = 32768; void zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype, znode_t *zp, offset_t off, ssize_t resid, int ioflag, zil_callback_t callback, void *callback_data) { dmu_buf_impl_t *db = (dmu_buf_impl_t *)sa_get_db(zp->z_sa_hdl); uint32_t blocksize = zp->z_blksz; itx_wr_state_t write_state; uintptr_t fsync_cnt; if (zil_replaying(zilog, tx) || zp->z_unlinked || zfs_xattr_owner_unlinked(zp)) { if (callback != NULL) callback(callback_data); return; } if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT) write_state = WR_INDIRECT; else if (!spa_has_slogs(zilog->zl_spa) && resid >= zfs_immediate_write_sz) write_state = WR_INDIRECT; else if (ioflag & (O_SYNC | O_DSYNC)) write_state = WR_COPIED; else write_state = WR_NEED_COPY; if ((fsync_cnt = (uintptr_t)tsd_get(zfs_fsyncer_key)) != 0) { (void) tsd_set(zfs_fsyncer_key, (void *)(fsync_cnt - 1)); } while (resid) { itx_t *itx; lr_write_t *lr; itx_wr_state_t wr_state = write_state; ssize_t len = resid; /* * A WR_COPIED record must fit entirely in one log block. * Large writes can use WR_NEED_COPY, which the ZIL will * split into multiple records across several log blocks * if necessary. */ if (wr_state == WR_COPIED && resid > zil_max_copied_data(zilog)) wr_state = WR_NEED_COPY; else if (wr_state == WR_INDIRECT) len = MIN(blocksize - P2PHASE(off, blocksize), resid); itx = zil_itx_create(txtype, sizeof (*lr) + (wr_state == WR_COPIED ? len : 0)); lr = (lr_write_t *)&itx->itx_lr; - DB_DNODE_ENTER(db); - if (wr_state == WR_COPIED && dmu_read_by_dnode(DB_DNODE(db), - off, len, lr + 1, DMU_READ_NO_PREFETCH) != 0) { - zil_itx_destroy(itx); - itx = zil_itx_create(txtype, sizeof (*lr)); - lr = (lr_write_t *)&itx->itx_lr; - wr_state = WR_NEED_COPY; + /* + * For WR_COPIED records, copy the data into the lr_write_t. + */ + if (wr_state == WR_COPIED) { + int err; + DB_DNODE_ENTER(db); + err = dmu_read_by_dnode(DB_DNODE(db), off, len, lr + 1, + DMU_READ_NO_PREFETCH); + if (err != 0) { + zil_itx_destroy(itx); + itx = zil_itx_create(txtype, sizeof (*lr)); + lr = (lr_write_t *)&itx->itx_lr; + wr_state = WR_NEED_COPY; + } + DB_DNODE_EXIT(db); } - DB_DNODE_EXIT(db); itx->itx_wr_state = wr_state; lr->lr_foid = zp->z_id; lr->lr_offset = off; lr->lr_length = len; lr->lr_blkoff = 0; BP_ZERO(&lr->lr_blkptr); itx->itx_private = ZTOZSB(zp); if (!(ioflag & (O_SYNC | O_DSYNC)) && (zp->z_sync_cnt == 0) && (fsync_cnt == 0)) itx->itx_sync = B_FALSE; itx->itx_callback = callback; itx->itx_callback_data = callback_data; zil_itx_assign(zilog, itx, tx); off += len; resid -= len; } } /* * Handles TX_TRUNCATE transactions. */ void zfs_log_truncate(zilog_t *zilog, dmu_tx_t *tx, int txtype, znode_t *zp, uint64_t off, uint64_t len) { itx_t *itx; lr_truncate_t *lr; if (zil_replaying(zilog, tx) || zp->z_unlinked || zfs_xattr_owner_unlinked(zp)) return; itx = zil_itx_create(txtype, sizeof (*lr)); lr = (lr_truncate_t *)&itx->itx_lr; lr->lr_foid = zp->z_id; lr->lr_offset = off; lr->lr_length = len; itx->itx_sync = (zp->z_sync_cnt != 0); zil_itx_assign(zilog, itx, tx); } /* * Handles TX_SETATTR transactions. */ void zfs_log_setattr(zilog_t *zilog, dmu_tx_t *tx, int txtype, znode_t *zp, vattr_t *vap, uint_t mask_applied, zfs_fuid_info_t *fuidp) { itx_t *itx; lr_setattr_t *lr; xvattr_t *xvap = (xvattr_t *)vap; size_t recsize = sizeof (lr_setattr_t); void *start; if (zil_replaying(zilog, tx) || zp->z_unlinked) return; /* * If XVATTR set, then log record size needs to allow * for lr_attr_t + xvattr mask, mapsize and create time * plus actual attribute values */ if (vap->va_mask & ATTR_XVATTR) recsize = sizeof (*lr) + ZIL_XVAT_SIZE(xvap->xva_mapsize); if (fuidp) recsize += fuidp->z_domain_str_sz; itx = zil_itx_create(txtype, recsize); lr = (lr_setattr_t *)&itx->itx_lr; lr->lr_foid = zp->z_id; lr->lr_mask = (uint64_t)mask_applied; lr->lr_mode = (uint64_t)vap->va_mode; if ((mask_applied & ATTR_UID) && IS_EPHEMERAL(vap->va_uid)) lr->lr_uid = fuidp->z_fuid_owner; else lr->lr_uid = (uint64_t)vap->va_uid; if ((mask_applied & ATTR_GID) && IS_EPHEMERAL(vap->va_gid)) lr->lr_gid = fuidp->z_fuid_group; else lr->lr_gid = (uint64_t)vap->va_gid; lr->lr_size = (uint64_t)vap->va_size; ZFS_TIME_ENCODE(&vap->va_atime, lr->lr_atime); ZFS_TIME_ENCODE(&vap->va_mtime, lr->lr_mtime); start = (lr_setattr_t *)(lr + 1); if (vap->va_mask & ATTR_XVATTR) { zfs_log_xvattr((lr_attr_t *)start, xvap); start = (caddr_t)start + ZIL_XVAT_SIZE(xvap->xva_mapsize); } /* * Now stick on domain information if any on end */ if (fuidp) (void) zfs_log_fuid_domains(fuidp, start); itx->itx_sync = (zp->z_sync_cnt != 0); zil_itx_assign(zilog, itx, tx); } /* * Handles TX_ACL transactions. */ void zfs_log_acl(zilog_t *zilog, dmu_tx_t *tx, znode_t *zp, vsecattr_t *vsecp, zfs_fuid_info_t *fuidp) { itx_t *itx; lr_acl_v0_t *lrv0; lr_acl_t *lr; int txtype; int lrsize; size_t txsize; size_t aclbytes = vsecp->vsa_aclentsz; if (zil_replaying(zilog, tx) || zp->z_unlinked) return; txtype = (ZTOZSB(zp)->z_version < ZPL_VERSION_FUID) ? TX_ACL_V0 : TX_ACL; if (txtype == TX_ACL) lrsize = sizeof (*lr); else lrsize = sizeof (*lrv0); txsize = lrsize + ((txtype == TX_ACL) ? ZIL_ACE_LENGTH(aclbytes) : aclbytes) + (fuidp ? fuidp->z_domain_str_sz : 0) + sizeof (uint64_t) * (fuidp ? fuidp->z_fuid_cnt : 0); itx = zil_itx_create(txtype, txsize); lr = (lr_acl_t *)&itx->itx_lr; lr->lr_foid = zp->z_id; if (txtype == TX_ACL) { lr->lr_acl_bytes = aclbytes; lr->lr_domcnt = fuidp ? fuidp->z_domain_cnt : 0; lr->lr_fuidcnt = fuidp ? fuidp->z_fuid_cnt : 0; if (vsecp->vsa_mask & VSA_ACE_ACLFLAGS) lr->lr_acl_flags = (uint64_t)vsecp->vsa_aclflags; else lr->lr_acl_flags = 0; } lr->lr_aclcnt = (uint64_t)vsecp->vsa_aclcnt; if (txtype == TX_ACL_V0) { lrv0 = (lr_acl_v0_t *)lr; bcopy(vsecp->vsa_aclentp, (ace_t *)(lrv0 + 1), aclbytes); } else { void *start = (ace_t *)(lr + 1); bcopy(vsecp->vsa_aclentp, start, aclbytes); start = (caddr_t)start + ZIL_ACE_LENGTH(aclbytes); if (fuidp) { start = zfs_log_fuid_ids(fuidp, start); (void) zfs_log_fuid_domains(fuidp, start); } } itx->itx_sync = (zp->z_sync_cnt != 0); zil_itx_assign(zilog, itx, tx); } /* BEGIN CSTYLED */ ZFS_MODULE_PARAM(zfs, zfs_, immediate_write_sz, LONG, ZMOD_RW, "Largest data block to write to zil"); /* END CSTYLED */ Index: vendor-sys/openzfs/dist/module/zstd/zfs_zstd.c =================================================================== --- vendor-sys/openzfs/dist/module/zstd/zfs_zstd.c (revision 366347) +++ vendor-sys/openzfs/dist/module/zstd/zfs_zstd.c (revision 366348) @@ -1,738 +1,752 @@ /* * BSD 3-Clause New License (https://spdx.org/licenses/BSD-3-Clause.html) * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are met: * * 1. Redistributions of source code must retain the above copyright notice, * this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright notice, * this list of conditions and the following disclaimer in the documentation * and/or other materials provided with the distribution. * * 3. Neither the name of the copyright holder nor the names of its * contributors may be used to endorse or promote products derived from this * software without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ /* * Copyright (c) 2016-2018, Klara Inc. * Copyright (c) 2016-2018, Allan Jude * Copyright (c) 2018-2020, Sebastian Gottschall * Copyright (c) 2019-2020, Michael Niewöhner * Copyright (c) 2020, The FreeBSD Foundation [1] * * [1] Portions of this software were developed by Allan Jude * under sponsorship from the FreeBSD Foundation. */ #include #include #include #include #include #include #define ZSTD_STATIC_LINKING_ONLY #include "lib/zstd.h" #include "lib/zstd_errors.h" kstat_t *zstd_ksp = NULL; typedef struct zstd_stats { kstat_named_t zstd_stat_alloc_fail; kstat_named_t zstd_stat_alloc_fallback; kstat_named_t zstd_stat_com_alloc_fail; kstat_named_t zstd_stat_dec_alloc_fail; kstat_named_t zstd_stat_com_inval; kstat_named_t zstd_stat_dec_inval; kstat_named_t zstd_stat_dec_header_inval; kstat_named_t zstd_stat_com_fail; kstat_named_t zstd_stat_dec_fail; } zstd_stats_t; static zstd_stats_t zstd_stats = { { "alloc_fail", KSTAT_DATA_UINT64 }, { "alloc_fallback", KSTAT_DATA_UINT64 }, { "compress_alloc_fail", KSTAT_DATA_UINT64 }, { "decompress_alloc_fail", KSTAT_DATA_UINT64 }, { "compress_level_invalid", KSTAT_DATA_UINT64 }, { "decompress_level_invalid", KSTAT_DATA_UINT64 }, { "decompress_header_invalid", KSTAT_DATA_UINT64 }, { "compress_failed", KSTAT_DATA_UINT64 }, { "decompress_failed", KSTAT_DATA_UINT64 }, }; /* Enums describing the allocator type specified by kmem_type in zstd_kmem */ enum zstd_kmem_type { ZSTD_KMEM_UNKNOWN = 0, /* Allocation type using kmem_vmalloc */ ZSTD_KMEM_DEFAULT, /* Pool based allocation using mempool_alloc */ ZSTD_KMEM_POOL, /* Reserved fallback memory for decompression only */ ZSTD_KMEM_DCTX, ZSTD_KMEM_COUNT, }; /* Structure for pooled memory objects */ struct zstd_pool { void *mem; size_t size; kmutex_t barrier; hrtime_t timeout; }; /* Global structure for handling memory allocations */ struct zstd_kmem { enum zstd_kmem_type kmem_type; size_t kmem_size; struct zstd_pool *pool; }; /* Fallback memory structure used for decompression only if memory runs out */ struct zstd_fallback_mem { size_t mem_size; void *mem; kmutex_t barrier; }; struct zstd_levelmap { int16_t zstd_level; enum zio_zstd_levels level; }; /* * ZSTD memory handlers * * For decompression we use a different handler which also provides fallback * memory allocation in case memory runs out. * * The ZSTD handlers were split up for the most simplified implementation. */ static void *zstd_alloc(void *opaque, size_t size); static void *zstd_dctx_alloc(void *opaque, size_t size); static void zstd_free(void *opaque, void *ptr); /* Compression memory handler */ static const ZSTD_customMem zstd_malloc = { zstd_alloc, zstd_free, NULL, }; /* Decompression memory handler */ static const ZSTD_customMem zstd_dctx_malloc = { zstd_dctx_alloc, zstd_free, NULL, }; /* Level map for converting ZFS internal levels to ZSTD levels and vice versa */ static struct zstd_levelmap zstd_levels[] = { {ZIO_ZSTD_LEVEL_1, ZIO_ZSTD_LEVEL_1}, {ZIO_ZSTD_LEVEL_2, ZIO_ZSTD_LEVEL_2}, {ZIO_ZSTD_LEVEL_3, ZIO_ZSTD_LEVEL_3}, {ZIO_ZSTD_LEVEL_4, ZIO_ZSTD_LEVEL_4}, {ZIO_ZSTD_LEVEL_5, ZIO_ZSTD_LEVEL_5}, {ZIO_ZSTD_LEVEL_6, ZIO_ZSTD_LEVEL_6}, {ZIO_ZSTD_LEVEL_7, ZIO_ZSTD_LEVEL_7}, {ZIO_ZSTD_LEVEL_8, ZIO_ZSTD_LEVEL_8}, {ZIO_ZSTD_LEVEL_9, ZIO_ZSTD_LEVEL_9}, {ZIO_ZSTD_LEVEL_10, ZIO_ZSTD_LEVEL_10}, {ZIO_ZSTD_LEVEL_11, ZIO_ZSTD_LEVEL_11}, {ZIO_ZSTD_LEVEL_12, ZIO_ZSTD_LEVEL_12}, {ZIO_ZSTD_LEVEL_13, ZIO_ZSTD_LEVEL_13}, {ZIO_ZSTD_LEVEL_14, ZIO_ZSTD_LEVEL_14}, {ZIO_ZSTD_LEVEL_15, ZIO_ZSTD_LEVEL_15}, {ZIO_ZSTD_LEVEL_16, ZIO_ZSTD_LEVEL_16}, {ZIO_ZSTD_LEVEL_17, ZIO_ZSTD_LEVEL_17}, {ZIO_ZSTD_LEVEL_18, ZIO_ZSTD_LEVEL_18}, {ZIO_ZSTD_LEVEL_19, ZIO_ZSTD_LEVEL_19}, {-1, ZIO_ZSTD_LEVEL_FAST_1}, {-2, ZIO_ZSTD_LEVEL_FAST_2}, {-3, ZIO_ZSTD_LEVEL_FAST_3}, {-4, ZIO_ZSTD_LEVEL_FAST_4}, {-5, ZIO_ZSTD_LEVEL_FAST_5}, {-6, ZIO_ZSTD_LEVEL_FAST_6}, {-7, ZIO_ZSTD_LEVEL_FAST_7}, {-8, ZIO_ZSTD_LEVEL_FAST_8}, {-9, ZIO_ZSTD_LEVEL_FAST_9}, {-10, ZIO_ZSTD_LEVEL_FAST_10}, {-20, ZIO_ZSTD_LEVEL_FAST_20}, {-30, ZIO_ZSTD_LEVEL_FAST_30}, {-40, ZIO_ZSTD_LEVEL_FAST_40}, {-50, ZIO_ZSTD_LEVEL_FAST_50}, {-60, ZIO_ZSTD_LEVEL_FAST_60}, {-70, ZIO_ZSTD_LEVEL_FAST_70}, {-80, ZIO_ZSTD_LEVEL_FAST_80}, {-90, ZIO_ZSTD_LEVEL_FAST_90}, {-100, ZIO_ZSTD_LEVEL_FAST_100}, {-500, ZIO_ZSTD_LEVEL_FAST_500}, {-1000, ZIO_ZSTD_LEVEL_FAST_1000}, }; /* * This variable represents the maximum count of the pool based on the number * of CPUs plus some buffer. We default to cpu count * 4, see init_zstd. */ static int pool_count = 16; #define ZSTD_POOL_MAX pool_count #define ZSTD_POOL_TIMEOUT 60 * 2 static struct zstd_fallback_mem zstd_dctx_fallback; static struct zstd_pool *zstd_mempool_cctx; static struct zstd_pool *zstd_mempool_dctx; /* * Try to get a cached allocated buffer from memory pool or allocate a new one * if necessary. If a object is older than 2 minutes and does not fit the * requested size, it will be released and a new cached entry will be allocated. * If other pooled objects are detected without being used for 2 minutes, they * will be released, too. * * The concept is that high frequency memory allocations of bigger objects are * expensive. So if a lot of work is going on, allocations will be kept for a * while and can be reused in that time frame. * * The scheduled release will be updated every time a object is reused. */ static void * zstd_mempool_alloc(struct zstd_pool *zstd_mempool, size_t size) { struct zstd_pool *pool; struct zstd_kmem *mem = NULL; if (!zstd_mempool) { return (NULL); } /* Seek for preallocated memory slot and free obsolete slots */ for (int i = 0; i < ZSTD_POOL_MAX; i++) { pool = &zstd_mempool[i]; /* * This lock is simply a marker for a pool object beeing in use. * If it's already hold, it will be skipped. * * We need to create it before checking it to avoid race * conditions caused by running in a threaded context. * * The lock is later released by zstd_mempool_free. */ if (mutex_tryenter(&pool->barrier)) { /* * Check if objects fits the size, if so we take it and * update the timestamp. */ - if (!mem && pool->mem && size <= pool->size) { + if (size && !mem && pool->mem && size <= pool->size) { pool->timeout = gethrestime_sec() + ZSTD_POOL_TIMEOUT; mem = pool->mem; continue; } /* Free memory if unused object older than 2 minutes */ if (pool->mem && gethrestime_sec() > pool->timeout) { vmem_free(pool->mem, pool->size); pool->mem = NULL; pool->size = 0; pool->timeout = 0; } mutex_exit(&pool->barrier); } } - if (mem) { + if (!size || mem) { return (mem); } /* * If no preallocated slot was found, try to fill in a new one. * * We run a similar algorithm twice here to avoid pool fragmentation. * The first one may generate holes in the list if objects get released. * We always make sure that these holes get filled instead of adding new * allocations constantly at the end. */ for (int i = 0; i < ZSTD_POOL_MAX; i++) { pool = &zstd_mempool[i]; if (mutex_tryenter(&pool->barrier)) { /* Object is free, try to allocate new one */ if (!pool->mem) { mem = vmem_alloc(size, KM_SLEEP); pool->mem = mem; if (pool->mem) { /* Keep track for later release */ mem->pool = pool; pool->size = size; mem->kmem_type = ZSTD_KMEM_POOL; mem->kmem_size = size; } } if (size <= pool->size) { /* Update timestamp */ pool->timeout = gethrestime_sec() + ZSTD_POOL_TIMEOUT; return (pool->mem); } mutex_exit(&pool->barrier); } } /* * If the pool is full or the allocation failed, try lazy allocation * instead. */ if (!mem) { mem = vmem_alloc(size, KM_NOSLEEP); if (mem) { mem->pool = NULL; mem->kmem_type = ZSTD_KMEM_DEFAULT; mem->kmem_size = size; } } return (mem); } /* Mark object as released by releasing the barrier mutex */ static void zstd_mempool_free(struct zstd_kmem *z) { mutex_exit(&z->pool->barrier); } /* Convert ZFS internal enum to ZSTD level */ static int zstd_enum_to_level(enum zio_zstd_levels level, int16_t *zstd_level) { if (level > 0 && level <= ZIO_ZSTD_LEVEL_19) { *zstd_level = zstd_levels[level - 1].zstd_level; return (0); } if (level >= ZIO_ZSTD_LEVEL_FAST_1 && level <= ZIO_ZSTD_LEVEL_FAST_1000) { *zstd_level = zstd_levels[level - ZIO_ZSTD_LEVEL_FAST_1 + ZIO_ZSTD_LEVEL_19].zstd_level; return (0); } /* Invalid/unknown zfs compression enum - this should never happen. */ return (1); } /* Compress block using zstd */ size_t zfs_zstd_compress(void *s_start, void *d_start, size_t s_len, size_t d_len, int level) { size_t c_len; int16_t zstd_level; zfs_zstdhdr_t *hdr; ZSTD_CCtx *cctx; hdr = (zfs_zstdhdr_t *)d_start; /* Skip compression if the specified level is invalid */ if (zstd_enum_to_level(level, &zstd_level)) { ZSTDSTAT_BUMP(zstd_stat_com_inval); return (s_len); } ASSERT3U(d_len, >=, sizeof (*hdr)); ASSERT3U(d_len, <=, s_len); ASSERT3U(zstd_level, !=, 0); cctx = ZSTD_createCCtx_advanced(zstd_malloc); /* * Out of kernel memory, gently fall through - this will disable * compression in zio_compress_data */ if (!cctx) { ZSTDSTAT_BUMP(zstd_stat_com_alloc_fail); return (s_len); } /* Set the compression level */ ZSTD_CCtx_setParameter(cctx, ZSTD_c_compressionLevel, zstd_level); /* Use the "magicless" zstd header which saves us 4 header bytes */ ZSTD_CCtx_setParameter(cctx, ZSTD_c_format, ZSTD_f_zstd1_magicless); /* * Disable redundant checksum calculation and content size storage since * this is already done by ZFS itself. */ ZSTD_CCtx_setParameter(cctx, ZSTD_c_checksumFlag, 0); ZSTD_CCtx_setParameter(cctx, ZSTD_c_contentSizeFlag, 0); c_len = ZSTD_compress2(cctx, hdr->data, d_len - sizeof (*hdr), s_start, s_len); ZSTD_freeCCtx(cctx); /* Error in the compression routine, disable compression. */ if (ZSTD_isError(c_len)) { /* * If we are aborting the compression because the saves are * too small, that is not a failure. Everything else is a * failure, so increment the compression failure counter. */ if (ZSTD_getErrorCode(c_len) != ZSTD_error_dstSize_tooSmall) { ZSTDSTAT_BUMP(zstd_stat_com_fail); } return (s_len); } /* * Encode the compressed buffer size at the start. We'll need this in * decompression to counter the effects of padding which might be added * to the compressed buffer and which, if unhandled, would confuse the * hell out of our decompression function. */ hdr->c_len = BE_32(c_len); /* * Check version for overflow. * The limit of 24 bits must not be exceeded. This allows a maximum * version 1677.72.15 which we don't expect to be ever reached. */ ASSERT3U(ZSTD_VERSION_NUMBER, <=, 0xFFFFFF); /* * Encode the compression level as well. We may need to know the * original compression level if compressed_arc is disabled, to match * the compression settings to write this block to the L2ARC. * * Encode the actual level, so if the enum changes in the future, we * will be compatible. * * The upper 24 bits store the ZSTD version to be able to provide * future compatibility, since new versions might enhance the * compression algorithm in a way, where the compressed data will * change. * * As soon as such incompatibility occurs, handling code needs to be * added, differentiating between the versions. */ hdr->version = ZSTD_VERSION_NUMBER; hdr->level = level; hdr->raw_version_level = BE_32(hdr->raw_version_level); return (c_len + sizeof (*hdr)); } /* Decompress block using zstd and return its stored level */ int zfs_zstd_decompress_level(void *s_start, void *d_start, size_t s_len, size_t d_len, uint8_t *level) { ZSTD_DCtx *dctx; size_t result; int16_t zstd_level; uint32_t c_len; const zfs_zstdhdr_t *hdr; zfs_zstdhdr_t hdr_copy; hdr = (const zfs_zstdhdr_t *)s_start; c_len = BE_32(hdr->c_len); /* * Make a copy instead of directly converting the header, since we must * not modify the original data that may be used again later. */ hdr_copy.raw_version_level = BE_32(hdr->raw_version_level); /* * NOTE: We ignore the ZSTD version for now. As soon as any * incompatibility occurrs, it has to be handled accordingly. * The version can be accessed via `hdr_copy.version`. */ /* * Convert and check the level * An invalid level is a strong indicator for data corruption! In such * case return an error so the upper layers can try to fix it. */ if (zstd_enum_to_level(hdr_copy.level, &zstd_level)) { ZSTDSTAT_BUMP(zstd_stat_dec_inval); return (1); } ASSERT3U(d_len, >=, s_len); ASSERT3U(hdr_copy.level, !=, ZIO_COMPLEVEL_INHERIT); /* Invalid compressed buffer size encoded at start */ if (c_len + sizeof (*hdr) > s_len) { ZSTDSTAT_BUMP(zstd_stat_dec_header_inval); return (1); } dctx = ZSTD_createDCtx_advanced(zstd_dctx_malloc); if (!dctx) { ZSTDSTAT_BUMP(zstd_stat_dec_alloc_fail); return (1); } /* Set header type to "magicless" */ ZSTD_DCtx_setParameter(dctx, ZSTD_d_format, ZSTD_f_zstd1_magicless); /* Decompress the data and release the context */ result = ZSTD_decompressDCtx(dctx, d_start, d_len, hdr->data, c_len); ZSTD_freeDCtx(dctx); /* * Returns 0 on success (decompression function returned non-negative) * and non-zero on failure (decompression function returned negative. */ if (ZSTD_isError(result)) { ZSTDSTAT_BUMP(zstd_stat_dec_fail); return (1); } if (level) { *level = hdr_copy.level; } return (0); } /* Decompress datablock using zstd */ int zfs_zstd_decompress(void *s_start, void *d_start, size_t s_len, size_t d_len, int level __maybe_unused) { return (zfs_zstd_decompress_level(s_start, d_start, s_len, d_len, NULL)); } /* Allocator for zstd compression context using mempool_allocator */ static void * zstd_alloc(void *opaque __maybe_unused, size_t size) { size_t nbytes = sizeof (struct zstd_kmem) + size; struct zstd_kmem *z = NULL; z = (struct zstd_kmem *)zstd_mempool_alloc(zstd_mempool_cctx, nbytes); if (!z) { ZSTDSTAT_BUMP(zstd_stat_alloc_fail); return (NULL); } return ((void*)z + (sizeof (struct zstd_kmem))); } /* * Allocator for zstd decompression context using mempool_allocator with * fallback to reserved memory if allocation fails */ static void * zstd_dctx_alloc(void *opaque __maybe_unused, size_t size) { size_t nbytes = sizeof (struct zstd_kmem) + size; struct zstd_kmem *z = NULL; enum zstd_kmem_type type = ZSTD_KMEM_DEFAULT; z = (struct zstd_kmem *)zstd_mempool_alloc(zstd_mempool_dctx, nbytes); if (!z) { /* Try harder, decompression shall not fail */ z = vmem_alloc(nbytes, KM_SLEEP); if (z) { z->pool = NULL; } ZSTDSTAT_BUMP(zstd_stat_alloc_fail); } else { return ((void*)z + (sizeof (struct zstd_kmem))); } /* Fallback if everything fails */ if (!z) { /* * Barrier since we only can handle it in a single thread. All * other following threads need to wait here until decompression * is completed. zstd_free will release this barrier later. */ mutex_enter(&zstd_dctx_fallback.barrier); z = zstd_dctx_fallback.mem; type = ZSTD_KMEM_DCTX; ZSTDSTAT_BUMP(zstd_stat_alloc_fallback); } /* Allocation should always be successful */ if (!z) { return (NULL); } z->kmem_type = type; z->kmem_size = nbytes; return ((void*)z + (sizeof (struct zstd_kmem))); } /* Free allocated memory by its specific type */ static void zstd_free(void *opaque __maybe_unused, void *ptr) { struct zstd_kmem *z = (ptr - sizeof (struct zstd_kmem)); enum zstd_kmem_type type; ASSERT3U(z->kmem_type, <, ZSTD_KMEM_COUNT); ASSERT3U(z->kmem_type, >, ZSTD_KMEM_UNKNOWN); type = z->kmem_type; switch (type) { case ZSTD_KMEM_DEFAULT: vmem_free(z, z->kmem_size); break; case ZSTD_KMEM_POOL: zstd_mempool_free(z); break; case ZSTD_KMEM_DCTX: mutex_exit(&zstd_dctx_fallback.barrier); break; default: break; } } /* Allocate fallback memory to ensure safe decompression */ static void __init create_fallback_mem(struct zstd_fallback_mem *mem, size_t size) { mem->mem_size = size; mem->mem = vmem_zalloc(mem->mem_size, KM_SLEEP); mutex_init(&mem->barrier, NULL, MUTEX_DEFAULT, NULL); } /* Initialize memory pool barrier mutexes */ static void __init zstd_mempool_init(void) { zstd_mempool_cctx = (struct zstd_pool *) kmem_zalloc(ZSTD_POOL_MAX * sizeof (struct zstd_pool), KM_SLEEP); zstd_mempool_dctx = (struct zstd_pool *) kmem_zalloc(ZSTD_POOL_MAX * sizeof (struct zstd_pool), KM_SLEEP); for (int i = 0; i < ZSTD_POOL_MAX; i++) { mutex_init(&zstd_mempool_cctx[i].barrier, NULL, MUTEX_DEFAULT, NULL); mutex_init(&zstd_mempool_dctx[i].barrier, NULL, MUTEX_DEFAULT, NULL); } } /* Initialize zstd-related memory handling */ static int __init zstd_meminit(void) { zstd_mempool_init(); /* * Estimate the size of the fallback decompression context. * The expected size on x64 with current ZSTD should be about 160 KB. */ create_fallback_mem(&zstd_dctx_fallback, P2ROUNDUP(ZSTD_estimateDCtxSize() + sizeof (struct zstd_kmem), PAGESIZE)); return (0); } /* Release object from pool and free memory */ static void __exit release_pool(struct zstd_pool *pool) { mutex_destroy(&pool->barrier); vmem_free(pool->mem, pool->size); pool->mem = NULL; pool->size = 0; } /* Release memory pool objects */ static void __exit zstd_mempool_deinit(void) { for (int i = 0; i < ZSTD_POOL_MAX; i++) { release_pool(&zstd_mempool_cctx[i]); release_pool(&zstd_mempool_dctx[i]); } kmem_free(zstd_mempool_dctx, ZSTD_POOL_MAX * sizeof (struct zstd_pool)); kmem_free(zstd_mempool_cctx, ZSTD_POOL_MAX * sizeof (struct zstd_pool)); zstd_mempool_dctx = NULL; zstd_mempool_cctx = NULL; } +/* release unused memory from pool */ + +void +zfs_zstd_cache_reap_now(void) +{ + /* + * calling alloc with zero size seeks + * and releases old unused objects + */ + zstd_mempool_alloc(zstd_mempool_cctx, 0); + zstd_mempool_alloc(zstd_mempool_dctx, 0); +} + extern int __init zstd_init(void) { /* Set pool size by using maximum sane thread count * 4 */ pool_count = (boot_ncpus * 4); zstd_meminit(); /* Initialize kstat */ zstd_ksp = kstat_create("zfs", 0, "zstd", "misc", KSTAT_TYPE_NAMED, sizeof (zstd_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL); if (zstd_ksp != NULL) { zstd_ksp->ks_data = &zstd_stats; kstat_install(zstd_ksp); } return (0); } extern void __exit zstd_fini(void) { /* Deinitialize kstat */ if (zstd_ksp != NULL) { kstat_delete(zstd_ksp); zstd_ksp = NULL; } /* Release fallback memory */ vmem_free(zstd_dctx_fallback.mem, zstd_dctx_fallback.mem_size); mutex_destroy(&zstd_dctx_fallback.barrier); /* Deinit memory pool */ zstd_mempool_deinit(); } #if defined(_KERNEL) module_init(zstd_init); module_exit(zstd_fini); ZFS_MODULE_DESCRIPTION("ZSTD Compression for ZFS"); -ZFS_MODULE_LICENSE("BSD"); +ZFS_MODULE_LICENSE("Dual BSD/GPL"); ZFS_MODULE_VERSION(ZSTD_VERSION_STRING); EXPORT_SYMBOL(zfs_zstd_compress); EXPORT_SYMBOL(zfs_zstd_decompress_level); EXPORT_SYMBOL(zfs_zstd_decompress); +EXPORT_SYMBOL(zfs_zstd_cache_reap_now); #endif Index: vendor-sys/openzfs/dist/tests/runfiles/common.run =================================================================== --- vendor-sys/openzfs/dist/tests/runfiles/common.run (revision 366347) +++ vendor-sys/openzfs/dist/tests/runfiles/common.run (revision 366348) @@ -1,900 +1,900 @@ # # This file and its contents are supplied under the terms of the # Common Development and Distribution License ("CDDL"), version 1.0. # You may only use this file in accordance with the terms of version # 1.0 of the CDDL. # # A full copy of the text of the CDDL should have accompanied this # source. A copy of the CDDL is also available via the Internet at # http://www.illumos.org/license/CDDL. # [DEFAULT] pre = setup quiet = False pre_user = root user = root timeout = 600 post_user = root post = cleanup failsafe_user = root failsafe = callbacks/zfs_failsafe outputdir = /var/tmp/test_results tags = ['functional'] [tests/functional/alloc_class] tests = ['alloc_class_001_pos', 'alloc_class_002_neg', 'alloc_class_003_pos', 'alloc_class_004_pos', 'alloc_class_005_pos', 'alloc_class_006_pos', 'alloc_class_007_pos', 'alloc_class_008_pos', 'alloc_class_009_pos', 'alloc_class_010_pos', 'alloc_class_011_neg', 'alloc_class_012_pos', 'alloc_class_013_pos'] tags = ['functional', 'alloc_class'] [tests/functional/atime] tests = ['atime_001_pos', 'atime_002_neg', 'root_atime_off', 'root_atime_on'] tags = ['functional', 'atime'] [tests/functional/bootfs] tests = ['bootfs_001_pos', 'bootfs_002_neg', 'bootfs_003_pos', 'bootfs_004_neg', 'bootfs_005_neg', 'bootfs_006_pos', 'bootfs_007_pos', 'bootfs_008_pos'] tags = ['functional', 'bootfs'] [tests/functional/btree] tests = ['btree_positive', 'btree_negative'] tags = ['functional', 'btree'] pre = post = [tests/functional/cache] tests = ['cache_001_pos', 'cache_002_pos', 'cache_003_pos', 'cache_004_neg', 'cache_005_neg', 'cache_006_pos', 'cache_007_neg', 'cache_008_neg', 'cache_009_pos', 'cache_010_pos', 'cache_011_pos', 'cache_012_pos'] tags = ['functional', 'cache'] [tests/functional/cachefile] tests = ['cachefile_001_pos', 'cachefile_002_pos', 'cachefile_003_pos', 'cachefile_004_pos'] tags = ['functional', 'cachefile'] [tests/functional/casenorm] tests = ['case_all_values', 'norm_all_values', 'mixed_create_failure', 'sensitive_none_lookup', 'sensitive_none_delete', 'sensitive_formd_lookup', 'sensitive_formd_delete', 'insensitive_none_lookup', 'insensitive_none_delete', 'insensitive_formd_lookup', 'insensitive_formd_delete', 'mixed_none_lookup', 'mixed_none_lookup_ci', 'mixed_none_delete', 'mixed_formd_lookup', 'mixed_formd_lookup_ci', 'mixed_formd_delete'] tags = ['functional', 'casenorm'] [tests/functional/channel_program/lua_core] tests = ['tst.args_to_lua', 'tst.divide_by_zero', 'tst.exists', 'tst.integer_illegal', 'tst.integer_overflow', 'tst.language_functions_neg', 'tst.language_functions_pos', 'tst.large_prog', 'tst.libraries', 'tst.memory_limit', 'tst.nested_neg', 'tst.nested_pos', 'tst.nvlist_to_lua', 'tst.recursive_neg', 'tst.recursive_pos', 'tst.return_large', 'tst.return_nvlist_neg', 'tst.return_nvlist_pos', 'tst.return_recursive_table', 'tst.stack_gsub', 'tst.timeout'] tags = ['functional', 'channel_program', 'lua_core'] [tests/functional/channel_program/synctask_core] tests = ['tst.destroy_fs', 'tst.destroy_snap', 'tst.get_count_and_limit', 'tst.get_index_props', 'tst.get_mountpoint', 'tst.get_neg', 'tst.get_number_props', 'tst.get_string_props', 'tst.get_type', 'tst.get_userquota', 'tst.get_written', 'tst.inherit', 'tst.list_bookmarks', 'tst.list_children', 'tst.list_clones', 'tst.list_holds', 'tst.list_snapshots', 'tst.list_system_props', 'tst.list_user_props', 'tst.parse_args_neg','tst.promote_conflict', 'tst.promote_multiple', 'tst.promote_simple', 'tst.rollback_mult', 'tst.rollback_one', 'tst.set_props', 'tst.snapshot_destroy', 'tst.snapshot_neg', 'tst.snapshot_recursive', 'tst.snapshot_simple', 'tst.bookmark.create', 'tst.bookmark.copy', 'tst.terminate_by_signal' ] tags = ['functional', 'channel_program', 'synctask_core'] [tests/functional/checksum] tests = ['run_sha2_test', 'run_skein_test', 'filetest_001_pos'] tags = ['functional', 'checksum'] [tests/functional/clean_mirror] tests = [ 'clean_mirror_001_pos', 'clean_mirror_002_pos', 'clean_mirror_003_pos', 'clean_mirror_004_pos'] tags = ['functional', 'clean_mirror'] [tests/functional/cli_root/zdb] tests = ['zdb_002_pos', 'zdb_003_pos', 'zdb_004_pos', 'zdb_005_pos', 'zdb_006_pos', 'zdb_args_neg', 'zdb_args_pos', 'zdb_block_size_histogram', 'zdb_checksum', 'zdb_decompress', 'zdb_display_block', 'zdb_object_range_neg', 'zdb_object_range_pos', 'zdb_objset_id', 'zdb_decompress_zstd'] pre = post = tags = ['functional', 'cli_root', 'zdb'] [tests/functional/cli_root/zfs] tests = ['zfs_001_neg', 'zfs_002_pos'] tags = ['functional', 'cli_root', 'zfs'] [tests/functional/cli_root/zfs_bookmark] tests = ['zfs_bookmark_cliargs'] tags = ['functional', 'cli_root', 'zfs_bookmark'] [tests/functional/cli_root/zfs_change-key] tests = ['zfs_change-key', 'zfs_change-key_child', 'zfs_change-key_format', 'zfs_change-key_inherit', 'zfs_change-key_load', 'zfs_change-key_location', 'zfs_change-key_pbkdf2iters', 'zfs_change-key_clones'] tags = ['functional', 'cli_root', 'zfs_change-key'] [tests/functional/cli_root/zfs_clone] tests = ['zfs_clone_001_neg', 'zfs_clone_002_pos', 'zfs_clone_003_pos', 'zfs_clone_004_pos', 'zfs_clone_005_pos', 'zfs_clone_006_pos', 'zfs_clone_007_pos', 'zfs_clone_008_neg', 'zfs_clone_009_neg', 'zfs_clone_010_pos', 'zfs_clone_encrypted', 'zfs_clone_deeply_nested'] tags = ['functional', 'cli_root', 'zfs_clone'] [tests/functional/cli_root/zfs_copies] tests = ['zfs_copies_001_pos', 'zfs_copies_002_pos', 'zfs_copies_003_pos', 'zfs_copies_004_neg', 'zfs_copies_005_neg', 'zfs_copies_006_pos'] tags = ['functional', 'cli_root', 'zfs_copies'] [tests/functional/cli_root/zfs_create] tests = ['zfs_create_001_pos', 'zfs_create_002_pos', 'zfs_create_003_pos', 'zfs_create_004_pos', 'zfs_create_005_pos', 'zfs_create_006_pos', 'zfs_create_007_pos', 'zfs_create_008_neg', 'zfs_create_009_neg', 'zfs_create_010_neg', 'zfs_create_011_pos', 'zfs_create_012_pos', 'zfs_create_013_pos', 'zfs_create_014_pos', 'zfs_create_encrypted', 'zfs_create_crypt_combos', 'zfs_create_dryrun', 'zfs_create_verbose'] tags = ['functional', 'cli_root', 'zfs_create'] [tests/functional/cli_root/zfs_destroy] tests = ['zfs_clone_livelist_condense_and_disable', 'zfs_clone_livelist_condense_races', 'zfs_destroy_001_pos', 'zfs_destroy_002_pos', 'zfs_destroy_003_pos', 'zfs_destroy_004_pos', 'zfs_destroy_005_neg', 'zfs_destroy_006_neg', 'zfs_destroy_007_neg', 'zfs_destroy_008_pos', 'zfs_destroy_009_pos', 'zfs_destroy_010_pos', 'zfs_destroy_011_pos', 'zfs_destroy_012_pos', 'zfs_destroy_013_neg', 'zfs_destroy_014_pos', 'zfs_destroy_015_pos', 'zfs_destroy_016_pos', 'zfs_destroy_clone_livelist', 'zfs_destroy_dev_removal', 'zfs_destroy_dev_removal_condense'] tags = ['functional', 'cli_root', 'zfs_destroy'] [tests/functional/cli_root/zfs_diff] tests = ['zfs_diff_changes', 'zfs_diff_cliargs', 'zfs_diff_timestamp', 'zfs_diff_types', 'zfs_diff_encrypted'] tags = ['functional', 'cli_root', 'zfs_diff'] [tests/functional/cli_root/zfs_get] tests = ['zfs_get_001_pos', 'zfs_get_002_pos', 'zfs_get_003_pos', 'zfs_get_004_pos', 'zfs_get_005_neg', 'zfs_get_006_neg', 'zfs_get_007_neg', 'zfs_get_008_pos', 'zfs_get_009_pos', 'zfs_get_010_neg'] tags = ['functional', 'cli_root', 'zfs_get'] [tests/functional/cli_root/zfs_ids_to_path] tests = ['zfs_ids_to_path_001_pos'] tags = ['functional', 'cli_root', 'zfs_ids_to_path'] [tests/functional/cli_root/zfs_inherit] tests = ['zfs_inherit_001_neg', 'zfs_inherit_002_neg', 'zfs_inherit_003_pos', 'zfs_inherit_mountpoint'] tags = ['functional', 'cli_root', 'zfs_inherit'] [tests/functional/cli_root/zfs_load-key] tests = ['zfs_load-key', 'zfs_load-key_all', 'zfs_load-key_file', 'zfs_load-key_location', 'zfs_load-key_noop', 'zfs_load-key_recursive'] tags = ['functional', 'cli_root', 'zfs_load-key'] [tests/functional/cli_root/zfs_mount] tests = ['zfs_mount_001_pos', 'zfs_mount_002_pos', 'zfs_mount_003_pos', 'zfs_mount_004_pos', 'zfs_mount_005_pos', 'zfs_mount_007_pos', 'zfs_mount_009_neg', 'zfs_mount_010_neg', 'zfs_mount_011_neg', 'zfs_mount_012_pos', 'zfs_mount_all_001_pos', 'zfs_mount_encrypted', 'zfs_mount_remount', 'zfs_mount_all_fail', 'zfs_mount_all_mountpoints', 'zfs_mount_test_race'] tags = ['functional', 'cli_root', 'zfs_mount'] [tests/functional/cli_root/zfs_program] tests = ['zfs_program_json'] tags = ['functional', 'cli_root', 'zfs_program'] [tests/functional/cli_root/zfs_promote] tests = ['zfs_promote_001_pos', 'zfs_promote_002_pos', 'zfs_promote_003_pos', 'zfs_promote_004_pos', 'zfs_promote_005_pos', 'zfs_promote_006_neg', 'zfs_promote_007_neg', 'zfs_promote_008_pos', 'zfs_promote_encryptionroot'] tags = ['functional', 'cli_root', 'zfs_promote'] [tests/functional/cli_root/zfs_property] tests = ['zfs_written_property_001_pos'] tags = ['functional', 'cli_root', 'zfs_property'] [tests/functional/cli_root/zfs_receive] tests = ['zfs_receive_001_pos', 'zfs_receive_002_pos', 'zfs_receive_003_pos', 'zfs_receive_004_neg', 'zfs_receive_005_neg', 'zfs_receive_006_pos', 'zfs_receive_007_neg', 'zfs_receive_008_pos', 'zfs_receive_009_neg', 'zfs_receive_010_pos', 'zfs_receive_011_pos', 'zfs_receive_012_pos', 'zfs_receive_013_pos', 'zfs_receive_014_pos', 'zfs_receive_015_pos', 'zfs_receive_016_pos', 'receive-o-x_props_override', 'zfs_receive_from_encrypted', 'zfs_receive_to_encrypted', 'zfs_receive_raw', 'zfs_receive_raw_incremental', 'zfs_receive_-e', 'zfs_receive_raw_-d', 'zfs_receive_from_zstd', 'zfs_receive_new_props'] tags = ['functional', 'cli_root', 'zfs_receive'] [tests/functional/cli_root/zfs_rename] tests = ['zfs_rename_001_pos', 'zfs_rename_002_pos', 'zfs_rename_003_pos', 'zfs_rename_004_neg', 'zfs_rename_005_neg', 'zfs_rename_006_pos', 'zfs_rename_007_pos', 'zfs_rename_008_pos', 'zfs_rename_009_neg', 'zfs_rename_010_neg', 'zfs_rename_011_pos', 'zfs_rename_012_neg', 'zfs_rename_013_pos', 'zfs_rename_014_neg', 'zfs_rename_encrypted_child', 'zfs_rename_to_encrypted', 'zfs_rename_mountpoint', 'zfs_rename_nounmount'] tags = ['functional', 'cli_root', 'zfs_rename'] [tests/functional/cli_root/zfs_reservation] tests = ['zfs_reservation_001_pos', 'zfs_reservation_002_pos'] tags = ['functional', 'cli_root', 'zfs_reservation'] [tests/functional/cli_root/zfs_rollback] tests = ['zfs_rollback_001_pos', 'zfs_rollback_002_pos', 'zfs_rollback_003_neg', 'zfs_rollback_004_neg'] tags = ['functional', 'cli_root', 'zfs_rollback'] [tests/functional/cli_root/zfs_send] tests = ['zfs_send_001_pos', 'zfs_send_002_pos', 'zfs_send_003_pos', 'zfs_send_004_neg', 'zfs_send_005_pos', 'zfs_send_006_pos', 'zfs_send_007_pos', 'zfs_send_encrypted', 'zfs_send_raw', 'zfs_send_sparse', 'zfs_send-b'] tags = ['functional', 'cli_root', 'zfs_send'] [tests/functional/cli_root/zfs_set] tests = ['cache_001_pos', 'cache_002_neg', 'canmount_001_pos', 'canmount_002_pos', 'canmount_003_pos', 'canmount_004_pos', 'checksum_001_pos', 'compression_001_pos', 'mountpoint_001_pos', 'mountpoint_002_pos', 'reservation_001_neg', 'user_property_002_pos', 'share_mount_001_neg', 'snapdir_001_pos', 'onoffs_001_pos', 'user_property_001_pos', 'user_property_003_neg', 'readonly_001_pos', 'user_property_004_pos', 'version_001_neg', 'zfs_set_001_neg', 'zfs_set_002_neg', 'zfs_set_003_neg', 'property_alias_001_pos', 'mountpoint_003_pos', 'ro_props_001_pos', 'zfs_set_keylocation', 'zfs_set_feature_activation'] tags = ['functional', 'cli_root', 'zfs_set'] [tests/functional/cli_root/zfs_share] tests = ['zfs_share_001_pos', 'zfs_share_002_pos', 'zfs_share_003_pos', 'zfs_share_004_pos', 'zfs_share_006_pos', 'zfs_share_008_neg', 'zfs_share_010_neg', 'zfs_share_011_pos', 'zfs_share_concurrent_shares'] tags = ['functional', 'cli_root', 'zfs_share'] [tests/functional/cli_root/zfs_snapshot] tests = ['zfs_snapshot_001_neg', 'zfs_snapshot_002_neg', 'zfs_snapshot_003_neg', 'zfs_snapshot_004_neg', 'zfs_snapshot_005_neg', 'zfs_snapshot_006_pos', 'zfs_snapshot_007_neg', 'zfs_snapshot_008_neg', 'zfs_snapshot_009_pos'] tags = ['functional', 'cli_root', 'zfs_snapshot'] [tests/functional/cli_root/zfs_unload-key] tests = ['zfs_unload-key', 'zfs_unload-key_all', 'zfs_unload-key_recursive'] tags = ['functional', 'cli_root', 'zfs_unload-key'] [tests/functional/cli_root/zfs_unmount] tests = ['zfs_unmount_001_pos', 'zfs_unmount_002_pos', 'zfs_unmount_003_pos', 'zfs_unmount_004_pos', 'zfs_unmount_005_pos', 'zfs_unmount_006_pos', 'zfs_unmount_007_neg', 'zfs_unmount_008_neg', 'zfs_unmount_009_pos', 'zfs_unmount_all_001_pos', 'zfs_unmount_nested', 'zfs_unmount_unload_keys'] tags = ['functional', 'cli_root', 'zfs_unmount'] [tests/functional/cli_root/zfs_unshare] tests = ['zfs_unshare_001_pos', 'zfs_unshare_002_pos', 'zfs_unshare_003_pos', 'zfs_unshare_004_neg', 'zfs_unshare_005_neg', 'zfs_unshare_006_pos', 'zfs_unshare_007_pos'] tags = ['functional', 'cli_root', 'zfs_unshare'] [tests/functional/cli_root/zfs_upgrade] tests = ['zfs_upgrade_001_pos', 'zfs_upgrade_002_pos', 'zfs_upgrade_003_pos', 'zfs_upgrade_004_pos', 'zfs_upgrade_005_pos', 'zfs_upgrade_006_neg', 'zfs_upgrade_007_neg'] tags = ['functional', 'cli_root', 'zfs_upgrade'] [tests/functional/cli_root/zfs_wait] tests = ['zfs_wait_deleteq'] tags = ['functional', 'cli_root', 'zfs_wait'] [tests/functional/cli_root/zpool] tests = ['zpool_001_neg', 'zpool_002_pos', 'zpool_003_pos', 'zpool_colors'] tags = ['functional', 'cli_root', 'zpool'] [tests/functional/cli_root/zpool_add] tests = ['zpool_add_001_pos', 'zpool_add_002_pos', 'zpool_add_003_pos', 'zpool_add_004_pos', 'zpool_add_006_pos', 'zpool_add_007_neg', 'zpool_add_008_neg', 'zpool_add_009_neg', 'zpool_add_010_pos', 'add-o_ashift', 'add_prop_ashift'] tags = ['functional', 'cli_root', 'zpool_add'] [tests/functional/cli_root/zpool_attach] tests = ['zpool_attach_001_neg', 'attach-o_ashift'] tags = ['functional', 'cli_root', 'zpool_attach'] [tests/functional/cli_root/zpool_clear] tests = ['zpool_clear_001_pos', 'zpool_clear_002_neg', 'zpool_clear_003_neg', 'zpool_clear_readonly'] tags = ['functional', 'cli_root', 'zpool_clear'] [tests/functional/cli_root/zpool_create] tests = ['zpool_create_001_pos', 'zpool_create_002_pos', 'zpool_create_003_pos', 'zpool_create_004_pos', 'zpool_create_005_pos', 'zpool_create_006_pos', 'zpool_create_007_neg', 'zpool_create_008_pos', 'zpool_create_009_neg', 'zpool_create_010_neg', 'zpool_create_011_neg', 'zpool_create_012_neg', 'zpool_create_014_neg', 'zpool_create_015_neg', 'zpool_create_017_neg', 'zpool_create_018_pos', 'zpool_create_019_pos', 'zpool_create_020_pos', 'zpool_create_021_pos', 'zpool_create_022_pos', 'zpool_create_023_neg', 'zpool_create_024_pos', 'zpool_create_encrypted', 'zpool_create_crypt_combos', 'zpool_create_features_001_pos', 'zpool_create_features_002_pos', 'zpool_create_features_003_pos', 'zpool_create_features_004_neg', 'zpool_create_features_005_pos', 'create-o_ashift', 'zpool_create_tempname'] tags = ['functional', 'cli_root', 'zpool_create'] [tests/functional/cli_root/zpool_destroy] tests = ['zpool_destroy_001_pos', 'zpool_destroy_002_pos', 'zpool_destroy_003_neg'] pre = post = tags = ['functional', 'cli_root', 'zpool_destroy'] [tests/functional/cli_root/zpool_detach] tests = ['zpool_detach_001_neg'] tags = ['functional', 'cli_root', 'zpool_detach'] [tests/functional/cli_root/zpool_events] tests = ['zpool_events_clear', 'zpool_events_cliargs', 'zpool_events_follow', 'zpool_events_poolname', 'zpool_events_errors', 'zpool_events_duplicates'] tags = ['functional', 'cli_root', 'zpool_events'] [tests/functional/cli_root/zpool_export] tests = ['zpool_export_001_pos', 'zpool_export_002_pos', 'zpool_export_003_neg', 'zpool_export_004_pos'] tags = ['functional', 'cli_root', 'zpool_export'] [tests/functional/cli_root/zpool_get] tests = ['zpool_get_001_pos', 'zpool_get_002_pos', 'zpool_get_003_pos', 'zpool_get_004_neg', 'zpool_get_005_pos'] tags = ['functional', 'cli_root', 'zpool_get'] [tests/functional/cli_root/zpool_history] tests = ['zpool_history_001_neg', 'zpool_history_002_pos'] tags = ['functional', 'cli_root', 'zpool_history'] [tests/functional/cli_root/zpool_import] tests = ['zpool_import_001_pos', 'zpool_import_002_pos', 'zpool_import_003_pos', 'zpool_import_004_pos', 'zpool_import_005_pos', 'zpool_import_006_pos', 'zpool_import_007_pos', 'zpool_import_008_pos', 'zpool_import_009_neg', 'zpool_import_010_pos', 'zpool_import_011_neg', 'zpool_import_012_pos', 'zpool_import_013_neg', 'zpool_import_014_pos', 'zpool_import_015_pos', 'zpool_import_features_001_pos', 'zpool_import_features_002_neg', 'zpool_import_features_003_pos', 'zpool_import_missing_001_pos', 'zpool_import_missing_002_pos', 'zpool_import_missing_003_pos', 'zpool_import_rename_001_pos', 'zpool_import_all_001_pos', 'zpool_import_encrypted', 'zpool_import_encrypted_load', 'zpool_import_errata3', 'zpool_import_errata4', 'import_cachefile_device_added', 'import_cachefile_device_removed', 'import_cachefile_device_replaced', 'import_cachefile_mirror_attached', 'import_cachefile_mirror_detached', 'import_cachefile_shared_device', 'import_devices_missing', 'import_paths_changed', 'import_rewind_config_changed', 'import_rewind_device_replaced'] tags = ['functional', 'cli_root', 'zpool_import'] [tests/functional/cli_root/zpool_labelclear] tests = ['zpool_labelclear_active', 'zpool_labelclear_exported', 'zpool_labelclear_removed', 'zpool_labelclear_valid'] pre = post = tags = ['functional', 'cli_root', 'zpool_labelclear'] [tests/functional/cli_root/zpool_initialize] tests = ['zpool_initialize_attach_detach_add_remove', 'zpool_initialize_import_export', 'zpool_initialize_offline_export_import_online', 'zpool_initialize_online_offline', 'zpool_initialize_split', 'zpool_initialize_start_and_cancel_neg', 'zpool_initialize_start_and_cancel_pos', 'zpool_initialize_suspend_resume', 'zpool_initialize_unsupported_vdevs', 'zpool_initialize_verify_checksums', 'zpool_initialize_verify_initialized'] pre = tags = ['functional', 'cli_root', 'zpool_initialize'] [tests/functional/cli_root/zpool_offline] tests = ['zpool_offline_001_pos', 'zpool_offline_002_neg', 'zpool_offline_003_pos'] tags = ['functional', 'cli_root', 'zpool_offline'] [tests/functional/cli_root/zpool_online] tests = ['zpool_online_001_pos', 'zpool_online_002_neg'] tags = ['functional', 'cli_root', 'zpool_online'] [tests/functional/cli_root/zpool_remove] tests = ['zpool_remove_001_neg', 'zpool_remove_002_pos', 'zpool_remove_003_pos'] tags = ['functional', 'cli_root', 'zpool_remove'] [tests/functional/cli_root/zpool_replace] tests = ['zpool_replace_001_neg', 'replace-o_ashift', 'replace_prop_ashift'] tags = ['functional', 'cli_root', 'zpool_replace'] [tests/functional/cli_root/zpool_resilver] tests = ['zpool_resilver_bad_args', 'zpool_resilver_restart'] tags = ['functional', 'cli_root', 'zpool_resilver'] [tests/functional/cli_root/zpool_scrub] tests = ['zpool_scrub_001_neg', 'zpool_scrub_002_pos', 'zpool_scrub_003_pos', 'zpool_scrub_004_pos', 'zpool_scrub_005_pos', 'zpool_scrub_encrypted_unloaded', 'zpool_scrub_print_repairing', 'zpool_scrub_offline_device', 'zpool_scrub_multiple_copies'] tags = ['functional', 'cli_root', 'zpool_scrub'] [tests/functional/cli_root/zpool_set] tests = ['zpool_set_001_pos', 'zpool_set_002_neg', 'zpool_set_003_neg', 'zpool_set_ashift', 'zpool_set_features'] tags = ['functional', 'cli_root', 'zpool_set'] [tests/functional/cli_root/zpool_split] tests = ['zpool_split_cliargs', 'zpool_split_devices', 'zpool_split_encryption', 'zpool_split_props', 'zpool_split_vdevs', 'zpool_split_resilver', 'zpool_split_indirect'] tags = ['functional', 'cli_root', 'zpool_split'] [tests/functional/cli_root/zpool_status] tests = ['zpool_status_001_pos', 'zpool_status_002_pos'] tags = ['functional', 'cli_root', 'zpool_status'] [tests/functional/cli_root/zpool_sync] tests = ['zpool_sync_001_pos', 'zpool_sync_002_neg'] tags = ['functional', 'cli_root', 'zpool_sync'] [tests/functional/cli_root/zpool_trim] tests = ['zpool_trim_attach_detach_add_remove', 'zpool_trim_import_export', 'zpool_trim_multiple', 'zpool_trim_neg', 'zpool_trim_offline_export_import_online', 'zpool_trim_online_offline', 'zpool_trim_partial', 'zpool_trim_rate', 'zpool_trim_rate_neg', 'zpool_trim_secure', 'zpool_trim_split', 'zpool_trim_start_and_cancel_neg', 'zpool_trim_start_and_cancel_pos', 'zpool_trim_suspend_resume', 'zpool_trim_unsupported_vdevs', 'zpool_trim_verify_checksums', 'zpool_trim_verify_trimmed'] tags = ['functional', 'zpool_trim'] [tests/functional/cli_root/zpool_upgrade] tests = ['zpool_upgrade_001_pos', 'zpool_upgrade_002_pos', 'zpool_upgrade_003_pos', 'zpool_upgrade_004_pos', 'zpool_upgrade_005_neg', 'zpool_upgrade_006_neg', 'zpool_upgrade_007_pos', 'zpool_upgrade_008_pos', 'zpool_upgrade_009_neg'] tags = ['functional', 'cli_root', 'zpool_upgrade'] [tests/functional/cli_root/zpool_wait] tests = ['zpool_wait_discard', 'zpool_wait_freeing', 'zpool_wait_initialize_basic', 'zpool_wait_initialize_cancel', 'zpool_wait_initialize_flag', 'zpool_wait_multiple', 'zpool_wait_no_activity', 'zpool_wait_remove', 'zpool_wait_remove_cancel', 'zpool_wait_trim_basic', 'zpool_wait_trim_cancel', 'zpool_wait_trim_flag', 'zpool_wait_usage'] tags = ['functional', 'cli_root', 'zpool_wait'] [tests/functional/cli_root/zpool_wait/scan] tests = ['zpool_wait_replace_cancel', 'zpool_wait_rebuild', 'zpool_wait_resilver', 'zpool_wait_scrub_cancel', 'zpool_wait_replace', 'zpool_wait_scrub_basic', 'zpool_wait_scrub_flag'] tags = ['functional', 'cli_root', 'zpool_wait'] [tests/functional/cli_user/misc] tests = ['zdb_001_neg', 'zfs_001_neg', 'zfs_allow_001_neg', 'zfs_clone_001_neg', 'zfs_create_001_neg', 'zfs_destroy_001_neg', 'zfs_get_001_neg', 'zfs_inherit_001_neg', 'zfs_mount_001_neg', 'zfs_promote_001_neg', 'zfs_receive_001_neg', 'zfs_rename_001_neg', 'zfs_rollback_001_neg', 'zfs_send_001_neg', 'zfs_set_001_neg', 'zfs_share_001_neg', 'zfs_snapshot_001_neg', 'zfs_unallow_001_neg', 'zfs_unmount_001_neg', 'zfs_unshare_001_neg', 'zfs_upgrade_001_neg', 'zpool_001_neg', 'zpool_add_001_neg', 'zpool_attach_001_neg', 'zpool_clear_001_neg', 'zpool_create_001_neg', 'zpool_destroy_001_neg', 'zpool_detach_001_neg', 'zpool_export_001_neg', 'zpool_get_001_neg', 'zpool_history_001_neg', 'zpool_import_001_neg', 'zpool_import_002_neg', 'zpool_offline_001_neg', 'zpool_online_001_neg', 'zpool_remove_001_neg', 'zpool_replace_001_neg', 'zpool_scrub_001_neg', 'zpool_set_001_neg', 'zpool_status_001_neg', 'zpool_upgrade_001_neg', 'arcstat_001_pos', 'arc_summary_001_pos', 'arc_summary_002_neg', 'zpool_wait_privilege'] user = tags = ['functional', 'cli_user', 'misc'] [tests/functional/cli_user/zfs_list] tests = ['zfs_list_001_pos', 'zfs_list_002_pos', 'zfs_list_003_pos', 'zfs_list_004_neg', 'zfs_list_007_pos', 'zfs_list_008_neg'] user = tags = ['functional', 'cli_user', 'zfs_list'] [tests/functional/cli_user/zpool_iostat] tests = ['zpool_iostat_001_neg', 'zpool_iostat_002_pos', 'zpool_iostat_003_neg', 'zpool_iostat_004_pos', 'zpool_iostat_005_pos', 'zpool_iostat_-c_disable', 'zpool_iostat_-c_homedir', 'zpool_iostat_-c_searchpath'] user = tags = ['functional', 'cli_user', 'zpool_iostat'] [tests/functional/cli_user/zpool_list] tests = ['zpool_list_001_pos', 'zpool_list_002_neg'] user = tags = ['functional', 'cli_user', 'zpool_list'] [tests/functional/cli_user/zpool_status] tests = ['zpool_status_003_pos', 'zpool_status_-c_disable', 'zpool_status_-c_homedir', 'zpool_status_-c_searchpath'] user = tags = ['functional', 'cli_user', 'zpool_status'] [tests/functional/compression] tests = ['compress_001_pos', 'compress_002_pos', 'compress_003_pos', 'l2arc_compressed_arc', 'l2arc_compressed_arc_disabled', 'l2arc_encrypted', 'l2arc_encrypted_no_compressed_arc'] tags = ['functional', 'compression'] [tests/functional/cp_files] tests = ['cp_files_001_pos'] tags = ['functional', 'cp_files'] [tests/functional/ctime] tests = ['ctime_001_pos' ] tags = ['functional', 'ctime'] [tests/functional/delegate] tests = ['zfs_allow_001_pos', 'zfs_allow_002_pos', 'zfs_allow_003_pos', 'zfs_allow_004_pos', 'zfs_allow_005_pos', 'zfs_allow_006_pos', 'zfs_allow_007_pos', 'zfs_allow_008_pos', 'zfs_allow_009_neg', 'zfs_allow_010_pos', 'zfs_allow_011_neg', 'zfs_allow_012_neg', 'zfs_unallow_001_pos', 'zfs_unallow_002_pos', 'zfs_unallow_003_pos', 'zfs_unallow_004_pos', 'zfs_unallow_005_pos', 'zfs_unallow_006_pos', 'zfs_unallow_007_neg', 'zfs_unallow_008_neg'] tags = ['functional', 'delegate'] [tests/functional/exec] tests = ['exec_001_pos', 'exec_002_neg'] tags = ['functional', 'exec'] [tests/functional/features/async_destroy] tests = ['async_destroy_001_pos'] tags = ['functional', 'features', 'async_destroy'] [tests/functional/features/large_dnode] tests = ['large_dnode_001_pos', 'large_dnode_003_pos', 'large_dnode_004_neg', 'large_dnode_005_pos', 'large_dnode_007_neg', 'large_dnode_009_pos'] tags = ['functional', 'features', 'large_dnode'] [tests/functional/grow] pre = post = tests = ['grow_pool_001_pos', 'grow_replicas_001_pos'] tags = ['functional', 'grow'] [tests/functional/history] tests = ['history_001_pos', 'history_002_pos', 'history_003_pos', 'history_004_pos', 'history_005_neg', 'history_006_neg', 'history_007_pos', 'history_008_pos', 'history_009_pos', 'history_010_pos'] tags = ['functional', 'history'] [tests/functional/hkdf] tests = ['run_hkdf_test'] tags = ['functional', 'hkdf'] [tests/functional/inheritance] tests = ['inherit_001_pos'] pre = tags = ['functional', 'inheritance'] [tests/functional/io] tests = ['sync', 'psync', 'posixaio', 'mmap'] tags = ['functional', 'io'] [tests/functional/inuse] tests = ['inuse_004_pos', 'inuse_005_pos', 'inuse_008_pos', 'inuse_009_pos'] post = tags = ['functional', 'inuse'] [tests/functional/large_files] tests = ['large_files_001_pos', 'large_files_002_pos'] tags = ['functional', 'large_files'] [tests/functional/largest_pool] tests = ['largest_pool_001_pos'] pre = post = tags = ['functional', 'largest_pool'] [tests/functional/limits] tests = ['filesystem_count', 'filesystem_limit', 'snapshot_count', 'snapshot_limit'] tags = ['functional', 'limits'] [tests/functional/link_count] tests = ['link_count_001', 'link_count_root_inode'] tags = ['functional', 'link_count'] [tests/functional/migration] tests = ['migration_001_pos', 'migration_002_pos', 'migration_003_pos', 'migration_004_pos', 'migration_005_pos', 'migration_006_pos', 'migration_007_pos', 'migration_008_pos', 'migration_009_pos', 'migration_010_pos', 'migration_011_pos', 'migration_012_pos'] tags = ['functional', 'migration'] [tests/functional/mmap] tests = ['mmap_write_001_pos', 'mmap_read_001_pos'] tags = ['functional', 'mmap'] [tests/functional/mount] tests = ['umount_001', 'umountall_001'] tags = ['functional', 'mount'] [tests/functional/mv_files] tests = ['mv_files_001_pos', 'mv_files_002_pos', 'random_creation'] tags = ['functional', 'mv_files'] [tests/functional/nestedfs] tests = ['nestedfs_001_pos'] tags = ['functional', 'nestedfs'] [tests/functional/no_space] tests = ['enospc_001_pos', 'enospc_002_pos', 'enospc_003_pos', 'enospc_df'] tags = ['functional', 'no_space'] [tests/functional/nopwrite] tests = ['nopwrite_copies', 'nopwrite_mtime', 'nopwrite_negative', 'nopwrite_promoted_clone', 'nopwrite_recsize', 'nopwrite_sync', 'nopwrite_varying_compression', 'nopwrite_volume'] tags = ['functional', 'nopwrite'] [tests/functional/online_offline] tests = ['online_offline_001_pos', 'online_offline_002_neg', 'online_offline_003_neg'] tags = ['functional', 'online_offline'] [tests/functional/persist_l2arc] tests = ['persist_l2arc_001_pos', 'persist_l2arc_002_pos', 'persist_l2arc_003_neg', 'persist_l2arc_004_pos', 'persist_l2arc_005_pos', 'persist_l2arc_006_pos', 'persist_l2arc_007_pos', 'persist_l2arc_008_pos'] tags = ['functional', 'persist_l2arc'] [tests/functional/pool_checkpoint] tests = ['checkpoint_after_rewind', 'checkpoint_big_rewind', 'checkpoint_capacity', 'checkpoint_conf_change', 'checkpoint_discard', 'checkpoint_discard_busy', 'checkpoint_discard_many', 'checkpoint_indirect', 'checkpoint_invalid', 'checkpoint_lun_expsz', 'checkpoint_open', 'checkpoint_removal', 'checkpoint_rewind', 'checkpoint_ro_rewind', 'checkpoint_sm_scale', 'checkpoint_twice', 'checkpoint_vdev_add', 'checkpoint_zdb', 'checkpoint_zhack_feat'] tags = ['functional', 'pool_checkpoint'] timeout = 1800 [tests/functional/pool_names] tests = ['pool_names_001_pos', 'pool_names_002_neg'] pre = post = tags = ['functional', 'pool_names'] [tests/functional/poolversion] tests = ['poolversion_001_pos', 'poolversion_002_pos'] tags = ['functional', 'poolversion'] [tests/functional/pyzfs] tests = ['pyzfs_unittest'] pre = post = tags = ['functional', 'pyzfs'] [tests/functional/quota] tests = ['quota_001_pos', 'quota_002_pos', 'quota_003_pos', 'quota_004_pos', 'quota_005_pos', 'quota_006_neg'] tags = ['functional', 'quota'] [tests/functional/redacted_send] tests = ['redacted_compressed', 'redacted_contents', 'redacted_deleted', 'redacted_disabled_feature', 'redacted_embedded', 'redacted_holes', 'redacted_incrementals', 'redacted_largeblocks', 'redacted_many_clones', 'redacted_mixed_recsize', 'redacted_mounts', 'redacted_negative', 'redacted_origin', 'redacted_props', 'redacted_resume', 'redacted_size', 'redacted_volume'] tags = ['functional', 'redacted_send'] [tests/functional/raidz] tests = ['raidz_001_neg', 'raidz_002_pos'] tags = ['functional', 'raidz'] [tests/functional/redundancy] tests = ['redundancy_001_pos', 'redundancy_002_pos', 'redundancy_003_pos', 'redundancy_004_neg'] tags = ['functional', 'redundancy'] [tests/functional/refquota] tests = ['refquota_001_pos', 'refquota_002_pos', 'refquota_003_pos', 'refquota_004_pos', 'refquota_005_pos', 'refquota_006_neg', 'refquota_007_neg', 'refquota_008_neg'] tags = ['functional', 'refquota'] [tests/functional/refreserv] tests = ['refreserv_001_pos', 'refreserv_002_pos', 'refreserv_003_pos', 'refreserv_004_pos', 'refreserv_005_pos', 'refreserv_multi_raidz', 'refreserv_raidz'] tags = ['functional', 'refreserv'] [tests/functional/removal] pre = tests = ['removal_all_vdev', 'removal_cancel', 'removal_check_space', 'removal_condense_export', 'removal_multiple_indirection', 'removal_nopwrite', 'removal_remap_deadlists', 'removal_resume_export', 'removal_sanity', 'removal_with_add', 'removal_with_create_fs', 'removal_with_dedup', 'removal_with_errors', 'removal_with_export', 'removal_with_ganging', 'removal_with_faulted', 'removal_with_remove', 'removal_with_scrub', 'removal_with_send', 'removal_with_send_recv', 'removal_with_snapshot', 'removal_with_write', 'removal_with_zdb', 'remove_expanded', 'remove_mirror', 'remove_mirror_sanity', 'remove_raidz', 'remove_indirect'] tags = ['functional', 'removal'] [tests/functional/rename_dirs] tests = ['rename_dirs_001_pos'] tags = ['functional', 'rename_dirs'] [tests/functional/replacement] tests = ['attach_import', 'attach_multiple', 'attach_rebuild', 'attach_resilver', 'detach', 'rebuild_disabled_feature', 'rebuild_multiple', 'rebuild_raidz', 'replace_import', 'replace_rebuild', 'replace_resilver', 'resilver_restart_001', 'resilver_restart_002', 'scrub_cancel'] tags = ['functional', 'replacement'] [tests/functional/reservation] tests = ['reservation_001_pos', 'reservation_002_pos', 'reservation_003_pos', 'reservation_004_pos', 'reservation_005_pos', 'reservation_006_pos', 'reservation_007_pos', 'reservation_008_pos', 'reservation_009_pos', 'reservation_010_pos', 'reservation_011_pos', 'reservation_012_pos', 'reservation_013_pos', 'reservation_014_pos', 'reservation_015_pos', 'reservation_016_pos', 'reservation_017_pos', 'reservation_018_pos', 'reservation_019_pos', 'reservation_020_pos', 'reservation_021_neg', 'reservation_022_pos'] tags = ['functional', 'reservation'] [tests/functional/rootpool] tests = ['rootpool_002_neg', 'rootpool_003_neg', 'rootpool_007_pos'] tags = ['functional', 'rootpool'] [tests/functional/rsend] tests = ['recv_dedup', 'recv_dedup_encrypted_zvol', 'rsend_001_pos', 'rsend_002_pos', 'rsend_003_pos', 'rsend_004_pos', 'rsend_005_pos', 'rsend_006_pos', 'rsend_007_pos', 'rsend_008_pos', 'rsend_009_pos', 'rsend_010_pos', 'rsend_011_pos', 'rsend_012_pos', 'rsend_013_pos', 'rsend_014_pos', 'rsend_016_neg', 'rsend_019_pos', 'rsend_020_pos', 'rsend_021_pos', 'rsend_022_pos', 'rsend_024_pos', 'send-c_verify_ratio', 'send-c_verify_contents', 'send-c_props', 'send-c_incremental', 'send-c_volume', 'send-c_zstreamdump', 'send-c_lz4_disabled', 'send-c_recv_lz4_disabled', 'send-c_mixed_compression', 'send-c_stream_size_estimate', 'send-c_embedded_blocks', 'send-c_resume', 'send-cpL_varied_recsize', 'send-c_recv_dedup', 'send-L_toggle', 'send_encrypted_hierarchy', 'send_encrypted_props', 'send_encrypted_truncated_files', 'send_freeobjects', 'send_realloc_files', 'send_realloc_encrypted_files', 'send_spill_block', 'send_holds', 'send_hole_birth', 'send_mixed_raw', 'send-wR_encrypted_zvol', - 'send_partial_dataset'] + 'send_partial_dataset', 'send_invalid'] tags = ['functional', 'rsend'] [tests/functional/scrub_mirror] tests = ['scrub_mirror_001_pos', 'scrub_mirror_002_pos', 'scrub_mirror_003_pos', 'scrub_mirror_004_pos'] tags = ['functional', 'scrub_mirror'] [tests/functional/slog] tests = ['slog_001_pos', 'slog_002_pos', 'slog_003_pos', 'slog_004_pos', 'slog_005_pos', 'slog_006_pos', 'slog_007_pos', 'slog_008_neg', 'slog_009_neg', 'slog_010_neg', 'slog_011_neg', 'slog_012_neg', 'slog_013_pos', 'slog_014_pos', 'slog_015_neg', 'slog_replay_fs_001', 'slog_replay_fs_002', 'slog_replay_volume'] tags = ['functional', 'slog'] [tests/functional/snapshot] tests = ['clone_001_pos', 'rollback_001_pos', 'rollback_002_pos', 'rollback_003_pos', 'snapshot_001_pos', 'snapshot_002_pos', 'snapshot_003_pos', 'snapshot_004_pos', 'snapshot_005_pos', 'snapshot_006_pos', 'snapshot_007_pos', 'snapshot_008_pos', 'snapshot_009_pos', 'snapshot_010_pos', 'snapshot_011_pos', 'snapshot_012_pos', 'snapshot_013_pos', 'snapshot_014_pos', 'snapshot_017_pos'] tags = ['functional', 'snapshot'] [tests/functional/snapused] tests = ['snapused_001_pos', 'snapused_002_pos', 'snapused_003_pos', 'snapused_004_pos', 'snapused_005_pos'] tags = ['functional', 'snapused'] [tests/functional/sparse] tests = ['sparse_001_pos'] tags = ['functional', 'sparse'] [tests/functional/suid] tests = ['suid_write_to_suid', 'suid_write_to_sgid', 'suid_write_to_suid_sgid', 'suid_write_to_none'] tags = ['functional', 'suid'] [tests/functional/threadsappend] tests = ['threadsappend_001_pos'] tags = ['functional', 'threadsappend'] [tests/functional/trim] tests = ['autotrim_integrity', 'autotrim_config', 'autotrim_trim_integrity', 'trim_integrity', 'trim_config', 'trim_l2arc'] tags = ['functional', 'trim'] [tests/functional/truncate] tests = ['truncate_001_pos', 'truncate_002_pos', 'truncate_timestamps'] tags = ['functional', 'truncate'] [tests/functional/upgrade] tests = ['upgrade_userobj_001_pos', 'upgrade_readonly_pool'] tags = ['functional', 'upgrade'] [tests/functional/userquota] tests = [ 'userquota_001_pos', 'userquota_002_pos', 'userquota_003_pos', 'userquota_004_pos', 'userquota_005_neg', 'userquota_006_pos', 'userquota_007_pos', 'userquota_008_pos', 'userquota_009_pos', 'userquota_010_pos', 'userquota_011_pos', 'userquota_012_neg', 'userspace_001_pos', 'userspace_002_pos'] tags = ['functional', 'userquota'] [tests/functional/vdev_zaps] tests = ['vdev_zaps_001_pos', 'vdev_zaps_002_pos', 'vdev_zaps_003_pos', 'vdev_zaps_004_pos', 'vdev_zaps_005_pos', 'vdev_zaps_006_pos', 'vdev_zaps_007_pos'] tags = ['functional', 'vdev_zaps'] [tests/functional/write_dirs] tests = ['write_dirs_001_pos', 'write_dirs_002_pos'] tags = ['functional', 'write_dirs'] [tests/functional/xattr] tests = ['xattr_001_pos', 'xattr_002_neg', 'xattr_003_neg', 'xattr_004_pos', 'xattr_005_pos', 'xattr_006_pos', 'xattr_007_neg', 'xattr_011_pos', 'xattr_012_pos', 'xattr_013_pos'] tags = ['functional', 'xattr'] [tests/functional/zvol/zvol_ENOSPC] tests = ['zvol_ENOSPC_001_pos'] tags = ['functional', 'zvol', 'zvol_ENOSPC'] [tests/functional/zvol/zvol_cli] tests = ['zvol_cli_001_pos', 'zvol_cli_002_pos', 'zvol_cli_003_neg'] tags = ['functional', 'zvol', 'zvol_cli'] [tests/functional/zvol/zvol_misc] tests = ['zvol_misc_002_pos', 'zvol_misc_hierarchy', 'zvol_misc_rename_inuse', 'zvol_misc_snapdev', 'zvol_misc_volmode', 'zvol_misc_zil'] tags = ['functional', 'zvol', 'zvol_misc'] [tests/functional/zvol/zvol_swap] tests = ['zvol_swap_001_pos', 'zvol_swap_002_pos', 'zvol_swap_004_pos'] tags = ['functional', 'zvol', 'zvol_swap'] [tests/functional/libzfs] tests = ['many_fds', 'libzfs_input'] tags = ['functional', 'libzfs'] [tests/functional/log_spacemap] tests = ['log_spacemap_import_logs'] pre = post = tags = ['functional', 'log_spacemap'] Index: vendor-sys/openzfs/dist/tests/zfs-tests/cmd/Makefile.am =================================================================== --- vendor-sys/openzfs/dist/tests/zfs-tests/cmd/Makefile.am (revision 366347) +++ vendor-sys/openzfs/dist/tests/zfs-tests/cmd/Makefile.am (revision 366348) @@ -1,34 +1,35 @@ EXTRA_DIST = file_common.h SUBDIRS = \ + badsend \ btree_test \ chg_usr_exec \ devname2devid \ dir_rd_update \ file_check \ file_trunc \ file_write \ get_diff \ largest_file \ libzfs_input_check \ mkbusy \ mkfile \ mkfiles \ mktree \ mmap_exec \ mmap_libaio \ mmapwrite \ nvlist_to_lua \ randwritecomp \ readmmap \ rename_dir \ rm_lnkcnt_zero_file \ stride_dd \ threadsappend if BUILD_LINUX SUBDIRS += \ randfree_file \ user_ns_exec \ xattrtest endif Index: vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/.gitignore =================================================================== --- vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/.gitignore (nonexistent) +++ vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/.gitignore (revision 366348) @@ -0,0 +1 @@ +/badsend Index: vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/Makefile.am =================================================================== --- vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/Makefile.am (nonexistent) +++ vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/Makefile.am (revision 366348) @@ -0,0 +1,11 @@ +include $(top_srcdir)/config/Rules.am + +pkgexecdir = $(datadir)/@PACKAGE@/zfs-tests/bin + +pkgexec_PROGRAMS = badsend + +badsend_SOURCES = badsend.c +badsend_LDADD = \ + $(abs_top_builddir)/lib/libzfs_core/libzfs_core.la \ + $(abs_top_builddir)/lib/libzfs/libzfs.la \ + $(abs_top_builddir)/lib/libnvpair/libnvpair.la Property changes on: vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/Makefile.am ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/badsend.c =================================================================== --- vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/badsend.c (nonexistent) +++ vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/badsend.c (revision 366348) @@ -0,0 +1,136 @@ +/* + * CDDL HEADER START + * + * The contents of this file are subject to the terms of the + * Common Development and Distribution License (the "License"). + * You may not use this file except in compliance with the License. + * + * You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE + * or http://www.opensolaris.org/os/licensing. + * See the License for the specific language governing permissions + * and limitations under the License. + * + * When distributing Covered Code, include this CDDL HEADER in each + * file and include the License file at usr/src/OPENSOLARIS.LICENSE. + * If applicable, add the following below this CDDL HEADER, with the + * fields enclosed by brackets "[]" replaced with your own identifying + * information: Portions Copyright [yyyy] [name of copyright owner] + * + * CDDL HEADER END + */ + +/* + * Portions Copyright 2020 iXsystems, Inc. + */ + +/* + * Test some invalid send operations with libzfs/libzfs_core. + * + * Specifying the to and from snaps in the wrong order should return EXDEV. + * We are checking that the early return doesn't accidentally leave any + * references held, so this test is designed to trigger a panic when asserts + * are verified with the bug present. + */ + +#include +#include + +#include +#include +#include +#include +#include +#include + +static void +usage(const char *name) +{ + fprintf(stderr, "usage: %s snap0 snap1\n", name); + exit(EX_USAGE); +} + +int +main(int argc, char const * const argv[]) +{ + sendflags_t flags = { 0 }; + libzfs_handle_t *zhdl; + zfs_handle_t *zhp; + const char *fromfull, *tofull, *fsname, *fromsnap, *tosnap, *p; + uint64_t size; + int fd, error; + + if (argc != 3) + usage(argv[0]); + + fromfull = argv[1]; + tofull = argv[2]; + + p = strchr(fromfull, '@'); + if (p == NULL) + usage(argv[0]); + fromsnap = p + 1; + + p = strchr(tofull, '@'); + if (p == NULL) + usage(argv[0]); + tosnap = p + 1; + + fsname = strndup(tofull, p - tofull); + if (strncmp(fsname, fromfull, p - tofull) != 0) + usage(argv[0]); + + fd = open("/dev/null", O_WRONLY); + if (fd == -1) + err(EX_OSERR, "open(\"/dev/null\", O_WRONLY)"); + + zhdl = libzfs_init(); + if (zhdl == NULL) + errx(EX_OSERR, "libzfs_init(): %s", libzfs_error_init(errno)); + + zhp = zfs_open(zhdl, fsname, ZFS_TYPE_FILESYSTEM); + if (zhp == NULL) + err(EX_OSERR, "zfs_open(\"%s\")", fsname); + + /* + * Exercise EXDEV in dmu_send_obj. The error gets translated to + * EZFS_CROSSTARGET in libzfs. + */ + error = zfs_send(zhp, tosnap, fromsnap, &flags, fd, NULL, NULL, NULL); + if (error == 0 || libzfs_errno(zhdl) != EZFS_CROSSTARGET) + errx(EX_OSERR, "zfs_send(\"%s\", \"%s\") should have failed " + "with EZFS_CROSSTARGET, not %d", + tofull, fromfull, libzfs_errno(zhdl)); + printf("zfs_send(\"%s\", \"%s\"): %s\n", + tofull, fromfull, libzfs_error_description(zhdl)); + + zfs_close(zhp); + + /* + * Exercise EXDEV in dmu_send. + */ + error = lzc_send_resume_redacted(fromfull, tofull, fd, 0, 0, 0, NULL); + if (error != EXDEV) + errx(EX_OSERR, "lzc_send_resume_redacted(\"%s\", \"%s\")" + " should have failed with EXDEV, not %d", + fromfull, tofull, error); + printf("lzc_send_resume_redacted(\"%s\", \"%s\"): %s\n", + fromfull, tofull, strerror(error)); + + /* + * Exercise EXDEV in dmu_send_estimate_fast. + */ + error = lzc_send_space_resume_redacted(fromfull, tofull, 0, 0, 0, 0, + NULL, fd, &size); + if (error != EXDEV) + errx(EX_OSERR, "lzc_send_space_resume_redacted(\"%s\", \"%s\")" + " should have failed with EXDEV, not %d", + fromfull, tofull, error); + printf("lzc_send_space_resume_redacted(\"%s\", \"%s\"): %s\n", + fromfull, tofull, strerror(error)); + + close(fd); + libzfs_fini(zhdl); + free((void *)fsname); + + return (EXIT_SUCCESS); +} Property changes on: vendor-sys/openzfs/dist/tests/zfs-tests/cmd/badsend/badsend.c ___________________________________________________________________ Added: svn:eol-style ## -0,0 +1 ## +native \ No newline at end of property Added: svn:keywords ## -0,0 +1 ## +FreeBSD=%H \ No newline at end of property Added: svn:mime-type ## -0,0 +1 ## +text/plain \ No newline at end of property Index: vendor-sys/openzfs/dist/tests/zfs-tests/include/commands.cfg =================================================================== --- vendor-sys/openzfs/dist/tests/zfs-tests/include/commands.cfg (revision 366347) +++ vendor-sys/openzfs/dist/tests/zfs-tests/include/commands.cfg (revision 366348) @@ -1,219 +1,220 @@ # # Copyright (c) 2016, 2019 by Delphix. All rights reserved. # These variables are used by zfs-tests.sh to constrain which utilities # may be used by the suite. The suite will create a directory which is # the only element of $PATH and create symlinks from that dir to the # binaries listed below. # # Please keep the contents of each variable sorted for ease of reading # and maintenance. # export SYSTEM_FILES_COMMON='arp awk base64 basename bc bunzip2 bzcat cat chgrp chmod chown cksum cmp cp cpio cut date dd df diff dirname dmesg du echo egrep expr false file find fio getconf getent getfacl grep gunzip gzip head hostname id iostat kill ksh ln logname ls mkdir mknod mktemp mount mv net od openssl pamtester pax pgrep ping pkill printenv printf ps pwd python python2 python3 quotaon readlink rm rmdir scp script sed seq setfacl sh sleep sort ssh stat strings su sudo sum swapoff swapon sync tail tar tee timeout touch tr true truncate umask umount uname uniq uuidgen vmstat wait wc which xargs' export SYSTEM_FILES_FREEBSD='chflags compress diskinfo dumpon env fsck getextattr gpart jail jexec jls lsextattr md5 mdconfig mkfifo newfs pw rmextattr setextattr sha256 showmount swapctl sysctl uncompress' export SYSTEM_FILES_LINUX='attr bash blkid blockdev chattr dmidecode exportfs fallocate fdisk free getfattr groupadd groupdel groupmod hostid losetup lsattr lsblk lscpu lsmod lsscsi md5sum mkswap modprobe mpstat nproc parted perf setenforce setfattr sha256sum udevadm useradd userdel usermod' export ZFS_FILES='zdb zfs zhack zinject zpool ztest raidz_test arc_summary arcstat dbufstat zed zgenhostid zstream zstreamdump zfs_ids_to_path' -export ZFSTEST_FILES='btree_test +export ZFSTEST_FILES='badsend + btree_test chg_usr_exec devname2devid dir_rd_update file_check file_trunc file_write get_diff largest_file libzfs_input_check mkbusy mkfile mkfiles mktree mmap_exec mmap_libaio mmapwrite nvlist_to_lua randfree_file randwritecomp readmmap rename_dir rm_lnkcnt_zero_file threadsappend user_ns_exec xattrtest stride_dd' Index: vendor-sys/openzfs/dist/tests/zfs-tests/tests/functional/rsend/Makefile.am =================================================================== --- vendor-sys/openzfs/dist/tests/zfs-tests/tests/functional/rsend/Makefile.am (revision 366347) +++ vendor-sys/openzfs/dist/tests/zfs-tests/tests/functional/rsend/Makefile.am (revision 366348) @@ -1,63 +1,64 @@ pkgdatadir = $(datadir)/@PACKAGE@/zfs-tests/tests/functional/rsend dist_pkgdata_SCRIPTS = \ setup.ksh \ cleanup.ksh \ recv_dedup.ksh \ recv_dedup_encrypted_zvol.ksh \ rsend_001_pos.ksh \ rsend_002_pos.ksh \ rsend_003_pos.ksh \ rsend_004_pos.ksh \ rsend_005_pos.ksh \ rsend_006_pos.ksh \ rsend_007_pos.ksh \ rsend_008_pos.ksh \ rsend_009_pos.ksh \ rsend_010_pos.ksh \ rsend_011_pos.ksh \ rsend_012_pos.ksh \ rsend_013_pos.ksh \ rsend_014_pos.ksh \ rsend_016_neg.ksh \ rsend_019_pos.ksh \ rsend_020_pos.ksh \ rsend_021_pos.ksh \ rsend_022_pos.ksh \ rsend_024_pos.ksh \ send_encrypted_files.ksh \ send_encrypted_hierarchy.ksh \ send_encrypted_props.ksh \ send_encrypted_truncated_files.ksh \ send-c_embedded_blocks.ksh \ send-c_incremental.ksh \ send-c_lz4_disabled.ksh \ send-c_mixed_compression.ksh \ send-c_props.ksh \ send-c_recv_dedup.ksh \ send-c_recv_lz4_disabled.ksh \ send-c_resume.ksh \ send-c_stream_size_estimate.ksh \ send-c_verify_contents.ksh \ send-c_verify_ratio.ksh \ send-c_volume.ksh \ send-c_zstreamdump.ksh \ send-cpL_varied_recsize.ksh \ send-L_toggle.ksh \ send_freeobjects.ksh \ send_partial_dataset.ksh \ send_realloc_dnode_size.ksh \ send_realloc_files.ksh \ send_realloc_encrypted_files.ksh \ send_spill_block.ksh \ send_holds.ksh \ send_hole_birth.ksh \ + send_invalid.ksh \ send_mixed_raw.ksh \ send-wR_encrypted_zvol.ksh dist_pkgdata_DATA = \ dedup.zsend.bz2 \ dedup_encrypted_zvol.bz2 \ dedup_encrypted_zvol.zsend.bz2 \ fs.tar.gz \ rsend.cfg \ rsend.kshlib Index: vendor-sys/openzfs/dist/tests/zfs-tests/tests/functional/rsend/send_invalid.ksh =================================================================== --- vendor-sys/openzfs/dist/tests/zfs-tests/tests/functional/rsend/send_invalid.ksh (nonexistent) +++ vendor-sys/openzfs/dist/tests/zfs-tests/tests/functional/rsend/send_invalid.ksh (revision 366348) @@ -0,0 +1,52 @@ +#!/bin/ksh + +# +# This file and its contents are supplied under the terms of the +# Common Development and Distribution License ("CDDL"), version a.0. +# You may only use this file in accordance with the terms of version +# a.0 of the CDDL. +# +# A full copy of the text of the CDDL should have accompanied this +# source. A copy of the CDDL is also available via the Internet at +# http://www.illumos.org/license/CDDL. +# + +# +# Portions Copyright 2020 iXsystems, Inc. +# + +. $STF_SUITE/include/libtest.shlib +. $STF_SUITE/tests/functional/rsend/rsend.kshlib + +# +# Description: +# Verify that send with invalid options will fail gracefully. +# +# Strategy: +# 1. Perform zfs send on the cli with the order of the snapshots reversed +# 2. Perform zfs send using libzfs with the order of the snapshots reversed +# + +verify_runnable "both" + +log_assert "Verify that send with invalid options will fail gracefully." + +function cleanup +{ + datasetexists $testfs && destroy_dataset $testfs -r +} +log_onexit cleanup + +testfs=$POOL/fs + +log_must zfs create $testfs +log_must zfs snap $testfs@snap0 +log_must zfs snap $testfs@snap1 + +# Test bad send with the CLI +log_mustnot eval "zfs send -i $testfs@snap1 $testfs@snap0 >/dev/null" + +# Test bad send with libzfs/libzfs_core +log_must badsend $testfs@snap0 $testfs@snap1 + +log_pass "Send with invalid options fails gracefully."