No OneTemporary
Actions

Size

571 KB

Referenced Files

None

Subscribers

None

View Options

This file is larger than 256 KB, so syntax highlighting was skipped.

	Index: stable/8/sys/amd64/include/xen
	===================================================================
	--- stable/8/sys/amd64/include/xen (revision 209273)
	+++ stable/8/sys/amd64/include/xen (revision 209274)

	Property changes on: stable/8/sys/amd64/include/xen
	___________________________________________________________________
	Modified: svn:mergeinfo
	## -0,0 +0,1 ##
	Merged /head/sys/amd64/include/xen:r209093-209101
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 209274)
	@@ -1,5023 +1,5023 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	/*
	* DVA-based Adjustable Replacement Cache
	*
	* While much of the theory of operation used here is
	* based on the self-tuning, low overhead replacement cache
	* presented by Megiddo and Modha at FAST 2003, there are some
	* significant differences:
	*
	* 1. The Megiddo and Modha model assumes any page is evictable.
	* Pages in its cache cannot be "locked" into memory. This makes
	* the eviction algorithm simple: evict the last page in the list.
	* This also make the performance characteristics easy to reason
	* about. Our cache is not so simple. At any given moment, some
	* subset of the blocks in the cache are un-evictable because we
	* have handed out a reference to them. Blocks are only evictable
	* when there are no external references active. This makes
	* eviction far more problematic: we choose to evict the evictable
	* blocks that are the "lowest" in the list.
	*
	* There are times when it is not possible to evict the requested
	* space. In these circumstances we are unable to adjust the cache
	* size. To prevent the cache growing unbounded at these times we
	* implement a "cache throttle" that slows the flow of new data
	* into the cache until we can make space available.
	*
	* 2. The Megiddo and Modha model assumes a fixed cache size.
	* Pages are evicted when the cache is full and there is a cache
	* miss. Our model has a variable sized cache. It grows with
	* high use, but also tries to react to memory pressure from the
	* operating system: decreasing its size when system memory is
	* tight.
	*
	* 3. The Megiddo and Modha model assumes a fixed page size. All
	* elements of the cache are therefor exactly the same size. So
	* when adjusting the cache size following a cache miss, its simply
	* a matter of choosing a single page to evict. In our model, we
	* have variable sized cache blocks (rangeing from 512 bytes to
	* 128K bytes). We therefor choose a set of blocks to evict to make
	* space for a cache miss that approximates as closely as possible
	* the space used by the new block.
	*
	* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
	* by N. Megiddo & D. Modha, FAST 2003
	*/

	/*
	* The locking model:
	*
	* A new reference to a cache buffer can be obtained in two
	* ways: 1) via a hash table lookup using the DVA as a key,
	* or 2) via one of the ARC lists. The arc_read() interface
	* uses method 1, while the internal arc algorithms for
	* adjusting the cache use method 2. We therefor provide two
	* types of locks: 1) the hash table lock array, and 2) the
	* arc list locks.
	*
	* Buffers do not have their own mutexs, rather they rely on the
	* hash table mutexs for the bulk of their protection (i.e. most
	* fields in the arc_buf_hdr_t are protected by these mutexs).
	*
	* buf_hash_find() returns the appropriate mutex (held) when it
	* locates the requested buffer in the hash table. It returns
	* NULL for the mutex if the buffer was not in the table.
	*
	* buf_hash_remove() expects the appropriate hash mutex to be
	* already held before it is invoked.
	*
	* Each arc state also has a mutex which is used to protect the
	* buffer list associated with the state. When attempting to
	* obtain a hash table lock while holding an arc list lock you
	* must use: mutex_tryenter() to avoid deadlock. Also note that
	* the active state mutex must be held before the ghost state mutex.
	*
	* Arc buffers may have an associated eviction callback function.
	* This function will be invoked prior to removing the buffer (e.g.
	* in arc_do_user_evicts()). Note however that the data associated
	* with the buffer may be evicted prior to the callback. The callback
	* must be made with no locks held (to prevent deadlock). Additionally,
	* the users of callbacks must ensure that their private data is
	* protected from simultaneous callbacks from arc_buf_evict()
	* and arc_do_user_evicts().
	*
	* Note that the majority of the performance stats are manipulated
	* with atomic operations.
	*
	* The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
	*
	* - L2ARC buflist creation
	* - L2ARC buflist eviction
	* - L2ARC write completion, which walks L2ARC buflists
	* - ARC header destruction, as it removes from L2ARC buflists
	* - ARC header release, as it removes from L2ARC buflists
	*/

	#include <sys/spa.h>
	#include <sys/zio.h>
	#include <sys/zio_checksum.h>
	#include <sys/zfs_context.h>
	#include <sys/arc.h>
	#include <sys/refcount.h>
	#include <sys/vdev.h>
	#ifdef _KERNEL
	#include <sys/dnlc.h>
	#endif
	#include <sys/callb.h>
	#include <sys/kstat.h>
	#include <sys/sdt.h>

	#include <vm/vm_pageout.h>

	static kmutex_t arc_reclaim_thr_lock;
	static kcondvar_t arc_reclaim_thr_cv; /* used to signal reclaim thr */
	static uint8_t arc_thread_exit;

	extern int zfs_write_limit_shift;
	extern uint64_t zfs_write_limit_max;
	extern kmutex_t zfs_write_limit_lock;

	#define ARC_REDUCE_DNLC_PERCENT 3
	uint_t arc_reduce_dnlc_percent = ARC_REDUCE_DNLC_PERCENT;

	typedef enum arc_reclaim_strategy {
	ARC_RECLAIM_AGGR, /* Aggressive reclaim strategy */
	ARC_RECLAIM_CONS /* Conservative reclaim strategy */
	} arc_reclaim_strategy_t;

	/* number of seconds before growing cache again */
	static int arc_grow_retry = 60;

	/* shift of arc_c for calculating both min and max arc_p */
	static int arc_p_min_shift = 4;

	/* log2(fraction of arc to reclaim) */
	static int arc_shrink_shift = 5;

	/*
	* minimum lifespan of a prefetch block in clock ticks
	* (initialized in arc_init())
	*/
	static int arc_min_prefetch_lifespan;

	static int arc_dead;
	extern int zfs_prefetch_disable;

	/*
	* The arc has filled available memory and has now warmed up.
	*/
	static boolean_t arc_warm;

	/*
	* These tunables are for performance analysis.
	*/
	uint64_t zfs_arc_max;
	uint64_t zfs_arc_min;
	uint64_t zfs_arc_meta_limit = 0;
	int zfs_mdcomp_disable = 0;
	int zfs_arc_grow_retry = 0;
	int zfs_arc_shrink_shift = 0;
	int zfs_arc_p_min_shift = 0;

	TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
	TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
	TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
	TUNABLE_INT("vfs.zfs.mdcomp_disable", &zfs_mdcomp_disable);
	SYSCTL_DECL(_vfs_zfs);
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
	"Maximum ARC size");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
	"Minimum ARC size");
	SYSCTL_INT(_vfs_zfs, OID_AUTO, mdcomp_disable, CTLFLAG_RDTUN,
	&zfs_mdcomp_disable, 0, "Disable metadata compression");

	/*
	* Note that buffers can be in one of 6 states:
	* ARC_anon - anonymous (discussed below)
	* ARC_mru - recently used, currently cached
	* ARC_mru_ghost - recentely used, no longer in cache
	* ARC_mfu - frequently used, currently cached
	* ARC_mfu_ghost - frequently used, no longer in cache
	* ARC_l2c_only - exists in L2ARC but not other states
	* When there are no active references to the buffer, they are
	* are linked onto a list in one of these arc states. These are
	* the only buffers that can be evicted or deleted. Within each
	* state there are multiple lists, one for meta-data and one for
	* non-meta-data. Meta-data (indirect blocks, blocks of dnodes,
	* etc.) is tracked separately so that it can be managed more
	* explicitly: favored over data, limited explicitly.
	*
	* Anonymous buffers are buffers that are not associated with
	* a DVA. These are buffers that hold dirty block copies
	* before they are written to stable storage. By definition,
	* they are "ref'd" and are considered part of arc_mru
	* that cannot be freed. Generally, they will aquire a DVA
	* as they are written and migrate onto the arc_mru list.
	*
	* The ARC_l2c_only state is for buffers that are in the second
	* level ARC but no longer in any of the ARC_m* lists. The second
	* level ARC itself may also contain buffers that are in any of
	* the ARC_m* states - meaning that a buffer can exist in two
	* places. The reason for the ARC_l2c_only state is to keep the
	* buffer header in the hash table, so that reads that hit the
	* second level ARC benefit from these fast lookups.
	*/

	#define ARCS_LOCK_PAD CACHE_LINE_SIZE
	struct arcs_lock {
	kmutex_t arcs_lock;
	#ifdef _KERNEL
	unsigned char pad[(ARCS_LOCK_PAD - sizeof (kmutex_t))];
	#endif
	};

	/*
	* must be power of two for mask use to work
	*
	*/
	#define ARC_BUFC_NUMDATALISTS 16
	#define ARC_BUFC_NUMMETADATALISTS 16
	#define ARC_BUFC_NUMLISTS (ARC_BUFC_NUMMETADATALISTS + ARC_BUFC_NUMDATALISTS)

	typedef struct arc_state {
	uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
	uint64_t arcs_size; /* total amount of data in this state */
	list_t arcs_lists[ARC_BUFC_NUMLISTS]; /* list of evictable buffers */
	struct arcs_lock arcs_locks[ARC_BUFC_NUMLISTS] __aligned(CACHE_LINE_SIZE);
	} arc_state_t;

	#define ARCS_LOCK(s, i) (&((s)->arcs_locks[(i)].arcs_lock))

	/* The 6 states: */
	static arc_state_t ARC_anon;
	static arc_state_t ARC_mru;
	static arc_state_t ARC_mru_ghost;
	static arc_state_t ARC_mfu;
	static arc_state_t ARC_mfu_ghost;
	static arc_state_t ARC_l2c_only;

	typedef struct arc_stats {
	kstat_named_t arcstat_hits;
	kstat_named_t arcstat_misses;
	kstat_named_t arcstat_demand_data_hits;
	kstat_named_t arcstat_demand_data_misses;
	kstat_named_t arcstat_demand_metadata_hits;
	kstat_named_t arcstat_demand_metadata_misses;
	kstat_named_t arcstat_prefetch_data_hits;
	kstat_named_t arcstat_prefetch_data_misses;
	kstat_named_t arcstat_prefetch_metadata_hits;
	kstat_named_t arcstat_prefetch_metadata_misses;
	kstat_named_t arcstat_mru_hits;
	kstat_named_t arcstat_mru_ghost_hits;
	kstat_named_t arcstat_mfu_hits;
	kstat_named_t arcstat_mfu_ghost_hits;
	kstat_named_t arcstat_allocated;
	kstat_named_t arcstat_deleted;
	kstat_named_t arcstat_stolen;
	kstat_named_t arcstat_recycle_miss;
	kstat_named_t arcstat_mutex_miss;
	kstat_named_t arcstat_evict_skip;
	kstat_named_t arcstat_evict_l2_cached;
	kstat_named_t arcstat_evict_l2_eligible;
	kstat_named_t arcstat_evict_l2_ineligible;
	kstat_named_t arcstat_hash_elements;
	kstat_named_t arcstat_hash_elements_max;
	kstat_named_t arcstat_hash_collisions;
	kstat_named_t arcstat_hash_chains;
	kstat_named_t arcstat_hash_chain_max;
	kstat_named_t arcstat_p;
	kstat_named_t arcstat_c;
	kstat_named_t arcstat_c_min;
	kstat_named_t arcstat_c_max;
	kstat_named_t arcstat_size;
	kstat_named_t arcstat_hdr_size;
	kstat_named_t arcstat_data_size;
	kstat_named_t arcstat_other_size;
	kstat_named_t arcstat_l2_hits;
	kstat_named_t arcstat_l2_misses;
	kstat_named_t arcstat_l2_feeds;
	kstat_named_t arcstat_l2_rw_clash;
	kstat_named_t arcstat_l2_read_bytes;
	kstat_named_t arcstat_l2_write_bytes;
	kstat_named_t arcstat_l2_writes_sent;
	kstat_named_t arcstat_l2_writes_done;
	kstat_named_t arcstat_l2_writes_error;
	kstat_named_t arcstat_l2_writes_hdr_miss;
	kstat_named_t arcstat_l2_evict_lock_retry;
	kstat_named_t arcstat_l2_evict_reading;
	kstat_named_t arcstat_l2_free_on_write;
	kstat_named_t arcstat_l2_abort_lowmem;
	kstat_named_t arcstat_l2_cksum_bad;
	kstat_named_t arcstat_l2_io_error;
	kstat_named_t arcstat_l2_size;
	kstat_named_t arcstat_l2_hdr_size;
	kstat_named_t arcstat_memory_throttle_count;
	kstat_named_t arcstat_l2_write_trylock_fail;
	kstat_named_t arcstat_l2_write_passed_headroom;
	kstat_named_t arcstat_l2_write_spa_mismatch;
	kstat_named_t arcstat_l2_write_in_l2;
	kstat_named_t arcstat_l2_write_hdr_io_in_progress;
	kstat_named_t arcstat_l2_write_not_cacheable;
	kstat_named_t arcstat_l2_write_full;
	kstat_named_t arcstat_l2_write_buffer_iter;
	kstat_named_t arcstat_l2_write_pios;
	kstat_named_t arcstat_l2_write_buffer_bytes_scanned;
	kstat_named_t arcstat_l2_write_buffer_list_iter;
	kstat_named_t arcstat_l2_write_buffer_list_null_iter;
	} arc_stats_t;

	static arc_stats_t arc_stats = {
	{ "hits", KSTAT_DATA_UINT64 },
	{ "misses", KSTAT_DATA_UINT64 },
	{ "demand_data_hits", KSTAT_DATA_UINT64 },
	{ "demand_data_misses", KSTAT_DATA_UINT64 },
	{ "demand_metadata_hits", KSTAT_DATA_UINT64 },
	{ "demand_metadata_misses", KSTAT_DATA_UINT64 },
	{ "prefetch_data_hits", KSTAT_DATA_UINT64 },
	{ "prefetch_data_misses", KSTAT_DATA_UINT64 },
	{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
	{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
	{ "mru_hits", KSTAT_DATA_UINT64 },
	{ "mru_ghost_hits", KSTAT_DATA_UINT64 },
	{ "mfu_hits", KSTAT_DATA_UINT64 },
	{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },
	{ "allocated", KSTAT_DATA_UINT64 },
	{ "deleted", KSTAT_DATA_UINT64 },
	{ "stolen", KSTAT_DATA_UINT64 },
	{ "recycle_miss", KSTAT_DATA_UINT64 },
	{ "mutex_miss", KSTAT_DATA_UINT64 },
	{ "evict_skip", KSTAT_DATA_UINT64 },
	{ "evict_l2_cached", KSTAT_DATA_UINT64 },
	{ "evict_l2_eligible", KSTAT_DATA_UINT64 },
	{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },
	{ "hash_elements", KSTAT_DATA_UINT64 },
	{ "hash_elements_max", KSTAT_DATA_UINT64 },
	{ "hash_collisions", KSTAT_DATA_UINT64 },
	{ "hash_chains", KSTAT_DATA_UINT64 },
	{ "hash_chain_max", KSTAT_DATA_UINT64 },
	{ "p", KSTAT_DATA_UINT64 },
	{ "c", KSTAT_DATA_UINT64 },
	{ "c_min", KSTAT_DATA_UINT64 },
	{ "c_max", KSTAT_DATA_UINT64 },
	{ "size", KSTAT_DATA_UINT64 },
	{ "hdr_size", KSTAT_DATA_UINT64 },
	{ "data_size", KSTAT_DATA_UINT64 },
	{ "other_size", KSTAT_DATA_UINT64 },
	{ "l2_hits", KSTAT_DATA_UINT64 },
	{ "l2_misses", KSTAT_DATA_UINT64 },
	{ "l2_feeds", KSTAT_DATA_UINT64 },
	{ "l2_rw_clash", KSTAT_DATA_UINT64 },
	{ "l2_read_bytes", KSTAT_DATA_UINT64 },
	{ "l2_write_bytes", KSTAT_DATA_UINT64 },
	{ "l2_writes_sent", KSTAT_DATA_UINT64 },
	{ "l2_writes_done", KSTAT_DATA_UINT64 },
	{ "l2_writes_error", KSTAT_DATA_UINT64 },
	{ "l2_writes_hdr_miss", KSTAT_DATA_UINT64 },
	{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
	{ "l2_evict_reading", KSTAT_DATA_UINT64 },
	{ "l2_free_on_write", KSTAT_DATA_UINT64 },
	{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },
	{ "l2_cksum_bad", KSTAT_DATA_UINT64 },
	{ "l2_io_error", KSTAT_DATA_UINT64 },
	{ "l2_size", KSTAT_DATA_UINT64 },
	{ "l2_hdr_size", KSTAT_DATA_UINT64 },
	{ "memory_throttle_count", KSTAT_DATA_UINT64 },
	{ "l2_write_trylock_fail", KSTAT_DATA_UINT64 },
	{ "l2_write_passed_headroom", KSTAT_DATA_UINT64 },
	{ "l2_write_spa_mismatch", KSTAT_DATA_UINT64 },
	{ "l2_write_in_l2", KSTAT_DATA_UINT64 },
	{ "l2_write_io_in_progress", KSTAT_DATA_UINT64 },
	{ "l2_write_not_cacheable", KSTAT_DATA_UINT64 },
	{ "l2_write_full", KSTAT_DATA_UINT64 },
	{ "l2_write_buffer_iter", KSTAT_DATA_UINT64 },
	{ "l2_write_pios", KSTAT_DATA_UINT64 },
	{ "l2_write_buffer_bytes_scanned", KSTAT_DATA_UINT64 },
	{ "l2_write_buffer_list_iter", KSTAT_DATA_UINT64 },
	{ "l2_write_buffer_list_null_iter", KSTAT_DATA_UINT64 }
	};

	#define ARCSTAT(stat) (arc_stats.stat.value.ui64)

	#define ARCSTAT_INCR(stat, val) \
	atomic_add_64(&arc_stats.stat.value.ui64, (val));

	#define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1)
	#define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)

	#define ARCSTAT_MAX(stat, val) { \
	uint64_t m; \
	while ((val) > (m = arc_stats.stat.value.ui64) && \
	(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
	continue; \
	}

	#define ARCSTAT_MAXSTAT(stat) \
	ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)

	/*
	* We define a macro to allow ARC hits/misses to be easily broken down by
	* two separate conditions, giving a total of four different subtypes for
	* each of hits and misses (so eight statistics total).
	*/
	#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
	if (cond1) { \
	if (cond2) { \
	ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
	} else { \
	ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
	} \
	} else { \
	if (cond2) { \
	ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
	} else { \
	ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
	} \
	}

	kstat_t *arc_ksp;
	static arc_state_t *arc_anon;
	static arc_state_t *arc_mru;
	static arc_state_t *arc_mru_ghost;
	static arc_state_t *arc_mfu;
	static arc_state_t *arc_mfu_ghost;
	static arc_state_t *arc_l2c_only;

	/*
	* There are several ARC variables that are critical to export as kstats --
	* but we don't want to have to grovel around in the kstat whenever we wish to
	* manipulate them. For these variables, we therefore define them to be in
	* terms of the statistic variable. This assures that we are not introducing
	* the possibility of inconsistency by having shadow copies of the variables,
	* while still allowing the code to be readable.
	*/
	#define arc_size ARCSTAT(arcstat_size) /* actual total arc size */
	#define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
	#define arc_c ARCSTAT(arcstat_c) /* target size of cache */
	#define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
	#define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */

	static int arc_no_grow; /* Don't try to grow cache size */
	static uint64_t arc_tempreserve;
	static uint64_t arc_meta_used;
	static uint64_t arc_meta_limit;
	static uint64_t arc_meta_max = 0;
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_meta_used, CTLFLAG_RDTUN,
	&arc_meta_used, 0, "ARC metadata used");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_meta_limit, CTLFLAG_RDTUN,
	&arc_meta_limit, 0, "ARC metadata limit");

	typedef struct l2arc_buf_hdr l2arc_buf_hdr_t;

	typedef struct arc_callback arc_callback_t;

	struct arc_callback {
	void *acb_private;
	arc_done_func_t *acb_done;
	arc_buf_t *acb_buf;
	zio_t *acb_zio_dummy;
	arc_callback_t *acb_next;
	};

	typedef struct arc_write_callback arc_write_callback_t;

	struct arc_write_callback {
	void *awcb_private;
	arc_done_func_t *awcb_ready;
	arc_done_func_t *awcb_done;
	arc_buf_t *awcb_buf;
	};

	struct arc_buf_hdr {
	/* protected by hash lock */
	dva_t b_dva;
	uint64_t b_birth;
	uint64_t b_cksum0;

	kmutex_t b_freeze_lock;
	zio_cksum_t *b_freeze_cksum;

	arc_buf_hdr_t *b_hash_next;
	arc_buf_t *b_buf;
	uint32_t b_flags;
	uint32_t b_datacnt;

	arc_callback_t *b_acb;
	kcondvar_t b_cv;

	/* immutable */
	arc_buf_contents_t b_type;
	uint64_t b_size;
	spa_t *b_spa;

	/* protected by arc state mutex */
	arc_state_t *b_state;
	list_node_t b_arc_node;

	/* updated atomically */
	clock_t b_arc_access;

	/* self protecting */
	refcount_t b_refcnt;

	l2arc_buf_hdr_t *b_l2hdr;
	list_node_t b_l2node;
	};

	static arc_buf_t *arc_eviction_list;
	static kmutex_t arc_eviction_mtx;
	static arc_buf_hdr_t arc_eviction_hdr;
	static void arc_get_data_buf(arc_buf_t *buf);
	static void arc_access(arc_buf_hdr_t buf, kmutex_t hash_lock);
	static int arc_evict_needed(arc_buf_contents_t type);
	static void arc_evict_ghost(arc_state_t state, spa_t spa, int64_t bytes);

	static boolean_t l2arc_write_eligible(spa_t spa, arc_buf_hdr_t ab);

	#define GHOST_STATE(state) \
	((state) == arc_mru_ghost \|\| (state) == arc_mfu_ghost \|\| \
	(state) == arc_l2c_only)

	/*
	* Private ARC flags. These flags are private ARC only flags that will show up
	* in b_flags in the arc_hdr_buf_t. Some flags are publicly declared, and can
	* be passed in as arc_flags in things like arc_read. However, these flags
	* should never be passed and should only be set by ARC code. When adding new
	* public flags, make sure not to smash the private ones.
	*/

	#define ARC_IN_HASH_TABLE (1 << 9) /* this buffer is hashed */
	#define ARC_IO_IN_PROGRESS (1 << 10) /* I/O in progress for buf */
	#define ARC_IO_ERROR (1 << 11) /* I/O failed for buf */
	#define ARC_FREED_IN_READ (1 << 12) /* buf freed while in read */
	#define ARC_BUF_AVAILABLE (1 << 13) /* block not in active use */
	#define ARC_INDIRECT (1 << 14) /* this is an indirect block */
	#define ARC_FREE_IN_PROGRESS (1 << 15) /* hdr about to be freed */
	#define ARC_L2_WRITING (1 << 16) /* L2ARC write in progress */
	#define ARC_L2_EVICTED (1 << 17) /* evicted during I/O */
	#define ARC_L2_WRITE_HEAD (1 << 18) /* head of write list */
	#define ARC_STORED (1 << 19) /* has been store()d to */

	#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_IN_HASH_TABLE)
	#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS)
	#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_IO_ERROR)
	#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_PREFETCH)
	#define HDR_FREED_IN_READ(hdr) ((hdr)->b_flags & ARC_FREED_IN_READ)
	#define HDR_BUF_AVAILABLE(hdr) ((hdr)->b_flags & ARC_BUF_AVAILABLE)
	#define HDR_FREE_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FREE_IN_PROGRESS)
	#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_L2CACHE)
	#define HDR_L2_READING(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS && \
	(hdr)->b_l2hdr != NULL)
	#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_L2_WRITING)
	#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_L2_EVICTED)
	#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_L2_WRITE_HEAD)

	/*
	* Other sizes
	*/

	#define HDR_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
	#define L2HDR_SIZE ((int64_t)sizeof (l2arc_buf_hdr_t))

	/*
	* Hash table routines
	*/

	#define HT_LOCK_PAD CACHE_LINE_SIZE

	struct ht_lock {
	kmutex_t ht_lock;
	#ifdef _KERNEL
	unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
	#endif
	};

	#define BUF_LOCKS 256
	typedef struct buf_hash_table {
	uint64_t ht_mask;
	arc_buf_hdr_t **ht_table;
	struct ht_lock ht_locks[BUF_LOCKS] __aligned(CACHE_LINE_SIZE);
	} buf_hash_table_t;

	static buf_hash_table_t buf_hash_table;

	#define BUF_HASH_INDEX(spa, dva, birth) \
	(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
	#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
	#define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
	#define HDR_LOCK(buf) \
	(BUF_HASH_LOCK(BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth)))

	uint64_t zfs_crc64_table[256];

	/*
	* Level 2 ARC
	*/

	#define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
	#define L2ARC_HEADROOM 2 /* num of writes */
	#define L2ARC_FEED_SECS 1 /* caching interval secs */
	#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */

	#define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent)
	#define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done)

	/*
	* L2ARC Performance Tunables
	*/
	uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* default max write size */
	uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra write during warmup */
	uint64_t l2arc_headroom = L2ARC_HEADROOM; /* number of dev writes */
	uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
	uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
	boolean_t l2arc_noprefetch = B_FALSE; /* don't cache prefetch bufs */
	boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */
	boolean_t l2arc_norw = B_TRUE; /* no reads during writes */

	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_max, CTLFLAG_RW,
	&l2arc_write_max, 0, "max write size");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_boost, CTLFLAG_RW,
	&l2arc_write_boost, 0, "extra write during warmup");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_headroom, CTLFLAG_RW,
	&l2arc_headroom, 0, "number of dev writes");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_secs, CTLFLAG_RW,
	&l2arc_feed_secs, 0, "interval seconds");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_min_ms, CTLFLAG_RW,
	&l2arc_feed_min_ms, 0, "min interval milliseconds");

	SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_noprefetch, CTLFLAG_RW,
	&l2arc_noprefetch, 0, "don't cache prefetch bufs");
	SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_feed_again, CTLFLAG_RW,
	&l2arc_feed_again, 0, "turbo warmup");
	SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_norw, CTLFLAG_RW,
	&l2arc_norw, 0, "no reads during writes");

	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_size, CTLFLAG_RD,
	&ARC_anon.arcs_size, 0, "size of anonymous state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_metadata_lsize, CTLFLAG_RD,
	&ARC_anon.arcs_lsize[ARC_BUFC_METADATA], 0, "size of anonymous state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_data_lsize, CTLFLAG_RD,
	&ARC_anon.arcs_lsize[ARC_BUFC_DATA], 0, "size of anonymous state");

	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_size, CTLFLAG_RD,
	&ARC_mru.arcs_size, 0, "size of mru state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_metadata_lsize, CTLFLAG_RD,
	&ARC_mru.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mru state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_data_lsize, CTLFLAG_RD,
	&ARC_mru.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mru state");

	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_size, CTLFLAG_RD,
	&ARC_mru_ghost.arcs_size, 0, "size of mru ghost state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_metadata_lsize, CTLFLAG_RD,
	&ARC_mru_ghost.arcs_lsize[ARC_BUFC_METADATA], 0,
	"size of metadata in mru ghost state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_data_lsize, CTLFLAG_RD,
	&ARC_mru_ghost.arcs_lsize[ARC_BUFC_DATA], 0,
	"size of data in mru ghost state");

	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_size, CTLFLAG_RD,
	&ARC_mfu.arcs_size, 0, "size of mfu state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_metadata_lsize, CTLFLAG_RD,
	&ARC_mfu.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mfu state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_data_lsize, CTLFLAG_RD,
	&ARC_mfu.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mfu state");

	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_size, CTLFLAG_RD,
	&ARC_mfu_ghost.arcs_size, 0, "size of mfu ghost state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_metadata_lsize, CTLFLAG_RD,
	&ARC_mfu_ghost.arcs_lsize[ARC_BUFC_METADATA], 0,
	"size of metadata in mfu ghost state");
	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_data_lsize, CTLFLAG_RD,
	&ARC_mfu_ghost.arcs_lsize[ARC_BUFC_DATA], 0,
	"size of data in mfu ghost state");

	SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2c_only_size, CTLFLAG_RD,
	&ARC_l2c_only.arcs_size, 0, "size of mru state");

	/*
	* L2ARC Internals
	*/
	typedef struct l2arc_dev {
	vdev_t l2ad_vdev; / vdev */
	spa_t l2ad_spa; / spa */
	uint64_t l2ad_hand; /* next write location */
	uint64_t l2ad_write; /* desired write size, bytes */
	uint64_t l2ad_boost; /* warmup write boost, bytes */
	uint64_t l2ad_start; /* first addr on device */
	uint64_t l2ad_end; /* last addr on device */
	uint64_t l2ad_evict; /* last addr eviction reached */
	boolean_t l2ad_first; /* first sweep through */
	boolean_t l2ad_writing; /* currently writing */
	list_t l2ad_buflist; / buffer list */
	list_node_t l2ad_node; /* device list node */
	} l2arc_dev_t;

	static list_t L2ARC_dev_list; /* device list */
	static list_t l2arc_dev_list; / device list pointer */
	static kmutex_t l2arc_dev_mtx; /* device list mutex */
	static l2arc_dev_t l2arc_dev_last; / last device used */
	static kmutex_t l2arc_buflist_mtx; /* mutex for all buflists */
	static list_t L2ARC_free_on_write; /* free after write buf list */
	static list_t l2arc_free_on_write; / free after write list ptr */
	static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
	static uint64_t l2arc_ndev; /* number of devices */

	typedef struct l2arc_read_callback {
	arc_buf_t l2rcb_buf; / read buffer */
	spa_t l2rcb_spa; / spa */
	blkptr_t l2rcb_bp; /* original blkptr */
	zbookmark_t l2rcb_zb; /* original bookmark */
	int l2rcb_flags; /* original flags */
	} l2arc_read_callback_t;

	typedef struct l2arc_write_callback {
	l2arc_dev_t l2wcb_dev; / device info */
	arc_buf_hdr_t l2wcb_head; / head of write buflist */
	} l2arc_write_callback_t;

	struct l2arc_buf_hdr {
	/* protected by arc_buf_hdr mutex */
	l2arc_dev_t b_dev; / L2ARC device */
	uint64_t b_daddr; /* disk address, offset byte */
	};

	typedef struct l2arc_data_free {
	/* protected by l2arc_free_on_write_mtx */
	void *l2df_data;
	size_t l2df_size;
	void (l2df_func)(void , size_t);
	list_node_t l2df_list_node;
	} l2arc_data_free_t;

	static kmutex_t l2arc_feed_thr_lock;
	static kcondvar_t l2arc_feed_thr_cv;
	static uint8_t l2arc_thread_exit;

	static void l2arc_read_done(zio_t *zio);
	static void l2arc_hdr_stat_add(void);
	static void l2arc_hdr_stat_remove(void);

	static uint64_t
	buf_hash(spa_t spa, const dva_t dva, uint64_t birth)
	{
	uintptr_t spav = (uintptr_t)spa;
	uint8_t vdva = (uint8_t )dva;
	uint64_t crc = -1ULL;
	int i;

	ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);

	for (i = 0; i < sizeof (dva_t); i++)
	crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];

	crc ^= (spav>>8) ^ birth;

	return (crc);
	}

	#define BUF_EMPTY(buf) \
	((buf)->b_dva.dva_word[0] == 0 && \
	(buf)->b_dva.dva_word[1] == 0 && \
	(buf)->b_birth == 0)

	#define BUF_EQUAL(spa, dva, birth, buf) \
	((buf)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
	((buf)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
	((buf)->b_birth == birth) && ((buf)->b_spa == spa)

	static arc_buf_hdr_t *
	buf_hash_find(spa_t spa, const dva_t dva, uint64_t birth, kmutex_t **lockp)
	{
	uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
	kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
	arc_buf_hdr_t *buf;

	mutex_enter(hash_lock);
	for (buf = buf_hash_table.ht_table[idx]; buf != NULL;
	buf = buf->b_hash_next) {
	if (BUF_EQUAL(spa, dva, birth, buf)) {
	*lockp = hash_lock;
	return (buf);
	}
	}
	mutex_exit(hash_lock);
	*lockp = NULL;
	return (NULL);
	}

	/*
	* Insert an entry into the hash table. If there is already an element
	* equal to elem in the hash table, then the already existing element
	* will be returned and the new element will not be inserted.
	* Otherwise returns NULL.
	*/
	static arc_buf_hdr_t *
	buf_hash_insert(arc_buf_hdr_t buf, kmutex_t *lockp)
	{
	uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
	kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
	arc_buf_hdr_t *fbuf;
	uint32_t i;

	ASSERT(!HDR_IN_HASH_TABLE(buf));
	*lockp = hash_lock;
	mutex_enter(hash_lock);
	for (fbuf = buf_hash_table.ht_table[idx], i = 0; fbuf != NULL;
	fbuf = fbuf->b_hash_next, i++) {
	if (BUF_EQUAL(buf->b_spa, &buf->b_dva, buf->b_birth, fbuf))
	return (fbuf);
	}

	buf->b_hash_next = buf_hash_table.ht_table[idx];
	buf_hash_table.ht_table[idx] = buf;
	buf->b_flags \|= ARC_IN_HASH_TABLE;

	/* collect some hash table performance data */
	if (i > 0) {
	ARCSTAT_BUMP(arcstat_hash_collisions);
	if (i == 1)
	ARCSTAT_BUMP(arcstat_hash_chains);

	ARCSTAT_MAX(arcstat_hash_chain_max, i);
	}

	ARCSTAT_BUMP(arcstat_hash_elements);
	ARCSTAT_MAXSTAT(arcstat_hash_elements);

	return (NULL);
	}

	static void
	buf_hash_remove(arc_buf_hdr_t *buf)
	{
	arc_buf_hdr_t fbuf, *bufp;
	uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);

	ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
	ASSERT(HDR_IN_HASH_TABLE(buf));

	bufp = &buf_hash_table.ht_table[idx];
	while ((fbuf = *bufp) != buf) {
	ASSERT(fbuf != NULL);
	bufp = &fbuf->b_hash_next;
	}
	*bufp = buf->b_hash_next;
	buf->b_hash_next = NULL;
	buf->b_flags &= ~ARC_IN_HASH_TABLE;

	/* collect some hash table performance data */
	ARCSTAT_BUMPDOWN(arcstat_hash_elements);

	if (buf_hash_table.ht_table[idx] &&
	buf_hash_table.ht_table[idx]->b_hash_next == NULL)
	ARCSTAT_BUMPDOWN(arcstat_hash_chains);
	}

	/*
	* Global data structures and functions for the buf kmem cache.
	*/
	static kmem_cache_t *hdr_cache;
	static kmem_cache_t *buf_cache;

	static void
	buf_fini(void)
	{
	int i;

	kmem_free(buf_hash_table.ht_table,
	(buf_hash_table.ht_mask + 1) * sizeof (void *));
	for (i = 0; i < BUF_LOCKS; i++)
	mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
	kmem_cache_destroy(hdr_cache);
	kmem_cache_destroy(buf_cache);
	}

	/*
	* Constructor callback - called when the cache is empty
	* and a new buf is requested.
	*/
	/* ARGSUSED */
	static int
	hdr_cons(void vbuf, void unused, int kmflag)
	{
	arc_buf_hdr_t *buf = vbuf;

	bzero(buf, sizeof (arc_buf_hdr_t));
	refcount_create(&buf->b_refcnt);
	cv_init(&buf->b_cv, NULL, CV_DEFAULT, NULL);
	mutex_init(&buf->b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
	arc_space_consume(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);

	return (0);
	}

	/* ARGSUSED */
	static int
	buf_cons(void vbuf, void unused, int kmflag)
	{
	arc_buf_t *buf = vbuf;

	bzero(buf, sizeof (arc_buf_t));
	rw_init(&buf->b_lock, NULL, RW_DEFAULT, NULL);
	arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);

	return (0);
	}

	/*
	* Destructor callback - called when a cached buf is
	* no longer required.
	*/
	/* ARGSUSED */
	static void
	hdr_dest(void vbuf, void unused)
	{
	arc_buf_hdr_t *buf = vbuf;

	refcount_destroy(&buf->b_refcnt);
	cv_destroy(&buf->b_cv);
	mutex_destroy(&buf->b_freeze_lock);
	arc_space_return(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
	}

	/* ARGSUSED */
	static void
	buf_dest(void vbuf, void unused)
	{
	arc_buf_t *buf = vbuf;

	rw_destroy(&buf->b_lock);
	arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
	}

	/*
	* Reclaim callback -- invoked when memory is low.
	*/
	/* ARGSUSED */
	static void
	hdr_recl(void *unused)
	{
	dprintf("hdr_recl called\n");
	/*
	* umem calls the reclaim func when we destroy the buf cache,
	* which is after we do arc_fini().
	*/
	if (!arc_dead)
	cv_signal(&arc_reclaim_thr_cv);
	}

	static void
	buf_init(void)
	{
	uint64_t *ct;
	uint64_t hsize = 1ULL << 12;
	int i, j;

	/*
	* The hash table is big enough to fill all of physical memory
	* with an average 64K block size. The table will take up
	* totalmemsizeof(void)/64K (eg. 128KB/GB with 8-byte pointers).
	*/
	while (hsize * 65536 < (uint64_t)physmem * PAGESIZE)
	hsize <<= 1;
	retry:
	buf_hash_table.ht_mask = hsize - 1;
	buf_hash_table.ht_table =
	kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
	if (buf_hash_table.ht_table == NULL) {
	ASSERT(hsize > (1ULL << 8));
	hsize >>= 1;
	goto retry;
	}

	hdr_cache = kmem_cache_create("arc_buf_hdr_t", sizeof (arc_buf_hdr_t),
	0, hdr_cons, hdr_dest, hdr_recl, NULL, NULL, 0);
	buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
	0, buf_cons, buf_dest, NULL, NULL, NULL, 0);

	for (i = 0; i < 256; i++)
	for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
	ct = (ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);

	for (i = 0; i < BUF_LOCKS; i++) {
	mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
	NULL, MUTEX_DEFAULT, NULL);
	}
	}

	#define ARC_MINTIME (hz>>4) /* 62 ms */

	static void
	arc_cksum_verify(arc_buf_t *buf)
	{
	zio_cksum_t zc;

	if (!(zfs_flags & ZFS_DEBUG_MODIFY))
	return;

	mutex_enter(&buf->b_hdr->b_freeze_lock);
	if (buf->b_hdr->b_freeze_cksum == NULL \|\|
	(buf->b_hdr->b_flags & ARC_IO_ERROR)) {
	mutex_exit(&buf->b_hdr->b_freeze_lock);
	return;
	}
	fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
	if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
	panic("buffer modified while frozen!");
	mutex_exit(&buf->b_hdr->b_freeze_lock);
	}

	static int
	arc_cksum_equal(arc_buf_t *buf)
	{
	zio_cksum_t zc;
	int equal;

	mutex_enter(&buf->b_hdr->b_freeze_lock);
	fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
	equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc);
	mutex_exit(&buf->b_hdr->b_freeze_lock);

	return (equal);
	}

	static void
	arc_cksum_compute(arc_buf_t *buf, boolean_t force)
	{
	if (!force && !(zfs_flags & ZFS_DEBUG_MODIFY))
	return;

	mutex_enter(&buf->b_hdr->b_freeze_lock);
	if (buf->b_hdr->b_freeze_cksum != NULL) {
	mutex_exit(&buf->b_hdr->b_freeze_lock);
	return;
	}
	buf->b_hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
	fletcher_2_native(buf->b_data, buf->b_hdr->b_size,
	buf->b_hdr->b_freeze_cksum);
	mutex_exit(&buf->b_hdr->b_freeze_lock);
	}

	void
	arc_buf_thaw(arc_buf_t *buf)
	{
	if (zfs_flags & ZFS_DEBUG_MODIFY) {
	if (buf->b_hdr->b_state != arc_anon)
	panic("modifying non-anon buffer!");
	if (buf->b_hdr->b_flags & ARC_IO_IN_PROGRESS)
	panic("modifying buffer while i/o in progress!");
	arc_cksum_verify(buf);
	}

	mutex_enter(&buf->b_hdr->b_freeze_lock);
	if (buf->b_hdr->b_freeze_cksum != NULL) {
	kmem_free(buf->b_hdr->b_freeze_cksum, sizeof (zio_cksum_t));
	buf->b_hdr->b_freeze_cksum = NULL;
	}
	mutex_exit(&buf->b_hdr->b_freeze_lock);
	}

	void
	arc_buf_freeze(arc_buf_t *buf)
	{
	if (!(zfs_flags & ZFS_DEBUG_MODIFY))
	return;

	ASSERT(buf->b_hdr->b_freeze_cksum != NULL \|\|
	buf->b_hdr->b_state == arc_anon);
	arc_cksum_compute(buf, B_FALSE);
	}

	static void
	get_buf_info(arc_buf_hdr_t ab, arc_state_t state, list_t list, kmutex_t lock)
	{
	uint64_t buf_hashid = buf_hash(ab->b_spa, &ab->b_dva, ab->b_birth);

	if (ab->b_type == ARC_BUFC_METADATA)
	buf_hashid &= (ARC_BUFC_NUMMETADATALISTS - 1);
	else {
	buf_hashid &= (ARC_BUFC_NUMDATALISTS - 1);
	buf_hashid += ARC_BUFC_NUMMETADATALISTS;
	}

	*list = &state->arcs_lists[buf_hashid];
	*lock = ARCS_LOCK(state, buf_hashid);
	}


	static void
	add_reference(arc_buf_hdr_t ab, kmutex_t hash_lock, void *tag)
	{

	ASSERT(MUTEX_HELD(hash_lock));

	if ((refcount_add(&ab->b_refcnt, tag) == 1) &&
	(ab->b_state != arc_anon)) {
	uint64_t delta = ab->b_size * ab->b_datacnt;
	uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type];
	list_t *list;
	kmutex_t *lock;

	get_buf_info(ab, ab->b_state, &list, &lock);
	ASSERT(!MUTEX_HELD(lock));
	mutex_enter(lock);
	ASSERT(list_link_active(&ab->b_arc_node));
	list_remove(list, ab);
	if (GHOST_STATE(ab->b_state)) {
	ASSERT3U(ab->b_datacnt, ==, 0);
	ASSERT3P(ab->b_buf, ==, NULL);
	delta = ab->b_size;
	}
	ASSERT(delta > 0);
	ASSERT3U(*size, >=, delta);
	atomic_add_64(size, -delta);
	mutex_exit(lock);
	/* remove the prefetch flag if we get a reference */
	if (ab->b_flags & ARC_PREFETCH)
	ab->b_flags &= ~ARC_PREFETCH;
	}
	}

	static int
	remove_reference(arc_buf_hdr_t ab, kmutex_t hash_lock, void *tag)
	{
	int cnt;
	arc_state_t *state = ab->b_state;

	ASSERT(state == arc_anon \|\| MUTEX_HELD(hash_lock));
	ASSERT(!GHOST_STATE(state));

	if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) &&
	(state != arc_anon)) {
	uint64_t *size = &state->arcs_lsize[ab->b_type];
	list_t *list;
	kmutex_t *lock;

	get_buf_info(ab, state, &list, &lock);
	ASSERT(!MUTEX_HELD(lock));
	mutex_enter(lock);
	ASSERT(!list_link_active(&ab->b_arc_node));
	list_insert_head(list, ab);
	ASSERT(ab->b_datacnt > 0);
	atomic_add_64(size, ab->b_size * ab->b_datacnt);
	mutex_exit(lock);
	}
	return (cnt);
	}

	/*
	* Move the supplied buffer to the indicated state. The mutex
	* for the buffer must be held by the caller.
	*/
	static void
	arc_change_state(arc_state_t new_state, arc_buf_hdr_t ab, kmutex_t *hash_lock)
	{
	arc_state_t *old_state = ab->b_state;
	int64_t refcnt = refcount_count(&ab->b_refcnt);
	uint64_t from_delta, to_delta;
	list_t *list;
	kmutex_t *lock;

	ASSERT(MUTEX_HELD(hash_lock));
	ASSERT(new_state != old_state);
	ASSERT(refcnt == 0 \|\| ab->b_datacnt > 0);
	ASSERT(ab->b_datacnt == 0 \|\| !GHOST_STATE(new_state));

	from_delta = to_delta = ab->b_datacnt * ab->b_size;

	/*
	* If this buffer is evictable, transfer it from the
	* old state list to the new state list.
	*/
	if (refcnt == 0) {
	if (old_state != arc_anon) {
	int use_mutex;
	uint64_t *size = &old_state->arcs_lsize[ab->b_type];

	get_buf_info(ab, old_state, &list, &lock);
	use_mutex = !MUTEX_HELD(lock);
	if (use_mutex)
	mutex_enter(lock);

	ASSERT(list_link_active(&ab->b_arc_node));
	list_remove(list, ab);

	/*
	* If prefetching out of the ghost cache,
	* we will have a non-null datacnt.
	*/
	if (GHOST_STATE(old_state) && ab->b_datacnt == 0) {
	/* ghost elements have a ghost size */
	ASSERT(ab->b_buf == NULL);
	from_delta = ab->b_size;
	}
	ASSERT3U(*size, >=, from_delta);
	atomic_add_64(size, -from_delta);

	if (use_mutex)
	mutex_exit(lock);
	}
	if (new_state != arc_anon) {
	int use_mutex;
	uint64_t *size = &new_state->arcs_lsize[ab->b_type];

	get_buf_info(ab, new_state, &list, &lock);
	use_mutex = !MUTEX_HELD(lock);
	if (use_mutex)
	mutex_enter(lock);

	list_insert_head(list, ab);

	/* ghost elements have a ghost size */
	if (GHOST_STATE(new_state)) {
	ASSERT(ab->b_datacnt == 0);
	ASSERT(ab->b_buf == NULL);
	to_delta = ab->b_size;
	}
	atomic_add_64(size, to_delta);

	if (use_mutex)
	mutex_exit(lock);
	}
	}

	ASSERT(!BUF_EMPTY(ab));
	if (new_state == arc_anon) {
	buf_hash_remove(ab);
	}

	/* adjust state sizes */
	if (to_delta)
	atomic_add_64(&new_state->arcs_size, to_delta);
	if (from_delta) {
	ASSERT3U(old_state->arcs_size, >=, from_delta);
	atomic_add_64(&old_state->arcs_size, -from_delta);
	}
	ab->b_state = new_state;

	/* adjust l2arc hdr stats */
	if (new_state == arc_l2c_only)
	l2arc_hdr_stat_add();
	else if (old_state == arc_l2c_only)
	l2arc_hdr_stat_remove();
	}

	void
	arc_space_consume(uint64_t space, arc_space_type_t type)
	{
	ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);

	switch (type) {
	case ARC_SPACE_DATA:
	ARCSTAT_INCR(arcstat_data_size, space);
	break;
	case ARC_SPACE_OTHER:
	ARCSTAT_INCR(arcstat_other_size, space);
	break;
	case ARC_SPACE_HDRS:
	ARCSTAT_INCR(arcstat_hdr_size, space);
	break;
	case ARC_SPACE_L2HDRS:
	ARCSTAT_INCR(arcstat_l2_hdr_size, space);
	break;
	}

	atomic_add_64(&arc_meta_used, space);
	atomic_add_64(&arc_size, space);
	}

	void
	arc_space_return(uint64_t space, arc_space_type_t type)
	{
	ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);

	switch (type) {
	case ARC_SPACE_DATA:
	ARCSTAT_INCR(arcstat_data_size, -space);
	break;
	case ARC_SPACE_OTHER:
	ARCSTAT_INCR(arcstat_other_size, -space);
	break;
	case ARC_SPACE_HDRS:
	ARCSTAT_INCR(arcstat_hdr_size, -space);
	break;
	case ARC_SPACE_L2HDRS:
	ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
	break;
	}

	ASSERT(arc_meta_used >= space);
	if (arc_meta_max < arc_meta_used)
	arc_meta_max = arc_meta_used;
	atomic_add_64(&arc_meta_used, -space);
	ASSERT(arc_size >= space);
	atomic_add_64(&arc_size, -space);
	}

	void *
	arc_data_buf_alloc(uint64_t size)
	{
	if (arc_evict_needed(ARC_BUFC_DATA))
	cv_signal(&arc_reclaim_thr_cv);
	atomic_add_64(&arc_size, size);
	return (zio_data_buf_alloc(size));
	}

	void
	arc_data_buf_free(void *buf, uint64_t size)
	{
	zio_data_buf_free(buf, size);
	ASSERT(arc_size >= size);
	atomic_add_64(&arc_size, -size);
	}

	arc_buf_t *
	arc_buf_alloc(spa_t spa, int size, void tag, arc_buf_contents_t type)
	{
	arc_buf_hdr_t *hdr;
	arc_buf_t *buf;

	ASSERT3U(size, >, 0);
	hdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
	ASSERT(BUF_EMPTY(hdr));
	hdr->b_size = size;
	hdr->b_type = type;
	hdr->b_spa = spa;
	hdr->b_state = arc_anon;
	hdr->b_arc_access = 0;
	buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
	buf->b_hdr = hdr;
	buf->b_data = NULL;
	buf->b_efunc = NULL;
	buf->b_private = NULL;
	buf->b_next = NULL;
	hdr->b_buf = buf;
	arc_get_data_buf(buf);
	hdr->b_datacnt = 1;
	hdr->b_flags = 0;
	ASSERT(refcount_is_zero(&hdr->b_refcnt));
	(void) refcount_add(&hdr->b_refcnt, tag);

	return (buf);
	}

	static arc_buf_t *
	arc_buf_clone(arc_buf_t *from)
	{
	arc_buf_t *buf;
	arc_buf_hdr_t *hdr = from->b_hdr;
	uint64_t size = hdr->b_size;

	buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
	buf->b_hdr = hdr;
	buf->b_data = NULL;
	buf->b_efunc = NULL;
	buf->b_private = NULL;
	buf->b_next = hdr->b_buf;
	hdr->b_buf = buf;
	arc_get_data_buf(buf);
	bcopy(from->b_data, buf->b_data, size);
	hdr->b_datacnt += 1;
	return (buf);
	}

	void
	arc_buf_add_ref(arc_buf_t buf, void tag)
	{
	arc_buf_hdr_t *hdr;
	kmutex_t *hash_lock;

	/*
	* Check to see if this buffer is evicted. Callers
	* must verify b_data != NULL to know if the add_ref
	* was successful.
	*/
	rw_enter(&buf->b_lock, RW_READER);
	if (buf->b_data == NULL) {
	rw_exit(&buf->b_lock);
	return;
	}
	hdr = buf->b_hdr;
	ASSERT(hdr != NULL);
	hash_lock = HDR_LOCK(hdr);
	mutex_enter(hash_lock);
	rw_exit(&buf->b_lock);

	ASSERT(hdr->b_state == arc_mru \|\| hdr->b_state == arc_mfu);
	add_reference(hdr, hash_lock, tag);
	DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
	arc_access(hdr, hash_lock);
	mutex_exit(hash_lock);
	ARCSTAT_BUMP(arcstat_hits);
	ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
	demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
	data, metadata, hits);
	}

	/*
	* Free the arc data buffer. If it is an l2arc write in progress,
	* the buffer is placed on l2arc_free_on_write to be freed later.
	*/
	static void
	arc_buf_data_free(arc_buf_hdr_t hdr, void (free_func)(void *, size_t),
	void *data, size_t size)
	{
	if (HDR_L2_WRITING(hdr)) {
	l2arc_data_free_t *df;
	df = kmem_alloc(sizeof (l2arc_data_free_t), KM_SLEEP);
	df->l2df_data = data;
	df->l2df_size = size;
	df->l2df_func = free_func;
	mutex_enter(&l2arc_free_on_write_mtx);
	list_insert_head(l2arc_free_on_write, df);
	mutex_exit(&l2arc_free_on_write_mtx);
	ARCSTAT_BUMP(arcstat_l2_free_on_write);
	} else {
	free_func(data, size);
	}
	}

	static void
	arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
	{
	arc_buf_t **bufp;

	/* free up data associated with the buf */
	if (buf->b_data) {
	arc_state_t *state = buf->b_hdr->b_state;
	uint64_t size = buf->b_hdr->b_size;
	arc_buf_contents_t type = buf->b_hdr->b_type;

	arc_cksum_verify(buf);
	if (!recycle) {
	if (type == ARC_BUFC_METADATA) {
	arc_buf_data_free(buf->b_hdr, zio_buf_free,
	buf->b_data, size);
	arc_space_return(size, ARC_SPACE_DATA);
	} else {
	ASSERT(type == ARC_BUFC_DATA);
	arc_buf_data_free(buf->b_hdr,
	zio_data_buf_free, buf->b_data, size);
	ARCSTAT_INCR(arcstat_data_size, -size);
	atomic_add_64(&arc_size, -size);
	}
	}
	if (list_link_active(&buf->b_hdr->b_arc_node)) {
	uint64_t *cnt = &state->arcs_lsize[type];

	ASSERT(refcount_is_zero(&buf->b_hdr->b_refcnt));
	ASSERT(state != arc_anon);

	ASSERT3U(*cnt, >=, size);
	atomic_add_64(cnt, -size);
	}
	ASSERT3U(state->arcs_size, >=, size);
	atomic_add_64(&state->arcs_size, -size);
	buf->b_data = NULL;
	ASSERT(buf->b_hdr->b_datacnt > 0);
	buf->b_hdr->b_datacnt -= 1;
	}

	/* only remove the buf if requested */
	if (!all)
	return;

	/* remove the buf from the hdr list */
	for (bufp = &buf->b_hdr->b_buf; bufp != buf; bufp = &(bufp)->b_next)
	continue;
	*bufp = buf->b_next;

	ASSERT(buf->b_efunc == NULL);

	/* clean up the buf */
	buf->b_hdr = NULL;
	kmem_cache_free(buf_cache, buf);
	}

	static void
	arc_hdr_destroy(arc_buf_hdr_t *hdr)
	{
	ASSERT(refcount_is_zero(&hdr->b_refcnt));
	ASSERT3P(hdr->b_state, ==, arc_anon);
	ASSERT(!HDR_IO_IN_PROGRESS(hdr));
	ASSERT(!(hdr->b_flags & ARC_STORED));

	if (hdr->b_l2hdr != NULL) {
	if (!MUTEX_HELD(&l2arc_buflist_mtx)) {
	/*
	* To prevent arc_free() and l2arc_evict() from
	* attempting to free the same buffer at the same time,
	* a FREE_IN_PROGRESS flag is given to arc_free() to
	* give it priority. l2arc_evict() can't destroy this
	* header while we are waiting on l2arc_buflist_mtx.
	*
	* The hdr may be removed from l2ad_buflist before we
	* grab l2arc_buflist_mtx, so b_l2hdr is rechecked.
	*/
	mutex_enter(&l2arc_buflist_mtx);
	if (hdr->b_l2hdr != NULL) {
	list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist,
	hdr);
	}
	mutex_exit(&l2arc_buflist_mtx);
	} else {
	list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist, hdr);
	}
	ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
	kmem_free(hdr->b_l2hdr, sizeof (l2arc_buf_hdr_t));
	if (hdr->b_state == arc_l2c_only)
	l2arc_hdr_stat_remove();
	hdr->b_l2hdr = NULL;
	}

	if (!BUF_EMPTY(hdr)) {
	ASSERT(!HDR_IN_HASH_TABLE(hdr));
	bzero(&hdr->b_dva, sizeof (dva_t));
	hdr->b_birth = 0;
	hdr->b_cksum0 = 0;
	}
	while (hdr->b_buf) {
	arc_buf_t *buf = hdr->b_buf;

	if (buf->b_efunc) {
	mutex_enter(&arc_eviction_mtx);
	rw_enter(&buf->b_lock, RW_WRITER);
	ASSERT(buf->b_hdr != NULL);
	arc_buf_destroy(hdr->b_buf, FALSE, FALSE);
	hdr->b_buf = buf->b_next;
	buf->b_hdr = &arc_eviction_hdr;
	buf->b_next = arc_eviction_list;
	arc_eviction_list = buf;
	rw_exit(&buf->b_lock);
	mutex_exit(&arc_eviction_mtx);
	} else {
	arc_buf_destroy(hdr->b_buf, FALSE, TRUE);
	}
	}
	if (hdr->b_freeze_cksum != NULL) {
	kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
	hdr->b_freeze_cksum = NULL;
	}

	ASSERT(!list_link_active(&hdr->b_arc_node));
	ASSERT3P(hdr->b_hash_next, ==, NULL);
	ASSERT3P(hdr->b_acb, ==, NULL);
	kmem_cache_free(hdr_cache, hdr);
	}

	void
	arc_buf_free(arc_buf_t buf, void tag)
	{
	arc_buf_hdr_t *hdr = buf->b_hdr;
	int hashed = hdr->b_state != arc_anon;

	ASSERT(buf->b_efunc == NULL);
	ASSERT(buf->b_data != NULL);

	if (hashed) {
	kmutex_t *hash_lock = HDR_LOCK(hdr);

	mutex_enter(hash_lock);
	(void) remove_reference(hdr, hash_lock, tag);
	if (hdr->b_datacnt > 1)
	arc_buf_destroy(buf, FALSE, TRUE);
	else
	hdr->b_flags \|= ARC_BUF_AVAILABLE;
	mutex_exit(hash_lock);
	} else if (HDR_IO_IN_PROGRESS(hdr)) {
	int destroy_hdr;
	/*
	* We are in the middle of an async write. Don't destroy
	* this buffer unless the write completes before we finish
	* decrementing the reference count.
	*/
	mutex_enter(&arc_eviction_mtx);
	(void) remove_reference(hdr, NULL, tag);
	ASSERT(refcount_is_zero(&hdr->b_refcnt));
	destroy_hdr = !HDR_IO_IN_PROGRESS(hdr);
	mutex_exit(&arc_eviction_mtx);
	if (destroy_hdr)
	arc_hdr_destroy(hdr);
	} else {
	if (remove_reference(hdr, NULL, tag) > 0) {
	ASSERT(HDR_IO_ERROR(hdr));
	arc_buf_destroy(buf, FALSE, TRUE);
	} else {
	arc_hdr_destroy(hdr);
	}
	}
	}

	int
	arc_buf_remove_ref(arc_buf_t buf, void tag)
	{
	arc_buf_hdr_t *hdr = buf->b_hdr;
	kmutex_t *hash_lock = HDR_LOCK(hdr);
	int no_callback = (buf->b_efunc == NULL);

	if (hdr->b_state == arc_anon) {
	arc_buf_free(buf, tag);
	return (no_callback);
	}

	mutex_enter(hash_lock);
	ASSERT(hdr->b_state != arc_anon);
	ASSERT(buf->b_data != NULL);

	(void) remove_reference(hdr, hash_lock, tag);
	if (hdr->b_datacnt > 1) {
	if (no_callback)
	arc_buf_destroy(buf, FALSE, TRUE);
	} else if (no_callback) {
	ASSERT(hdr->b_buf == buf && buf->b_next == NULL);
	hdr->b_flags \|= ARC_BUF_AVAILABLE;
	}
	ASSERT(no_callback \|\| hdr->b_datacnt > 1 \|\|
	refcount_is_zero(&hdr->b_refcnt));
	mutex_exit(hash_lock);
	return (no_callback);
	}

	int
	arc_buf_size(arc_buf_t *buf)
	{
	return (buf->b_hdr->b_size);
	}

	/*
	* Evict buffers from list until we've removed the specified number of
	* bytes. Move the removed buffers to the appropriate evict state.
	* If the recycle flag is set, then attempt to "recycle" a buffer:
	* - look for a buffer to evict that is `bytes' long.
	* - return the data block from this buffer rather than freeing it.
	* This flag is used by callers that are trying to make space for a
	* new buffer in a full arc cache.
	*
	* This function makes a "best effort". It skips over any buffers
	* it can't get a hash_lock on, and so may not catch all candidates.
	* It may also return without evicting as much space as requested.
	*/
	static void *
	arc_evict(arc_state_t state, spa_t spa, int64_t bytes, boolean_t recycle,
	arc_buf_contents_t type)
	{
	arc_state_t *evicted_state;
	uint64_t bytes_evicted = 0, skipped = 0, missed = 0;
	int64_t bytes_remaining;
	arc_buf_hdr_t ab, ab_prev = NULL;
	list_t evicted_list, list, evicted_list_start, list_start;
	kmutex_t lock, evicted_lock;
	kmutex_t *hash_lock;
	boolean_t have_lock;
	void *stolen = NULL;
	static int evict_metadata_offset, evict_data_offset;
	int i, idx, offset, list_count, count;

	ASSERT(state == arc_mru \|\| state == arc_mfu);

	evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;

	if (type == ARC_BUFC_METADATA) {
	offset = 0;
	list_count = ARC_BUFC_NUMMETADATALISTS;
	list_start = &state->arcs_lists[0];
	evicted_list_start = &evicted_state->arcs_lists[0];
	idx = evict_metadata_offset;
	} else {
	offset = ARC_BUFC_NUMMETADATALISTS;
	list_start = &state->arcs_lists[offset];
	evicted_list_start = &evicted_state->arcs_lists[offset];
	list_count = ARC_BUFC_NUMDATALISTS;
	idx = evict_data_offset;
	}
	bytes_remaining = evicted_state->arcs_lsize[type];
	count = 0;

	evict_start:
	list = &list_start[idx];
	evicted_list = &evicted_list_start[idx];
	lock = ARCS_LOCK(state, (offset + idx));
	evicted_lock = ARCS_LOCK(evicted_state, (offset + idx));

	mutex_enter(lock);
	mutex_enter(evicted_lock);

	for (ab = list_tail(list); ab; ab = ab_prev) {
	ab_prev = list_prev(list, ab);
	bytes_remaining -= (ab->b_size * ab->b_datacnt);
	/* prefetch buffers have a minimum lifespan */
	if (HDR_IO_IN_PROGRESS(ab) \|\|
	(spa && ab->b_spa != spa) \|\|
	(ab->b_flags & (ARC_PREFETCH\|ARC_INDIRECT) &&
	LBOLT - ab->b_arc_access < arc_min_prefetch_lifespan)) {
	skipped++;
	continue;
	}
	/* "lookahead" for better eviction candidate */
	if (recycle && ab->b_size != bytes &&
	ab_prev && ab_prev->b_size == bytes)
	continue;
	hash_lock = HDR_LOCK(ab);
	have_lock = MUTEX_HELD(hash_lock);
	if (have_lock \|\| mutex_tryenter(hash_lock)) {
	ASSERT3U(refcount_count(&ab->b_refcnt), ==, 0);
	ASSERT(ab->b_datacnt > 0);
	while (ab->b_buf) {
	arc_buf_t *buf = ab->b_buf;
	if (!rw_tryenter(&buf->b_lock, RW_WRITER)) {
	missed += 1;
	break;
	}
	if (buf->b_data) {
	bytes_evicted += ab->b_size;
	if (recycle && ab->b_type == type &&
	ab->b_size == bytes &&
	!HDR_L2_WRITING(ab)) {
	stolen = buf->b_data;
	recycle = FALSE;
	}
	}
	if (buf->b_efunc) {
	mutex_enter(&arc_eviction_mtx);
	arc_buf_destroy(buf,
	buf->b_data == stolen, FALSE);
	ab->b_buf = buf->b_next;
	buf->b_hdr = &arc_eviction_hdr;
	buf->b_next = arc_eviction_list;
	arc_eviction_list = buf;
	mutex_exit(&arc_eviction_mtx);
	rw_exit(&buf->b_lock);
	} else {
	rw_exit(&buf->b_lock);
	arc_buf_destroy(buf,
	buf->b_data == stolen, TRUE);
	}
	}

	if (ab->b_l2hdr) {
	ARCSTAT_INCR(arcstat_evict_l2_cached,
	ab->b_size);
	} else {
	if (l2arc_write_eligible(ab->b_spa, ab)) {
	ARCSTAT_INCR(arcstat_evict_l2_eligible,
	ab->b_size);
	} else {
	ARCSTAT_INCR(
	arcstat_evict_l2_ineligible,
	ab->b_size);
	}
	}

	if (ab->b_datacnt == 0) {
	arc_change_state(evicted_state, ab, hash_lock);
	ASSERT(HDR_IN_HASH_TABLE(ab));
	ab->b_flags \|= ARC_IN_HASH_TABLE;
	ab->b_flags &= ~ARC_BUF_AVAILABLE;
	DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, ab);
	}
	if (!have_lock)
	mutex_exit(hash_lock);
	if (bytes >= 0 && bytes_evicted >= bytes)
	break;
	if (bytes_remaining > 0) {
	mutex_exit(evicted_lock);
	mutex_exit(lock);
	idx = ((idx + 1) & (list_count - 1));
	count++;
	goto evict_start;
	}
	} else {
	missed += 1;
	}
	}

	mutex_exit(evicted_lock);
	mutex_exit(lock);

	idx = ((idx + 1) & (list_count - 1));
	count++;

	if (bytes_evicted < bytes) {
	if (count < list_count)
	goto evict_start;
	else
	dprintf("only evicted %lld bytes from %x",
	(longlong_t)bytes_evicted, state);
	}
	if (type == ARC_BUFC_METADATA)
	evict_metadata_offset = idx;
	else
	evict_data_offset = idx;

	if (skipped)
	ARCSTAT_INCR(arcstat_evict_skip, skipped);

	if (missed)
	ARCSTAT_INCR(arcstat_mutex_miss, missed);

	/*
	* We have just evicted some date into the ghost state, make
	* sure we also adjust the ghost state size if necessary.
	*/
	if (arc_no_grow &&
	arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size > arc_c) {
	int64_t mru_over = arc_anon->arcs_size + arc_mru->arcs_size +
	arc_mru_ghost->arcs_size - arc_c;

	if (mru_over > 0 && arc_mru_ghost->arcs_lsize[type] > 0) {
	int64_t todelete =
	MIN(arc_mru_ghost->arcs_lsize[type], mru_over);
	arc_evict_ghost(arc_mru_ghost, NULL, todelete);
	} else if (arc_mfu_ghost->arcs_lsize[type] > 0) {
	int64_t todelete = MIN(arc_mfu_ghost->arcs_lsize[type],
	arc_mru_ghost->arcs_size +
	arc_mfu_ghost->arcs_size - arc_c);
	arc_evict_ghost(arc_mfu_ghost, NULL, todelete);
	}
	}
	if (stolen)
	ARCSTAT_BUMP(arcstat_stolen);

	return (stolen);
	}

	/*
	* Remove buffers from list until we've removed the specified number of
	* bytes. Destroy the buffers that are removed.
	*/
	static void
	arc_evict_ghost(arc_state_t state, spa_t spa, int64_t bytes)
	{
	arc_buf_hdr_t ab, ab_prev;
	list_t list, list_start;
	kmutex_t hash_lock, lock;
	uint64_t bytes_deleted = 0;
	uint64_t bufs_skipped = 0;
	static int evict_offset;
	int list_count, idx = evict_offset;
	int offset, count = 0;

	ASSERT(GHOST_STATE(state));

	/*
	* data lists come after metadata lists
	*/
	list_start = &state->arcs_lists[ARC_BUFC_NUMMETADATALISTS];
	list_count = ARC_BUFC_NUMDATALISTS;
	offset = ARC_BUFC_NUMMETADATALISTS;

	evict_start:
	list = &list_start[idx];
	lock = ARCS_LOCK(state, idx + offset);

	mutex_enter(lock);
	for (ab = list_tail(list); ab; ab = ab_prev) {
	ab_prev = list_prev(list, ab);
	if (spa && ab->b_spa != spa)
	continue;
	hash_lock = HDR_LOCK(ab);
	if (mutex_tryenter(hash_lock)) {
	ASSERT(!HDR_IO_IN_PROGRESS(ab));
	ASSERT(ab->b_buf == NULL);
	ARCSTAT_BUMP(arcstat_deleted);
	bytes_deleted += ab->b_size;

	if (ab->b_l2hdr != NULL) {
	/*
	* This buffer is cached on the 2nd Level ARC;
	* don't destroy the header.
	*/
	arc_change_state(arc_l2c_only, ab, hash_lock);
	mutex_exit(hash_lock);
	} else {
	arc_change_state(arc_anon, ab, hash_lock);
	mutex_exit(hash_lock);
	arc_hdr_destroy(ab);
	}

	DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, ab);
	if (bytes >= 0 && bytes_deleted >= bytes)
	break;
	} else {
	if (bytes < 0) {
	/*
	* we're draining the ARC, retry
	*/
	mutex_exit(lock);
	mutex_enter(hash_lock);
	mutex_exit(hash_lock);
	goto evict_start;
	}
	bufs_skipped += 1;
	}
	}
	mutex_exit(lock);
	idx = ((idx + 1) & (ARC_BUFC_NUMDATALISTS - 1));
	count++;

	if (count < list_count)
	goto evict_start;

	evict_offset = idx;
	if ((uintptr_t)list > (uintptr_t)&state->arcs_lists[ARC_BUFC_NUMMETADATALISTS] &&
	(bytes < 0 \|\| bytes_deleted < bytes)) {
	list_start = &state->arcs_lists[0];
	list_count = ARC_BUFC_NUMMETADATALISTS;
	offset = count = 0;
	goto evict_start;
	}

	if (bufs_skipped) {
	ARCSTAT_INCR(arcstat_mutex_miss, bufs_skipped);
	ASSERT(bytes >= 0);
	}

	if (bytes_deleted < bytes)
	dprintf("only deleted %lld bytes from %p",
	(longlong_t)bytes_deleted, state);
	}

	static void
	arc_adjust(void)
	{
	int64_t adjustment, delta;

	/*
	* Adjust MRU size
	*/

	adjustment = MIN(arc_size - arc_c,
	arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used - arc_p);

	if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
	delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
	(void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
	adjustment -= delta;
	}

	if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
	delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
	(void) arc_evict(arc_mru, NULL, delta, FALSE,
	ARC_BUFC_METADATA);
	}

	/*
	* Adjust MFU size
	*/

	adjustment = arc_size - arc_c;

	if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) {
	delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]);
	(void) arc_evict(arc_mfu, NULL, delta, FALSE, ARC_BUFC_DATA);
	adjustment -= delta;
	}

	if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
	int64_t delta = MIN(adjustment,
	arc_mfu->arcs_lsize[ARC_BUFC_METADATA]);
	(void) arc_evict(arc_mfu, NULL, delta, FALSE,
	ARC_BUFC_METADATA);
	}

	/*
	* Adjust ghost lists
	*/

	adjustment = arc_mru->arcs_size + arc_mru_ghost->arcs_size - arc_c;

	if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
	delta = MIN(arc_mru_ghost->arcs_size, adjustment);
	arc_evict_ghost(arc_mru_ghost, NULL, delta);
	}

	adjustment =
	arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size - arc_c;

	if (adjustment > 0 && arc_mfu_ghost->arcs_size > 0) {
	delta = MIN(arc_mfu_ghost->arcs_size, adjustment);
	arc_evict_ghost(arc_mfu_ghost, NULL, delta);
	}
	}

	static void
	arc_do_user_evicts(void)
	{
	static arc_buf_t *tmp_arc_eviction_list;

	/*
	* Move list over to avoid LOR
	*/
	restart:
	mutex_enter(&arc_eviction_mtx);
	tmp_arc_eviction_list = arc_eviction_list;
	arc_eviction_list = NULL;
	mutex_exit(&arc_eviction_mtx);

	while (tmp_arc_eviction_list != NULL) {
	arc_buf_t *buf = tmp_arc_eviction_list;
	tmp_arc_eviction_list = buf->b_next;
	rw_enter(&buf->b_lock, RW_WRITER);
	buf->b_hdr = NULL;
	rw_exit(&buf->b_lock);

	if (buf->b_efunc != NULL)
	VERIFY(buf->b_efunc(buf) == 0);

	buf->b_efunc = NULL;
	buf->b_private = NULL;
	kmem_cache_free(buf_cache, buf);
	}

	if (arc_eviction_list != NULL)
	goto restart;
	}

	/*
	* Flush all evictable data from the cache for the given spa.
	* NOTE: this will not touch "active" (i.e. referenced) data.
	*/
	void
	arc_flush(spa_t *spa)
	{
	while (arc_mru->arcs_lsize[ARC_BUFC_DATA]) {
	(void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_DATA);
	if (spa)
	break;
	}
	while (arc_mru->arcs_lsize[ARC_BUFC_METADATA]) {
	(void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_METADATA);
	if (spa)
	break;
	}
	while (arc_mfu->arcs_lsize[ARC_BUFC_DATA]) {
	(void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_DATA);
	if (spa)
	break;
	}
	while (arc_mfu->arcs_lsize[ARC_BUFC_METADATA]) {
	(void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_METADATA);
	if (spa)
	break;
	}

	arc_evict_ghost(arc_mru_ghost, spa, -1);
	arc_evict_ghost(arc_mfu_ghost, spa, -1);

	mutex_enter(&arc_reclaim_thr_lock);
	arc_do_user_evicts();
	mutex_exit(&arc_reclaim_thr_lock);
	ASSERT(spa \|\| arc_eviction_list == NULL);
	}

	void
	arc_shrink(void)
	{
	if (arc_c > arc_c_min) {
	uint64_t to_free;

	#ifdef _KERNEL
	to_free = arc_c >> arc_shrink_shift;
	#else
	to_free = arc_c >> arc_shrink_shift;
	#endif
	if (arc_c > arc_c_min + to_free)
	atomic_add_64(&arc_c, -to_free);
	else
	arc_c = arc_c_min;

	atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
	if (arc_c > arc_size)
	arc_c = MAX(arc_size, arc_c_min);
	if (arc_p > arc_c)
	arc_p = (arc_c >> 1);
	ASSERT(arc_c >= arc_c_min);
	ASSERT((int64_t)arc_p >= 0);
	}

	if (arc_size > arc_c)
	arc_adjust();
	}

	static int needfree = 0;

	static int
	arc_reclaim_needed(void)
	{
	#if 0
	uint64_t extra;
	#endif

	#ifdef _KERNEL
	if (needfree)
	return (1);
	if (arc_size > arc_c_max)
	return (1);
	if (arc_size <= arc_c_min)
	return (0);

	/*
	* If pages are needed or we're within 2048 pages
	* of needing to page need to reclaim
	*/
	if (vm_pages_needed \|\| (vm_paging_target() > -2048))
	return (1);

	#if 0
	/*
	* take 'desfree' extra pages, so we reclaim sooner, rather than later
	*/
	extra = desfree;

	/*
	* check that we're out of range of the pageout scanner. It starts to
	* schedule paging if freemem is less than lotsfree and needfree.
	* lotsfree is the high-water mark for pageout, and needfree is the
	* number of needed free pages. We add extra pages here to make sure
	* the scanner doesn't start up while we're freeing memory.
	*/
	if (freemem < lotsfree + needfree + extra)
	return (1);

	/*
	* check to make sure that swapfs has enough space so that anon
	* reservations can still succeed. anon_resvmem() checks that the
	* availrmem is greater than swapfs_minfree, and the number of reserved
	* swap pages. We also add a bit of extra here just to prevent
	* circumstances from getting really dire.
	*/
	if (availrmem < swapfs_minfree + swapfs_reserve + extra)
	return (1);

	#if defined(__i386)
	/*
	* If we're on an i386 platform, it's possible that we'll exhaust the
	* kernel heap space before we ever run out of available physical
	* memory. Most checks of the size of the heap_area compare against
	* tune.t_minarmem, which is the minimum available real memory that we
	* can have in the system. However, this is generally fixed at 25 pages
	* which is so low that it's useless. In this comparison, we seek to
	* calculate the total heap-size, and reclaim if more than 3/4ths of the
	* heap is allocated. (Or, in the calculation, if less than 1/4th is
	* free)
	*/
	if (btop(vmem_size(heap_arena, VMEM_FREE)) <
	(btop(vmem_size(heap_arena, VMEM_FREE \| VMEM_ALLOC)) >> 2))
	return (1);
	#endif
	#else
	if (kmem_used() > (kmem_size() * 3) / 4)
	return (1);
	#endif

	#else
	if (spa_get_random(100) == 0)
	return (1);
	#endif
	return (0);
	}

	extern kmem_cache_t *zio_buf_cache[];
	extern kmem_cache_t *zio_data_buf_cache[];

	static void
	arc_kmem_reap_now(arc_reclaim_strategy_t strat)
	{
	size_t i;
	kmem_cache_t *prev_cache = NULL;
	kmem_cache_t *prev_data_cache = NULL;

	#ifdef _KERNEL
	if (arc_meta_used >= arc_meta_limit) {
	/*
	* We are exceeding our meta-data cache limit.
	* Purge some DNLC entries to release holds on meta-data.
	*/
	dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
	}
	#if defined(__i386)
	/*
	* Reclaim unused memory from all kmem caches.
	*/
	kmem_reap();
	#endif
	#endif

	/*
	* An aggressive reclamation will shrink the cache size as well as
	* reap free buffers from the arc kmem caches.
	*/
	if (strat == ARC_RECLAIM_AGGR)
	arc_shrink();

	for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
	if (zio_buf_cache[i] != prev_cache) {
	prev_cache = zio_buf_cache[i];
	kmem_cache_reap_now(zio_buf_cache[i]);
	}
	if (zio_data_buf_cache[i] != prev_data_cache) {
	prev_data_cache = zio_data_buf_cache[i];
	kmem_cache_reap_now(zio_data_buf_cache[i]);
	}
	}
	kmem_cache_reap_now(buf_cache);
	kmem_cache_reap_now(hdr_cache);
	}

	static void
	arc_reclaim_thread(void *dummy __unused)
	{
	clock_t growtime = 0;
	arc_reclaim_strategy_t last_reclaim = ARC_RECLAIM_CONS;
	callb_cpr_t cpr;

	CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);

	mutex_enter(&arc_reclaim_thr_lock);
	while (arc_thread_exit == 0) {
	if (arc_reclaim_needed()) {

	if (arc_no_grow) {
	if (last_reclaim == ARC_RECLAIM_CONS) {
	last_reclaim = ARC_RECLAIM_AGGR;
	} else {
	last_reclaim = ARC_RECLAIM_CONS;
	}
	} else {
	arc_no_grow = TRUE;
	last_reclaim = ARC_RECLAIM_AGGR;
	membar_producer();
	}

	/* reset the growth delay for every reclaim */
	growtime = LBOLT + (arc_grow_retry * hz);

	if (needfree && last_reclaim == ARC_RECLAIM_CONS) {
	/*
	* If needfree is TRUE our vm_lowmem hook
	* was called and in that case we must free some
	* memory, so switch to aggressive mode.
	*/
	arc_no_grow = TRUE;
	last_reclaim = ARC_RECLAIM_AGGR;
	}
	arc_kmem_reap_now(last_reclaim);
	arc_warm = B_TRUE;

	} else if (arc_no_grow && LBOLT >= growtime) {
	arc_no_grow = FALSE;
	}

	if (needfree \|\|
	(2 * arc_c < arc_size +
	arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size))
	arc_adjust();

	if (arc_eviction_list != NULL)
	arc_do_user_evicts();

	if (arc_reclaim_needed()) {
	needfree = 0;
	#ifdef _KERNEL
	wakeup(&needfree);
	#endif
	}

	/* block until needed, or one second, whichever is shorter */
	CALLB_CPR_SAFE_BEGIN(&cpr);
	(void) cv_timedwait(&arc_reclaim_thr_cv,
	&arc_reclaim_thr_lock, hz);
	CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_thr_lock);
	}

	arc_thread_exit = 0;
	cv_broadcast(&arc_reclaim_thr_cv);
	CALLB_CPR_EXIT(&cpr); /* drops arc_reclaim_thr_lock */
	thread_exit();
	}

	/*
	* Adapt arc info given the number of bytes we are trying to add and
	* the state that we are comming from. This function is only called
	* when we are adding new content to the cache.
	*/
	static void
	arc_adapt(int bytes, arc_state_t *state)
	{
	int mult;
	uint64_t arc_p_min = (arc_c >> arc_p_min_shift);

	if (state == arc_l2c_only)
	return;

	ASSERT(bytes > 0);
	/*
	* Adapt the target size of the MRU list:
	* - if we just hit in the MRU ghost list, then increase
	* the target size of the MRU list.
	* - if we just hit in the MFU ghost list, then increase
	* the target size of the MFU list by decreasing the
	* target size of the MRU list.
	*/
	if (state == arc_mru_ghost) {
	mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
	1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));

	arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
	} else if (state == arc_mfu_ghost) {
	uint64_t delta;

	mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
	1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));

	delta = MIN(bytes * mult, arc_p);
	arc_p = MAX(arc_p_min, arc_p - delta);
	}
	ASSERT((int64_t)arc_p >= 0);

	if (arc_reclaim_needed()) {
	cv_signal(&arc_reclaim_thr_cv);
	return;
	}

	if (arc_no_grow)
	return;

	if (arc_c >= arc_c_max)
	return;

	/*
	* If we're within (2 * maxblocksize) bytes of the target
	* cache size, increment the target cache size
	*/
	if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
	atomic_add_64(&arc_c, (int64_t)bytes);
	if (arc_c > arc_c_max)
	arc_c = arc_c_max;
	else if (state == arc_anon)
	atomic_add_64(&arc_p, (int64_t)bytes);
	if (arc_p > arc_c)
	arc_p = arc_c;
	}
	ASSERT((int64_t)arc_p >= 0);
	}

	/*
	* Check if the cache has reached its limits and eviction is required
	* prior to insert.
	*/
	static int
	arc_evict_needed(arc_buf_contents_t type)
	{
	if (type == ARC_BUFC_METADATA && arc_meta_used >= arc_meta_limit)
	return (1);

	#if 0
	#ifdef _KERNEL
	/*
	* If zio data pages are being allocated out of a separate heap segment,
	* then enforce that the size of available vmem for this area remains
	* above about 1/32nd free.
	*/
	if (type == ARC_BUFC_DATA && zio_arena != NULL &&
	vmem_size(zio_arena, VMEM_FREE) <
	(vmem_size(zio_arena, VMEM_ALLOC) >> 5))
	return (1);
	#endif
	#endif

	if (arc_reclaim_needed())
	return (1);

	return (arc_size > arc_c);
	}

	/*
	* The buffer, supplied as the first argument, needs a data block.
	* So, if we are at cache max, determine which cache should be victimized.
	* We have the following cases:
	*
	* 1. Insert for MRU, p > sizeof(arc_anon + arc_mru) ->
	* In this situation if we're out of space, but the resident size of the MFU is
	* under the limit, victimize the MFU cache to satisfy this insertion request.
	*
	* 2. Insert for MRU, p <= sizeof(arc_anon + arc_mru) ->
	* Here, we've used up all of the available space for the MRU, so we need to
	* evict from our own cache instead. Evict from the set of resident MRU
	* entries.
	*
	* 3. Insert for MFU (c - p) > sizeof(arc_mfu) ->
	* c minus p represents the MFU space in the cache, since p is the size of the
	* cache that is dedicated to the MRU. In this situation there's still space on
	* the MFU side, so the MRU side needs to be victimized.
	*
	* 4. Insert for MFU (c - p) < sizeof(arc_mfu) ->
	* MFU's resident set is consuming more space than it has been allotted. In
	* this situation, we must victimize our own cache, the MFU, for this insertion.
	*/
	static void
	arc_get_data_buf(arc_buf_t *buf)
	{
	arc_state_t *state = buf->b_hdr->b_state;
	uint64_t size = buf->b_hdr->b_size;
	arc_buf_contents_t type = buf->b_hdr->b_type;

	arc_adapt(size, state);

	/*
	* We have not yet reached cache maximum size,
	* just allocate a new buffer.
	*/
	if (!arc_evict_needed(type)) {
	if (type == ARC_BUFC_METADATA) {
	buf->b_data = zio_buf_alloc(size);
	arc_space_consume(size, ARC_SPACE_DATA);
	} else {
	ASSERT(type == ARC_BUFC_DATA);
	buf->b_data = zio_data_buf_alloc(size);
	ARCSTAT_INCR(arcstat_data_size, size);
	atomic_add_64(&arc_size, size);
	}
	goto out;
	}

	/*
	* If we are prefetching from the mfu ghost list, this buffer
	* will end up on the mru list; so steal space from there.
	*/
	if (state == arc_mfu_ghost)
	state = buf->b_hdr->b_flags & ARC_PREFETCH ? arc_mru : arc_mfu;
	else if (state == arc_mru_ghost)
	state = arc_mru;

	if (state == arc_mru \|\| state == arc_anon) {
	uint64_t mru_used = arc_anon->arcs_size + arc_mru->arcs_size;
	state = (arc_mfu->arcs_lsize[type] >= size &&
	arc_p > mru_used) ? arc_mfu : arc_mru;
	} else {
	/* MFU cases */
	uint64_t mfu_space = arc_c - arc_p;
	state = (arc_mru->arcs_lsize[type] >= size &&
	mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
	}
	if ((buf->b_data = arc_evict(state, NULL, size, TRUE, type)) == NULL) {
	if (type == ARC_BUFC_METADATA) {
	buf->b_data = zio_buf_alloc(size);
	arc_space_consume(size, ARC_SPACE_DATA);
	} else {
	ASSERT(type == ARC_BUFC_DATA);
	buf->b_data = zio_data_buf_alloc(size);
	ARCSTAT_INCR(arcstat_data_size, size);
	atomic_add_64(&arc_size, size);
	}
	ARCSTAT_BUMP(arcstat_recycle_miss);
	}
	ASSERT(buf->b_data != NULL);
	out:
	/*
	* Update the state size. Note that ghost states have a
	* "ghost size" and so don't need to be updated.
	*/
	if (!GHOST_STATE(buf->b_hdr->b_state)) {
	arc_buf_hdr_t *hdr = buf->b_hdr;

	atomic_add_64(&hdr->b_state->arcs_size, size);
	if (list_link_active(&hdr->b_arc_node)) {
	ASSERT(refcount_is_zero(&hdr->b_refcnt));
	atomic_add_64(&hdr->b_state->arcs_lsize[type], size);
	}
	/*
	* If we are growing the cache, and we are adding anonymous
	* data, and we have outgrown arc_p, update arc_p
	*/
	if (arc_size < arc_c && hdr->b_state == arc_anon &&
	arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
	arc_p = MIN(arc_c, arc_p + size);
	}
	ARCSTAT_BUMP(arcstat_allocated);
	}

	/*
	* This routine is called whenever a buffer is accessed.
	* NOTE: the hash lock is dropped in this function.
	*/
	static void
	arc_access(arc_buf_hdr_t buf, kmutex_t hash_lock)
	{
	ASSERT(MUTEX_HELD(hash_lock));

	if (buf->b_state == arc_anon) {
	/*
	* This buffer is not in the cache, and does not
	* appear in our "ghost" list. Add the new buffer
	* to the MRU state.
	*/

	ASSERT(buf->b_arc_access == 0);
	buf->b_arc_access = LBOLT;
	DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
	arc_change_state(arc_mru, buf, hash_lock);

	} else if (buf->b_state == arc_mru) {
	/*
	* If this buffer is here because of a prefetch, then either:
	* - clear the flag if this is a "referencing" read
	* (any subsequent access will bump this into the MFU state).
	* or
	* - move the buffer to the head of the list if this is
	* another prefetch (to make it less likely to be evicted).
	*/
	if ((buf->b_flags & ARC_PREFETCH) != 0) {
	if (refcount_count(&buf->b_refcnt) == 0) {
	ASSERT(list_link_active(&buf->b_arc_node));
	} else {
	buf->b_flags &= ~ARC_PREFETCH;
	ARCSTAT_BUMP(arcstat_mru_hits);
	}
	buf->b_arc_access = LBOLT;
	return;
	}

	/*
	* This buffer has been "accessed" only once so far,
	* but it is still in the cache. Move it to the MFU
	* state.
	*/
	if (LBOLT > buf->b_arc_access + ARC_MINTIME) {
	/*
	* More than 125ms have passed since we
	* instantiated this buffer. Move it to the
	* most frequently used state.
	*/
	buf->b_arc_access = LBOLT;
	DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
	arc_change_state(arc_mfu, buf, hash_lock);
	}
	ARCSTAT_BUMP(arcstat_mru_hits);
	} else if (buf->b_state == arc_mru_ghost) {
	arc_state_t *new_state;
	/*
	* This buffer has been "accessed" recently, but
	* was evicted from the cache. Move it to the
	* MFU state.
	*/

	if (buf->b_flags & ARC_PREFETCH) {
	new_state = arc_mru;
	if (refcount_count(&buf->b_refcnt) > 0)
	buf->b_flags &= ~ARC_PREFETCH;
	DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
	} else {
	new_state = arc_mfu;
	DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
	}

	buf->b_arc_access = LBOLT;
	arc_change_state(new_state, buf, hash_lock);

	ARCSTAT_BUMP(arcstat_mru_ghost_hits);
	} else if (buf->b_state == arc_mfu) {
	/*
	* This buffer has been accessed more than once and is
	* still in the cache. Keep it in the MFU state.
	*
	* NOTE: an add_reference() that occurred when we did
	* the arc_read() will have kicked this off the list.
	* If it was a prefetch, we will explicitly move it to
	* the head of the list now.
	*/
	if ((buf->b_flags & ARC_PREFETCH) != 0) {
	ASSERT(refcount_count(&buf->b_refcnt) == 0);
	ASSERT(list_link_active(&buf->b_arc_node));
	}
	ARCSTAT_BUMP(arcstat_mfu_hits);
	buf->b_arc_access = LBOLT;
	} else if (buf->b_state == arc_mfu_ghost) {
	arc_state_t *new_state = arc_mfu;
	/*
	* This buffer has been accessed more than once but has
	* been evicted from the cache. Move it back to the
	* MFU state.
	*/

	if (buf->b_flags & ARC_PREFETCH) {
	/*
	* This is a prefetch access...
	* move this block back to the MRU state.
	*/
	ASSERT3U(refcount_count(&buf->b_refcnt), ==, 0);
	new_state = arc_mru;
	}

	buf->b_arc_access = LBOLT;
	DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
	arc_change_state(new_state, buf, hash_lock);

	ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
	} else if (buf->b_state == arc_l2c_only) {
	/*
	* This buffer is on the 2nd Level ARC.
	*/

	buf->b_arc_access = LBOLT;
	DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
	arc_change_state(arc_mfu, buf, hash_lock);
	} else {
	ASSERT(!"invalid arc state");
	}
	}

	/* a generic arc_done_func_t which you can use */
	/* ARGSUSED */
	void
	arc_bcopy_func(zio_t zio, arc_buf_t buf, void *arg)
	{
	bcopy(buf->b_data, arg, buf->b_hdr->b_size);
	VERIFY(arc_buf_remove_ref(buf, arg) == 1);
	}

	/* a generic arc_done_func_t */
	void
	arc_getbuf_func(zio_t zio, arc_buf_t buf, void *arg)
	{
	arc_buf_t **bufp = arg;
	if (zio && zio->io_error) {
	VERIFY(arc_buf_remove_ref(buf, arg) == 1);
	*bufp = NULL;
	} else {
	*bufp = buf;
	}
	}

	static void
	arc_read_done(zio_t *zio)
	{
	arc_buf_hdr_t hdr, found;
	arc_buf_t *buf;
	arc_buf_t abuf; / buffer we're assigning to callback */
	kmutex_t *hash_lock;
	arc_callback_t callback_list, acb;
	int freeable = FALSE;

	buf = zio->io_private;
	hdr = buf->b_hdr;

	/*
	* The hdr was inserted into hash-table and removed from lists
	* prior to starting I/O. We should find this header, since
	* it's in the hash table, and it should be legit since it's
	* not possible to evict it during the I/O. The only possible
	* reason for it not to be found is if we were freed during the
	* read.
	*/
	found = buf_hash_find(zio->io_spa, &hdr->b_dva, hdr->b_birth,
	&hash_lock);

	ASSERT((found == NULL && HDR_FREED_IN_READ(hdr) && hash_lock == NULL) \|\|
	(found == hdr && DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) \|\|
	(found == hdr && HDR_L2_READING(hdr)));

	hdr->b_flags &= ~ARC_L2_EVICTED;
	if (l2arc_noprefetch && (hdr->b_flags & ARC_PREFETCH))
	hdr->b_flags &= ~ARC_L2CACHE;

	/* byteswap if necessary */
	callback_list = hdr->b_acb;
	ASSERT(callback_list != NULL);
	- if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
	+ if (BP_SHOULD_BYTESWAP(zio->io_bp) && zio->io_error == 0) {
	arc_byteswap_func_t *func = BP_GET_LEVEL(zio->io_bp) > 0 ?
	byteswap_uint64_array :
	dmu_ot[BP_GET_TYPE(zio->io_bp)].ot_byteswap;
	func(buf->b_data, hdr->b_size);
	}

	arc_cksum_compute(buf, B_FALSE);

	/* create copies of the data buffer for the callers */
	abuf = buf;
	for (acb = callback_list; acb; acb = acb->acb_next) {
	if (acb->acb_done) {
	if (abuf == NULL)
	abuf = arc_buf_clone(buf);
	acb->acb_buf = abuf;
	abuf = NULL;
	}
	}
	hdr->b_acb = NULL;
	hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
	ASSERT(!HDR_BUF_AVAILABLE(hdr));
	if (abuf == buf)
	hdr->b_flags \|= ARC_BUF_AVAILABLE;

	ASSERT(refcount_is_zero(&hdr->b_refcnt) \|\| callback_list != NULL);

	if (zio->io_error != 0) {
	hdr->b_flags \|= ARC_IO_ERROR;
	if (hdr->b_state != arc_anon)
	arc_change_state(arc_anon, hdr, hash_lock);
	if (HDR_IN_HASH_TABLE(hdr))
	buf_hash_remove(hdr);
	freeable = refcount_is_zero(&hdr->b_refcnt);
	}

	/*
	* Broadcast before we drop the hash_lock to avoid the possibility
	* that the hdr (and hence the cv) might be freed before we get to
	* the cv_broadcast().
	*/
	cv_broadcast(&hdr->b_cv);

	if (hash_lock) {
	/*
	* Only call arc_access on anonymous buffers. This is because
	* if we've issued an I/O for an evicted buffer, we've already
	* called arc_access (to prevent any simultaneous readers from
	* getting confused).
	*/
	if (zio->io_error == 0 && hdr->b_state == arc_anon)
	arc_access(hdr, hash_lock);
	mutex_exit(hash_lock);
	} else {
	/*
	* This block was freed while we waited for the read to
	* complete. It has been removed from the hash table and
	* moved to the anonymous state (so that it won't show up
	* in the cache).
	*/
	ASSERT3P(hdr->b_state, ==, arc_anon);
	freeable = refcount_is_zero(&hdr->b_refcnt);
	}

	/* execute each callback and free its structure */
	while ((acb = callback_list) != NULL) {
	if (acb->acb_done)
	acb->acb_done(zio, acb->acb_buf, acb->acb_private);

	if (acb->acb_zio_dummy != NULL) {
	acb->acb_zio_dummy->io_error = zio->io_error;
	zio_nowait(acb->acb_zio_dummy);
	}

	callback_list = acb->acb_next;
	kmem_free(acb, sizeof (arc_callback_t));
	}

	if (freeable)
	arc_hdr_destroy(hdr);
	}

	/*
	* "Read" the block block at the specified DVA (in bp) via the
	* cache. If the block is found in the cache, invoke the provided
	* callback immediately and return. Note that the `zio' parameter
	* in the callback will be NULL in this case, since no IO was
	* required. If the block is not in the cache pass the read request
	* on to the spa with a substitute callback function, so that the
	* requested block will be added to the cache.
	*
	* If a read request arrives for a block that has a read in-progress,
	* either wait for the in-progress read to complete (and return the
	* results); or, if this is a read with a "done" func, add a record
	* to the read to invoke the "done" func when the read completes,
	* and return; or just return.
	*
	* arc_read_done() will invoke all the requested "done" functions
	* for readers of this block.
	*
	* Normal callers should use arc_read and pass the arc buffer and offset
	* for the bp. But if you know you don't need locking, you can use
	* arc_read_bp.
	*/
	int
	arc_read(zio_t pio, spa_t spa, blkptr_t bp, arc_buf_t pbuf,
	arc_done_func_t done, void private, int priority, int zio_flags,
	uint32_t arc_flags, const zbookmark_t zb)
	{
	int err;

	ASSERT(!refcount_is_zero(&pbuf->b_hdr->b_refcnt));
	ASSERT3U((char )bp - (char )pbuf->b_data, <, pbuf->b_hdr->b_size);
	rw_enter(&pbuf->b_lock, RW_READER);

	err = arc_read_nolock(pio, spa, bp, done, private, priority,
	zio_flags, arc_flags, zb);
	rw_exit(&pbuf->b_lock);
	return (err);
	}

	int
	arc_read_nolock(zio_t pio, spa_t spa, blkptr_t *bp,
	arc_done_func_t done, void private, int priority, int zio_flags,
	uint32_t arc_flags, const zbookmark_t zb)
	{
	arc_buf_hdr_t *hdr;
	arc_buf_t *buf;
	kmutex_t *hash_lock;
	zio_t *rzio;

	top:
	hdr = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
	if (hdr && hdr->b_datacnt > 0) {

	*arc_flags \|= ARC_CACHED;

	if (HDR_IO_IN_PROGRESS(hdr)) {

	if (*arc_flags & ARC_WAIT) {
	cv_wait(&hdr->b_cv, hash_lock);
	mutex_exit(hash_lock);
	goto top;
	}
	ASSERT(*arc_flags & ARC_NOWAIT);

	if (done) {
	arc_callback_t *acb = NULL;

	acb = kmem_zalloc(sizeof (arc_callback_t),
	KM_SLEEP);
	acb->acb_done = done;
	acb->acb_private = private;
	if (pio != NULL)
	acb->acb_zio_dummy = zio_null(pio,
	spa, NULL, NULL, zio_flags);

	ASSERT(acb->acb_done != NULL);
	acb->acb_next = hdr->b_acb;
	hdr->b_acb = acb;
	add_reference(hdr, hash_lock, private);
	mutex_exit(hash_lock);
	return (0);
	}
	mutex_exit(hash_lock);
	return (0);
	}

	ASSERT(hdr->b_state == arc_mru \|\| hdr->b_state == arc_mfu);

	if (done) {
	add_reference(hdr, hash_lock, private);
	/*
	* If this block is already in use, create a new
	* copy of the data so that we will be guaranteed
	* that arc_release() will always succeed.
	*/
	buf = hdr->b_buf;
	ASSERT(buf);
	ASSERT(buf->b_data);
	if (HDR_BUF_AVAILABLE(hdr)) {
	ASSERT(buf->b_efunc == NULL);
	hdr->b_flags &= ~ARC_BUF_AVAILABLE;
	} else {
	buf = arc_buf_clone(buf);
	}
	} else if (*arc_flags & ARC_PREFETCH &&
	refcount_count(&hdr->b_refcnt) == 0) {
	hdr->b_flags \|= ARC_PREFETCH;
	}
	DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
	arc_access(hdr, hash_lock);
	if (*arc_flags & ARC_L2CACHE)
	hdr->b_flags \|= ARC_L2CACHE;
	mutex_exit(hash_lock);
	ARCSTAT_BUMP(arcstat_hits);
	ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
	demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
	data, metadata, hits);

	if (done)
	done(NULL, buf, private);
	} else {
	uint64_t size = BP_GET_LSIZE(bp);
	arc_callback_t *acb;
	vdev_t *vd = NULL;
	uint64_t addr;
	boolean_t devw = B_FALSE;

	if (hdr == NULL) {
	/* this block is not in the cache */
	arc_buf_hdr_t *exists;
	arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
	buf = arc_buf_alloc(spa, size, private, type);
	hdr = buf->b_hdr;
	hdr->b_dva = *BP_IDENTITY(bp);
	hdr->b_birth = bp->blk_birth;
	hdr->b_cksum0 = bp->blk_cksum.zc_word[0];
	exists = buf_hash_insert(hdr, &hash_lock);
	if (exists) {
	/* somebody beat us to the hash insert */
	mutex_exit(hash_lock);
	bzero(&hdr->b_dva, sizeof (dva_t));
	hdr->b_birth = 0;
	hdr->b_cksum0 = 0;
	(void) arc_buf_remove_ref(buf, private);
	goto top; /* restart the IO request */
	}
	/* if this is a prefetch, we don't have a reference */
	if (*arc_flags & ARC_PREFETCH) {
	(void) remove_reference(hdr, hash_lock,
	private);
	hdr->b_flags \|= ARC_PREFETCH;
	}
	if (*arc_flags & ARC_L2CACHE)
	hdr->b_flags \|= ARC_L2CACHE;
	if (BP_GET_LEVEL(bp) > 0)
	hdr->b_flags \|= ARC_INDIRECT;
	} else {
	/* this block is in the ghost cache */
	ASSERT(GHOST_STATE(hdr->b_state));
	ASSERT(!HDR_IO_IN_PROGRESS(hdr));
	ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 0);
	ASSERT(hdr->b_buf == NULL);

	/* if this is a prefetch, we don't have a reference */
	if (*arc_flags & ARC_PREFETCH)
	hdr->b_flags \|= ARC_PREFETCH;
	else
	add_reference(hdr, hash_lock, private);
	if (*arc_flags & ARC_L2CACHE)
	hdr->b_flags \|= ARC_L2CACHE;
	buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
	buf->b_hdr = hdr;
	buf->b_data = NULL;
	buf->b_efunc = NULL;
	buf->b_private = NULL;
	buf->b_next = NULL;
	hdr->b_buf = buf;
	arc_get_data_buf(buf);
	ASSERT(hdr->b_datacnt == 0);
	hdr->b_datacnt = 1;

	}

	acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
	acb->acb_done = done;
	acb->acb_private = private;

	ASSERT(hdr->b_acb == NULL);
	hdr->b_acb = acb;
	hdr->b_flags \|= ARC_IO_IN_PROGRESS;

	/*
	* If the buffer has been evicted, migrate it to a present state
	* before issuing the I/O. Once we drop the hash-table lock,
	* the header will be marked as I/O in progress and have an
	* attached buffer. At this point, anybody who finds this
	* buffer ought to notice that it's legit but has a pending I/O.
	*/

	if (GHOST_STATE(hdr->b_state))
	arc_access(hdr, hash_lock);

	if (HDR_L2CACHE(hdr) && hdr->b_l2hdr != NULL &&
	(vd = hdr->b_l2hdr->b_dev->l2ad_vdev) != NULL) {
	devw = hdr->b_l2hdr->b_dev->l2ad_writing;
	addr = hdr->b_l2hdr->b_daddr;
	/*
	* Lock out device removal.
	*/
	if (vdev_is_dead(vd) \|\|
	!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
	vd = NULL;
	}

	mutex_exit(hash_lock);

	ASSERT3U(hdr->b_size, ==, size);
	DTRACE_PROBE3(arc__miss, blkptr_t *, bp, uint64_t, size,
	zbookmark_t *, zb);
	ARCSTAT_BUMP(arcstat_misses);
	ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
	demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
	data, metadata, misses);

	if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
	/*
	* Read from the L2ARC if the following are true:
	* 1. The L2ARC vdev was previously cached.
	* 2. This buffer still has L2ARC metadata.
	* 3. This buffer isn't currently writing to the L2ARC.
	* 4. The L2ARC entry wasn't evicted, which may
	* also have invalidated the vdev.
	* 5. This isn't prefetch and l2arc_noprefetch is set.
	*/
	if (hdr->b_l2hdr != NULL &&
	!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
	!(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
	l2arc_read_callback_t *cb;

	DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
	ARCSTAT_BUMP(arcstat_l2_hits);

	cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
	KM_SLEEP);
	cb->l2rcb_buf = buf;
	cb->l2rcb_spa = spa;
	cb->l2rcb_bp = *bp;
	cb->l2rcb_zb = *zb;
	cb->l2rcb_flags = zio_flags;

	/*
	* l2arc read. The SCL_L2ARC lock will be
	* released by l2arc_read_done().
	*/
	rzio = zio_read_phys(pio, vd, addr, size,
	buf->b_data, ZIO_CHECKSUM_OFF,
	l2arc_read_done, cb, priority, zio_flags \|
	ZIO_FLAG_DONT_CACHE \| ZIO_FLAG_CANFAIL \|
	ZIO_FLAG_DONT_PROPAGATE \|
	ZIO_FLAG_DONT_RETRY, B_FALSE);
	DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
	zio_t *, rzio);
	ARCSTAT_INCR(arcstat_l2_read_bytes, size);

	if (*arc_flags & ARC_NOWAIT) {
	zio_nowait(rzio);
	return (0);
	}

	ASSERT(*arc_flags & ARC_WAIT);
	if (zio_wait(rzio) == 0)
	return (0);

	/* l2arc read error; goto zio_read() */
	} else {
	DTRACE_PROBE1(l2arc__miss,
	arc_buf_hdr_t *, hdr);
	ARCSTAT_BUMP(arcstat_l2_misses);
	if (HDR_L2_WRITING(hdr))
	ARCSTAT_BUMP(arcstat_l2_rw_clash);
	spa_config_exit(spa, SCL_L2ARC, vd);
	}
	} else {
	if (vd != NULL)
	spa_config_exit(spa, SCL_L2ARC, vd);
	if (l2arc_ndev != 0) {
	DTRACE_PROBE1(l2arc__miss,
	arc_buf_hdr_t *, hdr);
	ARCSTAT_BUMP(arcstat_l2_misses);
	}
	}

	rzio = zio_read(pio, spa, bp, buf->b_data, size,
	arc_read_done, buf, priority, zio_flags, zb);

	if (*arc_flags & ARC_WAIT)
	return (zio_wait(rzio));

	ASSERT(*arc_flags & ARC_NOWAIT);
	zio_nowait(rzio);
	}
	return (0);
	}

	/*
	* arc_read() variant to support pool traversal. If the block is already
	* in the ARC, make a copy of it; otherwise, the caller will do the I/O.
	* The idea is that we don't want pool traversal filling up memory, but
	* if the ARC already has the data anyway, we shouldn't pay for the I/O.
	*/
	int
	arc_tryread(spa_t spa, blkptr_t bp, void *data)
	{
	arc_buf_hdr_t *hdr;
	kmutex_t *hash_mtx;
	int rc = 0;

	hdr = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_mtx);

	if (hdr && hdr->b_datacnt > 0 && !HDR_IO_IN_PROGRESS(hdr)) {
	arc_buf_t *buf = hdr->b_buf;

	ASSERT(buf);
	while (buf->b_data == NULL) {
	buf = buf->b_next;
	ASSERT(buf);
	}
	bcopy(buf->b_data, data, hdr->b_size);
	} else {
	rc = ENOENT;
	}

	if (hash_mtx)
	mutex_exit(hash_mtx);

	return (rc);
	}

	void
	arc_set_callback(arc_buf_t buf, arc_evict_func_t func, void *private)
	{
	ASSERT(buf->b_hdr != NULL);
	ASSERT(buf->b_hdr->b_state != arc_anon);
	ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt) \|\| func == NULL);
	buf->b_efunc = func;
	buf->b_private = private;
	}

	/*
	* This is used by the DMU to let the ARC know that a buffer is
	* being evicted, so the ARC should clean up. If this arc buf
	* is not yet in the evicted state, it will be put there.
	*/
	int
	arc_buf_evict(arc_buf_t *buf)
	{
	arc_buf_hdr_t *hdr;
	kmutex_t *hash_lock;
	arc_buf_t **bufp;
	list_t list, evicted_list;
	kmutex_t lock, evicted_lock;

	rw_enter(&buf->b_lock, RW_WRITER);
	hdr = buf->b_hdr;
	if (hdr == NULL) {
	/*
	* We are in arc_do_user_evicts().
	*/
	ASSERT(buf->b_data == NULL);
	rw_exit(&buf->b_lock);
	return (0);
	} else if (buf->b_data == NULL) {
	arc_buf_t copy = buf; / structure assignment */
	/*
	* We are on the eviction list; process this buffer now
	* but let arc_do_user_evicts() do the reaping.
	*/
	buf->b_efunc = NULL;
	rw_exit(&buf->b_lock);
	VERIFY(copy.b_efunc(&copy) == 0);
	return (1);
	}
	hash_lock = HDR_LOCK(hdr);
	mutex_enter(hash_lock);

	ASSERT(buf->b_hdr == hdr);
	ASSERT3U(refcount_count(&hdr->b_refcnt), <, hdr->b_datacnt);
	ASSERT(hdr->b_state == arc_mru \|\| hdr->b_state == arc_mfu);

	/*
	* Pull this buffer off of the hdr
	*/
	bufp = &hdr->b_buf;
	while (*bufp != buf)
	bufp = &(*bufp)->b_next;
	*bufp = buf->b_next;

	ASSERT(buf->b_data != NULL);
	arc_buf_destroy(buf, FALSE, FALSE);

	if (hdr->b_datacnt == 0) {
	arc_state_t *old_state = hdr->b_state;
	arc_state_t *evicted_state;

	ASSERT(refcount_is_zero(&hdr->b_refcnt));

	evicted_state =
	(old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;

	get_buf_info(hdr, old_state, &list, &lock);
	get_buf_info(hdr, evicted_state, &evicted_list, &evicted_lock);
	mutex_enter(lock);
	mutex_enter(evicted_lock);

	arc_change_state(evicted_state, hdr, hash_lock);
	ASSERT(HDR_IN_HASH_TABLE(hdr));
	hdr->b_flags \|= ARC_IN_HASH_TABLE;
	hdr->b_flags &= ~ARC_BUF_AVAILABLE;

	mutex_exit(evicted_lock);
	mutex_exit(lock);
	}
	mutex_exit(hash_lock);
	rw_exit(&buf->b_lock);

	VERIFY(buf->b_efunc(buf) == 0);
	buf->b_efunc = NULL;
	buf->b_private = NULL;
	buf->b_hdr = NULL;
	kmem_cache_free(buf_cache, buf);
	return (1);
	}

	/*
	* Release this buffer from the cache. This must be done
	* after a read and prior to modifying the buffer contents.
	* If the buffer has more than one reference, we must make
	* a new hdr for the buffer.
	*/
	void
	arc_release(arc_buf_t buf, void tag)
	{
	arc_buf_hdr_t *hdr;
	kmutex_t *hash_lock;
	l2arc_buf_hdr_t *l2hdr;
	uint64_t buf_size;
	boolean_t released = B_FALSE;

	rw_enter(&buf->b_lock, RW_WRITER);
	hdr = buf->b_hdr;

	/* this buffer is not on any list */
	ASSERT(refcount_count(&hdr->b_refcnt) > 0);
	ASSERT(!(hdr->b_flags & ARC_STORED));

	if (hdr->b_state == arc_anon) {
	/* this buffer is already released */
	ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 1);
	ASSERT(BUF_EMPTY(hdr));
	ASSERT(buf->b_efunc == NULL);
	arc_buf_thaw(buf);
	rw_exit(&buf->b_lock);
	released = B_TRUE;
	} else {
	hash_lock = HDR_LOCK(hdr);
	mutex_enter(hash_lock);
	}

	l2hdr = hdr->b_l2hdr;
	if (l2hdr) {
	mutex_enter(&l2arc_buflist_mtx);
	hdr->b_l2hdr = NULL;
	buf_size = hdr->b_size;
	}

	if (released)
	goto out;

	/*
	* Do we have more than one buf?
	*/
	if (hdr->b_datacnt > 1) {
	arc_buf_hdr_t *nhdr;
	arc_buf_t **bufp;
	uint64_t blksz = hdr->b_size;
	spa_t *spa = hdr->b_spa;
	arc_buf_contents_t type = hdr->b_type;
	uint32_t flags = hdr->b_flags;

	ASSERT(hdr->b_buf != buf \|\| buf->b_next != NULL);
	/*
	* Pull the data off of this buf and attach it to
	* a new anonymous buf.
	*/
	(void) remove_reference(hdr, hash_lock, tag);
	bufp = &hdr->b_buf;
	while (*bufp != buf)
	bufp = &(*bufp)->b_next;
	bufp = (bufp)->b_next;
	buf->b_next = NULL;

	ASSERT3U(hdr->b_state->arcs_size, >=, hdr->b_size);
	atomic_add_64(&hdr->b_state->arcs_size, -hdr->b_size);
	if (refcount_is_zero(&hdr->b_refcnt)) {
	uint64_t *size = &hdr->b_state->arcs_lsize[hdr->b_type];
	ASSERT3U(*size, >=, hdr->b_size);
	atomic_add_64(size, -hdr->b_size);
	}
	hdr->b_datacnt -= 1;
	arc_cksum_verify(buf);

	mutex_exit(hash_lock);

	nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
	nhdr->b_size = blksz;
	nhdr->b_spa = spa;
	nhdr->b_type = type;
	nhdr->b_buf = buf;
	nhdr->b_state = arc_anon;
	nhdr->b_arc_access = 0;
	nhdr->b_flags = flags & ARC_L2_WRITING;
	nhdr->b_l2hdr = NULL;
	nhdr->b_datacnt = 1;
	nhdr->b_freeze_cksum = NULL;
	(void) refcount_add(&nhdr->b_refcnt, tag);
	buf->b_hdr = nhdr;
	rw_exit(&buf->b_lock);
	atomic_add_64(&arc_anon->arcs_size, blksz);
	} else {
	rw_exit(&buf->b_lock);
	ASSERT(refcount_count(&hdr->b_refcnt) == 1);
	ASSERT(!list_link_active(&hdr->b_arc_node));
	ASSERT(!HDR_IO_IN_PROGRESS(hdr));
	arc_change_state(arc_anon, hdr, hash_lock);
	hdr->b_arc_access = 0;
	mutex_exit(hash_lock);

	bzero(&hdr->b_dva, sizeof (dva_t));
	hdr->b_birth = 0;
	hdr->b_cksum0 = 0;
	arc_buf_thaw(buf);
	}
	buf->b_efunc = NULL;
	buf->b_private = NULL;

	out:
	if (l2hdr) {
	list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
	kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
	ARCSTAT_INCR(arcstat_l2_size, -buf_size);
	mutex_exit(&l2arc_buflist_mtx);
	}
	}

	int
	arc_released(arc_buf_t *buf)
	{
	int released;

	rw_enter(&buf->b_lock, RW_READER);
	released = (buf->b_data != NULL && buf->b_hdr->b_state == arc_anon);
	rw_exit(&buf->b_lock);
	return (released);
	}

	int
	arc_has_callback(arc_buf_t *buf)
	{
	int callback;

	rw_enter(&buf->b_lock, RW_READER);
	callback = (buf->b_efunc != NULL);
	rw_exit(&buf->b_lock);
	return (callback);
	}

	#ifdef ZFS_DEBUG
	int
	arc_referenced(arc_buf_t *buf)
	{
	int referenced;

	rw_enter(&buf->b_lock, RW_READER);
	referenced = (refcount_count(&buf->b_hdr->b_refcnt));
	rw_exit(&buf->b_lock);
	return (referenced);
	}
	#endif

	static void
	arc_write_ready(zio_t *zio)
	{
	arc_write_callback_t *callback = zio->io_private;
	arc_buf_t *buf = callback->awcb_buf;
	arc_buf_hdr_t *hdr = buf->b_hdr;

	ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt));
	callback->awcb_ready(zio, buf, callback->awcb_private);

	/*
	* If the IO is already in progress, then this is a re-write
	* attempt, so we need to thaw and re-compute the cksum.
	* It is the responsibility of the callback to handle the
	* accounting for any re-write attempt.
	*/
	if (HDR_IO_IN_PROGRESS(hdr)) {
	mutex_enter(&hdr->b_freeze_lock);
	if (hdr->b_freeze_cksum != NULL) {
	kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
	hdr->b_freeze_cksum = NULL;
	}
	mutex_exit(&hdr->b_freeze_lock);
	}
	arc_cksum_compute(buf, B_FALSE);
	hdr->b_flags \|= ARC_IO_IN_PROGRESS;
	}

	static void
	arc_write_done(zio_t *zio)
	{
	arc_write_callback_t *callback = zio->io_private;
	arc_buf_t *buf = callback->awcb_buf;
	arc_buf_hdr_t *hdr = buf->b_hdr;

	hdr->b_acb = NULL;

	hdr->b_dva = *BP_IDENTITY(zio->io_bp);
	hdr->b_birth = zio->io_bp->blk_birth;
	hdr->b_cksum0 = zio->io_bp->blk_cksum.zc_word[0];
	/*
	* If the block to be written was all-zero, we may have
	* compressed it away. In this case no write was performed
	* so there will be no dva/birth-date/checksum. The buffer
	* must therefor remain anonymous (and uncached).
	*/
	if (!BUF_EMPTY(hdr)) {
	arc_buf_hdr_t *exists;
	kmutex_t *hash_lock;

	arc_cksum_verify(buf);

	exists = buf_hash_insert(hdr, &hash_lock);
	if (exists) {
	/*
	* This can only happen if we overwrite for
	* sync-to-convergence, because we remove
	* buffers from the hash table when we arc_free().
	*/
	ASSERT(zio->io_flags & ZIO_FLAG_IO_REWRITE);
	ASSERT(DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig),
	BP_IDENTITY(zio->io_bp)));
	ASSERT3U(zio->io_bp_orig.blk_birth, ==,
	zio->io_bp->blk_birth);

	ASSERT(refcount_is_zero(&exists->b_refcnt));
	arc_change_state(arc_anon, exists, hash_lock);
	mutex_exit(hash_lock);
	arc_hdr_destroy(exists);
	exists = buf_hash_insert(hdr, &hash_lock);
	ASSERT3P(exists, ==, NULL);
	}
	hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
	/* if it's not anon, we are doing a scrub */
	if (hdr->b_state == arc_anon)
	arc_access(hdr, hash_lock);
	mutex_exit(hash_lock);
	} else if (callback->awcb_done == NULL) {
	int destroy_hdr;
	/*
	* This is an anonymous buffer with no user callback,
	* destroy it if there are no active references.
	*/
	mutex_enter(&arc_eviction_mtx);
	destroy_hdr = refcount_is_zero(&hdr->b_refcnt);
	hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
	mutex_exit(&arc_eviction_mtx);
	if (destroy_hdr)
	arc_hdr_destroy(hdr);
	} else {
	hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
	}
	hdr->b_flags &= ~ARC_STORED;

	if (callback->awcb_done) {
	ASSERT(!refcount_is_zero(&hdr->b_refcnt));
	callback->awcb_done(zio, buf, callback->awcb_private);
	}

	kmem_free(callback, sizeof (arc_write_callback_t));
	}

	static void
	write_policy(spa_t spa, const writeprops_t wp, zio_prop_t *zp)
	{
	boolean_t ismd = (wp->wp_level > 0 \|\| dmu_ot[wp->wp_type].ot_metadata);

	/* Determine checksum setting */
	if (ismd) {
	/*
	* Metadata always gets checksummed. If the data
	* checksum is multi-bit correctable, and it's not a
	* ZBT-style checksum, then it's suitable for metadata
	* as well. Otherwise, the metadata checksum defaults
	* to fletcher4.
	*/
	if (zio_checksum_table[wp->wp_oschecksum].ci_correctable &&
	!zio_checksum_table[wp->wp_oschecksum].ci_zbt)
	zp->zp_checksum = wp->wp_oschecksum;
	else
	zp->zp_checksum = ZIO_CHECKSUM_FLETCHER_4;
	} else {
	zp->zp_checksum = zio_checksum_select(wp->wp_dnchecksum,
	wp->wp_oschecksum);
	}

	/* Determine compression setting */
	if (ismd) {
	/*
	* XXX -- we should design a compression algorithm
	* that specializes in arrays of bps.
	*/
	zp->zp_compress = zfs_mdcomp_disable ? ZIO_COMPRESS_EMPTY :
	ZIO_COMPRESS_LZJB;
	} else {
	zp->zp_compress = zio_compress_select(wp->wp_dncompress,
	wp->wp_oscompress);
	}

	zp->zp_type = wp->wp_type;
	zp->zp_level = wp->wp_level;
	zp->zp_ndvas = MIN(wp->wp_copies + ismd, spa_max_replication(spa));
	}

	zio_t *
	arc_write(zio_t pio, spa_t spa, const writeprops_t *wp,
	boolean_t l2arc, uint64_t txg, blkptr_t bp, arc_buf_t buf,
	arc_done_func_t ready, arc_done_func_t done, void *private, int priority,
	int zio_flags, const zbookmark_t *zb)
	{
	arc_buf_hdr_t *hdr = buf->b_hdr;
	arc_write_callback_t *callback;
	zio_t *zio;
	zio_prop_t zp;

	ASSERT(ready != NULL);
	ASSERT(!HDR_IO_ERROR(hdr));
	ASSERT((hdr->b_flags & ARC_IO_IN_PROGRESS) == 0);
	ASSERT(hdr->b_acb == 0);
	if (l2arc)
	hdr->b_flags \|= ARC_L2CACHE;
	callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
	callback->awcb_ready = ready;
	callback->awcb_done = done;
	callback->awcb_private = private;
	callback->awcb_buf = buf;

	write_policy(spa, wp, &zp);
	zio = zio_write(pio, spa, txg, bp, buf->b_data, hdr->b_size, &zp,
	arc_write_ready, arc_write_done, callback, priority, zio_flags, zb);

	return (zio);
	}

	int
	arc_free(zio_t pio, spa_t spa, uint64_t txg, blkptr_t *bp,
	zio_done_func_t done, void private, uint32_t arc_flags)
	{
	arc_buf_hdr_t *ab;
	kmutex_t *hash_lock;
	zio_t *zio;

	/*
	* If this buffer is in the cache, release it, so it
	* can be re-used.
	*/
	ab = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
	if (ab != NULL) {
	/*
	* The checksum of blocks to free is not always
	* preserved (eg. on the deadlist). However, if it is
	* nonzero, it should match what we have in the cache.
	*/
	ASSERT(bp->blk_cksum.zc_word[0] == 0 \|\|
	bp->blk_cksum.zc_word[0] == ab->b_cksum0 \|\|
	bp->blk_fill == BLK_FILL_ALREADY_FREED);

	if (ab->b_state != arc_anon)
	arc_change_state(arc_anon, ab, hash_lock);
	if (HDR_IO_IN_PROGRESS(ab)) {
	/*
	* This should only happen when we prefetch.
	*/
	ASSERT(ab->b_flags & ARC_PREFETCH);
	ASSERT3U(ab->b_datacnt, ==, 1);
	ab->b_flags \|= ARC_FREED_IN_READ;
	if (HDR_IN_HASH_TABLE(ab))
	buf_hash_remove(ab);
	ab->b_arc_access = 0;
	bzero(&ab->b_dva, sizeof (dva_t));
	ab->b_birth = 0;
	ab->b_cksum0 = 0;
	ab->b_buf->b_efunc = NULL;
	ab->b_buf->b_private = NULL;
	mutex_exit(hash_lock);
	} else if (refcount_is_zero(&ab->b_refcnt)) {
	ab->b_flags \|= ARC_FREE_IN_PROGRESS;
	mutex_exit(hash_lock);
	arc_hdr_destroy(ab);
	ARCSTAT_BUMP(arcstat_deleted);
	} else {
	/*
	* We still have an active reference on this
	* buffer. This can happen, e.g., from
	* dbuf_unoverride().
	*/
	ASSERT(!HDR_IN_HASH_TABLE(ab));
	ab->b_arc_access = 0;
	bzero(&ab->b_dva, sizeof (dva_t));
	ab->b_birth = 0;
	ab->b_cksum0 = 0;
	ab->b_buf->b_efunc = NULL;
	ab->b_buf->b_private = NULL;
	mutex_exit(hash_lock);
	}
	}

	zio = zio_free(pio, spa, txg, bp, done, private, ZIO_FLAG_MUSTSUCCEED);

	if (arc_flags & ARC_WAIT)
	return (zio_wait(zio));

	ASSERT(arc_flags & ARC_NOWAIT);
	zio_nowait(zio);

	return (0);
	}

	static int
	arc_memory_throttle(uint64_t reserve, uint64_t txg)
	{
	#ifdef _KERNEL
	uint64_t inflight_data = arc_anon->arcs_size;
	uint64_t available_memory = ptoa((uintmax_t)cnt.v_free_count);
	static uint64_t page_load = 0;
	static uint64_t last_txg = 0;

	#if 0
	#if defined(__i386)
	available_memory =
	MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
	#endif
	#endif
	if (available_memory >= zfs_write_limit_max)
	return (0);

	if (txg > last_txg) {
	last_txg = txg;
	page_load = 0;
	}
	/*
	* If we are in pageout, we know that memory is already tight,
	* the arc is already going to be evicting, so we just want to
	* continue to let page writes occur as quickly as possible.
	*/
	if (curproc == pageproc) {
	if (page_load > available_memory / 4)
	return (ERESTART);
	/* Note: reserve is inflated, so we deflate */
	page_load += reserve / 8;
	return (0);
	} else if (page_load > 0 && arc_reclaim_needed()) {
	/* memory is low, delay before restarting */
	ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
	return (EAGAIN);
	}
	page_load = 0;

	if (arc_size > arc_c_min) {
	uint64_t evictable_memory =
	arc_mru->arcs_lsize[ARC_BUFC_DATA] +
	arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
	arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
	arc_mfu->arcs_lsize[ARC_BUFC_METADATA];
	available_memory += MIN(evictable_memory, arc_size - arc_c_min);
	}

	if (inflight_data > available_memory / 4) {
	ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
	return (ERESTART);
	}
	#endif
	return (0);
	}

	void
	arc_tempreserve_clear(uint64_t reserve)
	{
	atomic_add_64(&arc_tempreserve, -reserve);
	ASSERT((int64_t)arc_tempreserve >= 0);
	}

	int
	arc_tempreserve_space(uint64_t reserve, uint64_t txg)
	{
	int error;

	#ifdef ZFS_DEBUG
	/*
	* Once in a while, fail for no reason. Everything should cope.
	*/
	if (spa_get_random(10000) == 0) {
	dprintf("forcing random failure\n");
	return (ERESTART);
	}
	#endif
	if (reserve > arc_c/4 && !arc_no_grow)
	arc_c = MIN(arc_c_max, reserve * 4);
	if (reserve > arc_c)
	return (ENOMEM);

	/*
	* Writes will, almost always, require additional memory allocations
	* in order to compress/encrypt/etc the data. We therefor need to
	* make sure that there is sufficient available memory for this.
	*/
	if (error = arc_memory_throttle(reserve, txg))
	return (error);

	/*
	* Throttle writes when the amount of dirty data in the cache
	* gets too large. We try to keep the cache less than half full
	* of dirty blocks so that our sync times don't grow too large.
	* Note: if two requests come in concurrently, we might let them
	* both succeed, when one of them should fail. Not a huge deal.
	*/
	if (reserve + arc_tempreserve + arc_anon->arcs_size > arc_c / 2 &&
	arc_anon->arcs_size > arc_c / 4) {
	dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
	"anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
	arc_tempreserve>>10,
	arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
	arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
	reserve>>10, arc_c>>10);
	return (ERESTART);
	}
	atomic_add_64(&arc_tempreserve, reserve);
	return (0);
	}

	static kmutex_t arc_lowmem_lock;
	#ifdef _KERNEL
	static eventhandler_tag arc_event_lowmem = NULL;

	static void
	arc_lowmem(void *arg __unused, int howto __unused)
	{

	/* Serialize access via arc_lowmem_lock. */
	mutex_enter(&arc_lowmem_lock);
	needfree = 1;
	cv_signal(&arc_reclaim_thr_cv);
	while (needfree)
	tsleep(&needfree, 0, "zfs:lowmem", hz / 5);
	mutex_exit(&arc_lowmem_lock);
	}
	#endif

	void
	arc_init(void)
	{
	int prefetch_tunable_set = 0;
	int i;

	mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL);
	cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL);
	mutex_init(&arc_lowmem_lock, NULL, MUTEX_DEFAULT, NULL);

	/* Convert seconds to clock ticks */
	arc_min_prefetch_lifespan = 1 * hz;

	/* Start out with 1/8 of all memory */
	arc_c = kmem_size() / 8;
	#if 0
	#ifdef _KERNEL
	/*
	* On architectures where the physical memory can be larger
	* than the addressable space (intel in 32-bit mode), we may
	* need to limit the cache to 1/8 of VM size.
	*/
	arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC \| VMEM_FREE) / 8);
	#endif
	#endif
	/* set min cache to 1/32 of all memory, or 16MB, whichever is more */
	arc_c_min = MAX(arc_c / 4, 64<<18);
	/* set max to 1/2 of all memory, or all but 1GB, whichever is more */
	if (arc_c * 8 >= 1<<30)
	arc_c_max = (arc_c * 8) - (1<<30);
	else
	arc_c_max = arc_c_min;
	arc_c_max = MAX(arc_c * 5, arc_c_max);
	#ifdef _KERNEL
	/*
	* Allow the tunables to override our calculations if they are
	* reasonable (ie. over 16MB)
	*/
	if (zfs_arc_max >= 64<<18 && zfs_arc_max < kmem_size())
	arc_c_max = zfs_arc_max;
	if (zfs_arc_min >= 64<<18 && zfs_arc_min <= arc_c_max)
	arc_c_min = zfs_arc_min;
	#endif
	arc_c = arc_c_max;
	arc_p = (arc_c >> 1);

	/* limit meta-data to 1/4 of the arc capacity */
	arc_meta_limit = arc_c_max / 4;

	/* Allow the tunable to override if it is reasonable */
	if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
	arc_meta_limit = zfs_arc_meta_limit;

	if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
	arc_c_min = arc_meta_limit / 2;

	if (zfs_arc_grow_retry > 0)
	arc_grow_retry = zfs_arc_grow_retry;

	if (zfs_arc_shrink_shift > 0)
	arc_shrink_shift = zfs_arc_shrink_shift;

	if (zfs_arc_p_min_shift > 0)
	arc_p_min_shift = zfs_arc_p_min_shift;

	/* if kmem_flags are set, lets try to use less memory */
	if (kmem_debugging())
	arc_c = arc_c / 2;
	if (arc_c < arc_c_min)
	arc_c = arc_c_min;

	zfs_arc_min = arc_c_min;
	zfs_arc_max = arc_c_max;

	arc_anon = &ARC_anon;
	arc_mru = &ARC_mru;
	arc_mru_ghost = &ARC_mru_ghost;
	arc_mfu = &ARC_mfu;
	arc_mfu_ghost = &ARC_mfu_ghost;
	arc_l2c_only = &ARC_l2c_only;
	arc_size = 0;

	for (i = 0; i < ARC_BUFC_NUMLISTS; i++) {
	mutex_init(&arc_anon->arcs_locks[i].arcs_lock,
	NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&arc_mru->arcs_locks[i].arcs_lock,
	NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&arc_mru_ghost->arcs_locks[i].arcs_lock,
	NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&arc_mfu->arcs_locks[i].arcs_lock,
	NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&arc_mfu_ghost->arcs_locks[i].arcs_lock,
	NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&arc_l2c_only->arcs_locks[i].arcs_lock,
	NULL, MUTEX_DEFAULT, NULL);

	list_create(&arc_mru->arcs_lists[i],
	sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
	list_create(&arc_mru_ghost->arcs_lists[i],
	sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
	list_create(&arc_mfu->arcs_lists[i],
	sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
	list_create(&arc_mfu_ghost->arcs_lists[i],
	sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
	list_create(&arc_mfu_ghost->arcs_lists[i],
	sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
	list_create(&arc_l2c_only->arcs_lists[i],
	sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
	}

	buf_init();

	arc_thread_exit = 0;
	arc_eviction_list = NULL;
	mutex_init(&arc_eviction_mtx, NULL, MUTEX_DEFAULT, NULL);
	bzero(&arc_eviction_hdr, sizeof (arc_buf_hdr_t));

	arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
	sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);

	if (arc_ksp != NULL) {
	arc_ksp->ks_data = &arc_stats;
	kstat_install(arc_ksp);
	}

	(void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
	TS_RUN, minclsyspri);

	#ifdef _KERNEL
	arc_event_lowmem = EVENTHANDLER_REGISTER(vm_lowmem, arc_lowmem, NULL,
	EVENTHANDLER_PRI_FIRST);
	#endif

	arc_dead = FALSE;
	arc_warm = B_FALSE;

	if (zfs_write_limit_max == 0)
	zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
	else
	zfs_write_limit_shift = 0;
	mutex_init(&zfs_write_limit_lock, NULL, MUTEX_DEFAULT, NULL);

	#ifdef _KERNEL
	if (TUNABLE_INT_FETCH("vfs.zfs.prefetch_disable", &zfs_prefetch_disable))
	prefetch_tunable_set = 1;

	#ifdef __i386__
	if (prefetch_tunable_set == 0) {
	printf("ZFS NOTICE: Prefetch is disabled by default on i386 "
	"-- to enable,\n");
	printf(" add \"vfs.zfs.prefetch_disable=0\" "
	"to /boot/loader.conf.\n");
	zfs_prefetch_disable=1;
	}
	#else
	if ((((uint64_t)physmem * PAGESIZE) < (1ULL << 32)) &&
	prefetch_tunable_set == 0) {
	printf("ZFS NOTICE: Prefetch is disabled by default if less "
	"than 4GB of RAM is present;\n"
	" to enable, add \"vfs.zfs.prefetch_disable=0\" "
	"to /boot/loader.conf.\n");
	zfs_prefetch_disable=1;
	}
	#endif
	/* Warn about ZFS memory and address space requirements. */
	if (((uint64_t)physmem * PAGESIZE) < (256 + 128 + 64) * (1 << 20)) {
	printf("ZFS WARNING: Recommended minimum RAM size is 512MB; "
	"expect unstable behavior.\n");
	}
	if (kmem_size() < 512 * (1 << 20)) {
	printf("ZFS WARNING: Recommended minimum kmem_size is 512MB; "
	"expect unstable behavior.\n");
	printf(" Consider tuning vm.kmem_size and "
	"vm.kmem_size_max\n");
	printf(" in /boot/loader.conf.\n");
	}
	#endif
	}

	void
	arc_fini(void)
	{
	int i;

	mutex_enter(&arc_reclaim_thr_lock);
	arc_thread_exit = 1;
	cv_signal(&arc_reclaim_thr_cv);
	while (arc_thread_exit != 0)
	cv_wait(&arc_reclaim_thr_cv, &arc_reclaim_thr_lock);
	mutex_exit(&arc_reclaim_thr_lock);

	arc_flush(NULL);

	arc_dead = TRUE;

	if (arc_ksp != NULL) {
	kstat_delete(arc_ksp);
	arc_ksp = NULL;
	}

	mutex_destroy(&arc_eviction_mtx);
	mutex_destroy(&arc_reclaim_thr_lock);
	cv_destroy(&arc_reclaim_thr_cv);

	for (i = 0; i < ARC_BUFC_NUMLISTS; i++) {
	list_destroy(&arc_mru->arcs_lists[i]);
	list_destroy(&arc_mru_ghost->arcs_lists[i]);
	list_destroy(&arc_mfu->arcs_lists[i]);
	list_destroy(&arc_mfu_ghost->arcs_lists[i]);
	list_destroy(&arc_l2c_only->arcs_lists[i]);

	mutex_destroy(&arc_anon->arcs_locks[i].arcs_lock);
	mutex_destroy(&arc_mru->arcs_locks[i].arcs_lock);
	mutex_destroy(&arc_mru_ghost->arcs_locks[i].arcs_lock);
	mutex_destroy(&arc_mfu->arcs_locks[i].arcs_lock);
	mutex_destroy(&arc_mfu_ghost->arcs_locks[i].arcs_lock);
	mutex_destroy(&arc_l2c_only->arcs_locks[i].arcs_lock);
	}

	mutex_destroy(&zfs_write_limit_lock);

	buf_fini();

	mutex_destroy(&arc_lowmem_lock);
	#ifdef _KERNEL
	if (arc_event_lowmem != NULL)
	EVENTHANDLER_DEREGISTER(vm_lowmem, arc_event_lowmem);
	#endif
	}

	/*
	* Level 2 ARC
	*
	* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
	* It uses dedicated storage devices to hold cached data, which are populated
	* using large infrequent writes. The main role of this cache is to boost
	* the performance of random read workloads. The intended L2ARC devices
	* include short-stroked disks, solid state disks, and other media with
	* substantially faster read latency than disk.
	*
	* +-----------------------+
	* \| ARC \|
	* +-----------------------+
	* \| ^ ^
	* \| \| \|
	* l2arc_feed_thread() arc_read()
	* \| \| \|
	* \| l2arc read \|
	* V \| \|
	* +---------------+ \|
	* \| L2ARC \| \|
	* +---------------+ \|
	* \| ^ \|
	* l2arc_write() \| \|
	* \| \| \|
	* V \| \|
	* +-------+ +-------+
	* \| vdev \| \| vdev \|
	* \| cache \| \| cache \|
	* +-------+ +-------+
	* +=========+ .-----.
	* : L2ARC : \|-_____-\|
	* : devices : \| Disks \|
	* +=========+ `-_____-'
	*
	* Read requests are satisfied from the following sources, in order:
	*
	* 1) ARC
	* 2) vdev cache of L2ARC devices
	* 3) L2ARC devices
	* 4) vdev cache of disks
	* 5) disks
	*
	* Some L2ARC device types exhibit extremely slow write performance.
	* To accommodate for this there are some significant differences between
	* the L2ARC and traditional cache design:
	*
	* 1. There is no eviction path from the ARC to the L2ARC. Evictions from
	* the ARC behave as usual, freeing buffers and placing headers on ghost
	* lists. The ARC does not send buffers to the L2ARC during eviction as
	* this would add inflated write latencies for all ARC memory pressure.
	*
	* 2. The L2ARC attempts to cache data from the ARC before it is evicted.
	* It does this by periodically scanning buffers from the eviction-end of
	* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
	* not already there. It scans until a headroom of buffers is satisfied,
	* which itself is a buffer for ARC eviction. The thread that does this is
	* l2arc_feed_thread(), illustrated below; example sizes are included to
	* provide a better sense of ratio than this diagram:
	*
	* head --> tail
	* +---------------------+----------+
	* ARC_mfu \|:::::#:::::::::::::::\|o#o###o###\|-->. # already on L2ARC
	* +---------------------+----------+ \| o L2ARC eligible
	* ARC_mru \|:#:::::::::::::::::::\|#o#ooo####\|-->\| : ARC buffer
	* +---------------------+----------+ \|
	* 15.9 Gbytes ^ 32 Mbytes \|
	* headroom \|
	* l2arc_feed_thread()
	* \|
	* l2arc write hand <--[oooo]--'
	* \| 8 Mbyte
	* \| write max
	* V
	* +==============================+
	* L2ARC dev \|####\|#\|###\|###\| \|####\| ... \|
	* +==============================+
	* 32 Gbytes
	*
	* 3. If an ARC buffer is copied to the L2ARC but then hit instead of
	* evicted, then the L2ARC has cached a buffer much sooner than it probably
	* needed to, potentially wasting L2ARC device bandwidth and storage. It is
	* safe to say that this is an uncommon case, since buffers at the end of
	* the ARC lists have moved there due to inactivity.
	*
	* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
	* then the L2ARC simply misses copying some buffers. This serves as a
	* pressure valve to prevent heavy read workloads from both stalling the ARC
	* with waits and clogging the L2ARC with writes. This also helps prevent
	* the potential for the L2ARC to churn if it attempts to cache content too
	* quickly, such as during backups of the entire pool.
	*
	* 5. After system boot and before the ARC has filled main memory, there are
	* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
	* lists can remain mostly static. Instead of searching from tail of these
	* lists as pictured, the l2arc_feed_thread() will search from the list heads
	* for eligible buffers, greatly increasing its chance of finding them.
	*
	* The L2ARC device write speed is also boosted during this time so that
	* the L2ARC warms up faster. Since there have been no ARC evictions yet,
	* there are no L2ARC reads, and no fear of degrading read performance
	* through increased writes.
	*
	* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
	* the vdev queue can aggregate them into larger and fewer writes. Each
	* device is written to in a rotor fashion, sweeping writes through
	* available space then repeating.
	*
	* 7. The L2ARC does not store dirty content. It never needs to flush
	* write buffers back to disk based storage.
	*
	* 8. If an ARC buffer is written (and dirtied) which also exists in the
	* L2ARC, the now stale L2ARC buffer is immediately dropped.
	*
	* The performance of the L2ARC can be tweaked by a number of tunables, which
	* may be necessary for different workloads:
	*
	* l2arc_write_max max write bytes per interval
	* l2arc_write_boost extra write bytes during device warmup
	* l2arc_noprefetch skip caching prefetched buffers
	* l2arc_headroom number of max device writes to precache
	* l2arc_feed_secs seconds between L2ARC writing
	*
	* Tunables may be removed or added as future performance improvements are
	* integrated, and also may become zpool properties.
	*
	* There are three key functions that control how the L2ARC warms up:
	*
	* l2arc_write_eligible() check if a buffer is eligible to cache
	* l2arc_write_size() calculate how much to write
	* l2arc_write_interval() calculate sleep delay between writes
	*
	* These three functions determine what to write, how much, and how quickly
	* to send writes.
	*/

	static boolean_t
	l2arc_write_eligible(spa_t spa, arc_buf_hdr_t ab)
	{
	/*
	* A buffer is not eligible for the L2ARC if it:
	* 1. belongs to a different spa.
	* 2. is already cached on the L2ARC.
	* 3. has an I/O in progress (it may be an incomplete read).
	* 4. is flagged not eligible (zfs property).
	*/
	if (ab->b_spa != spa) {
	ARCSTAT_BUMP(arcstat_l2_write_spa_mismatch);
	return (B_FALSE);
	}
	if (ab->b_l2hdr != NULL) {
	ARCSTAT_BUMP(arcstat_l2_write_in_l2);
	return (B_FALSE);
	}
	if (HDR_IO_IN_PROGRESS(ab)) {
	ARCSTAT_BUMP(arcstat_l2_write_hdr_io_in_progress);
	return (B_FALSE);
	}
	if (!HDR_L2CACHE(ab)) {
	ARCSTAT_BUMP(arcstat_l2_write_not_cacheable);
	return (B_FALSE);
	}

	return (B_TRUE);
	}

	static uint64_t
	l2arc_write_size(l2arc_dev_t *dev)
	{
	uint64_t size;

	size = dev->l2ad_write;

	if (arc_warm == B_FALSE)
	size += dev->l2ad_boost;

	return (size);

	}

	static clock_t
	l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
	{
	clock_t interval, next;

	/*
	* If the ARC lists are busy, increase our write rate; if the
	* lists are stale, idle back. This is achieved by checking
	* how much we previously wrote - if it was more than half of
	* what we wanted, schedule the next write much sooner.
	*/
	if (l2arc_feed_again && wrote > (wanted / 2))
	interval = (hz * l2arc_feed_min_ms) / 1000;
	else
	interval = hz * l2arc_feed_secs;

	next = MAX(LBOLT, MIN(LBOLT + interval, began + interval));

	return (next);
	}

	static void
	l2arc_hdr_stat_add(void)
	{
	ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
	ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
	}

	static void
	l2arc_hdr_stat_remove(void)
	{
	ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
	ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
	}

	/*
	* Cycle through L2ARC devices. This is how L2ARC load balances.
	* If a device is returned, this also returns holding the spa config lock.
	*/
	static l2arc_dev_t *
	l2arc_dev_get_next(void)
	{
	l2arc_dev_t first, next = NULL;

	/*
	* Lock out the removal of spas (spa_namespace_lock), then removal
	* of cache devices (l2arc_dev_mtx). Once a device has been selected,
	* both locks will be dropped and a spa config lock held instead.
	*/
	mutex_enter(&spa_namespace_lock);
	mutex_enter(&l2arc_dev_mtx);

	/* if there are no vdevs, there is nothing to do */
	if (l2arc_ndev == 0)
	goto out;

	first = NULL;
	next = l2arc_dev_last;
	do {
	/* loop around the list looking for a non-faulted vdev */
	if (next == NULL) {
	next = list_head(l2arc_dev_list);
	} else {
	next = list_next(l2arc_dev_list, next);
	if (next == NULL)
	next = list_head(l2arc_dev_list);
	}

	/* if we have come back to the start, bail out */
	if (first == NULL)
	first = next;
	else if (next == first)
	break;

	} while (vdev_is_dead(next->l2ad_vdev));

	/* if we were unable to find any usable vdevs, return NULL */
	if (vdev_is_dead(next->l2ad_vdev))
	next = NULL;

	l2arc_dev_last = next;

	out:
	mutex_exit(&l2arc_dev_mtx);

	/*
	* Grab the config lock to prevent the 'next' device from being
	* removed while we are writing to it.
	*/
	if (next != NULL)
	spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
	mutex_exit(&spa_namespace_lock);

	return (next);
	}

	/*
	* Free buffers that were tagged for destruction.
	*/
	static void
	l2arc_do_free_on_write()
	{
	list_t *buflist;
	l2arc_data_free_t df, df_prev;

	mutex_enter(&l2arc_free_on_write_mtx);
	buflist = l2arc_free_on_write;

	for (df = list_tail(buflist); df; df = df_prev) {
	df_prev = list_prev(buflist, df);
	ASSERT(df->l2df_data != NULL);
	ASSERT(df->l2df_func != NULL);
	df->l2df_func(df->l2df_data, df->l2df_size);
	list_remove(buflist, df);
	kmem_free(df, sizeof (l2arc_data_free_t));
	}

	mutex_exit(&l2arc_free_on_write_mtx);
	}

	/*
	* A write to a cache device has completed. Update all headers to allow
	* reads from these buffers to begin.
	*/
	static void
	l2arc_write_done(zio_t *zio)
	{
	l2arc_write_callback_t *cb;
	l2arc_dev_t *dev;
	list_t *buflist;
	arc_buf_hdr_t head, ab, *ab_prev;
	l2arc_buf_hdr_t *abl2;
	kmutex_t *hash_lock;

	cb = zio->io_private;
	ASSERT(cb != NULL);
	dev = cb->l2wcb_dev;
	ASSERT(dev != NULL);
	head = cb->l2wcb_head;
	ASSERT(head != NULL);
	buflist = dev->l2ad_buflist;
	ASSERT(buflist != NULL);
	DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
	l2arc_write_callback_t *, cb);

	if (zio->io_error != 0)
	ARCSTAT_BUMP(arcstat_l2_writes_error);

	mutex_enter(&l2arc_buflist_mtx);

	/*
	* All writes completed, or an error was hit.
	*/
	for (ab = list_prev(buflist, head); ab; ab = ab_prev) {
	ab_prev = list_prev(buflist, ab);

	hash_lock = HDR_LOCK(ab);
	if (!mutex_tryenter(hash_lock)) {
	/*
	* This buffer misses out. It may be in a stage
	* of eviction. Its ARC_L2_WRITING flag will be
	* left set, denying reads to this buffer.
	*/
	ARCSTAT_BUMP(arcstat_l2_writes_hdr_miss);
	continue;
	}

	if (zio->io_error != 0) {
	/*
	* Error - drop L2ARC entry.
	*/
	list_remove(buflist, ab);
	abl2 = ab->b_l2hdr;
	ab->b_l2hdr = NULL;
	kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
	ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
	}

	/*
	* Allow ARC to begin reads to this L2ARC entry.
	*/
	ab->b_flags &= ~ARC_L2_WRITING;

	mutex_exit(hash_lock);
	}

	atomic_inc_64(&l2arc_writes_done);
	list_remove(buflist, head);
	kmem_cache_free(hdr_cache, head);
	mutex_exit(&l2arc_buflist_mtx);

	l2arc_do_free_on_write();

	kmem_free(cb, sizeof (l2arc_write_callback_t));
	}

	/*
	* A read to a cache device completed. Validate buffer contents before
	* handing over to the regular ARC routines.
	*/
	static void
	l2arc_read_done(zio_t *zio)
	{
	l2arc_read_callback_t *cb;
	arc_buf_hdr_t *hdr;
	arc_buf_t *buf;
	kmutex_t *hash_lock;
	int equal;

	ASSERT(zio->io_vd != NULL);
	ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);

	spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);

	cb = zio->io_private;
	ASSERT(cb != NULL);
	buf = cb->l2rcb_buf;
	ASSERT(buf != NULL);
	hdr = buf->b_hdr;
	ASSERT(hdr != NULL);

	hash_lock = HDR_LOCK(hdr);
	mutex_enter(hash_lock);

	/*
	* Check this survived the L2ARC journey.
	*/
	equal = arc_cksum_equal(buf);
	if (equal && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
	mutex_exit(hash_lock);
	zio->io_private = buf;
	zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
	zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
	arc_read_done(zio);
	} else {
	mutex_exit(hash_lock);
	/*
	* Buffer didn't survive caching. Increment stats and
	* reissue to the original storage device.
	*/
	if (zio->io_error != 0) {
	ARCSTAT_BUMP(arcstat_l2_io_error);
	} else {
	zio->io_error = EIO;
	}
	if (!equal)
	ARCSTAT_BUMP(arcstat_l2_cksum_bad);

	/*
	* If there's no waiter, issue an async i/o to the primary
	* storage now. If there is a waiter, the caller must
	* issue the i/o in a context where it's OK to block.
	*/
	if (zio->io_waiter == NULL)
	zio_nowait(zio_read(zio->io_parent,
	cb->l2rcb_spa, &cb->l2rcb_bp,
	buf->b_data, zio->io_size, arc_read_done, buf,
	zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb));
	}

	kmem_free(cb, sizeof (l2arc_read_callback_t));
	}

	/*
	* This is the list priority from which the L2ARC will search for pages to
	* cache. This is used within loops (0..3) to cycle through lists in the
	* desired order. This order can have a significant effect on cache
	* performance.
	*
	* Currently the metadata lists are hit first, MFU then MRU, followed by
	* the data lists. This function returns a locked list, and also returns
	* the lock pointer.
	*/
	static list_t *
	l2arc_list_locked(int list_num, kmutex_t **lock)
	{
	list_t *list;
	int idx;

	ASSERT(list_num >= 0 && list_num < 2 * ARC_BUFC_NUMLISTS);

	if (list_num < ARC_BUFC_NUMMETADATALISTS) {
	idx = list_num;
	list = &arc_mfu->arcs_lists[idx];
	*lock = ARCS_LOCK(arc_mfu, idx);
	} else if (list_num < ARC_BUFC_NUMMETADATALISTS * 2) {
	idx = list_num - ARC_BUFC_NUMMETADATALISTS;
	list = &arc_mru->arcs_lists[idx];
	*lock = ARCS_LOCK(arc_mru, idx);
	} else if (list_num < (ARC_BUFC_NUMMETADATALISTS * 2 +
	ARC_BUFC_NUMDATALISTS)) {
	idx = list_num - ARC_BUFC_NUMMETADATALISTS;
	list = &arc_mfu->arcs_lists[idx];
	*lock = ARCS_LOCK(arc_mfu, idx);
	} else {
	idx = list_num - ARC_BUFC_NUMLISTS;
	list = &arc_mru->arcs_lists[idx];
	*lock = ARCS_LOCK(arc_mru, idx);
	}

	ASSERT(!(MUTEX_HELD(*lock)));
	mutex_enter(*lock);
	return (list);
	}

	/*
	* Evict buffers from the device write hand to the distance specified in
	* bytes. This distance may span populated buffers, it may span nothing.
	* This is clearing a region on the L2ARC device ready for writing.
	* If the 'all' boolean is set, every buffer is evicted.
	*/
	static void
	l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
	{
	list_t *buflist;
	l2arc_buf_hdr_t *abl2;
	arc_buf_hdr_t ab, ab_prev;
	kmutex_t *hash_lock;
	uint64_t taddr;

	buflist = dev->l2ad_buflist;

	if (buflist == NULL)
	return;

	if (!all && dev->l2ad_first) {
	/*
	* This is the first sweep through the device. There is
	* nothing to evict.
	*/
	return;
	}

	if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
	/*
	* When nearing the end of the device, evict to the end
	* before the device write hand jumps to the start.
	*/
	taddr = dev->l2ad_end;
	} else {
	taddr = dev->l2ad_hand + distance;
	}
	DTRACE_PROBE4(l2arc__evict, l2arc_dev_t , dev, list_t , buflist,
	uint64_t, taddr, boolean_t, all);

	top:
	mutex_enter(&l2arc_buflist_mtx);
	for (ab = list_tail(buflist); ab; ab = ab_prev) {
	ab_prev = list_prev(buflist, ab);

	hash_lock = HDR_LOCK(ab);
	if (!mutex_tryenter(hash_lock)) {
	/*
	* Missed the hash lock. Retry.
	*/
	ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
	mutex_exit(&l2arc_buflist_mtx);
	mutex_enter(hash_lock);
	mutex_exit(hash_lock);
	goto top;
	}

	if (HDR_L2_WRITE_HEAD(ab)) {
	/*
	* We hit a write head node. Leave it for
	* l2arc_write_done().
	*/
	list_remove(buflist, ab);
	mutex_exit(hash_lock);
	continue;
	}

	if (!all && ab->b_l2hdr != NULL &&
	(ab->b_l2hdr->b_daddr > taddr \|\|
	ab->b_l2hdr->b_daddr < dev->l2ad_hand)) {
	/*
	* We've evicted to the target address,
	* or the end of the device.
	*/
	mutex_exit(hash_lock);
	break;
	}

	if (HDR_FREE_IN_PROGRESS(ab)) {
	/*
	* Already on the path to destruction.
	*/
	mutex_exit(hash_lock);
	continue;
	}

	if (ab->b_state == arc_l2c_only) {
	ASSERT(!HDR_L2_READING(ab));
	/*
	* This doesn't exist in the ARC. Destroy.
	* arc_hdr_destroy() will call list_remove()
	* and decrement arcstat_l2_size.
	*/
	arc_change_state(arc_anon, ab, hash_lock);
	arc_hdr_destroy(ab);
	} else {
	/*
	* Invalidate issued or about to be issued
	* reads, since we may be about to write
	* over this location.
	*/
	if (HDR_L2_READING(ab)) {
	ARCSTAT_BUMP(arcstat_l2_evict_reading);
	ab->b_flags \|= ARC_L2_EVICTED;
	}

	/*
	* Tell ARC this no longer exists in L2ARC.
	*/
	if (ab->b_l2hdr != NULL) {
	abl2 = ab->b_l2hdr;
	ab->b_l2hdr = NULL;
	kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
	ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
	}
	list_remove(buflist, ab);

	/*
	* This may have been leftover after a
	* failed write.
	*/
	ab->b_flags &= ~ARC_L2_WRITING;
	}
	mutex_exit(hash_lock);
	}
	mutex_exit(&l2arc_buflist_mtx);

	spa_l2cache_space_update(dev->l2ad_vdev, 0, -(taddr - dev->l2ad_evict));
	dev->l2ad_evict = taddr;
	}

	/*
	* Find and write ARC buffers to the L2ARC device.
	*
	* An ARC_L2_WRITING flag is set so that the L2ARC buffers are not valid
	* for reading until they have completed writing.
	*/
	static uint64_t
	l2arc_write_buffers(spa_t spa, l2arc_dev_t dev, uint64_t target_sz)
	{
	arc_buf_hdr_t ab, ab_prev, *head;
	l2arc_buf_hdr_t *hdrl2;
	list_t *list;
	uint64_t passed_sz, write_sz, buf_sz, headroom;
	void *buf_data;
	kmutex_t hash_lock, list_lock;
	boolean_t have_lock, full;
	l2arc_write_callback_t *cb;
	zio_t pio, wzio;
	int try;

	ASSERT(dev->l2ad_vdev != NULL);

	pio = NULL;
	write_sz = 0;
	full = B_FALSE;
	head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
	head->b_flags \|= ARC_L2_WRITE_HEAD;

	ARCSTAT_BUMP(arcstat_l2_write_buffer_iter);
	/*
	* Copy buffers for L2ARC writing.
	*/
	mutex_enter(&l2arc_buflist_mtx);
	for (try = 0; try < 2 * ARC_BUFC_NUMLISTS; try++) {
	list = l2arc_list_locked(try, &list_lock);
	passed_sz = 0;
	ARCSTAT_BUMP(arcstat_l2_write_buffer_list_iter);

	/*
	* L2ARC fast warmup.
	*
	* Until the ARC is warm and starts to evict, read from the
	* head of the ARC lists rather than the tail.
	*/
	headroom = target_sz * l2arc_headroom;
	if (arc_warm == B_FALSE)
	ab = list_head(list);
	else
	ab = list_tail(list);
	if (ab == NULL)
	ARCSTAT_BUMP(arcstat_l2_write_buffer_list_null_iter);

	for (; ab; ab = ab_prev) {
	if (arc_warm == B_FALSE)
	ab_prev = list_next(list, ab);
	else
	ab_prev = list_prev(list, ab);
	ARCSTAT_INCR(arcstat_l2_write_buffer_bytes_scanned, ab->b_size);

	hash_lock = HDR_LOCK(ab);
	have_lock = MUTEX_HELD(hash_lock);
	if (!have_lock && !mutex_tryenter(hash_lock)) {
	ARCSTAT_BUMP(arcstat_l2_write_trylock_fail);
	/*
	* Skip this buffer rather than waiting.
	*/
	continue;
	}

	passed_sz += ab->b_size;
	if (passed_sz > headroom) {
	/*
	* Searched too far.
	*/
	mutex_exit(hash_lock);
	ARCSTAT_BUMP(arcstat_l2_write_passed_headroom);
	break;
	}

	if (!l2arc_write_eligible(spa, ab)) {
	mutex_exit(hash_lock);
	continue;
	}

	if ((write_sz + ab->b_size) > target_sz) {
	full = B_TRUE;
	mutex_exit(hash_lock);
	ARCSTAT_BUMP(arcstat_l2_write_full);
	break;
	}

	if (pio == NULL) {
	/*
	* Insert a dummy header on the buflist so
	* l2arc_write_done() can find where the
	* write buffers begin without searching.
	*/
	list_insert_head(dev->l2ad_buflist, head);

	cb = kmem_alloc(
	sizeof (l2arc_write_callback_t), KM_SLEEP);
	cb->l2wcb_dev = dev;
	cb->l2wcb_head = head;
	pio = zio_root(spa, l2arc_write_done, cb,
	ZIO_FLAG_CANFAIL);
	ARCSTAT_BUMP(arcstat_l2_write_pios);
	}

	/*
	* Create and add a new L2ARC header.
	*/
	hdrl2 = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
	hdrl2->b_dev = dev;
	hdrl2->b_daddr = dev->l2ad_hand;

	ab->b_flags \|= ARC_L2_WRITING;
	ab->b_l2hdr = hdrl2;
	list_insert_head(dev->l2ad_buflist, ab);
	buf_data = ab->b_buf->b_data;
	buf_sz = ab->b_size;

	/*
	* Compute and store the buffer cksum before
	* writing. On debug the cksum is verified first.
	*/
	arc_cksum_verify(ab->b_buf);
	arc_cksum_compute(ab->b_buf, B_TRUE);

	mutex_exit(hash_lock);

	wzio = zio_write_phys(pio, dev->l2ad_vdev,
	dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
	NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE,
	ZIO_FLAG_CANFAIL, B_FALSE);

	DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
	zio_t *, wzio);
	(void) zio_nowait(wzio);

	/*
	* Keep the clock hand suitably device-aligned.
	*/
	buf_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);

	write_sz += buf_sz;
	dev->l2ad_hand += buf_sz;
	}

	mutex_exit(list_lock);

	if (full == B_TRUE)
	break;
	}
	mutex_exit(&l2arc_buflist_mtx);

	if (pio == NULL) {
	ASSERT3U(write_sz, ==, 0);
	kmem_cache_free(hdr_cache, head);
	return (0);
	}

	ASSERT3U(write_sz, <=, target_sz);
	ARCSTAT_BUMP(arcstat_l2_writes_sent);
	ARCSTAT_INCR(arcstat_l2_write_bytes, write_sz);
	ARCSTAT_INCR(arcstat_l2_size, write_sz);
	spa_l2cache_space_update(dev->l2ad_vdev, 0, write_sz);

	/*
	* Bump device hand to the device start if it is approaching the end.
	* l2arc_evict() will already have evicted ahead for this case.
	*/
	if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
	spa_l2cache_space_update(dev->l2ad_vdev, 0,
	dev->l2ad_end - dev->l2ad_hand);
	dev->l2ad_hand = dev->l2ad_start;
	dev->l2ad_evict = dev->l2ad_start;
	dev->l2ad_first = B_FALSE;
	}

	dev->l2ad_writing = B_TRUE;
	(void) zio_wait(pio);
	dev->l2ad_writing = B_FALSE;

	return (write_sz);
	}

	/*
	* This thread feeds the L2ARC at regular intervals. This is the beating
	* heart of the L2ARC.
	*/
	static void
	l2arc_feed_thread(void *dummy __unused)
	{
	callb_cpr_t cpr;
	l2arc_dev_t *dev;
	spa_t *spa;
	uint64_t size, wrote;
	clock_t begin, next = LBOLT;

	CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);

	mutex_enter(&l2arc_feed_thr_lock);

	while (l2arc_thread_exit == 0) {
	CALLB_CPR_SAFE_BEGIN(&cpr);
	(void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
	next - LBOLT);
	CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
	next = LBOLT + hz;

	/*
	* Quick check for L2ARC devices.
	*/
	mutex_enter(&l2arc_dev_mtx);
	if (l2arc_ndev == 0) {
	mutex_exit(&l2arc_dev_mtx);
	continue;
	}
	mutex_exit(&l2arc_dev_mtx);
	begin = LBOLT;

	/*
	* This selects the next l2arc device to write to, and in
	* doing so the next spa to feed from: dev->l2ad_spa. This
	* will return NULL if there are now no l2arc devices or if
	* they are all faulted.
	*
	* If a device is returned, its spa's config lock is also
	* held to prevent device removal. l2arc_dev_get_next()
	* will grab and release l2arc_dev_mtx.
	*/
	if ((dev = l2arc_dev_get_next()) == NULL)
	continue;

	spa = dev->l2ad_spa;
	ASSERT(spa != NULL);

	/*
	* Avoid contributing to memory pressure.
	*/
	if (arc_reclaim_needed()) {
	ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
	spa_config_exit(spa, SCL_L2ARC, dev);
	continue;
	}

	ARCSTAT_BUMP(arcstat_l2_feeds);

	size = l2arc_write_size(dev);

	/*
	* Evict L2ARC buffers that will be overwritten.
	*/
	l2arc_evict(dev, size, B_FALSE);

	/*
	* Write ARC buffers.
	*/
	wrote = l2arc_write_buffers(spa, dev, size);

	/*
	* Calculate interval between writes.
	*/
	next = l2arc_write_interval(begin, size, wrote);
	spa_config_exit(spa, SCL_L2ARC, dev);
	}

	l2arc_thread_exit = 0;
	cv_broadcast(&l2arc_feed_thr_cv);
	CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
	thread_exit();
	}

	boolean_t
	l2arc_vdev_present(vdev_t *vd)
	{
	l2arc_dev_t *dev;

	mutex_enter(&l2arc_dev_mtx);
	for (dev = list_head(l2arc_dev_list); dev != NULL;
	dev = list_next(l2arc_dev_list, dev)) {
	if (dev->l2ad_vdev == vd)
	break;
	}
	mutex_exit(&l2arc_dev_mtx);

	return (dev != NULL);
	}

	/*
	* Add a vdev for use by the L2ARC. By this point the spa has already
	* validated the vdev and opened it.
	*/
	void
	l2arc_add_vdev(spa_t spa, vdev_t vd, uint64_t start, uint64_t end)
	{
	l2arc_dev_t *adddev;

	ASSERT(!l2arc_vdev_present(vd));

	/*
	* Create a new l2arc device entry.
	*/
	adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
	adddev->l2ad_spa = spa;
	adddev->l2ad_vdev = vd;
	adddev->l2ad_write = l2arc_write_max;
	adddev->l2ad_boost = l2arc_write_boost;
	adddev->l2ad_start = start;
	adddev->l2ad_end = end;
	adddev->l2ad_hand = adddev->l2ad_start;
	adddev->l2ad_evict = adddev->l2ad_start;
	adddev->l2ad_first = B_TRUE;
	adddev->l2ad_writing = B_FALSE;
	ASSERT3U(adddev->l2ad_write, >, 0);

	/*
	* This is a list of all ARC buffers that are still valid on the
	* device.
	*/
	adddev->l2ad_buflist = kmem_zalloc(sizeof (list_t), KM_SLEEP);
	list_create(adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
	offsetof(arc_buf_hdr_t, b_l2node));

	spa_l2cache_space_update(vd, adddev->l2ad_end - adddev->l2ad_hand, 0);

	/*
	* Add device to global list
	*/
	mutex_enter(&l2arc_dev_mtx);
	list_insert_head(l2arc_dev_list, adddev);
	atomic_inc_64(&l2arc_ndev);
	mutex_exit(&l2arc_dev_mtx);
	}

	/*
	* Remove a vdev from the L2ARC.
	*/
	void
	l2arc_remove_vdev(vdev_t *vd)
	{
	l2arc_dev_t dev, nextdev, *remdev = NULL;

	/*
	* Find the device by vdev
	*/
	mutex_enter(&l2arc_dev_mtx);
	for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
	nextdev = list_next(l2arc_dev_list, dev);
	if (vd == dev->l2ad_vdev) {
	remdev = dev;
	break;
	}
	}
	ASSERT(remdev != NULL);

	/*
	* Remove device from global list
	*/
	list_remove(l2arc_dev_list, remdev);
	l2arc_dev_last = NULL; /* may have been invalidated */
	atomic_dec_64(&l2arc_ndev);
	mutex_exit(&l2arc_dev_mtx);

	/*
	* Clear all buflists and ARC references. L2ARC device flush.
	*/
	l2arc_evict(remdev, 0, B_TRUE);
	list_destroy(remdev->l2ad_buflist);
	kmem_free(remdev->l2ad_buflist, sizeof (list_t));
	kmem_free(remdev, sizeof (l2arc_dev_t));
	}

	void
	l2arc_init(void)
	{
	l2arc_thread_exit = 0;
	l2arc_ndev = 0;
	l2arc_writes_sent = 0;
	l2arc_writes_done = 0;

	mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
	cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
	mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&l2arc_buflist_mtx, NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);

	l2arc_dev_list = &L2ARC_dev_list;
	l2arc_free_on_write = &L2ARC_free_on_write;
	list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
	offsetof(l2arc_dev_t, l2ad_node));
	list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
	offsetof(l2arc_data_free_t, l2df_list_node));
	}

	void
	l2arc_fini(void)
	{
	/*
	* This is called from dmu_fini(), which is called from spa_fini();
	* Because of this, we can assume that all l2arc devices have
	* already been removed when the pools themselves were removed.
	*/

	l2arc_do_free_on_write();

	mutex_destroy(&l2arc_feed_thr_lock);
	cv_destroy(&l2arc_feed_thr_cv);
	mutex_destroy(&l2arc_dev_mtx);
	mutex_destroy(&l2arc_buflist_mtx);
	mutex_destroy(&l2arc_free_on_write_mtx);

	list_destroy(l2arc_dev_list);
	list_destroy(l2arc_free_on_write);
	}

	void
	l2arc_start(void)
	{
	if (!(spa_mode & FWRITE))
	return;

	(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
	TS_RUN, minclsyspri);
	}

	void
	l2arc_stop(void)
	{
	if (!(spa_mode & FWRITE))
	return;

	mutex_enter(&l2arc_feed_thr_lock);
	cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
	l2arc_thread_exit = 1;
	while (l2arc_thread_exit != 0)
	cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
	mutex_exit(&l2arc_feed_thr_lock);
	}
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c (revision 209274)
	@@ -1,1066 +1,1066 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	#include <sys/dmu.h>
	#include <sys/dmu_impl.h>
	#include <sys/dbuf.h>
	#include <sys/dmu_tx.h>
	#include <sys/dmu_objset.h>
	#include <sys/dsl_dataset.h> /* for dsl_dataset_block_freeable() */
	#include <sys/dsl_dir.h> /* for dsl_dir_tempreserve_() /
	#include <sys/dsl_pool.h>
	#include <sys/zap_impl.h> /* for fzap_default_block_shift */
	#include <sys/spa.h>
	#include <sys/zfs_context.h>

	typedef void (dmu_tx_hold_func_t)(dmu_tx_t tx, struct dnode *dn,
	uint64_t arg1, uint64_t arg2);


	dmu_tx_t *
	dmu_tx_create_dd(dsl_dir_t *dd)
	{
	dmu_tx_t *tx = kmem_zalloc(sizeof (dmu_tx_t), KM_SLEEP);
	tx->tx_dir = dd;
	if (dd)
	tx->tx_pool = dd->dd_pool;
	list_create(&tx->tx_holds, sizeof (dmu_tx_hold_t),
	offsetof(dmu_tx_hold_t, txh_node));
	#ifdef ZFS_DEBUG
	refcount_create(&tx->tx_space_written);
	refcount_create(&tx->tx_space_freed);
	#endif
	return (tx);
	}

	dmu_tx_t *
	dmu_tx_create(objset_t *os)
	{
	dmu_tx_t *tx = dmu_tx_create_dd(os->os->os_dsl_dataset->ds_dir);
	tx->tx_objset = os;
	tx->tx_lastsnap_txg = dsl_dataset_prev_snap_txg(os->os->os_dsl_dataset);
	return (tx);
	}

	dmu_tx_t *
	dmu_tx_create_assigned(struct dsl_pool *dp, uint64_t txg)
	{
	dmu_tx_t *tx = dmu_tx_create_dd(NULL);

	ASSERT3U(txg, <=, dp->dp_tx.tx_open_txg);
	tx->tx_pool = dp;
	tx->tx_txg = txg;
	tx->tx_anyobj = TRUE;

	return (tx);
	}

	int
	dmu_tx_is_syncing(dmu_tx_t *tx)
	{
	return (tx->tx_anyobj);
	}

	int
	dmu_tx_private_ok(dmu_tx_t *tx)
	{
	return (tx->tx_anyobj);
	}

	static dmu_tx_hold_t *
	dmu_tx_hold_object_impl(dmu_tx_t tx, objset_t os, uint64_t object,
	enum dmu_tx_hold_type type, uint64_t arg1, uint64_t arg2)
	{
	dmu_tx_hold_t *txh;
	dnode_t *dn = NULL;
	int err;

	if (object != DMU_NEW_OBJECT) {
	err = dnode_hold(os->os, object, tx, &dn);
	if (err) {
	tx->tx_err = err;
	return (NULL);
	}

	if (err == 0 && tx->tx_txg != 0) {
	mutex_enter(&dn->dn_mtx);
	/*
	* dn->dn_assigned_txg == tx->tx_txg doesn't pose a
	* problem, but there's no way for it to happen (for
	* now, at least).
	*/
	ASSERT(dn->dn_assigned_txg == 0);
	dn->dn_assigned_txg = tx->tx_txg;
	(void) refcount_add(&dn->dn_tx_holds, tx);
	mutex_exit(&dn->dn_mtx);
	}
	}

	txh = kmem_zalloc(sizeof (dmu_tx_hold_t), KM_SLEEP);
	txh->txh_tx = tx;
	txh->txh_dnode = dn;
	#ifdef ZFS_DEBUG
	txh->txh_type = type;
	txh->txh_arg1 = arg1;
	txh->txh_arg2 = arg2;
	#endif
	list_insert_tail(&tx->tx_holds, txh);

	return (txh);
	}

	void
	dmu_tx_add_new_object(dmu_tx_t tx, objset_t os, uint64_t object)
	{
	/*
	* If we're syncing, they can manipulate any object anyhow, and
	* the hold on the dnode_t can cause problems.
	*/
	if (!dmu_tx_is_syncing(tx)) {
	(void) dmu_tx_hold_object_impl(tx, os,
	object, THT_NEWOBJECT, 0, 0);
	}
	}

	static int
	dmu_tx_check_ioerr(zio_t zio, dnode_t dn, int level, uint64_t blkid)
	{
	int err;
	dmu_buf_impl_t *db;

	rw_enter(&dn->dn_struct_rwlock, RW_READER);
	db = dbuf_hold_level(dn, level, blkid, FTAG);
	rw_exit(&dn->dn_struct_rwlock);
	if (db == NULL)
	return (EIO);
	err = dbuf_read(db, zio, DB_RF_CANFAIL \| DB_RF_NOPREFETCH);
	dbuf_rele(db, FTAG);
	return (err);
	}

	/* ARGSUSED */
	static void
	dmu_tx_count_write(dmu_tx_hold_t *txh, uint64_t off, uint64_t len)
	{
	dnode_t *dn = txh->txh_dnode;
	uint64_t start, end, i;
	int min_bs, max_bs, min_ibs, max_ibs, epbs, bits;
	int err = 0;

	if (len == 0)
	return;

	min_bs = SPA_MINBLOCKSHIFT;
	max_bs = SPA_MAXBLOCKSHIFT;
	min_ibs = DN_MIN_INDBLKSHIFT;
	max_ibs = DN_MAX_INDBLKSHIFT;


	/*
	* For i/o error checking, read the first and last level-0
	* blocks (if they are not aligned), and all the level-1 blocks.
	*/

	if (dn) {
	if (dn->dn_maxblkid == 0) {
	err = dmu_tx_check_ioerr(NULL, dn, 0, 0);
	if (err)
	goto out;
	} else {
	zio_t *zio = zio_root(dn->dn_objset->os_spa,
	NULL, NULL, ZIO_FLAG_CANFAIL);

	/* first level-0 block */
	start = off >> dn->dn_datablkshift;
	if (P2PHASE(off, dn->dn_datablksz) \|\|
	len < dn->dn_datablksz) {
	err = dmu_tx_check_ioerr(zio, dn, 0, start);
	if (err)
	goto out;
	}

	/* last level-0 block */
	end = (off+len-1) >> dn->dn_datablkshift;
	if (end != start &&
	P2PHASE(off+len, dn->dn_datablksz)) {
	err = dmu_tx_check_ioerr(zio, dn, 0, end);
	if (err)
	goto out;
	}

	/* level-1 blocks */
	if (dn->dn_nlevels > 1) {
	start >>= dn->dn_indblkshift - SPA_BLKPTRSHIFT;
	end >>= dn->dn_indblkshift - SPA_BLKPTRSHIFT;
	for (i = start+1; i < end; i++) {
	err = dmu_tx_check_ioerr(zio, dn, 1, i);
	if (err)
	goto out;
	}
	}

	err = zio_wait(zio);
	if (err)
	goto out;
	}
	}

	/*
	* If there's more than one block, the blocksize can't change,
	* so we can make a more precise estimate. Alternatively,
	* if the dnode's ibs is larger than max_ibs, always use that.
	* This ensures that if we reduce DN_MAX_INDBLKSHIFT,
	* the code will still work correctly on existing pools.
	*/
	if (dn && (dn->dn_maxblkid != 0 \|\| dn->dn_indblkshift > max_ibs)) {
	min_ibs = max_ibs = dn->dn_indblkshift;
	if (dn->dn_datablkshift != 0)
	min_bs = max_bs = dn->dn_datablkshift;
	}

	/*
	* 'end' is the last thing we will access, not one past.
	* This way we won't overflow when accessing the last byte.
	*/
	start = P2ALIGN(off, 1ULL << max_bs);
	end = P2ROUNDUP(off + len, 1ULL << max_bs) - 1;
	txh->txh_space_towrite += end - start + 1;

	start >>= min_bs;
	end >>= min_bs;

	epbs = min_ibs - SPA_BLKPTRSHIFT;

	/*
	* The object contains at most 2^(64 - min_bs) blocks,
	* and each indirect level maps 2^epbs.
	*/
	for (bits = 64 - min_bs; bits >= 0; bits -= epbs) {
	start >>= epbs;
	end >>= epbs;
	/*
	* If we increase the number of levels of indirection,
	* we'll need new blkid=0 indirect blocks. If start == 0,
	* we're already accounting for that blocks; and if end == 0,
	* we can't increase the number of levels beyond that.
	*/
	if (start != 0 && end != 0)
	txh->txh_space_towrite += 1ULL << max_ibs;
	txh->txh_space_towrite += (end - start + 1) << max_ibs;
	}

	ASSERT(txh->txh_space_towrite < 2 * DMU_MAX_ACCESS);

	out:
	if (err)
	txh->txh_tx->tx_err = err;
	}

	static void
	dmu_tx_count_dnode(dmu_tx_hold_t *txh)
	{
	dnode_t *dn = txh->txh_dnode;
	dnode_t *mdn = txh->txh_tx->tx_objset->os->os_meta_dnode;
	uint64_t space = mdn->dn_datablksz +
	((mdn->dn_nlevels-1) << mdn->dn_indblkshift);

	if (dn && dn->dn_dbuf->db_blkptr &&
	dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
	dn->dn_dbuf->db_blkptr->blk_birth)) {
	txh->txh_space_tooverwrite += space;
	} else {
	txh->txh_space_towrite += space;
	if (dn && dn->dn_dbuf->db_blkptr)
	txh->txh_space_tounref += space;
	}
	}

	void
	dmu_tx_hold_write(dmu_tx_t *tx, uint64_t object, uint64_t off, int len)
	{
	dmu_tx_hold_t *txh;

	ASSERT(tx->tx_txg == 0);
	ASSERT(len < DMU_MAX_ACCESS);
	ASSERT(len == 0 \|\| UINT64_MAX - off >= len - 1);

	txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
	object, THT_WRITE, off, len);
	if (txh == NULL)
	return;

	dmu_tx_count_write(txh, off, len);
	dmu_tx_count_dnode(txh);
	}

	static void
	dmu_tx_count_free(dmu_tx_hold_t *txh, uint64_t off, uint64_t len)
	{
	uint64_t blkid, nblks, lastblk;
	uint64_t space = 0, unref = 0, skipped = 0;
	dnode_t *dn = txh->txh_dnode;
	dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset;
	spa_t *spa = txh->txh_tx->tx_pool->dp_spa;
	int epbs;

	if (dn->dn_nlevels == 0)
	return;

	/*
	* The struct_rwlock protects us against dn_nlevels
	* changing, in case (against all odds) we manage to dirty &
	* sync out the changes after we check for being dirty.
	* Also, dbuf_hold_level() wants us to have the struct_rwlock.
	*/
	rw_enter(&dn->dn_struct_rwlock, RW_READER);
	epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
	if (dn->dn_maxblkid == 0) {
	if (off == 0 && len >= dn->dn_datablksz) {
	blkid = 0;
	nblks = 1;
	} else {
	rw_exit(&dn->dn_struct_rwlock);
	return;
	}
	} else {
	blkid = off >> dn->dn_datablkshift;
	nblks = (len + dn->dn_datablksz - 1) >> dn->dn_datablkshift;

	if (blkid >= dn->dn_maxblkid) {
	rw_exit(&dn->dn_struct_rwlock);
	return;
	}
	if (blkid + nblks > dn->dn_maxblkid)
	nblks = dn->dn_maxblkid - blkid;

	}
	if (dn->dn_nlevels == 1) {
	int i;
	for (i = 0; i < nblks; i++) {
	blkptr_t *bp = dn->dn_phys->dn_blkptr;
	ASSERT3U(blkid + i, <, dn->dn_nblkptr);
	bp += blkid + i;
	if (dsl_dataset_block_freeable(ds, bp->blk_birth)) {
	dprintf_bp(bp, "can free old%s", "");
	space += bp_get_dasize(spa, bp);
	}
	unref += BP_GET_ASIZE(bp);
	}
	nblks = 0;
	}

	/*
	* Add in memory requirements of higher-level indirects.
	* This assumes a worst-possible scenario for dn_nlevels.
	*/
	{
	uint64_t blkcnt = 1 + ((nblks >> epbs) >> epbs);
	int level = (dn->dn_nlevels > 1) ? 2 : 1;

	while (level++ < DN_MAX_LEVELS) {
	txh->txh_memory_tohold += blkcnt << dn->dn_indblkshift;
	blkcnt = 1 + (blkcnt >> epbs);
	}
	ASSERT(blkcnt <= dn->dn_nblkptr);
	}

	lastblk = blkid + nblks - 1;
	while (nblks) {
	dmu_buf_impl_t *dbuf;
	uint64_t ibyte, new_blkid;
	int epb = 1 << epbs;
	int err, i, blkoff, tochk;
	blkptr_t *bp;

	ibyte = blkid << dn->dn_datablkshift;
	err = dnode_next_offset(dn,
	DNODE_FIND_HAVELOCK, &ibyte, 2, 1, 0);
	new_blkid = ibyte >> dn->dn_datablkshift;
	if (err == ESRCH) {
	skipped += (lastblk >> epbs) - (blkid >> epbs) + 1;
	break;
	}
	if (err) {
	txh->txh_tx->tx_err = err;
	break;
	}
	if (new_blkid > lastblk) {
	skipped += (lastblk >> epbs) - (blkid >> epbs) + 1;
	break;
	}

	if (new_blkid > blkid) {
	ASSERT((new_blkid >> epbs) > (blkid >> epbs));
	skipped += (new_blkid >> epbs) - (blkid >> epbs) - 1;
	nblks -= new_blkid - blkid;
	blkid = new_blkid;
	}
	blkoff = P2PHASE(blkid, epb);
	tochk = MIN(epb - blkoff, nblks);

	dbuf = dbuf_hold_level(dn, 1, blkid >> epbs, FTAG);

	txh->txh_memory_tohold += dbuf->db.db_size;
	if (txh->txh_memory_tohold > DMU_MAX_ACCESS) {
	txh->txh_tx->tx_err = E2BIG;
	dbuf_rele(dbuf, FTAG);
	break;
	}
	err = dbuf_read(dbuf, NULL, DB_RF_HAVESTRUCT \| DB_RF_CANFAIL);
	if (err != 0) {
	txh->txh_tx->tx_err = err;
	dbuf_rele(dbuf, FTAG);
	break;
	}

	bp = dbuf->db.db_data;
	bp += blkoff;

	for (i = 0; i < tochk; i++) {
	if (dsl_dataset_block_freeable(ds, bp[i].blk_birth)) {
	dprintf_bp(&bp[i], "can free old%s", "");
	space += bp_get_dasize(spa, &bp[i]);
	}
	unref += BP_GET_ASIZE(bp);
	}
	dbuf_rele(dbuf, FTAG);

	blkid += tochk;
	nblks -= tochk;
	}
	rw_exit(&dn->dn_struct_rwlock);

	/* account for new level 1 indirect blocks that might show up */
	if (skipped > 0) {
	txh->txh_fudge += skipped << dn->dn_indblkshift;
	skipped = MIN(skipped, DMU_MAX_DELETEBLKCNT >> epbs);
	txh->txh_memory_tohold += skipped << dn->dn_indblkshift;
	}
	txh->txh_space_tofree += space;
	txh->txh_space_tounref += unref;
	}

	void
	dmu_tx_hold_free(dmu_tx_t *tx, uint64_t object, uint64_t off, uint64_t len)
	{
	dmu_tx_hold_t *txh;
	dnode_t *dn;
	uint64_t start, end, i;
	int err, shift;
	zio_t *zio;

	ASSERT(tx->tx_txg == 0);

	txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
	object, THT_FREE, off, len);
	if (txh == NULL)
	return;
	dn = txh->txh_dnode;

	/* first block */
	if (off != 0)
	dmu_tx_count_write(txh, off, 1);
	/* last block */
	if (len != DMU_OBJECT_END)
	dmu_tx_count_write(txh, off+len, 1);

	if (off >= (dn->dn_maxblkid+1) * dn->dn_datablksz)
	return;
	if (len == DMU_OBJECT_END)
	len = (dn->dn_maxblkid+1) * dn->dn_datablksz - off;

	/*
	* For i/o error checking, read the first and last level-0
	* blocks, and all the level-1 blocks. The above count_write's
	* have already taken care of the level-0 blocks.
	*/
	if (dn->dn_nlevels > 1) {
	shift = dn->dn_datablkshift + dn->dn_indblkshift -
	SPA_BLKPTRSHIFT;
	start = off >> shift;
	end = dn->dn_datablkshift ? ((off+len) >> shift) : 0;

	zio = zio_root(tx->tx_pool->dp_spa,
	NULL, NULL, ZIO_FLAG_CANFAIL);
	for (i = start; i <= end; i++) {
	uint64_t ibyte = i << shift;
	err = dnode_next_offset(dn, 0, &ibyte, 2, 1, 0);
	i = ibyte >> shift;
	if (err == ESRCH)
	break;
	if (err) {
	tx->tx_err = err;
	return;
	}

	err = dmu_tx_check_ioerr(zio, dn, 1, i);
	if (err) {
	tx->tx_err = err;
	return;
	}
	}
	err = zio_wait(zio);
	if (err) {
	tx->tx_err = err;
	return;
	}
	}

	dmu_tx_count_dnode(txh);
	dmu_tx_count_free(txh, off, len);
	}

	void
	dmu_tx_hold_zap(dmu_tx_t tx, uint64_t object, int add, char name)
	{
	dmu_tx_hold_t *txh;
	dnode_t *dn;
	uint64_t nblocks;
	int epbs, err;

	ASSERT(tx->tx_txg == 0);

	txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
	object, THT_ZAP, add, (uintptr_t)name);
	if (txh == NULL)
	return;
	dn = txh->txh_dnode;

	dmu_tx_count_dnode(txh);

	if (dn == NULL) {
	/*
	* We will be able to fit a new object's entries into one leaf
	* block. So there will be at most 2 blocks total,
	* including the header block.
	*/
	dmu_tx_count_write(txh, 0, 2 << fzap_default_block_shift);
	return;
	}

	ASSERT3P(dmu_ot[dn->dn_type].ot_byteswap, ==, zap_byteswap);

	if (dn->dn_maxblkid == 0 && !add) {
	/*
	* If there is only one block (i.e. this is a micro-zap)
	* and we are not adding anything, the accounting is simple.
	*/
	err = dmu_tx_check_ioerr(NULL, dn, 0, 0);
	if (err) {
	tx->tx_err = err;
	return;
	}

	/*
	* Use max block size here, since we don't know how much
	* the size will change between now and the dbuf dirty call.
	*/
	if (dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
	dn->dn_phys->dn_blkptr[0].blk_birth)) {
	txh->txh_space_tooverwrite += SPA_MAXBLOCKSIZE;
	} else {
	txh->txh_space_towrite += SPA_MAXBLOCKSIZE;
	- txh->txh_space_tounref +=
	- BP_GET_ASIZE(dn->dn_phys->dn_blkptr);
	}
	+ if (dn->dn_phys->dn_blkptr[0].blk_birth)
	+ txh->txh_space_tounref += SPA_MAXBLOCKSIZE;
	return;
	}

	if (dn->dn_maxblkid > 0 && name) {
	/*
	* access the name in this fat-zap so that we'll check
	* for i/o errors to the leaf blocks, etc.
	*/
	err = zap_lookup(&dn->dn_objset->os, dn->dn_object, name,
	8, 0, NULL);
	if (err == EIO) {
	tx->tx_err = err;
	return;
	}
	}

	/*
	* 3 blocks overwritten: target leaf, ptrtbl block, header block
	* 3 new blocks written if adding: new split leaf, 2 grown ptrtbl blocks
	*/
	dmu_tx_count_write(txh, dn->dn_maxblkid * dn->dn_datablksz,
	(3 + (add ? 3 : 0)) << dn->dn_datablkshift);

	/*
	* If the modified blocks are scattered to the four winds,
	* we'll have to modify an indirect twig for each.
	*/
	epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
	for (nblocks = dn->dn_maxblkid >> epbs; nblocks != 0; nblocks >>= epbs)
	txh->txh_space_towrite += 3 << dn->dn_indblkshift;
	}

	void
	dmu_tx_hold_bonus(dmu_tx_t *tx, uint64_t object)
	{
	dmu_tx_hold_t *txh;

	ASSERT(tx->tx_txg == 0);

	txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
	object, THT_BONUS, 0, 0);
	if (txh)
	dmu_tx_count_dnode(txh);
	}

	void
	dmu_tx_hold_space(dmu_tx_t *tx, uint64_t space)
	{
	dmu_tx_hold_t *txh;
	ASSERT(tx->tx_txg == 0);

	txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
	DMU_NEW_OBJECT, THT_SPACE, space, 0);

	txh->txh_space_towrite += space;
	}

	int
	dmu_tx_holds(dmu_tx_t *tx, uint64_t object)
	{
	dmu_tx_hold_t *txh;
	int holds = 0;

	/*
	* By asserting that the tx is assigned, we're counting the
	* number of dn_tx_holds, which is the same as the number of
	* dn_holds. Otherwise, we'd be counting dn_holds, but
	* dn_tx_holds could be 0.
	*/
	ASSERT(tx->tx_txg != 0);

	/* if (tx->tx_anyobj == TRUE) */
	/* return (0); */

	for (txh = list_head(&tx->tx_holds); txh;
	txh = list_next(&tx->tx_holds, txh)) {
	if (txh->txh_dnode && txh->txh_dnode->dn_object == object)
	holds++;
	}

	return (holds);
	}

	#ifdef ZFS_DEBUG
	void
	dmu_tx_dirty_buf(dmu_tx_t tx, dmu_buf_impl_t db)
	{
	dmu_tx_hold_t *txh;
	int match_object = FALSE, match_offset = FALSE;
	dnode_t *dn = db->db_dnode;

	ASSERT(tx->tx_txg != 0);
	ASSERT(tx->tx_objset == NULL \|\| dn->dn_objset == tx->tx_objset->os);
	ASSERT3U(dn->dn_object, ==, db->db.db_object);

	if (tx->tx_anyobj)
	return;

	/* XXX No checking on the meta dnode for now */
	if (db->db.db_object == DMU_META_DNODE_OBJECT)
	return;

	for (txh = list_head(&tx->tx_holds); txh;
	txh = list_next(&tx->tx_holds, txh)) {
	ASSERT(dn == NULL \|\| dn->dn_assigned_txg == tx->tx_txg);
	if (txh->txh_dnode == dn && txh->txh_type != THT_NEWOBJECT)
	match_object = TRUE;
	if (txh->txh_dnode == NULL \|\| txh->txh_dnode == dn) {
	int datablkshift = dn->dn_datablkshift ?
	dn->dn_datablkshift : SPA_MAXBLOCKSHIFT;
	int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
	int shift = datablkshift + epbs * db->db_level;
	uint64_t beginblk = shift >= 64 ? 0 :
	(txh->txh_arg1 >> shift);
	uint64_t endblk = shift >= 64 ? 0 :
	((txh->txh_arg1 + txh->txh_arg2 - 1) >> shift);
	uint64_t blkid = db->db_blkid;

	/* XXX txh_arg2 better not be zero... */

	dprintf("found txh type %x beginblk=%llx endblk=%llx\n",
	txh->txh_type, beginblk, endblk);

	switch (txh->txh_type) {
	case THT_WRITE:
	if (blkid >= beginblk && blkid <= endblk)
	match_offset = TRUE;
	/*
	* We will let this hold work for the bonus
	* buffer so that we don't need to hold it
	* when creating a new object.
	*/
	if (blkid == DB_BONUS_BLKID)
	match_offset = TRUE;
	/*
	* They might have to increase nlevels,
	* thus dirtying the new TLIBs. Or the
	* might have to change the block size,
	* thus dirying the new lvl=0 blk=0.
	*/
	if (blkid == 0)
	match_offset = TRUE;
	break;
	case THT_FREE:
	/*
	* We will dirty all the level 1 blocks in
	* the free range and perhaps the first and
	* last level 0 block.
	*/
	if (blkid >= beginblk && (blkid <= endblk \|\|
	txh->txh_arg2 == DMU_OBJECT_END))
	match_offset = TRUE;
	break;
	case THT_BONUS:
	if (blkid == DB_BONUS_BLKID)
	match_offset = TRUE;
	break;
	case THT_ZAP:
	match_offset = TRUE;
	break;
	case THT_NEWOBJECT:
	match_object = TRUE;
	break;
	default:
	ASSERT(!"bad txh_type");
	}
	}
	if (match_object && match_offset)
	return;
	}
	panic("dirtying dbuf obj=%llx lvl=%u blkid=%llx but not tx_held\n",
	(u_longlong_t)db->db.db_object, db->db_level,
	(u_longlong_t)db->db_blkid);
	}
	#endif

	static int
	dmu_tx_try_assign(dmu_tx_t *tx, uint64_t txg_how)
	{
	dmu_tx_hold_t *txh;
	spa_t *spa = tx->tx_pool->dp_spa;
	uint64_t memory, asize, fsize, usize;
	uint64_t towrite, tofree, tooverwrite, tounref, tohold, fudge;

	ASSERT3U(tx->tx_txg, ==, 0);

	if (tx->tx_err)
	return (tx->tx_err);

	if (spa_suspended(spa)) {
	/*
	* If the user has indicated a blocking failure mode
	* then return ERESTART which will block in dmu_tx_wait().
	* Otherwise, return EIO so that an error can get
	* propagated back to the VOP calls.
	*
	* Note that we always honor the txg_how flag regardless
	* of the failuremode setting.
	*/
	if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_CONTINUE &&
	txg_how != TXG_WAIT)
	return (EIO);

	return (ERESTART);
	}

	tx->tx_txg = txg_hold_open(tx->tx_pool, &tx->tx_txgh);
	tx->tx_needassign_txh = NULL;

	/*
	* NB: No error returns are allowed after txg_hold_open, but
	* before processing the dnode holds, due to the
	* dmu_tx_unassign() logic.
	*/

	towrite = tofree = tooverwrite = tounref = tohold = fudge = 0;
	for (txh = list_head(&tx->tx_holds); txh;
	txh = list_next(&tx->tx_holds, txh)) {
	dnode_t *dn = txh->txh_dnode;
	if (dn != NULL) {
	mutex_enter(&dn->dn_mtx);
	if (dn->dn_assigned_txg == tx->tx_txg - 1) {
	mutex_exit(&dn->dn_mtx);
	tx->tx_needassign_txh = txh;
	return (ERESTART);
	}
	if (dn->dn_assigned_txg == 0)
	dn->dn_assigned_txg = tx->tx_txg;
	ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);
	(void) refcount_add(&dn->dn_tx_holds, tx);
	mutex_exit(&dn->dn_mtx);
	}
	towrite += txh->txh_space_towrite;
	tofree += txh->txh_space_tofree;
	tooverwrite += txh->txh_space_tooverwrite;
	tounref += txh->txh_space_tounref;
	tohold += txh->txh_memory_tohold;
	fudge += txh->txh_fudge;
	}

	/*
	* NB: This check must be after we've held the dnodes, so that
	* the dmu_tx_unassign() logic will work properly
	*/
	if (txg_how >= TXG_INITIAL && txg_how != tx->tx_txg)
	return (ERESTART);

	/*
	* If a snapshot has been taken since we made our estimates,
	* assume that we won't be able to free or overwrite anything.
	*/
	if (tx->tx_objset &&
	dsl_dataset_prev_snap_txg(tx->tx_objset->os->os_dsl_dataset) >
	tx->tx_lastsnap_txg) {
	towrite += tooverwrite;
	tooverwrite = tofree = 0;
	}

	/* needed allocation: worst-case estimate of write space */
	asize = spa_get_asize(tx->tx_pool->dp_spa, towrite + tooverwrite);
	/* freed space estimate: worst-case overwrite + free estimate */
	fsize = spa_get_asize(tx->tx_pool->dp_spa, tooverwrite) + tofree;
	/* convert unrefd space to worst-case estimate */
	usize = spa_get_asize(tx->tx_pool->dp_spa, tounref);
	/* calculate memory footprint estimate */
	memory = towrite + tooverwrite + tohold;

	#ifdef ZFS_DEBUG
	/*
	* Add in 'tohold' to account for our dirty holds on this memory
	* XXX - the "fudge" factor is to account for skipped blocks that
	* we missed because dnode_next_offset() misses in-core-only blocks.
	*/
	tx->tx_space_towrite = asize +
	spa_get_asize(tx->tx_pool->dp_spa, tohold + fudge);
	tx->tx_space_tofree = tofree;
	tx->tx_space_tooverwrite = tooverwrite;
	tx->tx_space_tounref = tounref;
	#endif

	if (tx->tx_dir && asize != 0) {
	int err = dsl_dir_tempreserve_space(tx->tx_dir, memory,
	asize, fsize, usize, &tx->tx_tempreserve_cookie, tx);
	if (err)
	return (err);
	}

	return (0);
	}

	static void
	dmu_tx_unassign(dmu_tx_t *tx)
	{
	dmu_tx_hold_t *txh;

	if (tx->tx_txg == 0)
	return;

	txg_rele_to_quiesce(&tx->tx_txgh);

	for (txh = list_head(&tx->tx_holds); txh != tx->tx_needassign_txh;
	txh = list_next(&tx->tx_holds, txh)) {
	dnode_t *dn = txh->txh_dnode;

	if (dn == NULL)
	continue;
	mutex_enter(&dn->dn_mtx);
	ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);

	if (refcount_remove(&dn->dn_tx_holds, tx) == 0) {
	dn->dn_assigned_txg = 0;
	cv_broadcast(&dn->dn_notxholds);
	}
	mutex_exit(&dn->dn_mtx);
	}

	txg_rele_to_sync(&tx->tx_txgh);

	tx->tx_lasttried_txg = tx->tx_txg;
	tx->tx_txg = 0;
	}

	/*
	* Assign tx to a transaction group. txg_how can be one of:
	*
	* (1) TXG_WAIT. If the current open txg is full, waits until there's
	* a new one. This should be used when you're not holding locks.
	* If will only fail if we're truly out of space (or over quota).
	*
	* (2) TXG_NOWAIT. If we can't assign into the current open txg without
	* blocking, returns immediately with ERESTART. This should be used
	* whenever you're holding locks. On an ERESTART error, the caller
	* should drop locks, do a dmu_tx_wait(tx), and try again.
	*
	* (3) A specific txg. Use this if you need to ensure that multiple
	* transactions all sync in the same txg. Like TXG_NOWAIT, it
	* returns ERESTART if it can't assign you into the requested txg.
	*/
	int
	dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how)
	{
	int err;

	ASSERT(tx->tx_txg == 0);
	ASSERT(txg_how != 0);
	ASSERT(!dsl_pool_sync_context(tx->tx_pool));

	while ((err = dmu_tx_try_assign(tx, txg_how)) != 0) {
	dmu_tx_unassign(tx);

	if (err != ERESTART \|\| txg_how != TXG_WAIT)
	return (err);

	dmu_tx_wait(tx);
	}

	txg_rele_to_quiesce(&tx->tx_txgh);

	return (0);
	}

	void
	dmu_tx_wait(dmu_tx_t *tx)
	{
	spa_t *spa = tx->tx_pool->dp_spa;

	ASSERT(tx->tx_txg == 0);

	/*
	* It's possible that the pool has become active after this thread
	* has tried to obtain a tx. If that's the case then his
	* tx_lasttried_txg would not have been assigned.
	*/
	if (spa_suspended(spa) \|\| tx->tx_lasttried_txg == 0) {
	txg_wait_synced(tx->tx_pool, spa_last_synced_txg(spa) + 1);
	} else if (tx->tx_needassign_txh) {
	dnode_t *dn = tx->tx_needassign_txh->txh_dnode;

	mutex_enter(&dn->dn_mtx);
	while (dn->dn_assigned_txg == tx->tx_lasttried_txg - 1)
	cv_wait(&dn->dn_notxholds, &dn->dn_mtx);
	mutex_exit(&dn->dn_mtx);
	tx->tx_needassign_txh = NULL;
	} else {
	txg_wait_open(tx->tx_pool, tx->tx_lasttried_txg + 1);
	}
	}

	void
	dmu_tx_willuse_space(dmu_tx_t *tx, int64_t delta)
	{
	#ifdef ZFS_DEBUG
	if (tx->tx_dir == NULL \|\| delta == 0)
	return;

	if (delta > 0) {
	ASSERT3U(refcount_count(&tx->tx_space_written) + delta, <=,
	tx->tx_space_towrite);
	(void) refcount_add_many(&tx->tx_space_written, delta, NULL);
	} else {
	(void) refcount_add_many(&tx->tx_space_freed, -delta, NULL);
	}
	#endif
	}

	void
	dmu_tx_commit(dmu_tx_t *tx)
	{
	dmu_tx_hold_t *txh;

	ASSERT(tx->tx_txg != 0);

	while (txh = list_head(&tx->tx_holds)) {
	dnode_t *dn = txh->txh_dnode;

	list_remove(&tx->tx_holds, txh);
	kmem_free(txh, sizeof (dmu_tx_hold_t));
	if (dn == NULL)
	continue;
	mutex_enter(&dn->dn_mtx);
	ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);

	if (refcount_remove(&dn->dn_tx_holds, tx) == 0) {
	dn->dn_assigned_txg = 0;
	cv_broadcast(&dn->dn_notxholds);
	}
	mutex_exit(&dn->dn_mtx);
	dnode_rele(dn, tx);
	}

	if (tx->tx_tempreserve_cookie)
	dsl_dir_tempreserve_clear(tx->tx_tempreserve_cookie, tx);

	if (tx->tx_anyobj == FALSE)
	txg_rele_to_sync(&tx->tx_txgh);
	list_destroy(&tx->tx_holds);
	#ifdef ZFS_DEBUG
	dprintf("towrite=%llu written=%llu tofree=%llu freed=%llu\n",
	tx->tx_space_towrite, refcount_count(&tx->tx_space_written),
	tx->tx_space_tofree, refcount_count(&tx->tx_space_freed));
	refcount_destroy_many(&tx->tx_space_written,
	refcount_count(&tx->tx_space_written));
	refcount_destroy_many(&tx->tx_space_freed,
	refcount_count(&tx->tx_space_freed));
	#endif
	kmem_free(tx, sizeof (dmu_tx_t));
	}

	void
	dmu_tx_abort(dmu_tx_t *tx)
	{
	dmu_tx_hold_t *txh;

	ASSERT(tx->tx_txg == 0);

	while (txh = list_head(&tx->tx_holds)) {
	dnode_t *dn = txh->txh_dnode;

	list_remove(&tx->tx_holds, txh);
	kmem_free(txh, sizeof (dmu_tx_hold_t));
	if (dn != NULL)
	dnode_rele(dn, tx);
	}
	list_destroy(&tx->tx_holds);
	#ifdef ZFS_DEBUG
	refcount_destroy_many(&tx->tx_space_written,
	refcount_count(&tx->tx_space_written));
	refcount_destroy_many(&tx->tx_space_freed,
	refcount_count(&tx->tx_space_freed));
	#endif
	kmem_free(tx, sizeof (dmu_tx_t));
	}

	uint64_t
	dmu_tx_get_txg(dmu_tx_t *tx)
	{
	ASSERT(tx->tx_txg != 0);
	return (tx->tx_txg);
	}
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c (revision 209274)
	@@ -1,1446 +1,1437 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	#include <sys/zfs_context.h>
	#include <sys/dbuf.h>
	#include <sys/dnode.h>
	#include <sys/dmu.h>
	#include <sys/dmu_impl.h>
	#include <sys/dmu_tx.h>
	#include <sys/dmu_objset.h>
	#include <sys/dsl_dir.h>
	#include <sys/dsl_dataset.h>
	#include <sys/spa.h>
	#include <sys/zio.h>
	#include <sys/dmu_zfetch.h>

	static int free_range_compar(const void node1, const void node2);

	static kmem_cache_t *dnode_cache;

	static dnode_phys_t dnode_phys_zero;

	int zfs_default_bs = SPA_MINBLOCKSHIFT;
	int zfs_default_ibs = DN_MAX_INDBLKSHIFT;

	/* ARGSUSED */
	static int
	dnode_cons(void arg, void unused, int kmflag)
	{
	int i;
	dnode_t *dn = arg;
	bzero(dn, sizeof (dnode_t));

	rw_init(&dn->dn_struct_rwlock, NULL, RW_DEFAULT, NULL);
	mutex_init(&dn->dn_mtx, NULL, MUTEX_DEFAULT, NULL);
	mutex_init(&dn->dn_dbufs_mtx, NULL, MUTEX_DEFAULT, NULL);
	cv_init(&dn->dn_notxholds, NULL, CV_DEFAULT, NULL);

	refcount_create(&dn->dn_holds);
	refcount_create(&dn->dn_tx_holds);

	for (i = 0; i < TXG_SIZE; i++) {
	avl_create(&dn->dn_ranges[i], free_range_compar,
	sizeof (free_range_t),
	offsetof(struct free_range, fr_node));
	list_create(&dn->dn_dirty_records[i],
	sizeof (dbuf_dirty_record_t),
	offsetof(dbuf_dirty_record_t, dr_dirty_node));
	}

	list_create(&dn->dn_dbufs, sizeof (dmu_buf_impl_t),
	offsetof(dmu_buf_impl_t, db_link));

	return (0);
	}

	/* ARGSUSED */
	static void
	dnode_dest(void arg, void unused)
	{
	int i;
	dnode_t *dn = arg;

	rw_destroy(&dn->dn_struct_rwlock);
	mutex_destroy(&dn->dn_mtx);
	mutex_destroy(&dn->dn_dbufs_mtx);
	cv_destroy(&dn->dn_notxholds);
	refcount_destroy(&dn->dn_holds);
	refcount_destroy(&dn->dn_tx_holds);

	for (i = 0; i < TXG_SIZE; i++) {
	avl_destroy(&dn->dn_ranges[i]);
	list_destroy(&dn->dn_dirty_records[i]);
	}

	list_destroy(&dn->dn_dbufs);
	}

	void
	dnode_init(void)
	{
	dnode_cache = kmem_cache_create("dnode_t",
	sizeof (dnode_t),
	0, dnode_cons, dnode_dest, NULL, NULL, NULL, 0);
	}

	void
	dnode_fini(void)
	{
	kmem_cache_destroy(dnode_cache);
	}


	#ifdef ZFS_DEBUG
	void
	dnode_verify(dnode_t *dn)
	{
	int drop_struct_lock = FALSE;

	ASSERT(dn->dn_phys);
	ASSERT(dn->dn_objset);

	ASSERT(dn->dn_phys->dn_type < DMU_OT_NUMTYPES);

	if (!(zfs_flags & ZFS_DEBUG_DNODE_VERIFY))
	return;

	if (!RW_WRITE_HELD(&dn->dn_struct_rwlock)) {
	rw_enter(&dn->dn_struct_rwlock, RW_READER);
	drop_struct_lock = TRUE;
	}
	if (dn->dn_phys->dn_type != DMU_OT_NONE \|\| dn->dn_allocated_txg != 0) {
	int i;
	ASSERT3U(dn->dn_indblkshift, >=, 0);
	ASSERT3U(dn->dn_indblkshift, <=, SPA_MAXBLOCKSHIFT);
	if (dn->dn_datablkshift) {
	ASSERT3U(dn->dn_datablkshift, >=, SPA_MINBLOCKSHIFT);
	ASSERT3U(dn->dn_datablkshift, <=, SPA_MAXBLOCKSHIFT);
	ASSERT3U(1<<dn->dn_datablkshift, ==, dn->dn_datablksz);
	}
	ASSERT3U(dn->dn_nlevels, <=, 30);
	ASSERT3U(dn->dn_type, <=, DMU_OT_NUMTYPES);
	ASSERT3U(dn->dn_nblkptr, >=, 1);
	ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
	ASSERT3U(dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
	ASSERT3U(dn->dn_datablksz, ==,
	dn->dn_datablkszsec << SPA_MINBLOCKSHIFT);
	ASSERT3U(ISP2(dn->dn_datablksz), ==, dn->dn_datablkshift != 0);
	ASSERT3U((dn->dn_nblkptr - 1) * sizeof (blkptr_t) +
	dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
	for (i = 0; i < TXG_SIZE; i++) {
	ASSERT3U(dn->dn_next_nlevels[i], <=, dn->dn_nlevels);
	}
	}
	if (dn->dn_phys->dn_type != DMU_OT_NONE)
	ASSERT3U(dn->dn_phys->dn_nlevels, <=, dn->dn_nlevels);
	ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT \|\| dn->dn_dbuf != NULL);
	if (dn->dn_dbuf != NULL) {
	ASSERT3P(dn->dn_phys, ==,
	(dnode_phys_t *)dn->dn_dbuf->db.db_data +
	(dn->dn_object % (dn->dn_dbuf->db.db_size >> DNODE_SHIFT)));
	}
	if (drop_struct_lock)
	rw_exit(&dn->dn_struct_rwlock);
	}
	#endif

	void
	dnode_byteswap(dnode_phys_t *dnp)
	{
	uint64_t buf64 = (void)&dnp->dn_blkptr;
	int i;

	if (dnp->dn_type == DMU_OT_NONE) {
	bzero(dnp, sizeof (dnode_phys_t));
	return;
	}

	dnp->dn_datablkszsec = BSWAP_16(dnp->dn_datablkszsec);
	dnp->dn_bonuslen = BSWAP_16(dnp->dn_bonuslen);
	dnp->dn_maxblkid = BSWAP_64(dnp->dn_maxblkid);
	dnp->dn_used = BSWAP_64(dnp->dn_used);

	/*
	* dn_nblkptr is only one byte, so it's OK to read it in either
	* byte order. We can't read dn_bouslen.
	*/
	ASSERT(dnp->dn_indblkshift <= SPA_MAXBLOCKSHIFT);
	ASSERT(dnp->dn_nblkptr <= DN_MAX_NBLKPTR);
	for (i = 0; i < dnp->dn_nblkptr * sizeof (blkptr_t)/8; i++)
	buf64[i] = BSWAP_64(buf64[i]);

	/*
	* OK to check dn_bonuslen for zero, because it won't matter if
	* we have the wrong byte order. This is necessary because the
	* dnode dnode is smaller than a regular dnode.
	*/
	if (dnp->dn_bonuslen != 0) {
	/*
	* Note that the bonus length calculated here may be
	* longer than the actual bonus buffer. This is because
	* we always put the bonus buffer after the last block
	* pointer (instead of packing it against the end of the
	* dnode buffer).
	*/
	int off = (dnp->dn_nblkptr-1) * sizeof (blkptr_t);
	size_t len = DN_MAX_BONUSLEN - off;
	ASSERT3U(dnp->dn_bonustype, <, DMU_OT_NUMTYPES);
	dmu_ot[dnp->dn_bonustype].ot_byteswap(dnp->dn_bonus + off, len);
	}
	}

	void
	dnode_buf_byteswap(void *vbuf, size_t size)
	{
	dnode_phys_t *buf = vbuf;
	int i;

	ASSERT3U(sizeof (dnode_phys_t), ==, (1<<DNODE_SHIFT));
	ASSERT((size & (sizeof (dnode_phys_t)-1)) == 0);

	size >>= DNODE_SHIFT;
	for (i = 0; i < size; i++) {
	dnode_byteswap(buf);
	buf++;
	}
	}

	static int
	free_range_compar(const void node1, const void node2)
	{
	const free_range_t *rp1 = node1;
	const free_range_t *rp2 = node2;

	if (rp1->fr_blkid < rp2->fr_blkid)
	return (-1);
	else if (rp1->fr_blkid > rp2->fr_blkid)
	return (1);
	else return (0);
	}

	void
	dnode_setbonuslen(dnode_t dn, int newsize, dmu_tx_t tx)
	{
	ASSERT3U(refcount_count(&dn->dn_holds), >=, 1);

	dnode_setdirty(dn, tx);
	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
	ASSERT3U(newsize, <=, DN_MAX_BONUSLEN -
	(dn->dn_nblkptr-1) * sizeof (blkptr_t));
	dn->dn_bonuslen = newsize;
	if (newsize == 0)
	dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = DN_ZERO_BONUSLEN;
	else
	dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
	rw_exit(&dn->dn_struct_rwlock);
	}

	static void
	dnode_setdblksz(dnode_t *dn, int size)
	{
	ASSERT3U(P2PHASE(size, SPA_MINBLOCKSIZE), ==, 0);
	ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
	ASSERT3U(size, >=, SPA_MINBLOCKSIZE);
	ASSERT3U(size >> SPA_MINBLOCKSHIFT, <,
	1<<(sizeof (dn->dn_phys->dn_datablkszsec) * 8));
	dn->dn_datablksz = size;
	dn->dn_datablkszsec = size >> SPA_MINBLOCKSHIFT;
	dn->dn_datablkshift = ISP2(size) ? highbit(size - 1) : 0;
	}

	static dnode_t *
	dnode_create(objset_impl_t os, dnode_phys_t dnp, dmu_buf_impl_t *db,
	uint64_t object)
	{
	dnode_t *dn = kmem_cache_alloc(dnode_cache, KM_SLEEP);

	dn->dn_objset = os;
	dn->dn_object = object;
	dn->dn_dbuf = db;
	dn->dn_phys = dnp;

	if (dnp->dn_datablkszsec)
	dnode_setdblksz(dn, dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
	dn->dn_indblkshift = dnp->dn_indblkshift;
	dn->dn_nlevels = dnp->dn_nlevels;
	dn->dn_type = dnp->dn_type;
	dn->dn_nblkptr = dnp->dn_nblkptr;
	dn->dn_checksum = dnp->dn_checksum;
	dn->dn_compress = dnp->dn_compress;
	dn->dn_bonustype = dnp->dn_bonustype;
	dn->dn_bonuslen = dnp->dn_bonuslen;
	dn->dn_maxblkid = dnp->dn_maxblkid;

	dmu_zfetch_init(&dn->dn_zfetch, dn);

	ASSERT(dn->dn_phys->dn_type < DMU_OT_NUMTYPES);
	mutex_enter(&os->os_lock);
	list_insert_head(&os->os_dnodes, dn);
	mutex_exit(&os->os_lock);

	arc_space_consume(sizeof (dnode_t), ARC_SPACE_OTHER);
	return (dn);
	}

	static void
	dnode_destroy(dnode_t *dn)
	{
	objset_impl_t *os = dn->dn_objset;

	#ifdef ZFS_DEBUG
	int i;

	for (i = 0; i < TXG_SIZE; i++) {
	ASSERT(!list_link_active(&dn->dn_dirty_link[i]));
	ASSERT(NULL == list_head(&dn->dn_dirty_records[i]));
	ASSERT(0 == avl_numnodes(&dn->dn_ranges[i]));
	}
	ASSERT(NULL == list_head(&dn->dn_dbufs));
	#endif

	mutex_enter(&os->os_lock);
	list_remove(&os->os_dnodes, dn);
	mutex_exit(&os->os_lock);

	if (dn->dn_dirtyctx_firstset) {
	kmem_free(dn->dn_dirtyctx_firstset, 1);
	dn->dn_dirtyctx_firstset = NULL;
	}
	dmu_zfetch_rele(&dn->dn_zfetch);
	if (dn->dn_bonus) {
	mutex_enter(&dn->dn_bonus->db_mtx);
	dbuf_evict(dn->dn_bonus);
	dn->dn_bonus = NULL;
	}
	kmem_cache_free(dnode_cache, dn);
	arc_space_return(sizeof (dnode_t), ARC_SPACE_OTHER);
	}

	void
	dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
	dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
	{
	int i;

	if (blocksize == 0)
	blocksize = 1 << zfs_default_bs;
	else if (blocksize > SPA_MAXBLOCKSIZE)
	blocksize = SPA_MAXBLOCKSIZE;
	else
	blocksize = P2ROUNDUP(blocksize, SPA_MINBLOCKSIZE);

	if (ibs == 0)
	ibs = zfs_default_ibs;

	ibs = MIN(MAX(ibs, DN_MIN_INDBLKSHIFT), DN_MAX_INDBLKSHIFT);

	dprintf("os=%p obj=%llu txg=%llu blocksize=%d ibs=%d\n", dn->dn_objset,
	dn->dn_object, tx->tx_txg, blocksize, ibs);

	ASSERT(dn->dn_type == DMU_OT_NONE);
	ASSERT(bcmp(dn->dn_phys, &dnode_phys_zero, sizeof (dnode_phys_t)) == 0);
	ASSERT(dn->dn_phys->dn_type == DMU_OT_NONE);
	ASSERT(ot != DMU_OT_NONE);
	ASSERT3U(ot, <, DMU_OT_NUMTYPES);
	ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) \|\|
	(bonustype != DMU_OT_NONE && bonuslen != 0));
	ASSERT3U(bonustype, <, DMU_OT_NUMTYPES);
	ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);
	ASSERT(dn->dn_type == DMU_OT_NONE);
	ASSERT3U(dn->dn_maxblkid, ==, 0);
	ASSERT3U(dn->dn_allocated_txg, ==, 0);
	ASSERT3U(dn->dn_assigned_txg, ==, 0);
	ASSERT(refcount_is_zero(&dn->dn_tx_holds));
	ASSERT3U(refcount_count(&dn->dn_holds), <=, 1);
	ASSERT3P(list_head(&dn->dn_dbufs), ==, NULL);

	for (i = 0; i < TXG_SIZE; i++) {
	ASSERT3U(dn->dn_next_nlevels[i], ==, 0);
	ASSERT3U(dn->dn_next_indblkshift[i], ==, 0);
	ASSERT3U(dn->dn_next_bonuslen[i], ==, 0);
	ASSERT3U(dn->dn_next_blksz[i], ==, 0);
	ASSERT(!list_link_active(&dn->dn_dirty_link[i]));
	ASSERT3P(list_head(&dn->dn_dirty_records[i]), ==, NULL);
	ASSERT3U(avl_numnodes(&dn->dn_ranges[i]), ==, 0);
	}

	dn->dn_type = ot;
	dnode_setdblksz(dn, blocksize);
	dn->dn_indblkshift = ibs;
	dn->dn_nlevels = 1;
	dn->dn_nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
	dn->dn_bonustype = bonustype;
	dn->dn_bonuslen = bonuslen;
	dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
	dn->dn_compress = ZIO_COMPRESS_INHERIT;
	dn->dn_dirtyctx = 0;

	dn->dn_free_txg = 0;
	if (dn->dn_dirtyctx_firstset) {
	kmem_free(dn->dn_dirtyctx_firstset, 1);
	dn->dn_dirtyctx_firstset = NULL;
	}

	dn->dn_allocated_txg = tx->tx_txg;

	dnode_setdirty(dn, tx);
	dn->dn_next_indblkshift[tx->tx_txg & TXG_MASK] = ibs;
	dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
	dn->dn_next_blksz[tx->tx_txg & TXG_MASK] = dn->dn_datablksz;
	}

	void
	dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
	dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
	{
	int nblkptr;

	ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE);
	ASSERT3U(blocksize, <=, SPA_MAXBLOCKSIZE);
	ASSERT3U(blocksize % SPA_MINBLOCKSIZE, ==, 0);
	ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT \|\| dmu_tx_private_ok(tx));
	ASSERT(tx->tx_txg != 0);
	ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) \|\|
	(bonustype != DMU_OT_NONE && bonuslen != 0));
	ASSERT3U(bonustype, <, DMU_OT_NUMTYPES);
	ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);

	/* clean up any unreferenced dbufs */
	dnode_evict_dbufs(dn);

	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
	dnode_setdirty(dn, tx);
	if (dn->dn_datablksz != blocksize) {
	/* change blocksize */
	ASSERT(dn->dn_maxblkid == 0 &&
	(BP_IS_HOLE(&dn->dn_phys->dn_blkptr[0]) \|\|
	dnode_block_freed(dn, 0)));
	dnode_setdblksz(dn, blocksize);
	dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = blocksize;
	}
	if (dn->dn_bonuslen != bonuslen)
	dn->dn_next_bonuslen[tx->tx_txg&TXG_MASK] = bonuslen;
	nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
	if (dn->dn_nblkptr != nblkptr)
	dn->dn_next_nblkptr[tx->tx_txg&TXG_MASK] = nblkptr;
	rw_exit(&dn->dn_struct_rwlock);

	/* change type */
	dn->dn_type = ot;

	/* change bonus size and type */
	mutex_enter(&dn->dn_mtx);
	dn->dn_bonustype = bonustype;
	dn->dn_bonuslen = bonuslen;
	dn->dn_nblkptr = nblkptr;
	dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
	dn->dn_compress = ZIO_COMPRESS_INHERIT;
	ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);

	/* fix up the bonus db_size */
	if (dn->dn_bonus) {
	dn->dn_bonus->db.db_size =
	DN_MAX_BONUSLEN - (dn->dn_nblkptr-1) * sizeof (blkptr_t);
	ASSERT(dn->dn_bonuslen <= dn->dn_bonus->db.db_size);
	}

	dn->dn_allocated_txg = tx->tx_txg;
	mutex_exit(&dn->dn_mtx);
	}

	void
	dnode_special_close(dnode_t *dn)
	{
	/*
	* Wait for final references to the dnode to clear. This can
	* only happen if the arc is asyncronously evicting state that
	* has a hold on this dnode while we are trying to evict this
	* dnode.
	*/
	while (refcount_count(&dn->dn_holds) > 0)
	delay(1);
	dnode_destroy(dn);
	}

	dnode_t *
	dnode_special_open(objset_impl_t os, dnode_phys_t dnp, uint64_t object)
	{
	dnode_t *dn = dnode_create(os, dnp, NULL, object);
	DNODE_VERIFY(dn);
	return (dn);
	}

	static void
	dnode_buf_pageout(dmu_buf_t db, void arg)
	{
	dnode_t **children_dnodes = arg;
	int i;
	int epb = db->db_size >> DNODE_SHIFT;

	for (i = 0; i < epb; i++) {
	dnode_t *dn = children_dnodes[i];
	int n;

	if (dn == NULL)
	continue;
	#ifdef ZFS_DEBUG
	/*
	* If there are holds on this dnode, then there should
	* be holds on the dnode's containing dbuf as well; thus
	* it wouldn't be eligable for eviction and this function
	* would not have been called.
	*/
	ASSERT(refcount_is_zero(&dn->dn_holds));
	ASSERT(list_head(&dn->dn_dbufs) == NULL);
	ASSERT(refcount_is_zero(&dn->dn_tx_holds));

	for (n = 0; n < TXG_SIZE; n++)
	ASSERT(!list_link_active(&dn->dn_dirty_link[n]));
	#endif
	children_dnodes[i] = NULL;
	dnode_destroy(dn);
	}
	kmem_free(children_dnodes, epb * sizeof (dnode_t *));
	}

	/*
	* errors:
	* EINVAL - invalid object number.
	* EIO - i/o error.
	* succeeds even for free dnodes.
	*/
	int
	dnode_hold_impl(objset_impl_t *os, uint64_t object, int flag,
	void tag, dnode_t *dnp)
	{
	int epb, idx, err;
	int drop_struct_lock = FALSE;
	int type;
	uint64_t blk;
	dnode_t mdn, dn;
	dmu_buf_impl_t *db;
	dnode_t **children_dnodes;

	/*
	* If you are holding the spa config lock as writer, you shouldn't
	* be asking the DMU to do anything.
	*/
	ASSERT(spa_config_held(os->os_spa, SCL_ALL, RW_WRITER) == 0);

	if (object == 0 \|\| object >= DN_MAX_OBJECT)
	return (EINVAL);

	mdn = os->os_meta_dnode;

	DNODE_VERIFY(mdn);

	if (!RW_WRITE_HELD(&mdn->dn_struct_rwlock)) {
	rw_enter(&mdn->dn_struct_rwlock, RW_READER);
	drop_struct_lock = TRUE;
	}

	blk = dbuf_whichblock(mdn, object * sizeof (dnode_phys_t));

	db = dbuf_hold(mdn, blk, FTAG);
	if (drop_struct_lock)
	rw_exit(&mdn->dn_struct_rwlock);
	if (db == NULL)
	return (EIO);
	err = dbuf_read(db, NULL, DB_RF_CANFAIL);
	if (err) {
	dbuf_rele(db, FTAG);
	return (err);
	}

	ASSERT3U(db->db.db_size, >=, 1<<DNODE_SHIFT);
	epb = db->db.db_size >> DNODE_SHIFT;

	idx = object & (epb-1);

	children_dnodes = dmu_buf_get_user(&db->db);
	if (children_dnodes == NULL) {
	dnode_t **winner;
	children_dnodes = kmem_zalloc(epb * sizeof (dnode_t *),
	KM_SLEEP);
	if (winner = dmu_buf_set_user(&db->db, children_dnodes, NULL,
	dnode_buf_pageout)) {
	kmem_free(children_dnodes, epb * sizeof (dnode_t *));
	children_dnodes = winner;
	}
	}

	if ((dn = children_dnodes[idx]) == NULL) {
	dnode_phys_t dnp = (dnode_phys_t )db->db.db_data+idx;
	dnode_t *winner;

	dn = dnode_create(os, dnp, db, object);
	winner = atomic_cas_ptr(&children_dnodes[idx], NULL, dn);
	if (winner != NULL) {
	dnode_destroy(dn);
	dn = winner;
	}
	}

	mutex_enter(&dn->dn_mtx);
	type = dn->dn_type;
	if (dn->dn_free_txg \|\|
	((flag & DNODE_MUST_BE_ALLOCATED) && type == DMU_OT_NONE) \|\|
	((flag & DNODE_MUST_BE_FREE) && type != DMU_OT_NONE)) {
	mutex_exit(&dn->dn_mtx);
	dbuf_rele(db, FTAG);
	return (type == DMU_OT_NONE ? ENOENT : EEXIST);
	}
	mutex_exit(&dn->dn_mtx);

	if (refcount_add(&dn->dn_holds, tag) == 1)
	dbuf_add_ref(db, dn);

	DNODE_VERIFY(dn);
	ASSERT3P(dn->dn_dbuf, ==, db);
	ASSERT3U(dn->dn_object, ==, object);
	dbuf_rele(db, FTAG);

	*dnp = dn;
	return (0);
	}

	/*
	* Return held dnode if the object is allocated, NULL if not.
	*/
	int
	dnode_hold(objset_impl_t os, uint64_t object, void tag, dnode_t **dnp)
	{
	return (dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED, tag, dnp));
	}

	/*
	* Can only add a reference if there is already at least one
	* reference on the dnode. Returns FALSE if unable to add a
	* new reference.
	*/
	boolean_t
	dnode_add_ref(dnode_t dn, void tag)
	{
	mutex_enter(&dn->dn_mtx);
	if (refcount_is_zero(&dn->dn_holds)) {
	mutex_exit(&dn->dn_mtx);
	return (FALSE);
	}
	VERIFY(1 < refcount_add(&dn->dn_holds, tag));
	mutex_exit(&dn->dn_mtx);
	return (TRUE);
	}

	void
	dnode_rele(dnode_t dn, void tag)
	{
	uint64_t refs;

	mutex_enter(&dn->dn_mtx);
	refs = refcount_remove(&dn->dn_holds, tag);
	mutex_exit(&dn->dn_mtx);
	/* NOTE: the DNODE_DNODE does not have a dn_dbuf */
	if (refs == 0 && dn->dn_dbuf)
	dbuf_rele(dn->dn_dbuf, dn);
	}

	void
	dnode_setdirty(dnode_t dn, dmu_tx_t tx)
	{
	objset_impl_t *os = dn->dn_objset;
	uint64_t txg = tx->tx_txg;

	if (dn->dn_object == DMU_META_DNODE_OBJECT)
	return;

	DNODE_VERIFY(dn);

	#ifdef ZFS_DEBUG
	mutex_enter(&dn->dn_mtx);
	ASSERT(dn->dn_phys->dn_type \|\| dn->dn_allocated_txg);
	/* ASSERT(dn->dn_free_txg == 0 \|\| dn->dn_free_txg >= txg); */
	mutex_exit(&dn->dn_mtx);
	#endif

	mutex_enter(&os->os_lock);

	/*
	* If we are already marked dirty, we're done.
	*/
	if (list_link_active(&dn->dn_dirty_link[txg & TXG_MASK])) {
	mutex_exit(&os->os_lock);
	return;
	}

	ASSERT(!refcount_is_zero(&dn->dn_holds) \|\| list_head(&dn->dn_dbufs));
	ASSERT(dn->dn_datablksz != 0);
	ASSERT3U(dn->dn_next_bonuslen[txg&TXG_MASK], ==, 0);
	ASSERT3U(dn->dn_next_blksz[txg&TXG_MASK], ==, 0);

	dprintf_ds(os->os_dsl_dataset, "obj=%llu txg=%llu\n",
	dn->dn_object, txg);

	if (dn->dn_free_txg > 0 && dn->dn_free_txg <= txg) {
	list_insert_tail(&os->os_free_dnodes[txg&TXG_MASK], dn);
	} else {
	list_insert_tail(&os->os_dirty_dnodes[txg&TXG_MASK], dn);
	}

	mutex_exit(&os->os_lock);

	/*
	* The dnode maintains a hold on its containing dbuf as
	* long as there are holds on it. Each instantiated child
	* dbuf maintaines a hold on the dnode. When the last child
	* drops its hold, the dnode will drop its hold on the
	* containing dbuf. We add a "dirty hold" here so that the
	* dnode will hang around after we finish processing its
	* children.
	*/
	VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx->tx_txg));

	(void) dbuf_dirty(dn->dn_dbuf, tx);

	dsl_dataset_dirty(os->os_dsl_dataset, tx);
	}

	void
	dnode_free(dnode_t dn, dmu_tx_t tx)
	{
	int txgoff = tx->tx_txg & TXG_MASK;

	dprintf("dn=%p txg=%llu\n", dn, tx->tx_txg);

	/* we should be the only holder... hopefully */
	/* ASSERT3U(refcount_count(&dn->dn_holds), ==, 1); */

	mutex_enter(&dn->dn_mtx);
	if (dn->dn_type == DMU_OT_NONE \|\| dn->dn_free_txg) {
	mutex_exit(&dn->dn_mtx);
	return;
	}
	dn->dn_free_txg = tx->tx_txg;
	mutex_exit(&dn->dn_mtx);

	/*
	* If the dnode is already dirty, it needs to be moved from
	* the dirty list to the free list.
	*/
	mutex_enter(&dn->dn_objset->os_lock);
	if (list_link_active(&dn->dn_dirty_link[txgoff])) {
	list_remove(&dn->dn_objset->os_dirty_dnodes[txgoff], dn);
	list_insert_tail(&dn->dn_objset->os_free_dnodes[txgoff], dn);
	mutex_exit(&dn->dn_objset->os_lock);
	} else {
	mutex_exit(&dn->dn_objset->os_lock);
	dnode_setdirty(dn, tx);
	}
	}

	/*
	* Try to change the block size for the indicated dnode. This can only
	* succeed if there are no blocks allocated or dirty beyond first block
	*/
	int
	dnode_set_blksz(dnode_t dn, uint64_t size, int ibs, dmu_tx_t tx)
	{
	dmu_buf_impl_t db, db_next;
	int err;

	if (size == 0)
	size = SPA_MINBLOCKSIZE;
	if (size > SPA_MAXBLOCKSIZE)
	size = SPA_MAXBLOCKSIZE;
	else
	size = P2ROUNDUP(size, SPA_MINBLOCKSIZE);

	if (ibs == dn->dn_indblkshift)
	ibs = 0;

	if (size >> SPA_MINBLOCKSHIFT == dn->dn_datablkszsec && ibs == 0)
	return (0);

	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);

	/* Check for any allocated blocks beyond the first */
	if (dn->dn_phys->dn_maxblkid != 0)
	goto fail;

	mutex_enter(&dn->dn_dbufs_mtx);
	for (db = list_head(&dn->dn_dbufs); db; db = db_next) {
	db_next = list_next(&dn->dn_dbufs, db);

	if (db->db_blkid != 0 && db->db_blkid != DB_BONUS_BLKID) {
	mutex_exit(&dn->dn_dbufs_mtx);
	goto fail;
	}
	}
	mutex_exit(&dn->dn_dbufs_mtx);

	if (ibs && dn->dn_nlevels != 1)
	goto fail;

	/* resize the old block */
	err = dbuf_hold_impl(dn, 0, 0, TRUE, FTAG, &db);
	if (err == 0)
	dbuf_new_size(db, size, tx);
	else if (err != ENOENT)
	goto fail;

	dnode_setdblksz(dn, size);
	dnode_setdirty(dn, tx);
	dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = size;
	if (ibs) {
	dn->dn_indblkshift = ibs;
	dn->dn_next_indblkshift[tx->tx_txg&TXG_MASK] = ibs;
	}
	/* rele after we have fixed the blocksize in the dnode */
	if (db)
	dbuf_rele(db, FTAG);

	rw_exit(&dn->dn_struct_rwlock);
	return (0);

	fail:
	rw_exit(&dn->dn_struct_rwlock);
	return (ENOTSUP);
	}

	/* read-holding callers must not rely on the lock being continuously held */
	void
	dnode_new_blkid(dnode_t dn, uint64_t blkid, dmu_tx_t tx, boolean_t have_read)
	{
	uint64_t txgoff = tx->tx_txg & TXG_MASK;
	int epbs, new_nlevels;
	uint64_t sz;

	ASSERT(blkid != DB_BONUS_BLKID);

	ASSERT(have_read ?
	RW_READ_HELD(&dn->dn_struct_rwlock) :
	RW_WRITE_HELD(&dn->dn_struct_rwlock));

	/*
	* if we have a read-lock, check to see if we need to do any work
	* before upgrading to a write-lock.
	*/
	if (have_read) {
	if (blkid <= dn->dn_maxblkid)
	return;

	if (!rw_tryupgrade(&dn->dn_struct_rwlock)) {
	rw_exit(&dn->dn_struct_rwlock);
	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
	}
	}

	if (blkid <= dn->dn_maxblkid)
	goto out;

	dn->dn_maxblkid = blkid;

	/*
	* Compute the number of levels necessary to support the new maxblkid.
	*/
	new_nlevels = 1;
	epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
	for (sz = dn->dn_nblkptr;
	sz <= blkid && sz >= dn->dn_nblkptr; sz <<= epbs)
	new_nlevels++;

	if (new_nlevels > dn->dn_nlevels) {
	int old_nlevels = dn->dn_nlevels;
	dmu_buf_impl_t *db;
	list_t *list;
	dbuf_dirty_record_t new, dr, *dr_next;

	dn->dn_nlevels = new_nlevels;

	ASSERT3U(new_nlevels, >, dn->dn_next_nlevels[txgoff]);
	dn->dn_next_nlevels[txgoff] = new_nlevels;

	/* dirty the left indirects */
	db = dbuf_hold_level(dn, old_nlevels, 0, FTAG);
	new = dbuf_dirty(db, tx);
	dbuf_rele(db, FTAG);

	/* transfer the dirty records to the new indirect */
	mutex_enter(&dn->dn_mtx);
	mutex_enter(&new->dt.di.dr_mtx);
	list = &dn->dn_dirty_records[txgoff];
	for (dr = list_head(list); dr; dr = dr_next) {
	dr_next = list_next(&dn->dn_dirty_records[txgoff], dr);
	if (dr->dr_dbuf->db_level != new_nlevels-1 &&
	dr->dr_dbuf->db_blkid != DB_BONUS_BLKID) {
	ASSERT(dr->dr_dbuf->db_level == old_nlevels-1);
	list_remove(&dn->dn_dirty_records[txgoff], dr);
	list_insert_tail(&new->dt.di.dr_children, dr);
	dr->dr_parent = new;
	}
	}
	mutex_exit(&new->dt.di.dr_mtx);
	mutex_exit(&dn->dn_mtx);
	}

	out:
	if (have_read)
	rw_downgrade(&dn->dn_struct_rwlock);
	}

	void
	dnode_clear_range(dnode_t dn, uint64_t blkid, uint64_t nblks, dmu_tx_t tx)
	{
	avl_tree_t *tree = &dn->dn_ranges[tx->tx_txg&TXG_MASK];
	avl_index_t where;
	free_range_t *rp;
	free_range_t rp_tofind;
	uint64_t endblk = blkid + nblks;

	ASSERT(MUTEX_HELD(&dn->dn_mtx));
	ASSERT(nblks <= UINT64_MAX - blkid); /* no overflow */

	dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n",
	blkid, nblks, tx->tx_txg);
	rp_tofind.fr_blkid = blkid;
	rp = avl_find(tree, &rp_tofind, &where);
	if (rp == NULL)
	rp = avl_nearest(tree, where, AVL_BEFORE);
	if (rp == NULL)
	rp = avl_nearest(tree, where, AVL_AFTER);

	while (rp && (rp->fr_blkid <= blkid + nblks)) {
	uint64_t fr_endblk = rp->fr_blkid + rp->fr_nblks;
	free_range_t *nrp = AVL_NEXT(tree, rp);

	if (blkid <= rp->fr_blkid && endblk >= fr_endblk) {
	/* clear this entire range */
	avl_remove(tree, rp);
	kmem_free(rp, sizeof (free_range_t));
	} else if (blkid <= rp->fr_blkid &&
	endblk > rp->fr_blkid && endblk < fr_endblk) {
	/* clear the beginning of this range */
	rp->fr_blkid = endblk;
	rp->fr_nblks = fr_endblk - endblk;
	} else if (blkid > rp->fr_blkid && blkid < fr_endblk &&
	endblk >= fr_endblk) {
	/* clear the end of this range */
	rp->fr_nblks = blkid - rp->fr_blkid;
	} else if (blkid > rp->fr_blkid && endblk < fr_endblk) {
	/* clear a chunk out of this range */
	free_range_t *new_rp =
	kmem_alloc(sizeof (free_range_t), KM_SLEEP);

	new_rp->fr_blkid = endblk;
	new_rp->fr_nblks = fr_endblk - endblk;
	avl_insert_here(tree, new_rp, rp, AVL_AFTER);
	rp->fr_nblks = blkid - rp->fr_blkid;
	}
	/* there may be no overlap */
	rp = nrp;
	}
	}

	void
	dnode_free_range(dnode_t dn, uint64_t off, uint64_t len, dmu_tx_t tx)
	{
	dmu_buf_impl_t *db;
	uint64_t blkoff, blkid, nblks;
	int blksz, blkshift, head, tail;
	int trunc = FALSE;
	int epbs;

	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
	blksz = dn->dn_datablksz;
	blkshift = dn->dn_datablkshift;
	epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;

	if (len == -1ULL) {
	len = UINT64_MAX - off;
	trunc = TRUE;
	}

	/*
	* First, block align the region to free:
	*/
	if (ISP2(blksz)) {
	head = P2NPHASE(off, blksz);
	blkoff = P2PHASE(off, blksz);
	if ((off >> blkshift) > dn->dn_maxblkid)
	goto out;
	} else {
	ASSERT(dn->dn_maxblkid == 0);
	if (off == 0 && len >= blksz) {
	/* Freeing the whole block; fast-track this request */
	blkid = 0;
	nblks = 1;
	goto done;
	} else if (off >= blksz) {
	/* Freeing past end-of-data */
	goto out;
	} else {
	/* Freeing part of the block. */
	head = blksz - off;
	ASSERT3U(head, >, 0);
	}
	blkoff = off;
	}
	/* zero out any partial block data at the start of the range */
	if (head) {
	ASSERT3U(blkoff + head, ==, blksz);
	if (len < head)
	head = len;
	if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, off), TRUE,
	FTAG, &db) == 0) {
	caddr_t data;

	/* don't dirty if it isn't on disk and isn't dirty */
	if (db->db_last_dirty \|\|
	(db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
	rw_exit(&dn->dn_struct_rwlock);
	dbuf_will_dirty(db, tx);
	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
	data = db->db.db_data;
	bzero(data + blkoff, head);
	}
	dbuf_rele(db, FTAG);
	}
	off += head;
	len -= head;
	}

	/* If the range was less than one block, we're done */
	if (len == 0)
	goto out;

	/* If the remaining range is past end of file, we're done */
	if ((off >> blkshift) > dn->dn_maxblkid)
	goto out;

	ASSERT(ISP2(blksz));
	if (trunc)
	tail = 0;
	else
	tail = P2PHASE(len, blksz);

	ASSERT3U(P2PHASE(off, blksz), ==, 0);
	/* zero out any partial block data at the end of the range */
	if (tail) {
	if (len < tail)
	tail = len;
	if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, off+len),
	TRUE, FTAG, &db) == 0) {
	/* don't dirty if not on disk and not dirty */
	if (db->db_last_dirty \|\|
	(db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
	rw_exit(&dn->dn_struct_rwlock);
	dbuf_will_dirty(db, tx);
	rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
	bzero(db->db.db_data, tail);
	}
	dbuf_rele(db, FTAG);
	}
	len -= tail;
	}

	/* If the range did not include a full block, we are done */
	if (len == 0)
	goto out;

	ASSERT(IS_P2ALIGNED(off, blksz));
	ASSERT(trunc \|\| IS_P2ALIGNED(len, blksz));
	blkid = off >> blkshift;
	nblks = len >> blkshift;
	if (trunc)
	nblks += 1;

	/*
	* Read in and mark all the level-1 indirects dirty,
	* so that they will stay in memory until syncing phase.
	* Always dirty the first and last indirect to make sure
	* we dirty all the partial indirects.
	*/
	if (dn->dn_nlevels > 1) {
	uint64_t i, first, last;
	int shift = epbs + dn->dn_datablkshift;

	first = blkid >> epbs;
	if (db = dbuf_hold_level(dn, 1, first, FTAG)) {
	dbuf_will_dirty(db, tx);
	dbuf_rele(db, FTAG);
	}
	if (trunc)
	last = dn->dn_maxblkid >> epbs;
	else
	last = (blkid + nblks - 1) >> epbs;
	if (last > first && (db = dbuf_hold_level(dn, 1, last, FTAG))) {
	dbuf_will_dirty(db, tx);
	dbuf_rele(db, FTAG);
	}
	for (i = first + 1; i < last; i++) {
	uint64_t ibyte = i << shift;
	int err;

	err = dnode_next_offset(dn,
	DNODE_FIND_HAVELOCK, &ibyte, 1, 1, 0);
	i = ibyte >> shift;
	if (err == ESRCH \|\| i >= last)
	break;
	ASSERT(err == 0);
	db = dbuf_hold_level(dn, 1, i, FTAG);
	if (db) {
	dbuf_will_dirty(db, tx);
	dbuf_rele(db, FTAG);
	}
	}
	}
	done:
	/*
	* Add this range to the dnode range list.
	* We will finish up this free operation in the syncing phase.
	*/
	mutex_enter(&dn->dn_mtx);
	dnode_clear_range(dn, blkid, nblks, tx);
	{
	free_range_t rp, found;
	avl_index_t where;
	avl_tree_t *tree = &dn->dn_ranges[tx->tx_txg&TXG_MASK];

	/* Add new range to dn_ranges */
	rp = kmem_alloc(sizeof (free_range_t), KM_SLEEP);
	rp->fr_blkid = blkid;
	rp->fr_nblks = nblks;
	found = avl_find(tree, rp, &where);
	ASSERT(found == NULL);
	avl_insert(tree, rp, where);
	dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n",
	blkid, nblks, tx->tx_txg);
	}
	mutex_exit(&dn->dn_mtx);

	dbuf_free_range(dn, blkid, blkid + nblks - 1, tx);
	dnode_setdirty(dn, tx);
	out:
	if (trunc && dn->dn_maxblkid >= (off >> blkshift))
	dn->dn_maxblkid = (off >> blkshift ? (off >> blkshift) - 1 : 0);

	rw_exit(&dn->dn_struct_rwlock);
	}

	/* return TRUE if this blkid was freed in a recent txg, or FALSE if it wasn't */
	uint64_t
	dnode_block_freed(dnode_t *dn, uint64_t blkid)
	{
	free_range_t range_tofind;
	void *dp = spa_get_dsl(dn->dn_objset->os_spa);
	int i;

	if (blkid == DB_BONUS_BLKID)
	return (FALSE);

	/*
	* If we're in the process of opening the pool, dp will not be
	* set yet, but there shouldn't be anything dirty.
	*/
	if (dp == NULL)
	return (FALSE);

	if (dn->dn_free_txg)
	return (TRUE);

	range_tofind.fr_blkid = blkid;
	mutex_enter(&dn->dn_mtx);
	for (i = 0; i < TXG_SIZE; i++) {
	free_range_t *range_found;
	avl_index_t idx;

	range_found = avl_find(&dn->dn_ranges[i], &range_tofind, &idx);
	if (range_found) {
	ASSERT(range_found->fr_nblks > 0);
	break;
	}
	range_found = avl_nearest(&dn->dn_ranges[i], idx, AVL_BEFORE);
	if (range_found &&
	range_found->fr_blkid + range_found->fr_nblks > blkid)
	break;
	}
	mutex_exit(&dn->dn_mtx);
	return (i < TXG_SIZE);
	}

	/* call from syncing context when we actually write/free space for this dnode */
	void
	dnode_diduse_space(dnode_t *dn, int64_t delta)
	{
	uint64_t space;
	dprintf_dnode(dn, "dn=%p dnp=%p used=%llu delta=%lld\n",
	dn, dn->dn_phys,
	(u_longlong_t)dn->dn_phys->dn_used,
	(longlong_t)delta);

	mutex_enter(&dn->dn_mtx);
	space = DN_USED_BYTES(dn->dn_phys);
	if (delta > 0) {
	ASSERT3U(space + delta, >=, space); /* no overflow */
	} else {
	ASSERT3U(space, >=, -delta); /* no underflow */
	}
	space += delta;
	if (spa_version(dn->dn_objset->os_spa) < SPA_VERSION_DNODE_BYTES) {
	ASSERT((dn->dn_phys->dn_flags & DNODE_FLAG_USED_BYTES) == 0);
	ASSERT3U(P2PHASE(space, 1<<DEV_BSHIFT), ==, 0);
	dn->dn_phys->dn_used = space >> DEV_BSHIFT;
	} else {
	dn->dn_phys->dn_used = space;
	dn->dn_phys->dn_flags \|= DNODE_FLAG_USED_BYTES;
	}
	mutex_exit(&dn->dn_mtx);
	}

	/*
	* Call when we think we're going to write/free space in open context.
	* Be conservative (ie. OK to write less than this or free more than
	* this, but don't write more or free less).
	*/
	void
	dnode_willuse_space(dnode_t dn, int64_t space, dmu_tx_t tx)
	{
	objset_impl_t *os = dn->dn_objset;
	dsl_dataset_t *ds = os->os_dsl_dataset;

	if (space > 0)
	space = spa_get_asize(os->os_spa, space);

	if (ds)
	dsl_dir_willuse_space(ds->ds_dir, space, tx);

	dmu_tx_willuse_space(tx, space);
	}

	/*
	* This function scans a block at the indicated "level" looking for
	* a hole or data (depending on 'flags'). If level > 0, then we are
	* scanning an indirect block looking at its pointers. If level == 0,
	* then we are looking at a block of dnodes. If we don't find what we
	* are looking for in the block, we return ESRCH. Otherwise, return
	* with *offset pointing to the beginning (if searching forwards) or
	* end (if searching backwards) of the range covered by the block
	* pointer we matched on (or dnode).
	*
	* The basic search algorithm used below by dnode_next_offset() is to
	* use this function to search up the block tree (widen the search) until
	* we find something (i.e., we don't return ESRCH) and then search back
	* down the tree (narrow the search) until we reach our original search
	* level.
	*/
	static int
	dnode_next_offset_level(dnode_t dn, int flags, uint64_t offset,
	int lvl, uint64_t blkfill, uint64_t txg)
	{
	dmu_buf_impl_t *db = NULL;
	void *data = NULL;
	uint64_t epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
	uint64_t epb = 1ULL << epbs;
	uint64_t minfill, maxfill;
	boolean_t hole;
	int i, inc, error, span;

	dprintf("probing object %llu offset %llx level %d of %u\n",
	dn->dn_object, *offset, lvl, dn->dn_phys->dn_nlevels);

	hole = flags & DNODE_FIND_HOLE;
	inc = (flags & DNODE_FIND_BACKWARDS) ? -1 : 1;
	ASSERT(txg == 0 \|\| !hole);

	if (lvl == dn->dn_phys->dn_nlevels) {
	error = 0;
	epb = dn->dn_phys->dn_nblkptr;
	data = dn->dn_phys->dn_blkptr;
	} else {
	uint64_t blkid = dbuf_whichblock(dn, offset) >> (epbs lvl);
	error = dbuf_hold_impl(dn, lvl, blkid, TRUE, FTAG, &db);
	if (error) {
	if (error != ENOENT)
	return (error);
	if (hole)
	return (0);
	/*
	* This can only happen when we are searching up
	* the block tree for data. We don't really need to
	* adjust the offset, as we will just end up looking
	* at the pointer to this block in its parent, and its
	* going to be unallocated, so we will skip over it.
	*/
	return (ESRCH);
	}
	error = dbuf_read(db, NULL, DB_RF_CANFAIL \| DB_RF_HAVESTRUCT);
	if (error) {
	dbuf_rele(db, FTAG);
	return (error);
	}
	data = db->db.db_data;
	}

	if (db && txg &&
	(db->db_blkptr == NULL \|\| db->db_blkptr->blk_birth <= txg)) {
	/*
	* This can only happen when we are searching up the tree
	* and these conditions mean that we need to keep climbing.
	*/
	error = ESRCH;
	} else if (lvl == 0) {
	dnode_phys_t *dnp = data;
	span = DNODE_SHIFT;
	ASSERT(dn->dn_type == DMU_OT_DNODE);

	for (i = (*offset >> span) & (blkfill - 1);
	i >= 0 && i < blkfill; i += inc) {
	- boolean_t newcontents = B_TRUE;
	- if (txg) {
	- int j;
	- newcontents = B_FALSE;
	- for (j = 0; j < dnp[i].dn_nblkptr; j++) {
	- if (dnp[i].dn_blkptr[j].blk_birth > txg)
	- newcontents = B_TRUE;
	- }
	- }
	- if (!dnp[i].dn_type == hole && newcontents)
	+ if ((dnp[i].dn_type == DMU_OT_NONE) == hole)
	break;
	offset += (1ULL << span) inc;
	}
	if (i < 0 \|\| i == blkfill)
	error = ESRCH;
	} else {
	blkptr_t *bp = data;
	uint64_t start = *offset;
	span = (lvl - 1) * epbs + dn->dn_datablkshift;
	minfill = 0;
	maxfill = blkfill << ((lvl - 1) * epbs);

	if (hole)
	maxfill--;
	else
	minfill++;

	offset = offset >> span;
	for (i = BF64_GET(*offset, 0, epbs);
	i >= 0 && i < epb; i += inc) {
	if (bp[i].blk_fill >= minfill &&
	bp[i].blk_fill <= maxfill &&
	(hole \|\| bp[i].blk_birth > txg))
	break;
	if (inc > 0 \|\| *offset > 0)
	*offset += inc;
	}
	offset = offset << span;
	if (inc < 0) {
	/* traversing backwards; position offset at the end */
	ASSERT3U(*offset, <=, start);
	offset = MIN(offset + (1ULL << span) - 1, start);
	} else if (*offset < start) {
	*offset = start;
	}
	if (i < 0 \|\| i >= epb)
	error = ESRCH;
	}

	if (db)
	dbuf_rele(db, FTAG);

	return (error);
	}

	/*
	* Find the next hole, data, or sparse region at or after *offset.
	* The value 'blkfill' tells us how many items we expect to find
	* in an L0 data block; this value is 1 for normal objects,
	* DNODES_PER_BLOCK for the meta dnode, and some fraction of
	* DNODES_PER_BLOCK when searching for sparse regions thereof.
	*
	* Examples:
	*
	* dnode_next_offset(dn, flags, offset, 1, 1, 0);
	* Finds the next/previous hole/data in a file.
	* Used in dmu_offset_next().
	*
	* dnode_next_offset(mdn, flags, offset, 0, DNODES_PER_BLOCK, txg);
	* Finds the next free/allocated dnode an objset's meta-dnode.
	* Only finds objects that have new contents since txg (ie.
	* bonus buffer changes and content removal are ignored).
	* Used in dmu_object_next().
	*
	* dnode_next_offset(mdn, DNODE_FIND_HOLE, offset, 2, DNODES_PER_BLOCK >> 2, 0);
	* Finds the next L2 meta-dnode bp that's at most 1/4 full.
	* Used in dmu_object_alloc().
	*/
	int
	dnode_next_offset(dnode_t dn, int flags, uint64_t offset,
	int minlvl, uint64_t blkfill, uint64_t txg)
	{
	uint64_t initial_offset = *offset;
	int lvl, maxlvl;
	int error = 0;

	if (!(flags & DNODE_FIND_HAVELOCK))
	rw_enter(&dn->dn_struct_rwlock, RW_READER);

	if (dn->dn_phys->dn_nlevels == 0) {
	error = ESRCH;
	goto out;
	}

	if (dn->dn_datablkshift == 0) {
	if (*offset < dn->dn_datablksz) {
	if (flags & DNODE_FIND_HOLE)
	*offset = dn->dn_datablksz;
	} else {
	error = ESRCH;
	}
	goto out;
	}

	maxlvl = dn->dn_phys->dn_nlevels;

	for (lvl = minlvl; lvl <= maxlvl; lvl++) {
	error = dnode_next_offset_level(dn,
	flags, offset, lvl, blkfill, txg);
	if (error != ESRCH)
	break;
	}

	while (error == 0 && --lvl >= minlvl) {
	error = dnode_next_offset_level(dn,
	flags, offset, lvl, blkfill, txg);
	}

	if (error == 0 && (flags & DNODE_FIND_BACKWARDS ?
	initial_offset < offset : initial_offset > offset))
	error = ESRCH;
	out:
	if (!(flags & DNODE_FIND_HAVELOCK))
	rw_exit(&dn->dn_struct_rwlock);

	return (error);
	}
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c (revision 209274)
	@@ -1,1331 +1,1330 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	#include <sys/dmu.h>
	#include <sys/dmu_objset.h>
	#include <sys/dmu_tx.h>
	#include <sys/dsl_dataset.h>
	#include <sys/dsl_dir.h>
	#include <sys/dsl_prop.h>
	#include <sys/dsl_synctask.h>
	#include <sys/dsl_deleg.h>
	#include <sys/spa.h>
	#include <sys/zap.h>
	#include <sys/zio.h>
	#include <sys/arc.h>
	#include <sys/sunddi.h>
	#include "zfs_namecheck.h"

	static uint64_t dsl_dir_space_towrite(dsl_dir_t *dd);
	static void dsl_dir_set_reservation_sync(void arg1, void arg2,
	cred_t cr, dmu_tx_t tx);


	/* ARGSUSED */
	static void
	dsl_dir_evict(dmu_buf_t db, void arg)
	{
	dsl_dir_t *dd = arg;
	dsl_pool_t *dp = dd->dd_pool;
	int t;

	for (t = 0; t < TXG_SIZE; t++) {
	ASSERT(!txg_list_member(&dp->dp_dirty_dirs, dd, t));
	ASSERT(dd->dd_tempreserved[t] == 0);
	ASSERT(dd->dd_space_towrite[t] == 0);
	}

	if (dd->dd_parent)
	dsl_dir_close(dd->dd_parent, dd);

	spa_close(dd->dd_pool->dp_spa, dd);

	/*
	* The props callback list should be empty since they hold the
	* dir open.
	*/
	list_destroy(&dd->dd_prop_cbs);
	mutex_destroy(&dd->dd_lock);
	kmem_free(dd, sizeof (dsl_dir_t));
	}

	int
	dsl_dir_open_obj(dsl_pool_t *dp, uint64_t ddobj,
	const char tail, void tag, dsl_dir_t **ddp)
	{
	dmu_buf_t *dbuf;
	dsl_dir_t *dd;
	int err;

	ASSERT(RW_LOCK_HELD(&dp->dp_config_rwlock) \|\|
	dsl_pool_sync_context(dp));

	err = dmu_bonus_hold(dp->dp_meta_objset, ddobj, tag, &dbuf);
	if (err)
	return (err);
	dd = dmu_buf_get_user(dbuf);
	#ifdef ZFS_DEBUG
	{
	dmu_object_info_t doi;
	dmu_object_info_from_db(dbuf, &doi);
	ASSERT3U(doi.doi_type, ==, DMU_OT_DSL_DIR);
	ASSERT3U(doi.doi_bonus_size, >=, sizeof (dsl_dir_phys_t));
	}
	#endif
	if (dd == NULL) {
	dsl_dir_t *winner;
	- int err;

	dd = kmem_zalloc(sizeof (dsl_dir_t), KM_SLEEP);
	dd->dd_object = ddobj;
	dd->dd_dbuf = dbuf;
	dd->dd_pool = dp;
	dd->dd_phys = dbuf->db_data;
	mutex_init(&dd->dd_lock, NULL, MUTEX_DEFAULT, NULL);

	list_create(&dd->dd_prop_cbs, sizeof (dsl_prop_cb_record_t),
	offsetof(dsl_prop_cb_record_t, cbr_node));

	if (dd->dd_phys->dd_parent_obj) {
	err = dsl_dir_open_obj(dp, dd->dd_phys->dd_parent_obj,
	NULL, dd, &dd->dd_parent);
	if (err)
	goto errout;
	if (tail) {
	#ifdef ZFS_DEBUG
	uint64_t foundobj;

	err = zap_lookup(dp->dp_meta_objset,
	dd->dd_parent->dd_phys->dd_child_dir_zapobj,
	tail, sizeof (foundobj), 1, &foundobj);
	ASSERT(err \|\| foundobj == ddobj);
	#endif
	(void) strcpy(dd->dd_myname, tail);
	} else {
	err = zap_value_search(dp->dp_meta_objset,
	dd->dd_parent->dd_phys->dd_child_dir_zapobj,
	ddobj, 0, dd->dd_myname);
	}
	if (err)
	goto errout;
	} else {
	(void) strcpy(dd->dd_myname, spa_name(dp->dp_spa));
	}

	winner = dmu_buf_set_user_ie(dbuf, dd, &dd->dd_phys,
	dsl_dir_evict);
	if (winner) {
	if (dd->dd_parent)
	dsl_dir_close(dd->dd_parent, dd);
	mutex_destroy(&dd->dd_lock);
	kmem_free(dd, sizeof (dsl_dir_t));
	dd = winner;
	} else {
	spa_open_ref(dp->dp_spa, dd);
	}
	}

	/*
	* The dsl_dir_t has both open-to-close and instantiate-to-evict
	* holds on the spa. We need the open-to-close holds because
	* otherwise the spa_refcnt wouldn't change when we open a
	* dir which the spa also has open, so we could incorrectly
	* think it was OK to unload/export/destroy the pool. We need
	* the instantiate-to-evict hold because the dsl_dir_t has a
	* pointer to the dd_pool, which has a pointer to the spa_t.
	*/
	spa_open_ref(dp->dp_spa, tag);
	ASSERT3P(dd->dd_pool, ==, dp);
	ASSERT3U(dd->dd_object, ==, ddobj);
	ASSERT3P(dd->dd_dbuf, ==, dbuf);
	*ddp = dd;
	return (0);

	errout:
	if (dd->dd_parent)
	dsl_dir_close(dd->dd_parent, dd);
	mutex_destroy(&dd->dd_lock);
	kmem_free(dd, sizeof (dsl_dir_t));
	dmu_buf_rele(dbuf, tag);
	return (err);

	}

	void
	dsl_dir_close(dsl_dir_t dd, void tag)
	{
	dprintf_dd(dd, "%s\n", "");
	spa_close(dd->dd_pool->dp_spa, tag);
	dmu_buf_rele(dd->dd_dbuf, tag);
	}

	/* buf must be long enough (MAXNAMELEN + strlen(MOS_DIR_NAME) + 1 should do) */
	void
	dsl_dir_name(dsl_dir_t dd, char buf)
	{
	if (dd->dd_parent) {
	dsl_dir_name(dd->dd_parent, buf);
	(void) strcat(buf, "/");
	} else {
	buf[0] = '\0';
	}
	if (!MUTEX_HELD(&dd->dd_lock)) {
	/*
	* recursive mutex so that we can use
	* dprintf_dd() with dd_lock held
	*/
	mutex_enter(&dd->dd_lock);
	(void) strcat(buf, dd->dd_myname);
	mutex_exit(&dd->dd_lock);
	} else {
	(void) strcat(buf, dd->dd_myname);
	}
	}

	/* Calculate name legnth, avoiding all the strcat calls of dsl_dir_name */
	int
	dsl_dir_namelen(dsl_dir_t *dd)
	{
	int result = 0;

	if (dd->dd_parent) {
	/* parent's name + 1 for the "/" */
	result = dsl_dir_namelen(dd->dd_parent) + 1;
	}

	if (!MUTEX_HELD(&dd->dd_lock)) {
	/* see dsl_dir_name */
	mutex_enter(&dd->dd_lock);
	result += strlen(dd->dd_myname);
	mutex_exit(&dd->dd_lock);
	} else {
	result += strlen(dd->dd_myname);
	}

	return (result);
	}

	int
	dsl_dir_is_private(dsl_dir_t *dd)
	{
	int rv = FALSE;

	if (dd->dd_parent && dsl_dir_is_private(dd->dd_parent))
	rv = TRUE;
	if (dataset_name_hidden(dd->dd_myname))
	rv = TRUE;
	return (rv);
	}


	static int
	getcomponent(const char path, char component, const char **nextp)
	{
	char *p;
	if (path == NULL)
	return (ENOENT);
	/* This would be a good place to reserve some namespace... */
	p = strpbrk(path, "/@");
	if (p && (p[1] == '/' \|\| p[1] == '@')) {
	/* two separators in a row */
	return (EINVAL);
	}
	if (p == NULL \|\| p == path) {
	/*
	* if the first thing is an @ or /, it had better be an
	* @ and it had better not have any more ats or slashes,
	* and it had better have something after the @.
	*/
	if (p != NULL &&
	(p[0] != '@' \|\| strpbrk(path+1, "/@") \|\| p[1] == '\0'))
	return (EINVAL);
	if (strlen(path) >= MAXNAMELEN)
	return (ENAMETOOLONG);
	(void) strcpy(component, path);
	p = NULL;
	} else if (p[0] == '/') {
	if (p-path >= MAXNAMELEN)
	return (ENAMETOOLONG);
	(void) strncpy(component, path, p - path);
	component[p-path] = '\0';
	p++;
	} else if (p[0] == '@') {
	/*
	* if the next separator is an @, there better not be
	* any more slashes.
	*/
	if (strchr(path, '/'))
	return (EINVAL);
	if (p-path >= MAXNAMELEN)
	return (ENAMETOOLONG);
	(void) strncpy(component, path, p - path);
	component[p-path] = '\0';
	} else {
	ASSERT(!"invalid p");
	}
	*nextp = p;
	return (0);
	}

	/*
	* same as dsl_open_dir, ignore the first component of name and use the
	* spa instead
	*/
	int
	dsl_dir_open_spa(spa_t spa, const char name, void *tag,
	dsl_dir_t ddp, const char tailp)
	{
	char buf[MAXNAMELEN];
	const char next, nextnext = NULL;
	int err;
	dsl_dir_t *dd;
	dsl_pool_t *dp;
	uint64_t ddobj;
	int openedspa = FALSE;

	dprintf("%s\n", name);

	err = getcomponent(name, buf, &next);
	if (err)
	return (err);
	if (spa == NULL) {
	err = spa_open(buf, &spa, FTAG);
	if (err) {
	dprintf("spa_open(%s) failed\n", buf);
	return (err);
	}
	openedspa = TRUE;

	/* XXX this assertion belongs in spa_open */
	ASSERT(!dsl_pool_sync_context(spa_get_dsl(spa)));
	}

	dp = spa_get_dsl(spa);

	rw_enter(&dp->dp_config_rwlock, RW_READER);
	err = dsl_dir_open_obj(dp, dp->dp_root_dir_obj, NULL, tag, &dd);
	if (err) {
	rw_exit(&dp->dp_config_rwlock);
	if (openedspa)
	spa_close(spa, FTAG);
	return (err);
	}

	while (next != NULL) {
	dsl_dir_t *child_ds;
	err = getcomponent(next, buf, &nextnext);
	if (err)
	break;
	ASSERT(next[0] != '\0');
	if (next[0] == '@')
	break;
	dprintf("looking up %s in obj%lld\n",
	buf, dd->dd_phys->dd_child_dir_zapobj);

	err = zap_lookup(dp->dp_meta_objset,
	dd->dd_phys->dd_child_dir_zapobj,
	buf, sizeof (ddobj), 1, &ddobj);
	if (err) {
	if (err == ENOENT)
	err = 0;
	break;
	}

	err = dsl_dir_open_obj(dp, ddobj, buf, tag, &child_ds);
	if (err)
	break;
	dsl_dir_close(dd, tag);
	dd = child_ds;
	next = nextnext;
	}
	rw_exit(&dp->dp_config_rwlock);

	if (err) {
	dsl_dir_close(dd, tag);
	if (openedspa)
	spa_close(spa, FTAG);
	return (err);
	}

	/*
	* It's an error if there's more than one component left, or
	* tailp==NULL and there's any component left.
	*/
	if (next != NULL &&
	(tailp == NULL \|\| (nextnext && nextnext[0] != '\0'))) {
	/* bad path name */
	dsl_dir_close(dd, tag);
	dprintf("next=%p (%s) tail=%p\n", next, next?next:"", tailp);
	err = ENOENT;
	}
	if (tailp)
	*tailp = next;
	if (openedspa)
	spa_close(spa, FTAG);
	*ddp = dd;
	return (err);
	}

	/*
	* Return the dsl_dir_t, and possibly the last component which couldn't
	* be found in *tail. Return NULL if the path is bogus, or if
	* tail==NULL and we couldn't parse the whole name. (*tail)[0] == '@'
	* means that the last component is a snapshot.
	*/
	int
	dsl_dir_open(const char name, void tag, dsl_dir_t ddp, const char tailp)
	{
	return (dsl_dir_open_spa(NULL, name, tag, ddp, tailp));
	}

	uint64_t
	dsl_dir_create_sync(dsl_pool_t dp, dsl_dir_t pds, const char *name,
	dmu_tx_t *tx)
	{
	objset_t *mos = dp->dp_meta_objset;
	uint64_t ddobj;
	dsl_dir_phys_t *dsphys;
	dmu_buf_t *dbuf;

	ddobj = dmu_object_alloc(mos, DMU_OT_DSL_DIR, 0,
	DMU_OT_DSL_DIR, sizeof (dsl_dir_phys_t), tx);
	if (pds) {
	VERIFY(0 == zap_add(mos, pds->dd_phys->dd_child_dir_zapobj,
	name, sizeof (uint64_t), 1, &ddobj, tx));
	} else {
	/* it's the root dir */
	VERIFY(0 == zap_add(mos, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_ROOT_DATASET, sizeof (uint64_t), 1, &ddobj, tx));
	}
	VERIFY(0 == dmu_bonus_hold(mos, ddobj, FTAG, &dbuf));
	dmu_buf_will_dirty(dbuf, tx);
	dsphys = dbuf->db_data;

	dsphys->dd_creation_time = gethrestime_sec();
	if (pds)
	dsphys->dd_parent_obj = pds->dd_object;
	dsphys->dd_props_zapobj = zap_create(mos,
	DMU_OT_DSL_PROPS, DMU_OT_NONE, 0, tx);
	dsphys->dd_child_dir_zapobj = zap_create(mos,
	DMU_OT_DSL_DIR_CHILD_MAP, DMU_OT_NONE, 0, tx);
	if (spa_version(dp->dp_spa) >= SPA_VERSION_USED_BREAKDOWN)
	dsphys->dd_flags \|= DD_FLAG_USED_BREAKDOWN;
	dmu_buf_rele(dbuf, FTAG);

	return (ddobj);
	}

	/* ARGSUSED */
	int
	dsl_dir_destroy_check(void arg1, void arg2, dmu_tx_t *tx)
	{
	dsl_dir_t *dd = arg1;
	dsl_pool_t *dp = dd->dd_pool;
	objset_t *mos = dp->dp_meta_objset;
	int err;
	uint64_t count;

	/*
	* There should be exactly two holds, both from
	* dsl_dataset_destroy: one on the dd directory, and one on its
	* head ds. Otherwise, someone is trying to lookup something
	* inside this dir while we want to destroy it. The
	* config_rwlock ensures that nobody else opens it after we
	* check.
	*/
	if (dmu_buf_refcount(dd->dd_dbuf) > 2)
	return (EBUSY);

	err = zap_count(mos, dd->dd_phys->dd_child_dir_zapobj, &count);
	if (err)
	return (err);
	if (count != 0)
	return (EEXIST);

	return (0);
	}

	void
	dsl_dir_destroy_sync(void arg1, void tag, cred_t cr, dmu_tx_t tx)
	{
	dsl_dir_t *dd = arg1;
	objset_t *mos = dd->dd_pool->dp_meta_objset;
	uint64_t val, obj;
	dd_used_t t;

	ASSERT(RW_WRITE_HELD(&dd->dd_pool->dp_config_rwlock));
	ASSERT(dd->dd_phys->dd_head_dataset_obj == 0);

	/* Remove our reservation. */
	val = 0;
	dsl_dir_set_reservation_sync(dd, &val, cr, tx);
	ASSERT3U(dd->dd_phys->dd_used_bytes, ==, 0);
	ASSERT3U(dd->dd_phys->dd_reserved, ==, 0);
	for (t = 0; t < DD_USED_NUM; t++)
	ASSERT3U(dd->dd_phys->dd_used_breakdown[t], ==, 0);

	VERIFY(0 == zap_destroy(mos, dd->dd_phys->dd_child_dir_zapobj, tx));
	VERIFY(0 == zap_destroy(mos, dd->dd_phys->dd_props_zapobj, tx));
	VERIFY(0 == dsl_deleg_destroy(mos, dd->dd_phys->dd_deleg_zapobj, tx));
	VERIFY(0 == zap_remove(mos,
	dd->dd_parent->dd_phys->dd_child_dir_zapobj, dd->dd_myname, tx));

	obj = dd->dd_object;
	dsl_dir_close(dd, tag);
	VERIFY(0 == dmu_object_free(mos, obj, tx));
	}

	boolean_t
	dsl_dir_is_clone(dsl_dir_t *dd)
	{
	return (dd->dd_phys->dd_origin_obj &&
	(dd->dd_pool->dp_origin_snap == NULL \|\|
	dd->dd_phys->dd_origin_obj !=
	dd->dd_pool->dp_origin_snap->ds_object));
	}

	void
	dsl_dir_stats(dsl_dir_t dd, nvlist_t nv)
	{
	mutex_enter(&dd->dd_lock);
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USED,
	dd->dd_phys->dd_used_bytes);
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_QUOTA, dd->dd_phys->dd_quota);
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_RESERVATION,
	dd->dd_phys->dd_reserved);
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_COMPRESSRATIO,
	dd->dd_phys->dd_compressed_bytes == 0 ? 100 :
	(dd->dd_phys->dd_uncompressed_bytes * 100 /
	dd->dd_phys->dd_compressed_bytes));
	if (dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN) {
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDSNAP,
	dd->dd_phys->dd_used_breakdown[DD_USED_SNAP]);
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDDS,
	dd->dd_phys->dd_used_breakdown[DD_USED_HEAD]);
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDREFRESERV,
	dd->dd_phys->dd_used_breakdown[DD_USED_REFRSRV]);
	dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDCHILD,
	dd->dd_phys->dd_used_breakdown[DD_USED_CHILD] +
	dd->dd_phys->dd_used_breakdown[DD_USED_CHILD_RSRV]);
	}
	mutex_exit(&dd->dd_lock);

	rw_enter(&dd->dd_pool->dp_config_rwlock, RW_READER);
	if (dsl_dir_is_clone(dd)) {
	dsl_dataset_t *ds;
	char buf[MAXNAMELEN];

	VERIFY(0 == dsl_dataset_hold_obj(dd->dd_pool,
	dd->dd_phys->dd_origin_obj, FTAG, &ds));
	dsl_dataset_name(ds, buf);
	dsl_dataset_rele(ds, FTAG);
	dsl_prop_nvlist_add_string(nv, ZFS_PROP_ORIGIN, buf);
	}
	rw_exit(&dd->dd_pool->dp_config_rwlock);
	}

	void
	dsl_dir_dirty(dsl_dir_t dd, dmu_tx_t tx)
	{
	dsl_pool_t *dp = dd->dd_pool;

	ASSERT(dd->dd_phys);

	if (txg_list_add(&dp->dp_dirty_dirs, dd, tx->tx_txg) == 0) {
	/* up the hold count until we can be written out */
	dmu_buf_add_ref(dd->dd_dbuf, dd);
	}
	}

	static int64_t
	parent_delta(dsl_dir_t *dd, uint64_t used, int64_t delta)
	{
	uint64_t old_accounted = MAX(used, dd->dd_phys->dd_reserved);
	uint64_t new_accounted = MAX(used + delta, dd->dd_phys->dd_reserved);
	return (new_accounted - old_accounted);
	}

	void
	dsl_dir_sync(dsl_dir_t dd, dmu_tx_t tx)
	{
	ASSERT(dmu_tx_is_syncing(tx));

	dmu_buf_will_dirty(dd->dd_dbuf, tx);

	mutex_enter(&dd->dd_lock);
	ASSERT3U(dd->dd_tempreserved[tx->tx_txg&TXG_MASK], ==, 0);
	dprintf_dd(dd, "txg=%llu towrite=%lluK\n", tx->tx_txg,
	dd->dd_space_towrite[tx->tx_txg&TXG_MASK] / 1024);
	dd->dd_space_towrite[tx->tx_txg&TXG_MASK] = 0;
	mutex_exit(&dd->dd_lock);

	/* release the hold from dsl_dir_dirty */
	dmu_buf_rele(dd->dd_dbuf, dd);
	}

	static uint64_t
	dsl_dir_space_towrite(dsl_dir_t *dd)
	{
	uint64_t space = 0;
	int i;

	ASSERT(MUTEX_HELD(&dd->dd_lock));

	for (i = 0; i < TXG_SIZE; i++) {
	space += dd->dd_space_towrite[i&TXG_MASK];
	ASSERT3U(dd->dd_space_towrite[i&TXG_MASK], >=, 0);
	}
	return (space);
	}

	/*
	* How much space would dd have available if ancestor had delta applied
	* to it? If ondiskonly is set, we're only interested in what's
	* on-disk, not estimated pending changes.
	*/
	uint64_t
	dsl_dir_space_available(dsl_dir_t *dd,
	dsl_dir_t *ancestor, int64_t delta, int ondiskonly)
	{
	uint64_t parentspace, myspace, quota, used;

	/*
	* If there are no restrictions otherwise, assume we have
	* unlimited space available.
	*/
	quota = UINT64_MAX;
	parentspace = UINT64_MAX;

	if (dd->dd_parent != NULL) {
	parentspace = dsl_dir_space_available(dd->dd_parent,
	ancestor, delta, ondiskonly);
	}

	mutex_enter(&dd->dd_lock);
	if (dd->dd_phys->dd_quota != 0)
	quota = dd->dd_phys->dd_quota;
	used = dd->dd_phys->dd_used_bytes;
	if (!ondiskonly)
	used += dsl_dir_space_towrite(dd);

	if (dd->dd_parent == NULL) {
	uint64_t poolsize = dsl_pool_adjustedsize(dd->dd_pool, FALSE);
	quota = MIN(quota, poolsize);
	}

	if (dd->dd_phys->dd_reserved > used && parentspace != UINT64_MAX) {
	/*
	* We have some space reserved, in addition to what our
	* parent gave us.
	*/
	parentspace += dd->dd_phys->dd_reserved - used;
	}

	if (dd == ancestor) {
	ASSERT(delta <= 0);
	ASSERT(used >= -delta);
	used += delta;
	if (parentspace != UINT64_MAX)
	parentspace -= delta;
	}

	if (used > quota) {
	/* over quota */
	myspace = 0;

	/*
	* While it's OK to be a little over quota, if
	* we think we are using more space than there
	* is in the pool (which is already 1.6% more than
	* dsl_pool_adjustedsize()), something is very
	* wrong.
	*/
	ASSERT3U(used, <=, spa_get_space(dd->dd_pool->dp_spa));
	} else {
	/*
	* the lesser of the space provided by our parent and
	* the space left in our quota
	*/
	myspace = MIN(parentspace, quota - used);
	}

	mutex_exit(&dd->dd_lock);

	return (myspace);
	}

	struct tempreserve {
	list_node_t tr_node;
	dsl_pool_t *tr_dp;
	dsl_dir_t *tr_ds;
	uint64_t tr_size;
	};

	static int
	dsl_dir_tempreserve_impl(dsl_dir_t *dd, uint64_t asize, boolean_t netfree,
	boolean_t ignorequota, boolean_t checkrefquota, list_t *tr_list,
	dmu_tx_t *tx, boolean_t first)
	{
	uint64_t txg = tx->tx_txg;
	uint64_t est_inflight, used_on_disk, quota, parent_rsrv;
	struct tempreserve *tr;
	int enospc = EDQUOT;
	int txgidx = txg & TXG_MASK;
	int i;
	uint64_t ref_rsrv = 0;

	ASSERT3U(txg, !=, 0);
	ASSERT3S(asize, >, 0);

	mutex_enter(&dd->dd_lock);

	/*
	* Check against the dsl_dir's quota. We don't add in the delta
	* when checking for over-quota because they get one free hit.
	*/
	est_inflight = dsl_dir_space_towrite(dd);
	for (i = 0; i < TXG_SIZE; i++)
	est_inflight += dd->dd_tempreserved[i];
	used_on_disk = dd->dd_phys->dd_used_bytes;

	/*
	* On the first iteration, fetch the dataset's used-on-disk and
	* refreservation values. Also, if checkrefquota is set, test if
	* allocating this space would exceed the dataset's refquota.
	*/
	if (first && tx->tx_objset) {
	int error;
	dsl_dataset_t *ds = tx->tx_objset->os->os_dsl_dataset;

	error = dsl_dataset_check_quota(ds, checkrefquota,
	asize, est_inflight, &used_on_disk, &ref_rsrv);
	if (error) {
	mutex_exit(&dd->dd_lock);
	return (error);
	}
	}

	/*
	* If this transaction will result in a net free of space,
	* we want to let it through.
	*/
	if (ignorequota \|\| netfree \|\| dd->dd_phys->dd_quota == 0)
	quota = UINT64_MAX;
	else
	quota = dd->dd_phys->dd_quota;

	/*
	* Adjust the quota against the actual pool size at the root.
	* To ensure that it's possible to remove files from a full
	* pool without inducing transient overcommits, we throttle
	* netfree transactions against a quota that is slightly larger,
	* but still within the pool's allocation slop. In cases where
	* we're very close to full, this will allow a steady trickle of
	* removes to get through.
	*/
	if (dd->dd_parent == NULL) {
	uint64_t poolsize = dsl_pool_adjustedsize(dd->dd_pool, netfree);
	if (poolsize < quota) {
	quota = poolsize;
	enospc = ENOSPC;
	}
	}

	/*
	* If they are requesting more space, and our current estimate
	* is over quota, they get to try again unless the actual
	* on-disk is over quota and there are no pending changes (which
	* may free up space for us).
	*/
	if (used_on_disk + est_inflight > quota) {
	if (est_inflight > 0 \|\| used_on_disk < quota)
	enospc = ERESTART;
	dprintf_dd(dd, "failing: used=%lluK inflight = %lluK "
	"quota=%lluK tr=%lluK err=%d\n",
	used_on_disk>>10, est_inflight>>10,
	quota>>10, asize>>10, enospc);
	mutex_exit(&dd->dd_lock);
	return (enospc);
	}

	/* We need to up our estimated delta before dropping dd_lock */
	dd->dd_tempreserved[txgidx] += asize;

	parent_rsrv = parent_delta(dd, used_on_disk + est_inflight,
	asize - ref_rsrv);
	mutex_exit(&dd->dd_lock);

	tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
	tr->tr_ds = dd;
	tr->tr_size = asize;
	list_insert_tail(tr_list, tr);

	/* see if it's OK with our parent */
	if (dd->dd_parent && parent_rsrv) {
	boolean_t ismos = (dd->dd_phys->dd_head_dataset_obj == 0);

	return (dsl_dir_tempreserve_impl(dd->dd_parent,
	parent_rsrv, netfree, ismos, TRUE, tr_list, tx, FALSE));
	} else {
	return (0);
	}
	}

	/*
	* Reserve space in this dsl_dir, to be used in this tx's txg.
	* After the space has been dirtied (and dsl_dir_willuse_space()
	* has been called), the reservation should be canceled, using
	* dsl_dir_tempreserve_clear().
	*/
	int
	dsl_dir_tempreserve_space(dsl_dir_t *dd, uint64_t lsize, uint64_t asize,
	uint64_t fsize, uint64_t usize, void *tr_cookiep, dmu_tx_t tx)
	{
	int err;
	list_t *tr_list;

	if (asize == 0) {
	*tr_cookiep = NULL;
	return (0);
	}

	tr_list = kmem_alloc(sizeof (list_t), KM_SLEEP);
	list_create(tr_list, sizeof (struct tempreserve),
	offsetof(struct tempreserve, tr_node));
	ASSERT3S(asize, >, 0);
	ASSERT3S(fsize, >=, 0);

	err = arc_tempreserve_space(lsize, tx->tx_txg);
	if (err == 0) {
	struct tempreserve *tr;

	tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
	tr->tr_size = lsize;
	list_insert_tail(tr_list, tr);

	err = dsl_pool_tempreserve_space(dd->dd_pool, asize, tx);
	} else {
	if (err == EAGAIN) {
	txg_delay(dd->dd_pool, tx->tx_txg, 1);
	err = ERESTART;
	}
	dsl_pool_memory_pressure(dd->dd_pool);
	}

	if (err == 0) {
	struct tempreserve *tr;

	tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
	tr->tr_dp = dd->dd_pool;
	tr->tr_size = asize;
	list_insert_tail(tr_list, tr);

	err = dsl_dir_tempreserve_impl(dd, asize, fsize >= asize,
	FALSE, asize > usize, tr_list, tx, TRUE);
	}

	if (err)
	dsl_dir_tempreserve_clear(tr_list, tx);
	else
	*tr_cookiep = tr_list;

	return (err);
	}

	/*
	* Clear a temporary reservation that we previously made with
	* dsl_dir_tempreserve_space().
	*/
	void
	dsl_dir_tempreserve_clear(void tr_cookie, dmu_tx_t tx)
	{
	int txgidx = tx->tx_txg & TXG_MASK;
	list_t *tr_list = tr_cookie;
	struct tempreserve *tr;

	ASSERT3U(tx->tx_txg, !=, 0);

	if (tr_cookie == NULL)
	return;

	while (tr = list_head(tr_list)) {
	if (tr->tr_dp) {
	dsl_pool_tempreserve_clear(tr->tr_dp, tr->tr_size, tx);
	} else if (tr->tr_ds) {
	mutex_enter(&tr->tr_ds->dd_lock);
	ASSERT3U(tr->tr_ds->dd_tempreserved[txgidx], >=,
	tr->tr_size);
	tr->tr_ds->dd_tempreserved[txgidx] -= tr->tr_size;
	mutex_exit(&tr->tr_ds->dd_lock);
	} else {
	arc_tempreserve_clear(tr->tr_size);
	}
	list_remove(tr_list, tr);
	kmem_free(tr, sizeof (struct tempreserve));
	}

	kmem_free(tr_list, sizeof (list_t));
	}

	static void
	dsl_dir_willuse_space_impl(dsl_dir_t dd, int64_t space, dmu_tx_t tx)
	{
	int64_t parent_space;
	uint64_t est_used;

	mutex_enter(&dd->dd_lock);
	if (space > 0)
	dd->dd_space_towrite[tx->tx_txg & TXG_MASK] += space;

	est_used = dsl_dir_space_towrite(dd) + dd->dd_phys->dd_used_bytes;
	parent_space = parent_delta(dd, est_used, space);
	mutex_exit(&dd->dd_lock);

	/* Make sure that we clean up dd_space_to* */
	dsl_dir_dirty(dd, tx);

	/* XXX this is potentially expensive and unnecessary... */
	if (parent_space && dd->dd_parent)
	dsl_dir_willuse_space_impl(dd->dd_parent, parent_space, tx);
	}

	/*
	* Call in open context when we think we're going to write/free space,
	* eg. when dirtying data. Be conservative (ie. OK to write less than
	* this or free more than this, but don't write more or free less).
	*/
	void
	dsl_dir_willuse_space(dsl_dir_t dd, int64_t space, dmu_tx_t tx)
	{
	dsl_pool_willuse_space(dd->dd_pool, space, tx);
	dsl_dir_willuse_space_impl(dd, space, tx);
	}

	/* call from syncing context when we actually write/free space for this dd */
	void
	dsl_dir_diduse_space(dsl_dir_t *dd, dd_used_t type,
	int64_t used, int64_t compressed, int64_t uncompressed, dmu_tx_t *tx)
	{
	int64_t accounted_delta;
	boolean_t needlock = !MUTEX_HELD(&dd->dd_lock);

	ASSERT(dmu_tx_is_syncing(tx));
	ASSERT(type < DD_USED_NUM);

	dsl_dir_dirty(dd, tx);

	if (needlock)
	mutex_enter(&dd->dd_lock);
	accounted_delta = parent_delta(dd, dd->dd_phys->dd_used_bytes, used);
	ASSERT(used >= 0 \|\| dd->dd_phys->dd_used_bytes >= -used);
	ASSERT(compressed >= 0 \|\|
	dd->dd_phys->dd_compressed_bytes >= -compressed);
	ASSERT(uncompressed >= 0 \|\|
	dd->dd_phys->dd_uncompressed_bytes >= -uncompressed);
	dd->dd_phys->dd_used_bytes += used;
	dd->dd_phys->dd_uncompressed_bytes += uncompressed;
	dd->dd_phys->dd_compressed_bytes += compressed;

	if (dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN) {
	ASSERT(used > 0 \|\|
	dd->dd_phys->dd_used_breakdown[type] >= -used);
	dd->dd_phys->dd_used_breakdown[type] += used;
	#ifdef DEBUG
	dd_used_t t;
	uint64_t u = 0;
	for (t = 0; t < DD_USED_NUM; t++)
	u += dd->dd_phys->dd_used_breakdown[t];
	ASSERT3U(u, ==, dd->dd_phys->dd_used_bytes);
	#endif
	}
	if (needlock)
	mutex_exit(&dd->dd_lock);

	if (dd->dd_parent != NULL) {
	dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD,
	accounted_delta, compressed, uncompressed, tx);
	dsl_dir_transfer_space(dd->dd_parent,
	used - accounted_delta,
	DD_USED_CHILD_RSRV, DD_USED_CHILD, tx);
	}
	}

	void
	dsl_dir_transfer_space(dsl_dir_t *dd, int64_t delta,
	dd_used_t oldtype, dd_used_t newtype, dmu_tx_t *tx)
	{
	boolean_t needlock = !MUTEX_HELD(&dd->dd_lock);

	ASSERT(dmu_tx_is_syncing(tx));
	ASSERT(oldtype < DD_USED_NUM);
	ASSERT(newtype < DD_USED_NUM);

	if (delta == 0 \|\| !(dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN))
	return;

	dsl_dir_dirty(dd, tx);
	if (needlock)
	mutex_enter(&dd->dd_lock);
	ASSERT(delta > 0 ?
	dd->dd_phys->dd_used_breakdown[oldtype] >= delta :
	dd->dd_phys->dd_used_breakdown[newtype] >= -delta);
	ASSERT(dd->dd_phys->dd_used_bytes >= ABS(delta));
	dd->dd_phys->dd_used_breakdown[oldtype] -= delta;
	dd->dd_phys->dd_used_breakdown[newtype] += delta;
	if (needlock)
	mutex_exit(&dd->dd_lock);
	}

	static int
	dsl_dir_set_quota_check(void arg1, void arg2, dmu_tx_t *tx)
	{
	dsl_dir_t *dd = arg1;
	uint64_t *quotap = arg2;
	uint64_t new_quota = *quotap;
	int err = 0;
	uint64_t towrite;

	if (new_quota == 0)
	return (0);

	mutex_enter(&dd->dd_lock);
	/*
	* If we are doing the preliminary check in open context, and
	* there are pending changes, then don't fail it, since the
	* pending changes could under-estimate the amount of space to be
	* freed up.
	*/
	towrite = dsl_dir_space_towrite(dd);
	if ((dmu_tx_is_syncing(tx) \|\| towrite == 0) &&
	(new_quota < dd->dd_phys->dd_reserved \|\|
	new_quota < dd->dd_phys->dd_used_bytes + towrite)) {
	err = ENOSPC;
	}
	mutex_exit(&dd->dd_lock);
	return (err);
	}

	/* ARGSUSED */
	static void
	dsl_dir_set_quota_sync(void arg1, void arg2, cred_t cr, dmu_tx_t tx)
	{
	dsl_dir_t *dd = arg1;
	uint64_t *quotap = arg2;
	uint64_t new_quota = *quotap;

	dmu_buf_will_dirty(dd->dd_dbuf, tx);

	mutex_enter(&dd->dd_lock);
	dd->dd_phys->dd_quota = new_quota;
	mutex_exit(&dd->dd_lock);

	spa_history_internal_log(LOG_DS_QUOTA, dd->dd_pool->dp_spa,
	tx, cr, "%lld dataset = %llu ",
	(longlong_t)new_quota, dd->dd_phys->dd_head_dataset_obj);
	}

	int
	dsl_dir_set_quota(const char *ddname, uint64_t quota)
	{
	dsl_dir_t *dd;
	int err;

	err = dsl_dir_open(ddname, FTAG, &dd, NULL);
	if (err)
	return (err);

	if (quota != dd->dd_phys->dd_quota) {
	/*
	* If someone removes a file, then tries to set the quota, we
	* want to make sure the file freeing takes effect.
	*/
	txg_wait_open(dd->dd_pool, 0);

	err = dsl_sync_task_do(dd->dd_pool, dsl_dir_set_quota_check,
	dsl_dir_set_quota_sync, dd, &quota, 0);
	}
	dsl_dir_close(dd, FTAG);
	return (err);
	}

	int
	dsl_dir_set_reservation_check(void arg1, void arg2, dmu_tx_t *tx)
	{
	dsl_dir_t *dd = arg1;
	uint64_t *reservationp = arg2;
	uint64_t new_reservation = *reservationp;
	uint64_t used, avail;
	int64_t delta;

	if (new_reservation > INT64_MAX)
	return (EOVERFLOW);

	/*
	* If we are doing the preliminary check in open context, the
	* space estimates may be inaccurate.
	*/
	if (!dmu_tx_is_syncing(tx))
	return (0);

	mutex_enter(&dd->dd_lock);
	used = dd->dd_phys->dd_used_bytes;
	delta = MAX(used, new_reservation) -
	MAX(used, dd->dd_phys->dd_reserved);
	mutex_exit(&dd->dd_lock);

	if (dd->dd_parent) {
	avail = dsl_dir_space_available(dd->dd_parent,
	NULL, 0, FALSE);
	} else {
	avail = dsl_pool_adjustedsize(dd->dd_pool, B_FALSE) - used;
	}

	if (delta > 0 && delta > avail)
	return (ENOSPC);
	if (delta > 0 && dd->dd_phys->dd_quota > 0 &&
	new_reservation > dd->dd_phys->dd_quota)
	return (ENOSPC);
	return (0);
	}

	/* ARGSUSED */
	static void
	dsl_dir_set_reservation_sync(void arg1, void arg2, cred_t cr, dmu_tx_t tx)
	{
	dsl_dir_t *dd = arg1;
	uint64_t *reservationp = arg2;
	uint64_t new_reservation = *reservationp;
	uint64_t used;
	int64_t delta;

	dmu_buf_will_dirty(dd->dd_dbuf, tx);

	mutex_enter(&dd->dd_lock);
	used = dd->dd_phys->dd_used_bytes;
	delta = MAX(used, new_reservation) -
	MAX(used, dd->dd_phys->dd_reserved);
	dd->dd_phys->dd_reserved = new_reservation;

	if (dd->dd_parent != NULL) {
	/* Roll up this additional usage into our ancestors */
	dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD_RSRV,
	delta, 0, 0, tx);
	}
	mutex_exit(&dd->dd_lock);

	spa_history_internal_log(LOG_DS_RESERVATION, dd->dd_pool->dp_spa,
	tx, cr, "%lld dataset = %llu",
	(longlong_t)new_reservation, dd->dd_phys->dd_head_dataset_obj);
	}

	int
	dsl_dir_set_reservation(const char *ddname, uint64_t reservation)
	{
	dsl_dir_t *dd;
	int err;

	err = dsl_dir_open(ddname, FTAG, &dd, NULL);
	if (err)
	return (err);
	err = dsl_sync_task_do(dd->dd_pool, dsl_dir_set_reservation_check,
	dsl_dir_set_reservation_sync, dd, &reservation, 0);
	dsl_dir_close(dd, FTAG);
	return (err);
	}

	static dsl_dir_t *
	closest_common_ancestor(dsl_dir_t ds1, dsl_dir_t ds2)
	{
	for (; ds1; ds1 = ds1->dd_parent) {
	dsl_dir_t *dd;
	for (dd = ds2; dd; dd = dd->dd_parent) {
	if (ds1 == dd)
	return (dd);
	}
	}
	return (NULL);
	}

	/*
	* If delta is applied to dd, how much of that delta would be applied to
	* ancestor? Syncing context only.
	*/
	static int64_t
	would_change(dsl_dir_t dd, int64_t delta, dsl_dir_t ancestor)
	{
	if (dd == ancestor)
	return (delta);

	mutex_enter(&dd->dd_lock);
	delta = parent_delta(dd, dd->dd_phys->dd_used_bytes, delta);
	mutex_exit(&dd->dd_lock);
	return (would_change(dd->dd_parent, delta, ancestor));
	}

	struct renamearg {
	dsl_dir_t *newparent;
	const char *mynewname;
	};

	/ARGSUSED/
	static int
	dsl_dir_rename_check(void arg1, void arg2, dmu_tx_t *tx)
	{
	dsl_dir_t *dd = arg1;
	struct renamearg *ra = arg2;
	dsl_pool_t *dp = dd->dd_pool;
	objset_t *mos = dp->dp_meta_objset;
	int err;
	uint64_t val;

	/* There should be 2 references: the open and the dirty */
	if (dmu_buf_refcount(dd->dd_dbuf) > 2)
	return (EBUSY);

	/* check for existing name */
	err = zap_lookup(mos, ra->newparent->dd_phys->dd_child_dir_zapobj,
	ra->mynewname, 8, 1, &val);
	if (err == 0)
	return (EEXIST);
	if (err != ENOENT)
	return (err);

	if (ra->newparent != dd->dd_parent) {
	/* is there enough space? */
	uint64_t myspace =
	MAX(dd->dd_phys->dd_used_bytes, dd->dd_phys->dd_reserved);

	/* no rename into our descendant */
	if (closest_common_ancestor(dd, ra->newparent) == dd)
	return (EINVAL);

	if (err = dsl_dir_transfer_possible(dd->dd_parent,
	ra->newparent, myspace))
	return (err);
	}

	return (0);
	}

	static void
	dsl_dir_rename_sync(void arg1, void arg2, cred_t cr, dmu_tx_t tx)
	{
	dsl_dir_t *dd = arg1;
	struct renamearg *ra = arg2;
	dsl_pool_t *dp = dd->dd_pool;
	objset_t *mos = dp->dp_meta_objset;
	int err;

	ASSERT(dmu_buf_refcount(dd->dd_dbuf) <= 2);

	if (ra->newparent != dd->dd_parent) {
	dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD,
	-dd->dd_phys->dd_used_bytes,
	-dd->dd_phys->dd_compressed_bytes,
	-dd->dd_phys->dd_uncompressed_bytes, tx);
	dsl_dir_diduse_space(ra->newparent, DD_USED_CHILD,
	dd->dd_phys->dd_used_bytes,
	dd->dd_phys->dd_compressed_bytes,
	dd->dd_phys->dd_uncompressed_bytes, tx);

	if (dd->dd_phys->dd_reserved > dd->dd_phys->dd_used_bytes) {
	uint64_t unused_rsrv = dd->dd_phys->dd_reserved -
	dd->dd_phys->dd_used_bytes;

	dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD_RSRV,
	-unused_rsrv, 0, 0, tx);
	dsl_dir_diduse_space(ra->newparent, DD_USED_CHILD_RSRV,
	unused_rsrv, 0, 0, tx);
	}
	}

	dmu_buf_will_dirty(dd->dd_dbuf, tx);

	/* remove from old parent zapobj */
	err = zap_remove(mos, dd->dd_parent->dd_phys->dd_child_dir_zapobj,
	dd->dd_myname, tx);
	ASSERT3U(err, ==, 0);

	(void) strcpy(dd->dd_myname, ra->mynewname);
	dsl_dir_close(dd->dd_parent, dd);
	dd->dd_phys->dd_parent_obj = ra->newparent->dd_object;
	VERIFY(0 == dsl_dir_open_obj(dd->dd_pool,
	ra->newparent->dd_object, NULL, dd, &dd->dd_parent));

	/* add to new parent zapobj */
	err = zap_add(mos, ra->newparent->dd_phys->dd_child_dir_zapobj,
	dd->dd_myname, 8, 1, &dd->dd_object, tx);
	ASSERT3U(err, ==, 0);

	spa_history_internal_log(LOG_DS_RENAME, dd->dd_pool->dp_spa,
	tx, cr, "dataset = %llu", dd->dd_phys->dd_head_dataset_obj);
	}

	int
	dsl_dir_rename(dsl_dir_t dd, const char newname)
	{
	struct renamearg ra;
	int err;

	/* new parent should exist */
	err = dsl_dir_open(newname, FTAG, &ra.newparent, &ra.mynewname);
	if (err)
	return (err);

	/* can't rename to different pool */
	if (dd->dd_pool != ra.newparent->dd_pool) {
	err = ENXIO;
	goto out;
	}

	/* new name should not already exist */
	if (ra.mynewname == NULL) {
	err = EEXIST;
	goto out;
	}

	err = dsl_sync_task_do(dd->dd_pool,
	dsl_dir_rename_check, dsl_dir_rename_sync, dd, &ra, 3);

	out:
	dsl_dir_close(ra.newparent, FTAG);
	return (err);
	}

	int
	dsl_dir_transfer_possible(dsl_dir_t sdd, dsl_dir_t tdd, uint64_t space)
	{
	dsl_dir_t *ancestor;
	int64_t adelta;
	uint64_t avail;

	ancestor = closest_common_ancestor(sdd, tdd);
	adelta = would_change(sdd, -space, ancestor);
	avail = dsl_dir_space_available(tdd, ancestor, adelta, FALSE);
	if (avail < space)
	return (ENOSPC);

	return (0);
	}
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c (revision 209274)
	@@ -1,1025 +1,1027 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	#include <sys/dsl_pool.h>
	#include <sys/dsl_dataset.h>
	#include <sys/dsl_prop.h>
	#include <sys/dsl_dir.h>
	#include <sys/dsl_synctask.h>
	#include <sys/dnode.h>
	#include <sys/dmu_tx.h>
	#include <sys/dmu_objset.h>
	#include <sys/arc.h>
	#include <sys/zap.h>
	#include <sys/zio.h>
	#include <sys/zfs_context.h>
	#include <sys/fs/zfs.h>
	#include <sys/zfs_znode.h>
	#include <sys/spa_impl.h>
	#include <sys/vdev_impl.h>
	#include <sys/zil_impl.h>

	typedef int (scrub_cb_t)(dsl_pool_t , const blkptr_t , const zbookmark_t *);

	static scrub_cb_t dsl_pool_scrub_clean_cb;
	static dsl_syncfunc_t dsl_pool_scrub_cancel_sync;

	int zfs_scrub_min_time = 1; /* scrub for at least 1 sec each txg */
	int zfs_resilver_min_time = 3; /* resilver for at least 3 sec each txg */
	boolean_t zfs_no_scrub_io = B_FALSE; /* set to disable scrub i/o */

	extern int zfs_txg_timeout;

	static scrub_cb_t *scrub_funcs[SCRUB_FUNC_NUMFUNCS] = {
	NULL,
	dsl_pool_scrub_clean_cb
	};

	#define SET_BOOKMARK(zb, objset, object, level, blkid) \
	{ \
	(zb)->zb_objset = objset; \
	(zb)->zb_object = object; \
	(zb)->zb_level = level; \
	(zb)->zb_blkid = blkid; \
	}

	/* ARGSUSED */
	static void
	dsl_pool_scrub_setup_sync(void arg1, void arg2, cred_t cr, dmu_tx_t tx)
	{
	dsl_pool_t *dp = arg1;
	enum scrub_func *funcp = arg2;
	dmu_object_type_t ot = 0;
	boolean_t complete = B_FALSE;

	dsl_pool_scrub_cancel_sync(dp, &complete, cr, tx);

	ASSERT(dp->dp_scrub_func == SCRUB_FUNC_NONE);
	ASSERT(*funcp > SCRUB_FUNC_NONE);
	ASSERT(*funcp < SCRUB_FUNC_NUMFUNCS);

	dp->dp_scrub_min_txg = 0;
	dp->dp_scrub_max_txg = tx->tx_txg;

	if (*funcp == SCRUB_FUNC_CLEAN) {
	vdev_t *rvd = dp->dp_spa->spa_root_vdev;

	/* rewrite all disk labels */
	vdev_config_dirty(rvd);

	if (vdev_resilver_needed(rvd,
	&dp->dp_scrub_min_txg, &dp->dp_scrub_max_txg)) {
	spa_event_notify(dp->dp_spa, NULL,
	ESC_ZFS_RESILVER_START);
	dp->dp_scrub_max_txg = MIN(dp->dp_scrub_max_txg,
	tx->tx_txg);
	}

	/* zero out the scrub stats in all vdev_stat_t's */
	vdev_scrub_stat_update(rvd,
	dp->dp_scrub_min_txg ? POOL_SCRUB_RESILVER :
	POOL_SCRUB_EVERYTHING, B_FALSE);

	dp->dp_spa->spa_scrub_started = B_TRUE;
	}

	/* back to the generic stuff */

	if (dp->dp_blkstats == NULL) {
	dp->dp_blkstats =
	kmem_alloc(sizeof (zfs_all_blkstats_t), KM_SLEEP);
	}
	bzero(dp->dp_blkstats, sizeof (zfs_all_blkstats_t));

	if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB)
	ot = DMU_OT_ZAP_OTHER;

	dp->dp_scrub_func = *funcp;
	dp->dp_scrub_queue_obj = zap_create(dp->dp_meta_objset,
	ot ? ot : DMU_OT_SCRUB_QUEUE, DMU_OT_NONE, 0, tx);
	bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));
	dp->dp_scrub_restart = B_FALSE;
	dp->dp_spa->spa_scrub_errors = 0;

	VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_FUNC, sizeof (uint32_t), 1,
	&dp->dp_scrub_func, tx));
	VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_QUEUE, sizeof (uint64_t), 1,
	&dp->dp_scrub_queue_obj, tx));
	VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_MIN_TXG, sizeof (uint64_t), 1,
	&dp->dp_scrub_min_txg, tx));
	VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_MAX_TXG, sizeof (uint64_t), 1,
	&dp->dp_scrub_max_txg, tx));
	VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4,
	&dp->dp_scrub_bookmark, tx));
	VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_ERRORS, sizeof (uint64_t), 1,
	&dp->dp_spa->spa_scrub_errors, tx));

	spa_history_internal_log(LOG_POOL_SCRUB, dp->dp_spa, tx, cr,
	"func=%u mintxg=%llu maxtxg=%llu",
	*funcp, dp->dp_scrub_min_txg, dp->dp_scrub_max_txg);
	}

	int
	dsl_pool_scrub_setup(dsl_pool_t *dp, enum scrub_func func)
	{
	return (dsl_sync_task_do(dp, NULL,
	dsl_pool_scrub_setup_sync, dp, &func, 0));
	}

	/* ARGSUSED */
	static void
	dsl_pool_scrub_cancel_sync(void arg1, void arg2, cred_t cr, dmu_tx_t tx)
	{
	dsl_pool_t *dp = arg1;
	boolean_t *completep = arg2;

	if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
	return;

	mutex_enter(&dp->dp_scrub_cancel_lock);

	if (dp->dp_scrub_restart) {
	dp->dp_scrub_restart = B_FALSE;
	*completep = B_FALSE;
	}

	/* XXX this is scrub-clean specific */
	mutex_enter(&dp->dp_spa->spa_scrub_lock);
	while (dp->dp_spa->spa_scrub_inflight > 0) {
	cv_wait(&dp->dp_spa->spa_scrub_io_cv,
	&dp->dp_spa->spa_scrub_lock);
	}
	mutex_exit(&dp->dp_spa->spa_scrub_lock);
	dp->dp_spa->spa_scrub_started = B_FALSE;
	dp->dp_spa->spa_scrub_active = B_FALSE;

	dp->dp_scrub_func = SCRUB_FUNC_NONE;
	VERIFY(0 == dmu_object_free(dp->dp_meta_objset,
	dp->dp_scrub_queue_obj, tx));
	dp->dp_scrub_queue_obj = 0;
	bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));

	VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_QUEUE, tx));
	VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_MIN_TXG, tx));
	VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_MAX_TXG, tx));
	VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_BOOKMARK, tx));
	VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_FUNC, tx));
	VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_ERRORS, tx));

	spa_history_internal_log(LOG_POOL_SCRUB_DONE, dp->dp_spa, tx, cr,
	"complete=%u", *completep);

	/* below is scrub-clean specific */
	vdev_scrub_stat_update(dp->dp_spa->spa_root_vdev, POOL_SCRUB_NONE,
	*completep);
	/*
	* If the scrub/resilver completed, update all DTLs to reflect this.
	* Whether it succeeded or not, vacate all temporary scrub DTLs.
	*/
	vdev_dtl_reassess(dp->dp_spa->spa_root_vdev, tx->tx_txg,
	*completep ? dp->dp_scrub_max_txg : 0, B_TRUE);
	if (dp->dp_scrub_min_txg && *completep)
	spa_event_notify(dp->dp_spa, NULL, ESC_ZFS_RESILVER_FINISH);
	spa_errlog_rotate(dp->dp_spa);

	/*
	* We may have finished replacing a device.
	* Let the async thread assess this and handle the detach.
	*/
	spa_async_request(dp->dp_spa, SPA_ASYNC_RESILVER_DONE);

	dp->dp_scrub_min_txg = dp->dp_scrub_max_txg = 0;
	mutex_exit(&dp->dp_scrub_cancel_lock);
	}

	int
	dsl_pool_scrub_cancel(dsl_pool_t *dp)
	{
	boolean_t complete = B_FALSE;

	return (dsl_sync_task_do(dp, NULL,
	dsl_pool_scrub_cancel_sync, dp, &complete, 3));
	}

	int
	dsl_free(zio_t pio, dsl_pool_t dp, uint64_t txg, const blkptr_t *bpp,
	zio_done_func_t done, void private, uint32_t arc_flags)
	{
	/*
	* This function will be used by bp-rewrite wad to intercept frees.
	*/
	return (arc_free(pio, dp->dp_spa, txg, (blkptr_t *)bpp,
	done, private, arc_flags));
	}

	static boolean_t
	bookmark_is_zero(const zbookmark_t *zb)
	{
	return (zb->zb_objset == 0 && zb->zb_object == 0 &&
	zb->zb_level == 0 && zb->zb_blkid == 0);
	}

	/* dnp is the dnode for zb1->zb_object */
	static boolean_t
	bookmark_is_before(dnode_phys_t dnp, const zbookmark_t zb1,
	const zbookmark_t *zb2)
	{
	uint64_t zb1nextL0, zb2thisobj;

	ASSERT(zb1->zb_objset == zb2->zb_objset);
	ASSERT(zb1->zb_object != -1ULL);
	ASSERT(zb2->zb_level == 0);

	/*
	* A bookmark in the deadlist is considered to be after
	* everything else.
	*/
	if (zb2->zb_object == -1ULL)
	return (B_TRUE);

	/* The objset_phys_t isn't before anything. */
	if (dnp == NULL)
	return (B_FALSE);

	zb1nextL0 = (zb1->zb_blkid + 1) <<
	((zb1->zb_level) * (dnp->dn_indblkshift - SPA_BLKPTRSHIFT));

	zb2thisobj = zb2->zb_object ? zb2->zb_object :
	zb2->zb_blkid << (DNODE_BLOCK_SHIFT - DNODE_SHIFT);

	if (zb1->zb_object == 0) {
	uint64_t nextobj = zb1nextL0 *
	(dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT) >> DNODE_SHIFT;
	return (nextobj <= zb2thisobj);
	}

	if (zb1->zb_object < zb2thisobj)
	return (B_TRUE);
	if (zb1->zb_object > zb2thisobj)
	return (B_FALSE);
	if (zb2->zb_object == 0)
	return (B_FALSE);
	return (zb1nextL0 <= zb2->zb_blkid);
	}

	static boolean_t
	scrub_pause(dsl_pool_t dp, const zbookmark_t zb)
	{
	int elapsed_ticks;
	int mintime;

	if (dp->dp_scrub_pausing)
	return (B_TRUE); /* we're already pausing */

	if (!bookmark_is_zero(&dp->dp_scrub_bookmark))
	return (B_FALSE); /* we're resuming */

	/* We only know how to resume from level-0 blocks. */
	if (zb->zb_level != 0)
	return (B_FALSE);

	mintime = dp->dp_scrub_isresilver ? zfs_resilver_min_time :
	zfs_scrub_min_time;
	elapsed_ticks = lbolt64 - dp->dp_scrub_start_time;
	if (elapsed_ticks > hz * zfs_txg_timeout \|\|
	(elapsed_ticks > hz * mintime && txg_sync_waiting(dp))) {
	dprintf("pausing at %llx/%llx/%llx/%llx\n",
	(longlong_t)zb->zb_objset, (longlong_t)zb->zb_object,
	(longlong_t)zb->zb_level, (longlong_t)zb->zb_blkid);
	dp->dp_scrub_pausing = B_TRUE;
	dp->dp_scrub_bookmark = *zb;
	return (B_TRUE);
	}
	return (B_FALSE);
	}

	typedef struct zil_traverse_arg {
	dsl_pool_t *zta_dp;
	zil_header_t *zta_zh;
	} zil_traverse_arg_t;

	/* ARGSUSED */
	static void
	traverse_zil_block(zilog_t zilog, blkptr_t bp, void *arg, uint64_t claim_txg)
	{
	zil_traverse_arg_t *zta = arg;
	dsl_pool_t *dp = zta->zta_dp;
	zil_header_t *zh = zta->zta_zh;
	zbookmark_t zb;

	if (bp->blk_birth <= dp->dp_scrub_min_txg)
	return;

	/*
	* One block ("stumpy") can be allocated a long time ago; we
	* want to visit that one because it has been allocated
	* (on-disk) even if it hasn't been claimed (even though for
	* plain scrub there's nothing to do to it).
	*/
	if (claim_txg == 0 && bp->blk_birth >= spa_first_txg(dp->dp_spa))
	return;

	zb.zb_objset = zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET];
	zb.zb_object = 0;
	zb.zb_level = -1;
	zb.zb_blkid = bp->blk_cksum.zc_word[ZIL_ZC_SEQ];
	VERIFY(0 == scrub_funcs[dp->dp_scrub_func](dp, bp, &zb));
	}

	/* ARGSUSED */
	static void
	traverse_zil_record(zilog_t zilog, lr_t lrc, void *arg, uint64_t claim_txg)
	{
	if (lrc->lrc_txtype == TX_WRITE) {
	zil_traverse_arg_t *zta = arg;
	dsl_pool_t *dp = zta->zta_dp;
	zil_header_t *zh = zta->zta_zh;
	lr_write_t lr = (lr_write_t )lrc;
	blkptr_t *bp = &lr->lr_blkptr;
	zbookmark_t zb;

	if (bp->blk_birth <= dp->dp_scrub_min_txg)
	return;

	/*
	* birth can be < claim_txg if this record's txg is
	* already txg sync'ed (but this log block contains
	* other records that are not synced)
	*/
	if (claim_txg == 0 \|\| bp->blk_birth < claim_txg)
	return;

	zb.zb_objset = zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET];
	zb.zb_object = lr->lr_foid;
	zb.zb_level = BP_GET_LEVEL(bp);
	zb.zb_blkid = lr->lr_offset / BP_GET_LSIZE(bp);
	VERIFY(0 == scrub_funcs[dp->dp_scrub_func](dp, bp, &zb));
	}
	}

	static void
	traverse_zil(dsl_pool_t dp, zil_header_t zh)
	{
	uint64_t claim_txg = zh->zh_claim_txg;
	zil_traverse_arg_t zta = { dp, zh };
	zilog_t *zilog;

	/*
	* We only want to visit blocks that have been claimed but not yet
	* replayed (or, in read-only mode, blocks that would be claimed).
	*/
	if (claim_txg == 0 && (spa_mode & FWRITE))
	return;

	zilog = zil_alloc(dp->dp_meta_objset, zh);

	(void) zil_parse(zilog, traverse_zil_block, traverse_zil_record, &zta,
	claim_txg);

	zil_free(zilog);
	}

	static void
	scrub_visitbp(dsl_pool_t dp, dnode_phys_t dnp,
	arc_buf_t pbuf, blkptr_t bp, const zbookmark_t *zb)
	{
	int err;
	arc_buf_t *buf = NULL;

	if (bp->blk_birth == 0)
	return;

	if (bp->blk_birth <= dp->dp_scrub_min_txg)
	return;

	if (scrub_pause(dp, zb))
	return;

	if (!bookmark_is_zero(&dp->dp_scrub_bookmark)) {
	/*
	* If we already visited this bp & everything below (in
	* a prior txg), don't bother doing it again.
	*/
	if (bookmark_is_before(dnp, zb, &dp->dp_scrub_bookmark))
	return;

	/*
	* If we found the block we're trying to resume from, or
	* we went past it to a different object, zero it out to
	* indicate that it's OK to start checking for pausing
	* again.
	*/
	if (bcmp(zb, &dp->dp_scrub_bookmark, sizeof (*zb)) == 0 \|\|
	zb->zb_object > dp->dp_scrub_bookmark.zb_object) {
	dprintf("resuming at %llx/%llx/%llx/%llx\n",
	(longlong_t)zb->zb_objset,
	(longlong_t)zb->zb_object,
	(longlong_t)zb->zb_level,
	(longlong_t)zb->zb_blkid);
	bzero(&dp->dp_scrub_bookmark, sizeof (*zb));
	}
	}

	if (BP_GET_LEVEL(bp) > 0) {
	uint32_t flags = ARC_WAIT;
	int i;
	blkptr_t *cbp;
	int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;

	err = arc_read(NULL, dp->dp_spa, bp, pbuf,
	arc_getbuf_func, &buf,
	ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
	if (err) {
	mutex_enter(&dp->dp_spa->spa_scrub_lock);
	dp->dp_spa->spa_scrub_errors++;
	mutex_exit(&dp->dp_spa->spa_scrub_lock);
	return;
	}
	cbp = buf->b_data;

	for (i = 0; i < epb; i++, cbp++) {
	zbookmark_t czb;

	SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
	zb->zb_level - 1,
	zb->zb_blkid * epb + i);
	scrub_visitbp(dp, dnp, buf, cbp, &czb);
	}
	} else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {
	uint32_t flags = ARC_WAIT;
	dnode_phys_t *child_dnp;
	int i, j;
	int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;

	err = arc_read(NULL, dp->dp_spa, bp, pbuf,
	arc_getbuf_func, &buf,
	ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
	if (err) {
	mutex_enter(&dp->dp_spa->spa_scrub_lock);
	dp->dp_spa->spa_scrub_errors++;
	mutex_exit(&dp->dp_spa->spa_scrub_lock);
	return;
	}
	child_dnp = buf->b_data;

	for (i = 0; i < epb; i++, child_dnp++) {
	for (j = 0; j < child_dnp->dn_nblkptr; j++) {
	zbookmark_t czb;

	SET_BOOKMARK(&czb, zb->zb_objset,
	zb->zb_blkid * epb + i,
	child_dnp->dn_nlevels - 1, j);
	scrub_visitbp(dp, child_dnp, buf,
	&child_dnp->dn_blkptr[j], &czb);
	}
	}
	} else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
	uint32_t flags = ARC_WAIT;
	objset_phys_t *osp;
	int j;

	err = arc_read_nolock(NULL, dp->dp_spa, bp,
	arc_getbuf_func, &buf,
	ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
	if (err) {
	mutex_enter(&dp->dp_spa->spa_scrub_lock);
	dp->dp_spa->spa_scrub_errors++;
	mutex_exit(&dp->dp_spa->spa_scrub_lock);
	return;
	}

	osp = buf->b_data;

	traverse_zil(dp, &osp->os_zil_header);

	for (j = 0; j < osp->os_meta_dnode.dn_nblkptr; j++) {
	zbookmark_t czb;

	SET_BOOKMARK(&czb, zb->zb_objset, 0,
	osp->os_meta_dnode.dn_nlevels - 1, j);
	scrub_visitbp(dp, &osp->os_meta_dnode, buf,
	&osp->os_meta_dnode.dn_blkptr[j], &czb);
	}
	}

	(void) scrub_funcs[dp->dp_scrub_func](dp, bp, zb);
	if (buf)
	(void) arc_buf_remove_ref(buf, &buf);
	}

	static void
	scrub_visit_rootbp(dsl_pool_t dp, dsl_dataset_t ds, blkptr_t *bp)
	{
	zbookmark_t zb;

	SET_BOOKMARK(&zb, ds ? ds->ds_object : 0, 0, -1, 0);
	scrub_visitbp(dp, NULL, NULL, bp, &zb);
	}

	void
	dsl_pool_ds_destroyed(dsl_dataset_t ds, dmu_tx_t tx)
	{
	dsl_pool_t *dp = ds->ds_dir->dd_pool;

	if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
	return;

	if (dp->dp_scrub_bookmark.zb_objset == ds->ds_object) {
	SET_BOOKMARK(&dp->dp_scrub_bookmark, -1, 0, 0, 0);
	} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds->ds_object, tx) != 0) {
	return;
	}

	if (ds->ds_phys->ds_next_snap_obj != 0) {
	VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds->ds_phys->ds_next_snap_obj, tx) == 0);
	}
	ASSERT3U(ds->ds_phys->ds_num_children, <=, 1);
	}

	void
	dsl_pool_ds_snapshotted(dsl_dataset_t ds, dmu_tx_t tx)
	{
	dsl_pool_t *dp = ds->ds_dir->dd_pool;

	if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
	return;

	ASSERT(ds->ds_phys->ds_prev_snap_obj != 0);

	if (dp->dp_scrub_bookmark.zb_objset == ds->ds_object) {
	dp->dp_scrub_bookmark.zb_objset =
	ds->ds_phys->ds_prev_snap_obj;
	} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds->ds_object, tx) == 0) {
	VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds->ds_phys->ds_prev_snap_obj, tx) == 0);
	}
	}

	void
	dsl_pool_ds_clone_swapped(dsl_dataset_t ds1, dsl_dataset_t ds2, dmu_tx_t *tx)
	{
	dsl_pool_t *dp = ds1->ds_dir->dd_pool;

	if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
	return;

	if (dp->dp_scrub_bookmark.zb_objset == ds1->ds_object) {
	dp->dp_scrub_bookmark.zb_objset = ds2->ds_object;
	} else if (dp->dp_scrub_bookmark.zb_objset == ds2->ds_object) {
	dp->dp_scrub_bookmark.zb_objset = ds1->ds_object;
	}

	if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds1->ds_object, tx) == 0) {
	int err = zap_add_int(dp->dp_meta_objset,
	dp->dp_scrub_queue_obj, ds2->ds_object, tx);
	VERIFY(err == 0 \|\| err == EEXIST);
	if (err == EEXIST) {
	/* Both were there to begin with */
	VERIFY(0 == zap_add_int(dp->dp_meta_objset,
	dp->dp_scrub_queue_obj, ds1->ds_object, tx));
	}
	} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds2->ds_object, tx) == 0) {
	VERIFY(0 == zap_add_int(dp->dp_meta_objset,
	dp->dp_scrub_queue_obj, ds1->ds_object, tx));
	}
	}

	struct enqueue_clones_arg {
	dmu_tx_t *tx;
	uint64_t originobj;
	};

	/* ARGSUSED */
	static int
	enqueue_clones_cb(spa_t spa, uint64_t dsobj, const char dsname, void *arg)
	{
	struct enqueue_clones_arg *eca = arg;
	dsl_dataset_t *ds;
	int err;
	dsl_pool_t *dp;

	err = dsl_dataset_hold_obj(spa->spa_dsl_pool, dsobj, FTAG, &ds);
	if (err)
	return (err);
	dp = ds->ds_dir->dd_pool;

	if (ds->ds_dir->dd_phys->dd_origin_obj == eca->originobj) {
	while (ds->ds_phys->ds_prev_snap_obj != eca->originobj) {
	dsl_dataset_t *prev;
	err = dsl_dataset_hold_obj(dp,
	ds->ds_phys->ds_prev_snap_obj, FTAG, &prev);

	dsl_dataset_rele(ds, FTAG);
	if (err)
	return (err);
	ds = prev;
	}
	VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds->ds_object, eca->tx) == 0);
	}
	dsl_dataset_rele(ds, FTAG);
	return (0);
	}

	static void
	scrub_visitds(dsl_pool_t dp, uint64_t dsobj, dmu_tx_t tx)
	{
	dsl_dataset_t *ds;
	uint64_t min_txg_save;

	VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));

	/*
	* Iterate over the bps in this ds.
	*/
	min_txg_save = dp->dp_scrub_min_txg;
	dp->dp_scrub_min_txg =
	MAX(dp->dp_scrub_min_txg, ds->ds_phys->ds_prev_snap_txg);
	scrub_visit_rootbp(dp, ds, &ds->ds_phys->ds_bp);
	dp->dp_scrub_min_txg = min_txg_save;

	if (dp->dp_scrub_pausing)
	goto out;

	/*
	* Add descendent datasets to work queue.
	*/
	if (ds->ds_phys->ds_next_snap_obj != 0) {
	VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds->ds_phys->ds_next_snap_obj, tx) == 0);
	}
	if (ds->ds_phys->ds_num_children > 1) {
	if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
	struct enqueue_clones_arg eca;
	eca.tx = tx;
	eca.originobj = ds->ds_object;

	(void) dmu_objset_find_spa(ds->ds_dir->dd_pool->dp_spa,
	NULL, enqueue_clones_cb, &eca, DS_FIND_CHILDREN);
	} else {
	VERIFY(zap_join(dp->dp_meta_objset,
	ds->ds_phys->ds_next_clones_obj,
	dp->dp_scrub_queue_obj, tx) == 0);
	}
	}

	out:
	dsl_dataset_rele(ds, FTAG);
	}

	/* ARGSUSED */
	static int
	enqueue_cb(spa_t spa, uint64_t dsobj, const char dsname, void *arg)
	{
	dmu_tx_t *tx = arg;
	dsl_dataset_t *ds;
	int err;
	dsl_pool_t *dp;

	err = dsl_dataset_hold_obj(spa->spa_dsl_pool, dsobj, FTAG, &ds);
	if (err)
	return (err);

	dp = ds->ds_dir->dd_pool;

	while (ds->ds_phys->ds_prev_snap_obj != 0) {
	dsl_dataset_t *prev;
	err = dsl_dataset_hold_obj(dp, ds->ds_phys->ds_prev_snap_obj,
	FTAG, &prev);
	if (err) {
	dsl_dataset_rele(ds, FTAG);
	return (err);
	}

	/*
	* If this is a clone, we don't need to worry about it for now.
	*/
	if (prev->ds_phys->ds_next_snap_obj != ds->ds_object) {
	dsl_dataset_rele(ds, FTAG);
	dsl_dataset_rele(prev, FTAG);
	return (0);
	}
	dsl_dataset_rele(ds, FTAG);
	ds = prev;
	}

	VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
	ds->ds_object, tx) == 0);
	dsl_dataset_rele(ds, FTAG);
	return (0);
	}

	void
	dsl_pool_scrub_sync(dsl_pool_t dp, dmu_tx_t tx)
	{
	zap_cursor_t zc;
	zap_attribute_t za;
	boolean_t complete = B_TRUE;

	if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
	return;

	/* If the spa is not fully loaded, don't bother. */
	if (dp->dp_spa->spa_load_state != SPA_LOAD_NONE)
	return;

	if (dp->dp_scrub_restart) {
	enum scrub_func func = dp->dp_scrub_func;
	dp->dp_scrub_restart = B_FALSE;
	dsl_pool_scrub_setup_sync(dp, &func, kcred, tx);
	}

	if (dp->dp_spa->spa_root_vdev->vdev_stat.vs_scrub_type == 0) {
	/*
	* We must have resumed after rebooting; reset the vdev
	* stats to know that we're doing a scrub (although it
	* will think we're just starting now).
	*/
	vdev_scrub_stat_update(dp->dp_spa->spa_root_vdev,
	dp->dp_scrub_min_txg ? POOL_SCRUB_RESILVER :
	POOL_SCRUB_EVERYTHING, B_FALSE);
	}

	dp->dp_scrub_pausing = B_FALSE;
	dp->dp_scrub_start_time = lbolt64;
	dp->dp_scrub_isresilver = (dp->dp_scrub_min_txg != 0);
	dp->dp_spa->spa_scrub_active = B_TRUE;

	if (dp->dp_scrub_bookmark.zb_objset == 0) {
	/* First do the MOS & ORIGIN */
	scrub_visit_rootbp(dp, NULL, &dp->dp_meta_rootbp);
	if (dp->dp_scrub_pausing)
	goto out;

	if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
	VERIFY(0 == dmu_objset_find_spa(dp->dp_spa,
	NULL, enqueue_cb, tx, DS_FIND_CHILDREN));
	} else {
	scrub_visitds(dp, dp->dp_origin_snap->ds_object, tx);
	}
	ASSERT(!dp->dp_scrub_pausing);
	} else if (dp->dp_scrub_bookmark.zb_objset != -1ULL) {
	/*
	* If we were paused, continue from here. Note if the
	* ds we were paused on was deleted, the zb_objset will
	* be -1, so we will skip this and find a new objset
	* below.
	*/
	scrub_visitds(dp, dp->dp_scrub_bookmark.zb_objset, tx);
	if (dp->dp_scrub_pausing)
	goto out;
	}

	/*
	* In case we were paused right at the end of the ds, zero the
	* bookmark so we don't think that we're still trying to resume.
	*/
	bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));

	/* keep pulling things out of the zap-object-as-queue */
	while (zap_cursor_init(&zc, dp->dp_meta_objset, dp->dp_scrub_queue_obj),
	zap_cursor_retrieve(&zc, &za) == 0) {
	VERIFY(0 == zap_remove(dp->dp_meta_objset,
	dp->dp_scrub_queue_obj, za.za_name, tx));
	scrub_visitds(dp, za.za_first_integer, tx);
	if (dp->dp_scrub_pausing)
	break;
	zap_cursor_fini(&zc);
	}
	zap_cursor_fini(&zc);
	if (dp->dp_scrub_pausing)
	goto out;

	/* done. */

	dsl_pool_scrub_cancel_sync(dp, &complete, kcred, tx);
	return;
	out:
	VERIFY(0 == zap_update(dp->dp_meta_objset,
	DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4,
	&dp->dp_scrub_bookmark, tx));
	VERIFY(0 == zap_update(dp->dp_meta_objset,
	DMU_POOL_DIRECTORY_OBJECT,
	DMU_POOL_SCRUB_ERRORS, sizeof (uint64_t), 1,
	&dp->dp_spa->spa_scrub_errors, tx));

	/* XXX this is scrub-clean specific */
	mutex_enter(&dp->dp_spa->spa_scrub_lock);
	while (dp->dp_spa->spa_scrub_inflight > 0) {
	cv_wait(&dp->dp_spa->spa_scrub_io_cv,
	&dp->dp_spa->spa_scrub_lock);
	}
	mutex_exit(&dp->dp_spa->spa_scrub_lock);
	}

	void
	dsl_pool_scrub_restart(dsl_pool_t *dp)
	{
	mutex_enter(&dp->dp_scrub_cancel_lock);
	dp->dp_scrub_restart = B_TRUE;
	mutex_exit(&dp->dp_scrub_cancel_lock);
	}

	/*
	* scrub consumers
	*/

	static void
	count_block(zfs_all_blkstats_t zab, const blkptr_t bp)
	{
	int i;

	/*
	* If we resume after a reboot, zab will be NULL; don't record
	* incomplete stats in that case.
	*/
	if (zab == NULL)
	return;

	for (i = 0; i < 4; i++) {
	int l = (i < 2) ? BP_GET_LEVEL(bp) : DN_MAX_LEVELS;
	int t = (i & 1) ? BP_GET_TYPE(bp) : DMU_OT_TOTAL;
	zfs_blkstat_t *zb = &zab->zab_type[l][t];
	int equal;

	zb->zb_count++;
	zb->zb_asize += BP_GET_ASIZE(bp);
	zb->zb_lsize += BP_GET_LSIZE(bp);
	zb->zb_psize += BP_GET_PSIZE(bp);
	zb->zb_gangs += BP_COUNT_GANG(bp);

	switch (BP_GET_NDVAS(bp)) {
	case 2:
	if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
	DVA_GET_VDEV(&bp->blk_dva[1]))
	zb->zb_ditto_2_of_2_samevdev++;
	break;
	case 3:
	equal = (DVA_GET_VDEV(&bp->blk_dva[0]) ==
	DVA_GET_VDEV(&bp->blk_dva[1])) +
	(DVA_GET_VDEV(&bp->blk_dva[0]) ==
	DVA_GET_VDEV(&bp->blk_dva[2])) +
	(DVA_GET_VDEV(&bp->blk_dva[1]) ==
	DVA_GET_VDEV(&bp->blk_dva[2]));
	if (equal == 1)
	zb->zb_ditto_2_of_3_samevdev++;
	else if (equal == 3)
	zb->zb_ditto_3_of_3_samevdev++;
	break;
	}
	}
	}

	static void
	dsl_pool_scrub_clean_done(zio_t *zio)
	{
	spa_t *spa = zio->io_spa;

	zio_data_buf_free(zio->io_data, zio->io_size);

	mutex_enter(&spa->spa_scrub_lock);
	spa->spa_scrub_inflight--;
	cv_broadcast(&spa->spa_scrub_io_cv);

	if (zio->io_error && (zio->io_error != ECKSUM \|\|
	!(zio->io_flags & ZIO_FLAG_SPECULATIVE)))
	spa->spa_scrub_errors++;
	mutex_exit(&spa->spa_scrub_lock);
	}

	static int
	dsl_pool_scrub_clean_cb(dsl_pool_t *dp,
	const blkptr_t bp, const zbookmark_t zb)
	{
	size_t size = BP_GET_LSIZE(bp);
	int d;
	spa_t *spa = dp->dp_spa;
	boolean_t needs_io;
	int zio_flags = ZIO_FLAG_SCRUB_THREAD \| ZIO_FLAG_CANFAIL;
	int zio_priority;

	count_block(dp->dp_blkstats, bp);

	if (dp->dp_scrub_isresilver == 0) {
	/* It's a scrub */
	zio_flags \|= ZIO_FLAG_SCRUB;
	zio_priority = ZIO_PRIORITY_SCRUB;
	needs_io = B_TRUE;
	} else {
	/* It's a resilver */
	zio_flags \|= ZIO_FLAG_RESILVER;
	zio_priority = ZIO_PRIORITY_RESILVER;
	needs_io = B_FALSE;
	}

	/* If it's an intent log block, failure is expected. */
	if (zb->zb_level == -1 && BP_GET_TYPE(bp) != DMU_OT_OBJSET)
	zio_flags \|= ZIO_FLAG_SPECULATIVE;

	for (d = 0; d < BP_GET_NDVAS(bp); d++) {
	vdev_t *vd = vdev_lookup_top(spa,
	DVA_GET_VDEV(&bp->blk_dva[d]));

	/*
	* Keep track of how much data we've examined so that
	* zpool(1M) status can make useful progress reports.
	*/
	mutex_enter(&vd->vdev_stat_lock);
	vd->vdev_stat.vs_scrub_examined +=
	DVA_GET_ASIZE(&bp->blk_dva[d]);
	mutex_exit(&vd->vdev_stat_lock);

	/* if it's a resilver, this may not be in the target range */
	if (!needs_io) {
	if (DVA_GET_GANG(&bp->blk_dva[d])) {
	/*
	* Gang members may be spread across multiple
	* vdevs, so the best we can do is look at the
	* pool-wide DTL.
	* XXX -- it would be better to change our
	* allocation policy to ensure that this can't
	* happen.
	*/
	vd = spa->spa_root_vdev;
	}
	needs_io = vdev_dtl_contains(&vd->vdev_dtl_map,
	bp->blk_birth, 1);
	}
	}

	if (needs_io && !zfs_no_scrub_io) {
	void *data = zio_data_buf_alloc(size);

	mutex_enter(&spa->spa_scrub_lock);
	while (spa->spa_scrub_inflight >= spa->spa_scrub_maxinflight)
	cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
	spa->spa_scrub_inflight++;
	mutex_exit(&spa->spa_scrub_lock);

	zio_nowait(zio_read(NULL, spa, bp, data, size,
	dsl_pool_scrub_clean_done, NULL, zio_priority,
	zio_flags, zb));
	}

	/* do not relocate this block */
	return (0);
	}

	int
	dsl_pool_scrub_clean(dsl_pool_t *dp)
	{
	+ spa_t *spa = dp->dp_spa;
	+
	/*
	* Purge all vdev caches. We do this here rather than in sync
	* context because this requires a writer lock on the spa_config
	* lock, which we can't do from sync context. The
	* spa_scrub_reopen flag indicates that vdev_open() should not
	* attempt to start another scrub.
	*/
	- spa_config_enter(dp->dp_spa, SCL_ALL, FTAG, RW_WRITER);
	- dp->dp_spa->spa_scrub_reopen = B_TRUE;
	- vdev_reopen(dp->dp_spa->spa_root_vdev);
	- dp->dp_spa->spa_scrub_reopen = B_FALSE;
	- spa_config_exit(dp->dp_spa, SCL_ALL, FTAG);
	+ spa_vdev_state_enter(spa);
	+ spa->spa_scrub_reopen = B_TRUE;
	+ vdev_reopen(spa->spa_root_vdev);
	+ spa->spa_scrub_reopen = B_FALSE;
	+ (void) spa_vdev_state_exit(spa, NULL, 0);

	return (dsl_pool_scrub_setup(dp, SCRUB_FUNC_CLEAN));
	}
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c (revision 209274)
	@@ -1,1209 +1,1209 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/

	/*
	- * Copyright 2008 Sun Microsystems, Inc. All rights reserved.
	+ * Copyright 2009 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	#include <sys/zfs_context.h>
	#include <sys/spa.h>
	#include <sys/vdev_impl.h>
	#include <sys/zio.h>
	#include <sys/zio_checksum.h>
	#include <sys/fs/zfs.h>
	#include <sys/fm/fs/zfs.h>

	/*
	* Virtual device vector for RAID-Z.
	*
	* This vdev supports both single and double parity. For single parity, we
	* use a simple XOR of all the data columns. For double parity, we use both
	* the simple XOR as well as a technique described in "The mathematics of
	* RAID-6" by H. Peter Anvin. This technique defines a Galois field, GF(2^8),
	* over the integers expressable in a single byte. Briefly, the operations on
	* the field are defined as follows:
	*
	* o addition (+) is represented by a bitwise XOR
	* o subtraction (-) is therefore identical to addition: A + B = A - B
	* o multiplication of A by 2 is defined by the following bitwise expression:
	* (A * 2)_7 = A_6
	* (A * 2)_6 = A_5
	* (A * 2)_5 = A_4
	* (A * 2)_4 = A_3 + A_7
	* (A * 2)_3 = A_2 + A_7
	* (A * 2)_2 = A_1 + A_7
	* (A * 2)_1 = A_0
	* (A * 2)_0 = A_7
	*
	* In C, multiplying by 2 is therefore ((a << 1) ^ ((a & 0x80) ? 0x1d : 0)).
	*
	* Observe that any number in the field (except for 0) can be expressed as a
	* power of 2 -- a generator for the field. We store a table of the powers of
	* 2 and logs base 2 for quick look ups, and exploit the fact that A * B can
	* be rewritten as 2^(log_2(A) + log_2(B)) (where '+' is normal addition rather
	* than field addition). The inverse of a field element A (A^-1) is A^254.
	*
	* The two parity columns, P and Q, over several data columns, D_0, ... D_n-1,
	* can be expressed by field operations:
	*
	* P = D_0 + D_1 + ... + D_n-2 + D_n-1
	* Q = 2^n-1 * D_0 + 2^n-2 * D_1 + ... + 2^1 * D_n-2 + 2^0 * D_n-1
	* = ((...((D_0) * 2 + D_1) * 2 + ...) * 2 + D_n-2) * 2 + D_n-1
	*
	* See the reconstruction code below for how P and Q can used individually or
	* in concert to recover missing data columns.
	*/

	typedef struct raidz_col {
	uint64_t rc_devidx; /* child device index for I/O */
	uint64_t rc_offset; /* device offset */
	uint64_t rc_size; /* I/O size */
	void rc_data; / I/O data */
	int rc_error; /* I/O error for this device */
	uint8_t rc_tried; /* Did we attempt this I/O column? */
	uint8_t rc_skipped; /* Did we skip this I/O column? */
	} raidz_col_t;

	typedef struct raidz_map {
	uint64_t rm_cols; /* Column count */
	uint64_t rm_bigcols; /* Number of oversized columns */
	uint64_t rm_asize; /* Actual total I/O size */
	uint64_t rm_missingdata; /* Count of missing data devices */
	uint64_t rm_missingparity; /* Count of missing parity devices */
	uint64_t rm_firstdatacol; /* First data column/parity count */
	raidz_col_t rm_col[1]; /* Flexible array of I/O columns */
	} raidz_map_t;

	#define VDEV_RAIDZ_P 0
	#define VDEV_RAIDZ_Q 1

	#define VDEV_RAIDZ_MAXPARITY 2

	#define VDEV_RAIDZ_MUL_2(a) (((a) << 1) ^ (((a) & 0x80) ? 0x1d : 0))

	/*
	* These two tables represent powers and logs of 2 in the Galois field defined
	* above. These values were computed by repeatedly multiplying by 2 as above.
	*/
	static const uint8_t vdev_raidz_pow2[256] = {
	0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,
	0x1d, 0x3a, 0x74, 0xe8, 0xcd, 0x87, 0x13, 0x26,
	0x4c, 0x98, 0x2d, 0x5a, 0xb4, 0x75, 0xea, 0xc9,
	0x8f, 0x03, 0x06, 0x0c, 0x18, 0x30, 0x60, 0xc0,
	0x9d, 0x27, 0x4e, 0x9c, 0x25, 0x4a, 0x94, 0x35,
	0x6a, 0xd4, 0xb5, 0x77, 0xee, 0xc1, 0x9f, 0x23,
	0x46, 0x8c, 0x05, 0x0a, 0x14, 0x28, 0x50, 0xa0,
	0x5d, 0xba, 0x69, 0xd2, 0xb9, 0x6f, 0xde, 0xa1,
	0x5f, 0xbe, 0x61, 0xc2, 0x99, 0x2f, 0x5e, 0xbc,
	0x65, 0xca, 0x89, 0x0f, 0x1e, 0x3c, 0x78, 0xf0,
	0xfd, 0xe7, 0xd3, 0xbb, 0x6b, 0xd6, 0xb1, 0x7f,
	0xfe, 0xe1, 0xdf, 0xa3, 0x5b, 0xb6, 0x71, 0xe2,
	0xd9, 0xaf, 0x43, 0x86, 0x11, 0x22, 0x44, 0x88,
	0x0d, 0x1a, 0x34, 0x68, 0xd0, 0xbd, 0x67, 0xce,
	0x81, 0x1f, 0x3e, 0x7c, 0xf8, 0xed, 0xc7, 0x93,
	0x3b, 0x76, 0xec, 0xc5, 0x97, 0x33, 0x66, 0xcc,
	0x85, 0x17, 0x2e, 0x5c, 0xb8, 0x6d, 0xda, 0xa9,
	0x4f, 0x9e, 0x21, 0x42, 0x84, 0x15, 0x2a, 0x54,
	0xa8, 0x4d, 0x9a, 0x29, 0x52, 0xa4, 0x55, 0xaa,
	0x49, 0x92, 0x39, 0x72, 0xe4, 0xd5, 0xb7, 0x73,
	0xe6, 0xd1, 0xbf, 0x63, 0xc6, 0x91, 0x3f, 0x7e,
	0xfc, 0xe5, 0xd7, 0xb3, 0x7b, 0xf6, 0xf1, 0xff,
	0xe3, 0xdb, 0xab, 0x4b, 0x96, 0x31, 0x62, 0xc4,
	0x95, 0x37, 0x6e, 0xdc, 0xa5, 0x57, 0xae, 0x41,
	0x82, 0x19, 0x32, 0x64, 0xc8, 0x8d, 0x07, 0x0e,
	0x1c, 0x38, 0x70, 0xe0, 0xdd, 0xa7, 0x53, 0xa6,
	0x51, 0xa2, 0x59, 0xb2, 0x79, 0xf2, 0xf9, 0xef,
	0xc3, 0x9b, 0x2b, 0x56, 0xac, 0x45, 0x8a, 0x09,
	0x12, 0x24, 0x48, 0x90, 0x3d, 0x7a, 0xf4, 0xf5,
	0xf7, 0xf3, 0xfb, 0xeb, 0xcb, 0x8b, 0x0b, 0x16,
	0x2c, 0x58, 0xb0, 0x7d, 0xfa, 0xe9, 0xcf, 0x83,
	0x1b, 0x36, 0x6c, 0xd8, 0xad, 0x47, 0x8e, 0x01
	};
	static const uint8_t vdev_raidz_log2[256] = {
	0x00, 0x00, 0x01, 0x19, 0x02, 0x32, 0x1a, 0xc6,
	0x03, 0xdf, 0x33, 0xee, 0x1b, 0x68, 0xc7, 0x4b,
	0x04, 0x64, 0xe0, 0x0e, 0x34, 0x8d, 0xef, 0x81,
	0x1c, 0xc1, 0x69, 0xf8, 0xc8, 0x08, 0x4c, 0x71,
	0x05, 0x8a, 0x65, 0x2f, 0xe1, 0x24, 0x0f, 0x21,
	0x35, 0x93, 0x8e, 0xda, 0xf0, 0x12, 0x82, 0x45,
	0x1d, 0xb5, 0xc2, 0x7d, 0x6a, 0x27, 0xf9, 0xb9,
	0xc9, 0x9a, 0x09, 0x78, 0x4d, 0xe4, 0x72, 0xa6,
	0x06, 0xbf, 0x8b, 0x62, 0x66, 0xdd, 0x30, 0xfd,
	0xe2, 0x98, 0x25, 0xb3, 0x10, 0x91, 0x22, 0x88,
	0x36, 0xd0, 0x94, 0xce, 0x8f, 0x96, 0xdb, 0xbd,
	0xf1, 0xd2, 0x13, 0x5c, 0x83, 0x38, 0x46, 0x40,
	0x1e, 0x42, 0xb6, 0xa3, 0xc3, 0x48, 0x7e, 0x6e,
	0x6b, 0x3a, 0x28, 0x54, 0xfa, 0x85, 0xba, 0x3d,
	0xca, 0x5e, 0x9b, 0x9f, 0x0a, 0x15, 0x79, 0x2b,
	0x4e, 0xd4, 0xe5, 0xac, 0x73, 0xf3, 0xa7, 0x57,
	0x07, 0x70, 0xc0, 0xf7, 0x8c, 0x80, 0x63, 0x0d,
	0x67, 0x4a, 0xde, 0xed, 0x31, 0xc5, 0xfe, 0x18,
	0xe3, 0xa5, 0x99, 0x77, 0x26, 0xb8, 0xb4, 0x7c,
	0x11, 0x44, 0x92, 0xd9, 0x23, 0x20, 0x89, 0x2e,
	0x37, 0x3f, 0xd1, 0x5b, 0x95, 0xbc, 0xcf, 0xcd,
	0x90, 0x87, 0x97, 0xb2, 0xdc, 0xfc, 0xbe, 0x61,
	0xf2, 0x56, 0xd3, 0xab, 0x14, 0x2a, 0x5d, 0x9e,
	0x84, 0x3c, 0x39, 0x53, 0x47, 0x6d, 0x41, 0xa2,
	0x1f, 0x2d, 0x43, 0xd8, 0xb7, 0x7b, 0xa4, 0x76,
	0xc4, 0x17, 0x49, 0xec, 0x7f, 0x0c, 0x6f, 0xf6,
	0x6c, 0xa1, 0x3b, 0x52, 0x29, 0x9d, 0x55, 0xaa,
	0xfb, 0x60, 0x86, 0xb1, 0xbb, 0xcc, 0x3e, 0x5a,
	0xcb, 0x59, 0x5f, 0xb0, 0x9c, 0xa9, 0xa0, 0x51,
	0x0b, 0xf5, 0x16, 0xeb, 0x7a, 0x75, 0x2c, 0xd7,
	0x4f, 0xae, 0xd5, 0xe9, 0xe6, 0xe7, 0xad, 0xe8,
	0x74, 0xd6, 0xf4, 0xea, 0xa8, 0x50, 0x58, 0xaf,
	};

	/*
	* Multiply a given number by 2 raised to the given power.
	*/
	static uint8_t
	vdev_raidz_exp2(uint_t a, int exp)
	{
	if (a == 0)
	return (0);

	ASSERT(exp >= 0);
	ASSERT(vdev_raidz_log2[a] > 0 \|\| a == 1);

	exp += vdev_raidz_log2[a];
	if (exp > 255)
	exp -= 255;

	return (vdev_raidz_pow2[exp]);
	}

	static void
	vdev_raidz_map_free(zio_t *zio)
	{
	raidz_map_t *rm = zio->io_vsd;
	int c;

	for (c = 0; c < rm->rm_firstdatacol; c++)
	zio_buf_free(rm->rm_col[c].rc_data, rm->rm_col[c].rc_size);

	kmem_free(rm, offsetof(raidz_map_t, rm_col[rm->rm_cols]));
	}

	static raidz_map_t *
	vdev_raidz_map_alloc(zio_t *zio, uint64_t unit_shift, uint64_t dcols,
	uint64_t nparity)
	{
	raidz_map_t *rm;
	uint64_t b = zio->io_offset >> unit_shift;
	uint64_t s = zio->io_size >> unit_shift;
	uint64_t f = b % dcols;
	uint64_t o = (b / dcols) << unit_shift;
	uint64_t q, r, c, bc, col, acols, coff, devidx;

	q = s / (dcols - nparity);
	r = s - q * (dcols - nparity);
	bc = (r == 0 ? 0 : r + nparity);

	acols = (q == 0 ? bc : dcols);

	rm = kmem_alloc(offsetof(raidz_map_t, rm_col[acols]), KM_SLEEP);

	rm->rm_cols = acols;
	rm->rm_bigcols = bc;
	rm->rm_asize = 0;
	rm->rm_missingdata = 0;
	rm->rm_missingparity = 0;
	rm->rm_firstdatacol = nparity;

	for (c = 0; c < acols; c++) {
	col = f + c;
	coff = o;
	if (col >= dcols) {
	col -= dcols;
	coff += 1ULL << unit_shift;
	}
	rm->rm_col[c].rc_devidx = col;
	rm->rm_col[c].rc_offset = coff;
	rm->rm_col[c].rc_size = (q + (c < bc)) << unit_shift;
	rm->rm_col[c].rc_data = NULL;
	rm->rm_col[c].rc_error = 0;
	rm->rm_col[c].rc_tried = 0;
	rm->rm_col[c].rc_skipped = 0;
	rm->rm_asize += rm->rm_col[c].rc_size;
	}

	rm->rm_asize = roundup(rm->rm_asize, (nparity + 1) << unit_shift);

	for (c = 0; c < rm->rm_firstdatacol; c++)
	rm->rm_col[c].rc_data = zio_buf_alloc(rm->rm_col[c].rc_size);

	rm->rm_col[c].rc_data = zio->io_data;

	for (c = c + 1; c < acols; c++)
	rm->rm_col[c].rc_data = (char *)rm->rm_col[c - 1].rc_data +
	rm->rm_col[c - 1].rc_size;

	/*
	* If all data stored spans all columns, there's a danger that parity
	* will always be on the same device and, since parity isn't read
	* during normal operation, that that device's I/O bandwidth won't be
	* used effectively. We therefore switch the parity every 1MB.
	*
	* ... at least that was, ostensibly, the theory. As a practical
	* matter unless we juggle the parity between all devices evenly, we
	* won't see any benefit. Further, occasional writes that aren't a
	* multiple of the LCM of the number of children and the minimum
	* stripe width are sufficient to avoid pessimal behavior.
	* Unfortunately, this decision created an implicit on-disk format
	* requirement that we need to support for all eternity, but only
	* for single-parity RAID-Z.
	*/
	ASSERT(rm->rm_cols >= 2);
	ASSERT(rm->rm_col[0].rc_size == rm->rm_col[1].rc_size);

	if (rm->rm_firstdatacol == 1 && (zio->io_offset & (1ULL << 20))) {
	devidx = rm->rm_col[0].rc_devidx;
	o = rm->rm_col[0].rc_offset;
	rm->rm_col[0].rc_devidx = rm->rm_col[1].rc_devidx;
	rm->rm_col[0].rc_offset = rm->rm_col[1].rc_offset;
	rm->rm_col[1].rc_devidx = devidx;
	rm->rm_col[1].rc_offset = o;
	}

	zio->io_vsd = rm;
	zio->io_vsd_free = vdev_raidz_map_free;
	return (rm);
	}

	static void
	vdev_raidz_generate_parity_p(raidz_map_t *rm)
	{
	uint64_t p, src, pcount, ccount, i;
	int c;

	pcount = rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]);

	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	src = rm->rm_col[c].rc_data;
	p = rm->rm_col[VDEV_RAIDZ_P].rc_data;
	ccount = rm->rm_col[c].rc_size / sizeof (src[0]);

	if (c == rm->rm_firstdatacol) {
	ASSERT(ccount == pcount);
	for (i = 0; i < ccount; i++, p++, src++) {
	p = src;
	}
	} else {
	ASSERT(ccount <= pcount);
	for (i = 0; i < ccount; i++, p++, src++) {
	p ^= src;
	}
	}
	}
	}

	static void
	vdev_raidz_generate_parity_pq(raidz_map_t *rm)
	{
	uint64_t q, p, *src, pcount, ccount, mask, i;
	int c;

	pcount = rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]);
	ASSERT(rm->rm_col[VDEV_RAIDZ_P].rc_size ==
	rm->rm_col[VDEV_RAIDZ_Q].rc_size);

	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	src = rm->rm_col[c].rc_data;
	p = rm->rm_col[VDEV_RAIDZ_P].rc_data;
	q = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
	ccount = rm->rm_col[c].rc_size / sizeof (src[0]);

	if (c == rm->rm_firstdatacol) {
	ASSERT(ccount == pcount \|\| ccount == 0);
	for (i = 0; i < ccount; i++, p++, q++, src++) {
	q = src;
	p = src;
	}
	for (; i < pcount; i++, p++, q++, src++) {
	*q = 0;
	*p = 0;
	}
	} else {
	ASSERT(ccount <= pcount);

	/*
	* Rather than multiplying each byte individually (as
	* described above), we are able to handle 8 at once
	* by generating a mask based on the high bit in each
	* byte and using that to conditionally XOR in 0x1d.
	*/
	for (i = 0; i < ccount; i++, p++, q++, src++) {
	mask = *q & 0x8080808080808080ULL;
	mask = (mask << 1) - (mask >> 7);
	q = ((q << 1) & 0xfefefefefefefefeULL) ^
	(mask & 0x1d1d1d1d1d1d1d1dULL);
	q ^= src;
	p ^= src;
	}

	/*
	* Treat short columns as though they are full of 0s.
	*/
	for (; i < pcount; i++, q++) {
	mask = *q & 0x8080808080808080ULL;
	mask = (mask << 1) - (mask >> 7);
	q = ((q << 1) & 0xfefefefefefefefeULL) ^
	(mask & 0x1d1d1d1d1d1d1d1dULL);
	}
	}
	}
	}

	static void
	vdev_raidz_reconstruct_p(raidz_map_t *rm, int x)
	{
	uint64_t dst, src, xcount, ccount, count, i;
	int c;

	xcount = rm->rm_col[x].rc_size / sizeof (src[0]);
	ASSERT(xcount <= rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]));
	ASSERT(xcount > 0);

	src = rm->rm_col[VDEV_RAIDZ_P].rc_data;
	dst = rm->rm_col[x].rc_data;
	for (i = 0; i < xcount; i++, dst++, src++) {
	dst = src;
	}

	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	src = rm->rm_col[c].rc_data;
	dst = rm->rm_col[x].rc_data;

	if (c == x)
	continue;

	ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
	count = MIN(ccount, xcount);

	for (i = 0; i < count; i++, dst++, src++) {
	dst ^= src;
	}
	}
	}

	static void
	vdev_raidz_reconstruct_q(raidz_map_t *rm, int x)
	{
	uint64_t dst, src, xcount, ccount, count, mask, i;
	uint8_t *b;
	int c, j, exp;

	xcount = rm->rm_col[x].rc_size / sizeof (src[0]);
	ASSERT(xcount <= rm->rm_col[VDEV_RAIDZ_Q].rc_size / sizeof (src[0]));

	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	src = rm->rm_col[c].rc_data;
	dst = rm->rm_col[x].rc_data;

	if (c == x)
	ccount = 0;
	else
	ccount = rm->rm_col[c].rc_size / sizeof (src[0]);

	count = MIN(ccount, xcount);

	if (c == rm->rm_firstdatacol) {
	for (i = 0; i < count; i++, dst++, src++) {
	dst = src;
	}
	for (; i < xcount; i++, dst++) {
	*dst = 0;
	}

	} else {
	/*
	* For an explanation of this, see the comment in
	* vdev_raidz_generate_parity_pq() above.
	*/
	for (i = 0; i < count; i++, dst++, src++) {
	mask = *dst & 0x8080808080808080ULL;
	mask = (mask << 1) - (mask >> 7);
	dst = ((dst << 1) & 0xfefefefefefefefeULL) ^
	(mask & 0x1d1d1d1d1d1d1d1dULL);
	dst ^= src;
	}

	for (; i < xcount; i++, dst++) {
	mask = *dst & 0x8080808080808080ULL;
	mask = (mask << 1) - (mask >> 7);
	dst = ((dst << 1) & 0xfefefefefefefefeULL) ^
	(mask & 0x1d1d1d1d1d1d1d1dULL);
	}
	}
	}

	src = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
	dst = rm->rm_col[x].rc_data;
	exp = 255 - (rm->rm_cols - 1 - x);

	for (i = 0; i < xcount; i++, dst++, src++) {
	dst ^= src;
	for (j = 0, b = (uint8_t *)dst; j < 8; j++, b++) {
	b = vdev_raidz_exp2(b, exp);
	}
	}
	}

	static void
	vdev_raidz_reconstruct_pq(raidz_map_t *rm, int x, int y)
	{
	uint8_t p, q, pxy, qxy, xd, yd, tmp, a, b, aexp, bexp;
	void pdata, qdata;
	uint64_t xsize, ysize, i;

	ASSERT(x < y);
	ASSERT(x >= rm->rm_firstdatacol);
	ASSERT(y < rm->rm_cols);

	ASSERT(rm->rm_col[x].rc_size >= rm->rm_col[y].rc_size);

	/*
	* Move the parity data aside -- we're going to compute parity as
	* though columns x and y were full of zeros -- Pxy and Qxy. We want to
	* reuse the parity generation mechanism without trashing the actual
	* parity so we make those columns appear to be full of zeros by
	* setting their lengths to zero.
	*/
	pdata = rm->rm_col[VDEV_RAIDZ_P].rc_data;
	qdata = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
	xsize = rm->rm_col[x].rc_size;
	ysize = rm->rm_col[y].rc_size;

	rm->rm_col[VDEV_RAIDZ_P].rc_data =
	zio_buf_alloc(rm->rm_col[VDEV_RAIDZ_P].rc_size);
	rm->rm_col[VDEV_RAIDZ_Q].rc_data =
	zio_buf_alloc(rm->rm_col[VDEV_RAIDZ_Q].rc_size);
	rm->rm_col[x].rc_size = 0;
	rm->rm_col[y].rc_size = 0;

	vdev_raidz_generate_parity_pq(rm);

	rm->rm_col[x].rc_size = xsize;
	rm->rm_col[y].rc_size = ysize;

	p = pdata;
	q = qdata;
	pxy = rm->rm_col[VDEV_RAIDZ_P].rc_data;
	qxy = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
	xd = rm->rm_col[x].rc_data;
	yd = rm->rm_col[y].rc_data;

	/*
	* We now have:
	* Pxy = P + D_x + D_y
	* Qxy = Q + 2^(ndevs - 1 - x) * D_x + 2^(ndevs - 1 - y) * D_y
	*
	* We can then solve for D_x:
	* D_x = A * (P + Pxy) + B * (Q + Qxy)
	* where
	* A = 2^(x - y) * (2^(x - y) + 1)^-1
	* B = 2^(ndevs - 1 - x) * (2^(x - y) + 1)^-1
	*
	* With D_x in hand, we can easily solve for D_y:
	* D_y = P + Pxy + D_x
	*/

	a = vdev_raidz_pow2[255 + x - y];
	b = vdev_raidz_pow2[255 - (rm->rm_cols - 1 - x)];
	tmp = 255 - vdev_raidz_log2[a ^ 1];

	aexp = vdev_raidz_log2[vdev_raidz_exp2(a, tmp)];
	bexp = vdev_raidz_log2[vdev_raidz_exp2(b, tmp)];

	for (i = 0; i < xsize; i++, p++, q++, pxy++, qxy++, xd++, yd++) {
	xd = vdev_raidz_exp2(p ^ *pxy, aexp) ^
	vdev_raidz_exp2(q ^ qxy, bexp);

	if (i < ysize)
	yd = p ^ pxy ^ xd;
	}

	zio_buf_free(rm->rm_col[VDEV_RAIDZ_P].rc_data,
	rm->rm_col[VDEV_RAIDZ_P].rc_size);
	zio_buf_free(rm->rm_col[VDEV_RAIDZ_Q].rc_data,
	rm->rm_col[VDEV_RAIDZ_Q].rc_size);

	/*
	* Restore the saved parity data.
	*/
	rm->rm_col[VDEV_RAIDZ_P].rc_data = pdata;
	rm->rm_col[VDEV_RAIDZ_Q].rc_data = qdata;
	}


	static int
	vdev_raidz_open(vdev_t vd, uint64_t asize, uint64_t *ashift)
	{
	vdev_t *cvd;
	uint64_t nparity = vd->vdev_nparity;
	int c, error;
	int lasterror = 0;
	int numerrors = 0;

	ASSERT(nparity > 0);

	if (nparity > VDEV_RAIDZ_MAXPARITY \|\|
	vd->vdev_children < nparity + 1) {
	vd->vdev_stat.vs_aux = VDEV_AUX_BAD_LABEL;
	return (EINVAL);
	}

	for (c = 0; c < vd->vdev_children; c++) {
	cvd = vd->vdev_child[c];

	if ((error = vdev_open(cvd)) != 0) {
	lasterror = error;
	numerrors++;
	continue;
	}

	asize = MIN(asize - 1, cvd->vdev_asize - 1) + 1;
	ashift = MAX(ashift, cvd->vdev_ashift);
	}

	asize = vd->vdev_children;

	if (numerrors > nparity) {
	vd->vdev_stat.vs_aux = VDEV_AUX_NO_REPLICAS;
	return (lasterror);
	}

	return (0);
	}

	static void
	vdev_raidz_close(vdev_t *vd)
	{
	int c;

	for (c = 0; c < vd->vdev_children; c++)
	vdev_close(vd->vdev_child[c]);
	}

	static uint64_t
	vdev_raidz_asize(vdev_t *vd, uint64_t psize)
	{
	uint64_t asize;
	uint64_t ashift = vd->vdev_top->vdev_ashift;
	uint64_t cols = vd->vdev_children;
	uint64_t nparity = vd->vdev_nparity;

	asize = ((psize - 1) >> ashift) + 1;
	asize += nparity * ((asize + cols - nparity - 1) / (cols - nparity));
	asize = roundup(asize, nparity + 1) << ashift;

	return (asize);
	}

	static void
	vdev_raidz_child_done(zio_t *zio)
	{
	raidz_col_t *rc = zio->io_private;

	rc->rc_error = zio->io_error;
	rc->rc_tried = 1;
	rc->rc_skipped = 0;
	}

	static int
	vdev_raidz_io_start(zio_t *zio)
	{
	vdev_t *vd = zio->io_vd;
	vdev_t *tvd = vd->vdev_top;
	vdev_t *cvd;
	blkptr_t *bp = zio->io_bp;
	raidz_map_t *rm;
	raidz_col_t *rc;
	int c;

	rm = vdev_raidz_map_alloc(zio, tvd->vdev_ashift, vd->vdev_children,
	vd->vdev_nparity);

	ASSERT3U(rm->rm_asize, ==, vdev_psize_to_asize(vd, zio->io_size));

	if (zio->io_type == ZIO_TYPE_WRITE) {
	/*
	* Generate RAID parity in the first virtual columns.
	*/
	if (rm->rm_firstdatacol == 1)
	vdev_raidz_generate_parity_p(rm);
	else
	vdev_raidz_generate_parity_pq(rm);

	for (c = 0; c < rm->rm_cols; c++) {
	rc = &rm->rm_col[c];
	cvd = vd->vdev_child[rc->rc_devidx];
	zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
	rc->rc_offset, rc->rc_data, rc->rc_size,
	zio->io_type, zio->io_priority, 0,
	vdev_raidz_child_done, rc));
	}

	return (ZIO_PIPELINE_CONTINUE);
	}

	ASSERT(zio->io_type == ZIO_TYPE_READ);

	/*
	* Iterate over the columns in reverse order so that we hit the parity
	* last -- any errors along the way will force us to read the parity
	* data.
	*/
	for (c = rm->rm_cols - 1; c >= 0; c--) {
	rc = &rm->rm_col[c];
	cvd = vd->vdev_child[rc->rc_devidx];
	if (!vdev_readable(cvd)) {
	if (c >= rm->rm_firstdatacol)
	rm->rm_missingdata++;
	else
	rm->rm_missingparity++;
	rc->rc_error = ENXIO;
	rc->rc_tried = 1; /* don't even try */
	rc->rc_skipped = 1;
	continue;
	}
	if (vdev_dtl_contains(&cvd->vdev_dtl_map, bp->blk_birth, 1)) {
	if (c >= rm->rm_firstdatacol)
	rm->rm_missingdata++;
	else
	rm->rm_missingparity++;
	rc->rc_error = ESTALE;
	rc->rc_skipped = 1;
	continue;
	}
	if (c >= rm->rm_firstdatacol \|\| rm->rm_missingdata > 0 \|\|
	- (zio->io_flags & ZIO_FLAG_SCRUB)) {
	+ (zio->io_flags & (ZIO_FLAG_SCRUB \| ZIO_FLAG_RESILVER))) {
	zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
	rc->rc_offset, rc->rc_data, rc->rc_size,
	zio->io_type, zio->io_priority, 0,
	vdev_raidz_child_done, rc));
	}
	}

	return (ZIO_PIPELINE_CONTINUE);
	}

	/*
	* Report a checksum error for a child of a RAID-Z device.
	*/
	static void
	raidz_checksum_error(zio_t zio, raidz_col_t rc)
	{
	vdev_t *vd = zio->io_vd->vdev_child[rc->rc_devidx];

	if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
	mutex_enter(&vd->vdev_stat_lock);
	vd->vdev_stat.vs_checksum_errors++;
	mutex_exit(&vd->vdev_stat_lock);
	}

	if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE))
	zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
	zio->io_spa, vd, zio, rc->rc_offset, rc->rc_size);
	}

	/*
	* Generate the parity from the data columns. If we tried and were able to
	* read the parity without error, verify that the generated parity matches the
	* data we read. If it doesn't, we fire off a checksum error. Return the
	* number such failures.
	*/
	static int
	raidz_parity_verify(zio_t zio, raidz_map_t rm)
	{
	void *orig[VDEV_RAIDZ_MAXPARITY];
	int c, ret = 0;
	raidz_col_t *rc;

	for (c = 0; c < rm->rm_firstdatacol; c++) {
	rc = &rm->rm_col[c];
	if (!rc->rc_tried \|\| rc->rc_error != 0)
	continue;
	orig[c] = zio_buf_alloc(rc->rc_size);
	bcopy(rc->rc_data, orig[c], rc->rc_size);
	}

	if (rm->rm_firstdatacol == 1)
	vdev_raidz_generate_parity_p(rm);
	else
	vdev_raidz_generate_parity_pq(rm);

	for (c = 0; c < rm->rm_firstdatacol; c++) {
	rc = &rm->rm_col[c];
	if (!rc->rc_tried \|\| rc->rc_error != 0)
	continue;
	if (bcmp(orig[c], rc->rc_data, rc->rc_size) != 0) {
	raidz_checksum_error(zio, rc);
	rc->rc_error = ECKSUM;
	ret++;
	}
	zio_buf_free(orig[c], rc->rc_size);
	}

	return (ret);
	}

	static uint64_t raidz_corrected_p;
	static uint64_t raidz_corrected_q;
	static uint64_t raidz_corrected_pq;

	static int
	vdev_raidz_worst_error(raidz_map_t *rm)
	{
	int error = 0;

	for (int c = 0; c < rm->rm_cols; c++)
	error = zio_worst_error(error, rm->rm_col[c].rc_error);

	return (error);
	}

	static void
	vdev_raidz_io_done(zio_t *zio)
	{
	vdev_t *vd = zio->io_vd;
	vdev_t *cvd;
	raidz_map_t *rm = zio->io_vsd;
	raidz_col_t rc, rc1;
	int unexpected_errors = 0;
	int parity_errors = 0;
	int parity_untried = 0;
	int data_errors = 0;
	int total_errors = 0;
	int n, c, c1;

	ASSERT(zio->io_bp != NULL); /* XXX need to add code to enforce this */

	ASSERT(rm->rm_missingparity <= rm->rm_firstdatacol);
	ASSERT(rm->rm_missingdata <= rm->rm_cols - rm->rm_firstdatacol);

	for (c = 0; c < rm->rm_cols; c++) {
	rc = &rm->rm_col[c];

	if (rc->rc_error) {
	ASSERT(rc->rc_error != ECKSUM); /* child has no bp */

	if (c < rm->rm_firstdatacol)
	parity_errors++;
	else
	data_errors++;

	if (!rc->rc_skipped)
	unexpected_errors++;

	total_errors++;
	} else if (c < rm->rm_firstdatacol && !rc->rc_tried) {
	parity_untried++;
	}
	}

	if (zio->io_type == ZIO_TYPE_WRITE) {
	/*
	* XXX -- for now, treat partial writes as a success.
	* (If we couldn't write enough columns to reconstruct
	* the data, the I/O failed. Otherwise, good enough.)
	*
	* Now that we support write reallocation, it would be better
	* to treat partial failure as real failure unless there are
	* no non-degraded top-level vdevs left, and not update DTLs
	* if we intend to reallocate.
	*/
	/* XXPOLICY */
	if (total_errors > rm->rm_firstdatacol)
	zio->io_error = vdev_raidz_worst_error(rm);

	return;
	}

	ASSERT(zio->io_type == ZIO_TYPE_READ);
	/*
	* There are three potential phases for a read:
	* 1. produce valid data from the columns read
	* 2. read all disks and try again
	* 3. perform combinatorial reconstruction
	*
	* Each phase is progressively both more expensive and less likely to
	* occur. If we encounter more errors than we can repair or all phases
	* fail, we have no choice but to return an error.
	*/

	/*
	* If the number of errors we saw was correctable -- less than or equal
	* to the number of parity disks read -- attempt to produce data that
	* has a valid checksum. Naturally, this case applies in the absence of
	* any errors.
	*/
	if (total_errors <= rm->rm_firstdatacol - parity_untried) {
	switch (data_errors) {
	case 0:
	if (zio_checksum_error(zio) == 0) {
	/*
	* If we read parity information (unnecessarily
	* as it happens since no reconstruction was
	* needed) regenerate and verify the parity.
	* We also regenerate parity when resilvering
	* so we can write it out to the failed device
	* later.
	*/
	if (parity_errors + parity_untried <
	rm->rm_firstdatacol \|\|
	(zio->io_flags & ZIO_FLAG_RESILVER)) {
	n = raidz_parity_verify(zio, rm);
	unexpected_errors += n;
	ASSERT(parity_errors + n <=
	rm->rm_firstdatacol);
	}
	goto done;
	}
	break;

	case 1:
	/*
	* We either attempt to read all the parity columns or
	* none of them. If we didn't try to read parity, we
	* wouldn't be here in the correctable case. There must
	* also have been fewer parity errors than parity
	* columns or, again, we wouldn't be in this code path.
	*/
	ASSERT(parity_untried == 0);
	ASSERT(parity_errors < rm->rm_firstdatacol);

	/*
	* Find the column that reported the error.
	*/
	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	rc = &rm->rm_col[c];
	if (rc->rc_error != 0)
	break;
	}
	ASSERT(c != rm->rm_cols);
	ASSERT(!rc->rc_skipped \|\| rc->rc_error == ENXIO \|\|
	rc->rc_error == ESTALE);

	if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0) {
	vdev_raidz_reconstruct_p(rm, c);
	} else {
	ASSERT(rm->rm_firstdatacol > 1);
	vdev_raidz_reconstruct_q(rm, c);
	}

	if (zio_checksum_error(zio) == 0) {
	if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0)
	atomic_inc_64(&raidz_corrected_p);
	else
	atomic_inc_64(&raidz_corrected_q);

	/*
	* If there's more than one parity disk that
	* was successfully read, confirm that the
	* other parity disk produced the correct data.
	* This routine is suboptimal in that it
	* regenerates both the parity we wish to test
	* as well as the parity we just used to
	* perform the reconstruction, but this should
	* be a relatively uncommon case, and can be
	* optimized if it becomes a problem.
	* We also regenerate parity when resilvering
	* so we can write it out to the failed device
	* later.
	*/
	if (parity_errors < rm->rm_firstdatacol - 1 \|\|
	(zio->io_flags & ZIO_FLAG_RESILVER)) {
	n = raidz_parity_verify(zio, rm);
	unexpected_errors += n;
	ASSERT(parity_errors + n <=
	rm->rm_firstdatacol);
	}

	goto done;
	}
	break;

	case 2:
	/*
	* Two data column errors require double parity.
	*/
	ASSERT(rm->rm_firstdatacol == 2);

	/*
	* Find the two columns that reported errors.
	*/
	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	rc = &rm->rm_col[c];
	if (rc->rc_error != 0)
	break;
	}
	ASSERT(c != rm->rm_cols);
	ASSERT(!rc->rc_skipped \|\| rc->rc_error == ENXIO \|\|
	rc->rc_error == ESTALE);

	for (c1 = c++; c < rm->rm_cols; c++) {
	rc = &rm->rm_col[c];
	if (rc->rc_error != 0)
	break;
	}
	ASSERT(c != rm->rm_cols);
	ASSERT(!rc->rc_skipped \|\| rc->rc_error == ENXIO \|\|
	rc->rc_error == ESTALE);

	vdev_raidz_reconstruct_pq(rm, c1, c);

	if (zio_checksum_error(zio) == 0) {
	atomic_inc_64(&raidz_corrected_pq);
	goto done;
	}
	break;

	default:
	ASSERT(rm->rm_firstdatacol <= 2);
	ASSERT(0);
	}
	}

	/*
	* This isn't a typical situation -- either we got a read error or
	* a child silently returned bad data. Read every block so we can
	* try again with as much data and parity as we can track down. If
	* we've already been through once before, all children will be marked
	* as tried so we'll proceed to combinatorial reconstruction.
	*/
	unexpected_errors = 1;
	rm->rm_missingdata = 0;
	rm->rm_missingparity = 0;

	for (c = 0; c < rm->rm_cols; c++) {
	if (rm->rm_col[c].rc_tried)
	continue;

	zio_vdev_io_redone(zio);
	do {
	rc = &rm->rm_col[c];
	if (rc->rc_tried)
	continue;
	zio_nowait(zio_vdev_child_io(zio, NULL,
	vd->vdev_child[rc->rc_devidx],
	rc->rc_offset, rc->rc_data, rc->rc_size,
	zio->io_type, zio->io_priority, 0,
	vdev_raidz_child_done, rc));
	} while (++c < rm->rm_cols);

	return;
	}

	/*
	* At this point we've attempted to reconstruct the data given the
	* errors we detected, and we've attempted to read all columns. There
	* must, therefore, be one or more additional problems -- silent errors
	* resulting in invalid data rather than explicit I/O errors resulting
	* in absent data. Before we attempt combinatorial reconstruction make
	* sure we have a chance of coming up with the right answer.
	*/
	if (total_errors >= rm->rm_firstdatacol) {
	zio->io_error = vdev_raidz_worst_error(rm);
	/*
	* If there were exactly as many device errors as parity
	* columns, yet we couldn't reconstruct the data, then at
	* least one device must have returned bad data silently.
	*/
	if (total_errors == rm->rm_firstdatacol)
	zio->io_error = zio_worst_error(zio->io_error, ECKSUM);
	goto done;
	}

	if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0) {
	/*
	* Attempt to reconstruct the data from parity P.
	*/
	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	void *orig;
	rc = &rm->rm_col[c];

	orig = zio_buf_alloc(rc->rc_size);
	bcopy(rc->rc_data, orig, rc->rc_size);
	vdev_raidz_reconstruct_p(rm, c);

	if (zio_checksum_error(zio) == 0) {
	zio_buf_free(orig, rc->rc_size);
	atomic_inc_64(&raidz_corrected_p);

	/*
	* If this child didn't know that it returned
	* bad data, inform it.
	*/
	if (rc->rc_tried && rc->rc_error == 0)
	raidz_checksum_error(zio, rc);
	rc->rc_error = ECKSUM;
	goto done;
	}

	bcopy(orig, rc->rc_data, rc->rc_size);
	zio_buf_free(orig, rc->rc_size);
	}
	}

	if (rm->rm_firstdatacol > 1 && rm->rm_col[VDEV_RAIDZ_Q].rc_error == 0) {
	/*
	* Attempt to reconstruct the data from parity Q.
	*/
	for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
	void *orig;
	rc = &rm->rm_col[c];

	orig = zio_buf_alloc(rc->rc_size);
	bcopy(rc->rc_data, orig, rc->rc_size);
	vdev_raidz_reconstruct_q(rm, c);

	if (zio_checksum_error(zio) == 0) {
	zio_buf_free(orig, rc->rc_size);
	atomic_inc_64(&raidz_corrected_q);

	/*
	* If this child didn't know that it returned
	* bad data, inform it.
	*/
	if (rc->rc_tried && rc->rc_error == 0)
	raidz_checksum_error(zio, rc);
	rc->rc_error = ECKSUM;
	goto done;
	}

	bcopy(orig, rc->rc_data, rc->rc_size);
	zio_buf_free(orig, rc->rc_size);
	}
	}

	if (rm->rm_firstdatacol > 1 &&
	rm->rm_col[VDEV_RAIDZ_P].rc_error == 0 &&
	rm->rm_col[VDEV_RAIDZ_Q].rc_error == 0) {
	/*
	* Attempt to reconstruct the data from both P and Q.
	*/
	for (c = rm->rm_firstdatacol; c < rm->rm_cols - 1; c++) {
	void orig, orig1;
	rc = &rm->rm_col[c];

	orig = zio_buf_alloc(rc->rc_size);
	bcopy(rc->rc_data, orig, rc->rc_size);

	for (c1 = c + 1; c1 < rm->rm_cols; c1++) {
	rc1 = &rm->rm_col[c1];

	orig1 = zio_buf_alloc(rc1->rc_size);
	bcopy(rc1->rc_data, orig1, rc1->rc_size);

	vdev_raidz_reconstruct_pq(rm, c, c1);

	if (zio_checksum_error(zio) == 0) {
	zio_buf_free(orig, rc->rc_size);
	zio_buf_free(orig1, rc1->rc_size);
	atomic_inc_64(&raidz_corrected_pq);

	/*
	* If these children didn't know they
	* returned bad data, inform them.
	*/
	if (rc->rc_tried && rc->rc_error == 0)
	raidz_checksum_error(zio, rc);
	if (rc1->rc_tried && rc1->rc_error == 0)
	raidz_checksum_error(zio, rc1);

	rc->rc_error = ECKSUM;
	rc1->rc_error = ECKSUM;

	goto done;
	}

	bcopy(orig1, rc1->rc_data, rc1->rc_size);
	zio_buf_free(orig1, rc1->rc_size);
	}

	bcopy(orig, rc->rc_data, rc->rc_size);
	zio_buf_free(orig, rc->rc_size);
	}
	}

	/*
	* All combinations failed to checksum. Generate checksum ereports for
	* all children.
	*/
	zio->io_error = ECKSUM;

	if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
	for (c = 0; c < rm->rm_cols; c++) {
	rc = &rm->rm_col[c];
	zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
	zio->io_spa, vd->vdev_child[rc->rc_devidx], zio,
	rc->rc_offset, rc->rc_size);
	}
	}

	done:
	zio_checksum_verified(zio);

	if (zio->io_error == 0 && (spa_mode & FWRITE) &&
	(unexpected_errors \|\| (zio->io_flags & ZIO_FLAG_RESILVER))) {
	/*
	* Use the good data we have in hand to repair damaged children.
	*/
	for (c = 0; c < rm->rm_cols; c++) {
	rc = &rm->rm_col[c];
	cvd = vd->vdev_child[rc->rc_devidx];

	if (rc->rc_error == 0)
	continue;

	zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
	rc->rc_offset, rc->rc_data, rc->rc_size,
	ZIO_TYPE_WRITE, zio->io_priority,
	ZIO_FLAG_IO_REPAIR, NULL, NULL));
	}
	}
	}

	static void
	vdev_raidz_state_change(vdev_t *vd, int faulted, int degraded)
	{
	if (faulted > vd->vdev_nparity)
	vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
	VDEV_AUX_NO_REPLICAS);
	else if (degraded + faulted != 0)
	vdev_set_state(vd, B_FALSE, VDEV_STATE_DEGRADED, VDEV_AUX_NONE);
	else
	vdev_set_state(vd, B_FALSE, VDEV_STATE_HEALTHY, VDEV_AUX_NONE);
	}

	vdev_ops_t vdev_raidz_ops = {
	vdev_raidz_open,
	vdev_raidz_close,
	vdev_raidz_asize,
	vdev_raidz_io_start,
	vdev_raidz_io_done,
	vdev_raidz_state_change,
	VDEV_TYPE_RAIDZ, /* name of this vdev type */
	B_FALSE /* not a leaf vdev */
	};
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c (revision 209274)
	@@ -1,2712 +1,2719 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	#include <sys/types.h>
	#include <sys/param.h>
	#include <sys/time.h>
	#include <sys/systm.h>
	#include <sys/sysmacros.h>
	#include <sys/resource.h>
	#include <sys/vfs.h>
	#include <sys/vnode.h>
	#include <sys/file.h>
	#include <sys/stat.h>
	#include <sys/kmem.h>
	#include <sys/cmn_err.h>
	#include <sys/errno.h>
	#include <sys/unistd.h>
	#include <sys/sdt.h>
	#include <sys/fs/zfs.h>
	#include <sys/policy.h>
	#include <sys/zfs_znode.h>
	#include <sys/zfs_fuid.h>
	#include <sys/zfs_acl.h>
	#include <sys/zfs_dir.h>
	#include <sys/zfs_vfsops.h>
	#include <sys/dmu.h>
	#include <sys/dnode.h>
	#include <sys/zap.h>
	#include <acl/acl_common.h>

	#define ALLOW ACE_ACCESS_ALLOWED_ACE_TYPE
	#define DENY ACE_ACCESS_DENIED_ACE_TYPE
	#define MAX_ACE_TYPE ACE_SYSTEM_ALARM_CALLBACK_OBJECT_ACE_TYPE
	#define MIN_ACE_TYPE ALLOW

	#define OWNING_GROUP (ACE_GROUP\|ACE_IDENTIFIER_GROUP)
	#define EVERYONE_ALLOW_MASK (ACE_READ_ACL\|ACE_READ_ATTRIBUTES \| \
	ACE_READ_NAMED_ATTRS\|ACE_SYNCHRONIZE)
	#define EVERYONE_DENY_MASK (ACE_WRITE_ACL\|ACE_WRITE_OWNER \| \
	ACE_WRITE_ATTRIBUTES\|ACE_WRITE_NAMED_ATTRS)
	#define OWNER_ALLOW_MASK (ACE_WRITE_ACL \| ACE_WRITE_OWNER \| \
	ACE_WRITE_ATTRIBUTES\|ACE_WRITE_NAMED_ATTRS)
	#define WRITE_MASK_DATA (ACE_WRITE_DATA\|ACE_APPEND_DATA\|ACE_WRITE_NAMED_ATTRS)

	#define ZFS_CHECKED_MASKS (ACE_READ_ACL\|ACE_READ_ATTRIBUTES\|ACE_READ_DATA\| \
	ACE_READ_NAMED_ATTRS\|ACE_WRITE_DATA\|ACE_WRITE_ATTRIBUTES\| \
	ACE_WRITE_NAMED_ATTRS\|ACE_APPEND_DATA\|ACE_EXECUTE\|ACE_WRITE_OWNER\| \
	ACE_WRITE_ACL\|ACE_DELETE\|ACE_DELETE_CHILD\|ACE_SYNCHRONIZE)

	#define WRITE_MASK (WRITE_MASK_DATA\|ACE_WRITE_ATTRIBUTES\|ACE_WRITE_ACL\|\
	ACE_WRITE_OWNER\|ACE_DELETE\|ACE_DELETE_CHILD)

	#define OGE_CLEAR (ACE_READ_DATA\|ACE_LIST_DIRECTORY\|ACE_WRITE_DATA\| \
	ACE_ADD_FILE\|ACE_APPEND_DATA\|ACE_ADD_SUBDIRECTORY\|ACE_EXECUTE)

	#define OKAY_MASK_BITS (ACE_READ_DATA\|ACE_LIST_DIRECTORY\|ACE_WRITE_DATA\| \
	ACE_ADD_FILE\|ACE_APPEND_DATA\|ACE_ADD_SUBDIRECTORY\|ACE_EXECUTE)

	#define ALL_INHERIT (ACE_FILE_INHERIT_ACE\|ACE_DIRECTORY_INHERIT_ACE \| \
	ACE_NO_PROPAGATE_INHERIT_ACE\|ACE_INHERIT_ONLY_ACE\|ACE_INHERITED_ACE)

	#define RESTRICTED_CLEAR (ACE_WRITE_ACL\|ACE_WRITE_OWNER)

	#define V4_ACL_WIDE_FLAGS (ZFS_ACL_AUTO_INHERIT\|ZFS_ACL_DEFAULTED\|\
	ZFS_ACL_PROTECTED)

	#define ZFS_ACL_WIDE_FLAGS (V4_ACL_WIDE_FLAGS\|ZFS_ACL_TRIVIAL\|ZFS_INHERIT_ACE\|\
	ZFS_ACL_OBJ_ACE)

	static uint16_t
	zfs_ace_v0_get_type(void *acep)
	{
	return (((zfs_oldace_t *)acep)->z_type);
	}

	static uint16_t
	zfs_ace_v0_get_flags(void *acep)
	{
	return (((zfs_oldace_t *)acep)->z_flags);
	}

	static uint32_t
	zfs_ace_v0_get_mask(void *acep)
	{
	return (((zfs_oldace_t *)acep)->z_access_mask);
	}

	static uint64_t
	zfs_ace_v0_get_who(void *acep)
	{
	return (((zfs_oldace_t *)acep)->z_fuid);
	}

	static void
	zfs_ace_v0_set_type(void *acep, uint16_t type)
	{
	((zfs_oldace_t *)acep)->z_type = type;
	}

	static void
	zfs_ace_v0_set_flags(void *acep, uint16_t flags)
	{
	((zfs_oldace_t *)acep)->z_flags = flags;
	}

	static void
	zfs_ace_v0_set_mask(void *acep, uint32_t mask)
	{
	((zfs_oldace_t *)acep)->z_access_mask = mask;
	}

	static void
	zfs_ace_v0_set_who(void *acep, uint64_t who)
	{
	((zfs_oldace_t *)acep)->z_fuid = who;
	}

	/ARGSUSED/
	static size_t
	zfs_ace_v0_size(void *acep)
	{
	return (sizeof (zfs_oldace_t));
	}

	static size_t
	zfs_ace_v0_abstract_size(void)
	{
	return (sizeof (zfs_oldace_t));
	}

	static int
	zfs_ace_v0_mask_off(void)
	{
	return (offsetof(zfs_oldace_t, z_access_mask));
	}

	/ARGSUSED/
	static int
	zfs_ace_v0_data(void acep, void *datap)
	{
	*datap = NULL;
	return (0);
	}

	static acl_ops_t zfs_acl_v0_ops = {
	zfs_ace_v0_get_mask,
	zfs_ace_v0_set_mask,
	zfs_ace_v0_get_flags,
	zfs_ace_v0_set_flags,
	zfs_ace_v0_get_type,
	zfs_ace_v0_set_type,
	zfs_ace_v0_get_who,
	zfs_ace_v0_set_who,
	zfs_ace_v0_size,
	zfs_ace_v0_abstract_size,
	zfs_ace_v0_mask_off,
	zfs_ace_v0_data
	};

	static uint16_t
	zfs_ace_fuid_get_type(void *acep)
	{
	return (((zfs_ace_hdr_t *)acep)->z_type);
	}

	static uint16_t
	zfs_ace_fuid_get_flags(void *acep)
	{
	return (((zfs_ace_hdr_t *)acep)->z_flags);
	}

	static uint32_t
	zfs_ace_fuid_get_mask(void *acep)
	{
	return (((zfs_ace_hdr_t *)acep)->z_access_mask);
	}

	static uint64_t
	zfs_ace_fuid_get_who(void *args)
	{
	uint16_t entry_type;
	zfs_ace_t *acep = args;

	entry_type = acep->z_hdr.z_flags & ACE_TYPE_FLAGS;

	if (entry_type == ACE_OWNER \|\| entry_type == OWNING_GROUP \|\|
	entry_type == ACE_EVERYONE)
	return (-1);
	return (((zfs_ace_t *)acep)->z_fuid);
	}

	static void
	zfs_ace_fuid_set_type(void *acep, uint16_t type)
	{
	((zfs_ace_hdr_t *)acep)->z_type = type;
	}

	static void
	zfs_ace_fuid_set_flags(void *acep, uint16_t flags)
	{
	((zfs_ace_hdr_t *)acep)->z_flags = flags;
	}

	static void
	zfs_ace_fuid_set_mask(void *acep, uint32_t mask)
	{
	((zfs_ace_hdr_t *)acep)->z_access_mask = mask;
	}

	static void
	zfs_ace_fuid_set_who(void *arg, uint64_t who)
	{
	zfs_ace_t *acep = arg;

	uint16_t entry_type = acep->z_hdr.z_flags & ACE_TYPE_FLAGS;

	if (entry_type == ACE_OWNER \|\| entry_type == OWNING_GROUP \|\|
	entry_type == ACE_EVERYONE)
	return;
	acep->z_fuid = who;
	}

	static size_t
	zfs_ace_fuid_size(void *acep)
	{
	zfs_ace_hdr_t *zacep = acep;
	uint16_t entry_type;

	switch (zacep->z_type) {
	case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
	case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
	return (sizeof (zfs_object_ace_t));
	case ALLOW:
	case DENY:
	entry_type =
	(((zfs_ace_hdr_t *)acep)->z_flags & ACE_TYPE_FLAGS);
	if (entry_type == ACE_OWNER \|\|
	entry_type == OWNING_GROUP \|\|
	entry_type == ACE_EVERYONE)
	return (sizeof (zfs_ace_hdr_t));
	/FALLTHROUGH/
	default:
	return (sizeof (zfs_ace_t));
	}
	}

	static size_t
	zfs_ace_fuid_abstract_size(void)
	{
	return (sizeof (zfs_ace_hdr_t));
	}

	static int
	zfs_ace_fuid_mask_off(void)
	{
	return (offsetof(zfs_ace_hdr_t, z_access_mask));
	}

	static int
	zfs_ace_fuid_data(void acep, void *datap)
	{
	zfs_ace_t *zacep = acep;
	zfs_object_ace_t *zobjp;

	switch (zacep->z_hdr.z_type) {
	case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
	case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
	zobjp = acep;
	*datap = (caddr_t)zobjp + sizeof (zfs_ace_t);
	return (sizeof (zfs_object_ace_t) - sizeof (zfs_ace_t));
	default:
	*datap = NULL;
	return (0);
	}
	}

	static acl_ops_t zfs_acl_fuid_ops = {
	zfs_ace_fuid_get_mask,
	zfs_ace_fuid_set_mask,
	zfs_ace_fuid_get_flags,
	zfs_ace_fuid_set_flags,
	zfs_ace_fuid_get_type,
	zfs_ace_fuid_set_type,
	zfs_ace_fuid_get_who,
	zfs_ace_fuid_set_who,
	zfs_ace_fuid_size,
	zfs_ace_fuid_abstract_size,
	zfs_ace_fuid_mask_off,
	zfs_ace_fuid_data
	};

	static int
	zfs_acl_version(int version)
	{
	if (version < ZPL_VERSION_FUID)
	return (ZFS_ACL_VERSION_INITIAL);
	else
	return (ZFS_ACL_VERSION_FUID);
	}

	static int
	zfs_acl_version_zp(znode_t *zp)
	{
	return (zfs_acl_version(zp->z_zfsvfs->z_version));
	}

	static zfs_acl_t *
	zfs_acl_alloc(int vers)
	{
	zfs_acl_t *aclp;

	aclp = kmem_zalloc(sizeof (zfs_acl_t), KM_SLEEP);
	list_create(&aclp->z_acl, sizeof (zfs_acl_node_t),
	offsetof(zfs_acl_node_t, z_next));
	aclp->z_version = vers;
	if (vers == ZFS_ACL_VERSION_FUID)
	aclp->z_ops = zfs_acl_fuid_ops;
	else
	aclp->z_ops = zfs_acl_v0_ops;
	return (aclp);
	}

	static zfs_acl_node_t *
	zfs_acl_node_alloc(size_t bytes)
	{
	zfs_acl_node_t *aclnode;

	aclnode = kmem_zalloc(sizeof (zfs_acl_node_t), KM_SLEEP);
	if (bytes) {
	aclnode->z_acldata = kmem_alloc(bytes, KM_SLEEP);
	aclnode->z_allocdata = aclnode->z_acldata;
	aclnode->z_allocsize = bytes;
	aclnode->z_size = bytes;
	}

	return (aclnode);
	}

	static void
	zfs_acl_node_free(zfs_acl_node_t *aclnode)
	{
	if (aclnode->z_allocsize)
	kmem_free(aclnode->z_allocdata, aclnode->z_allocsize);
	kmem_free(aclnode, sizeof (zfs_acl_node_t));
	}

	static void
	zfs_acl_release_nodes(zfs_acl_t *aclp)
	{
	zfs_acl_node_t *aclnode;

	while (aclnode = list_head(&aclp->z_acl)) {
	list_remove(&aclp->z_acl, aclnode);
	zfs_acl_node_free(aclnode);
	}
	aclp->z_acl_count = 0;
	aclp->z_acl_bytes = 0;
	}

	void
	zfs_acl_free(zfs_acl_t *aclp)
	{
	zfs_acl_release_nodes(aclp);
	list_destroy(&aclp->z_acl);
	kmem_free(aclp, sizeof (zfs_acl_t));
	}

	static boolean_t
	zfs_acl_valid_ace_type(uint_t type, uint_t flags)
	{
	uint16_t entry_type;

	switch (type) {
	case ALLOW:
	case DENY:
	case ACE_SYSTEM_AUDIT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_ACE_TYPE:
	entry_type = flags & ACE_TYPE_FLAGS;
	return (entry_type == ACE_OWNER \|\|
	entry_type == OWNING_GROUP \|\|
	entry_type == ACE_EVERYONE \|\| entry_type == 0 \|\|
	entry_type == ACE_IDENTIFIER_GROUP);
	default:
	if (type >= MIN_ACE_TYPE && type <= MAX_ACE_TYPE)
	return (B_TRUE);
	}
	return (B_FALSE);
	}

	static boolean_t
	zfs_ace_valid(vtype_t obj_type, zfs_acl_t *aclp, uint16_t type, uint16_t iflags)
	{
	/*
	* first check type of entry
	*/

	if (!zfs_acl_valid_ace_type(type, iflags))
	return (B_FALSE);

	switch (type) {
	case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
	case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
	if (aclp->z_version < ZFS_ACL_VERSION_FUID)
	return (B_FALSE);
	aclp->z_hints \|= ZFS_ACL_OBJ_ACE;
	}

	/*
	* next check inheritance level flags
	*/

	if (obj_type == VDIR &&
	(iflags & (ACE_FILE_INHERIT_ACE\|ACE_DIRECTORY_INHERIT_ACE)))
	aclp->z_hints \|= ZFS_INHERIT_ACE;

	if (iflags & (ACE_INHERIT_ONLY_ACE\|ACE_NO_PROPAGATE_INHERIT_ACE)) {
	if ((iflags & (ACE_FILE_INHERIT_ACE\|
	ACE_DIRECTORY_INHERIT_ACE)) == 0) {
	return (B_FALSE);
	}
	}

	return (B_TRUE);
	}

	static void *
	zfs_acl_next_ace(zfs_acl_t aclp, void start, uint64_t *who,
	uint32_t access_mask, uint16_t iflags, uint16_t *type)
	{
	zfs_acl_node_t *aclnode;

	if (start == NULL) {
	aclnode = list_head(&aclp->z_acl);
	if (aclnode == NULL)
	return (NULL);

	aclp->z_next_ace = aclnode->z_acldata;
	aclp->z_curr_node = aclnode;
	aclnode->z_ace_idx = 0;
	}

	aclnode = aclp->z_curr_node;

	if (aclnode == NULL)
	return (NULL);

	if (aclnode->z_ace_idx >= aclnode->z_ace_count) {
	aclnode = list_next(&aclp->z_acl, aclnode);
	if (aclnode == NULL)
	return (NULL);
	else {
	aclp->z_curr_node = aclnode;
	aclnode->z_ace_idx = 0;
	aclp->z_next_ace = aclnode->z_acldata;
	}
	}

	if (aclnode->z_ace_idx < aclnode->z_ace_count) {
	void *acep = aclp->z_next_ace;
	size_t ace_size;

	/*
	* Make sure we don't overstep our bounds
	*/
	ace_size = aclp->z_ops.ace_size(acep);

	if (((caddr_t)acep + ace_size) >
	((caddr_t)aclnode->z_acldata + aclnode->z_size)) {
	return (NULL);
	}

	*iflags = aclp->z_ops.ace_flags_get(acep);
	*type = aclp->z_ops.ace_type_get(acep);
	*access_mask = aclp->z_ops.ace_mask_get(acep);
	*who = aclp->z_ops.ace_who_get(acep);
	aclp->z_next_ace = (caddr_t)aclp->z_next_ace + ace_size;
	aclnode->z_ace_idx++;
	return ((void *)acep);
	}
	return (NULL);
	}

	/ARGSUSED/
	static uint64_t
	zfs_ace_walk(void *datap, uint64_t cookie, int aclcnt,
	uint16_t flags, uint16_t type, uint32_t *mask)
	{
	zfs_acl_t *aclp = datap;
	zfs_ace_hdr_t acep = (zfs_ace_hdr_t )(uintptr_t)cookie;
	uint64_t who;

	acep = zfs_acl_next_ace(aclp, acep, &who, mask,
	flags, type);
	return ((uint64_t)(uintptr_t)acep);
	}

	static zfs_acl_node_t *
	zfs_acl_curr_node(zfs_acl_t *aclp)
	{
	ASSERT(aclp->z_curr_node);
	return (aclp->z_curr_node);
	}

	/*
	* Copy ACE to internal ZFS format.
	* While processing the ACL each ACE will be validated for correctness.
	* ACE FUIDs will be created later.
	*/
	int
	zfs_copy_ace_2_fuid(vtype_t obj_type, zfs_acl_t aclp, void datap,
	zfs_ace_t z_acl, int aclcnt, size_t size)
	{
	int i;
	uint16_t entry_type;
	zfs_ace_t *aceptr = z_acl;
	ace_t *acep = datap;
	zfs_object_ace_t *zobjacep;
	ace_object_t *aceobjp;

	for (i = 0; i != aclcnt; i++) {
	aceptr->z_hdr.z_access_mask = acep->a_access_mask;
	aceptr->z_hdr.z_flags = acep->a_flags;
	aceptr->z_hdr.z_type = acep->a_type;
	entry_type = aceptr->z_hdr.z_flags & ACE_TYPE_FLAGS;
	if (entry_type != ACE_OWNER && entry_type != OWNING_GROUP &&
	entry_type != ACE_EVERYONE) {
	if (!aclp->z_has_fuids)
	aclp->z_has_fuids = IS_EPHEMERAL(acep->a_who);
	aceptr->z_fuid = (uint64_t)acep->a_who;
	}

	/*
	* Make sure ACE is valid
	*/
	if (zfs_ace_valid(obj_type, aclp, aceptr->z_hdr.z_type,
	aceptr->z_hdr.z_flags) != B_TRUE)
	return (EINVAL);

	switch (acep->a_type) {
	case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
	case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
	zobjacep = (zfs_object_ace_t *)aceptr;
	aceobjp = (ace_object_t *)acep;

	bcopy(aceobjp->a_obj_type, zobjacep->z_object_type,
	sizeof (aceobjp->a_obj_type));
	bcopy(aceobjp->a_inherit_obj_type,
	zobjacep->z_inherit_type,
	sizeof (aceobjp->a_inherit_obj_type));
	acep = (ace_t *)((caddr_t)acep + sizeof (ace_object_t));
	break;
	default:
	acep = (ace_t *)((caddr_t)acep + sizeof (ace_t));
	}

	aceptr = (zfs_ace_t *)((caddr_t)aceptr +
	aclp->z_ops.ace_size(aceptr));
	}

	*size = (caddr_t)aceptr - (caddr_t)z_acl;

	return (0);
	}

	/*
	* Copy ZFS ACEs to fixed size ace_t layout
	*/
	static void
	zfs_copy_fuid_2_ace(zfsvfs_t zfsvfs, zfs_acl_t aclp, cred_t *cr,
	void *datap, int filter)
	{
	uint64_t who;
	uint32_t access_mask;
	uint16_t iflags, type;
	zfs_ace_hdr_t *zacep = NULL;
	ace_t *acep = datap;
	ace_object_t *objacep;
	zfs_object_ace_t *zobjacep;
	size_t ace_size;
	uint16_t entry_type;

	while (zacep = zfs_acl_next_ace(aclp, zacep,
	&who, &access_mask, &iflags, &type)) {

	switch (type) {
	case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
	case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
	if (filter) {
	continue;
	}
	zobjacep = (zfs_object_ace_t *)zacep;
	objacep = (ace_object_t *)acep;
	bcopy(zobjacep->z_object_type,
	objacep->a_obj_type,
	sizeof (zobjacep->z_object_type));
	bcopy(zobjacep->z_inherit_type,
	objacep->a_inherit_obj_type,
	sizeof (zobjacep->z_inherit_type));
	ace_size = sizeof (ace_object_t);
	break;
	default:
	ace_size = sizeof (ace_t);
	break;
	}

	entry_type = (iflags & ACE_TYPE_FLAGS);
	if ((entry_type != ACE_OWNER &&
	entry_type != OWNING_GROUP &&
	entry_type != ACE_EVERYONE)) {
	acep->a_who = zfs_fuid_map_id(zfsvfs, who,
	cr, (entry_type & ACE_IDENTIFIER_GROUP) ?
	ZFS_ACE_GROUP : ZFS_ACE_USER);
	} else {
	acep->a_who = (uid_t)(int64_t)who;
	}
	acep->a_access_mask = access_mask;
	acep->a_flags = iflags;
	acep->a_type = type;
	acep = (ace_t *)((caddr_t)acep + ace_size);
	}
	}

	static int
	zfs_copy_ace_2_oldace(vtype_t obj_type, zfs_acl_t aclp, ace_t acep,
	zfs_oldace_t z_acl, int aclcnt, size_t size)
	{
	int i;
	zfs_oldace_t *aceptr = z_acl;

	for (i = 0; i != aclcnt; i++, aceptr++) {
	aceptr->z_access_mask = acep[i].a_access_mask;
	aceptr->z_type = acep[i].a_type;
	aceptr->z_flags = acep[i].a_flags;
	aceptr->z_fuid = acep[i].a_who;
	/*
	* Make sure ACE is valid
	*/
	if (zfs_ace_valid(obj_type, aclp, aceptr->z_type,
	aceptr->z_flags) != B_TRUE)
	return (EINVAL);
	}
	*size = (caddr_t)aceptr - (caddr_t)z_acl;
	return (0);
	}

	/*
	* convert old ACL format to new
	*/
	void
	zfs_acl_xform(znode_t zp, zfs_acl_t aclp)
	{
	zfs_oldace_t *oldaclp;
	int i;
	uint16_t type, iflags;
	uint32_t access_mask;
	uint64_t who;
	void *cookie = NULL;
	zfs_acl_node_t *newaclnode;

	ASSERT(aclp->z_version == ZFS_ACL_VERSION_INITIAL);
	/*
	* First create the ACE in a contiguous piece of memory
	* for zfs_copy_ace_2_fuid().
	*
	* We only convert an ACL once, so this won't happen
	* everytime.
	*/
	oldaclp = kmem_alloc(sizeof (zfs_oldace_t) * aclp->z_acl_count,
	KM_SLEEP);
	i = 0;
	while (cookie = zfs_acl_next_ace(aclp, cookie, &who,
	&access_mask, &iflags, &type)) {
	oldaclp[i].z_flags = iflags;
	oldaclp[i].z_type = type;
	oldaclp[i].z_fuid = who;
	oldaclp[i++].z_access_mask = access_mask;
	}

	newaclnode = zfs_acl_node_alloc(aclp->z_acl_count *
	sizeof (zfs_object_ace_t));
	aclp->z_ops = zfs_acl_fuid_ops;
	VERIFY(zfs_copy_ace_2_fuid(ZTOV(zp)->v_type, aclp, oldaclp,
	newaclnode->z_acldata, aclp->z_acl_count,
	&newaclnode->z_size) == 0);
	newaclnode->z_ace_count = aclp->z_acl_count;
	aclp->z_version = ZFS_ACL_VERSION;
	kmem_free(oldaclp, aclp->z_acl_count * sizeof (zfs_oldace_t));

	/*
	* Release all previous ACL nodes
	*/

	zfs_acl_release_nodes(aclp);

	list_insert_head(&aclp->z_acl, newaclnode);

	aclp->z_acl_bytes = newaclnode->z_size;
	aclp->z_acl_count = newaclnode->z_ace_count;

	}

	/*
	* Convert unix access mask to v4 access mask
	*/
	static uint32_t
	zfs_unix_to_v4(uint32_t access_mask)
	{
	uint32_t new_mask = 0;

	if (access_mask & S_IXOTH)
	new_mask \|= ACE_EXECUTE;
	if (access_mask & S_IWOTH)
	new_mask \|= ACE_WRITE_DATA;
	if (access_mask & S_IROTH)
	new_mask \|= ACE_READ_DATA;
	return (new_mask);
	}

	static void
	zfs_set_ace(zfs_acl_t aclp, void acep, uint32_t access_mask,
	uint16_t access_type, uint64_t fuid, uint16_t entry_type)
	{
	uint16_t type = entry_type & ACE_TYPE_FLAGS;

	aclp->z_ops.ace_mask_set(acep, access_mask);
	aclp->z_ops.ace_type_set(acep, access_type);
	aclp->z_ops.ace_flags_set(acep, entry_type);
	if ((type != ACE_OWNER && type != OWNING_GROUP &&
	type != ACE_EVERYONE))
	aclp->z_ops.ace_who_set(acep, fuid);
	}

	/*
	* Determine mode of file based on ACL.
	* Also, create FUIDs for any User/Group ACEs
	*/
	static uint64_t
	zfs_mode_fuid_compute(znode_t zp, zfs_acl_t aclp, cred_t *cr,
	zfs_fuid_info_t *fuidp, dmu_tx_t tx)
	{
	int entry_type;
	mode_t mode;
	mode_t seen = 0;
	zfs_ace_hdr_t *acep = NULL;
	uint64_t who;
	uint16_t iflags, type;
	uint32_t access_mask;

	mode = (zp->z_phys->zp_mode & (S_IFMT \| S_ISUID \| S_ISGID \| S_ISVTX));

	while (acep = zfs_acl_next_ace(aclp, acep, &who,
	&access_mask, &iflags, &type)) {

	if (!zfs_acl_valid_ace_type(type, iflags))
	continue;

	entry_type = (iflags & ACE_TYPE_FLAGS);

	/*
	* Skip over owner@, group@ or everyone@ inherit only ACEs
	*/
	if ((iflags & ACE_INHERIT_ONLY_ACE) &&
	(entry_type == ACE_OWNER \|\| entry_type == ACE_EVERYONE \|\|
	entry_type == OWNING_GROUP))
	continue;

	if (entry_type == ACE_OWNER) {
	if ((access_mask & ACE_READ_DATA) &&
	(!(seen & S_IRUSR))) {
	seen \|= S_IRUSR;
	if (type == ALLOW) {
	mode \|= S_IRUSR;
	}
	}
	if ((access_mask & ACE_WRITE_DATA) &&
	(!(seen & S_IWUSR))) {
	seen \|= S_IWUSR;
	if (type == ALLOW) {
	mode \|= S_IWUSR;
	}
	}
	if ((access_mask & ACE_EXECUTE) &&
	(!(seen & S_IXUSR))) {
	seen \|= S_IXUSR;
	if (type == ALLOW) {
	mode \|= S_IXUSR;
	}
	}
	} else if (entry_type == OWNING_GROUP) {
	if ((access_mask & ACE_READ_DATA) &&
	(!(seen & S_IRGRP))) {
	seen \|= S_IRGRP;
	if (type == ALLOW) {
	mode \|= S_IRGRP;
	}
	}
	if ((access_mask & ACE_WRITE_DATA) &&
	(!(seen & S_IWGRP))) {
	seen \|= S_IWGRP;
	if (type == ALLOW) {
	mode \|= S_IWGRP;
	}
	}
	if ((access_mask & ACE_EXECUTE) &&
	(!(seen & S_IXGRP))) {
	seen \|= S_IXGRP;
	if (type == ALLOW) {
	mode \|= S_IXGRP;
	}
	}
	} else if (entry_type == ACE_EVERYONE) {
	if ((access_mask & ACE_READ_DATA)) {
	if (!(seen & S_IRUSR)) {
	seen \|= S_IRUSR;
	if (type == ALLOW) {
	mode \|= S_IRUSR;
	}
	}
	if (!(seen & S_IRGRP)) {
	seen \|= S_IRGRP;
	if (type == ALLOW) {
	mode \|= S_IRGRP;
	}
	}
	if (!(seen & S_IROTH)) {
	seen \|= S_IROTH;
	if (type == ALLOW) {
	mode \|= S_IROTH;
	}
	}
	}
	if ((access_mask & ACE_WRITE_DATA)) {
	if (!(seen & S_IWUSR)) {
	seen \|= S_IWUSR;
	if (type == ALLOW) {
	mode \|= S_IWUSR;
	}
	}
	if (!(seen & S_IWGRP)) {
	seen \|= S_IWGRP;
	if (type == ALLOW) {
	mode \|= S_IWGRP;
	}
	}
	if (!(seen & S_IWOTH)) {
	seen \|= S_IWOTH;
	if (type == ALLOW) {
	mode \|= S_IWOTH;
	}
	}
	}
	if ((access_mask & ACE_EXECUTE)) {
	if (!(seen & S_IXUSR)) {
	seen \|= S_IXUSR;
	if (type == ALLOW) {
	mode \|= S_IXUSR;
	}
	}
	if (!(seen & S_IXGRP)) {
	seen \|= S_IXGRP;
	if (type == ALLOW) {
	mode \|= S_IXGRP;
	}
	}
	if (!(seen & S_IXOTH)) {
	seen \|= S_IXOTH;
	if (type == ALLOW) {
	mode \|= S_IXOTH;
	}
	}
	}
	}
	/*
	* Now handle FUID create for user/group ACEs
	*/
	if (entry_type == 0 \|\| entry_type == ACE_IDENTIFIER_GROUP) {
	aclp->z_ops.ace_who_set(acep,
	zfs_fuid_create(zp->z_zfsvfs, who, cr,
	(entry_type == 0) ? ZFS_ACE_USER : ZFS_ACE_GROUP,
	tx, fuidp));
	}
	}
	return (mode);
	}

	static zfs_acl_t *
	zfs_acl_node_read_internal(znode_t *zp, boolean_t will_modify)
	{
	zfs_acl_t *aclp;
	zfs_acl_node_t *aclnode;

	aclp = zfs_acl_alloc(zp->z_phys->zp_acl.z_acl_version);

	/*
	* Version 0 to 1 znode_acl_phys has the size/count fields swapped.
	* Version 0 didn't have a size field, only a count.
	*/
	if (zp->z_phys->zp_acl.z_acl_version == ZFS_ACL_VERSION_INITIAL) {
	aclp->z_acl_count = zp->z_phys->zp_acl.z_acl_size;
	aclp->z_acl_bytes = ZFS_ACL_SIZE(aclp->z_acl_count);
	} else {
	aclp->z_acl_count = zp->z_phys->zp_acl.z_acl_count;
	aclp->z_acl_bytes = zp->z_phys->zp_acl.z_acl_size;
	}

	aclnode = zfs_acl_node_alloc(will_modify ? aclp->z_acl_bytes : 0);
	aclnode->z_ace_count = aclp->z_acl_count;
	if (will_modify) {
	bcopy(zp->z_phys->zp_acl.z_ace_data, aclnode->z_acldata,
	aclp->z_acl_bytes);
	} else {
	aclnode->z_size = aclp->z_acl_bytes;
	aclnode->z_acldata = &zp->z_phys->zp_acl.z_ace_data[0];
	}

	list_insert_head(&aclp->z_acl, aclnode);

	return (aclp);
	}

	/*
	* Read an external acl object.
	*/
	static int
	zfs_acl_node_read(znode_t zp, zfs_acl_t *aclpp, boolean_t will_modify)
	{
	uint64_t extacl = zp->z_phys->zp_acl.z_acl_extern_obj;
	zfs_acl_t *aclp;
	size_t aclsize;
	size_t acl_count;
	zfs_acl_node_t *aclnode;
	int error;

	ASSERT(MUTEX_HELD(&zp->z_acl_lock));

	if (zp->z_phys->zp_acl.z_acl_extern_obj == 0) {
	*aclpp = zfs_acl_node_read_internal(zp, will_modify);
	return (0);
	}

	aclp = zfs_acl_alloc(zp->z_phys->zp_acl.z_acl_version);
	if (zp->z_phys->zp_acl.z_acl_version == ZFS_ACL_VERSION_INITIAL) {
	zfs_acl_phys_v0_t *zacl0 =
	(zfs_acl_phys_v0_t *)&zp->z_phys->zp_acl;

	aclsize = ZFS_ACL_SIZE(zacl0->z_acl_count);
	acl_count = zacl0->z_acl_count;
	} else {
	aclsize = zp->z_phys->zp_acl.z_acl_size;
	acl_count = zp->z_phys->zp_acl.z_acl_count;
	if (aclsize == 0)
	aclsize = acl_count * sizeof (zfs_ace_t);
	}
	aclnode = zfs_acl_node_alloc(aclsize);
	list_insert_head(&aclp->z_acl, aclnode);
	error = dmu_read(zp->z_zfsvfs->z_os, extacl, 0,
	aclsize, aclnode->z_acldata);
	aclnode->z_ace_count = acl_count;
	aclp->z_acl_count = acl_count;
	aclp->z_acl_bytes = aclsize;

	if (error != 0) {
	zfs_acl_free(aclp);
	/* convert checksum errors into IO errors */
	if (error == ECKSUM)
	error = EIO;
	return (error);
	}

	*aclpp = aclp;
	return (0);
	}

	/*
	* common code for setting ACLs.
	*
	* This function is called from zfs_mode_update, zfs_perm_init, and zfs_setacl.
	* zfs_setacl passes a non-NULL inherit pointer (ihp) to indicate that it's
	* already checked the acl and knows whether to inherit.
	*/
	int
	zfs_aclset_common(znode_t zp, zfs_acl_t aclp, cred_t *cr,
	zfs_fuid_info_t *fuidp, dmu_tx_t tx)
	{
	int error;
	znode_phys_t *zphys = zp->z_phys;
	zfs_acl_phys_t *zacl = &zphys->zp_acl;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	uint64_t aoid = zphys->zp_acl.z_acl_extern_obj;
	uint64_t off = 0;
	dmu_object_type_t otype;
	zfs_acl_node_t *aclnode;

	ASSERT(MUTEX_HELD(&zp->z_lock));
	ASSERT(MUTEX_HELD(&zp->z_acl_lock));

	dmu_buf_will_dirty(zp->z_dbuf, tx);

	zphys->zp_mode = zfs_mode_fuid_compute(zp, aclp, cr, fuidp, tx);

	/*
	* Decide which opbject type to use. If we are forced to
	* use old ACL format than transform ACL into zfs_oldace_t
	* layout.
	*/
	if (!zfsvfs->z_use_fuids) {
	otype = DMU_OT_OLDACL;
	} else {
	if ((aclp->z_version == ZFS_ACL_VERSION_INITIAL) &&
	(zfsvfs->z_version >= ZPL_VERSION_FUID))
	zfs_acl_xform(zp, aclp);
	ASSERT(aclp->z_version >= ZFS_ACL_VERSION_FUID);
	otype = DMU_OT_ACL;
	}

	if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
	/*
	* If ACL was previously external and we are now
	* converting to new ACL format then release old
	* ACL object and create a new one.
	*/
	if (aoid && aclp->z_version != zacl->z_acl_version) {
	error = dmu_object_free(zfsvfs->z_os,
	zp->z_phys->zp_acl.z_acl_extern_obj, tx);
	if (error)
	return (error);
	aoid = 0;
	}
	if (aoid == 0) {
	aoid = dmu_object_alloc(zfsvfs->z_os,
	otype, aclp->z_acl_bytes,
	otype == DMU_OT_ACL ? DMU_OT_SYSACL : DMU_OT_NONE,
	otype == DMU_OT_ACL ? DN_MAX_BONUSLEN : 0, tx);
	} else {
	(void) dmu_object_set_blocksize(zfsvfs->z_os, aoid,
	aclp->z_acl_bytes, 0, tx);
	}
	zphys->zp_acl.z_acl_extern_obj = aoid;
	for (aclnode = list_head(&aclp->z_acl); aclnode;
	aclnode = list_next(&aclp->z_acl, aclnode)) {
	if (aclnode->z_ace_count == 0)
	continue;
	dmu_write(zfsvfs->z_os, aoid, off,
	aclnode->z_size, aclnode->z_acldata, tx);
	off += aclnode->z_size;
	}
	} else {
	void *start = zacl->z_ace_data;
	/*
	* Migrating back embedded?
	*/
	if (zphys->zp_acl.z_acl_extern_obj) {
	error = dmu_object_free(zfsvfs->z_os,
	zp->z_phys->zp_acl.z_acl_extern_obj, tx);
	if (error)
	return (error);
	zphys->zp_acl.z_acl_extern_obj = 0;
	}

	for (aclnode = list_head(&aclp->z_acl); aclnode;
	aclnode = list_next(&aclp->z_acl, aclnode)) {
	if (aclnode->z_ace_count == 0)
	continue;
	bcopy(aclnode->z_acldata, start, aclnode->z_size);
	start = (caddr_t)start + aclnode->z_size;
	}
	}

	/*
	* If Old version then swap count/bytes to match old
	* layout of znode_acl_phys_t.
	*/
	if (aclp->z_version == ZFS_ACL_VERSION_INITIAL) {
	zphys->zp_acl.z_acl_size = aclp->z_acl_count;
	zphys->zp_acl.z_acl_count = aclp->z_acl_bytes;
	} else {
	zphys->zp_acl.z_acl_size = aclp->z_acl_bytes;
	zphys->zp_acl.z_acl_count = aclp->z_acl_count;
	}

	zphys->zp_acl.z_acl_version = aclp->z_version;

	/*
	* Replace ACL wide bits, but first clear them.
	*/
	zp->z_phys->zp_flags &= ~ZFS_ACL_WIDE_FLAGS;

	zp->z_phys->zp_flags \|= aclp->z_hints;

	if (ace_trivial_common(aclp, 0, zfs_ace_walk) == 0)
	zp->z_phys->zp_flags \|= ZFS_ACL_TRIVIAL;

	zfs_time_stamper_locked(zp, STATE_CHANGED, tx);
	return (0);
	}

	/*
	* Update access mask for prepended ACE
	*
	* This applies the "groupmask" value for aclmode property.
	*/
	static void
	zfs_acl_prepend_fixup(zfs_acl_t aclp, void acep, void *origacep,
	mode_t mode, uint64_t owner)
	{
	int rmask, wmask, xmask;
	int user_ace;
	uint16_t aceflags;
	uint32_t origmask, acepmask;
	uint64_t fuid;

	aceflags = aclp->z_ops.ace_flags_get(acep);
	fuid = aclp->z_ops.ace_who_get(acep);
	origmask = aclp->z_ops.ace_mask_get(origacep);
	acepmask = aclp->z_ops.ace_mask_get(acep);

	user_ace = (!(aceflags &
	(ACE_OWNER\|ACE_GROUP\|ACE_IDENTIFIER_GROUP)));

	if (user_ace && (fuid == owner)) {
	rmask = S_IRUSR;
	wmask = S_IWUSR;
	xmask = S_IXUSR;
	} else {
	rmask = S_IRGRP;
	wmask = S_IWGRP;
	xmask = S_IXGRP;
	}

	if (origmask & ACE_READ_DATA) {
	if (mode & rmask) {
	acepmask &= ~ACE_READ_DATA;
	} else {
	acepmask \|= ACE_READ_DATA;
	}
	}

	if (origmask & ACE_WRITE_DATA) {
	if (mode & wmask) {
	acepmask &= ~ACE_WRITE_DATA;
	} else {
	acepmask \|= ACE_WRITE_DATA;
	}
	}

	if (origmask & ACE_APPEND_DATA) {
	if (mode & wmask) {
	acepmask &= ~ACE_APPEND_DATA;
	} else {
	acepmask \|= ACE_APPEND_DATA;
	}
	}

	if (origmask & ACE_EXECUTE) {
	if (mode & xmask) {
	acepmask &= ~ACE_EXECUTE;
	} else {
	acepmask \|= ACE_EXECUTE;
	}
	}
	aclp->z_ops.ace_mask_set(acep, acepmask);
	}

	/*
	* Apply mode to canonical six ACEs.
	*/
	static void
	zfs_acl_fixup_canonical_six(zfs_acl_t *aclp, mode_t mode)
	{
	zfs_acl_node_t *aclnode = list_tail(&aclp->z_acl);
	void *acep;
	int maskoff = aclp->z_ops.ace_mask_off();
	size_t abstract_size = aclp->z_ops.ace_abstract_size();

	ASSERT(aclnode != NULL);

	acep = (void *)((caddr_t)aclnode->z_acldata +
	aclnode->z_size - (abstract_size * 6));

	/*
	* Fixup final ACEs to match the mode
	*/

	adjust_ace_pair_common(acep, maskoff, abstract_size,
	(mode & 0700) >> 6); /* owner@ */

	acep = (caddr_t)acep + (abstract_size * 2);

	adjust_ace_pair_common(acep, maskoff, abstract_size,
	(mode & 0070) >> 3); /* group@ */

	acep = (caddr_t)acep + (abstract_size * 2);
	adjust_ace_pair_common(acep, maskoff,
	abstract_size, mode); /* everyone@ */
	}


	static int
	zfs_acl_ace_match(zfs_acl_t aclp, void acep, int allow_deny,
	int entry_type, int accessmask)
	{
	uint32_t mask = aclp->z_ops.ace_mask_get(acep);
	uint16_t type = aclp->z_ops.ace_type_get(acep);
	uint16_t flags = aclp->z_ops.ace_flags_get(acep);

	return (mask == accessmask && type == allow_deny &&
	((flags & ACE_TYPE_FLAGS) == entry_type));
	}

	/*
	* Can prepended ACE be reused?
	*/
	static int
	zfs_reuse_deny(zfs_acl_t aclp, void acep, void *prevacep)
	{
	int okay_masks;
	uint16_t prevtype;
	uint16_t prevflags;
	uint16_t flags;
	uint32_t mask, prevmask;

	if (prevacep == NULL)
	return (B_FALSE);

	prevtype = aclp->z_ops.ace_type_get(prevacep);
	prevflags = aclp->z_ops.ace_flags_get(prevacep);
	flags = aclp->z_ops.ace_flags_get(acep);
	mask = aclp->z_ops.ace_mask_get(acep);
	prevmask = aclp->z_ops.ace_mask_get(prevacep);

	if (prevtype != DENY)
	return (B_FALSE);

	if (prevflags != (flags & ACE_IDENTIFIER_GROUP))
	return (B_FALSE);

	okay_masks = (mask & OKAY_MASK_BITS);

	if (prevmask & ~okay_masks)
	return (B_FALSE);

	return (B_TRUE);
	}


	/*
	* Insert new ACL node into chain of zfs_acl_node_t's
	*
	* This will result in two possible results.
	* 1. If the ACL is currently just a single zfs_acl_node and
	* we are prepending the entry then current acl node will have
	* a new node inserted above it.
	*
	* 2. If we are inserting in the middle of current acl node then
	* the current node will be split in two and new node will be inserted
	* in between the two split nodes.
	*/
	static zfs_acl_node_t *
	zfs_acl_ace_insert(zfs_acl_t aclp, void acep)
	{
	zfs_acl_node_t *newnode;
	zfs_acl_node_t *trailernode = NULL;
	zfs_acl_node_t *currnode = zfs_acl_curr_node(aclp);
	int curr_idx = aclp->z_curr_node->z_ace_idx;
	int trailer_count;
	size_t oldsize;

	newnode = zfs_acl_node_alloc(aclp->z_ops.ace_size(acep));
	newnode->z_ace_count = 1;

	oldsize = currnode->z_size;

	if (curr_idx != 1) {
	trailernode = zfs_acl_node_alloc(0);
	trailernode->z_acldata = acep;

	trailer_count = currnode->z_ace_count - curr_idx + 1;
	currnode->z_ace_count = curr_idx - 1;
	currnode->z_size = (caddr_t)acep - (caddr_t)currnode->z_acldata;
	trailernode->z_size = oldsize - currnode->z_size;
	trailernode->z_ace_count = trailer_count;
	}

	aclp->z_acl_count += 1;
	aclp->z_acl_bytes += aclp->z_ops.ace_size(acep);

	if (curr_idx == 1)
	list_insert_before(&aclp->z_acl, currnode, newnode);
	else
	list_insert_after(&aclp->z_acl, currnode, newnode);
	if (trailernode) {
	list_insert_after(&aclp->z_acl, newnode, trailernode);
	aclp->z_curr_node = trailernode;
	trailernode->z_ace_idx = 1;
	}

	return (newnode);
	}

	/*
	* Prepend deny ACE
	*/
	static void *
	zfs_acl_prepend_deny(znode_t zp, zfs_acl_t aclp, void *acep,
	mode_t mode)
	{
	zfs_acl_node_t *aclnode;
	void *newacep;
	uint64_t fuid;
	uint16_t flags;

	aclnode = zfs_acl_ace_insert(aclp, acep);
	newacep = aclnode->z_acldata;
	fuid = aclp->z_ops.ace_who_get(acep);
	flags = aclp->z_ops.ace_flags_get(acep);
	zfs_set_ace(aclp, newacep, 0, DENY, fuid, (flags & ACE_TYPE_FLAGS));
	zfs_acl_prepend_fixup(aclp, newacep, acep, mode, zp->z_phys->zp_uid);

	return (newacep);
	}

	/*
	* Split an inherited ACE into inherit_only ACE
	* and original ACE with inheritance flags stripped off.
	*/
	static void
	zfs_acl_split_ace(zfs_acl_t aclp, zfs_ace_hdr_t acep)
	{
	zfs_acl_node_t *aclnode;
	zfs_acl_node_t *currnode;
	void *newacep;
	uint16_t type, flags;
	uint32_t mask;
	uint64_t fuid;

	type = aclp->z_ops.ace_type_get(acep);
	flags = aclp->z_ops.ace_flags_get(acep);
	mask = aclp->z_ops.ace_mask_get(acep);
	fuid = aclp->z_ops.ace_who_get(acep);

	aclnode = zfs_acl_ace_insert(aclp, acep);
	newacep = aclnode->z_acldata;

	aclp->z_ops.ace_type_set(newacep, type);
	aclp->z_ops.ace_flags_set(newacep, flags \| ACE_INHERIT_ONLY_ACE);
	aclp->z_ops.ace_mask_set(newacep, mask);
	aclp->z_ops.ace_type_set(newacep, type);
	aclp->z_ops.ace_who_set(newacep, fuid);
	aclp->z_next_ace = acep;
	flags &= ~ALL_INHERIT;
	aclp->z_ops.ace_flags_set(acep, flags);
	currnode = zfs_acl_curr_node(aclp);
	ASSERT(currnode->z_ace_idx >= 1);
	currnode->z_ace_idx -= 1;
	}

	/*
	* Are ACES started at index i, the canonical six ACES?
	*/
	static int
	zfs_have_canonical_six(zfs_acl_t *aclp)
	{
	void *acep;
	zfs_acl_node_t *aclnode = list_tail(&aclp->z_acl);
	int i = 0;
	size_t abstract_size = aclp->z_ops.ace_abstract_size();

	ASSERT(aclnode != NULL);

	if (aclnode->z_ace_count < 6)
	return (0);

	acep = (void *)((caddr_t)aclnode->z_acldata +
	aclnode->z_size - (aclp->z_ops.ace_abstract_size() * 6));

	if ((zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
	DENY, ACE_OWNER, 0) &&
	zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
	ALLOW, ACE_OWNER, OWNER_ALLOW_MASK) &&
	zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++), DENY,
	OWNING_GROUP, 0) && zfs_acl_ace_match(aclp, (caddr_t)acep +
	(abstract_size * i++),
	ALLOW, OWNING_GROUP, 0) &&
	zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
	DENY, ACE_EVERYONE, EVERYONE_DENY_MASK) &&
	zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
	ALLOW, ACE_EVERYONE, EVERYONE_ALLOW_MASK))) {
	return (1);
	} else {
	return (0);
	}
	}


	/*
	* Apply step 1g, to group entries
	*
	* Need to deal with corner case where group may have
	* greater permissions than owner. If so then limit
	* group permissions, based on what extra permissions
	* group has.
	*/
	static void
	zfs_fixup_group_entries(zfs_acl_t aclp, void acep, void *prevacep,
	mode_t mode)
	{
	uint32_t prevmask = aclp->z_ops.ace_mask_get(prevacep);
	uint32_t mask = aclp->z_ops.ace_mask_get(acep);
	uint16_t prevflags = aclp->z_ops.ace_flags_get(prevacep);
	mode_t extramode = (mode >> 3) & 07;
	mode_t ownermode = (mode >> 6);

	if (prevflags & ACE_IDENTIFIER_GROUP) {

	extramode &= ~ownermode;

	if (extramode) {
	if (extramode & S_IROTH) {
	prevmask &= ~ACE_READ_DATA;
	mask &= ~ACE_READ_DATA;
	}
	if (extramode & S_IWOTH) {
	prevmask &= ~(ACE_WRITE_DATA\|ACE_APPEND_DATA);
	mask &= ~(ACE_WRITE_DATA\|ACE_APPEND_DATA);
	}
	if (extramode & S_IXOTH) {
	prevmask &= ~ACE_EXECUTE;
	mask &= ~ACE_EXECUTE;
	}
	}
	}
	aclp->z_ops.ace_mask_set(acep, mask);
	aclp->z_ops.ace_mask_set(prevacep, prevmask);
	}

	/*
	* Apply the chmod algorithm as described
	* in PSARC/2002/240
	*/
	static void
	zfs_acl_chmod(znode_t zp, uint64_t mode, zfs_acl_t aclp)
	{
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	void acep = NULL, prevacep = NULL;
	uint64_t who;
	int i;
	int entry_type;
	int reuse_deny;
	int need_canonical_six = 1;
	uint16_t iflags, type;
	uint32_t access_mask;

	ASSERT(MUTEX_HELD(&zp->z_acl_lock));
	ASSERT(MUTEX_HELD(&zp->z_lock));

	aclp->z_hints = (zp->z_phys->zp_flags & V4_ACL_WIDE_FLAGS);

	/*
	* If discard then just discard all ACL nodes which
	* represent the ACEs.
	*
	* New owner@/group@/everone@ ACEs will be added
	* later.
	*/
	if (zfsvfs->z_acl_mode == ZFS_ACL_DISCARD)
	zfs_acl_release_nodes(aclp);

	while (acep = zfs_acl_next_ace(aclp, acep, &who, &access_mask,
	&iflags, &type)) {

	entry_type = (iflags & ACE_TYPE_FLAGS);
	iflags = (iflags & ALL_INHERIT);

	if ((type != ALLOW && type != DENY) \|\|
	(iflags & ACE_INHERIT_ONLY_ACE)) {
	if (iflags)
	aclp->z_hints \|= ZFS_INHERIT_ACE;
	switch (type) {
	case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
	case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
	aclp->z_hints \|= ZFS_ACL_OBJ_ACE;
	break;
	}
	goto nextace;
	}

	/*
	* Need to split ace into two?
	*/
	if ((iflags & (ACE_FILE_INHERIT_ACE\|
	ACE_DIRECTORY_INHERIT_ACE)) &&
	(!(iflags & ACE_INHERIT_ONLY_ACE))) {
	zfs_acl_split_ace(aclp, acep);
	aclp->z_hints \|= ZFS_INHERIT_ACE;
	goto nextace;
	}

	if (entry_type == ACE_OWNER \|\| entry_type == ACE_EVERYONE \|\|
	(entry_type == OWNING_GROUP)) {
	access_mask &= ~OGE_CLEAR;
	aclp->z_ops.ace_mask_set(acep, access_mask);
	goto nextace;
	} else {
	reuse_deny = B_TRUE;
	if (type == ALLOW) {

	/*
	* Check preceding ACE if any, to see
	* if we need to prepend a DENY ACE.
	* This is only applicable when the acl_mode
	* property == groupmask.
	*/
	if (zfsvfs->z_acl_mode == ZFS_ACL_GROUPMASK) {

	reuse_deny = zfs_reuse_deny(aclp, acep,
	prevacep);

	if (!reuse_deny) {
	prevacep =
	zfs_acl_prepend_deny(zp,
	aclp, acep, mode);
	} else {
	zfs_acl_prepend_fixup(
	aclp, prevacep,
	acep, mode,
	zp->z_phys->zp_uid);
	}
	zfs_fixup_group_entries(aclp, acep,
	prevacep, mode);

	}
	}
	}
	nextace:
	prevacep = acep;
	}

	/*
	* Check out last six aces, if we have six.
	*/

	if (aclp->z_acl_count >= 6) {
	if (zfs_have_canonical_six(aclp)) {
	need_canonical_six = 0;
	}
	}

	if (need_canonical_six) {
	size_t abstract_size = aclp->z_ops.ace_abstract_size();
	void *zacep;
	zfs_acl_node_t *aclnode =
	zfs_acl_node_alloc(abstract_size * 6);

	aclnode->z_size = abstract_size * 6;
	aclnode->z_ace_count = 6;
	aclp->z_acl_bytes += aclnode->z_size;
	list_insert_tail(&aclp->z_acl, aclnode);

	zacep = aclnode->z_acldata;

	i = 0;
	zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
	0, DENY, -1, ACE_OWNER);
	zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
	OWNER_ALLOW_MASK, ALLOW, -1, ACE_OWNER);
	zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++), 0,
	DENY, -1, OWNING_GROUP);
	zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++), 0,
	ALLOW, -1, OWNING_GROUP);
	zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
	EVERYONE_DENY_MASK, DENY, -1, ACE_EVERYONE);
	zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
	EVERYONE_ALLOW_MASK, ALLOW, -1, ACE_EVERYONE);
	aclp->z_acl_count += 6;
	}

	zfs_acl_fixup_canonical_six(aclp, mode);
	}

	int
	zfs_acl_chmod_setattr(znode_t zp, zfs_acl_t *aclp, uint64_t mode)
	{
	int error;

	mutex_enter(&zp->z_lock);
	mutex_enter(&zp->z_acl_lock);
	*aclp = NULL;
	error = zfs_acl_node_read(zp, aclp, B_TRUE);
	if (error == 0)
	zfs_acl_chmod(zp, mode, *aclp);
	mutex_exit(&zp->z_acl_lock);
	mutex_exit(&zp->z_lock);
	return (error);
	}

	/*
	* strip off write_owner and write_acl
	*/
	static void
	zfs_restricted_update(zfsvfs_t zfsvfs, zfs_acl_t aclp, void *acep)
	{
	uint32_t mask = aclp->z_ops.ace_mask_get(acep);

	if ((zfsvfs->z_acl_inherit == ZFS_ACL_RESTRICTED) &&
	(aclp->z_ops.ace_type_get(acep) == ALLOW)) {
	mask &= ~RESTRICTED_CLEAR;
	aclp->z_ops.ace_mask_set(acep, mask);
	}
	}

	/*
	* Should ACE be inherited?
	*/
	static int
	zfs_ace_can_use(znode_t *zp, uint16_t acep_flags)
	{
	int vtype = ZTOV(zp)->v_type;
	int iflags = (acep_flags & 0xf);

	if ((vtype == VDIR) && (iflags & ACE_DIRECTORY_INHERIT_ACE))
	return (1);
	else if (iflags & ACE_FILE_INHERIT_ACE)
	return (!((vtype == VDIR) &&
	(iflags & ACE_NO_PROPAGATE_INHERIT_ACE)));
	return (0);
	}

	/*
	* inherit inheritable ACEs from parent
	*/
	static zfs_acl_t *
	zfs_acl_inherit(znode_t zp, zfs_acl_t paclp, uint64_t mode,
	boolean_t *need_chmod)
	{
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	void *pacep;
	void acep, acep2;
	zfs_acl_node_t aclnode, aclnode2;
	zfs_acl_t *aclp = NULL;
	uint64_t who;
	uint32_t access_mask;
	uint16_t iflags, newflags, type;
	size_t ace_size;
	void data1, data2;
	size_t data1sz, data2sz;
	boolean_t vdir = ZTOV(zp)->v_type == VDIR;
	boolean_t vreg = ZTOV(zp)->v_type == VREG;
	boolean_t passthrough, passthrough_x, noallow;

	passthrough_x =
	zfsvfs->z_acl_inherit == ZFS_ACL_PASSTHROUGH_X;
	passthrough = passthrough_x \|\|
	zfsvfs->z_acl_inherit == ZFS_ACL_PASSTHROUGH;
	noallow =
	zfsvfs->z_acl_inherit == ZFS_ACL_NOALLOW;

	*need_chmod = B_TRUE;
	pacep = NULL;
	aclp = zfs_acl_alloc(paclp->z_version);
	if (zfsvfs->z_acl_inherit == ZFS_ACL_DISCARD)
	return (aclp);
	while (pacep = zfs_acl_next_ace(paclp, pacep, &who,
	&access_mask, &iflags, &type)) {

	/*
	* don't inherit bogus ACEs
	*/
	if (!zfs_acl_valid_ace_type(type, iflags))
	continue;

	if (noallow && type == ALLOW)
	continue;

	ace_size = aclp->z_ops.ace_size(pacep);

	if (!zfs_ace_can_use(zp, iflags))
	continue;

	/*
	* If owner@, group@, or everyone@ inheritable
	* then zfs_acl_chmod() isn't needed.
	*/
	if (passthrough &&
	((iflags & (ACE_OWNER\|ACE_EVERYONE)) \|\|
	((iflags & OWNING_GROUP) ==
	OWNING_GROUP)) && (vreg \|\| (vdir && (iflags &
	ACE_DIRECTORY_INHERIT_ACE)))) {
	*need_chmod = B_FALSE;

	if (!vdir && passthrough_x &&
	((mode & (S_IXUSR \| S_IXGRP \| S_IXOTH)) == 0)) {
	access_mask &= ~ACE_EXECUTE;
	}
	}

	aclnode = zfs_acl_node_alloc(ace_size);
	list_insert_tail(&aclp->z_acl, aclnode);
	acep = aclnode->z_acldata;

	zfs_set_ace(aclp, acep, access_mask, type,
	who, iflags\|ACE_INHERITED_ACE);

	/*
	* Copy special opaque data if any
	*/
	if ((data1sz = paclp->z_ops.ace_data(pacep, &data1)) != 0) {
	VERIFY((data2sz = aclp->z_ops.ace_data(acep,
	&data2)) == data1sz);
	bcopy(data1, data2, data2sz);
	}
	aclp->z_acl_count++;
	aclnode->z_ace_count++;
	aclp->z_acl_bytes += aclnode->z_size;
	newflags = aclp->z_ops.ace_flags_get(acep);

	if (vdir)
	aclp->z_hints \|= ZFS_INHERIT_ACE;

	if ((iflags & ACE_NO_PROPAGATE_INHERIT_ACE) \|\| !vdir) {
	newflags &= ~ALL_INHERIT;
	aclp->z_ops.ace_flags_set(acep,
	newflags\|ACE_INHERITED_ACE);
	zfs_restricted_update(zfsvfs, aclp, acep);
	continue;
	}

	ASSERT(vdir);

	newflags = aclp->z_ops.ace_flags_get(acep);
	if ((iflags & (ACE_FILE_INHERIT_ACE \|
	ACE_DIRECTORY_INHERIT_ACE)) !=
	ACE_FILE_INHERIT_ACE) {
	aclnode2 = zfs_acl_node_alloc(ace_size);
	list_insert_tail(&aclp->z_acl, aclnode2);
	acep2 = aclnode2->z_acldata;
	zfs_set_ace(aclp, acep2,
	access_mask, type, who,
	iflags\|ACE_INHERITED_ACE);
	newflags \|= ACE_INHERIT_ONLY_ACE;
	aclp->z_ops.ace_flags_set(acep, newflags);
	newflags &= ~ALL_INHERIT;
	aclp->z_ops.ace_flags_set(acep2,
	newflags\|ACE_INHERITED_ACE);

	/*
	* Copy special opaque data if any
	*/
	if ((data1sz = aclp->z_ops.ace_data(acep,
	&data1)) != 0) {
	VERIFY((data2sz =
	aclp->z_ops.ace_data(acep2,
	&data2)) == data1sz);
	bcopy(data1, data2, data1sz);
	}
	aclp->z_acl_count++;
	aclnode2->z_ace_count++;
	aclp->z_acl_bytes += aclnode->z_size;
	zfs_restricted_update(zfsvfs, aclp, acep2);
	} else {
	newflags \|= ACE_INHERIT_ONLY_ACE;
	aclp->z_ops.ace_flags_set(acep,
	newflags\|ACE_INHERITED_ACE);
	}
	}
	return (aclp);
	}

	/*
	* Create file system object initial permissions
	* including inheritable ACEs.
	*/
	void
	zfs_perm_init(znode_t zp, znode_t parent, int flag,
	vattr_t vap, dmu_tx_t tx, cred_t *cr,
	zfs_acl_t setaclp, zfs_fuid_info_t *fuidp)
	{
	uint64_t mode, fuid, fgid;
	int error;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	zfs_acl_t *aclp = NULL;
	zfs_acl_t *paclp;
	xvattr_t xvap = (xvattr_t )vap;
	gid_t gid;
	boolean_t need_chmod = B_TRUE;

	if (setaclp)
	aclp = setaclp;

	mode = MAKEIMODE(vap->va_type, vap->va_mode);

	/*
	* Determine uid and gid.
	*/
	if ((flag & (IS_ROOT_NODE \| IS_REPLAY)) \|\|
	((flag & IS_XATTR) && (vap->va_type == VDIR))) {
	fuid = zfs_fuid_create(zfsvfs, vap->va_uid, cr,
	ZFS_OWNER, tx, fuidp);
	fgid = zfs_fuid_create(zfsvfs, vap->va_gid, cr,
	ZFS_GROUP, tx, fuidp);
	gid = vap->va_gid;
	} else {
	fuid = zfs_fuid_create_cred(zfsvfs, ZFS_OWNER, tx, cr, fuidp);
	fgid = 0;
	if (vap->va_mask & AT_GID) {
	fgid = zfs_fuid_create(zfsvfs, vap->va_gid, cr,
	ZFS_GROUP, tx, fuidp);
	gid = vap->va_gid;
	if (fgid != parent->z_phys->zp_gid &&
	!groupmember(vap->va_gid, cr) &&
	secpolicy_vnode_create_gid(cr) != 0)
	fgid = 0;
	}
	if (fgid == 0) {
	if (parent->z_phys->zp_mode & S_ISGID) {
	fgid = parent->z_phys->zp_gid;
	gid = zfs_fuid_map_id(zfsvfs, fgid,
	cr, ZFS_GROUP);
	} else {
	fgid = zfs_fuid_create_cred(zfsvfs,
	ZFS_GROUP, tx, cr, fuidp);
	#ifdef __FreeBSD__
	gid = fgid = parent->z_phys->zp_gid;
	#else
	gid = crgetgid(cr);
	#endif
	}
	}
	}

	/*
	* If we're creating a directory, and the parent directory has the
	* set-GID bit set, set in on the new directory.
	* Otherwise, if the user is neither privileged nor a member of the
	* file's new group, clear the file's set-GID bit.
	*/

	if ((parent->z_phys->zp_mode & S_ISGID) && (vap->va_type == VDIR)) {
	mode \|= S_ISGID;
	} else {
	if ((mode & S_ISGID) &&
	secpolicy_vnode_setids_setgids(ZTOV(zp), cr, gid) != 0)
	mode &= ~S_ISGID;
	}

	zp->z_phys->zp_uid = fuid;
	zp->z_phys->zp_gid = fgid;
	zp->z_phys->zp_mode = mode;

	if (aclp == NULL) {
	mutex_enter(&parent->z_lock);
	if ((ZTOV(parent)->v_type == VDIR &&
	(parent->z_phys->zp_flags & ZFS_INHERIT_ACE)) &&
	!(zp->z_phys->zp_flags & ZFS_XATTR)) {
	mutex_enter(&parent->z_acl_lock);
	VERIFY(0 == zfs_acl_node_read(parent, &paclp, B_FALSE));
	mutex_exit(&parent->z_acl_lock);
	aclp = zfs_acl_inherit(zp, paclp, mode, &need_chmod);
	zfs_acl_free(paclp);
	} else {
	aclp = zfs_acl_alloc(zfs_acl_version_zp(zp));
	}
	mutex_exit(&parent->z_lock);
	mutex_enter(&zp->z_lock);
	mutex_enter(&zp->z_acl_lock);
	if (need_chmod)
	zfs_acl_chmod(zp, mode, aclp);
	} else {
	mutex_enter(&zp->z_lock);
	mutex_enter(&zp->z_acl_lock);
	}

	/* Force auto_inherit on all new directory objects */
	if (vap->va_type == VDIR)
	aclp->z_hints \|= ZFS_ACL_AUTO_INHERIT;

	error = zfs_aclset_common(zp, aclp, cr, fuidp, tx);

	/* Set optional attributes if any */
	if (vap->va_mask & AT_XVATTR)
	zfs_xvattr_set(zp, xvap);

	mutex_exit(&zp->z_lock);
	mutex_exit(&zp->z_acl_lock);
	ASSERT3U(error, ==, 0);

	if (aclp != setaclp)
	zfs_acl_free(aclp);
	}

	/*
	* Retrieve a files ACL
	*/
	int
	zfs_getacl(znode_t zp, vsecattr_t vsecp, boolean_t skipaclchk, cred_t *cr)
	{
	zfs_acl_t *aclp;
	ulong_t mask;
	int error;
	int count = 0;
	int largeace = 0;

	mask = vsecp->vsa_mask & (VSA_ACE \| VSA_ACECNT \|
	VSA_ACE_ACLFLAGS \| VSA_ACE_ALLTYPES);

	if (error = zfs_zaccess(zp, ACE_READ_ACL, 0, skipaclchk, cr))
	return (error);

	if (mask == 0)
	return (ENOSYS);

	mutex_enter(&zp->z_acl_lock);

	error = zfs_acl_node_read(zp, &aclp, B_FALSE);
	if (error != 0) {
	mutex_exit(&zp->z_acl_lock);
	return (error);
	}

	/*
	* Scan ACL to determine number of ACEs
	*/
	if ((zp->z_phys->zp_flags & ZFS_ACL_OBJ_ACE) &&
	!(mask & VSA_ACE_ALLTYPES)) {
	void *zacep = NULL;
	uint64_t who;
	uint32_t access_mask;
	uint16_t type, iflags;

	while (zacep = zfs_acl_next_ace(aclp, zacep,
	&who, &access_mask, &iflags, &type)) {
	switch (type) {
	case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
	case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
	case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
	largeace++;
	continue;
	default:
	count++;
	}
	}
	vsecp->vsa_aclcnt = count;
	} else
	count = aclp->z_acl_count;

	if (mask & VSA_ACECNT) {
	vsecp->vsa_aclcnt = count;
	}

	if (mask & VSA_ACE) {
	size_t aclsz;

	- zfs_acl_node_t *aclnode = list_head(&aclp->z_acl);
	-
	aclsz = count * sizeof (ace_t) +
	sizeof (ace_object_t) * largeace;

	vsecp->vsa_aclentp = kmem_alloc(aclsz, KM_SLEEP);
	vsecp->vsa_aclentsz = aclsz;

	if (aclp->z_version == ZFS_ACL_VERSION_FUID)
	zfs_copy_fuid_2_ace(zp->z_zfsvfs, aclp, cr,
	vsecp->vsa_aclentp, !(mask & VSA_ACE_ALLTYPES));
	else {
	- bcopy(aclnode->z_acldata, vsecp->vsa_aclentp,
	- count * sizeof (ace_t));
	+ zfs_acl_node_t *aclnode;
	+ void *start = vsecp->vsa_aclentp;
	+
	+ for (aclnode = list_head(&aclp->z_acl); aclnode;
	+ aclnode = list_next(&aclp->z_acl, aclnode)) {
	+ bcopy(aclnode->z_acldata, start,
	+ aclnode->z_size);
	+ start = (caddr_t)start + aclnode->z_size;
	+ }
	+ ASSERT((caddr_t)start - (caddr_t)vsecp->vsa_aclentp ==
	+ aclp->z_acl_bytes);
	}
	}
	if (mask & VSA_ACE_ACLFLAGS) {
	vsecp->vsa_aclflags = 0;
	if (zp->z_phys->zp_flags & ZFS_ACL_DEFAULTED)
	vsecp->vsa_aclflags \|= ACL_DEFAULTED;
	if (zp->z_phys->zp_flags & ZFS_ACL_PROTECTED)
	vsecp->vsa_aclflags \|= ACL_PROTECTED;
	if (zp->z_phys->zp_flags & ZFS_ACL_AUTO_INHERIT)
	vsecp->vsa_aclflags \|= ACL_AUTO_INHERIT;
	}

	mutex_exit(&zp->z_acl_lock);

	zfs_acl_free(aclp);

	return (0);
	}

	int
	zfs_vsec_2_aclp(zfsvfs_t *zfsvfs, vtype_t obj_type,
	vsecattr_t vsecp, zfs_acl_t *zaclp)
	{
	zfs_acl_t *aclp;
	zfs_acl_node_t *aclnode;
	int aclcnt = vsecp->vsa_aclcnt;
	int error;

	if (vsecp->vsa_aclcnt > MAX_ACL_ENTRIES \|\| vsecp->vsa_aclcnt <= 0)
	return (EINVAL);

	aclp = zfs_acl_alloc(zfs_acl_version(zfsvfs->z_version));

	aclp->z_hints = 0;
	aclnode = zfs_acl_node_alloc(aclcnt * sizeof (zfs_object_ace_t));
	if (aclp->z_version == ZFS_ACL_VERSION_INITIAL) {
	if ((error = zfs_copy_ace_2_oldace(obj_type, aclp,
	(ace_t *)vsecp->vsa_aclentp, aclnode->z_acldata,
	aclcnt, &aclnode->z_size)) != 0) {
	zfs_acl_free(aclp);
	zfs_acl_node_free(aclnode);
	return (error);
	}
	} else {
	if ((error = zfs_copy_ace_2_fuid(obj_type, aclp,
	vsecp->vsa_aclentp, aclnode->z_acldata, aclcnt,
	&aclnode->z_size)) != 0) {
	zfs_acl_free(aclp);
	zfs_acl_node_free(aclnode);
	return (error);
	}
	}
	aclp->z_acl_bytes = aclnode->z_size;
	aclnode->z_ace_count = aclcnt;
	aclp->z_acl_count = aclcnt;
	list_insert_head(&aclp->z_acl, aclnode);

	/*
	* If flags are being set then add them to z_hints
	*/
	if (vsecp->vsa_mask & VSA_ACE_ACLFLAGS) {
	if (vsecp->vsa_aclflags & ACL_PROTECTED)
	aclp->z_hints \|= ZFS_ACL_PROTECTED;
	if (vsecp->vsa_aclflags & ACL_DEFAULTED)
	aclp->z_hints \|= ZFS_ACL_DEFAULTED;
	if (vsecp->vsa_aclflags & ACL_AUTO_INHERIT)
	aclp->z_hints \|= ZFS_ACL_AUTO_INHERIT;
	}

	*zaclp = aclp;

	return (0);
	}

	/*
	* Set a files ACL
	*/
	int
	zfs_setacl(znode_t zp, vsecattr_t vsecp, boolean_t skipaclchk, cred_t *cr)
	{
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	zilog_t *zilog = zfsvfs->z_log;
	ulong_t mask = vsecp->vsa_mask & (VSA_ACE \| VSA_ACECNT);
	dmu_tx_t *tx;
	int error;
	zfs_acl_t *aclp;
	zfs_fuid_info_t *fuidp = NULL;

	if (mask == 0)
	return (ENOSYS);

	if (zp->z_phys->zp_flags & ZFS_IMMUTABLE)
	return (EPERM);

	if (error = zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr))
	return (error);

	error = zfs_vsec_2_aclp(zfsvfs, ZTOV(zp)->v_type, vsecp, &aclp);
	if (error)
	return (error);

	/*
	* If ACL wide flags aren't being set then preserve any
	* existing flags.
	*/
	if (!(vsecp->vsa_mask & VSA_ACE_ACLFLAGS)) {
	aclp->z_hints \|= (zp->z_phys->zp_flags & V4_ACL_WIDE_FLAGS);
	}
	top:
	if (error = zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr)) {
	zfs_acl_free(aclp);
	return (error);
	}

	mutex_enter(&zp->z_lock);
	mutex_enter(&zp->z_acl_lock);

	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_bonus(tx, zp->z_id);

	if (zp->z_phys->zp_acl.z_acl_extern_obj) {
	/* Are we upgrading ACL? */
	if (zfsvfs->z_version <= ZPL_VERSION_FUID &&
	zp->z_phys->zp_acl.z_acl_version ==
	ZFS_ACL_VERSION_INITIAL) {
	dmu_tx_hold_free(tx,
	zp->z_phys->zp_acl.z_acl_extern_obj,
	0, DMU_OBJECT_END);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
	0, aclp->z_acl_bytes);
	} else {
	dmu_tx_hold_write(tx,
	zp->z_phys->zp_acl.z_acl_extern_obj,
	0, aclp->z_acl_bytes);
	}
	} else if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, aclp->z_acl_bytes);
	}
	if (aclp->z_has_fuids) {
	if (zfsvfs->z_fuid_obj == 0) {
	dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
	} else {
	dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
	dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	}
	}

	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	mutex_exit(&zp->z_acl_lock);
	mutex_exit(&zp->z_lock);

	if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	zfs_acl_free(aclp);
	return (error);
	}

	error = zfs_aclset_common(zp, aclp, cr, &fuidp, tx);
	ASSERT(error == 0);

	zfs_log_acl(zilog, tx, zp, vsecp, fuidp);

	if (fuidp)
	zfs_fuid_info_free(fuidp);
	zfs_acl_free(aclp);
	dmu_tx_commit(tx);
	done:
	mutex_exit(&zp->z_acl_lock);
	mutex_exit(&zp->z_lock);

	return (error);
	}

	/*
	* working_mode returns the permissions that were not granted
	*/
	static int
	zfs_zaccess_common(znode_t zp, uint32_t v4_mode, uint32_t working_mode,
	boolean_t check_privs, boolean_t skipaclchk, cred_t cr)
	{
	zfs_acl_t *aclp;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	int error;
	uid_t uid = crgetuid(cr);
	uint64_t who;
	uint16_t type, iflags;
	uint16_t entry_type;
	uint32_t access_mask;
	uint32_t deny_mask = 0;
	zfs_ace_hdr_t *acep = NULL;
	boolean_t checkit;
	uid_t fowner;
	uid_t gowner;

	/*
	* Short circuit empty requests
	*/
	if (v4_mode == 0)
	return (0);

	*check_privs = B_TRUE;

	if (zfsvfs->z_assign >= TXG_INITIAL) { /* ZIL replay */
	*working_mode = 0;
	return (0);
	}

	*working_mode = v4_mode;

	if ((v4_mode & WRITE_MASK) &&
	(zp->z_zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) &&
	(!IS_DEVVP(ZTOV(zp)))) {
	*check_privs = B_FALSE;
	return (EROFS);
	}

	/*
	* Only check for READONLY on non-directories.
	*/
	if ((v4_mode & WRITE_MASK_DATA) &&
	(((ZTOV(zp)->v_type != VDIR) &&
	(zp->z_phys->zp_flags & (ZFS_READONLY \| ZFS_IMMUTABLE))) \|\|
	(ZTOV(zp)->v_type == VDIR &&
	(zp->z_phys->zp_flags & ZFS_IMMUTABLE)))) {
	*check_privs = B_FALSE;
	return (EPERM);
	}

	#ifdef sun
	if ((v4_mode & (ACE_DELETE \| ACE_DELETE_CHILD)) &&
	(zp->z_phys->zp_flags & ZFS_NOUNLINK)) {
	*check_privs = B_FALSE;
	return (EPERM);
	}
	#else
	/*
	* In FreeBSD we allow to modify directory's content is ZFS_NOUNLINK
	* (sunlnk) is set. We just don't allow directory removal, which is
	* handled in zfs_zaccess_delete().
	*/
	if ((v4_mode & ACE_DELETE) &&
	(zp->z_phys->zp_flags & ZFS_NOUNLINK)) {
	*check_privs = B_FALSE;
	return (EPERM);
	}
	#endif

	if (((v4_mode & (ACE_READ_DATA\|ACE_EXECUTE)) &&
	(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED))) {
	*check_privs = B_FALSE;
	return (EACCES);
	}

	/*
	* The caller requested that the ACL check be skipped. This
	* would only happen if the caller checked VOP_ACCESS() with a
	* 32 bit ACE mask and already had the appropriate permissions.
	*/
	if (skipaclchk) {
	*working_mode = 0;
	return (0);
	}

	zfs_fuid_map_ids(zp, cr, &fowner, &gowner);

	mutex_enter(&zp->z_acl_lock);

	error = zfs_acl_node_read(zp, &aclp, B_FALSE);
	if (error != 0) {
	mutex_exit(&zp->z_acl_lock);
	return (error);
	}

	while (acep = zfs_acl_next_ace(aclp, acep, &who, &access_mask,
	&iflags, &type)) {

	if (!zfs_acl_valid_ace_type(type, iflags))
	continue;

	if (ZTOV(zp)->v_type == VDIR && (iflags & ACE_INHERIT_ONLY_ACE))
	continue;

	entry_type = (iflags & ACE_TYPE_FLAGS);

	checkit = B_FALSE;

	switch (entry_type) {
	case ACE_OWNER:
	if (uid == fowner)
	checkit = B_TRUE;
	break;
	case OWNING_GROUP:
	who = gowner;
	/FALLTHROUGH/
	case ACE_IDENTIFIER_GROUP:
	checkit = zfs_groupmember(zfsvfs, who, cr);
	break;
	case ACE_EVERYONE:
	checkit = B_TRUE;
	break;

	/* USER Entry */
	default:
	if (entry_type == 0) {
	uid_t newid;

	newid = zfs_fuid_map_id(zfsvfs, who, cr,
	ZFS_ACE_USER);
	if (newid != IDMAP_WK_CREATOR_OWNER_UID &&
	uid == newid)
	checkit = B_TRUE;
	break;
	} else {
	zfs_acl_free(aclp);
	mutex_exit(&zp->z_acl_lock);
	return (EIO);
	}
	}

	if (checkit) {
	uint32_t mask_matched = (access_mask & *working_mode);

	if (mask_matched) {
	if (type == DENY)
	deny_mask \|= mask_matched;

	*working_mode &= ~mask_matched;
	}
	}

	/* Are we done? */
	if (*working_mode == 0)
	break;
	}

	mutex_exit(&zp->z_acl_lock);
	zfs_acl_free(aclp);

	/* Put the found 'denies' back on the working mode */
	if (deny_mask) {
	*working_mode \|= deny_mask;
	return (EACCES);
	} else if (*working_mode) {
	return (-1);
	}

	return (0);
	}

	static int
	zfs_zaccess_append(znode_t zp, uint32_t working_mode, boolean_t *check_privs,
	cred_t *cr)
	{
	if (*working_mode != ACE_WRITE_DATA)
	return (EACCES);

	return (zfs_zaccess_common(zp, ACE_APPEND_DATA, working_mode,
	check_privs, B_FALSE, cr));
	}

	/*
	* Determine whether Access should be granted/denied, invoking least
	* priv subsytem when a deny is determined.
	*/
	int
	zfs_zaccess(znode_t zp, int mode, int flags, boolean_t skipaclchk, cred_t cr)
	{
	uint32_t working_mode;
	int error;
	int is_attr;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	boolean_t check_privs;
	znode_t *xzp;
	znode_t *check_zp = zp;

	is_attr = ((zp->z_phys->zp_flags & ZFS_XATTR) &&
	(ZTOV(zp)->v_type == VDIR));

	#ifdef __FreeBSD__
	/*
	* In FreeBSD, we don't care about permissions of individual ADS.
	* Note that not checking them is not just an optimization - without
	* this shortcut, EA operations may bogusly fail with EACCES.
	*/
	if (zp->z_phys->zp_flags & ZFS_XATTR)
	return (0);
	#else
	/*
	* If attribute then validate against base file
	*/
	if (is_attr) {
	if ((error = zfs_zget(zp->z_zfsvfs,
	zp->z_phys->zp_parent, &xzp)) != 0) {
	return (error);
	}

	check_zp = xzp;

	/*
	* fixup mode to map to xattr perms
	*/

	if (mode & (ACE_WRITE_DATA\|ACE_APPEND_DATA)) {
	mode &= ~(ACE_WRITE_DATA\|ACE_APPEND_DATA);
	mode \|= ACE_WRITE_NAMED_ATTRS;
	}

	if (mode & (ACE_READ_DATA\|ACE_EXECUTE)) {
	mode &= ~(ACE_READ_DATA\|ACE_EXECUTE);
	mode \|= ACE_READ_NAMED_ATTRS;
	}
	}
	#endif

	if ((error = zfs_zaccess_common(check_zp, mode, &working_mode,
	&check_privs, skipaclchk, cr)) == 0) {
	if (is_attr)
	VN_RELE(ZTOV(xzp));
	return (0);
	}

	if (error && !check_privs) {
	if (is_attr)
	VN_RELE(ZTOV(xzp));
	return (error);
	}

	if (error && (flags & V_APPEND)) {
	error = zfs_zaccess_append(zp, &working_mode, &check_privs, cr);
	}

	if (error && check_privs) {
	uid_t owner;
	mode_t checkmode = 0;

	owner = zfs_fuid_map_id(zfsvfs, check_zp->z_phys->zp_uid, cr,
	ZFS_OWNER);

	/*
	* First check for implicit owner permission on
	* read_acl/read_attributes
	*/

	error = 0;
	ASSERT(working_mode != 0);

	if ((working_mode & (ACE_READ_ACL\|ACE_READ_ATTRIBUTES) &&
	owner == crgetuid(cr)))
	working_mode &= ~(ACE_READ_ACL\|ACE_READ_ATTRIBUTES);

	if (working_mode & (ACE_READ_DATA\|ACE_READ_NAMED_ATTRS\|
	ACE_READ_ACL\|ACE_READ_ATTRIBUTES\|ACE_SYNCHRONIZE))
	checkmode \|= VREAD;
	if (working_mode & (ACE_WRITE_DATA\|ACE_WRITE_NAMED_ATTRS\|
	ACE_APPEND_DATA\|ACE_WRITE_ATTRIBUTES\|ACE_SYNCHRONIZE))
	checkmode \|= VWRITE;
	if (working_mode & ACE_EXECUTE)
	checkmode \|= VEXEC;

	if (checkmode)
	error = secpolicy_vnode_access(cr, ZTOV(check_zp),
	owner, checkmode);

	if (error == 0 && (working_mode & ACE_WRITE_OWNER))
	error = secpolicy_vnode_chown(ZTOV(check_zp), cr, B_TRUE);
	if (error == 0 && (working_mode & ACE_WRITE_ACL))
	error = secpolicy_vnode_setdac(ZTOV(check_zp), cr, owner);

	if (error == 0 && (working_mode &
	(ACE_DELETE\|ACE_DELETE_CHILD)))
	error = secpolicy_vnode_remove(ZTOV(check_zp), cr);

	if (error == 0 && (working_mode & ACE_SYNCHRONIZE)) {
	error = secpolicy_vnode_chown(ZTOV(check_zp), cr, B_FALSE);
	}
	if (error == 0) {
	/*
	* See if any bits other than those already checked
	* for are still present. If so then return EACCES
	*/
	if (working_mode & ~(ZFS_CHECKED_MASKS)) {
	error = EACCES;
	}
	}
	}

	if (is_attr)
	VN_RELE(ZTOV(xzp));

	return (error);
	}

	/*
	* Translate traditional unix VREAD/VWRITE/VEXEC mode into
	* native ACL format and call zfs_zaccess()
	*/
	int
	zfs_zaccess_rwx(znode_t zp, mode_t mode, int flags, cred_t cr)
	{
	return (zfs_zaccess(zp, zfs_unix_to_v4(mode >> 6), flags, B_FALSE, cr));
	}

	/*
	* Access function for secpolicy_vnode_setattr
	*/
	int
	zfs_zaccess_unix(znode_t zp, mode_t mode, cred_t cr)
	{
	int v4_mode = zfs_unix_to_v4(mode >> 6);

	return (zfs_zaccess(zp, v4_mode, 0, B_FALSE, cr));
	}

	static int
	zfs_delete_final_check(znode_t zp, znode_t dzp,
	mode_t missing_perms, cred_t *cr)
	{
	int error;
	uid_t downer;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;

	downer = zfs_fuid_map_id(zfsvfs, dzp->z_phys->zp_uid, cr, ZFS_OWNER);

	error = secpolicy_vnode_access(cr, ZTOV(dzp), downer, missing_perms);

	if (error == 0)
	error = zfs_sticky_remove_access(dzp, zp, cr);

	return (error);
	}

	/*
	* Determine whether Access should be granted/deny, without
	* consulting least priv subsystem.
	*
	*
	* The following chart is the recommended NFSv4 enforcement for
	* ability to delete an object.
	*
	* -------------------------------------------------------
	* \| Parent Dir \| Target Object Permissions \|
	* \| permissions \| \|
	* -------------------------------------------------------
	* \| \| ACL Allows \| ACL Denies\| Delete \|
	* \| \| Delete \| Delete \| unspecified\|
	* -------------------------------------------------------
	* \| ACL Allows \| Permit \| Permit \| Permit \|
	* \| DELETE_CHILD \| \|
	* -------------------------------------------------------
	* \| ACL Denies \| Permit \| Deny \| Deny \|
	* \| DELETE_CHILD \| \| \| \|
	* -------------------------------------------------------
	* \| ACL specifies \| \| \| \|
	* \| only allow \| Permit \| Permit \| Permit \|
	* \| write and \| \| \| \|
	* \| execute \| \| \| \|
	* -------------------------------------------------------
	* \| ACL denies \| \| \| \|
	* \| write and \| Permit \| Deny \| Deny \|
	* \| execute \| \| \| \|
	* -------------------------------------------------------
	* ^
	* \|
	* No search privilege, can't even look up file?
	*
	*/
	int
	zfs_zaccess_delete(znode_t dzp, znode_t zp, cred_t *cr)
	{
	uint32_t dzp_working_mode = 0;
	uint32_t zp_working_mode = 0;
	int dzp_error, zp_error;
	mode_t missing_perms;
	boolean_t dzpcheck_privs = B_TRUE;
	boolean_t zpcheck_privs = B_TRUE;

	/*
	* We want specific DELETE permissions to
	* take precedence over WRITE/EXECUTE. We don't
	* want an ACL such as this to mess us up.
	* user:joe:write_data:deny,user:joe:delete:allow
	*
	* However, deny permissions may ultimately be overridden
	* by secpolicy_vnode_access().
	*
	* We will ask for all of the necessary permissions and then
	* look at the working modes from the directory and target object
	* to determine what was found.
	*/

	if (zp->z_phys->zp_flags & (ZFS_IMMUTABLE \| ZFS_NOUNLINK))
	return (EPERM);

	/*
	* First row
	* If the directory permissions allow the delete, we are done.
	*/
	if ((dzp_error = zfs_zaccess_common(dzp, ACE_DELETE_CHILD,
	&dzp_working_mode, &dzpcheck_privs, B_FALSE, cr)) == 0)
	return (0);

	/*
	* If target object has delete permission then we are done
	*/
	if ((zp_error = zfs_zaccess_common(zp, ACE_DELETE, &zp_working_mode,
	&zpcheck_privs, B_FALSE, cr)) == 0)
	return (0);

	ASSERT(dzp_error && zp_error);

	if (!dzpcheck_privs)
	return (dzp_error);
	if (!zpcheck_privs)
	return (zp_error);

	/*
	* Second row
	*
	* If directory returns EACCES then delete_child was denied
	* due to deny delete_child. In this case send the request through
	* secpolicy_vnode_remove(). We don't use zfs_delete_final_check()
	* since that could allow the delete based on write/execute permission
	* and we want delete permissions to override write/execute.
	*/

	if (dzp_error == EACCES)
	return (secpolicy_vnode_remove(ZTOV(dzp), cr)); /* XXXPJD: s/dzp/zp/ ? */

	/*
	* Third Row
	* only need to see if we have write/execute on directory.
	*/

	if ((dzp_error = zfs_zaccess_common(dzp, ACE_EXECUTE\|ACE_WRITE_DATA,
	&dzp_working_mode, &dzpcheck_privs, B_FALSE, cr)) == 0)
	return (zfs_sticky_remove_access(dzp, zp, cr));

	if (!dzpcheck_privs)
	return (dzp_error);

	/*
	* Fourth row
	*/

	missing_perms = (dzp_working_mode & ACE_WRITE_DATA) ? VWRITE : 0;
	missing_perms \|= (dzp_working_mode & ACE_EXECUTE) ? VEXEC : 0;

	ASSERT(missing_perms);

	return (zfs_delete_final_check(zp, dzp, missing_perms, cr));

	}

	int
	zfs_zaccess_rename(znode_t sdzp, znode_t szp, znode_t *tdzp,
	znode_t tzp, cred_t cr)
	{
	int add_perm;
	int error;

	if (szp->z_phys->zp_flags & ZFS_AV_QUARANTINED)
	return (EACCES);

	add_perm = (ZTOV(szp)->v_type == VDIR) ?
	ACE_ADD_SUBDIRECTORY : ACE_ADD_FILE;

	/*
	* Rename permissions are combination of delete permission +
	* add file/subdir permission.
	*
	* BSD operating systems also require write permission
	* on the directory being moved from one parent directory
	* to another.
	*/
	if (ZTOV(szp)->v_type == VDIR && ZTOV(sdzp) != ZTOV(tdzp)) {
	if (error = zfs_zaccess(szp, ACE_WRITE_DATA, 0, B_FALSE, cr))
	return (error);
	}

	/*
	* first make sure we do the delete portion.
	*
	* If that succeeds then check for add_file/add_subdir permissions
	*/

	if (error = zfs_zaccess_delete(sdzp, szp, cr))
	return (error);

	/*
	* If we have a tzp, see if we can delete it?
	*/
	if (tzp) {
	if (error = zfs_zaccess_delete(tdzp, tzp, cr))
	return (error);
	}

	/*
	* Now check for add permissions
	*/
	error = zfs_zaccess(tdzp, add_perm, 0, B_FALSE, cr);

	return (error);
	}
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (revision 209274)
	@@ -1,5053 +1,5051 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	/* Portions Copyright 2007 Jeremy Teo */

	#include <sys/types.h>
	#include <sys/param.h>
	#include <sys/time.h>
	#include <sys/systm.h>
	#include <sys/sysmacros.h>
	#include <sys/resource.h>
	#include <sys/vfs.h>
	#include <sys/vnode.h>
	#include <sys/file.h>
	#include <sys/stat.h>
	#include <sys/kmem.h>
	#include <sys/taskq.h>
	#include <sys/uio.h>
	#include <sys/atomic.h>
	#include <sys/namei.h>
	#include <sys/mman.h>
	#include <sys/cmn_err.h>
	#include <sys/errno.h>
	#include <sys/unistd.h>
	#include <sys/zfs_dir.h>
	#include <sys/zfs_ioctl.h>
	#include <sys/fs/zfs.h>
	#include <sys/dmu.h>
	#include <sys/spa.h>
	#include <sys/txg.h>
	#include <sys/dbuf.h>
	#include <sys/zap.h>
	#include <sys/dirent.h>
	#include <sys/policy.h>
	#include <sys/sunddi.h>
	#include <sys/filio.h>
	#include <sys/zfs_ctldir.h>
	#include <sys/zfs_fuid.h>
	#include <sys/dnlc.h>
	#include <sys/zfs_rlock.h>
	#include <sys/extdirent.h>
	#include <sys/kidmap.h>
	#include <sys/bio.h>
	#include <sys/buf.h>
	#include <sys/sf_buf.h>
	#include <sys/sched.h>
	#include <sys/acl.h>

	/*
	* Programming rules.
	*
	* Each vnode op performs some logical unit of work. To do this, the ZPL must
	* properly lock its in-core state, create a DMU transaction, do the work,
	* record this work in the intent log (ZIL), commit the DMU transaction,
	* and wait for the intent log to commit if it is a synchronous operation.
	* Moreover, the vnode ops must work in both normal and log replay context.
	* The ordering of events is important to avoid deadlocks and references
	* to freed memory. The example below illustrates the following Big Rules:
	*
	* (1) A check must be made in each zfs thread for a mounted file system.
	* This is done avoiding races using ZFS_ENTER(zfsvfs).
	* A ZFS_EXIT(zfsvfs) is needed before all returns. Any znodes
	* must be checked with ZFS_VERIFY_ZP(zp). Both of these macros
	* can return EIO from the calling function.
	*
	* (2) VN_RELE() should always be the last thing except for zil_commit()
	* (if necessary) and ZFS_EXIT(). This is for 3 reasons:
	* First, if it's the last reference, the vnode/znode
	* can be freed, so the zp may point to freed memory. Second, the last
	* reference will call zfs_zinactive(), which may induce a lot of work --
	* pushing cached pages (which acquires range locks) and syncing out
	* cached atime changes. Third, zfs_zinactive() may require a new tx,
	* which could deadlock the system if you were already holding one.
	* If you must call VN_RELE() within a tx then use VN_RELE_ASYNC().
	*
	* (3) All range locks must be grabbed before calling dmu_tx_assign(),
	* as they can span dmu_tx_assign() calls.
	*
	* (4) Always pass zfsvfs->z_assign as the second argument to dmu_tx_assign().
	* In normal operation, this will be TXG_NOWAIT. During ZIL replay,
	* it will be a specific txg. Either way, dmu_tx_assign() never blocks.
	* This is critical because we don't want to block while holding locks.
	* Note, in particular, that if a lock is sometimes acquired before
	* the tx assigns, and sometimes after (e.g. z_lock), then failing to
	* use a non-blocking assign can deadlock the system. The scenario:
	*
	* Thread A has grabbed a lock before calling dmu_tx_assign().
	* Thread B is in an already-assigned tx, and blocks for this lock.
	* Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
	* forever, because the previous txg can't quiesce until B's tx commits.
	*
	* If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
	* then drop all locks, call dmu_tx_wait(), and try again.
	*
	* (5) If the operation succeeded, generate the intent log entry for it
	* before dropping locks. This ensures that the ordering of events
	* in the intent log matches the order in which they actually occurred.
	*
	* (6) At the end of each vnode op, the DMU tx must always commit,
	* regardless of whether there were any errors.
	*
	* (7) After dropping all locks, invoke zil_commit(zilog, seq, foid)
	* to ensure that synchronous semantics are provided when necessary.
	*
	* In general, this is how things should be ordered in each vnode op:
	*
	* ZFS_ENTER(zfsvfs); // exit if unmounted
	* top:
	* zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD())
	* rw_enter(...); // grab any other locks you need
	* tx = dmu_tx_create(...); // get DMU tx
	* dmu_tx_hold_*(); // hold each object you might modify
	* error = dmu_tx_assign(tx, zfsvfs->z_assign); // try to assign
	* if (error) {
	* rw_exit(...); // drop locks
	* zfs_dirent_unlock(dl); // unlock directory entry
	* VN_RELE(...); // release held vnodes
	* if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	* dmu_tx_wait(tx);
	* dmu_tx_abort(tx);
	* goto top;
	* }
	* dmu_tx_abort(tx); // abort DMU tx
	* ZFS_EXIT(zfsvfs); // finished in zfs
	* return (error); // really out of space
	* }
	* error = do_real_work(); // do whatever this VOP does
	* if (error == 0)
	* zfs_log_*(...); // on success, make ZIL entry
	* dmu_tx_commit(tx); // commit DMU tx -- error or not
	* rw_exit(...); // drop locks
	* zfs_dirent_unlock(dl); // unlock directory entry
	* VN_RELE(...); // release held vnodes
	* zil_commit(zilog, seq, foid); // synchronous when necessary
	* ZFS_EXIT(zfsvfs); // finished in zfs
	* return (error); // done, report error
	*/

	/* ARGSUSED */
	static int
	zfs_open(vnode_t *vpp, int flag, cred_t cr, caller_context_t *ct)
	{
	znode_t zp = VTOZ(vpp);

	if ((flag & FWRITE) && (zp->z_phys->zp_flags & ZFS_APPENDONLY) &&
	((flag & FAPPEND) == 0)) {
	return (EPERM);
	}

	if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
	ZTOV(zp)->v_type == VREG &&
	!(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) &&
	zp->z_phys->zp_size > 0)
	if (fs_vscan(*vpp, cr, 0) != 0)
	return (EACCES);

	/* Keep a count of the synchronous opens in the znode */
	if (flag & (FSYNC \| FDSYNC))
	atomic_inc_32(&zp->z_sync_cnt);

	return (0);
	}

	/* ARGSUSED */
	static int
	zfs_close(vnode_t vp, int flag, int count, offset_t offset, cred_t cr,
	caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);

	/* Decrement the synchronous opens in the znode */
	if ((flag & (FSYNC \| FDSYNC)) && (count == 1))
	atomic_dec_32(&zp->z_sync_cnt);

	/*
	* Clean up any locks held by this process on the vp.
	*/
	cleanlocks(vp, ddi_get_pid(), 0);
	cleanshares(vp, ddi_get_pid());

	if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
	ZTOV(zp)->v_type == VREG &&
	!(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) &&
	zp->z_phys->zp_size > 0)
	VERIFY(fs_vscan(vp, cr, 1) == 0);

	return (0);
	}

	/*
	* Lseek support for finding holes (cmd == _FIO_SEEK_HOLE) and
	* data (cmd == _FIO_SEEK_DATA). "off" is an in/out parameter.
	*/
	static int
	zfs_holey(vnode_t vp, u_long cmd, offset_t off)
	{
	znode_t *zp = VTOZ(vp);
	uint64_t noff = (uint64_t)off; / new offset */
	uint64_t file_sz;
	int error;
	boolean_t hole;

	file_sz = zp->z_phys->zp_size;
	if (noff >= file_sz) {
	return (ENXIO);
	}

	if (cmd == _FIO_SEEK_HOLE)
	hole = B_TRUE;
	else
	hole = B_FALSE;

	error = dmu_offset_next(zp->z_zfsvfs->z_os, zp->z_id, hole, &noff);

	/* end of file? */
	if ((error == ESRCH) \|\| (noff > file_sz)) {
	/*
	* Handle the virtual hole at the end of file.
	*/
	if (hole) {
	*off = file_sz;
	return (0);
	}
	return (ENXIO);
	}

	if (noff < *off)
	return (error);
	*off = noff;
	return (error);
	}

	/* ARGSUSED */
	static int
	zfs_ioctl(vnode_t vp, u_long com, intptr_t data, int flag, cred_t cred,
	int rvalp, caller_context_t ct)
	{
	offset_t off;
	int error;
	zfsvfs_t *zfsvfs;
	znode_t *zp;

	switch (com) {
	case _FIOFFS:
	return (0);

	/*
	* The following two ioctls are used by bfu. Faking out,
	* necessary to avoid bfu errors.
	*/
	case _FIOGDIO:
	case _FIOSDIO:
	return (0);

	case _FIO_SEEK_DATA:
	case _FIO_SEEK_HOLE:
	if (ddi_copyin((void *)data, &off, sizeof (off), flag))
	return (EFAULT);

	zp = VTOZ(vp);
	zfsvfs = zp->z_zfsvfs;
	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);

	/* offset parameter is in/out */
	error = zfs_holey(vp, com, &off);
	ZFS_EXIT(zfsvfs);
	if (error)
	return (error);
	if (ddi_copyout(&off, (void *)data, sizeof (off), flag))
	return (EFAULT);
	return (0);
	}
	return (ENOTTY);
	}

	/*
	* When a file is memory mapped, we must keep the IO data synchronized
	* between the DMU cache and the memory mapped pages. What this means:
	*
	* On Write: If we find a memory mapped page, we write to both
	* the page and the dmu buffer.
	*
	* NOTE: We will always "break up" the IO into PAGESIZE uiomoves when
	* the file is memory mapped.
	*/
	static int
	mappedwrite(vnode_t vp, int nbytes, uio_t uio, dmu_tx_t *tx)
	{
	znode_t *zp = VTOZ(vp);
	objset_t *os = zp->z_zfsvfs->z_os;
	vm_object_t obj;
	vm_page_t m;
	struct sf_buf *sf;
	int64_t start, off;
	int len = nbytes;
	int error = 0;
	uint64_t dirbytes;

	ASSERT(vp->v_mount != NULL);
	obj = vp->v_object;
	ASSERT(obj != NULL);

	start = uio->uio_loffset;
	off = start & PAGEOFFSET;
	dirbytes = 0;
	VM_OBJECT_LOCK(obj);
	for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
	uint64_t bytes = MIN(PAGESIZE - off, len);
	uint64_t fsize;

	again:
	if ((m = vm_page_lookup(obj, OFF_TO_IDX(start))) != NULL &&
	vm_page_is_valid(m, (vm_offset_t)off, bytes)) {
	uint64_t woff;
	caddr_t va;

	if (vm_page_sleep_if_busy(m, FALSE, "zfsmwb"))
	goto again;
	fsize = obj->un_pager.vnp.vnp_size;
	vm_page_busy(m);
	vm_page_lock_queues();
	vm_page_undirty(m);
	vm_page_unlock_queues();
	VM_OBJECT_UNLOCK(obj);
	if (dirbytes > 0) {
	error = dmu_write_uio(os, zp->z_id, uio,
	dirbytes, tx);
	dirbytes = 0;
	}
	if (error == 0) {
	sched_pin();
	sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
	va = (caddr_t)sf_buf_kva(sf);
	woff = uio->uio_loffset - off;
	error = uiomove(va + off, bytes, UIO_WRITE, uio);
	/*
	* The uiomove() above could have been partially
	* successful, that's why we call dmu_write()
	* below unconditionally. The page was marked
	* non-dirty above and we would lose the changes
	* without doing so. If the uiomove() failed
	* entirely, well, we just write what we got
	* before one more time.
	*/
	dmu_write(os, zp->z_id, woff,
	MIN(PAGESIZE, fsize - woff), va, tx);
	sf_buf_free(sf);
	sched_unpin();
	}
	VM_OBJECT_LOCK(obj);
	vm_page_wakeup(m);
	} else {
	if (__predict_false(obj->cache != NULL)) {
	vm_page_cache_free(obj, OFF_TO_IDX(start),
	OFF_TO_IDX(start) + 1);
	}
	dirbytes += bytes;
	}
	len -= bytes;
	off = 0;
	if (error)
	break;
	}
	VM_OBJECT_UNLOCK(obj);
	if (error == 0 && dirbytes > 0)
	error = dmu_write_uio(os, zp->z_id, uio, dirbytes, tx);
	return (error);
	}

	/*
	* When a file is memory mapped, we must keep the IO data synchronized
	* between the DMU cache and the memory mapped pages. What this means:
	*
	* On Read: We "read" preferentially from memory mapped pages,
	* else we default from the dmu buffer.
	*
	* NOTE: We will always "break up" the IO into PAGESIZE uiomoves when
	* the file is memory mapped.
	*/
	static int
	mappedread(vnode_t vp, int nbytes, uio_t uio)
	{
	znode_t *zp = VTOZ(vp);
	objset_t *os = zp->z_zfsvfs->z_os;
	vm_object_t obj;
	vm_page_t m;
	struct sf_buf *sf;
	int64_t start, off;
	caddr_t va;
	int len = nbytes;
	int error = 0;
	uint64_t dirbytes;

	ASSERT(vp->v_mount != NULL);
	obj = vp->v_object;
	ASSERT(obj != NULL);

	start = uio->uio_loffset;
	off = start & PAGEOFFSET;
	dirbytes = 0;
	VM_OBJECT_LOCK(obj);
	for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
	uint64_t bytes = MIN(PAGESIZE - off, len);

	again:
	if ((m = vm_page_lookup(obj, OFF_TO_IDX(start))) != NULL &&
	vm_page_is_valid(m, (vm_offset_t)off, bytes)) {
	if (vm_page_sleep_if_busy(m, FALSE, "zfsmrb"))
	goto again;
	vm_page_busy(m);
	VM_OBJECT_UNLOCK(obj);
	if (dirbytes > 0) {
	error = dmu_read_uio(os, zp->z_id, uio,
	dirbytes);
	dirbytes = 0;
	}
	if (error == 0) {
	sched_pin();
	sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
	va = (caddr_t)sf_buf_kva(sf);
	error = uiomove(va + off, bytes, UIO_READ, uio);
	sf_buf_free(sf);
	sched_unpin();
	}
	VM_OBJECT_LOCK(obj);
	vm_page_wakeup(m);
	} else if (m != NULL && uio->uio_segflg == UIO_NOCOPY) {
	/*
	* The code below is here to make sendfile(2) work
	* correctly with ZFS. As pointed out by ups@
	* sendfile(2) should be changed to use VOP_GETPAGES(),
	* but it pessimize performance of sendfile/UFS, that's
	* why I handle this special case in ZFS code.
	*/
	if (vm_page_sleep_if_busy(m, FALSE, "zfsmrb"))
	goto again;
	vm_page_busy(m);
	VM_OBJECT_UNLOCK(obj);
	if (dirbytes > 0) {
	error = dmu_read_uio(os, zp->z_id, uio,
	dirbytes);
	dirbytes = 0;
	}
	if (error == 0) {
	sched_pin();
	sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
	va = (caddr_t)sf_buf_kva(sf);
	error = dmu_read(os, zp->z_id, start + off,
	bytes, (void *)(va + off));
	sf_buf_free(sf);
	sched_unpin();
	}
	VM_OBJECT_LOCK(obj);
	vm_page_wakeup(m);
	if (error == 0)
	uio->uio_resid -= bytes;
	} else {
	dirbytes += bytes;
	}
	len -= bytes;
	off = 0;
	if (error)
	break;
	}
	VM_OBJECT_UNLOCK(obj);
	if (error == 0 && dirbytes > 0)
	error = dmu_read_uio(os, zp->z_id, uio, dirbytes);
	return (error);
	}

	offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */

	/*
	* Read bytes from specified file into supplied buffer.
	*
	* IN: vp - vnode of file to be read from.
	* uio - structure supplying read location, range info,
	* and return buffer.
	* ioflag - SYNC flags; used to provide FRSYNC semantics.
	* cr - credentials of caller.
	* ct - caller context
	*
	* OUT: uio - updated offset and range, buffer filled.
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Side Effects:
	* vp - atime updated if byte count > 0
	*/
	/* ARGSUSED */
	static int
	zfs_read(vnode_t vp, uio_t uio, int ioflag, cred_t cr, caller_context_t ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	objset_t *os;
	ssize_t n, nbytes;
	int error;
	rl_t *rl;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);
	os = zfsvfs->z_os;

	if (zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) {
	ZFS_EXIT(zfsvfs);
	return (EACCES);
	}

	/*
	* Validate file offset
	*/
	if (uio->uio_loffset < (offset_t)0) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	/*
	* Fasttrack empty reads
	*/
	if (uio->uio_resid == 0) {
	ZFS_EXIT(zfsvfs);
	return (0);
	}

	/*
	* Check for mandatory locks
	*/
	if (MANDMODE((mode_t)zp->z_phys->zp_mode)) {
	if (error = chklock(vp, FREAD,
	uio->uio_loffset, uio->uio_resid, uio->uio_fmode, ct)) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}
	}

	/*
	* If we're in FRSYNC mode, sync out this znode before reading it.
	*/
	if (ioflag & FRSYNC)
	zil_commit(zfsvfs->z_log, zp->z_last_itx, zp->z_id);

	/*
	* Lock the range against changes.
	*/
	rl = zfs_range_lock(zp, uio->uio_loffset, uio->uio_resid, RL_READER);

	/*
	* If we are reading past end-of-file we can skip
	* to the end; but we might still need to set atime.
	*/
	if (uio->uio_loffset >= zp->z_phys->zp_size) {
	error = 0;
	goto out;
	}

	ASSERT(uio->uio_loffset < zp->z_phys->zp_size);
	n = MIN(uio->uio_resid, zp->z_phys->zp_size - uio->uio_loffset);

	while (n > 0) {
	nbytes = MIN(n, zfs_read_chunk_size -
	P2PHASE(uio->uio_loffset, zfs_read_chunk_size));

	if (vn_has_cached_data(vp))
	error = mappedread(vp, nbytes, uio);
	else
	error = dmu_read_uio(os, zp->z_id, uio, nbytes);
	if (error) {
	/* convert checksum errors into IO errors */
	if (error == ECKSUM)
	error = EIO;
	break;
	}

	n -= nbytes;
	}

	out:
	zfs_range_unlock(rl);

	ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Fault in the pages of the first n bytes specified by the uio structure.
	* 1 byte in each page is touched and the uio struct is unmodified.
	* Any error will exit this routine as this is only a best
	* attempt to get the pages resident. This is a copy of ufs_trans_touch().
	*/
	static void
	zfs_prefault_write(ssize_t n, struct uio *uio)
	{
	struct iovec *iov;
	ulong_t cnt, incr;
	caddr_t p;

	if (uio->uio_segflg != UIO_USERSPACE)
	return;

	iov = uio->uio_iov;

	while (n) {
	cnt = MIN(iov->iov_len, n);
	if (cnt == 0) {
	/* empty iov entry */
	iov++;
	continue;
	}
	n -= cnt;
	/*
	* touch each page in this segment.
	*/
	p = iov->iov_base;
	while (cnt) {
	if (fubyte(p) == -1)
	return;
	incr = MIN(cnt, PAGESIZE);
	p += incr;
	cnt -= incr;
	}
	/*
	* touch the last byte in case it straddles a page.
	*/
	p--;
	if (fubyte(p) == -1)
	return;
	iov++;
	}
	}

	/*
	* Write the bytes to a file.
	*
	* IN: vp - vnode of file to be written to.
	* uio - structure supplying write location, range info,
	* and data buffer.
	* ioflag - IO_APPEND flag set if in append mode.
	* cr - credentials of caller.
	* ct - caller context (NFS/CIFS fem monitor only)
	*
	* OUT: uio - updated offset and range.
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* vp - ctime\|mtime updated if byte count > 0
	*/
	/* ARGSUSED */
	static int
	zfs_write(vnode_t vp, uio_t uio, int ioflag, cred_t cr, caller_context_t ct)
	{
	znode_t *zp = VTOZ(vp);
	rlim64_t limit = MAXOFFSET_T;
	ssize_t start_resid = uio->uio_resid;
	ssize_t tx_bytes;
	uint64_t end_size;
	dmu_tx_t *tx;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	zilog_t *zilog;
	offset_t woff;
	ssize_t n, nbytes;
	rl_t *rl;
	int max_blksz = zfsvfs->z_max_blksz;
	uint64_t pflags;
	int error;

	/*
	* Fasttrack empty write
	*/
	n = start_resid;
	if (n == 0)
	return (0);

	if (limit == RLIM64_INFINITY \|\| limit > MAXOFFSET_T)
	limit = MAXOFFSET_T;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);

	/*
	* If immutable or not appending then return EPERM
	*/
	pflags = zp->z_phys->zp_flags;
	if ((pflags & (ZFS_IMMUTABLE \| ZFS_READONLY)) \|\|
	((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
	(uio->uio_loffset < zp->z_phys->zp_size))) {
	ZFS_EXIT(zfsvfs);
	return (EPERM);
	}

	zilog = zfsvfs->z_log;

	/*
	* Pre-fault the pages to ensure slow (eg NFS) pages
	* don't hold up txg.
	*/
	zfs_prefault_write(n, uio);

	/*
	* If in append mode, set the io offset pointer to eof.
	*/
	if (ioflag & IO_APPEND) {
	/*
	* Range lock for a file append:
	* The value for the start of range will be determined by
	* zfs_range_lock() (to guarantee append semantics).
	* If this write will cause the block size to increase,
	* zfs_range_lock() will lock the entire file, so we must
	* later reduce the range after we grow the block size.
	*/
	rl = zfs_range_lock(zp, 0, n, RL_APPEND);
	if (rl->r_len == UINT64_MAX) {
	/* overlocked, zp_size can't change */
	woff = uio->uio_loffset = zp->z_phys->zp_size;
	} else {
	woff = uio->uio_loffset = rl->r_off;
	}
	} else {
	woff = uio->uio_loffset;
	/*
	* Validate file offset
	*/
	if (woff < 0) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	/*
	* If we need to grow the block size then zfs_range_lock()
	* will lock a wider range than we request here.
	* Later after growing the block size we reduce the range.
	*/
	rl = zfs_range_lock(zp, woff, n, RL_WRITER);
	}

	if (woff >= limit) {
	zfs_range_unlock(rl);
	ZFS_EXIT(zfsvfs);
	return (EFBIG);
	}

	if ((woff + n) > limit \|\| woff > (limit - n))
	n = limit - woff;

	/*
	* Check for mandatory locks
	*/
	if (MANDMODE((mode_t)zp->z_phys->zp_mode) &&
	(error = chklock(vp, FWRITE, woff, n, uio->uio_fmode, ct)) != 0) {
	zfs_range_unlock(rl);
	ZFS_EXIT(zfsvfs);
	return (error);
	}
	end_size = MAX(zp->z_phys->zp_size, woff + n);

	/*
	* Write the file in reasonable size chunks. Each chunk is written
	* in a separate transaction; this keeps the intent log records small
	* and allows us to do more fine-grained space accounting.
	*/
	while (n > 0) {
	/*
	* Start a transaction.
	*/
	woff = uio->uio_loffset;
	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_bonus(tx, zp->z_id);
	dmu_tx_hold_write(tx, zp->z_id, woff, MIN(n, max_blksz));
	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	if (error == ERESTART &&
	zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	continue;
	}
	dmu_tx_abort(tx);
	break;
	}

	/*
	* If zfs_range_lock() over-locked we grow the blocksize
	* and then reduce the lock range. This will only happen
	* on the first iteration since zfs_range_reduce() will
	* shrink down r_len to the appropriate size.
	*/
	if (rl->r_len == UINT64_MAX) {
	uint64_t new_blksz;

	if (zp->z_blksz > max_blksz) {
	ASSERT(!ISP2(zp->z_blksz));
	new_blksz = MIN(end_size, SPA_MAXBLOCKSIZE);
	} else {
	new_blksz = MIN(end_size, max_blksz);
	}
	zfs_grow_blocksize(zp, new_blksz, tx);
	zfs_range_reduce(rl, woff, n);
	}

	/*
	* XXX - should we really limit each write to z_max_blksz?
	* Perhaps we should use SPA_MAXBLOCKSIZE chunks?
	*/
	nbytes = MIN(n, max_blksz - P2PHASE(woff, max_blksz));

	if (woff + nbytes > zp->z_phys->zp_size)
	vnode_pager_setsize(vp, woff + nbytes);

	rw_enter(&zp->z_map_lock, RW_READER);

	tx_bytes = uio->uio_resid;
	if (vn_has_cached_data(vp)) {
	rw_exit(&zp->z_map_lock);
	error = mappedwrite(vp, nbytes, uio, tx);
	} else {
	error = dmu_write_uio(zfsvfs->z_os, zp->z_id,
	uio, nbytes, tx);
	rw_exit(&zp->z_map_lock);
	}
	tx_bytes -= uio->uio_resid;

	/*
	* If we made no progress, we're done. If we made even
	* partial progress, update the znode and ZIL accordingly.
	*/
	if (tx_bytes == 0) {
	dmu_tx_commit(tx);
	ASSERT(error != 0);
	break;
	}

	/*
	* Clear Set-UID/Set-GID bits on successful write if not
	* privileged and at least one of the excute bits is set.
	*
	* It would be nice to to this after all writes have
	* been done, but that would still expose the ISUID/ISGID
	* to another app after the partial write is committed.
	*
	* Note: we don't call zfs_fuid_map_id() here because
	* user 0 is not an ephemeral uid.
	*/
	mutex_enter(&zp->z_acl_lock);
	if ((zp->z_phys->zp_mode & (S_IXUSR \| (S_IXUSR >> 3) \|
	(S_IXUSR >> 6))) != 0 &&
	(zp->z_phys->zp_mode & (S_ISUID \| S_ISGID)) != 0 &&
	secpolicy_vnode_setid_retain(vp, cr,
	(zp->z_phys->zp_mode & S_ISUID) != 0 &&
	zp->z_phys->zp_uid == 0) != 0) {
	zp->z_phys->zp_mode &= ~(S_ISUID \| S_ISGID);
	}
	mutex_exit(&zp->z_acl_lock);

	/*
	* Update time stamp. NOTE: This marks the bonus buffer as
	* dirty, so we don't have to do it again for zp_size.
	*/
	zfs_time_stamper(zp, CONTENT_MODIFIED, tx);

	/*
	* Update the file size (zp_size) if it has changed;
	* account for possible concurrent updates.
	*/
	while ((end_size = zp->z_phys->zp_size) < uio->uio_loffset)
	(void) atomic_cas_64(&zp->z_phys->zp_size, end_size,
	uio->uio_loffset);
	zfs_log_write(zilog, tx, TX_WRITE, zp, woff, tx_bytes, ioflag);
	dmu_tx_commit(tx);

	if (error != 0)
	break;
	ASSERT(tx_bytes == nbytes);
	n -= nbytes;
	}

	zfs_range_unlock(rl);

	/*
	* If we're in replay mode, or we made no progress, return error.
	* Otherwise, it's at least a partial write, so it's successful.
	*/
	if (zfsvfs->z_assign >= TXG_INITIAL \|\| uio->uio_resid == start_resid) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	if (ioflag & (FSYNC \| FDSYNC))
	zil_commit(zilog, zp->z_last_itx, zp->z_id);

	ZFS_EXIT(zfsvfs);
	return (0);
	}

	void
	zfs_get_done(dmu_buf_t db, void vzgd)
	{
	zgd_t zgd = (zgd_t )vzgd;
	rl_t *rl = zgd->zgd_rl;
	vnode_t *vp = ZTOV(rl->r_zp);
	objset_t *os = rl->r_zp->z_zfsvfs->z_os;
	int vfslocked;

	vfslocked = VFS_LOCK_GIANT(vp->v_vfsp);
	dmu_buf_rele(db, vzgd);
	zfs_range_unlock(rl);
	/*
	* Release the vnode asynchronously as we currently have the
	* txg stopped from syncing.
	*/
	VN_RELE_ASYNC(vp, dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
	zil_add_block(zgd->zgd_zilog, zgd->zgd_bp);
	kmem_free(zgd, sizeof (zgd_t));
	VFS_UNLOCK_GIANT(vfslocked);
	}

	/*
	* Get data to generate a TX_WRITE intent log record.
	*/
	int
	zfs_get_data(void arg, lr_write_t lr, char buf, zio_t zio)
	{
	zfsvfs_t *zfsvfs = arg;
	objset_t *os = zfsvfs->z_os;
	znode_t *zp;
	uint64_t off = lr->lr_offset;
	dmu_buf_t *db;
	rl_t *rl;
	zgd_t *zgd;
	int dlen = lr->lr_length; /* length of user data */
	int error = 0;

	ASSERT(zio);
	ASSERT(dlen != 0);

	/*
	* Nothing to do if the file has been removed
	*/
	if (zfs_zget(zfsvfs, lr->lr_foid, &zp) != 0)
	return (ENOENT);
	if (zp->z_unlinked) {
	/*
	* Release the vnode asynchronously as we currently have the
	* txg stopped from syncing.
	*/
	VN_RELE_ASYNC(ZTOV(zp),
	dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
	return (ENOENT);
	}

	/*
	* Write records come in two flavors: immediate and indirect.
	* For small writes it's cheaper to store the data with the
	* log record (immediate); for large writes it's cheaper to
	* sync the data and get a pointer to it (indirect) so that
	* we don't have to write the data twice.
	*/
	if (buf != NULL) { /* immediate write */
	rl = zfs_range_lock(zp, off, dlen, RL_READER);
	/* test for truncation needs to be done while range locked */
	if (off >= zp->z_phys->zp_size) {
	error = ENOENT;
	goto out;
	}
	VERIFY(0 == dmu_read(os, lr->lr_foid, off, dlen, buf));
	} else { /* indirect write */
	uint64_t boff; /* block starting offset */

	/*
	* Have to lock the whole block to ensure when it's
	* written out and it's checksum is being calculated
	* that no one can change the data. We need to re-check
	* blocksize after we get the lock in case it's changed!
	*/
	for (;;) {
	if (ISP2(zp->z_blksz)) {
	boff = P2ALIGN_TYPED(off, zp->z_blksz,
	uint64_t);
	} else {
	boff = 0;
	}
	dlen = zp->z_blksz;
	rl = zfs_range_lock(zp, boff, dlen, RL_READER);
	if (zp->z_blksz == dlen)
	break;
	zfs_range_unlock(rl);
	}
	/* test for truncation needs to be done while range locked */
	if (off >= zp->z_phys->zp_size) {
	error = ENOENT;
	goto out;
	}
	zgd = (zgd_t *)kmem_alloc(sizeof (zgd_t), KM_SLEEP);
	zgd->zgd_rl = rl;
	zgd->zgd_zilog = zfsvfs->z_log;
	zgd->zgd_bp = &lr->lr_blkptr;
	VERIFY(0 == dmu_buf_hold(os, lr->lr_foid, boff, zgd, &db));
	ASSERT(boff == db->db_offset);
	lr->lr_blkoff = off - boff;
	error = dmu_sync(zio, db, &lr->lr_blkptr,
	lr->lr_common.lrc_txg, zfs_get_done, zgd);
	ASSERT((error && error != EINPROGRESS) \|\|
	lr->lr_length <= zp->z_blksz);
	if (error == 0)
	zil_add_block(zfsvfs->z_log, &lr->lr_blkptr);
	/*
	* If we get EINPROGRESS, then we need to wait for a
	* write IO initiated by dmu_sync() to complete before
	* we can release this dbuf. We will finish everything
	* up in the zfs_get_done() callback.
	*/
	if (error == EINPROGRESS)
	return (0);
	dmu_buf_rele(db, zgd);
	kmem_free(zgd, sizeof (zgd_t));
	}
	out:
	zfs_range_unlock(rl);
	/*
	* Release the vnode asynchronously as we currently have the
	* txg stopped from syncing.
	*/
	VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
	return (error);
	}

	/ARGSUSED/
	static int
	zfs_access(vnode_t vp, int mode, int flag, cred_t cr,
	caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	int error;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);

	if (flag & V_ACE_MASK)
	error = zfs_zaccess(zp, mode, flag, B_FALSE, cr);
	else
	error = zfs_zaccess_rwx(zp, mode, flag, cr);

	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Lookup an entry in a directory, or an extended attribute directory.
	* If it exists, return a held vnode reference for it.
	*
	* IN: dvp - vnode of directory to search.
	* nm - name of entry to lookup.
	* pnp - full pathname to lookup [UNUSED].
	* flags - LOOKUP_XATTR set if looking for an attribute.
	* rdir - root directory vnode [UNUSED].
	* cr - credentials of caller.
	* ct - caller context
	* direntflags - directory lookup flags
	* realpnp - returned pathname.
	*
	* OUT: vpp - vnode of located entry, NULL if not found.
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* NA
	*/
	/* ARGSUSED */
	static int
	zfs_lookup(vnode_t dvp, char nm, vnode_t *vpp, struct componentname cnp,
	int nameiop, cred_t cr, kthread_t td, int flags)
	{
	znode_t *zdp = VTOZ(dvp);
	zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
	int error;
	int *direntflags = NULL;
	void *realpnp = NULL;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zdp);

	*vpp = NULL;

	if (flags & LOOKUP_XATTR) {
	#ifdef TODO
	/*
	* If the xattr property is off, refuse the lookup request.
	*/
	if (!(zfsvfs->z_vfs->vfs_flag & VFS_XATTR)) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}
	#endif

	/*
	* We don't allow recursive attributes..
	* Maybe someday we will.
	*/
	if (zdp->z_phys->zp_flags & ZFS_XATTR) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	if (error = zfs_get_xattrdir(VTOZ(dvp), vpp, cr, flags)) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Do we have permission to get into attribute directory?
	*/

	if (error = zfs_zaccess(VTOZ(*vpp), ACE_EXECUTE, 0,
	B_FALSE, cr)) {
	VN_RELE(*vpp);
	*vpp = NULL;
	}

	ZFS_EXIT(zfsvfs);
	return (error);
	}

	if (dvp->v_type != VDIR) {
	ZFS_EXIT(zfsvfs);
	return (ENOTDIR);
	}

	/*
	* Check accessibility of directory.
	*/

	if (error = zfs_zaccess(zdp, ACE_EXECUTE, 0, B_FALSE, cr)) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	if (zfsvfs->z_utf8 && u8_validate(nm, strlen(nm),
	NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
	ZFS_EXIT(zfsvfs);
	return (EILSEQ);
	}

	error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
	if (error == 0) {
	/*
	* Convert device special files
	*/
	if (IS_DEVVP(*vpp)) {
	vnode_t *svp;

	svp = specvp(vpp, (vpp)->v_rdev, (*vpp)->v_type, cr);
	VN_RELE(*vpp);
	if (svp == NULL)
	error = ENOSYS;
	else
	*vpp = svp;
	}
	}

	/* Translate errors and add SAVENAME when needed. */
	if (cnp->cn_flags & ISLASTCN) {
	switch (nameiop) {
	case CREATE:
	case RENAME:
	if (error == ENOENT) {
	error = EJUSTRETURN;
	cnp->cn_flags \|= SAVENAME;
	break;
	}
	/* FALLTHROUGH */
	case DELETE:
	if (error == 0)
	cnp->cn_flags \|= SAVENAME;
	break;
	}
	}
	if (error == 0 && (nm[0] != '.' \|\| nm[1] != '\0')) {
	int ltype = 0;

	if (cnp->cn_flags & ISDOTDOT) {
	ltype = VOP_ISLOCKED(dvp);
	VOP_UNLOCK(dvp, 0);
	}
	ZFS_EXIT(zfsvfs);
	error = vn_lock(*vpp, cnp->cn_lkflags);
	if (cnp->cn_flags & ISDOTDOT)
	vn_lock(dvp, ltype \| LK_RETRY);
	if (error != 0) {
	VN_RELE(*vpp);
	*vpp = NULL;
	return (error);
	}
	} else {
	ZFS_EXIT(zfsvfs);
	}

	#ifdef FREEBSD_NAMECACHE
	/*
	* Insert name into cache (as non-existent) if appropriate.
	*/
	if (error == ENOENT && (cnp->cn_flags & MAKEENTRY) && nameiop != CREATE)
	cache_enter(dvp, *vpp, cnp);
	/*
	* Insert name into cache if appropriate.
	*/
	if (error == 0 && (cnp->cn_flags & MAKEENTRY)) {
	if (!(cnp->cn_flags & ISLASTCN) \|\|
	(nameiop != DELETE && nameiop != RENAME)) {
	cache_enter(dvp, *vpp, cnp);
	}
	}
	#endif

	return (error);
	}

	/*
	* Attempt to create a new entry in a directory. If the entry
	* already exists, truncate the file if permissible, else return
	* an error. Return the vp of the created or trunc'd file.
	*
	* IN: dvp - vnode of directory to put new file entry in.
	* name - name of new file entry.
	* vap - attributes of new file.
	* excl - flag indicating exclusive or non-exclusive mode.
	* mode - mode to open file with.
	* cr - credentials of caller.
	* flag - large file flag [UNUSED].
	* ct - caller context
	* vsecp - ACL to be set
	*
	* OUT: vpp - vnode of created or trunc'd entry.
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* dvp - ctime\|mtime updated if new entry created
	* vp - ctime\|mtime always, atime if new
	*/

	/* ARGSUSED */
	static int
	zfs_create(vnode_t dvp, char name, vattr_t *vap, int excl, int mode,
	vnode_t *vpp, cred_t cr, kthread_t *td)
	{
	znode_t zp, dzp = VTOZ(dvp);
	zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
	zilog_t *zilog;
	objset_t *os;
	zfs_dirlock_t *dl;
	dmu_tx_t *tx;
	int error;
	zfs_acl_t *aclp = NULL;
	zfs_fuid_info_t *fuidp = NULL;
	void *vsecp = NULL;
	int flag = 0;

	/*
	* If we have an ephemeral id, ACL, or XVATTR then
	* make sure file system is at proper version
	*/

	if (zfsvfs->z_use_fuids == B_FALSE &&
	(vsecp \|\| (vap->va_mask & AT_XVATTR) \|\|
	IS_EPHEMERAL(crgetuid(cr)) \|\| IS_EPHEMERAL(crgetgid(cr))))
	return (EINVAL);

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(dzp);
	os = zfsvfs->z_os;
	zilog = zfsvfs->z_log;

	if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
	NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
	ZFS_EXIT(zfsvfs);
	return (EILSEQ);
	}

	if (vap->va_mask & AT_XVATTR) {
	if ((error = secpolicy_xvattr(dvp, (xvattr_t *)vap,
	crgetuid(cr), cr, vap->va_type)) != 0) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}
	}
	top:
	*vpp = NULL;

	if ((vap->va_mode & S_ISVTX) && secpolicy_vnode_stky_modify(cr))
	vap->va_mode &= ~S_ISVTX;

	if (*name == '\0') {
	/*
	* Null component name refers to the directory itself.
	*/
	VN_HOLD(dvp);
	zp = dzp;
	dl = NULL;
	error = 0;
	} else {
	/* possible VN_HOLD(zp) */
	int zflg = 0;

	if (flag & FIGNORECASE)
	zflg \|= ZCILOOK;

	error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
	NULL, NULL);
	if (error) {
	if (strcmp(name, "..") == 0)
	error = EISDIR;
	ZFS_EXIT(zfsvfs);
	if (aclp)
	zfs_acl_free(aclp);
	return (error);
	}
	}
	if (vsecp && aclp == NULL) {
	error = zfs_vsec_2_aclp(zfsvfs, vap->va_type, vsecp, &aclp);
	if (error) {
	ZFS_EXIT(zfsvfs);
	if (dl)
	zfs_dirent_unlock(dl);
	return (error);
	}
	}

	if (zp == NULL) {
	uint64_t txtype;

	/*
	* Create a new file object and update the directory
	* to reference it.
	*/
	if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
	goto out;
	}

	/*
	* We only support the creation of regular files in
	* extended attribute directories.
	*/
	if ((dzp->z_phys->zp_flags & ZFS_XATTR) &&
	(vap->va_type != VREG)) {
	error = EINVAL;
	goto out;
	}

	tx = dmu_tx_create(os);
	dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
	if ((aclp && aclp->z_has_fuids) \|\| IS_EPHEMERAL(crgetuid(cr)) \|\|
	IS_EPHEMERAL(crgetgid(cr))) {
	if (zfsvfs->z_fuid_obj == 0) {
	dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	dmu_tx_hold_zap(tx, MASTER_NODE_OBJ,
	FALSE, NULL);
	} else {
	dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
	dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	}
	}
	dmu_tx_hold_bonus(tx, dzp->z_id);
	dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
	if ((dzp->z_phys->zp_flags & ZFS_INHERIT_ACE) \|\| aclp) {
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
	0, SPA_MAXBLOCKSIZE);
	}
	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	zfs_dirent_unlock(dl);
	if (error == ERESTART &&
	zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	if (aclp)
	zfs_acl_free(aclp);
	return (error);
	}
	zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, aclp, &fuidp);
	(void) zfs_link_create(dl, zp, tx, ZNEW);
	txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
	if (flag & FIGNORECASE)
	txtype \|= TX_CI;
	zfs_log_create(zilog, tx, txtype, dzp, zp, name,
	vsecp, fuidp, vap);
	if (fuidp)
	zfs_fuid_info_free(fuidp);
	dmu_tx_commit(tx);
	} else {
	int aflags = (flag & FAPPEND) ? V_APPEND : 0;

	/*
	* A directory entry already exists for this name.
	*/
	/*
	* Can't truncate an existing file if in exclusive mode.
	*/
	if (excl == EXCL) {
	error = EEXIST;
	goto out;
	}
	/*
	* Can't open a directory for writing.
	*/
	if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
	error = EISDIR;
	goto out;
	}
	/*
	* Verify requested access to file.
	*/
	if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
	goto out;
	}

	mutex_enter(&dzp->z_lock);
	dzp->z_seq++;
	mutex_exit(&dzp->z_lock);

	/*
	* Truncate regular files if requested.
	*/
	if ((ZTOV(zp)->v_type == VREG) &&
	(vap->va_mask & AT_SIZE) && (vap->va_size == 0)) {
	/* we can't hold any locks when calling zfs_freesp() */
	zfs_dirent_unlock(dl);
	dl = NULL;
	error = zfs_freesp(zp, 0, 0, mode, TRUE);
	if (error == 0) {
	vnevent_create(ZTOV(zp), ct);
	}
	}
	}
	out:
	if (dl)
	zfs_dirent_unlock(dl);

	if (error) {
	if (zp)
	VN_RELE(ZTOV(zp));
	} else {
	*vpp = ZTOV(zp);
	/*
	* If vnode is for a device return a specfs vnode instead.
	*/
	if (IS_DEVVP(*vpp)) {
	struct vnode *svp;

	svp = specvp(vpp, (vpp)->v_rdev, (*vpp)->v_type, cr);
	VN_RELE(*vpp);
	if (svp == NULL) {
	error = ENOSYS;
	}
	*vpp = svp;
	}
	}
	if (aclp)
	zfs_acl_free(aclp);

	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Remove an entry from a directory.
	*
	* IN: dvp - vnode of directory to remove entry from.
	* name - name of entry to remove.
	* cr - credentials of caller.
	* ct - caller context
	* flags - case flags
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* dvp - ctime\|mtime
	* vp - ctime (if nlink > 0)
	*/
	/ARGSUSED/
	static int
	zfs_remove(vnode_t dvp, char name, cred_t cr, caller_context_t ct,
	int flags)
	{
	znode_t zp, dzp = VTOZ(dvp);
	znode_t *xzp = NULL;
	vnode_t *vp;
	zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
	zilog_t *zilog;
	uint64_t acl_obj, xattr_obj;
	zfs_dirlock_t *dl;
	dmu_tx_t *tx;
	boolean_t may_delete_now, delete_now = FALSE;
	boolean_t unlinked, toobig = FALSE;
	uint64_t txtype;
	pathname_t *realnmp = NULL;
	pathname_t realnm;
	int error;
	int zflg = ZEXISTS;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(dzp);
	zilog = zfsvfs->z_log;

	if (flags & FIGNORECASE) {
	zflg \|= ZCILOOK;
	pn_alloc(&realnm);
	realnmp = &realnm;
	}

	top:
	/*
	* Attempt to lock directory; fail if entry doesn't exist.
	*/
	if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
	NULL, realnmp)) {
	if (realnmp)
	pn_free(realnmp);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	vp = ZTOV(zp);

	if (error = zfs_zaccess_delete(dzp, zp, cr)) {
	goto out;
	}

	/*
	* Need to use rmdir for removing directories.
	*/
	if (vp->v_type == VDIR) {
	error = EPERM;
	goto out;
	}

	vnevent_remove(vp, dvp, name, ct);

	if (realnmp)
	dnlc_remove(dvp, realnmp->pn_buf);
	else
	dnlc_remove(dvp, name);

	may_delete_now = FALSE;

	/*
	* We may delete the znode now, or we may put it in the unlinked set;
	* it depends on whether we're the last link, and on whether there are
	* other holds on the vnode. So we dmu_tx_hold() the right things to
	* allow for either case.
	*/
	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
	dmu_tx_hold_bonus(tx, zp->z_id);
	if (may_delete_now) {
	toobig =
	zp->z_phys->zp_size > zp->z_blksz * DMU_MAX_DELETEBLKCNT;
	/* if the file is too big, only hold_free a token amount */
	dmu_tx_hold_free(tx, zp->z_id, 0,
	(toobig ? DMU_MAX_ACCESS : DMU_OBJECT_END));
	}

	/* are there any extended attributes? */
	if ((xattr_obj = zp->z_phys->zp_xattr) != 0) {
	/* XXX - do we need this if we are deleting? */
	dmu_tx_hold_bonus(tx, xattr_obj);
	}

	/* are there any additional acls */
	if ((acl_obj = zp->z_phys->zp_acl.z_acl_extern_obj) != 0 &&
	may_delete_now)
	dmu_tx_hold_free(tx, acl_obj, 0, DMU_OBJECT_END);

	/* charge as an update -- would be nice not to charge at all */
	dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);

	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	zfs_dirent_unlock(dl);
	VN_RELE(vp);
	if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	if (realnmp)
	pn_free(realnmp);
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Remove the directory entry.
	*/
	error = zfs_link_destroy(dl, zp, tx, zflg, &unlinked);

	if (error) {
	dmu_tx_commit(tx);
	goto out;
	}

	if (0 && unlinked) {
	VI_LOCK(vp);
	delete_now = may_delete_now && !toobig &&
	vp->v_count == 1 && !vn_has_cached_data(vp) &&
	zp->z_phys->zp_xattr == xattr_obj &&
	zp->z_phys->zp_acl.z_acl_extern_obj == acl_obj;
	VI_UNLOCK(vp);
	}

	if (delete_now) {
	if (zp->z_phys->zp_xattr) {
	error = zfs_zget(zfsvfs, zp->z_phys->zp_xattr, &xzp);
	ASSERT3U(error, ==, 0);
	ASSERT3U(xzp->z_phys->zp_links, ==, 2);
	dmu_buf_will_dirty(xzp->z_dbuf, tx);
	mutex_enter(&xzp->z_lock);
	xzp->z_unlinked = 1;
	xzp->z_phys->zp_links = 0;
	mutex_exit(&xzp->z_lock);
	zfs_unlinked_add(xzp, tx);
	zp->z_phys->zp_xattr = 0; /* probably unnecessary */
	}
	mutex_enter(&zp->z_lock);
	VI_LOCK(vp);
	vp->v_count--;
	ASSERT3U(vp->v_count, ==, 0);
	VI_UNLOCK(vp);
	mutex_exit(&zp->z_lock);
	zfs_znode_delete(zp, tx);
	} else if (unlinked) {
	zfs_unlinked_add(zp, tx);
	}

	txtype = TX_REMOVE;
	if (flags & FIGNORECASE)
	txtype \|= TX_CI;
	zfs_log_remove(zilog, tx, txtype, dzp, name);

	dmu_tx_commit(tx);
	out:
	if (realnmp)
	pn_free(realnmp);

	zfs_dirent_unlock(dl);

	if (!delete_now) {
	VN_RELE(vp);
	} else if (xzp) {
	/* this rele is delayed to prevent nesting transactions */
	VN_RELE(ZTOV(xzp));
	}

	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Create a new directory and insert it into dvp using the name
	* provided. Return a pointer to the inserted directory.
	*
	* IN: dvp - vnode of directory to add subdir to.
	* dirname - name of new directory.
	* vap - attributes of new directory.
	* cr - credentials of caller.
	* ct - caller context
	* vsecp - ACL to be set
	*
	* OUT: vpp - vnode of created directory.
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* dvp - ctime\|mtime updated
	* vp - ctime\|mtime\|atime updated
	*/
	/ARGSUSED/
	static int
	zfs_mkdir(vnode_t dvp, char dirname, vattr_t vap, vnode_t vpp, cred_t cr,
	caller_context_t ct, int flags, vsecattr_t vsecp)
	{
	znode_t zp, dzp = VTOZ(dvp);
	zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
	zilog_t *zilog;
	zfs_dirlock_t *dl;
	uint64_t txtype;
	dmu_tx_t *tx;
	int error;
	zfs_acl_t *aclp = NULL;
	zfs_fuid_info_t *fuidp = NULL;
	int zf = ZNEW;

	ASSERT(vap->va_type == VDIR);

	/*
	* If we have an ephemeral id, ACL, or XVATTR then
	* make sure file system is at proper version
	*/

	if (zfsvfs->z_use_fuids == B_FALSE &&
	(vsecp \|\| (vap->va_mask & AT_XVATTR) \|\| IS_EPHEMERAL(crgetuid(cr))\|\|
	IS_EPHEMERAL(crgetgid(cr))))
	return (EINVAL);

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(dzp);
	zilog = zfsvfs->z_log;

	if (dzp->z_phys->zp_flags & ZFS_XATTR) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	if (zfsvfs->z_utf8 && u8_validate(dirname,
	strlen(dirname), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
	ZFS_EXIT(zfsvfs);
	return (EILSEQ);
	}
	if (flags & FIGNORECASE)
	zf \|= ZCILOOK;

	if (vap->va_mask & AT_XVATTR)
	if ((error = secpolicy_xvattr(dvp, (xvattr_t *)vap,
	crgetuid(cr), cr, vap->va_type)) != 0) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* First make sure the new directory doesn't exist.
	*/
	top:
	*vpp = NULL;

	if (error = zfs_dirent_lock(&dl, dzp, dirname, &zp, zf,
	NULL, NULL)) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
	zfs_dirent_unlock(dl);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	if (vsecp && aclp == NULL) {
	error = zfs_vsec_2_aclp(zfsvfs, vap->va_type, vsecp, &aclp);
	if (error) {
	zfs_dirent_unlock(dl);
	ZFS_EXIT(zfsvfs);
	return (error);
	}
	}
	/*
	* Add a new entry to the directory.
	*/
	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_zap(tx, dzp->z_id, TRUE, dirname);
	dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL);
	if ((aclp && aclp->z_has_fuids) \|\| IS_EPHEMERAL(crgetuid(cr)) \|\|
	IS_EPHEMERAL(crgetgid(cr))) {
	if (zfsvfs->z_fuid_obj == 0) {
	dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
	} else {
	dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
	dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	}
	}
	if ((dzp->z_phys->zp_flags & ZFS_INHERIT_ACE) \|\| aclp)
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
	0, SPA_MAXBLOCKSIZE);
	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	zfs_dirent_unlock(dl);
	if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	if (aclp)
	zfs_acl_free(aclp);
	return (error);
	}

	/*
	* Create new node.
	*/
	zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, aclp, &fuidp);

	if (aclp)
	zfs_acl_free(aclp);

	/*
	* Now put new name in parent dir.
	*/
	(void) zfs_link_create(dl, zp, tx, ZNEW);

	*vpp = ZTOV(zp);

	txtype = zfs_log_create_txtype(Z_DIR, vsecp, vap);
	if (flags & FIGNORECASE)
	txtype \|= TX_CI;
	zfs_log_create(zilog, tx, txtype, dzp, zp, dirname, vsecp, fuidp, vap);

	if (fuidp)
	zfs_fuid_info_free(fuidp);
	dmu_tx_commit(tx);

	zfs_dirent_unlock(dl);

	ZFS_EXIT(zfsvfs);
	return (0);
	}

	/*
	* Remove a directory subdir entry. If the current working
	* directory is the same as the subdir to be removed, the
	* remove will fail.
	*
	* IN: dvp - vnode of directory to remove from.
	* name - name of directory to be removed.
	* cwd - vnode of current working directory.
	* cr - credentials of caller.
	* ct - caller context
	* flags - case flags
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* dvp - ctime\|mtime updated
	*/
	/ARGSUSED/
	static int
	zfs_rmdir(vnode_t dvp, char name, vnode_t cwd, cred_t cr,
	caller_context_t *ct, int flags)
	{
	znode_t *dzp = VTOZ(dvp);
	znode_t *zp;
	vnode_t *vp;
	zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
	zilog_t *zilog;
	zfs_dirlock_t *dl;
	dmu_tx_t *tx;
	int error;
	int zflg = ZEXISTS;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(dzp);
	zilog = zfsvfs->z_log;

	if (flags & FIGNORECASE)
	zflg \|= ZCILOOK;
	top:
	zp = NULL;

	/*
	* Attempt to lock directory; fail if entry doesn't exist.
	*/
	if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
	NULL, NULL)) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	vp = ZTOV(zp);

	if (error = zfs_zaccess_delete(dzp, zp, cr)) {
	goto out;
	}

	if (vp->v_type != VDIR) {
	error = ENOTDIR;
	goto out;
	}

	if (vp == cwd) {
	error = EINVAL;
	goto out;
	}

	vnevent_rmdir(vp, dvp, name, ct);

	/*
	* Grab a lock on the directory to make sure that noone is
	* trying to add (or lookup) entries while we are removing it.
	*/
	rw_enter(&zp->z_name_lock, RW_WRITER);

	/*
	* Grab a lock on the parent pointer to make sure we play well
	* with the treewalk and directory rename code.
	*/
	rw_enter(&zp->z_parent_lock, RW_WRITER);

	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
	dmu_tx_hold_bonus(tx, zp->z_id);
	dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	rw_exit(&zp->z_parent_lock);
	rw_exit(&zp->z_name_lock);
	zfs_dirent_unlock(dl);
	VN_RELE(vp);
	if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	#ifdef FREEBSD_NAMECACHE
	cache_purge(dvp);
	#endif

	error = zfs_link_destroy(dl, zp, tx, zflg, NULL);

	if (error == 0) {
	uint64_t txtype = TX_RMDIR;
	if (flags & FIGNORECASE)
	txtype \|= TX_CI;
	zfs_log_remove(zilog, tx, txtype, dzp, name);
	}

	dmu_tx_commit(tx);

	rw_exit(&zp->z_parent_lock);
	rw_exit(&zp->z_name_lock);
	#ifdef FREEBSD_NAMECACHE
	cache_purge(vp);
	#endif
	out:
	zfs_dirent_unlock(dl);

	VN_RELE(vp);

	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Read as many directory entries as will fit into the provided
	* buffer from the given directory cursor position (specified in
	* the uio structure.
	*
	* IN: vp - vnode of directory to read.
	* uio - structure supplying read location, range info,
	* and return buffer.
	* cr - credentials of caller.
	* ct - caller context
	* flags - case flags
	*
	* OUT: uio - updated offset and range, buffer filled.
	* eofp - set to true if end-of-file detected.
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* vp - atime updated
	*
	* Note that the low 4 bits of the cookie returned by zap is always zero.
	* This allows us to use the low range for "special" directory entries:
	* We use 0 for '.', and 1 for '..'. If this is the root of the filesystem,
	* we use the offset 2 for the '.zfs' directory.
	*/
	/* ARGSUSED */
	static int
	zfs_readdir(vnode_t vp, uio_t uio, cred_t cr, int eofp, int ncookies, u_long *cookies)
	{
	znode_t *zp = VTOZ(vp);
	iovec_t *iovp;
	edirent_t *eodp;
	dirent64_t *odp;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	objset_t *os;
	caddr_t outbuf;
	size_t bufsize;
	zap_cursor_t zc;
	zap_attribute_t zap;
	uint_t bytes_wanted;
	uint64_t offset; /* must be unsigned; checks for < 1 */
	int local_eof;
	int outcount;
	int error;
	uint8_t prefetch;
	boolean_t check_sysattrs;
	uint8_t type;
	int ncooks;
	u_long *cooks = NULL;
	int flags = 0;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);

	/*
	* If we are not given an eof variable,
	* use a local one.
	*/
	if (eofp == NULL)
	eofp = &local_eof;

	/*
	* Check for valid iov_len.
	*/
	if (uio->uio_iov->iov_len <= 0) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	/*
	* Quit if directory has been removed (posix)
	*/
	if ((*eofp = zp->z_unlinked) != 0) {
	ZFS_EXIT(zfsvfs);
	return (0);
	}

	error = 0;
	os = zfsvfs->z_os;
	offset = uio->uio_loffset;
	prefetch = zp->z_zn_prefetch;

	/*
	* Initialize the iterator cursor.
	*/
	if (offset <= 3) {
	/*
	* Start iteration from the beginning of the directory.
	*/
	zap_cursor_init(&zc, os, zp->z_id);
	} else {
	/*
	* The offset is a serialized cursor.
	*/
	zap_cursor_init_serialized(&zc, os, zp->z_id, offset);
	}

	/*
	* Get space to change directory entries into fs independent format.
	*/
	iovp = uio->uio_iov;
	bytes_wanted = iovp->iov_len;
	if (uio->uio_segflg != UIO_SYSSPACE \|\| uio->uio_iovcnt != 1) {
	bufsize = bytes_wanted;
	outbuf = kmem_alloc(bufsize, KM_SLEEP);
	odp = (struct dirent64 *)outbuf;
	} else {
	bufsize = bytes_wanted;
	odp = (struct dirent64 *)iovp->iov_base;
	}
	eodp = (struct edirent *)odp;

	if (ncookies != NULL) {
	/*
	* Minimum entry size is dirent size and 1 byte for a file name.
	*/
	ncooks = uio->uio_resid / (sizeof(struct dirent) - sizeof(((struct dirent *)NULL)->d_name) + 1);
	cooks = malloc(ncooks * sizeof(u_long), M_TEMP, M_WAITOK);
	*cookies = cooks;
	*ncookies = ncooks;
	}
	/*
	* If this VFS supports the system attribute view interface; and
	* we're looking at an extended attribute directory; and we care
	* about normalization conflicts on this vfs; then we must check
	* for normalization conflicts with the sysattr name space.
	*/
	#ifdef TODO
	check_sysattrs = vfs_has_feature(vp->v_vfsp, VFSFT_SYSATTR_VIEWS) &&
	(vp->v_flag & V_XATTRDIR) && zfsvfs->z_norm &&
	(flags & V_RDDIR_ENTFLAGS);
	#else
	check_sysattrs = 0;
	#endif

	/*
	* Transform to file-system independent format
	*/
	outcount = 0;
	while (outcount < bytes_wanted) {
	ino64_t objnum;
	ushort_t reclen;
	off64_t *next;

	/*
	* Special case `.', `..', and `.zfs'.
	*/
	if (offset == 0) {
	(void) strcpy(zap.za_name, ".");
	zap.za_normalization_conflict = 0;
	objnum = zp->z_id;
	type = DT_DIR;
	} else if (offset == 1) {
	(void) strcpy(zap.za_name, "..");
	zap.za_normalization_conflict = 0;
	objnum = zp->z_phys->zp_parent;
	type = DT_DIR;
	} else if (offset == 2 && zfs_show_ctldir(zp)) {
	(void) strcpy(zap.za_name, ZFS_CTLDIR_NAME);
	zap.za_normalization_conflict = 0;
	objnum = ZFSCTL_INO_ROOT;
	type = DT_DIR;
	} else {
	/*
	* Grab next entry.
	*/
	if (error = zap_cursor_retrieve(&zc, &zap)) {
	if ((*eofp = (error == ENOENT)) != 0)
	break;
	else
	goto update;
	}

	if (zap.za_integer_length != 8 \|\|
	zap.za_num_integers != 1) {
	cmn_err(CE_WARN, "zap_readdir: bad directory "
	"entry, obj = %lld, offset = %lld\n",
	(u_longlong_t)zp->z_id,
	(u_longlong_t)offset);
	error = ENXIO;
	goto update;
	}

	objnum = ZFS_DIRENT_OBJ(zap.za_first_integer);
	/*
	* MacOS X can extract the object type here such as:
	* uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer);
	*/
	type = ZFS_DIRENT_TYPE(zap.za_first_integer);

	if (check_sysattrs && !zap.za_normalization_conflict) {
	#ifdef TODO
	zap.za_normalization_conflict =
	xattr_sysattr_casechk(zap.za_name);
	#else
	panic("%s:%u: TODO", __func__, __LINE__);
	#endif
	}
	}

	if (flags & V_RDDIR_ENTFLAGS)
	reclen = EDIRENT_RECLEN(strlen(zap.za_name));
	else
	reclen = DIRENT64_RECLEN(strlen(zap.za_name));

	/*
	* Will this entry fit in the buffer?
	*/
	if (outcount + reclen > bufsize) {
	/*
	* Did we manage to fit anything in the buffer?
	*/
	if (!outcount) {
	error = EINVAL;
	goto update;
	}
	break;
	}
	if (flags & V_RDDIR_ENTFLAGS) {
	/*
	* Add extended flag entry:
	*/
	eodp->ed_ino = objnum;
	eodp->ed_reclen = reclen;
	/* NOTE: ed_off is the offset for the next entry */
	next = &(eodp->ed_off);
	eodp->ed_eflags = zap.za_normalization_conflict ?
	ED_CASE_CONFLICT : 0;
	(void) strncpy(eodp->ed_name, zap.za_name,
	EDIRENT_NAMELEN(reclen));
	eodp = (edirent_t *)((intptr_t)eodp + reclen);
	} else {
	/*
	* Add normal entry:
	*/
	odp->d_ino = objnum;
	odp->d_reclen = reclen;
	odp->d_namlen = strlen(zap.za_name);
	(void) strlcpy(odp->d_name, zap.za_name, odp->d_namlen + 1);
	odp->d_type = type;
	odp = (dirent64_t *)((intptr_t)odp + reclen);
	}
	outcount += reclen;

	ASSERT(outcount <= bufsize);

	/* Prefetch znode */
	if (prefetch)
	dmu_prefetch(os, objnum, 0, 0);

	/*
	* Move to the next entry, fill in the previous offset.
	*/
	if (offset > 2 \|\| (offset == 2 && !zfs_show_ctldir(zp))) {
	zap_cursor_advance(&zc);
	offset = zap_cursor_serialize(&zc);
	} else {
	offset += 1;
	}

	if (cooks != NULL) {
	*cooks++ = offset;
	ncooks--;
	KASSERT(ncooks >= 0, ("ncookies=%d", ncooks));
	}
	}
	zp->z_zn_prefetch = B_FALSE; /* a lookup will re-enable pre-fetching */

	/* Subtract unused cookies */
	if (ncookies != NULL)
	*ncookies -= ncooks;

	if (uio->uio_segflg == UIO_SYSSPACE && uio->uio_iovcnt == 1) {
	iovp->iov_base += outcount;
	iovp->iov_len -= outcount;
	uio->uio_resid -= outcount;
	} else if (error = uiomove(outbuf, (long)outcount, UIO_READ, uio)) {
	/*
	* Reset the pointer.
	*/
	offset = uio->uio_loffset;
	}

	update:
	zap_cursor_fini(&zc);
	if (uio->uio_segflg != UIO_SYSSPACE \|\| uio->uio_iovcnt != 1)
	kmem_free(outbuf, bufsize);

	if (error == ENOENT)
	error = 0;

	ZFS_ACCESSTIME_STAMP(zfsvfs, zp);

	uio->uio_loffset = offset;
	ZFS_EXIT(zfsvfs);
	if (error != 0 && cookies != NULL) {
	free(*cookies, M_TEMP);
	*cookies = NULL;
	*ncookies = 0;
	}
	return (error);
	}

	ulong_t zfs_fsync_sync_cnt = 4;

	static int
	zfs_fsync(vnode_t vp, int syncflag, cred_t cr, caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;

	(void) tsd_set(zfs_fsyncer_key, (void *)zfs_fsync_sync_cnt);

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);
	zil_commit(zfsvfs->z_log, zp->z_last_itx, zp->z_id);
	ZFS_EXIT(zfsvfs);
	return (0);
	}


	/*
	* Get the requested file attributes and place them in the provided
	* vattr structure.
	*
	* IN: vp - vnode of file.
	* vap - va_mask identifies requested attributes.
	* If AT_XVATTR set, then optional attrs are requested
	* flags - ATTR_NOACLCHECK (CIFS server context)
	* cr - credentials of caller.
	* ct - caller context
	*
	* OUT: vap - attribute values.
	*
	* RETURN: 0 (always succeeds)
	*/
	/* ARGSUSED */
	static int
	zfs_getattr(vnode_t vp, vattr_t vap, int flags, cred_t *cr,
	caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	znode_phys_t *pzp;
	int error = 0;
	uint32_t blksize;
	u_longlong_t nblocks;
	uint64_t links;
	xvattr_t xvap = (xvattr_t )vap; /* vap may be an xvattr_t * */
	xoptattr_t *xoap = NULL;
	boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);
	pzp = zp->z_phys;

	- mutex_enter(&zp->z_lock);
	-
	/*
	* If ACL is trivial don't bother looking for ACE_READ_ATTRIBUTES.
	* Also, if we are the owner don't bother, since owner should
	* always be allowed to read basic attributes of file.
	*/
	if (!(pzp->zp_flags & ZFS_ACL_TRIVIAL) &&
	(pzp->zp_uid != crgetuid(cr))) {
	if (error = zfs_zaccess(zp, ACE_READ_ATTRIBUTES, 0,
	skipaclchk, cr)) {
	- mutex_exit(&zp->z_lock);
	ZFS_EXIT(zfsvfs);
	return (error);
	}
	}

	/*
	* Return all attributes. It's cheaper to provide the answer
	* than to determine whether we were asked the question.
	*/

	+ mutex_enter(&zp->z_lock);
	vap->va_type = IFTOVT(pzp->zp_mode);
	vap->va_mode = pzp->zp_mode & ~S_IFMT;
	zfs_fuid_map_ids(zp, cr, &vap->va_uid, &vap->va_gid);
	// vap->va_fsid = zp->z_zfsvfs->z_vfs->vfs_dev;
	vap->va_nodeid = zp->z_id;
	if ((vp->v_flag & VROOT) && zfs_show_ctldir(zp))
	links = pzp->zp_links + 1;
	else
	links = pzp->zp_links;
	vap->va_nlink = MIN(links, UINT32_MAX); /* nlink_t limit! */
	vap->va_size = pzp->zp_size;
	vap->va_fsid = vp->v_mount->mnt_stat.f_fsid.val[0];
	vap->va_rdev = zfs_cmpldev(pzp->zp_rdev);
	vap->va_seq = zp->z_seq;
	vap->va_flags = 0; /* FreeBSD: Reset chflags(2) flags. */

	/*
	* Add in any requested optional attributes and the create time.
	* Also set the corresponding bits in the returned attribute bitmap.
	*/
	if ((xoap = xva_getxoptattr(xvap)) != NULL && zfsvfs->z_use_fuids) {
	if (XVA_ISSET_REQ(xvap, XAT_ARCHIVE)) {
	xoap->xoa_archive =
	((pzp->zp_flags & ZFS_ARCHIVE) != 0);
	XVA_SET_RTN(xvap, XAT_ARCHIVE);
	}

	if (XVA_ISSET_REQ(xvap, XAT_READONLY)) {
	xoap->xoa_readonly =
	((pzp->zp_flags & ZFS_READONLY) != 0);
	XVA_SET_RTN(xvap, XAT_READONLY);
	}

	if (XVA_ISSET_REQ(xvap, XAT_SYSTEM)) {
	xoap->xoa_system =
	((pzp->zp_flags & ZFS_SYSTEM) != 0);
	XVA_SET_RTN(xvap, XAT_SYSTEM);
	}

	if (XVA_ISSET_REQ(xvap, XAT_HIDDEN)) {
	xoap->xoa_hidden =
	((pzp->zp_flags & ZFS_HIDDEN) != 0);
	XVA_SET_RTN(xvap, XAT_HIDDEN);
	}

	if (XVA_ISSET_REQ(xvap, XAT_NOUNLINK)) {
	xoap->xoa_nounlink =
	((pzp->zp_flags & ZFS_NOUNLINK) != 0);
	XVA_SET_RTN(xvap, XAT_NOUNLINK);
	}

	if (XVA_ISSET_REQ(xvap, XAT_IMMUTABLE)) {
	xoap->xoa_immutable =
	((pzp->zp_flags & ZFS_IMMUTABLE) != 0);
	XVA_SET_RTN(xvap, XAT_IMMUTABLE);
	}

	if (XVA_ISSET_REQ(xvap, XAT_APPENDONLY)) {
	xoap->xoa_appendonly =
	((pzp->zp_flags & ZFS_APPENDONLY) != 0);
	XVA_SET_RTN(xvap, XAT_APPENDONLY);
	}

	if (XVA_ISSET_REQ(xvap, XAT_NODUMP)) {
	xoap->xoa_nodump =
	((pzp->zp_flags & ZFS_NODUMP) != 0);
	XVA_SET_RTN(xvap, XAT_NODUMP);
	}

	if (XVA_ISSET_REQ(xvap, XAT_OPAQUE)) {
	xoap->xoa_opaque =
	((pzp->zp_flags & ZFS_OPAQUE) != 0);
	XVA_SET_RTN(xvap, XAT_OPAQUE);
	}

	if (XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED)) {
	xoap->xoa_av_quarantined =
	((pzp->zp_flags & ZFS_AV_QUARANTINED) != 0);
	XVA_SET_RTN(xvap, XAT_AV_QUARANTINED);
	}

	if (XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED)) {
	xoap->xoa_av_modified =
	((pzp->zp_flags & ZFS_AV_MODIFIED) != 0);
	XVA_SET_RTN(xvap, XAT_AV_MODIFIED);
	}

	if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP) &&
	vp->v_type == VREG &&
	(pzp->zp_flags & ZFS_BONUS_SCANSTAMP)) {
	size_t len;
	dmu_object_info_t doi;

	/*
	* Only VREG files have anti-virus scanstamps, so we
	* won't conflict with symlinks in the bonus buffer.
	*/
	dmu_object_info_from_db(zp->z_dbuf, &doi);
	len = sizeof (xoap->xoa_av_scanstamp) +
	sizeof (znode_phys_t);
	if (len <= doi.doi_bonus_size) {
	/*
	* pzp points to the start of the
	* znode_phys_t. pzp + 1 points to the
	* first byte after the znode_phys_t.
	*/
	(void) memcpy(xoap->xoa_av_scanstamp,
	pzp + 1,
	sizeof (xoap->xoa_av_scanstamp));
	XVA_SET_RTN(xvap, XAT_AV_SCANSTAMP);
	}
	}

	if (XVA_ISSET_REQ(xvap, XAT_CREATETIME)) {
	ZFS_TIME_DECODE(&xoap->xoa_createtime, pzp->zp_crtime);
	XVA_SET_RTN(xvap, XAT_CREATETIME);
	}
	}

	ZFS_TIME_DECODE(&vap->va_atime, pzp->zp_atime);
	ZFS_TIME_DECODE(&vap->va_mtime, pzp->zp_mtime);
	ZFS_TIME_DECODE(&vap->va_ctime, pzp->zp_ctime);
	ZFS_TIME_DECODE(&vap->va_birthtime, pzp->zp_crtime);

	mutex_exit(&zp->z_lock);

	dmu_object_size_from_db(zp->z_dbuf, &blksize, &nblocks);
	vap->va_blksize = blksize;
	vap->va_bytes = nblocks << 9; /* nblocks * 512 */

	if (zp->z_blksz == 0) {
	/*
	* Block size hasn't been set; suggest maximal I/O transfers.
	*/
	vap->va_blksize = zfsvfs->z_max_blksz;
	}

	ZFS_EXIT(zfsvfs);
	return (0);
	}

	/*
	* Set the file attributes to the values contained in the
	* vattr structure.
	*
	* IN: vp - vnode of file to be modified.
	* vap - new attribute values.
	* If AT_XVATTR set, then optional attrs are being set
	* flags - ATTR_UTIME set if non-default time values provided.
	* - ATTR_NOACLCHECK (CIFS context only).
	* cr - credentials of caller.
	* ct - caller context
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* vp - ctime updated, mtime updated if size changed.
	*/
	/* ARGSUSED */
	static int
	zfs_setattr(vnode_t vp, vattr_t vap, int flags, cred_t *cr,
	caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	znode_phys_t *pzp;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	zilog_t *zilog;
	dmu_tx_t *tx;
	vattr_t oldva;
	uint_t mask = vap->va_mask;
	uint_t saved_mask;
	uint64_t saved_mode;
	int trim_mask = 0;
	uint64_t new_mode;
	znode_t *attrzp;
	int need_policy = FALSE;
	int err;
	zfs_fuid_info_t *fuidp = NULL;
	xvattr_t xvap = (xvattr_t )vap; /* vap may be an xvattr_t * */
	xoptattr_t *xoap;
	zfs_acl_t *aclp = NULL;
	boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;

	if (mask == 0)
	return (0);

	if (mask & AT_NOSET)
	return (EINVAL);

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);

	pzp = zp->z_phys;
	zilog = zfsvfs->z_log;

	/*
	* Make sure that if we have ephemeral uid/gid or xvattr specified
	* that file system is at proper version level
	*/

	if (zfsvfs->z_use_fuids == B_FALSE &&
	(((mask & AT_UID) && IS_EPHEMERAL(vap->va_uid)) \|\|
	((mask & AT_GID) && IS_EPHEMERAL(vap->va_gid)) \|\|
	(mask & AT_XVATTR))) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	if (mask & AT_SIZE && vp->v_type == VDIR) {
	ZFS_EXIT(zfsvfs);
	return (EISDIR);
	}

	if (mask & AT_SIZE && vp->v_type != VREG && vp->v_type != VFIFO) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	/*
	* If this is an xvattr_t, then get a pointer to the structure of
	* optional attributes. If this is NULL, then we have a vattr_t.
	*/
	xoap = xva_getxoptattr(xvap);

	/*
	* Immutable files can only alter immutable bit and atime
	*/
	if ((pzp->zp_flags & ZFS_IMMUTABLE) &&
	((mask & (AT_SIZE\|AT_UID\|AT_GID\|AT_MTIME\|AT_MODE)) \|\|
	((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
	ZFS_EXIT(zfsvfs);
	return (EPERM);
	}

	if ((mask & AT_SIZE) && (pzp->zp_flags & ZFS_READONLY)) {
	ZFS_EXIT(zfsvfs);
	return (EPERM);
	}

	/*
	* Verify timestamps doesn't overflow 32 bits.
	* ZFS can handle large timestamps, but 32bit syscalls can't
	* handle times greater than 2039. This check should be removed
	* once large timestamps are fully supported.
	*/
	if (mask & (AT_ATIME \| AT_MTIME)) {
	if (((mask & AT_ATIME) && TIMESPEC_OVERFLOW(&vap->va_atime)) \|\|
	((mask & AT_MTIME) && TIMESPEC_OVERFLOW(&vap->va_mtime))) {
	ZFS_EXIT(zfsvfs);
	return (EOVERFLOW);
	}
	}

	top:
	attrzp = NULL;

	if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
	ZFS_EXIT(zfsvfs);
	return (EROFS);
	}

	/*
	* First validate permissions
	*/

	if (mask & AT_SIZE) {
	err = zfs_zaccess(zp, ACE_WRITE_DATA, 0, skipaclchk, cr);
	if (err) {
	ZFS_EXIT(zfsvfs);
	return (err);
	}
	/*
	* XXX - Note, we are not providing any open
	* mode flags here (like FNDELAY), so we may
	* block if there are locks present... this
	* should be addressed in openat().
	*/
	/* XXX - would it be OK to generate a log record here? */
	err = zfs_freesp(zp, vap->va_size, 0, 0, FALSE);
	if (err) {
	ZFS_EXIT(zfsvfs);
	return (err);
	}
	}

	if (mask & (AT_ATIME\|AT_MTIME) \|\|
	((mask & AT_XVATTR) && (XVA_ISSET_REQ(xvap, XAT_HIDDEN) \|\|
	XVA_ISSET_REQ(xvap, XAT_READONLY) \|\|
	XVA_ISSET_REQ(xvap, XAT_ARCHIVE) \|\|
	XVA_ISSET_REQ(xvap, XAT_CREATETIME) \|\|
	XVA_ISSET_REQ(xvap, XAT_SYSTEM))))
	need_policy = zfs_zaccess(zp, ACE_WRITE_ATTRIBUTES, 0,
	skipaclchk, cr);

	if (mask & (AT_UID\|AT_GID)) {
	int idmask = (mask & (AT_UID\|AT_GID));
	int take_owner;
	int take_group;

	/*
	* NOTE: even if a new mode is being set,
	* we may clear S_ISUID/S_ISGID bits.
	*/

	if (!(mask & AT_MODE))
	vap->va_mode = pzp->zp_mode;

	/*
	* Take ownership or chgrp to group we are a member of
	*/

	take_owner = (mask & AT_UID) && (vap->va_uid == crgetuid(cr));
	take_group = (mask & AT_GID) &&
	zfs_groupmember(zfsvfs, vap->va_gid, cr);

	/*
	* If both AT_UID and AT_GID are set then take_owner and
	* take_group must both be set in order to allow taking
	* ownership.
	*
	* Otherwise, send the check through secpolicy_vnode_setattr()
	*
	*/

	if (((idmask == (AT_UID\|AT_GID)) && take_owner && take_group) \|\|
	((idmask == AT_UID) && take_owner) \|\|
	((idmask == AT_GID) && take_group)) {
	if (zfs_zaccess(zp, ACE_WRITE_OWNER, 0,
	skipaclchk, cr) == 0) {
	/*
	* Remove setuid/setgid for non-privileged users
	*/
	secpolicy_setid_clear(vap, vp, cr);
	trim_mask = (mask & (AT_UID\|AT_GID));
	} else {
	need_policy = TRUE;
	}
	} else {
	need_policy = TRUE;
	}
	}

	mutex_enter(&zp->z_lock);
	oldva.va_mode = pzp->zp_mode;
	zfs_fuid_map_ids(zp, cr, &oldva.va_uid, &oldva.va_gid);
	if (mask & AT_XVATTR) {
	if ((need_policy == FALSE) &&
	(XVA_ISSET_REQ(xvap, XAT_APPENDONLY) &&
	xoap->xoa_appendonly !=
	((pzp->zp_flags & ZFS_APPENDONLY) != 0)) \|\|
	(XVA_ISSET_REQ(xvap, XAT_NOUNLINK) &&
	xoap->xoa_nounlink !=
	((pzp->zp_flags & ZFS_NOUNLINK) != 0)) \|\|
	(XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
	xoap->xoa_immutable !=
	((pzp->zp_flags & ZFS_IMMUTABLE) != 0)) \|\|
	(XVA_ISSET_REQ(xvap, XAT_NODUMP) &&
	xoap->xoa_nodump !=
	((pzp->zp_flags & ZFS_NODUMP) != 0)) \|\|
	(XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED) &&
	xoap->xoa_av_modified !=
	((pzp->zp_flags & ZFS_AV_MODIFIED) != 0)) \|\|
	((XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED) &&
	((vp->v_type != VREG && xoap->xoa_av_quarantined) \|\|
	xoap->xoa_av_quarantined !=
	((pzp->zp_flags & ZFS_AV_QUARANTINED) != 0)))) \|\|
	(XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP)) \|\|
	(XVA_ISSET_REQ(xvap, XAT_OPAQUE))) {
	need_policy = TRUE;
	}
	}

	mutex_exit(&zp->z_lock);

	if (mask & AT_MODE) {
	if (zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr) == 0) {
	err = secpolicy_setid_setsticky_clear(vp, vap,
	&oldva, cr);
	if (err) {
	ZFS_EXIT(zfsvfs);
	return (err);
	}
	trim_mask \|= AT_MODE;
	} else {
	need_policy = TRUE;
	}
	}

	if (need_policy) {
	/*
	* If trim_mask is set then take ownership
	* has been granted or write_acl is present and user
	* has the ability to modify mode. In that case remove
	* UID\|GID and or MODE from mask so that
	* secpolicy_vnode_setattr() doesn't revoke it.
	*/

	if (trim_mask) {
	saved_mask = vap->va_mask;
	vap->va_mask &= ~trim_mask;
	if (trim_mask & AT_MODE) {
	/*
	* Save the mode, as secpolicy_vnode_setattr()
	* will overwrite it with ova.va_mode.
	*/
	saved_mode = vap->va_mode;
	}
	}
	err = secpolicy_vnode_setattr(cr, vp, vap, &oldva, flags,
	(int ()(void , int, cred_t *))zfs_zaccess_unix, zp);
	if (err) {
	ZFS_EXIT(zfsvfs);
	return (err);
	}

	if (trim_mask) {
	vap->va_mask \|= saved_mask;
	if (trim_mask & AT_MODE) {
	/*
	* Recover the mode after
	* secpolicy_vnode_setattr().
	*/
	vap->va_mode = saved_mode;
	}
	}
	}

	/*
	* secpolicy_vnode_setattr, or take ownership may have
	* changed va_mask
	*/
	mask = vap->va_mask;

	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_bonus(tx, zp->z_id);
	if (((mask & AT_UID) && IS_EPHEMERAL(vap->va_uid)) \|\|
	((mask & AT_GID) && IS_EPHEMERAL(vap->va_gid))) {
	if (zfsvfs->z_fuid_obj == 0) {
	dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
	} else {
	dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
	dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	}
	}

	if (mask & AT_MODE) {
	uint64_t pmode = pzp->zp_mode;

	new_mode = (pmode & S_IFMT) \| (vap->va_mode & ~S_IFMT);

	if (err = zfs_acl_chmod_setattr(zp, &aclp, new_mode)) {
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	return (err);
	}
	if (pzp->zp_acl.z_acl_extern_obj) {
	/* Are we upgrading ACL from old V0 format to new V1 */
	if (zfsvfs->z_version <= ZPL_VERSION_FUID &&
	pzp->zp_acl.z_acl_version ==
	ZFS_ACL_VERSION_INITIAL) {
	dmu_tx_hold_free(tx,
	pzp->zp_acl.z_acl_extern_obj, 0,
	DMU_OBJECT_END);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
	0, aclp->z_acl_bytes);
	} else {
	dmu_tx_hold_write(tx,
	pzp->zp_acl.z_acl_extern_obj, 0,
	aclp->z_acl_bytes);
	}
	} else if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
	0, aclp->z_acl_bytes);
	}
	}

	if ((mask & (AT_UID \| AT_GID)) && pzp->zp_xattr != 0) {
	err = zfs_zget(zp->z_zfsvfs, pzp->zp_xattr, &attrzp);
	if (err) {
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	if (aclp)
	zfs_acl_free(aclp);
	return (err);
	}
	dmu_tx_hold_bonus(tx, attrzp->z_id);
	}

	err = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (err) {
	if (attrzp)
	VN_RELE(ZTOV(attrzp));

	if (aclp) {
	zfs_acl_free(aclp);
	aclp = NULL;
	}

	if (err == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	return (err);
	}

	dmu_buf_will_dirty(zp->z_dbuf, tx);

	/*
	* Set each attribute requested.
	* We group settings according to the locks they need to acquire.
	*
	* Note: you cannot set ctime directly, although it will be
	* updated as a side-effect of calling this function.
	*/

	mutex_enter(&zp->z_lock);

	if (mask & AT_MODE) {
	mutex_enter(&zp->z_acl_lock);
	zp->z_phys->zp_mode = new_mode;
	err = zfs_aclset_common(zp, aclp, cr, &fuidp, tx);
	ASSERT3U(err, ==, 0);
	mutex_exit(&zp->z_acl_lock);
	}

	if (attrzp)
	mutex_enter(&attrzp->z_lock);

	if (mask & AT_UID) {
	pzp->zp_uid = zfs_fuid_create(zfsvfs,
	vap->va_uid, cr, ZFS_OWNER, tx, &fuidp);
	if (attrzp) {
	attrzp->z_phys->zp_uid = zfs_fuid_create(zfsvfs,
	vap->va_uid, cr, ZFS_OWNER, tx, &fuidp);
	}
	}

	if (mask & AT_GID) {
	pzp->zp_gid = zfs_fuid_create(zfsvfs, vap->va_gid,
	cr, ZFS_GROUP, tx, &fuidp);
	if (attrzp)
	attrzp->z_phys->zp_gid = zfs_fuid_create(zfsvfs,
	vap->va_gid, cr, ZFS_GROUP, tx, &fuidp);
	}

	if (aclp)
	zfs_acl_free(aclp);

	if (attrzp)
	mutex_exit(&attrzp->z_lock);

	if (mask & AT_ATIME)
	ZFS_TIME_ENCODE(&vap->va_atime, pzp->zp_atime);

	if (mask & AT_MTIME)
	ZFS_TIME_ENCODE(&vap->va_mtime, pzp->zp_mtime);

	/* XXX - shouldn't this be done before the ATIME/MTIME checks? */
	if (mask & AT_SIZE)
	zfs_time_stamper_locked(zp, CONTENT_MODIFIED, tx);
	else if (mask != 0)
	zfs_time_stamper_locked(zp, STATE_CHANGED, tx);
	/*
	* Do this after setting timestamps to prevent timestamp
	* update from toggling bit
	*/

	if (xoap && (mask & AT_XVATTR)) {
	if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP)) {
	size_t len;
	dmu_object_info_t doi;

	ASSERT(vp->v_type == VREG);

	/* Grow the bonus buffer if necessary. */
	dmu_object_info_from_db(zp->z_dbuf, &doi);
	len = sizeof (xoap->xoa_av_scanstamp) +
	sizeof (znode_phys_t);
	if (len > doi.doi_bonus_size)
	VERIFY(dmu_set_bonus(zp->z_dbuf, len, tx) == 0);
	}
	zfs_xvattr_set(zp, xvap);
	}

	if (mask != 0)
	zfs_log_setattr(zilog, tx, TX_SETATTR, zp, vap, mask, fuidp);

	if (fuidp)
	zfs_fuid_info_free(fuidp);
	mutex_exit(&zp->z_lock);

	if (attrzp)
	VN_RELE(ZTOV(attrzp));

	dmu_tx_commit(tx);

	ZFS_EXIT(zfsvfs);
	return (err);
	}

	typedef struct zfs_zlock {
	krwlock_t zl_rwlock; / lock we acquired */
	znode_t zl_znode; / znode we held */
	struct zfs_zlock zl_next; / next in list */
	} zfs_zlock_t;

	/*
	* Drop locks and release vnodes that were held by zfs_rename_lock().
	*/
	static void
	zfs_rename_unlock(zfs_zlock_t **zlpp)
	{
	zfs_zlock_t *zl;

	while ((zl = *zlpp) != NULL) {
	if (zl->zl_znode != NULL)
	VN_RELE(ZTOV(zl->zl_znode));
	rw_exit(zl->zl_rwlock);
	*zlpp = zl->zl_next;
	kmem_free(zl, sizeof (*zl));
	}
	}

	/*
	* Search back through the directory tree, using the ".." entries.
	* Lock each directory in the chain to prevent concurrent renames.
	* Fail any attempt to move a directory into one of its own descendants.
	* XXX - z_parent_lock can overlap with map or grow locks
	*/
	static int
	zfs_rename_lock(znode_t szp, znode_t tdzp, znode_t sdzp, zfs_zlock_t *zlpp)
	{
	zfs_zlock_t *zl;
	znode_t *zp = tdzp;
	uint64_t rootid = zp->z_zfsvfs->z_root;
	uint64_t *oidp = &zp->z_id;
	krwlock_t *rwlp = &szp->z_parent_lock;
	krw_t rw = RW_WRITER;

	/*
	* First pass write-locks szp and compares to zp->z_id.
	* Later passes read-lock zp and compare to zp->z_parent.
	*/
	do {
	if (!rw_tryenter(rwlp, rw)) {
	/*
	* Another thread is renaming in this path.
	* Note that if we are a WRITER, we don't have any
	* parent_locks held yet.
	*/
	if (rw == RW_READER && zp->z_id > szp->z_id) {
	/*
	* Drop our locks and restart
	*/
	zfs_rename_unlock(&zl);
	*zlpp = NULL;
	zp = tdzp;
	oidp = &zp->z_id;
	rwlp = &szp->z_parent_lock;
	rw = RW_WRITER;
	continue;
	} else {
	/*
	* Wait for other thread to drop its locks
	*/
	rw_enter(rwlp, rw);
	}
	}

	zl = kmem_alloc(sizeof (*zl), KM_SLEEP);
	zl->zl_rwlock = rwlp;
	zl->zl_znode = NULL;
	zl->zl_next = *zlpp;
	*zlpp = zl;

	if (oidp == szp->z_id) / We're a descendant of szp */
	return (EINVAL);

	if (oidp == rootid) / We've hit the top */
	return (0);

	if (rw == RW_READER) { /* i.e. not the first pass */
	int error = zfs_zget(zp->z_zfsvfs, *oidp, &zp);
	if (error)
	return (error);
	zl->zl_znode = zp;
	}
	oidp = &zp->z_phys->zp_parent;
	rwlp = &zp->z_parent_lock;
	rw = RW_READER;

	} while (zp->z_id != sdzp->z_id);

	return (0);
	}

	/*
	* Move an entry from the provided source directory to the target
	* directory. Change the entry name as indicated.
	*
	* IN: sdvp - Source directory containing the "old entry".
	* snm - Old entry name.
	* tdvp - Target directory to contain the "new entry".
	* tnm - New entry name.
	* cr - credentials of caller.
	* ct - caller context
	* flags - case flags
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* sdvp,tdvp - ctime\|mtime updated
	*/
	/ARGSUSED/
	static int
	zfs_rename(vnode_t sdvp, char snm, vnode_t tdvp, char tnm, cred_t *cr,
	caller_context_t *ct, int flags)
	{
	znode_t tdzp, szp, *tzp;
	znode_t *sdzp = VTOZ(sdvp);
	zfsvfs_t *zfsvfs = sdzp->z_zfsvfs;
	zilog_t *zilog;
	vnode_t *realvp;
	zfs_dirlock_t sdl, tdl;
	dmu_tx_t *tx;
	zfs_zlock_t *zl;
	int cmp, serr, terr;
	int error = 0;
	int zflg = 0;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(sdzp);
	zilog = zfsvfs->z_log;

	/*
	* Make sure we have the real vp for the target directory.
	*/
	if (VOP_REALVP(tdvp, &realvp, ct) == 0)
	tdvp = realvp;

	if (tdvp->v_vfsp != sdvp->v_vfsp) {
	ZFS_EXIT(zfsvfs);
	return (EXDEV);
	}

	tdzp = VTOZ(tdvp);
	ZFS_VERIFY_ZP(tdzp);
	if (zfsvfs->z_utf8 && u8_validate(tnm,
	strlen(tnm), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
	ZFS_EXIT(zfsvfs);
	return (EILSEQ);
	}

	if (flags & FIGNORECASE)
	zflg \|= ZCILOOK;

	top:
	szp = NULL;
	tzp = NULL;
	zl = NULL;

	/*
	* This is to prevent the creation of links into attribute space
	* by renaming a linked file into/outof an attribute directory.
	* See the comment in zfs_link() for why this is considered bad.
	*/
	if ((tdzp->z_phys->zp_flags & ZFS_XATTR) !=
	(sdzp->z_phys->zp_flags & ZFS_XATTR)) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	/*
	* Lock source and target directory entries. To prevent deadlock,
	* a lock ordering must be defined. We lock the directory with
	* the smallest object id first, or if it's a tie, the one with
	* the lexically first name.
	*/
	if (sdzp->z_id < tdzp->z_id) {
	cmp = -1;
	} else if (sdzp->z_id > tdzp->z_id) {
	cmp = 1;
	} else {
	/*
	* First compare the two name arguments without
	* considering any case folding.
	*/
	int nofold = (zfsvfs->z_norm & ~U8_TEXTPREP_TOUPPER);

	cmp = u8_strcmp(snm, tnm, 0, nofold, U8_UNICODE_LATEST, &error);
	ASSERT(error == 0 \|\| !zfsvfs->z_utf8);
	if (cmp == 0) {
	/*
	* POSIX: "If the old argument and the new argument
	* both refer to links to the same existing file,
	* the rename() function shall return successfully
	* and perform no other action."
	*/
	ZFS_EXIT(zfsvfs);
	return (0);
	}
	/*
	* If the file system is case-folding, then we may
	* have some more checking to do. A case-folding file
	* system is either supporting mixed case sensitivity
	* access or is completely case-insensitive. Note
	* that the file system is always case preserving.
	*
	* In mixed sensitivity mode case sensitive behavior
	* is the default. FIGNORECASE must be used to
	* explicitly request case insensitive behavior.
	*
	* If the source and target names provided differ only
	* by case (e.g., a request to rename 'tim' to 'Tim'),
	* we will treat this as a special case in the
	* case-insensitive mode: as long as the source name
	* is an exact match, we will allow this to proceed as
	* a name-change request.
	*/
	if ((zfsvfs->z_case == ZFS_CASE_INSENSITIVE \|\|
	(zfsvfs->z_case == ZFS_CASE_MIXED &&
	flags & FIGNORECASE)) &&
	u8_strcmp(snm, tnm, 0, zfsvfs->z_norm, U8_UNICODE_LATEST,
	&error) == 0) {
	/*
	* case preserving rename request, require exact
	* name matches
	*/
	zflg \|= ZCIEXACT;
	zflg &= ~ZCILOOK;
	}
	}

	/*
	* If the source and destination directories are the same, we should
	* grab the z_name_lock of that directory only once.
	*/
	if (sdzp == tdzp) {
	zflg \|= ZHAVELOCK;
	rw_enter(&sdzp->z_name_lock, RW_READER);
	}

	if (cmp < 0) {
	serr = zfs_dirent_lock(&sdl, sdzp, snm, &szp,
	ZEXISTS \| zflg, NULL, NULL);
	terr = zfs_dirent_lock(&tdl,
	tdzp, tnm, &tzp, ZRENAMING \| zflg, NULL, NULL);
	} else {
	terr = zfs_dirent_lock(&tdl,
	tdzp, tnm, &tzp, zflg, NULL, NULL);
	serr = zfs_dirent_lock(&sdl,
	sdzp, snm, &szp, ZEXISTS \| ZRENAMING \| zflg,
	NULL, NULL);
	}

	if (serr) {
	/*
	* Source entry invalid or not there.
	*/
	if (!terr) {
	zfs_dirent_unlock(tdl);
	if (tzp)
	VN_RELE(ZTOV(tzp));
	}

	if (sdzp == tdzp)
	rw_exit(&sdzp->z_name_lock);

	if (strcmp(snm, ".") == 0 \|\| strcmp(snm, "..") == 0)
	serr = EINVAL;
	ZFS_EXIT(zfsvfs);
	return (serr);
	}
	if (terr) {
	zfs_dirent_unlock(sdl);
	VN_RELE(ZTOV(szp));

	if (sdzp == tdzp)
	rw_exit(&sdzp->z_name_lock);

	if (strcmp(tnm, "..") == 0)
	terr = EINVAL;
	ZFS_EXIT(zfsvfs);
	return (terr);
	}

	/*
	* Must have write access at the source to remove the old entry
	* and write access at the target to create the new entry.
	* Note that if target and source are the same, this can be
	* done in a single check.
	*/

	if (error = zfs_zaccess_rename(sdzp, szp, tdzp, tzp, cr))
	goto out;

	if (ZTOV(szp)->v_type == VDIR) {
	/*
	* Check to make sure rename is valid.
	* Can't do a move like this: /usr/a/b to /usr/a/b/c/d
	*/
	if (error = zfs_rename_lock(szp, tdzp, sdzp, &zl))
	goto out;
	}

	/*
	* Does target exist?
	*/
	if (tzp) {
	/*
	* Source and target must be the same type.
	*/
	if (ZTOV(szp)->v_type == VDIR) {
	if (ZTOV(tzp)->v_type != VDIR) {
	error = ENOTDIR;
	goto out;
	}
	} else {
	if (ZTOV(tzp)->v_type == VDIR) {
	error = EISDIR;
	goto out;
	}
	}
	/*
	* POSIX dictates that when the source and target
	* entries refer to the same file object, rename
	* must do nothing and exit without error.
	*/
	if (szp->z_id == tzp->z_id) {
	error = 0;
	goto out;
	}
	}

	vnevent_rename_src(ZTOV(szp), sdvp, snm, ct);
	if (tzp)
	vnevent_rename_dest(ZTOV(tzp), tdvp, tnm, ct);

	/*
	* notify the target directory if it is not the same
	* as source directory.
	*/
	if (tdvp != sdvp) {
	vnevent_rename_dest_dir(tdvp, ct);
	}

	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_bonus(tx, szp->z_id); /* nlink changes */
	dmu_tx_hold_bonus(tx, sdzp->z_id); /* nlink changes */
	dmu_tx_hold_zap(tx, sdzp->z_id, FALSE, snm);
	dmu_tx_hold_zap(tx, tdzp->z_id, TRUE, tnm);
	if (sdzp != tdzp)
	dmu_tx_hold_bonus(tx, tdzp->z_id); /* nlink changes */
	if (tzp)
	dmu_tx_hold_bonus(tx, tzp->z_id); /* parent changes */
	dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	if (zl != NULL)
	zfs_rename_unlock(&zl);
	zfs_dirent_unlock(sdl);
	zfs_dirent_unlock(tdl);

	if (sdzp == tdzp)
	rw_exit(&sdzp->z_name_lock);

	VN_RELE(ZTOV(szp));
	if (tzp)
	VN_RELE(ZTOV(tzp));
	if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	if (tzp) /* Attempt to remove the existing target */
	error = zfs_link_destroy(tdl, tzp, tx, zflg, NULL);

	if (error == 0) {
	error = zfs_link_create(tdl, szp, tx, ZRENAMING);
	if (error == 0) {
	szp->z_phys->zp_flags \|= ZFS_AV_MODIFIED;

	error = zfs_link_destroy(sdl, szp, tx, ZRENAMING, NULL);
	ASSERT(error == 0);

	zfs_log_rename(zilog, tx,
	TX_RENAME \| (flags & FIGNORECASE ? TX_CI : 0),
	sdzp, sdl->dl_name, tdzp, tdl->dl_name, szp);

	/* Update path information for the target vnode */
	vn_renamepath(tdvp, ZTOV(szp), tnm, strlen(tnm));
	}
	#ifdef FREEBSD_NAMECACHE
	if (error == 0) {
	cache_purge(sdvp);
	cache_purge(tdvp);
	}
	#endif
	}

	dmu_tx_commit(tx);
	out:
	if (zl != NULL)
	zfs_rename_unlock(&zl);

	zfs_dirent_unlock(sdl);
	zfs_dirent_unlock(tdl);

	if (sdzp == tdzp)
	rw_exit(&sdzp->z_name_lock);

	VN_RELE(ZTOV(szp));
	if (tzp)
	VN_RELE(ZTOV(tzp));

	ZFS_EXIT(zfsvfs);

	return (error);
	}

	/*
	* Insert the indicated symbolic reference entry into the directory.
	*
	* IN: dvp - Directory to contain new symbolic link.
	* link - Name for new symlink entry.
	* vap - Attributes of new entry.
	* target - Target path of new symlink.
	* cr - credentials of caller.
	* ct - caller context
	* flags - case flags
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* dvp - ctime\|mtime updated
	*/
	/ARGSUSED/
	static int
	zfs_symlink(vnode_t dvp, vnode_t vpp, char name, vattr_t vap, char link,
	cred_t cr, kthread_t td)
	{
	znode_t zp, dzp = VTOZ(dvp);
	zfs_dirlock_t *dl;
	dmu_tx_t *tx;
	zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
	zilog_t *zilog;
	int len = strlen(link);
	int error;
	int zflg = ZNEW;
	zfs_fuid_info_t *fuidp = NULL;
	int flags = 0;

	ASSERT(vap->va_type == VLNK);

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(dzp);
	zilog = zfsvfs->z_log;

	if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
	NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
	ZFS_EXIT(zfsvfs);
	return (EILSEQ);
	}
	if (flags & FIGNORECASE)
	zflg \|= ZCILOOK;
	top:
	if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	if (len > MAXPATHLEN) {
	ZFS_EXIT(zfsvfs);
	return (ENAMETOOLONG);
	}

	/*
	* Attempt to lock directory; fail if entry already exists.
	*/
	error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg, NULL, NULL);
	if (error) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, MAX(1, len));
	dmu_tx_hold_bonus(tx, dzp->z_id);
	dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
	if (dzp->z_phys->zp_flags & ZFS_INHERIT_ACE)
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, SPA_MAXBLOCKSIZE);
	if (IS_EPHEMERAL(crgetuid(cr)) \|\| IS_EPHEMERAL(crgetgid(cr))) {
	if (zfsvfs->z_fuid_obj == 0) {
	dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
	dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
	} else {
	dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
	dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
	FUID_SIZE_ESTIMATE(zfsvfs));
	}
	}
	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	zfs_dirent_unlock(dl);
	if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	dmu_buf_will_dirty(dzp->z_dbuf, tx);

	/*
	* Create a new object for the symlink.
	* Put the link content into bonus buffer if it will fit;
	* otherwise, store it just like any other file data.
	*/
	if (sizeof (znode_phys_t) + len <= dmu_bonus_max()) {
	zfs_mknode(dzp, vap, tx, cr, 0, &zp, len, NULL, &fuidp);
	if (len != 0)
	bcopy(link, zp->z_phys + 1, len);
	} else {
	dmu_buf_t *dbp;

	zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, NULL, &fuidp);
	/*
	* Nothing can access the znode yet so no locking needed
	* for growing the znode's blocksize.
	*/
	zfs_grow_blocksize(zp, len, tx);

	VERIFY(0 == dmu_buf_hold(zfsvfs->z_os,
	zp->z_id, 0, FTAG, &dbp));
	dmu_buf_will_dirty(dbp, tx);

	ASSERT3U(len, <=, dbp->db_size);
	bcopy(link, dbp->db_data, len);
	dmu_buf_rele(dbp, FTAG);
	}
	zp->z_phys->zp_size = len;

	/*
	* Insert the new object into the directory.
	*/
	(void) zfs_link_create(dl, zp, tx, ZNEW);
	out:
	if (error == 0) {
	uint64_t txtype = TX_SYMLINK;
	if (flags & FIGNORECASE)
	txtype \|= TX_CI;
	zfs_log_symlink(zilog, tx, txtype, dzp, zp, name, link);
	*vpp = ZTOV(zp);
	}
	if (fuidp)
	zfs_fuid_info_free(fuidp);

	dmu_tx_commit(tx);

	zfs_dirent_unlock(dl);

	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Return, in the buffer contained in the provided uio structure,
	* the symbolic path referred to by vp.
	*
	* IN: vp - vnode of symbolic link.
	* uoip - structure to contain the link path.
	* cr - credentials of caller.
	* ct - caller context
	*
	* OUT: uio - structure to contain the link path.
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* vp - atime updated
	*/
	/* ARGSUSED */
	static int
	zfs_readlink(vnode_t vp, uio_t uio, cred_t cr, caller_context_t ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	size_t bufsz;
	int error;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);

	bufsz = (size_t)zp->z_phys->zp_size;
	if (bufsz + sizeof (znode_phys_t) <= zp->z_dbuf->db_size) {
	error = uiomove(zp->z_phys + 1,
	MIN((size_t)bufsz, uio->uio_resid), UIO_READ, uio);
	} else {
	dmu_buf_t *dbp;
	error = dmu_buf_hold(zfsvfs->z_os, zp->z_id, 0, FTAG, &dbp);
	if (error) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}
	error = uiomove(dbp->db_data,
	MIN((size_t)bufsz, uio->uio_resid), UIO_READ, uio);
	dmu_buf_rele(dbp, FTAG);
	}

	ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Insert a new entry into directory tdvp referencing svp.
	*
	* IN: tdvp - Directory to contain new entry.
	* svp - vnode of new entry.
	* name - name of new entry.
	* cr - credentials of caller.
	* ct - caller context
	*
	* RETURN: 0 if success
	* error code if failure
	*
	* Timestamps:
	* tdvp - ctime\|mtime updated
	* svp - ctime updated
	*/
	/* ARGSUSED */
	static int
	zfs_link(vnode_t tdvp, vnode_t svp, char name, cred_t cr,
	caller_context_t *ct, int flags)
	{
	znode_t *dzp = VTOZ(tdvp);
	znode_t tzp, szp;
	zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
	zilog_t *zilog;
	zfs_dirlock_t *dl;
	dmu_tx_t *tx;
	vnode_t *realvp;
	int error;
	int zf = ZNEW;
	uid_t owner;

	ASSERT(tdvp->v_type == VDIR);

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(dzp);
	zilog = zfsvfs->z_log;

	if (VOP_REALVP(svp, &realvp, ct) == 0)
	svp = realvp;

	if (svp->v_vfsp != tdvp->v_vfsp) {
	ZFS_EXIT(zfsvfs);
	return (EXDEV);
	}
	szp = VTOZ(svp);
	ZFS_VERIFY_ZP(szp);

	if (zfsvfs->z_utf8 && u8_validate(name,
	strlen(name), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
	ZFS_EXIT(zfsvfs);
	return (EILSEQ);
	}
	if (flags & FIGNORECASE)
	zf \|= ZCILOOK;

	top:
	/*
	* We do not support links between attributes and non-attributes
	* because of the potential security risk of creating links
	* into "normal" file space in order to circumvent restrictions
	* imposed in attribute space.
	*/
	if ((szp->z_phys->zp_flags & ZFS_XATTR) !=
	(dzp->z_phys->zp_flags & ZFS_XATTR)) {
	ZFS_EXIT(zfsvfs);
	return (EINVAL);
	}

	/*
	* POSIX dictates that we return EPERM here.
	* Better choices include ENOTSUP or EISDIR.
	*/
	if (svp->v_type == VDIR) {
	ZFS_EXIT(zfsvfs);
	return (EPERM);
	}

	owner = zfs_fuid_map_id(zfsvfs, szp->z_phys->zp_uid, cr, ZFS_OWNER);
	if (owner != crgetuid(cr) &&
	secpolicy_basic_link(svp, cr) != 0) {
	ZFS_EXIT(zfsvfs);
	return (EPERM);
	}

	if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/*
	* Attempt to lock directory; fail if entry already exists.
	*/
	error = zfs_dirent_lock(&dl, dzp, name, &tzp, zf, NULL, NULL);
	if (error) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	tx = dmu_tx_create(zfsvfs->z_os);
	dmu_tx_hold_bonus(tx, szp->z_id);
	dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
	error = dmu_tx_assign(tx, zfsvfs->z_assign);
	if (error) {
	zfs_dirent_unlock(dl);
	if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
	dmu_tx_wait(tx);
	dmu_tx_abort(tx);
	goto top;
	}
	dmu_tx_abort(tx);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	error = zfs_link_create(dl, szp, tx, 0);

	if (error == 0) {
	uint64_t txtype = TX_LINK;
	if (flags & FIGNORECASE)
	txtype \|= TX_CI;
	zfs_log_link(zilog, tx, txtype, dzp, szp, name);
	}

	dmu_tx_commit(tx);

	zfs_dirent_unlock(dl);

	if (error == 0) {
	vnevent_link(svp, ct);
	}

	ZFS_EXIT(zfsvfs);
	return (error);
	}

	/ARGSUSED/
	void
	zfs_inactive(vnode_t vp, cred_t cr, caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	int error;

	rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
	if (zp->z_dbuf == NULL) {
	/*
	* The fs has been unmounted, or we did a
	* suspend/resume and this file no longer exists.
	*/
	VI_LOCK(vp);
	vp->v_count = 0; /* count arrives as 1 */
	VI_UNLOCK(vp);
	vrecycle(vp, curthread);
	rw_exit(&zfsvfs->z_teardown_inactive_lock);
	return;
	}

	if (zp->z_atime_dirty && zp->z_unlinked == 0) {
	dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);

	dmu_tx_hold_bonus(tx, zp->z_id);
	error = dmu_tx_assign(tx, TXG_WAIT);
	if (error) {
	dmu_tx_abort(tx);
	} else {
	dmu_buf_will_dirty(zp->z_dbuf, tx);
	mutex_enter(&zp->z_lock);
	zp->z_atime_dirty = 0;
	mutex_exit(&zp->z_lock);
	dmu_tx_commit(tx);
	}
	}

	zfs_zinactive(zp);
	rw_exit(&zfsvfs->z_teardown_inactive_lock);
	}

	CTASSERT(sizeof(struct zfid_short) <= sizeof(struct fid));
	CTASSERT(sizeof(struct zfid_long) <= sizeof(struct fid));

	/ARGSUSED/
	static int
	zfs_fid(vnode_t vp, fid_t fidp, caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	uint32_t gen;
	uint64_t object = zp->z_id;
	zfid_short_t *zfid;
	int size, i;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);
	gen = (uint32_t)zp->z_gen;

	size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
	fidp->fid_len = size;

	zfid = (zfid_short_t *)fidp;

	zfid->zf_len = size;

	for (i = 0; i < sizeof (zfid->zf_object); i++)
	zfid->zf_object[i] = (uint8_t)(object >> (8 * i));

	/* Must have a non-zero generation number to distinguish from .zfs */
	if (gen == 0)
	gen = 1;
	for (i = 0; i < sizeof (zfid->zf_gen); i++)
	zfid->zf_gen[i] = (uint8_t)(gen >> (8 * i));

	if (size == LONG_FID_LEN) {
	uint64_t objsetid = dmu_objset_id(zfsvfs->z_os);
	zfid_long_t *zlfid;

	zlfid = (zfid_long_t *)fidp;

	for (i = 0; i < sizeof (zlfid->zf_setid); i++)
	zlfid->zf_setid[i] = (uint8_t)(objsetid >> (8 * i));

	/* XXX - this should be the generation number for the objset */
	for (i = 0; i < sizeof (zlfid->zf_setgen); i++)
	zlfid->zf_setgen[i] = 0;
	}

	ZFS_EXIT(zfsvfs);
	return (0);
	}

	static int
	zfs_pathconf(vnode_t vp, int cmd, ulong_t valp, cred_t *cr,
	caller_context_t *ct)
	{
	znode_t zp, xzp;
	zfsvfs_t *zfsvfs;
	zfs_dirlock_t *dl;
	int error;

	switch (cmd) {
	case _PC_LINK_MAX:
	*valp = INT_MAX;
	return (0);

	case _PC_FILESIZEBITS:
	*valp = 64;
	return (0);

	#if 0
	case _PC_XATTR_EXISTS:
	zp = VTOZ(vp);
	zfsvfs = zp->z_zfsvfs;
	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);
	*valp = 0;
	error = zfs_dirent_lock(&dl, zp, "", &xzp,
	ZXATTR \| ZEXISTS \| ZSHARED, NULL, NULL);
	if (error == 0) {
	zfs_dirent_unlock(dl);
	if (!zfs_dirempty(xzp))
	*valp = 1;
	VN_RELE(ZTOV(xzp));
	} else if (error == ENOENT) {
	/*
	* If there aren't extended attributes, it's the
	* same as having zero of them.
	*/
	error = 0;
	}
	ZFS_EXIT(zfsvfs);
	return (error);
	#endif

	case _PC_ACL_EXTENDED:
	*valp = 0;
	return (0);

	case _PC_ACL_NFS4:
	*valp = 1;
	return (0);

	case _PC_ACL_PATH_MAX:
	*valp = ACL_MAX_ENTRIES;
	return (0);

	case _PC_MIN_HOLE_SIZE:
	*valp = (int)SPA_MINBLOCKSIZE;
	return (0);

	default:
	return (EOPNOTSUPP);
	}
	}

	/ARGSUSED/
	static int
	zfs_getsecattr(vnode_t vp, vsecattr_t vsecp, int flag, cred_t *cr,
	caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	int error;
	boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);
	error = zfs_getacl(zp, vsecp, skipaclchk, cr);
	ZFS_EXIT(zfsvfs);

	return (error);
	}

	/ARGSUSED/
	static int
	zfs_setsecattr(vnode_t vp, vsecattr_t vsecp, int flag, cred_t *cr,
	caller_context_t *ct)
	{
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;
	int error;
	boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;

	ZFS_ENTER(zfsvfs);
	ZFS_VERIFY_ZP(zp);
	error = zfs_setacl(zp, vsecp, skipaclchk, cr);
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	static int
	zfs_freebsd_open(ap)
	struct vop_open_args /* {
	struct vnode *a_vp;
	int a_mode;
	struct ucred *a_cred;
	struct thread *a_td;
	} / ap;
	{
	vnode_t *vp = ap->a_vp;
	znode_t *zp = VTOZ(vp);
	int error;

	error = zfs_open(&vp, ap->a_mode, ap->a_cred, NULL);
	if (error == 0)
	vnode_create_vobject(vp, zp->z_phys->zp_size, ap->a_td);
	return (error);
	}

	static int
	zfs_freebsd_close(ap)
	struct vop_close_args /* {
	struct vnode *a_vp;
	int a_fflag;
	struct ucred *a_cred;
	struct thread *a_td;
	} / ap;
	{

	return (zfs_close(ap->a_vp, ap->a_fflag, 0, 0, ap->a_cred, NULL));
	}

	static int
	zfs_freebsd_ioctl(ap)
	struct vop_ioctl_args /* {
	struct vnode *a_vp;
	u_long a_command;
	caddr_t a_data;
	int a_fflag;
	struct ucred *cred;
	struct thread *td;
	} / ap;
	{

	return (zfs_ioctl(ap->a_vp, ap->a_command, (intptr_t)ap->a_data,
	ap->a_fflag, ap->a_cred, NULL, NULL));
	}

	static int
	zfs_freebsd_read(ap)
	struct vop_read_args /* {
	struct vnode *a_vp;
	struct uio *a_uio;
	int a_ioflag;
	struct ucred *a_cred;
	} / ap;
	{

	return (zfs_read(ap->a_vp, ap->a_uio, ap->a_ioflag, ap->a_cred, NULL));
	}

	static int
	zfs_freebsd_write(ap)
	struct vop_write_args /* {
	struct vnode *a_vp;
	struct uio *a_uio;
	int a_ioflag;
	struct ucred *a_cred;
	} / ap;
	{

	return (zfs_write(ap->a_vp, ap->a_uio, ap->a_ioflag, ap->a_cred, NULL));
	}

	static int
	zfs_freebsd_access(ap)
	struct vop_access_args /* {
	struct vnode *a_vp;
	accmode_t a_accmode;
	struct ucred *a_cred;
	struct thread *a_td;
	} / ap;
	{
	accmode_t accmode;
	int error = 0;

	/*
	* ZFS itself only knowns about VREAD, VWRITE, VEXEC and VAPPEND,
	*/
	accmode = ap->a_accmode & (VREAD\|VWRITE\|VEXEC\|VAPPEND);
	if (accmode != 0)
	error = zfs_access(ap->a_vp, accmode, 0, ap->a_cred, NULL);

	/*
	* VADMIN has to be handled by vaccess().
	*/
	if (error == 0) {
	accmode = ap->a_accmode & ~(VREAD\|VWRITE\|VEXEC\|VAPPEND);
	if (accmode != 0) {
	vnode_t *vp = ap->a_vp;
	znode_t *zp = VTOZ(vp);
	znode_phys_t *zphys = zp->z_phys;

	error = vaccess(vp->v_type, zphys->zp_mode,
	zphys->zp_uid, zphys->zp_gid, accmode, ap->a_cred,
	NULL);
	}
	}

	return (error);
	}

	static int
	zfs_freebsd_lookup(ap)
	struct vop_lookup_args /* {
	struct vnode *a_dvp;
	struct vnode **a_vpp;
	struct componentname *a_cnp;
	} / ap;
	{
	struct componentname *cnp = ap->a_cnp;
	char nm[NAME_MAX + 1];

	ASSERT(cnp->cn_namelen < sizeof(nm));
	strlcpy(nm, cnp->cn_nameptr, MIN(cnp->cn_namelen + 1, sizeof(nm)));

	return (zfs_lookup(ap->a_dvp, nm, ap->a_vpp, cnp, cnp->cn_nameiop,
	cnp->cn_cred, cnp->cn_thread, 0));
	}

	static int
	zfs_freebsd_create(ap)
	struct vop_create_args /* {
	struct vnode *a_dvp;
	struct vnode **a_vpp;
	struct componentname *a_cnp;
	struct vattr *a_vap;
	} / ap;
	{
	struct componentname *cnp = ap->a_cnp;
	vattr_t *vap = ap->a_vap;
	int mode;

	ASSERT(cnp->cn_flags & SAVENAME);

	vattr_init_mask(vap);
	mode = vap->va_mode & ALLPERMS;

	return (zfs_create(ap->a_dvp, cnp->cn_nameptr, vap, !EXCL, mode,
	ap->a_vpp, cnp->cn_cred, cnp->cn_thread));
	}

	static int
	zfs_freebsd_remove(ap)
	struct vop_remove_args /* {
	struct vnode *a_dvp;
	struct vnode *a_vp;
	struct componentname *a_cnp;
	} / ap;
	{

	ASSERT(ap->a_cnp->cn_flags & SAVENAME);

	return (zfs_remove(ap->a_dvp, ap->a_cnp->cn_nameptr,
	ap->a_cnp->cn_cred, NULL, 0));
	}

	static int
	zfs_freebsd_mkdir(ap)
	struct vop_mkdir_args /* {
	struct vnode *a_dvp;
	struct vnode **a_vpp;
	struct componentname *a_cnp;
	struct vattr *a_vap;
	} / ap;
	{
	vattr_t *vap = ap->a_vap;

	ASSERT(ap->a_cnp->cn_flags & SAVENAME);

	vattr_init_mask(vap);

	return (zfs_mkdir(ap->a_dvp, ap->a_cnp->cn_nameptr, vap, ap->a_vpp,
	ap->a_cnp->cn_cred, NULL, 0, NULL));
	}

	static int
	zfs_freebsd_rmdir(ap)
	struct vop_rmdir_args /* {
	struct vnode *a_dvp;
	struct vnode *a_vp;
	struct componentname *a_cnp;
	} / ap;
	{
	struct componentname *cnp = ap->a_cnp;

	ASSERT(cnp->cn_flags & SAVENAME);

	return (zfs_rmdir(ap->a_dvp, cnp->cn_nameptr, NULL, cnp->cn_cred, NULL, 0));
	}

	static int
	zfs_freebsd_readdir(ap)
	struct vop_readdir_args /* {
	struct vnode *a_vp;
	struct uio *a_uio;
	struct ucred *a_cred;
	int *a_eofflag;
	int *a_ncookies;
	u_long **a_cookies;
	} / ap;
	{

	return (zfs_readdir(ap->a_vp, ap->a_uio, ap->a_cred, ap->a_eofflag,
	ap->a_ncookies, ap->a_cookies));
	}

	static int
	zfs_freebsd_fsync(ap)
	struct vop_fsync_args /* {
	struct vnode *a_vp;
	int a_waitfor;
	struct thread *a_td;
	} / ap;
	{

	vop_stdfsync(ap);
	return (zfs_fsync(ap->a_vp, 0, ap->a_td->td_ucred, NULL));
	}

	static int
	zfs_freebsd_getattr(ap)
	struct vop_getattr_args /* {
	struct vnode *a_vp;
	struct vattr *a_vap;
	struct ucred *a_cred;
	struct thread *a_td;
	} / ap;
	{
	vattr_t *vap = ap->a_vap;
	xvattr_t xvap;
	u_long fflags = 0;
	int error;

	xva_init(&xvap);
	xvap.xva_vattr = *vap;
	xvap.xva_vattr.va_mask \|= AT_XVATTR;

	/* Convert chflags into ZFS-type flags. */
	/* XXX: what about SF_SETTABLE?. */
	XVA_SET_REQ(&xvap, XAT_IMMUTABLE);
	XVA_SET_REQ(&xvap, XAT_APPENDONLY);
	XVA_SET_REQ(&xvap, XAT_NOUNLINK);
	XVA_SET_REQ(&xvap, XAT_NODUMP);
	error = zfs_getattr(ap->a_vp, (vattr_t *)&xvap, 0, ap->a_cred, NULL);
	if (error != 0)
	return (error);

	/* Convert ZFS xattr into chflags. */
	#define FLAG_CHECK(fflag, xflag, xfield) do { \
	if (XVA_ISSET_RTN(&xvap, (xflag)) && (xfield) != 0) \
	fflags \|= (fflag); \
	} while (0)
	FLAG_CHECK(SF_IMMUTABLE, XAT_IMMUTABLE,
	xvap.xva_xoptattrs.xoa_immutable);
	FLAG_CHECK(SF_APPEND, XAT_APPENDONLY,
	xvap.xva_xoptattrs.xoa_appendonly);
	FLAG_CHECK(SF_NOUNLINK, XAT_NOUNLINK,
	xvap.xva_xoptattrs.xoa_nounlink);
	FLAG_CHECK(UF_NODUMP, XAT_NODUMP,
	xvap.xva_xoptattrs.xoa_nodump);
	#undef FLAG_CHECK
	*vap = xvap.xva_vattr;
	vap->va_flags = fflags;
	return (0);
	}

	static int
	zfs_freebsd_setattr(ap)
	struct vop_setattr_args /* {
	struct vnode *a_vp;
	struct vattr *a_vap;
	struct ucred *a_cred;
	struct thread *a_td;
	} / ap;
	{
	vnode_t *vp = ap->a_vp;
	vattr_t *vap = ap->a_vap;
	cred_t *cred = ap->a_cred;
	xvattr_t xvap;
	u_long fflags;
	uint64_t zflags;

	vattr_init_mask(vap);
	vap->va_mask &= ~AT_NOSET;

	xva_init(&xvap);
	xvap.xva_vattr = *vap;

	zflags = VTOZ(vp)->z_phys->zp_flags;

	if (vap->va_flags != VNOVAL) {
	zfsvfs_t *zfsvfs = VTOZ(vp)->z_zfsvfs;
	int error;

	if (zfsvfs->z_use_fuids == B_FALSE)
	return (EOPNOTSUPP);

	fflags = vap->va_flags;
	if ((fflags & ~(SF_IMMUTABLE\|SF_APPEND\|SF_NOUNLINK\|UF_NODUMP)) != 0)
	return (EOPNOTSUPP);
	/*
	* Unprivileged processes are not permitted to unset system
	* flags, or modify flags if any system flags are set.
	* Privileged non-jail processes may not modify system flags
	* if securelevel > 0 and any existing system flags are set.
	* Privileged jail processes behave like privileged non-jail
	* processes if the security.jail.chflags_allowed sysctl is
	* is non-zero; otherwise, they behave like unprivileged
	* processes.
	*/
	if (secpolicy_fs_owner(vp->v_mount, cred) == 0 \|\|
	priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0) == 0) {
	if (zflags &
	(ZFS_IMMUTABLE \| ZFS_APPENDONLY \| ZFS_NOUNLINK)) {
	error = securelevel_gt(cred, 0);
	if (error != 0)
	return (error);
	}
	} else {
	/*
	* Callers may only modify the file flags on objects they
	* have VADMIN rights for.
	*/
	if ((error = VOP_ACCESS(vp, VADMIN, cred, curthread)) != 0)
	return (error);
	if (zflags &
	(ZFS_IMMUTABLE \| ZFS_APPENDONLY \| ZFS_NOUNLINK)) {
	return (EPERM);
	}
	if (fflags &
	(SF_IMMUTABLE \| SF_APPEND \| SF_NOUNLINK)) {
	return (EPERM);
	}
	}

	#define FLAG_CHANGE(fflag, zflag, xflag, xfield) do { \
	if (((fflags & (fflag)) && !(zflags & (zflag))) \|\| \
	((zflags & (zflag)) && !(fflags & (fflag)))) { \
	XVA_SET_REQ(&xvap, (xflag)); \
	(xfield) = ((fflags & (fflag)) != 0); \
	} \
	} while (0)
	/* Convert chflags into ZFS-type flags. */
	/* XXX: what about SF_SETTABLE?. */
	FLAG_CHANGE(SF_IMMUTABLE, ZFS_IMMUTABLE, XAT_IMMUTABLE,
	xvap.xva_xoptattrs.xoa_immutable);
	FLAG_CHANGE(SF_APPEND, ZFS_APPENDONLY, XAT_APPENDONLY,
	xvap.xva_xoptattrs.xoa_appendonly);
	FLAG_CHANGE(SF_NOUNLINK, ZFS_NOUNLINK, XAT_NOUNLINK,
	xvap.xva_xoptattrs.xoa_nounlink);
	FLAG_CHANGE(UF_NODUMP, ZFS_NODUMP, XAT_NODUMP,
	xvap.xva_xoptattrs.xoa_nodump);
	#undef FLAG_CHANGE
	}
	return (zfs_setattr(vp, (vattr_t *)&xvap, 0, cred, NULL));
	}

	static int
	zfs_freebsd_rename(ap)
	struct vop_rename_args /* {
	struct vnode *a_fdvp;
	struct vnode *a_fvp;
	struct componentname *a_fcnp;
	struct vnode *a_tdvp;
	struct vnode *a_tvp;
	struct componentname *a_tcnp;
	} / ap;
	{
	vnode_t *fdvp = ap->a_fdvp;
	vnode_t *fvp = ap->a_fvp;
	vnode_t *tdvp = ap->a_tdvp;
	vnode_t *tvp = ap->a_tvp;
	int error;

	ASSERT(ap->a_fcnp->cn_flags & (SAVENAME\|SAVESTART));
	ASSERT(ap->a_tcnp->cn_flags & (SAVENAME\|SAVESTART));

	error = zfs_rename(fdvp, ap->a_fcnp->cn_nameptr, tdvp,
	ap->a_tcnp->cn_nameptr, ap->a_fcnp->cn_cred, NULL, 0);

	if (tdvp == tvp)
	VN_RELE(tdvp);
	else
	VN_URELE(tdvp);
	if (tvp)
	VN_URELE(tvp);
	VN_RELE(fdvp);
	VN_RELE(fvp);

	return (error);
	}

	static int
	zfs_freebsd_symlink(ap)
	struct vop_symlink_args /* {
	struct vnode *a_dvp;
	struct vnode **a_vpp;
	struct componentname *a_cnp;
	struct vattr *a_vap;
	char *a_target;
	} / ap;
	{
	struct componentname *cnp = ap->a_cnp;
	vattr_t *vap = ap->a_vap;

	ASSERT(cnp->cn_flags & SAVENAME);

	vap->va_type = VLNK; /* FreeBSD: Syscall only sets va_mode. */
	vattr_init_mask(vap);

	return (zfs_symlink(ap->a_dvp, ap->a_vpp, cnp->cn_nameptr, vap,
	ap->a_target, cnp->cn_cred, cnp->cn_thread));
	}

	static int
	zfs_freebsd_readlink(ap)
	struct vop_readlink_args /* {
	struct vnode *a_vp;
	struct uio *a_uio;
	struct ucred *a_cred;
	} / ap;
	{

	return (zfs_readlink(ap->a_vp, ap->a_uio, ap->a_cred, NULL));
	}

	static int
	zfs_freebsd_link(ap)
	struct vop_link_args /* {
	struct vnode *a_tdvp;
	struct vnode *a_vp;
	struct componentname *a_cnp;
	} / ap;
	{
	struct componentname *cnp = ap->a_cnp;

	ASSERT(cnp->cn_flags & SAVENAME);

	return (zfs_link(ap->a_tdvp, ap->a_vp, cnp->cn_nameptr, cnp->cn_cred, NULL, 0));
	}

	static int
	zfs_freebsd_inactive(ap)
	struct vop_inactive_args /* {
	struct vnode *a_vp;
	struct thread *a_td;
	} / ap;
	{
	vnode_t *vp = ap->a_vp;

	zfs_inactive(vp, ap->a_td->td_ucred, NULL);
	return (0);
	}

	static void
	zfs_reclaim_complete(void *arg, int pending)
	{
	znode_t *zp = arg;
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;

	rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
	if (zp->z_dbuf != NULL) {
	ZFS_OBJ_HOLD_ENTER(zfsvfs, zp->z_id);
	zfs_znode_dmu_fini(zp);
	ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
	}
	zfs_znode_free(zp);
	rw_exit(&zfsvfs->z_teardown_inactive_lock);
	/*
	* If the file system is being unmounted, there is a process waiting
	* for us, wake it up.
	*/
	if (zfsvfs->z_unmounted)
	wakeup_one(zfsvfs);
	}

	static int
	zfs_freebsd_reclaim(ap)
	struct vop_reclaim_args /* {
	struct vnode *a_vp;
	struct thread *a_td;
	} / ap;
	{
	vnode_t *vp = ap->a_vp;
	znode_t *zp = VTOZ(vp);
	zfsvfs_t *zfsvfs = zp->z_zfsvfs;

	rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);

	ASSERT(zp != NULL);

	/*
	* Destroy the vm object and flush associated pages.
	*/
	vnode_destroy_vobject(vp);

	mutex_enter(&zp->z_lock);
	ASSERT(zp->z_phys != NULL);
	zp->z_vnode = NULL;
	mutex_exit(&zp->z_lock);

	if (zp->z_unlinked)
	; /* Do nothing. */
	else if (zp->z_dbuf == NULL)
	zfs_znode_free(zp);
	else /* if (!zp->z_unlinked && zp->z_dbuf != NULL) */ {
	int locked;

	locked = MUTEX_HELD(ZFS_OBJ_MUTEX(zfsvfs, zp->z_id)) ? 2 :
	ZFS_OBJ_HOLD_TRYENTER(zfsvfs, zp->z_id);
	if (locked == 0) {
	/*
	* Lock can't be obtained due to deadlock possibility,
	* so defer znode destruction.
	*/
	TASK_INIT(&zp->z_task, 0, zfs_reclaim_complete, zp);
	taskqueue_enqueue(taskqueue_thread, &zp->z_task);
	} else {
	zfs_znode_dmu_fini(zp);
	if (locked == 1)
	ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
	zfs_znode_free(zp);
	}
	}
	VI_LOCK(vp);
	vp->v_data = NULL;
	ASSERT(vp->v_holdcnt >= 1);
	VI_UNLOCK(vp);
	rw_exit(&zfsvfs->z_teardown_inactive_lock);
	return (0);
	}

	static int
	zfs_freebsd_fid(ap)
	struct vop_fid_args /* {
	struct vnode *a_vp;
	struct fid *a_fid;
	} / ap;
	{

	return (zfs_fid(ap->a_vp, (void *)ap->a_fid, NULL));
	}

	static int
	zfs_freebsd_pathconf(ap)
	struct vop_pathconf_args /* {
	struct vnode *a_vp;
	int a_name;
	register_t *a_retval;
	} / ap;
	{
	ulong_t val;
	int error;

	error = zfs_pathconf(ap->a_vp, ap->a_name, &val, curthread->td_ucred, NULL);
	if (error == 0)
	*ap->a_retval = val;
	else if (error == EOPNOTSUPP)
	error = vop_stdpathconf(ap);
	return (error);
	}

	static int
	zfs_freebsd_fifo_pathconf(ap)
	struct vop_pathconf_args /* {
	struct vnode *a_vp;
	int a_name;
	register_t *a_retval;
	} / ap;
	{

	switch (ap->a_name) {
	case _PC_ACL_EXTENDED:
	case _PC_ACL_NFS4:
	case _PC_ACL_PATH_MAX:
	case _PC_MAC_PRESENT:
	return (zfs_freebsd_pathconf(ap));
	default:
	return (fifo_specops.vop_pathconf(ap));
	}
	}

	/*
	* FreeBSD's extended attributes namespace defines file name prefix for ZFS'
	* extended attribute name:
	*
	* NAMESPACE PREFIX
	* system freebsd:system:
	* user (none, can be used to access ZFS fsattr(5) attributes
	* created on Solaris)
	*/
	static int
	zfs_create_attrname(int attrnamespace, const char name, char attrname,
	size_t size)
	{
	const char namespace, prefix, *suffix;

	/* We don't allow '/' character in attribute name. */
	if (strchr(name, '/') != NULL)
	return (EINVAL);
	/* We don't allow attribute names that start with "freebsd:" string. */
	if (strncmp(name, "freebsd:", 8) == 0)
	return (EINVAL);

	bzero(attrname, size);

	switch (attrnamespace) {
	case EXTATTR_NAMESPACE_USER:
	#if 0
	prefix = "freebsd:";
	namespace = EXTATTR_NAMESPACE_USER_STRING;
	suffix = ":";
	#else
	/*
	* This is the default namespace by which we can access all
	* attributes created on Solaris.
	*/
	prefix = namespace = suffix = "";
	#endif
	break;
	case EXTATTR_NAMESPACE_SYSTEM:
	prefix = "freebsd:";
	namespace = EXTATTR_NAMESPACE_SYSTEM_STRING;
	suffix = ":";
	break;
	case EXTATTR_NAMESPACE_EMPTY:
	default:
	return (EINVAL);
	}
	if (snprintf(attrname, size, "%s%s%s%s", prefix, namespace, suffix,
	name) >= size) {
	return (ENAMETOOLONG);
	}
	return (0);
	}

	/*
	* Vnode operating to retrieve a named extended attribute.
	*/
	static int
	zfs_getextattr(struct vop_getextattr_args *ap)
	/*
	vop_getextattr {
	IN struct vnode *a_vp;
	IN int a_attrnamespace;
	IN const char *a_name;
	INOUT struct uio *a_uio;
	OUT size_t *a_size;
	IN struct ucred *a_cred;
	IN struct thread *a_td;
	};
	*/
	{
	zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
	struct thread *td = ap->a_td;
	struct nameidata nd;
	char attrname[255];
	struct vattr va;
	vnode_t xvp = NULL, vp;
	int error, flags;

	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
	ap->a_cred, ap->a_td, VREAD);
	if (error != 0)
	return (error);

	error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
	sizeof(attrname));
	if (error != 0)
	return (error);

	ZFS_ENTER(zfsvfs);

	error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
	LOOKUP_XATTR);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	flags = FREAD;
	NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW \| MPSAFE, UIO_SYSSPACE, attrname,
	xvp, td);
	error = vn_open_cred(&nd, &flags, 0, 0, ap->a_cred, NULL);
	vp = nd.ni_vp;
	NDFREE(&nd, NDF_ONLY_PNBUF);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	if (error == ENOENT)
	error = ENOATTR;
	return (error);
	}

	if (ap->a_size != NULL) {
	error = VOP_GETATTR(vp, &va, ap->a_cred);
	if (error == 0)
	*ap->a_size = (size_t)va.va_size;
	} else if (ap->a_uio != NULL)
	error = VOP_READ(vp, ap->a_uio, IO_UNIT \| IO_SYNC, ap->a_cred);

	VOP_UNLOCK(vp, 0);
	vn_close(vp, flags, ap->a_cred, td);
	ZFS_EXIT(zfsvfs);

	return (error);
	}

	/*
	* Vnode operation to remove a named attribute.
	*/
	int
	zfs_deleteextattr(struct vop_deleteextattr_args *ap)
	/*
	vop_deleteextattr {
	IN struct vnode *a_vp;
	IN int a_attrnamespace;
	IN const char *a_name;
	IN struct ucred *a_cred;
	IN struct thread *a_td;
	};
	*/
	{
	zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
	struct thread *td = ap->a_td;
	struct nameidata nd;
	char attrname[255];
	struct vattr va;
	vnode_t xvp = NULL, vp;
	int error, flags;

	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
	ap->a_cred, ap->a_td, VWRITE);
	if (error != 0)
	return (error);

	error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
	sizeof(attrname));
	if (error != 0)
	return (error);

	ZFS_ENTER(zfsvfs);

	error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
	LOOKUP_XATTR);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	NDINIT_ATVP(&nd, DELETE, NOFOLLOW \| LOCKPARENT \| LOCKLEAF \| MPSAFE,
	UIO_SYSSPACE, attrname, xvp, td);
	error = namei(&nd);
	vp = nd.ni_vp;
	NDFREE(&nd, NDF_ONLY_PNBUF);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	if (error == ENOENT)
	error = ENOATTR;
	return (error);
	}
	error = VOP_REMOVE(nd.ni_dvp, vp, &nd.ni_cnd);

	vput(nd.ni_dvp);
	if (vp == nd.ni_dvp)
	vrele(vp);
	else
	vput(vp);
	ZFS_EXIT(zfsvfs);

	return (error);
	}

	/*
	* Vnode operation to set a named attribute.
	*/
	static int
	zfs_setextattr(struct vop_setextattr_args *ap)
	/*
	vop_setextattr {
	IN struct vnode *a_vp;
	IN int a_attrnamespace;
	IN const char *a_name;
	INOUT struct uio *a_uio;
	IN struct ucred *a_cred;
	IN struct thread *a_td;
	};
	*/
	{
	zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
	struct thread *td = ap->a_td;
	struct nameidata nd;
	char attrname[255];
	struct vattr va;
	vnode_t xvp = NULL, vp;
	int error, flags;

	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
	ap->a_cred, ap->a_td, VWRITE);
	if (error != 0)
	return (error);

	error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
	sizeof(attrname));
	if (error != 0)
	return (error);

	ZFS_ENTER(zfsvfs);

	error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
	LOOKUP_XATTR \| CREATE_XATTR_DIR);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	flags = FFLAGS(O_WRONLY \| O_CREAT);
	NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW \| MPSAFE, UIO_SYSSPACE, attrname,
	xvp, td);
	error = vn_open_cred(&nd, &flags, 0600, 0, ap->a_cred, NULL);
	vp = nd.ni_vp;
	NDFREE(&nd, NDF_ONLY_PNBUF);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	VATTR_NULL(&va);
	va.va_size = 0;
	error = VOP_SETATTR(vp, &va, ap->a_cred);
	if (error == 0)
	VOP_WRITE(vp, ap->a_uio, IO_UNIT \| IO_SYNC, ap->a_cred);

	VOP_UNLOCK(vp, 0);
	vn_close(vp, flags, ap->a_cred, td);
	ZFS_EXIT(zfsvfs);

	return (error);
	}

	/*
	* Vnode operation to retrieve extended attributes on a vnode.
	*/
	static int
	zfs_listextattr(struct vop_listextattr_args *ap)
	/*
	vop_listextattr {
	IN struct vnode *a_vp;
	IN int a_attrnamespace;
	INOUT struct uio *a_uio;
	OUT size_t *a_size;
	IN struct ucred *a_cred;
	IN struct thread *a_td;
	};
	*/
	{
	zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
	struct thread *td = ap->a_td;
	struct nameidata nd;
	char attrprefix[16];
	u_char dirbuf[sizeof(struct dirent)];
	struct dirent *dp;
	struct iovec aiov;
	struct uio auio, *uio = ap->a_uio;
	size_t *sizep = ap->a_size;
	size_t plen;
	vnode_t xvp = NULL, vp;
	int done, error, eof, pos;

	error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
	ap->a_cred, ap->a_td, VREAD);
	if (error != 0)
	return (error);

	error = zfs_create_attrname(ap->a_attrnamespace, "", attrprefix,
	sizeof(attrprefix));
	if (error != 0)
	return (error);
	plen = strlen(attrprefix);

	ZFS_ENTER(zfsvfs);

	if (sizep != NULL)
	*sizep = 0;

	error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
	LOOKUP_XATTR);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	/*
	* ENOATTR means that the EA directory does not yet exist,
	* i.e. there are no extended attributes there.
	*/
	if (error == ENOATTR)
	error = 0;
	return (error);
	}

	NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW \| LOCKLEAF \| LOCKSHARED \| MPSAFE,
	UIO_SYSSPACE, ".", xvp, td);
	error = namei(&nd);
	vp = nd.ni_vp;
	NDFREE(&nd, NDF_ONLY_PNBUF);
	if (error != 0) {
	ZFS_EXIT(zfsvfs);
	return (error);
	}

	auio.uio_iov = &aiov;
	auio.uio_iovcnt = 1;
	auio.uio_segflg = UIO_SYSSPACE;
	auio.uio_td = td;
	auio.uio_rw = UIO_READ;
	auio.uio_offset = 0;

	do {
	u_char nlen;

	aiov.iov_base = (void *)dirbuf;
	aiov.iov_len = sizeof(dirbuf);
	auio.uio_resid = sizeof(dirbuf);
	error = VOP_READDIR(vp, &auio, ap->a_cred, &eof, NULL, NULL);
	done = sizeof(dirbuf) - auio.uio_resid;
	if (error != 0)
	break;
	for (pos = 0; pos < done;) {
	dp = (struct dirent *)(dirbuf + pos);
	pos += dp->d_reclen;
	/*
	* XXX: Temporarily we also accept DT_UNKNOWN, as this
	* is what we get when attribute was created on Solaris.
	*/
	if (dp->d_type != DT_REG && dp->d_type != DT_UNKNOWN)
	continue;
	if (plen == 0 && strncmp(dp->d_name, "freebsd:", 8) == 0)
	continue;
	else if (strncmp(dp->d_name, attrprefix, plen) != 0)
	continue;
	nlen = dp->d_namlen - plen;
	if (sizep != NULL)
	*sizep += 1 + nlen;
	else if (uio != NULL) {
	/*
	* Format of extattr name entry is one byte for
	* length and the rest for name.
	*/
	error = uiomove(&nlen, 1, uio->uio_rw, uio);
	if (error == 0) {
	error = uiomove(dp->d_name + plen, nlen,
	uio->uio_rw, uio);
	}
	if (error != 0)
	break;
	}
	}
	} while (!eof && error == 0);

	vput(vp);
	ZFS_EXIT(zfsvfs);

	return (error);
	}

	int
	zfs_freebsd_getacl(ap)
	struct vop_getacl_args /* {
	struct vnode *vp;
	acl_type_t type;
	struct acl *aclp;
	struct ucred *cred;
	struct thread *td;
	} / ap;
	{
	int error;
	vsecattr_t vsecattr;

	if (ap->a_type != ACL_TYPE_NFS4)
	return (EINVAL);

	vsecattr.vsa_mask = VSA_ACE \| VSA_ACECNT;
	if (error = zfs_getsecattr(ap->a_vp, &vsecattr, 0, ap->a_cred, NULL))
	return (error);

	error = acl_from_aces(ap->a_aclp, vsecattr.vsa_aclentp, vsecattr.vsa_aclcnt);
	if (vsecattr.vsa_aclentp != NULL)
	kmem_free(vsecattr.vsa_aclentp, vsecattr.vsa_aclentsz);

	return (error);
	}

	int
	zfs_freebsd_setacl(ap)
	struct vop_setacl_args /* {
	struct vnode *vp;
	acl_type_t type;
	struct acl *aclp;
	struct ucred *cred;
	struct thread *td;
	} / ap;
	{
	int error;
	vsecattr_t vsecattr;
	int aclbsize; /* size of acl list in bytes */
	aclent_t *aaclp;

	if (ap->a_type != ACL_TYPE_NFS4)
	return (EINVAL);

	if (ap->a_aclp->acl_cnt < 1 \|\| ap->a_aclp->acl_cnt > MAX_ACL_ENTRIES)
	return (EINVAL);

	/*
	* With NFSv4 ACLs, chmod(2) may need to add additional entries,
	* splitting every entry into two and appending "canonical six"
	* entries at the end. Don't allow for setting an ACL that would
	* cause chmod(2) to run out of ACL entries.
	*/
	if (ap->a_aclp->acl_cnt * 2 + 6 > ACL_MAX_ENTRIES)
	return (ENOSPC);

	error = acl_nfs4_check(ap->a_aclp, ap->a_vp->v_type == VDIR);
	if (error != 0)
	return (error);

	vsecattr.vsa_mask = VSA_ACE;
	aclbsize = ap->a_aclp->acl_cnt * sizeof(ace_t);
	vsecattr.vsa_aclentp = kmem_alloc(aclbsize, KM_SLEEP);
	aaclp = vsecattr.vsa_aclentp;
	vsecattr.vsa_aclentsz = aclbsize;

	aces_from_acl(vsecattr.vsa_aclentp, &vsecattr.vsa_aclcnt, ap->a_aclp);
	error = zfs_setsecattr(ap->a_vp, &vsecattr, 0, ap->a_cred, NULL);
	kmem_free(aaclp, aclbsize);

	return (error);
	}

	int
	zfs_freebsd_aclcheck(ap)
	struct vop_aclcheck_args /* {
	struct vnode *vp;
	acl_type_t type;
	struct acl *aclp;
	struct ucred *cred;
	struct thread *td;
	} / ap;
	{

	return (EOPNOTSUPP);
	}

	struct vop_vector zfs_vnodeops;
	struct vop_vector zfs_fifoops;

	struct vop_vector zfs_vnodeops = {
	.vop_default = &default_vnodeops,
	.vop_inactive = zfs_freebsd_inactive,
	.vop_reclaim = zfs_freebsd_reclaim,
	.vop_access = zfs_freebsd_access,
	#ifdef FREEBSD_NAMECACHE
	.vop_lookup = vfs_cache_lookup,
	.vop_cachedlookup = zfs_freebsd_lookup,
	#else
	.vop_lookup = zfs_freebsd_lookup,
	#endif
	.vop_getattr = zfs_freebsd_getattr,
	.vop_setattr = zfs_freebsd_setattr,
	.vop_create = zfs_freebsd_create,
	.vop_mknod = zfs_freebsd_create,
	.vop_mkdir = zfs_freebsd_mkdir,
	.vop_readdir = zfs_freebsd_readdir,
	.vop_fsync = zfs_freebsd_fsync,
	.vop_open = zfs_freebsd_open,
	.vop_close = zfs_freebsd_close,
	.vop_rmdir = zfs_freebsd_rmdir,
	.vop_ioctl = zfs_freebsd_ioctl,
	.vop_link = zfs_freebsd_link,
	.vop_symlink = zfs_freebsd_symlink,
	.vop_readlink = zfs_freebsd_readlink,
	.vop_read = zfs_freebsd_read,
	.vop_write = zfs_freebsd_write,
	.vop_remove = zfs_freebsd_remove,
	.vop_rename = zfs_freebsd_rename,
	.vop_pathconf = zfs_freebsd_pathconf,
	.vop_bmap = VOP_EOPNOTSUPP,
	.vop_fid = zfs_freebsd_fid,
	.vop_getextattr = zfs_getextattr,
	.vop_deleteextattr = zfs_deleteextattr,
	.vop_setextattr = zfs_setextattr,
	.vop_listextattr = zfs_listextattr,
	.vop_getacl = zfs_freebsd_getacl,
	.vop_setacl = zfs_freebsd_setacl,
	.vop_aclcheck = zfs_freebsd_aclcheck,
	};

	struct vop_vector zfs_fifoops = {
	.vop_default = &fifo_specops,
	.vop_fsync = zfs_freebsd_fsync,
	.vop_access = zfs_freebsd_access,
	.vop_getattr = zfs_freebsd_getattr,
	.vop_inactive = zfs_freebsd_inactive,
	.vop_read = VOP_PANIC,
	.vop_reclaim = zfs_freebsd_reclaim,
	.vop_setattr = zfs_freebsd_setattr,
	.vop_write = VOP_PANIC,
	.vop_pathconf = zfs_freebsd_fifo_pathconf,
	.vop_fid = zfs_freebsd_fid,
	.vop_getacl = zfs_freebsd_getacl,
	.vop_setacl = zfs_freebsd_setacl,
	.vop_aclcheck = zfs_freebsd_aclcheck,
	};
	Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision 209274)
	@@ -1,2276 +1,2276 @@
	/*
	* CDDL HEADER START
	*
	* The contents of this file are subject to the terms of the
	* Common Development and Distribution License (the "License").
	* You may not use this file except in compliance with the License.
	*
	* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
	* or http://www.opensolaris.org/os/licensing.
	* See the License for the specific language governing permissions
	* and limitations under the License.
	*
	* When distributing Covered Code, include this CDDL HEADER in each
	* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
	* If applicable, add the following below this CDDL HEADER, with the
	* fields enclosed by brackets "[]" replaced with your own identifying
	* information: Portions Copyright [yyyy] [name of copyright owner]
	*
	* CDDL HEADER END
	*/
	/*
	* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
	* Use is subject to license terms.
	*/

	#include <sys/zfs_context.h>
	#include <sys/fm/fs/zfs.h>
	#include <sys/spa.h>
	#include <sys/txg.h>
	#include <sys/spa_impl.h>
	#include <sys/vdev_impl.h>
	#include <sys/zio_impl.h>
	#include <sys/zio_compress.h>
	#include <sys/zio_checksum.h>

	SYSCTL_DECL(_vfs_zfs);
	SYSCTL_NODE(_vfs_zfs, OID_AUTO, zio, CTLFLAG_RW, 0, "ZFS ZIO");
	static int zio_use_uma = 0;
	TUNABLE_INT("vfs.zfs.zio.use_uma", &zio_use_uma);
	SYSCTL_INT(_vfs_zfs_zio, OID_AUTO, use_uma, CTLFLAG_RDTUN, &zio_use_uma, 0,
	"Use uma(9) for ZIO allocations");

	/*
	* ==========================================================================
	* I/O priority table
	* ==========================================================================
	*/
	uint8_t zio_priority_table[ZIO_PRIORITY_TABLE_SIZE] = {
	0, /* ZIO_PRIORITY_NOW */
	0, /* ZIO_PRIORITY_SYNC_READ */
	0, /* ZIO_PRIORITY_SYNC_WRITE */
	6, /* ZIO_PRIORITY_ASYNC_READ */
	4, /* ZIO_PRIORITY_ASYNC_WRITE */
	4, /* ZIO_PRIORITY_FREE */
	0, /* ZIO_PRIORITY_CACHE_FILL */
	0, /* ZIO_PRIORITY_LOG_WRITE */
	10, /* ZIO_PRIORITY_RESILVER */
	20, /* ZIO_PRIORITY_SCRUB */
	};

	/*
	* ==========================================================================
	* I/O type descriptions
	* ==========================================================================
	*/
	char *zio_type_name[ZIO_TYPES] = {
	"null", "read", "write", "free", "claim", "ioctl" };

	#define SYNC_PASS_DEFERRED_FREE 1 /* defer frees after this pass */
	#define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
	#define SYNC_PASS_REWRITE 1 /* rewrite new bps after this pass */

	/*
	* ==========================================================================
	* I/O kmem caches
	* ==========================================================================
	*/
	kmem_cache_t *zio_cache;
	kmem_cache_t *zio_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
	kmem_cache_t *zio_data_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];

	#ifdef _KERNEL
	extern vmem_t *zio_alloc_arena;
	#endif

	/*
	* An allocating zio is one that either currently has the DVA allocate
	* stage set or will have it later in its lifetime.
	*/
	#define IO_IS_ALLOCATING(zio) \
	((zio)->io_orig_pipeline & (1U << ZIO_STAGE_DVA_ALLOCATE))

	void
	zio_init(void)
	{
	size_t c;
	zio_cache = kmem_cache_create("zio_cache", sizeof (zio_t), 0,
	NULL, NULL, NULL, NULL, NULL, 0);

	/*
	* For small buffers, we want a cache for each multiple of
	* SPA_MINBLOCKSIZE. For medium-size buffers, we want a cache
	* for each quarter-power of 2. For large buffers, we want
	* a cache for each multiple of PAGESIZE.
	*/
	for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
	size_t size = (c + 1) << SPA_MINBLOCKSHIFT;
	size_t p2 = size;
	size_t align = 0;

	while (p2 & (p2 - 1))
	p2 &= p2 - 1;

	if (size <= 4 * SPA_MINBLOCKSIZE) {
	align = SPA_MINBLOCKSIZE;
	} else if (P2PHASE(size, PAGESIZE) == 0) {
	align = PAGESIZE;
	} else if (P2PHASE(size, p2 >> 2) == 0) {
	align = p2 >> 2;
	}

	if (align != 0) {
	char name[36];
	(void) sprintf(name, "zio_buf_%lu", (ulong_t)size);
	zio_buf_cache[c] = kmem_cache_create(name, size,
	align, NULL, NULL, NULL, NULL, NULL, KMC_NODEBUG);

	(void) sprintf(name, "zio_data_buf_%lu", (ulong_t)size);
	zio_data_buf_cache[c] = kmem_cache_create(name, size,
	align, NULL, NULL, NULL, NULL, NULL, KMC_NODEBUG);
	}
	}

	while (--c != 0) {
	ASSERT(zio_buf_cache[c] != NULL);
	if (zio_buf_cache[c - 1] == NULL)
	zio_buf_cache[c - 1] = zio_buf_cache[c];

	ASSERT(zio_data_buf_cache[c] != NULL);
	if (zio_data_buf_cache[c - 1] == NULL)
	zio_data_buf_cache[c - 1] = zio_data_buf_cache[c];
	}

	zio_inject_init();
	}

	void
	zio_fini(void)
	{
	size_t c;
	kmem_cache_t *last_cache = NULL;
	kmem_cache_t *last_data_cache = NULL;

	for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
	if (zio_buf_cache[c] != last_cache) {
	last_cache = zio_buf_cache[c];
	kmem_cache_destroy(zio_buf_cache[c]);
	}
	zio_buf_cache[c] = NULL;

	if (zio_data_buf_cache[c] != last_data_cache) {
	last_data_cache = zio_data_buf_cache[c];
	kmem_cache_destroy(zio_data_buf_cache[c]);
	}
	zio_data_buf_cache[c] = NULL;
	}

	kmem_cache_destroy(zio_cache);

	zio_inject_fini();
	}

	/*
	* ==========================================================================
	* Allocate and free I/O buffers
	* ==========================================================================
	*/

	/*
	* Use zio_buf_alloc to allocate ZFS metadata. This data will appear in a
	* crashdump if the kernel panics, so use it judiciously. Obviously, it's
	* useful to inspect ZFS metadata, but if possible, we should avoid keeping
	* excess / transient data in-core during a crashdump.
	*/
	void *
	zio_buf_alloc(size_t size)
	{
	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;

	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);

	if (zio_use_uma)
	return (kmem_cache_alloc(zio_buf_cache[c], KM_PUSHPAGE));
	else
	return (kmem_alloc(size, KM_SLEEP));
	}

	/*
	* Use zio_data_buf_alloc to allocate data. The data will not appear in a
	* crashdump if the kernel panics. This exists so that we will limit the amount
	* of ZFS data that shows up in a kernel crashdump. (Thus reducing the amount
	* of kernel heap dumped to disk when the kernel panics)
	*/
	void *
	zio_data_buf_alloc(size_t size)
	{
	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;

	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);

	if (zio_use_uma)
	return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
	else
	return (kmem_alloc(size, KM_SLEEP));
	}

	void
	zio_buf_free(void *buf, size_t size)
	{
	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;

	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);

	if (zio_use_uma)
	kmem_cache_free(zio_buf_cache[c], buf);
	else
	kmem_free(buf, size);
	}

	void
	zio_data_buf_free(void *buf, size_t size)
	{
	size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;

	ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);

	if (zio_use_uma)
	kmem_cache_free(zio_data_buf_cache[c], buf);
	else
	kmem_free(buf, size);
	}

	/*
	* ==========================================================================
	* Push and pop I/O transform buffers
	* ==========================================================================
	*/
	static void
	zio_push_transform(zio_t zio, void data, uint64_t size, uint64_t bufsize,
	zio_transform_func_t *transform)
	{
	zio_transform_t *zt = kmem_alloc(sizeof (zio_transform_t), KM_SLEEP);

	zt->zt_orig_data = zio->io_data;
	zt->zt_orig_size = zio->io_size;
	zt->zt_bufsize = bufsize;
	zt->zt_transform = transform;

	zt->zt_next = zio->io_transform_stack;
	zio->io_transform_stack = zt;

	zio->io_data = data;
	zio->io_size = size;
	}

	static void
	zio_pop_transforms(zio_t *zio)
	{
	zio_transform_t *zt;

	while ((zt = zio->io_transform_stack) != NULL) {
	if (zt->zt_transform != NULL)
	zt->zt_transform(zio,
	zt->zt_orig_data, zt->zt_orig_size);

	zio_buf_free(zio->io_data, zt->zt_bufsize);

	zio->io_data = zt->zt_orig_data;
	zio->io_size = zt->zt_orig_size;
	zio->io_transform_stack = zt->zt_next;

	kmem_free(zt, sizeof (zio_transform_t));
	}
	}

	/*
	* ==========================================================================
	* I/O transform callbacks for subblocks and decompression
	* ==========================================================================
	*/
	static void
	zio_subblock(zio_t zio, void data, uint64_t size)
	{
	ASSERT(zio->io_size > size);

	if (zio->io_type == ZIO_TYPE_READ)
	bcopy(zio->io_data, data, size);
	}

	static void
	zio_decompress(zio_t zio, void data, uint64_t size)
	{
	if (zio->io_error == 0 &&
	zio_decompress_data(BP_GET_COMPRESS(zio->io_bp),
	zio->io_data, zio->io_size, data, size) != 0)
	zio->io_error = EIO;
	}

	/*
	* ==========================================================================
	* I/O parent/child relationships and pipeline interlocks
	* ==========================================================================
	*/

	static void
	zio_add_child(zio_t pio, zio_t zio)
	{
	mutex_enter(&pio->io_lock);
	if (zio->io_stage < ZIO_STAGE_READY)
	pio->io_children[zio->io_child_type][ZIO_WAIT_READY]++;
	if (zio->io_stage < ZIO_STAGE_DONE)
	pio->io_children[zio->io_child_type][ZIO_WAIT_DONE]++;
	zio->io_sibling_prev = NULL;
	zio->io_sibling_next = pio->io_child;
	if (pio->io_child != NULL)
	pio->io_child->io_sibling_prev = zio;
	pio->io_child = zio;
	zio->io_parent = pio;
	mutex_exit(&pio->io_lock);
	}

	static void
	zio_remove_child(zio_t pio, zio_t zio)
	{
	zio_t next, prev;

	ASSERT(zio->io_parent == pio);

	mutex_enter(&pio->io_lock);
	next = zio->io_sibling_next;
	prev = zio->io_sibling_prev;
	if (next != NULL)
	next->io_sibling_prev = prev;
	if (prev != NULL)
	prev->io_sibling_next = next;
	if (pio->io_child == zio)
	pio->io_child = next;
	mutex_exit(&pio->io_lock);
	}

	static boolean_t
	zio_wait_for_children(zio_t *zio, enum zio_child child, enum zio_wait_type wait)
	{
	uint64_t *countp = &zio->io_children[child][wait];
	boolean_t waiting = B_FALSE;

	mutex_enter(&zio->io_lock);
	ASSERT(zio->io_stall == NULL);
	if (*countp != 0) {
	zio->io_stage--;
	zio->io_stall = countp;
	waiting = B_TRUE;
	}
	mutex_exit(&zio->io_lock);

	return (waiting);
	}

	static void
	zio_notify_parent(zio_t pio, zio_t zio, enum zio_wait_type wait)
	{
	uint64_t *countp = &pio->io_children[zio->io_child_type][wait];
	int *errorp = &pio->io_child_error[zio->io_child_type];

	mutex_enter(&pio->io_lock);
	if (zio->io_error && !(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE))
	errorp = zio_worst_error(errorp, zio->io_error);
	pio->io_reexecute \|= zio->io_reexecute;
	ASSERT3U(*countp, >, 0);
	if (--*countp == 0 && pio->io_stall == countp) {
	pio->io_stall = NULL;
	mutex_exit(&pio->io_lock);
	zio_execute(pio);
	} else {
	mutex_exit(&pio->io_lock);
	}
	}

	static void
	zio_inherit_child_errors(zio_t *zio, enum zio_child c)
	{
	if (zio->io_child_error[c] != 0 && zio->io_error == 0)
	zio->io_error = zio->io_child_error[c];
	}

	/*
	* ==========================================================================
	* Create the various types of I/O (read, write, free, etc)
	* ==========================================================================
	*/
	static zio_t *
	zio_create(zio_t pio, spa_t spa, uint64_t txg, blkptr_t *bp,
	void data, uint64_t size, zio_done_func_t done, void *private,
	zio_type_t type, int priority, int flags, vdev_t *vd, uint64_t offset,
	const zbookmark_t *zb, uint8_t stage, uint32_t pipeline)
	{
	zio_t *zio;

	ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
	ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
	ASSERT(P2PHASE(offset, SPA_MINBLOCKSIZE) == 0);

	ASSERT(!vd \|\| spa_config_held(spa, SCL_STATE_ALL, RW_READER));
	ASSERT(!bp \|\| !(flags & ZIO_FLAG_CONFIG_WRITER));
	ASSERT(vd \|\| stage == ZIO_STAGE_OPEN);

	zio = kmem_cache_alloc(zio_cache, KM_SLEEP);
	bzero(zio, sizeof (zio_t));

	mutex_init(&zio->io_lock, NULL, MUTEX_DEFAULT, NULL);
	cv_init(&zio->io_cv, NULL, CV_DEFAULT, NULL);

	if (vd != NULL)
	zio->io_child_type = ZIO_CHILD_VDEV;
	else if (flags & ZIO_FLAG_GANG_CHILD)
	zio->io_child_type = ZIO_CHILD_GANG;
	else
	zio->io_child_type = ZIO_CHILD_LOGICAL;

	if (bp != NULL) {
	zio->io_bp = bp;
	zio->io_bp_copy = *bp;
	zio->io_bp_orig = *bp;
	if (type != ZIO_TYPE_WRITE)
	zio->io_bp = &zio->io_bp_copy; /* so caller can free */
	if (zio->io_child_type == ZIO_CHILD_LOGICAL) {
	if (BP_IS_GANG(bp))
	pipeline \|= ZIO_GANG_STAGES;
	zio->io_logical = zio;
	}
	}

	zio->io_spa = spa;
	zio->io_txg = txg;
	zio->io_data = data;
	zio->io_size = size;
	zio->io_done = done;
	zio->io_private = private;
	zio->io_type = type;
	zio->io_priority = priority;
	zio->io_vd = vd;
	zio->io_offset = offset;
	zio->io_orig_flags = zio->io_flags = flags;
	zio->io_orig_stage = zio->io_stage = stage;
	zio->io_orig_pipeline = zio->io_pipeline = pipeline;

	if (zb != NULL)
	zio->io_bookmark = *zb;

	if (pio != NULL) {
	/*
	* Logical I/Os can have logical, gang, or vdev children.
	* Gang I/Os can have gang or vdev children.
	* Vdev I/Os can only have vdev children.
	* The following ASSERT captures all of these constraints.
	*/
	ASSERT(zio->io_child_type <= pio->io_child_type);
	if (zio->io_logical == NULL)
	zio->io_logical = pio->io_logical;
	zio_add_child(pio, zio);
	}

	return (zio);
	}

	static void
	zio_destroy(zio_t *zio)
	{
	spa_t *spa = zio->io_spa;
	uint8_t async_root = zio->io_async_root;

	mutex_destroy(&zio->io_lock);
	cv_destroy(&zio->io_cv);
	kmem_cache_free(zio_cache, zio);

	if (async_root) {
	mutex_enter(&spa->spa_async_root_lock);
	if (--spa->spa_async_root_count == 0)
	cv_broadcast(&spa->spa_async_root_cv);
	mutex_exit(&spa->spa_async_root_lock);
	}
	}

	zio_t *
	zio_null(zio_t pio, spa_t spa, zio_done_func_t done, void private,
	int flags)
	{
	zio_t *zio;

	zio = zio_create(pio, spa, 0, NULL, NULL, 0, done, private,
	ZIO_TYPE_NULL, ZIO_PRIORITY_NOW, flags, NULL, 0, NULL,
	ZIO_STAGE_OPEN, ZIO_INTERLOCK_PIPELINE);

	return (zio);
	}

	zio_t *
	zio_root(spa_t spa, zio_done_func_t done, void *private, int flags)
	{
	return (zio_null(NULL, spa, done, private, flags));
	}

	zio_t *
	zio_read(zio_t pio, spa_t spa, const blkptr_t *bp,
	void data, uint64_t size, zio_done_func_t done, void *private,
	int priority, int flags, const zbookmark_t *zb)
	{
	zio_t *zio;

	zio = zio_create(pio, spa, bp->blk_birth, (blkptr_t *)bp,
	data, size, done, private,
	ZIO_TYPE_READ, priority, flags, NULL, 0, zb,
	ZIO_STAGE_OPEN, ZIO_READ_PIPELINE);

	return (zio);
	}

	zio_t *
	zio_write(zio_t pio, spa_t spa, uint64_t txg, blkptr_t *bp,
	void data, uint64_t size, zio_prop_t zp,
	zio_done_func_t ready, zio_done_func_t done, void *private,
	int priority, int flags, const zbookmark_t *zb)
	{
	zio_t *zio;

	ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF &&
	zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS &&
	zp->zp_compress >= ZIO_COMPRESS_OFF &&
	zp->zp_compress < ZIO_COMPRESS_FUNCTIONS &&
	zp->zp_type < DMU_OT_NUMTYPES &&
	zp->zp_level < 32 &&
	zp->zp_ndvas > 0 &&
	zp->zp_ndvas <= spa_max_replication(spa));
	ASSERT(ready != NULL);

	zio = zio_create(pio, spa, txg, bp, data, size, done, private,
	ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
	ZIO_STAGE_OPEN, ZIO_WRITE_PIPELINE);

	zio->io_ready = ready;
	zio->io_prop = *zp;

	return (zio);
	}

	zio_t *
	zio_rewrite(zio_t pio, spa_t spa, uint64_t txg, blkptr_t bp, void data,
	uint64_t size, zio_done_func_t done, void private, int priority,
	int flags, zbookmark_t *zb)
	{
	zio_t *zio;

	zio = zio_create(pio, spa, txg, bp, data, size, done, private,
	ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
	ZIO_STAGE_OPEN, ZIO_REWRITE_PIPELINE);

	return (zio);
	}

	zio_t *
	zio_free(zio_t pio, spa_t spa, uint64_t txg, blkptr_t *bp,
	zio_done_func_t done, void private, int flags)
	{
	zio_t *zio;

	ASSERT(!BP_IS_HOLE(bp));

	if (bp->blk_fill == BLK_FILL_ALREADY_FREED)
	return (zio_null(pio, spa, NULL, NULL, flags));

	if (txg == spa->spa_syncing_txg &&
	spa_sync_pass(spa) > SYNC_PASS_DEFERRED_FREE) {
	bplist_enqueue_deferred(&spa->spa_sync_bplist, bp);
	return (zio_null(pio, spa, NULL, NULL, flags));
	}

	zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
	done, private, ZIO_TYPE_FREE, ZIO_PRIORITY_FREE, flags,
	NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_FREE_PIPELINE);

	return (zio);
	}

	zio_t *
	zio_claim(zio_t pio, spa_t spa, uint64_t txg, blkptr_t *bp,
	zio_done_func_t done, void private, int flags)
	{
	zio_t *zio;

	/*
	* A claim is an allocation of a specific block. Claims are needed
	* to support immediate writes in the intent log. The issue is that
	* immediate writes contain committed data, but in a txg that was
	* not committed. Upon opening the pool after an unclean shutdown,
	* the intent log claims all blocks that contain immediate write data
	* so that the SPA knows they're in use.
	*
	* All claims must be resolved in the first txg -- before the SPA
	* starts allocating blocks -- so that nothing is allocated twice.
	*/
	ASSERT3U(spa->spa_uberblock.ub_rootbp.blk_birth, <, spa_first_txg(spa));
	ASSERT3U(spa_first_txg(spa), <=, txg);

	zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
	done, private, ZIO_TYPE_CLAIM, ZIO_PRIORITY_NOW, flags,
	NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_CLAIM_PIPELINE);

	return (zio);
	}

	zio_t *
	zio_ioctl(zio_t pio, spa_t spa, vdev_t *vd, int cmd,
	zio_done_func_t done, void private, int priority, int flags)
	{
	zio_t *zio;
	int c;

	if (vd->vdev_children == 0) {
	zio = zio_create(pio, spa, 0, NULL, NULL, 0, done, private,
	ZIO_TYPE_IOCTL, priority, flags, vd, 0, NULL,
	ZIO_STAGE_OPEN, ZIO_IOCTL_PIPELINE);

	zio->io_cmd = cmd;
	} else {
	zio = zio_null(pio, spa, NULL, NULL, flags);

	for (c = 0; c < vd->vdev_children; c++)
	zio_nowait(zio_ioctl(zio, spa, vd->vdev_child[c], cmd,
	done, private, priority, flags));
	}

	return (zio);
	}

	zio_t *
	zio_read_phys(zio_t pio, vdev_t vd, uint64_t offset, uint64_t size,
	void data, int checksum, zio_done_func_t done, void *private,
	int priority, int flags, boolean_t labels)
	{
	zio_t *zio;

	ASSERT(vd->vdev_children == 0);
	ASSERT(!labels \|\| offset + size <= VDEV_LABEL_START_SIZE \|\|
	offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
	ASSERT3U(offset + size, <=, vd->vdev_psize);

	zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, done, private,
	ZIO_TYPE_READ, priority, flags, vd, offset, NULL,
	ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);

	zio->io_prop.zp_checksum = checksum;

	return (zio);
	}

	zio_t *
	zio_write_phys(zio_t pio, vdev_t vd, uint64_t offset, uint64_t size,
	void data, int checksum, zio_done_func_t done, void *private,
	int priority, int flags, boolean_t labels)
	{
	zio_t *zio;

	ASSERT(vd->vdev_children == 0);
	ASSERT(!labels \|\| offset + size <= VDEV_LABEL_START_SIZE \|\|
	offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
	ASSERT3U(offset + size, <=, vd->vdev_psize);

	zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, done, private,
	ZIO_TYPE_WRITE, priority, flags, vd, offset, NULL,
	ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);

	zio->io_prop.zp_checksum = checksum;

	if (zio_checksum_table[checksum].ci_zbt) {
	/*
	* zbt checksums are necessarily destructive -- they modify
	* the end of the write buffer to hold the verifier/checksum.
	* Therefore, we must make a local copy in case the data is
	* being written to multiple places in parallel.
	*/
	void *wbuf = zio_buf_alloc(size);
	bcopy(data, wbuf, size);
	zio_push_transform(zio, wbuf, size, size, NULL);
	}

	return (zio);
	}

	/*
	* Create a child I/O to do some work for us.
	*/
	zio_t *
	zio_vdev_child_io(zio_t pio, blkptr_t bp, vdev_t *vd, uint64_t offset,
	void *data, uint64_t size, int type, int priority, int flags,
	zio_done_func_t done, void private)
	{
	uint32_t pipeline = ZIO_VDEV_CHILD_PIPELINE;
	zio_t *zio;

	ASSERT(vd->vdev_parent ==
	(pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev));

	if (type == ZIO_TYPE_READ && bp != NULL) {
	/*
	* If we have the bp, then the child should perform the
	* checksum and the parent need not. This pushes error
	* detection as close to the leaves as possible and
	* eliminates redundant checksums in the interior nodes.
	*/
	pipeline \|= 1U << ZIO_STAGE_CHECKSUM_VERIFY;
	pio->io_pipeline &= ~(1U << ZIO_STAGE_CHECKSUM_VERIFY);
	}

	if (vd->vdev_children == 0)
	offset += VDEV_LABEL_START_SIZE;

	zio = zio_create(pio, pio->io_spa, pio->io_txg, bp, data, size,
	done, private, type, priority,
	(pio->io_flags & ZIO_FLAG_VDEV_INHERIT) \|
	ZIO_FLAG_CANFAIL \| ZIO_FLAG_DONT_PROPAGATE \| flags,
	vd, offset, &pio->io_bookmark,
	ZIO_STAGE_VDEV_IO_START - 1, pipeline);

	return (zio);
	}

	zio_t *
	zio_vdev_delegated_io(vdev_t vd, uint64_t offset, void data, uint64_t size,
	int type, int priority, int flags, zio_done_func_t done, void private)
	{
	zio_t *zio;

	ASSERT(vd->vdev_ops->vdev_op_leaf);

	zio = zio_create(NULL, vd->vdev_spa, 0, NULL,
	data, size, done, private, type, priority,
	flags \| ZIO_FLAG_CANFAIL \| ZIO_FLAG_DONT_RETRY,
	vd, offset, NULL,
	ZIO_STAGE_VDEV_IO_START - 1, ZIO_VDEV_CHILD_PIPELINE);

	return (zio);
	}

	void
	zio_flush(zio_t zio, vdev_t vd)
	{
	zio_nowait(zio_ioctl(zio, zio->io_spa, vd, DKIOCFLUSHWRITECACHE,
	NULL, NULL, ZIO_PRIORITY_NOW,
	ZIO_FLAG_CANFAIL \| ZIO_FLAG_DONT_PROPAGATE \| ZIO_FLAG_DONT_RETRY));
	}

	/*
	* ==========================================================================
	* Prepare to read and write logical blocks
	* ==========================================================================
	*/

	static int
	zio_read_bp_init(zio_t *zio)
	{
	blkptr_t *bp = zio->io_bp;

	if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF && zio->io_logical == zio) {
	uint64_t csize = BP_GET_PSIZE(bp);
	void *cbuf = zio_buf_alloc(csize);

	zio_push_transform(zio, cbuf, csize, csize, zio_decompress);
	}

	if (!dmu_ot[BP_GET_TYPE(bp)].ot_metadata && BP_GET_LEVEL(bp) == 0)
	zio->io_flags \|= ZIO_FLAG_DONT_CACHE;

	return (ZIO_PIPELINE_CONTINUE);
	}

	static int
	zio_write_bp_init(zio_t *zio)
	{
	zio_prop_t *zp = &zio->io_prop;
	int compress = zp->zp_compress;
	blkptr_t *bp = zio->io_bp;
	void *cbuf;
	uint64_t lsize = zio->io_size;
	uint64_t csize = lsize;
	uint64_t cbufsize = 0;
	int pass = 1;

	/*
	* If our children haven't all reached the ready stage,
	* wait for them and then repeat this pipeline stage.
	*/
	if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) \|\|
	zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_READY))
	return (ZIO_PIPELINE_STOP);

	if (!IO_IS_ALLOCATING(zio))
	return (ZIO_PIPELINE_CONTINUE);

	ASSERT(compress != ZIO_COMPRESS_INHERIT);

	if (bp->blk_birth == zio->io_txg) {
	/*
	* We're rewriting an existing block, which means we're
	* working on behalf of spa_sync(). For spa_sync() to
	* converge, it must eventually be the case that we don't
	* have to allocate new blocks. But compression changes
	* the blocksize, which forces a reallocate, and makes
	* convergence take longer. Therefore, after the first
	* few passes, stop compressing to ensure convergence.
	*/
	pass = spa_sync_pass(zio->io_spa);
	ASSERT(pass > 1);

	if (pass > SYNC_PASS_DONT_COMPRESS)
	compress = ZIO_COMPRESS_OFF;

	/*
	* Only MOS (objset 0) data should need to be rewritten.
	*/
	ASSERT(zio->io_logical->io_bookmark.zb_objset == 0);

	/* Make sure someone doesn't change their mind on overwrites */
	ASSERT(MIN(zp->zp_ndvas + BP_IS_GANG(bp),
	spa_max_replication(zio->io_spa)) == BP_GET_NDVAS(bp));
	}

	if (compress != ZIO_COMPRESS_OFF) {
	if (!zio_compress_data(compress, zio->io_data, zio->io_size,
	&cbuf, &csize, &cbufsize)) {
	compress = ZIO_COMPRESS_OFF;
	} else if (csize != 0) {
	zio_push_transform(zio, cbuf, csize, cbufsize, NULL);
	}
	}

	/*
	* The final pass of spa_sync() must be all rewrites, but the first
	* few passes offer a trade-off: allocating blocks defers convergence,
	* but newly allocated blocks are sequential, so they can be written
	* to disk faster. Therefore, we allow the first few passes of
	* spa_sync() to allocate new blocks, but force rewrites after that.
	* There should only be a handful of blocks after pass 1 in any case.
	*/
	if (bp->blk_birth == zio->io_txg && BP_GET_PSIZE(bp) == csize &&
	pass > SYNC_PASS_REWRITE) {
	ASSERT(csize != 0);
	uint32_t gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;
	zio->io_pipeline = ZIO_REWRITE_PIPELINE \| gang_stages;
	zio->io_flags \|= ZIO_FLAG_IO_REWRITE;
	} else {
	BP_ZERO(bp);
	zio->io_pipeline = ZIO_WRITE_PIPELINE;
	}

	if (csize == 0) {
	zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
	} else {
	ASSERT(zp->zp_checksum != ZIO_CHECKSUM_GANG_HEADER);
	BP_SET_LSIZE(bp, lsize);
	BP_SET_PSIZE(bp, csize);
	BP_SET_COMPRESS(bp, compress);
	BP_SET_CHECKSUM(bp, zp->zp_checksum);
	BP_SET_TYPE(bp, zp->zp_type);
	BP_SET_LEVEL(bp, zp->zp_level);
	BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
	}

	return (ZIO_PIPELINE_CONTINUE);
	}

	/*
	* ==========================================================================
	* Execute the I/O pipeline
	* ==========================================================================
	*/

	static void
	zio_taskq_dispatch(zio_t *zio, enum zio_taskq_type q)
	{
	zio_type_t t = zio->io_type;

	/*
	- * If we're a config writer, the normal issue and interrupt threads
	- * may all be blocked waiting for the config lock. In this case,
	- * select the otherwise-unused taskq for ZIO_TYPE_NULL.
	+ * If we're a config writer or a probe, the normal issue and
	+ * interrupt threads may all be blocked waiting for the config lock.
	+ * In this case, select the otherwise-unused taskq for ZIO_TYPE_NULL.
	*/
	- if (zio->io_flags & ZIO_FLAG_CONFIG_WRITER)
	+ if (zio->io_flags & (ZIO_FLAG_CONFIG_WRITER \| ZIO_FLAG_PROBE))
	t = ZIO_TYPE_NULL;

	/*
	* A similar issue exists for the L2ARC write thread until L2ARC 2.0.
	*/
	if (t == ZIO_TYPE_WRITE && zio->io_vd && zio->io_vd->vdev_aux)
	t = ZIO_TYPE_NULL;

	(void) taskq_dispatch_safe(zio->io_spa->spa_zio_taskq[t][q],
	(task_func_t *)zio_execute, zio, &zio->io_task);
	}

	static boolean_t
	zio_taskq_member(zio_t *zio, enum zio_taskq_type q)
	{
	kthread_t *executor = zio->io_executor;
	spa_t *spa = zio->io_spa;

	for (zio_type_t t = 0; t < ZIO_TYPES; t++)
	if (taskq_member(spa->spa_zio_taskq[t][q], executor))
	return (B_TRUE);

	return (B_FALSE);
	}

	static int
	zio_issue_async(zio_t *zio)
	{
	zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);

	return (ZIO_PIPELINE_STOP);
	}

	void
	zio_interrupt(zio_t *zio)
	{
	zio_taskq_dispatch(zio, ZIO_TASKQ_INTERRUPT);
	}

	/*
	* Execute the I/O pipeline until one of the following occurs:
	* (1) the I/O completes; (2) the pipeline stalls waiting for
	* dependent child I/Os; (3) the I/O issues, so we're waiting
	* for an I/O completion interrupt; (4) the I/O is delegated by
	* vdev-level caching or aggregation; (5) the I/O is deferred
	* due to vdev-level queueing; (6) the I/O is handed off to
	* another thread. In all cases, the pipeline stops whenever
	* there's no CPU work; it never burns a thread in cv_wait().
	*
	* There's no locking on io_stage because there's no legitimate way
	* for multiple threads to be attempting to process the same I/O.
	*/
	static zio_pipe_stage_t *zio_pipeline[ZIO_STAGES];

	void
	zio_execute(zio_t *zio)
	{
	zio->io_executor = curthread;

	while (zio->io_stage < ZIO_STAGE_DONE) {
	uint32_t pipeline = zio->io_pipeline;
	zio_stage_t stage = zio->io_stage;
	int rv;

	ASSERT(!MUTEX_HELD(&zio->io_lock));

	while (((1U << ++stage) & pipeline) == 0)
	continue;

	ASSERT(stage <= ZIO_STAGE_DONE);
	ASSERT(zio->io_stall == NULL);

	/*
	* If we are in interrupt context and this pipeline stage
	* will grab a config lock that is held across I/O,
	* issue async to avoid deadlock.
	*/
	if (((1U << stage) & ZIO_CONFIG_LOCK_BLOCKING_STAGES) &&
	zio->io_vd == NULL &&
	zio_taskq_member(zio, ZIO_TASKQ_INTERRUPT)) {
	zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
	return;
	}

	zio->io_stage = stage;
	rv = zio_pipeline[stage](zio);

	if (rv == ZIO_PIPELINE_STOP)
	return;

	ASSERT(rv == ZIO_PIPELINE_CONTINUE);
	}
	}

	/*
	* ==========================================================================
	* Initiate I/O, either sync or async
	* ==========================================================================
	*/
	int
	zio_wait(zio_t *zio)
	{
	int error;

	ASSERT(zio->io_stage == ZIO_STAGE_OPEN);
	ASSERT(zio->io_executor == NULL);

	zio->io_waiter = curthread;

	zio_execute(zio);

	mutex_enter(&zio->io_lock);
	while (zio->io_executor != NULL)
	cv_wait(&zio->io_cv, &zio->io_lock);
	mutex_exit(&zio->io_lock);

	error = zio->io_error;
	zio_destroy(zio);

	return (error);
	}

	void
	zio_nowait(zio_t *zio)
	{
	ASSERT(zio->io_executor == NULL);

	if (zio->io_parent == NULL && zio->io_child_type == ZIO_CHILD_LOGICAL) {
	/*
	* This is a logical async I/O with no parent to wait for it.
	* Attach it to the pool's global async root zio so that
	* spa_unload() has a way of waiting for async I/O to finish.
	*/
	spa_t *spa = zio->io_spa;
	zio->io_async_root = B_TRUE;
	mutex_enter(&spa->spa_async_root_lock);
	spa->spa_async_root_count++;
	mutex_exit(&spa->spa_async_root_lock);
	}

	zio_execute(zio);
	}

	/*
	* ==========================================================================
	* Reexecute or suspend/resume failed I/O
	* ==========================================================================
	*/

	static void
	zio_reexecute(zio_t *pio)
	{
	zio_t zio, zio_next;

	pio->io_flags = pio->io_orig_flags;
	pio->io_stage = pio->io_orig_stage;
	pio->io_pipeline = pio->io_orig_pipeline;
	pio->io_reexecute = 0;
	pio->io_error = 0;
	for (int c = 0; c < ZIO_CHILD_TYPES; c++)
	pio->io_child_error[c] = 0;

	if (IO_IS_ALLOCATING(pio)) {
	/*
	* Remember the failed bp so that the io_ready() callback
	* can update its accounting upon reexecution. The block
	* was already freed in zio_done(); we indicate this with
	* a fill count of -1 so that zio_free() knows to skip it.
	*/
	blkptr_t *bp = pio->io_bp;
	ASSERT(bp->blk_birth == 0 \|\| bp->blk_birth == pio->io_txg);
	bp->blk_fill = BLK_FILL_ALREADY_FREED;
	pio->io_bp_orig = *bp;
	BP_ZERO(bp);
	}

	/*
	* As we reexecute pio's children, new children could be created.
	* New children go to the head of the io_child list, however,
	* so we will (correctly) not reexecute them. The key is that
	* the remainder of the io_child list, from 'zio_next' onward,
	* cannot be affected by any side effects of reexecuting 'zio'.
	*/
	for (zio = pio->io_child; zio != NULL; zio = zio_next) {
	zio_next = zio->io_sibling_next;
	mutex_enter(&pio->io_lock);
	pio->io_children[zio->io_child_type][ZIO_WAIT_READY]++;
	pio->io_children[zio->io_child_type][ZIO_WAIT_DONE]++;
	mutex_exit(&pio->io_lock);
	zio_reexecute(zio);
	}

	/*
	* Now that all children have been reexecuted, execute the parent.
	*/
	zio_execute(pio);
	}

	void
	zio_suspend(spa_t spa, zio_t zio)
	{
	if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_PANIC)
	fm_panic("Pool '%s' has encountered an uncorrectable I/O "
	"failure and the failure mode property for this pool "
	"is set to panic.", spa_name(spa));

	zfs_ereport_post(FM_EREPORT_ZFS_IO_FAILURE, spa, NULL, NULL, 0, 0);

	mutex_enter(&spa->spa_suspend_lock);

	if (spa->spa_suspend_zio_root == NULL)
	spa->spa_suspend_zio_root = zio_root(spa, NULL, NULL, 0);

	spa->spa_suspended = B_TRUE;

	if (zio != NULL) {
	ASSERT(zio != spa->spa_suspend_zio_root);
	ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
	ASSERT(zio->io_parent == NULL);
	ASSERT(zio->io_stage == ZIO_STAGE_DONE);
	zio_add_child(spa->spa_suspend_zio_root, zio);
	}

	mutex_exit(&spa->spa_suspend_lock);
	}

	void
	zio_resume(spa_t *spa)
	{
	zio_t pio, zio;

	/*
	* Reexecute all previously suspended i/o.
	*/
	mutex_enter(&spa->spa_suspend_lock);
	spa->spa_suspended = B_FALSE;
	cv_broadcast(&spa->spa_suspend_cv);
	pio = spa->spa_suspend_zio_root;
	spa->spa_suspend_zio_root = NULL;
	mutex_exit(&spa->spa_suspend_lock);

	if (pio == NULL)
	return;

	while ((zio = pio->io_child) != NULL) {
	zio_remove_child(pio, zio);
	zio->io_parent = NULL;
	zio_reexecute(zio);
	}

	ASSERT(pio->io_children[ZIO_CHILD_LOGICAL][ZIO_WAIT_DONE] == 0);

	(void) zio_wait(pio);
	}

	void
	zio_resume_wait(spa_t *spa)
	{
	mutex_enter(&spa->spa_suspend_lock);
	while (spa_suspended(spa))
	cv_wait(&spa->spa_suspend_cv, &spa->spa_suspend_lock);
	mutex_exit(&spa->spa_suspend_lock);
	}

	/*
	* ==========================================================================
	* Gang blocks.
	*
	* A gang block is a collection of small blocks that looks to the DMU
	* like one large block. When zio_dva_allocate() cannot find a block
	* of the requested size, due to either severe fragmentation or the pool
	* being nearly full, it calls zio_write_gang_block() to construct the
	* block from smaller fragments.
	*
	* A gang block consists of a gang header (zio_gbh_phys_t) and up to
	* three (SPA_GBH_NBLKPTRS) gang members. The gang header is just like
	* an indirect block: it's an array of block pointers. It consumes
	* only one sector and hence is allocatable regardless of fragmentation.
	* The gang header's bps point to its gang members, which hold the data.
	*
	* Gang blocks are self-checksumming, using the bp's <vdev, offset, txg>
	* as the verifier to ensure uniqueness of the SHA256 checksum.
	* Critically, the gang block bp's blk_cksum is the checksum of the data,
	* not the gang header. This ensures that data block signatures (needed for
	* deduplication) are independent of how the block is physically stored.
	*
	* Gang blocks can be nested: a gang member may itself be a gang block.
	* Thus every gang block is a tree in which root and all interior nodes are
	* gang headers, and the leaves are normal blocks that contain user data.
	* The root of the gang tree is called the gang leader.
	*
	* To perform any operation (read, rewrite, free, claim) on a gang block,
	* zio_gang_assemble() first assembles the gang tree (minus data leaves)
	* in the io_gang_tree field of the original logical i/o by recursively
	* reading the gang leader and all gang headers below it. This yields
	* an in-core tree containing the contents of every gang header and the
	* bps for every constituent of the gang block.
	*
	* With the gang tree now assembled, zio_gang_issue() just walks the gang tree
	* and invokes a callback on each bp. To free a gang block, zio_gang_issue()
	* calls zio_free_gang() -- a trivial wrapper around zio_free() -- for each bp.
	* zio_claim_gang() provides a similarly trivial wrapper for zio_claim().
	* zio_read_gang() is a wrapper around zio_read() that omits reading gang
	* headers, since we already have those in io_gang_tree. zio_rewrite_gang()
	* performs a zio_rewrite() of the data or, for gang headers, a zio_rewrite()
	* of the gang header plus zio_checksum_compute() of the data to update the
	* gang header's blk_cksum as described above.
	*
	* The two-phase assemble/issue model solves the problem of partial failure --
	* what if you'd freed part of a gang block but then couldn't read the
	* gang header for another part? Assembling the entire gang tree first
	* ensures that all the necessary gang header I/O has succeeded before
	* starting the actual work of free, claim, or write. Once the gang tree
	* is assembled, free and claim are in-memory operations that cannot fail.
	*
	* In the event that a gang write fails, zio_dva_unallocate() walks the
	* gang tree to immediately free (i.e. insert back into the space map)
	* everything we've allocated. This ensures that we don't get ENOSPC
	* errors during repeated suspend/resume cycles due to a flaky device.
	*
	* Gang rewrites only happen during sync-to-convergence. If we can't assemble
	* the gang tree, we won't modify the block, so we can safely defer the free
	* (knowing that the block is still intact). If we can assemble the gang
	* tree, then even if some of the rewrites fail, zio_dva_unallocate() will free
	* each constituent bp and we can allocate a new block on the next sync pass.
	*
	* In all cases, the gang tree allows complete recovery from partial failure.
	* ==========================================================================
	*/

	static zio_t *
	zio_read_gang(zio_t pio, blkptr_t bp, zio_gang_node_t gn, void data)
	{
	if (gn != NULL)
	return (pio);

	return (zio_read(pio, pio->io_spa, bp, data, BP_GET_PSIZE(bp),
	NULL, NULL, pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
	&pio->io_bookmark));
	}

	zio_t *
	zio_rewrite_gang(zio_t pio, blkptr_t bp, zio_gang_node_t gn, void data)
	{
	zio_t *zio;

	if (gn != NULL) {
	zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
	gn->gn_gbh, SPA_GANGBLOCKSIZE, NULL, NULL, pio->io_priority,
	ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
	/*
	* As we rewrite each gang header, the pipeline will compute
	* a new gang block header checksum for it; but no one will
	* compute a new data checksum, so we do that here. The one
	* exception is the gang leader: the pipeline already computed
	* its data checksum because that stage precedes gang assembly.
	* (Presently, nothing actually uses interior data checksums;
	* this is just good hygiene.)
	*/
	if (gn != pio->io_logical->io_gang_tree) {
	zio_checksum_compute(zio, BP_GET_CHECKSUM(bp),
	data, BP_GET_PSIZE(bp));
	}
	} else {
	zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
	data, BP_GET_PSIZE(bp), NULL, NULL, pio->io_priority,
	ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
	}

	return (zio);
	}

	/* ARGSUSED */
	zio_t *
	zio_free_gang(zio_t pio, blkptr_t bp, zio_gang_node_t gn, void data)
	{
	return (zio_free(pio, pio->io_spa, pio->io_txg, bp,
	NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
	}

	/* ARGSUSED */
	zio_t *
	zio_claim_gang(zio_t pio, blkptr_t bp, zio_gang_node_t gn, void data)
	{
	return (zio_claim(pio, pio->io_spa, pio->io_txg, bp,
	NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
	}

	static zio_gang_issue_func_t *zio_gang_issue_func[ZIO_TYPES] = {
	NULL,
	zio_read_gang,
	zio_rewrite_gang,
	zio_free_gang,
	zio_claim_gang,
	NULL
	};

	static void zio_gang_tree_assemble_done(zio_t *zio);

	static zio_gang_node_t *
	zio_gang_node_alloc(zio_gang_node_t **gnpp)
	{
	zio_gang_node_t *gn;

	ASSERT(*gnpp == NULL);

	gn = kmem_zalloc(sizeof (*gn), KM_SLEEP);
	gn->gn_gbh = zio_buf_alloc(SPA_GANGBLOCKSIZE);
	*gnpp = gn;

	return (gn);
	}

	static void
	zio_gang_node_free(zio_gang_node_t **gnpp)
	{
	zio_gang_node_t gn = gnpp;

	for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
	ASSERT(gn->gn_child[g] == NULL);

	zio_buf_free(gn->gn_gbh, SPA_GANGBLOCKSIZE);
	kmem_free(gn, sizeof (*gn));
	*gnpp = NULL;
	}

	static void
	zio_gang_tree_free(zio_gang_node_t **gnpp)
	{
	zio_gang_node_t gn = gnpp;

	if (gn == NULL)
	return;

	for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
	zio_gang_tree_free(&gn->gn_child[g]);

	zio_gang_node_free(gnpp);
	}

	static void
	zio_gang_tree_assemble(zio_t lio, blkptr_t bp, zio_gang_node_t **gnpp)
	{
	zio_gang_node_t *gn = zio_gang_node_alloc(gnpp);

	ASSERT(lio->io_logical == lio);
	ASSERT(BP_IS_GANG(bp));

	zio_nowait(zio_read(lio, lio->io_spa, bp, gn->gn_gbh,
	SPA_GANGBLOCKSIZE, zio_gang_tree_assemble_done, gn,
	lio->io_priority, ZIO_GANG_CHILD_FLAGS(lio), &lio->io_bookmark));
	}

	static void
	zio_gang_tree_assemble_done(zio_t *zio)
	{
	zio_t *lio = zio->io_logical;
	zio_gang_node_t *gn = zio->io_private;
	blkptr_t *bp = zio->io_bp;

	ASSERT(zio->io_parent == lio);
	ASSERT(zio->io_child == NULL);

	if (zio->io_error)
	return;

	if (BP_SHOULD_BYTESWAP(bp))
	byteswap_uint64_array(zio->io_data, zio->io_size);

	ASSERT(zio->io_data == gn->gn_gbh);
	ASSERT(zio->io_size == SPA_GANGBLOCKSIZE);
	ASSERT(gn->gn_gbh->zg_tail.zbt_magic == ZBT_MAGIC);

	for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
	blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
	if (!BP_IS_GANG(gbp))
	continue;
	zio_gang_tree_assemble(lio, gbp, &gn->gn_child[g]);
	}
	}

	static void
	zio_gang_tree_issue(zio_t pio, zio_gang_node_t gn, blkptr_t bp, void data)
	{
	zio_t *lio = pio->io_logical;
	zio_t *zio;

	ASSERT(BP_IS_GANG(bp) == !!gn);
	ASSERT(BP_GET_CHECKSUM(bp) == BP_GET_CHECKSUM(lio->io_bp));
	ASSERT(BP_GET_LSIZE(bp) == BP_GET_PSIZE(bp) \|\| gn == lio->io_gang_tree);

	/*
	* If you're a gang header, your data is in gn->gn_gbh.
	* If you're a gang member, your data is in 'data' and gn == NULL.
	*/
	zio = zio_gang_issue_func[lio->io_type](pio, bp, gn, data);

	if (gn != NULL) {
	ASSERT(gn->gn_gbh->zg_tail.zbt_magic == ZBT_MAGIC);

	for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
	blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
	if (BP_IS_HOLE(gbp))
	continue;
	zio_gang_tree_issue(zio, gn->gn_child[g], gbp, data);
	data = (char *)data + BP_GET_PSIZE(gbp);
	}
	}

	if (gn == lio->io_gang_tree)
	ASSERT3P((char *)lio->io_data + lio->io_size, ==, data);

	if (zio != pio)
	zio_nowait(zio);
	}

	static int
	zio_gang_assemble(zio_t *zio)
	{
	blkptr_t *bp = zio->io_bp;

	ASSERT(BP_IS_GANG(bp) && zio == zio->io_logical);

	zio_gang_tree_assemble(zio, bp, &zio->io_gang_tree);

	return (ZIO_PIPELINE_CONTINUE);
	}

	static int
	zio_gang_issue(zio_t *zio)
	{
	zio_t *lio = zio->io_logical;
	blkptr_t *bp = zio->io_bp;

	if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE))
	return (ZIO_PIPELINE_STOP);

	ASSERT(BP_IS_GANG(bp) && zio == lio);

	if (zio->io_child_error[ZIO_CHILD_GANG] == 0)
	zio_gang_tree_issue(lio, lio->io_gang_tree, bp, lio->io_data);
	else
	zio_gang_tree_free(&lio->io_gang_tree);

	zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;

	return (ZIO_PIPELINE_CONTINUE);
	}

	static void
	zio_write_gang_member_ready(zio_t *zio)
	{
	zio_t *pio = zio->io_parent;
	zio_t *lio = zio->io_logical;
	dva_t *cdva = zio->io_bp->blk_dva;
	dva_t *pdva = pio->io_bp->blk_dva;
	uint64_t asize;

	if (BP_IS_HOLE(zio->io_bp))
	return;

	ASSERT(BP_IS_HOLE(&zio->io_bp_orig));

	ASSERT(zio->io_child_type == ZIO_CHILD_GANG);
	ASSERT3U(zio->io_prop.zp_ndvas, ==, lio->io_prop.zp_ndvas);
	ASSERT3U(zio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(zio->io_bp));
	ASSERT3U(pio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(pio->io_bp));
	ASSERT3U(BP_GET_NDVAS(zio->io_bp), <=, BP_GET_NDVAS(pio->io_bp));

	mutex_enter(&pio->io_lock);
	for (int d = 0; d < BP_GET_NDVAS(zio->io_bp); d++) {
	ASSERT(DVA_GET_GANG(&pdva[d]));
	asize = DVA_GET_ASIZE(&pdva[d]);
	asize += DVA_GET_ASIZE(&cdva[d]);
	DVA_SET_ASIZE(&pdva[d], asize);
	}
	mutex_exit(&pio->io_lock);
	}

	static int
	zio_write_gang_block(zio_t *pio)
	{
	spa_t *spa = pio->io_spa;
	blkptr_t *bp = pio->io_bp;
	zio_t *lio = pio->io_logical;
	zio_t *zio;
	zio_gang_node_t gn, *gnpp;
	zio_gbh_phys_t *gbh;
	uint64_t txg = pio->io_txg;
	uint64_t resid = pio->io_size;
	uint64_t lsize;
	int ndvas = lio->io_prop.zp_ndvas;
	int gbh_ndvas = MIN(ndvas + 1, spa_max_replication(spa));
	zio_prop_t zp;
	int error;

	error = metaslab_alloc(spa, spa->spa_normal_class, SPA_GANGBLOCKSIZE,
	bp, gbh_ndvas, txg, pio == lio ? NULL : lio->io_bp,
	METASLAB_HINTBP_FAVOR \| METASLAB_GANG_HEADER);
	if (error) {
	pio->io_error = error;
	return (ZIO_PIPELINE_CONTINUE);
	}

	if (pio == lio) {
	gnpp = &lio->io_gang_tree;
	} else {
	gnpp = pio->io_private;
	ASSERT(pio->io_ready == zio_write_gang_member_ready);
	}

	gn = zio_gang_node_alloc(gnpp);
	gbh = gn->gn_gbh;
	bzero(gbh, SPA_GANGBLOCKSIZE);

	/*
	* Create the gang header.
	*/
	zio = zio_rewrite(pio, spa, txg, bp, gbh, SPA_GANGBLOCKSIZE, NULL, NULL,
	pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);

	/*
	* Create and nowait the gang children.
	*/
	for (int g = 0; resid != 0; resid -= lsize, g++) {
	lsize = P2ROUNDUP(resid / (SPA_GBH_NBLKPTRS - g),
	SPA_MINBLOCKSIZE);
	ASSERT(lsize >= SPA_MINBLOCKSIZE && lsize <= resid);

	zp.zp_checksum = lio->io_prop.zp_checksum;
	zp.zp_compress = ZIO_COMPRESS_OFF;
	zp.zp_type = DMU_OT_NONE;
	zp.zp_level = 0;
	zp.zp_ndvas = lio->io_prop.zp_ndvas;

	zio_nowait(zio_write(zio, spa, txg, &gbh->zg_blkptr[g],
	(char *)pio->io_data + (pio->io_size - resid), lsize, &zp,
	zio_write_gang_member_ready, NULL, &gn->gn_child[g],
	pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
	&pio->io_bookmark));
	}

	/*
	* Set pio's pipeline to just wait for zio to finish.
	*/
	pio->io_pipeline = ZIO_INTERLOCK_PIPELINE;

	zio_nowait(zio);

	return (ZIO_PIPELINE_CONTINUE);
	}

	/*
	* ==========================================================================
	* Allocate and free blocks
	* ==========================================================================
	*/

	static int
	zio_dva_allocate(zio_t *zio)
	{
	spa_t *spa = zio->io_spa;
	metaslab_class_t *mc = spa->spa_normal_class;
	blkptr_t *bp = zio->io_bp;
	int error;

	ASSERT(BP_IS_HOLE(bp));
	ASSERT3U(BP_GET_NDVAS(bp), ==, 0);
	ASSERT3U(zio->io_prop.zp_ndvas, >, 0);
	ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
	ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));

	error = metaslab_alloc(spa, mc, zio->io_size, bp,
	zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);

	if (error) {
	if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE)
	return (zio_write_gang_block(zio));
	zio->io_error = error;
	}

	return (ZIO_PIPELINE_CONTINUE);
	}

	static int
	zio_dva_free(zio_t *zio)
	{
	metaslab_free(zio->io_spa, zio->io_bp, zio->io_txg, B_FALSE);

	return (ZIO_PIPELINE_CONTINUE);
	}

	static int
	zio_dva_claim(zio_t *zio)
	{
	int error;

	error = metaslab_claim(zio->io_spa, zio->io_bp, zio->io_txg);
	if (error)
	zio->io_error = error;

	return (ZIO_PIPELINE_CONTINUE);
	}

	/*
	* Undo an allocation. This is used by zio_done() when an I/O fails
	* and we want to give back the block we just allocated.
	* This handles both normal blocks and gang blocks.
	*/
	static void
	zio_dva_unallocate(zio_t zio, zio_gang_node_t gn, blkptr_t *bp)
	{
	spa_t *spa = zio->io_spa;
	boolean_t now = !(zio->io_flags & ZIO_FLAG_IO_REWRITE);

	ASSERT(bp->blk_birth == zio->io_txg \|\| BP_IS_HOLE(bp));

	if (zio->io_bp == bp && !now) {
	/*
	* This is a rewrite for sync-to-convergence.
	* We can't do a metaslab_free(NOW) because bp wasn't allocated
	* during this sync pass, which means that metaslab_sync()
	* already committed the allocation.
	*/
	ASSERT(DVA_EQUAL(BP_IDENTITY(bp),
	BP_IDENTITY(&zio->io_bp_orig)));
	ASSERT(spa_sync_pass(spa) > 1);

	if (BP_IS_GANG(bp) && gn == NULL) {
	/*
	* This is a gang leader whose gang header(s) we
	* couldn't read now, so defer the free until later.
	* The block should still be intact because without
	* the headers, we'd never even start the rewrite.
	*/
	bplist_enqueue_deferred(&spa->spa_sync_bplist, bp);
	return;
	}
	}

	if (!BP_IS_HOLE(bp))
	metaslab_free(spa, bp, bp->blk_birth, now);

	if (gn != NULL) {
	for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
	zio_dva_unallocate(zio, gn->gn_child[g],
	&gn->gn_gbh->zg_blkptr[g]);
	}
	}
	}

	/*
	* Try to allocate an intent log block. Return 0 on success, errno on failure.
	*/
	int
	zio_alloc_blk(spa_t spa, uint64_t size, blkptr_t new_bp, blkptr_t *old_bp,
	uint64_t txg)
	{
	int error;

	error = metaslab_alloc(spa, spa->spa_log_class, size,
	new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID);

	if (error)
	error = metaslab_alloc(spa, spa->spa_normal_class, size,
	new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID);

	if (error == 0) {
	BP_SET_LSIZE(new_bp, size);
	BP_SET_PSIZE(new_bp, size);
	BP_SET_COMPRESS(new_bp, ZIO_COMPRESS_OFF);
	BP_SET_CHECKSUM(new_bp, ZIO_CHECKSUM_ZILOG);
	BP_SET_TYPE(new_bp, DMU_OT_INTENT_LOG);
	BP_SET_LEVEL(new_bp, 0);
	BP_SET_BYTEORDER(new_bp, ZFS_HOST_BYTEORDER);
	}

	return (error);
	}

	/*
	* Free an intent log block. We know it can't be a gang block, so there's
	* nothing to do except metaslab_free() it.
	*/
	void
	zio_free_blk(spa_t spa, blkptr_t bp, uint64_t txg)
	{
	ASSERT(!BP_IS_GANG(bp));

	metaslab_free(spa, bp, txg, B_FALSE);
	}

	/*
	* ==========================================================================
	* Read and write to physical devices
	* ==========================================================================
	*/

	static void
	zio_vdev_io_probe_done(zio_t *zio)
	{
	zio_t *dio;
	vdev_t *vd = zio->io_private;

	mutex_enter(&vd->vdev_probe_lock);
	ASSERT(vd->vdev_probe_zio == zio);
	vd->vdev_probe_zio = NULL;
	mutex_exit(&vd->vdev_probe_lock);

	while ((dio = zio->io_delegate_list) != NULL) {
	zio->io_delegate_list = dio->io_delegate_next;
	dio->io_delegate_next = NULL;
	if (!vdev_accessible(vd, dio))
	dio->io_error = ENXIO;
	zio_execute(dio);
	}
	}

	/*
	* Probe the device to determine whether I/O failure is specific to this
	* zio (e.g. a bad sector) or affects the entire vdev (e.g. unplugged).
	*/
	static int
	zio_vdev_io_probe(zio_t *zio)
	{
	vdev_t *vd = zio->io_vd;
	zio_t *pio = NULL;
	boolean_t created_pio = B_FALSE;

	/*
	* Don't probe the probe.
	*/
	if (zio->io_flags & ZIO_FLAG_PROBE)
	return (ZIO_PIPELINE_CONTINUE);

	/*
	* To prevent 'probe storms' when a device fails, we create
	* just one probe i/o at a time. All zios that want to probe
	* this vdev will join the probe zio's io_delegate_list.
	*/
	mutex_enter(&vd->vdev_probe_lock);

	if ((pio = vd->vdev_probe_zio) == NULL) {
	vd->vdev_probe_zio = pio = zio_root(zio->io_spa,
	zio_vdev_io_probe_done, vd, ZIO_FLAG_CANFAIL);
	created_pio = B_TRUE;
	vd->vdev_probe_wanted = B_TRUE;
	spa_async_request(zio->io_spa, SPA_ASYNC_PROBE);
	}

	zio->io_delegate_next = pio->io_delegate_list;
	pio->io_delegate_list = zio;

	mutex_exit(&vd->vdev_probe_lock);

	if (created_pio) {
	zio_nowait(vdev_probe(vd, pio));
	zio_nowait(pio);
	}

	return (ZIO_PIPELINE_STOP);
	}

	static int
	zio_vdev_io_start(zio_t *zio)
	{
	vdev_t *vd = zio->io_vd;
	uint64_t align;
	spa_t *spa = zio->io_spa;

	ASSERT(zio->io_error == 0);
	ASSERT(zio->io_child_error[ZIO_CHILD_VDEV] == 0);

	if (vd == NULL) {
	if (!(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
	spa_config_enter(spa, SCL_ZIO, zio, RW_READER);

	/*
	* The mirror_ops handle multiple DVAs in a single BP.
	*/
	return (vdev_mirror_ops.vdev_op_io_start(zio));
	}

	align = 1ULL << vd->vdev_top->vdev_ashift;

	if (P2PHASE(zio->io_size, align) != 0) {
	uint64_t asize = P2ROUNDUP(zio->io_size, align);
	char *abuf = zio_buf_alloc(asize);
	ASSERT(vd == vd->vdev_top);
	if (zio->io_type == ZIO_TYPE_WRITE) {
	bcopy(zio->io_data, abuf, zio->io_size);
	bzero(abuf + zio->io_size, asize - zio->io_size);
	}
	zio_push_transform(zio, abuf, asize, asize, zio_subblock);
	}

	ASSERT(P2PHASE(zio->io_offset, align) == 0);
	ASSERT(P2PHASE(zio->io_size, align) == 0);
	ASSERT(zio->io_type != ZIO_TYPE_WRITE \|\| (spa_mode & FWRITE));

	if (vd->vdev_ops->vdev_op_leaf &&
	(zio->io_type == ZIO_TYPE_READ \|\| zio->io_type == ZIO_TYPE_WRITE)) {

	if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio) == 0)
	return (ZIO_PIPELINE_STOP);

	if ((zio = vdev_queue_io(zio)) == NULL)
	return (ZIO_PIPELINE_STOP);

	if (!vdev_accessible(vd, zio)) {
	zio->io_error = ENXIO;
	zio_interrupt(zio);
	return (ZIO_PIPELINE_STOP);
	}

	}

	return (vd->vdev_ops->vdev_op_io_start(zio));
	}

	static int
	zio_vdev_io_done(zio_t *zio)
	{
	vdev_t *vd = zio->io_vd;
	vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops;
	boolean_t unexpected_error = B_FALSE;

	if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
	return (ZIO_PIPELINE_STOP);

	ASSERT(zio->io_type == ZIO_TYPE_READ \|\| zio->io_type == ZIO_TYPE_WRITE);

	if (vd != NULL && vd->vdev_ops->vdev_op_leaf) {

	vdev_queue_io_done(zio);

	if (zio->io_type == ZIO_TYPE_WRITE)
	vdev_cache_write(zio);

	if (zio_injection_enabled && zio->io_error == 0)
	zio->io_error = zio_handle_device_injection(vd, EIO);

	if (zio_injection_enabled && zio->io_error == 0)
	zio->io_error = zio_handle_label_injection(zio, EIO);

	if (zio->io_error) {
	if (!vdev_accessible(vd, zio)) {
	zio->io_error = ENXIO;
	} else {
	unexpected_error = B_TRUE;
	}
	}
	}

	ops->vdev_op_io_done(zio);

	if (unexpected_error)
	return (zio_vdev_io_probe(zio));

	return (ZIO_PIPELINE_CONTINUE);
	}

	static int
	zio_vdev_io_assess(zio_t *zio)
	{
	vdev_t *vd = zio->io_vd;

	if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
	return (ZIO_PIPELINE_STOP);

	if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
	spa_config_exit(zio->io_spa, SCL_ZIO, zio);

	if (zio->io_vsd != NULL) {
	zio->io_vsd_free(zio);
	zio->io_vsd = NULL;
	}

	if (zio_injection_enabled && zio->io_error == 0)
	zio->io_error = zio_handle_fault_injection(zio, EIO);

	/*
	* If the I/O failed, determine whether we should attempt to retry it.
	*/
	if (zio->io_error && vd == NULL &&
	!(zio->io_flags & (ZIO_FLAG_DONT_RETRY \| ZIO_FLAG_IO_RETRY))) {
	ASSERT(!(zio->io_flags & ZIO_FLAG_DONT_QUEUE)); /* not a leaf */
	ASSERT(!(zio->io_flags & ZIO_FLAG_IO_BYPASS)); /* not a leaf */
	zio->io_error = 0;
	zio->io_flags \|= ZIO_FLAG_IO_RETRY \|
	ZIO_FLAG_DONT_CACHE \| ZIO_FLAG_DONT_AGGREGATE;
	zio->io_stage = ZIO_STAGE_VDEV_IO_START - 1;
	zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
	return (ZIO_PIPELINE_STOP);
	}

	/*
	* If we got an error on a leaf device, convert it to ENXIO
	* if the device is not accessible at all.
	*/
	if (zio->io_error && vd != NULL && vd->vdev_ops->vdev_op_leaf &&
	!vdev_accessible(vd, zio))
	zio->io_error = ENXIO;

	/*
	* If we can't write to an interior vdev (mirror or RAID-Z),
	* set vdev_cant_write so that we stop trying to allocate from it.
	*/
	if (zio->io_error == ENXIO && zio->io_type == ZIO_TYPE_WRITE &&
	vd != NULL && !vd->vdev_ops->vdev_op_leaf)
	vd->vdev_cant_write = B_TRUE;

	if (zio->io_error)
	zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;

	return (ZIO_PIPELINE_CONTINUE);
	}

	void
	zio_vdev_io_reissue(zio_t *zio)
	{
	ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
	ASSERT(zio->io_error == 0);

	zio->io_stage--;
	}

	void
	zio_vdev_io_redone(zio_t *zio)
	{
	ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_DONE);

	zio->io_stage--;
	}

	void
	zio_vdev_io_bypass(zio_t *zio)
	{
	ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
	ASSERT(zio->io_error == 0);

	zio->io_flags \|= ZIO_FLAG_IO_BYPASS;
	zio->io_stage = ZIO_STAGE_VDEV_IO_ASSESS - 1;
	}

	/*
	* ==========================================================================
	* Generate and verify checksums
	* ==========================================================================
	*/
	static int
	zio_checksum_generate(zio_t *zio)
	{
	blkptr_t *bp = zio->io_bp;
	enum zio_checksum checksum;

	if (bp == NULL) {
	/*
	* This is zio_write_phys().
	* We're either generating a label checksum, or none at all.
	*/
	checksum = zio->io_prop.zp_checksum;

	if (checksum == ZIO_CHECKSUM_OFF)
	return (ZIO_PIPELINE_CONTINUE);

	ASSERT(checksum == ZIO_CHECKSUM_LABEL);
	} else {
	if (BP_IS_GANG(bp) && zio->io_child_type == ZIO_CHILD_GANG) {
	ASSERT(!IO_IS_ALLOCATING(zio));
	checksum = ZIO_CHECKSUM_GANG_HEADER;
	} else {
	checksum = BP_GET_CHECKSUM(bp);
	}
	}

	zio_checksum_compute(zio, checksum, zio->io_data, zio->io_size);

	return (ZIO_PIPELINE_CONTINUE);
	}

	static int
	zio_checksum_verify(zio_t *zio)
	{
	blkptr_t *bp = zio->io_bp;
	int error;

	if (bp == NULL) {
	/*
	* This is zio_read_phys().
	* We're either verifying a label checksum, or nothing at all.
	*/
	if (zio->io_prop.zp_checksum == ZIO_CHECKSUM_OFF)
	return (ZIO_PIPELINE_CONTINUE);

	ASSERT(zio->io_prop.zp_checksum == ZIO_CHECKSUM_LABEL);
	}

	if ((error = zio_checksum_error(zio)) != 0) {
	zio->io_error = error;
	if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
	zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
	zio->io_spa, zio->io_vd, zio, 0, 0);
	}
	}

	return (ZIO_PIPELINE_CONTINUE);
	}

	/*
	* Called by RAID-Z to ensure we don't compute the checksum twice.
	*/
	void
	zio_checksum_verified(zio_t *zio)
	{
	zio->io_pipeline &= ~(1U << ZIO_STAGE_CHECKSUM_VERIFY);
	}

	/*
	* ==========================================================================
	* Error rank. Error are ranked in the order 0, ENXIO, ECKSUM, EIO, other.
	* An error of 0 indictes success. ENXIO indicates whole-device failure,
	* which may be transient (e.g. unplugged) or permament. ECKSUM and EIO
	* indicate errors that are specific to one I/O, and most likely permanent.
	* Any other error is presumed to be worse because we weren't expecting it.
	* ==========================================================================
	*/
	int
	zio_worst_error(int e1, int e2)
	{
	static int zio_error_rank[] = { 0, ENXIO, ECKSUM, EIO };
	int r1, r2;

	for (r1 = 0; r1 < sizeof (zio_error_rank) / sizeof (int); r1++)
	if (e1 == zio_error_rank[r1])
	break;

	for (r2 = 0; r2 < sizeof (zio_error_rank) / sizeof (int); r2++)
	if (e2 == zio_error_rank[r2])
	break;

	return (r1 > r2 ? e1 : e2);
	}

	/*
	* ==========================================================================
	* I/O completion
	* ==========================================================================
	*/
	static int
	zio_ready(zio_t *zio)
	{
	blkptr_t *bp = zio->io_bp;
	zio_t *pio = zio->io_parent;

	if (zio->io_ready) {
	if (BP_IS_GANG(bp) &&
	zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY))
	return (ZIO_PIPELINE_STOP);

	ASSERT(IO_IS_ALLOCATING(zio));
	ASSERT(bp->blk_birth == zio->io_txg \|\| BP_IS_HOLE(bp));
	ASSERT(zio->io_children[ZIO_CHILD_GANG][ZIO_WAIT_READY] == 0);

	zio->io_ready(zio);
	}

	if (bp != NULL && bp != &zio->io_bp_copy)
	zio->io_bp_copy = *bp;

	if (zio->io_error)
	zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;

	if (pio != NULL)
	zio_notify_parent(pio, zio, ZIO_WAIT_READY);

	return (ZIO_PIPELINE_CONTINUE);
	}

	static int
	zio_done(zio_t *zio)
	{
	spa_t *spa = zio->io_spa;
	zio_t *pio = zio->io_parent;
	zio_t *lio = zio->io_logical;
	blkptr_t *bp = zio->io_bp;
	vdev_t *vd = zio->io_vd;
	uint64_t psize = zio->io_size;

	/*
	* If our of children haven't all completed,
	* wait for them and then repeat this pipeline stage.
	*/
	if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) \|\|
	zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) \|\|
	zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE))
	return (ZIO_PIPELINE_STOP);

	for (int c = 0; c < ZIO_CHILD_TYPES; c++)
	for (int w = 0; w < ZIO_WAIT_TYPES; w++)
	ASSERT(zio->io_children[c][w] == 0);

	if (bp != NULL) {
	ASSERT(bp->blk_pad[0] == 0);
	ASSERT(bp->blk_pad[1] == 0);
	ASSERT(bp->blk_pad[2] == 0);
	ASSERT(bcmp(bp, &zio->io_bp_copy, sizeof (blkptr_t)) == 0 \|\|
	(pio != NULL && bp == pio->io_bp));
	if (zio->io_type == ZIO_TYPE_WRITE && !BP_IS_HOLE(bp) &&
	!(zio->io_flags & ZIO_FLAG_IO_REPAIR)) {
	ASSERT(!BP_SHOULD_BYTESWAP(bp));
	ASSERT3U(zio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(bp));
	ASSERT(BP_COUNT_GANG(bp) == 0 \|\|
	(BP_COUNT_GANG(bp) == BP_GET_NDVAS(bp)));
	}
	}

	/*
	* If there were child vdev or gang errors, they apply to us now.
	*/
	zio_inherit_child_errors(zio, ZIO_CHILD_VDEV);
	zio_inherit_child_errors(zio, ZIO_CHILD_GANG);

	zio_pop_transforms(zio); /* note: may set zio->io_error */

	vdev_stat_update(zio, psize);

	if (zio->io_error) {
	/*
	* If this I/O is attached to a particular vdev,
	* generate an error message describing the I/O failure
	* at the block level. We ignore these errors if the
	* device is currently unavailable.
	*/
	if (zio->io_error != ECKSUM && vd != NULL && !vdev_is_dead(vd))
	zfs_ereport_post(FM_EREPORT_ZFS_IO, spa, vd, zio, 0, 0);

	if ((zio->io_error == EIO \|\|
	!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) && zio == lio) {
	/*
	* For logical I/O requests, tell the SPA to log the
	* error and generate a logical data ereport.
	*/
	spa_log_error(spa, zio);
	zfs_ereport_post(FM_EREPORT_ZFS_DATA, spa, NULL, zio,
	0, 0);
	}
	}

	if (zio->io_error && zio == lio) {
	/*
	* Determine whether zio should be reexecuted. This will
	* propagate all the way to the root via zio_notify_parent().
	*/
	ASSERT(vd == NULL && bp != NULL);

	if (IO_IS_ALLOCATING(zio))
	if (zio->io_error != ENOSPC)
	zio->io_reexecute \|= ZIO_REEXECUTE_NOW;
	else
	zio->io_reexecute \|= ZIO_REEXECUTE_SUSPEND;

	if ((zio->io_type == ZIO_TYPE_READ \|\|
	zio->io_type == ZIO_TYPE_FREE) &&
	zio->io_error == ENXIO &&
	spa_get_failmode(spa) != ZIO_FAILURE_MODE_CONTINUE)
	zio->io_reexecute \|= ZIO_REEXECUTE_SUSPEND;

	if (!(zio->io_flags & ZIO_FLAG_CANFAIL) && !zio->io_reexecute)
	zio->io_reexecute \|= ZIO_REEXECUTE_SUSPEND;
	}

	/*
	* If there were logical child errors, they apply to us now.
	* We defer this until now to avoid conflating logical child
	* errors with errors that happened to the zio itself when
	* updating vdev stats and reporting FMA events above.
	*/
	zio_inherit_child_errors(zio, ZIO_CHILD_LOGICAL);

	if (zio->io_reexecute) {
	/*
	* This is a logical I/O that wants to reexecute.
	*
	* Reexecute is top-down. When an i/o fails, if it's not
	* the root, it simply notifies its parent and sticks around.
	* The parent, seeing that it still has children in zio_done(),
	* does the same. This percolates all the way up to the root.
	* The root i/o will reexecute or suspend the entire tree.
	*
	* This approach ensures that zio_reexecute() honors
	* all the original i/o dependency relationships, e.g.
	* parents not executing until children are ready.
	*/
	ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);

	if (IO_IS_ALLOCATING(zio))
	zio_dva_unallocate(zio, zio->io_gang_tree, bp);

	zio_gang_tree_free(&zio->io_gang_tree);

	if (pio != NULL) {
	/*
	* We're not a root i/o, so there's nothing to do
	* but notify our parent. Don't propagate errors
	* upward since we haven't permanently failed yet.
	*/
	zio->io_flags \|= ZIO_FLAG_DONT_PROPAGATE;
	zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
	} else if (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND) {
	/*
	* We'd fail again if we reexecuted now, so suspend
	* until conditions improve (e.g. device comes online).
	*/
	zio_suspend(spa, zio);
	} else {
	/*
	* Reexecution is potentially a huge amount of work.
	* Hand it off to the otherwise-unused claim taskq.
	*/
	(void) taskq_dispatch_safe(
	spa->spa_zio_taskq[ZIO_TYPE_CLAIM][ZIO_TASKQ_ISSUE],
	(task_func_t *)zio_reexecute, zio, &zio->io_task);
	}
	return (ZIO_PIPELINE_STOP);
	}

	ASSERT(zio->io_child == NULL);
	ASSERT(zio->io_reexecute == 0);
	ASSERT(zio->io_error == 0 \|\| (zio->io_flags & ZIO_FLAG_CANFAIL));

	if (zio->io_done)
	zio->io_done(zio);

	zio_gang_tree_free(&zio->io_gang_tree);

	ASSERT(zio->io_delegate_list == NULL);
	ASSERT(zio->io_delegate_next == NULL);

	if (pio != NULL) {
	zio_remove_child(pio, zio);
	zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
	}

	if (zio->io_waiter != NULL) {
	mutex_enter(&zio->io_lock);
	zio->io_executor = NULL;
	cv_broadcast(&zio->io_cv);
	mutex_exit(&zio->io_lock);
	} else {
	zio_destroy(zio);
	}

	return (ZIO_PIPELINE_STOP);
	}

	/*
	* ==========================================================================
	* I/O pipeline definition
	* ==========================================================================
	*/
	static zio_pipe_stage_t *zio_pipeline[ZIO_STAGES] = {
	NULL,
	zio_issue_async,
	zio_read_bp_init,
	zio_write_bp_init,
	zio_checksum_generate,
	zio_gang_assemble,
	zio_gang_issue,
	zio_dva_allocate,
	zio_dva_free,
	zio_dva_claim,
	zio_ready,
	zio_vdev_io_start,
	zio_vdev_io_done,
	zio_vdev_io_assess,
	zio_checksum_verify,
	zio_done
	};
	Index: stable/8/sys/cddl/contrib/opensolaris
	===================================================================
	--- stable/8/sys/cddl/contrib/opensolaris (revision 209273)
	+++ stable/8/sys/cddl/contrib/opensolaris (revision 209274)

	Property changes on: stable/8/sys/cddl/contrib/opensolaris
	___________________________________________________________________
	Modified: svn:mergeinfo
	## -0,0 +0,1 ##
	Merged /head/sys/cddl/contrib/opensolaris:r209093-209101
	Index: stable/8/sys/contrib/dev/acpica
	===================================================================
	--- stable/8/sys/contrib/dev/acpica (revision 209273)
	+++ stable/8/sys/contrib/dev/acpica (revision 209274)

	Property changes on: stable/8/sys/contrib/dev/acpica
	___________________________________________________________________
	Modified: svn:mergeinfo
	## -0,0 +0,1 ##
	Merged /head/sys/contrib/dev/acpica:r209093-209101
	Index: stable/8/sys/contrib/pf
	===================================================================
	--- stable/8/sys/contrib/pf (revision 209273)
	+++ stable/8/sys/contrib/pf (revision 209274)

	Property changes on: stable/8/sys/contrib/pf
	___________________________________________________________________
	Modified: svn:mergeinfo
	## -0,0 +0,1 ##
	Merged /head/sys/contrib/pf:r209093-209101
	Index: stable/8/sys/dev/xen/xenpci
	===================================================================
	--- stable/8/sys/dev/xen/xenpci (revision 209273)
	+++ stable/8/sys/dev/xen/xenpci (revision 209274)

	Property changes on: stable/8/sys/dev/xen/xenpci
	___________________________________________________________________
	Modified: svn:mergeinfo
	## -0,0 +0,1 ##
	Merged /head/sys/dev/xen/xenpci:r209093-209101
	Index: stable/8/sys/geom/sched
	===================================================================
	--- stable/8/sys/geom/sched (revision 209273)
	+++ stable/8/sys/geom/sched (revision 209274)

	Property changes on: stable/8/sys/geom/sched
	___________________________________________________________________
	Modified: svn:mergeinfo
	## -0,0 +0,1 ##
	Merged /head/sys/geom/sched:r209093-209101
	Index: stable/8/sys
	===================================================================
	--- stable/8/sys (revision 209273)
	+++ stable/8/sys (revision 209274)

	Property changes on: stable/8/sys
	___________________________________________________________________
	Modified: svn:mergeinfo
	## -0,0 +0,1 ##
	Merged /head/sys:r209093-209101

File Metadata

Mime Type: text/x-c
Expires: Sun, Mar 29, 10:26 PM (2 d)
Storage Engine: blob
Storage Format: Raw Data
Storage Handle: 30495913
Default Alt Text: (571 KB)

No OneTemporaryActions

View Options

File Metadata

Event Timeline

No OneTemporary
Actions