Page MenuHomeFreeBSD

No OneTemporary

This file is larger than 256 KB, so syntax highlighting was skipped.
Index: stable/8/sys/amd64/include/xen
===================================================================
--- stable/8/sys/amd64/include/xen (revision 209273)
+++ stable/8/sys/amd64/include/xen (revision 209274)
Property changes on: stable/8/sys/amd64/include/xen
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/amd64/include/xen:r209093-209101
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 209274)
@@ -1,5023 +1,5023 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/*
* DVA-based Adjustable Replacement Cache
*
* While much of the theory of operation used here is
* based on the self-tuning, low overhead replacement cache
* presented by Megiddo and Modha at FAST 2003, there are some
* significant differences:
*
* 1. The Megiddo and Modha model assumes any page is evictable.
* Pages in its cache cannot be "locked" into memory. This makes
* the eviction algorithm simple: evict the last page in the list.
* This also make the performance characteristics easy to reason
* about. Our cache is not so simple. At any given moment, some
* subset of the blocks in the cache are un-evictable because we
* have handed out a reference to them. Blocks are only evictable
* when there are no external references active. This makes
* eviction far more problematic: we choose to evict the evictable
* blocks that are the "lowest" in the list.
*
* There are times when it is not possible to evict the requested
* space. In these circumstances we are unable to adjust the cache
* size. To prevent the cache growing unbounded at these times we
* implement a "cache throttle" that slows the flow of new data
* into the cache until we can make space available.
*
* 2. The Megiddo and Modha model assumes a fixed cache size.
* Pages are evicted when the cache is full and there is a cache
* miss. Our model has a variable sized cache. It grows with
* high use, but also tries to react to memory pressure from the
* operating system: decreasing its size when system memory is
* tight.
*
* 3. The Megiddo and Modha model assumes a fixed page size. All
* elements of the cache are therefor exactly the same size. So
* when adjusting the cache size following a cache miss, its simply
* a matter of choosing a single page to evict. In our model, we
* have variable sized cache blocks (rangeing from 512 bytes to
* 128K bytes). We therefor choose a set of blocks to evict to make
* space for a cache miss that approximates as closely as possible
* the space used by the new block.
*
* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
* by N. Megiddo & D. Modha, FAST 2003
*/
/*
* The locking model:
*
* A new reference to a cache buffer can be obtained in two
* ways: 1) via a hash table lookup using the DVA as a key,
* or 2) via one of the ARC lists. The arc_read() interface
* uses method 1, while the internal arc algorithms for
* adjusting the cache use method 2. We therefor provide two
* types of locks: 1) the hash table lock array, and 2) the
* arc list locks.
*
* Buffers do not have their own mutexs, rather they rely on the
* hash table mutexs for the bulk of their protection (i.e. most
* fields in the arc_buf_hdr_t are protected by these mutexs).
*
* buf_hash_find() returns the appropriate mutex (held) when it
* locates the requested buffer in the hash table. It returns
* NULL for the mutex if the buffer was not in the table.
*
* buf_hash_remove() expects the appropriate hash mutex to be
* already held before it is invoked.
*
* Each arc state also has a mutex which is used to protect the
* buffer list associated with the state. When attempting to
* obtain a hash table lock while holding an arc list lock you
* must use: mutex_tryenter() to avoid deadlock. Also note that
* the active state mutex must be held before the ghost state mutex.
*
* Arc buffers may have an associated eviction callback function.
* This function will be invoked prior to removing the buffer (e.g.
* in arc_do_user_evicts()). Note however that the data associated
* with the buffer may be evicted prior to the callback. The callback
* must be made with *no locks held* (to prevent deadlock). Additionally,
* the users of callbacks must ensure that their private data is
* protected from simultaneous callbacks from arc_buf_evict()
* and arc_do_user_evicts().
*
* Note that the majority of the performance stats are manipulated
* with atomic operations.
*
* The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
*
* - L2ARC buflist creation
* - L2ARC buflist eviction
* - L2ARC write completion, which walks L2ARC buflists
* - ARC header destruction, as it removes from L2ARC buflists
* - ARC header release, as it removes from L2ARC buflists
*/
#include <sys/spa.h>
#include <sys/zio.h>
#include <sys/zio_checksum.h>
#include <sys/zfs_context.h>
#include <sys/arc.h>
#include <sys/refcount.h>
#include <sys/vdev.h>
#ifdef _KERNEL
#include <sys/dnlc.h>
#endif
#include <sys/callb.h>
#include <sys/kstat.h>
#include <sys/sdt.h>
#include <vm/vm_pageout.h>
static kmutex_t arc_reclaim_thr_lock;
static kcondvar_t arc_reclaim_thr_cv; /* used to signal reclaim thr */
static uint8_t arc_thread_exit;
extern int zfs_write_limit_shift;
extern uint64_t zfs_write_limit_max;
extern kmutex_t zfs_write_limit_lock;
#define ARC_REDUCE_DNLC_PERCENT 3
uint_t arc_reduce_dnlc_percent = ARC_REDUCE_DNLC_PERCENT;
typedef enum arc_reclaim_strategy {
ARC_RECLAIM_AGGR, /* Aggressive reclaim strategy */
ARC_RECLAIM_CONS /* Conservative reclaim strategy */
} arc_reclaim_strategy_t;
/* number of seconds before growing cache again */
static int arc_grow_retry = 60;
/* shift of arc_c for calculating both min and max arc_p */
static int arc_p_min_shift = 4;
/* log2(fraction of arc to reclaim) */
static int arc_shrink_shift = 5;
/*
* minimum lifespan of a prefetch block in clock ticks
* (initialized in arc_init())
*/
static int arc_min_prefetch_lifespan;
static int arc_dead;
extern int zfs_prefetch_disable;
/*
* The arc has filled available memory and has now warmed up.
*/
static boolean_t arc_warm;
/*
* These tunables are for performance analysis.
*/
uint64_t zfs_arc_max;
uint64_t zfs_arc_min;
uint64_t zfs_arc_meta_limit = 0;
int zfs_mdcomp_disable = 0;
int zfs_arc_grow_retry = 0;
int zfs_arc_shrink_shift = 0;
int zfs_arc_p_min_shift = 0;
TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
TUNABLE_INT("vfs.zfs.mdcomp_disable", &zfs_mdcomp_disable);
SYSCTL_DECL(_vfs_zfs);
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
"Maximum ARC size");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
"Minimum ARC size");
SYSCTL_INT(_vfs_zfs, OID_AUTO, mdcomp_disable, CTLFLAG_RDTUN,
&zfs_mdcomp_disable, 0, "Disable metadata compression");
/*
* Note that buffers can be in one of 6 states:
* ARC_anon - anonymous (discussed below)
* ARC_mru - recently used, currently cached
* ARC_mru_ghost - recentely used, no longer in cache
* ARC_mfu - frequently used, currently cached
* ARC_mfu_ghost - frequently used, no longer in cache
* ARC_l2c_only - exists in L2ARC but not other states
* When there are no active references to the buffer, they are
* are linked onto a list in one of these arc states. These are
* the only buffers that can be evicted or deleted. Within each
* state there are multiple lists, one for meta-data and one for
* non-meta-data. Meta-data (indirect blocks, blocks of dnodes,
* etc.) is tracked separately so that it can be managed more
* explicitly: favored over data, limited explicitly.
*
* Anonymous buffers are buffers that are not associated with
* a DVA. These are buffers that hold dirty block copies
* before they are written to stable storage. By definition,
* they are "ref'd" and are considered part of arc_mru
* that cannot be freed. Generally, they will aquire a DVA
* as they are written and migrate onto the arc_mru list.
*
* The ARC_l2c_only state is for buffers that are in the second
* level ARC but no longer in any of the ARC_m* lists. The second
* level ARC itself may also contain buffers that are in any of
* the ARC_m* states - meaning that a buffer can exist in two
* places. The reason for the ARC_l2c_only state is to keep the
* buffer header in the hash table, so that reads that hit the
* second level ARC benefit from these fast lookups.
*/
#define ARCS_LOCK_PAD CACHE_LINE_SIZE
struct arcs_lock {
kmutex_t arcs_lock;
#ifdef _KERNEL
unsigned char pad[(ARCS_LOCK_PAD - sizeof (kmutex_t))];
#endif
};
/*
* must be power of two for mask use to work
*
*/
#define ARC_BUFC_NUMDATALISTS 16
#define ARC_BUFC_NUMMETADATALISTS 16
#define ARC_BUFC_NUMLISTS (ARC_BUFC_NUMMETADATALISTS + ARC_BUFC_NUMDATALISTS)
typedef struct arc_state {
uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
uint64_t arcs_size; /* total amount of data in this state */
list_t arcs_lists[ARC_BUFC_NUMLISTS]; /* list of evictable buffers */
struct arcs_lock arcs_locks[ARC_BUFC_NUMLISTS] __aligned(CACHE_LINE_SIZE);
} arc_state_t;
#define ARCS_LOCK(s, i) (&((s)->arcs_locks[(i)].arcs_lock))
/* The 6 states: */
static arc_state_t ARC_anon;
static arc_state_t ARC_mru;
static arc_state_t ARC_mru_ghost;
static arc_state_t ARC_mfu;
static arc_state_t ARC_mfu_ghost;
static arc_state_t ARC_l2c_only;
typedef struct arc_stats {
kstat_named_t arcstat_hits;
kstat_named_t arcstat_misses;
kstat_named_t arcstat_demand_data_hits;
kstat_named_t arcstat_demand_data_misses;
kstat_named_t arcstat_demand_metadata_hits;
kstat_named_t arcstat_demand_metadata_misses;
kstat_named_t arcstat_prefetch_data_hits;
kstat_named_t arcstat_prefetch_data_misses;
kstat_named_t arcstat_prefetch_metadata_hits;
kstat_named_t arcstat_prefetch_metadata_misses;
kstat_named_t arcstat_mru_hits;
kstat_named_t arcstat_mru_ghost_hits;
kstat_named_t arcstat_mfu_hits;
kstat_named_t arcstat_mfu_ghost_hits;
kstat_named_t arcstat_allocated;
kstat_named_t arcstat_deleted;
kstat_named_t arcstat_stolen;
kstat_named_t arcstat_recycle_miss;
kstat_named_t arcstat_mutex_miss;
kstat_named_t arcstat_evict_skip;
kstat_named_t arcstat_evict_l2_cached;
kstat_named_t arcstat_evict_l2_eligible;
kstat_named_t arcstat_evict_l2_ineligible;
kstat_named_t arcstat_hash_elements;
kstat_named_t arcstat_hash_elements_max;
kstat_named_t arcstat_hash_collisions;
kstat_named_t arcstat_hash_chains;
kstat_named_t arcstat_hash_chain_max;
kstat_named_t arcstat_p;
kstat_named_t arcstat_c;
kstat_named_t arcstat_c_min;
kstat_named_t arcstat_c_max;
kstat_named_t arcstat_size;
kstat_named_t arcstat_hdr_size;
kstat_named_t arcstat_data_size;
kstat_named_t arcstat_other_size;
kstat_named_t arcstat_l2_hits;
kstat_named_t arcstat_l2_misses;
kstat_named_t arcstat_l2_feeds;
kstat_named_t arcstat_l2_rw_clash;
kstat_named_t arcstat_l2_read_bytes;
kstat_named_t arcstat_l2_write_bytes;
kstat_named_t arcstat_l2_writes_sent;
kstat_named_t arcstat_l2_writes_done;
kstat_named_t arcstat_l2_writes_error;
kstat_named_t arcstat_l2_writes_hdr_miss;
kstat_named_t arcstat_l2_evict_lock_retry;
kstat_named_t arcstat_l2_evict_reading;
kstat_named_t arcstat_l2_free_on_write;
kstat_named_t arcstat_l2_abort_lowmem;
kstat_named_t arcstat_l2_cksum_bad;
kstat_named_t arcstat_l2_io_error;
kstat_named_t arcstat_l2_size;
kstat_named_t arcstat_l2_hdr_size;
kstat_named_t arcstat_memory_throttle_count;
kstat_named_t arcstat_l2_write_trylock_fail;
kstat_named_t arcstat_l2_write_passed_headroom;
kstat_named_t arcstat_l2_write_spa_mismatch;
kstat_named_t arcstat_l2_write_in_l2;
kstat_named_t arcstat_l2_write_hdr_io_in_progress;
kstat_named_t arcstat_l2_write_not_cacheable;
kstat_named_t arcstat_l2_write_full;
kstat_named_t arcstat_l2_write_buffer_iter;
kstat_named_t arcstat_l2_write_pios;
kstat_named_t arcstat_l2_write_buffer_bytes_scanned;
kstat_named_t arcstat_l2_write_buffer_list_iter;
kstat_named_t arcstat_l2_write_buffer_list_null_iter;
} arc_stats_t;
static arc_stats_t arc_stats = {
{ "hits", KSTAT_DATA_UINT64 },
{ "misses", KSTAT_DATA_UINT64 },
{ "demand_data_hits", KSTAT_DATA_UINT64 },
{ "demand_data_misses", KSTAT_DATA_UINT64 },
{ "demand_metadata_hits", KSTAT_DATA_UINT64 },
{ "demand_metadata_misses", KSTAT_DATA_UINT64 },
{ "prefetch_data_hits", KSTAT_DATA_UINT64 },
{ "prefetch_data_misses", KSTAT_DATA_UINT64 },
{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
{ "mru_hits", KSTAT_DATA_UINT64 },
{ "mru_ghost_hits", KSTAT_DATA_UINT64 },
{ "mfu_hits", KSTAT_DATA_UINT64 },
{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },
{ "allocated", KSTAT_DATA_UINT64 },
{ "deleted", KSTAT_DATA_UINT64 },
{ "stolen", KSTAT_DATA_UINT64 },
{ "recycle_miss", KSTAT_DATA_UINT64 },
{ "mutex_miss", KSTAT_DATA_UINT64 },
{ "evict_skip", KSTAT_DATA_UINT64 },
{ "evict_l2_cached", KSTAT_DATA_UINT64 },
{ "evict_l2_eligible", KSTAT_DATA_UINT64 },
{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },
{ "hash_elements", KSTAT_DATA_UINT64 },
{ "hash_elements_max", KSTAT_DATA_UINT64 },
{ "hash_collisions", KSTAT_DATA_UINT64 },
{ "hash_chains", KSTAT_DATA_UINT64 },
{ "hash_chain_max", KSTAT_DATA_UINT64 },
{ "p", KSTAT_DATA_UINT64 },
{ "c", KSTAT_DATA_UINT64 },
{ "c_min", KSTAT_DATA_UINT64 },
{ "c_max", KSTAT_DATA_UINT64 },
{ "size", KSTAT_DATA_UINT64 },
{ "hdr_size", KSTAT_DATA_UINT64 },
{ "data_size", KSTAT_DATA_UINT64 },
{ "other_size", KSTAT_DATA_UINT64 },
{ "l2_hits", KSTAT_DATA_UINT64 },
{ "l2_misses", KSTAT_DATA_UINT64 },
{ "l2_feeds", KSTAT_DATA_UINT64 },
{ "l2_rw_clash", KSTAT_DATA_UINT64 },
{ "l2_read_bytes", KSTAT_DATA_UINT64 },
{ "l2_write_bytes", KSTAT_DATA_UINT64 },
{ "l2_writes_sent", KSTAT_DATA_UINT64 },
{ "l2_writes_done", KSTAT_DATA_UINT64 },
{ "l2_writes_error", KSTAT_DATA_UINT64 },
{ "l2_writes_hdr_miss", KSTAT_DATA_UINT64 },
{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
{ "l2_evict_reading", KSTAT_DATA_UINT64 },
{ "l2_free_on_write", KSTAT_DATA_UINT64 },
{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },
{ "l2_cksum_bad", KSTAT_DATA_UINT64 },
{ "l2_io_error", KSTAT_DATA_UINT64 },
{ "l2_size", KSTAT_DATA_UINT64 },
{ "l2_hdr_size", KSTAT_DATA_UINT64 },
{ "memory_throttle_count", KSTAT_DATA_UINT64 },
{ "l2_write_trylock_fail", KSTAT_DATA_UINT64 },
{ "l2_write_passed_headroom", KSTAT_DATA_UINT64 },
{ "l2_write_spa_mismatch", KSTAT_DATA_UINT64 },
{ "l2_write_in_l2", KSTAT_DATA_UINT64 },
{ "l2_write_io_in_progress", KSTAT_DATA_UINT64 },
{ "l2_write_not_cacheable", KSTAT_DATA_UINT64 },
{ "l2_write_full", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_iter", KSTAT_DATA_UINT64 },
{ "l2_write_pios", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_bytes_scanned", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_list_iter", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_list_null_iter", KSTAT_DATA_UINT64 }
};
#define ARCSTAT(stat) (arc_stats.stat.value.ui64)
#define ARCSTAT_INCR(stat, val) \
atomic_add_64(&arc_stats.stat.value.ui64, (val));
#define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1)
#define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)
#define ARCSTAT_MAX(stat, val) { \
uint64_t m; \
while ((val) > (m = arc_stats.stat.value.ui64) && \
(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
continue; \
}
#define ARCSTAT_MAXSTAT(stat) \
ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
/*
* We define a macro to allow ARC hits/misses to be easily broken down by
* two separate conditions, giving a total of four different subtypes for
* each of hits and misses (so eight statistics total).
*/
#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
if (cond1) { \
if (cond2) { \
ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
} else { \
ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
} \
} else { \
if (cond2) { \
ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
} else { \
ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
} \
}
kstat_t *arc_ksp;
static arc_state_t *arc_anon;
static arc_state_t *arc_mru;
static arc_state_t *arc_mru_ghost;
static arc_state_t *arc_mfu;
static arc_state_t *arc_mfu_ghost;
static arc_state_t *arc_l2c_only;
/*
* There are several ARC variables that are critical to export as kstats --
* but we don't want to have to grovel around in the kstat whenever we wish to
* manipulate them. For these variables, we therefore define them to be in
* terms of the statistic variable. This assures that we are not introducing
* the possibility of inconsistency by having shadow copies of the variables,
* while still allowing the code to be readable.
*/
#define arc_size ARCSTAT(arcstat_size) /* actual total arc size */
#define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
#define arc_c ARCSTAT(arcstat_c) /* target size of cache */
#define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
#define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */
static int arc_no_grow; /* Don't try to grow cache size */
static uint64_t arc_tempreserve;
static uint64_t arc_meta_used;
static uint64_t arc_meta_limit;
static uint64_t arc_meta_max = 0;
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_meta_used, CTLFLAG_RDTUN,
&arc_meta_used, 0, "ARC metadata used");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_meta_limit, CTLFLAG_RDTUN,
&arc_meta_limit, 0, "ARC metadata limit");
typedef struct l2arc_buf_hdr l2arc_buf_hdr_t;
typedef struct arc_callback arc_callback_t;
struct arc_callback {
void *acb_private;
arc_done_func_t *acb_done;
arc_buf_t *acb_buf;
zio_t *acb_zio_dummy;
arc_callback_t *acb_next;
};
typedef struct arc_write_callback arc_write_callback_t;
struct arc_write_callback {
void *awcb_private;
arc_done_func_t *awcb_ready;
arc_done_func_t *awcb_done;
arc_buf_t *awcb_buf;
};
struct arc_buf_hdr {
/* protected by hash lock */
dva_t b_dva;
uint64_t b_birth;
uint64_t b_cksum0;
kmutex_t b_freeze_lock;
zio_cksum_t *b_freeze_cksum;
arc_buf_hdr_t *b_hash_next;
arc_buf_t *b_buf;
uint32_t b_flags;
uint32_t b_datacnt;
arc_callback_t *b_acb;
kcondvar_t b_cv;
/* immutable */
arc_buf_contents_t b_type;
uint64_t b_size;
spa_t *b_spa;
/* protected by arc state mutex */
arc_state_t *b_state;
list_node_t b_arc_node;
/* updated atomically */
clock_t b_arc_access;
/* self protecting */
refcount_t b_refcnt;
l2arc_buf_hdr_t *b_l2hdr;
list_node_t b_l2node;
};
static arc_buf_t *arc_eviction_list;
static kmutex_t arc_eviction_mtx;
static arc_buf_hdr_t arc_eviction_hdr;
static void arc_get_data_buf(arc_buf_t *buf);
static void arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock);
static int arc_evict_needed(arc_buf_contents_t type);
static void arc_evict_ghost(arc_state_t *state, spa_t *spa, int64_t bytes);
static boolean_t l2arc_write_eligible(spa_t *spa, arc_buf_hdr_t *ab);
#define GHOST_STATE(state) \
((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \
(state) == arc_l2c_only)
/*
* Private ARC flags. These flags are private ARC only flags that will show up
* in b_flags in the arc_hdr_buf_t. Some flags are publicly declared, and can
* be passed in as arc_flags in things like arc_read. However, these flags
* should never be passed and should only be set by ARC code. When adding new
* public flags, make sure not to smash the private ones.
*/
#define ARC_IN_HASH_TABLE (1 << 9) /* this buffer is hashed */
#define ARC_IO_IN_PROGRESS (1 << 10) /* I/O in progress for buf */
#define ARC_IO_ERROR (1 << 11) /* I/O failed for buf */
#define ARC_FREED_IN_READ (1 << 12) /* buf freed while in read */
#define ARC_BUF_AVAILABLE (1 << 13) /* block not in active use */
#define ARC_INDIRECT (1 << 14) /* this is an indirect block */
#define ARC_FREE_IN_PROGRESS (1 << 15) /* hdr about to be freed */
#define ARC_L2_WRITING (1 << 16) /* L2ARC write in progress */
#define ARC_L2_EVICTED (1 << 17) /* evicted during I/O */
#define ARC_L2_WRITE_HEAD (1 << 18) /* head of write list */
#define ARC_STORED (1 << 19) /* has been store()d to */
#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_IN_HASH_TABLE)
#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS)
#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_IO_ERROR)
#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_PREFETCH)
#define HDR_FREED_IN_READ(hdr) ((hdr)->b_flags & ARC_FREED_IN_READ)
#define HDR_BUF_AVAILABLE(hdr) ((hdr)->b_flags & ARC_BUF_AVAILABLE)
#define HDR_FREE_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FREE_IN_PROGRESS)
#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_L2CACHE)
#define HDR_L2_READING(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS && \
(hdr)->b_l2hdr != NULL)
#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_L2_WRITING)
#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_L2_EVICTED)
#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_L2_WRITE_HEAD)
/*
* Other sizes
*/
#define HDR_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
#define L2HDR_SIZE ((int64_t)sizeof (l2arc_buf_hdr_t))
/*
* Hash table routines
*/
#define HT_LOCK_PAD CACHE_LINE_SIZE
struct ht_lock {
kmutex_t ht_lock;
#ifdef _KERNEL
unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
#endif
};
#define BUF_LOCKS 256
typedef struct buf_hash_table {
uint64_t ht_mask;
arc_buf_hdr_t **ht_table;
struct ht_lock ht_locks[BUF_LOCKS] __aligned(CACHE_LINE_SIZE);
} buf_hash_table_t;
static buf_hash_table_t buf_hash_table;
#define BUF_HASH_INDEX(spa, dva, birth) \
(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
#define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
#define HDR_LOCK(buf) \
(BUF_HASH_LOCK(BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth)))
uint64_t zfs_crc64_table[256];
/*
* Level 2 ARC
*/
#define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
#define L2ARC_HEADROOM 2 /* num of writes */
#define L2ARC_FEED_SECS 1 /* caching interval secs */
#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */
#define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent)
#define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done)
/*
* L2ARC Performance Tunables
*/
uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* default max write size */
uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra write during warmup */
uint64_t l2arc_headroom = L2ARC_HEADROOM; /* number of dev writes */
uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
boolean_t l2arc_noprefetch = B_FALSE; /* don't cache prefetch bufs */
boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */
boolean_t l2arc_norw = B_TRUE; /* no reads during writes */
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_max, CTLFLAG_RW,
&l2arc_write_max, 0, "max write size");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_boost, CTLFLAG_RW,
&l2arc_write_boost, 0, "extra write during warmup");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_headroom, CTLFLAG_RW,
&l2arc_headroom, 0, "number of dev writes");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_secs, CTLFLAG_RW,
&l2arc_feed_secs, 0, "interval seconds");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_min_ms, CTLFLAG_RW,
&l2arc_feed_min_ms, 0, "min interval milliseconds");
SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_noprefetch, CTLFLAG_RW,
&l2arc_noprefetch, 0, "don't cache prefetch bufs");
SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_feed_again, CTLFLAG_RW,
&l2arc_feed_again, 0, "turbo warmup");
SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_norw, CTLFLAG_RW,
&l2arc_norw, 0, "no reads during writes");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_size, CTLFLAG_RD,
&ARC_anon.arcs_size, 0, "size of anonymous state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_metadata_lsize, CTLFLAG_RD,
&ARC_anon.arcs_lsize[ARC_BUFC_METADATA], 0, "size of anonymous state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_data_lsize, CTLFLAG_RD,
&ARC_anon.arcs_lsize[ARC_BUFC_DATA], 0, "size of anonymous state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_size, CTLFLAG_RD,
&ARC_mru.arcs_size, 0, "size of mru state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_metadata_lsize, CTLFLAG_RD,
&ARC_mru.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mru state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_data_lsize, CTLFLAG_RD,
&ARC_mru.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mru state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_size, CTLFLAG_RD,
&ARC_mru_ghost.arcs_size, 0, "size of mru ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_metadata_lsize, CTLFLAG_RD,
&ARC_mru_ghost.arcs_lsize[ARC_BUFC_METADATA], 0,
"size of metadata in mru ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_data_lsize, CTLFLAG_RD,
&ARC_mru_ghost.arcs_lsize[ARC_BUFC_DATA], 0,
"size of data in mru ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_size, CTLFLAG_RD,
&ARC_mfu.arcs_size, 0, "size of mfu state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_metadata_lsize, CTLFLAG_RD,
&ARC_mfu.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mfu state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_data_lsize, CTLFLAG_RD,
&ARC_mfu.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mfu state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_size, CTLFLAG_RD,
&ARC_mfu_ghost.arcs_size, 0, "size of mfu ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_metadata_lsize, CTLFLAG_RD,
&ARC_mfu_ghost.arcs_lsize[ARC_BUFC_METADATA], 0,
"size of metadata in mfu ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_data_lsize, CTLFLAG_RD,
&ARC_mfu_ghost.arcs_lsize[ARC_BUFC_DATA], 0,
"size of data in mfu ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2c_only_size, CTLFLAG_RD,
&ARC_l2c_only.arcs_size, 0, "size of mru state");
/*
* L2ARC Internals
*/
typedef struct l2arc_dev {
vdev_t *l2ad_vdev; /* vdev */
spa_t *l2ad_spa; /* spa */
uint64_t l2ad_hand; /* next write location */
uint64_t l2ad_write; /* desired write size, bytes */
uint64_t l2ad_boost; /* warmup write boost, bytes */
uint64_t l2ad_start; /* first addr on device */
uint64_t l2ad_end; /* last addr on device */
uint64_t l2ad_evict; /* last addr eviction reached */
boolean_t l2ad_first; /* first sweep through */
boolean_t l2ad_writing; /* currently writing */
list_t *l2ad_buflist; /* buffer list */
list_node_t l2ad_node; /* device list node */
} l2arc_dev_t;
static list_t L2ARC_dev_list; /* device list */
static list_t *l2arc_dev_list; /* device list pointer */
static kmutex_t l2arc_dev_mtx; /* device list mutex */
static l2arc_dev_t *l2arc_dev_last; /* last device used */
static kmutex_t l2arc_buflist_mtx; /* mutex for all buflists */
static list_t L2ARC_free_on_write; /* free after write buf list */
static list_t *l2arc_free_on_write; /* free after write list ptr */
static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
static uint64_t l2arc_ndev; /* number of devices */
typedef struct l2arc_read_callback {
arc_buf_t *l2rcb_buf; /* read buffer */
spa_t *l2rcb_spa; /* spa */
blkptr_t l2rcb_bp; /* original blkptr */
zbookmark_t l2rcb_zb; /* original bookmark */
int l2rcb_flags; /* original flags */
} l2arc_read_callback_t;
typedef struct l2arc_write_callback {
l2arc_dev_t *l2wcb_dev; /* device info */
arc_buf_hdr_t *l2wcb_head; /* head of write buflist */
} l2arc_write_callback_t;
struct l2arc_buf_hdr {
/* protected by arc_buf_hdr mutex */
l2arc_dev_t *b_dev; /* L2ARC device */
uint64_t b_daddr; /* disk address, offset byte */
};
typedef struct l2arc_data_free {
/* protected by l2arc_free_on_write_mtx */
void *l2df_data;
size_t l2df_size;
void (*l2df_func)(void *, size_t);
list_node_t l2df_list_node;
} l2arc_data_free_t;
static kmutex_t l2arc_feed_thr_lock;
static kcondvar_t l2arc_feed_thr_cv;
static uint8_t l2arc_thread_exit;
static void l2arc_read_done(zio_t *zio);
static void l2arc_hdr_stat_add(void);
static void l2arc_hdr_stat_remove(void);
static uint64_t
buf_hash(spa_t *spa, const dva_t *dva, uint64_t birth)
{
uintptr_t spav = (uintptr_t)spa;
uint8_t *vdva = (uint8_t *)dva;
uint64_t crc = -1ULL;
int i;
ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
for (i = 0; i < sizeof (dva_t); i++)
crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
crc ^= (spav>>8) ^ birth;
return (crc);
}
#define BUF_EMPTY(buf) \
((buf)->b_dva.dva_word[0] == 0 && \
(buf)->b_dva.dva_word[1] == 0 && \
(buf)->b_birth == 0)
#define BUF_EQUAL(spa, dva, birth, buf) \
((buf)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
((buf)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
((buf)->b_birth == birth) && ((buf)->b_spa == spa)
static arc_buf_hdr_t *
buf_hash_find(spa_t *spa, const dva_t *dva, uint64_t birth, kmutex_t **lockp)
{
uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
arc_buf_hdr_t *buf;
mutex_enter(hash_lock);
for (buf = buf_hash_table.ht_table[idx]; buf != NULL;
buf = buf->b_hash_next) {
if (BUF_EQUAL(spa, dva, birth, buf)) {
*lockp = hash_lock;
return (buf);
}
}
mutex_exit(hash_lock);
*lockp = NULL;
return (NULL);
}
/*
* Insert an entry into the hash table. If there is already an element
* equal to elem in the hash table, then the already existing element
* will be returned and the new element will not be inserted.
* Otherwise returns NULL.
*/
static arc_buf_hdr_t *
buf_hash_insert(arc_buf_hdr_t *buf, kmutex_t **lockp)
{
uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
arc_buf_hdr_t *fbuf;
uint32_t i;
ASSERT(!HDR_IN_HASH_TABLE(buf));
*lockp = hash_lock;
mutex_enter(hash_lock);
for (fbuf = buf_hash_table.ht_table[idx], i = 0; fbuf != NULL;
fbuf = fbuf->b_hash_next, i++) {
if (BUF_EQUAL(buf->b_spa, &buf->b_dva, buf->b_birth, fbuf))
return (fbuf);
}
buf->b_hash_next = buf_hash_table.ht_table[idx];
buf_hash_table.ht_table[idx] = buf;
buf->b_flags |= ARC_IN_HASH_TABLE;
/* collect some hash table performance data */
if (i > 0) {
ARCSTAT_BUMP(arcstat_hash_collisions);
if (i == 1)
ARCSTAT_BUMP(arcstat_hash_chains);
ARCSTAT_MAX(arcstat_hash_chain_max, i);
}
ARCSTAT_BUMP(arcstat_hash_elements);
ARCSTAT_MAXSTAT(arcstat_hash_elements);
return (NULL);
}
static void
buf_hash_remove(arc_buf_hdr_t *buf)
{
arc_buf_hdr_t *fbuf, **bufp;
uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
ASSERT(HDR_IN_HASH_TABLE(buf));
bufp = &buf_hash_table.ht_table[idx];
while ((fbuf = *bufp) != buf) {
ASSERT(fbuf != NULL);
bufp = &fbuf->b_hash_next;
}
*bufp = buf->b_hash_next;
buf->b_hash_next = NULL;
buf->b_flags &= ~ARC_IN_HASH_TABLE;
/* collect some hash table performance data */
ARCSTAT_BUMPDOWN(arcstat_hash_elements);
if (buf_hash_table.ht_table[idx] &&
buf_hash_table.ht_table[idx]->b_hash_next == NULL)
ARCSTAT_BUMPDOWN(arcstat_hash_chains);
}
/*
* Global data structures and functions for the buf kmem cache.
*/
static kmem_cache_t *hdr_cache;
static kmem_cache_t *buf_cache;
static void
buf_fini(void)
{
int i;
kmem_free(buf_hash_table.ht_table,
(buf_hash_table.ht_mask + 1) * sizeof (void *));
for (i = 0; i < BUF_LOCKS; i++)
mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
kmem_cache_destroy(hdr_cache);
kmem_cache_destroy(buf_cache);
}
/*
* Constructor callback - called when the cache is empty
* and a new buf is requested.
*/
/* ARGSUSED */
static int
hdr_cons(void *vbuf, void *unused, int kmflag)
{
arc_buf_hdr_t *buf = vbuf;
bzero(buf, sizeof (arc_buf_hdr_t));
refcount_create(&buf->b_refcnt);
cv_init(&buf->b_cv, NULL, CV_DEFAULT, NULL);
mutex_init(&buf->b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
arc_space_consume(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
return (0);
}
/* ARGSUSED */
static int
buf_cons(void *vbuf, void *unused, int kmflag)
{
arc_buf_t *buf = vbuf;
bzero(buf, sizeof (arc_buf_t));
rw_init(&buf->b_lock, NULL, RW_DEFAULT, NULL);
arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
return (0);
}
/*
* Destructor callback - called when a cached buf is
* no longer required.
*/
/* ARGSUSED */
static void
hdr_dest(void *vbuf, void *unused)
{
arc_buf_hdr_t *buf = vbuf;
refcount_destroy(&buf->b_refcnt);
cv_destroy(&buf->b_cv);
mutex_destroy(&buf->b_freeze_lock);
arc_space_return(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
}
/* ARGSUSED */
static void
buf_dest(void *vbuf, void *unused)
{
arc_buf_t *buf = vbuf;
rw_destroy(&buf->b_lock);
arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
}
/*
* Reclaim callback -- invoked when memory is low.
*/
/* ARGSUSED */
static void
hdr_recl(void *unused)
{
dprintf("hdr_recl called\n");
/*
* umem calls the reclaim func when we destroy the buf cache,
* which is after we do arc_fini().
*/
if (!arc_dead)
cv_signal(&arc_reclaim_thr_cv);
}
static void
buf_init(void)
{
uint64_t *ct;
uint64_t hsize = 1ULL << 12;
int i, j;
/*
* The hash table is big enough to fill all of physical memory
* with an average 64K block size. The table will take up
* totalmem*sizeof(void*)/64K (eg. 128KB/GB with 8-byte pointers).
*/
while (hsize * 65536 < (uint64_t)physmem * PAGESIZE)
hsize <<= 1;
retry:
buf_hash_table.ht_mask = hsize - 1;
buf_hash_table.ht_table =
kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
if (buf_hash_table.ht_table == NULL) {
ASSERT(hsize > (1ULL << 8));
hsize >>= 1;
goto retry;
}
hdr_cache = kmem_cache_create("arc_buf_hdr_t", sizeof (arc_buf_hdr_t),
0, hdr_cons, hdr_dest, hdr_recl, NULL, NULL, 0);
buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
for (i = 0; i < 256; i++)
for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
for (i = 0; i < BUF_LOCKS; i++) {
mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
NULL, MUTEX_DEFAULT, NULL);
}
}
#define ARC_MINTIME (hz>>4) /* 62 ms */
static void
arc_cksum_verify(arc_buf_t *buf)
{
zio_cksum_t zc;
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
return;
mutex_enter(&buf->b_hdr->b_freeze_lock);
if (buf->b_hdr->b_freeze_cksum == NULL ||
(buf->b_hdr->b_flags & ARC_IO_ERROR)) {
mutex_exit(&buf->b_hdr->b_freeze_lock);
return;
}
fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
panic("buffer modified while frozen!");
mutex_exit(&buf->b_hdr->b_freeze_lock);
}
static int
arc_cksum_equal(arc_buf_t *buf)
{
zio_cksum_t zc;
int equal;
mutex_enter(&buf->b_hdr->b_freeze_lock);
fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc);
mutex_exit(&buf->b_hdr->b_freeze_lock);
return (equal);
}
static void
arc_cksum_compute(arc_buf_t *buf, boolean_t force)
{
if (!force && !(zfs_flags & ZFS_DEBUG_MODIFY))
return;
mutex_enter(&buf->b_hdr->b_freeze_lock);
if (buf->b_hdr->b_freeze_cksum != NULL) {
mutex_exit(&buf->b_hdr->b_freeze_lock);
return;
}
buf->b_hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
fletcher_2_native(buf->b_data, buf->b_hdr->b_size,
buf->b_hdr->b_freeze_cksum);
mutex_exit(&buf->b_hdr->b_freeze_lock);
}
void
arc_buf_thaw(arc_buf_t *buf)
{
if (zfs_flags & ZFS_DEBUG_MODIFY) {
if (buf->b_hdr->b_state != arc_anon)
panic("modifying non-anon buffer!");
if (buf->b_hdr->b_flags & ARC_IO_IN_PROGRESS)
panic("modifying buffer while i/o in progress!");
arc_cksum_verify(buf);
}
mutex_enter(&buf->b_hdr->b_freeze_lock);
if (buf->b_hdr->b_freeze_cksum != NULL) {
kmem_free(buf->b_hdr->b_freeze_cksum, sizeof (zio_cksum_t));
buf->b_hdr->b_freeze_cksum = NULL;
}
mutex_exit(&buf->b_hdr->b_freeze_lock);
}
void
arc_buf_freeze(arc_buf_t *buf)
{
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
return;
ASSERT(buf->b_hdr->b_freeze_cksum != NULL ||
buf->b_hdr->b_state == arc_anon);
arc_cksum_compute(buf, B_FALSE);
}
static void
get_buf_info(arc_buf_hdr_t *ab, arc_state_t *state, list_t **list, kmutex_t **lock)
{
uint64_t buf_hashid = buf_hash(ab->b_spa, &ab->b_dva, ab->b_birth);
if (ab->b_type == ARC_BUFC_METADATA)
buf_hashid &= (ARC_BUFC_NUMMETADATALISTS - 1);
else {
buf_hashid &= (ARC_BUFC_NUMDATALISTS - 1);
buf_hashid += ARC_BUFC_NUMMETADATALISTS;
}
*list = &state->arcs_lists[buf_hashid];
*lock = ARCS_LOCK(state, buf_hashid);
}
static void
add_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
{
ASSERT(MUTEX_HELD(hash_lock));
if ((refcount_add(&ab->b_refcnt, tag) == 1) &&
(ab->b_state != arc_anon)) {
uint64_t delta = ab->b_size * ab->b_datacnt;
uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type];
list_t *list;
kmutex_t *lock;
get_buf_info(ab, ab->b_state, &list, &lock);
ASSERT(!MUTEX_HELD(lock));
mutex_enter(lock);
ASSERT(list_link_active(&ab->b_arc_node));
list_remove(list, ab);
if (GHOST_STATE(ab->b_state)) {
ASSERT3U(ab->b_datacnt, ==, 0);
ASSERT3P(ab->b_buf, ==, NULL);
delta = ab->b_size;
}
ASSERT(delta > 0);
ASSERT3U(*size, >=, delta);
atomic_add_64(size, -delta);
mutex_exit(lock);
/* remove the prefetch flag if we get a reference */
if (ab->b_flags & ARC_PREFETCH)
ab->b_flags &= ~ARC_PREFETCH;
}
}
static int
remove_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
{
int cnt;
arc_state_t *state = ab->b_state;
ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
ASSERT(!GHOST_STATE(state));
if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) &&
(state != arc_anon)) {
uint64_t *size = &state->arcs_lsize[ab->b_type];
list_t *list;
kmutex_t *lock;
get_buf_info(ab, state, &list, &lock);
ASSERT(!MUTEX_HELD(lock));
mutex_enter(lock);
ASSERT(!list_link_active(&ab->b_arc_node));
list_insert_head(list, ab);
ASSERT(ab->b_datacnt > 0);
atomic_add_64(size, ab->b_size * ab->b_datacnt);
mutex_exit(lock);
}
return (cnt);
}
/*
* Move the supplied buffer to the indicated state. The mutex
* for the buffer must be held by the caller.
*/
static void
arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *ab, kmutex_t *hash_lock)
{
arc_state_t *old_state = ab->b_state;
int64_t refcnt = refcount_count(&ab->b_refcnt);
uint64_t from_delta, to_delta;
list_t *list;
kmutex_t *lock;
ASSERT(MUTEX_HELD(hash_lock));
ASSERT(new_state != old_state);
ASSERT(refcnt == 0 || ab->b_datacnt > 0);
ASSERT(ab->b_datacnt == 0 || !GHOST_STATE(new_state));
from_delta = to_delta = ab->b_datacnt * ab->b_size;
/*
* If this buffer is evictable, transfer it from the
* old state list to the new state list.
*/
if (refcnt == 0) {
if (old_state != arc_anon) {
int use_mutex;
uint64_t *size = &old_state->arcs_lsize[ab->b_type];
get_buf_info(ab, old_state, &list, &lock);
use_mutex = !MUTEX_HELD(lock);
if (use_mutex)
mutex_enter(lock);
ASSERT(list_link_active(&ab->b_arc_node));
list_remove(list, ab);
/*
* If prefetching out of the ghost cache,
* we will have a non-null datacnt.
*/
if (GHOST_STATE(old_state) && ab->b_datacnt == 0) {
/* ghost elements have a ghost size */
ASSERT(ab->b_buf == NULL);
from_delta = ab->b_size;
}
ASSERT3U(*size, >=, from_delta);
atomic_add_64(size, -from_delta);
if (use_mutex)
mutex_exit(lock);
}
if (new_state != arc_anon) {
int use_mutex;
uint64_t *size = &new_state->arcs_lsize[ab->b_type];
get_buf_info(ab, new_state, &list, &lock);
use_mutex = !MUTEX_HELD(lock);
if (use_mutex)
mutex_enter(lock);
list_insert_head(list, ab);
/* ghost elements have a ghost size */
if (GHOST_STATE(new_state)) {
ASSERT(ab->b_datacnt == 0);
ASSERT(ab->b_buf == NULL);
to_delta = ab->b_size;
}
atomic_add_64(size, to_delta);
if (use_mutex)
mutex_exit(lock);
}
}
ASSERT(!BUF_EMPTY(ab));
if (new_state == arc_anon) {
buf_hash_remove(ab);
}
/* adjust state sizes */
if (to_delta)
atomic_add_64(&new_state->arcs_size, to_delta);
if (from_delta) {
ASSERT3U(old_state->arcs_size, >=, from_delta);
atomic_add_64(&old_state->arcs_size, -from_delta);
}
ab->b_state = new_state;
/* adjust l2arc hdr stats */
if (new_state == arc_l2c_only)
l2arc_hdr_stat_add();
else if (old_state == arc_l2c_only)
l2arc_hdr_stat_remove();
}
void
arc_space_consume(uint64_t space, arc_space_type_t type)
{
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
switch (type) {
case ARC_SPACE_DATA:
ARCSTAT_INCR(arcstat_data_size, space);
break;
case ARC_SPACE_OTHER:
ARCSTAT_INCR(arcstat_other_size, space);
break;
case ARC_SPACE_HDRS:
ARCSTAT_INCR(arcstat_hdr_size, space);
break;
case ARC_SPACE_L2HDRS:
ARCSTAT_INCR(arcstat_l2_hdr_size, space);
break;
}
atomic_add_64(&arc_meta_used, space);
atomic_add_64(&arc_size, space);
}
void
arc_space_return(uint64_t space, arc_space_type_t type)
{
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
switch (type) {
case ARC_SPACE_DATA:
ARCSTAT_INCR(arcstat_data_size, -space);
break;
case ARC_SPACE_OTHER:
ARCSTAT_INCR(arcstat_other_size, -space);
break;
case ARC_SPACE_HDRS:
ARCSTAT_INCR(arcstat_hdr_size, -space);
break;
case ARC_SPACE_L2HDRS:
ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
break;
}
ASSERT(arc_meta_used >= space);
if (arc_meta_max < arc_meta_used)
arc_meta_max = arc_meta_used;
atomic_add_64(&arc_meta_used, -space);
ASSERT(arc_size >= space);
atomic_add_64(&arc_size, -space);
}
void *
arc_data_buf_alloc(uint64_t size)
{
if (arc_evict_needed(ARC_BUFC_DATA))
cv_signal(&arc_reclaim_thr_cv);
atomic_add_64(&arc_size, size);
return (zio_data_buf_alloc(size));
}
void
arc_data_buf_free(void *buf, uint64_t size)
{
zio_data_buf_free(buf, size);
ASSERT(arc_size >= size);
atomic_add_64(&arc_size, -size);
}
arc_buf_t *
arc_buf_alloc(spa_t *spa, int size, void *tag, arc_buf_contents_t type)
{
arc_buf_hdr_t *hdr;
arc_buf_t *buf;
ASSERT3U(size, >, 0);
hdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
ASSERT(BUF_EMPTY(hdr));
hdr->b_size = size;
hdr->b_type = type;
hdr->b_spa = spa;
hdr->b_state = arc_anon;
hdr->b_arc_access = 0;
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
buf->b_hdr = hdr;
buf->b_data = NULL;
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_next = NULL;
hdr->b_buf = buf;
arc_get_data_buf(buf);
hdr->b_datacnt = 1;
hdr->b_flags = 0;
ASSERT(refcount_is_zero(&hdr->b_refcnt));
(void) refcount_add(&hdr->b_refcnt, tag);
return (buf);
}
static arc_buf_t *
arc_buf_clone(arc_buf_t *from)
{
arc_buf_t *buf;
arc_buf_hdr_t *hdr = from->b_hdr;
uint64_t size = hdr->b_size;
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
buf->b_hdr = hdr;
buf->b_data = NULL;
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_next = hdr->b_buf;
hdr->b_buf = buf;
arc_get_data_buf(buf);
bcopy(from->b_data, buf->b_data, size);
hdr->b_datacnt += 1;
return (buf);
}
void
arc_buf_add_ref(arc_buf_t *buf, void* tag)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_lock;
/*
* Check to see if this buffer is evicted. Callers
* must verify b_data != NULL to know if the add_ref
* was successful.
*/
rw_enter(&buf->b_lock, RW_READER);
if (buf->b_data == NULL) {
rw_exit(&buf->b_lock);
return;
}
hdr = buf->b_hdr;
ASSERT(hdr != NULL);
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
rw_exit(&buf->b_lock);
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
add_reference(hdr, hash_lock, tag);
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
arc_access(hdr, hash_lock);
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_hits);
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
data, metadata, hits);
}
/*
* Free the arc data buffer. If it is an l2arc write in progress,
* the buffer is placed on l2arc_free_on_write to be freed later.
*/
static void
arc_buf_data_free(arc_buf_hdr_t *hdr, void (*free_func)(void *, size_t),
void *data, size_t size)
{
if (HDR_L2_WRITING(hdr)) {
l2arc_data_free_t *df;
df = kmem_alloc(sizeof (l2arc_data_free_t), KM_SLEEP);
df->l2df_data = data;
df->l2df_size = size;
df->l2df_func = free_func;
mutex_enter(&l2arc_free_on_write_mtx);
list_insert_head(l2arc_free_on_write, df);
mutex_exit(&l2arc_free_on_write_mtx);
ARCSTAT_BUMP(arcstat_l2_free_on_write);
} else {
free_func(data, size);
}
}
static void
arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
{
arc_buf_t **bufp;
/* free up data associated with the buf */
if (buf->b_data) {
arc_state_t *state = buf->b_hdr->b_state;
uint64_t size = buf->b_hdr->b_size;
arc_buf_contents_t type = buf->b_hdr->b_type;
arc_cksum_verify(buf);
if (!recycle) {
if (type == ARC_BUFC_METADATA) {
arc_buf_data_free(buf->b_hdr, zio_buf_free,
buf->b_data, size);
arc_space_return(size, ARC_SPACE_DATA);
} else {
ASSERT(type == ARC_BUFC_DATA);
arc_buf_data_free(buf->b_hdr,
zio_data_buf_free, buf->b_data, size);
ARCSTAT_INCR(arcstat_data_size, -size);
atomic_add_64(&arc_size, -size);
}
}
if (list_link_active(&buf->b_hdr->b_arc_node)) {
uint64_t *cnt = &state->arcs_lsize[type];
ASSERT(refcount_is_zero(&buf->b_hdr->b_refcnt));
ASSERT(state != arc_anon);
ASSERT3U(*cnt, >=, size);
atomic_add_64(cnt, -size);
}
ASSERT3U(state->arcs_size, >=, size);
atomic_add_64(&state->arcs_size, -size);
buf->b_data = NULL;
ASSERT(buf->b_hdr->b_datacnt > 0);
buf->b_hdr->b_datacnt -= 1;
}
/* only remove the buf if requested */
if (!all)
return;
/* remove the buf from the hdr list */
for (bufp = &buf->b_hdr->b_buf; *bufp != buf; bufp = &(*bufp)->b_next)
continue;
*bufp = buf->b_next;
ASSERT(buf->b_efunc == NULL);
/* clean up the buf */
buf->b_hdr = NULL;
kmem_cache_free(buf_cache, buf);
}
static void
arc_hdr_destroy(arc_buf_hdr_t *hdr)
{
ASSERT(refcount_is_zero(&hdr->b_refcnt));
ASSERT3P(hdr->b_state, ==, arc_anon);
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
ASSERT(!(hdr->b_flags & ARC_STORED));
if (hdr->b_l2hdr != NULL) {
if (!MUTEX_HELD(&l2arc_buflist_mtx)) {
/*
* To prevent arc_free() and l2arc_evict() from
* attempting to free the same buffer at the same time,
* a FREE_IN_PROGRESS flag is given to arc_free() to
* give it priority. l2arc_evict() can't destroy this
* header while we are waiting on l2arc_buflist_mtx.
*
* The hdr may be removed from l2ad_buflist before we
* grab l2arc_buflist_mtx, so b_l2hdr is rechecked.
*/
mutex_enter(&l2arc_buflist_mtx);
if (hdr->b_l2hdr != NULL) {
list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist,
hdr);
}
mutex_exit(&l2arc_buflist_mtx);
} else {
list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist, hdr);
}
ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
kmem_free(hdr->b_l2hdr, sizeof (l2arc_buf_hdr_t));
if (hdr->b_state == arc_l2c_only)
l2arc_hdr_stat_remove();
hdr->b_l2hdr = NULL;
}
if (!BUF_EMPTY(hdr)) {
ASSERT(!HDR_IN_HASH_TABLE(hdr));
bzero(&hdr->b_dva, sizeof (dva_t));
hdr->b_birth = 0;
hdr->b_cksum0 = 0;
}
while (hdr->b_buf) {
arc_buf_t *buf = hdr->b_buf;
if (buf->b_efunc) {
mutex_enter(&arc_eviction_mtx);
rw_enter(&buf->b_lock, RW_WRITER);
ASSERT(buf->b_hdr != NULL);
arc_buf_destroy(hdr->b_buf, FALSE, FALSE);
hdr->b_buf = buf->b_next;
buf->b_hdr = &arc_eviction_hdr;
buf->b_next = arc_eviction_list;
arc_eviction_list = buf;
rw_exit(&buf->b_lock);
mutex_exit(&arc_eviction_mtx);
} else {
arc_buf_destroy(hdr->b_buf, FALSE, TRUE);
}
}
if (hdr->b_freeze_cksum != NULL) {
kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
hdr->b_freeze_cksum = NULL;
}
ASSERT(!list_link_active(&hdr->b_arc_node));
ASSERT3P(hdr->b_hash_next, ==, NULL);
ASSERT3P(hdr->b_acb, ==, NULL);
kmem_cache_free(hdr_cache, hdr);
}
void
arc_buf_free(arc_buf_t *buf, void *tag)
{
arc_buf_hdr_t *hdr = buf->b_hdr;
int hashed = hdr->b_state != arc_anon;
ASSERT(buf->b_efunc == NULL);
ASSERT(buf->b_data != NULL);
if (hashed) {
kmutex_t *hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
(void) remove_reference(hdr, hash_lock, tag);
if (hdr->b_datacnt > 1)
arc_buf_destroy(buf, FALSE, TRUE);
else
hdr->b_flags |= ARC_BUF_AVAILABLE;
mutex_exit(hash_lock);
} else if (HDR_IO_IN_PROGRESS(hdr)) {
int destroy_hdr;
/*
* We are in the middle of an async write. Don't destroy
* this buffer unless the write completes before we finish
* decrementing the reference count.
*/
mutex_enter(&arc_eviction_mtx);
(void) remove_reference(hdr, NULL, tag);
ASSERT(refcount_is_zero(&hdr->b_refcnt));
destroy_hdr = !HDR_IO_IN_PROGRESS(hdr);
mutex_exit(&arc_eviction_mtx);
if (destroy_hdr)
arc_hdr_destroy(hdr);
} else {
if (remove_reference(hdr, NULL, tag) > 0) {
ASSERT(HDR_IO_ERROR(hdr));
arc_buf_destroy(buf, FALSE, TRUE);
} else {
arc_hdr_destroy(hdr);
}
}
}
int
arc_buf_remove_ref(arc_buf_t *buf, void* tag)
{
arc_buf_hdr_t *hdr = buf->b_hdr;
kmutex_t *hash_lock = HDR_LOCK(hdr);
int no_callback = (buf->b_efunc == NULL);
if (hdr->b_state == arc_anon) {
arc_buf_free(buf, tag);
return (no_callback);
}
mutex_enter(hash_lock);
ASSERT(hdr->b_state != arc_anon);
ASSERT(buf->b_data != NULL);
(void) remove_reference(hdr, hash_lock, tag);
if (hdr->b_datacnt > 1) {
if (no_callback)
arc_buf_destroy(buf, FALSE, TRUE);
} else if (no_callback) {
ASSERT(hdr->b_buf == buf && buf->b_next == NULL);
hdr->b_flags |= ARC_BUF_AVAILABLE;
}
ASSERT(no_callback || hdr->b_datacnt > 1 ||
refcount_is_zero(&hdr->b_refcnt));
mutex_exit(hash_lock);
return (no_callback);
}
int
arc_buf_size(arc_buf_t *buf)
{
return (buf->b_hdr->b_size);
}
/*
* Evict buffers from list until we've removed the specified number of
* bytes. Move the removed buffers to the appropriate evict state.
* If the recycle flag is set, then attempt to "recycle" a buffer:
* - look for a buffer to evict that is `bytes' long.
* - return the data block from this buffer rather than freeing it.
* This flag is used by callers that are trying to make space for a
* new buffer in a full arc cache.
*
* This function makes a "best effort". It skips over any buffers
* it can't get a hash_lock on, and so may not catch all candidates.
* It may also return without evicting as much space as requested.
*/
static void *
arc_evict(arc_state_t *state, spa_t *spa, int64_t bytes, boolean_t recycle,
arc_buf_contents_t type)
{
arc_state_t *evicted_state;
uint64_t bytes_evicted = 0, skipped = 0, missed = 0;
int64_t bytes_remaining;
arc_buf_hdr_t *ab, *ab_prev = NULL;
list_t *evicted_list, *list, *evicted_list_start, *list_start;
kmutex_t *lock, *evicted_lock;
kmutex_t *hash_lock;
boolean_t have_lock;
void *stolen = NULL;
static int evict_metadata_offset, evict_data_offset;
int i, idx, offset, list_count, count;
ASSERT(state == arc_mru || state == arc_mfu);
evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
if (type == ARC_BUFC_METADATA) {
offset = 0;
list_count = ARC_BUFC_NUMMETADATALISTS;
list_start = &state->arcs_lists[0];
evicted_list_start = &evicted_state->arcs_lists[0];
idx = evict_metadata_offset;
} else {
offset = ARC_BUFC_NUMMETADATALISTS;
list_start = &state->arcs_lists[offset];
evicted_list_start = &evicted_state->arcs_lists[offset];
list_count = ARC_BUFC_NUMDATALISTS;
idx = evict_data_offset;
}
bytes_remaining = evicted_state->arcs_lsize[type];
count = 0;
evict_start:
list = &list_start[idx];
evicted_list = &evicted_list_start[idx];
lock = ARCS_LOCK(state, (offset + idx));
evicted_lock = ARCS_LOCK(evicted_state, (offset + idx));
mutex_enter(lock);
mutex_enter(evicted_lock);
for (ab = list_tail(list); ab; ab = ab_prev) {
ab_prev = list_prev(list, ab);
bytes_remaining -= (ab->b_size * ab->b_datacnt);
/* prefetch buffers have a minimum lifespan */
if (HDR_IO_IN_PROGRESS(ab) ||
(spa && ab->b_spa != spa) ||
(ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
LBOLT - ab->b_arc_access < arc_min_prefetch_lifespan)) {
skipped++;
continue;
}
/* "lookahead" for better eviction candidate */
if (recycle && ab->b_size != bytes &&
ab_prev && ab_prev->b_size == bytes)
continue;
hash_lock = HDR_LOCK(ab);
have_lock = MUTEX_HELD(hash_lock);
if (have_lock || mutex_tryenter(hash_lock)) {
ASSERT3U(refcount_count(&ab->b_refcnt), ==, 0);
ASSERT(ab->b_datacnt > 0);
while (ab->b_buf) {
arc_buf_t *buf = ab->b_buf;
if (!rw_tryenter(&buf->b_lock, RW_WRITER)) {
missed += 1;
break;
}
if (buf->b_data) {
bytes_evicted += ab->b_size;
if (recycle && ab->b_type == type &&
ab->b_size == bytes &&
!HDR_L2_WRITING(ab)) {
stolen = buf->b_data;
recycle = FALSE;
}
}
if (buf->b_efunc) {
mutex_enter(&arc_eviction_mtx);
arc_buf_destroy(buf,
buf->b_data == stolen, FALSE);
ab->b_buf = buf->b_next;
buf->b_hdr = &arc_eviction_hdr;
buf->b_next = arc_eviction_list;
arc_eviction_list = buf;
mutex_exit(&arc_eviction_mtx);
rw_exit(&buf->b_lock);
} else {
rw_exit(&buf->b_lock);
arc_buf_destroy(buf,
buf->b_data == stolen, TRUE);
}
}
if (ab->b_l2hdr) {
ARCSTAT_INCR(arcstat_evict_l2_cached,
ab->b_size);
} else {
if (l2arc_write_eligible(ab->b_spa, ab)) {
ARCSTAT_INCR(arcstat_evict_l2_eligible,
ab->b_size);
} else {
ARCSTAT_INCR(
arcstat_evict_l2_ineligible,
ab->b_size);
}
}
if (ab->b_datacnt == 0) {
arc_change_state(evicted_state, ab, hash_lock);
ASSERT(HDR_IN_HASH_TABLE(ab));
ab->b_flags |= ARC_IN_HASH_TABLE;
ab->b_flags &= ~ARC_BUF_AVAILABLE;
DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, ab);
}
if (!have_lock)
mutex_exit(hash_lock);
if (bytes >= 0 && bytes_evicted >= bytes)
break;
if (bytes_remaining > 0) {
mutex_exit(evicted_lock);
mutex_exit(lock);
idx = ((idx + 1) & (list_count - 1));
count++;
goto evict_start;
}
} else {
missed += 1;
}
}
mutex_exit(evicted_lock);
mutex_exit(lock);
idx = ((idx + 1) & (list_count - 1));
count++;
if (bytes_evicted < bytes) {
if (count < list_count)
goto evict_start;
else
dprintf("only evicted %lld bytes from %x",
(longlong_t)bytes_evicted, state);
}
if (type == ARC_BUFC_METADATA)
evict_metadata_offset = idx;
else
evict_data_offset = idx;
if (skipped)
ARCSTAT_INCR(arcstat_evict_skip, skipped);
if (missed)
ARCSTAT_INCR(arcstat_mutex_miss, missed);
/*
* We have just evicted some date into the ghost state, make
* sure we also adjust the ghost state size if necessary.
*/
if (arc_no_grow &&
arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size > arc_c) {
int64_t mru_over = arc_anon->arcs_size + arc_mru->arcs_size +
arc_mru_ghost->arcs_size - arc_c;
if (mru_over > 0 && arc_mru_ghost->arcs_lsize[type] > 0) {
int64_t todelete =
MIN(arc_mru_ghost->arcs_lsize[type], mru_over);
arc_evict_ghost(arc_mru_ghost, NULL, todelete);
} else if (arc_mfu_ghost->arcs_lsize[type] > 0) {
int64_t todelete = MIN(arc_mfu_ghost->arcs_lsize[type],
arc_mru_ghost->arcs_size +
arc_mfu_ghost->arcs_size - arc_c);
arc_evict_ghost(arc_mfu_ghost, NULL, todelete);
}
}
if (stolen)
ARCSTAT_BUMP(arcstat_stolen);
return (stolen);
}
/*
* Remove buffers from list until we've removed the specified number of
* bytes. Destroy the buffers that are removed.
*/
static void
arc_evict_ghost(arc_state_t *state, spa_t *spa, int64_t bytes)
{
arc_buf_hdr_t *ab, *ab_prev;
list_t *list, *list_start;
kmutex_t *hash_lock, *lock;
uint64_t bytes_deleted = 0;
uint64_t bufs_skipped = 0;
static int evict_offset;
int list_count, idx = evict_offset;
int offset, count = 0;
ASSERT(GHOST_STATE(state));
/*
* data lists come after metadata lists
*/
list_start = &state->arcs_lists[ARC_BUFC_NUMMETADATALISTS];
list_count = ARC_BUFC_NUMDATALISTS;
offset = ARC_BUFC_NUMMETADATALISTS;
evict_start:
list = &list_start[idx];
lock = ARCS_LOCK(state, idx + offset);
mutex_enter(lock);
for (ab = list_tail(list); ab; ab = ab_prev) {
ab_prev = list_prev(list, ab);
if (spa && ab->b_spa != spa)
continue;
hash_lock = HDR_LOCK(ab);
if (mutex_tryenter(hash_lock)) {
ASSERT(!HDR_IO_IN_PROGRESS(ab));
ASSERT(ab->b_buf == NULL);
ARCSTAT_BUMP(arcstat_deleted);
bytes_deleted += ab->b_size;
if (ab->b_l2hdr != NULL) {
/*
* This buffer is cached on the 2nd Level ARC;
* don't destroy the header.
*/
arc_change_state(arc_l2c_only, ab, hash_lock);
mutex_exit(hash_lock);
} else {
arc_change_state(arc_anon, ab, hash_lock);
mutex_exit(hash_lock);
arc_hdr_destroy(ab);
}
DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, ab);
if (bytes >= 0 && bytes_deleted >= bytes)
break;
} else {
if (bytes < 0) {
/*
* we're draining the ARC, retry
*/
mutex_exit(lock);
mutex_enter(hash_lock);
mutex_exit(hash_lock);
goto evict_start;
}
bufs_skipped += 1;
}
}
mutex_exit(lock);
idx = ((idx + 1) & (ARC_BUFC_NUMDATALISTS - 1));
count++;
if (count < list_count)
goto evict_start;
evict_offset = idx;
if ((uintptr_t)list > (uintptr_t)&state->arcs_lists[ARC_BUFC_NUMMETADATALISTS] &&
(bytes < 0 || bytes_deleted < bytes)) {
list_start = &state->arcs_lists[0];
list_count = ARC_BUFC_NUMMETADATALISTS;
offset = count = 0;
goto evict_start;
}
if (bufs_skipped) {
ARCSTAT_INCR(arcstat_mutex_miss, bufs_skipped);
ASSERT(bytes >= 0);
}
if (bytes_deleted < bytes)
dprintf("only deleted %lld bytes from %p",
(longlong_t)bytes_deleted, state);
}
static void
arc_adjust(void)
{
int64_t adjustment, delta;
/*
* Adjust MRU size
*/
adjustment = MIN(arc_size - arc_c,
arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used - arc_p);
if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
(void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
adjustment -= delta;
}
if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
(void) arc_evict(arc_mru, NULL, delta, FALSE,
ARC_BUFC_METADATA);
}
/*
* Adjust MFU size
*/
adjustment = arc_size - arc_c;
if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) {
delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]);
(void) arc_evict(arc_mfu, NULL, delta, FALSE, ARC_BUFC_DATA);
adjustment -= delta;
}
if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
int64_t delta = MIN(adjustment,
arc_mfu->arcs_lsize[ARC_BUFC_METADATA]);
(void) arc_evict(arc_mfu, NULL, delta, FALSE,
ARC_BUFC_METADATA);
}
/*
* Adjust ghost lists
*/
adjustment = arc_mru->arcs_size + arc_mru_ghost->arcs_size - arc_c;
if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
delta = MIN(arc_mru_ghost->arcs_size, adjustment);
arc_evict_ghost(arc_mru_ghost, NULL, delta);
}
adjustment =
arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size - arc_c;
if (adjustment > 0 && arc_mfu_ghost->arcs_size > 0) {
delta = MIN(arc_mfu_ghost->arcs_size, adjustment);
arc_evict_ghost(arc_mfu_ghost, NULL, delta);
}
}
static void
arc_do_user_evicts(void)
{
static arc_buf_t *tmp_arc_eviction_list;
/*
* Move list over to avoid LOR
*/
restart:
mutex_enter(&arc_eviction_mtx);
tmp_arc_eviction_list = arc_eviction_list;
arc_eviction_list = NULL;
mutex_exit(&arc_eviction_mtx);
while (tmp_arc_eviction_list != NULL) {
arc_buf_t *buf = tmp_arc_eviction_list;
tmp_arc_eviction_list = buf->b_next;
rw_enter(&buf->b_lock, RW_WRITER);
buf->b_hdr = NULL;
rw_exit(&buf->b_lock);
if (buf->b_efunc != NULL)
VERIFY(buf->b_efunc(buf) == 0);
buf->b_efunc = NULL;
buf->b_private = NULL;
kmem_cache_free(buf_cache, buf);
}
if (arc_eviction_list != NULL)
goto restart;
}
/*
* Flush all *evictable* data from the cache for the given spa.
* NOTE: this will not touch "active" (i.e. referenced) data.
*/
void
arc_flush(spa_t *spa)
{
while (arc_mru->arcs_lsize[ARC_BUFC_DATA]) {
(void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_DATA);
if (spa)
break;
}
while (arc_mru->arcs_lsize[ARC_BUFC_METADATA]) {
(void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_METADATA);
if (spa)
break;
}
while (arc_mfu->arcs_lsize[ARC_BUFC_DATA]) {
(void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_DATA);
if (spa)
break;
}
while (arc_mfu->arcs_lsize[ARC_BUFC_METADATA]) {
(void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_METADATA);
if (spa)
break;
}
arc_evict_ghost(arc_mru_ghost, spa, -1);
arc_evict_ghost(arc_mfu_ghost, spa, -1);
mutex_enter(&arc_reclaim_thr_lock);
arc_do_user_evicts();
mutex_exit(&arc_reclaim_thr_lock);
ASSERT(spa || arc_eviction_list == NULL);
}
void
arc_shrink(void)
{
if (arc_c > arc_c_min) {
uint64_t to_free;
#ifdef _KERNEL
to_free = arc_c >> arc_shrink_shift;
#else
to_free = arc_c >> arc_shrink_shift;
#endif
if (arc_c > arc_c_min + to_free)
atomic_add_64(&arc_c, -to_free);
else
arc_c = arc_c_min;
atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
if (arc_c > arc_size)
arc_c = MAX(arc_size, arc_c_min);
if (arc_p > arc_c)
arc_p = (arc_c >> 1);
ASSERT(arc_c >= arc_c_min);
ASSERT((int64_t)arc_p >= 0);
}
if (arc_size > arc_c)
arc_adjust();
}
static int needfree = 0;
static int
arc_reclaim_needed(void)
{
#if 0
uint64_t extra;
#endif
#ifdef _KERNEL
if (needfree)
return (1);
if (arc_size > arc_c_max)
return (1);
if (arc_size <= arc_c_min)
return (0);
/*
* If pages are needed or we're within 2048 pages
* of needing to page need to reclaim
*/
if (vm_pages_needed || (vm_paging_target() > -2048))
return (1);
#if 0
/*
* take 'desfree' extra pages, so we reclaim sooner, rather than later
*/
extra = desfree;
/*
* check that we're out of range of the pageout scanner. It starts to
* schedule paging if freemem is less than lotsfree and needfree.
* lotsfree is the high-water mark for pageout, and needfree is the
* number of needed free pages. We add extra pages here to make sure
* the scanner doesn't start up while we're freeing memory.
*/
if (freemem < lotsfree + needfree + extra)
return (1);
/*
* check to make sure that swapfs has enough space so that anon
* reservations can still succeed. anon_resvmem() checks that the
* availrmem is greater than swapfs_minfree, and the number of reserved
* swap pages. We also add a bit of extra here just to prevent
* circumstances from getting really dire.
*/
if (availrmem < swapfs_minfree + swapfs_reserve + extra)
return (1);
#if defined(__i386)
/*
* If we're on an i386 platform, it's possible that we'll exhaust the
* kernel heap space before we ever run out of available physical
* memory. Most checks of the size of the heap_area compare against
* tune.t_minarmem, which is the minimum available real memory that we
* can have in the system. However, this is generally fixed at 25 pages
* which is so low that it's useless. In this comparison, we seek to
* calculate the total heap-size, and reclaim if more than 3/4ths of the
* heap is allocated. (Or, in the calculation, if less than 1/4th is
* free)
*/
if (btop(vmem_size(heap_arena, VMEM_FREE)) <
(btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2))
return (1);
#endif
#else
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif
#else
if (spa_get_random(100) == 0)
return (1);
#endif
return (0);
}
extern kmem_cache_t *zio_buf_cache[];
extern kmem_cache_t *zio_data_buf_cache[];
static void
arc_kmem_reap_now(arc_reclaim_strategy_t strat)
{
size_t i;
kmem_cache_t *prev_cache = NULL;
kmem_cache_t *prev_data_cache = NULL;
#ifdef _KERNEL
if (arc_meta_used >= arc_meta_limit) {
/*
* We are exceeding our meta-data cache limit.
* Purge some DNLC entries to release holds on meta-data.
*/
dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
}
#if defined(__i386)
/*
* Reclaim unused memory from all kmem caches.
*/
kmem_reap();
#endif
#endif
/*
* An aggressive reclamation will shrink the cache size as well as
* reap free buffers from the arc kmem caches.
*/
if (strat == ARC_RECLAIM_AGGR)
arc_shrink();
for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
if (zio_buf_cache[i] != prev_cache) {
prev_cache = zio_buf_cache[i];
kmem_cache_reap_now(zio_buf_cache[i]);
}
if (zio_data_buf_cache[i] != prev_data_cache) {
prev_data_cache = zio_data_buf_cache[i];
kmem_cache_reap_now(zio_data_buf_cache[i]);
}
}
kmem_cache_reap_now(buf_cache);
kmem_cache_reap_now(hdr_cache);
}
static void
arc_reclaim_thread(void *dummy __unused)
{
clock_t growtime = 0;
arc_reclaim_strategy_t last_reclaim = ARC_RECLAIM_CONS;
callb_cpr_t cpr;
CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);
mutex_enter(&arc_reclaim_thr_lock);
while (arc_thread_exit == 0) {
if (arc_reclaim_needed()) {
if (arc_no_grow) {
if (last_reclaim == ARC_RECLAIM_CONS) {
last_reclaim = ARC_RECLAIM_AGGR;
} else {
last_reclaim = ARC_RECLAIM_CONS;
}
} else {
arc_no_grow = TRUE;
last_reclaim = ARC_RECLAIM_AGGR;
membar_producer();
}
/* reset the growth delay for every reclaim */
growtime = LBOLT + (arc_grow_retry * hz);
if (needfree && last_reclaim == ARC_RECLAIM_CONS) {
/*
* If needfree is TRUE our vm_lowmem hook
* was called and in that case we must free some
* memory, so switch to aggressive mode.
*/
arc_no_grow = TRUE;
last_reclaim = ARC_RECLAIM_AGGR;
}
arc_kmem_reap_now(last_reclaim);
arc_warm = B_TRUE;
} else if (arc_no_grow && LBOLT >= growtime) {
arc_no_grow = FALSE;
}
if (needfree ||
(2 * arc_c < arc_size +
arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size))
arc_adjust();
if (arc_eviction_list != NULL)
arc_do_user_evicts();
if (arc_reclaim_needed()) {
needfree = 0;
#ifdef _KERNEL
wakeup(&needfree);
#endif
}
/* block until needed, or one second, whichever is shorter */
CALLB_CPR_SAFE_BEGIN(&cpr);
(void) cv_timedwait(&arc_reclaim_thr_cv,
&arc_reclaim_thr_lock, hz);
CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_thr_lock);
}
arc_thread_exit = 0;
cv_broadcast(&arc_reclaim_thr_cv);
CALLB_CPR_EXIT(&cpr); /* drops arc_reclaim_thr_lock */
thread_exit();
}
/*
* Adapt arc info given the number of bytes we are trying to add and
* the state that we are comming from. This function is only called
* when we are adding new content to the cache.
*/
static void
arc_adapt(int bytes, arc_state_t *state)
{
int mult;
uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
if (state == arc_l2c_only)
return;
ASSERT(bytes > 0);
/*
* Adapt the target size of the MRU list:
* - if we just hit in the MRU ghost list, then increase
* the target size of the MRU list.
* - if we just hit in the MFU ghost list, then increase
* the target size of the MFU list by decreasing the
* target size of the MRU list.
*/
if (state == arc_mru_ghost) {
mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
} else if (state == arc_mfu_ghost) {
uint64_t delta;
mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));
delta = MIN(bytes * mult, arc_p);
arc_p = MAX(arc_p_min, arc_p - delta);
}
ASSERT((int64_t)arc_p >= 0);
if (arc_reclaim_needed()) {
cv_signal(&arc_reclaim_thr_cv);
return;
}
if (arc_no_grow)
return;
if (arc_c >= arc_c_max)
return;
/*
* If we're within (2 * maxblocksize) bytes of the target
* cache size, increment the target cache size
*/
if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
atomic_add_64(&arc_c, (int64_t)bytes);
if (arc_c > arc_c_max)
arc_c = arc_c_max;
else if (state == arc_anon)
atomic_add_64(&arc_p, (int64_t)bytes);
if (arc_p > arc_c)
arc_p = arc_c;
}
ASSERT((int64_t)arc_p >= 0);
}
/*
* Check if the cache has reached its limits and eviction is required
* prior to insert.
*/
static int
arc_evict_needed(arc_buf_contents_t type)
{
if (type == ARC_BUFC_METADATA && arc_meta_used >= arc_meta_limit)
return (1);
#if 0
#ifdef _KERNEL
/*
* If zio data pages are being allocated out of a separate heap segment,
* then enforce that the size of available vmem for this area remains
* above about 1/32nd free.
*/
if (type == ARC_BUFC_DATA && zio_arena != NULL &&
vmem_size(zio_arena, VMEM_FREE) <
(vmem_size(zio_arena, VMEM_ALLOC) >> 5))
return (1);
#endif
#endif
if (arc_reclaim_needed())
return (1);
return (arc_size > arc_c);
}
/*
* The buffer, supplied as the first argument, needs a data block.
* So, if we are at cache max, determine which cache should be victimized.
* We have the following cases:
*
* 1. Insert for MRU, p > sizeof(arc_anon + arc_mru) ->
* In this situation if we're out of space, but the resident size of the MFU is
* under the limit, victimize the MFU cache to satisfy this insertion request.
*
* 2. Insert for MRU, p <= sizeof(arc_anon + arc_mru) ->
* Here, we've used up all of the available space for the MRU, so we need to
* evict from our own cache instead. Evict from the set of resident MRU
* entries.
*
* 3. Insert for MFU (c - p) > sizeof(arc_mfu) ->
* c minus p represents the MFU space in the cache, since p is the size of the
* cache that is dedicated to the MRU. In this situation there's still space on
* the MFU side, so the MRU side needs to be victimized.
*
* 4. Insert for MFU (c - p) < sizeof(arc_mfu) ->
* MFU's resident set is consuming more space than it has been allotted. In
* this situation, we must victimize our own cache, the MFU, for this insertion.
*/
static void
arc_get_data_buf(arc_buf_t *buf)
{
arc_state_t *state = buf->b_hdr->b_state;
uint64_t size = buf->b_hdr->b_size;
arc_buf_contents_t type = buf->b_hdr->b_type;
arc_adapt(size, state);
/*
* We have not yet reached cache maximum size,
* just allocate a new buffer.
*/
if (!arc_evict_needed(type)) {
if (type == ARC_BUFC_METADATA) {
buf->b_data = zio_buf_alloc(size);
arc_space_consume(size, ARC_SPACE_DATA);
} else {
ASSERT(type == ARC_BUFC_DATA);
buf->b_data = zio_data_buf_alloc(size);
ARCSTAT_INCR(arcstat_data_size, size);
atomic_add_64(&arc_size, size);
}
goto out;
}
/*
* If we are prefetching from the mfu ghost list, this buffer
* will end up on the mru list; so steal space from there.
*/
if (state == arc_mfu_ghost)
state = buf->b_hdr->b_flags & ARC_PREFETCH ? arc_mru : arc_mfu;
else if (state == arc_mru_ghost)
state = arc_mru;
if (state == arc_mru || state == arc_anon) {
uint64_t mru_used = arc_anon->arcs_size + arc_mru->arcs_size;
state = (arc_mfu->arcs_lsize[type] >= size &&
arc_p > mru_used) ? arc_mfu : arc_mru;
} else {
/* MFU cases */
uint64_t mfu_space = arc_c - arc_p;
state = (arc_mru->arcs_lsize[type] >= size &&
mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
}
if ((buf->b_data = arc_evict(state, NULL, size, TRUE, type)) == NULL) {
if (type == ARC_BUFC_METADATA) {
buf->b_data = zio_buf_alloc(size);
arc_space_consume(size, ARC_SPACE_DATA);
} else {
ASSERT(type == ARC_BUFC_DATA);
buf->b_data = zio_data_buf_alloc(size);
ARCSTAT_INCR(arcstat_data_size, size);
atomic_add_64(&arc_size, size);
}
ARCSTAT_BUMP(arcstat_recycle_miss);
}
ASSERT(buf->b_data != NULL);
out:
/*
* Update the state size. Note that ghost states have a
* "ghost size" and so don't need to be updated.
*/
if (!GHOST_STATE(buf->b_hdr->b_state)) {
arc_buf_hdr_t *hdr = buf->b_hdr;
atomic_add_64(&hdr->b_state->arcs_size, size);
if (list_link_active(&hdr->b_arc_node)) {
ASSERT(refcount_is_zero(&hdr->b_refcnt));
atomic_add_64(&hdr->b_state->arcs_lsize[type], size);
}
/*
* If we are growing the cache, and we are adding anonymous
* data, and we have outgrown arc_p, update arc_p
*/
if (arc_size < arc_c && hdr->b_state == arc_anon &&
arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
arc_p = MIN(arc_c, arc_p + size);
}
ARCSTAT_BUMP(arcstat_allocated);
}
/*
* This routine is called whenever a buffer is accessed.
* NOTE: the hash lock is dropped in this function.
*/
static void
arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock)
{
ASSERT(MUTEX_HELD(hash_lock));
if (buf->b_state == arc_anon) {
/*
* This buffer is not in the cache, and does not
* appear in our "ghost" list. Add the new buffer
* to the MRU state.
*/
ASSERT(buf->b_arc_access == 0);
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
arc_change_state(arc_mru, buf, hash_lock);
} else if (buf->b_state == arc_mru) {
/*
* If this buffer is here because of a prefetch, then either:
* - clear the flag if this is a "referencing" read
* (any subsequent access will bump this into the MFU state).
* or
* - move the buffer to the head of the list if this is
* another prefetch (to make it less likely to be evicted).
*/
if ((buf->b_flags & ARC_PREFETCH) != 0) {
if (refcount_count(&buf->b_refcnt) == 0) {
ASSERT(list_link_active(&buf->b_arc_node));
} else {
buf->b_flags &= ~ARC_PREFETCH;
ARCSTAT_BUMP(arcstat_mru_hits);
}
buf->b_arc_access = LBOLT;
return;
}
/*
* This buffer has been "accessed" only once so far,
* but it is still in the cache. Move it to the MFU
* state.
*/
if (LBOLT > buf->b_arc_access + ARC_MINTIME) {
/*
* More than 125ms have passed since we
* instantiated this buffer. Move it to the
* most frequently used state.
*/
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
arc_change_state(arc_mfu, buf, hash_lock);
}
ARCSTAT_BUMP(arcstat_mru_hits);
} else if (buf->b_state == arc_mru_ghost) {
arc_state_t *new_state;
/*
* This buffer has been "accessed" recently, but
* was evicted from the cache. Move it to the
* MFU state.
*/
if (buf->b_flags & ARC_PREFETCH) {
new_state = arc_mru;
if (refcount_count(&buf->b_refcnt) > 0)
buf->b_flags &= ~ARC_PREFETCH;
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
} else {
new_state = arc_mfu;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
}
buf->b_arc_access = LBOLT;
arc_change_state(new_state, buf, hash_lock);
ARCSTAT_BUMP(arcstat_mru_ghost_hits);
} else if (buf->b_state == arc_mfu) {
/*
* This buffer has been accessed more than once and is
* still in the cache. Keep it in the MFU state.
*
* NOTE: an add_reference() that occurred when we did
* the arc_read() will have kicked this off the list.
* If it was a prefetch, we will explicitly move it to
* the head of the list now.
*/
if ((buf->b_flags & ARC_PREFETCH) != 0) {
ASSERT(refcount_count(&buf->b_refcnt) == 0);
ASSERT(list_link_active(&buf->b_arc_node));
}
ARCSTAT_BUMP(arcstat_mfu_hits);
buf->b_arc_access = LBOLT;
} else if (buf->b_state == arc_mfu_ghost) {
arc_state_t *new_state = arc_mfu;
/*
* This buffer has been accessed more than once but has
* been evicted from the cache. Move it back to the
* MFU state.
*/
if (buf->b_flags & ARC_PREFETCH) {
/*
* This is a prefetch access...
* move this block back to the MRU state.
*/
ASSERT3U(refcount_count(&buf->b_refcnt), ==, 0);
new_state = arc_mru;
}
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
arc_change_state(new_state, buf, hash_lock);
ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
} else if (buf->b_state == arc_l2c_only) {
/*
* This buffer is on the 2nd Level ARC.
*/
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
arc_change_state(arc_mfu, buf, hash_lock);
} else {
ASSERT(!"invalid arc state");
}
}
/* a generic arc_done_func_t which you can use */
/* ARGSUSED */
void
arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
{
bcopy(buf->b_data, arg, buf->b_hdr->b_size);
VERIFY(arc_buf_remove_ref(buf, arg) == 1);
}
/* a generic arc_done_func_t */
void
arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
{
arc_buf_t **bufp = arg;
if (zio && zio->io_error) {
VERIFY(arc_buf_remove_ref(buf, arg) == 1);
*bufp = NULL;
} else {
*bufp = buf;
}
}
static void
arc_read_done(zio_t *zio)
{
arc_buf_hdr_t *hdr, *found;
arc_buf_t *buf;
arc_buf_t *abuf; /* buffer we're assigning to callback */
kmutex_t *hash_lock;
arc_callback_t *callback_list, *acb;
int freeable = FALSE;
buf = zio->io_private;
hdr = buf->b_hdr;
/*
* The hdr was inserted into hash-table and removed from lists
* prior to starting I/O. We should find this header, since
* it's in the hash table, and it should be legit since it's
* not possible to evict it during the I/O. The only possible
* reason for it not to be found is if we were freed during the
* read.
*/
found = buf_hash_find(zio->io_spa, &hdr->b_dva, hdr->b_birth,
&hash_lock);
ASSERT((found == NULL && HDR_FREED_IN_READ(hdr) && hash_lock == NULL) ||
(found == hdr && DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
(found == hdr && HDR_L2_READING(hdr)));
hdr->b_flags &= ~ARC_L2_EVICTED;
if (l2arc_noprefetch && (hdr->b_flags & ARC_PREFETCH))
hdr->b_flags &= ~ARC_L2CACHE;
/* byteswap if necessary */
callback_list = hdr->b_acb;
ASSERT(callback_list != NULL);
- if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
+ if (BP_SHOULD_BYTESWAP(zio->io_bp) && zio->io_error == 0) {
arc_byteswap_func_t *func = BP_GET_LEVEL(zio->io_bp) > 0 ?
byteswap_uint64_array :
dmu_ot[BP_GET_TYPE(zio->io_bp)].ot_byteswap;
func(buf->b_data, hdr->b_size);
}
arc_cksum_compute(buf, B_FALSE);
/* create copies of the data buffer for the callers */
abuf = buf;
for (acb = callback_list; acb; acb = acb->acb_next) {
if (acb->acb_done) {
if (abuf == NULL)
abuf = arc_buf_clone(buf);
acb->acb_buf = abuf;
abuf = NULL;
}
}
hdr->b_acb = NULL;
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
ASSERT(!HDR_BUF_AVAILABLE(hdr));
if (abuf == buf)
hdr->b_flags |= ARC_BUF_AVAILABLE;
ASSERT(refcount_is_zero(&hdr->b_refcnt) || callback_list != NULL);
if (zio->io_error != 0) {
hdr->b_flags |= ARC_IO_ERROR;
if (hdr->b_state != arc_anon)
arc_change_state(arc_anon, hdr, hash_lock);
if (HDR_IN_HASH_TABLE(hdr))
buf_hash_remove(hdr);
freeable = refcount_is_zero(&hdr->b_refcnt);
}
/*
* Broadcast before we drop the hash_lock to avoid the possibility
* that the hdr (and hence the cv) might be freed before we get to
* the cv_broadcast().
*/
cv_broadcast(&hdr->b_cv);
if (hash_lock) {
/*
* Only call arc_access on anonymous buffers. This is because
* if we've issued an I/O for an evicted buffer, we've already
* called arc_access (to prevent any simultaneous readers from
* getting confused).
*/
if (zio->io_error == 0 && hdr->b_state == arc_anon)
arc_access(hdr, hash_lock);
mutex_exit(hash_lock);
} else {
/*
* This block was freed while we waited for the read to
* complete. It has been removed from the hash table and
* moved to the anonymous state (so that it won't show up
* in the cache).
*/
ASSERT3P(hdr->b_state, ==, arc_anon);
freeable = refcount_is_zero(&hdr->b_refcnt);
}
/* execute each callback and free its structure */
while ((acb = callback_list) != NULL) {
if (acb->acb_done)
acb->acb_done(zio, acb->acb_buf, acb->acb_private);
if (acb->acb_zio_dummy != NULL) {
acb->acb_zio_dummy->io_error = zio->io_error;
zio_nowait(acb->acb_zio_dummy);
}
callback_list = acb->acb_next;
kmem_free(acb, sizeof (arc_callback_t));
}
if (freeable)
arc_hdr_destroy(hdr);
}
/*
* "Read" the block block at the specified DVA (in bp) via the
* cache. If the block is found in the cache, invoke the provided
* callback immediately and return. Note that the `zio' parameter
* in the callback will be NULL in this case, since no IO was
* required. If the block is not in the cache pass the read request
* on to the spa with a substitute callback function, so that the
* requested block will be added to the cache.
*
* If a read request arrives for a block that has a read in-progress,
* either wait for the in-progress read to complete (and return the
* results); or, if this is a read with a "done" func, add a record
* to the read to invoke the "done" func when the read completes,
* and return; or just return.
*
* arc_read_done() will invoke all the requested "done" functions
* for readers of this block.
*
* Normal callers should use arc_read and pass the arc buffer and offset
* for the bp. But if you know you don't need locking, you can use
* arc_read_bp.
*/
int
arc_read(zio_t *pio, spa_t *spa, blkptr_t *bp, arc_buf_t *pbuf,
arc_done_func_t *done, void *private, int priority, int zio_flags,
uint32_t *arc_flags, const zbookmark_t *zb)
{
int err;
ASSERT(!refcount_is_zero(&pbuf->b_hdr->b_refcnt));
ASSERT3U((char *)bp - (char *)pbuf->b_data, <, pbuf->b_hdr->b_size);
rw_enter(&pbuf->b_lock, RW_READER);
err = arc_read_nolock(pio, spa, bp, done, private, priority,
zio_flags, arc_flags, zb);
rw_exit(&pbuf->b_lock);
return (err);
}
int
arc_read_nolock(zio_t *pio, spa_t *spa, blkptr_t *bp,
arc_done_func_t *done, void *private, int priority, int zio_flags,
uint32_t *arc_flags, const zbookmark_t *zb)
{
arc_buf_hdr_t *hdr;
arc_buf_t *buf;
kmutex_t *hash_lock;
zio_t *rzio;
top:
hdr = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
if (hdr && hdr->b_datacnt > 0) {
*arc_flags |= ARC_CACHED;
if (HDR_IO_IN_PROGRESS(hdr)) {
if (*arc_flags & ARC_WAIT) {
cv_wait(&hdr->b_cv, hash_lock);
mutex_exit(hash_lock);
goto top;
}
ASSERT(*arc_flags & ARC_NOWAIT);
if (done) {
arc_callback_t *acb = NULL;
acb = kmem_zalloc(sizeof (arc_callback_t),
KM_SLEEP);
acb->acb_done = done;
acb->acb_private = private;
if (pio != NULL)
acb->acb_zio_dummy = zio_null(pio,
spa, NULL, NULL, zio_flags);
ASSERT(acb->acb_done != NULL);
acb->acb_next = hdr->b_acb;
hdr->b_acb = acb;
add_reference(hdr, hash_lock, private);
mutex_exit(hash_lock);
return (0);
}
mutex_exit(hash_lock);
return (0);
}
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
if (done) {
add_reference(hdr, hash_lock, private);
/*
* If this block is already in use, create a new
* copy of the data so that we will be guaranteed
* that arc_release() will always succeed.
*/
buf = hdr->b_buf;
ASSERT(buf);
ASSERT(buf->b_data);
if (HDR_BUF_AVAILABLE(hdr)) {
ASSERT(buf->b_efunc == NULL);
hdr->b_flags &= ~ARC_BUF_AVAILABLE;
} else {
buf = arc_buf_clone(buf);
}
} else if (*arc_flags & ARC_PREFETCH &&
refcount_count(&hdr->b_refcnt) == 0) {
hdr->b_flags |= ARC_PREFETCH;
}
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
arc_access(hdr, hash_lock);
if (*arc_flags & ARC_L2CACHE)
hdr->b_flags |= ARC_L2CACHE;
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_hits);
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
data, metadata, hits);
if (done)
done(NULL, buf, private);
} else {
uint64_t size = BP_GET_LSIZE(bp);
arc_callback_t *acb;
vdev_t *vd = NULL;
uint64_t addr;
boolean_t devw = B_FALSE;
if (hdr == NULL) {
/* this block is not in the cache */
arc_buf_hdr_t *exists;
arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
buf = arc_buf_alloc(spa, size, private, type);
hdr = buf->b_hdr;
hdr->b_dva = *BP_IDENTITY(bp);
hdr->b_birth = bp->blk_birth;
hdr->b_cksum0 = bp->blk_cksum.zc_word[0];
exists = buf_hash_insert(hdr, &hash_lock);
if (exists) {
/* somebody beat us to the hash insert */
mutex_exit(hash_lock);
bzero(&hdr->b_dva, sizeof (dva_t));
hdr->b_birth = 0;
hdr->b_cksum0 = 0;
(void) arc_buf_remove_ref(buf, private);
goto top; /* restart the IO request */
}
/* if this is a prefetch, we don't have a reference */
if (*arc_flags & ARC_PREFETCH) {
(void) remove_reference(hdr, hash_lock,
private);
hdr->b_flags |= ARC_PREFETCH;
}
if (*arc_flags & ARC_L2CACHE)
hdr->b_flags |= ARC_L2CACHE;
if (BP_GET_LEVEL(bp) > 0)
hdr->b_flags |= ARC_INDIRECT;
} else {
/* this block is in the ghost cache */
ASSERT(GHOST_STATE(hdr->b_state));
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 0);
ASSERT(hdr->b_buf == NULL);
/* if this is a prefetch, we don't have a reference */
if (*arc_flags & ARC_PREFETCH)
hdr->b_flags |= ARC_PREFETCH;
else
add_reference(hdr, hash_lock, private);
if (*arc_flags & ARC_L2CACHE)
hdr->b_flags |= ARC_L2CACHE;
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
buf->b_hdr = hdr;
buf->b_data = NULL;
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_next = NULL;
hdr->b_buf = buf;
arc_get_data_buf(buf);
ASSERT(hdr->b_datacnt == 0);
hdr->b_datacnt = 1;
}
acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
acb->acb_done = done;
acb->acb_private = private;
ASSERT(hdr->b_acb == NULL);
hdr->b_acb = acb;
hdr->b_flags |= ARC_IO_IN_PROGRESS;
/*
* If the buffer has been evicted, migrate it to a present state
* before issuing the I/O. Once we drop the hash-table lock,
* the header will be marked as I/O in progress and have an
* attached buffer. At this point, anybody who finds this
* buffer ought to notice that it's legit but has a pending I/O.
*/
if (GHOST_STATE(hdr->b_state))
arc_access(hdr, hash_lock);
if (HDR_L2CACHE(hdr) && hdr->b_l2hdr != NULL &&
(vd = hdr->b_l2hdr->b_dev->l2ad_vdev) != NULL) {
devw = hdr->b_l2hdr->b_dev->l2ad_writing;
addr = hdr->b_l2hdr->b_daddr;
/*
* Lock out device removal.
*/
if (vdev_is_dead(vd) ||
!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
vd = NULL;
}
mutex_exit(hash_lock);
ASSERT3U(hdr->b_size, ==, size);
DTRACE_PROBE3(arc__miss, blkptr_t *, bp, uint64_t, size,
zbookmark_t *, zb);
ARCSTAT_BUMP(arcstat_misses);
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
data, metadata, misses);
if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
/*
* Read from the L2ARC if the following are true:
* 1. The L2ARC vdev was previously cached.
* 2. This buffer still has L2ARC metadata.
* 3. This buffer isn't currently writing to the L2ARC.
* 4. The L2ARC entry wasn't evicted, which may
* also have invalidated the vdev.
* 5. This isn't prefetch and l2arc_noprefetch is set.
*/
if (hdr->b_l2hdr != NULL &&
!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
!(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
l2arc_read_callback_t *cb;
DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
ARCSTAT_BUMP(arcstat_l2_hits);
cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
KM_SLEEP);
cb->l2rcb_buf = buf;
cb->l2rcb_spa = spa;
cb->l2rcb_bp = *bp;
cb->l2rcb_zb = *zb;
cb->l2rcb_flags = zio_flags;
/*
* l2arc read. The SCL_L2ARC lock will be
* released by l2arc_read_done().
*/
rzio = zio_read_phys(pio, vd, addr, size,
buf->b_data, ZIO_CHECKSUM_OFF,
l2arc_read_done, cb, priority, zio_flags |
ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY, B_FALSE);
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
zio_t *, rzio);
ARCSTAT_INCR(arcstat_l2_read_bytes, size);
if (*arc_flags & ARC_NOWAIT) {
zio_nowait(rzio);
return (0);
}
ASSERT(*arc_flags & ARC_WAIT);
if (zio_wait(rzio) == 0)
return (0);
/* l2arc read error; goto zio_read() */
} else {
DTRACE_PROBE1(l2arc__miss,
arc_buf_hdr_t *, hdr);
ARCSTAT_BUMP(arcstat_l2_misses);
if (HDR_L2_WRITING(hdr))
ARCSTAT_BUMP(arcstat_l2_rw_clash);
spa_config_exit(spa, SCL_L2ARC, vd);
}
} else {
if (vd != NULL)
spa_config_exit(spa, SCL_L2ARC, vd);
if (l2arc_ndev != 0) {
DTRACE_PROBE1(l2arc__miss,
arc_buf_hdr_t *, hdr);
ARCSTAT_BUMP(arcstat_l2_misses);
}
}
rzio = zio_read(pio, spa, bp, buf->b_data, size,
arc_read_done, buf, priority, zio_flags, zb);
if (*arc_flags & ARC_WAIT)
return (zio_wait(rzio));
ASSERT(*arc_flags & ARC_NOWAIT);
zio_nowait(rzio);
}
return (0);
}
/*
* arc_read() variant to support pool traversal. If the block is already
* in the ARC, make a copy of it; otherwise, the caller will do the I/O.
* The idea is that we don't want pool traversal filling up memory, but
* if the ARC already has the data anyway, we shouldn't pay for the I/O.
*/
int
arc_tryread(spa_t *spa, blkptr_t *bp, void *data)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_mtx;
int rc = 0;
hdr = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_mtx);
if (hdr && hdr->b_datacnt > 0 && !HDR_IO_IN_PROGRESS(hdr)) {
arc_buf_t *buf = hdr->b_buf;
ASSERT(buf);
while (buf->b_data == NULL) {
buf = buf->b_next;
ASSERT(buf);
}
bcopy(buf->b_data, data, hdr->b_size);
} else {
rc = ENOENT;
}
if (hash_mtx)
mutex_exit(hash_mtx);
return (rc);
}
void
arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private)
{
ASSERT(buf->b_hdr != NULL);
ASSERT(buf->b_hdr->b_state != arc_anon);
ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt) || func == NULL);
buf->b_efunc = func;
buf->b_private = private;
}
/*
* This is used by the DMU to let the ARC know that a buffer is
* being evicted, so the ARC should clean up. If this arc buf
* is not yet in the evicted state, it will be put there.
*/
int
arc_buf_evict(arc_buf_t *buf)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_lock;
arc_buf_t **bufp;
list_t *list, *evicted_list;
kmutex_t *lock, *evicted_lock;
rw_enter(&buf->b_lock, RW_WRITER);
hdr = buf->b_hdr;
if (hdr == NULL) {
/*
* We are in arc_do_user_evicts().
*/
ASSERT(buf->b_data == NULL);
rw_exit(&buf->b_lock);
return (0);
} else if (buf->b_data == NULL) {
arc_buf_t copy = *buf; /* structure assignment */
/*
* We are on the eviction list; process this buffer now
* but let arc_do_user_evicts() do the reaping.
*/
buf->b_efunc = NULL;
rw_exit(&buf->b_lock);
VERIFY(copy.b_efunc(&copy) == 0);
return (1);
}
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
ASSERT(buf->b_hdr == hdr);
ASSERT3U(refcount_count(&hdr->b_refcnt), <, hdr->b_datacnt);
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
/*
* Pull this buffer off of the hdr
*/
bufp = &hdr->b_buf;
while (*bufp != buf)
bufp = &(*bufp)->b_next;
*bufp = buf->b_next;
ASSERT(buf->b_data != NULL);
arc_buf_destroy(buf, FALSE, FALSE);
if (hdr->b_datacnt == 0) {
arc_state_t *old_state = hdr->b_state;
arc_state_t *evicted_state;
ASSERT(refcount_is_zero(&hdr->b_refcnt));
evicted_state =
(old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
get_buf_info(hdr, old_state, &list, &lock);
get_buf_info(hdr, evicted_state, &evicted_list, &evicted_lock);
mutex_enter(lock);
mutex_enter(evicted_lock);
arc_change_state(evicted_state, hdr, hash_lock);
ASSERT(HDR_IN_HASH_TABLE(hdr));
hdr->b_flags |= ARC_IN_HASH_TABLE;
hdr->b_flags &= ~ARC_BUF_AVAILABLE;
mutex_exit(evicted_lock);
mutex_exit(lock);
}
mutex_exit(hash_lock);
rw_exit(&buf->b_lock);
VERIFY(buf->b_efunc(buf) == 0);
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_hdr = NULL;
kmem_cache_free(buf_cache, buf);
return (1);
}
/*
* Release this buffer from the cache. This must be done
* after a read and prior to modifying the buffer contents.
* If the buffer has more than one reference, we must make
* a new hdr for the buffer.
*/
void
arc_release(arc_buf_t *buf, void *tag)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_lock;
l2arc_buf_hdr_t *l2hdr;
uint64_t buf_size;
boolean_t released = B_FALSE;
rw_enter(&buf->b_lock, RW_WRITER);
hdr = buf->b_hdr;
/* this buffer is not on any list */
ASSERT(refcount_count(&hdr->b_refcnt) > 0);
ASSERT(!(hdr->b_flags & ARC_STORED));
if (hdr->b_state == arc_anon) {
/* this buffer is already released */
ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 1);
ASSERT(BUF_EMPTY(hdr));
ASSERT(buf->b_efunc == NULL);
arc_buf_thaw(buf);
rw_exit(&buf->b_lock);
released = B_TRUE;
} else {
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
}
l2hdr = hdr->b_l2hdr;
if (l2hdr) {
mutex_enter(&l2arc_buflist_mtx);
hdr->b_l2hdr = NULL;
buf_size = hdr->b_size;
}
if (released)
goto out;
/*
* Do we have more than one buf?
*/
if (hdr->b_datacnt > 1) {
arc_buf_hdr_t *nhdr;
arc_buf_t **bufp;
uint64_t blksz = hdr->b_size;
spa_t *spa = hdr->b_spa;
arc_buf_contents_t type = hdr->b_type;
uint32_t flags = hdr->b_flags;
ASSERT(hdr->b_buf != buf || buf->b_next != NULL);
/*
* Pull the data off of this buf and attach it to
* a new anonymous buf.
*/
(void) remove_reference(hdr, hash_lock, tag);
bufp = &hdr->b_buf;
while (*bufp != buf)
bufp = &(*bufp)->b_next;
*bufp = (*bufp)->b_next;
buf->b_next = NULL;
ASSERT3U(hdr->b_state->arcs_size, >=, hdr->b_size);
atomic_add_64(&hdr->b_state->arcs_size, -hdr->b_size);
if (refcount_is_zero(&hdr->b_refcnt)) {
uint64_t *size = &hdr->b_state->arcs_lsize[hdr->b_type];
ASSERT3U(*size, >=, hdr->b_size);
atomic_add_64(size, -hdr->b_size);
}
hdr->b_datacnt -= 1;
arc_cksum_verify(buf);
mutex_exit(hash_lock);
nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
nhdr->b_size = blksz;
nhdr->b_spa = spa;
nhdr->b_type = type;
nhdr->b_buf = buf;
nhdr->b_state = arc_anon;
nhdr->b_arc_access = 0;
nhdr->b_flags = flags & ARC_L2_WRITING;
nhdr->b_l2hdr = NULL;
nhdr->b_datacnt = 1;
nhdr->b_freeze_cksum = NULL;
(void) refcount_add(&nhdr->b_refcnt, tag);
buf->b_hdr = nhdr;
rw_exit(&buf->b_lock);
atomic_add_64(&arc_anon->arcs_size, blksz);
} else {
rw_exit(&buf->b_lock);
ASSERT(refcount_count(&hdr->b_refcnt) == 1);
ASSERT(!list_link_active(&hdr->b_arc_node));
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
arc_change_state(arc_anon, hdr, hash_lock);
hdr->b_arc_access = 0;
mutex_exit(hash_lock);
bzero(&hdr->b_dva, sizeof (dva_t));
hdr->b_birth = 0;
hdr->b_cksum0 = 0;
arc_buf_thaw(buf);
}
buf->b_efunc = NULL;
buf->b_private = NULL;
out:
if (l2hdr) {
list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -buf_size);
mutex_exit(&l2arc_buflist_mtx);
}
}
int
arc_released(arc_buf_t *buf)
{
int released;
rw_enter(&buf->b_lock, RW_READER);
released = (buf->b_data != NULL && buf->b_hdr->b_state == arc_anon);
rw_exit(&buf->b_lock);
return (released);
}
int
arc_has_callback(arc_buf_t *buf)
{
int callback;
rw_enter(&buf->b_lock, RW_READER);
callback = (buf->b_efunc != NULL);
rw_exit(&buf->b_lock);
return (callback);
}
#ifdef ZFS_DEBUG
int
arc_referenced(arc_buf_t *buf)
{
int referenced;
rw_enter(&buf->b_lock, RW_READER);
referenced = (refcount_count(&buf->b_hdr->b_refcnt));
rw_exit(&buf->b_lock);
return (referenced);
}
#endif
static void
arc_write_ready(zio_t *zio)
{
arc_write_callback_t *callback = zio->io_private;
arc_buf_t *buf = callback->awcb_buf;
arc_buf_hdr_t *hdr = buf->b_hdr;
ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt));
callback->awcb_ready(zio, buf, callback->awcb_private);
/*
* If the IO is already in progress, then this is a re-write
* attempt, so we need to thaw and re-compute the cksum.
* It is the responsibility of the callback to handle the
* accounting for any re-write attempt.
*/
if (HDR_IO_IN_PROGRESS(hdr)) {
mutex_enter(&hdr->b_freeze_lock);
if (hdr->b_freeze_cksum != NULL) {
kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
hdr->b_freeze_cksum = NULL;
}
mutex_exit(&hdr->b_freeze_lock);
}
arc_cksum_compute(buf, B_FALSE);
hdr->b_flags |= ARC_IO_IN_PROGRESS;
}
static void
arc_write_done(zio_t *zio)
{
arc_write_callback_t *callback = zio->io_private;
arc_buf_t *buf = callback->awcb_buf;
arc_buf_hdr_t *hdr = buf->b_hdr;
hdr->b_acb = NULL;
hdr->b_dva = *BP_IDENTITY(zio->io_bp);
hdr->b_birth = zio->io_bp->blk_birth;
hdr->b_cksum0 = zio->io_bp->blk_cksum.zc_word[0];
/*
* If the block to be written was all-zero, we may have
* compressed it away. In this case no write was performed
* so there will be no dva/birth-date/checksum. The buffer
* must therefor remain anonymous (and uncached).
*/
if (!BUF_EMPTY(hdr)) {
arc_buf_hdr_t *exists;
kmutex_t *hash_lock;
arc_cksum_verify(buf);
exists = buf_hash_insert(hdr, &hash_lock);
if (exists) {
/*
* This can only happen if we overwrite for
* sync-to-convergence, because we remove
* buffers from the hash table when we arc_free().
*/
ASSERT(zio->io_flags & ZIO_FLAG_IO_REWRITE);
ASSERT(DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig),
BP_IDENTITY(zio->io_bp)));
ASSERT3U(zio->io_bp_orig.blk_birth, ==,
zio->io_bp->blk_birth);
ASSERT(refcount_is_zero(&exists->b_refcnt));
arc_change_state(arc_anon, exists, hash_lock);
mutex_exit(hash_lock);
arc_hdr_destroy(exists);
exists = buf_hash_insert(hdr, &hash_lock);
ASSERT3P(exists, ==, NULL);
}
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
/* if it's not anon, we are doing a scrub */
if (hdr->b_state == arc_anon)
arc_access(hdr, hash_lock);
mutex_exit(hash_lock);
} else if (callback->awcb_done == NULL) {
int destroy_hdr;
/*
* This is an anonymous buffer with no user callback,
* destroy it if there are no active references.
*/
mutex_enter(&arc_eviction_mtx);
destroy_hdr = refcount_is_zero(&hdr->b_refcnt);
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
mutex_exit(&arc_eviction_mtx);
if (destroy_hdr)
arc_hdr_destroy(hdr);
} else {
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
}
hdr->b_flags &= ~ARC_STORED;
if (callback->awcb_done) {
ASSERT(!refcount_is_zero(&hdr->b_refcnt));
callback->awcb_done(zio, buf, callback->awcb_private);
}
kmem_free(callback, sizeof (arc_write_callback_t));
}
static void
write_policy(spa_t *spa, const writeprops_t *wp, zio_prop_t *zp)
{
boolean_t ismd = (wp->wp_level > 0 || dmu_ot[wp->wp_type].ot_metadata);
/* Determine checksum setting */
if (ismd) {
/*
* Metadata always gets checksummed. If the data
* checksum is multi-bit correctable, and it's not a
* ZBT-style checksum, then it's suitable for metadata
* as well. Otherwise, the metadata checksum defaults
* to fletcher4.
*/
if (zio_checksum_table[wp->wp_oschecksum].ci_correctable &&
!zio_checksum_table[wp->wp_oschecksum].ci_zbt)
zp->zp_checksum = wp->wp_oschecksum;
else
zp->zp_checksum = ZIO_CHECKSUM_FLETCHER_4;
} else {
zp->zp_checksum = zio_checksum_select(wp->wp_dnchecksum,
wp->wp_oschecksum);
}
/* Determine compression setting */
if (ismd) {
/*
* XXX -- we should design a compression algorithm
* that specializes in arrays of bps.
*/
zp->zp_compress = zfs_mdcomp_disable ? ZIO_COMPRESS_EMPTY :
ZIO_COMPRESS_LZJB;
} else {
zp->zp_compress = zio_compress_select(wp->wp_dncompress,
wp->wp_oscompress);
}
zp->zp_type = wp->wp_type;
zp->zp_level = wp->wp_level;
zp->zp_ndvas = MIN(wp->wp_copies + ismd, spa_max_replication(spa));
}
zio_t *
arc_write(zio_t *pio, spa_t *spa, const writeprops_t *wp,
boolean_t l2arc, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
arc_done_func_t *ready, arc_done_func_t *done, void *private, int priority,
int zio_flags, const zbookmark_t *zb)
{
arc_buf_hdr_t *hdr = buf->b_hdr;
arc_write_callback_t *callback;
zio_t *zio;
zio_prop_t zp;
ASSERT(ready != NULL);
ASSERT(!HDR_IO_ERROR(hdr));
ASSERT((hdr->b_flags & ARC_IO_IN_PROGRESS) == 0);
ASSERT(hdr->b_acb == 0);
if (l2arc)
hdr->b_flags |= ARC_L2CACHE;
callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
callback->awcb_ready = ready;
callback->awcb_done = done;
callback->awcb_private = private;
callback->awcb_buf = buf;
write_policy(spa, wp, &zp);
zio = zio_write(pio, spa, txg, bp, buf->b_data, hdr->b_size, &zp,
arc_write_ready, arc_write_done, callback, priority, zio_flags, zb);
return (zio);
}
int
arc_free(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
zio_done_func_t *done, void *private, uint32_t arc_flags)
{
arc_buf_hdr_t *ab;
kmutex_t *hash_lock;
zio_t *zio;
/*
* If this buffer is in the cache, release it, so it
* can be re-used.
*/
ab = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
if (ab != NULL) {
/*
* The checksum of blocks to free is not always
* preserved (eg. on the deadlist). However, if it is
* nonzero, it should match what we have in the cache.
*/
ASSERT(bp->blk_cksum.zc_word[0] == 0 ||
bp->blk_cksum.zc_word[0] == ab->b_cksum0 ||
bp->blk_fill == BLK_FILL_ALREADY_FREED);
if (ab->b_state != arc_anon)
arc_change_state(arc_anon, ab, hash_lock);
if (HDR_IO_IN_PROGRESS(ab)) {
/*
* This should only happen when we prefetch.
*/
ASSERT(ab->b_flags & ARC_PREFETCH);
ASSERT3U(ab->b_datacnt, ==, 1);
ab->b_flags |= ARC_FREED_IN_READ;
if (HDR_IN_HASH_TABLE(ab))
buf_hash_remove(ab);
ab->b_arc_access = 0;
bzero(&ab->b_dva, sizeof (dva_t));
ab->b_birth = 0;
ab->b_cksum0 = 0;
ab->b_buf->b_efunc = NULL;
ab->b_buf->b_private = NULL;
mutex_exit(hash_lock);
} else if (refcount_is_zero(&ab->b_refcnt)) {
ab->b_flags |= ARC_FREE_IN_PROGRESS;
mutex_exit(hash_lock);
arc_hdr_destroy(ab);
ARCSTAT_BUMP(arcstat_deleted);
} else {
/*
* We still have an active reference on this
* buffer. This can happen, e.g., from
* dbuf_unoverride().
*/
ASSERT(!HDR_IN_HASH_TABLE(ab));
ab->b_arc_access = 0;
bzero(&ab->b_dva, sizeof (dva_t));
ab->b_birth = 0;
ab->b_cksum0 = 0;
ab->b_buf->b_efunc = NULL;
ab->b_buf->b_private = NULL;
mutex_exit(hash_lock);
}
}
zio = zio_free(pio, spa, txg, bp, done, private, ZIO_FLAG_MUSTSUCCEED);
if (arc_flags & ARC_WAIT)
return (zio_wait(zio));
ASSERT(arc_flags & ARC_NOWAIT);
zio_nowait(zio);
return (0);
}
static int
arc_memory_throttle(uint64_t reserve, uint64_t txg)
{
#ifdef _KERNEL
uint64_t inflight_data = arc_anon->arcs_size;
uint64_t available_memory = ptoa((uintmax_t)cnt.v_free_count);
static uint64_t page_load = 0;
static uint64_t last_txg = 0;
#if 0
#if defined(__i386)
available_memory =
MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
#endif
#endif
if (available_memory >= zfs_write_limit_max)
return (0);
if (txg > last_txg) {
last_txg = txg;
page_load = 0;
}
/*
* If we are in pageout, we know that memory is already tight,
* the arc is already going to be evicting, so we just want to
* continue to let page writes occur as quickly as possible.
*/
if (curproc == pageproc) {
if (page_load > available_memory / 4)
return (ERESTART);
/* Note: reserve is inflated, so we deflate */
page_load += reserve / 8;
return (0);
} else if (page_load > 0 && arc_reclaim_needed()) {
/* memory is low, delay before restarting */
ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
return (EAGAIN);
}
page_load = 0;
if (arc_size > arc_c_min) {
uint64_t evictable_memory =
arc_mru->arcs_lsize[ARC_BUFC_DATA] +
arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
arc_mfu->arcs_lsize[ARC_BUFC_METADATA];
available_memory += MIN(evictable_memory, arc_size - arc_c_min);
}
if (inflight_data > available_memory / 4) {
ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
return (ERESTART);
}
#endif
return (0);
}
void
arc_tempreserve_clear(uint64_t reserve)
{
atomic_add_64(&arc_tempreserve, -reserve);
ASSERT((int64_t)arc_tempreserve >= 0);
}
int
arc_tempreserve_space(uint64_t reserve, uint64_t txg)
{
int error;
#ifdef ZFS_DEBUG
/*
* Once in a while, fail for no reason. Everything should cope.
*/
if (spa_get_random(10000) == 0) {
dprintf("forcing random failure\n");
return (ERESTART);
}
#endif
if (reserve > arc_c/4 && !arc_no_grow)
arc_c = MIN(arc_c_max, reserve * 4);
if (reserve > arc_c)
return (ENOMEM);
/*
* Writes will, almost always, require additional memory allocations
* in order to compress/encrypt/etc the data. We therefor need to
* make sure that there is sufficient available memory for this.
*/
if (error = arc_memory_throttle(reserve, txg))
return (error);
/*
* Throttle writes when the amount of dirty data in the cache
* gets too large. We try to keep the cache less than half full
* of dirty blocks so that our sync times don't grow too large.
* Note: if two requests come in concurrently, we might let them
* both succeed, when one of them should fail. Not a huge deal.
*/
if (reserve + arc_tempreserve + arc_anon->arcs_size > arc_c / 2 &&
arc_anon->arcs_size > arc_c / 4) {
dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
"anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
arc_tempreserve>>10,
arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
reserve>>10, arc_c>>10);
return (ERESTART);
}
atomic_add_64(&arc_tempreserve, reserve);
return (0);
}
static kmutex_t arc_lowmem_lock;
#ifdef _KERNEL
static eventhandler_tag arc_event_lowmem = NULL;
static void
arc_lowmem(void *arg __unused, int howto __unused)
{
/* Serialize access via arc_lowmem_lock. */
mutex_enter(&arc_lowmem_lock);
needfree = 1;
cv_signal(&arc_reclaim_thr_cv);
while (needfree)
tsleep(&needfree, 0, "zfs:lowmem", hz / 5);
mutex_exit(&arc_lowmem_lock);
}
#endif
void
arc_init(void)
{
int prefetch_tunable_set = 0;
int i;
mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL);
mutex_init(&arc_lowmem_lock, NULL, MUTEX_DEFAULT, NULL);
/* Convert seconds to clock ticks */
arc_min_prefetch_lifespan = 1 * hz;
/* Start out with 1/8 of all memory */
arc_c = kmem_size() / 8;
#if 0
#ifdef _KERNEL
/*
* On architectures where the physical memory can be larger
* than the addressable space (intel in 32-bit mode), we may
* need to limit the cache to 1/8 of VM size.
*/
arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);
#endif
#endif
/* set min cache to 1/32 of all memory, or 16MB, whichever is more */
arc_c_min = MAX(arc_c / 4, 64<<18);
/* set max to 1/2 of all memory, or all but 1GB, whichever is more */
if (arc_c * 8 >= 1<<30)
arc_c_max = (arc_c * 8) - (1<<30);
else
arc_c_max = arc_c_min;
arc_c_max = MAX(arc_c * 5, arc_c_max);
#ifdef _KERNEL
/*
* Allow the tunables to override our calculations if they are
* reasonable (ie. over 16MB)
*/
if (zfs_arc_max >= 64<<18 && zfs_arc_max < kmem_size())
arc_c_max = zfs_arc_max;
if (zfs_arc_min >= 64<<18 && zfs_arc_min <= arc_c_max)
arc_c_min = zfs_arc_min;
#endif
arc_c = arc_c_max;
arc_p = (arc_c >> 1);
/* limit meta-data to 1/4 of the arc capacity */
arc_meta_limit = arc_c_max / 4;
/* Allow the tunable to override if it is reasonable */
if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
arc_meta_limit = zfs_arc_meta_limit;
if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
arc_c_min = arc_meta_limit / 2;
if (zfs_arc_grow_retry > 0)
arc_grow_retry = zfs_arc_grow_retry;
if (zfs_arc_shrink_shift > 0)
arc_shrink_shift = zfs_arc_shrink_shift;
if (zfs_arc_p_min_shift > 0)
arc_p_min_shift = zfs_arc_p_min_shift;
/* if kmem_flags are set, lets try to use less memory */
if (kmem_debugging())
arc_c = arc_c / 2;
if (arc_c < arc_c_min)
arc_c = arc_c_min;
zfs_arc_min = arc_c_min;
zfs_arc_max = arc_c_max;
arc_anon = &ARC_anon;
arc_mru = &ARC_mru;
arc_mru_ghost = &ARC_mru_ghost;
arc_mfu = &ARC_mfu;
arc_mfu_ghost = &ARC_mfu_ghost;
arc_l2c_only = &ARC_l2c_only;
arc_size = 0;
for (i = 0; i < ARC_BUFC_NUMLISTS; i++) {
mutex_init(&arc_anon->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mru->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mru_ghost->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mfu->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mfu_ghost->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_l2c_only->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
list_create(&arc_mru->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mru_ghost->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mfu->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mfu_ghost->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mfu_ghost->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_l2c_only->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
}
buf_init();
arc_thread_exit = 0;
arc_eviction_list = NULL;
mutex_init(&arc_eviction_mtx, NULL, MUTEX_DEFAULT, NULL);
bzero(&arc_eviction_hdr, sizeof (arc_buf_hdr_t));
arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
if (arc_ksp != NULL) {
arc_ksp->ks_data = &arc_stats;
kstat_install(arc_ksp);
}
(void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
TS_RUN, minclsyspri);
#ifdef _KERNEL
arc_event_lowmem = EVENTHANDLER_REGISTER(vm_lowmem, arc_lowmem, NULL,
EVENTHANDLER_PRI_FIRST);
#endif
arc_dead = FALSE;
arc_warm = B_FALSE;
if (zfs_write_limit_max == 0)
zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
else
zfs_write_limit_shift = 0;
mutex_init(&zfs_write_limit_lock, NULL, MUTEX_DEFAULT, NULL);
#ifdef _KERNEL
if (TUNABLE_INT_FETCH("vfs.zfs.prefetch_disable", &zfs_prefetch_disable))
prefetch_tunable_set = 1;
#ifdef __i386__
if (prefetch_tunable_set == 0) {
printf("ZFS NOTICE: Prefetch is disabled by default on i386 "
"-- to enable,\n");
printf(" add \"vfs.zfs.prefetch_disable=0\" "
"to /boot/loader.conf.\n");
zfs_prefetch_disable=1;
}
#else
if ((((uint64_t)physmem * PAGESIZE) < (1ULL << 32)) &&
prefetch_tunable_set == 0) {
printf("ZFS NOTICE: Prefetch is disabled by default if less "
"than 4GB of RAM is present;\n"
" to enable, add \"vfs.zfs.prefetch_disable=0\" "
"to /boot/loader.conf.\n");
zfs_prefetch_disable=1;
}
#endif
/* Warn about ZFS memory and address space requirements. */
if (((uint64_t)physmem * PAGESIZE) < (256 + 128 + 64) * (1 << 20)) {
printf("ZFS WARNING: Recommended minimum RAM size is 512MB; "
"expect unstable behavior.\n");
}
if (kmem_size() < 512 * (1 << 20)) {
printf("ZFS WARNING: Recommended minimum kmem_size is 512MB; "
"expect unstable behavior.\n");
printf(" Consider tuning vm.kmem_size and "
"vm.kmem_size_max\n");
printf(" in /boot/loader.conf.\n");
}
#endif
}
void
arc_fini(void)
{
int i;
mutex_enter(&arc_reclaim_thr_lock);
arc_thread_exit = 1;
cv_signal(&arc_reclaim_thr_cv);
while (arc_thread_exit != 0)
cv_wait(&arc_reclaim_thr_cv, &arc_reclaim_thr_lock);
mutex_exit(&arc_reclaim_thr_lock);
arc_flush(NULL);
arc_dead = TRUE;
if (arc_ksp != NULL) {
kstat_delete(arc_ksp);
arc_ksp = NULL;
}
mutex_destroy(&arc_eviction_mtx);
mutex_destroy(&arc_reclaim_thr_lock);
cv_destroy(&arc_reclaim_thr_cv);
for (i = 0; i < ARC_BUFC_NUMLISTS; i++) {
list_destroy(&arc_mru->arcs_lists[i]);
list_destroy(&arc_mru_ghost->arcs_lists[i]);
list_destroy(&arc_mfu->arcs_lists[i]);
list_destroy(&arc_mfu_ghost->arcs_lists[i]);
list_destroy(&arc_l2c_only->arcs_lists[i]);
mutex_destroy(&arc_anon->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mru->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mru_ghost->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mfu->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mfu_ghost->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_l2c_only->arcs_locks[i].arcs_lock);
}
mutex_destroy(&zfs_write_limit_lock);
buf_fini();
mutex_destroy(&arc_lowmem_lock);
#ifdef _KERNEL
if (arc_event_lowmem != NULL)
EVENTHANDLER_DEREGISTER(vm_lowmem, arc_event_lowmem);
#endif
}
/*
* Level 2 ARC
*
* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
* It uses dedicated storage devices to hold cached data, which are populated
* using large infrequent writes. The main role of this cache is to boost
* the performance of random read workloads. The intended L2ARC devices
* include short-stroked disks, solid state disks, and other media with
* substantially faster read latency than disk.
*
* +-----------------------+
* | ARC |
* +-----------------------+
* | ^ ^
* | | |
* l2arc_feed_thread() arc_read()
* | | |
* | l2arc read |
* V | |
* +---------------+ |
* | L2ARC | |
* +---------------+ |
* | ^ |
* l2arc_write() | |
* | | |
* V | |
* +-------+ +-------+
* | vdev | | vdev |
* | cache | | cache |
* +-------+ +-------+
* +=========+ .-----.
* : L2ARC : |-_____-|
* : devices : | Disks |
* +=========+ `-_____-'
*
* Read requests are satisfied from the following sources, in order:
*
* 1) ARC
* 2) vdev cache of L2ARC devices
* 3) L2ARC devices
* 4) vdev cache of disks
* 5) disks
*
* Some L2ARC device types exhibit extremely slow write performance.
* To accommodate for this there are some significant differences between
* the L2ARC and traditional cache design:
*
* 1. There is no eviction path from the ARC to the L2ARC. Evictions from
* the ARC behave as usual, freeing buffers and placing headers on ghost
* lists. The ARC does not send buffers to the L2ARC during eviction as
* this would add inflated write latencies for all ARC memory pressure.
*
* 2. The L2ARC attempts to cache data from the ARC before it is evicted.
* It does this by periodically scanning buffers from the eviction-end of
* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
* not already there. It scans until a headroom of buffers is satisfied,
* which itself is a buffer for ARC eviction. The thread that does this is
* l2arc_feed_thread(), illustrated below; example sizes are included to
* provide a better sense of ratio than this diagram:
*
* head --> tail
* +---------------------+----------+
* ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC
* +---------------------+----------+ | o L2ARC eligible
* ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer
* +---------------------+----------+ |
* 15.9 Gbytes ^ 32 Mbytes |
* headroom |
* l2arc_feed_thread()
* |
* l2arc write hand <--[oooo]--'
* | 8 Mbyte
* | write max
* V
* +==============================+
* L2ARC dev |####|#|###|###| |####| ... |
* +==============================+
* 32 Gbytes
*
* 3. If an ARC buffer is copied to the L2ARC but then hit instead of
* evicted, then the L2ARC has cached a buffer much sooner than it probably
* needed to, potentially wasting L2ARC device bandwidth and storage. It is
* safe to say that this is an uncommon case, since buffers at the end of
* the ARC lists have moved there due to inactivity.
*
* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
* then the L2ARC simply misses copying some buffers. This serves as a
* pressure valve to prevent heavy read workloads from both stalling the ARC
* with waits and clogging the L2ARC with writes. This also helps prevent
* the potential for the L2ARC to churn if it attempts to cache content too
* quickly, such as during backups of the entire pool.
*
* 5. After system boot and before the ARC has filled main memory, there are
* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
* lists can remain mostly static. Instead of searching from tail of these
* lists as pictured, the l2arc_feed_thread() will search from the list heads
* for eligible buffers, greatly increasing its chance of finding them.
*
* The L2ARC device write speed is also boosted during this time so that
* the L2ARC warms up faster. Since there have been no ARC evictions yet,
* there are no L2ARC reads, and no fear of degrading read performance
* through increased writes.
*
* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
* the vdev queue can aggregate them into larger and fewer writes. Each
* device is written to in a rotor fashion, sweeping writes through
* available space then repeating.
*
* 7. The L2ARC does not store dirty content. It never needs to flush
* write buffers back to disk based storage.
*
* 8. If an ARC buffer is written (and dirtied) which also exists in the
* L2ARC, the now stale L2ARC buffer is immediately dropped.
*
* The performance of the L2ARC can be tweaked by a number of tunables, which
* may be necessary for different workloads:
*
* l2arc_write_max max write bytes per interval
* l2arc_write_boost extra write bytes during device warmup
* l2arc_noprefetch skip caching prefetched buffers
* l2arc_headroom number of max device writes to precache
* l2arc_feed_secs seconds between L2ARC writing
*
* Tunables may be removed or added as future performance improvements are
* integrated, and also may become zpool properties.
*
* There are three key functions that control how the L2ARC warms up:
*
* l2arc_write_eligible() check if a buffer is eligible to cache
* l2arc_write_size() calculate how much to write
* l2arc_write_interval() calculate sleep delay between writes
*
* These three functions determine what to write, how much, and how quickly
* to send writes.
*/
static boolean_t
l2arc_write_eligible(spa_t *spa, arc_buf_hdr_t *ab)
{
/*
* A buffer is *not* eligible for the L2ARC if it:
* 1. belongs to a different spa.
* 2. is already cached on the L2ARC.
* 3. has an I/O in progress (it may be an incomplete read).
* 4. is flagged not eligible (zfs property).
*/
if (ab->b_spa != spa) {
ARCSTAT_BUMP(arcstat_l2_write_spa_mismatch);
return (B_FALSE);
}
if (ab->b_l2hdr != NULL) {
ARCSTAT_BUMP(arcstat_l2_write_in_l2);
return (B_FALSE);
}
if (HDR_IO_IN_PROGRESS(ab)) {
ARCSTAT_BUMP(arcstat_l2_write_hdr_io_in_progress);
return (B_FALSE);
}
if (!HDR_L2CACHE(ab)) {
ARCSTAT_BUMP(arcstat_l2_write_not_cacheable);
return (B_FALSE);
}
return (B_TRUE);
}
static uint64_t
l2arc_write_size(l2arc_dev_t *dev)
{
uint64_t size;
size = dev->l2ad_write;
if (arc_warm == B_FALSE)
size += dev->l2ad_boost;
return (size);
}
static clock_t
l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
{
clock_t interval, next;
/*
* If the ARC lists are busy, increase our write rate; if the
* lists are stale, idle back. This is achieved by checking
* how much we previously wrote - if it was more than half of
* what we wanted, schedule the next write much sooner.
*/
if (l2arc_feed_again && wrote > (wanted / 2))
interval = (hz * l2arc_feed_min_ms) / 1000;
else
interval = hz * l2arc_feed_secs;
next = MAX(LBOLT, MIN(LBOLT + interval, began + interval));
return (next);
}
static void
l2arc_hdr_stat_add(void)
{
ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
}
static void
l2arc_hdr_stat_remove(void)
{
ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
}
/*
* Cycle through L2ARC devices. This is how L2ARC load balances.
* If a device is returned, this also returns holding the spa config lock.
*/
static l2arc_dev_t *
l2arc_dev_get_next(void)
{
l2arc_dev_t *first, *next = NULL;
/*
* Lock out the removal of spas (spa_namespace_lock), then removal
* of cache devices (l2arc_dev_mtx). Once a device has been selected,
* both locks will be dropped and a spa config lock held instead.
*/
mutex_enter(&spa_namespace_lock);
mutex_enter(&l2arc_dev_mtx);
/* if there are no vdevs, there is nothing to do */
if (l2arc_ndev == 0)
goto out;
first = NULL;
next = l2arc_dev_last;
do {
/* loop around the list looking for a non-faulted vdev */
if (next == NULL) {
next = list_head(l2arc_dev_list);
} else {
next = list_next(l2arc_dev_list, next);
if (next == NULL)
next = list_head(l2arc_dev_list);
}
/* if we have come back to the start, bail out */
if (first == NULL)
first = next;
else if (next == first)
break;
} while (vdev_is_dead(next->l2ad_vdev));
/* if we were unable to find any usable vdevs, return NULL */
if (vdev_is_dead(next->l2ad_vdev))
next = NULL;
l2arc_dev_last = next;
out:
mutex_exit(&l2arc_dev_mtx);
/*
* Grab the config lock to prevent the 'next' device from being
* removed while we are writing to it.
*/
if (next != NULL)
spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
mutex_exit(&spa_namespace_lock);
return (next);
}
/*
* Free buffers that were tagged for destruction.
*/
static void
l2arc_do_free_on_write()
{
list_t *buflist;
l2arc_data_free_t *df, *df_prev;
mutex_enter(&l2arc_free_on_write_mtx);
buflist = l2arc_free_on_write;
for (df = list_tail(buflist); df; df = df_prev) {
df_prev = list_prev(buflist, df);
ASSERT(df->l2df_data != NULL);
ASSERT(df->l2df_func != NULL);
df->l2df_func(df->l2df_data, df->l2df_size);
list_remove(buflist, df);
kmem_free(df, sizeof (l2arc_data_free_t));
}
mutex_exit(&l2arc_free_on_write_mtx);
}
/*
* A write to a cache device has completed. Update all headers to allow
* reads from these buffers to begin.
*/
static void
l2arc_write_done(zio_t *zio)
{
l2arc_write_callback_t *cb;
l2arc_dev_t *dev;
list_t *buflist;
arc_buf_hdr_t *head, *ab, *ab_prev;
l2arc_buf_hdr_t *abl2;
kmutex_t *hash_lock;
cb = zio->io_private;
ASSERT(cb != NULL);
dev = cb->l2wcb_dev;
ASSERT(dev != NULL);
head = cb->l2wcb_head;
ASSERT(head != NULL);
buflist = dev->l2ad_buflist;
ASSERT(buflist != NULL);
DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
l2arc_write_callback_t *, cb);
if (zio->io_error != 0)
ARCSTAT_BUMP(arcstat_l2_writes_error);
mutex_enter(&l2arc_buflist_mtx);
/*
* All writes completed, or an error was hit.
*/
for (ab = list_prev(buflist, head); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
* This buffer misses out. It may be in a stage
* of eviction. Its ARC_L2_WRITING flag will be
* left set, denying reads to this buffer.
*/
ARCSTAT_BUMP(arcstat_l2_writes_hdr_miss);
continue;
}
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
abl2 = ab->b_l2hdr;
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
/*
* Allow ARC to begin reads to this L2ARC entry.
*/
ab->b_flags &= ~ARC_L2_WRITING;
mutex_exit(hash_lock);
}
atomic_inc_64(&l2arc_writes_done);
list_remove(buflist, head);
kmem_cache_free(hdr_cache, head);
mutex_exit(&l2arc_buflist_mtx);
l2arc_do_free_on_write();
kmem_free(cb, sizeof (l2arc_write_callback_t));
}
/*
* A read to a cache device completed. Validate buffer contents before
* handing over to the regular ARC routines.
*/
static void
l2arc_read_done(zio_t *zio)
{
l2arc_read_callback_t *cb;
arc_buf_hdr_t *hdr;
arc_buf_t *buf;
kmutex_t *hash_lock;
int equal;
ASSERT(zio->io_vd != NULL);
ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
cb = zio->io_private;
ASSERT(cb != NULL);
buf = cb->l2rcb_buf;
ASSERT(buf != NULL);
hdr = buf->b_hdr;
ASSERT(hdr != NULL);
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
/*
* Check this survived the L2ARC journey.
*/
equal = arc_cksum_equal(buf);
if (equal && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
mutex_exit(hash_lock);
zio->io_private = buf;
zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
arc_read_done(zio);
} else {
mutex_exit(hash_lock);
/*
* Buffer didn't survive caching. Increment stats and
* reissue to the original storage device.
*/
if (zio->io_error != 0) {
ARCSTAT_BUMP(arcstat_l2_io_error);
} else {
zio->io_error = EIO;
}
if (!equal)
ARCSTAT_BUMP(arcstat_l2_cksum_bad);
/*
* If there's no waiter, issue an async i/o to the primary
* storage now. If there *is* a waiter, the caller must
* issue the i/o in a context where it's OK to block.
*/
if (zio->io_waiter == NULL)
zio_nowait(zio_read(zio->io_parent,
cb->l2rcb_spa, &cb->l2rcb_bp,
buf->b_data, zio->io_size, arc_read_done, buf,
zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb));
}
kmem_free(cb, sizeof (l2arc_read_callback_t));
}
/*
* This is the list priority from which the L2ARC will search for pages to
* cache. This is used within loops (0..3) to cycle through lists in the
* desired order. This order can have a significant effect on cache
* performance.
*
* Currently the metadata lists are hit first, MFU then MRU, followed by
* the data lists. This function returns a locked list, and also returns
* the lock pointer.
*/
static list_t *
l2arc_list_locked(int list_num, kmutex_t **lock)
{
list_t *list;
int idx;
ASSERT(list_num >= 0 && list_num < 2 * ARC_BUFC_NUMLISTS);
if (list_num < ARC_BUFC_NUMMETADATALISTS) {
idx = list_num;
list = &arc_mfu->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mfu, idx);
} else if (list_num < ARC_BUFC_NUMMETADATALISTS * 2) {
idx = list_num - ARC_BUFC_NUMMETADATALISTS;
list = &arc_mru->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mru, idx);
} else if (list_num < (ARC_BUFC_NUMMETADATALISTS * 2 +
ARC_BUFC_NUMDATALISTS)) {
idx = list_num - ARC_BUFC_NUMMETADATALISTS;
list = &arc_mfu->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mfu, idx);
} else {
idx = list_num - ARC_BUFC_NUMLISTS;
list = &arc_mru->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mru, idx);
}
ASSERT(!(MUTEX_HELD(*lock)));
mutex_enter(*lock);
return (list);
}
/*
* Evict buffers from the device write hand to the distance specified in
* bytes. This distance may span populated buffers, it may span nothing.
* This is clearing a region on the L2ARC device ready for writing.
* If the 'all' boolean is set, every buffer is evicted.
*/
static void
l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
{
list_t *buflist;
l2arc_buf_hdr_t *abl2;
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
uint64_t taddr;
buflist = dev->l2ad_buflist;
if (buflist == NULL)
return;
if (!all && dev->l2ad_first) {
/*
* This is the first sweep through the device. There is
* nothing to evict.
*/
return;
}
if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
/*
* When nearing the end of the device, evict to the end
* before the device write hand jumps to the start.
*/
taddr = dev->l2ad_end;
} else {
taddr = dev->l2ad_hand + distance;
}
DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
uint64_t, taddr, boolean_t, all);
top:
mutex_enter(&l2arc_buflist_mtx);
for (ab = list_tail(buflist); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
* Missed the hash lock. Retry.
*/
ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
mutex_exit(&l2arc_buflist_mtx);
mutex_enter(hash_lock);
mutex_exit(hash_lock);
goto top;
}
if (HDR_L2_WRITE_HEAD(ab)) {
/*
* We hit a write head node. Leave it for
* l2arc_write_done().
*/
list_remove(buflist, ab);
mutex_exit(hash_lock);
continue;
}
if (!all && ab->b_l2hdr != NULL &&
(ab->b_l2hdr->b_daddr > taddr ||
ab->b_l2hdr->b_daddr < dev->l2ad_hand)) {
/*
* We've evicted to the target address,
* or the end of the device.
*/
mutex_exit(hash_lock);
break;
}
if (HDR_FREE_IN_PROGRESS(ab)) {
/*
* Already on the path to destruction.
*/
mutex_exit(hash_lock);
continue;
}
if (ab->b_state == arc_l2c_only) {
ASSERT(!HDR_L2_READING(ab));
/*
* This doesn't exist in the ARC. Destroy.
* arc_hdr_destroy() will call list_remove()
* and decrement arcstat_l2_size.
*/
arc_change_state(arc_anon, ab, hash_lock);
arc_hdr_destroy(ab);
} else {
/*
* Invalidate issued or about to be issued
* reads, since we may be about to write
* over this location.
*/
if (HDR_L2_READING(ab)) {
ARCSTAT_BUMP(arcstat_l2_evict_reading);
ab->b_flags |= ARC_L2_EVICTED;
}
/*
* Tell ARC this no longer exists in L2ARC.
*/
if (ab->b_l2hdr != NULL) {
abl2 = ab->b_l2hdr;
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
list_remove(buflist, ab);
/*
* This may have been leftover after a
* failed write.
*/
ab->b_flags &= ~ARC_L2_WRITING;
}
mutex_exit(hash_lock);
}
mutex_exit(&l2arc_buflist_mtx);
spa_l2cache_space_update(dev->l2ad_vdev, 0, -(taddr - dev->l2ad_evict));
dev->l2ad_evict = taddr;
}
/*
* Find and write ARC buffers to the L2ARC device.
*
* An ARC_L2_WRITING flag is set so that the L2ARC buffers are not valid
* for reading until they have completed writing.
*/
static uint64_t
l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
{
arc_buf_hdr_t *ab, *ab_prev, *head;
l2arc_buf_hdr_t *hdrl2;
list_t *list;
uint64_t passed_sz, write_sz, buf_sz, headroom;
void *buf_data;
kmutex_t *hash_lock, *list_lock;
boolean_t have_lock, full;
l2arc_write_callback_t *cb;
zio_t *pio, *wzio;
int try;
ASSERT(dev->l2ad_vdev != NULL);
pio = NULL;
write_sz = 0;
full = B_FALSE;
head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
head->b_flags |= ARC_L2_WRITE_HEAD;
ARCSTAT_BUMP(arcstat_l2_write_buffer_iter);
/*
* Copy buffers for L2ARC writing.
*/
mutex_enter(&l2arc_buflist_mtx);
for (try = 0; try < 2 * ARC_BUFC_NUMLISTS; try++) {
list = l2arc_list_locked(try, &list_lock);
passed_sz = 0;
ARCSTAT_BUMP(arcstat_l2_write_buffer_list_iter);
/*
* L2ARC fast warmup.
*
* Until the ARC is warm and starts to evict, read from the
* head of the ARC lists rather than the tail.
*/
headroom = target_sz * l2arc_headroom;
if (arc_warm == B_FALSE)
ab = list_head(list);
else
ab = list_tail(list);
if (ab == NULL)
ARCSTAT_BUMP(arcstat_l2_write_buffer_list_null_iter);
for (; ab; ab = ab_prev) {
if (arc_warm == B_FALSE)
ab_prev = list_next(list, ab);
else
ab_prev = list_prev(list, ab);
ARCSTAT_INCR(arcstat_l2_write_buffer_bytes_scanned, ab->b_size);
hash_lock = HDR_LOCK(ab);
have_lock = MUTEX_HELD(hash_lock);
if (!have_lock && !mutex_tryenter(hash_lock)) {
ARCSTAT_BUMP(arcstat_l2_write_trylock_fail);
/*
* Skip this buffer rather than waiting.
*/
continue;
}
passed_sz += ab->b_size;
if (passed_sz > headroom) {
/*
* Searched too far.
*/
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_l2_write_passed_headroom);
break;
}
if (!l2arc_write_eligible(spa, ab)) {
mutex_exit(hash_lock);
continue;
}
if ((write_sz + ab->b_size) > target_sz) {
full = B_TRUE;
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_l2_write_full);
break;
}
if (pio == NULL) {
/*
* Insert a dummy header on the buflist so
* l2arc_write_done() can find where the
* write buffers begin without searching.
*/
list_insert_head(dev->l2ad_buflist, head);
cb = kmem_alloc(
sizeof (l2arc_write_callback_t), KM_SLEEP);
cb->l2wcb_dev = dev;
cb->l2wcb_head = head;
pio = zio_root(spa, l2arc_write_done, cb,
ZIO_FLAG_CANFAIL);
ARCSTAT_BUMP(arcstat_l2_write_pios);
}
/*
* Create and add a new L2ARC header.
*/
hdrl2 = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
hdrl2->b_dev = dev;
hdrl2->b_daddr = dev->l2ad_hand;
ab->b_flags |= ARC_L2_WRITING;
ab->b_l2hdr = hdrl2;
list_insert_head(dev->l2ad_buflist, ab);
buf_data = ab->b_buf->b_data;
buf_sz = ab->b_size;
/*
* Compute and store the buffer cksum before
* writing. On debug the cksum is verified first.
*/
arc_cksum_verify(ab->b_buf);
arc_cksum_compute(ab->b_buf, B_TRUE);
mutex_exit(hash_lock);
wzio = zio_write_phys(pio, dev->l2ad_vdev,
dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE,
ZIO_FLAG_CANFAIL, B_FALSE);
DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
zio_t *, wzio);
(void) zio_nowait(wzio);
/*
* Keep the clock hand suitably device-aligned.
*/
buf_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
write_sz += buf_sz;
dev->l2ad_hand += buf_sz;
}
mutex_exit(list_lock);
if (full == B_TRUE)
break;
}
mutex_exit(&l2arc_buflist_mtx);
if (pio == NULL) {
ASSERT3U(write_sz, ==, 0);
kmem_cache_free(hdr_cache, head);
return (0);
}
ASSERT3U(write_sz, <=, target_sz);
ARCSTAT_BUMP(arcstat_l2_writes_sent);
ARCSTAT_INCR(arcstat_l2_write_bytes, write_sz);
ARCSTAT_INCR(arcstat_l2_size, write_sz);
spa_l2cache_space_update(dev->l2ad_vdev, 0, write_sz);
/*
* Bump device hand to the device start if it is approaching the end.
* l2arc_evict() will already have evicted ahead for this case.
*/
if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
spa_l2cache_space_update(dev->l2ad_vdev, 0,
dev->l2ad_end - dev->l2ad_hand);
dev->l2ad_hand = dev->l2ad_start;
dev->l2ad_evict = dev->l2ad_start;
dev->l2ad_first = B_FALSE;
}
dev->l2ad_writing = B_TRUE;
(void) zio_wait(pio);
dev->l2ad_writing = B_FALSE;
return (write_sz);
}
/*
* This thread feeds the L2ARC at regular intervals. This is the beating
* heart of the L2ARC.
*/
static void
l2arc_feed_thread(void *dummy __unused)
{
callb_cpr_t cpr;
l2arc_dev_t *dev;
spa_t *spa;
uint64_t size, wrote;
clock_t begin, next = LBOLT;
CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
mutex_enter(&l2arc_feed_thr_lock);
while (l2arc_thread_exit == 0) {
CALLB_CPR_SAFE_BEGIN(&cpr);
(void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
next - LBOLT);
CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
next = LBOLT + hz;
/*
* Quick check for L2ARC devices.
*/
mutex_enter(&l2arc_dev_mtx);
if (l2arc_ndev == 0) {
mutex_exit(&l2arc_dev_mtx);
continue;
}
mutex_exit(&l2arc_dev_mtx);
begin = LBOLT;
/*
* This selects the next l2arc device to write to, and in
* doing so the next spa to feed from: dev->l2ad_spa. This
* will return NULL if there are now no l2arc devices or if
* they are all faulted.
*
* If a device is returned, its spa's config lock is also
* held to prevent device removal. l2arc_dev_get_next()
* will grab and release l2arc_dev_mtx.
*/
if ((dev = l2arc_dev_get_next()) == NULL)
continue;
spa = dev->l2ad_spa;
ASSERT(spa != NULL);
/*
* Avoid contributing to memory pressure.
*/
if (arc_reclaim_needed()) {
ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
spa_config_exit(spa, SCL_L2ARC, dev);
continue;
}
ARCSTAT_BUMP(arcstat_l2_feeds);
size = l2arc_write_size(dev);
/*
* Evict L2ARC buffers that will be overwritten.
*/
l2arc_evict(dev, size, B_FALSE);
/*
* Write ARC buffers.
*/
wrote = l2arc_write_buffers(spa, dev, size);
/*
* Calculate interval between writes.
*/
next = l2arc_write_interval(begin, size, wrote);
spa_config_exit(spa, SCL_L2ARC, dev);
}
l2arc_thread_exit = 0;
cv_broadcast(&l2arc_feed_thr_cv);
CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
thread_exit();
}
boolean_t
l2arc_vdev_present(vdev_t *vd)
{
l2arc_dev_t *dev;
mutex_enter(&l2arc_dev_mtx);
for (dev = list_head(l2arc_dev_list); dev != NULL;
dev = list_next(l2arc_dev_list, dev)) {
if (dev->l2ad_vdev == vd)
break;
}
mutex_exit(&l2arc_dev_mtx);
return (dev != NULL);
}
/*
* Add a vdev for use by the L2ARC. By this point the spa has already
* validated the vdev and opened it.
*/
void
l2arc_add_vdev(spa_t *spa, vdev_t *vd, uint64_t start, uint64_t end)
{
l2arc_dev_t *adddev;
ASSERT(!l2arc_vdev_present(vd));
/*
* Create a new l2arc device entry.
*/
adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
adddev->l2ad_spa = spa;
adddev->l2ad_vdev = vd;
adddev->l2ad_write = l2arc_write_max;
adddev->l2ad_boost = l2arc_write_boost;
adddev->l2ad_start = start;
adddev->l2ad_end = end;
adddev->l2ad_hand = adddev->l2ad_start;
adddev->l2ad_evict = adddev->l2ad_start;
adddev->l2ad_first = B_TRUE;
adddev->l2ad_writing = B_FALSE;
ASSERT3U(adddev->l2ad_write, >, 0);
/*
* This is a list of all ARC buffers that are still valid on the
* device.
*/
adddev->l2ad_buflist = kmem_zalloc(sizeof (list_t), KM_SLEEP);
list_create(adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
offsetof(arc_buf_hdr_t, b_l2node));
spa_l2cache_space_update(vd, adddev->l2ad_end - adddev->l2ad_hand, 0);
/*
* Add device to global list
*/
mutex_enter(&l2arc_dev_mtx);
list_insert_head(l2arc_dev_list, adddev);
atomic_inc_64(&l2arc_ndev);
mutex_exit(&l2arc_dev_mtx);
}
/*
* Remove a vdev from the L2ARC.
*/
void
l2arc_remove_vdev(vdev_t *vd)
{
l2arc_dev_t *dev, *nextdev, *remdev = NULL;
/*
* Find the device by vdev
*/
mutex_enter(&l2arc_dev_mtx);
for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
nextdev = list_next(l2arc_dev_list, dev);
if (vd == dev->l2ad_vdev) {
remdev = dev;
break;
}
}
ASSERT(remdev != NULL);
/*
* Remove device from global list
*/
list_remove(l2arc_dev_list, remdev);
l2arc_dev_last = NULL; /* may have been invalidated */
atomic_dec_64(&l2arc_ndev);
mutex_exit(&l2arc_dev_mtx);
/*
* Clear all buflists and ARC references. L2ARC device flush.
*/
l2arc_evict(remdev, 0, B_TRUE);
list_destroy(remdev->l2ad_buflist);
kmem_free(remdev->l2ad_buflist, sizeof (list_t));
kmem_free(remdev, sizeof (l2arc_dev_t));
}
void
l2arc_init(void)
{
l2arc_thread_exit = 0;
l2arc_ndev = 0;
l2arc_writes_sent = 0;
l2arc_writes_done = 0;
mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&l2arc_buflist_mtx, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
l2arc_dev_list = &L2ARC_dev_list;
l2arc_free_on_write = &L2ARC_free_on_write;
list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
offsetof(l2arc_dev_t, l2ad_node));
list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
offsetof(l2arc_data_free_t, l2df_list_node));
}
void
l2arc_fini(void)
{
/*
* This is called from dmu_fini(), which is called from spa_fini();
* Because of this, we can assume that all l2arc devices have
* already been removed when the pools themselves were removed.
*/
l2arc_do_free_on_write();
mutex_destroy(&l2arc_feed_thr_lock);
cv_destroy(&l2arc_feed_thr_cv);
mutex_destroy(&l2arc_dev_mtx);
mutex_destroy(&l2arc_buflist_mtx);
mutex_destroy(&l2arc_free_on_write_mtx);
list_destroy(l2arc_dev_list);
list_destroy(l2arc_free_on_write);
}
void
l2arc_start(void)
{
if (!(spa_mode & FWRITE))
return;
(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
TS_RUN, minclsyspri);
}
void
l2arc_stop(void)
{
if (!(spa_mode & FWRITE))
return;
mutex_enter(&l2arc_feed_thr_lock);
cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
l2arc_thread_exit = 1;
while (l2arc_thread_exit != 0)
cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
mutex_exit(&l2arc_feed_thr_lock);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c (revision 209274)
@@ -1,1066 +1,1066 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/dmu.h>
#include <sys/dmu_impl.h>
#include <sys/dbuf.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/dsl_dataset.h> /* for dsl_dataset_block_freeable() */
#include <sys/dsl_dir.h> /* for dsl_dir_tempreserve_*() */
#include <sys/dsl_pool.h>
#include <sys/zap_impl.h> /* for fzap_default_block_shift */
#include <sys/spa.h>
#include <sys/zfs_context.h>
typedef void (*dmu_tx_hold_func_t)(dmu_tx_t *tx, struct dnode *dn,
uint64_t arg1, uint64_t arg2);
dmu_tx_t *
dmu_tx_create_dd(dsl_dir_t *dd)
{
dmu_tx_t *tx = kmem_zalloc(sizeof (dmu_tx_t), KM_SLEEP);
tx->tx_dir = dd;
if (dd)
tx->tx_pool = dd->dd_pool;
list_create(&tx->tx_holds, sizeof (dmu_tx_hold_t),
offsetof(dmu_tx_hold_t, txh_node));
#ifdef ZFS_DEBUG
refcount_create(&tx->tx_space_written);
refcount_create(&tx->tx_space_freed);
#endif
return (tx);
}
dmu_tx_t *
dmu_tx_create(objset_t *os)
{
dmu_tx_t *tx = dmu_tx_create_dd(os->os->os_dsl_dataset->ds_dir);
tx->tx_objset = os;
tx->tx_lastsnap_txg = dsl_dataset_prev_snap_txg(os->os->os_dsl_dataset);
return (tx);
}
dmu_tx_t *
dmu_tx_create_assigned(struct dsl_pool *dp, uint64_t txg)
{
dmu_tx_t *tx = dmu_tx_create_dd(NULL);
ASSERT3U(txg, <=, dp->dp_tx.tx_open_txg);
tx->tx_pool = dp;
tx->tx_txg = txg;
tx->tx_anyobj = TRUE;
return (tx);
}
int
dmu_tx_is_syncing(dmu_tx_t *tx)
{
return (tx->tx_anyobj);
}
int
dmu_tx_private_ok(dmu_tx_t *tx)
{
return (tx->tx_anyobj);
}
static dmu_tx_hold_t *
dmu_tx_hold_object_impl(dmu_tx_t *tx, objset_t *os, uint64_t object,
enum dmu_tx_hold_type type, uint64_t arg1, uint64_t arg2)
{
dmu_tx_hold_t *txh;
dnode_t *dn = NULL;
int err;
if (object != DMU_NEW_OBJECT) {
err = dnode_hold(os->os, object, tx, &dn);
if (err) {
tx->tx_err = err;
return (NULL);
}
if (err == 0 && tx->tx_txg != 0) {
mutex_enter(&dn->dn_mtx);
/*
* dn->dn_assigned_txg == tx->tx_txg doesn't pose a
* problem, but there's no way for it to happen (for
* now, at least).
*/
ASSERT(dn->dn_assigned_txg == 0);
dn->dn_assigned_txg = tx->tx_txg;
(void) refcount_add(&dn->dn_tx_holds, tx);
mutex_exit(&dn->dn_mtx);
}
}
txh = kmem_zalloc(sizeof (dmu_tx_hold_t), KM_SLEEP);
txh->txh_tx = tx;
txh->txh_dnode = dn;
#ifdef ZFS_DEBUG
txh->txh_type = type;
txh->txh_arg1 = arg1;
txh->txh_arg2 = arg2;
#endif
list_insert_tail(&tx->tx_holds, txh);
return (txh);
}
void
dmu_tx_add_new_object(dmu_tx_t *tx, objset_t *os, uint64_t object)
{
/*
* If we're syncing, they can manipulate any object anyhow, and
* the hold on the dnode_t can cause problems.
*/
if (!dmu_tx_is_syncing(tx)) {
(void) dmu_tx_hold_object_impl(tx, os,
object, THT_NEWOBJECT, 0, 0);
}
}
static int
dmu_tx_check_ioerr(zio_t *zio, dnode_t *dn, int level, uint64_t blkid)
{
int err;
dmu_buf_impl_t *db;
rw_enter(&dn->dn_struct_rwlock, RW_READER);
db = dbuf_hold_level(dn, level, blkid, FTAG);
rw_exit(&dn->dn_struct_rwlock);
if (db == NULL)
return (EIO);
err = dbuf_read(db, zio, DB_RF_CANFAIL | DB_RF_NOPREFETCH);
dbuf_rele(db, FTAG);
return (err);
}
/* ARGSUSED */
static void
dmu_tx_count_write(dmu_tx_hold_t *txh, uint64_t off, uint64_t len)
{
dnode_t *dn = txh->txh_dnode;
uint64_t start, end, i;
int min_bs, max_bs, min_ibs, max_ibs, epbs, bits;
int err = 0;
if (len == 0)
return;
min_bs = SPA_MINBLOCKSHIFT;
max_bs = SPA_MAXBLOCKSHIFT;
min_ibs = DN_MIN_INDBLKSHIFT;
max_ibs = DN_MAX_INDBLKSHIFT;
/*
* For i/o error checking, read the first and last level-0
* blocks (if they are not aligned), and all the level-1 blocks.
*/
if (dn) {
if (dn->dn_maxblkid == 0) {
err = dmu_tx_check_ioerr(NULL, dn, 0, 0);
if (err)
goto out;
} else {
zio_t *zio = zio_root(dn->dn_objset->os_spa,
NULL, NULL, ZIO_FLAG_CANFAIL);
/* first level-0 block */
start = off >> dn->dn_datablkshift;
if (P2PHASE(off, dn->dn_datablksz) ||
len < dn->dn_datablksz) {
err = dmu_tx_check_ioerr(zio, dn, 0, start);
if (err)
goto out;
}
/* last level-0 block */
end = (off+len-1) >> dn->dn_datablkshift;
if (end != start &&
P2PHASE(off+len, dn->dn_datablksz)) {
err = dmu_tx_check_ioerr(zio, dn, 0, end);
if (err)
goto out;
}
/* level-1 blocks */
if (dn->dn_nlevels > 1) {
start >>= dn->dn_indblkshift - SPA_BLKPTRSHIFT;
end >>= dn->dn_indblkshift - SPA_BLKPTRSHIFT;
for (i = start+1; i < end; i++) {
err = dmu_tx_check_ioerr(zio, dn, 1, i);
if (err)
goto out;
}
}
err = zio_wait(zio);
if (err)
goto out;
}
}
/*
* If there's more than one block, the blocksize can't change,
* so we can make a more precise estimate. Alternatively,
* if the dnode's ibs is larger than max_ibs, always use that.
* This ensures that if we reduce DN_MAX_INDBLKSHIFT,
* the code will still work correctly on existing pools.
*/
if (dn && (dn->dn_maxblkid != 0 || dn->dn_indblkshift > max_ibs)) {
min_ibs = max_ibs = dn->dn_indblkshift;
if (dn->dn_datablkshift != 0)
min_bs = max_bs = dn->dn_datablkshift;
}
/*
* 'end' is the last thing we will access, not one past.
* This way we won't overflow when accessing the last byte.
*/
start = P2ALIGN(off, 1ULL << max_bs);
end = P2ROUNDUP(off + len, 1ULL << max_bs) - 1;
txh->txh_space_towrite += end - start + 1;
start >>= min_bs;
end >>= min_bs;
epbs = min_ibs - SPA_BLKPTRSHIFT;
/*
* The object contains at most 2^(64 - min_bs) blocks,
* and each indirect level maps 2^epbs.
*/
for (bits = 64 - min_bs; bits >= 0; bits -= epbs) {
start >>= epbs;
end >>= epbs;
/*
* If we increase the number of levels of indirection,
* we'll need new blkid=0 indirect blocks. If start == 0,
* we're already accounting for that blocks; and if end == 0,
* we can't increase the number of levels beyond that.
*/
if (start != 0 && end != 0)
txh->txh_space_towrite += 1ULL << max_ibs;
txh->txh_space_towrite += (end - start + 1) << max_ibs;
}
ASSERT(txh->txh_space_towrite < 2 * DMU_MAX_ACCESS);
out:
if (err)
txh->txh_tx->tx_err = err;
}
static void
dmu_tx_count_dnode(dmu_tx_hold_t *txh)
{
dnode_t *dn = txh->txh_dnode;
dnode_t *mdn = txh->txh_tx->tx_objset->os->os_meta_dnode;
uint64_t space = mdn->dn_datablksz +
((mdn->dn_nlevels-1) << mdn->dn_indblkshift);
if (dn && dn->dn_dbuf->db_blkptr &&
dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
dn->dn_dbuf->db_blkptr->blk_birth)) {
txh->txh_space_tooverwrite += space;
} else {
txh->txh_space_towrite += space;
if (dn && dn->dn_dbuf->db_blkptr)
txh->txh_space_tounref += space;
}
}
void
dmu_tx_hold_write(dmu_tx_t *tx, uint64_t object, uint64_t off, int len)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
ASSERT(len < DMU_MAX_ACCESS);
ASSERT(len == 0 || UINT64_MAX - off >= len - 1);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_WRITE, off, len);
if (txh == NULL)
return;
dmu_tx_count_write(txh, off, len);
dmu_tx_count_dnode(txh);
}
static void
dmu_tx_count_free(dmu_tx_hold_t *txh, uint64_t off, uint64_t len)
{
uint64_t blkid, nblks, lastblk;
uint64_t space = 0, unref = 0, skipped = 0;
dnode_t *dn = txh->txh_dnode;
dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset;
spa_t *spa = txh->txh_tx->tx_pool->dp_spa;
int epbs;
if (dn->dn_nlevels == 0)
return;
/*
* The struct_rwlock protects us against dn_nlevels
* changing, in case (against all odds) we manage to dirty &
* sync out the changes after we check for being dirty.
* Also, dbuf_hold_level() wants us to have the struct_rwlock.
*/
rw_enter(&dn->dn_struct_rwlock, RW_READER);
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
if (dn->dn_maxblkid == 0) {
if (off == 0 && len >= dn->dn_datablksz) {
blkid = 0;
nblks = 1;
} else {
rw_exit(&dn->dn_struct_rwlock);
return;
}
} else {
blkid = off >> dn->dn_datablkshift;
nblks = (len + dn->dn_datablksz - 1) >> dn->dn_datablkshift;
if (blkid >= dn->dn_maxblkid) {
rw_exit(&dn->dn_struct_rwlock);
return;
}
if (blkid + nblks > dn->dn_maxblkid)
nblks = dn->dn_maxblkid - blkid;
}
if (dn->dn_nlevels == 1) {
int i;
for (i = 0; i < nblks; i++) {
blkptr_t *bp = dn->dn_phys->dn_blkptr;
ASSERT3U(blkid + i, <, dn->dn_nblkptr);
bp += blkid + i;
if (dsl_dataset_block_freeable(ds, bp->blk_birth)) {
dprintf_bp(bp, "can free old%s", "");
space += bp_get_dasize(spa, bp);
}
unref += BP_GET_ASIZE(bp);
}
nblks = 0;
}
/*
* Add in memory requirements of higher-level indirects.
* This assumes a worst-possible scenario for dn_nlevels.
*/
{
uint64_t blkcnt = 1 + ((nblks >> epbs) >> epbs);
int level = (dn->dn_nlevels > 1) ? 2 : 1;
while (level++ < DN_MAX_LEVELS) {
txh->txh_memory_tohold += blkcnt << dn->dn_indblkshift;
blkcnt = 1 + (blkcnt >> epbs);
}
ASSERT(blkcnt <= dn->dn_nblkptr);
}
lastblk = blkid + nblks - 1;
while (nblks) {
dmu_buf_impl_t *dbuf;
uint64_t ibyte, new_blkid;
int epb = 1 << epbs;
int err, i, blkoff, tochk;
blkptr_t *bp;
ibyte = blkid << dn->dn_datablkshift;
err = dnode_next_offset(dn,
DNODE_FIND_HAVELOCK, &ibyte, 2, 1, 0);
new_blkid = ibyte >> dn->dn_datablkshift;
if (err == ESRCH) {
skipped += (lastblk >> epbs) - (blkid >> epbs) + 1;
break;
}
if (err) {
txh->txh_tx->tx_err = err;
break;
}
if (new_blkid > lastblk) {
skipped += (lastblk >> epbs) - (blkid >> epbs) + 1;
break;
}
if (new_blkid > blkid) {
ASSERT((new_blkid >> epbs) > (blkid >> epbs));
skipped += (new_blkid >> epbs) - (blkid >> epbs) - 1;
nblks -= new_blkid - blkid;
blkid = new_blkid;
}
blkoff = P2PHASE(blkid, epb);
tochk = MIN(epb - blkoff, nblks);
dbuf = dbuf_hold_level(dn, 1, blkid >> epbs, FTAG);
txh->txh_memory_tohold += dbuf->db.db_size;
if (txh->txh_memory_tohold > DMU_MAX_ACCESS) {
txh->txh_tx->tx_err = E2BIG;
dbuf_rele(dbuf, FTAG);
break;
}
err = dbuf_read(dbuf, NULL, DB_RF_HAVESTRUCT | DB_RF_CANFAIL);
if (err != 0) {
txh->txh_tx->tx_err = err;
dbuf_rele(dbuf, FTAG);
break;
}
bp = dbuf->db.db_data;
bp += blkoff;
for (i = 0; i < tochk; i++) {
if (dsl_dataset_block_freeable(ds, bp[i].blk_birth)) {
dprintf_bp(&bp[i], "can free old%s", "");
space += bp_get_dasize(spa, &bp[i]);
}
unref += BP_GET_ASIZE(bp);
}
dbuf_rele(dbuf, FTAG);
blkid += tochk;
nblks -= tochk;
}
rw_exit(&dn->dn_struct_rwlock);
/* account for new level 1 indirect blocks that might show up */
if (skipped > 0) {
txh->txh_fudge += skipped << dn->dn_indblkshift;
skipped = MIN(skipped, DMU_MAX_DELETEBLKCNT >> epbs);
txh->txh_memory_tohold += skipped << dn->dn_indblkshift;
}
txh->txh_space_tofree += space;
txh->txh_space_tounref += unref;
}
void
dmu_tx_hold_free(dmu_tx_t *tx, uint64_t object, uint64_t off, uint64_t len)
{
dmu_tx_hold_t *txh;
dnode_t *dn;
uint64_t start, end, i;
int err, shift;
zio_t *zio;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_FREE, off, len);
if (txh == NULL)
return;
dn = txh->txh_dnode;
/* first block */
if (off != 0)
dmu_tx_count_write(txh, off, 1);
/* last block */
if (len != DMU_OBJECT_END)
dmu_tx_count_write(txh, off+len, 1);
if (off >= (dn->dn_maxblkid+1) * dn->dn_datablksz)
return;
if (len == DMU_OBJECT_END)
len = (dn->dn_maxblkid+1) * dn->dn_datablksz - off;
/*
* For i/o error checking, read the first and last level-0
* blocks, and all the level-1 blocks. The above count_write's
* have already taken care of the level-0 blocks.
*/
if (dn->dn_nlevels > 1) {
shift = dn->dn_datablkshift + dn->dn_indblkshift -
SPA_BLKPTRSHIFT;
start = off >> shift;
end = dn->dn_datablkshift ? ((off+len) >> shift) : 0;
zio = zio_root(tx->tx_pool->dp_spa,
NULL, NULL, ZIO_FLAG_CANFAIL);
for (i = start; i <= end; i++) {
uint64_t ibyte = i << shift;
err = dnode_next_offset(dn, 0, &ibyte, 2, 1, 0);
i = ibyte >> shift;
if (err == ESRCH)
break;
if (err) {
tx->tx_err = err;
return;
}
err = dmu_tx_check_ioerr(zio, dn, 1, i);
if (err) {
tx->tx_err = err;
return;
}
}
err = zio_wait(zio);
if (err) {
tx->tx_err = err;
return;
}
}
dmu_tx_count_dnode(txh);
dmu_tx_count_free(txh, off, len);
}
void
dmu_tx_hold_zap(dmu_tx_t *tx, uint64_t object, int add, char *name)
{
dmu_tx_hold_t *txh;
dnode_t *dn;
uint64_t nblocks;
int epbs, err;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_ZAP, add, (uintptr_t)name);
if (txh == NULL)
return;
dn = txh->txh_dnode;
dmu_tx_count_dnode(txh);
if (dn == NULL) {
/*
* We will be able to fit a new object's entries into one leaf
* block. So there will be at most 2 blocks total,
* including the header block.
*/
dmu_tx_count_write(txh, 0, 2 << fzap_default_block_shift);
return;
}
ASSERT3P(dmu_ot[dn->dn_type].ot_byteswap, ==, zap_byteswap);
if (dn->dn_maxblkid == 0 && !add) {
/*
* If there is only one block (i.e. this is a micro-zap)
* and we are not adding anything, the accounting is simple.
*/
err = dmu_tx_check_ioerr(NULL, dn, 0, 0);
if (err) {
tx->tx_err = err;
return;
}
/*
* Use max block size here, since we don't know how much
* the size will change between now and the dbuf dirty call.
*/
if (dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
dn->dn_phys->dn_blkptr[0].blk_birth)) {
txh->txh_space_tooverwrite += SPA_MAXBLOCKSIZE;
} else {
txh->txh_space_towrite += SPA_MAXBLOCKSIZE;
- txh->txh_space_tounref +=
- BP_GET_ASIZE(dn->dn_phys->dn_blkptr);
}
+ if (dn->dn_phys->dn_blkptr[0].blk_birth)
+ txh->txh_space_tounref += SPA_MAXBLOCKSIZE;
return;
}
if (dn->dn_maxblkid > 0 && name) {
/*
* access the name in this fat-zap so that we'll check
* for i/o errors to the leaf blocks, etc.
*/
err = zap_lookup(&dn->dn_objset->os, dn->dn_object, name,
8, 0, NULL);
if (err == EIO) {
tx->tx_err = err;
return;
}
}
/*
* 3 blocks overwritten: target leaf, ptrtbl block, header block
* 3 new blocks written if adding: new split leaf, 2 grown ptrtbl blocks
*/
dmu_tx_count_write(txh, dn->dn_maxblkid * dn->dn_datablksz,
(3 + (add ? 3 : 0)) << dn->dn_datablkshift);
/*
* If the modified blocks are scattered to the four winds,
* we'll have to modify an indirect twig for each.
*/
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
for (nblocks = dn->dn_maxblkid >> epbs; nblocks != 0; nblocks >>= epbs)
txh->txh_space_towrite += 3 << dn->dn_indblkshift;
}
void
dmu_tx_hold_bonus(dmu_tx_t *tx, uint64_t object)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_BONUS, 0, 0);
if (txh)
dmu_tx_count_dnode(txh);
}
void
dmu_tx_hold_space(dmu_tx_t *tx, uint64_t space)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
DMU_NEW_OBJECT, THT_SPACE, space, 0);
txh->txh_space_towrite += space;
}
int
dmu_tx_holds(dmu_tx_t *tx, uint64_t object)
{
dmu_tx_hold_t *txh;
int holds = 0;
/*
* By asserting that the tx is assigned, we're counting the
* number of dn_tx_holds, which is the same as the number of
* dn_holds. Otherwise, we'd be counting dn_holds, but
* dn_tx_holds could be 0.
*/
ASSERT(tx->tx_txg != 0);
/* if (tx->tx_anyobj == TRUE) */
/* return (0); */
for (txh = list_head(&tx->tx_holds); txh;
txh = list_next(&tx->tx_holds, txh)) {
if (txh->txh_dnode && txh->txh_dnode->dn_object == object)
holds++;
}
return (holds);
}
#ifdef ZFS_DEBUG
void
dmu_tx_dirty_buf(dmu_tx_t *tx, dmu_buf_impl_t *db)
{
dmu_tx_hold_t *txh;
int match_object = FALSE, match_offset = FALSE;
dnode_t *dn = db->db_dnode;
ASSERT(tx->tx_txg != 0);
ASSERT(tx->tx_objset == NULL || dn->dn_objset == tx->tx_objset->os);
ASSERT3U(dn->dn_object, ==, db->db.db_object);
if (tx->tx_anyobj)
return;
/* XXX No checking on the meta dnode for now */
if (db->db.db_object == DMU_META_DNODE_OBJECT)
return;
for (txh = list_head(&tx->tx_holds); txh;
txh = list_next(&tx->tx_holds, txh)) {
ASSERT(dn == NULL || dn->dn_assigned_txg == tx->tx_txg);
if (txh->txh_dnode == dn && txh->txh_type != THT_NEWOBJECT)
match_object = TRUE;
if (txh->txh_dnode == NULL || txh->txh_dnode == dn) {
int datablkshift = dn->dn_datablkshift ?
dn->dn_datablkshift : SPA_MAXBLOCKSHIFT;
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
int shift = datablkshift + epbs * db->db_level;
uint64_t beginblk = shift >= 64 ? 0 :
(txh->txh_arg1 >> shift);
uint64_t endblk = shift >= 64 ? 0 :
((txh->txh_arg1 + txh->txh_arg2 - 1) >> shift);
uint64_t blkid = db->db_blkid;
/* XXX txh_arg2 better not be zero... */
dprintf("found txh type %x beginblk=%llx endblk=%llx\n",
txh->txh_type, beginblk, endblk);
switch (txh->txh_type) {
case THT_WRITE:
if (blkid >= beginblk && blkid <= endblk)
match_offset = TRUE;
/*
* We will let this hold work for the bonus
* buffer so that we don't need to hold it
* when creating a new object.
*/
if (blkid == DB_BONUS_BLKID)
match_offset = TRUE;
/*
* They might have to increase nlevels,
* thus dirtying the new TLIBs. Or the
* might have to change the block size,
* thus dirying the new lvl=0 blk=0.
*/
if (blkid == 0)
match_offset = TRUE;
break;
case THT_FREE:
/*
* We will dirty all the level 1 blocks in
* the free range and perhaps the first and
* last level 0 block.
*/
if (blkid >= beginblk && (blkid <= endblk ||
txh->txh_arg2 == DMU_OBJECT_END))
match_offset = TRUE;
break;
case THT_BONUS:
if (blkid == DB_BONUS_BLKID)
match_offset = TRUE;
break;
case THT_ZAP:
match_offset = TRUE;
break;
case THT_NEWOBJECT:
match_object = TRUE;
break;
default:
ASSERT(!"bad txh_type");
}
}
if (match_object && match_offset)
return;
}
panic("dirtying dbuf obj=%llx lvl=%u blkid=%llx but not tx_held\n",
(u_longlong_t)db->db.db_object, db->db_level,
(u_longlong_t)db->db_blkid);
}
#endif
static int
dmu_tx_try_assign(dmu_tx_t *tx, uint64_t txg_how)
{
dmu_tx_hold_t *txh;
spa_t *spa = tx->tx_pool->dp_spa;
uint64_t memory, asize, fsize, usize;
uint64_t towrite, tofree, tooverwrite, tounref, tohold, fudge;
ASSERT3U(tx->tx_txg, ==, 0);
if (tx->tx_err)
return (tx->tx_err);
if (spa_suspended(spa)) {
/*
* If the user has indicated a blocking failure mode
* then return ERESTART which will block in dmu_tx_wait().
* Otherwise, return EIO so that an error can get
* propagated back to the VOP calls.
*
* Note that we always honor the txg_how flag regardless
* of the failuremode setting.
*/
if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_CONTINUE &&
txg_how != TXG_WAIT)
return (EIO);
return (ERESTART);
}
tx->tx_txg = txg_hold_open(tx->tx_pool, &tx->tx_txgh);
tx->tx_needassign_txh = NULL;
/*
* NB: No error returns are allowed after txg_hold_open, but
* before processing the dnode holds, due to the
* dmu_tx_unassign() logic.
*/
towrite = tofree = tooverwrite = tounref = tohold = fudge = 0;
for (txh = list_head(&tx->tx_holds); txh;
txh = list_next(&tx->tx_holds, txh)) {
dnode_t *dn = txh->txh_dnode;
if (dn != NULL) {
mutex_enter(&dn->dn_mtx);
if (dn->dn_assigned_txg == tx->tx_txg - 1) {
mutex_exit(&dn->dn_mtx);
tx->tx_needassign_txh = txh;
return (ERESTART);
}
if (dn->dn_assigned_txg == 0)
dn->dn_assigned_txg = tx->tx_txg;
ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);
(void) refcount_add(&dn->dn_tx_holds, tx);
mutex_exit(&dn->dn_mtx);
}
towrite += txh->txh_space_towrite;
tofree += txh->txh_space_tofree;
tooverwrite += txh->txh_space_tooverwrite;
tounref += txh->txh_space_tounref;
tohold += txh->txh_memory_tohold;
fudge += txh->txh_fudge;
}
/*
* NB: This check must be after we've held the dnodes, so that
* the dmu_tx_unassign() logic will work properly
*/
if (txg_how >= TXG_INITIAL && txg_how != tx->tx_txg)
return (ERESTART);
/*
* If a snapshot has been taken since we made our estimates,
* assume that we won't be able to free or overwrite anything.
*/
if (tx->tx_objset &&
dsl_dataset_prev_snap_txg(tx->tx_objset->os->os_dsl_dataset) >
tx->tx_lastsnap_txg) {
towrite += tooverwrite;
tooverwrite = tofree = 0;
}
/* needed allocation: worst-case estimate of write space */
asize = spa_get_asize(tx->tx_pool->dp_spa, towrite + tooverwrite);
/* freed space estimate: worst-case overwrite + free estimate */
fsize = spa_get_asize(tx->tx_pool->dp_spa, tooverwrite) + tofree;
/* convert unrefd space to worst-case estimate */
usize = spa_get_asize(tx->tx_pool->dp_spa, tounref);
/* calculate memory footprint estimate */
memory = towrite + tooverwrite + tohold;
#ifdef ZFS_DEBUG
/*
* Add in 'tohold' to account for our dirty holds on this memory
* XXX - the "fudge" factor is to account for skipped blocks that
* we missed because dnode_next_offset() misses in-core-only blocks.
*/
tx->tx_space_towrite = asize +
spa_get_asize(tx->tx_pool->dp_spa, tohold + fudge);
tx->tx_space_tofree = tofree;
tx->tx_space_tooverwrite = tooverwrite;
tx->tx_space_tounref = tounref;
#endif
if (tx->tx_dir && asize != 0) {
int err = dsl_dir_tempreserve_space(tx->tx_dir, memory,
asize, fsize, usize, &tx->tx_tempreserve_cookie, tx);
if (err)
return (err);
}
return (0);
}
static void
dmu_tx_unassign(dmu_tx_t *tx)
{
dmu_tx_hold_t *txh;
if (tx->tx_txg == 0)
return;
txg_rele_to_quiesce(&tx->tx_txgh);
for (txh = list_head(&tx->tx_holds); txh != tx->tx_needassign_txh;
txh = list_next(&tx->tx_holds, txh)) {
dnode_t *dn = txh->txh_dnode;
if (dn == NULL)
continue;
mutex_enter(&dn->dn_mtx);
ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);
if (refcount_remove(&dn->dn_tx_holds, tx) == 0) {
dn->dn_assigned_txg = 0;
cv_broadcast(&dn->dn_notxholds);
}
mutex_exit(&dn->dn_mtx);
}
txg_rele_to_sync(&tx->tx_txgh);
tx->tx_lasttried_txg = tx->tx_txg;
tx->tx_txg = 0;
}
/*
* Assign tx to a transaction group. txg_how can be one of:
*
* (1) TXG_WAIT. If the current open txg is full, waits until there's
* a new one. This should be used when you're not holding locks.
* If will only fail if we're truly out of space (or over quota).
*
* (2) TXG_NOWAIT. If we can't assign into the current open txg without
* blocking, returns immediately with ERESTART. This should be used
* whenever you're holding locks. On an ERESTART error, the caller
* should drop locks, do a dmu_tx_wait(tx), and try again.
*
* (3) A specific txg. Use this if you need to ensure that multiple
* transactions all sync in the same txg. Like TXG_NOWAIT, it
* returns ERESTART if it can't assign you into the requested txg.
*/
int
dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how)
{
int err;
ASSERT(tx->tx_txg == 0);
ASSERT(txg_how != 0);
ASSERT(!dsl_pool_sync_context(tx->tx_pool));
while ((err = dmu_tx_try_assign(tx, txg_how)) != 0) {
dmu_tx_unassign(tx);
if (err != ERESTART || txg_how != TXG_WAIT)
return (err);
dmu_tx_wait(tx);
}
txg_rele_to_quiesce(&tx->tx_txgh);
return (0);
}
void
dmu_tx_wait(dmu_tx_t *tx)
{
spa_t *spa = tx->tx_pool->dp_spa;
ASSERT(tx->tx_txg == 0);
/*
* It's possible that the pool has become active after this thread
* has tried to obtain a tx. If that's the case then his
* tx_lasttried_txg would not have been assigned.
*/
if (spa_suspended(spa) || tx->tx_lasttried_txg == 0) {
txg_wait_synced(tx->tx_pool, spa_last_synced_txg(spa) + 1);
} else if (tx->tx_needassign_txh) {
dnode_t *dn = tx->tx_needassign_txh->txh_dnode;
mutex_enter(&dn->dn_mtx);
while (dn->dn_assigned_txg == tx->tx_lasttried_txg - 1)
cv_wait(&dn->dn_notxholds, &dn->dn_mtx);
mutex_exit(&dn->dn_mtx);
tx->tx_needassign_txh = NULL;
} else {
txg_wait_open(tx->tx_pool, tx->tx_lasttried_txg + 1);
}
}
void
dmu_tx_willuse_space(dmu_tx_t *tx, int64_t delta)
{
#ifdef ZFS_DEBUG
if (tx->tx_dir == NULL || delta == 0)
return;
if (delta > 0) {
ASSERT3U(refcount_count(&tx->tx_space_written) + delta, <=,
tx->tx_space_towrite);
(void) refcount_add_many(&tx->tx_space_written, delta, NULL);
} else {
(void) refcount_add_many(&tx->tx_space_freed, -delta, NULL);
}
#endif
}
void
dmu_tx_commit(dmu_tx_t *tx)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg != 0);
while (txh = list_head(&tx->tx_holds)) {
dnode_t *dn = txh->txh_dnode;
list_remove(&tx->tx_holds, txh);
kmem_free(txh, sizeof (dmu_tx_hold_t));
if (dn == NULL)
continue;
mutex_enter(&dn->dn_mtx);
ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);
if (refcount_remove(&dn->dn_tx_holds, tx) == 0) {
dn->dn_assigned_txg = 0;
cv_broadcast(&dn->dn_notxholds);
}
mutex_exit(&dn->dn_mtx);
dnode_rele(dn, tx);
}
if (tx->tx_tempreserve_cookie)
dsl_dir_tempreserve_clear(tx->tx_tempreserve_cookie, tx);
if (tx->tx_anyobj == FALSE)
txg_rele_to_sync(&tx->tx_txgh);
list_destroy(&tx->tx_holds);
#ifdef ZFS_DEBUG
dprintf("towrite=%llu written=%llu tofree=%llu freed=%llu\n",
tx->tx_space_towrite, refcount_count(&tx->tx_space_written),
tx->tx_space_tofree, refcount_count(&tx->tx_space_freed));
refcount_destroy_many(&tx->tx_space_written,
refcount_count(&tx->tx_space_written));
refcount_destroy_many(&tx->tx_space_freed,
refcount_count(&tx->tx_space_freed));
#endif
kmem_free(tx, sizeof (dmu_tx_t));
}
void
dmu_tx_abort(dmu_tx_t *tx)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
while (txh = list_head(&tx->tx_holds)) {
dnode_t *dn = txh->txh_dnode;
list_remove(&tx->tx_holds, txh);
kmem_free(txh, sizeof (dmu_tx_hold_t));
if (dn != NULL)
dnode_rele(dn, tx);
}
list_destroy(&tx->tx_holds);
#ifdef ZFS_DEBUG
refcount_destroy_many(&tx->tx_space_written,
refcount_count(&tx->tx_space_written));
refcount_destroy_many(&tx->tx_space_freed,
refcount_count(&tx->tx_space_freed));
#endif
kmem_free(tx, sizeof (dmu_tx_t));
}
uint64_t
dmu_tx_get_txg(dmu_tx_t *tx)
{
ASSERT(tx->tx_txg != 0);
return (tx->tx_txg);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c (revision 209274)
@@ -1,1446 +1,1437 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/dbuf.h>
#include <sys/dnode.h>
#include <sys/dmu.h>
#include <sys/dmu_impl.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_dataset.h>
#include <sys/spa.h>
#include <sys/zio.h>
#include <sys/dmu_zfetch.h>
static int free_range_compar(const void *node1, const void *node2);
static kmem_cache_t *dnode_cache;
static dnode_phys_t dnode_phys_zero;
int zfs_default_bs = SPA_MINBLOCKSHIFT;
int zfs_default_ibs = DN_MAX_INDBLKSHIFT;
/* ARGSUSED */
static int
dnode_cons(void *arg, void *unused, int kmflag)
{
int i;
dnode_t *dn = arg;
bzero(dn, sizeof (dnode_t));
rw_init(&dn->dn_struct_rwlock, NULL, RW_DEFAULT, NULL);
mutex_init(&dn->dn_mtx, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&dn->dn_dbufs_mtx, NULL, MUTEX_DEFAULT, NULL);
cv_init(&dn->dn_notxholds, NULL, CV_DEFAULT, NULL);
refcount_create(&dn->dn_holds);
refcount_create(&dn->dn_tx_holds);
for (i = 0; i < TXG_SIZE; i++) {
avl_create(&dn->dn_ranges[i], free_range_compar,
sizeof (free_range_t),
offsetof(struct free_range, fr_node));
list_create(&dn->dn_dirty_records[i],
sizeof (dbuf_dirty_record_t),
offsetof(dbuf_dirty_record_t, dr_dirty_node));
}
list_create(&dn->dn_dbufs, sizeof (dmu_buf_impl_t),
offsetof(dmu_buf_impl_t, db_link));
return (0);
}
/* ARGSUSED */
static void
dnode_dest(void *arg, void *unused)
{
int i;
dnode_t *dn = arg;
rw_destroy(&dn->dn_struct_rwlock);
mutex_destroy(&dn->dn_mtx);
mutex_destroy(&dn->dn_dbufs_mtx);
cv_destroy(&dn->dn_notxholds);
refcount_destroy(&dn->dn_holds);
refcount_destroy(&dn->dn_tx_holds);
for (i = 0; i < TXG_SIZE; i++) {
avl_destroy(&dn->dn_ranges[i]);
list_destroy(&dn->dn_dirty_records[i]);
}
list_destroy(&dn->dn_dbufs);
}
void
dnode_init(void)
{
dnode_cache = kmem_cache_create("dnode_t",
sizeof (dnode_t),
0, dnode_cons, dnode_dest, NULL, NULL, NULL, 0);
}
void
dnode_fini(void)
{
kmem_cache_destroy(dnode_cache);
}
#ifdef ZFS_DEBUG
void
dnode_verify(dnode_t *dn)
{
int drop_struct_lock = FALSE;
ASSERT(dn->dn_phys);
ASSERT(dn->dn_objset);
ASSERT(dn->dn_phys->dn_type < DMU_OT_NUMTYPES);
if (!(zfs_flags & ZFS_DEBUG_DNODE_VERIFY))
return;
if (!RW_WRITE_HELD(&dn->dn_struct_rwlock)) {
rw_enter(&dn->dn_struct_rwlock, RW_READER);
drop_struct_lock = TRUE;
}
if (dn->dn_phys->dn_type != DMU_OT_NONE || dn->dn_allocated_txg != 0) {
int i;
ASSERT3U(dn->dn_indblkshift, >=, 0);
ASSERT3U(dn->dn_indblkshift, <=, SPA_MAXBLOCKSHIFT);
if (dn->dn_datablkshift) {
ASSERT3U(dn->dn_datablkshift, >=, SPA_MINBLOCKSHIFT);
ASSERT3U(dn->dn_datablkshift, <=, SPA_MAXBLOCKSHIFT);
ASSERT3U(1<<dn->dn_datablkshift, ==, dn->dn_datablksz);
}
ASSERT3U(dn->dn_nlevels, <=, 30);
ASSERT3U(dn->dn_type, <=, DMU_OT_NUMTYPES);
ASSERT3U(dn->dn_nblkptr, >=, 1);
ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
ASSERT3U(dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
ASSERT3U(dn->dn_datablksz, ==,
dn->dn_datablkszsec << SPA_MINBLOCKSHIFT);
ASSERT3U(ISP2(dn->dn_datablksz), ==, dn->dn_datablkshift != 0);
ASSERT3U((dn->dn_nblkptr - 1) * sizeof (blkptr_t) +
dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
for (i = 0; i < TXG_SIZE; i++) {
ASSERT3U(dn->dn_next_nlevels[i], <=, dn->dn_nlevels);
}
}
if (dn->dn_phys->dn_type != DMU_OT_NONE)
ASSERT3U(dn->dn_phys->dn_nlevels, <=, dn->dn_nlevels);
ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT || dn->dn_dbuf != NULL);
if (dn->dn_dbuf != NULL) {
ASSERT3P(dn->dn_phys, ==,
(dnode_phys_t *)dn->dn_dbuf->db.db_data +
(dn->dn_object % (dn->dn_dbuf->db.db_size >> DNODE_SHIFT)));
}
if (drop_struct_lock)
rw_exit(&dn->dn_struct_rwlock);
}
#endif
void
dnode_byteswap(dnode_phys_t *dnp)
{
uint64_t *buf64 = (void*)&dnp->dn_blkptr;
int i;
if (dnp->dn_type == DMU_OT_NONE) {
bzero(dnp, sizeof (dnode_phys_t));
return;
}
dnp->dn_datablkszsec = BSWAP_16(dnp->dn_datablkszsec);
dnp->dn_bonuslen = BSWAP_16(dnp->dn_bonuslen);
dnp->dn_maxblkid = BSWAP_64(dnp->dn_maxblkid);
dnp->dn_used = BSWAP_64(dnp->dn_used);
/*
* dn_nblkptr is only one byte, so it's OK to read it in either
* byte order. We can't read dn_bouslen.
*/
ASSERT(dnp->dn_indblkshift <= SPA_MAXBLOCKSHIFT);
ASSERT(dnp->dn_nblkptr <= DN_MAX_NBLKPTR);
for (i = 0; i < dnp->dn_nblkptr * sizeof (blkptr_t)/8; i++)
buf64[i] = BSWAP_64(buf64[i]);
/*
* OK to check dn_bonuslen for zero, because it won't matter if
* we have the wrong byte order. This is necessary because the
* dnode dnode is smaller than a regular dnode.
*/
if (dnp->dn_bonuslen != 0) {
/*
* Note that the bonus length calculated here may be
* longer than the actual bonus buffer. This is because
* we always put the bonus buffer after the last block
* pointer (instead of packing it against the end of the
* dnode buffer).
*/
int off = (dnp->dn_nblkptr-1) * sizeof (blkptr_t);
size_t len = DN_MAX_BONUSLEN - off;
ASSERT3U(dnp->dn_bonustype, <, DMU_OT_NUMTYPES);
dmu_ot[dnp->dn_bonustype].ot_byteswap(dnp->dn_bonus + off, len);
}
}
void
dnode_buf_byteswap(void *vbuf, size_t size)
{
dnode_phys_t *buf = vbuf;
int i;
ASSERT3U(sizeof (dnode_phys_t), ==, (1<<DNODE_SHIFT));
ASSERT((size & (sizeof (dnode_phys_t)-1)) == 0);
size >>= DNODE_SHIFT;
for (i = 0; i < size; i++) {
dnode_byteswap(buf);
buf++;
}
}
static int
free_range_compar(const void *node1, const void *node2)
{
const free_range_t *rp1 = node1;
const free_range_t *rp2 = node2;
if (rp1->fr_blkid < rp2->fr_blkid)
return (-1);
else if (rp1->fr_blkid > rp2->fr_blkid)
return (1);
else return (0);
}
void
dnode_setbonuslen(dnode_t *dn, int newsize, dmu_tx_t *tx)
{
ASSERT3U(refcount_count(&dn->dn_holds), >=, 1);
dnode_setdirty(dn, tx);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
ASSERT3U(newsize, <=, DN_MAX_BONUSLEN -
(dn->dn_nblkptr-1) * sizeof (blkptr_t));
dn->dn_bonuslen = newsize;
if (newsize == 0)
dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = DN_ZERO_BONUSLEN;
else
dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
rw_exit(&dn->dn_struct_rwlock);
}
static void
dnode_setdblksz(dnode_t *dn, int size)
{
ASSERT3U(P2PHASE(size, SPA_MINBLOCKSIZE), ==, 0);
ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
ASSERT3U(size, >=, SPA_MINBLOCKSIZE);
ASSERT3U(size >> SPA_MINBLOCKSHIFT, <,
1<<(sizeof (dn->dn_phys->dn_datablkszsec) * 8));
dn->dn_datablksz = size;
dn->dn_datablkszsec = size >> SPA_MINBLOCKSHIFT;
dn->dn_datablkshift = ISP2(size) ? highbit(size - 1) : 0;
}
static dnode_t *
dnode_create(objset_impl_t *os, dnode_phys_t *dnp, dmu_buf_impl_t *db,
uint64_t object)
{
dnode_t *dn = kmem_cache_alloc(dnode_cache, KM_SLEEP);
dn->dn_objset = os;
dn->dn_object = object;
dn->dn_dbuf = db;
dn->dn_phys = dnp;
if (dnp->dn_datablkszsec)
dnode_setdblksz(dn, dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
dn->dn_indblkshift = dnp->dn_indblkshift;
dn->dn_nlevels = dnp->dn_nlevels;
dn->dn_type = dnp->dn_type;
dn->dn_nblkptr = dnp->dn_nblkptr;
dn->dn_checksum = dnp->dn_checksum;
dn->dn_compress = dnp->dn_compress;
dn->dn_bonustype = dnp->dn_bonustype;
dn->dn_bonuslen = dnp->dn_bonuslen;
dn->dn_maxblkid = dnp->dn_maxblkid;
dmu_zfetch_init(&dn->dn_zfetch, dn);
ASSERT(dn->dn_phys->dn_type < DMU_OT_NUMTYPES);
mutex_enter(&os->os_lock);
list_insert_head(&os->os_dnodes, dn);
mutex_exit(&os->os_lock);
arc_space_consume(sizeof (dnode_t), ARC_SPACE_OTHER);
return (dn);
}
static void
dnode_destroy(dnode_t *dn)
{
objset_impl_t *os = dn->dn_objset;
#ifdef ZFS_DEBUG
int i;
for (i = 0; i < TXG_SIZE; i++) {
ASSERT(!list_link_active(&dn->dn_dirty_link[i]));
ASSERT(NULL == list_head(&dn->dn_dirty_records[i]));
ASSERT(0 == avl_numnodes(&dn->dn_ranges[i]));
}
ASSERT(NULL == list_head(&dn->dn_dbufs));
#endif
mutex_enter(&os->os_lock);
list_remove(&os->os_dnodes, dn);
mutex_exit(&os->os_lock);
if (dn->dn_dirtyctx_firstset) {
kmem_free(dn->dn_dirtyctx_firstset, 1);
dn->dn_dirtyctx_firstset = NULL;
}
dmu_zfetch_rele(&dn->dn_zfetch);
if (dn->dn_bonus) {
mutex_enter(&dn->dn_bonus->db_mtx);
dbuf_evict(dn->dn_bonus);
dn->dn_bonus = NULL;
}
kmem_cache_free(dnode_cache, dn);
arc_space_return(sizeof (dnode_t), ARC_SPACE_OTHER);
}
void
dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
{
int i;
if (blocksize == 0)
blocksize = 1 << zfs_default_bs;
else if (blocksize > SPA_MAXBLOCKSIZE)
blocksize = SPA_MAXBLOCKSIZE;
else
blocksize = P2ROUNDUP(blocksize, SPA_MINBLOCKSIZE);
if (ibs == 0)
ibs = zfs_default_ibs;
ibs = MIN(MAX(ibs, DN_MIN_INDBLKSHIFT), DN_MAX_INDBLKSHIFT);
dprintf("os=%p obj=%llu txg=%llu blocksize=%d ibs=%d\n", dn->dn_objset,
dn->dn_object, tx->tx_txg, blocksize, ibs);
ASSERT(dn->dn_type == DMU_OT_NONE);
ASSERT(bcmp(dn->dn_phys, &dnode_phys_zero, sizeof (dnode_phys_t)) == 0);
ASSERT(dn->dn_phys->dn_type == DMU_OT_NONE);
ASSERT(ot != DMU_OT_NONE);
ASSERT3U(ot, <, DMU_OT_NUMTYPES);
ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) ||
(bonustype != DMU_OT_NONE && bonuslen != 0));
ASSERT3U(bonustype, <, DMU_OT_NUMTYPES);
ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);
ASSERT(dn->dn_type == DMU_OT_NONE);
ASSERT3U(dn->dn_maxblkid, ==, 0);
ASSERT3U(dn->dn_allocated_txg, ==, 0);
ASSERT3U(dn->dn_assigned_txg, ==, 0);
ASSERT(refcount_is_zero(&dn->dn_tx_holds));
ASSERT3U(refcount_count(&dn->dn_holds), <=, 1);
ASSERT3P(list_head(&dn->dn_dbufs), ==, NULL);
for (i = 0; i < TXG_SIZE; i++) {
ASSERT3U(dn->dn_next_nlevels[i], ==, 0);
ASSERT3U(dn->dn_next_indblkshift[i], ==, 0);
ASSERT3U(dn->dn_next_bonuslen[i], ==, 0);
ASSERT3U(dn->dn_next_blksz[i], ==, 0);
ASSERT(!list_link_active(&dn->dn_dirty_link[i]));
ASSERT3P(list_head(&dn->dn_dirty_records[i]), ==, NULL);
ASSERT3U(avl_numnodes(&dn->dn_ranges[i]), ==, 0);
}
dn->dn_type = ot;
dnode_setdblksz(dn, blocksize);
dn->dn_indblkshift = ibs;
dn->dn_nlevels = 1;
dn->dn_nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
dn->dn_bonustype = bonustype;
dn->dn_bonuslen = bonuslen;
dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
dn->dn_compress = ZIO_COMPRESS_INHERIT;
dn->dn_dirtyctx = 0;
dn->dn_free_txg = 0;
if (dn->dn_dirtyctx_firstset) {
kmem_free(dn->dn_dirtyctx_firstset, 1);
dn->dn_dirtyctx_firstset = NULL;
}
dn->dn_allocated_txg = tx->tx_txg;
dnode_setdirty(dn, tx);
dn->dn_next_indblkshift[tx->tx_txg & TXG_MASK] = ibs;
dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
dn->dn_next_blksz[tx->tx_txg & TXG_MASK] = dn->dn_datablksz;
}
void
dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
{
int nblkptr;
ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE);
ASSERT3U(blocksize, <=, SPA_MAXBLOCKSIZE);
ASSERT3U(blocksize % SPA_MINBLOCKSIZE, ==, 0);
ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT || dmu_tx_private_ok(tx));
ASSERT(tx->tx_txg != 0);
ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) ||
(bonustype != DMU_OT_NONE && bonuslen != 0));
ASSERT3U(bonustype, <, DMU_OT_NUMTYPES);
ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);
/* clean up any unreferenced dbufs */
dnode_evict_dbufs(dn);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
dnode_setdirty(dn, tx);
if (dn->dn_datablksz != blocksize) {
/* change blocksize */
ASSERT(dn->dn_maxblkid == 0 &&
(BP_IS_HOLE(&dn->dn_phys->dn_blkptr[0]) ||
dnode_block_freed(dn, 0)));
dnode_setdblksz(dn, blocksize);
dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = blocksize;
}
if (dn->dn_bonuslen != bonuslen)
dn->dn_next_bonuslen[tx->tx_txg&TXG_MASK] = bonuslen;
nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
if (dn->dn_nblkptr != nblkptr)
dn->dn_next_nblkptr[tx->tx_txg&TXG_MASK] = nblkptr;
rw_exit(&dn->dn_struct_rwlock);
/* change type */
dn->dn_type = ot;
/* change bonus size and type */
mutex_enter(&dn->dn_mtx);
dn->dn_bonustype = bonustype;
dn->dn_bonuslen = bonuslen;
dn->dn_nblkptr = nblkptr;
dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
dn->dn_compress = ZIO_COMPRESS_INHERIT;
ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
/* fix up the bonus db_size */
if (dn->dn_bonus) {
dn->dn_bonus->db.db_size =
DN_MAX_BONUSLEN - (dn->dn_nblkptr-1) * sizeof (blkptr_t);
ASSERT(dn->dn_bonuslen <= dn->dn_bonus->db.db_size);
}
dn->dn_allocated_txg = tx->tx_txg;
mutex_exit(&dn->dn_mtx);
}
void
dnode_special_close(dnode_t *dn)
{
/*
* Wait for final references to the dnode to clear. This can
* only happen if the arc is asyncronously evicting state that
* has a hold on this dnode while we are trying to evict this
* dnode.
*/
while (refcount_count(&dn->dn_holds) > 0)
delay(1);
dnode_destroy(dn);
}
dnode_t *
dnode_special_open(objset_impl_t *os, dnode_phys_t *dnp, uint64_t object)
{
dnode_t *dn = dnode_create(os, dnp, NULL, object);
DNODE_VERIFY(dn);
return (dn);
}
static void
dnode_buf_pageout(dmu_buf_t *db, void *arg)
{
dnode_t **children_dnodes = arg;
int i;
int epb = db->db_size >> DNODE_SHIFT;
for (i = 0; i < epb; i++) {
dnode_t *dn = children_dnodes[i];
int n;
if (dn == NULL)
continue;
#ifdef ZFS_DEBUG
/*
* If there are holds on this dnode, then there should
* be holds on the dnode's containing dbuf as well; thus
* it wouldn't be eligable for eviction and this function
* would not have been called.
*/
ASSERT(refcount_is_zero(&dn->dn_holds));
ASSERT(list_head(&dn->dn_dbufs) == NULL);
ASSERT(refcount_is_zero(&dn->dn_tx_holds));
for (n = 0; n < TXG_SIZE; n++)
ASSERT(!list_link_active(&dn->dn_dirty_link[n]));
#endif
children_dnodes[i] = NULL;
dnode_destroy(dn);
}
kmem_free(children_dnodes, epb * sizeof (dnode_t *));
}
/*
* errors:
* EINVAL - invalid object number.
* EIO - i/o error.
* succeeds even for free dnodes.
*/
int
dnode_hold_impl(objset_impl_t *os, uint64_t object, int flag,
void *tag, dnode_t **dnp)
{
int epb, idx, err;
int drop_struct_lock = FALSE;
int type;
uint64_t blk;
dnode_t *mdn, *dn;
dmu_buf_impl_t *db;
dnode_t **children_dnodes;
/*
* If you are holding the spa config lock as writer, you shouldn't
* be asking the DMU to do *anything*.
*/
ASSERT(spa_config_held(os->os_spa, SCL_ALL, RW_WRITER) == 0);
if (object == 0 || object >= DN_MAX_OBJECT)
return (EINVAL);
mdn = os->os_meta_dnode;
DNODE_VERIFY(mdn);
if (!RW_WRITE_HELD(&mdn->dn_struct_rwlock)) {
rw_enter(&mdn->dn_struct_rwlock, RW_READER);
drop_struct_lock = TRUE;
}
blk = dbuf_whichblock(mdn, object * sizeof (dnode_phys_t));
db = dbuf_hold(mdn, blk, FTAG);
if (drop_struct_lock)
rw_exit(&mdn->dn_struct_rwlock);
if (db == NULL)
return (EIO);
err = dbuf_read(db, NULL, DB_RF_CANFAIL);
if (err) {
dbuf_rele(db, FTAG);
return (err);
}
ASSERT3U(db->db.db_size, >=, 1<<DNODE_SHIFT);
epb = db->db.db_size >> DNODE_SHIFT;
idx = object & (epb-1);
children_dnodes = dmu_buf_get_user(&db->db);
if (children_dnodes == NULL) {
dnode_t **winner;
children_dnodes = kmem_zalloc(epb * sizeof (dnode_t *),
KM_SLEEP);
if (winner = dmu_buf_set_user(&db->db, children_dnodes, NULL,
dnode_buf_pageout)) {
kmem_free(children_dnodes, epb * sizeof (dnode_t *));
children_dnodes = winner;
}
}
if ((dn = children_dnodes[idx]) == NULL) {
dnode_phys_t *dnp = (dnode_phys_t *)db->db.db_data+idx;
dnode_t *winner;
dn = dnode_create(os, dnp, db, object);
winner = atomic_cas_ptr(&children_dnodes[idx], NULL, dn);
if (winner != NULL) {
dnode_destroy(dn);
dn = winner;
}
}
mutex_enter(&dn->dn_mtx);
type = dn->dn_type;
if (dn->dn_free_txg ||
((flag & DNODE_MUST_BE_ALLOCATED) && type == DMU_OT_NONE) ||
((flag & DNODE_MUST_BE_FREE) && type != DMU_OT_NONE)) {
mutex_exit(&dn->dn_mtx);
dbuf_rele(db, FTAG);
return (type == DMU_OT_NONE ? ENOENT : EEXIST);
}
mutex_exit(&dn->dn_mtx);
if (refcount_add(&dn->dn_holds, tag) == 1)
dbuf_add_ref(db, dn);
DNODE_VERIFY(dn);
ASSERT3P(dn->dn_dbuf, ==, db);
ASSERT3U(dn->dn_object, ==, object);
dbuf_rele(db, FTAG);
*dnp = dn;
return (0);
}
/*
* Return held dnode if the object is allocated, NULL if not.
*/
int
dnode_hold(objset_impl_t *os, uint64_t object, void *tag, dnode_t **dnp)
{
return (dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED, tag, dnp));
}
/*
* Can only add a reference if there is already at least one
* reference on the dnode. Returns FALSE if unable to add a
* new reference.
*/
boolean_t
dnode_add_ref(dnode_t *dn, void *tag)
{
mutex_enter(&dn->dn_mtx);
if (refcount_is_zero(&dn->dn_holds)) {
mutex_exit(&dn->dn_mtx);
return (FALSE);
}
VERIFY(1 < refcount_add(&dn->dn_holds, tag));
mutex_exit(&dn->dn_mtx);
return (TRUE);
}
void
dnode_rele(dnode_t *dn, void *tag)
{
uint64_t refs;
mutex_enter(&dn->dn_mtx);
refs = refcount_remove(&dn->dn_holds, tag);
mutex_exit(&dn->dn_mtx);
/* NOTE: the DNODE_DNODE does not have a dn_dbuf */
if (refs == 0 && dn->dn_dbuf)
dbuf_rele(dn->dn_dbuf, dn);
}
void
dnode_setdirty(dnode_t *dn, dmu_tx_t *tx)
{
objset_impl_t *os = dn->dn_objset;
uint64_t txg = tx->tx_txg;
if (dn->dn_object == DMU_META_DNODE_OBJECT)
return;
DNODE_VERIFY(dn);
#ifdef ZFS_DEBUG
mutex_enter(&dn->dn_mtx);
ASSERT(dn->dn_phys->dn_type || dn->dn_allocated_txg);
/* ASSERT(dn->dn_free_txg == 0 || dn->dn_free_txg >= txg); */
mutex_exit(&dn->dn_mtx);
#endif
mutex_enter(&os->os_lock);
/*
* If we are already marked dirty, we're done.
*/
if (list_link_active(&dn->dn_dirty_link[txg & TXG_MASK])) {
mutex_exit(&os->os_lock);
return;
}
ASSERT(!refcount_is_zero(&dn->dn_holds) || list_head(&dn->dn_dbufs));
ASSERT(dn->dn_datablksz != 0);
ASSERT3U(dn->dn_next_bonuslen[txg&TXG_MASK], ==, 0);
ASSERT3U(dn->dn_next_blksz[txg&TXG_MASK], ==, 0);
dprintf_ds(os->os_dsl_dataset, "obj=%llu txg=%llu\n",
dn->dn_object, txg);
if (dn->dn_free_txg > 0 && dn->dn_free_txg <= txg) {
list_insert_tail(&os->os_free_dnodes[txg&TXG_MASK], dn);
} else {
list_insert_tail(&os->os_dirty_dnodes[txg&TXG_MASK], dn);
}
mutex_exit(&os->os_lock);
/*
* The dnode maintains a hold on its containing dbuf as
* long as there are holds on it. Each instantiated child
* dbuf maintaines a hold on the dnode. When the last child
* drops its hold, the dnode will drop its hold on the
* containing dbuf. We add a "dirty hold" here so that the
* dnode will hang around after we finish processing its
* children.
*/
VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx->tx_txg));
(void) dbuf_dirty(dn->dn_dbuf, tx);
dsl_dataset_dirty(os->os_dsl_dataset, tx);
}
void
dnode_free(dnode_t *dn, dmu_tx_t *tx)
{
int txgoff = tx->tx_txg & TXG_MASK;
dprintf("dn=%p txg=%llu\n", dn, tx->tx_txg);
/* we should be the only holder... hopefully */
/* ASSERT3U(refcount_count(&dn->dn_holds), ==, 1); */
mutex_enter(&dn->dn_mtx);
if (dn->dn_type == DMU_OT_NONE || dn->dn_free_txg) {
mutex_exit(&dn->dn_mtx);
return;
}
dn->dn_free_txg = tx->tx_txg;
mutex_exit(&dn->dn_mtx);
/*
* If the dnode is already dirty, it needs to be moved from
* the dirty list to the free list.
*/
mutex_enter(&dn->dn_objset->os_lock);
if (list_link_active(&dn->dn_dirty_link[txgoff])) {
list_remove(&dn->dn_objset->os_dirty_dnodes[txgoff], dn);
list_insert_tail(&dn->dn_objset->os_free_dnodes[txgoff], dn);
mutex_exit(&dn->dn_objset->os_lock);
} else {
mutex_exit(&dn->dn_objset->os_lock);
dnode_setdirty(dn, tx);
}
}
/*
* Try to change the block size for the indicated dnode. This can only
* succeed if there are no blocks allocated or dirty beyond first block
*/
int
dnode_set_blksz(dnode_t *dn, uint64_t size, int ibs, dmu_tx_t *tx)
{
dmu_buf_impl_t *db, *db_next;
int err;
if (size == 0)
size = SPA_MINBLOCKSIZE;
if (size > SPA_MAXBLOCKSIZE)
size = SPA_MAXBLOCKSIZE;
else
size = P2ROUNDUP(size, SPA_MINBLOCKSIZE);
if (ibs == dn->dn_indblkshift)
ibs = 0;
if (size >> SPA_MINBLOCKSHIFT == dn->dn_datablkszsec && ibs == 0)
return (0);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
/* Check for any allocated blocks beyond the first */
if (dn->dn_phys->dn_maxblkid != 0)
goto fail;
mutex_enter(&dn->dn_dbufs_mtx);
for (db = list_head(&dn->dn_dbufs); db; db = db_next) {
db_next = list_next(&dn->dn_dbufs, db);
if (db->db_blkid != 0 && db->db_blkid != DB_BONUS_BLKID) {
mutex_exit(&dn->dn_dbufs_mtx);
goto fail;
}
}
mutex_exit(&dn->dn_dbufs_mtx);
if (ibs && dn->dn_nlevels != 1)
goto fail;
/* resize the old block */
err = dbuf_hold_impl(dn, 0, 0, TRUE, FTAG, &db);
if (err == 0)
dbuf_new_size(db, size, tx);
else if (err != ENOENT)
goto fail;
dnode_setdblksz(dn, size);
dnode_setdirty(dn, tx);
dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = size;
if (ibs) {
dn->dn_indblkshift = ibs;
dn->dn_next_indblkshift[tx->tx_txg&TXG_MASK] = ibs;
}
/* rele after we have fixed the blocksize in the dnode */
if (db)
dbuf_rele(db, FTAG);
rw_exit(&dn->dn_struct_rwlock);
return (0);
fail:
rw_exit(&dn->dn_struct_rwlock);
return (ENOTSUP);
}
/* read-holding callers must not rely on the lock being continuously held */
void
dnode_new_blkid(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx, boolean_t have_read)
{
uint64_t txgoff = tx->tx_txg & TXG_MASK;
int epbs, new_nlevels;
uint64_t sz;
ASSERT(blkid != DB_BONUS_BLKID);
ASSERT(have_read ?
RW_READ_HELD(&dn->dn_struct_rwlock) :
RW_WRITE_HELD(&dn->dn_struct_rwlock));
/*
* if we have a read-lock, check to see if we need to do any work
* before upgrading to a write-lock.
*/
if (have_read) {
if (blkid <= dn->dn_maxblkid)
return;
if (!rw_tryupgrade(&dn->dn_struct_rwlock)) {
rw_exit(&dn->dn_struct_rwlock);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
}
}
if (blkid <= dn->dn_maxblkid)
goto out;
dn->dn_maxblkid = blkid;
/*
* Compute the number of levels necessary to support the new maxblkid.
*/
new_nlevels = 1;
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
for (sz = dn->dn_nblkptr;
sz <= blkid && sz >= dn->dn_nblkptr; sz <<= epbs)
new_nlevels++;
if (new_nlevels > dn->dn_nlevels) {
int old_nlevels = dn->dn_nlevels;
dmu_buf_impl_t *db;
list_t *list;
dbuf_dirty_record_t *new, *dr, *dr_next;
dn->dn_nlevels = new_nlevels;
ASSERT3U(new_nlevels, >, dn->dn_next_nlevels[txgoff]);
dn->dn_next_nlevels[txgoff] = new_nlevels;
/* dirty the left indirects */
db = dbuf_hold_level(dn, old_nlevels, 0, FTAG);
new = dbuf_dirty(db, tx);
dbuf_rele(db, FTAG);
/* transfer the dirty records to the new indirect */
mutex_enter(&dn->dn_mtx);
mutex_enter(&new->dt.di.dr_mtx);
list = &dn->dn_dirty_records[txgoff];
for (dr = list_head(list); dr; dr = dr_next) {
dr_next = list_next(&dn->dn_dirty_records[txgoff], dr);
if (dr->dr_dbuf->db_level != new_nlevels-1 &&
dr->dr_dbuf->db_blkid != DB_BONUS_BLKID) {
ASSERT(dr->dr_dbuf->db_level == old_nlevels-1);
list_remove(&dn->dn_dirty_records[txgoff], dr);
list_insert_tail(&new->dt.di.dr_children, dr);
dr->dr_parent = new;
}
}
mutex_exit(&new->dt.di.dr_mtx);
mutex_exit(&dn->dn_mtx);
}
out:
if (have_read)
rw_downgrade(&dn->dn_struct_rwlock);
}
void
dnode_clear_range(dnode_t *dn, uint64_t blkid, uint64_t nblks, dmu_tx_t *tx)
{
avl_tree_t *tree = &dn->dn_ranges[tx->tx_txg&TXG_MASK];
avl_index_t where;
free_range_t *rp;
free_range_t rp_tofind;
uint64_t endblk = blkid + nblks;
ASSERT(MUTEX_HELD(&dn->dn_mtx));
ASSERT(nblks <= UINT64_MAX - blkid); /* no overflow */
dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n",
blkid, nblks, tx->tx_txg);
rp_tofind.fr_blkid = blkid;
rp = avl_find(tree, &rp_tofind, &where);
if (rp == NULL)
rp = avl_nearest(tree, where, AVL_BEFORE);
if (rp == NULL)
rp = avl_nearest(tree, where, AVL_AFTER);
while (rp && (rp->fr_blkid <= blkid + nblks)) {
uint64_t fr_endblk = rp->fr_blkid + rp->fr_nblks;
free_range_t *nrp = AVL_NEXT(tree, rp);
if (blkid <= rp->fr_blkid && endblk >= fr_endblk) {
/* clear this entire range */
avl_remove(tree, rp);
kmem_free(rp, sizeof (free_range_t));
} else if (blkid <= rp->fr_blkid &&
endblk > rp->fr_blkid && endblk < fr_endblk) {
/* clear the beginning of this range */
rp->fr_blkid = endblk;
rp->fr_nblks = fr_endblk - endblk;
} else if (blkid > rp->fr_blkid && blkid < fr_endblk &&
endblk >= fr_endblk) {
/* clear the end of this range */
rp->fr_nblks = blkid - rp->fr_blkid;
} else if (blkid > rp->fr_blkid && endblk < fr_endblk) {
/* clear a chunk out of this range */
free_range_t *new_rp =
kmem_alloc(sizeof (free_range_t), KM_SLEEP);
new_rp->fr_blkid = endblk;
new_rp->fr_nblks = fr_endblk - endblk;
avl_insert_here(tree, new_rp, rp, AVL_AFTER);
rp->fr_nblks = blkid - rp->fr_blkid;
}
/* there may be no overlap */
rp = nrp;
}
}
void
dnode_free_range(dnode_t *dn, uint64_t off, uint64_t len, dmu_tx_t *tx)
{
dmu_buf_impl_t *db;
uint64_t blkoff, blkid, nblks;
int blksz, blkshift, head, tail;
int trunc = FALSE;
int epbs;
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
blksz = dn->dn_datablksz;
blkshift = dn->dn_datablkshift;
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
if (len == -1ULL) {
len = UINT64_MAX - off;
trunc = TRUE;
}
/*
* First, block align the region to free:
*/
if (ISP2(blksz)) {
head = P2NPHASE(off, blksz);
blkoff = P2PHASE(off, blksz);
if ((off >> blkshift) > dn->dn_maxblkid)
goto out;
} else {
ASSERT(dn->dn_maxblkid == 0);
if (off == 0 && len >= blksz) {
/* Freeing the whole block; fast-track this request */
blkid = 0;
nblks = 1;
goto done;
} else if (off >= blksz) {
/* Freeing past end-of-data */
goto out;
} else {
/* Freeing part of the block. */
head = blksz - off;
ASSERT3U(head, >, 0);
}
blkoff = off;
}
/* zero out any partial block data at the start of the range */
if (head) {
ASSERT3U(blkoff + head, ==, blksz);
if (len < head)
head = len;
if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, off), TRUE,
FTAG, &db) == 0) {
caddr_t data;
/* don't dirty if it isn't on disk and isn't dirty */
if (db->db_last_dirty ||
(db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
rw_exit(&dn->dn_struct_rwlock);
dbuf_will_dirty(db, tx);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
data = db->db.db_data;
bzero(data + blkoff, head);
}
dbuf_rele(db, FTAG);
}
off += head;
len -= head;
}
/* If the range was less than one block, we're done */
if (len == 0)
goto out;
/* If the remaining range is past end of file, we're done */
if ((off >> blkshift) > dn->dn_maxblkid)
goto out;
ASSERT(ISP2(blksz));
if (trunc)
tail = 0;
else
tail = P2PHASE(len, blksz);
ASSERT3U(P2PHASE(off, blksz), ==, 0);
/* zero out any partial block data at the end of the range */
if (tail) {
if (len < tail)
tail = len;
if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, off+len),
TRUE, FTAG, &db) == 0) {
/* don't dirty if not on disk and not dirty */
if (db->db_last_dirty ||
(db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
rw_exit(&dn->dn_struct_rwlock);
dbuf_will_dirty(db, tx);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
bzero(db->db.db_data, tail);
}
dbuf_rele(db, FTAG);
}
len -= tail;
}
/* If the range did not include a full block, we are done */
if (len == 0)
goto out;
ASSERT(IS_P2ALIGNED(off, blksz));
ASSERT(trunc || IS_P2ALIGNED(len, blksz));
blkid = off >> blkshift;
nblks = len >> blkshift;
if (trunc)
nblks += 1;
/*
* Read in and mark all the level-1 indirects dirty,
* so that they will stay in memory until syncing phase.
* Always dirty the first and last indirect to make sure
* we dirty all the partial indirects.
*/
if (dn->dn_nlevels > 1) {
uint64_t i, first, last;
int shift = epbs + dn->dn_datablkshift;
first = blkid >> epbs;
if (db = dbuf_hold_level(dn, 1, first, FTAG)) {
dbuf_will_dirty(db, tx);
dbuf_rele(db, FTAG);
}
if (trunc)
last = dn->dn_maxblkid >> epbs;
else
last = (blkid + nblks - 1) >> epbs;
if (last > first && (db = dbuf_hold_level(dn, 1, last, FTAG))) {
dbuf_will_dirty(db, tx);
dbuf_rele(db, FTAG);
}
for (i = first + 1; i < last; i++) {
uint64_t ibyte = i << shift;
int err;
err = dnode_next_offset(dn,
DNODE_FIND_HAVELOCK, &ibyte, 1, 1, 0);
i = ibyte >> shift;
if (err == ESRCH || i >= last)
break;
ASSERT(err == 0);
db = dbuf_hold_level(dn, 1, i, FTAG);
if (db) {
dbuf_will_dirty(db, tx);
dbuf_rele(db, FTAG);
}
}
}
done:
/*
* Add this range to the dnode range list.
* We will finish up this free operation in the syncing phase.
*/
mutex_enter(&dn->dn_mtx);
dnode_clear_range(dn, blkid, nblks, tx);
{
free_range_t *rp, *found;
avl_index_t where;
avl_tree_t *tree = &dn->dn_ranges[tx->tx_txg&TXG_MASK];
/* Add new range to dn_ranges */
rp = kmem_alloc(sizeof (free_range_t), KM_SLEEP);
rp->fr_blkid = blkid;
rp->fr_nblks = nblks;
found = avl_find(tree, rp, &where);
ASSERT(found == NULL);
avl_insert(tree, rp, where);
dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n",
blkid, nblks, tx->tx_txg);
}
mutex_exit(&dn->dn_mtx);
dbuf_free_range(dn, blkid, blkid + nblks - 1, tx);
dnode_setdirty(dn, tx);
out:
if (trunc && dn->dn_maxblkid >= (off >> blkshift))
dn->dn_maxblkid = (off >> blkshift ? (off >> blkshift) - 1 : 0);
rw_exit(&dn->dn_struct_rwlock);
}
/* return TRUE if this blkid was freed in a recent txg, or FALSE if it wasn't */
uint64_t
dnode_block_freed(dnode_t *dn, uint64_t blkid)
{
free_range_t range_tofind;
void *dp = spa_get_dsl(dn->dn_objset->os_spa);
int i;
if (blkid == DB_BONUS_BLKID)
return (FALSE);
/*
* If we're in the process of opening the pool, dp will not be
* set yet, but there shouldn't be anything dirty.
*/
if (dp == NULL)
return (FALSE);
if (dn->dn_free_txg)
return (TRUE);
range_tofind.fr_blkid = blkid;
mutex_enter(&dn->dn_mtx);
for (i = 0; i < TXG_SIZE; i++) {
free_range_t *range_found;
avl_index_t idx;
range_found = avl_find(&dn->dn_ranges[i], &range_tofind, &idx);
if (range_found) {
ASSERT(range_found->fr_nblks > 0);
break;
}
range_found = avl_nearest(&dn->dn_ranges[i], idx, AVL_BEFORE);
if (range_found &&
range_found->fr_blkid + range_found->fr_nblks > blkid)
break;
}
mutex_exit(&dn->dn_mtx);
return (i < TXG_SIZE);
}
/* call from syncing context when we actually write/free space for this dnode */
void
dnode_diduse_space(dnode_t *dn, int64_t delta)
{
uint64_t space;
dprintf_dnode(dn, "dn=%p dnp=%p used=%llu delta=%lld\n",
dn, dn->dn_phys,
(u_longlong_t)dn->dn_phys->dn_used,
(longlong_t)delta);
mutex_enter(&dn->dn_mtx);
space = DN_USED_BYTES(dn->dn_phys);
if (delta > 0) {
ASSERT3U(space + delta, >=, space); /* no overflow */
} else {
ASSERT3U(space, >=, -delta); /* no underflow */
}
space += delta;
if (spa_version(dn->dn_objset->os_spa) < SPA_VERSION_DNODE_BYTES) {
ASSERT((dn->dn_phys->dn_flags & DNODE_FLAG_USED_BYTES) == 0);
ASSERT3U(P2PHASE(space, 1<<DEV_BSHIFT), ==, 0);
dn->dn_phys->dn_used = space >> DEV_BSHIFT;
} else {
dn->dn_phys->dn_used = space;
dn->dn_phys->dn_flags |= DNODE_FLAG_USED_BYTES;
}
mutex_exit(&dn->dn_mtx);
}
/*
* Call when we think we're going to write/free space in open context.
* Be conservative (ie. OK to write less than this or free more than
* this, but don't write more or free less).
*/
void
dnode_willuse_space(dnode_t *dn, int64_t space, dmu_tx_t *tx)
{
objset_impl_t *os = dn->dn_objset;
dsl_dataset_t *ds = os->os_dsl_dataset;
if (space > 0)
space = spa_get_asize(os->os_spa, space);
if (ds)
dsl_dir_willuse_space(ds->ds_dir, space, tx);
dmu_tx_willuse_space(tx, space);
}
/*
* This function scans a block at the indicated "level" looking for
* a hole or data (depending on 'flags'). If level > 0, then we are
* scanning an indirect block looking at its pointers. If level == 0,
* then we are looking at a block of dnodes. If we don't find what we
* are looking for in the block, we return ESRCH. Otherwise, return
* with *offset pointing to the beginning (if searching forwards) or
* end (if searching backwards) of the range covered by the block
* pointer we matched on (or dnode).
*
* The basic search algorithm used below by dnode_next_offset() is to
* use this function to search up the block tree (widen the search) until
* we find something (i.e., we don't return ESRCH) and then search back
* down the tree (narrow the search) until we reach our original search
* level.
*/
static int
dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset,
int lvl, uint64_t blkfill, uint64_t txg)
{
dmu_buf_impl_t *db = NULL;
void *data = NULL;
uint64_t epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
uint64_t epb = 1ULL << epbs;
uint64_t minfill, maxfill;
boolean_t hole;
int i, inc, error, span;
dprintf("probing object %llu offset %llx level %d of %u\n",
dn->dn_object, *offset, lvl, dn->dn_phys->dn_nlevels);
hole = flags & DNODE_FIND_HOLE;
inc = (flags & DNODE_FIND_BACKWARDS) ? -1 : 1;
ASSERT(txg == 0 || !hole);
if (lvl == dn->dn_phys->dn_nlevels) {
error = 0;
epb = dn->dn_phys->dn_nblkptr;
data = dn->dn_phys->dn_blkptr;
} else {
uint64_t blkid = dbuf_whichblock(dn, *offset) >> (epbs * lvl);
error = dbuf_hold_impl(dn, lvl, blkid, TRUE, FTAG, &db);
if (error) {
if (error != ENOENT)
return (error);
if (hole)
return (0);
/*
* This can only happen when we are searching up
* the block tree for data. We don't really need to
* adjust the offset, as we will just end up looking
* at the pointer to this block in its parent, and its
* going to be unallocated, so we will skip over it.
*/
return (ESRCH);
}
error = dbuf_read(db, NULL, DB_RF_CANFAIL | DB_RF_HAVESTRUCT);
if (error) {
dbuf_rele(db, FTAG);
return (error);
}
data = db->db.db_data;
}
if (db && txg &&
(db->db_blkptr == NULL || db->db_blkptr->blk_birth <= txg)) {
/*
* This can only happen when we are searching up the tree
* and these conditions mean that we need to keep climbing.
*/
error = ESRCH;
} else if (lvl == 0) {
dnode_phys_t *dnp = data;
span = DNODE_SHIFT;
ASSERT(dn->dn_type == DMU_OT_DNODE);
for (i = (*offset >> span) & (blkfill - 1);
i >= 0 && i < blkfill; i += inc) {
- boolean_t newcontents = B_TRUE;
- if (txg) {
- int j;
- newcontents = B_FALSE;
- for (j = 0; j < dnp[i].dn_nblkptr; j++) {
- if (dnp[i].dn_blkptr[j].blk_birth > txg)
- newcontents = B_TRUE;
- }
- }
- if (!dnp[i].dn_type == hole && newcontents)
+ if ((dnp[i].dn_type == DMU_OT_NONE) == hole)
break;
*offset += (1ULL << span) * inc;
}
if (i < 0 || i == blkfill)
error = ESRCH;
} else {
blkptr_t *bp = data;
uint64_t start = *offset;
span = (lvl - 1) * epbs + dn->dn_datablkshift;
minfill = 0;
maxfill = blkfill << ((lvl - 1) * epbs);
if (hole)
maxfill--;
else
minfill++;
*offset = *offset >> span;
for (i = BF64_GET(*offset, 0, epbs);
i >= 0 && i < epb; i += inc) {
if (bp[i].blk_fill >= minfill &&
bp[i].blk_fill <= maxfill &&
(hole || bp[i].blk_birth > txg))
break;
if (inc > 0 || *offset > 0)
*offset += inc;
}
*offset = *offset << span;
if (inc < 0) {
/* traversing backwards; position offset at the end */
ASSERT3U(*offset, <=, start);
*offset = MIN(*offset + (1ULL << span) - 1, start);
} else if (*offset < start) {
*offset = start;
}
if (i < 0 || i >= epb)
error = ESRCH;
}
if (db)
dbuf_rele(db, FTAG);
return (error);
}
/*
* Find the next hole, data, or sparse region at or after *offset.
* The value 'blkfill' tells us how many items we expect to find
* in an L0 data block; this value is 1 for normal objects,
* DNODES_PER_BLOCK for the meta dnode, and some fraction of
* DNODES_PER_BLOCK when searching for sparse regions thereof.
*
* Examples:
*
* dnode_next_offset(dn, flags, offset, 1, 1, 0);
* Finds the next/previous hole/data in a file.
* Used in dmu_offset_next().
*
* dnode_next_offset(mdn, flags, offset, 0, DNODES_PER_BLOCK, txg);
* Finds the next free/allocated dnode an objset's meta-dnode.
* Only finds objects that have new contents since txg (ie.
* bonus buffer changes and content removal are ignored).
* Used in dmu_object_next().
*
* dnode_next_offset(mdn, DNODE_FIND_HOLE, offset, 2, DNODES_PER_BLOCK >> 2, 0);
* Finds the next L2 meta-dnode bp that's at most 1/4 full.
* Used in dmu_object_alloc().
*/
int
dnode_next_offset(dnode_t *dn, int flags, uint64_t *offset,
int minlvl, uint64_t blkfill, uint64_t txg)
{
uint64_t initial_offset = *offset;
int lvl, maxlvl;
int error = 0;
if (!(flags & DNODE_FIND_HAVELOCK))
rw_enter(&dn->dn_struct_rwlock, RW_READER);
if (dn->dn_phys->dn_nlevels == 0) {
error = ESRCH;
goto out;
}
if (dn->dn_datablkshift == 0) {
if (*offset < dn->dn_datablksz) {
if (flags & DNODE_FIND_HOLE)
*offset = dn->dn_datablksz;
} else {
error = ESRCH;
}
goto out;
}
maxlvl = dn->dn_phys->dn_nlevels;
for (lvl = minlvl; lvl <= maxlvl; lvl++) {
error = dnode_next_offset_level(dn,
flags, offset, lvl, blkfill, txg);
if (error != ESRCH)
break;
}
while (error == 0 && --lvl >= minlvl) {
error = dnode_next_offset_level(dn,
flags, offset, lvl, blkfill, txg);
}
if (error == 0 && (flags & DNODE_FIND_BACKWARDS ?
initial_offset < *offset : initial_offset > *offset))
error = ESRCH;
out:
if (!(flags & DNODE_FIND_HAVELOCK))
rw_exit(&dn->dn_struct_rwlock);
return (error);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c (revision 209274)
@@ -1,1331 +1,1330 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/dmu.h>
#include <sys/dmu_objset.h>
#include <sys/dmu_tx.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_prop.h>
#include <sys/dsl_synctask.h>
#include <sys/dsl_deleg.h>
#include <sys/spa.h>
#include <sys/zap.h>
#include <sys/zio.h>
#include <sys/arc.h>
#include <sys/sunddi.h>
#include "zfs_namecheck.h"
static uint64_t dsl_dir_space_towrite(dsl_dir_t *dd);
static void dsl_dir_set_reservation_sync(void *arg1, void *arg2,
cred_t *cr, dmu_tx_t *tx);
/* ARGSUSED */
static void
dsl_dir_evict(dmu_buf_t *db, void *arg)
{
dsl_dir_t *dd = arg;
dsl_pool_t *dp = dd->dd_pool;
int t;
for (t = 0; t < TXG_SIZE; t++) {
ASSERT(!txg_list_member(&dp->dp_dirty_dirs, dd, t));
ASSERT(dd->dd_tempreserved[t] == 0);
ASSERT(dd->dd_space_towrite[t] == 0);
}
if (dd->dd_parent)
dsl_dir_close(dd->dd_parent, dd);
spa_close(dd->dd_pool->dp_spa, dd);
/*
* The props callback list should be empty since they hold the
* dir open.
*/
list_destroy(&dd->dd_prop_cbs);
mutex_destroy(&dd->dd_lock);
kmem_free(dd, sizeof (dsl_dir_t));
}
int
dsl_dir_open_obj(dsl_pool_t *dp, uint64_t ddobj,
const char *tail, void *tag, dsl_dir_t **ddp)
{
dmu_buf_t *dbuf;
dsl_dir_t *dd;
int err;
ASSERT(RW_LOCK_HELD(&dp->dp_config_rwlock) ||
dsl_pool_sync_context(dp));
err = dmu_bonus_hold(dp->dp_meta_objset, ddobj, tag, &dbuf);
if (err)
return (err);
dd = dmu_buf_get_user(dbuf);
#ifdef ZFS_DEBUG
{
dmu_object_info_t doi;
dmu_object_info_from_db(dbuf, &doi);
ASSERT3U(doi.doi_type, ==, DMU_OT_DSL_DIR);
ASSERT3U(doi.doi_bonus_size, >=, sizeof (dsl_dir_phys_t));
}
#endif
if (dd == NULL) {
dsl_dir_t *winner;
- int err;
dd = kmem_zalloc(sizeof (dsl_dir_t), KM_SLEEP);
dd->dd_object = ddobj;
dd->dd_dbuf = dbuf;
dd->dd_pool = dp;
dd->dd_phys = dbuf->db_data;
mutex_init(&dd->dd_lock, NULL, MUTEX_DEFAULT, NULL);
list_create(&dd->dd_prop_cbs, sizeof (dsl_prop_cb_record_t),
offsetof(dsl_prop_cb_record_t, cbr_node));
if (dd->dd_phys->dd_parent_obj) {
err = dsl_dir_open_obj(dp, dd->dd_phys->dd_parent_obj,
NULL, dd, &dd->dd_parent);
if (err)
goto errout;
if (tail) {
#ifdef ZFS_DEBUG
uint64_t foundobj;
err = zap_lookup(dp->dp_meta_objset,
dd->dd_parent->dd_phys->dd_child_dir_zapobj,
tail, sizeof (foundobj), 1, &foundobj);
ASSERT(err || foundobj == ddobj);
#endif
(void) strcpy(dd->dd_myname, tail);
} else {
err = zap_value_search(dp->dp_meta_objset,
dd->dd_parent->dd_phys->dd_child_dir_zapobj,
ddobj, 0, dd->dd_myname);
}
if (err)
goto errout;
} else {
(void) strcpy(dd->dd_myname, spa_name(dp->dp_spa));
}
winner = dmu_buf_set_user_ie(dbuf, dd, &dd->dd_phys,
dsl_dir_evict);
if (winner) {
if (dd->dd_parent)
dsl_dir_close(dd->dd_parent, dd);
mutex_destroy(&dd->dd_lock);
kmem_free(dd, sizeof (dsl_dir_t));
dd = winner;
} else {
spa_open_ref(dp->dp_spa, dd);
}
}
/*
* The dsl_dir_t has both open-to-close and instantiate-to-evict
* holds on the spa. We need the open-to-close holds because
* otherwise the spa_refcnt wouldn't change when we open a
* dir which the spa also has open, so we could incorrectly
* think it was OK to unload/export/destroy the pool. We need
* the instantiate-to-evict hold because the dsl_dir_t has a
* pointer to the dd_pool, which has a pointer to the spa_t.
*/
spa_open_ref(dp->dp_spa, tag);
ASSERT3P(dd->dd_pool, ==, dp);
ASSERT3U(dd->dd_object, ==, ddobj);
ASSERT3P(dd->dd_dbuf, ==, dbuf);
*ddp = dd;
return (0);
errout:
if (dd->dd_parent)
dsl_dir_close(dd->dd_parent, dd);
mutex_destroy(&dd->dd_lock);
kmem_free(dd, sizeof (dsl_dir_t));
dmu_buf_rele(dbuf, tag);
return (err);
}
void
dsl_dir_close(dsl_dir_t *dd, void *tag)
{
dprintf_dd(dd, "%s\n", "");
spa_close(dd->dd_pool->dp_spa, tag);
dmu_buf_rele(dd->dd_dbuf, tag);
}
/* buf must be long enough (MAXNAMELEN + strlen(MOS_DIR_NAME) + 1 should do) */
void
dsl_dir_name(dsl_dir_t *dd, char *buf)
{
if (dd->dd_parent) {
dsl_dir_name(dd->dd_parent, buf);
(void) strcat(buf, "/");
} else {
buf[0] = '\0';
}
if (!MUTEX_HELD(&dd->dd_lock)) {
/*
* recursive mutex so that we can use
* dprintf_dd() with dd_lock held
*/
mutex_enter(&dd->dd_lock);
(void) strcat(buf, dd->dd_myname);
mutex_exit(&dd->dd_lock);
} else {
(void) strcat(buf, dd->dd_myname);
}
}
/* Calculate name legnth, avoiding all the strcat calls of dsl_dir_name */
int
dsl_dir_namelen(dsl_dir_t *dd)
{
int result = 0;
if (dd->dd_parent) {
/* parent's name + 1 for the "/" */
result = dsl_dir_namelen(dd->dd_parent) + 1;
}
if (!MUTEX_HELD(&dd->dd_lock)) {
/* see dsl_dir_name */
mutex_enter(&dd->dd_lock);
result += strlen(dd->dd_myname);
mutex_exit(&dd->dd_lock);
} else {
result += strlen(dd->dd_myname);
}
return (result);
}
int
dsl_dir_is_private(dsl_dir_t *dd)
{
int rv = FALSE;
if (dd->dd_parent && dsl_dir_is_private(dd->dd_parent))
rv = TRUE;
if (dataset_name_hidden(dd->dd_myname))
rv = TRUE;
return (rv);
}
static int
getcomponent(const char *path, char *component, const char **nextp)
{
char *p;
if (path == NULL)
return (ENOENT);
/* This would be a good place to reserve some namespace... */
p = strpbrk(path, "/@");
if (p && (p[1] == '/' || p[1] == '@')) {
/* two separators in a row */
return (EINVAL);
}
if (p == NULL || p == path) {
/*
* if the first thing is an @ or /, it had better be an
* @ and it had better not have any more ats or slashes,
* and it had better have something after the @.
*/
if (p != NULL &&
(p[0] != '@' || strpbrk(path+1, "/@") || p[1] == '\0'))
return (EINVAL);
if (strlen(path) >= MAXNAMELEN)
return (ENAMETOOLONG);
(void) strcpy(component, path);
p = NULL;
} else if (p[0] == '/') {
if (p-path >= MAXNAMELEN)
return (ENAMETOOLONG);
(void) strncpy(component, path, p - path);
component[p-path] = '\0';
p++;
} else if (p[0] == '@') {
/*
* if the next separator is an @, there better not be
* any more slashes.
*/
if (strchr(path, '/'))
return (EINVAL);
if (p-path >= MAXNAMELEN)
return (ENAMETOOLONG);
(void) strncpy(component, path, p - path);
component[p-path] = '\0';
} else {
ASSERT(!"invalid p");
}
*nextp = p;
return (0);
}
/*
* same as dsl_open_dir, ignore the first component of name and use the
* spa instead
*/
int
dsl_dir_open_spa(spa_t *spa, const char *name, void *tag,
dsl_dir_t **ddp, const char **tailp)
{
char buf[MAXNAMELEN];
const char *next, *nextnext = NULL;
int err;
dsl_dir_t *dd;
dsl_pool_t *dp;
uint64_t ddobj;
int openedspa = FALSE;
dprintf("%s\n", name);
err = getcomponent(name, buf, &next);
if (err)
return (err);
if (spa == NULL) {
err = spa_open(buf, &spa, FTAG);
if (err) {
dprintf("spa_open(%s) failed\n", buf);
return (err);
}
openedspa = TRUE;
/* XXX this assertion belongs in spa_open */
ASSERT(!dsl_pool_sync_context(spa_get_dsl(spa)));
}
dp = spa_get_dsl(spa);
rw_enter(&dp->dp_config_rwlock, RW_READER);
err = dsl_dir_open_obj(dp, dp->dp_root_dir_obj, NULL, tag, &dd);
if (err) {
rw_exit(&dp->dp_config_rwlock);
if (openedspa)
spa_close(spa, FTAG);
return (err);
}
while (next != NULL) {
dsl_dir_t *child_ds;
err = getcomponent(next, buf, &nextnext);
if (err)
break;
ASSERT(next[0] != '\0');
if (next[0] == '@')
break;
dprintf("looking up %s in obj%lld\n",
buf, dd->dd_phys->dd_child_dir_zapobj);
err = zap_lookup(dp->dp_meta_objset,
dd->dd_phys->dd_child_dir_zapobj,
buf, sizeof (ddobj), 1, &ddobj);
if (err) {
if (err == ENOENT)
err = 0;
break;
}
err = dsl_dir_open_obj(dp, ddobj, buf, tag, &child_ds);
if (err)
break;
dsl_dir_close(dd, tag);
dd = child_ds;
next = nextnext;
}
rw_exit(&dp->dp_config_rwlock);
if (err) {
dsl_dir_close(dd, tag);
if (openedspa)
spa_close(spa, FTAG);
return (err);
}
/*
* It's an error if there's more than one component left, or
* tailp==NULL and there's any component left.
*/
if (next != NULL &&
(tailp == NULL || (nextnext && nextnext[0] != '\0'))) {
/* bad path name */
dsl_dir_close(dd, tag);
dprintf("next=%p (%s) tail=%p\n", next, next?next:"", tailp);
err = ENOENT;
}
if (tailp)
*tailp = next;
if (openedspa)
spa_close(spa, FTAG);
*ddp = dd;
return (err);
}
/*
* Return the dsl_dir_t, and possibly the last component which couldn't
* be found in *tail. Return NULL if the path is bogus, or if
* tail==NULL and we couldn't parse the whole name. (*tail)[0] == '@'
* means that the last component is a snapshot.
*/
int
dsl_dir_open(const char *name, void *tag, dsl_dir_t **ddp, const char **tailp)
{
return (dsl_dir_open_spa(NULL, name, tag, ddp, tailp));
}
uint64_t
dsl_dir_create_sync(dsl_pool_t *dp, dsl_dir_t *pds, const char *name,
dmu_tx_t *tx)
{
objset_t *mos = dp->dp_meta_objset;
uint64_t ddobj;
dsl_dir_phys_t *dsphys;
dmu_buf_t *dbuf;
ddobj = dmu_object_alloc(mos, DMU_OT_DSL_DIR, 0,
DMU_OT_DSL_DIR, sizeof (dsl_dir_phys_t), tx);
if (pds) {
VERIFY(0 == zap_add(mos, pds->dd_phys->dd_child_dir_zapobj,
name, sizeof (uint64_t), 1, &ddobj, tx));
} else {
/* it's the root dir */
VERIFY(0 == zap_add(mos, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_ROOT_DATASET, sizeof (uint64_t), 1, &ddobj, tx));
}
VERIFY(0 == dmu_bonus_hold(mos, ddobj, FTAG, &dbuf));
dmu_buf_will_dirty(dbuf, tx);
dsphys = dbuf->db_data;
dsphys->dd_creation_time = gethrestime_sec();
if (pds)
dsphys->dd_parent_obj = pds->dd_object;
dsphys->dd_props_zapobj = zap_create(mos,
DMU_OT_DSL_PROPS, DMU_OT_NONE, 0, tx);
dsphys->dd_child_dir_zapobj = zap_create(mos,
DMU_OT_DSL_DIR_CHILD_MAP, DMU_OT_NONE, 0, tx);
if (spa_version(dp->dp_spa) >= SPA_VERSION_USED_BREAKDOWN)
dsphys->dd_flags |= DD_FLAG_USED_BREAKDOWN;
dmu_buf_rele(dbuf, FTAG);
return (ddobj);
}
/* ARGSUSED */
int
dsl_dir_destroy_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
dsl_pool_t *dp = dd->dd_pool;
objset_t *mos = dp->dp_meta_objset;
int err;
uint64_t count;
/*
* There should be exactly two holds, both from
* dsl_dataset_destroy: one on the dd directory, and one on its
* head ds. Otherwise, someone is trying to lookup something
* inside this dir while we want to destroy it. The
* config_rwlock ensures that nobody else opens it after we
* check.
*/
if (dmu_buf_refcount(dd->dd_dbuf) > 2)
return (EBUSY);
err = zap_count(mos, dd->dd_phys->dd_child_dir_zapobj, &count);
if (err)
return (err);
if (count != 0)
return (EEXIST);
return (0);
}
void
dsl_dir_destroy_sync(void *arg1, void *tag, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
objset_t *mos = dd->dd_pool->dp_meta_objset;
uint64_t val, obj;
dd_used_t t;
ASSERT(RW_WRITE_HELD(&dd->dd_pool->dp_config_rwlock));
ASSERT(dd->dd_phys->dd_head_dataset_obj == 0);
/* Remove our reservation. */
val = 0;
dsl_dir_set_reservation_sync(dd, &val, cr, tx);
ASSERT3U(dd->dd_phys->dd_used_bytes, ==, 0);
ASSERT3U(dd->dd_phys->dd_reserved, ==, 0);
for (t = 0; t < DD_USED_NUM; t++)
ASSERT3U(dd->dd_phys->dd_used_breakdown[t], ==, 0);
VERIFY(0 == zap_destroy(mos, dd->dd_phys->dd_child_dir_zapobj, tx));
VERIFY(0 == zap_destroy(mos, dd->dd_phys->dd_props_zapobj, tx));
VERIFY(0 == dsl_deleg_destroy(mos, dd->dd_phys->dd_deleg_zapobj, tx));
VERIFY(0 == zap_remove(mos,
dd->dd_parent->dd_phys->dd_child_dir_zapobj, dd->dd_myname, tx));
obj = dd->dd_object;
dsl_dir_close(dd, tag);
VERIFY(0 == dmu_object_free(mos, obj, tx));
}
boolean_t
dsl_dir_is_clone(dsl_dir_t *dd)
{
return (dd->dd_phys->dd_origin_obj &&
(dd->dd_pool->dp_origin_snap == NULL ||
dd->dd_phys->dd_origin_obj !=
dd->dd_pool->dp_origin_snap->ds_object));
}
void
dsl_dir_stats(dsl_dir_t *dd, nvlist_t *nv)
{
mutex_enter(&dd->dd_lock);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USED,
dd->dd_phys->dd_used_bytes);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_QUOTA, dd->dd_phys->dd_quota);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_RESERVATION,
dd->dd_phys->dd_reserved);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_COMPRESSRATIO,
dd->dd_phys->dd_compressed_bytes == 0 ? 100 :
(dd->dd_phys->dd_uncompressed_bytes * 100 /
dd->dd_phys->dd_compressed_bytes));
if (dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN) {
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDSNAP,
dd->dd_phys->dd_used_breakdown[DD_USED_SNAP]);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDDS,
dd->dd_phys->dd_used_breakdown[DD_USED_HEAD]);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDREFRESERV,
dd->dd_phys->dd_used_breakdown[DD_USED_REFRSRV]);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDCHILD,
dd->dd_phys->dd_used_breakdown[DD_USED_CHILD] +
dd->dd_phys->dd_used_breakdown[DD_USED_CHILD_RSRV]);
}
mutex_exit(&dd->dd_lock);
rw_enter(&dd->dd_pool->dp_config_rwlock, RW_READER);
if (dsl_dir_is_clone(dd)) {
dsl_dataset_t *ds;
char buf[MAXNAMELEN];
VERIFY(0 == dsl_dataset_hold_obj(dd->dd_pool,
dd->dd_phys->dd_origin_obj, FTAG, &ds));
dsl_dataset_name(ds, buf);
dsl_dataset_rele(ds, FTAG);
dsl_prop_nvlist_add_string(nv, ZFS_PROP_ORIGIN, buf);
}
rw_exit(&dd->dd_pool->dp_config_rwlock);
}
void
dsl_dir_dirty(dsl_dir_t *dd, dmu_tx_t *tx)
{
dsl_pool_t *dp = dd->dd_pool;
ASSERT(dd->dd_phys);
if (txg_list_add(&dp->dp_dirty_dirs, dd, tx->tx_txg) == 0) {
/* up the hold count until we can be written out */
dmu_buf_add_ref(dd->dd_dbuf, dd);
}
}
static int64_t
parent_delta(dsl_dir_t *dd, uint64_t used, int64_t delta)
{
uint64_t old_accounted = MAX(used, dd->dd_phys->dd_reserved);
uint64_t new_accounted = MAX(used + delta, dd->dd_phys->dd_reserved);
return (new_accounted - old_accounted);
}
void
dsl_dir_sync(dsl_dir_t *dd, dmu_tx_t *tx)
{
ASSERT(dmu_tx_is_syncing(tx));
dmu_buf_will_dirty(dd->dd_dbuf, tx);
mutex_enter(&dd->dd_lock);
ASSERT3U(dd->dd_tempreserved[tx->tx_txg&TXG_MASK], ==, 0);
dprintf_dd(dd, "txg=%llu towrite=%lluK\n", tx->tx_txg,
dd->dd_space_towrite[tx->tx_txg&TXG_MASK] / 1024);
dd->dd_space_towrite[tx->tx_txg&TXG_MASK] = 0;
mutex_exit(&dd->dd_lock);
/* release the hold from dsl_dir_dirty */
dmu_buf_rele(dd->dd_dbuf, dd);
}
static uint64_t
dsl_dir_space_towrite(dsl_dir_t *dd)
{
uint64_t space = 0;
int i;
ASSERT(MUTEX_HELD(&dd->dd_lock));
for (i = 0; i < TXG_SIZE; i++) {
space += dd->dd_space_towrite[i&TXG_MASK];
ASSERT3U(dd->dd_space_towrite[i&TXG_MASK], >=, 0);
}
return (space);
}
/*
* How much space would dd have available if ancestor had delta applied
* to it? If ondiskonly is set, we're only interested in what's
* on-disk, not estimated pending changes.
*/
uint64_t
dsl_dir_space_available(dsl_dir_t *dd,
dsl_dir_t *ancestor, int64_t delta, int ondiskonly)
{
uint64_t parentspace, myspace, quota, used;
/*
* If there are no restrictions otherwise, assume we have
* unlimited space available.
*/
quota = UINT64_MAX;
parentspace = UINT64_MAX;
if (dd->dd_parent != NULL) {
parentspace = dsl_dir_space_available(dd->dd_parent,
ancestor, delta, ondiskonly);
}
mutex_enter(&dd->dd_lock);
if (dd->dd_phys->dd_quota != 0)
quota = dd->dd_phys->dd_quota;
used = dd->dd_phys->dd_used_bytes;
if (!ondiskonly)
used += dsl_dir_space_towrite(dd);
if (dd->dd_parent == NULL) {
uint64_t poolsize = dsl_pool_adjustedsize(dd->dd_pool, FALSE);
quota = MIN(quota, poolsize);
}
if (dd->dd_phys->dd_reserved > used && parentspace != UINT64_MAX) {
/*
* We have some space reserved, in addition to what our
* parent gave us.
*/
parentspace += dd->dd_phys->dd_reserved - used;
}
if (dd == ancestor) {
ASSERT(delta <= 0);
ASSERT(used >= -delta);
used += delta;
if (parentspace != UINT64_MAX)
parentspace -= delta;
}
if (used > quota) {
/* over quota */
myspace = 0;
/*
* While it's OK to be a little over quota, if
* we think we are using more space than there
* is in the pool (which is already 1.6% more than
* dsl_pool_adjustedsize()), something is very
* wrong.
*/
ASSERT3U(used, <=, spa_get_space(dd->dd_pool->dp_spa));
} else {
/*
* the lesser of the space provided by our parent and
* the space left in our quota
*/
myspace = MIN(parentspace, quota - used);
}
mutex_exit(&dd->dd_lock);
return (myspace);
}
struct tempreserve {
list_node_t tr_node;
dsl_pool_t *tr_dp;
dsl_dir_t *tr_ds;
uint64_t tr_size;
};
static int
dsl_dir_tempreserve_impl(dsl_dir_t *dd, uint64_t asize, boolean_t netfree,
boolean_t ignorequota, boolean_t checkrefquota, list_t *tr_list,
dmu_tx_t *tx, boolean_t first)
{
uint64_t txg = tx->tx_txg;
uint64_t est_inflight, used_on_disk, quota, parent_rsrv;
struct tempreserve *tr;
int enospc = EDQUOT;
int txgidx = txg & TXG_MASK;
int i;
uint64_t ref_rsrv = 0;
ASSERT3U(txg, !=, 0);
ASSERT3S(asize, >, 0);
mutex_enter(&dd->dd_lock);
/*
* Check against the dsl_dir's quota. We don't add in the delta
* when checking for over-quota because they get one free hit.
*/
est_inflight = dsl_dir_space_towrite(dd);
for (i = 0; i < TXG_SIZE; i++)
est_inflight += dd->dd_tempreserved[i];
used_on_disk = dd->dd_phys->dd_used_bytes;
/*
* On the first iteration, fetch the dataset's used-on-disk and
* refreservation values. Also, if checkrefquota is set, test if
* allocating this space would exceed the dataset's refquota.
*/
if (first && tx->tx_objset) {
int error;
dsl_dataset_t *ds = tx->tx_objset->os->os_dsl_dataset;
error = dsl_dataset_check_quota(ds, checkrefquota,
asize, est_inflight, &used_on_disk, &ref_rsrv);
if (error) {
mutex_exit(&dd->dd_lock);
return (error);
}
}
/*
* If this transaction will result in a net free of space,
* we want to let it through.
*/
if (ignorequota || netfree || dd->dd_phys->dd_quota == 0)
quota = UINT64_MAX;
else
quota = dd->dd_phys->dd_quota;
/*
* Adjust the quota against the actual pool size at the root.
* To ensure that it's possible to remove files from a full
* pool without inducing transient overcommits, we throttle
* netfree transactions against a quota that is slightly larger,
* but still within the pool's allocation slop. In cases where
* we're very close to full, this will allow a steady trickle of
* removes to get through.
*/
if (dd->dd_parent == NULL) {
uint64_t poolsize = dsl_pool_adjustedsize(dd->dd_pool, netfree);
if (poolsize < quota) {
quota = poolsize;
enospc = ENOSPC;
}
}
/*
* If they are requesting more space, and our current estimate
* is over quota, they get to try again unless the actual
* on-disk is over quota and there are no pending changes (which
* may free up space for us).
*/
if (used_on_disk + est_inflight > quota) {
if (est_inflight > 0 || used_on_disk < quota)
enospc = ERESTART;
dprintf_dd(dd, "failing: used=%lluK inflight = %lluK "
"quota=%lluK tr=%lluK err=%d\n",
used_on_disk>>10, est_inflight>>10,
quota>>10, asize>>10, enospc);
mutex_exit(&dd->dd_lock);
return (enospc);
}
/* We need to up our estimated delta before dropping dd_lock */
dd->dd_tempreserved[txgidx] += asize;
parent_rsrv = parent_delta(dd, used_on_disk + est_inflight,
asize - ref_rsrv);
mutex_exit(&dd->dd_lock);
tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
tr->tr_ds = dd;
tr->tr_size = asize;
list_insert_tail(tr_list, tr);
/* see if it's OK with our parent */
if (dd->dd_parent && parent_rsrv) {
boolean_t ismos = (dd->dd_phys->dd_head_dataset_obj == 0);
return (dsl_dir_tempreserve_impl(dd->dd_parent,
parent_rsrv, netfree, ismos, TRUE, tr_list, tx, FALSE));
} else {
return (0);
}
}
/*
* Reserve space in this dsl_dir, to be used in this tx's txg.
* After the space has been dirtied (and dsl_dir_willuse_space()
* has been called), the reservation should be canceled, using
* dsl_dir_tempreserve_clear().
*/
int
dsl_dir_tempreserve_space(dsl_dir_t *dd, uint64_t lsize, uint64_t asize,
uint64_t fsize, uint64_t usize, void **tr_cookiep, dmu_tx_t *tx)
{
int err;
list_t *tr_list;
if (asize == 0) {
*tr_cookiep = NULL;
return (0);
}
tr_list = kmem_alloc(sizeof (list_t), KM_SLEEP);
list_create(tr_list, sizeof (struct tempreserve),
offsetof(struct tempreserve, tr_node));
ASSERT3S(asize, >, 0);
ASSERT3S(fsize, >=, 0);
err = arc_tempreserve_space(lsize, tx->tx_txg);
if (err == 0) {
struct tempreserve *tr;
tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
tr->tr_size = lsize;
list_insert_tail(tr_list, tr);
err = dsl_pool_tempreserve_space(dd->dd_pool, asize, tx);
} else {
if (err == EAGAIN) {
txg_delay(dd->dd_pool, tx->tx_txg, 1);
err = ERESTART;
}
dsl_pool_memory_pressure(dd->dd_pool);
}
if (err == 0) {
struct tempreserve *tr;
tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
tr->tr_dp = dd->dd_pool;
tr->tr_size = asize;
list_insert_tail(tr_list, tr);
err = dsl_dir_tempreserve_impl(dd, asize, fsize >= asize,
FALSE, asize > usize, tr_list, tx, TRUE);
}
if (err)
dsl_dir_tempreserve_clear(tr_list, tx);
else
*tr_cookiep = tr_list;
return (err);
}
/*
* Clear a temporary reservation that we previously made with
* dsl_dir_tempreserve_space().
*/
void
dsl_dir_tempreserve_clear(void *tr_cookie, dmu_tx_t *tx)
{
int txgidx = tx->tx_txg & TXG_MASK;
list_t *tr_list = tr_cookie;
struct tempreserve *tr;
ASSERT3U(tx->tx_txg, !=, 0);
if (tr_cookie == NULL)
return;
while (tr = list_head(tr_list)) {
if (tr->tr_dp) {
dsl_pool_tempreserve_clear(tr->tr_dp, tr->tr_size, tx);
} else if (tr->tr_ds) {
mutex_enter(&tr->tr_ds->dd_lock);
ASSERT3U(tr->tr_ds->dd_tempreserved[txgidx], >=,
tr->tr_size);
tr->tr_ds->dd_tempreserved[txgidx] -= tr->tr_size;
mutex_exit(&tr->tr_ds->dd_lock);
} else {
arc_tempreserve_clear(tr->tr_size);
}
list_remove(tr_list, tr);
kmem_free(tr, sizeof (struct tempreserve));
}
kmem_free(tr_list, sizeof (list_t));
}
static void
dsl_dir_willuse_space_impl(dsl_dir_t *dd, int64_t space, dmu_tx_t *tx)
{
int64_t parent_space;
uint64_t est_used;
mutex_enter(&dd->dd_lock);
if (space > 0)
dd->dd_space_towrite[tx->tx_txg & TXG_MASK] += space;
est_used = dsl_dir_space_towrite(dd) + dd->dd_phys->dd_used_bytes;
parent_space = parent_delta(dd, est_used, space);
mutex_exit(&dd->dd_lock);
/* Make sure that we clean up dd_space_to* */
dsl_dir_dirty(dd, tx);
/* XXX this is potentially expensive and unnecessary... */
if (parent_space && dd->dd_parent)
dsl_dir_willuse_space_impl(dd->dd_parent, parent_space, tx);
}
/*
* Call in open context when we think we're going to write/free space,
* eg. when dirtying data. Be conservative (ie. OK to write less than
* this or free more than this, but don't write more or free less).
*/
void
dsl_dir_willuse_space(dsl_dir_t *dd, int64_t space, dmu_tx_t *tx)
{
dsl_pool_willuse_space(dd->dd_pool, space, tx);
dsl_dir_willuse_space_impl(dd, space, tx);
}
/* call from syncing context when we actually write/free space for this dd */
void
dsl_dir_diduse_space(dsl_dir_t *dd, dd_used_t type,
int64_t used, int64_t compressed, int64_t uncompressed, dmu_tx_t *tx)
{
int64_t accounted_delta;
boolean_t needlock = !MUTEX_HELD(&dd->dd_lock);
ASSERT(dmu_tx_is_syncing(tx));
ASSERT(type < DD_USED_NUM);
dsl_dir_dirty(dd, tx);
if (needlock)
mutex_enter(&dd->dd_lock);
accounted_delta = parent_delta(dd, dd->dd_phys->dd_used_bytes, used);
ASSERT(used >= 0 || dd->dd_phys->dd_used_bytes >= -used);
ASSERT(compressed >= 0 ||
dd->dd_phys->dd_compressed_bytes >= -compressed);
ASSERT(uncompressed >= 0 ||
dd->dd_phys->dd_uncompressed_bytes >= -uncompressed);
dd->dd_phys->dd_used_bytes += used;
dd->dd_phys->dd_uncompressed_bytes += uncompressed;
dd->dd_phys->dd_compressed_bytes += compressed;
if (dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN) {
ASSERT(used > 0 ||
dd->dd_phys->dd_used_breakdown[type] >= -used);
dd->dd_phys->dd_used_breakdown[type] += used;
#ifdef DEBUG
dd_used_t t;
uint64_t u = 0;
for (t = 0; t < DD_USED_NUM; t++)
u += dd->dd_phys->dd_used_breakdown[t];
ASSERT3U(u, ==, dd->dd_phys->dd_used_bytes);
#endif
}
if (needlock)
mutex_exit(&dd->dd_lock);
if (dd->dd_parent != NULL) {
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD,
accounted_delta, compressed, uncompressed, tx);
dsl_dir_transfer_space(dd->dd_parent,
used - accounted_delta,
DD_USED_CHILD_RSRV, DD_USED_CHILD, tx);
}
}
void
dsl_dir_transfer_space(dsl_dir_t *dd, int64_t delta,
dd_used_t oldtype, dd_used_t newtype, dmu_tx_t *tx)
{
boolean_t needlock = !MUTEX_HELD(&dd->dd_lock);
ASSERT(dmu_tx_is_syncing(tx));
ASSERT(oldtype < DD_USED_NUM);
ASSERT(newtype < DD_USED_NUM);
if (delta == 0 || !(dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN))
return;
dsl_dir_dirty(dd, tx);
if (needlock)
mutex_enter(&dd->dd_lock);
ASSERT(delta > 0 ?
dd->dd_phys->dd_used_breakdown[oldtype] >= delta :
dd->dd_phys->dd_used_breakdown[newtype] >= -delta);
ASSERT(dd->dd_phys->dd_used_bytes >= ABS(delta));
dd->dd_phys->dd_used_breakdown[oldtype] -= delta;
dd->dd_phys->dd_used_breakdown[newtype] += delta;
if (needlock)
mutex_exit(&dd->dd_lock);
}
static int
dsl_dir_set_quota_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *quotap = arg2;
uint64_t new_quota = *quotap;
int err = 0;
uint64_t towrite;
if (new_quota == 0)
return (0);
mutex_enter(&dd->dd_lock);
/*
* If we are doing the preliminary check in open context, and
* there are pending changes, then don't fail it, since the
* pending changes could under-estimate the amount of space to be
* freed up.
*/
towrite = dsl_dir_space_towrite(dd);
if ((dmu_tx_is_syncing(tx) || towrite == 0) &&
(new_quota < dd->dd_phys->dd_reserved ||
new_quota < dd->dd_phys->dd_used_bytes + towrite)) {
err = ENOSPC;
}
mutex_exit(&dd->dd_lock);
return (err);
}
/* ARGSUSED */
static void
dsl_dir_set_quota_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *quotap = arg2;
uint64_t new_quota = *quotap;
dmu_buf_will_dirty(dd->dd_dbuf, tx);
mutex_enter(&dd->dd_lock);
dd->dd_phys->dd_quota = new_quota;
mutex_exit(&dd->dd_lock);
spa_history_internal_log(LOG_DS_QUOTA, dd->dd_pool->dp_spa,
tx, cr, "%lld dataset = %llu ",
(longlong_t)new_quota, dd->dd_phys->dd_head_dataset_obj);
}
int
dsl_dir_set_quota(const char *ddname, uint64_t quota)
{
dsl_dir_t *dd;
int err;
err = dsl_dir_open(ddname, FTAG, &dd, NULL);
if (err)
return (err);
if (quota != dd->dd_phys->dd_quota) {
/*
* If someone removes a file, then tries to set the quota, we
* want to make sure the file freeing takes effect.
*/
txg_wait_open(dd->dd_pool, 0);
err = dsl_sync_task_do(dd->dd_pool, dsl_dir_set_quota_check,
dsl_dir_set_quota_sync, dd, &quota, 0);
}
dsl_dir_close(dd, FTAG);
return (err);
}
int
dsl_dir_set_reservation_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *reservationp = arg2;
uint64_t new_reservation = *reservationp;
uint64_t used, avail;
int64_t delta;
if (new_reservation > INT64_MAX)
return (EOVERFLOW);
/*
* If we are doing the preliminary check in open context, the
* space estimates may be inaccurate.
*/
if (!dmu_tx_is_syncing(tx))
return (0);
mutex_enter(&dd->dd_lock);
used = dd->dd_phys->dd_used_bytes;
delta = MAX(used, new_reservation) -
MAX(used, dd->dd_phys->dd_reserved);
mutex_exit(&dd->dd_lock);
if (dd->dd_parent) {
avail = dsl_dir_space_available(dd->dd_parent,
NULL, 0, FALSE);
} else {
avail = dsl_pool_adjustedsize(dd->dd_pool, B_FALSE) - used;
}
if (delta > 0 && delta > avail)
return (ENOSPC);
if (delta > 0 && dd->dd_phys->dd_quota > 0 &&
new_reservation > dd->dd_phys->dd_quota)
return (ENOSPC);
return (0);
}
/* ARGSUSED */
static void
dsl_dir_set_reservation_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *reservationp = arg2;
uint64_t new_reservation = *reservationp;
uint64_t used;
int64_t delta;
dmu_buf_will_dirty(dd->dd_dbuf, tx);
mutex_enter(&dd->dd_lock);
used = dd->dd_phys->dd_used_bytes;
delta = MAX(used, new_reservation) -
MAX(used, dd->dd_phys->dd_reserved);
dd->dd_phys->dd_reserved = new_reservation;
if (dd->dd_parent != NULL) {
/* Roll up this additional usage into our ancestors */
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD_RSRV,
delta, 0, 0, tx);
}
mutex_exit(&dd->dd_lock);
spa_history_internal_log(LOG_DS_RESERVATION, dd->dd_pool->dp_spa,
tx, cr, "%lld dataset = %llu",
(longlong_t)new_reservation, dd->dd_phys->dd_head_dataset_obj);
}
int
dsl_dir_set_reservation(const char *ddname, uint64_t reservation)
{
dsl_dir_t *dd;
int err;
err = dsl_dir_open(ddname, FTAG, &dd, NULL);
if (err)
return (err);
err = dsl_sync_task_do(dd->dd_pool, dsl_dir_set_reservation_check,
dsl_dir_set_reservation_sync, dd, &reservation, 0);
dsl_dir_close(dd, FTAG);
return (err);
}
static dsl_dir_t *
closest_common_ancestor(dsl_dir_t *ds1, dsl_dir_t *ds2)
{
for (; ds1; ds1 = ds1->dd_parent) {
dsl_dir_t *dd;
for (dd = ds2; dd; dd = dd->dd_parent) {
if (ds1 == dd)
return (dd);
}
}
return (NULL);
}
/*
* If delta is applied to dd, how much of that delta would be applied to
* ancestor? Syncing context only.
*/
static int64_t
would_change(dsl_dir_t *dd, int64_t delta, dsl_dir_t *ancestor)
{
if (dd == ancestor)
return (delta);
mutex_enter(&dd->dd_lock);
delta = parent_delta(dd, dd->dd_phys->dd_used_bytes, delta);
mutex_exit(&dd->dd_lock);
return (would_change(dd->dd_parent, delta, ancestor));
}
struct renamearg {
dsl_dir_t *newparent;
const char *mynewname;
};
/*ARGSUSED*/
static int
dsl_dir_rename_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
struct renamearg *ra = arg2;
dsl_pool_t *dp = dd->dd_pool;
objset_t *mos = dp->dp_meta_objset;
int err;
uint64_t val;
/* There should be 2 references: the open and the dirty */
if (dmu_buf_refcount(dd->dd_dbuf) > 2)
return (EBUSY);
/* check for existing name */
err = zap_lookup(mos, ra->newparent->dd_phys->dd_child_dir_zapobj,
ra->mynewname, 8, 1, &val);
if (err == 0)
return (EEXIST);
if (err != ENOENT)
return (err);
if (ra->newparent != dd->dd_parent) {
/* is there enough space? */
uint64_t myspace =
MAX(dd->dd_phys->dd_used_bytes, dd->dd_phys->dd_reserved);
/* no rename into our descendant */
if (closest_common_ancestor(dd, ra->newparent) == dd)
return (EINVAL);
if (err = dsl_dir_transfer_possible(dd->dd_parent,
ra->newparent, myspace))
return (err);
}
return (0);
}
static void
dsl_dir_rename_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
struct renamearg *ra = arg2;
dsl_pool_t *dp = dd->dd_pool;
objset_t *mos = dp->dp_meta_objset;
int err;
ASSERT(dmu_buf_refcount(dd->dd_dbuf) <= 2);
if (ra->newparent != dd->dd_parent) {
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD,
-dd->dd_phys->dd_used_bytes,
-dd->dd_phys->dd_compressed_bytes,
-dd->dd_phys->dd_uncompressed_bytes, tx);
dsl_dir_diduse_space(ra->newparent, DD_USED_CHILD,
dd->dd_phys->dd_used_bytes,
dd->dd_phys->dd_compressed_bytes,
dd->dd_phys->dd_uncompressed_bytes, tx);
if (dd->dd_phys->dd_reserved > dd->dd_phys->dd_used_bytes) {
uint64_t unused_rsrv = dd->dd_phys->dd_reserved -
dd->dd_phys->dd_used_bytes;
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD_RSRV,
-unused_rsrv, 0, 0, tx);
dsl_dir_diduse_space(ra->newparent, DD_USED_CHILD_RSRV,
unused_rsrv, 0, 0, tx);
}
}
dmu_buf_will_dirty(dd->dd_dbuf, tx);
/* remove from old parent zapobj */
err = zap_remove(mos, dd->dd_parent->dd_phys->dd_child_dir_zapobj,
dd->dd_myname, tx);
ASSERT3U(err, ==, 0);
(void) strcpy(dd->dd_myname, ra->mynewname);
dsl_dir_close(dd->dd_parent, dd);
dd->dd_phys->dd_parent_obj = ra->newparent->dd_object;
VERIFY(0 == dsl_dir_open_obj(dd->dd_pool,
ra->newparent->dd_object, NULL, dd, &dd->dd_parent));
/* add to new parent zapobj */
err = zap_add(mos, ra->newparent->dd_phys->dd_child_dir_zapobj,
dd->dd_myname, 8, 1, &dd->dd_object, tx);
ASSERT3U(err, ==, 0);
spa_history_internal_log(LOG_DS_RENAME, dd->dd_pool->dp_spa,
tx, cr, "dataset = %llu", dd->dd_phys->dd_head_dataset_obj);
}
int
dsl_dir_rename(dsl_dir_t *dd, const char *newname)
{
struct renamearg ra;
int err;
/* new parent should exist */
err = dsl_dir_open(newname, FTAG, &ra.newparent, &ra.mynewname);
if (err)
return (err);
/* can't rename to different pool */
if (dd->dd_pool != ra.newparent->dd_pool) {
err = ENXIO;
goto out;
}
/* new name should not already exist */
if (ra.mynewname == NULL) {
err = EEXIST;
goto out;
}
err = dsl_sync_task_do(dd->dd_pool,
dsl_dir_rename_check, dsl_dir_rename_sync, dd, &ra, 3);
out:
dsl_dir_close(ra.newparent, FTAG);
return (err);
}
int
dsl_dir_transfer_possible(dsl_dir_t *sdd, dsl_dir_t *tdd, uint64_t space)
{
dsl_dir_t *ancestor;
int64_t adelta;
uint64_t avail;
ancestor = closest_common_ancestor(sdd, tdd);
adelta = would_change(sdd, -space, ancestor);
avail = dsl_dir_space_available(tdd, ancestor, adelta, FALSE);
if (avail < space)
return (ENOSPC);
return (0);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c (revision 209274)
@@ -1,1025 +1,1027 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/dsl_pool.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_prop.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_synctask.h>
#include <sys/dnode.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/arc.h>
#include <sys/zap.h>
#include <sys/zio.h>
#include <sys/zfs_context.h>
#include <sys/fs/zfs.h>
#include <sys/zfs_znode.h>
#include <sys/spa_impl.h>
#include <sys/vdev_impl.h>
#include <sys/zil_impl.h>
typedef int (scrub_cb_t)(dsl_pool_t *, const blkptr_t *, const zbookmark_t *);
static scrub_cb_t dsl_pool_scrub_clean_cb;
static dsl_syncfunc_t dsl_pool_scrub_cancel_sync;
int zfs_scrub_min_time = 1; /* scrub for at least 1 sec each txg */
int zfs_resilver_min_time = 3; /* resilver for at least 3 sec each txg */
boolean_t zfs_no_scrub_io = B_FALSE; /* set to disable scrub i/o */
extern int zfs_txg_timeout;
static scrub_cb_t *scrub_funcs[SCRUB_FUNC_NUMFUNCS] = {
NULL,
dsl_pool_scrub_clean_cb
};
#define SET_BOOKMARK(zb, objset, object, level, blkid) \
{ \
(zb)->zb_objset = objset; \
(zb)->zb_object = object; \
(zb)->zb_level = level; \
(zb)->zb_blkid = blkid; \
}
/* ARGSUSED */
static void
dsl_pool_scrub_setup_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_pool_t *dp = arg1;
enum scrub_func *funcp = arg2;
dmu_object_type_t ot = 0;
boolean_t complete = B_FALSE;
dsl_pool_scrub_cancel_sync(dp, &complete, cr, tx);
ASSERT(dp->dp_scrub_func == SCRUB_FUNC_NONE);
ASSERT(*funcp > SCRUB_FUNC_NONE);
ASSERT(*funcp < SCRUB_FUNC_NUMFUNCS);
dp->dp_scrub_min_txg = 0;
dp->dp_scrub_max_txg = tx->tx_txg;
if (*funcp == SCRUB_FUNC_CLEAN) {
vdev_t *rvd = dp->dp_spa->spa_root_vdev;
/* rewrite all disk labels */
vdev_config_dirty(rvd);
if (vdev_resilver_needed(rvd,
&dp->dp_scrub_min_txg, &dp->dp_scrub_max_txg)) {
spa_event_notify(dp->dp_spa, NULL,
ESC_ZFS_RESILVER_START);
dp->dp_scrub_max_txg = MIN(dp->dp_scrub_max_txg,
tx->tx_txg);
}
/* zero out the scrub stats in all vdev_stat_t's */
vdev_scrub_stat_update(rvd,
dp->dp_scrub_min_txg ? POOL_SCRUB_RESILVER :
POOL_SCRUB_EVERYTHING, B_FALSE);
dp->dp_spa->spa_scrub_started = B_TRUE;
}
/* back to the generic stuff */
if (dp->dp_blkstats == NULL) {
dp->dp_blkstats =
kmem_alloc(sizeof (zfs_all_blkstats_t), KM_SLEEP);
}
bzero(dp->dp_blkstats, sizeof (zfs_all_blkstats_t));
if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB)
ot = DMU_OT_ZAP_OTHER;
dp->dp_scrub_func = *funcp;
dp->dp_scrub_queue_obj = zap_create(dp->dp_meta_objset,
ot ? ot : DMU_OT_SCRUB_QUEUE, DMU_OT_NONE, 0, tx);
bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));
dp->dp_scrub_restart = B_FALSE;
dp->dp_spa->spa_scrub_errors = 0;
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_FUNC, sizeof (uint32_t), 1,
&dp->dp_scrub_func, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_QUEUE, sizeof (uint64_t), 1,
&dp->dp_scrub_queue_obj, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MIN_TXG, sizeof (uint64_t), 1,
&dp->dp_scrub_min_txg, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MAX_TXG, sizeof (uint64_t), 1,
&dp->dp_scrub_max_txg, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4,
&dp->dp_scrub_bookmark, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_ERRORS, sizeof (uint64_t), 1,
&dp->dp_spa->spa_scrub_errors, tx));
spa_history_internal_log(LOG_POOL_SCRUB, dp->dp_spa, tx, cr,
"func=%u mintxg=%llu maxtxg=%llu",
*funcp, dp->dp_scrub_min_txg, dp->dp_scrub_max_txg);
}
int
dsl_pool_scrub_setup(dsl_pool_t *dp, enum scrub_func func)
{
return (dsl_sync_task_do(dp, NULL,
dsl_pool_scrub_setup_sync, dp, &func, 0));
}
/* ARGSUSED */
static void
dsl_pool_scrub_cancel_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_pool_t *dp = arg1;
boolean_t *completep = arg2;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
mutex_enter(&dp->dp_scrub_cancel_lock);
if (dp->dp_scrub_restart) {
dp->dp_scrub_restart = B_FALSE;
*completep = B_FALSE;
}
/* XXX this is scrub-clean specific */
mutex_enter(&dp->dp_spa->spa_scrub_lock);
while (dp->dp_spa->spa_scrub_inflight > 0) {
cv_wait(&dp->dp_spa->spa_scrub_io_cv,
&dp->dp_spa->spa_scrub_lock);
}
mutex_exit(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_started = B_FALSE;
dp->dp_spa->spa_scrub_active = B_FALSE;
dp->dp_scrub_func = SCRUB_FUNC_NONE;
VERIFY(0 == dmu_object_free(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, tx));
dp->dp_scrub_queue_obj = 0;
bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_QUEUE, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MIN_TXG, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MAX_TXG, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_BOOKMARK, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_FUNC, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_ERRORS, tx));
spa_history_internal_log(LOG_POOL_SCRUB_DONE, dp->dp_spa, tx, cr,
"complete=%u", *completep);
/* below is scrub-clean specific */
vdev_scrub_stat_update(dp->dp_spa->spa_root_vdev, POOL_SCRUB_NONE,
*completep);
/*
* If the scrub/resilver completed, update all DTLs to reflect this.
* Whether it succeeded or not, vacate all temporary scrub DTLs.
*/
vdev_dtl_reassess(dp->dp_spa->spa_root_vdev, tx->tx_txg,
*completep ? dp->dp_scrub_max_txg : 0, B_TRUE);
if (dp->dp_scrub_min_txg && *completep)
spa_event_notify(dp->dp_spa, NULL, ESC_ZFS_RESILVER_FINISH);
spa_errlog_rotate(dp->dp_spa);
/*
* We may have finished replacing a device.
* Let the async thread assess this and handle the detach.
*/
spa_async_request(dp->dp_spa, SPA_ASYNC_RESILVER_DONE);
dp->dp_scrub_min_txg = dp->dp_scrub_max_txg = 0;
mutex_exit(&dp->dp_scrub_cancel_lock);
}
int
dsl_pool_scrub_cancel(dsl_pool_t *dp)
{
boolean_t complete = B_FALSE;
return (dsl_sync_task_do(dp, NULL,
dsl_pool_scrub_cancel_sync, dp, &complete, 3));
}
int
dsl_free(zio_t *pio, dsl_pool_t *dp, uint64_t txg, const blkptr_t *bpp,
zio_done_func_t *done, void *private, uint32_t arc_flags)
{
/*
* This function will be used by bp-rewrite wad to intercept frees.
*/
return (arc_free(pio, dp->dp_spa, txg, (blkptr_t *)bpp,
done, private, arc_flags));
}
static boolean_t
bookmark_is_zero(const zbookmark_t *zb)
{
return (zb->zb_objset == 0 && zb->zb_object == 0 &&
zb->zb_level == 0 && zb->zb_blkid == 0);
}
/* dnp is the dnode for zb1->zb_object */
static boolean_t
bookmark_is_before(dnode_phys_t *dnp, const zbookmark_t *zb1,
const zbookmark_t *zb2)
{
uint64_t zb1nextL0, zb2thisobj;
ASSERT(zb1->zb_objset == zb2->zb_objset);
ASSERT(zb1->zb_object != -1ULL);
ASSERT(zb2->zb_level == 0);
/*
* A bookmark in the deadlist is considered to be after
* everything else.
*/
if (zb2->zb_object == -1ULL)
return (B_TRUE);
/* The objset_phys_t isn't before anything. */
if (dnp == NULL)
return (B_FALSE);
zb1nextL0 = (zb1->zb_blkid + 1) <<
((zb1->zb_level) * (dnp->dn_indblkshift - SPA_BLKPTRSHIFT));
zb2thisobj = zb2->zb_object ? zb2->zb_object :
zb2->zb_blkid << (DNODE_BLOCK_SHIFT - DNODE_SHIFT);
if (zb1->zb_object == 0) {
uint64_t nextobj = zb1nextL0 *
(dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT) >> DNODE_SHIFT;
return (nextobj <= zb2thisobj);
}
if (zb1->zb_object < zb2thisobj)
return (B_TRUE);
if (zb1->zb_object > zb2thisobj)
return (B_FALSE);
if (zb2->zb_object == 0)
return (B_FALSE);
return (zb1nextL0 <= zb2->zb_blkid);
}
static boolean_t
scrub_pause(dsl_pool_t *dp, const zbookmark_t *zb)
{
int elapsed_ticks;
int mintime;
if (dp->dp_scrub_pausing)
return (B_TRUE); /* we're already pausing */
if (!bookmark_is_zero(&dp->dp_scrub_bookmark))
return (B_FALSE); /* we're resuming */
/* We only know how to resume from level-0 blocks. */
if (zb->zb_level != 0)
return (B_FALSE);
mintime = dp->dp_scrub_isresilver ? zfs_resilver_min_time :
zfs_scrub_min_time;
elapsed_ticks = lbolt64 - dp->dp_scrub_start_time;
if (elapsed_ticks > hz * zfs_txg_timeout ||
(elapsed_ticks > hz * mintime && txg_sync_waiting(dp))) {
dprintf("pausing at %llx/%llx/%llx/%llx\n",
(longlong_t)zb->zb_objset, (longlong_t)zb->zb_object,
(longlong_t)zb->zb_level, (longlong_t)zb->zb_blkid);
dp->dp_scrub_pausing = B_TRUE;
dp->dp_scrub_bookmark = *zb;
return (B_TRUE);
}
return (B_FALSE);
}
typedef struct zil_traverse_arg {
dsl_pool_t *zta_dp;
zil_header_t *zta_zh;
} zil_traverse_arg_t;
/* ARGSUSED */
static void
traverse_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
{
zil_traverse_arg_t *zta = arg;
dsl_pool_t *dp = zta->zta_dp;
zil_header_t *zh = zta->zta_zh;
zbookmark_t zb;
if (bp->blk_birth <= dp->dp_scrub_min_txg)
return;
/*
* One block ("stumpy") can be allocated a long time ago; we
* want to visit that one because it has been allocated
* (on-disk) even if it hasn't been claimed (even though for
* plain scrub there's nothing to do to it).
*/
if (claim_txg == 0 && bp->blk_birth >= spa_first_txg(dp->dp_spa))
return;
zb.zb_objset = zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET];
zb.zb_object = 0;
zb.zb_level = -1;
zb.zb_blkid = bp->blk_cksum.zc_word[ZIL_ZC_SEQ];
VERIFY(0 == scrub_funcs[dp->dp_scrub_func](dp, bp, &zb));
}
/* ARGSUSED */
static void
traverse_zil_record(zilog_t *zilog, lr_t *lrc, void *arg, uint64_t claim_txg)
{
if (lrc->lrc_txtype == TX_WRITE) {
zil_traverse_arg_t *zta = arg;
dsl_pool_t *dp = zta->zta_dp;
zil_header_t *zh = zta->zta_zh;
lr_write_t *lr = (lr_write_t *)lrc;
blkptr_t *bp = &lr->lr_blkptr;
zbookmark_t zb;
if (bp->blk_birth <= dp->dp_scrub_min_txg)
return;
/*
* birth can be < claim_txg if this record's txg is
* already txg sync'ed (but this log block contains
* other records that are not synced)
*/
if (claim_txg == 0 || bp->blk_birth < claim_txg)
return;
zb.zb_objset = zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET];
zb.zb_object = lr->lr_foid;
zb.zb_level = BP_GET_LEVEL(bp);
zb.zb_blkid = lr->lr_offset / BP_GET_LSIZE(bp);
VERIFY(0 == scrub_funcs[dp->dp_scrub_func](dp, bp, &zb));
}
}
static void
traverse_zil(dsl_pool_t *dp, zil_header_t *zh)
{
uint64_t claim_txg = zh->zh_claim_txg;
zil_traverse_arg_t zta = { dp, zh };
zilog_t *zilog;
/*
* We only want to visit blocks that have been claimed but not yet
* replayed (or, in read-only mode, blocks that *would* be claimed).
*/
if (claim_txg == 0 && (spa_mode & FWRITE))
return;
zilog = zil_alloc(dp->dp_meta_objset, zh);
(void) zil_parse(zilog, traverse_zil_block, traverse_zil_record, &zta,
claim_txg);
zil_free(zilog);
}
static void
scrub_visitbp(dsl_pool_t *dp, dnode_phys_t *dnp,
arc_buf_t *pbuf, blkptr_t *bp, const zbookmark_t *zb)
{
int err;
arc_buf_t *buf = NULL;
if (bp->blk_birth == 0)
return;
if (bp->blk_birth <= dp->dp_scrub_min_txg)
return;
if (scrub_pause(dp, zb))
return;
if (!bookmark_is_zero(&dp->dp_scrub_bookmark)) {
/*
* If we already visited this bp & everything below (in
* a prior txg), don't bother doing it again.
*/
if (bookmark_is_before(dnp, zb, &dp->dp_scrub_bookmark))
return;
/*
* If we found the block we're trying to resume from, or
* we went past it to a different object, zero it out to
* indicate that it's OK to start checking for pausing
* again.
*/
if (bcmp(zb, &dp->dp_scrub_bookmark, sizeof (*zb)) == 0 ||
zb->zb_object > dp->dp_scrub_bookmark.zb_object) {
dprintf("resuming at %llx/%llx/%llx/%llx\n",
(longlong_t)zb->zb_objset,
(longlong_t)zb->zb_object,
(longlong_t)zb->zb_level,
(longlong_t)zb->zb_blkid);
bzero(&dp->dp_scrub_bookmark, sizeof (*zb));
}
}
if (BP_GET_LEVEL(bp) > 0) {
uint32_t flags = ARC_WAIT;
int i;
blkptr_t *cbp;
int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
err = arc_read(NULL, dp->dp_spa, bp, pbuf,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err) {
mutex_enter(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_errors++;
mutex_exit(&dp->dp_spa->spa_scrub_lock);
return;
}
cbp = buf->b_data;
for (i = 0; i < epb; i++, cbp++) {
zbookmark_t czb;
SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
zb->zb_level - 1,
zb->zb_blkid * epb + i);
scrub_visitbp(dp, dnp, buf, cbp, &czb);
}
} else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {
uint32_t flags = ARC_WAIT;
dnode_phys_t *child_dnp;
int i, j;
int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;
err = arc_read(NULL, dp->dp_spa, bp, pbuf,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err) {
mutex_enter(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_errors++;
mutex_exit(&dp->dp_spa->spa_scrub_lock);
return;
}
child_dnp = buf->b_data;
for (i = 0; i < epb; i++, child_dnp++) {
for (j = 0; j < child_dnp->dn_nblkptr; j++) {
zbookmark_t czb;
SET_BOOKMARK(&czb, zb->zb_objset,
zb->zb_blkid * epb + i,
child_dnp->dn_nlevels - 1, j);
scrub_visitbp(dp, child_dnp, buf,
&child_dnp->dn_blkptr[j], &czb);
}
}
} else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
uint32_t flags = ARC_WAIT;
objset_phys_t *osp;
int j;
err = arc_read_nolock(NULL, dp->dp_spa, bp,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err) {
mutex_enter(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_errors++;
mutex_exit(&dp->dp_spa->spa_scrub_lock);
return;
}
osp = buf->b_data;
traverse_zil(dp, &osp->os_zil_header);
for (j = 0; j < osp->os_meta_dnode.dn_nblkptr; j++) {
zbookmark_t czb;
SET_BOOKMARK(&czb, zb->zb_objset, 0,
osp->os_meta_dnode.dn_nlevels - 1, j);
scrub_visitbp(dp, &osp->os_meta_dnode, buf,
&osp->os_meta_dnode.dn_blkptr[j], &czb);
}
}
(void) scrub_funcs[dp->dp_scrub_func](dp, bp, zb);
if (buf)
(void) arc_buf_remove_ref(buf, &buf);
}
static void
scrub_visit_rootbp(dsl_pool_t *dp, dsl_dataset_t *ds, blkptr_t *bp)
{
zbookmark_t zb;
SET_BOOKMARK(&zb, ds ? ds->ds_object : 0, 0, -1, 0);
scrub_visitbp(dp, NULL, NULL, bp, &zb);
}
void
dsl_pool_ds_destroyed(dsl_dataset_t *ds, dmu_tx_t *tx)
{
dsl_pool_t *dp = ds->ds_dir->dd_pool;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
if (dp->dp_scrub_bookmark.zb_objset == ds->ds_object) {
SET_BOOKMARK(&dp->dp_scrub_bookmark, -1, 0, 0, 0);
} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, tx) != 0) {
return;
}
if (ds->ds_phys->ds_next_snap_obj != 0) {
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_phys->ds_next_snap_obj, tx) == 0);
}
ASSERT3U(ds->ds_phys->ds_num_children, <=, 1);
}
void
dsl_pool_ds_snapshotted(dsl_dataset_t *ds, dmu_tx_t *tx)
{
dsl_pool_t *dp = ds->ds_dir->dd_pool;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
ASSERT(ds->ds_phys->ds_prev_snap_obj != 0);
if (dp->dp_scrub_bookmark.zb_objset == ds->ds_object) {
dp->dp_scrub_bookmark.zb_objset =
ds->ds_phys->ds_prev_snap_obj;
} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, tx) == 0) {
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_phys->ds_prev_snap_obj, tx) == 0);
}
}
void
dsl_pool_ds_clone_swapped(dsl_dataset_t *ds1, dsl_dataset_t *ds2, dmu_tx_t *tx)
{
dsl_pool_t *dp = ds1->ds_dir->dd_pool;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
if (dp->dp_scrub_bookmark.zb_objset == ds1->ds_object) {
dp->dp_scrub_bookmark.zb_objset = ds2->ds_object;
} else if (dp->dp_scrub_bookmark.zb_objset == ds2->ds_object) {
dp->dp_scrub_bookmark.zb_objset = ds1->ds_object;
}
if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds1->ds_object, tx) == 0) {
int err = zap_add_int(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, ds2->ds_object, tx);
VERIFY(err == 0 || err == EEXIST);
if (err == EEXIST) {
/* Both were there to begin with */
VERIFY(0 == zap_add_int(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, ds1->ds_object, tx));
}
} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds2->ds_object, tx) == 0) {
VERIFY(0 == zap_add_int(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, ds1->ds_object, tx));
}
}
struct enqueue_clones_arg {
dmu_tx_t *tx;
uint64_t originobj;
};
/* ARGSUSED */
static int
enqueue_clones_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
{
struct enqueue_clones_arg *eca = arg;
dsl_dataset_t *ds;
int err;
dsl_pool_t *dp;
err = dsl_dataset_hold_obj(spa->spa_dsl_pool, dsobj, FTAG, &ds);
if (err)
return (err);
dp = ds->ds_dir->dd_pool;
if (ds->ds_dir->dd_phys->dd_origin_obj == eca->originobj) {
while (ds->ds_phys->ds_prev_snap_obj != eca->originobj) {
dsl_dataset_t *prev;
err = dsl_dataset_hold_obj(dp,
ds->ds_phys->ds_prev_snap_obj, FTAG, &prev);
dsl_dataset_rele(ds, FTAG);
if (err)
return (err);
ds = prev;
}
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, eca->tx) == 0);
}
dsl_dataset_rele(ds, FTAG);
return (0);
}
static void
scrub_visitds(dsl_pool_t *dp, uint64_t dsobj, dmu_tx_t *tx)
{
dsl_dataset_t *ds;
uint64_t min_txg_save;
VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
/*
* Iterate over the bps in this ds.
*/
min_txg_save = dp->dp_scrub_min_txg;
dp->dp_scrub_min_txg =
MAX(dp->dp_scrub_min_txg, ds->ds_phys->ds_prev_snap_txg);
scrub_visit_rootbp(dp, ds, &ds->ds_phys->ds_bp);
dp->dp_scrub_min_txg = min_txg_save;
if (dp->dp_scrub_pausing)
goto out;
/*
* Add descendent datasets to work queue.
*/
if (ds->ds_phys->ds_next_snap_obj != 0) {
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_phys->ds_next_snap_obj, tx) == 0);
}
if (ds->ds_phys->ds_num_children > 1) {
if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
struct enqueue_clones_arg eca;
eca.tx = tx;
eca.originobj = ds->ds_object;
(void) dmu_objset_find_spa(ds->ds_dir->dd_pool->dp_spa,
NULL, enqueue_clones_cb, &eca, DS_FIND_CHILDREN);
} else {
VERIFY(zap_join(dp->dp_meta_objset,
ds->ds_phys->ds_next_clones_obj,
dp->dp_scrub_queue_obj, tx) == 0);
}
}
out:
dsl_dataset_rele(ds, FTAG);
}
/* ARGSUSED */
static int
enqueue_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
{
dmu_tx_t *tx = arg;
dsl_dataset_t *ds;
int err;
dsl_pool_t *dp;
err = dsl_dataset_hold_obj(spa->spa_dsl_pool, dsobj, FTAG, &ds);
if (err)
return (err);
dp = ds->ds_dir->dd_pool;
while (ds->ds_phys->ds_prev_snap_obj != 0) {
dsl_dataset_t *prev;
err = dsl_dataset_hold_obj(dp, ds->ds_phys->ds_prev_snap_obj,
FTAG, &prev);
if (err) {
dsl_dataset_rele(ds, FTAG);
return (err);
}
/*
* If this is a clone, we don't need to worry about it for now.
*/
if (prev->ds_phys->ds_next_snap_obj != ds->ds_object) {
dsl_dataset_rele(ds, FTAG);
dsl_dataset_rele(prev, FTAG);
return (0);
}
dsl_dataset_rele(ds, FTAG);
ds = prev;
}
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, tx) == 0);
dsl_dataset_rele(ds, FTAG);
return (0);
}
void
dsl_pool_scrub_sync(dsl_pool_t *dp, dmu_tx_t *tx)
{
zap_cursor_t zc;
zap_attribute_t za;
boolean_t complete = B_TRUE;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
/* If the spa is not fully loaded, don't bother. */
if (dp->dp_spa->spa_load_state != SPA_LOAD_NONE)
return;
if (dp->dp_scrub_restart) {
enum scrub_func func = dp->dp_scrub_func;
dp->dp_scrub_restart = B_FALSE;
dsl_pool_scrub_setup_sync(dp, &func, kcred, tx);
}
if (dp->dp_spa->spa_root_vdev->vdev_stat.vs_scrub_type == 0) {
/*
* We must have resumed after rebooting; reset the vdev
* stats to know that we're doing a scrub (although it
* will think we're just starting now).
*/
vdev_scrub_stat_update(dp->dp_spa->spa_root_vdev,
dp->dp_scrub_min_txg ? POOL_SCRUB_RESILVER :
POOL_SCRUB_EVERYTHING, B_FALSE);
}
dp->dp_scrub_pausing = B_FALSE;
dp->dp_scrub_start_time = lbolt64;
dp->dp_scrub_isresilver = (dp->dp_scrub_min_txg != 0);
dp->dp_spa->spa_scrub_active = B_TRUE;
if (dp->dp_scrub_bookmark.zb_objset == 0) {
/* First do the MOS & ORIGIN */
scrub_visit_rootbp(dp, NULL, &dp->dp_meta_rootbp);
if (dp->dp_scrub_pausing)
goto out;
if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
VERIFY(0 == dmu_objset_find_spa(dp->dp_spa,
NULL, enqueue_cb, tx, DS_FIND_CHILDREN));
} else {
scrub_visitds(dp, dp->dp_origin_snap->ds_object, tx);
}
ASSERT(!dp->dp_scrub_pausing);
} else if (dp->dp_scrub_bookmark.zb_objset != -1ULL) {
/*
* If we were paused, continue from here. Note if the
* ds we were paused on was deleted, the zb_objset will
* be -1, so we will skip this and find a new objset
* below.
*/
scrub_visitds(dp, dp->dp_scrub_bookmark.zb_objset, tx);
if (dp->dp_scrub_pausing)
goto out;
}
/*
* In case we were paused right at the end of the ds, zero the
* bookmark so we don't think that we're still trying to resume.
*/
bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));
/* keep pulling things out of the zap-object-as-queue */
while (zap_cursor_init(&zc, dp->dp_meta_objset, dp->dp_scrub_queue_obj),
zap_cursor_retrieve(&zc, &za) == 0) {
VERIFY(0 == zap_remove(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, za.za_name, tx));
scrub_visitds(dp, za.za_first_integer, tx);
if (dp->dp_scrub_pausing)
break;
zap_cursor_fini(&zc);
}
zap_cursor_fini(&zc);
if (dp->dp_scrub_pausing)
goto out;
/* done. */
dsl_pool_scrub_cancel_sync(dp, &complete, kcred, tx);
return;
out:
VERIFY(0 == zap_update(dp->dp_meta_objset,
DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4,
&dp->dp_scrub_bookmark, tx));
VERIFY(0 == zap_update(dp->dp_meta_objset,
DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_ERRORS, sizeof (uint64_t), 1,
&dp->dp_spa->spa_scrub_errors, tx));
/* XXX this is scrub-clean specific */
mutex_enter(&dp->dp_spa->spa_scrub_lock);
while (dp->dp_spa->spa_scrub_inflight > 0) {
cv_wait(&dp->dp_spa->spa_scrub_io_cv,
&dp->dp_spa->spa_scrub_lock);
}
mutex_exit(&dp->dp_spa->spa_scrub_lock);
}
void
dsl_pool_scrub_restart(dsl_pool_t *dp)
{
mutex_enter(&dp->dp_scrub_cancel_lock);
dp->dp_scrub_restart = B_TRUE;
mutex_exit(&dp->dp_scrub_cancel_lock);
}
/*
* scrub consumers
*/
static void
count_block(zfs_all_blkstats_t *zab, const blkptr_t *bp)
{
int i;
/*
* If we resume after a reboot, zab will be NULL; don't record
* incomplete stats in that case.
*/
if (zab == NULL)
return;
for (i = 0; i < 4; i++) {
int l = (i < 2) ? BP_GET_LEVEL(bp) : DN_MAX_LEVELS;
int t = (i & 1) ? BP_GET_TYPE(bp) : DMU_OT_TOTAL;
zfs_blkstat_t *zb = &zab->zab_type[l][t];
int equal;
zb->zb_count++;
zb->zb_asize += BP_GET_ASIZE(bp);
zb->zb_lsize += BP_GET_LSIZE(bp);
zb->zb_psize += BP_GET_PSIZE(bp);
zb->zb_gangs += BP_COUNT_GANG(bp);
switch (BP_GET_NDVAS(bp)) {
case 2:
if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
DVA_GET_VDEV(&bp->blk_dva[1]))
zb->zb_ditto_2_of_2_samevdev++;
break;
case 3:
equal = (DVA_GET_VDEV(&bp->blk_dva[0]) ==
DVA_GET_VDEV(&bp->blk_dva[1])) +
(DVA_GET_VDEV(&bp->blk_dva[0]) ==
DVA_GET_VDEV(&bp->blk_dva[2])) +
(DVA_GET_VDEV(&bp->blk_dva[1]) ==
DVA_GET_VDEV(&bp->blk_dva[2]));
if (equal == 1)
zb->zb_ditto_2_of_3_samevdev++;
else if (equal == 3)
zb->zb_ditto_3_of_3_samevdev++;
break;
}
}
}
static void
dsl_pool_scrub_clean_done(zio_t *zio)
{
spa_t *spa = zio->io_spa;
zio_data_buf_free(zio->io_data, zio->io_size);
mutex_enter(&spa->spa_scrub_lock);
spa->spa_scrub_inflight--;
cv_broadcast(&spa->spa_scrub_io_cv);
if (zio->io_error && (zio->io_error != ECKSUM ||
!(zio->io_flags & ZIO_FLAG_SPECULATIVE)))
spa->spa_scrub_errors++;
mutex_exit(&spa->spa_scrub_lock);
}
static int
dsl_pool_scrub_clean_cb(dsl_pool_t *dp,
const blkptr_t *bp, const zbookmark_t *zb)
{
size_t size = BP_GET_LSIZE(bp);
int d;
spa_t *spa = dp->dp_spa;
boolean_t needs_io;
int zio_flags = ZIO_FLAG_SCRUB_THREAD | ZIO_FLAG_CANFAIL;
int zio_priority;
count_block(dp->dp_blkstats, bp);
if (dp->dp_scrub_isresilver == 0) {
/* It's a scrub */
zio_flags |= ZIO_FLAG_SCRUB;
zio_priority = ZIO_PRIORITY_SCRUB;
needs_io = B_TRUE;
} else {
/* It's a resilver */
zio_flags |= ZIO_FLAG_RESILVER;
zio_priority = ZIO_PRIORITY_RESILVER;
needs_io = B_FALSE;
}
/* If it's an intent log block, failure is expected. */
if (zb->zb_level == -1 && BP_GET_TYPE(bp) != DMU_OT_OBJSET)
zio_flags |= ZIO_FLAG_SPECULATIVE;
for (d = 0; d < BP_GET_NDVAS(bp); d++) {
vdev_t *vd = vdev_lookup_top(spa,
DVA_GET_VDEV(&bp->blk_dva[d]));
/*
* Keep track of how much data we've examined so that
* zpool(1M) status can make useful progress reports.
*/
mutex_enter(&vd->vdev_stat_lock);
vd->vdev_stat.vs_scrub_examined +=
DVA_GET_ASIZE(&bp->blk_dva[d]);
mutex_exit(&vd->vdev_stat_lock);
/* if it's a resilver, this may not be in the target range */
if (!needs_io) {
if (DVA_GET_GANG(&bp->blk_dva[d])) {
/*
* Gang members may be spread across multiple
* vdevs, so the best we can do is look at the
* pool-wide DTL.
* XXX -- it would be better to change our
* allocation policy to ensure that this can't
* happen.
*/
vd = spa->spa_root_vdev;
}
needs_io = vdev_dtl_contains(&vd->vdev_dtl_map,
bp->blk_birth, 1);
}
}
if (needs_io && !zfs_no_scrub_io) {
void *data = zio_data_buf_alloc(size);
mutex_enter(&spa->spa_scrub_lock);
while (spa->spa_scrub_inflight >= spa->spa_scrub_maxinflight)
cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
spa->spa_scrub_inflight++;
mutex_exit(&spa->spa_scrub_lock);
zio_nowait(zio_read(NULL, spa, bp, data, size,
dsl_pool_scrub_clean_done, NULL, zio_priority,
zio_flags, zb));
}
/* do not relocate this block */
return (0);
}
int
dsl_pool_scrub_clean(dsl_pool_t *dp)
{
+ spa_t *spa = dp->dp_spa;
+
/*
* Purge all vdev caches. We do this here rather than in sync
* context because this requires a writer lock on the spa_config
* lock, which we can't do from sync context. The
* spa_scrub_reopen flag indicates that vdev_open() should not
* attempt to start another scrub.
*/
- spa_config_enter(dp->dp_spa, SCL_ALL, FTAG, RW_WRITER);
- dp->dp_spa->spa_scrub_reopen = B_TRUE;
- vdev_reopen(dp->dp_spa->spa_root_vdev);
- dp->dp_spa->spa_scrub_reopen = B_FALSE;
- spa_config_exit(dp->dp_spa, SCL_ALL, FTAG);
+ spa_vdev_state_enter(spa);
+ spa->spa_scrub_reopen = B_TRUE;
+ vdev_reopen(spa->spa_root_vdev);
+ spa->spa_scrub_reopen = B_FALSE;
+ (void) spa_vdev_state_exit(spa, NULL, 0);
return (dsl_pool_scrub_setup(dp, SCRUB_FUNC_CLEAN));
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c (revision 209274)
@@ -1,1209 +1,1209 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
- * Copyright 2008 Sun Microsystems, Inc. All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/spa.h>
#include <sys/vdev_impl.h>
#include <sys/zio.h>
#include <sys/zio_checksum.h>
#include <sys/fs/zfs.h>
#include <sys/fm/fs/zfs.h>
/*
* Virtual device vector for RAID-Z.
*
* This vdev supports both single and double parity. For single parity, we
* use a simple XOR of all the data columns. For double parity, we use both
* the simple XOR as well as a technique described in "The mathematics of
* RAID-6" by H. Peter Anvin. This technique defines a Galois field, GF(2^8),
* over the integers expressable in a single byte. Briefly, the operations on
* the field are defined as follows:
*
* o addition (+) is represented by a bitwise XOR
* o subtraction (-) is therefore identical to addition: A + B = A - B
* o multiplication of A by 2 is defined by the following bitwise expression:
* (A * 2)_7 = A_6
* (A * 2)_6 = A_5
* (A * 2)_5 = A_4
* (A * 2)_4 = A_3 + A_7
* (A * 2)_3 = A_2 + A_7
* (A * 2)_2 = A_1 + A_7
* (A * 2)_1 = A_0
* (A * 2)_0 = A_7
*
* In C, multiplying by 2 is therefore ((a << 1) ^ ((a & 0x80) ? 0x1d : 0)).
*
* Observe that any number in the field (except for 0) can be expressed as a
* power of 2 -- a generator for the field. We store a table of the powers of
* 2 and logs base 2 for quick look ups, and exploit the fact that A * B can
* be rewritten as 2^(log_2(A) + log_2(B)) (where '+' is normal addition rather
* than field addition). The inverse of a field element A (A^-1) is A^254.
*
* The two parity columns, P and Q, over several data columns, D_0, ... D_n-1,
* can be expressed by field operations:
*
* P = D_0 + D_1 + ... + D_n-2 + D_n-1
* Q = 2^n-1 * D_0 + 2^n-2 * D_1 + ... + 2^1 * D_n-2 + 2^0 * D_n-1
* = ((...((D_0) * 2 + D_1) * 2 + ...) * 2 + D_n-2) * 2 + D_n-1
*
* See the reconstruction code below for how P and Q can used individually or
* in concert to recover missing data columns.
*/
typedef struct raidz_col {
uint64_t rc_devidx; /* child device index for I/O */
uint64_t rc_offset; /* device offset */
uint64_t rc_size; /* I/O size */
void *rc_data; /* I/O data */
int rc_error; /* I/O error for this device */
uint8_t rc_tried; /* Did we attempt this I/O column? */
uint8_t rc_skipped; /* Did we skip this I/O column? */
} raidz_col_t;
typedef struct raidz_map {
uint64_t rm_cols; /* Column count */
uint64_t rm_bigcols; /* Number of oversized columns */
uint64_t rm_asize; /* Actual total I/O size */
uint64_t rm_missingdata; /* Count of missing data devices */
uint64_t rm_missingparity; /* Count of missing parity devices */
uint64_t rm_firstdatacol; /* First data column/parity count */
raidz_col_t rm_col[1]; /* Flexible array of I/O columns */
} raidz_map_t;
#define VDEV_RAIDZ_P 0
#define VDEV_RAIDZ_Q 1
#define VDEV_RAIDZ_MAXPARITY 2
#define VDEV_RAIDZ_MUL_2(a) (((a) << 1) ^ (((a) & 0x80) ? 0x1d : 0))
/*
* These two tables represent powers and logs of 2 in the Galois field defined
* above. These values were computed by repeatedly multiplying by 2 as above.
*/
static const uint8_t vdev_raidz_pow2[256] = {
0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,
0x1d, 0x3a, 0x74, 0xe8, 0xcd, 0x87, 0x13, 0x26,
0x4c, 0x98, 0x2d, 0x5a, 0xb4, 0x75, 0xea, 0xc9,
0x8f, 0x03, 0x06, 0x0c, 0x18, 0x30, 0x60, 0xc0,
0x9d, 0x27, 0x4e, 0x9c, 0x25, 0x4a, 0x94, 0x35,
0x6a, 0xd4, 0xb5, 0x77, 0xee, 0xc1, 0x9f, 0x23,
0x46, 0x8c, 0x05, 0x0a, 0x14, 0x28, 0x50, 0xa0,
0x5d, 0xba, 0x69, 0xd2, 0xb9, 0x6f, 0xde, 0xa1,
0x5f, 0xbe, 0x61, 0xc2, 0x99, 0x2f, 0x5e, 0xbc,
0x65, 0xca, 0x89, 0x0f, 0x1e, 0x3c, 0x78, 0xf0,
0xfd, 0xe7, 0xd3, 0xbb, 0x6b, 0xd6, 0xb1, 0x7f,
0xfe, 0xe1, 0xdf, 0xa3, 0x5b, 0xb6, 0x71, 0xe2,
0xd9, 0xaf, 0x43, 0x86, 0x11, 0x22, 0x44, 0x88,
0x0d, 0x1a, 0x34, 0x68, 0xd0, 0xbd, 0x67, 0xce,
0x81, 0x1f, 0x3e, 0x7c, 0xf8, 0xed, 0xc7, 0x93,
0x3b, 0x76, 0xec, 0xc5, 0x97, 0x33, 0x66, 0xcc,
0x85, 0x17, 0x2e, 0x5c, 0xb8, 0x6d, 0xda, 0xa9,
0x4f, 0x9e, 0x21, 0x42, 0x84, 0x15, 0x2a, 0x54,
0xa8, 0x4d, 0x9a, 0x29, 0x52, 0xa4, 0x55, 0xaa,
0x49, 0x92, 0x39, 0x72, 0xe4, 0xd5, 0xb7, 0x73,
0xe6, 0xd1, 0xbf, 0x63, 0xc6, 0x91, 0x3f, 0x7e,
0xfc, 0xe5, 0xd7, 0xb3, 0x7b, 0xf6, 0xf1, 0xff,
0xe3, 0xdb, 0xab, 0x4b, 0x96, 0x31, 0x62, 0xc4,
0x95, 0x37, 0x6e, 0xdc, 0xa5, 0x57, 0xae, 0x41,
0x82, 0x19, 0x32, 0x64, 0xc8, 0x8d, 0x07, 0x0e,
0x1c, 0x38, 0x70, 0xe0, 0xdd, 0xa7, 0x53, 0xa6,
0x51, 0xa2, 0x59, 0xb2, 0x79, 0xf2, 0xf9, 0xef,
0xc3, 0x9b, 0x2b, 0x56, 0xac, 0x45, 0x8a, 0x09,
0x12, 0x24, 0x48, 0x90, 0x3d, 0x7a, 0xf4, 0xf5,
0xf7, 0xf3, 0xfb, 0xeb, 0xcb, 0x8b, 0x0b, 0x16,
0x2c, 0x58, 0xb0, 0x7d, 0xfa, 0xe9, 0xcf, 0x83,
0x1b, 0x36, 0x6c, 0xd8, 0xad, 0x47, 0x8e, 0x01
};
static const uint8_t vdev_raidz_log2[256] = {
0x00, 0x00, 0x01, 0x19, 0x02, 0x32, 0x1a, 0xc6,
0x03, 0xdf, 0x33, 0xee, 0x1b, 0x68, 0xc7, 0x4b,
0x04, 0x64, 0xe0, 0x0e, 0x34, 0x8d, 0xef, 0x81,
0x1c, 0xc1, 0x69, 0xf8, 0xc8, 0x08, 0x4c, 0x71,
0x05, 0x8a, 0x65, 0x2f, 0xe1, 0x24, 0x0f, 0x21,
0x35, 0x93, 0x8e, 0xda, 0xf0, 0x12, 0x82, 0x45,
0x1d, 0xb5, 0xc2, 0x7d, 0x6a, 0x27, 0xf9, 0xb9,
0xc9, 0x9a, 0x09, 0x78, 0x4d, 0xe4, 0x72, 0xa6,
0x06, 0xbf, 0x8b, 0x62, 0x66, 0xdd, 0x30, 0xfd,
0xe2, 0x98, 0x25, 0xb3, 0x10, 0x91, 0x22, 0x88,
0x36, 0xd0, 0x94, 0xce, 0x8f, 0x96, 0xdb, 0xbd,
0xf1, 0xd2, 0x13, 0x5c, 0x83, 0x38, 0x46, 0x40,
0x1e, 0x42, 0xb6, 0xa3, 0xc3, 0x48, 0x7e, 0x6e,
0x6b, 0x3a, 0x28, 0x54, 0xfa, 0x85, 0xba, 0x3d,
0xca, 0x5e, 0x9b, 0x9f, 0x0a, 0x15, 0x79, 0x2b,
0x4e, 0xd4, 0xe5, 0xac, 0x73, 0xf3, 0xa7, 0x57,
0x07, 0x70, 0xc0, 0xf7, 0x8c, 0x80, 0x63, 0x0d,
0x67, 0x4a, 0xde, 0xed, 0x31, 0xc5, 0xfe, 0x18,
0xe3, 0xa5, 0x99, 0x77, 0x26, 0xb8, 0xb4, 0x7c,
0x11, 0x44, 0x92, 0xd9, 0x23, 0x20, 0x89, 0x2e,
0x37, 0x3f, 0xd1, 0x5b, 0x95, 0xbc, 0xcf, 0xcd,
0x90, 0x87, 0x97, 0xb2, 0xdc, 0xfc, 0xbe, 0x61,
0xf2, 0x56, 0xd3, 0xab, 0x14, 0x2a, 0x5d, 0x9e,
0x84, 0x3c, 0x39, 0x53, 0x47, 0x6d, 0x41, 0xa2,
0x1f, 0x2d, 0x43, 0xd8, 0xb7, 0x7b, 0xa4, 0x76,
0xc4, 0x17, 0x49, 0xec, 0x7f, 0x0c, 0x6f, 0xf6,
0x6c, 0xa1, 0x3b, 0x52, 0x29, 0x9d, 0x55, 0xaa,
0xfb, 0x60, 0x86, 0xb1, 0xbb, 0xcc, 0x3e, 0x5a,
0xcb, 0x59, 0x5f, 0xb0, 0x9c, 0xa9, 0xa0, 0x51,
0x0b, 0xf5, 0x16, 0xeb, 0x7a, 0x75, 0x2c, 0xd7,
0x4f, 0xae, 0xd5, 0xe9, 0xe6, 0xe7, 0xad, 0xe8,
0x74, 0xd6, 0xf4, 0xea, 0xa8, 0x50, 0x58, 0xaf,
};
/*
* Multiply a given number by 2 raised to the given power.
*/
static uint8_t
vdev_raidz_exp2(uint_t a, int exp)
{
if (a == 0)
return (0);
ASSERT(exp >= 0);
ASSERT(vdev_raidz_log2[a] > 0 || a == 1);
exp += vdev_raidz_log2[a];
if (exp > 255)
exp -= 255;
return (vdev_raidz_pow2[exp]);
}
static void
vdev_raidz_map_free(zio_t *zio)
{
raidz_map_t *rm = zio->io_vsd;
int c;
for (c = 0; c < rm->rm_firstdatacol; c++)
zio_buf_free(rm->rm_col[c].rc_data, rm->rm_col[c].rc_size);
kmem_free(rm, offsetof(raidz_map_t, rm_col[rm->rm_cols]));
}
static raidz_map_t *
vdev_raidz_map_alloc(zio_t *zio, uint64_t unit_shift, uint64_t dcols,
uint64_t nparity)
{
raidz_map_t *rm;
uint64_t b = zio->io_offset >> unit_shift;
uint64_t s = zio->io_size >> unit_shift;
uint64_t f = b % dcols;
uint64_t o = (b / dcols) << unit_shift;
uint64_t q, r, c, bc, col, acols, coff, devidx;
q = s / (dcols - nparity);
r = s - q * (dcols - nparity);
bc = (r == 0 ? 0 : r + nparity);
acols = (q == 0 ? bc : dcols);
rm = kmem_alloc(offsetof(raidz_map_t, rm_col[acols]), KM_SLEEP);
rm->rm_cols = acols;
rm->rm_bigcols = bc;
rm->rm_asize = 0;
rm->rm_missingdata = 0;
rm->rm_missingparity = 0;
rm->rm_firstdatacol = nparity;
for (c = 0; c < acols; c++) {
col = f + c;
coff = o;
if (col >= dcols) {
col -= dcols;
coff += 1ULL << unit_shift;
}
rm->rm_col[c].rc_devidx = col;
rm->rm_col[c].rc_offset = coff;
rm->rm_col[c].rc_size = (q + (c < bc)) << unit_shift;
rm->rm_col[c].rc_data = NULL;
rm->rm_col[c].rc_error = 0;
rm->rm_col[c].rc_tried = 0;
rm->rm_col[c].rc_skipped = 0;
rm->rm_asize += rm->rm_col[c].rc_size;
}
rm->rm_asize = roundup(rm->rm_asize, (nparity + 1) << unit_shift);
for (c = 0; c < rm->rm_firstdatacol; c++)
rm->rm_col[c].rc_data = zio_buf_alloc(rm->rm_col[c].rc_size);
rm->rm_col[c].rc_data = zio->io_data;
for (c = c + 1; c < acols; c++)
rm->rm_col[c].rc_data = (char *)rm->rm_col[c - 1].rc_data +
rm->rm_col[c - 1].rc_size;
/*
* If all data stored spans all columns, there's a danger that parity
* will always be on the same device and, since parity isn't read
* during normal operation, that that device's I/O bandwidth won't be
* used effectively. We therefore switch the parity every 1MB.
*
* ... at least that was, ostensibly, the theory. As a practical
* matter unless we juggle the parity between all devices evenly, we
* won't see any benefit. Further, occasional writes that aren't a
* multiple of the LCM of the number of children and the minimum
* stripe width are sufficient to avoid pessimal behavior.
* Unfortunately, this decision created an implicit on-disk format
* requirement that we need to support for all eternity, but only
* for single-parity RAID-Z.
*/
ASSERT(rm->rm_cols >= 2);
ASSERT(rm->rm_col[0].rc_size == rm->rm_col[1].rc_size);
if (rm->rm_firstdatacol == 1 && (zio->io_offset & (1ULL << 20))) {
devidx = rm->rm_col[0].rc_devidx;
o = rm->rm_col[0].rc_offset;
rm->rm_col[0].rc_devidx = rm->rm_col[1].rc_devidx;
rm->rm_col[0].rc_offset = rm->rm_col[1].rc_offset;
rm->rm_col[1].rc_devidx = devidx;
rm->rm_col[1].rc_offset = o;
}
zio->io_vsd = rm;
zio->io_vsd_free = vdev_raidz_map_free;
return (rm);
}
static void
vdev_raidz_generate_parity_p(raidz_map_t *rm)
{
uint64_t *p, *src, pcount, ccount, i;
int c;
pcount = rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]);
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
p = rm->rm_col[VDEV_RAIDZ_P].rc_data;
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
if (c == rm->rm_firstdatacol) {
ASSERT(ccount == pcount);
for (i = 0; i < ccount; i++, p++, src++) {
*p = *src;
}
} else {
ASSERT(ccount <= pcount);
for (i = 0; i < ccount; i++, p++, src++) {
*p ^= *src;
}
}
}
}
static void
vdev_raidz_generate_parity_pq(raidz_map_t *rm)
{
uint64_t *q, *p, *src, pcount, ccount, mask, i;
int c;
pcount = rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]);
ASSERT(rm->rm_col[VDEV_RAIDZ_P].rc_size ==
rm->rm_col[VDEV_RAIDZ_Q].rc_size);
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
p = rm->rm_col[VDEV_RAIDZ_P].rc_data;
q = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
if (c == rm->rm_firstdatacol) {
ASSERT(ccount == pcount || ccount == 0);
for (i = 0; i < ccount; i++, p++, q++, src++) {
*q = *src;
*p = *src;
}
for (; i < pcount; i++, p++, q++, src++) {
*q = 0;
*p = 0;
}
} else {
ASSERT(ccount <= pcount);
/*
* Rather than multiplying each byte individually (as
* described above), we are able to handle 8 at once
* by generating a mask based on the high bit in each
* byte and using that to conditionally XOR in 0x1d.
*/
for (i = 0; i < ccount; i++, p++, q++, src++) {
mask = *q & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*q = ((*q << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
*q ^= *src;
*p ^= *src;
}
/*
* Treat short columns as though they are full of 0s.
*/
for (; i < pcount; i++, q++) {
mask = *q & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*q = ((*q << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
}
}
}
}
static void
vdev_raidz_reconstruct_p(raidz_map_t *rm, int x)
{
uint64_t *dst, *src, xcount, ccount, count, i;
int c;
xcount = rm->rm_col[x].rc_size / sizeof (src[0]);
ASSERT(xcount <= rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]));
ASSERT(xcount > 0);
src = rm->rm_col[VDEV_RAIDZ_P].rc_data;
dst = rm->rm_col[x].rc_data;
for (i = 0; i < xcount; i++, dst++, src++) {
*dst = *src;
}
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
dst = rm->rm_col[x].rc_data;
if (c == x)
continue;
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
count = MIN(ccount, xcount);
for (i = 0; i < count; i++, dst++, src++) {
*dst ^= *src;
}
}
}
static void
vdev_raidz_reconstruct_q(raidz_map_t *rm, int x)
{
uint64_t *dst, *src, xcount, ccount, count, mask, i;
uint8_t *b;
int c, j, exp;
xcount = rm->rm_col[x].rc_size / sizeof (src[0]);
ASSERT(xcount <= rm->rm_col[VDEV_RAIDZ_Q].rc_size / sizeof (src[0]));
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
dst = rm->rm_col[x].rc_data;
if (c == x)
ccount = 0;
else
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
count = MIN(ccount, xcount);
if (c == rm->rm_firstdatacol) {
for (i = 0; i < count; i++, dst++, src++) {
*dst = *src;
}
for (; i < xcount; i++, dst++) {
*dst = 0;
}
} else {
/*
* For an explanation of this, see the comment in
* vdev_raidz_generate_parity_pq() above.
*/
for (i = 0; i < count; i++, dst++, src++) {
mask = *dst & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*dst = ((*dst << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
*dst ^= *src;
}
for (; i < xcount; i++, dst++) {
mask = *dst & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*dst = ((*dst << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
}
}
}
src = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
dst = rm->rm_col[x].rc_data;
exp = 255 - (rm->rm_cols - 1 - x);
for (i = 0; i < xcount; i++, dst++, src++) {
*dst ^= *src;
for (j = 0, b = (uint8_t *)dst; j < 8; j++, b++) {
*b = vdev_raidz_exp2(*b, exp);
}
}
}
static void
vdev_raidz_reconstruct_pq(raidz_map_t *rm, int x, int y)
{
uint8_t *p, *q, *pxy, *qxy, *xd, *yd, tmp, a, b, aexp, bexp;
void *pdata, *qdata;
uint64_t xsize, ysize, i;
ASSERT(x < y);
ASSERT(x >= rm->rm_firstdatacol);
ASSERT(y < rm->rm_cols);
ASSERT(rm->rm_col[x].rc_size >= rm->rm_col[y].rc_size);
/*
* Move the parity data aside -- we're going to compute parity as
* though columns x and y were full of zeros -- Pxy and Qxy. We want to
* reuse the parity generation mechanism without trashing the actual
* parity so we make those columns appear to be full of zeros by
* setting their lengths to zero.
*/
pdata = rm->rm_col[VDEV_RAIDZ_P].rc_data;
qdata = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
xsize = rm->rm_col[x].rc_size;
ysize = rm->rm_col[y].rc_size;
rm->rm_col[VDEV_RAIDZ_P].rc_data =
zio_buf_alloc(rm->rm_col[VDEV_RAIDZ_P].rc_size);
rm->rm_col[VDEV_RAIDZ_Q].rc_data =
zio_buf_alloc(rm->rm_col[VDEV_RAIDZ_Q].rc_size);
rm->rm_col[x].rc_size = 0;
rm->rm_col[y].rc_size = 0;
vdev_raidz_generate_parity_pq(rm);
rm->rm_col[x].rc_size = xsize;
rm->rm_col[y].rc_size = ysize;
p = pdata;
q = qdata;
pxy = rm->rm_col[VDEV_RAIDZ_P].rc_data;
qxy = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
xd = rm->rm_col[x].rc_data;
yd = rm->rm_col[y].rc_data;
/*
* We now have:
* Pxy = P + D_x + D_y
* Qxy = Q + 2^(ndevs - 1 - x) * D_x + 2^(ndevs - 1 - y) * D_y
*
* We can then solve for D_x:
* D_x = A * (P + Pxy) + B * (Q + Qxy)
* where
* A = 2^(x - y) * (2^(x - y) + 1)^-1
* B = 2^(ndevs - 1 - x) * (2^(x - y) + 1)^-1
*
* With D_x in hand, we can easily solve for D_y:
* D_y = P + Pxy + D_x
*/
a = vdev_raidz_pow2[255 + x - y];
b = vdev_raidz_pow2[255 - (rm->rm_cols - 1 - x)];
tmp = 255 - vdev_raidz_log2[a ^ 1];
aexp = vdev_raidz_log2[vdev_raidz_exp2(a, tmp)];
bexp = vdev_raidz_log2[vdev_raidz_exp2(b, tmp)];
for (i = 0; i < xsize; i++, p++, q++, pxy++, qxy++, xd++, yd++) {
*xd = vdev_raidz_exp2(*p ^ *pxy, aexp) ^
vdev_raidz_exp2(*q ^ *qxy, bexp);
if (i < ysize)
*yd = *p ^ *pxy ^ *xd;
}
zio_buf_free(rm->rm_col[VDEV_RAIDZ_P].rc_data,
rm->rm_col[VDEV_RAIDZ_P].rc_size);
zio_buf_free(rm->rm_col[VDEV_RAIDZ_Q].rc_data,
rm->rm_col[VDEV_RAIDZ_Q].rc_size);
/*
* Restore the saved parity data.
*/
rm->rm_col[VDEV_RAIDZ_P].rc_data = pdata;
rm->rm_col[VDEV_RAIDZ_Q].rc_data = qdata;
}
static int
vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)
{
vdev_t *cvd;
uint64_t nparity = vd->vdev_nparity;
int c, error;
int lasterror = 0;
int numerrors = 0;
ASSERT(nparity > 0);
if (nparity > VDEV_RAIDZ_MAXPARITY ||
vd->vdev_children < nparity + 1) {
vd->vdev_stat.vs_aux = VDEV_AUX_BAD_LABEL;
return (EINVAL);
}
for (c = 0; c < vd->vdev_children; c++) {
cvd = vd->vdev_child[c];
if ((error = vdev_open(cvd)) != 0) {
lasterror = error;
numerrors++;
continue;
}
*asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1;
*ashift = MAX(*ashift, cvd->vdev_ashift);
}
*asize *= vd->vdev_children;
if (numerrors > nparity) {
vd->vdev_stat.vs_aux = VDEV_AUX_NO_REPLICAS;
return (lasterror);
}
return (0);
}
static void
vdev_raidz_close(vdev_t *vd)
{
int c;
for (c = 0; c < vd->vdev_children; c++)
vdev_close(vd->vdev_child[c]);
}
static uint64_t
vdev_raidz_asize(vdev_t *vd, uint64_t psize)
{
uint64_t asize;
uint64_t ashift = vd->vdev_top->vdev_ashift;
uint64_t cols = vd->vdev_children;
uint64_t nparity = vd->vdev_nparity;
asize = ((psize - 1) >> ashift) + 1;
asize += nparity * ((asize + cols - nparity - 1) / (cols - nparity));
asize = roundup(asize, nparity + 1) << ashift;
return (asize);
}
static void
vdev_raidz_child_done(zio_t *zio)
{
raidz_col_t *rc = zio->io_private;
rc->rc_error = zio->io_error;
rc->rc_tried = 1;
rc->rc_skipped = 0;
}
static int
vdev_raidz_io_start(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
vdev_t *tvd = vd->vdev_top;
vdev_t *cvd;
blkptr_t *bp = zio->io_bp;
raidz_map_t *rm;
raidz_col_t *rc;
int c;
rm = vdev_raidz_map_alloc(zio, tvd->vdev_ashift, vd->vdev_children,
vd->vdev_nparity);
ASSERT3U(rm->rm_asize, ==, vdev_psize_to_asize(vd, zio->io_size));
if (zio->io_type == ZIO_TYPE_WRITE) {
/*
* Generate RAID parity in the first virtual columns.
*/
if (rm->rm_firstdatacol == 1)
vdev_raidz_generate_parity_p(rm);
else
vdev_raidz_generate_parity_pq(rm);
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
cvd = vd->vdev_child[rc->rc_devidx];
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, rc->rc_data, rc->rc_size,
zio->io_type, zio->io_priority, 0,
vdev_raidz_child_done, rc));
}
return (ZIO_PIPELINE_CONTINUE);
}
ASSERT(zio->io_type == ZIO_TYPE_READ);
/*
* Iterate over the columns in reverse order so that we hit the parity
* last -- any errors along the way will force us to read the parity
* data.
*/
for (c = rm->rm_cols - 1; c >= 0; c--) {
rc = &rm->rm_col[c];
cvd = vd->vdev_child[rc->rc_devidx];
if (!vdev_readable(cvd)) {
if (c >= rm->rm_firstdatacol)
rm->rm_missingdata++;
else
rm->rm_missingparity++;
rc->rc_error = ENXIO;
rc->rc_tried = 1; /* don't even try */
rc->rc_skipped = 1;
continue;
}
if (vdev_dtl_contains(&cvd->vdev_dtl_map, bp->blk_birth, 1)) {
if (c >= rm->rm_firstdatacol)
rm->rm_missingdata++;
else
rm->rm_missingparity++;
rc->rc_error = ESTALE;
rc->rc_skipped = 1;
continue;
}
if (c >= rm->rm_firstdatacol || rm->rm_missingdata > 0 ||
- (zio->io_flags & ZIO_FLAG_SCRUB)) {
+ (zio->io_flags & (ZIO_FLAG_SCRUB | ZIO_FLAG_RESILVER))) {
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, rc->rc_data, rc->rc_size,
zio->io_type, zio->io_priority, 0,
vdev_raidz_child_done, rc));
}
}
return (ZIO_PIPELINE_CONTINUE);
}
/*
* Report a checksum error for a child of a RAID-Z device.
*/
static void
raidz_checksum_error(zio_t *zio, raidz_col_t *rc)
{
vdev_t *vd = zio->io_vd->vdev_child[rc->rc_devidx];
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
mutex_enter(&vd->vdev_stat_lock);
vd->vdev_stat.vs_checksum_errors++;
mutex_exit(&vd->vdev_stat_lock);
}
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE))
zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
zio->io_spa, vd, zio, rc->rc_offset, rc->rc_size);
}
/*
* Generate the parity from the data columns. If we tried and were able to
* read the parity without error, verify that the generated parity matches the
* data we read. If it doesn't, we fire off a checksum error. Return the
* number such failures.
*/
static int
raidz_parity_verify(zio_t *zio, raidz_map_t *rm)
{
void *orig[VDEV_RAIDZ_MAXPARITY];
int c, ret = 0;
raidz_col_t *rc;
for (c = 0; c < rm->rm_firstdatacol; c++) {
rc = &rm->rm_col[c];
if (!rc->rc_tried || rc->rc_error != 0)
continue;
orig[c] = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig[c], rc->rc_size);
}
if (rm->rm_firstdatacol == 1)
vdev_raidz_generate_parity_p(rm);
else
vdev_raidz_generate_parity_pq(rm);
for (c = 0; c < rm->rm_firstdatacol; c++) {
rc = &rm->rm_col[c];
if (!rc->rc_tried || rc->rc_error != 0)
continue;
if (bcmp(orig[c], rc->rc_data, rc->rc_size) != 0) {
raidz_checksum_error(zio, rc);
rc->rc_error = ECKSUM;
ret++;
}
zio_buf_free(orig[c], rc->rc_size);
}
return (ret);
}
static uint64_t raidz_corrected_p;
static uint64_t raidz_corrected_q;
static uint64_t raidz_corrected_pq;
static int
vdev_raidz_worst_error(raidz_map_t *rm)
{
int error = 0;
for (int c = 0; c < rm->rm_cols; c++)
error = zio_worst_error(error, rm->rm_col[c].rc_error);
return (error);
}
static void
vdev_raidz_io_done(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
vdev_t *cvd;
raidz_map_t *rm = zio->io_vsd;
raidz_col_t *rc, *rc1;
int unexpected_errors = 0;
int parity_errors = 0;
int parity_untried = 0;
int data_errors = 0;
int total_errors = 0;
int n, c, c1;
ASSERT(zio->io_bp != NULL); /* XXX need to add code to enforce this */
ASSERT(rm->rm_missingparity <= rm->rm_firstdatacol);
ASSERT(rm->rm_missingdata <= rm->rm_cols - rm->rm_firstdatacol);
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error) {
ASSERT(rc->rc_error != ECKSUM); /* child has no bp */
if (c < rm->rm_firstdatacol)
parity_errors++;
else
data_errors++;
if (!rc->rc_skipped)
unexpected_errors++;
total_errors++;
} else if (c < rm->rm_firstdatacol && !rc->rc_tried) {
parity_untried++;
}
}
if (zio->io_type == ZIO_TYPE_WRITE) {
/*
* XXX -- for now, treat partial writes as a success.
* (If we couldn't write enough columns to reconstruct
* the data, the I/O failed. Otherwise, good enough.)
*
* Now that we support write reallocation, it would be better
* to treat partial failure as real failure unless there are
* no non-degraded top-level vdevs left, and not update DTLs
* if we intend to reallocate.
*/
/* XXPOLICY */
if (total_errors > rm->rm_firstdatacol)
zio->io_error = vdev_raidz_worst_error(rm);
return;
}
ASSERT(zio->io_type == ZIO_TYPE_READ);
/*
* There are three potential phases for a read:
* 1. produce valid data from the columns read
* 2. read all disks and try again
* 3. perform combinatorial reconstruction
*
* Each phase is progressively both more expensive and less likely to
* occur. If we encounter more errors than we can repair or all phases
* fail, we have no choice but to return an error.
*/
/*
* If the number of errors we saw was correctable -- less than or equal
* to the number of parity disks read -- attempt to produce data that
* has a valid checksum. Naturally, this case applies in the absence of
* any errors.
*/
if (total_errors <= rm->rm_firstdatacol - parity_untried) {
switch (data_errors) {
case 0:
if (zio_checksum_error(zio) == 0) {
/*
* If we read parity information (unnecessarily
* as it happens since no reconstruction was
* needed) regenerate and verify the parity.
* We also regenerate parity when resilvering
* so we can write it out to the failed device
* later.
*/
if (parity_errors + parity_untried <
rm->rm_firstdatacol ||
(zio->io_flags & ZIO_FLAG_RESILVER)) {
n = raidz_parity_verify(zio, rm);
unexpected_errors += n;
ASSERT(parity_errors + n <=
rm->rm_firstdatacol);
}
goto done;
}
break;
case 1:
/*
* We either attempt to read all the parity columns or
* none of them. If we didn't try to read parity, we
* wouldn't be here in the correctable case. There must
* also have been fewer parity errors than parity
* columns or, again, we wouldn't be in this code path.
*/
ASSERT(parity_untried == 0);
ASSERT(parity_errors < rm->rm_firstdatacol);
/*
* Find the column that reported the error.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error != 0)
break;
}
ASSERT(c != rm->rm_cols);
ASSERT(!rc->rc_skipped || rc->rc_error == ENXIO ||
rc->rc_error == ESTALE);
if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0) {
vdev_raidz_reconstruct_p(rm, c);
} else {
ASSERT(rm->rm_firstdatacol > 1);
vdev_raidz_reconstruct_q(rm, c);
}
if (zio_checksum_error(zio) == 0) {
if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0)
atomic_inc_64(&raidz_corrected_p);
else
atomic_inc_64(&raidz_corrected_q);
/*
* If there's more than one parity disk that
* was successfully read, confirm that the
* other parity disk produced the correct data.
* This routine is suboptimal in that it
* regenerates both the parity we wish to test
* as well as the parity we just used to
* perform the reconstruction, but this should
* be a relatively uncommon case, and can be
* optimized if it becomes a problem.
* We also regenerate parity when resilvering
* so we can write it out to the failed device
* later.
*/
if (parity_errors < rm->rm_firstdatacol - 1 ||
(zio->io_flags & ZIO_FLAG_RESILVER)) {
n = raidz_parity_verify(zio, rm);
unexpected_errors += n;
ASSERT(parity_errors + n <=
rm->rm_firstdatacol);
}
goto done;
}
break;
case 2:
/*
* Two data column errors require double parity.
*/
ASSERT(rm->rm_firstdatacol == 2);
/*
* Find the two columns that reported errors.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error != 0)
break;
}
ASSERT(c != rm->rm_cols);
ASSERT(!rc->rc_skipped || rc->rc_error == ENXIO ||
rc->rc_error == ESTALE);
for (c1 = c++; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error != 0)
break;
}
ASSERT(c != rm->rm_cols);
ASSERT(!rc->rc_skipped || rc->rc_error == ENXIO ||
rc->rc_error == ESTALE);
vdev_raidz_reconstruct_pq(rm, c1, c);
if (zio_checksum_error(zio) == 0) {
atomic_inc_64(&raidz_corrected_pq);
goto done;
}
break;
default:
ASSERT(rm->rm_firstdatacol <= 2);
ASSERT(0);
}
}
/*
* This isn't a typical situation -- either we got a read error or
* a child silently returned bad data. Read every block so we can
* try again with as much data and parity as we can track down. If
* we've already been through once before, all children will be marked
* as tried so we'll proceed to combinatorial reconstruction.
*/
unexpected_errors = 1;
rm->rm_missingdata = 0;
rm->rm_missingparity = 0;
for (c = 0; c < rm->rm_cols; c++) {
if (rm->rm_col[c].rc_tried)
continue;
zio_vdev_io_redone(zio);
do {
rc = &rm->rm_col[c];
if (rc->rc_tried)
continue;
zio_nowait(zio_vdev_child_io(zio, NULL,
vd->vdev_child[rc->rc_devidx],
rc->rc_offset, rc->rc_data, rc->rc_size,
zio->io_type, zio->io_priority, 0,
vdev_raidz_child_done, rc));
} while (++c < rm->rm_cols);
return;
}
/*
* At this point we've attempted to reconstruct the data given the
* errors we detected, and we've attempted to read all columns. There
* must, therefore, be one or more additional problems -- silent errors
* resulting in invalid data rather than explicit I/O errors resulting
* in absent data. Before we attempt combinatorial reconstruction make
* sure we have a chance of coming up with the right answer.
*/
if (total_errors >= rm->rm_firstdatacol) {
zio->io_error = vdev_raidz_worst_error(rm);
/*
* If there were exactly as many device errors as parity
* columns, yet we couldn't reconstruct the data, then at
* least one device must have returned bad data silently.
*/
if (total_errors == rm->rm_firstdatacol)
zio->io_error = zio_worst_error(zio->io_error, ECKSUM);
goto done;
}
if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0) {
/*
* Attempt to reconstruct the data from parity P.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
void *orig;
rc = &rm->rm_col[c];
orig = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig, rc->rc_size);
vdev_raidz_reconstruct_p(rm, c);
if (zio_checksum_error(zio) == 0) {
zio_buf_free(orig, rc->rc_size);
atomic_inc_64(&raidz_corrected_p);
/*
* If this child didn't know that it returned
* bad data, inform it.
*/
if (rc->rc_tried && rc->rc_error == 0)
raidz_checksum_error(zio, rc);
rc->rc_error = ECKSUM;
goto done;
}
bcopy(orig, rc->rc_data, rc->rc_size);
zio_buf_free(orig, rc->rc_size);
}
}
if (rm->rm_firstdatacol > 1 && rm->rm_col[VDEV_RAIDZ_Q].rc_error == 0) {
/*
* Attempt to reconstruct the data from parity Q.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
void *orig;
rc = &rm->rm_col[c];
orig = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig, rc->rc_size);
vdev_raidz_reconstruct_q(rm, c);
if (zio_checksum_error(zio) == 0) {
zio_buf_free(orig, rc->rc_size);
atomic_inc_64(&raidz_corrected_q);
/*
* If this child didn't know that it returned
* bad data, inform it.
*/
if (rc->rc_tried && rc->rc_error == 0)
raidz_checksum_error(zio, rc);
rc->rc_error = ECKSUM;
goto done;
}
bcopy(orig, rc->rc_data, rc->rc_size);
zio_buf_free(orig, rc->rc_size);
}
}
if (rm->rm_firstdatacol > 1 &&
rm->rm_col[VDEV_RAIDZ_P].rc_error == 0 &&
rm->rm_col[VDEV_RAIDZ_Q].rc_error == 0) {
/*
* Attempt to reconstruct the data from both P and Q.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols - 1; c++) {
void *orig, *orig1;
rc = &rm->rm_col[c];
orig = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig, rc->rc_size);
for (c1 = c + 1; c1 < rm->rm_cols; c1++) {
rc1 = &rm->rm_col[c1];
orig1 = zio_buf_alloc(rc1->rc_size);
bcopy(rc1->rc_data, orig1, rc1->rc_size);
vdev_raidz_reconstruct_pq(rm, c, c1);
if (zio_checksum_error(zio) == 0) {
zio_buf_free(orig, rc->rc_size);
zio_buf_free(orig1, rc1->rc_size);
atomic_inc_64(&raidz_corrected_pq);
/*
* If these children didn't know they
* returned bad data, inform them.
*/
if (rc->rc_tried && rc->rc_error == 0)
raidz_checksum_error(zio, rc);
if (rc1->rc_tried && rc1->rc_error == 0)
raidz_checksum_error(zio, rc1);
rc->rc_error = ECKSUM;
rc1->rc_error = ECKSUM;
goto done;
}
bcopy(orig1, rc1->rc_data, rc1->rc_size);
zio_buf_free(orig1, rc1->rc_size);
}
bcopy(orig, rc->rc_data, rc->rc_size);
zio_buf_free(orig, rc->rc_size);
}
}
/*
* All combinations failed to checksum. Generate checksum ereports for
* all children.
*/
zio->io_error = ECKSUM;
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
zio->io_spa, vd->vdev_child[rc->rc_devidx], zio,
rc->rc_offset, rc->rc_size);
}
}
done:
zio_checksum_verified(zio);
if (zio->io_error == 0 && (spa_mode & FWRITE) &&
(unexpected_errors || (zio->io_flags & ZIO_FLAG_RESILVER))) {
/*
* Use the good data we have in hand to repair damaged children.
*/
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
cvd = vd->vdev_child[rc->rc_devidx];
if (rc->rc_error == 0)
continue;
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, rc->rc_data, rc->rc_size,
ZIO_TYPE_WRITE, zio->io_priority,
ZIO_FLAG_IO_REPAIR, NULL, NULL));
}
}
}
static void
vdev_raidz_state_change(vdev_t *vd, int faulted, int degraded)
{
if (faulted > vd->vdev_nparity)
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_NO_REPLICAS);
else if (degraded + faulted != 0)
vdev_set_state(vd, B_FALSE, VDEV_STATE_DEGRADED, VDEV_AUX_NONE);
else
vdev_set_state(vd, B_FALSE, VDEV_STATE_HEALTHY, VDEV_AUX_NONE);
}
vdev_ops_t vdev_raidz_ops = {
vdev_raidz_open,
vdev_raidz_close,
vdev_raidz_asize,
vdev_raidz_io_start,
vdev_raidz_io_done,
vdev_raidz_state_change,
VDEV_TYPE_RAIDZ, /* name of this vdev type */
B_FALSE /* not a leaf vdev */
};
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c (revision 209274)
@@ -1,2712 +1,2719 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/types.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/systm.h>
#include <sys/sysmacros.h>
#include <sys/resource.h>
#include <sys/vfs.h>
#include <sys/vnode.h>
#include <sys/file.h>
#include <sys/stat.h>
#include <sys/kmem.h>
#include <sys/cmn_err.h>
#include <sys/errno.h>
#include <sys/unistd.h>
#include <sys/sdt.h>
#include <sys/fs/zfs.h>
#include <sys/policy.h>
#include <sys/zfs_znode.h>
#include <sys/zfs_fuid.h>
#include <sys/zfs_acl.h>
#include <sys/zfs_dir.h>
#include <sys/zfs_vfsops.h>
#include <sys/dmu.h>
#include <sys/dnode.h>
#include <sys/zap.h>
#include <acl/acl_common.h>
#define ALLOW ACE_ACCESS_ALLOWED_ACE_TYPE
#define DENY ACE_ACCESS_DENIED_ACE_TYPE
#define MAX_ACE_TYPE ACE_SYSTEM_ALARM_CALLBACK_OBJECT_ACE_TYPE
#define MIN_ACE_TYPE ALLOW
#define OWNING_GROUP (ACE_GROUP|ACE_IDENTIFIER_GROUP)
#define EVERYONE_ALLOW_MASK (ACE_READ_ACL|ACE_READ_ATTRIBUTES | \
ACE_READ_NAMED_ATTRS|ACE_SYNCHRONIZE)
#define EVERYONE_DENY_MASK (ACE_WRITE_ACL|ACE_WRITE_OWNER | \
ACE_WRITE_ATTRIBUTES|ACE_WRITE_NAMED_ATTRS)
#define OWNER_ALLOW_MASK (ACE_WRITE_ACL | ACE_WRITE_OWNER | \
ACE_WRITE_ATTRIBUTES|ACE_WRITE_NAMED_ATTRS)
#define WRITE_MASK_DATA (ACE_WRITE_DATA|ACE_APPEND_DATA|ACE_WRITE_NAMED_ATTRS)
#define ZFS_CHECKED_MASKS (ACE_READ_ACL|ACE_READ_ATTRIBUTES|ACE_READ_DATA| \
ACE_READ_NAMED_ATTRS|ACE_WRITE_DATA|ACE_WRITE_ATTRIBUTES| \
ACE_WRITE_NAMED_ATTRS|ACE_APPEND_DATA|ACE_EXECUTE|ACE_WRITE_OWNER| \
ACE_WRITE_ACL|ACE_DELETE|ACE_DELETE_CHILD|ACE_SYNCHRONIZE)
#define WRITE_MASK (WRITE_MASK_DATA|ACE_WRITE_ATTRIBUTES|ACE_WRITE_ACL|\
ACE_WRITE_OWNER|ACE_DELETE|ACE_DELETE_CHILD)
#define OGE_CLEAR (ACE_READ_DATA|ACE_LIST_DIRECTORY|ACE_WRITE_DATA| \
ACE_ADD_FILE|ACE_APPEND_DATA|ACE_ADD_SUBDIRECTORY|ACE_EXECUTE)
#define OKAY_MASK_BITS (ACE_READ_DATA|ACE_LIST_DIRECTORY|ACE_WRITE_DATA| \
ACE_ADD_FILE|ACE_APPEND_DATA|ACE_ADD_SUBDIRECTORY|ACE_EXECUTE)
#define ALL_INHERIT (ACE_FILE_INHERIT_ACE|ACE_DIRECTORY_INHERIT_ACE | \
ACE_NO_PROPAGATE_INHERIT_ACE|ACE_INHERIT_ONLY_ACE|ACE_INHERITED_ACE)
#define RESTRICTED_CLEAR (ACE_WRITE_ACL|ACE_WRITE_OWNER)
#define V4_ACL_WIDE_FLAGS (ZFS_ACL_AUTO_INHERIT|ZFS_ACL_DEFAULTED|\
ZFS_ACL_PROTECTED)
#define ZFS_ACL_WIDE_FLAGS (V4_ACL_WIDE_FLAGS|ZFS_ACL_TRIVIAL|ZFS_INHERIT_ACE|\
ZFS_ACL_OBJ_ACE)
static uint16_t
zfs_ace_v0_get_type(void *acep)
{
return (((zfs_oldace_t *)acep)->z_type);
}
static uint16_t
zfs_ace_v0_get_flags(void *acep)
{
return (((zfs_oldace_t *)acep)->z_flags);
}
static uint32_t
zfs_ace_v0_get_mask(void *acep)
{
return (((zfs_oldace_t *)acep)->z_access_mask);
}
static uint64_t
zfs_ace_v0_get_who(void *acep)
{
return (((zfs_oldace_t *)acep)->z_fuid);
}
static void
zfs_ace_v0_set_type(void *acep, uint16_t type)
{
((zfs_oldace_t *)acep)->z_type = type;
}
static void
zfs_ace_v0_set_flags(void *acep, uint16_t flags)
{
((zfs_oldace_t *)acep)->z_flags = flags;
}
static void
zfs_ace_v0_set_mask(void *acep, uint32_t mask)
{
((zfs_oldace_t *)acep)->z_access_mask = mask;
}
static void
zfs_ace_v0_set_who(void *acep, uint64_t who)
{
((zfs_oldace_t *)acep)->z_fuid = who;
}
/*ARGSUSED*/
static size_t
zfs_ace_v0_size(void *acep)
{
return (sizeof (zfs_oldace_t));
}
static size_t
zfs_ace_v0_abstract_size(void)
{
return (sizeof (zfs_oldace_t));
}
static int
zfs_ace_v0_mask_off(void)
{
return (offsetof(zfs_oldace_t, z_access_mask));
}
/*ARGSUSED*/
static int
zfs_ace_v0_data(void *acep, void **datap)
{
*datap = NULL;
return (0);
}
static acl_ops_t zfs_acl_v0_ops = {
zfs_ace_v0_get_mask,
zfs_ace_v0_set_mask,
zfs_ace_v0_get_flags,
zfs_ace_v0_set_flags,
zfs_ace_v0_get_type,
zfs_ace_v0_set_type,
zfs_ace_v0_get_who,
zfs_ace_v0_set_who,
zfs_ace_v0_size,
zfs_ace_v0_abstract_size,
zfs_ace_v0_mask_off,
zfs_ace_v0_data
};
static uint16_t
zfs_ace_fuid_get_type(void *acep)
{
return (((zfs_ace_hdr_t *)acep)->z_type);
}
static uint16_t
zfs_ace_fuid_get_flags(void *acep)
{
return (((zfs_ace_hdr_t *)acep)->z_flags);
}
static uint32_t
zfs_ace_fuid_get_mask(void *acep)
{
return (((zfs_ace_hdr_t *)acep)->z_access_mask);
}
static uint64_t
zfs_ace_fuid_get_who(void *args)
{
uint16_t entry_type;
zfs_ace_t *acep = args;
entry_type = acep->z_hdr.z_flags & ACE_TYPE_FLAGS;
if (entry_type == ACE_OWNER || entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE)
return (-1);
return (((zfs_ace_t *)acep)->z_fuid);
}
static void
zfs_ace_fuid_set_type(void *acep, uint16_t type)
{
((zfs_ace_hdr_t *)acep)->z_type = type;
}
static void
zfs_ace_fuid_set_flags(void *acep, uint16_t flags)
{
((zfs_ace_hdr_t *)acep)->z_flags = flags;
}
static void
zfs_ace_fuid_set_mask(void *acep, uint32_t mask)
{
((zfs_ace_hdr_t *)acep)->z_access_mask = mask;
}
static void
zfs_ace_fuid_set_who(void *arg, uint64_t who)
{
zfs_ace_t *acep = arg;
uint16_t entry_type = acep->z_hdr.z_flags & ACE_TYPE_FLAGS;
if (entry_type == ACE_OWNER || entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE)
return;
acep->z_fuid = who;
}
static size_t
zfs_ace_fuid_size(void *acep)
{
zfs_ace_hdr_t *zacep = acep;
uint16_t entry_type;
switch (zacep->z_type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
return (sizeof (zfs_object_ace_t));
case ALLOW:
case DENY:
entry_type =
(((zfs_ace_hdr_t *)acep)->z_flags & ACE_TYPE_FLAGS);
if (entry_type == ACE_OWNER ||
entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE)
return (sizeof (zfs_ace_hdr_t));
/*FALLTHROUGH*/
default:
return (sizeof (zfs_ace_t));
}
}
static size_t
zfs_ace_fuid_abstract_size(void)
{
return (sizeof (zfs_ace_hdr_t));
}
static int
zfs_ace_fuid_mask_off(void)
{
return (offsetof(zfs_ace_hdr_t, z_access_mask));
}
static int
zfs_ace_fuid_data(void *acep, void **datap)
{
zfs_ace_t *zacep = acep;
zfs_object_ace_t *zobjp;
switch (zacep->z_hdr.z_type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
zobjp = acep;
*datap = (caddr_t)zobjp + sizeof (zfs_ace_t);
return (sizeof (zfs_object_ace_t) - sizeof (zfs_ace_t));
default:
*datap = NULL;
return (0);
}
}
static acl_ops_t zfs_acl_fuid_ops = {
zfs_ace_fuid_get_mask,
zfs_ace_fuid_set_mask,
zfs_ace_fuid_get_flags,
zfs_ace_fuid_set_flags,
zfs_ace_fuid_get_type,
zfs_ace_fuid_set_type,
zfs_ace_fuid_get_who,
zfs_ace_fuid_set_who,
zfs_ace_fuid_size,
zfs_ace_fuid_abstract_size,
zfs_ace_fuid_mask_off,
zfs_ace_fuid_data
};
static int
zfs_acl_version(int version)
{
if (version < ZPL_VERSION_FUID)
return (ZFS_ACL_VERSION_INITIAL);
else
return (ZFS_ACL_VERSION_FUID);
}
static int
zfs_acl_version_zp(znode_t *zp)
{
return (zfs_acl_version(zp->z_zfsvfs->z_version));
}
static zfs_acl_t *
zfs_acl_alloc(int vers)
{
zfs_acl_t *aclp;
aclp = kmem_zalloc(sizeof (zfs_acl_t), KM_SLEEP);
list_create(&aclp->z_acl, sizeof (zfs_acl_node_t),
offsetof(zfs_acl_node_t, z_next));
aclp->z_version = vers;
if (vers == ZFS_ACL_VERSION_FUID)
aclp->z_ops = zfs_acl_fuid_ops;
else
aclp->z_ops = zfs_acl_v0_ops;
return (aclp);
}
static zfs_acl_node_t *
zfs_acl_node_alloc(size_t bytes)
{
zfs_acl_node_t *aclnode;
aclnode = kmem_zalloc(sizeof (zfs_acl_node_t), KM_SLEEP);
if (bytes) {
aclnode->z_acldata = kmem_alloc(bytes, KM_SLEEP);
aclnode->z_allocdata = aclnode->z_acldata;
aclnode->z_allocsize = bytes;
aclnode->z_size = bytes;
}
return (aclnode);
}
static void
zfs_acl_node_free(zfs_acl_node_t *aclnode)
{
if (aclnode->z_allocsize)
kmem_free(aclnode->z_allocdata, aclnode->z_allocsize);
kmem_free(aclnode, sizeof (zfs_acl_node_t));
}
static void
zfs_acl_release_nodes(zfs_acl_t *aclp)
{
zfs_acl_node_t *aclnode;
while (aclnode = list_head(&aclp->z_acl)) {
list_remove(&aclp->z_acl, aclnode);
zfs_acl_node_free(aclnode);
}
aclp->z_acl_count = 0;
aclp->z_acl_bytes = 0;
}
void
zfs_acl_free(zfs_acl_t *aclp)
{
zfs_acl_release_nodes(aclp);
list_destroy(&aclp->z_acl);
kmem_free(aclp, sizeof (zfs_acl_t));
}
static boolean_t
zfs_acl_valid_ace_type(uint_t type, uint_t flags)
{
uint16_t entry_type;
switch (type) {
case ALLOW:
case DENY:
case ACE_SYSTEM_AUDIT_ACE_TYPE:
case ACE_SYSTEM_ALARM_ACE_TYPE:
entry_type = flags & ACE_TYPE_FLAGS;
return (entry_type == ACE_OWNER ||
entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE || entry_type == 0 ||
entry_type == ACE_IDENTIFIER_GROUP);
default:
if (type >= MIN_ACE_TYPE && type <= MAX_ACE_TYPE)
return (B_TRUE);
}
return (B_FALSE);
}
static boolean_t
zfs_ace_valid(vtype_t obj_type, zfs_acl_t *aclp, uint16_t type, uint16_t iflags)
{
/*
* first check type of entry
*/
if (!zfs_acl_valid_ace_type(type, iflags))
return (B_FALSE);
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
if (aclp->z_version < ZFS_ACL_VERSION_FUID)
return (B_FALSE);
aclp->z_hints |= ZFS_ACL_OBJ_ACE;
}
/*
* next check inheritance level flags
*/
if (obj_type == VDIR &&
(iflags & (ACE_FILE_INHERIT_ACE|ACE_DIRECTORY_INHERIT_ACE)))
aclp->z_hints |= ZFS_INHERIT_ACE;
if (iflags & (ACE_INHERIT_ONLY_ACE|ACE_NO_PROPAGATE_INHERIT_ACE)) {
if ((iflags & (ACE_FILE_INHERIT_ACE|
ACE_DIRECTORY_INHERIT_ACE)) == 0) {
return (B_FALSE);
}
}
return (B_TRUE);
}
static void *
zfs_acl_next_ace(zfs_acl_t *aclp, void *start, uint64_t *who,
uint32_t *access_mask, uint16_t *iflags, uint16_t *type)
{
zfs_acl_node_t *aclnode;
if (start == NULL) {
aclnode = list_head(&aclp->z_acl);
if (aclnode == NULL)
return (NULL);
aclp->z_next_ace = aclnode->z_acldata;
aclp->z_curr_node = aclnode;
aclnode->z_ace_idx = 0;
}
aclnode = aclp->z_curr_node;
if (aclnode == NULL)
return (NULL);
if (aclnode->z_ace_idx >= aclnode->z_ace_count) {
aclnode = list_next(&aclp->z_acl, aclnode);
if (aclnode == NULL)
return (NULL);
else {
aclp->z_curr_node = aclnode;
aclnode->z_ace_idx = 0;
aclp->z_next_ace = aclnode->z_acldata;
}
}
if (aclnode->z_ace_idx < aclnode->z_ace_count) {
void *acep = aclp->z_next_ace;
size_t ace_size;
/*
* Make sure we don't overstep our bounds
*/
ace_size = aclp->z_ops.ace_size(acep);
if (((caddr_t)acep + ace_size) >
((caddr_t)aclnode->z_acldata + aclnode->z_size)) {
return (NULL);
}
*iflags = aclp->z_ops.ace_flags_get(acep);
*type = aclp->z_ops.ace_type_get(acep);
*access_mask = aclp->z_ops.ace_mask_get(acep);
*who = aclp->z_ops.ace_who_get(acep);
aclp->z_next_ace = (caddr_t)aclp->z_next_ace + ace_size;
aclnode->z_ace_idx++;
return ((void *)acep);
}
return (NULL);
}
/*ARGSUSED*/
static uint64_t
zfs_ace_walk(void *datap, uint64_t cookie, int aclcnt,
uint16_t *flags, uint16_t *type, uint32_t *mask)
{
zfs_acl_t *aclp = datap;
zfs_ace_hdr_t *acep = (zfs_ace_hdr_t *)(uintptr_t)cookie;
uint64_t who;
acep = zfs_acl_next_ace(aclp, acep, &who, mask,
flags, type);
return ((uint64_t)(uintptr_t)acep);
}
static zfs_acl_node_t *
zfs_acl_curr_node(zfs_acl_t *aclp)
{
ASSERT(aclp->z_curr_node);
return (aclp->z_curr_node);
}
/*
* Copy ACE to internal ZFS format.
* While processing the ACL each ACE will be validated for correctness.
* ACE FUIDs will be created later.
*/
int
zfs_copy_ace_2_fuid(vtype_t obj_type, zfs_acl_t *aclp, void *datap,
zfs_ace_t *z_acl, int aclcnt, size_t *size)
{
int i;
uint16_t entry_type;
zfs_ace_t *aceptr = z_acl;
ace_t *acep = datap;
zfs_object_ace_t *zobjacep;
ace_object_t *aceobjp;
for (i = 0; i != aclcnt; i++) {
aceptr->z_hdr.z_access_mask = acep->a_access_mask;
aceptr->z_hdr.z_flags = acep->a_flags;
aceptr->z_hdr.z_type = acep->a_type;
entry_type = aceptr->z_hdr.z_flags & ACE_TYPE_FLAGS;
if (entry_type != ACE_OWNER && entry_type != OWNING_GROUP &&
entry_type != ACE_EVERYONE) {
if (!aclp->z_has_fuids)
aclp->z_has_fuids = IS_EPHEMERAL(acep->a_who);
aceptr->z_fuid = (uint64_t)acep->a_who;
}
/*
* Make sure ACE is valid
*/
if (zfs_ace_valid(obj_type, aclp, aceptr->z_hdr.z_type,
aceptr->z_hdr.z_flags) != B_TRUE)
return (EINVAL);
switch (acep->a_type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
zobjacep = (zfs_object_ace_t *)aceptr;
aceobjp = (ace_object_t *)acep;
bcopy(aceobjp->a_obj_type, zobjacep->z_object_type,
sizeof (aceobjp->a_obj_type));
bcopy(aceobjp->a_inherit_obj_type,
zobjacep->z_inherit_type,
sizeof (aceobjp->a_inherit_obj_type));
acep = (ace_t *)((caddr_t)acep + sizeof (ace_object_t));
break;
default:
acep = (ace_t *)((caddr_t)acep + sizeof (ace_t));
}
aceptr = (zfs_ace_t *)((caddr_t)aceptr +
aclp->z_ops.ace_size(aceptr));
}
*size = (caddr_t)aceptr - (caddr_t)z_acl;
return (0);
}
/*
* Copy ZFS ACEs to fixed size ace_t layout
*/
static void
zfs_copy_fuid_2_ace(zfsvfs_t *zfsvfs, zfs_acl_t *aclp, cred_t *cr,
void *datap, int filter)
{
uint64_t who;
uint32_t access_mask;
uint16_t iflags, type;
zfs_ace_hdr_t *zacep = NULL;
ace_t *acep = datap;
ace_object_t *objacep;
zfs_object_ace_t *zobjacep;
size_t ace_size;
uint16_t entry_type;
while (zacep = zfs_acl_next_ace(aclp, zacep,
&who, &access_mask, &iflags, &type)) {
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
if (filter) {
continue;
}
zobjacep = (zfs_object_ace_t *)zacep;
objacep = (ace_object_t *)acep;
bcopy(zobjacep->z_object_type,
objacep->a_obj_type,
sizeof (zobjacep->z_object_type));
bcopy(zobjacep->z_inherit_type,
objacep->a_inherit_obj_type,
sizeof (zobjacep->z_inherit_type));
ace_size = sizeof (ace_object_t);
break;
default:
ace_size = sizeof (ace_t);
break;
}
entry_type = (iflags & ACE_TYPE_FLAGS);
if ((entry_type != ACE_OWNER &&
entry_type != OWNING_GROUP &&
entry_type != ACE_EVERYONE)) {
acep->a_who = zfs_fuid_map_id(zfsvfs, who,
cr, (entry_type & ACE_IDENTIFIER_GROUP) ?
ZFS_ACE_GROUP : ZFS_ACE_USER);
} else {
acep->a_who = (uid_t)(int64_t)who;
}
acep->a_access_mask = access_mask;
acep->a_flags = iflags;
acep->a_type = type;
acep = (ace_t *)((caddr_t)acep + ace_size);
}
}
static int
zfs_copy_ace_2_oldace(vtype_t obj_type, zfs_acl_t *aclp, ace_t *acep,
zfs_oldace_t *z_acl, int aclcnt, size_t *size)
{
int i;
zfs_oldace_t *aceptr = z_acl;
for (i = 0; i != aclcnt; i++, aceptr++) {
aceptr->z_access_mask = acep[i].a_access_mask;
aceptr->z_type = acep[i].a_type;
aceptr->z_flags = acep[i].a_flags;
aceptr->z_fuid = acep[i].a_who;
/*
* Make sure ACE is valid
*/
if (zfs_ace_valid(obj_type, aclp, aceptr->z_type,
aceptr->z_flags) != B_TRUE)
return (EINVAL);
}
*size = (caddr_t)aceptr - (caddr_t)z_acl;
return (0);
}
/*
* convert old ACL format to new
*/
void
zfs_acl_xform(znode_t *zp, zfs_acl_t *aclp)
{
zfs_oldace_t *oldaclp;
int i;
uint16_t type, iflags;
uint32_t access_mask;
uint64_t who;
void *cookie = NULL;
zfs_acl_node_t *newaclnode;
ASSERT(aclp->z_version == ZFS_ACL_VERSION_INITIAL);
/*
* First create the ACE in a contiguous piece of memory
* for zfs_copy_ace_2_fuid().
*
* We only convert an ACL once, so this won't happen
* everytime.
*/
oldaclp = kmem_alloc(sizeof (zfs_oldace_t) * aclp->z_acl_count,
KM_SLEEP);
i = 0;
while (cookie = zfs_acl_next_ace(aclp, cookie, &who,
&access_mask, &iflags, &type)) {
oldaclp[i].z_flags = iflags;
oldaclp[i].z_type = type;
oldaclp[i].z_fuid = who;
oldaclp[i++].z_access_mask = access_mask;
}
newaclnode = zfs_acl_node_alloc(aclp->z_acl_count *
sizeof (zfs_object_ace_t));
aclp->z_ops = zfs_acl_fuid_ops;
VERIFY(zfs_copy_ace_2_fuid(ZTOV(zp)->v_type, aclp, oldaclp,
newaclnode->z_acldata, aclp->z_acl_count,
&newaclnode->z_size) == 0);
newaclnode->z_ace_count = aclp->z_acl_count;
aclp->z_version = ZFS_ACL_VERSION;
kmem_free(oldaclp, aclp->z_acl_count * sizeof (zfs_oldace_t));
/*
* Release all previous ACL nodes
*/
zfs_acl_release_nodes(aclp);
list_insert_head(&aclp->z_acl, newaclnode);
aclp->z_acl_bytes = newaclnode->z_size;
aclp->z_acl_count = newaclnode->z_ace_count;
}
/*
* Convert unix access mask to v4 access mask
*/
static uint32_t
zfs_unix_to_v4(uint32_t access_mask)
{
uint32_t new_mask = 0;
if (access_mask & S_IXOTH)
new_mask |= ACE_EXECUTE;
if (access_mask & S_IWOTH)
new_mask |= ACE_WRITE_DATA;
if (access_mask & S_IROTH)
new_mask |= ACE_READ_DATA;
return (new_mask);
}
static void
zfs_set_ace(zfs_acl_t *aclp, void *acep, uint32_t access_mask,
uint16_t access_type, uint64_t fuid, uint16_t entry_type)
{
uint16_t type = entry_type & ACE_TYPE_FLAGS;
aclp->z_ops.ace_mask_set(acep, access_mask);
aclp->z_ops.ace_type_set(acep, access_type);
aclp->z_ops.ace_flags_set(acep, entry_type);
if ((type != ACE_OWNER && type != OWNING_GROUP &&
type != ACE_EVERYONE))
aclp->z_ops.ace_who_set(acep, fuid);
}
/*
* Determine mode of file based on ACL.
* Also, create FUIDs for any User/Group ACEs
*/
static uint64_t
zfs_mode_fuid_compute(znode_t *zp, zfs_acl_t *aclp, cred_t *cr,
zfs_fuid_info_t **fuidp, dmu_tx_t *tx)
{
int entry_type;
mode_t mode;
mode_t seen = 0;
zfs_ace_hdr_t *acep = NULL;
uint64_t who;
uint16_t iflags, type;
uint32_t access_mask;
mode = (zp->z_phys->zp_mode & (S_IFMT | S_ISUID | S_ISGID | S_ISVTX));
while (acep = zfs_acl_next_ace(aclp, acep, &who,
&access_mask, &iflags, &type)) {
if (!zfs_acl_valid_ace_type(type, iflags))
continue;
entry_type = (iflags & ACE_TYPE_FLAGS);
/*
* Skip over owner@, group@ or everyone@ inherit only ACEs
*/
if ((iflags & ACE_INHERIT_ONLY_ACE) &&
(entry_type == ACE_OWNER || entry_type == ACE_EVERYONE ||
entry_type == OWNING_GROUP))
continue;
if (entry_type == ACE_OWNER) {
if ((access_mask & ACE_READ_DATA) &&
(!(seen & S_IRUSR))) {
seen |= S_IRUSR;
if (type == ALLOW) {
mode |= S_IRUSR;
}
}
if ((access_mask & ACE_WRITE_DATA) &&
(!(seen & S_IWUSR))) {
seen |= S_IWUSR;
if (type == ALLOW) {
mode |= S_IWUSR;
}
}
if ((access_mask & ACE_EXECUTE) &&
(!(seen & S_IXUSR))) {
seen |= S_IXUSR;
if (type == ALLOW) {
mode |= S_IXUSR;
}
}
} else if (entry_type == OWNING_GROUP) {
if ((access_mask & ACE_READ_DATA) &&
(!(seen & S_IRGRP))) {
seen |= S_IRGRP;
if (type == ALLOW) {
mode |= S_IRGRP;
}
}
if ((access_mask & ACE_WRITE_DATA) &&
(!(seen & S_IWGRP))) {
seen |= S_IWGRP;
if (type == ALLOW) {
mode |= S_IWGRP;
}
}
if ((access_mask & ACE_EXECUTE) &&
(!(seen & S_IXGRP))) {
seen |= S_IXGRP;
if (type == ALLOW) {
mode |= S_IXGRP;
}
}
} else if (entry_type == ACE_EVERYONE) {
if ((access_mask & ACE_READ_DATA)) {
if (!(seen & S_IRUSR)) {
seen |= S_IRUSR;
if (type == ALLOW) {
mode |= S_IRUSR;
}
}
if (!(seen & S_IRGRP)) {
seen |= S_IRGRP;
if (type == ALLOW) {
mode |= S_IRGRP;
}
}
if (!(seen & S_IROTH)) {
seen |= S_IROTH;
if (type == ALLOW) {
mode |= S_IROTH;
}
}
}
if ((access_mask & ACE_WRITE_DATA)) {
if (!(seen & S_IWUSR)) {
seen |= S_IWUSR;
if (type == ALLOW) {
mode |= S_IWUSR;
}
}
if (!(seen & S_IWGRP)) {
seen |= S_IWGRP;
if (type == ALLOW) {
mode |= S_IWGRP;
}
}
if (!(seen & S_IWOTH)) {
seen |= S_IWOTH;
if (type == ALLOW) {
mode |= S_IWOTH;
}
}
}
if ((access_mask & ACE_EXECUTE)) {
if (!(seen & S_IXUSR)) {
seen |= S_IXUSR;
if (type == ALLOW) {
mode |= S_IXUSR;
}
}
if (!(seen & S_IXGRP)) {
seen |= S_IXGRP;
if (type == ALLOW) {
mode |= S_IXGRP;
}
}
if (!(seen & S_IXOTH)) {
seen |= S_IXOTH;
if (type == ALLOW) {
mode |= S_IXOTH;
}
}
}
}
/*
* Now handle FUID create for user/group ACEs
*/
if (entry_type == 0 || entry_type == ACE_IDENTIFIER_GROUP) {
aclp->z_ops.ace_who_set(acep,
zfs_fuid_create(zp->z_zfsvfs, who, cr,
(entry_type == 0) ? ZFS_ACE_USER : ZFS_ACE_GROUP,
tx, fuidp));
}
}
return (mode);
}
static zfs_acl_t *
zfs_acl_node_read_internal(znode_t *zp, boolean_t will_modify)
{
zfs_acl_t *aclp;
zfs_acl_node_t *aclnode;
aclp = zfs_acl_alloc(zp->z_phys->zp_acl.z_acl_version);
/*
* Version 0 to 1 znode_acl_phys has the size/count fields swapped.
* Version 0 didn't have a size field, only a count.
*/
if (zp->z_phys->zp_acl.z_acl_version == ZFS_ACL_VERSION_INITIAL) {
aclp->z_acl_count = zp->z_phys->zp_acl.z_acl_size;
aclp->z_acl_bytes = ZFS_ACL_SIZE(aclp->z_acl_count);
} else {
aclp->z_acl_count = zp->z_phys->zp_acl.z_acl_count;
aclp->z_acl_bytes = zp->z_phys->zp_acl.z_acl_size;
}
aclnode = zfs_acl_node_alloc(will_modify ? aclp->z_acl_bytes : 0);
aclnode->z_ace_count = aclp->z_acl_count;
if (will_modify) {
bcopy(zp->z_phys->zp_acl.z_ace_data, aclnode->z_acldata,
aclp->z_acl_bytes);
} else {
aclnode->z_size = aclp->z_acl_bytes;
aclnode->z_acldata = &zp->z_phys->zp_acl.z_ace_data[0];
}
list_insert_head(&aclp->z_acl, aclnode);
return (aclp);
}
/*
* Read an external acl object.
*/
static int
zfs_acl_node_read(znode_t *zp, zfs_acl_t **aclpp, boolean_t will_modify)
{
uint64_t extacl = zp->z_phys->zp_acl.z_acl_extern_obj;
zfs_acl_t *aclp;
size_t aclsize;
size_t acl_count;
zfs_acl_node_t *aclnode;
int error;
ASSERT(MUTEX_HELD(&zp->z_acl_lock));
if (zp->z_phys->zp_acl.z_acl_extern_obj == 0) {
*aclpp = zfs_acl_node_read_internal(zp, will_modify);
return (0);
}
aclp = zfs_acl_alloc(zp->z_phys->zp_acl.z_acl_version);
if (zp->z_phys->zp_acl.z_acl_version == ZFS_ACL_VERSION_INITIAL) {
zfs_acl_phys_v0_t *zacl0 =
(zfs_acl_phys_v0_t *)&zp->z_phys->zp_acl;
aclsize = ZFS_ACL_SIZE(zacl0->z_acl_count);
acl_count = zacl0->z_acl_count;
} else {
aclsize = zp->z_phys->zp_acl.z_acl_size;
acl_count = zp->z_phys->zp_acl.z_acl_count;
if (aclsize == 0)
aclsize = acl_count * sizeof (zfs_ace_t);
}
aclnode = zfs_acl_node_alloc(aclsize);
list_insert_head(&aclp->z_acl, aclnode);
error = dmu_read(zp->z_zfsvfs->z_os, extacl, 0,
aclsize, aclnode->z_acldata);
aclnode->z_ace_count = acl_count;
aclp->z_acl_count = acl_count;
aclp->z_acl_bytes = aclsize;
if (error != 0) {
zfs_acl_free(aclp);
/* convert checksum errors into IO errors */
if (error == ECKSUM)
error = EIO;
return (error);
}
*aclpp = aclp;
return (0);
}
/*
* common code for setting ACLs.
*
* This function is called from zfs_mode_update, zfs_perm_init, and zfs_setacl.
* zfs_setacl passes a non-NULL inherit pointer (ihp) to indicate that it's
* already checked the acl and knows whether to inherit.
*/
int
zfs_aclset_common(znode_t *zp, zfs_acl_t *aclp, cred_t *cr,
zfs_fuid_info_t **fuidp, dmu_tx_t *tx)
{
int error;
znode_phys_t *zphys = zp->z_phys;
zfs_acl_phys_t *zacl = &zphys->zp_acl;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
uint64_t aoid = zphys->zp_acl.z_acl_extern_obj;
uint64_t off = 0;
dmu_object_type_t otype;
zfs_acl_node_t *aclnode;
ASSERT(MUTEX_HELD(&zp->z_lock));
ASSERT(MUTEX_HELD(&zp->z_acl_lock));
dmu_buf_will_dirty(zp->z_dbuf, tx);
zphys->zp_mode = zfs_mode_fuid_compute(zp, aclp, cr, fuidp, tx);
/*
* Decide which opbject type to use. If we are forced to
* use old ACL format than transform ACL into zfs_oldace_t
* layout.
*/
if (!zfsvfs->z_use_fuids) {
otype = DMU_OT_OLDACL;
} else {
if ((aclp->z_version == ZFS_ACL_VERSION_INITIAL) &&
(zfsvfs->z_version >= ZPL_VERSION_FUID))
zfs_acl_xform(zp, aclp);
ASSERT(aclp->z_version >= ZFS_ACL_VERSION_FUID);
otype = DMU_OT_ACL;
}
if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
/*
* If ACL was previously external and we are now
* converting to new ACL format then release old
* ACL object and create a new one.
*/
if (aoid && aclp->z_version != zacl->z_acl_version) {
error = dmu_object_free(zfsvfs->z_os,
zp->z_phys->zp_acl.z_acl_extern_obj, tx);
if (error)
return (error);
aoid = 0;
}
if (aoid == 0) {
aoid = dmu_object_alloc(zfsvfs->z_os,
otype, aclp->z_acl_bytes,
otype == DMU_OT_ACL ? DMU_OT_SYSACL : DMU_OT_NONE,
otype == DMU_OT_ACL ? DN_MAX_BONUSLEN : 0, tx);
} else {
(void) dmu_object_set_blocksize(zfsvfs->z_os, aoid,
aclp->z_acl_bytes, 0, tx);
}
zphys->zp_acl.z_acl_extern_obj = aoid;
for (aclnode = list_head(&aclp->z_acl); aclnode;
aclnode = list_next(&aclp->z_acl, aclnode)) {
if (aclnode->z_ace_count == 0)
continue;
dmu_write(zfsvfs->z_os, aoid, off,
aclnode->z_size, aclnode->z_acldata, tx);
off += aclnode->z_size;
}
} else {
void *start = zacl->z_ace_data;
/*
* Migrating back embedded?
*/
if (zphys->zp_acl.z_acl_extern_obj) {
error = dmu_object_free(zfsvfs->z_os,
zp->z_phys->zp_acl.z_acl_extern_obj, tx);
if (error)
return (error);
zphys->zp_acl.z_acl_extern_obj = 0;
}
for (aclnode = list_head(&aclp->z_acl); aclnode;
aclnode = list_next(&aclp->z_acl, aclnode)) {
if (aclnode->z_ace_count == 0)
continue;
bcopy(aclnode->z_acldata, start, aclnode->z_size);
start = (caddr_t)start + aclnode->z_size;
}
}
/*
* If Old version then swap count/bytes to match old
* layout of znode_acl_phys_t.
*/
if (aclp->z_version == ZFS_ACL_VERSION_INITIAL) {
zphys->zp_acl.z_acl_size = aclp->z_acl_count;
zphys->zp_acl.z_acl_count = aclp->z_acl_bytes;
} else {
zphys->zp_acl.z_acl_size = aclp->z_acl_bytes;
zphys->zp_acl.z_acl_count = aclp->z_acl_count;
}
zphys->zp_acl.z_acl_version = aclp->z_version;
/*
* Replace ACL wide bits, but first clear them.
*/
zp->z_phys->zp_flags &= ~ZFS_ACL_WIDE_FLAGS;
zp->z_phys->zp_flags |= aclp->z_hints;
if (ace_trivial_common(aclp, 0, zfs_ace_walk) == 0)
zp->z_phys->zp_flags |= ZFS_ACL_TRIVIAL;
zfs_time_stamper_locked(zp, STATE_CHANGED, tx);
return (0);
}
/*
* Update access mask for prepended ACE
*
* This applies the "groupmask" value for aclmode property.
*/
static void
zfs_acl_prepend_fixup(zfs_acl_t *aclp, void *acep, void *origacep,
mode_t mode, uint64_t owner)
{
int rmask, wmask, xmask;
int user_ace;
uint16_t aceflags;
uint32_t origmask, acepmask;
uint64_t fuid;
aceflags = aclp->z_ops.ace_flags_get(acep);
fuid = aclp->z_ops.ace_who_get(acep);
origmask = aclp->z_ops.ace_mask_get(origacep);
acepmask = aclp->z_ops.ace_mask_get(acep);
user_ace = (!(aceflags &
(ACE_OWNER|ACE_GROUP|ACE_IDENTIFIER_GROUP)));
if (user_ace && (fuid == owner)) {
rmask = S_IRUSR;
wmask = S_IWUSR;
xmask = S_IXUSR;
} else {
rmask = S_IRGRP;
wmask = S_IWGRP;
xmask = S_IXGRP;
}
if (origmask & ACE_READ_DATA) {
if (mode & rmask) {
acepmask &= ~ACE_READ_DATA;
} else {
acepmask |= ACE_READ_DATA;
}
}
if (origmask & ACE_WRITE_DATA) {
if (mode & wmask) {
acepmask &= ~ACE_WRITE_DATA;
} else {
acepmask |= ACE_WRITE_DATA;
}
}
if (origmask & ACE_APPEND_DATA) {
if (mode & wmask) {
acepmask &= ~ACE_APPEND_DATA;
} else {
acepmask |= ACE_APPEND_DATA;
}
}
if (origmask & ACE_EXECUTE) {
if (mode & xmask) {
acepmask &= ~ACE_EXECUTE;
} else {
acepmask |= ACE_EXECUTE;
}
}
aclp->z_ops.ace_mask_set(acep, acepmask);
}
/*
* Apply mode to canonical six ACEs.
*/
static void
zfs_acl_fixup_canonical_six(zfs_acl_t *aclp, mode_t mode)
{
zfs_acl_node_t *aclnode = list_tail(&aclp->z_acl);
void *acep;
int maskoff = aclp->z_ops.ace_mask_off();
size_t abstract_size = aclp->z_ops.ace_abstract_size();
ASSERT(aclnode != NULL);
acep = (void *)((caddr_t)aclnode->z_acldata +
aclnode->z_size - (abstract_size * 6));
/*
* Fixup final ACEs to match the mode
*/
adjust_ace_pair_common(acep, maskoff, abstract_size,
(mode & 0700) >> 6); /* owner@ */
acep = (caddr_t)acep + (abstract_size * 2);
adjust_ace_pair_common(acep, maskoff, abstract_size,
(mode & 0070) >> 3); /* group@ */
acep = (caddr_t)acep + (abstract_size * 2);
adjust_ace_pair_common(acep, maskoff,
abstract_size, mode); /* everyone@ */
}
static int
zfs_acl_ace_match(zfs_acl_t *aclp, void *acep, int allow_deny,
int entry_type, int accessmask)
{
uint32_t mask = aclp->z_ops.ace_mask_get(acep);
uint16_t type = aclp->z_ops.ace_type_get(acep);
uint16_t flags = aclp->z_ops.ace_flags_get(acep);
return (mask == accessmask && type == allow_deny &&
((flags & ACE_TYPE_FLAGS) == entry_type));
}
/*
* Can prepended ACE be reused?
*/
static int
zfs_reuse_deny(zfs_acl_t *aclp, void *acep, void *prevacep)
{
int okay_masks;
uint16_t prevtype;
uint16_t prevflags;
uint16_t flags;
uint32_t mask, prevmask;
if (prevacep == NULL)
return (B_FALSE);
prevtype = aclp->z_ops.ace_type_get(prevacep);
prevflags = aclp->z_ops.ace_flags_get(prevacep);
flags = aclp->z_ops.ace_flags_get(acep);
mask = aclp->z_ops.ace_mask_get(acep);
prevmask = aclp->z_ops.ace_mask_get(prevacep);
if (prevtype != DENY)
return (B_FALSE);
if (prevflags != (flags & ACE_IDENTIFIER_GROUP))
return (B_FALSE);
okay_masks = (mask & OKAY_MASK_BITS);
if (prevmask & ~okay_masks)
return (B_FALSE);
return (B_TRUE);
}
/*
* Insert new ACL node into chain of zfs_acl_node_t's
*
* This will result in two possible results.
* 1. If the ACL is currently just a single zfs_acl_node and
* we are prepending the entry then current acl node will have
* a new node inserted above it.
*
* 2. If we are inserting in the middle of current acl node then
* the current node will be split in two and new node will be inserted
* in between the two split nodes.
*/
static zfs_acl_node_t *
zfs_acl_ace_insert(zfs_acl_t *aclp, void *acep)
{
zfs_acl_node_t *newnode;
zfs_acl_node_t *trailernode = NULL;
zfs_acl_node_t *currnode = zfs_acl_curr_node(aclp);
int curr_idx = aclp->z_curr_node->z_ace_idx;
int trailer_count;
size_t oldsize;
newnode = zfs_acl_node_alloc(aclp->z_ops.ace_size(acep));
newnode->z_ace_count = 1;
oldsize = currnode->z_size;
if (curr_idx != 1) {
trailernode = zfs_acl_node_alloc(0);
trailernode->z_acldata = acep;
trailer_count = currnode->z_ace_count - curr_idx + 1;
currnode->z_ace_count = curr_idx - 1;
currnode->z_size = (caddr_t)acep - (caddr_t)currnode->z_acldata;
trailernode->z_size = oldsize - currnode->z_size;
trailernode->z_ace_count = trailer_count;
}
aclp->z_acl_count += 1;
aclp->z_acl_bytes += aclp->z_ops.ace_size(acep);
if (curr_idx == 1)
list_insert_before(&aclp->z_acl, currnode, newnode);
else
list_insert_after(&aclp->z_acl, currnode, newnode);
if (trailernode) {
list_insert_after(&aclp->z_acl, newnode, trailernode);
aclp->z_curr_node = trailernode;
trailernode->z_ace_idx = 1;
}
return (newnode);
}
/*
* Prepend deny ACE
*/
static void *
zfs_acl_prepend_deny(znode_t *zp, zfs_acl_t *aclp, void *acep,
mode_t mode)
{
zfs_acl_node_t *aclnode;
void *newacep;
uint64_t fuid;
uint16_t flags;
aclnode = zfs_acl_ace_insert(aclp, acep);
newacep = aclnode->z_acldata;
fuid = aclp->z_ops.ace_who_get(acep);
flags = aclp->z_ops.ace_flags_get(acep);
zfs_set_ace(aclp, newacep, 0, DENY, fuid, (flags & ACE_TYPE_FLAGS));
zfs_acl_prepend_fixup(aclp, newacep, acep, mode, zp->z_phys->zp_uid);
return (newacep);
}
/*
* Split an inherited ACE into inherit_only ACE
* and original ACE with inheritance flags stripped off.
*/
static void
zfs_acl_split_ace(zfs_acl_t *aclp, zfs_ace_hdr_t *acep)
{
zfs_acl_node_t *aclnode;
zfs_acl_node_t *currnode;
void *newacep;
uint16_t type, flags;
uint32_t mask;
uint64_t fuid;
type = aclp->z_ops.ace_type_get(acep);
flags = aclp->z_ops.ace_flags_get(acep);
mask = aclp->z_ops.ace_mask_get(acep);
fuid = aclp->z_ops.ace_who_get(acep);
aclnode = zfs_acl_ace_insert(aclp, acep);
newacep = aclnode->z_acldata;
aclp->z_ops.ace_type_set(newacep, type);
aclp->z_ops.ace_flags_set(newacep, flags | ACE_INHERIT_ONLY_ACE);
aclp->z_ops.ace_mask_set(newacep, mask);
aclp->z_ops.ace_type_set(newacep, type);
aclp->z_ops.ace_who_set(newacep, fuid);
aclp->z_next_ace = acep;
flags &= ~ALL_INHERIT;
aclp->z_ops.ace_flags_set(acep, flags);
currnode = zfs_acl_curr_node(aclp);
ASSERT(currnode->z_ace_idx >= 1);
currnode->z_ace_idx -= 1;
}
/*
* Are ACES started at index i, the canonical six ACES?
*/
static int
zfs_have_canonical_six(zfs_acl_t *aclp)
{
void *acep;
zfs_acl_node_t *aclnode = list_tail(&aclp->z_acl);
int i = 0;
size_t abstract_size = aclp->z_ops.ace_abstract_size();
ASSERT(aclnode != NULL);
if (aclnode->z_ace_count < 6)
return (0);
acep = (void *)((caddr_t)aclnode->z_acldata +
aclnode->z_size - (aclp->z_ops.ace_abstract_size() * 6));
if ((zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
DENY, ACE_OWNER, 0) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
ALLOW, ACE_OWNER, OWNER_ALLOW_MASK) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++), DENY,
OWNING_GROUP, 0) && zfs_acl_ace_match(aclp, (caddr_t)acep +
(abstract_size * i++),
ALLOW, OWNING_GROUP, 0) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
DENY, ACE_EVERYONE, EVERYONE_DENY_MASK) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
ALLOW, ACE_EVERYONE, EVERYONE_ALLOW_MASK))) {
return (1);
} else {
return (0);
}
}
/*
* Apply step 1g, to group entries
*
* Need to deal with corner case where group may have
* greater permissions than owner. If so then limit
* group permissions, based on what extra permissions
* group has.
*/
static void
zfs_fixup_group_entries(zfs_acl_t *aclp, void *acep, void *prevacep,
mode_t mode)
{
uint32_t prevmask = aclp->z_ops.ace_mask_get(prevacep);
uint32_t mask = aclp->z_ops.ace_mask_get(acep);
uint16_t prevflags = aclp->z_ops.ace_flags_get(prevacep);
mode_t extramode = (mode >> 3) & 07;
mode_t ownermode = (mode >> 6);
if (prevflags & ACE_IDENTIFIER_GROUP) {
extramode &= ~ownermode;
if (extramode) {
if (extramode & S_IROTH) {
prevmask &= ~ACE_READ_DATA;
mask &= ~ACE_READ_DATA;
}
if (extramode & S_IWOTH) {
prevmask &= ~(ACE_WRITE_DATA|ACE_APPEND_DATA);
mask &= ~(ACE_WRITE_DATA|ACE_APPEND_DATA);
}
if (extramode & S_IXOTH) {
prevmask &= ~ACE_EXECUTE;
mask &= ~ACE_EXECUTE;
}
}
}
aclp->z_ops.ace_mask_set(acep, mask);
aclp->z_ops.ace_mask_set(prevacep, prevmask);
}
/*
* Apply the chmod algorithm as described
* in PSARC/2002/240
*/
static void
zfs_acl_chmod(znode_t *zp, uint64_t mode, zfs_acl_t *aclp)
{
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
void *acep = NULL, *prevacep = NULL;
uint64_t who;
int i;
int entry_type;
int reuse_deny;
int need_canonical_six = 1;
uint16_t iflags, type;
uint32_t access_mask;
ASSERT(MUTEX_HELD(&zp->z_acl_lock));
ASSERT(MUTEX_HELD(&zp->z_lock));
aclp->z_hints = (zp->z_phys->zp_flags & V4_ACL_WIDE_FLAGS);
/*
* If discard then just discard all ACL nodes which
* represent the ACEs.
*
* New owner@/group@/everone@ ACEs will be added
* later.
*/
if (zfsvfs->z_acl_mode == ZFS_ACL_DISCARD)
zfs_acl_release_nodes(aclp);
while (acep = zfs_acl_next_ace(aclp, acep, &who, &access_mask,
&iflags, &type)) {
entry_type = (iflags & ACE_TYPE_FLAGS);
iflags = (iflags & ALL_INHERIT);
if ((type != ALLOW && type != DENY) ||
(iflags & ACE_INHERIT_ONLY_ACE)) {
if (iflags)
aclp->z_hints |= ZFS_INHERIT_ACE;
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
aclp->z_hints |= ZFS_ACL_OBJ_ACE;
break;
}
goto nextace;
}
/*
* Need to split ace into two?
*/
if ((iflags & (ACE_FILE_INHERIT_ACE|
ACE_DIRECTORY_INHERIT_ACE)) &&
(!(iflags & ACE_INHERIT_ONLY_ACE))) {
zfs_acl_split_ace(aclp, acep);
aclp->z_hints |= ZFS_INHERIT_ACE;
goto nextace;
}
if (entry_type == ACE_OWNER || entry_type == ACE_EVERYONE ||
(entry_type == OWNING_GROUP)) {
access_mask &= ~OGE_CLEAR;
aclp->z_ops.ace_mask_set(acep, access_mask);
goto nextace;
} else {
reuse_deny = B_TRUE;
if (type == ALLOW) {
/*
* Check preceding ACE if any, to see
* if we need to prepend a DENY ACE.
* This is only applicable when the acl_mode
* property == groupmask.
*/
if (zfsvfs->z_acl_mode == ZFS_ACL_GROUPMASK) {
reuse_deny = zfs_reuse_deny(aclp, acep,
prevacep);
if (!reuse_deny) {
prevacep =
zfs_acl_prepend_deny(zp,
aclp, acep, mode);
} else {
zfs_acl_prepend_fixup(
aclp, prevacep,
acep, mode,
zp->z_phys->zp_uid);
}
zfs_fixup_group_entries(aclp, acep,
prevacep, mode);
}
}
}
nextace:
prevacep = acep;
}
/*
* Check out last six aces, if we have six.
*/
if (aclp->z_acl_count >= 6) {
if (zfs_have_canonical_six(aclp)) {
need_canonical_six = 0;
}
}
if (need_canonical_six) {
size_t abstract_size = aclp->z_ops.ace_abstract_size();
void *zacep;
zfs_acl_node_t *aclnode =
zfs_acl_node_alloc(abstract_size * 6);
aclnode->z_size = abstract_size * 6;
aclnode->z_ace_count = 6;
aclp->z_acl_bytes += aclnode->z_size;
list_insert_tail(&aclp->z_acl, aclnode);
zacep = aclnode->z_acldata;
i = 0;
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
0, DENY, -1, ACE_OWNER);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
OWNER_ALLOW_MASK, ALLOW, -1, ACE_OWNER);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++), 0,
DENY, -1, OWNING_GROUP);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++), 0,
ALLOW, -1, OWNING_GROUP);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
EVERYONE_DENY_MASK, DENY, -1, ACE_EVERYONE);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
EVERYONE_ALLOW_MASK, ALLOW, -1, ACE_EVERYONE);
aclp->z_acl_count += 6;
}
zfs_acl_fixup_canonical_six(aclp, mode);
}
int
zfs_acl_chmod_setattr(znode_t *zp, zfs_acl_t **aclp, uint64_t mode)
{
int error;
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
*aclp = NULL;
error = zfs_acl_node_read(zp, aclp, B_TRUE);
if (error == 0)
zfs_acl_chmod(zp, mode, *aclp);
mutex_exit(&zp->z_acl_lock);
mutex_exit(&zp->z_lock);
return (error);
}
/*
* strip off write_owner and write_acl
*/
static void
zfs_restricted_update(zfsvfs_t *zfsvfs, zfs_acl_t *aclp, void *acep)
{
uint32_t mask = aclp->z_ops.ace_mask_get(acep);
if ((zfsvfs->z_acl_inherit == ZFS_ACL_RESTRICTED) &&
(aclp->z_ops.ace_type_get(acep) == ALLOW)) {
mask &= ~RESTRICTED_CLEAR;
aclp->z_ops.ace_mask_set(acep, mask);
}
}
/*
* Should ACE be inherited?
*/
static int
zfs_ace_can_use(znode_t *zp, uint16_t acep_flags)
{
int vtype = ZTOV(zp)->v_type;
int iflags = (acep_flags & 0xf);
if ((vtype == VDIR) && (iflags & ACE_DIRECTORY_INHERIT_ACE))
return (1);
else if (iflags & ACE_FILE_INHERIT_ACE)
return (!((vtype == VDIR) &&
(iflags & ACE_NO_PROPAGATE_INHERIT_ACE)));
return (0);
}
/*
* inherit inheritable ACEs from parent
*/
static zfs_acl_t *
zfs_acl_inherit(znode_t *zp, zfs_acl_t *paclp, uint64_t mode,
boolean_t *need_chmod)
{
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
void *pacep;
void *acep, *acep2;
zfs_acl_node_t *aclnode, *aclnode2;
zfs_acl_t *aclp = NULL;
uint64_t who;
uint32_t access_mask;
uint16_t iflags, newflags, type;
size_t ace_size;
void *data1, *data2;
size_t data1sz, data2sz;
boolean_t vdir = ZTOV(zp)->v_type == VDIR;
boolean_t vreg = ZTOV(zp)->v_type == VREG;
boolean_t passthrough, passthrough_x, noallow;
passthrough_x =
zfsvfs->z_acl_inherit == ZFS_ACL_PASSTHROUGH_X;
passthrough = passthrough_x ||
zfsvfs->z_acl_inherit == ZFS_ACL_PASSTHROUGH;
noallow =
zfsvfs->z_acl_inherit == ZFS_ACL_NOALLOW;
*need_chmod = B_TRUE;
pacep = NULL;
aclp = zfs_acl_alloc(paclp->z_version);
if (zfsvfs->z_acl_inherit == ZFS_ACL_DISCARD)
return (aclp);
while (pacep = zfs_acl_next_ace(paclp, pacep, &who,
&access_mask, &iflags, &type)) {
/*
* don't inherit bogus ACEs
*/
if (!zfs_acl_valid_ace_type(type, iflags))
continue;
if (noallow && type == ALLOW)
continue;
ace_size = aclp->z_ops.ace_size(pacep);
if (!zfs_ace_can_use(zp, iflags))
continue;
/*
* If owner@, group@, or everyone@ inheritable
* then zfs_acl_chmod() isn't needed.
*/
if (passthrough &&
((iflags & (ACE_OWNER|ACE_EVERYONE)) ||
((iflags & OWNING_GROUP) ==
OWNING_GROUP)) && (vreg || (vdir && (iflags &
ACE_DIRECTORY_INHERIT_ACE)))) {
*need_chmod = B_FALSE;
if (!vdir && passthrough_x &&
((mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0)) {
access_mask &= ~ACE_EXECUTE;
}
}
aclnode = zfs_acl_node_alloc(ace_size);
list_insert_tail(&aclp->z_acl, aclnode);
acep = aclnode->z_acldata;
zfs_set_ace(aclp, acep, access_mask, type,
who, iflags|ACE_INHERITED_ACE);
/*
* Copy special opaque data if any
*/
if ((data1sz = paclp->z_ops.ace_data(pacep, &data1)) != 0) {
VERIFY((data2sz = aclp->z_ops.ace_data(acep,
&data2)) == data1sz);
bcopy(data1, data2, data2sz);
}
aclp->z_acl_count++;
aclnode->z_ace_count++;
aclp->z_acl_bytes += aclnode->z_size;
newflags = aclp->z_ops.ace_flags_get(acep);
if (vdir)
aclp->z_hints |= ZFS_INHERIT_ACE;
if ((iflags & ACE_NO_PROPAGATE_INHERIT_ACE) || !vdir) {
newflags &= ~ALL_INHERIT;
aclp->z_ops.ace_flags_set(acep,
newflags|ACE_INHERITED_ACE);
zfs_restricted_update(zfsvfs, aclp, acep);
continue;
}
ASSERT(vdir);
newflags = aclp->z_ops.ace_flags_get(acep);
if ((iflags & (ACE_FILE_INHERIT_ACE |
ACE_DIRECTORY_INHERIT_ACE)) !=
ACE_FILE_INHERIT_ACE) {
aclnode2 = zfs_acl_node_alloc(ace_size);
list_insert_tail(&aclp->z_acl, aclnode2);
acep2 = aclnode2->z_acldata;
zfs_set_ace(aclp, acep2,
access_mask, type, who,
iflags|ACE_INHERITED_ACE);
newflags |= ACE_INHERIT_ONLY_ACE;
aclp->z_ops.ace_flags_set(acep, newflags);
newflags &= ~ALL_INHERIT;
aclp->z_ops.ace_flags_set(acep2,
newflags|ACE_INHERITED_ACE);
/*
* Copy special opaque data if any
*/
if ((data1sz = aclp->z_ops.ace_data(acep,
&data1)) != 0) {
VERIFY((data2sz =
aclp->z_ops.ace_data(acep2,
&data2)) == data1sz);
bcopy(data1, data2, data1sz);
}
aclp->z_acl_count++;
aclnode2->z_ace_count++;
aclp->z_acl_bytes += aclnode->z_size;
zfs_restricted_update(zfsvfs, aclp, acep2);
} else {
newflags |= ACE_INHERIT_ONLY_ACE;
aclp->z_ops.ace_flags_set(acep,
newflags|ACE_INHERITED_ACE);
}
}
return (aclp);
}
/*
* Create file system object initial permissions
* including inheritable ACEs.
*/
void
zfs_perm_init(znode_t *zp, znode_t *parent, int flag,
vattr_t *vap, dmu_tx_t *tx, cred_t *cr,
zfs_acl_t *setaclp, zfs_fuid_info_t **fuidp)
{
uint64_t mode, fuid, fgid;
int error;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zfs_acl_t *aclp = NULL;
zfs_acl_t *paclp;
xvattr_t *xvap = (xvattr_t *)vap;
gid_t gid;
boolean_t need_chmod = B_TRUE;
if (setaclp)
aclp = setaclp;
mode = MAKEIMODE(vap->va_type, vap->va_mode);
/*
* Determine uid and gid.
*/
if ((flag & (IS_ROOT_NODE | IS_REPLAY)) ||
((flag & IS_XATTR) && (vap->va_type == VDIR))) {
fuid = zfs_fuid_create(zfsvfs, vap->va_uid, cr,
ZFS_OWNER, tx, fuidp);
fgid = zfs_fuid_create(zfsvfs, vap->va_gid, cr,
ZFS_GROUP, tx, fuidp);
gid = vap->va_gid;
} else {
fuid = zfs_fuid_create_cred(zfsvfs, ZFS_OWNER, tx, cr, fuidp);
fgid = 0;
if (vap->va_mask & AT_GID) {
fgid = zfs_fuid_create(zfsvfs, vap->va_gid, cr,
ZFS_GROUP, tx, fuidp);
gid = vap->va_gid;
if (fgid != parent->z_phys->zp_gid &&
!groupmember(vap->va_gid, cr) &&
secpolicy_vnode_create_gid(cr) != 0)
fgid = 0;
}
if (fgid == 0) {
if (parent->z_phys->zp_mode & S_ISGID) {
fgid = parent->z_phys->zp_gid;
gid = zfs_fuid_map_id(zfsvfs, fgid,
cr, ZFS_GROUP);
} else {
fgid = zfs_fuid_create_cred(zfsvfs,
ZFS_GROUP, tx, cr, fuidp);
#ifdef __FreeBSD__
gid = fgid = parent->z_phys->zp_gid;
#else
gid = crgetgid(cr);
#endif
}
}
}
/*
* If we're creating a directory, and the parent directory has the
* set-GID bit set, set in on the new directory.
* Otherwise, if the user is neither privileged nor a member of the
* file's new group, clear the file's set-GID bit.
*/
if ((parent->z_phys->zp_mode & S_ISGID) && (vap->va_type == VDIR)) {
mode |= S_ISGID;
} else {
if ((mode & S_ISGID) &&
secpolicy_vnode_setids_setgids(ZTOV(zp), cr, gid) != 0)
mode &= ~S_ISGID;
}
zp->z_phys->zp_uid = fuid;
zp->z_phys->zp_gid = fgid;
zp->z_phys->zp_mode = mode;
if (aclp == NULL) {
mutex_enter(&parent->z_lock);
if ((ZTOV(parent)->v_type == VDIR &&
(parent->z_phys->zp_flags & ZFS_INHERIT_ACE)) &&
!(zp->z_phys->zp_flags & ZFS_XATTR)) {
mutex_enter(&parent->z_acl_lock);
VERIFY(0 == zfs_acl_node_read(parent, &paclp, B_FALSE));
mutex_exit(&parent->z_acl_lock);
aclp = zfs_acl_inherit(zp, paclp, mode, &need_chmod);
zfs_acl_free(paclp);
} else {
aclp = zfs_acl_alloc(zfs_acl_version_zp(zp));
}
mutex_exit(&parent->z_lock);
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
if (need_chmod)
zfs_acl_chmod(zp, mode, aclp);
} else {
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
}
/* Force auto_inherit on all new directory objects */
if (vap->va_type == VDIR)
aclp->z_hints |= ZFS_ACL_AUTO_INHERIT;
error = zfs_aclset_common(zp, aclp, cr, fuidp, tx);
/* Set optional attributes if any */
if (vap->va_mask & AT_XVATTR)
zfs_xvattr_set(zp, xvap);
mutex_exit(&zp->z_lock);
mutex_exit(&zp->z_acl_lock);
ASSERT3U(error, ==, 0);
if (aclp != setaclp)
zfs_acl_free(aclp);
}
/*
* Retrieve a files ACL
*/
int
zfs_getacl(znode_t *zp, vsecattr_t *vsecp, boolean_t skipaclchk, cred_t *cr)
{
zfs_acl_t *aclp;
ulong_t mask;
int error;
int count = 0;
int largeace = 0;
mask = vsecp->vsa_mask & (VSA_ACE | VSA_ACECNT |
VSA_ACE_ACLFLAGS | VSA_ACE_ALLTYPES);
if (error = zfs_zaccess(zp, ACE_READ_ACL, 0, skipaclchk, cr))
return (error);
if (mask == 0)
return (ENOSYS);
mutex_enter(&zp->z_acl_lock);
error = zfs_acl_node_read(zp, &aclp, B_FALSE);
if (error != 0) {
mutex_exit(&zp->z_acl_lock);
return (error);
}
/*
* Scan ACL to determine number of ACEs
*/
if ((zp->z_phys->zp_flags & ZFS_ACL_OBJ_ACE) &&
!(mask & VSA_ACE_ALLTYPES)) {
void *zacep = NULL;
uint64_t who;
uint32_t access_mask;
uint16_t type, iflags;
while (zacep = zfs_acl_next_ace(aclp, zacep,
&who, &access_mask, &iflags, &type)) {
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
largeace++;
continue;
default:
count++;
}
}
vsecp->vsa_aclcnt = count;
} else
count = aclp->z_acl_count;
if (mask & VSA_ACECNT) {
vsecp->vsa_aclcnt = count;
}
if (mask & VSA_ACE) {
size_t aclsz;
- zfs_acl_node_t *aclnode = list_head(&aclp->z_acl);
-
aclsz = count * sizeof (ace_t) +
sizeof (ace_object_t) * largeace;
vsecp->vsa_aclentp = kmem_alloc(aclsz, KM_SLEEP);
vsecp->vsa_aclentsz = aclsz;
if (aclp->z_version == ZFS_ACL_VERSION_FUID)
zfs_copy_fuid_2_ace(zp->z_zfsvfs, aclp, cr,
vsecp->vsa_aclentp, !(mask & VSA_ACE_ALLTYPES));
else {
- bcopy(aclnode->z_acldata, vsecp->vsa_aclentp,
- count * sizeof (ace_t));
+ zfs_acl_node_t *aclnode;
+ void *start = vsecp->vsa_aclentp;
+
+ for (aclnode = list_head(&aclp->z_acl); aclnode;
+ aclnode = list_next(&aclp->z_acl, aclnode)) {
+ bcopy(aclnode->z_acldata, start,
+ aclnode->z_size);
+ start = (caddr_t)start + aclnode->z_size;
+ }
+ ASSERT((caddr_t)start - (caddr_t)vsecp->vsa_aclentp ==
+ aclp->z_acl_bytes);
}
}
if (mask & VSA_ACE_ACLFLAGS) {
vsecp->vsa_aclflags = 0;
if (zp->z_phys->zp_flags & ZFS_ACL_DEFAULTED)
vsecp->vsa_aclflags |= ACL_DEFAULTED;
if (zp->z_phys->zp_flags & ZFS_ACL_PROTECTED)
vsecp->vsa_aclflags |= ACL_PROTECTED;
if (zp->z_phys->zp_flags & ZFS_ACL_AUTO_INHERIT)
vsecp->vsa_aclflags |= ACL_AUTO_INHERIT;
}
mutex_exit(&zp->z_acl_lock);
zfs_acl_free(aclp);
return (0);
}
int
zfs_vsec_2_aclp(zfsvfs_t *zfsvfs, vtype_t obj_type,
vsecattr_t *vsecp, zfs_acl_t **zaclp)
{
zfs_acl_t *aclp;
zfs_acl_node_t *aclnode;
int aclcnt = vsecp->vsa_aclcnt;
int error;
if (vsecp->vsa_aclcnt > MAX_ACL_ENTRIES || vsecp->vsa_aclcnt <= 0)
return (EINVAL);
aclp = zfs_acl_alloc(zfs_acl_version(zfsvfs->z_version));
aclp->z_hints = 0;
aclnode = zfs_acl_node_alloc(aclcnt * sizeof (zfs_object_ace_t));
if (aclp->z_version == ZFS_ACL_VERSION_INITIAL) {
if ((error = zfs_copy_ace_2_oldace(obj_type, aclp,
(ace_t *)vsecp->vsa_aclentp, aclnode->z_acldata,
aclcnt, &aclnode->z_size)) != 0) {
zfs_acl_free(aclp);
zfs_acl_node_free(aclnode);
return (error);
}
} else {
if ((error = zfs_copy_ace_2_fuid(obj_type, aclp,
vsecp->vsa_aclentp, aclnode->z_acldata, aclcnt,
&aclnode->z_size)) != 0) {
zfs_acl_free(aclp);
zfs_acl_node_free(aclnode);
return (error);
}
}
aclp->z_acl_bytes = aclnode->z_size;
aclnode->z_ace_count = aclcnt;
aclp->z_acl_count = aclcnt;
list_insert_head(&aclp->z_acl, aclnode);
/*
* If flags are being set then add them to z_hints
*/
if (vsecp->vsa_mask & VSA_ACE_ACLFLAGS) {
if (vsecp->vsa_aclflags & ACL_PROTECTED)
aclp->z_hints |= ZFS_ACL_PROTECTED;
if (vsecp->vsa_aclflags & ACL_DEFAULTED)
aclp->z_hints |= ZFS_ACL_DEFAULTED;
if (vsecp->vsa_aclflags & ACL_AUTO_INHERIT)
aclp->z_hints |= ZFS_ACL_AUTO_INHERIT;
}
*zaclp = aclp;
return (0);
}
/*
* Set a files ACL
*/
int
zfs_setacl(znode_t *zp, vsecattr_t *vsecp, boolean_t skipaclchk, cred_t *cr)
{
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zilog_t *zilog = zfsvfs->z_log;
ulong_t mask = vsecp->vsa_mask & (VSA_ACE | VSA_ACECNT);
dmu_tx_t *tx;
int error;
zfs_acl_t *aclp;
zfs_fuid_info_t *fuidp = NULL;
if (mask == 0)
return (ENOSYS);
if (zp->z_phys->zp_flags & ZFS_IMMUTABLE)
return (EPERM);
if (error = zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr))
return (error);
error = zfs_vsec_2_aclp(zfsvfs, ZTOV(zp)->v_type, vsecp, &aclp);
if (error)
return (error);
/*
* If ACL wide flags aren't being set then preserve any
* existing flags.
*/
if (!(vsecp->vsa_mask & VSA_ACE_ACLFLAGS)) {
aclp->z_hints |= (zp->z_phys->zp_flags & V4_ACL_WIDE_FLAGS);
}
top:
if (error = zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr)) {
zfs_acl_free(aclp);
return (error);
}
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
if (zp->z_phys->zp_acl.z_acl_extern_obj) {
/* Are we upgrading ACL? */
if (zfsvfs->z_version <= ZPL_VERSION_FUID &&
zp->z_phys->zp_acl.z_acl_version ==
ZFS_ACL_VERSION_INITIAL) {
dmu_tx_hold_free(tx,
zp->z_phys->zp_acl.z_acl_extern_obj,
0, DMU_OBJECT_END);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, aclp->z_acl_bytes);
} else {
dmu_tx_hold_write(tx,
zp->z_phys->zp_acl.z_acl_extern_obj,
0, aclp->z_acl_bytes);
}
} else if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, aclp->z_acl_bytes);
}
if (aclp->z_has_fuids) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
mutex_exit(&zp->z_acl_lock);
mutex_exit(&zp->z_lock);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
zfs_acl_free(aclp);
return (error);
}
error = zfs_aclset_common(zp, aclp, cr, &fuidp, tx);
ASSERT(error == 0);
zfs_log_acl(zilog, tx, zp, vsecp, fuidp);
if (fuidp)
zfs_fuid_info_free(fuidp);
zfs_acl_free(aclp);
dmu_tx_commit(tx);
done:
mutex_exit(&zp->z_acl_lock);
mutex_exit(&zp->z_lock);
return (error);
}
/*
* working_mode returns the permissions that were not granted
*/
static int
zfs_zaccess_common(znode_t *zp, uint32_t v4_mode, uint32_t *working_mode,
boolean_t *check_privs, boolean_t skipaclchk, cred_t *cr)
{
zfs_acl_t *aclp;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
uid_t uid = crgetuid(cr);
uint64_t who;
uint16_t type, iflags;
uint16_t entry_type;
uint32_t access_mask;
uint32_t deny_mask = 0;
zfs_ace_hdr_t *acep = NULL;
boolean_t checkit;
uid_t fowner;
uid_t gowner;
/*
* Short circuit empty requests
*/
if (v4_mode == 0)
return (0);
*check_privs = B_TRUE;
if (zfsvfs->z_assign >= TXG_INITIAL) { /* ZIL replay */
*working_mode = 0;
return (0);
}
*working_mode = v4_mode;
if ((v4_mode & WRITE_MASK) &&
(zp->z_zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) &&
(!IS_DEVVP(ZTOV(zp)))) {
*check_privs = B_FALSE;
return (EROFS);
}
/*
* Only check for READONLY on non-directories.
*/
if ((v4_mode & WRITE_MASK_DATA) &&
(((ZTOV(zp)->v_type != VDIR) &&
(zp->z_phys->zp_flags & (ZFS_READONLY | ZFS_IMMUTABLE))) ||
(ZTOV(zp)->v_type == VDIR &&
(zp->z_phys->zp_flags & ZFS_IMMUTABLE)))) {
*check_privs = B_FALSE;
return (EPERM);
}
#ifdef sun
if ((v4_mode & (ACE_DELETE | ACE_DELETE_CHILD)) &&
(zp->z_phys->zp_flags & ZFS_NOUNLINK)) {
*check_privs = B_FALSE;
return (EPERM);
}
#else
/*
* In FreeBSD we allow to modify directory's content is ZFS_NOUNLINK
* (sunlnk) is set. We just don't allow directory removal, which is
* handled in zfs_zaccess_delete().
*/
if ((v4_mode & ACE_DELETE) &&
(zp->z_phys->zp_flags & ZFS_NOUNLINK)) {
*check_privs = B_FALSE;
return (EPERM);
}
#endif
if (((v4_mode & (ACE_READ_DATA|ACE_EXECUTE)) &&
(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED))) {
*check_privs = B_FALSE;
return (EACCES);
}
/*
* The caller requested that the ACL check be skipped. This
* would only happen if the caller checked VOP_ACCESS() with a
* 32 bit ACE mask and already had the appropriate permissions.
*/
if (skipaclchk) {
*working_mode = 0;
return (0);
}
zfs_fuid_map_ids(zp, cr, &fowner, &gowner);
mutex_enter(&zp->z_acl_lock);
error = zfs_acl_node_read(zp, &aclp, B_FALSE);
if (error != 0) {
mutex_exit(&zp->z_acl_lock);
return (error);
}
while (acep = zfs_acl_next_ace(aclp, acep, &who, &access_mask,
&iflags, &type)) {
if (!zfs_acl_valid_ace_type(type, iflags))
continue;
if (ZTOV(zp)->v_type == VDIR && (iflags & ACE_INHERIT_ONLY_ACE))
continue;
entry_type = (iflags & ACE_TYPE_FLAGS);
checkit = B_FALSE;
switch (entry_type) {
case ACE_OWNER:
if (uid == fowner)
checkit = B_TRUE;
break;
case OWNING_GROUP:
who = gowner;
/*FALLTHROUGH*/
case ACE_IDENTIFIER_GROUP:
checkit = zfs_groupmember(zfsvfs, who, cr);
break;
case ACE_EVERYONE:
checkit = B_TRUE;
break;
/* USER Entry */
default:
if (entry_type == 0) {
uid_t newid;
newid = zfs_fuid_map_id(zfsvfs, who, cr,
ZFS_ACE_USER);
if (newid != IDMAP_WK_CREATOR_OWNER_UID &&
uid == newid)
checkit = B_TRUE;
break;
} else {
zfs_acl_free(aclp);
mutex_exit(&zp->z_acl_lock);
return (EIO);
}
}
if (checkit) {
uint32_t mask_matched = (access_mask & *working_mode);
if (mask_matched) {
if (type == DENY)
deny_mask |= mask_matched;
*working_mode &= ~mask_matched;
}
}
/* Are we done? */
if (*working_mode == 0)
break;
}
mutex_exit(&zp->z_acl_lock);
zfs_acl_free(aclp);
/* Put the found 'denies' back on the working mode */
if (deny_mask) {
*working_mode |= deny_mask;
return (EACCES);
} else if (*working_mode) {
return (-1);
}
return (0);
}
static int
zfs_zaccess_append(znode_t *zp, uint32_t *working_mode, boolean_t *check_privs,
cred_t *cr)
{
if (*working_mode != ACE_WRITE_DATA)
return (EACCES);
return (zfs_zaccess_common(zp, ACE_APPEND_DATA, working_mode,
check_privs, B_FALSE, cr));
}
/*
* Determine whether Access should be granted/denied, invoking least
* priv subsytem when a deny is determined.
*/
int
zfs_zaccess(znode_t *zp, int mode, int flags, boolean_t skipaclchk, cred_t *cr)
{
uint32_t working_mode;
int error;
int is_attr;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
boolean_t check_privs;
znode_t *xzp;
znode_t *check_zp = zp;
is_attr = ((zp->z_phys->zp_flags & ZFS_XATTR) &&
(ZTOV(zp)->v_type == VDIR));
#ifdef __FreeBSD__
/*
* In FreeBSD, we don't care about permissions of individual ADS.
* Note that not checking them is not just an optimization - without
* this shortcut, EA operations may bogusly fail with EACCES.
*/
if (zp->z_phys->zp_flags & ZFS_XATTR)
return (0);
#else
/*
* If attribute then validate against base file
*/
if (is_attr) {
if ((error = zfs_zget(zp->z_zfsvfs,
zp->z_phys->zp_parent, &xzp)) != 0) {
return (error);
}
check_zp = xzp;
/*
* fixup mode to map to xattr perms
*/
if (mode & (ACE_WRITE_DATA|ACE_APPEND_DATA)) {
mode &= ~(ACE_WRITE_DATA|ACE_APPEND_DATA);
mode |= ACE_WRITE_NAMED_ATTRS;
}
if (mode & (ACE_READ_DATA|ACE_EXECUTE)) {
mode &= ~(ACE_READ_DATA|ACE_EXECUTE);
mode |= ACE_READ_NAMED_ATTRS;
}
}
#endif
if ((error = zfs_zaccess_common(check_zp, mode, &working_mode,
&check_privs, skipaclchk, cr)) == 0) {
if (is_attr)
VN_RELE(ZTOV(xzp));
return (0);
}
if (error && !check_privs) {
if (is_attr)
VN_RELE(ZTOV(xzp));
return (error);
}
if (error && (flags & V_APPEND)) {
error = zfs_zaccess_append(zp, &working_mode, &check_privs, cr);
}
if (error && check_privs) {
uid_t owner;
mode_t checkmode = 0;
owner = zfs_fuid_map_id(zfsvfs, check_zp->z_phys->zp_uid, cr,
ZFS_OWNER);
/*
* First check for implicit owner permission on
* read_acl/read_attributes
*/
error = 0;
ASSERT(working_mode != 0);
if ((working_mode & (ACE_READ_ACL|ACE_READ_ATTRIBUTES) &&
owner == crgetuid(cr)))
working_mode &= ~(ACE_READ_ACL|ACE_READ_ATTRIBUTES);
if (working_mode & (ACE_READ_DATA|ACE_READ_NAMED_ATTRS|
ACE_READ_ACL|ACE_READ_ATTRIBUTES|ACE_SYNCHRONIZE))
checkmode |= VREAD;
if (working_mode & (ACE_WRITE_DATA|ACE_WRITE_NAMED_ATTRS|
ACE_APPEND_DATA|ACE_WRITE_ATTRIBUTES|ACE_SYNCHRONIZE))
checkmode |= VWRITE;
if (working_mode & ACE_EXECUTE)
checkmode |= VEXEC;
if (checkmode)
error = secpolicy_vnode_access(cr, ZTOV(check_zp),
owner, checkmode);
if (error == 0 && (working_mode & ACE_WRITE_OWNER))
error = secpolicy_vnode_chown(ZTOV(check_zp), cr, B_TRUE);
if (error == 0 && (working_mode & ACE_WRITE_ACL))
error = secpolicy_vnode_setdac(ZTOV(check_zp), cr, owner);
if (error == 0 && (working_mode &
(ACE_DELETE|ACE_DELETE_CHILD)))
error = secpolicy_vnode_remove(ZTOV(check_zp), cr);
if (error == 0 && (working_mode & ACE_SYNCHRONIZE)) {
error = secpolicy_vnode_chown(ZTOV(check_zp), cr, B_FALSE);
}
if (error == 0) {
/*
* See if any bits other than those already checked
* for are still present. If so then return EACCES
*/
if (working_mode & ~(ZFS_CHECKED_MASKS)) {
error = EACCES;
}
}
}
if (is_attr)
VN_RELE(ZTOV(xzp));
return (error);
}
/*
* Translate traditional unix VREAD/VWRITE/VEXEC mode into
* native ACL format and call zfs_zaccess()
*/
int
zfs_zaccess_rwx(znode_t *zp, mode_t mode, int flags, cred_t *cr)
{
return (zfs_zaccess(zp, zfs_unix_to_v4(mode >> 6), flags, B_FALSE, cr));
}
/*
* Access function for secpolicy_vnode_setattr
*/
int
zfs_zaccess_unix(znode_t *zp, mode_t mode, cred_t *cr)
{
int v4_mode = zfs_unix_to_v4(mode >> 6);
return (zfs_zaccess(zp, v4_mode, 0, B_FALSE, cr));
}
static int
zfs_delete_final_check(znode_t *zp, znode_t *dzp,
mode_t missing_perms, cred_t *cr)
{
int error;
uid_t downer;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
downer = zfs_fuid_map_id(zfsvfs, dzp->z_phys->zp_uid, cr, ZFS_OWNER);
error = secpolicy_vnode_access(cr, ZTOV(dzp), downer, missing_perms);
if (error == 0)
error = zfs_sticky_remove_access(dzp, zp, cr);
return (error);
}
/*
* Determine whether Access should be granted/deny, without
* consulting least priv subsystem.
*
*
* The following chart is the recommended NFSv4 enforcement for
* ability to delete an object.
*
* -------------------------------------------------------
* | Parent Dir | Target Object Permissions |
* | permissions | |
* -------------------------------------------------------
* | | ACL Allows | ACL Denies| Delete |
* | | Delete | Delete | unspecified|
* -------------------------------------------------------
* | ACL Allows | Permit | Permit | Permit |
* | DELETE_CHILD | |
* -------------------------------------------------------
* | ACL Denies | Permit | Deny | Deny |
* | DELETE_CHILD | | | |
* -------------------------------------------------------
* | ACL specifies | | | |
* | only allow | Permit | Permit | Permit |
* | write and | | | |
* | execute | | | |
* -------------------------------------------------------
* | ACL denies | | | |
* | write and | Permit | Deny | Deny |
* | execute | | | |
* -------------------------------------------------------
* ^
* |
* No search privilege, can't even look up file?
*
*/
int
zfs_zaccess_delete(znode_t *dzp, znode_t *zp, cred_t *cr)
{
uint32_t dzp_working_mode = 0;
uint32_t zp_working_mode = 0;
int dzp_error, zp_error;
mode_t missing_perms;
boolean_t dzpcheck_privs = B_TRUE;
boolean_t zpcheck_privs = B_TRUE;
/*
* We want specific DELETE permissions to
* take precedence over WRITE/EXECUTE. We don't
* want an ACL such as this to mess us up.
* user:joe:write_data:deny,user:joe:delete:allow
*
* However, deny permissions may ultimately be overridden
* by secpolicy_vnode_access().
*
* We will ask for all of the necessary permissions and then
* look at the working modes from the directory and target object
* to determine what was found.
*/
if (zp->z_phys->zp_flags & (ZFS_IMMUTABLE | ZFS_NOUNLINK))
return (EPERM);
/*
* First row
* If the directory permissions allow the delete, we are done.
*/
if ((dzp_error = zfs_zaccess_common(dzp, ACE_DELETE_CHILD,
&dzp_working_mode, &dzpcheck_privs, B_FALSE, cr)) == 0)
return (0);
/*
* If target object has delete permission then we are done
*/
if ((zp_error = zfs_zaccess_common(zp, ACE_DELETE, &zp_working_mode,
&zpcheck_privs, B_FALSE, cr)) == 0)
return (0);
ASSERT(dzp_error && zp_error);
if (!dzpcheck_privs)
return (dzp_error);
if (!zpcheck_privs)
return (zp_error);
/*
* Second row
*
* If directory returns EACCES then delete_child was denied
* due to deny delete_child. In this case send the request through
* secpolicy_vnode_remove(). We don't use zfs_delete_final_check()
* since that *could* allow the delete based on write/execute permission
* and we want delete permissions to override write/execute.
*/
if (dzp_error == EACCES)
return (secpolicy_vnode_remove(ZTOV(dzp), cr)); /* XXXPJD: s/dzp/zp/ ? */
/*
* Third Row
* only need to see if we have write/execute on directory.
*/
if ((dzp_error = zfs_zaccess_common(dzp, ACE_EXECUTE|ACE_WRITE_DATA,
&dzp_working_mode, &dzpcheck_privs, B_FALSE, cr)) == 0)
return (zfs_sticky_remove_access(dzp, zp, cr));
if (!dzpcheck_privs)
return (dzp_error);
/*
* Fourth row
*/
missing_perms = (dzp_working_mode & ACE_WRITE_DATA) ? VWRITE : 0;
missing_perms |= (dzp_working_mode & ACE_EXECUTE) ? VEXEC : 0;
ASSERT(missing_perms);
return (zfs_delete_final_check(zp, dzp, missing_perms, cr));
}
int
zfs_zaccess_rename(znode_t *sdzp, znode_t *szp, znode_t *tdzp,
znode_t *tzp, cred_t *cr)
{
int add_perm;
int error;
if (szp->z_phys->zp_flags & ZFS_AV_QUARANTINED)
return (EACCES);
add_perm = (ZTOV(szp)->v_type == VDIR) ?
ACE_ADD_SUBDIRECTORY : ACE_ADD_FILE;
/*
* Rename permissions are combination of delete permission +
* add file/subdir permission.
*
* BSD operating systems also require write permission
* on the directory being moved from one parent directory
* to another.
*/
if (ZTOV(szp)->v_type == VDIR && ZTOV(sdzp) != ZTOV(tdzp)) {
if (error = zfs_zaccess(szp, ACE_WRITE_DATA, 0, B_FALSE, cr))
return (error);
}
/*
* first make sure we do the delete portion.
*
* If that succeeds then check for add_file/add_subdir permissions
*/
if (error = zfs_zaccess_delete(sdzp, szp, cr))
return (error);
/*
* If we have a tzp, see if we can delete it?
*/
if (tzp) {
if (error = zfs_zaccess_delete(tdzp, tzp, cr))
return (error);
}
/*
* Now check for add permissions
*/
error = zfs_zaccess(tdzp, add_perm, 0, B_FALSE, cr);
return (error);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (revision 209274)
@@ -1,5053 +1,5051 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/* Portions Copyright 2007 Jeremy Teo */
#include <sys/types.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/systm.h>
#include <sys/sysmacros.h>
#include <sys/resource.h>
#include <sys/vfs.h>
#include <sys/vnode.h>
#include <sys/file.h>
#include <sys/stat.h>
#include <sys/kmem.h>
#include <sys/taskq.h>
#include <sys/uio.h>
#include <sys/atomic.h>
#include <sys/namei.h>
#include <sys/mman.h>
#include <sys/cmn_err.h>
#include <sys/errno.h>
#include <sys/unistd.h>
#include <sys/zfs_dir.h>
#include <sys/zfs_ioctl.h>
#include <sys/fs/zfs.h>
#include <sys/dmu.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/dbuf.h>
#include <sys/zap.h>
#include <sys/dirent.h>
#include <sys/policy.h>
#include <sys/sunddi.h>
#include <sys/filio.h>
#include <sys/zfs_ctldir.h>
#include <sys/zfs_fuid.h>
#include <sys/dnlc.h>
#include <sys/zfs_rlock.h>
#include <sys/extdirent.h>
#include <sys/kidmap.h>
#include <sys/bio.h>
#include <sys/buf.h>
#include <sys/sf_buf.h>
#include <sys/sched.h>
#include <sys/acl.h>
/*
* Programming rules.
*
* Each vnode op performs some logical unit of work. To do this, the ZPL must
* properly lock its in-core state, create a DMU transaction, do the work,
* record this work in the intent log (ZIL), commit the DMU transaction,
* and wait for the intent log to commit if it is a synchronous operation.
* Moreover, the vnode ops must work in both normal and log replay context.
* The ordering of events is important to avoid deadlocks and references
* to freed memory. The example below illustrates the following Big Rules:
*
* (1) A check must be made in each zfs thread for a mounted file system.
* This is done avoiding races using ZFS_ENTER(zfsvfs).
* A ZFS_EXIT(zfsvfs) is needed before all returns. Any znodes
* must be checked with ZFS_VERIFY_ZP(zp). Both of these macros
* can return EIO from the calling function.
*
* (2) VN_RELE() should always be the last thing except for zil_commit()
* (if necessary) and ZFS_EXIT(). This is for 3 reasons:
* First, if it's the last reference, the vnode/znode
* can be freed, so the zp may point to freed memory. Second, the last
* reference will call zfs_zinactive(), which may induce a lot of work --
* pushing cached pages (which acquires range locks) and syncing out
* cached atime changes. Third, zfs_zinactive() may require a new tx,
* which could deadlock the system if you were already holding one.
* If you must call VN_RELE() within a tx then use VN_RELE_ASYNC().
*
* (3) All range locks must be grabbed before calling dmu_tx_assign(),
* as they can span dmu_tx_assign() calls.
*
* (4) Always pass zfsvfs->z_assign as the second argument to dmu_tx_assign().
* In normal operation, this will be TXG_NOWAIT. During ZIL replay,
* it will be a specific txg. Either way, dmu_tx_assign() never blocks.
* This is critical because we don't want to block while holding locks.
* Note, in particular, that if a lock is sometimes acquired before
* the tx assigns, and sometimes after (e.g. z_lock), then failing to
* use a non-blocking assign can deadlock the system. The scenario:
*
* Thread A has grabbed a lock before calling dmu_tx_assign().
* Thread B is in an already-assigned tx, and blocks for this lock.
* Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
* forever, because the previous txg can't quiesce until B's tx commits.
*
* If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
* then drop all locks, call dmu_tx_wait(), and try again.
*
* (5) If the operation succeeded, generate the intent log entry for it
* before dropping locks. This ensures that the ordering of events
* in the intent log matches the order in which they actually occurred.
*
* (6) At the end of each vnode op, the DMU tx must always commit,
* regardless of whether there were any errors.
*
* (7) After dropping all locks, invoke zil_commit(zilog, seq, foid)
* to ensure that synchronous semantics are provided when necessary.
*
* In general, this is how things should be ordered in each vnode op:
*
* ZFS_ENTER(zfsvfs); // exit if unmounted
* top:
* zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD())
* rw_enter(...); // grab any other locks you need
* tx = dmu_tx_create(...); // get DMU tx
* dmu_tx_hold_*(); // hold each object you might modify
* error = dmu_tx_assign(tx, zfsvfs->z_assign); // try to assign
* if (error) {
* rw_exit(...); // drop locks
* zfs_dirent_unlock(dl); // unlock directory entry
* VN_RELE(...); // release held vnodes
* if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
* dmu_tx_wait(tx);
* dmu_tx_abort(tx);
* goto top;
* }
* dmu_tx_abort(tx); // abort DMU tx
* ZFS_EXIT(zfsvfs); // finished in zfs
* return (error); // really out of space
* }
* error = do_real_work(); // do whatever this VOP does
* if (error == 0)
* zfs_log_*(...); // on success, make ZIL entry
* dmu_tx_commit(tx); // commit DMU tx -- error or not
* rw_exit(...); // drop locks
* zfs_dirent_unlock(dl); // unlock directory entry
* VN_RELE(...); // release held vnodes
* zil_commit(zilog, seq, foid); // synchronous when necessary
* ZFS_EXIT(zfsvfs); // finished in zfs
* return (error); // done, report error
*/
/* ARGSUSED */
static int
zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(*vpp);
if ((flag & FWRITE) && (zp->z_phys->zp_flags & ZFS_APPENDONLY) &&
((flag & FAPPEND) == 0)) {
return (EPERM);
}
if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
ZTOV(zp)->v_type == VREG &&
!(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) &&
zp->z_phys->zp_size > 0)
if (fs_vscan(*vpp, cr, 0) != 0)
return (EACCES);
/* Keep a count of the synchronous opens in the znode */
if (flag & (FSYNC | FDSYNC))
atomic_inc_32(&zp->z_sync_cnt);
return (0);
}
/* ARGSUSED */
static int
zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
/* Decrement the synchronous opens in the znode */
if ((flag & (FSYNC | FDSYNC)) && (count == 1))
atomic_dec_32(&zp->z_sync_cnt);
/*
* Clean up any locks held by this process on the vp.
*/
cleanlocks(vp, ddi_get_pid(), 0);
cleanshares(vp, ddi_get_pid());
if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
ZTOV(zp)->v_type == VREG &&
!(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) &&
zp->z_phys->zp_size > 0)
VERIFY(fs_vscan(vp, cr, 1) == 0);
return (0);
}
/*
* Lseek support for finding holes (cmd == _FIO_SEEK_HOLE) and
* data (cmd == _FIO_SEEK_DATA). "off" is an in/out parameter.
*/
static int
zfs_holey(vnode_t *vp, u_long cmd, offset_t *off)
{
znode_t *zp = VTOZ(vp);
uint64_t noff = (uint64_t)*off; /* new offset */
uint64_t file_sz;
int error;
boolean_t hole;
file_sz = zp->z_phys->zp_size;
if (noff >= file_sz) {
return (ENXIO);
}
if (cmd == _FIO_SEEK_HOLE)
hole = B_TRUE;
else
hole = B_FALSE;
error = dmu_offset_next(zp->z_zfsvfs->z_os, zp->z_id, hole, &noff);
/* end of file? */
if ((error == ESRCH) || (noff > file_sz)) {
/*
* Handle the virtual hole at the end of file.
*/
if (hole) {
*off = file_sz;
return (0);
}
return (ENXIO);
}
if (noff < *off)
return (error);
*off = noff;
return (error);
}
/* ARGSUSED */
static int
zfs_ioctl(vnode_t *vp, u_long com, intptr_t data, int flag, cred_t *cred,
int *rvalp, caller_context_t *ct)
{
offset_t off;
int error;
zfsvfs_t *zfsvfs;
znode_t *zp;
switch (com) {
case _FIOFFS:
return (0);
/*
* The following two ioctls are used by bfu. Faking out,
* necessary to avoid bfu errors.
*/
case _FIOGDIO:
case _FIOSDIO:
return (0);
case _FIO_SEEK_DATA:
case _FIO_SEEK_HOLE:
if (ddi_copyin((void *)data, &off, sizeof (off), flag))
return (EFAULT);
zp = VTOZ(vp);
zfsvfs = zp->z_zfsvfs;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/* offset parameter is in/out */
error = zfs_holey(vp, com, &off);
ZFS_EXIT(zfsvfs);
if (error)
return (error);
if (ddi_copyout(&off, (void *)data, sizeof (off), flag))
return (EFAULT);
return (0);
}
return (ENOTTY);
}
/*
* When a file is memory mapped, we must keep the IO data synchronized
* between the DMU cache and the memory mapped pages. What this means:
*
* On Write: If we find a memory mapped page, we write to *both*
* the page and the dmu buffer.
*
* NOTE: We will always "break up" the IO into PAGESIZE uiomoves when
* the file is memory mapped.
*/
static int
mappedwrite(vnode_t *vp, int nbytes, uio_t *uio, dmu_tx_t *tx)
{
znode_t *zp = VTOZ(vp);
objset_t *os = zp->z_zfsvfs->z_os;
vm_object_t obj;
vm_page_t m;
struct sf_buf *sf;
int64_t start, off;
int len = nbytes;
int error = 0;
uint64_t dirbytes;
ASSERT(vp->v_mount != NULL);
obj = vp->v_object;
ASSERT(obj != NULL);
start = uio->uio_loffset;
off = start & PAGEOFFSET;
dirbytes = 0;
VM_OBJECT_LOCK(obj);
for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
uint64_t bytes = MIN(PAGESIZE - off, len);
uint64_t fsize;
again:
if ((m = vm_page_lookup(obj, OFF_TO_IDX(start))) != NULL &&
vm_page_is_valid(m, (vm_offset_t)off, bytes)) {
uint64_t woff;
caddr_t va;
if (vm_page_sleep_if_busy(m, FALSE, "zfsmwb"))
goto again;
fsize = obj->un_pager.vnp.vnp_size;
vm_page_busy(m);
vm_page_lock_queues();
vm_page_undirty(m);
vm_page_unlock_queues();
VM_OBJECT_UNLOCK(obj);
if (dirbytes > 0) {
error = dmu_write_uio(os, zp->z_id, uio,
dirbytes, tx);
dirbytes = 0;
}
if (error == 0) {
sched_pin();
sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
va = (caddr_t)sf_buf_kva(sf);
woff = uio->uio_loffset - off;
error = uiomove(va + off, bytes, UIO_WRITE, uio);
/*
* The uiomove() above could have been partially
* successful, that's why we call dmu_write()
* below unconditionally. The page was marked
* non-dirty above and we would lose the changes
* without doing so. If the uiomove() failed
* entirely, well, we just write what we got
* before one more time.
*/
dmu_write(os, zp->z_id, woff,
MIN(PAGESIZE, fsize - woff), va, tx);
sf_buf_free(sf);
sched_unpin();
}
VM_OBJECT_LOCK(obj);
vm_page_wakeup(m);
} else {
if (__predict_false(obj->cache != NULL)) {
vm_page_cache_free(obj, OFF_TO_IDX(start),
OFF_TO_IDX(start) + 1);
}
dirbytes += bytes;
}
len -= bytes;
off = 0;
if (error)
break;
}
VM_OBJECT_UNLOCK(obj);
if (error == 0 && dirbytes > 0)
error = dmu_write_uio(os, zp->z_id, uio, dirbytes, tx);
return (error);
}
/*
* When a file is memory mapped, we must keep the IO data synchronized
* between the DMU cache and the memory mapped pages. What this means:
*
* On Read: We "read" preferentially from memory mapped pages,
* else we default from the dmu buffer.
*
* NOTE: We will always "break up" the IO into PAGESIZE uiomoves when
* the file is memory mapped.
*/
static int
mappedread(vnode_t *vp, int nbytes, uio_t *uio)
{
znode_t *zp = VTOZ(vp);
objset_t *os = zp->z_zfsvfs->z_os;
vm_object_t obj;
vm_page_t m;
struct sf_buf *sf;
int64_t start, off;
caddr_t va;
int len = nbytes;
int error = 0;
uint64_t dirbytes;
ASSERT(vp->v_mount != NULL);
obj = vp->v_object;
ASSERT(obj != NULL);
start = uio->uio_loffset;
off = start & PAGEOFFSET;
dirbytes = 0;
VM_OBJECT_LOCK(obj);
for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
uint64_t bytes = MIN(PAGESIZE - off, len);
again:
if ((m = vm_page_lookup(obj, OFF_TO_IDX(start))) != NULL &&
vm_page_is_valid(m, (vm_offset_t)off, bytes)) {
if (vm_page_sleep_if_busy(m, FALSE, "zfsmrb"))
goto again;
vm_page_busy(m);
VM_OBJECT_UNLOCK(obj);
if (dirbytes > 0) {
error = dmu_read_uio(os, zp->z_id, uio,
dirbytes);
dirbytes = 0;
}
if (error == 0) {
sched_pin();
sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
va = (caddr_t)sf_buf_kva(sf);
error = uiomove(va + off, bytes, UIO_READ, uio);
sf_buf_free(sf);
sched_unpin();
}
VM_OBJECT_LOCK(obj);
vm_page_wakeup(m);
} else if (m != NULL && uio->uio_segflg == UIO_NOCOPY) {
/*
* The code below is here to make sendfile(2) work
* correctly with ZFS. As pointed out by ups@
* sendfile(2) should be changed to use VOP_GETPAGES(),
* but it pessimize performance of sendfile/UFS, that's
* why I handle this special case in ZFS code.
*/
if (vm_page_sleep_if_busy(m, FALSE, "zfsmrb"))
goto again;
vm_page_busy(m);
VM_OBJECT_UNLOCK(obj);
if (dirbytes > 0) {
error = dmu_read_uio(os, zp->z_id, uio,
dirbytes);
dirbytes = 0;
}
if (error == 0) {
sched_pin();
sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
va = (caddr_t)sf_buf_kva(sf);
error = dmu_read(os, zp->z_id, start + off,
bytes, (void *)(va + off));
sf_buf_free(sf);
sched_unpin();
}
VM_OBJECT_LOCK(obj);
vm_page_wakeup(m);
if (error == 0)
uio->uio_resid -= bytes;
} else {
dirbytes += bytes;
}
len -= bytes;
off = 0;
if (error)
break;
}
VM_OBJECT_UNLOCK(obj);
if (error == 0 && dirbytes > 0)
error = dmu_read_uio(os, zp->z_id, uio, dirbytes);
return (error);
}
offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
/*
* Read bytes from specified file into supplied buffer.
*
* IN: vp - vnode of file to be read from.
* uio - structure supplying read location, range info,
* and return buffer.
* ioflag - SYNC flags; used to provide FRSYNC semantics.
* cr - credentials of caller.
* ct - caller context
*
* OUT: uio - updated offset and range, buffer filled.
*
* RETURN: 0 if success
* error code if failure
*
* Side Effects:
* vp - atime updated if byte count > 0
*/
/* ARGSUSED */
static int
zfs_read(vnode_t *vp, uio_t *uio, int ioflag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
objset_t *os;
ssize_t n, nbytes;
int error;
rl_t *rl;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
os = zfsvfs->z_os;
if (zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) {
ZFS_EXIT(zfsvfs);
return (EACCES);
}
/*
* Validate file offset
*/
if (uio->uio_loffset < (offset_t)0) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* Fasttrack empty reads
*/
if (uio->uio_resid == 0) {
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Check for mandatory locks
*/
if (MANDMODE((mode_t)zp->z_phys->zp_mode)) {
if (error = chklock(vp, FREAD,
uio->uio_loffset, uio->uio_resid, uio->uio_fmode, ct)) {
ZFS_EXIT(zfsvfs);
return (error);
}
}
/*
* If we're in FRSYNC mode, sync out this znode before reading it.
*/
if (ioflag & FRSYNC)
zil_commit(zfsvfs->z_log, zp->z_last_itx, zp->z_id);
/*
* Lock the range against changes.
*/
rl = zfs_range_lock(zp, uio->uio_loffset, uio->uio_resid, RL_READER);
/*
* If we are reading past end-of-file we can skip
* to the end; but we might still need to set atime.
*/
if (uio->uio_loffset >= zp->z_phys->zp_size) {
error = 0;
goto out;
}
ASSERT(uio->uio_loffset < zp->z_phys->zp_size);
n = MIN(uio->uio_resid, zp->z_phys->zp_size - uio->uio_loffset);
while (n > 0) {
nbytes = MIN(n, zfs_read_chunk_size -
P2PHASE(uio->uio_loffset, zfs_read_chunk_size));
if (vn_has_cached_data(vp))
error = mappedread(vp, nbytes, uio);
else
error = dmu_read_uio(os, zp->z_id, uio, nbytes);
if (error) {
/* convert checksum errors into IO errors */
if (error == ECKSUM)
error = EIO;
break;
}
n -= nbytes;
}
out:
zfs_range_unlock(rl);
ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Fault in the pages of the first n bytes specified by the uio structure.
* 1 byte in each page is touched and the uio struct is unmodified.
* Any error will exit this routine as this is only a best
* attempt to get the pages resident. This is a copy of ufs_trans_touch().
*/
static void
zfs_prefault_write(ssize_t n, struct uio *uio)
{
struct iovec *iov;
ulong_t cnt, incr;
caddr_t p;
if (uio->uio_segflg != UIO_USERSPACE)
return;
iov = uio->uio_iov;
while (n) {
cnt = MIN(iov->iov_len, n);
if (cnt == 0) {
/* empty iov entry */
iov++;
continue;
}
n -= cnt;
/*
* touch each page in this segment.
*/
p = iov->iov_base;
while (cnt) {
if (fubyte(p) == -1)
return;
incr = MIN(cnt, PAGESIZE);
p += incr;
cnt -= incr;
}
/*
* touch the last byte in case it straddles a page.
*/
p--;
if (fubyte(p) == -1)
return;
iov++;
}
}
/*
* Write the bytes to a file.
*
* IN: vp - vnode of file to be written to.
* uio - structure supplying write location, range info,
* and data buffer.
* ioflag - IO_APPEND flag set if in append mode.
* cr - credentials of caller.
* ct - caller context (NFS/CIFS fem monitor only)
*
* OUT: uio - updated offset and range.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - ctime|mtime updated if byte count > 0
*/
/* ARGSUSED */
static int
zfs_write(vnode_t *vp, uio_t *uio, int ioflag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
rlim64_t limit = MAXOFFSET_T;
ssize_t start_resid = uio->uio_resid;
ssize_t tx_bytes;
uint64_t end_size;
dmu_tx_t *tx;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zilog_t *zilog;
offset_t woff;
ssize_t n, nbytes;
rl_t *rl;
int max_blksz = zfsvfs->z_max_blksz;
uint64_t pflags;
int error;
/*
* Fasttrack empty write
*/
n = start_resid;
if (n == 0)
return (0);
if (limit == RLIM64_INFINITY || limit > MAXOFFSET_T)
limit = MAXOFFSET_T;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/*
* If immutable or not appending then return EPERM
*/
pflags = zp->z_phys->zp_flags;
if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
(uio->uio_loffset < zp->z_phys->zp_size))) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
zilog = zfsvfs->z_log;
/*
* Pre-fault the pages to ensure slow (eg NFS) pages
* don't hold up txg.
*/
zfs_prefault_write(n, uio);
/*
* If in append mode, set the io offset pointer to eof.
*/
if (ioflag & IO_APPEND) {
/*
* Range lock for a file append:
* The value for the start of range will be determined by
* zfs_range_lock() (to guarantee append semantics).
* If this write will cause the block size to increase,
* zfs_range_lock() will lock the entire file, so we must
* later reduce the range after we grow the block size.
*/
rl = zfs_range_lock(zp, 0, n, RL_APPEND);
if (rl->r_len == UINT64_MAX) {
/* overlocked, zp_size can't change */
woff = uio->uio_loffset = zp->z_phys->zp_size;
} else {
woff = uio->uio_loffset = rl->r_off;
}
} else {
woff = uio->uio_loffset;
/*
* Validate file offset
*/
if (woff < 0) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* If we need to grow the block size then zfs_range_lock()
* will lock a wider range than we request here.
* Later after growing the block size we reduce the range.
*/
rl = zfs_range_lock(zp, woff, n, RL_WRITER);
}
if (woff >= limit) {
zfs_range_unlock(rl);
ZFS_EXIT(zfsvfs);
return (EFBIG);
}
if ((woff + n) > limit || woff > (limit - n))
n = limit - woff;
/*
* Check for mandatory locks
*/
if (MANDMODE((mode_t)zp->z_phys->zp_mode) &&
(error = chklock(vp, FWRITE, woff, n, uio->uio_fmode, ct)) != 0) {
zfs_range_unlock(rl);
ZFS_EXIT(zfsvfs);
return (error);
}
end_size = MAX(zp->z_phys->zp_size, woff + n);
/*
* Write the file in reasonable size chunks. Each chunk is written
* in a separate transaction; this keeps the intent log records small
* and allows us to do more fine-grained space accounting.
*/
while (n > 0) {
/*
* Start a transaction.
*/
woff = uio->uio_loffset;
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
dmu_tx_hold_write(tx, zp->z_id, woff, MIN(n, max_blksz));
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
if (error == ERESTART &&
zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
continue;
}
dmu_tx_abort(tx);
break;
}
/*
* If zfs_range_lock() over-locked we grow the blocksize
* and then reduce the lock range. This will only happen
* on the first iteration since zfs_range_reduce() will
* shrink down r_len to the appropriate size.
*/
if (rl->r_len == UINT64_MAX) {
uint64_t new_blksz;
if (zp->z_blksz > max_blksz) {
ASSERT(!ISP2(zp->z_blksz));
new_blksz = MIN(end_size, SPA_MAXBLOCKSIZE);
} else {
new_blksz = MIN(end_size, max_blksz);
}
zfs_grow_blocksize(zp, new_blksz, tx);
zfs_range_reduce(rl, woff, n);
}
/*
* XXX - should we really limit each write to z_max_blksz?
* Perhaps we should use SPA_MAXBLOCKSIZE chunks?
*/
nbytes = MIN(n, max_blksz - P2PHASE(woff, max_blksz));
if (woff + nbytes > zp->z_phys->zp_size)
vnode_pager_setsize(vp, woff + nbytes);
rw_enter(&zp->z_map_lock, RW_READER);
tx_bytes = uio->uio_resid;
if (vn_has_cached_data(vp)) {
rw_exit(&zp->z_map_lock);
error = mappedwrite(vp, nbytes, uio, tx);
} else {
error = dmu_write_uio(zfsvfs->z_os, zp->z_id,
uio, nbytes, tx);
rw_exit(&zp->z_map_lock);
}
tx_bytes -= uio->uio_resid;
/*
* If we made no progress, we're done. If we made even
* partial progress, update the znode and ZIL accordingly.
*/
if (tx_bytes == 0) {
dmu_tx_commit(tx);
ASSERT(error != 0);
break;
}
/*
* Clear Set-UID/Set-GID bits on successful write if not
* privileged and at least one of the excute bits is set.
*
* It would be nice to to this after all writes have
* been done, but that would still expose the ISUID/ISGID
* to another app after the partial write is committed.
*
* Note: we don't call zfs_fuid_map_id() here because
* user 0 is not an ephemeral uid.
*/
mutex_enter(&zp->z_acl_lock);
if ((zp->z_phys->zp_mode & (S_IXUSR | (S_IXUSR >> 3) |
(S_IXUSR >> 6))) != 0 &&
(zp->z_phys->zp_mode & (S_ISUID | S_ISGID)) != 0 &&
secpolicy_vnode_setid_retain(vp, cr,
(zp->z_phys->zp_mode & S_ISUID) != 0 &&
zp->z_phys->zp_uid == 0) != 0) {
zp->z_phys->zp_mode &= ~(S_ISUID | S_ISGID);
}
mutex_exit(&zp->z_acl_lock);
/*
* Update time stamp. NOTE: This marks the bonus buffer as
* dirty, so we don't have to do it again for zp_size.
*/
zfs_time_stamper(zp, CONTENT_MODIFIED, tx);
/*
* Update the file size (zp_size) if it has changed;
* account for possible concurrent updates.
*/
while ((end_size = zp->z_phys->zp_size) < uio->uio_loffset)
(void) atomic_cas_64(&zp->z_phys->zp_size, end_size,
uio->uio_loffset);
zfs_log_write(zilog, tx, TX_WRITE, zp, woff, tx_bytes, ioflag);
dmu_tx_commit(tx);
if (error != 0)
break;
ASSERT(tx_bytes == nbytes);
n -= nbytes;
}
zfs_range_unlock(rl);
/*
* If we're in replay mode, or we made no progress, return error.
* Otherwise, it's at least a partial write, so it's successful.
*/
if (zfsvfs->z_assign >= TXG_INITIAL || uio->uio_resid == start_resid) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (ioflag & (FSYNC | FDSYNC))
zil_commit(zilog, zp->z_last_itx, zp->z_id);
ZFS_EXIT(zfsvfs);
return (0);
}
void
zfs_get_done(dmu_buf_t *db, void *vzgd)
{
zgd_t *zgd = (zgd_t *)vzgd;
rl_t *rl = zgd->zgd_rl;
vnode_t *vp = ZTOV(rl->r_zp);
objset_t *os = rl->r_zp->z_zfsvfs->z_os;
int vfslocked;
vfslocked = VFS_LOCK_GIANT(vp->v_vfsp);
dmu_buf_rele(db, vzgd);
zfs_range_unlock(rl);
/*
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(vp, dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
zil_add_block(zgd->zgd_zilog, zgd->zgd_bp);
kmem_free(zgd, sizeof (zgd_t));
VFS_UNLOCK_GIANT(vfslocked);
}
/*
* Get data to generate a TX_WRITE intent log record.
*/
int
zfs_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio)
{
zfsvfs_t *zfsvfs = arg;
objset_t *os = zfsvfs->z_os;
znode_t *zp;
uint64_t off = lr->lr_offset;
dmu_buf_t *db;
rl_t *rl;
zgd_t *zgd;
int dlen = lr->lr_length; /* length of user data */
int error = 0;
ASSERT(zio);
ASSERT(dlen != 0);
/*
* Nothing to do if the file has been removed
*/
if (zfs_zget(zfsvfs, lr->lr_foid, &zp) != 0)
return (ENOENT);
if (zp->z_unlinked) {
/*
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(ZTOV(zp),
dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
return (ENOENT);
}
/*
* Write records come in two flavors: immediate and indirect.
* For small writes it's cheaper to store the data with the
* log record (immediate); for large writes it's cheaper to
* sync the data and get a pointer to it (indirect) so that
* we don't have to write the data twice.
*/
if (buf != NULL) { /* immediate write */
rl = zfs_range_lock(zp, off, dlen, RL_READER);
/* test for truncation needs to be done while range locked */
if (off >= zp->z_phys->zp_size) {
error = ENOENT;
goto out;
}
VERIFY(0 == dmu_read(os, lr->lr_foid, off, dlen, buf));
} else { /* indirect write */
uint64_t boff; /* block starting offset */
/*
* Have to lock the whole block to ensure when it's
* written out and it's checksum is being calculated
* that no one can change the data. We need to re-check
* blocksize after we get the lock in case it's changed!
*/
for (;;) {
if (ISP2(zp->z_blksz)) {
boff = P2ALIGN_TYPED(off, zp->z_blksz,
uint64_t);
} else {
boff = 0;
}
dlen = zp->z_blksz;
rl = zfs_range_lock(zp, boff, dlen, RL_READER);
if (zp->z_blksz == dlen)
break;
zfs_range_unlock(rl);
}
/* test for truncation needs to be done while range locked */
if (off >= zp->z_phys->zp_size) {
error = ENOENT;
goto out;
}
zgd = (zgd_t *)kmem_alloc(sizeof (zgd_t), KM_SLEEP);
zgd->zgd_rl = rl;
zgd->zgd_zilog = zfsvfs->z_log;
zgd->zgd_bp = &lr->lr_blkptr;
VERIFY(0 == dmu_buf_hold(os, lr->lr_foid, boff, zgd, &db));
ASSERT(boff == db->db_offset);
lr->lr_blkoff = off - boff;
error = dmu_sync(zio, db, &lr->lr_blkptr,
lr->lr_common.lrc_txg, zfs_get_done, zgd);
ASSERT((error && error != EINPROGRESS) ||
lr->lr_length <= zp->z_blksz);
if (error == 0)
zil_add_block(zfsvfs->z_log, &lr->lr_blkptr);
/*
* If we get EINPROGRESS, then we need to wait for a
* write IO initiated by dmu_sync() to complete before
* we can release this dbuf. We will finish everything
* up in the zfs_get_done() callback.
*/
if (error == EINPROGRESS)
return (0);
dmu_buf_rele(db, zgd);
kmem_free(zgd, sizeof (zgd_t));
}
out:
zfs_range_unlock(rl);
/*
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
return (error);
}
/*ARGSUSED*/
static int
zfs_access(vnode_t *vp, int mode, int flag, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
if (flag & V_ACE_MASK)
error = zfs_zaccess(zp, mode, flag, B_FALSE, cr);
else
error = zfs_zaccess_rwx(zp, mode, flag, cr);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Lookup an entry in a directory, or an extended attribute directory.
* If it exists, return a held vnode reference for it.
*
* IN: dvp - vnode of directory to search.
* nm - name of entry to lookup.
* pnp - full pathname to lookup [UNUSED].
* flags - LOOKUP_XATTR set if looking for an attribute.
* rdir - root directory vnode [UNUSED].
* cr - credentials of caller.
* ct - caller context
* direntflags - directory lookup flags
* realpnp - returned pathname.
*
* OUT: vpp - vnode of located entry, NULL if not found.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* NA
*/
/* ARGSUSED */
static int
zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct componentname *cnp,
int nameiop, cred_t *cr, kthread_t *td, int flags)
{
znode_t *zdp = VTOZ(dvp);
zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
int error;
int *direntflags = NULL;
void *realpnp = NULL;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zdp);
*vpp = NULL;
if (flags & LOOKUP_XATTR) {
#ifdef TODO
/*
* If the xattr property is off, refuse the lookup request.
*/
if (!(zfsvfs->z_vfs->vfs_flag & VFS_XATTR)) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
#endif
/*
* We don't allow recursive attributes..
* Maybe someday we will.
*/
if (zdp->z_phys->zp_flags & ZFS_XATTR) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
if (error = zfs_get_xattrdir(VTOZ(dvp), vpp, cr, flags)) {
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Do we have permission to get into attribute directory?
*/
if (error = zfs_zaccess(VTOZ(*vpp), ACE_EXECUTE, 0,
B_FALSE, cr)) {
VN_RELE(*vpp);
*vpp = NULL;
}
ZFS_EXIT(zfsvfs);
return (error);
}
if (dvp->v_type != VDIR) {
ZFS_EXIT(zfsvfs);
return (ENOTDIR);
}
/*
* Check accessibility of directory.
*/
if (error = zfs_zaccess(zdp, ACE_EXECUTE, 0, B_FALSE, cr)) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (zfsvfs->z_utf8 && u8_validate(nm, strlen(nm),
NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
if (error == 0) {
/*
* Convert device special files
*/
if (IS_DEVVP(*vpp)) {
vnode_t *svp;
svp = specvp(*vpp, (*vpp)->v_rdev, (*vpp)->v_type, cr);
VN_RELE(*vpp);
if (svp == NULL)
error = ENOSYS;
else
*vpp = svp;
}
}
/* Translate errors and add SAVENAME when needed. */
if (cnp->cn_flags & ISLASTCN) {
switch (nameiop) {
case CREATE:
case RENAME:
if (error == ENOENT) {
error = EJUSTRETURN;
cnp->cn_flags |= SAVENAME;
break;
}
/* FALLTHROUGH */
case DELETE:
if (error == 0)
cnp->cn_flags |= SAVENAME;
break;
}
}
if (error == 0 && (nm[0] != '.' || nm[1] != '\0')) {
int ltype = 0;
if (cnp->cn_flags & ISDOTDOT) {
ltype = VOP_ISLOCKED(dvp);
VOP_UNLOCK(dvp, 0);
}
ZFS_EXIT(zfsvfs);
error = vn_lock(*vpp, cnp->cn_lkflags);
if (cnp->cn_flags & ISDOTDOT)
vn_lock(dvp, ltype | LK_RETRY);
if (error != 0) {
VN_RELE(*vpp);
*vpp = NULL;
return (error);
}
} else {
ZFS_EXIT(zfsvfs);
}
#ifdef FREEBSD_NAMECACHE
/*
* Insert name into cache (as non-existent) if appropriate.
*/
if (error == ENOENT && (cnp->cn_flags & MAKEENTRY) && nameiop != CREATE)
cache_enter(dvp, *vpp, cnp);
/*
* Insert name into cache if appropriate.
*/
if (error == 0 && (cnp->cn_flags & MAKEENTRY)) {
if (!(cnp->cn_flags & ISLASTCN) ||
(nameiop != DELETE && nameiop != RENAME)) {
cache_enter(dvp, *vpp, cnp);
}
}
#endif
return (error);
}
/*
* Attempt to create a new entry in a directory. If the entry
* already exists, truncate the file if permissible, else return
* an error. Return the vp of the created or trunc'd file.
*
* IN: dvp - vnode of directory to put new file entry in.
* name - name of new file entry.
* vap - attributes of new file.
* excl - flag indicating exclusive or non-exclusive mode.
* mode - mode to open file with.
* cr - credentials of caller.
* flag - large file flag [UNUSED].
* ct - caller context
* vsecp - ACL to be set
*
* OUT: vpp - vnode of created or trunc'd entry.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated if new entry created
* vp - ctime|mtime always, atime if new
*/
/* ARGSUSED */
static int
zfs_create(vnode_t *dvp, char *name, vattr_t *vap, int excl, int mode,
vnode_t **vpp, cred_t *cr, kthread_t *td)
{
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
objset_t *os;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
int error;
zfs_acl_t *aclp = NULL;
zfs_fuid_info_t *fuidp = NULL;
void *vsecp = NULL;
int flag = 0;
/*
* If we have an ephemeral id, ACL, or XVATTR then
* make sure file system is at proper version
*/
if (zfsvfs->z_use_fuids == B_FALSE &&
(vsecp || (vap->va_mask & AT_XVATTR) ||
IS_EPHEMERAL(crgetuid(cr)) || IS_EPHEMERAL(crgetgid(cr))))
return (EINVAL);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
os = zfsvfs->z_os;
zilog = zfsvfs->z_log;
if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (vap->va_mask & AT_XVATTR) {
if ((error = secpolicy_xvattr(dvp, (xvattr_t *)vap,
crgetuid(cr), cr, vap->va_type)) != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
}
top:
*vpp = NULL;
if ((vap->va_mode & S_ISVTX) && secpolicy_vnode_stky_modify(cr))
vap->va_mode &= ~S_ISVTX;
if (*name == '\0') {
/*
* Null component name refers to the directory itself.
*/
VN_HOLD(dvp);
zp = dzp;
dl = NULL;
error = 0;
} else {
/* possible VN_HOLD(zp) */
int zflg = 0;
if (flag & FIGNORECASE)
zflg |= ZCILOOK;
error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
NULL, NULL);
if (error) {
if (strcmp(name, "..") == 0)
error = EISDIR;
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (error);
}
}
if (vsecp && aclp == NULL) {
error = zfs_vsec_2_aclp(zfsvfs, vap->va_type, vsecp, &aclp);
if (error) {
ZFS_EXIT(zfsvfs);
if (dl)
zfs_dirent_unlock(dl);
return (error);
}
}
if (zp == NULL) {
uint64_t txtype;
/*
* Create a new file object and update the directory
* to reference it.
*/
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
goto out;
}
/*
* We only support the creation of regular files in
* extended attribute directories.
*/
if ((dzp->z_phys->zp_flags & ZFS_XATTR) &&
(vap->va_type != VREG)) {
error = EINVAL;
goto out;
}
tx = dmu_tx_create(os);
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
if ((aclp && aclp->z_has_fuids) || IS_EPHEMERAL(crgetuid(cr)) ||
IS_EPHEMERAL(crgetgid(cr))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ,
FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
dmu_tx_hold_bonus(tx, dzp->z_id);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
if ((dzp->z_phys->zp_flags & ZFS_INHERIT_ACE) || aclp) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, SPA_MAXBLOCKSIZE);
}
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART &&
zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (error);
}
zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, aclp, &fuidp);
(void) zfs_link_create(dl, zp, tx, ZNEW);
txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
if (flag & FIGNORECASE)
txtype |= TX_CI;
zfs_log_create(zilog, tx, txtype, dzp, zp, name,
vsecp, fuidp, vap);
if (fuidp)
zfs_fuid_info_free(fuidp);
dmu_tx_commit(tx);
} else {
int aflags = (flag & FAPPEND) ? V_APPEND : 0;
/*
* A directory entry already exists for this name.
*/
/*
* Can't truncate an existing file if in exclusive mode.
*/
if (excl == EXCL) {
error = EEXIST;
goto out;
}
/*
* Can't open a directory for writing.
*/
if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
error = EISDIR;
goto out;
}
/*
* Verify requested access to file.
*/
if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
goto out;
}
mutex_enter(&dzp->z_lock);
dzp->z_seq++;
mutex_exit(&dzp->z_lock);
/*
* Truncate regular files if requested.
*/
if ((ZTOV(zp)->v_type == VREG) &&
(vap->va_mask & AT_SIZE) && (vap->va_size == 0)) {
/* we can't hold any locks when calling zfs_freesp() */
zfs_dirent_unlock(dl);
dl = NULL;
error = zfs_freesp(zp, 0, 0, mode, TRUE);
if (error == 0) {
vnevent_create(ZTOV(zp), ct);
}
}
}
out:
if (dl)
zfs_dirent_unlock(dl);
if (error) {
if (zp)
VN_RELE(ZTOV(zp));
} else {
*vpp = ZTOV(zp);
/*
* If vnode is for a device return a specfs vnode instead.
*/
if (IS_DEVVP(*vpp)) {
struct vnode *svp;
svp = specvp(*vpp, (*vpp)->v_rdev, (*vpp)->v_type, cr);
VN_RELE(*vpp);
if (svp == NULL) {
error = ENOSYS;
}
*vpp = svp;
}
}
if (aclp)
zfs_acl_free(aclp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Remove an entry from a directory.
*
* IN: dvp - vnode of directory to remove entry from.
* name - name of entry to remove.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime
* vp - ctime (if nlink > 0)
*/
/*ARGSUSED*/
static int
zfs_remove(vnode_t *dvp, char *name, cred_t *cr, caller_context_t *ct,
int flags)
{
znode_t *zp, *dzp = VTOZ(dvp);
znode_t *xzp = NULL;
vnode_t *vp;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
uint64_t acl_obj, xattr_obj;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
boolean_t may_delete_now, delete_now = FALSE;
boolean_t unlinked, toobig = FALSE;
uint64_t txtype;
pathname_t *realnmp = NULL;
pathname_t realnm;
int error;
int zflg = ZEXISTS;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (flags & FIGNORECASE) {
zflg |= ZCILOOK;
pn_alloc(&realnm);
realnmp = &realnm;
}
top:
/*
* Attempt to lock directory; fail if entry doesn't exist.
*/
if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
NULL, realnmp)) {
if (realnmp)
pn_free(realnmp);
ZFS_EXIT(zfsvfs);
return (error);
}
vp = ZTOV(zp);
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
/*
* Need to use rmdir for removing directories.
*/
if (vp->v_type == VDIR) {
error = EPERM;
goto out;
}
vnevent_remove(vp, dvp, name, ct);
if (realnmp)
dnlc_remove(dvp, realnmp->pn_buf);
else
dnlc_remove(dvp, name);
may_delete_now = FALSE;
/*
* We may delete the znode now, or we may put it in the unlinked set;
* it depends on whether we're the last link, and on whether there are
* other holds on the vnode. So we dmu_tx_hold() the right things to
* allow for either case.
*/
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
dmu_tx_hold_bonus(tx, zp->z_id);
if (may_delete_now) {
toobig =
zp->z_phys->zp_size > zp->z_blksz * DMU_MAX_DELETEBLKCNT;
/* if the file is too big, only hold_free a token amount */
dmu_tx_hold_free(tx, zp->z_id, 0,
(toobig ? DMU_MAX_ACCESS : DMU_OBJECT_END));
}
/* are there any extended attributes? */
if ((xattr_obj = zp->z_phys->zp_xattr) != 0) {
/* XXX - do we need this if we are deleting? */
dmu_tx_hold_bonus(tx, xattr_obj);
}
/* are there any additional acls */
if ((acl_obj = zp->z_phys->zp_acl.z_acl_extern_obj) != 0 &&
may_delete_now)
dmu_tx_hold_free(tx, acl_obj, 0, DMU_OBJECT_END);
/* charge as an update -- would be nice not to charge at all */
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
VN_RELE(vp);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
if (realnmp)
pn_free(realnmp);
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Remove the directory entry.
*/
error = zfs_link_destroy(dl, zp, tx, zflg, &unlinked);
if (error) {
dmu_tx_commit(tx);
goto out;
}
if (0 && unlinked) {
VI_LOCK(vp);
delete_now = may_delete_now && !toobig &&
vp->v_count == 1 && !vn_has_cached_data(vp) &&
zp->z_phys->zp_xattr == xattr_obj &&
zp->z_phys->zp_acl.z_acl_extern_obj == acl_obj;
VI_UNLOCK(vp);
}
if (delete_now) {
if (zp->z_phys->zp_xattr) {
error = zfs_zget(zfsvfs, zp->z_phys->zp_xattr, &xzp);
ASSERT3U(error, ==, 0);
ASSERT3U(xzp->z_phys->zp_links, ==, 2);
dmu_buf_will_dirty(xzp->z_dbuf, tx);
mutex_enter(&xzp->z_lock);
xzp->z_unlinked = 1;
xzp->z_phys->zp_links = 0;
mutex_exit(&xzp->z_lock);
zfs_unlinked_add(xzp, tx);
zp->z_phys->zp_xattr = 0; /* probably unnecessary */
}
mutex_enter(&zp->z_lock);
VI_LOCK(vp);
vp->v_count--;
ASSERT3U(vp->v_count, ==, 0);
VI_UNLOCK(vp);
mutex_exit(&zp->z_lock);
zfs_znode_delete(zp, tx);
} else if (unlinked) {
zfs_unlinked_add(zp, tx);
}
txtype = TX_REMOVE;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_remove(zilog, tx, txtype, dzp, name);
dmu_tx_commit(tx);
out:
if (realnmp)
pn_free(realnmp);
zfs_dirent_unlock(dl);
if (!delete_now) {
VN_RELE(vp);
} else if (xzp) {
/* this rele is delayed to prevent nesting transactions */
VN_RELE(ZTOV(xzp));
}
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Create a new directory and insert it into dvp using the name
* provided. Return a pointer to the inserted directory.
*
* IN: dvp - vnode of directory to add subdir to.
* dirname - name of new directory.
* vap - attributes of new directory.
* cr - credentials of caller.
* ct - caller context
* vsecp - ACL to be set
*
* OUT: vpp - vnode of created directory.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated
* vp - ctime|mtime|atime updated
*/
/*ARGSUSED*/
static int
zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
caller_context_t *ct, int flags, vsecattr_t *vsecp)
{
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
uint64_t txtype;
dmu_tx_t *tx;
int error;
zfs_acl_t *aclp = NULL;
zfs_fuid_info_t *fuidp = NULL;
int zf = ZNEW;
ASSERT(vap->va_type == VDIR);
/*
* If we have an ephemeral id, ACL, or XVATTR then
* make sure file system is at proper version
*/
if (zfsvfs->z_use_fuids == B_FALSE &&
(vsecp || (vap->va_mask & AT_XVATTR) || IS_EPHEMERAL(crgetuid(cr))||
IS_EPHEMERAL(crgetgid(cr))))
return (EINVAL);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (dzp->z_phys->zp_flags & ZFS_XATTR) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
if (zfsvfs->z_utf8 && u8_validate(dirname,
strlen(dirname), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zf |= ZCILOOK;
if (vap->va_mask & AT_XVATTR)
if ((error = secpolicy_xvattr(dvp, (xvattr_t *)vap,
crgetuid(cr), cr, vap->va_type)) != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* First make sure the new directory doesn't exist.
*/
top:
*vpp = NULL;
if (error = zfs_dirent_lock(&dl, dzp, dirname, &zp, zf,
NULL, NULL)) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
if (vsecp && aclp == NULL) {
error = zfs_vsec_2_aclp(zfsvfs, vap->va_type, vsecp, &aclp);
if (error) {
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
}
/*
* Add a new entry to the directory.
*/
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, dirname);
dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL);
if ((aclp && aclp->z_has_fuids) || IS_EPHEMERAL(crgetuid(cr)) ||
IS_EPHEMERAL(crgetgid(cr))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
if ((dzp->z_phys->zp_flags & ZFS_INHERIT_ACE) || aclp)
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, SPA_MAXBLOCKSIZE);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (error);
}
/*
* Create new node.
*/
zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, aclp, &fuidp);
if (aclp)
zfs_acl_free(aclp);
/*
* Now put new name in parent dir.
*/
(void) zfs_link_create(dl, zp, tx, ZNEW);
*vpp = ZTOV(zp);
txtype = zfs_log_create_txtype(Z_DIR, vsecp, vap);
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_create(zilog, tx, txtype, dzp, zp, dirname, vsecp, fuidp, vap);
if (fuidp)
zfs_fuid_info_free(fuidp);
dmu_tx_commit(tx);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Remove a directory subdir entry. If the current working
* directory is the same as the subdir to be removed, the
* remove will fail.
*
* IN: dvp - vnode of directory to remove from.
* name - name of directory to be removed.
* cwd - vnode of current working directory.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated
*/
/*ARGSUSED*/
static int
zfs_rmdir(vnode_t *dvp, char *name, vnode_t *cwd, cred_t *cr,
caller_context_t *ct, int flags)
{
znode_t *dzp = VTOZ(dvp);
znode_t *zp;
vnode_t *vp;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
int error;
int zflg = ZEXISTS;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (flags & FIGNORECASE)
zflg |= ZCILOOK;
top:
zp = NULL;
/*
* Attempt to lock directory; fail if entry doesn't exist.
*/
if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
NULL, NULL)) {
ZFS_EXIT(zfsvfs);
return (error);
}
vp = ZTOV(zp);
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
if (vp->v_type != VDIR) {
error = ENOTDIR;
goto out;
}
if (vp == cwd) {
error = EINVAL;
goto out;
}
vnevent_rmdir(vp, dvp, name, ct);
/*
* Grab a lock on the directory to make sure that noone is
* trying to add (or lookup) entries while we are removing it.
*/
rw_enter(&zp->z_name_lock, RW_WRITER);
/*
* Grab a lock on the parent pointer to make sure we play well
* with the treewalk and directory rename code.
*/
rw_enter(&zp->z_parent_lock, RW_WRITER);
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
dmu_tx_hold_bonus(tx, zp->z_id);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
rw_exit(&zp->z_parent_lock);
rw_exit(&zp->z_name_lock);
zfs_dirent_unlock(dl);
VN_RELE(vp);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
#ifdef FREEBSD_NAMECACHE
cache_purge(dvp);
#endif
error = zfs_link_destroy(dl, zp, tx, zflg, NULL);
if (error == 0) {
uint64_t txtype = TX_RMDIR;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_remove(zilog, tx, txtype, dzp, name);
}
dmu_tx_commit(tx);
rw_exit(&zp->z_parent_lock);
rw_exit(&zp->z_name_lock);
#ifdef FREEBSD_NAMECACHE
cache_purge(vp);
#endif
out:
zfs_dirent_unlock(dl);
VN_RELE(vp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Read as many directory entries as will fit into the provided
* buffer from the given directory cursor position (specified in
* the uio structure.
*
* IN: vp - vnode of directory to read.
* uio - structure supplying read location, range info,
* and return buffer.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* OUT: uio - updated offset and range, buffer filled.
* eofp - set to true if end-of-file detected.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - atime updated
*
* Note that the low 4 bits of the cookie returned by zap is always zero.
* This allows us to use the low range for "special" directory entries:
* We use 0 for '.', and 1 for '..'. If this is the root of the filesystem,
* we use the offset 2 for the '.zfs' directory.
*/
/* ARGSUSED */
static int
zfs_readdir(vnode_t *vp, uio_t *uio, cred_t *cr, int *eofp, int *ncookies, u_long **cookies)
{
znode_t *zp = VTOZ(vp);
iovec_t *iovp;
edirent_t *eodp;
dirent64_t *odp;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
objset_t *os;
caddr_t outbuf;
size_t bufsize;
zap_cursor_t zc;
zap_attribute_t zap;
uint_t bytes_wanted;
uint64_t offset; /* must be unsigned; checks for < 1 */
int local_eof;
int outcount;
int error;
uint8_t prefetch;
boolean_t check_sysattrs;
uint8_t type;
int ncooks;
u_long *cooks = NULL;
int flags = 0;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/*
* If we are not given an eof variable,
* use a local one.
*/
if (eofp == NULL)
eofp = &local_eof;
/*
* Check for valid iov_len.
*/
if (uio->uio_iov->iov_len <= 0) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* Quit if directory has been removed (posix)
*/
if ((*eofp = zp->z_unlinked) != 0) {
ZFS_EXIT(zfsvfs);
return (0);
}
error = 0;
os = zfsvfs->z_os;
offset = uio->uio_loffset;
prefetch = zp->z_zn_prefetch;
/*
* Initialize the iterator cursor.
*/
if (offset <= 3) {
/*
* Start iteration from the beginning of the directory.
*/
zap_cursor_init(&zc, os, zp->z_id);
} else {
/*
* The offset is a serialized cursor.
*/
zap_cursor_init_serialized(&zc, os, zp->z_id, offset);
}
/*
* Get space to change directory entries into fs independent format.
*/
iovp = uio->uio_iov;
bytes_wanted = iovp->iov_len;
if (uio->uio_segflg != UIO_SYSSPACE || uio->uio_iovcnt != 1) {
bufsize = bytes_wanted;
outbuf = kmem_alloc(bufsize, KM_SLEEP);
odp = (struct dirent64 *)outbuf;
} else {
bufsize = bytes_wanted;
odp = (struct dirent64 *)iovp->iov_base;
}
eodp = (struct edirent *)odp;
if (ncookies != NULL) {
/*
* Minimum entry size is dirent size and 1 byte for a file name.
*/
ncooks = uio->uio_resid / (sizeof(struct dirent) - sizeof(((struct dirent *)NULL)->d_name) + 1);
cooks = malloc(ncooks * sizeof(u_long), M_TEMP, M_WAITOK);
*cookies = cooks;
*ncookies = ncooks;
}
/*
* If this VFS supports the system attribute view interface; and
* we're looking at an extended attribute directory; and we care
* about normalization conflicts on this vfs; then we must check
* for normalization conflicts with the sysattr name space.
*/
#ifdef TODO
check_sysattrs = vfs_has_feature(vp->v_vfsp, VFSFT_SYSATTR_VIEWS) &&
(vp->v_flag & V_XATTRDIR) && zfsvfs->z_norm &&
(flags & V_RDDIR_ENTFLAGS);
#else
check_sysattrs = 0;
#endif
/*
* Transform to file-system independent format
*/
outcount = 0;
while (outcount < bytes_wanted) {
ino64_t objnum;
ushort_t reclen;
off64_t *next;
/*
* Special case `.', `..', and `.zfs'.
*/
if (offset == 0) {
(void) strcpy(zap.za_name, ".");
zap.za_normalization_conflict = 0;
objnum = zp->z_id;
type = DT_DIR;
} else if (offset == 1) {
(void) strcpy(zap.za_name, "..");
zap.za_normalization_conflict = 0;
objnum = zp->z_phys->zp_parent;
type = DT_DIR;
} else if (offset == 2 && zfs_show_ctldir(zp)) {
(void) strcpy(zap.za_name, ZFS_CTLDIR_NAME);
zap.za_normalization_conflict = 0;
objnum = ZFSCTL_INO_ROOT;
type = DT_DIR;
} else {
/*
* Grab next entry.
*/
if (error = zap_cursor_retrieve(&zc, &zap)) {
if ((*eofp = (error == ENOENT)) != 0)
break;
else
goto update;
}
if (zap.za_integer_length != 8 ||
zap.za_num_integers != 1) {
cmn_err(CE_WARN, "zap_readdir: bad directory "
"entry, obj = %lld, offset = %lld\n",
(u_longlong_t)zp->z_id,
(u_longlong_t)offset);
error = ENXIO;
goto update;
}
objnum = ZFS_DIRENT_OBJ(zap.za_first_integer);
/*
* MacOS X can extract the object type here such as:
* uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer);
*/
type = ZFS_DIRENT_TYPE(zap.za_first_integer);
if (check_sysattrs && !zap.za_normalization_conflict) {
#ifdef TODO
zap.za_normalization_conflict =
xattr_sysattr_casechk(zap.za_name);
#else
panic("%s:%u: TODO", __func__, __LINE__);
#endif
}
}
if (flags & V_RDDIR_ENTFLAGS)
reclen = EDIRENT_RECLEN(strlen(zap.za_name));
else
reclen = DIRENT64_RECLEN(strlen(zap.za_name));
/*
* Will this entry fit in the buffer?
*/
if (outcount + reclen > bufsize) {
/*
* Did we manage to fit anything in the buffer?
*/
if (!outcount) {
error = EINVAL;
goto update;
}
break;
}
if (flags & V_RDDIR_ENTFLAGS) {
/*
* Add extended flag entry:
*/
eodp->ed_ino = objnum;
eodp->ed_reclen = reclen;
/* NOTE: ed_off is the offset for the *next* entry */
next = &(eodp->ed_off);
eodp->ed_eflags = zap.za_normalization_conflict ?
ED_CASE_CONFLICT : 0;
(void) strncpy(eodp->ed_name, zap.za_name,
EDIRENT_NAMELEN(reclen));
eodp = (edirent_t *)((intptr_t)eodp + reclen);
} else {
/*
* Add normal entry:
*/
odp->d_ino = objnum;
odp->d_reclen = reclen;
odp->d_namlen = strlen(zap.za_name);
(void) strlcpy(odp->d_name, zap.za_name, odp->d_namlen + 1);
odp->d_type = type;
odp = (dirent64_t *)((intptr_t)odp + reclen);
}
outcount += reclen;
ASSERT(outcount <= bufsize);
/* Prefetch znode */
if (prefetch)
dmu_prefetch(os, objnum, 0, 0);
/*
* Move to the next entry, fill in the previous offset.
*/
if (offset > 2 || (offset == 2 && !zfs_show_ctldir(zp))) {
zap_cursor_advance(&zc);
offset = zap_cursor_serialize(&zc);
} else {
offset += 1;
}
if (cooks != NULL) {
*cooks++ = offset;
ncooks--;
KASSERT(ncooks >= 0, ("ncookies=%d", ncooks));
}
}
zp->z_zn_prefetch = B_FALSE; /* a lookup will re-enable pre-fetching */
/* Subtract unused cookies */
if (ncookies != NULL)
*ncookies -= ncooks;
if (uio->uio_segflg == UIO_SYSSPACE && uio->uio_iovcnt == 1) {
iovp->iov_base += outcount;
iovp->iov_len -= outcount;
uio->uio_resid -= outcount;
} else if (error = uiomove(outbuf, (long)outcount, UIO_READ, uio)) {
/*
* Reset the pointer.
*/
offset = uio->uio_loffset;
}
update:
zap_cursor_fini(&zc);
if (uio->uio_segflg != UIO_SYSSPACE || uio->uio_iovcnt != 1)
kmem_free(outbuf, bufsize);
if (error == ENOENT)
error = 0;
ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
uio->uio_loffset = offset;
ZFS_EXIT(zfsvfs);
if (error != 0 && cookies != NULL) {
free(*cookies, M_TEMP);
*cookies = NULL;
*ncookies = 0;
}
return (error);
}
ulong_t zfs_fsync_sync_cnt = 4;
static int
zfs_fsync(vnode_t *vp, int syncflag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
(void) tsd_set(zfs_fsyncer_key, (void *)zfs_fsync_sync_cnt);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
zil_commit(zfsvfs->z_log, zp->z_last_itx, zp->z_id);
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Get the requested file attributes and place them in the provided
* vattr structure.
*
* IN: vp - vnode of file.
* vap - va_mask identifies requested attributes.
* If AT_XVATTR set, then optional attrs are requested
* flags - ATTR_NOACLCHECK (CIFS server context)
* cr - credentials of caller.
* ct - caller context
*
* OUT: vap - attribute values.
*
* RETURN: 0 (always succeeds)
*/
/* ARGSUSED */
static int
zfs_getattr(vnode_t *vp, vattr_t *vap, int flags, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
znode_phys_t *pzp;
int error = 0;
uint32_t blksize;
u_longlong_t nblocks;
uint64_t links;
xvattr_t *xvap = (xvattr_t *)vap; /* vap may be an xvattr_t * */
xoptattr_t *xoap = NULL;
boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
pzp = zp->z_phys;
- mutex_enter(&zp->z_lock);
-
/*
* If ACL is trivial don't bother looking for ACE_READ_ATTRIBUTES.
* Also, if we are the owner don't bother, since owner should
* always be allowed to read basic attributes of file.
*/
if (!(pzp->zp_flags & ZFS_ACL_TRIVIAL) &&
(pzp->zp_uid != crgetuid(cr))) {
if (error = zfs_zaccess(zp, ACE_READ_ATTRIBUTES, 0,
skipaclchk, cr)) {
- mutex_exit(&zp->z_lock);
ZFS_EXIT(zfsvfs);
return (error);
}
}
/*
* Return all attributes. It's cheaper to provide the answer
* than to determine whether we were asked the question.
*/
+ mutex_enter(&zp->z_lock);
vap->va_type = IFTOVT(pzp->zp_mode);
vap->va_mode = pzp->zp_mode & ~S_IFMT;
zfs_fuid_map_ids(zp, cr, &vap->va_uid, &vap->va_gid);
// vap->va_fsid = zp->z_zfsvfs->z_vfs->vfs_dev;
vap->va_nodeid = zp->z_id;
if ((vp->v_flag & VROOT) && zfs_show_ctldir(zp))
links = pzp->zp_links + 1;
else
links = pzp->zp_links;
vap->va_nlink = MIN(links, UINT32_MAX); /* nlink_t limit! */
vap->va_size = pzp->zp_size;
vap->va_fsid = vp->v_mount->mnt_stat.f_fsid.val[0];
vap->va_rdev = zfs_cmpldev(pzp->zp_rdev);
vap->va_seq = zp->z_seq;
vap->va_flags = 0; /* FreeBSD: Reset chflags(2) flags. */
/*
* Add in any requested optional attributes and the create time.
* Also set the corresponding bits in the returned attribute bitmap.
*/
if ((xoap = xva_getxoptattr(xvap)) != NULL && zfsvfs->z_use_fuids) {
if (XVA_ISSET_REQ(xvap, XAT_ARCHIVE)) {
xoap->xoa_archive =
((pzp->zp_flags & ZFS_ARCHIVE) != 0);
XVA_SET_RTN(xvap, XAT_ARCHIVE);
}
if (XVA_ISSET_REQ(xvap, XAT_READONLY)) {
xoap->xoa_readonly =
((pzp->zp_flags & ZFS_READONLY) != 0);
XVA_SET_RTN(xvap, XAT_READONLY);
}
if (XVA_ISSET_REQ(xvap, XAT_SYSTEM)) {
xoap->xoa_system =
((pzp->zp_flags & ZFS_SYSTEM) != 0);
XVA_SET_RTN(xvap, XAT_SYSTEM);
}
if (XVA_ISSET_REQ(xvap, XAT_HIDDEN)) {
xoap->xoa_hidden =
((pzp->zp_flags & ZFS_HIDDEN) != 0);
XVA_SET_RTN(xvap, XAT_HIDDEN);
}
if (XVA_ISSET_REQ(xvap, XAT_NOUNLINK)) {
xoap->xoa_nounlink =
((pzp->zp_flags & ZFS_NOUNLINK) != 0);
XVA_SET_RTN(xvap, XAT_NOUNLINK);
}
if (XVA_ISSET_REQ(xvap, XAT_IMMUTABLE)) {
xoap->xoa_immutable =
((pzp->zp_flags & ZFS_IMMUTABLE) != 0);
XVA_SET_RTN(xvap, XAT_IMMUTABLE);
}
if (XVA_ISSET_REQ(xvap, XAT_APPENDONLY)) {
xoap->xoa_appendonly =
((pzp->zp_flags & ZFS_APPENDONLY) != 0);
XVA_SET_RTN(xvap, XAT_APPENDONLY);
}
if (XVA_ISSET_REQ(xvap, XAT_NODUMP)) {
xoap->xoa_nodump =
((pzp->zp_flags & ZFS_NODUMP) != 0);
XVA_SET_RTN(xvap, XAT_NODUMP);
}
if (XVA_ISSET_REQ(xvap, XAT_OPAQUE)) {
xoap->xoa_opaque =
((pzp->zp_flags & ZFS_OPAQUE) != 0);
XVA_SET_RTN(xvap, XAT_OPAQUE);
}
if (XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED)) {
xoap->xoa_av_quarantined =
((pzp->zp_flags & ZFS_AV_QUARANTINED) != 0);
XVA_SET_RTN(xvap, XAT_AV_QUARANTINED);
}
if (XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED)) {
xoap->xoa_av_modified =
((pzp->zp_flags & ZFS_AV_MODIFIED) != 0);
XVA_SET_RTN(xvap, XAT_AV_MODIFIED);
}
if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP) &&
vp->v_type == VREG &&
(pzp->zp_flags & ZFS_BONUS_SCANSTAMP)) {
size_t len;
dmu_object_info_t doi;
/*
* Only VREG files have anti-virus scanstamps, so we
* won't conflict with symlinks in the bonus buffer.
*/
dmu_object_info_from_db(zp->z_dbuf, &doi);
len = sizeof (xoap->xoa_av_scanstamp) +
sizeof (znode_phys_t);
if (len <= doi.doi_bonus_size) {
/*
* pzp points to the start of the
* znode_phys_t. pzp + 1 points to the
* first byte after the znode_phys_t.
*/
(void) memcpy(xoap->xoa_av_scanstamp,
pzp + 1,
sizeof (xoap->xoa_av_scanstamp));
XVA_SET_RTN(xvap, XAT_AV_SCANSTAMP);
}
}
if (XVA_ISSET_REQ(xvap, XAT_CREATETIME)) {
ZFS_TIME_DECODE(&xoap->xoa_createtime, pzp->zp_crtime);
XVA_SET_RTN(xvap, XAT_CREATETIME);
}
}
ZFS_TIME_DECODE(&vap->va_atime, pzp->zp_atime);
ZFS_TIME_DECODE(&vap->va_mtime, pzp->zp_mtime);
ZFS_TIME_DECODE(&vap->va_ctime, pzp->zp_ctime);
ZFS_TIME_DECODE(&vap->va_birthtime, pzp->zp_crtime);
mutex_exit(&zp->z_lock);
dmu_object_size_from_db(zp->z_dbuf, &blksize, &nblocks);
vap->va_blksize = blksize;
vap->va_bytes = nblocks << 9; /* nblocks * 512 */
if (zp->z_blksz == 0) {
/*
* Block size hasn't been set; suggest maximal I/O transfers.
*/
vap->va_blksize = zfsvfs->z_max_blksz;
}
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Set the file attributes to the values contained in the
* vattr structure.
*
* IN: vp - vnode of file to be modified.
* vap - new attribute values.
* If AT_XVATTR set, then optional attrs are being set
* flags - ATTR_UTIME set if non-default time values provided.
* - ATTR_NOACLCHECK (CIFS context only).
* cr - credentials of caller.
* ct - caller context
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - ctime updated, mtime updated if size changed.
*/
/* ARGSUSED */
static int
zfs_setattr(vnode_t *vp, vattr_t *vap, int flags, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
znode_phys_t *pzp;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zilog_t *zilog;
dmu_tx_t *tx;
vattr_t oldva;
uint_t mask = vap->va_mask;
uint_t saved_mask;
uint64_t saved_mode;
int trim_mask = 0;
uint64_t new_mode;
znode_t *attrzp;
int need_policy = FALSE;
int err;
zfs_fuid_info_t *fuidp = NULL;
xvattr_t *xvap = (xvattr_t *)vap; /* vap may be an xvattr_t * */
xoptattr_t *xoap;
zfs_acl_t *aclp = NULL;
boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
if (mask == 0)
return (0);
if (mask & AT_NOSET)
return (EINVAL);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
pzp = zp->z_phys;
zilog = zfsvfs->z_log;
/*
* Make sure that if we have ephemeral uid/gid or xvattr specified
* that file system is at proper version level
*/
if (zfsvfs->z_use_fuids == B_FALSE &&
(((mask & AT_UID) && IS_EPHEMERAL(vap->va_uid)) ||
((mask & AT_GID) && IS_EPHEMERAL(vap->va_gid)) ||
(mask & AT_XVATTR))) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
if (mask & AT_SIZE && vp->v_type == VDIR) {
ZFS_EXIT(zfsvfs);
return (EISDIR);
}
if (mask & AT_SIZE && vp->v_type != VREG && vp->v_type != VFIFO) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* If this is an xvattr_t, then get a pointer to the structure of
* optional attributes. If this is NULL, then we have a vattr_t.
*/
xoap = xva_getxoptattr(xvap);
/*
* Immutable files can only alter immutable bit and atime
*/
if ((pzp->zp_flags & ZFS_IMMUTABLE) &&
((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
if ((mask & AT_SIZE) && (pzp->zp_flags & ZFS_READONLY)) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
/*
* Verify timestamps doesn't overflow 32 bits.
* ZFS can handle large timestamps, but 32bit syscalls can't
* handle times greater than 2039. This check should be removed
* once large timestamps are fully supported.
*/
if (mask & (AT_ATIME | AT_MTIME)) {
if (((mask & AT_ATIME) && TIMESPEC_OVERFLOW(&vap->va_atime)) ||
((mask & AT_MTIME) && TIMESPEC_OVERFLOW(&vap->va_mtime))) {
ZFS_EXIT(zfsvfs);
return (EOVERFLOW);
}
}
top:
attrzp = NULL;
if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
ZFS_EXIT(zfsvfs);
return (EROFS);
}
/*
* First validate permissions
*/
if (mask & AT_SIZE) {
err = zfs_zaccess(zp, ACE_WRITE_DATA, 0, skipaclchk, cr);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
/*
* XXX - Note, we are not providing any open
* mode flags here (like FNDELAY), so we may
* block if there are locks present... this
* should be addressed in openat().
*/
/* XXX - would it be OK to generate a log record here? */
err = zfs_freesp(zp, vap->va_size, 0, 0, FALSE);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
}
if (mask & (AT_ATIME|AT_MTIME) ||
((mask & AT_XVATTR) && (XVA_ISSET_REQ(xvap, XAT_HIDDEN) ||
XVA_ISSET_REQ(xvap, XAT_READONLY) ||
XVA_ISSET_REQ(xvap, XAT_ARCHIVE) ||
XVA_ISSET_REQ(xvap, XAT_CREATETIME) ||
XVA_ISSET_REQ(xvap, XAT_SYSTEM))))
need_policy = zfs_zaccess(zp, ACE_WRITE_ATTRIBUTES, 0,
skipaclchk, cr);
if (mask & (AT_UID|AT_GID)) {
int idmask = (mask & (AT_UID|AT_GID));
int take_owner;
int take_group;
/*
* NOTE: even if a new mode is being set,
* we may clear S_ISUID/S_ISGID bits.
*/
if (!(mask & AT_MODE))
vap->va_mode = pzp->zp_mode;
/*
* Take ownership or chgrp to group we are a member of
*/
take_owner = (mask & AT_UID) && (vap->va_uid == crgetuid(cr));
take_group = (mask & AT_GID) &&
zfs_groupmember(zfsvfs, vap->va_gid, cr);
/*
* If both AT_UID and AT_GID are set then take_owner and
* take_group must both be set in order to allow taking
* ownership.
*
* Otherwise, send the check through secpolicy_vnode_setattr()
*
*/
if (((idmask == (AT_UID|AT_GID)) && take_owner && take_group) ||
((idmask == AT_UID) && take_owner) ||
((idmask == AT_GID) && take_group)) {
if (zfs_zaccess(zp, ACE_WRITE_OWNER, 0,
skipaclchk, cr) == 0) {
/*
* Remove setuid/setgid for non-privileged users
*/
secpolicy_setid_clear(vap, vp, cr);
trim_mask = (mask & (AT_UID|AT_GID));
} else {
need_policy = TRUE;
}
} else {
need_policy = TRUE;
}
}
mutex_enter(&zp->z_lock);
oldva.va_mode = pzp->zp_mode;
zfs_fuid_map_ids(zp, cr, &oldva.va_uid, &oldva.va_gid);
if (mask & AT_XVATTR) {
if ((need_policy == FALSE) &&
(XVA_ISSET_REQ(xvap, XAT_APPENDONLY) &&
xoap->xoa_appendonly !=
((pzp->zp_flags & ZFS_APPENDONLY) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_NOUNLINK) &&
xoap->xoa_nounlink !=
((pzp->zp_flags & ZFS_NOUNLINK) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
xoap->xoa_immutable !=
((pzp->zp_flags & ZFS_IMMUTABLE) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_NODUMP) &&
xoap->xoa_nodump !=
((pzp->zp_flags & ZFS_NODUMP) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED) &&
xoap->xoa_av_modified !=
((pzp->zp_flags & ZFS_AV_MODIFIED) != 0)) ||
((XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED) &&
((vp->v_type != VREG && xoap->xoa_av_quarantined) ||
xoap->xoa_av_quarantined !=
((pzp->zp_flags & ZFS_AV_QUARANTINED) != 0)))) ||
(XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP)) ||
(XVA_ISSET_REQ(xvap, XAT_OPAQUE))) {
need_policy = TRUE;
}
}
mutex_exit(&zp->z_lock);
if (mask & AT_MODE) {
if (zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr) == 0) {
err = secpolicy_setid_setsticky_clear(vp, vap,
&oldva, cr);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
trim_mask |= AT_MODE;
} else {
need_policy = TRUE;
}
}
if (need_policy) {
/*
* If trim_mask is set then take ownership
* has been granted or write_acl is present and user
* has the ability to modify mode. In that case remove
* UID|GID and or MODE from mask so that
* secpolicy_vnode_setattr() doesn't revoke it.
*/
if (trim_mask) {
saved_mask = vap->va_mask;
vap->va_mask &= ~trim_mask;
if (trim_mask & AT_MODE) {
/*
* Save the mode, as secpolicy_vnode_setattr()
* will overwrite it with ova.va_mode.
*/
saved_mode = vap->va_mode;
}
}
err = secpolicy_vnode_setattr(cr, vp, vap, &oldva, flags,
(int (*)(void *, int, cred_t *))zfs_zaccess_unix, zp);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
if (trim_mask) {
vap->va_mask |= saved_mask;
if (trim_mask & AT_MODE) {
/*
* Recover the mode after
* secpolicy_vnode_setattr().
*/
vap->va_mode = saved_mode;
}
}
}
/*
* secpolicy_vnode_setattr, or take ownership may have
* changed va_mask
*/
mask = vap->va_mask;
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
if (((mask & AT_UID) && IS_EPHEMERAL(vap->va_uid)) ||
((mask & AT_GID) && IS_EPHEMERAL(vap->va_gid))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
if (mask & AT_MODE) {
uint64_t pmode = pzp->zp_mode;
new_mode = (pmode & S_IFMT) | (vap->va_mode & ~S_IFMT);
if (err = zfs_acl_chmod_setattr(zp, &aclp, new_mode)) {
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (err);
}
if (pzp->zp_acl.z_acl_extern_obj) {
/* Are we upgrading ACL from old V0 format to new V1 */
if (zfsvfs->z_version <= ZPL_VERSION_FUID &&
pzp->zp_acl.z_acl_version ==
ZFS_ACL_VERSION_INITIAL) {
dmu_tx_hold_free(tx,
pzp->zp_acl.z_acl_extern_obj, 0,
DMU_OBJECT_END);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, aclp->z_acl_bytes);
} else {
dmu_tx_hold_write(tx,
pzp->zp_acl.z_acl_extern_obj, 0,
aclp->z_acl_bytes);
}
} else if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, aclp->z_acl_bytes);
}
}
if ((mask & (AT_UID | AT_GID)) && pzp->zp_xattr != 0) {
err = zfs_zget(zp->z_zfsvfs, pzp->zp_xattr, &attrzp);
if (err) {
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (err);
}
dmu_tx_hold_bonus(tx, attrzp->z_id);
}
err = dmu_tx_assign(tx, zfsvfs->z_assign);
if (err) {
if (attrzp)
VN_RELE(ZTOV(attrzp));
if (aclp) {
zfs_acl_free(aclp);
aclp = NULL;
}
if (err == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (err);
}
dmu_buf_will_dirty(zp->z_dbuf, tx);
/*
* Set each attribute requested.
* We group settings according to the locks they need to acquire.
*
* Note: you cannot set ctime directly, although it will be
* updated as a side-effect of calling this function.
*/
mutex_enter(&zp->z_lock);
if (mask & AT_MODE) {
mutex_enter(&zp->z_acl_lock);
zp->z_phys->zp_mode = new_mode;
err = zfs_aclset_common(zp, aclp, cr, &fuidp, tx);
ASSERT3U(err, ==, 0);
mutex_exit(&zp->z_acl_lock);
}
if (attrzp)
mutex_enter(&attrzp->z_lock);
if (mask & AT_UID) {
pzp->zp_uid = zfs_fuid_create(zfsvfs,
vap->va_uid, cr, ZFS_OWNER, tx, &fuidp);
if (attrzp) {
attrzp->z_phys->zp_uid = zfs_fuid_create(zfsvfs,
vap->va_uid, cr, ZFS_OWNER, tx, &fuidp);
}
}
if (mask & AT_GID) {
pzp->zp_gid = zfs_fuid_create(zfsvfs, vap->va_gid,
cr, ZFS_GROUP, tx, &fuidp);
if (attrzp)
attrzp->z_phys->zp_gid = zfs_fuid_create(zfsvfs,
vap->va_gid, cr, ZFS_GROUP, tx, &fuidp);
}
if (aclp)
zfs_acl_free(aclp);
if (attrzp)
mutex_exit(&attrzp->z_lock);
if (mask & AT_ATIME)
ZFS_TIME_ENCODE(&vap->va_atime, pzp->zp_atime);
if (mask & AT_MTIME)
ZFS_TIME_ENCODE(&vap->va_mtime, pzp->zp_mtime);
/* XXX - shouldn't this be done *before* the ATIME/MTIME checks? */
if (mask & AT_SIZE)
zfs_time_stamper_locked(zp, CONTENT_MODIFIED, tx);
else if (mask != 0)
zfs_time_stamper_locked(zp, STATE_CHANGED, tx);
/*
* Do this after setting timestamps to prevent timestamp
* update from toggling bit
*/
if (xoap && (mask & AT_XVATTR)) {
if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP)) {
size_t len;
dmu_object_info_t doi;
ASSERT(vp->v_type == VREG);
/* Grow the bonus buffer if necessary. */
dmu_object_info_from_db(zp->z_dbuf, &doi);
len = sizeof (xoap->xoa_av_scanstamp) +
sizeof (znode_phys_t);
if (len > doi.doi_bonus_size)
VERIFY(dmu_set_bonus(zp->z_dbuf, len, tx) == 0);
}
zfs_xvattr_set(zp, xvap);
}
if (mask != 0)
zfs_log_setattr(zilog, tx, TX_SETATTR, zp, vap, mask, fuidp);
if (fuidp)
zfs_fuid_info_free(fuidp);
mutex_exit(&zp->z_lock);
if (attrzp)
VN_RELE(ZTOV(attrzp));
dmu_tx_commit(tx);
ZFS_EXIT(zfsvfs);
return (err);
}
typedef struct zfs_zlock {
krwlock_t *zl_rwlock; /* lock we acquired */
znode_t *zl_znode; /* znode we held */
struct zfs_zlock *zl_next; /* next in list */
} zfs_zlock_t;
/*
* Drop locks and release vnodes that were held by zfs_rename_lock().
*/
static void
zfs_rename_unlock(zfs_zlock_t **zlpp)
{
zfs_zlock_t *zl;
while ((zl = *zlpp) != NULL) {
if (zl->zl_znode != NULL)
VN_RELE(ZTOV(zl->zl_znode));
rw_exit(zl->zl_rwlock);
*zlpp = zl->zl_next;
kmem_free(zl, sizeof (*zl));
}
}
/*
* Search back through the directory tree, using the ".." entries.
* Lock each directory in the chain to prevent concurrent renames.
* Fail any attempt to move a directory into one of its own descendants.
* XXX - z_parent_lock can overlap with map or grow locks
*/
static int
zfs_rename_lock(znode_t *szp, znode_t *tdzp, znode_t *sdzp, zfs_zlock_t **zlpp)
{
zfs_zlock_t *zl;
znode_t *zp = tdzp;
uint64_t rootid = zp->z_zfsvfs->z_root;
uint64_t *oidp = &zp->z_id;
krwlock_t *rwlp = &szp->z_parent_lock;
krw_t rw = RW_WRITER;
/*
* First pass write-locks szp and compares to zp->z_id.
* Later passes read-lock zp and compare to zp->z_parent.
*/
do {
if (!rw_tryenter(rwlp, rw)) {
/*
* Another thread is renaming in this path.
* Note that if we are a WRITER, we don't have any
* parent_locks held yet.
*/
if (rw == RW_READER && zp->z_id > szp->z_id) {
/*
* Drop our locks and restart
*/
zfs_rename_unlock(&zl);
*zlpp = NULL;
zp = tdzp;
oidp = &zp->z_id;
rwlp = &szp->z_parent_lock;
rw = RW_WRITER;
continue;
} else {
/*
* Wait for other thread to drop its locks
*/
rw_enter(rwlp, rw);
}
}
zl = kmem_alloc(sizeof (*zl), KM_SLEEP);
zl->zl_rwlock = rwlp;
zl->zl_znode = NULL;
zl->zl_next = *zlpp;
*zlpp = zl;
if (*oidp == szp->z_id) /* We're a descendant of szp */
return (EINVAL);
if (*oidp == rootid) /* We've hit the top */
return (0);
if (rw == RW_READER) { /* i.e. not the first pass */
int error = zfs_zget(zp->z_zfsvfs, *oidp, &zp);
if (error)
return (error);
zl->zl_znode = zp;
}
oidp = &zp->z_phys->zp_parent;
rwlp = &zp->z_parent_lock;
rw = RW_READER;
} while (zp->z_id != sdzp->z_id);
return (0);
}
/*
* Move an entry from the provided source directory to the target
* directory. Change the entry name as indicated.
*
* IN: sdvp - Source directory containing the "old entry".
* snm - Old entry name.
* tdvp - Target directory to contain the "new entry".
* tnm - New entry name.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* sdvp,tdvp - ctime|mtime updated
*/
/*ARGSUSED*/
static int
zfs_rename(vnode_t *sdvp, char *snm, vnode_t *tdvp, char *tnm, cred_t *cr,
caller_context_t *ct, int flags)
{
znode_t *tdzp, *szp, *tzp;
znode_t *sdzp = VTOZ(sdvp);
zfsvfs_t *zfsvfs = sdzp->z_zfsvfs;
zilog_t *zilog;
vnode_t *realvp;
zfs_dirlock_t *sdl, *tdl;
dmu_tx_t *tx;
zfs_zlock_t *zl;
int cmp, serr, terr;
int error = 0;
int zflg = 0;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(sdzp);
zilog = zfsvfs->z_log;
/*
* Make sure we have the real vp for the target directory.
*/
if (VOP_REALVP(tdvp, &realvp, ct) == 0)
tdvp = realvp;
if (tdvp->v_vfsp != sdvp->v_vfsp) {
ZFS_EXIT(zfsvfs);
return (EXDEV);
}
tdzp = VTOZ(tdvp);
ZFS_VERIFY_ZP(tdzp);
if (zfsvfs->z_utf8 && u8_validate(tnm,
strlen(tnm), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zflg |= ZCILOOK;
top:
szp = NULL;
tzp = NULL;
zl = NULL;
/*
* This is to prevent the creation of links into attribute space
* by renaming a linked file into/outof an attribute directory.
* See the comment in zfs_link() for why this is considered bad.
*/
if ((tdzp->z_phys->zp_flags & ZFS_XATTR) !=
(sdzp->z_phys->zp_flags & ZFS_XATTR)) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* Lock source and target directory entries. To prevent deadlock,
* a lock ordering must be defined. We lock the directory with
* the smallest object id first, or if it's a tie, the one with
* the lexically first name.
*/
if (sdzp->z_id < tdzp->z_id) {
cmp = -1;
} else if (sdzp->z_id > tdzp->z_id) {
cmp = 1;
} else {
/*
* First compare the two name arguments without
* considering any case folding.
*/
int nofold = (zfsvfs->z_norm & ~U8_TEXTPREP_TOUPPER);
cmp = u8_strcmp(snm, tnm, 0, nofold, U8_UNICODE_LATEST, &error);
ASSERT(error == 0 || !zfsvfs->z_utf8);
if (cmp == 0) {
/*
* POSIX: "If the old argument and the new argument
* both refer to links to the same existing file,
* the rename() function shall return successfully
* and perform no other action."
*/
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* If the file system is case-folding, then we may
* have some more checking to do. A case-folding file
* system is either supporting mixed case sensitivity
* access or is completely case-insensitive. Note
* that the file system is always case preserving.
*
* In mixed sensitivity mode case sensitive behavior
* is the default. FIGNORECASE must be used to
* explicitly request case insensitive behavior.
*
* If the source and target names provided differ only
* by case (e.g., a request to rename 'tim' to 'Tim'),
* we will treat this as a special case in the
* case-insensitive mode: as long as the source name
* is an exact match, we will allow this to proceed as
* a name-change request.
*/
if ((zfsvfs->z_case == ZFS_CASE_INSENSITIVE ||
(zfsvfs->z_case == ZFS_CASE_MIXED &&
flags & FIGNORECASE)) &&
u8_strcmp(snm, tnm, 0, zfsvfs->z_norm, U8_UNICODE_LATEST,
&error) == 0) {
/*
* case preserving rename request, require exact
* name matches
*/
zflg |= ZCIEXACT;
zflg &= ~ZCILOOK;
}
}
/*
* If the source and destination directories are the same, we should
* grab the z_name_lock of that directory only once.
*/
if (sdzp == tdzp) {
zflg |= ZHAVELOCK;
rw_enter(&sdzp->z_name_lock, RW_READER);
}
if (cmp < 0) {
serr = zfs_dirent_lock(&sdl, sdzp, snm, &szp,
ZEXISTS | zflg, NULL, NULL);
terr = zfs_dirent_lock(&tdl,
tdzp, tnm, &tzp, ZRENAMING | zflg, NULL, NULL);
} else {
terr = zfs_dirent_lock(&tdl,
tdzp, tnm, &tzp, zflg, NULL, NULL);
serr = zfs_dirent_lock(&sdl,
sdzp, snm, &szp, ZEXISTS | ZRENAMING | zflg,
NULL, NULL);
}
if (serr) {
/*
* Source entry invalid or not there.
*/
if (!terr) {
zfs_dirent_unlock(tdl);
if (tzp)
VN_RELE(ZTOV(tzp));
}
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
if (strcmp(snm, ".") == 0 || strcmp(snm, "..") == 0)
serr = EINVAL;
ZFS_EXIT(zfsvfs);
return (serr);
}
if (terr) {
zfs_dirent_unlock(sdl);
VN_RELE(ZTOV(szp));
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
if (strcmp(tnm, "..") == 0)
terr = EINVAL;
ZFS_EXIT(zfsvfs);
return (terr);
}
/*
* Must have write access at the source to remove the old entry
* and write access at the target to create the new entry.
* Note that if target and source are the same, this can be
* done in a single check.
*/
if (error = zfs_zaccess_rename(sdzp, szp, tdzp, tzp, cr))
goto out;
if (ZTOV(szp)->v_type == VDIR) {
/*
* Check to make sure rename is valid.
* Can't do a move like this: /usr/a/b to /usr/a/b/c/d
*/
if (error = zfs_rename_lock(szp, tdzp, sdzp, &zl))
goto out;
}
/*
* Does target exist?
*/
if (tzp) {
/*
* Source and target must be the same type.
*/
if (ZTOV(szp)->v_type == VDIR) {
if (ZTOV(tzp)->v_type != VDIR) {
error = ENOTDIR;
goto out;
}
} else {
if (ZTOV(tzp)->v_type == VDIR) {
error = EISDIR;
goto out;
}
}
/*
* POSIX dictates that when the source and target
* entries refer to the same file object, rename
* must do nothing and exit without error.
*/
if (szp->z_id == tzp->z_id) {
error = 0;
goto out;
}
}
vnevent_rename_src(ZTOV(szp), sdvp, snm, ct);
if (tzp)
vnevent_rename_dest(ZTOV(tzp), tdvp, tnm, ct);
/*
* notify the target directory if it is not the same
* as source directory.
*/
if (tdvp != sdvp) {
vnevent_rename_dest_dir(tdvp, ct);
}
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, szp->z_id); /* nlink changes */
dmu_tx_hold_bonus(tx, sdzp->z_id); /* nlink changes */
dmu_tx_hold_zap(tx, sdzp->z_id, FALSE, snm);
dmu_tx_hold_zap(tx, tdzp->z_id, TRUE, tnm);
if (sdzp != tdzp)
dmu_tx_hold_bonus(tx, tdzp->z_id); /* nlink changes */
if (tzp)
dmu_tx_hold_bonus(tx, tzp->z_id); /* parent changes */
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
if (zl != NULL)
zfs_rename_unlock(&zl);
zfs_dirent_unlock(sdl);
zfs_dirent_unlock(tdl);
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
VN_RELE(ZTOV(szp));
if (tzp)
VN_RELE(ZTOV(tzp));
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
if (tzp) /* Attempt to remove the existing target */
error = zfs_link_destroy(tdl, tzp, tx, zflg, NULL);
if (error == 0) {
error = zfs_link_create(tdl, szp, tx, ZRENAMING);
if (error == 0) {
szp->z_phys->zp_flags |= ZFS_AV_MODIFIED;
error = zfs_link_destroy(sdl, szp, tx, ZRENAMING, NULL);
ASSERT(error == 0);
zfs_log_rename(zilog, tx,
TX_RENAME | (flags & FIGNORECASE ? TX_CI : 0),
sdzp, sdl->dl_name, tdzp, tdl->dl_name, szp);
/* Update path information for the target vnode */
vn_renamepath(tdvp, ZTOV(szp), tnm, strlen(tnm));
}
#ifdef FREEBSD_NAMECACHE
if (error == 0) {
cache_purge(sdvp);
cache_purge(tdvp);
}
#endif
}
dmu_tx_commit(tx);
out:
if (zl != NULL)
zfs_rename_unlock(&zl);
zfs_dirent_unlock(sdl);
zfs_dirent_unlock(tdl);
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
VN_RELE(ZTOV(szp));
if (tzp)
VN_RELE(ZTOV(tzp));
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Insert the indicated symbolic reference entry into the directory.
*
* IN: dvp - Directory to contain new symbolic link.
* link - Name for new symlink entry.
* vap - Attributes of new entry.
* target - Target path of new symlink.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated
*/
/*ARGSUSED*/
static int
zfs_symlink(vnode_t *dvp, vnode_t **vpp, char *name, vattr_t *vap, char *link,
cred_t *cr, kthread_t *td)
{
znode_t *zp, *dzp = VTOZ(dvp);
zfs_dirlock_t *dl;
dmu_tx_t *tx;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
int len = strlen(link);
int error;
int zflg = ZNEW;
zfs_fuid_info_t *fuidp = NULL;
int flags = 0;
ASSERT(vap->va_type == VLNK);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zflg |= ZCILOOK;
top:
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (len > MAXPATHLEN) {
ZFS_EXIT(zfsvfs);
return (ENAMETOOLONG);
}
/*
* Attempt to lock directory; fail if entry already exists.
*/
error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg, NULL, NULL);
if (error) {
ZFS_EXIT(zfsvfs);
return (error);
}
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, MAX(1, len));
dmu_tx_hold_bonus(tx, dzp->z_id);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
if (dzp->z_phys->zp_flags & ZFS_INHERIT_ACE)
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, SPA_MAXBLOCKSIZE);
if (IS_EPHEMERAL(crgetuid(cr)) || IS_EPHEMERAL(crgetgid(cr))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
dmu_buf_will_dirty(dzp->z_dbuf, tx);
/*
* Create a new object for the symlink.
* Put the link content into bonus buffer if it will fit;
* otherwise, store it just like any other file data.
*/
if (sizeof (znode_phys_t) + len <= dmu_bonus_max()) {
zfs_mknode(dzp, vap, tx, cr, 0, &zp, len, NULL, &fuidp);
if (len != 0)
bcopy(link, zp->z_phys + 1, len);
} else {
dmu_buf_t *dbp;
zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, NULL, &fuidp);
/*
* Nothing can access the znode yet so no locking needed
* for growing the znode's blocksize.
*/
zfs_grow_blocksize(zp, len, tx);
VERIFY(0 == dmu_buf_hold(zfsvfs->z_os,
zp->z_id, 0, FTAG, &dbp));
dmu_buf_will_dirty(dbp, tx);
ASSERT3U(len, <=, dbp->db_size);
bcopy(link, dbp->db_data, len);
dmu_buf_rele(dbp, FTAG);
}
zp->z_phys->zp_size = len;
/*
* Insert the new object into the directory.
*/
(void) zfs_link_create(dl, zp, tx, ZNEW);
out:
if (error == 0) {
uint64_t txtype = TX_SYMLINK;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_symlink(zilog, tx, txtype, dzp, zp, name, link);
*vpp = ZTOV(zp);
}
if (fuidp)
zfs_fuid_info_free(fuidp);
dmu_tx_commit(tx);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Return, in the buffer contained in the provided uio structure,
* the symbolic path referred to by vp.
*
* IN: vp - vnode of symbolic link.
* uoip - structure to contain the link path.
* cr - credentials of caller.
* ct - caller context
*
* OUT: uio - structure to contain the link path.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - atime updated
*/
/* ARGSUSED */
static int
zfs_readlink(vnode_t *vp, uio_t *uio, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
size_t bufsz;
int error;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
bufsz = (size_t)zp->z_phys->zp_size;
if (bufsz + sizeof (znode_phys_t) <= zp->z_dbuf->db_size) {
error = uiomove(zp->z_phys + 1,
MIN((size_t)bufsz, uio->uio_resid), UIO_READ, uio);
} else {
dmu_buf_t *dbp;
error = dmu_buf_hold(zfsvfs->z_os, zp->z_id, 0, FTAG, &dbp);
if (error) {
ZFS_EXIT(zfsvfs);
return (error);
}
error = uiomove(dbp->db_data,
MIN((size_t)bufsz, uio->uio_resid), UIO_READ, uio);
dmu_buf_rele(dbp, FTAG);
}
ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Insert a new entry into directory tdvp referencing svp.
*
* IN: tdvp - Directory to contain new entry.
* svp - vnode of new entry.
* name - name of new entry.
* cr - credentials of caller.
* ct - caller context
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* tdvp - ctime|mtime updated
* svp - ctime updated
*/
/* ARGSUSED */
static int
zfs_link(vnode_t *tdvp, vnode_t *svp, char *name, cred_t *cr,
caller_context_t *ct, int flags)
{
znode_t *dzp = VTOZ(tdvp);
znode_t *tzp, *szp;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
vnode_t *realvp;
int error;
int zf = ZNEW;
uid_t owner;
ASSERT(tdvp->v_type == VDIR);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (VOP_REALVP(svp, &realvp, ct) == 0)
svp = realvp;
if (svp->v_vfsp != tdvp->v_vfsp) {
ZFS_EXIT(zfsvfs);
return (EXDEV);
}
szp = VTOZ(svp);
ZFS_VERIFY_ZP(szp);
if (zfsvfs->z_utf8 && u8_validate(name,
strlen(name), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zf |= ZCILOOK;
top:
/*
* We do not support links between attributes and non-attributes
* because of the potential security risk of creating links
* into "normal" file space in order to circumvent restrictions
* imposed in attribute space.
*/
if ((szp->z_phys->zp_flags & ZFS_XATTR) !=
(dzp->z_phys->zp_flags & ZFS_XATTR)) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* POSIX dictates that we return EPERM here.
* Better choices include ENOTSUP or EISDIR.
*/
if (svp->v_type == VDIR) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
owner = zfs_fuid_map_id(zfsvfs, szp->z_phys->zp_uid, cr, ZFS_OWNER);
if (owner != crgetuid(cr) &&
secpolicy_basic_link(svp, cr) != 0) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Attempt to lock directory; fail if entry already exists.
*/
error = zfs_dirent_lock(&dl, dzp, name, &tzp, zf, NULL, NULL);
if (error) {
ZFS_EXIT(zfsvfs);
return (error);
}
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, szp->z_id);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
error = zfs_link_create(dl, szp, tx, 0);
if (error == 0) {
uint64_t txtype = TX_LINK;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_link(zilog, tx, txtype, dzp, szp, name);
}
dmu_tx_commit(tx);
zfs_dirent_unlock(dl);
if (error == 0) {
vnevent_link(svp, ct);
}
ZFS_EXIT(zfsvfs);
return (error);
}
/*ARGSUSED*/
void
zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
if (zp->z_dbuf == NULL) {
/*
* The fs has been unmounted, or we did a
* suspend/resume and this file no longer exists.
*/
VI_LOCK(vp);
vp->v_count = 0; /* count arrives as 1 */
VI_UNLOCK(vp);
vrecycle(vp, curthread);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
return;
}
if (zp->z_atime_dirty && zp->z_unlinked == 0) {
dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
error = dmu_tx_assign(tx, TXG_WAIT);
if (error) {
dmu_tx_abort(tx);
} else {
dmu_buf_will_dirty(zp->z_dbuf, tx);
mutex_enter(&zp->z_lock);
zp->z_atime_dirty = 0;
mutex_exit(&zp->z_lock);
dmu_tx_commit(tx);
}
}
zfs_zinactive(zp);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
}
CTASSERT(sizeof(struct zfid_short) <= sizeof(struct fid));
CTASSERT(sizeof(struct zfid_long) <= sizeof(struct fid));
/*ARGSUSED*/
static int
zfs_fid(vnode_t *vp, fid_t *fidp, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
uint32_t gen;
uint64_t object = zp->z_id;
zfid_short_t *zfid;
int size, i;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
gen = (uint32_t)zp->z_gen;
size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
fidp->fid_len = size;
zfid = (zfid_short_t *)fidp;
zfid->zf_len = size;
for (i = 0; i < sizeof (zfid->zf_object); i++)
zfid->zf_object[i] = (uint8_t)(object >> (8 * i));
/* Must have a non-zero generation number to distinguish from .zfs */
if (gen == 0)
gen = 1;
for (i = 0; i < sizeof (zfid->zf_gen); i++)
zfid->zf_gen[i] = (uint8_t)(gen >> (8 * i));
if (size == LONG_FID_LEN) {
uint64_t objsetid = dmu_objset_id(zfsvfs->z_os);
zfid_long_t *zlfid;
zlfid = (zfid_long_t *)fidp;
for (i = 0; i < sizeof (zlfid->zf_setid); i++)
zlfid->zf_setid[i] = (uint8_t)(objsetid >> (8 * i));
/* XXX - this should be the generation number for the objset */
for (i = 0; i < sizeof (zlfid->zf_setgen); i++)
zlfid->zf_setgen[i] = 0;
}
ZFS_EXIT(zfsvfs);
return (0);
}
static int
zfs_pathconf(vnode_t *vp, int cmd, ulong_t *valp, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp, *xzp;
zfsvfs_t *zfsvfs;
zfs_dirlock_t *dl;
int error;
switch (cmd) {
case _PC_LINK_MAX:
*valp = INT_MAX;
return (0);
case _PC_FILESIZEBITS:
*valp = 64;
return (0);
#if 0
case _PC_XATTR_EXISTS:
zp = VTOZ(vp);
zfsvfs = zp->z_zfsvfs;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
*valp = 0;
error = zfs_dirent_lock(&dl, zp, "", &xzp,
ZXATTR | ZEXISTS | ZSHARED, NULL, NULL);
if (error == 0) {
zfs_dirent_unlock(dl);
if (!zfs_dirempty(xzp))
*valp = 1;
VN_RELE(ZTOV(xzp));
} else if (error == ENOENT) {
/*
* If there aren't extended attributes, it's the
* same as having zero of them.
*/
error = 0;
}
ZFS_EXIT(zfsvfs);
return (error);
#endif
case _PC_ACL_EXTENDED:
*valp = 0;
return (0);
case _PC_ACL_NFS4:
*valp = 1;
return (0);
case _PC_ACL_PATH_MAX:
*valp = ACL_MAX_ENTRIES;
return (0);
case _PC_MIN_HOLE_SIZE:
*valp = (int)SPA_MINBLOCKSIZE;
return (0);
default:
return (EOPNOTSUPP);
}
}
/*ARGSUSED*/
static int
zfs_getsecattr(vnode_t *vp, vsecattr_t *vsecp, int flag, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
error = zfs_getacl(zp, vsecp, skipaclchk, cr);
ZFS_EXIT(zfsvfs);
return (error);
}
/*ARGSUSED*/
static int
zfs_setsecattr(vnode_t *vp, vsecattr_t *vsecp, int flag, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
error = zfs_setacl(zp, vsecp, skipaclchk, cr);
ZFS_EXIT(zfsvfs);
return (error);
}
static int
zfs_freebsd_open(ap)
struct vop_open_args /* {
struct vnode *a_vp;
int a_mode;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
znode_t *zp = VTOZ(vp);
int error;
error = zfs_open(&vp, ap->a_mode, ap->a_cred, NULL);
if (error == 0)
vnode_create_vobject(vp, zp->z_phys->zp_size, ap->a_td);
return (error);
}
static int
zfs_freebsd_close(ap)
struct vop_close_args /* {
struct vnode *a_vp;
int a_fflag;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
return (zfs_close(ap->a_vp, ap->a_fflag, 0, 0, ap->a_cred, NULL));
}
static int
zfs_freebsd_ioctl(ap)
struct vop_ioctl_args /* {
struct vnode *a_vp;
u_long a_command;
caddr_t a_data;
int a_fflag;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
return (zfs_ioctl(ap->a_vp, ap->a_command, (intptr_t)ap->a_data,
ap->a_fflag, ap->a_cred, NULL, NULL));
}
static int
zfs_freebsd_read(ap)
struct vop_read_args /* {
struct vnode *a_vp;
struct uio *a_uio;
int a_ioflag;
struct ucred *a_cred;
} */ *ap;
{
return (zfs_read(ap->a_vp, ap->a_uio, ap->a_ioflag, ap->a_cred, NULL));
}
static int
zfs_freebsd_write(ap)
struct vop_write_args /* {
struct vnode *a_vp;
struct uio *a_uio;
int a_ioflag;
struct ucred *a_cred;
} */ *ap;
{
return (zfs_write(ap->a_vp, ap->a_uio, ap->a_ioflag, ap->a_cred, NULL));
}
static int
zfs_freebsd_access(ap)
struct vop_access_args /* {
struct vnode *a_vp;
accmode_t a_accmode;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
accmode_t accmode;
int error = 0;
/*
* ZFS itself only knowns about VREAD, VWRITE, VEXEC and VAPPEND,
*/
accmode = ap->a_accmode & (VREAD|VWRITE|VEXEC|VAPPEND);
if (accmode != 0)
error = zfs_access(ap->a_vp, accmode, 0, ap->a_cred, NULL);
/*
* VADMIN has to be handled by vaccess().
*/
if (error == 0) {
accmode = ap->a_accmode & ~(VREAD|VWRITE|VEXEC|VAPPEND);
if (accmode != 0) {
vnode_t *vp = ap->a_vp;
znode_t *zp = VTOZ(vp);
znode_phys_t *zphys = zp->z_phys;
error = vaccess(vp->v_type, zphys->zp_mode,
zphys->zp_uid, zphys->zp_gid, accmode, ap->a_cred,
NULL);
}
}
return (error);
}
static int
zfs_freebsd_lookup(ap)
struct vop_lookup_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
char nm[NAME_MAX + 1];
ASSERT(cnp->cn_namelen < sizeof(nm));
strlcpy(nm, cnp->cn_nameptr, MIN(cnp->cn_namelen + 1, sizeof(nm)));
return (zfs_lookup(ap->a_dvp, nm, ap->a_vpp, cnp, cnp->cn_nameiop,
cnp->cn_cred, cnp->cn_thread, 0));
}
static int
zfs_freebsd_create(ap)
struct vop_create_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
struct vattr *a_vap;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
vattr_t *vap = ap->a_vap;
int mode;
ASSERT(cnp->cn_flags & SAVENAME);
vattr_init_mask(vap);
mode = vap->va_mode & ALLPERMS;
return (zfs_create(ap->a_dvp, cnp->cn_nameptr, vap, !EXCL, mode,
ap->a_vpp, cnp->cn_cred, cnp->cn_thread));
}
static int
zfs_freebsd_remove(ap)
struct vop_remove_args /* {
struct vnode *a_dvp;
struct vnode *a_vp;
struct componentname *a_cnp;
} */ *ap;
{
ASSERT(ap->a_cnp->cn_flags & SAVENAME);
return (zfs_remove(ap->a_dvp, ap->a_cnp->cn_nameptr,
ap->a_cnp->cn_cred, NULL, 0));
}
static int
zfs_freebsd_mkdir(ap)
struct vop_mkdir_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
struct vattr *a_vap;
} */ *ap;
{
vattr_t *vap = ap->a_vap;
ASSERT(ap->a_cnp->cn_flags & SAVENAME);
vattr_init_mask(vap);
return (zfs_mkdir(ap->a_dvp, ap->a_cnp->cn_nameptr, vap, ap->a_vpp,
ap->a_cnp->cn_cred, NULL, 0, NULL));
}
static int
zfs_freebsd_rmdir(ap)
struct vop_rmdir_args /* {
struct vnode *a_dvp;
struct vnode *a_vp;
struct componentname *a_cnp;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
ASSERT(cnp->cn_flags & SAVENAME);
return (zfs_rmdir(ap->a_dvp, cnp->cn_nameptr, NULL, cnp->cn_cred, NULL, 0));
}
static int
zfs_freebsd_readdir(ap)
struct vop_readdir_args /* {
struct vnode *a_vp;
struct uio *a_uio;
struct ucred *a_cred;
int *a_eofflag;
int *a_ncookies;
u_long **a_cookies;
} */ *ap;
{
return (zfs_readdir(ap->a_vp, ap->a_uio, ap->a_cred, ap->a_eofflag,
ap->a_ncookies, ap->a_cookies));
}
static int
zfs_freebsd_fsync(ap)
struct vop_fsync_args /* {
struct vnode *a_vp;
int a_waitfor;
struct thread *a_td;
} */ *ap;
{
vop_stdfsync(ap);
return (zfs_fsync(ap->a_vp, 0, ap->a_td->td_ucred, NULL));
}
static int
zfs_freebsd_getattr(ap)
struct vop_getattr_args /* {
struct vnode *a_vp;
struct vattr *a_vap;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
vattr_t *vap = ap->a_vap;
xvattr_t xvap;
u_long fflags = 0;
int error;
xva_init(&xvap);
xvap.xva_vattr = *vap;
xvap.xva_vattr.va_mask |= AT_XVATTR;
/* Convert chflags into ZFS-type flags. */
/* XXX: what about SF_SETTABLE?. */
XVA_SET_REQ(&xvap, XAT_IMMUTABLE);
XVA_SET_REQ(&xvap, XAT_APPENDONLY);
XVA_SET_REQ(&xvap, XAT_NOUNLINK);
XVA_SET_REQ(&xvap, XAT_NODUMP);
error = zfs_getattr(ap->a_vp, (vattr_t *)&xvap, 0, ap->a_cred, NULL);
if (error != 0)
return (error);
/* Convert ZFS xattr into chflags. */
#define FLAG_CHECK(fflag, xflag, xfield) do { \
if (XVA_ISSET_RTN(&xvap, (xflag)) && (xfield) != 0) \
fflags |= (fflag); \
} while (0)
FLAG_CHECK(SF_IMMUTABLE, XAT_IMMUTABLE,
xvap.xva_xoptattrs.xoa_immutable);
FLAG_CHECK(SF_APPEND, XAT_APPENDONLY,
xvap.xva_xoptattrs.xoa_appendonly);
FLAG_CHECK(SF_NOUNLINK, XAT_NOUNLINK,
xvap.xva_xoptattrs.xoa_nounlink);
FLAG_CHECK(UF_NODUMP, XAT_NODUMP,
xvap.xva_xoptattrs.xoa_nodump);
#undef FLAG_CHECK
*vap = xvap.xva_vattr;
vap->va_flags = fflags;
return (0);
}
static int
zfs_freebsd_setattr(ap)
struct vop_setattr_args /* {
struct vnode *a_vp;
struct vattr *a_vap;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
vattr_t *vap = ap->a_vap;
cred_t *cred = ap->a_cred;
xvattr_t xvap;
u_long fflags;
uint64_t zflags;
vattr_init_mask(vap);
vap->va_mask &= ~AT_NOSET;
xva_init(&xvap);
xvap.xva_vattr = *vap;
zflags = VTOZ(vp)->z_phys->zp_flags;
if (vap->va_flags != VNOVAL) {
zfsvfs_t *zfsvfs = VTOZ(vp)->z_zfsvfs;
int error;
if (zfsvfs->z_use_fuids == B_FALSE)
return (EOPNOTSUPP);
fflags = vap->va_flags;
if ((fflags & ~(SF_IMMUTABLE|SF_APPEND|SF_NOUNLINK|UF_NODUMP)) != 0)
return (EOPNOTSUPP);
/*
* Unprivileged processes are not permitted to unset system
* flags, or modify flags if any system flags are set.
* Privileged non-jail processes may not modify system flags
* if securelevel > 0 and any existing system flags are set.
* Privileged jail processes behave like privileged non-jail
* processes if the security.jail.chflags_allowed sysctl is
* is non-zero; otherwise, they behave like unprivileged
* processes.
*/
if (secpolicy_fs_owner(vp->v_mount, cred) == 0 ||
priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0) == 0) {
if (zflags &
(ZFS_IMMUTABLE | ZFS_APPENDONLY | ZFS_NOUNLINK)) {
error = securelevel_gt(cred, 0);
if (error != 0)
return (error);
}
} else {
/*
* Callers may only modify the file flags on objects they
* have VADMIN rights for.
*/
if ((error = VOP_ACCESS(vp, VADMIN, cred, curthread)) != 0)
return (error);
if (zflags &
(ZFS_IMMUTABLE | ZFS_APPENDONLY | ZFS_NOUNLINK)) {
return (EPERM);
}
if (fflags &
(SF_IMMUTABLE | SF_APPEND | SF_NOUNLINK)) {
return (EPERM);
}
}
#define FLAG_CHANGE(fflag, zflag, xflag, xfield) do { \
if (((fflags & (fflag)) && !(zflags & (zflag))) || \
((zflags & (zflag)) && !(fflags & (fflag)))) { \
XVA_SET_REQ(&xvap, (xflag)); \
(xfield) = ((fflags & (fflag)) != 0); \
} \
} while (0)
/* Convert chflags into ZFS-type flags. */
/* XXX: what about SF_SETTABLE?. */
FLAG_CHANGE(SF_IMMUTABLE, ZFS_IMMUTABLE, XAT_IMMUTABLE,
xvap.xva_xoptattrs.xoa_immutable);
FLAG_CHANGE(SF_APPEND, ZFS_APPENDONLY, XAT_APPENDONLY,
xvap.xva_xoptattrs.xoa_appendonly);
FLAG_CHANGE(SF_NOUNLINK, ZFS_NOUNLINK, XAT_NOUNLINK,
xvap.xva_xoptattrs.xoa_nounlink);
FLAG_CHANGE(UF_NODUMP, ZFS_NODUMP, XAT_NODUMP,
xvap.xva_xoptattrs.xoa_nodump);
#undef FLAG_CHANGE
}
return (zfs_setattr(vp, (vattr_t *)&xvap, 0, cred, NULL));
}
static int
zfs_freebsd_rename(ap)
struct vop_rename_args /* {
struct vnode *a_fdvp;
struct vnode *a_fvp;
struct componentname *a_fcnp;
struct vnode *a_tdvp;
struct vnode *a_tvp;
struct componentname *a_tcnp;
} */ *ap;
{
vnode_t *fdvp = ap->a_fdvp;
vnode_t *fvp = ap->a_fvp;
vnode_t *tdvp = ap->a_tdvp;
vnode_t *tvp = ap->a_tvp;
int error;
ASSERT(ap->a_fcnp->cn_flags & (SAVENAME|SAVESTART));
ASSERT(ap->a_tcnp->cn_flags & (SAVENAME|SAVESTART));
error = zfs_rename(fdvp, ap->a_fcnp->cn_nameptr, tdvp,
ap->a_tcnp->cn_nameptr, ap->a_fcnp->cn_cred, NULL, 0);
if (tdvp == tvp)
VN_RELE(tdvp);
else
VN_URELE(tdvp);
if (tvp)
VN_URELE(tvp);
VN_RELE(fdvp);
VN_RELE(fvp);
return (error);
}
static int
zfs_freebsd_symlink(ap)
struct vop_symlink_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
struct vattr *a_vap;
char *a_target;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
vattr_t *vap = ap->a_vap;
ASSERT(cnp->cn_flags & SAVENAME);
vap->va_type = VLNK; /* FreeBSD: Syscall only sets va_mode. */
vattr_init_mask(vap);
return (zfs_symlink(ap->a_dvp, ap->a_vpp, cnp->cn_nameptr, vap,
ap->a_target, cnp->cn_cred, cnp->cn_thread));
}
static int
zfs_freebsd_readlink(ap)
struct vop_readlink_args /* {
struct vnode *a_vp;
struct uio *a_uio;
struct ucred *a_cred;
} */ *ap;
{
return (zfs_readlink(ap->a_vp, ap->a_uio, ap->a_cred, NULL));
}
static int
zfs_freebsd_link(ap)
struct vop_link_args /* {
struct vnode *a_tdvp;
struct vnode *a_vp;
struct componentname *a_cnp;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
ASSERT(cnp->cn_flags & SAVENAME);
return (zfs_link(ap->a_tdvp, ap->a_vp, cnp->cn_nameptr, cnp->cn_cred, NULL, 0));
}
static int
zfs_freebsd_inactive(ap)
struct vop_inactive_args /* {
struct vnode *a_vp;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
zfs_inactive(vp, ap->a_td->td_ucred, NULL);
return (0);
}
static void
zfs_reclaim_complete(void *arg, int pending)
{
znode_t *zp = arg;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
if (zp->z_dbuf != NULL) {
ZFS_OBJ_HOLD_ENTER(zfsvfs, zp->z_id);
zfs_znode_dmu_fini(zp);
ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
}
zfs_znode_free(zp);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
/*
* If the file system is being unmounted, there is a process waiting
* for us, wake it up.
*/
if (zfsvfs->z_unmounted)
wakeup_one(zfsvfs);
}
static int
zfs_freebsd_reclaim(ap)
struct vop_reclaim_args /* {
struct vnode *a_vp;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
ASSERT(zp != NULL);
/*
* Destroy the vm object and flush associated pages.
*/
vnode_destroy_vobject(vp);
mutex_enter(&zp->z_lock);
ASSERT(zp->z_phys != NULL);
zp->z_vnode = NULL;
mutex_exit(&zp->z_lock);
if (zp->z_unlinked)
; /* Do nothing. */
else if (zp->z_dbuf == NULL)
zfs_znode_free(zp);
else /* if (!zp->z_unlinked && zp->z_dbuf != NULL) */ {
int locked;
locked = MUTEX_HELD(ZFS_OBJ_MUTEX(zfsvfs, zp->z_id)) ? 2 :
ZFS_OBJ_HOLD_TRYENTER(zfsvfs, zp->z_id);
if (locked == 0) {
/*
* Lock can't be obtained due to deadlock possibility,
* so defer znode destruction.
*/
TASK_INIT(&zp->z_task, 0, zfs_reclaim_complete, zp);
taskqueue_enqueue(taskqueue_thread, &zp->z_task);
} else {
zfs_znode_dmu_fini(zp);
if (locked == 1)
ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
zfs_znode_free(zp);
}
}
VI_LOCK(vp);
vp->v_data = NULL;
ASSERT(vp->v_holdcnt >= 1);
VI_UNLOCK(vp);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
return (0);
}
static int
zfs_freebsd_fid(ap)
struct vop_fid_args /* {
struct vnode *a_vp;
struct fid *a_fid;
} */ *ap;
{
return (zfs_fid(ap->a_vp, (void *)ap->a_fid, NULL));
}
static int
zfs_freebsd_pathconf(ap)
struct vop_pathconf_args /* {
struct vnode *a_vp;
int a_name;
register_t *a_retval;
} */ *ap;
{
ulong_t val;
int error;
error = zfs_pathconf(ap->a_vp, ap->a_name, &val, curthread->td_ucred, NULL);
if (error == 0)
*ap->a_retval = val;
else if (error == EOPNOTSUPP)
error = vop_stdpathconf(ap);
return (error);
}
static int
zfs_freebsd_fifo_pathconf(ap)
struct vop_pathconf_args /* {
struct vnode *a_vp;
int a_name;
register_t *a_retval;
} */ *ap;
{
switch (ap->a_name) {
case _PC_ACL_EXTENDED:
case _PC_ACL_NFS4:
case _PC_ACL_PATH_MAX:
case _PC_MAC_PRESENT:
return (zfs_freebsd_pathconf(ap));
default:
return (fifo_specops.vop_pathconf(ap));
}
}
/*
* FreeBSD's extended attributes namespace defines file name prefix for ZFS'
* extended attribute name:
*
* NAMESPACE PREFIX
* system freebsd:system:
* user (none, can be used to access ZFS fsattr(5) attributes
* created on Solaris)
*/
static int
zfs_create_attrname(int attrnamespace, const char *name, char *attrname,
size_t size)
{
const char *namespace, *prefix, *suffix;
/* We don't allow '/' character in attribute name. */
if (strchr(name, '/') != NULL)
return (EINVAL);
/* We don't allow attribute names that start with "freebsd:" string. */
if (strncmp(name, "freebsd:", 8) == 0)
return (EINVAL);
bzero(attrname, size);
switch (attrnamespace) {
case EXTATTR_NAMESPACE_USER:
#if 0
prefix = "freebsd:";
namespace = EXTATTR_NAMESPACE_USER_STRING;
suffix = ":";
#else
/*
* This is the default namespace by which we can access all
* attributes created on Solaris.
*/
prefix = namespace = suffix = "";
#endif
break;
case EXTATTR_NAMESPACE_SYSTEM:
prefix = "freebsd:";
namespace = EXTATTR_NAMESPACE_SYSTEM_STRING;
suffix = ":";
break;
case EXTATTR_NAMESPACE_EMPTY:
default:
return (EINVAL);
}
if (snprintf(attrname, size, "%s%s%s%s", prefix, namespace, suffix,
name) >= size) {
return (ENAMETOOLONG);
}
return (0);
}
/*
* Vnode operating to retrieve a named extended attribute.
*/
static int
zfs_getextattr(struct vop_getextattr_args *ap)
/*
vop_getextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
IN const char *a_name;
INOUT struct uio *a_uio;
OUT size_t *a_size;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrname[255];
struct vattr va;
vnode_t *xvp = NULL, *vp;
int error, flags;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VREAD);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
sizeof(attrname));
if (error != 0)
return (error);
ZFS_ENTER(zfsvfs);
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
flags = FREAD;
NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW | MPSAFE, UIO_SYSSPACE, attrname,
xvp, td);
error = vn_open_cred(&nd, &flags, 0, 0, ap->a_cred, NULL);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
if (error == ENOENT)
error = ENOATTR;
return (error);
}
if (ap->a_size != NULL) {
error = VOP_GETATTR(vp, &va, ap->a_cred);
if (error == 0)
*ap->a_size = (size_t)va.va_size;
} else if (ap->a_uio != NULL)
error = VOP_READ(vp, ap->a_uio, IO_UNIT | IO_SYNC, ap->a_cred);
VOP_UNLOCK(vp, 0);
vn_close(vp, flags, ap->a_cred, td);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Vnode operation to remove a named attribute.
*/
int
zfs_deleteextattr(struct vop_deleteextattr_args *ap)
/*
vop_deleteextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
IN const char *a_name;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrname[255];
struct vattr va;
vnode_t *xvp = NULL, *vp;
int error, flags;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VWRITE);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
sizeof(attrname));
if (error != 0)
return (error);
ZFS_ENTER(zfsvfs);
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
NDINIT_ATVP(&nd, DELETE, NOFOLLOW | LOCKPARENT | LOCKLEAF | MPSAFE,
UIO_SYSSPACE, attrname, xvp, td);
error = namei(&nd);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
if (error == ENOENT)
error = ENOATTR;
return (error);
}
error = VOP_REMOVE(nd.ni_dvp, vp, &nd.ni_cnd);
vput(nd.ni_dvp);
if (vp == nd.ni_dvp)
vrele(vp);
else
vput(vp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Vnode operation to set a named attribute.
*/
static int
zfs_setextattr(struct vop_setextattr_args *ap)
/*
vop_setextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
IN const char *a_name;
INOUT struct uio *a_uio;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrname[255];
struct vattr va;
vnode_t *xvp = NULL, *vp;
int error, flags;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VWRITE);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
sizeof(attrname));
if (error != 0)
return (error);
ZFS_ENTER(zfsvfs);
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR | CREATE_XATTR_DIR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
flags = FFLAGS(O_WRONLY | O_CREAT);
NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW | MPSAFE, UIO_SYSSPACE, attrname,
xvp, td);
error = vn_open_cred(&nd, &flags, 0600, 0, ap->a_cred, NULL);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
VATTR_NULL(&va);
va.va_size = 0;
error = VOP_SETATTR(vp, &va, ap->a_cred);
if (error == 0)
VOP_WRITE(vp, ap->a_uio, IO_UNIT | IO_SYNC, ap->a_cred);
VOP_UNLOCK(vp, 0);
vn_close(vp, flags, ap->a_cred, td);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Vnode operation to retrieve extended attributes on a vnode.
*/
static int
zfs_listextattr(struct vop_listextattr_args *ap)
/*
vop_listextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
INOUT struct uio *a_uio;
OUT size_t *a_size;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrprefix[16];
u_char dirbuf[sizeof(struct dirent)];
struct dirent *dp;
struct iovec aiov;
struct uio auio, *uio = ap->a_uio;
size_t *sizep = ap->a_size;
size_t plen;
vnode_t *xvp = NULL, *vp;
int done, error, eof, pos;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VREAD);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, "", attrprefix,
sizeof(attrprefix));
if (error != 0)
return (error);
plen = strlen(attrprefix);
ZFS_ENTER(zfsvfs);
if (sizep != NULL)
*sizep = 0;
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
/*
* ENOATTR means that the EA directory does not yet exist,
* i.e. there are no extended attributes there.
*/
if (error == ENOATTR)
error = 0;
return (error);
}
NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW | LOCKLEAF | LOCKSHARED | MPSAFE,
UIO_SYSSPACE, ".", xvp, td);
error = namei(&nd);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
auio.uio_iov = &aiov;
auio.uio_iovcnt = 1;
auio.uio_segflg = UIO_SYSSPACE;
auio.uio_td = td;
auio.uio_rw = UIO_READ;
auio.uio_offset = 0;
do {
u_char nlen;
aiov.iov_base = (void *)dirbuf;
aiov.iov_len = sizeof(dirbuf);
auio.uio_resid = sizeof(dirbuf);
error = VOP_READDIR(vp, &auio, ap->a_cred, &eof, NULL, NULL);
done = sizeof(dirbuf) - auio.uio_resid;
if (error != 0)
break;
for (pos = 0; pos < done;) {
dp = (struct dirent *)(dirbuf + pos);
pos += dp->d_reclen;
/*
* XXX: Temporarily we also accept DT_UNKNOWN, as this
* is what we get when attribute was created on Solaris.
*/
if (dp->d_type != DT_REG && dp->d_type != DT_UNKNOWN)
continue;
if (plen == 0 && strncmp(dp->d_name, "freebsd:", 8) == 0)
continue;
else if (strncmp(dp->d_name, attrprefix, plen) != 0)
continue;
nlen = dp->d_namlen - plen;
if (sizep != NULL)
*sizep += 1 + nlen;
else if (uio != NULL) {
/*
* Format of extattr name entry is one byte for
* length and the rest for name.
*/
error = uiomove(&nlen, 1, uio->uio_rw, uio);
if (error == 0) {
error = uiomove(dp->d_name + plen, nlen,
uio->uio_rw, uio);
}
if (error != 0)
break;
}
}
} while (!eof && error == 0);
vput(vp);
ZFS_EXIT(zfsvfs);
return (error);
}
int
zfs_freebsd_getacl(ap)
struct vop_getacl_args /* {
struct vnode *vp;
acl_type_t type;
struct acl *aclp;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
int error;
vsecattr_t vsecattr;
if (ap->a_type != ACL_TYPE_NFS4)
return (EINVAL);
vsecattr.vsa_mask = VSA_ACE | VSA_ACECNT;
if (error = zfs_getsecattr(ap->a_vp, &vsecattr, 0, ap->a_cred, NULL))
return (error);
error = acl_from_aces(ap->a_aclp, vsecattr.vsa_aclentp, vsecattr.vsa_aclcnt);
if (vsecattr.vsa_aclentp != NULL)
kmem_free(vsecattr.vsa_aclentp, vsecattr.vsa_aclentsz);
return (error);
}
int
zfs_freebsd_setacl(ap)
struct vop_setacl_args /* {
struct vnode *vp;
acl_type_t type;
struct acl *aclp;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
int error;
vsecattr_t vsecattr;
int aclbsize; /* size of acl list in bytes */
aclent_t *aaclp;
if (ap->a_type != ACL_TYPE_NFS4)
return (EINVAL);
if (ap->a_aclp->acl_cnt < 1 || ap->a_aclp->acl_cnt > MAX_ACL_ENTRIES)
return (EINVAL);
/*
* With NFSv4 ACLs, chmod(2) may need to add additional entries,
* splitting every entry into two and appending "canonical six"
* entries at the end. Don't allow for setting an ACL that would
* cause chmod(2) to run out of ACL entries.
*/
if (ap->a_aclp->acl_cnt * 2 + 6 > ACL_MAX_ENTRIES)
return (ENOSPC);
error = acl_nfs4_check(ap->a_aclp, ap->a_vp->v_type == VDIR);
if (error != 0)
return (error);
vsecattr.vsa_mask = VSA_ACE;
aclbsize = ap->a_aclp->acl_cnt * sizeof(ace_t);
vsecattr.vsa_aclentp = kmem_alloc(aclbsize, KM_SLEEP);
aaclp = vsecattr.vsa_aclentp;
vsecattr.vsa_aclentsz = aclbsize;
aces_from_acl(vsecattr.vsa_aclentp, &vsecattr.vsa_aclcnt, ap->a_aclp);
error = zfs_setsecattr(ap->a_vp, &vsecattr, 0, ap->a_cred, NULL);
kmem_free(aaclp, aclbsize);
return (error);
}
int
zfs_freebsd_aclcheck(ap)
struct vop_aclcheck_args /* {
struct vnode *vp;
acl_type_t type;
struct acl *aclp;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
return (EOPNOTSUPP);
}
struct vop_vector zfs_vnodeops;
struct vop_vector zfs_fifoops;
struct vop_vector zfs_vnodeops = {
.vop_default = &default_vnodeops,
.vop_inactive = zfs_freebsd_inactive,
.vop_reclaim = zfs_freebsd_reclaim,
.vop_access = zfs_freebsd_access,
#ifdef FREEBSD_NAMECACHE
.vop_lookup = vfs_cache_lookup,
.vop_cachedlookup = zfs_freebsd_lookup,
#else
.vop_lookup = zfs_freebsd_lookup,
#endif
.vop_getattr = zfs_freebsd_getattr,
.vop_setattr = zfs_freebsd_setattr,
.vop_create = zfs_freebsd_create,
.vop_mknod = zfs_freebsd_create,
.vop_mkdir = zfs_freebsd_mkdir,
.vop_readdir = zfs_freebsd_readdir,
.vop_fsync = zfs_freebsd_fsync,
.vop_open = zfs_freebsd_open,
.vop_close = zfs_freebsd_close,
.vop_rmdir = zfs_freebsd_rmdir,
.vop_ioctl = zfs_freebsd_ioctl,
.vop_link = zfs_freebsd_link,
.vop_symlink = zfs_freebsd_symlink,
.vop_readlink = zfs_freebsd_readlink,
.vop_read = zfs_freebsd_read,
.vop_write = zfs_freebsd_write,
.vop_remove = zfs_freebsd_remove,
.vop_rename = zfs_freebsd_rename,
.vop_pathconf = zfs_freebsd_pathconf,
.vop_bmap = VOP_EOPNOTSUPP,
.vop_fid = zfs_freebsd_fid,
.vop_getextattr = zfs_getextattr,
.vop_deleteextattr = zfs_deleteextattr,
.vop_setextattr = zfs_setextattr,
.vop_listextattr = zfs_listextattr,
.vop_getacl = zfs_freebsd_getacl,
.vop_setacl = zfs_freebsd_setacl,
.vop_aclcheck = zfs_freebsd_aclcheck,
};
struct vop_vector zfs_fifoops = {
.vop_default = &fifo_specops,
.vop_fsync = zfs_freebsd_fsync,
.vop_access = zfs_freebsd_access,
.vop_getattr = zfs_freebsd_getattr,
.vop_inactive = zfs_freebsd_inactive,
.vop_read = VOP_PANIC,
.vop_reclaim = zfs_freebsd_reclaim,
.vop_setattr = zfs_freebsd_setattr,
.vop_write = VOP_PANIC,
.vop_pathconf = zfs_freebsd_fifo_pathconf,
.vop_fid = zfs_freebsd_fid,
.vop_getacl = zfs_freebsd_getacl,
.vop_setacl = zfs_freebsd_setacl,
.vop_aclcheck = zfs_freebsd_aclcheck,
};
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision 209274)
@@ -1,2276 +1,2276 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/fm/fs/zfs.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/spa_impl.h>
#include <sys/vdev_impl.h>
#include <sys/zio_impl.h>
#include <sys/zio_compress.h>
#include <sys/zio_checksum.h>
SYSCTL_DECL(_vfs_zfs);
SYSCTL_NODE(_vfs_zfs, OID_AUTO, zio, CTLFLAG_RW, 0, "ZFS ZIO");
static int zio_use_uma = 0;
TUNABLE_INT("vfs.zfs.zio.use_uma", &zio_use_uma);
SYSCTL_INT(_vfs_zfs_zio, OID_AUTO, use_uma, CTLFLAG_RDTUN, &zio_use_uma, 0,
"Use uma(9) for ZIO allocations");
/*
* ==========================================================================
* I/O priority table
* ==========================================================================
*/
uint8_t zio_priority_table[ZIO_PRIORITY_TABLE_SIZE] = {
0, /* ZIO_PRIORITY_NOW */
0, /* ZIO_PRIORITY_SYNC_READ */
0, /* ZIO_PRIORITY_SYNC_WRITE */
6, /* ZIO_PRIORITY_ASYNC_READ */
4, /* ZIO_PRIORITY_ASYNC_WRITE */
4, /* ZIO_PRIORITY_FREE */
0, /* ZIO_PRIORITY_CACHE_FILL */
0, /* ZIO_PRIORITY_LOG_WRITE */
10, /* ZIO_PRIORITY_RESILVER */
20, /* ZIO_PRIORITY_SCRUB */
};
/*
* ==========================================================================
* I/O type descriptions
* ==========================================================================
*/
char *zio_type_name[ZIO_TYPES] = {
"null", "read", "write", "free", "claim", "ioctl" };
#define SYNC_PASS_DEFERRED_FREE 1 /* defer frees after this pass */
#define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
#define SYNC_PASS_REWRITE 1 /* rewrite new bps after this pass */
/*
* ==========================================================================
* I/O kmem caches
* ==========================================================================
*/
kmem_cache_t *zio_cache;
kmem_cache_t *zio_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
kmem_cache_t *zio_data_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
#ifdef _KERNEL
extern vmem_t *zio_alloc_arena;
#endif
/*
* An allocating zio is one that either currently has the DVA allocate
* stage set or will have it later in its lifetime.
*/
#define IO_IS_ALLOCATING(zio) \
((zio)->io_orig_pipeline & (1U << ZIO_STAGE_DVA_ALLOCATE))
void
zio_init(void)
{
size_t c;
zio_cache = kmem_cache_create("zio_cache", sizeof (zio_t), 0,
NULL, NULL, NULL, NULL, NULL, 0);
/*
* For small buffers, we want a cache for each multiple of
* SPA_MINBLOCKSIZE. For medium-size buffers, we want a cache
* for each quarter-power of 2. For large buffers, we want
* a cache for each multiple of PAGESIZE.
*/
for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
size_t size = (c + 1) << SPA_MINBLOCKSHIFT;
size_t p2 = size;
size_t align = 0;
while (p2 & (p2 - 1))
p2 &= p2 - 1;
if (size <= 4 * SPA_MINBLOCKSIZE) {
align = SPA_MINBLOCKSIZE;
} else if (P2PHASE(size, PAGESIZE) == 0) {
align = PAGESIZE;
} else if (P2PHASE(size, p2 >> 2) == 0) {
align = p2 >> 2;
}
if (align != 0) {
char name[36];
(void) sprintf(name, "zio_buf_%lu", (ulong_t)size);
zio_buf_cache[c] = kmem_cache_create(name, size,
align, NULL, NULL, NULL, NULL, NULL, KMC_NODEBUG);
(void) sprintf(name, "zio_data_buf_%lu", (ulong_t)size);
zio_data_buf_cache[c] = kmem_cache_create(name, size,
align, NULL, NULL, NULL, NULL, NULL, KMC_NODEBUG);
}
}
while (--c != 0) {
ASSERT(zio_buf_cache[c] != NULL);
if (zio_buf_cache[c - 1] == NULL)
zio_buf_cache[c - 1] = zio_buf_cache[c];
ASSERT(zio_data_buf_cache[c] != NULL);
if (zio_data_buf_cache[c - 1] == NULL)
zio_data_buf_cache[c - 1] = zio_data_buf_cache[c];
}
zio_inject_init();
}
void
zio_fini(void)
{
size_t c;
kmem_cache_t *last_cache = NULL;
kmem_cache_t *last_data_cache = NULL;
for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
if (zio_buf_cache[c] != last_cache) {
last_cache = zio_buf_cache[c];
kmem_cache_destroy(zio_buf_cache[c]);
}
zio_buf_cache[c] = NULL;
if (zio_data_buf_cache[c] != last_data_cache) {
last_data_cache = zio_data_buf_cache[c];
kmem_cache_destroy(zio_data_buf_cache[c]);
}
zio_data_buf_cache[c] = NULL;
}
kmem_cache_destroy(zio_cache);
zio_inject_fini();
}
/*
* ==========================================================================
* Allocate and free I/O buffers
* ==========================================================================
*/
/*
* Use zio_buf_alloc to allocate ZFS metadata. This data will appear in a
* crashdump if the kernel panics, so use it judiciously. Obviously, it's
* useful to inspect ZFS metadata, but if possible, we should avoid keeping
* excess / transient data in-core during a crashdump.
*/
void *
zio_buf_alloc(size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
return (kmem_cache_alloc(zio_buf_cache[c], KM_PUSHPAGE));
else
return (kmem_alloc(size, KM_SLEEP));
}
/*
* Use zio_data_buf_alloc to allocate data. The data will not appear in a
* crashdump if the kernel panics. This exists so that we will limit the amount
* of ZFS data that shows up in a kernel crashdump. (Thus reducing the amount
* of kernel heap dumped to disk when the kernel panics)
*/
void *
zio_data_buf_alloc(size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
else
return (kmem_alloc(size, KM_SLEEP));
}
void
zio_buf_free(void *buf, size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
kmem_cache_free(zio_buf_cache[c], buf);
else
kmem_free(buf, size);
}
void
zio_data_buf_free(void *buf, size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
kmem_cache_free(zio_data_buf_cache[c], buf);
else
kmem_free(buf, size);
}
/*
* ==========================================================================
* Push and pop I/O transform buffers
* ==========================================================================
*/
static void
zio_push_transform(zio_t *zio, void *data, uint64_t size, uint64_t bufsize,
zio_transform_func_t *transform)
{
zio_transform_t *zt = kmem_alloc(sizeof (zio_transform_t), KM_SLEEP);
zt->zt_orig_data = zio->io_data;
zt->zt_orig_size = zio->io_size;
zt->zt_bufsize = bufsize;
zt->zt_transform = transform;
zt->zt_next = zio->io_transform_stack;
zio->io_transform_stack = zt;
zio->io_data = data;
zio->io_size = size;
}
static void
zio_pop_transforms(zio_t *zio)
{
zio_transform_t *zt;
while ((zt = zio->io_transform_stack) != NULL) {
if (zt->zt_transform != NULL)
zt->zt_transform(zio,
zt->zt_orig_data, zt->zt_orig_size);
zio_buf_free(zio->io_data, zt->zt_bufsize);
zio->io_data = zt->zt_orig_data;
zio->io_size = zt->zt_orig_size;
zio->io_transform_stack = zt->zt_next;
kmem_free(zt, sizeof (zio_transform_t));
}
}
/*
* ==========================================================================
* I/O transform callbacks for subblocks and decompression
* ==========================================================================
*/
static void
zio_subblock(zio_t *zio, void *data, uint64_t size)
{
ASSERT(zio->io_size > size);
if (zio->io_type == ZIO_TYPE_READ)
bcopy(zio->io_data, data, size);
}
static void
zio_decompress(zio_t *zio, void *data, uint64_t size)
{
if (zio->io_error == 0 &&
zio_decompress_data(BP_GET_COMPRESS(zio->io_bp),
zio->io_data, zio->io_size, data, size) != 0)
zio->io_error = EIO;
}
/*
* ==========================================================================
* I/O parent/child relationships and pipeline interlocks
* ==========================================================================
*/
static void
zio_add_child(zio_t *pio, zio_t *zio)
{
mutex_enter(&pio->io_lock);
if (zio->io_stage < ZIO_STAGE_READY)
pio->io_children[zio->io_child_type][ZIO_WAIT_READY]++;
if (zio->io_stage < ZIO_STAGE_DONE)
pio->io_children[zio->io_child_type][ZIO_WAIT_DONE]++;
zio->io_sibling_prev = NULL;
zio->io_sibling_next = pio->io_child;
if (pio->io_child != NULL)
pio->io_child->io_sibling_prev = zio;
pio->io_child = zio;
zio->io_parent = pio;
mutex_exit(&pio->io_lock);
}
static void
zio_remove_child(zio_t *pio, zio_t *zio)
{
zio_t *next, *prev;
ASSERT(zio->io_parent == pio);
mutex_enter(&pio->io_lock);
next = zio->io_sibling_next;
prev = zio->io_sibling_prev;
if (next != NULL)
next->io_sibling_prev = prev;
if (prev != NULL)
prev->io_sibling_next = next;
if (pio->io_child == zio)
pio->io_child = next;
mutex_exit(&pio->io_lock);
}
static boolean_t
zio_wait_for_children(zio_t *zio, enum zio_child child, enum zio_wait_type wait)
{
uint64_t *countp = &zio->io_children[child][wait];
boolean_t waiting = B_FALSE;
mutex_enter(&zio->io_lock);
ASSERT(zio->io_stall == NULL);
if (*countp != 0) {
zio->io_stage--;
zio->io_stall = countp;
waiting = B_TRUE;
}
mutex_exit(&zio->io_lock);
return (waiting);
}
static void
zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait)
{
uint64_t *countp = &pio->io_children[zio->io_child_type][wait];
int *errorp = &pio->io_child_error[zio->io_child_type];
mutex_enter(&pio->io_lock);
if (zio->io_error && !(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE))
*errorp = zio_worst_error(*errorp, zio->io_error);
pio->io_reexecute |= zio->io_reexecute;
ASSERT3U(*countp, >, 0);
if (--*countp == 0 && pio->io_stall == countp) {
pio->io_stall = NULL;
mutex_exit(&pio->io_lock);
zio_execute(pio);
} else {
mutex_exit(&pio->io_lock);
}
}
static void
zio_inherit_child_errors(zio_t *zio, enum zio_child c)
{
if (zio->io_child_error[c] != 0 && zio->io_error == 0)
zio->io_error = zio->io_child_error[c];
}
/*
* ==========================================================================
* Create the various types of I/O (read, write, free, etc)
* ==========================================================================
*/
static zio_t *
zio_create(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
void *data, uint64_t size, zio_done_func_t *done, void *private,
zio_type_t type, int priority, int flags, vdev_t *vd, uint64_t offset,
const zbookmark_t *zb, uint8_t stage, uint32_t pipeline)
{
zio_t *zio;
ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
ASSERT(P2PHASE(offset, SPA_MINBLOCKSIZE) == 0);
ASSERT(!vd || spa_config_held(spa, SCL_STATE_ALL, RW_READER));
ASSERT(!bp || !(flags & ZIO_FLAG_CONFIG_WRITER));
ASSERT(vd || stage == ZIO_STAGE_OPEN);
zio = kmem_cache_alloc(zio_cache, KM_SLEEP);
bzero(zio, sizeof (zio_t));
mutex_init(&zio->io_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&zio->io_cv, NULL, CV_DEFAULT, NULL);
if (vd != NULL)
zio->io_child_type = ZIO_CHILD_VDEV;
else if (flags & ZIO_FLAG_GANG_CHILD)
zio->io_child_type = ZIO_CHILD_GANG;
else
zio->io_child_type = ZIO_CHILD_LOGICAL;
if (bp != NULL) {
zio->io_bp = bp;
zio->io_bp_copy = *bp;
zio->io_bp_orig = *bp;
if (type != ZIO_TYPE_WRITE)
zio->io_bp = &zio->io_bp_copy; /* so caller can free */
if (zio->io_child_type == ZIO_CHILD_LOGICAL) {
if (BP_IS_GANG(bp))
pipeline |= ZIO_GANG_STAGES;
zio->io_logical = zio;
}
}
zio->io_spa = spa;
zio->io_txg = txg;
zio->io_data = data;
zio->io_size = size;
zio->io_done = done;
zio->io_private = private;
zio->io_type = type;
zio->io_priority = priority;
zio->io_vd = vd;
zio->io_offset = offset;
zio->io_orig_flags = zio->io_flags = flags;
zio->io_orig_stage = zio->io_stage = stage;
zio->io_orig_pipeline = zio->io_pipeline = pipeline;
if (zb != NULL)
zio->io_bookmark = *zb;
if (pio != NULL) {
/*
* Logical I/Os can have logical, gang, or vdev children.
* Gang I/Os can have gang or vdev children.
* Vdev I/Os can only have vdev children.
* The following ASSERT captures all of these constraints.
*/
ASSERT(zio->io_child_type <= pio->io_child_type);
if (zio->io_logical == NULL)
zio->io_logical = pio->io_logical;
zio_add_child(pio, zio);
}
return (zio);
}
static void
zio_destroy(zio_t *zio)
{
spa_t *spa = zio->io_spa;
uint8_t async_root = zio->io_async_root;
mutex_destroy(&zio->io_lock);
cv_destroy(&zio->io_cv);
kmem_cache_free(zio_cache, zio);
if (async_root) {
mutex_enter(&spa->spa_async_root_lock);
if (--spa->spa_async_root_count == 0)
cv_broadcast(&spa->spa_async_root_cv);
mutex_exit(&spa->spa_async_root_lock);
}
}
zio_t *
zio_null(zio_t *pio, spa_t *spa, zio_done_func_t *done, void *private,
int flags)
{
zio_t *zio;
zio = zio_create(pio, spa, 0, NULL, NULL, 0, done, private,
ZIO_TYPE_NULL, ZIO_PRIORITY_NOW, flags, NULL, 0, NULL,
ZIO_STAGE_OPEN, ZIO_INTERLOCK_PIPELINE);
return (zio);
}
zio_t *
zio_root(spa_t *spa, zio_done_func_t *done, void *private, int flags)
{
return (zio_null(NULL, spa, done, private, flags));
}
zio_t *
zio_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
void *data, uint64_t size, zio_done_func_t *done, void *private,
int priority, int flags, const zbookmark_t *zb)
{
zio_t *zio;
zio = zio_create(pio, spa, bp->blk_birth, (blkptr_t *)bp,
data, size, done, private,
ZIO_TYPE_READ, priority, flags, NULL, 0, zb,
ZIO_STAGE_OPEN, ZIO_READ_PIPELINE);
return (zio);
}
zio_t *
zio_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
void *data, uint64_t size, zio_prop_t *zp,
zio_done_func_t *ready, zio_done_func_t *done, void *private,
int priority, int flags, const zbookmark_t *zb)
{
zio_t *zio;
ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF &&
zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS &&
zp->zp_compress >= ZIO_COMPRESS_OFF &&
zp->zp_compress < ZIO_COMPRESS_FUNCTIONS &&
zp->zp_type < DMU_OT_NUMTYPES &&
zp->zp_level < 32 &&
zp->zp_ndvas > 0 &&
zp->zp_ndvas <= spa_max_replication(spa));
ASSERT(ready != NULL);
zio = zio_create(pio, spa, txg, bp, data, size, done, private,
ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
ZIO_STAGE_OPEN, ZIO_WRITE_PIPELINE);
zio->io_ready = ready;
zio->io_prop = *zp;
return (zio);
}
zio_t *
zio_rewrite(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, void *data,
uint64_t size, zio_done_func_t *done, void *private, int priority,
int flags, zbookmark_t *zb)
{
zio_t *zio;
zio = zio_create(pio, spa, txg, bp, data, size, done, private,
ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
ZIO_STAGE_OPEN, ZIO_REWRITE_PIPELINE);
return (zio);
}
zio_t *
zio_free(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
zio_done_func_t *done, void *private, int flags)
{
zio_t *zio;
ASSERT(!BP_IS_HOLE(bp));
if (bp->blk_fill == BLK_FILL_ALREADY_FREED)
return (zio_null(pio, spa, NULL, NULL, flags));
if (txg == spa->spa_syncing_txg &&
spa_sync_pass(spa) > SYNC_PASS_DEFERRED_FREE) {
bplist_enqueue_deferred(&spa->spa_sync_bplist, bp);
return (zio_null(pio, spa, NULL, NULL, flags));
}
zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
done, private, ZIO_TYPE_FREE, ZIO_PRIORITY_FREE, flags,
NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_FREE_PIPELINE);
return (zio);
}
zio_t *
zio_claim(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
zio_done_func_t *done, void *private, int flags)
{
zio_t *zio;
/*
* A claim is an allocation of a specific block. Claims are needed
* to support immediate writes in the intent log. The issue is that
* immediate writes contain committed data, but in a txg that was
* *not* committed. Upon opening the pool after an unclean shutdown,
* the intent log claims all blocks that contain immediate write data
* so that the SPA knows they're in use.
*
* All claims *must* be resolved in the first txg -- before the SPA
* starts allocating blocks -- so that nothing is allocated twice.
*/
ASSERT3U(spa->spa_uberblock.ub_rootbp.blk_birth, <, spa_first_txg(spa));
ASSERT3U(spa_first_txg(spa), <=, txg);
zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
done, private, ZIO_TYPE_CLAIM, ZIO_PRIORITY_NOW, flags,
NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_CLAIM_PIPELINE);
return (zio);
}
zio_t *
zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
zio_done_func_t *done, void *private, int priority, int flags)
{
zio_t *zio;
int c;
if (vd->vdev_children == 0) {
zio = zio_create(pio, spa, 0, NULL, NULL, 0, done, private,
ZIO_TYPE_IOCTL, priority, flags, vd, 0, NULL,
ZIO_STAGE_OPEN, ZIO_IOCTL_PIPELINE);
zio->io_cmd = cmd;
} else {
zio = zio_null(pio, spa, NULL, NULL, flags);
for (c = 0; c < vd->vdev_children; c++)
zio_nowait(zio_ioctl(zio, spa, vd->vdev_child[c], cmd,
done, private, priority, flags));
}
return (zio);
}
zio_t *
zio_read_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
void *data, int checksum, zio_done_func_t *done, void *private,
int priority, int flags, boolean_t labels)
{
zio_t *zio;
ASSERT(vd->vdev_children == 0);
ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
ASSERT3U(offset + size, <=, vd->vdev_psize);
zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, done, private,
ZIO_TYPE_READ, priority, flags, vd, offset, NULL,
ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);
zio->io_prop.zp_checksum = checksum;
return (zio);
}
zio_t *
zio_write_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
void *data, int checksum, zio_done_func_t *done, void *private,
int priority, int flags, boolean_t labels)
{
zio_t *zio;
ASSERT(vd->vdev_children == 0);
ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
ASSERT3U(offset + size, <=, vd->vdev_psize);
zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, done, private,
ZIO_TYPE_WRITE, priority, flags, vd, offset, NULL,
ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);
zio->io_prop.zp_checksum = checksum;
if (zio_checksum_table[checksum].ci_zbt) {
/*
* zbt checksums are necessarily destructive -- they modify
* the end of the write buffer to hold the verifier/checksum.
* Therefore, we must make a local copy in case the data is
* being written to multiple places in parallel.
*/
void *wbuf = zio_buf_alloc(size);
bcopy(data, wbuf, size);
zio_push_transform(zio, wbuf, size, size, NULL);
}
return (zio);
}
/*
* Create a child I/O to do some work for us.
*/
zio_t *
zio_vdev_child_io(zio_t *pio, blkptr_t *bp, vdev_t *vd, uint64_t offset,
void *data, uint64_t size, int type, int priority, int flags,
zio_done_func_t *done, void *private)
{
uint32_t pipeline = ZIO_VDEV_CHILD_PIPELINE;
zio_t *zio;
ASSERT(vd->vdev_parent ==
(pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev));
if (type == ZIO_TYPE_READ && bp != NULL) {
/*
* If we have the bp, then the child should perform the
* checksum and the parent need not. This pushes error
* detection as close to the leaves as possible and
* eliminates redundant checksums in the interior nodes.
*/
pipeline |= 1U << ZIO_STAGE_CHECKSUM_VERIFY;
pio->io_pipeline &= ~(1U << ZIO_STAGE_CHECKSUM_VERIFY);
}
if (vd->vdev_children == 0)
offset += VDEV_LABEL_START_SIZE;
zio = zio_create(pio, pio->io_spa, pio->io_txg, bp, data, size,
done, private, type, priority,
(pio->io_flags & ZIO_FLAG_VDEV_INHERIT) |
ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | flags,
vd, offset, &pio->io_bookmark,
ZIO_STAGE_VDEV_IO_START - 1, pipeline);
return (zio);
}
zio_t *
zio_vdev_delegated_io(vdev_t *vd, uint64_t offset, void *data, uint64_t size,
int type, int priority, int flags, zio_done_func_t *done, void *private)
{
zio_t *zio;
ASSERT(vd->vdev_ops->vdev_op_leaf);
zio = zio_create(NULL, vd->vdev_spa, 0, NULL,
data, size, done, private, type, priority,
flags | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_RETRY,
vd, offset, NULL,
ZIO_STAGE_VDEV_IO_START - 1, ZIO_VDEV_CHILD_PIPELINE);
return (zio);
}
void
zio_flush(zio_t *zio, vdev_t *vd)
{
zio_nowait(zio_ioctl(zio, zio->io_spa, vd, DKIOCFLUSHWRITECACHE,
NULL, NULL, ZIO_PRIORITY_NOW,
ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY));
}
/*
* ==========================================================================
* Prepare to read and write logical blocks
* ==========================================================================
*/
static int
zio_read_bp_init(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF && zio->io_logical == zio) {
uint64_t csize = BP_GET_PSIZE(bp);
void *cbuf = zio_buf_alloc(csize);
zio_push_transform(zio, cbuf, csize, csize, zio_decompress);
}
if (!dmu_ot[BP_GET_TYPE(bp)].ot_metadata && BP_GET_LEVEL(bp) == 0)
zio->io_flags |= ZIO_FLAG_DONT_CACHE;
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_write_bp_init(zio_t *zio)
{
zio_prop_t *zp = &zio->io_prop;
int compress = zp->zp_compress;
blkptr_t *bp = zio->io_bp;
void *cbuf;
uint64_t lsize = zio->io_size;
uint64_t csize = lsize;
uint64_t cbufsize = 0;
int pass = 1;
/*
* If our children haven't all reached the ready stage,
* wait for them and then repeat this pipeline stage.
*/
if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_READY))
return (ZIO_PIPELINE_STOP);
if (!IO_IS_ALLOCATING(zio))
return (ZIO_PIPELINE_CONTINUE);
ASSERT(compress != ZIO_COMPRESS_INHERIT);
if (bp->blk_birth == zio->io_txg) {
/*
* We're rewriting an existing block, which means we're
* working on behalf of spa_sync(). For spa_sync() to
* converge, it must eventually be the case that we don't
* have to allocate new blocks. But compression changes
* the blocksize, which forces a reallocate, and makes
* convergence take longer. Therefore, after the first
* few passes, stop compressing to ensure convergence.
*/
pass = spa_sync_pass(zio->io_spa);
ASSERT(pass > 1);
if (pass > SYNC_PASS_DONT_COMPRESS)
compress = ZIO_COMPRESS_OFF;
/*
* Only MOS (objset 0) data should need to be rewritten.
*/
ASSERT(zio->io_logical->io_bookmark.zb_objset == 0);
/* Make sure someone doesn't change their mind on overwrites */
ASSERT(MIN(zp->zp_ndvas + BP_IS_GANG(bp),
spa_max_replication(zio->io_spa)) == BP_GET_NDVAS(bp));
}
if (compress != ZIO_COMPRESS_OFF) {
if (!zio_compress_data(compress, zio->io_data, zio->io_size,
&cbuf, &csize, &cbufsize)) {
compress = ZIO_COMPRESS_OFF;
} else if (csize != 0) {
zio_push_transform(zio, cbuf, csize, cbufsize, NULL);
}
}
/*
* The final pass of spa_sync() must be all rewrites, but the first
* few passes offer a trade-off: allocating blocks defers convergence,
* but newly allocated blocks are sequential, so they can be written
* to disk faster. Therefore, we allow the first few passes of
* spa_sync() to allocate new blocks, but force rewrites after that.
* There should only be a handful of blocks after pass 1 in any case.
*/
if (bp->blk_birth == zio->io_txg && BP_GET_PSIZE(bp) == csize &&
pass > SYNC_PASS_REWRITE) {
ASSERT(csize != 0);
uint32_t gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;
zio->io_pipeline = ZIO_REWRITE_PIPELINE | gang_stages;
zio->io_flags |= ZIO_FLAG_IO_REWRITE;
} else {
BP_ZERO(bp);
zio->io_pipeline = ZIO_WRITE_PIPELINE;
}
if (csize == 0) {
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
} else {
ASSERT(zp->zp_checksum != ZIO_CHECKSUM_GANG_HEADER);
BP_SET_LSIZE(bp, lsize);
BP_SET_PSIZE(bp, csize);
BP_SET_COMPRESS(bp, compress);
BP_SET_CHECKSUM(bp, zp->zp_checksum);
BP_SET_TYPE(bp, zp->zp_type);
BP_SET_LEVEL(bp, zp->zp_level);
BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
}
return (ZIO_PIPELINE_CONTINUE);
}
/*
* ==========================================================================
* Execute the I/O pipeline
* ==========================================================================
*/
static void
zio_taskq_dispatch(zio_t *zio, enum zio_taskq_type q)
{
zio_type_t t = zio->io_type;
/*
- * If we're a config writer, the normal issue and interrupt threads
- * may all be blocked waiting for the config lock. In this case,
- * select the otherwise-unused taskq for ZIO_TYPE_NULL.
+ * If we're a config writer or a probe, the normal issue and
+ * interrupt threads may all be blocked waiting for the config lock.
+ * In this case, select the otherwise-unused taskq for ZIO_TYPE_NULL.
*/
- if (zio->io_flags & ZIO_FLAG_CONFIG_WRITER)
+ if (zio->io_flags & (ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_PROBE))
t = ZIO_TYPE_NULL;
/*
* A similar issue exists for the L2ARC write thread until L2ARC 2.0.
*/
if (t == ZIO_TYPE_WRITE && zio->io_vd && zio->io_vd->vdev_aux)
t = ZIO_TYPE_NULL;
(void) taskq_dispatch_safe(zio->io_spa->spa_zio_taskq[t][q],
(task_func_t *)zio_execute, zio, &zio->io_task);
}
static boolean_t
zio_taskq_member(zio_t *zio, enum zio_taskq_type q)
{
kthread_t *executor = zio->io_executor;
spa_t *spa = zio->io_spa;
for (zio_type_t t = 0; t < ZIO_TYPES; t++)
if (taskq_member(spa->spa_zio_taskq[t][q], executor))
return (B_TRUE);
return (B_FALSE);
}
static int
zio_issue_async(zio_t *zio)
{
zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
return (ZIO_PIPELINE_STOP);
}
void
zio_interrupt(zio_t *zio)
{
zio_taskq_dispatch(zio, ZIO_TASKQ_INTERRUPT);
}
/*
* Execute the I/O pipeline until one of the following occurs:
* (1) the I/O completes; (2) the pipeline stalls waiting for
* dependent child I/Os; (3) the I/O issues, so we're waiting
* for an I/O completion interrupt; (4) the I/O is delegated by
* vdev-level caching or aggregation; (5) the I/O is deferred
* due to vdev-level queueing; (6) the I/O is handed off to
* another thread. In all cases, the pipeline stops whenever
* there's no CPU work; it never burns a thread in cv_wait().
*
* There's no locking on io_stage because there's no legitimate way
* for multiple threads to be attempting to process the same I/O.
*/
static zio_pipe_stage_t *zio_pipeline[ZIO_STAGES];
void
zio_execute(zio_t *zio)
{
zio->io_executor = curthread;
while (zio->io_stage < ZIO_STAGE_DONE) {
uint32_t pipeline = zio->io_pipeline;
zio_stage_t stage = zio->io_stage;
int rv;
ASSERT(!MUTEX_HELD(&zio->io_lock));
while (((1U << ++stage) & pipeline) == 0)
continue;
ASSERT(stage <= ZIO_STAGE_DONE);
ASSERT(zio->io_stall == NULL);
/*
* If we are in interrupt context and this pipeline stage
* will grab a config lock that is held across I/O,
* issue async to avoid deadlock.
*/
if (((1U << stage) & ZIO_CONFIG_LOCK_BLOCKING_STAGES) &&
zio->io_vd == NULL &&
zio_taskq_member(zio, ZIO_TASKQ_INTERRUPT)) {
zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
return;
}
zio->io_stage = stage;
rv = zio_pipeline[stage](zio);
if (rv == ZIO_PIPELINE_STOP)
return;
ASSERT(rv == ZIO_PIPELINE_CONTINUE);
}
}
/*
* ==========================================================================
* Initiate I/O, either sync or async
* ==========================================================================
*/
int
zio_wait(zio_t *zio)
{
int error;
ASSERT(zio->io_stage == ZIO_STAGE_OPEN);
ASSERT(zio->io_executor == NULL);
zio->io_waiter = curthread;
zio_execute(zio);
mutex_enter(&zio->io_lock);
while (zio->io_executor != NULL)
cv_wait(&zio->io_cv, &zio->io_lock);
mutex_exit(&zio->io_lock);
error = zio->io_error;
zio_destroy(zio);
return (error);
}
void
zio_nowait(zio_t *zio)
{
ASSERT(zio->io_executor == NULL);
if (zio->io_parent == NULL && zio->io_child_type == ZIO_CHILD_LOGICAL) {
/*
* This is a logical async I/O with no parent to wait for it.
* Attach it to the pool's global async root zio so that
* spa_unload() has a way of waiting for async I/O to finish.
*/
spa_t *spa = zio->io_spa;
zio->io_async_root = B_TRUE;
mutex_enter(&spa->spa_async_root_lock);
spa->spa_async_root_count++;
mutex_exit(&spa->spa_async_root_lock);
}
zio_execute(zio);
}
/*
* ==========================================================================
* Reexecute or suspend/resume failed I/O
* ==========================================================================
*/
static void
zio_reexecute(zio_t *pio)
{
zio_t *zio, *zio_next;
pio->io_flags = pio->io_orig_flags;
pio->io_stage = pio->io_orig_stage;
pio->io_pipeline = pio->io_orig_pipeline;
pio->io_reexecute = 0;
pio->io_error = 0;
for (int c = 0; c < ZIO_CHILD_TYPES; c++)
pio->io_child_error[c] = 0;
if (IO_IS_ALLOCATING(pio)) {
/*
* Remember the failed bp so that the io_ready() callback
* can update its accounting upon reexecution. The block
* was already freed in zio_done(); we indicate this with
* a fill count of -1 so that zio_free() knows to skip it.
*/
blkptr_t *bp = pio->io_bp;
ASSERT(bp->blk_birth == 0 || bp->blk_birth == pio->io_txg);
bp->blk_fill = BLK_FILL_ALREADY_FREED;
pio->io_bp_orig = *bp;
BP_ZERO(bp);
}
/*
* As we reexecute pio's children, new children could be created.
* New children go to the head of the io_child list, however,
* so we will (correctly) not reexecute them. The key is that
* the remainder of the io_child list, from 'zio_next' onward,
* cannot be affected by any side effects of reexecuting 'zio'.
*/
for (zio = pio->io_child; zio != NULL; zio = zio_next) {
zio_next = zio->io_sibling_next;
mutex_enter(&pio->io_lock);
pio->io_children[zio->io_child_type][ZIO_WAIT_READY]++;
pio->io_children[zio->io_child_type][ZIO_WAIT_DONE]++;
mutex_exit(&pio->io_lock);
zio_reexecute(zio);
}
/*
* Now that all children have been reexecuted, execute the parent.
*/
zio_execute(pio);
}
void
zio_suspend(spa_t *spa, zio_t *zio)
{
if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_PANIC)
fm_panic("Pool '%s' has encountered an uncorrectable I/O "
"failure and the failure mode property for this pool "
"is set to panic.", spa_name(spa));
zfs_ereport_post(FM_EREPORT_ZFS_IO_FAILURE, spa, NULL, NULL, 0, 0);
mutex_enter(&spa->spa_suspend_lock);
if (spa->spa_suspend_zio_root == NULL)
spa->spa_suspend_zio_root = zio_root(spa, NULL, NULL, 0);
spa->spa_suspended = B_TRUE;
if (zio != NULL) {
ASSERT(zio != spa->spa_suspend_zio_root);
ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
ASSERT(zio->io_parent == NULL);
ASSERT(zio->io_stage == ZIO_STAGE_DONE);
zio_add_child(spa->spa_suspend_zio_root, zio);
}
mutex_exit(&spa->spa_suspend_lock);
}
void
zio_resume(spa_t *spa)
{
zio_t *pio, *zio;
/*
* Reexecute all previously suspended i/o.
*/
mutex_enter(&spa->spa_suspend_lock);
spa->spa_suspended = B_FALSE;
cv_broadcast(&spa->spa_suspend_cv);
pio = spa->spa_suspend_zio_root;
spa->spa_suspend_zio_root = NULL;
mutex_exit(&spa->spa_suspend_lock);
if (pio == NULL)
return;
while ((zio = pio->io_child) != NULL) {
zio_remove_child(pio, zio);
zio->io_parent = NULL;
zio_reexecute(zio);
}
ASSERT(pio->io_children[ZIO_CHILD_LOGICAL][ZIO_WAIT_DONE] == 0);
(void) zio_wait(pio);
}
void
zio_resume_wait(spa_t *spa)
{
mutex_enter(&spa->spa_suspend_lock);
while (spa_suspended(spa))
cv_wait(&spa->spa_suspend_cv, &spa->spa_suspend_lock);
mutex_exit(&spa->spa_suspend_lock);
}
/*
* ==========================================================================
* Gang blocks.
*
* A gang block is a collection of small blocks that looks to the DMU
* like one large block. When zio_dva_allocate() cannot find a block
* of the requested size, due to either severe fragmentation or the pool
* being nearly full, it calls zio_write_gang_block() to construct the
* block from smaller fragments.
*
* A gang block consists of a gang header (zio_gbh_phys_t) and up to
* three (SPA_GBH_NBLKPTRS) gang members. The gang header is just like
* an indirect block: it's an array of block pointers. It consumes
* only one sector and hence is allocatable regardless of fragmentation.
* The gang header's bps point to its gang members, which hold the data.
*
* Gang blocks are self-checksumming, using the bp's <vdev, offset, txg>
* as the verifier to ensure uniqueness of the SHA256 checksum.
* Critically, the gang block bp's blk_cksum is the checksum of the data,
* not the gang header. This ensures that data block signatures (needed for
* deduplication) are independent of how the block is physically stored.
*
* Gang blocks can be nested: a gang member may itself be a gang block.
* Thus every gang block is a tree in which root and all interior nodes are
* gang headers, and the leaves are normal blocks that contain user data.
* The root of the gang tree is called the gang leader.
*
* To perform any operation (read, rewrite, free, claim) on a gang block,
* zio_gang_assemble() first assembles the gang tree (minus data leaves)
* in the io_gang_tree field of the original logical i/o by recursively
* reading the gang leader and all gang headers below it. This yields
* an in-core tree containing the contents of every gang header and the
* bps for every constituent of the gang block.
*
* With the gang tree now assembled, zio_gang_issue() just walks the gang tree
* and invokes a callback on each bp. To free a gang block, zio_gang_issue()
* calls zio_free_gang() -- a trivial wrapper around zio_free() -- for each bp.
* zio_claim_gang() provides a similarly trivial wrapper for zio_claim().
* zio_read_gang() is a wrapper around zio_read() that omits reading gang
* headers, since we already have those in io_gang_tree. zio_rewrite_gang()
* performs a zio_rewrite() of the data or, for gang headers, a zio_rewrite()
* of the gang header plus zio_checksum_compute() of the data to update the
* gang header's blk_cksum as described above.
*
* The two-phase assemble/issue model solves the problem of partial failure --
* what if you'd freed part of a gang block but then couldn't read the
* gang header for another part? Assembling the entire gang tree first
* ensures that all the necessary gang header I/O has succeeded before
* starting the actual work of free, claim, or write. Once the gang tree
* is assembled, free and claim are in-memory operations that cannot fail.
*
* In the event that a gang write fails, zio_dva_unallocate() walks the
* gang tree to immediately free (i.e. insert back into the space map)
* everything we've allocated. This ensures that we don't get ENOSPC
* errors during repeated suspend/resume cycles due to a flaky device.
*
* Gang rewrites only happen during sync-to-convergence. If we can't assemble
* the gang tree, we won't modify the block, so we can safely defer the free
* (knowing that the block is still intact). If we *can* assemble the gang
* tree, then even if some of the rewrites fail, zio_dva_unallocate() will free
* each constituent bp and we can allocate a new block on the next sync pass.
*
* In all cases, the gang tree allows complete recovery from partial failure.
* ==========================================================================
*/
static zio_t *
zio_read_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
if (gn != NULL)
return (pio);
return (zio_read(pio, pio->io_spa, bp, data, BP_GET_PSIZE(bp),
NULL, NULL, pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
&pio->io_bookmark));
}
zio_t *
zio_rewrite_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
zio_t *zio;
if (gn != NULL) {
zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
gn->gn_gbh, SPA_GANGBLOCKSIZE, NULL, NULL, pio->io_priority,
ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
/*
* As we rewrite each gang header, the pipeline will compute
* a new gang block header checksum for it; but no one will
* compute a new data checksum, so we do that here. The one
* exception is the gang leader: the pipeline already computed
* its data checksum because that stage precedes gang assembly.
* (Presently, nothing actually uses interior data checksums;
* this is just good hygiene.)
*/
if (gn != pio->io_logical->io_gang_tree) {
zio_checksum_compute(zio, BP_GET_CHECKSUM(bp),
data, BP_GET_PSIZE(bp));
}
} else {
zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
data, BP_GET_PSIZE(bp), NULL, NULL, pio->io_priority,
ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
}
return (zio);
}
/* ARGSUSED */
zio_t *
zio_free_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
return (zio_free(pio, pio->io_spa, pio->io_txg, bp,
NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
}
/* ARGSUSED */
zio_t *
zio_claim_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
return (zio_claim(pio, pio->io_spa, pio->io_txg, bp,
NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
}
static zio_gang_issue_func_t *zio_gang_issue_func[ZIO_TYPES] = {
NULL,
zio_read_gang,
zio_rewrite_gang,
zio_free_gang,
zio_claim_gang,
NULL
};
static void zio_gang_tree_assemble_done(zio_t *zio);
static zio_gang_node_t *
zio_gang_node_alloc(zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn;
ASSERT(*gnpp == NULL);
gn = kmem_zalloc(sizeof (*gn), KM_SLEEP);
gn->gn_gbh = zio_buf_alloc(SPA_GANGBLOCKSIZE);
*gnpp = gn;
return (gn);
}
static void
zio_gang_node_free(zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn = *gnpp;
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
ASSERT(gn->gn_child[g] == NULL);
zio_buf_free(gn->gn_gbh, SPA_GANGBLOCKSIZE);
kmem_free(gn, sizeof (*gn));
*gnpp = NULL;
}
static void
zio_gang_tree_free(zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn = *gnpp;
if (gn == NULL)
return;
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
zio_gang_tree_free(&gn->gn_child[g]);
zio_gang_node_free(gnpp);
}
static void
zio_gang_tree_assemble(zio_t *lio, blkptr_t *bp, zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn = zio_gang_node_alloc(gnpp);
ASSERT(lio->io_logical == lio);
ASSERT(BP_IS_GANG(bp));
zio_nowait(zio_read(lio, lio->io_spa, bp, gn->gn_gbh,
SPA_GANGBLOCKSIZE, zio_gang_tree_assemble_done, gn,
lio->io_priority, ZIO_GANG_CHILD_FLAGS(lio), &lio->io_bookmark));
}
static void
zio_gang_tree_assemble_done(zio_t *zio)
{
zio_t *lio = zio->io_logical;
zio_gang_node_t *gn = zio->io_private;
blkptr_t *bp = zio->io_bp;
ASSERT(zio->io_parent == lio);
ASSERT(zio->io_child == NULL);
if (zio->io_error)
return;
if (BP_SHOULD_BYTESWAP(bp))
byteswap_uint64_array(zio->io_data, zio->io_size);
ASSERT(zio->io_data == gn->gn_gbh);
ASSERT(zio->io_size == SPA_GANGBLOCKSIZE);
ASSERT(gn->gn_gbh->zg_tail.zbt_magic == ZBT_MAGIC);
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
if (!BP_IS_GANG(gbp))
continue;
zio_gang_tree_assemble(lio, gbp, &gn->gn_child[g]);
}
}
static void
zio_gang_tree_issue(zio_t *pio, zio_gang_node_t *gn, blkptr_t *bp, void *data)
{
zio_t *lio = pio->io_logical;
zio_t *zio;
ASSERT(BP_IS_GANG(bp) == !!gn);
ASSERT(BP_GET_CHECKSUM(bp) == BP_GET_CHECKSUM(lio->io_bp));
ASSERT(BP_GET_LSIZE(bp) == BP_GET_PSIZE(bp) || gn == lio->io_gang_tree);
/*
* If you're a gang header, your data is in gn->gn_gbh.
* If you're a gang member, your data is in 'data' and gn == NULL.
*/
zio = zio_gang_issue_func[lio->io_type](pio, bp, gn, data);
if (gn != NULL) {
ASSERT(gn->gn_gbh->zg_tail.zbt_magic == ZBT_MAGIC);
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
if (BP_IS_HOLE(gbp))
continue;
zio_gang_tree_issue(zio, gn->gn_child[g], gbp, data);
data = (char *)data + BP_GET_PSIZE(gbp);
}
}
if (gn == lio->io_gang_tree)
ASSERT3P((char *)lio->io_data + lio->io_size, ==, data);
if (zio != pio)
zio_nowait(zio);
}
static int
zio_gang_assemble(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
ASSERT(BP_IS_GANG(bp) && zio == zio->io_logical);
zio_gang_tree_assemble(zio, bp, &zio->io_gang_tree);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_gang_issue(zio_t *zio)
{
zio_t *lio = zio->io_logical;
blkptr_t *bp = zio->io_bp;
if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
ASSERT(BP_IS_GANG(bp) && zio == lio);
if (zio->io_child_error[ZIO_CHILD_GANG] == 0)
zio_gang_tree_issue(lio, lio->io_gang_tree, bp, lio->io_data);
else
zio_gang_tree_free(&lio->io_gang_tree);
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
return (ZIO_PIPELINE_CONTINUE);
}
static void
zio_write_gang_member_ready(zio_t *zio)
{
zio_t *pio = zio->io_parent;
zio_t *lio = zio->io_logical;
dva_t *cdva = zio->io_bp->blk_dva;
dva_t *pdva = pio->io_bp->blk_dva;
uint64_t asize;
if (BP_IS_HOLE(zio->io_bp))
return;
ASSERT(BP_IS_HOLE(&zio->io_bp_orig));
ASSERT(zio->io_child_type == ZIO_CHILD_GANG);
ASSERT3U(zio->io_prop.zp_ndvas, ==, lio->io_prop.zp_ndvas);
ASSERT3U(zio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(zio->io_bp));
ASSERT3U(pio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(pio->io_bp));
ASSERT3U(BP_GET_NDVAS(zio->io_bp), <=, BP_GET_NDVAS(pio->io_bp));
mutex_enter(&pio->io_lock);
for (int d = 0; d < BP_GET_NDVAS(zio->io_bp); d++) {
ASSERT(DVA_GET_GANG(&pdva[d]));
asize = DVA_GET_ASIZE(&pdva[d]);
asize += DVA_GET_ASIZE(&cdva[d]);
DVA_SET_ASIZE(&pdva[d], asize);
}
mutex_exit(&pio->io_lock);
}
static int
zio_write_gang_block(zio_t *pio)
{
spa_t *spa = pio->io_spa;
blkptr_t *bp = pio->io_bp;
zio_t *lio = pio->io_logical;
zio_t *zio;
zio_gang_node_t *gn, **gnpp;
zio_gbh_phys_t *gbh;
uint64_t txg = pio->io_txg;
uint64_t resid = pio->io_size;
uint64_t lsize;
int ndvas = lio->io_prop.zp_ndvas;
int gbh_ndvas = MIN(ndvas + 1, spa_max_replication(spa));
zio_prop_t zp;
int error;
error = metaslab_alloc(spa, spa->spa_normal_class, SPA_GANGBLOCKSIZE,
bp, gbh_ndvas, txg, pio == lio ? NULL : lio->io_bp,
METASLAB_HINTBP_FAVOR | METASLAB_GANG_HEADER);
if (error) {
pio->io_error = error;
return (ZIO_PIPELINE_CONTINUE);
}
if (pio == lio) {
gnpp = &lio->io_gang_tree;
} else {
gnpp = pio->io_private;
ASSERT(pio->io_ready == zio_write_gang_member_ready);
}
gn = zio_gang_node_alloc(gnpp);
gbh = gn->gn_gbh;
bzero(gbh, SPA_GANGBLOCKSIZE);
/*
* Create the gang header.
*/
zio = zio_rewrite(pio, spa, txg, bp, gbh, SPA_GANGBLOCKSIZE, NULL, NULL,
pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
/*
* Create and nowait the gang children.
*/
for (int g = 0; resid != 0; resid -= lsize, g++) {
lsize = P2ROUNDUP(resid / (SPA_GBH_NBLKPTRS - g),
SPA_MINBLOCKSIZE);
ASSERT(lsize >= SPA_MINBLOCKSIZE && lsize <= resid);
zp.zp_checksum = lio->io_prop.zp_checksum;
zp.zp_compress = ZIO_COMPRESS_OFF;
zp.zp_type = DMU_OT_NONE;
zp.zp_level = 0;
zp.zp_ndvas = lio->io_prop.zp_ndvas;
zio_nowait(zio_write(zio, spa, txg, &gbh->zg_blkptr[g],
(char *)pio->io_data + (pio->io_size - resid), lsize, &zp,
zio_write_gang_member_ready, NULL, &gn->gn_child[g],
pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
&pio->io_bookmark));
}
/*
* Set pio's pipeline to just wait for zio to finish.
*/
pio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
zio_nowait(zio);
return (ZIO_PIPELINE_CONTINUE);
}
/*
* ==========================================================================
* Allocate and free blocks
* ==========================================================================
*/
static int
zio_dva_allocate(zio_t *zio)
{
spa_t *spa = zio->io_spa;
metaslab_class_t *mc = spa->spa_normal_class;
blkptr_t *bp = zio->io_bp;
int error;
ASSERT(BP_IS_HOLE(bp));
ASSERT3U(BP_GET_NDVAS(bp), ==, 0);
ASSERT3U(zio->io_prop.zp_ndvas, >, 0);
ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
error = metaslab_alloc(spa, mc, zio->io_size, bp,
zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);
if (error) {
if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE)
return (zio_write_gang_block(zio));
zio->io_error = error;
}
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_dva_free(zio_t *zio)
{
metaslab_free(zio->io_spa, zio->io_bp, zio->io_txg, B_FALSE);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_dva_claim(zio_t *zio)
{
int error;
error = metaslab_claim(zio->io_spa, zio->io_bp, zio->io_txg);
if (error)
zio->io_error = error;
return (ZIO_PIPELINE_CONTINUE);
}
/*
* Undo an allocation. This is used by zio_done() when an I/O fails
* and we want to give back the block we just allocated.
* This handles both normal blocks and gang blocks.
*/
static void
zio_dva_unallocate(zio_t *zio, zio_gang_node_t *gn, blkptr_t *bp)
{
spa_t *spa = zio->io_spa;
boolean_t now = !(zio->io_flags & ZIO_FLAG_IO_REWRITE);
ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp));
if (zio->io_bp == bp && !now) {
/*
* This is a rewrite for sync-to-convergence.
* We can't do a metaslab_free(NOW) because bp wasn't allocated
* during this sync pass, which means that metaslab_sync()
* already committed the allocation.
*/
ASSERT(DVA_EQUAL(BP_IDENTITY(bp),
BP_IDENTITY(&zio->io_bp_orig)));
ASSERT(spa_sync_pass(spa) > 1);
if (BP_IS_GANG(bp) && gn == NULL) {
/*
* This is a gang leader whose gang header(s) we
* couldn't read now, so defer the free until later.
* The block should still be intact because without
* the headers, we'd never even start the rewrite.
*/
bplist_enqueue_deferred(&spa->spa_sync_bplist, bp);
return;
}
}
if (!BP_IS_HOLE(bp))
metaslab_free(spa, bp, bp->blk_birth, now);
if (gn != NULL) {
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
zio_dva_unallocate(zio, gn->gn_child[g],
&gn->gn_gbh->zg_blkptr[g]);
}
}
}
/*
* Try to allocate an intent log block. Return 0 on success, errno on failure.
*/
int
zio_alloc_blk(spa_t *spa, uint64_t size, blkptr_t *new_bp, blkptr_t *old_bp,
uint64_t txg)
{
int error;
error = metaslab_alloc(spa, spa->spa_log_class, size,
new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID);
if (error)
error = metaslab_alloc(spa, spa->spa_normal_class, size,
new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID);
if (error == 0) {
BP_SET_LSIZE(new_bp, size);
BP_SET_PSIZE(new_bp, size);
BP_SET_COMPRESS(new_bp, ZIO_COMPRESS_OFF);
BP_SET_CHECKSUM(new_bp, ZIO_CHECKSUM_ZILOG);
BP_SET_TYPE(new_bp, DMU_OT_INTENT_LOG);
BP_SET_LEVEL(new_bp, 0);
BP_SET_BYTEORDER(new_bp, ZFS_HOST_BYTEORDER);
}
return (error);
}
/*
* Free an intent log block. We know it can't be a gang block, so there's
* nothing to do except metaslab_free() it.
*/
void
zio_free_blk(spa_t *spa, blkptr_t *bp, uint64_t txg)
{
ASSERT(!BP_IS_GANG(bp));
metaslab_free(spa, bp, txg, B_FALSE);
}
/*
* ==========================================================================
* Read and write to physical devices
* ==========================================================================
*/
static void
zio_vdev_io_probe_done(zio_t *zio)
{
zio_t *dio;
vdev_t *vd = zio->io_private;
mutex_enter(&vd->vdev_probe_lock);
ASSERT(vd->vdev_probe_zio == zio);
vd->vdev_probe_zio = NULL;
mutex_exit(&vd->vdev_probe_lock);
while ((dio = zio->io_delegate_list) != NULL) {
zio->io_delegate_list = dio->io_delegate_next;
dio->io_delegate_next = NULL;
if (!vdev_accessible(vd, dio))
dio->io_error = ENXIO;
zio_execute(dio);
}
}
/*
* Probe the device to determine whether I/O failure is specific to this
* zio (e.g. a bad sector) or affects the entire vdev (e.g. unplugged).
*/
static int
zio_vdev_io_probe(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
zio_t *pio = NULL;
boolean_t created_pio = B_FALSE;
/*
* Don't probe the probe.
*/
if (zio->io_flags & ZIO_FLAG_PROBE)
return (ZIO_PIPELINE_CONTINUE);
/*
* To prevent 'probe storms' when a device fails, we create
* just one probe i/o at a time. All zios that want to probe
* this vdev will join the probe zio's io_delegate_list.
*/
mutex_enter(&vd->vdev_probe_lock);
if ((pio = vd->vdev_probe_zio) == NULL) {
vd->vdev_probe_zio = pio = zio_root(zio->io_spa,
zio_vdev_io_probe_done, vd, ZIO_FLAG_CANFAIL);
created_pio = B_TRUE;
vd->vdev_probe_wanted = B_TRUE;
spa_async_request(zio->io_spa, SPA_ASYNC_PROBE);
}
zio->io_delegate_next = pio->io_delegate_list;
pio->io_delegate_list = zio;
mutex_exit(&vd->vdev_probe_lock);
if (created_pio) {
zio_nowait(vdev_probe(vd, pio));
zio_nowait(pio);
}
return (ZIO_PIPELINE_STOP);
}
static int
zio_vdev_io_start(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
uint64_t align;
spa_t *spa = zio->io_spa;
ASSERT(zio->io_error == 0);
ASSERT(zio->io_child_error[ZIO_CHILD_VDEV] == 0);
if (vd == NULL) {
if (!(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
spa_config_enter(spa, SCL_ZIO, zio, RW_READER);
/*
* The mirror_ops handle multiple DVAs in a single BP.
*/
return (vdev_mirror_ops.vdev_op_io_start(zio));
}
align = 1ULL << vd->vdev_top->vdev_ashift;
if (P2PHASE(zio->io_size, align) != 0) {
uint64_t asize = P2ROUNDUP(zio->io_size, align);
char *abuf = zio_buf_alloc(asize);
ASSERT(vd == vd->vdev_top);
if (zio->io_type == ZIO_TYPE_WRITE) {
bcopy(zio->io_data, abuf, zio->io_size);
bzero(abuf + zio->io_size, asize - zio->io_size);
}
zio_push_transform(zio, abuf, asize, asize, zio_subblock);
}
ASSERT(P2PHASE(zio->io_offset, align) == 0);
ASSERT(P2PHASE(zio->io_size, align) == 0);
ASSERT(zio->io_type != ZIO_TYPE_WRITE || (spa_mode & FWRITE));
if (vd->vdev_ops->vdev_op_leaf &&
(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE)) {
if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio) == 0)
return (ZIO_PIPELINE_STOP);
if ((zio = vdev_queue_io(zio)) == NULL)
return (ZIO_PIPELINE_STOP);
if (!vdev_accessible(vd, zio)) {
zio->io_error = ENXIO;
zio_interrupt(zio);
return (ZIO_PIPELINE_STOP);
}
}
return (vd->vdev_ops->vdev_op_io_start(zio));
}
static int
zio_vdev_io_done(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops;
boolean_t unexpected_error = B_FALSE;
if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE);
if (vd != NULL && vd->vdev_ops->vdev_op_leaf) {
vdev_queue_io_done(zio);
if (zio->io_type == ZIO_TYPE_WRITE)
vdev_cache_write(zio);
if (zio_injection_enabled && zio->io_error == 0)
zio->io_error = zio_handle_device_injection(vd, EIO);
if (zio_injection_enabled && zio->io_error == 0)
zio->io_error = zio_handle_label_injection(zio, EIO);
if (zio->io_error) {
if (!vdev_accessible(vd, zio)) {
zio->io_error = ENXIO;
} else {
unexpected_error = B_TRUE;
}
}
}
ops->vdev_op_io_done(zio);
if (unexpected_error)
return (zio_vdev_io_probe(zio));
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_vdev_io_assess(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
spa_config_exit(zio->io_spa, SCL_ZIO, zio);
if (zio->io_vsd != NULL) {
zio->io_vsd_free(zio);
zio->io_vsd = NULL;
}
if (zio_injection_enabled && zio->io_error == 0)
zio->io_error = zio_handle_fault_injection(zio, EIO);
/*
* If the I/O failed, determine whether we should attempt to retry it.
*/
if (zio->io_error && vd == NULL &&
!(zio->io_flags & (ZIO_FLAG_DONT_RETRY | ZIO_FLAG_IO_RETRY))) {
ASSERT(!(zio->io_flags & ZIO_FLAG_DONT_QUEUE)); /* not a leaf */
ASSERT(!(zio->io_flags & ZIO_FLAG_IO_BYPASS)); /* not a leaf */
zio->io_error = 0;
zio->io_flags |= ZIO_FLAG_IO_RETRY |
ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_AGGREGATE;
zio->io_stage = ZIO_STAGE_VDEV_IO_START - 1;
zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
return (ZIO_PIPELINE_STOP);
}
/*
* If we got an error on a leaf device, convert it to ENXIO
* if the device is not accessible at all.
*/
if (zio->io_error && vd != NULL && vd->vdev_ops->vdev_op_leaf &&
!vdev_accessible(vd, zio))
zio->io_error = ENXIO;
/*
* If we can't write to an interior vdev (mirror or RAID-Z),
* set vdev_cant_write so that we stop trying to allocate from it.
*/
if (zio->io_error == ENXIO && zio->io_type == ZIO_TYPE_WRITE &&
vd != NULL && !vd->vdev_ops->vdev_op_leaf)
vd->vdev_cant_write = B_TRUE;
if (zio->io_error)
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
return (ZIO_PIPELINE_CONTINUE);
}
void
zio_vdev_io_reissue(zio_t *zio)
{
ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
ASSERT(zio->io_error == 0);
zio->io_stage--;
}
void
zio_vdev_io_redone(zio_t *zio)
{
ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_DONE);
zio->io_stage--;
}
void
zio_vdev_io_bypass(zio_t *zio)
{
ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
ASSERT(zio->io_error == 0);
zio->io_flags |= ZIO_FLAG_IO_BYPASS;
zio->io_stage = ZIO_STAGE_VDEV_IO_ASSESS - 1;
}
/*
* ==========================================================================
* Generate and verify checksums
* ==========================================================================
*/
static int
zio_checksum_generate(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
enum zio_checksum checksum;
if (bp == NULL) {
/*
* This is zio_write_phys().
* We're either generating a label checksum, or none at all.
*/
checksum = zio->io_prop.zp_checksum;
if (checksum == ZIO_CHECKSUM_OFF)
return (ZIO_PIPELINE_CONTINUE);
ASSERT(checksum == ZIO_CHECKSUM_LABEL);
} else {
if (BP_IS_GANG(bp) && zio->io_child_type == ZIO_CHILD_GANG) {
ASSERT(!IO_IS_ALLOCATING(zio));
checksum = ZIO_CHECKSUM_GANG_HEADER;
} else {
checksum = BP_GET_CHECKSUM(bp);
}
}
zio_checksum_compute(zio, checksum, zio->io_data, zio->io_size);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_checksum_verify(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
int error;
if (bp == NULL) {
/*
* This is zio_read_phys().
* We're either verifying a label checksum, or nothing at all.
*/
if (zio->io_prop.zp_checksum == ZIO_CHECKSUM_OFF)
return (ZIO_PIPELINE_CONTINUE);
ASSERT(zio->io_prop.zp_checksum == ZIO_CHECKSUM_LABEL);
}
if ((error = zio_checksum_error(zio)) != 0) {
zio->io_error = error;
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
zio->io_spa, zio->io_vd, zio, 0, 0);
}
}
return (ZIO_PIPELINE_CONTINUE);
}
/*
* Called by RAID-Z to ensure we don't compute the checksum twice.
*/
void
zio_checksum_verified(zio_t *zio)
{
zio->io_pipeline &= ~(1U << ZIO_STAGE_CHECKSUM_VERIFY);
}
/*
* ==========================================================================
* Error rank. Error are ranked in the order 0, ENXIO, ECKSUM, EIO, other.
* An error of 0 indictes success. ENXIO indicates whole-device failure,
* which may be transient (e.g. unplugged) or permament. ECKSUM and EIO
* indicate errors that are specific to one I/O, and most likely permanent.
* Any other error is presumed to be worse because we weren't expecting it.
* ==========================================================================
*/
int
zio_worst_error(int e1, int e2)
{
static int zio_error_rank[] = { 0, ENXIO, ECKSUM, EIO };
int r1, r2;
for (r1 = 0; r1 < sizeof (zio_error_rank) / sizeof (int); r1++)
if (e1 == zio_error_rank[r1])
break;
for (r2 = 0; r2 < sizeof (zio_error_rank) / sizeof (int); r2++)
if (e2 == zio_error_rank[r2])
break;
return (r1 > r2 ? e1 : e2);
}
/*
* ==========================================================================
* I/O completion
* ==========================================================================
*/
static int
zio_ready(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
zio_t *pio = zio->io_parent;
if (zio->io_ready) {
if (BP_IS_GANG(bp) &&
zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY))
return (ZIO_PIPELINE_STOP);
ASSERT(IO_IS_ALLOCATING(zio));
ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp));
ASSERT(zio->io_children[ZIO_CHILD_GANG][ZIO_WAIT_READY] == 0);
zio->io_ready(zio);
}
if (bp != NULL && bp != &zio->io_bp_copy)
zio->io_bp_copy = *bp;
if (zio->io_error)
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
if (pio != NULL)
zio_notify_parent(pio, zio, ZIO_WAIT_READY);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_done(zio_t *zio)
{
spa_t *spa = zio->io_spa;
zio_t *pio = zio->io_parent;
zio_t *lio = zio->io_logical;
blkptr_t *bp = zio->io_bp;
vdev_t *vd = zio->io_vd;
uint64_t psize = zio->io_size;
/*
* If our of children haven't all completed,
* wait for them and then repeat this pipeline stage.
*/
if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) ||
zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) ||
zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
for (int c = 0; c < ZIO_CHILD_TYPES; c++)
for (int w = 0; w < ZIO_WAIT_TYPES; w++)
ASSERT(zio->io_children[c][w] == 0);
if (bp != NULL) {
ASSERT(bp->blk_pad[0] == 0);
ASSERT(bp->blk_pad[1] == 0);
ASSERT(bp->blk_pad[2] == 0);
ASSERT(bcmp(bp, &zio->io_bp_copy, sizeof (blkptr_t)) == 0 ||
(pio != NULL && bp == pio->io_bp));
if (zio->io_type == ZIO_TYPE_WRITE && !BP_IS_HOLE(bp) &&
!(zio->io_flags & ZIO_FLAG_IO_REPAIR)) {
ASSERT(!BP_SHOULD_BYTESWAP(bp));
ASSERT3U(zio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(bp));
ASSERT(BP_COUNT_GANG(bp) == 0 ||
(BP_COUNT_GANG(bp) == BP_GET_NDVAS(bp)));
}
}
/*
* If there were child vdev or gang errors, they apply to us now.
*/
zio_inherit_child_errors(zio, ZIO_CHILD_VDEV);
zio_inherit_child_errors(zio, ZIO_CHILD_GANG);
zio_pop_transforms(zio); /* note: may set zio->io_error */
vdev_stat_update(zio, psize);
if (zio->io_error) {
/*
* If this I/O is attached to a particular vdev,
* generate an error message describing the I/O failure
* at the block level. We ignore these errors if the
* device is currently unavailable.
*/
if (zio->io_error != ECKSUM && vd != NULL && !vdev_is_dead(vd))
zfs_ereport_post(FM_EREPORT_ZFS_IO, spa, vd, zio, 0, 0);
if ((zio->io_error == EIO ||
!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) && zio == lio) {
/*
* For logical I/O requests, tell the SPA to log the
* error and generate a logical data ereport.
*/
spa_log_error(spa, zio);
zfs_ereport_post(FM_EREPORT_ZFS_DATA, spa, NULL, zio,
0, 0);
}
}
if (zio->io_error && zio == lio) {
/*
* Determine whether zio should be reexecuted. This will
* propagate all the way to the root via zio_notify_parent().
*/
ASSERT(vd == NULL && bp != NULL);
if (IO_IS_ALLOCATING(zio))
if (zio->io_error != ENOSPC)
zio->io_reexecute |= ZIO_REEXECUTE_NOW;
else
zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
if ((zio->io_type == ZIO_TYPE_READ ||
zio->io_type == ZIO_TYPE_FREE) &&
zio->io_error == ENXIO &&
spa_get_failmode(spa) != ZIO_FAILURE_MODE_CONTINUE)
zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
if (!(zio->io_flags & ZIO_FLAG_CANFAIL) && !zio->io_reexecute)
zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
}
/*
* If there were logical child errors, they apply to us now.
* We defer this until now to avoid conflating logical child
* errors with errors that happened to the zio itself when
* updating vdev stats and reporting FMA events above.
*/
zio_inherit_child_errors(zio, ZIO_CHILD_LOGICAL);
if (zio->io_reexecute) {
/*
* This is a logical I/O that wants to reexecute.
*
* Reexecute is top-down. When an i/o fails, if it's not
* the root, it simply notifies its parent and sticks around.
* The parent, seeing that it still has children in zio_done(),
* does the same. This percolates all the way up to the root.
* The root i/o will reexecute or suspend the entire tree.
*
* This approach ensures that zio_reexecute() honors
* all the original i/o dependency relationships, e.g.
* parents not executing until children are ready.
*/
ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
if (IO_IS_ALLOCATING(zio))
zio_dva_unallocate(zio, zio->io_gang_tree, bp);
zio_gang_tree_free(&zio->io_gang_tree);
if (pio != NULL) {
/*
* We're not a root i/o, so there's nothing to do
* but notify our parent. Don't propagate errors
* upward since we haven't permanently failed yet.
*/
zio->io_flags |= ZIO_FLAG_DONT_PROPAGATE;
zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
} else if (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND) {
/*
* We'd fail again if we reexecuted now, so suspend
* until conditions improve (e.g. device comes online).
*/
zio_suspend(spa, zio);
} else {
/*
* Reexecution is potentially a huge amount of work.
* Hand it off to the otherwise-unused claim taskq.
*/
(void) taskq_dispatch_safe(
spa->spa_zio_taskq[ZIO_TYPE_CLAIM][ZIO_TASKQ_ISSUE],
(task_func_t *)zio_reexecute, zio, &zio->io_task);
}
return (ZIO_PIPELINE_STOP);
}
ASSERT(zio->io_child == NULL);
ASSERT(zio->io_reexecute == 0);
ASSERT(zio->io_error == 0 || (zio->io_flags & ZIO_FLAG_CANFAIL));
if (zio->io_done)
zio->io_done(zio);
zio_gang_tree_free(&zio->io_gang_tree);
ASSERT(zio->io_delegate_list == NULL);
ASSERT(zio->io_delegate_next == NULL);
if (pio != NULL) {
zio_remove_child(pio, zio);
zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
}
if (zio->io_waiter != NULL) {
mutex_enter(&zio->io_lock);
zio->io_executor = NULL;
cv_broadcast(&zio->io_cv);
mutex_exit(&zio->io_lock);
} else {
zio_destroy(zio);
}
return (ZIO_PIPELINE_STOP);
}
/*
* ==========================================================================
* I/O pipeline definition
* ==========================================================================
*/
static zio_pipe_stage_t *zio_pipeline[ZIO_STAGES] = {
NULL,
zio_issue_async,
zio_read_bp_init,
zio_write_bp_init,
zio_checksum_generate,
zio_gang_assemble,
zio_gang_issue,
zio_dva_allocate,
zio_dva_free,
zio_dva_claim,
zio_ready,
zio_vdev_io_start,
zio_vdev_io_done,
zio_vdev_io_assess,
zio_checksum_verify,
zio_done
};
Index: stable/8/sys/cddl/contrib/opensolaris
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris (revision 209274)
Property changes on: stable/8/sys/cddl/contrib/opensolaris
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/cddl/contrib/opensolaris:r209093-209101
Index: stable/8/sys/contrib/dev/acpica
===================================================================
--- stable/8/sys/contrib/dev/acpica (revision 209273)
+++ stable/8/sys/contrib/dev/acpica (revision 209274)
Property changes on: stable/8/sys/contrib/dev/acpica
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/contrib/dev/acpica:r209093-209101
Index: stable/8/sys/contrib/pf
===================================================================
--- stable/8/sys/contrib/pf (revision 209273)
+++ stable/8/sys/contrib/pf (revision 209274)
Property changes on: stable/8/sys/contrib/pf
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/contrib/pf:r209093-209101
Index: stable/8/sys/dev/xen/xenpci
===================================================================
--- stable/8/sys/dev/xen/xenpci (revision 209273)
+++ stable/8/sys/dev/xen/xenpci (revision 209274)
Property changes on: stable/8/sys/dev/xen/xenpci
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/dev/xen/xenpci:r209093-209101
Index: stable/8/sys/geom/sched
===================================================================
--- stable/8/sys/geom/sched (revision 209273)
+++ stable/8/sys/geom/sched (revision 209274)
Property changes on: stable/8/sys/geom/sched
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/geom/sched:r209093-209101
Index: stable/8/sys
===================================================================
--- stable/8/sys (revision 209273)
+++ stable/8/sys (revision 209274)
Property changes on: stable/8/sys
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys:r209093-209101

File Metadata

Mime Type
text/x-c
Expires
Sun, Mar 29, 10:26 PM (2 d)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
30495913
Default Alt Text
(571 KB)

Event Timeline