Page Menu
Home
FreeBSD
Search
Configure Global Search
Log In
Files
F149891595
No One
Temporary
Actions
Download File
Edit File
Delete File
View Transforms
Subscribe
Mute Notifications
Flag For Later
Award Token
Size
571 KB
Referenced Files
None
Subscribers
None
View Options
This file is larger than 256 KB, so syntax highlighting was skipped.
Index: stable/8/sys/amd64/include/xen
===================================================================
--- stable/8/sys/amd64/include/xen (revision 209273)
+++ stable/8/sys/amd64/include/xen (revision 209274)
Property changes on: stable/8/sys/amd64/include/xen
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/amd64/include/xen:r209093-209101
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c (revision 209274)
@@ -1,5023 +1,5023 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/*
* DVA-based Adjustable Replacement Cache
*
* While much of the theory of operation used here is
* based on the self-tuning, low overhead replacement cache
* presented by Megiddo and Modha at FAST 2003, there are some
* significant differences:
*
* 1. The Megiddo and Modha model assumes any page is evictable.
* Pages in its cache cannot be "locked" into memory. This makes
* the eviction algorithm simple: evict the last page in the list.
* This also make the performance characteristics easy to reason
* about. Our cache is not so simple. At any given moment, some
* subset of the blocks in the cache are un-evictable because we
* have handed out a reference to them. Blocks are only evictable
* when there are no external references active. This makes
* eviction far more problematic: we choose to evict the evictable
* blocks that are the "lowest" in the list.
*
* There are times when it is not possible to evict the requested
* space. In these circumstances we are unable to adjust the cache
* size. To prevent the cache growing unbounded at these times we
* implement a "cache throttle" that slows the flow of new data
* into the cache until we can make space available.
*
* 2. The Megiddo and Modha model assumes a fixed cache size.
* Pages are evicted when the cache is full and there is a cache
* miss. Our model has a variable sized cache. It grows with
* high use, but also tries to react to memory pressure from the
* operating system: decreasing its size when system memory is
* tight.
*
* 3. The Megiddo and Modha model assumes a fixed page size. All
* elements of the cache are therefor exactly the same size. So
* when adjusting the cache size following a cache miss, its simply
* a matter of choosing a single page to evict. In our model, we
* have variable sized cache blocks (rangeing from 512 bytes to
* 128K bytes). We therefor choose a set of blocks to evict to make
* space for a cache miss that approximates as closely as possible
* the space used by the new block.
*
* See also: "ARC: A Self-Tuning, Low Overhead Replacement Cache"
* by N. Megiddo & D. Modha, FAST 2003
*/
/*
* The locking model:
*
* A new reference to a cache buffer can be obtained in two
* ways: 1) via a hash table lookup using the DVA as a key,
* or 2) via one of the ARC lists. The arc_read() interface
* uses method 1, while the internal arc algorithms for
* adjusting the cache use method 2. We therefor provide two
* types of locks: 1) the hash table lock array, and 2) the
* arc list locks.
*
* Buffers do not have their own mutexs, rather they rely on the
* hash table mutexs for the bulk of their protection (i.e. most
* fields in the arc_buf_hdr_t are protected by these mutexs).
*
* buf_hash_find() returns the appropriate mutex (held) when it
* locates the requested buffer in the hash table. It returns
* NULL for the mutex if the buffer was not in the table.
*
* buf_hash_remove() expects the appropriate hash mutex to be
* already held before it is invoked.
*
* Each arc state also has a mutex which is used to protect the
* buffer list associated with the state. When attempting to
* obtain a hash table lock while holding an arc list lock you
* must use: mutex_tryenter() to avoid deadlock. Also note that
* the active state mutex must be held before the ghost state mutex.
*
* Arc buffers may have an associated eviction callback function.
* This function will be invoked prior to removing the buffer (e.g.
* in arc_do_user_evicts()). Note however that the data associated
* with the buffer may be evicted prior to the callback. The callback
* must be made with *no locks held* (to prevent deadlock). Additionally,
* the users of callbacks must ensure that their private data is
* protected from simultaneous callbacks from arc_buf_evict()
* and arc_do_user_evicts().
*
* Note that the majority of the performance stats are manipulated
* with atomic operations.
*
* The L2ARC uses the l2arc_buflist_mtx global mutex for the following:
*
* - L2ARC buflist creation
* - L2ARC buflist eviction
* - L2ARC write completion, which walks L2ARC buflists
* - ARC header destruction, as it removes from L2ARC buflists
* - ARC header release, as it removes from L2ARC buflists
*/
#include <sys/spa.h>
#include <sys/zio.h>
#include <sys/zio_checksum.h>
#include <sys/zfs_context.h>
#include <sys/arc.h>
#include <sys/refcount.h>
#include <sys/vdev.h>
#ifdef _KERNEL
#include <sys/dnlc.h>
#endif
#include <sys/callb.h>
#include <sys/kstat.h>
#include <sys/sdt.h>
#include <vm/vm_pageout.h>
static kmutex_t arc_reclaim_thr_lock;
static kcondvar_t arc_reclaim_thr_cv; /* used to signal reclaim thr */
static uint8_t arc_thread_exit;
extern int zfs_write_limit_shift;
extern uint64_t zfs_write_limit_max;
extern kmutex_t zfs_write_limit_lock;
#define ARC_REDUCE_DNLC_PERCENT 3
uint_t arc_reduce_dnlc_percent = ARC_REDUCE_DNLC_PERCENT;
typedef enum arc_reclaim_strategy {
ARC_RECLAIM_AGGR, /* Aggressive reclaim strategy */
ARC_RECLAIM_CONS /* Conservative reclaim strategy */
} arc_reclaim_strategy_t;
/* number of seconds before growing cache again */
static int arc_grow_retry = 60;
/* shift of arc_c for calculating both min and max arc_p */
static int arc_p_min_shift = 4;
/* log2(fraction of arc to reclaim) */
static int arc_shrink_shift = 5;
/*
* minimum lifespan of a prefetch block in clock ticks
* (initialized in arc_init())
*/
static int arc_min_prefetch_lifespan;
static int arc_dead;
extern int zfs_prefetch_disable;
/*
* The arc has filled available memory and has now warmed up.
*/
static boolean_t arc_warm;
/*
* These tunables are for performance analysis.
*/
uint64_t zfs_arc_max;
uint64_t zfs_arc_min;
uint64_t zfs_arc_meta_limit = 0;
int zfs_mdcomp_disable = 0;
int zfs_arc_grow_retry = 0;
int zfs_arc_shrink_shift = 0;
int zfs_arc_p_min_shift = 0;
TUNABLE_QUAD("vfs.zfs.arc_max", &zfs_arc_max);
TUNABLE_QUAD("vfs.zfs.arc_min", &zfs_arc_min);
TUNABLE_QUAD("vfs.zfs.arc_meta_limit", &zfs_arc_meta_limit);
TUNABLE_INT("vfs.zfs.mdcomp_disable", &zfs_mdcomp_disable);
SYSCTL_DECL(_vfs_zfs);
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_max, CTLFLAG_RDTUN, &zfs_arc_max, 0,
"Maximum ARC size");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_min, CTLFLAG_RDTUN, &zfs_arc_min, 0,
"Minimum ARC size");
SYSCTL_INT(_vfs_zfs, OID_AUTO, mdcomp_disable, CTLFLAG_RDTUN,
&zfs_mdcomp_disable, 0, "Disable metadata compression");
/*
* Note that buffers can be in one of 6 states:
* ARC_anon - anonymous (discussed below)
* ARC_mru - recently used, currently cached
* ARC_mru_ghost - recentely used, no longer in cache
* ARC_mfu - frequently used, currently cached
* ARC_mfu_ghost - frequently used, no longer in cache
* ARC_l2c_only - exists in L2ARC but not other states
* When there are no active references to the buffer, they are
* are linked onto a list in one of these arc states. These are
* the only buffers that can be evicted or deleted. Within each
* state there are multiple lists, one for meta-data and one for
* non-meta-data. Meta-data (indirect blocks, blocks of dnodes,
* etc.) is tracked separately so that it can be managed more
* explicitly: favored over data, limited explicitly.
*
* Anonymous buffers are buffers that are not associated with
* a DVA. These are buffers that hold dirty block copies
* before they are written to stable storage. By definition,
* they are "ref'd" and are considered part of arc_mru
* that cannot be freed. Generally, they will aquire a DVA
* as they are written and migrate onto the arc_mru list.
*
* The ARC_l2c_only state is for buffers that are in the second
* level ARC but no longer in any of the ARC_m* lists. The second
* level ARC itself may also contain buffers that are in any of
* the ARC_m* states - meaning that a buffer can exist in two
* places. The reason for the ARC_l2c_only state is to keep the
* buffer header in the hash table, so that reads that hit the
* second level ARC benefit from these fast lookups.
*/
#define ARCS_LOCK_PAD CACHE_LINE_SIZE
struct arcs_lock {
kmutex_t arcs_lock;
#ifdef _KERNEL
unsigned char pad[(ARCS_LOCK_PAD - sizeof (kmutex_t))];
#endif
};
/*
* must be power of two for mask use to work
*
*/
#define ARC_BUFC_NUMDATALISTS 16
#define ARC_BUFC_NUMMETADATALISTS 16
#define ARC_BUFC_NUMLISTS (ARC_BUFC_NUMMETADATALISTS + ARC_BUFC_NUMDATALISTS)
typedef struct arc_state {
uint64_t arcs_lsize[ARC_BUFC_NUMTYPES]; /* amount of evictable data */
uint64_t arcs_size; /* total amount of data in this state */
list_t arcs_lists[ARC_BUFC_NUMLISTS]; /* list of evictable buffers */
struct arcs_lock arcs_locks[ARC_BUFC_NUMLISTS] __aligned(CACHE_LINE_SIZE);
} arc_state_t;
#define ARCS_LOCK(s, i) (&((s)->arcs_locks[(i)].arcs_lock))
/* The 6 states: */
static arc_state_t ARC_anon;
static arc_state_t ARC_mru;
static arc_state_t ARC_mru_ghost;
static arc_state_t ARC_mfu;
static arc_state_t ARC_mfu_ghost;
static arc_state_t ARC_l2c_only;
typedef struct arc_stats {
kstat_named_t arcstat_hits;
kstat_named_t arcstat_misses;
kstat_named_t arcstat_demand_data_hits;
kstat_named_t arcstat_demand_data_misses;
kstat_named_t arcstat_demand_metadata_hits;
kstat_named_t arcstat_demand_metadata_misses;
kstat_named_t arcstat_prefetch_data_hits;
kstat_named_t arcstat_prefetch_data_misses;
kstat_named_t arcstat_prefetch_metadata_hits;
kstat_named_t arcstat_prefetch_metadata_misses;
kstat_named_t arcstat_mru_hits;
kstat_named_t arcstat_mru_ghost_hits;
kstat_named_t arcstat_mfu_hits;
kstat_named_t arcstat_mfu_ghost_hits;
kstat_named_t arcstat_allocated;
kstat_named_t arcstat_deleted;
kstat_named_t arcstat_stolen;
kstat_named_t arcstat_recycle_miss;
kstat_named_t arcstat_mutex_miss;
kstat_named_t arcstat_evict_skip;
kstat_named_t arcstat_evict_l2_cached;
kstat_named_t arcstat_evict_l2_eligible;
kstat_named_t arcstat_evict_l2_ineligible;
kstat_named_t arcstat_hash_elements;
kstat_named_t arcstat_hash_elements_max;
kstat_named_t arcstat_hash_collisions;
kstat_named_t arcstat_hash_chains;
kstat_named_t arcstat_hash_chain_max;
kstat_named_t arcstat_p;
kstat_named_t arcstat_c;
kstat_named_t arcstat_c_min;
kstat_named_t arcstat_c_max;
kstat_named_t arcstat_size;
kstat_named_t arcstat_hdr_size;
kstat_named_t arcstat_data_size;
kstat_named_t arcstat_other_size;
kstat_named_t arcstat_l2_hits;
kstat_named_t arcstat_l2_misses;
kstat_named_t arcstat_l2_feeds;
kstat_named_t arcstat_l2_rw_clash;
kstat_named_t arcstat_l2_read_bytes;
kstat_named_t arcstat_l2_write_bytes;
kstat_named_t arcstat_l2_writes_sent;
kstat_named_t arcstat_l2_writes_done;
kstat_named_t arcstat_l2_writes_error;
kstat_named_t arcstat_l2_writes_hdr_miss;
kstat_named_t arcstat_l2_evict_lock_retry;
kstat_named_t arcstat_l2_evict_reading;
kstat_named_t arcstat_l2_free_on_write;
kstat_named_t arcstat_l2_abort_lowmem;
kstat_named_t arcstat_l2_cksum_bad;
kstat_named_t arcstat_l2_io_error;
kstat_named_t arcstat_l2_size;
kstat_named_t arcstat_l2_hdr_size;
kstat_named_t arcstat_memory_throttle_count;
kstat_named_t arcstat_l2_write_trylock_fail;
kstat_named_t arcstat_l2_write_passed_headroom;
kstat_named_t arcstat_l2_write_spa_mismatch;
kstat_named_t arcstat_l2_write_in_l2;
kstat_named_t arcstat_l2_write_hdr_io_in_progress;
kstat_named_t arcstat_l2_write_not_cacheable;
kstat_named_t arcstat_l2_write_full;
kstat_named_t arcstat_l2_write_buffer_iter;
kstat_named_t arcstat_l2_write_pios;
kstat_named_t arcstat_l2_write_buffer_bytes_scanned;
kstat_named_t arcstat_l2_write_buffer_list_iter;
kstat_named_t arcstat_l2_write_buffer_list_null_iter;
} arc_stats_t;
static arc_stats_t arc_stats = {
{ "hits", KSTAT_DATA_UINT64 },
{ "misses", KSTAT_DATA_UINT64 },
{ "demand_data_hits", KSTAT_DATA_UINT64 },
{ "demand_data_misses", KSTAT_DATA_UINT64 },
{ "demand_metadata_hits", KSTAT_DATA_UINT64 },
{ "demand_metadata_misses", KSTAT_DATA_UINT64 },
{ "prefetch_data_hits", KSTAT_DATA_UINT64 },
{ "prefetch_data_misses", KSTAT_DATA_UINT64 },
{ "prefetch_metadata_hits", KSTAT_DATA_UINT64 },
{ "prefetch_metadata_misses", KSTAT_DATA_UINT64 },
{ "mru_hits", KSTAT_DATA_UINT64 },
{ "mru_ghost_hits", KSTAT_DATA_UINT64 },
{ "mfu_hits", KSTAT_DATA_UINT64 },
{ "mfu_ghost_hits", KSTAT_DATA_UINT64 },
{ "allocated", KSTAT_DATA_UINT64 },
{ "deleted", KSTAT_DATA_UINT64 },
{ "stolen", KSTAT_DATA_UINT64 },
{ "recycle_miss", KSTAT_DATA_UINT64 },
{ "mutex_miss", KSTAT_DATA_UINT64 },
{ "evict_skip", KSTAT_DATA_UINT64 },
{ "evict_l2_cached", KSTAT_DATA_UINT64 },
{ "evict_l2_eligible", KSTAT_DATA_UINT64 },
{ "evict_l2_ineligible", KSTAT_DATA_UINT64 },
{ "hash_elements", KSTAT_DATA_UINT64 },
{ "hash_elements_max", KSTAT_DATA_UINT64 },
{ "hash_collisions", KSTAT_DATA_UINT64 },
{ "hash_chains", KSTAT_DATA_UINT64 },
{ "hash_chain_max", KSTAT_DATA_UINT64 },
{ "p", KSTAT_DATA_UINT64 },
{ "c", KSTAT_DATA_UINT64 },
{ "c_min", KSTAT_DATA_UINT64 },
{ "c_max", KSTAT_DATA_UINT64 },
{ "size", KSTAT_DATA_UINT64 },
{ "hdr_size", KSTAT_DATA_UINT64 },
{ "data_size", KSTAT_DATA_UINT64 },
{ "other_size", KSTAT_DATA_UINT64 },
{ "l2_hits", KSTAT_DATA_UINT64 },
{ "l2_misses", KSTAT_DATA_UINT64 },
{ "l2_feeds", KSTAT_DATA_UINT64 },
{ "l2_rw_clash", KSTAT_DATA_UINT64 },
{ "l2_read_bytes", KSTAT_DATA_UINT64 },
{ "l2_write_bytes", KSTAT_DATA_UINT64 },
{ "l2_writes_sent", KSTAT_DATA_UINT64 },
{ "l2_writes_done", KSTAT_DATA_UINT64 },
{ "l2_writes_error", KSTAT_DATA_UINT64 },
{ "l2_writes_hdr_miss", KSTAT_DATA_UINT64 },
{ "l2_evict_lock_retry", KSTAT_DATA_UINT64 },
{ "l2_evict_reading", KSTAT_DATA_UINT64 },
{ "l2_free_on_write", KSTAT_DATA_UINT64 },
{ "l2_abort_lowmem", KSTAT_DATA_UINT64 },
{ "l2_cksum_bad", KSTAT_DATA_UINT64 },
{ "l2_io_error", KSTAT_DATA_UINT64 },
{ "l2_size", KSTAT_DATA_UINT64 },
{ "l2_hdr_size", KSTAT_DATA_UINT64 },
{ "memory_throttle_count", KSTAT_DATA_UINT64 },
{ "l2_write_trylock_fail", KSTAT_DATA_UINT64 },
{ "l2_write_passed_headroom", KSTAT_DATA_UINT64 },
{ "l2_write_spa_mismatch", KSTAT_DATA_UINT64 },
{ "l2_write_in_l2", KSTAT_DATA_UINT64 },
{ "l2_write_io_in_progress", KSTAT_DATA_UINT64 },
{ "l2_write_not_cacheable", KSTAT_DATA_UINT64 },
{ "l2_write_full", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_iter", KSTAT_DATA_UINT64 },
{ "l2_write_pios", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_bytes_scanned", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_list_iter", KSTAT_DATA_UINT64 },
{ "l2_write_buffer_list_null_iter", KSTAT_DATA_UINT64 }
};
#define ARCSTAT(stat) (arc_stats.stat.value.ui64)
#define ARCSTAT_INCR(stat, val) \
atomic_add_64(&arc_stats.stat.value.ui64, (val));
#define ARCSTAT_BUMP(stat) ARCSTAT_INCR(stat, 1)
#define ARCSTAT_BUMPDOWN(stat) ARCSTAT_INCR(stat, -1)
#define ARCSTAT_MAX(stat, val) { \
uint64_t m; \
while ((val) > (m = arc_stats.stat.value.ui64) && \
(m != atomic_cas_64(&arc_stats.stat.value.ui64, m, (val)))) \
continue; \
}
#define ARCSTAT_MAXSTAT(stat) \
ARCSTAT_MAX(stat##_max, arc_stats.stat.value.ui64)
/*
* We define a macro to allow ARC hits/misses to be easily broken down by
* two separate conditions, giving a total of four different subtypes for
* each of hits and misses (so eight statistics total).
*/
#define ARCSTAT_CONDSTAT(cond1, stat1, notstat1, cond2, stat2, notstat2, stat) \
if (cond1) { \
if (cond2) { \
ARCSTAT_BUMP(arcstat_##stat1##_##stat2##_##stat); \
} else { \
ARCSTAT_BUMP(arcstat_##stat1##_##notstat2##_##stat); \
} \
} else { \
if (cond2) { \
ARCSTAT_BUMP(arcstat_##notstat1##_##stat2##_##stat); \
} else { \
ARCSTAT_BUMP(arcstat_##notstat1##_##notstat2##_##stat);\
} \
}
kstat_t *arc_ksp;
static arc_state_t *arc_anon;
static arc_state_t *arc_mru;
static arc_state_t *arc_mru_ghost;
static arc_state_t *arc_mfu;
static arc_state_t *arc_mfu_ghost;
static arc_state_t *arc_l2c_only;
/*
* There are several ARC variables that are critical to export as kstats --
* but we don't want to have to grovel around in the kstat whenever we wish to
* manipulate them. For these variables, we therefore define them to be in
* terms of the statistic variable. This assures that we are not introducing
* the possibility of inconsistency by having shadow copies of the variables,
* while still allowing the code to be readable.
*/
#define arc_size ARCSTAT(arcstat_size) /* actual total arc size */
#define arc_p ARCSTAT(arcstat_p) /* target size of MRU */
#define arc_c ARCSTAT(arcstat_c) /* target size of cache */
#define arc_c_min ARCSTAT(arcstat_c_min) /* min target cache size */
#define arc_c_max ARCSTAT(arcstat_c_max) /* max target cache size */
static int arc_no_grow; /* Don't try to grow cache size */
static uint64_t arc_tempreserve;
static uint64_t arc_meta_used;
static uint64_t arc_meta_limit;
static uint64_t arc_meta_max = 0;
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_meta_used, CTLFLAG_RDTUN,
&arc_meta_used, 0, "ARC metadata used");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, arc_meta_limit, CTLFLAG_RDTUN,
&arc_meta_limit, 0, "ARC metadata limit");
typedef struct l2arc_buf_hdr l2arc_buf_hdr_t;
typedef struct arc_callback arc_callback_t;
struct arc_callback {
void *acb_private;
arc_done_func_t *acb_done;
arc_buf_t *acb_buf;
zio_t *acb_zio_dummy;
arc_callback_t *acb_next;
};
typedef struct arc_write_callback arc_write_callback_t;
struct arc_write_callback {
void *awcb_private;
arc_done_func_t *awcb_ready;
arc_done_func_t *awcb_done;
arc_buf_t *awcb_buf;
};
struct arc_buf_hdr {
/* protected by hash lock */
dva_t b_dva;
uint64_t b_birth;
uint64_t b_cksum0;
kmutex_t b_freeze_lock;
zio_cksum_t *b_freeze_cksum;
arc_buf_hdr_t *b_hash_next;
arc_buf_t *b_buf;
uint32_t b_flags;
uint32_t b_datacnt;
arc_callback_t *b_acb;
kcondvar_t b_cv;
/* immutable */
arc_buf_contents_t b_type;
uint64_t b_size;
spa_t *b_spa;
/* protected by arc state mutex */
arc_state_t *b_state;
list_node_t b_arc_node;
/* updated atomically */
clock_t b_arc_access;
/* self protecting */
refcount_t b_refcnt;
l2arc_buf_hdr_t *b_l2hdr;
list_node_t b_l2node;
};
static arc_buf_t *arc_eviction_list;
static kmutex_t arc_eviction_mtx;
static arc_buf_hdr_t arc_eviction_hdr;
static void arc_get_data_buf(arc_buf_t *buf);
static void arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock);
static int arc_evict_needed(arc_buf_contents_t type);
static void arc_evict_ghost(arc_state_t *state, spa_t *spa, int64_t bytes);
static boolean_t l2arc_write_eligible(spa_t *spa, arc_buf_hdr_t *ab);
#define GHOST_STATE(state) \
((state) == arc_mru_ghost || (state) == arc_mfu_ghost || \
(state) == arc_l2c_only)
/*
* Private ARC flags. These flags are private ARC only flags that will show up
* in b_flags in the arc_hdr_buf_t. Some flags are publicly declared, and can
* be passed in as arc_flags in things like arc_read. However, these flags
* should never be passed and should only be set by ARC code. When adding new
* public flags, make sure not to smash the private ones.
*/
#define ARC_IN_HASH_TABLE (1 << 9) /* this buffer is hashed */
#define ARC_IO_IN_PROGRESS (1 << 10) /* I/O in progress for buf */
#define ARC_IO_ERROR (1 << 11) /* I/O failed for buf */
#define ARC_FREED_IN_READ (1 << 12) /* buf freed while in read */
#define ARC_BUF_AVAILABLE (1 << 13) /* block not in active use */
#define ARC_INDIRECT (1 << 14) /* this is an indirect block */
#define ARC_FREE_IN_PROGRESS (1 << 15) /* hdr about to be freed */
#define ARC_L2_WRITING (1 << 16) /* L2ARC write in progress */
#define ARC_L2_EVICTED (1 << 17) /* evicted during I/O */
#define ARC_L2_WRITE_HEAD (1 << 18) /* head of write list */
#define ARC_STORED (1 << 19) /* has been store()d to */
#define HDR_IN_HASH_TABLE(hdr) ((hdr)->b_flags & ARC_IN_HASH_TABLE)
#define HDR_IO_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS)
#define HDR_IO_ERROR(hdr) ((hdr)->b_flags & ARC_IO_ERROR)
#define HDR_PREFETCH(hdr) ((hdr)->b_flags & ARC_PREFETCH)
#define HDR_FREED_IN_READ(hdr) ((hdr)->b_flags & ARC_FREED_IN_READ)
#define HDR_BUF_AVAILABLE(hdr) ((hdr)->b_flags & ARC_BUF_AVAILABLE)
#define HDR_FREE_IN_PROGRESS(hdr) ((hdr)->b_flags & ARC_FREE_IN_PROGRESS)
#define HDR_L2CACHE(hdr) ((hdr)->b_flags & ARC_L2CACHE)
#define HDR_L2_READING(hdr) ((hdr)->b_flags & ARC_IO_IN_PROGRESS && \
(hdr)->b_l2hdr != NULL)
#define HDR_L2_WRITING(hdr) ((hdr)->b_flags & ARC_L2_WRITING)
#define HDR_L2_EVICTED(hdr) ((hdr)->b_flags & ARC_L2_EVICTED)
#define HDR_L2_WRITE_HEAD(hdr) ((hdr)->b_flags & ARC_L2_WRITE_HEAD)
/*
* Other sizes
*/
#define HDR_SIZE ((int64_t)sizeof (arc_buf_hdr_t))
#define L2HDR_SIZE ((int64_t)sizeof (l2arc_buf_hdr_t))
/*
* Hash table routines
*/
#define HT_LOCK_PAD CACHE_LINE_SIZE
struct ht_lock {
kmutex_t ht_lock;
#ifdef _KERNEL
unsigned char pad[(HT_LOCK_PAD - sizeof (kmutex_t))];
#endif
};
#define BUF_LOCKS 256
typedef struct buf_hash_table {
uint64_t ht_mask;
arc_buf_hdr_t **ht_table;
struct ht_lock ht_locks[BUF_LOCKS] __aligned(CACHE_LINE_SIZE);
} buf_hash_table_t;
static buf_hash_table_t buf_hash_table;
#define BUF_HASH_INDEX(spa, dva, birth) \
(buf_hash(spa, dva, birth) & buf_hash_table.ht_mask)
#define BUF_HASH_LOCK_NTRY(idx) (buf_hash_table.ht_locks[idx & (BUF_LOCKS-1)])
#define BUF_HASH_LOCK(idx) (&(BUF_HASH_LOCK_NTRY(idx).ht_lock))
#define HDR_LOCK(buf) \
(BUF_HASH_LOCK(BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth)))
uint64_t zfs_crc64_table[256];
/*
* Level 2 ARC
*/
#define L2ARC_WRITE_SIZE (8 * 1024 * 1024) /* initial write max */
#define L2ARC_HEADROOM 2 /* num of writes */
#define L2ARC_FEED_SECS 1 /* caching interval secs */
#define L2ARC_FEED_MIN_MS 200 /* min caching interval ms */
#define l2arc_writes_sent ARCSTAT(arcstat_l2_writes_sent)
#define l2arc_writes_done ARCSTAT(arcstat_l2_writes_done)
/*
* L2ARC Performance Tunables
*/
uint64_t l2arc_write_max = L2ARC_WRITE_SIZE; /* default max write size */
uint64_t l2arc_write_boost = L2ARC_WRITE_SIZE; /* extra write during warmup */
uint64_t l2arc_headroom = L2ARC_HEADROOM; /* number of dev writes */
uint64_t l2arc_feed_secs = L2ARC_FEED_SECS; /* interval seconds */
uint64_t l2arc_feed_min_ms = L2ARC_FEED_MIN_MS; /* min interval milliseconds */
boolean_t l2arc_noprefetch = B_FALSE; /* don't cache prefetch bufs */
boolean_t l2arc_feed_again = B_TRUE; /* turbo warmup */
boolean_t l2arc_norw = B_TRUE; /* no reads during writes */
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_max, CTLFLAG_RW,
&l2arc_write_max, 0, "max write size");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_write_boost, CTLFLAG_RW,
&l2arc_write_boost, 0, "extra write during warmup");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_headroom, CTLFLAG_RW,
&l2arc_headroom, 0, "number of dev writes");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_secs, CTLFLAG_RW,
&l2arc_feed_secs, 0, "interval seconds");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2arc_feed_min_ms, CTLFLAG_RW,
&l2arc_feed_min_ms, 0, "min interval milliseconds");
SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_noprefetch, CTLFLAG_RW,
&l2arc_noprefetch, 0, "don't cache prefetch bufs");
SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_feed_again, CTLFLAG_RW,
&l2arc_feed_again, 0, "turbo warmup");
SYSCTL_INT(_vfs_zfs, OID_AUTO, l2arc_norw, CTLFLAG_RW,
&l2arc_norw, 0, "no reads during writes");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_size, CTLFLAG_RD,
&ARC_anon.arcs_size, 0, "size of anonymous state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_metadata_lsize, CTLFLAG_RD,
&ARC_anon.arcs_lsize[ARC_BUFC_METADATA], 0, "size of anonymous state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, anon_data_lsize, CTLFLAG_RD,
&ARC_anon.arcs_lsize[ARC_BUFC_DATA], 0, "size of anonymous state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_size, CTLFLAG_RD,
&ARC_mru.arcs_size, 0, "size of mru state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_metadata_lsize, CTLFLAG_RD,
&ARC_mru.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mru state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_data_lsize, CTLFLAG_RD,
&ARC_mru.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mru state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_size, CTLFLAG_RD,
&ARC_mru_ghost.arcs_size, 0, "size of mru ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_metadata_lsize, CTLFLAG_RD,
&ARC_mru_ghost.arcs_lsize[ARC_BUFC_METADATA], 0,
"size of metadata in mru ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mru_ghost_data_lsize, CTLFLAG_RD,
&ARC_mru_ghost.arcs_lsize[ARC_BUFC_DATA], 0,
"size of data in mru ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_size, CTLFLAG_RD,
&ARC_mfu.arcs_size, 0, "size of mfu state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_metadata_lsize, CTLFLAG_RD,
&ARC_mfu.arcs_lsize[ARC_BUFC_METADATA], 0, "size of metadata in mfu state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_data_lsize, CTLFLAG_RD,
&ARC_mfu.arcs_lsize[ARC_BUFC_DATA], 0, "size of data in mfu state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_size, CTLFLAG_RD,
&ARC_mfu_ghost.arcs_size, 0, "size of mfu ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_metadata_lsize, CTLFLAG_RD,
&ARC_mfu_ghost.arcs_lsize[ARC_BUFC_METADATA], 0,
"size of metadata in mfu ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, mfu_ghost_data_lsize, CTLFLAG_RD,
&ARC_mfu_ghost.arcs_lsize[ARC_BUFC_DATA], 0,
"size of data in mfu ghost state");
SYSCTL_QUAD(_vfs_zfs, OID_AUTO, l2c_only_size, CTLFLAG_RD,
&ARC_l2c_only.arcs_size, 0, "size of mru state");
/*
* L2ARC Internals
*/
typedef struct l2arc_dev {
vdev_t *l2ad_vdev; /* vdev */
spa_t *l2ad_spa; /* spa */
uint64_t l2ad_hand; /* next write location */
uint64_t l2ad_write; /* desired write size, bytes */
uint64_t l2ad_boost; /* warmup write boost, bytes */
uint64_t l2ad_start; /* first addr on device */
uint64_t l2ad_end; /* last addr on device */
uint64_t l2ad_evict; /* last addr eviction reached */
boolean_t l2ad_first; /* first sweep through */
boolean_t l2ad_writing; /* currently writing */
list_t *l2ad_buflist; /* buffer list */
list_node_t l2ad_node; /* device list node */
} l2arc_dev_t;
static list_t L2ARC_dev_list; /* device list */
static list_t *l2arc_dev_list; /* device list pointer */
static kmutex_t l2arc_dev_mtx; /* device list mutex */
static l2arc_dev_t *l2arc_dev_last; /* last device used */
static kmutex_t l2arc_buflist_mtx; /* mutex for all buflists */
static list_t L2ARC_free_on_write; /* free after write buf list */
static list_t *l2arc_free_on_write; /* free after write list ptr */
static kmutex_t l2arc_free_on_write_mtx; /* mutex for list */
static uint64_t l2arc_ndev; /* number of devices */
typedef struct l2arc_read_callback {
arc_buf_t *l2rcb_buf; /* read buffer */
spa_t *l2rcb_spa; /* spa */
blkptr_t l2rcb_bp; /* original blkptr */
zbookmark_t l2rcb_zb; /* original bookmark */
int l2rcb_flags; /* original flags */
} l2arc_read_callback_t;
typedef struct l2arc_write_callback {
l2arc_dev_t *l2wcb_dev; /* device info */
arc_buf_hdr_t *l2wcb_head; /* head of write buflist */
} l2arc_write_callback_t;
struct l2arc_buf_hdr {
/* protected by arc_buf_hdr mutex */
l2arc_dev_t *b_dev; /* L2ARC device */
uint64_t b_daddr; /* disk address, offset byte */
};
typedef struct l2arc_data_free {
/* protected by l2arc_free_on_write_mtx */
void *l2df_data;
size_t l2df_size;
void (*l2df_func)(void *, size_t);
list_node_t l2df_list_node;
} l2arc_data_free_t;
static kmutex_t l2arc_feed_thr_lock;
static kcondvar_t l2arc_feed_thr_cv;
static uint8_t l2arc_thread_exit;
static void l2arc_read_done(zio_t *zio);
static void l2arc_hdr_stat_add(void);
static void l2arc_hdr_stat_remove(void);
static uint64_t
buf_hash(spa_t *spa, const dva_t *dva, uint64_t birth)
{
uintptr_t spav = (uintptr_t)spa;
uint8_t *vdva = (uint8_t *)dva;
uint64_t crc = -1ULL;
int i;
ASSERT(zfs_crc64_table[128] == ZFS_CRC64_POLY);
for (i = 0; i < sizeof (dva_t); i++)
crc = (crc >> 8) ^ zfs_crc64_table[(crc ^ vdva[i]) & 0xFF];
crc ^= (spav>>8) ^ birth;
return (crc);
}
#define BUF_EMPTY(buf) \
((buf)->b_dva.dva_word[0] == 0 && \
(buf)->b_dva.dva_word[1] == 0 && \
(buf)->b_birth == 0)
#define BUF_EQUAL(spa, dva, birth, buf) \
((buf)->b_dva.dva_word[0] == (dva)->dva_word[0]) && \
((buf)->b_dva.dva_word[1] == (dva)->dva_word[1]) && \
((buf)->b_birth == birth) && ((buf)->b_spa == spa)
static arc_buf_hdr_t *
buf_hash_find(spa_t *spa, const dva_t *dva, uint64_t birth, kmutex_t **lockp)
{
uint64_t idx = BUF_HASH_INDEX(spa, dva, birth);
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
arc_buf_hdr_t *buf;
mutex_enter(hash_lock);
for (buf = buf_hash_table.ht_table[idx]; buf != NULL;
buf = buf->b_hash_next) {
if (BUF_EQUAL(spa, dva, birth, buf)) {
*lockp = hash_lock;
return (buf);
}
}
mutex_exit(hash_lock);
*lockp = NULL;
return (NULL);
}
/*
* Insert an entry into the hash table. If there is already an element
* equal to elem in the hash table, then the already existing element
* will be returned and the new element will not be inserted.
* Otherwise returns NULL.
*/
static arc_buf_hdr_t *
buf_hash_insert(arc_buf_hdr_t *buf, kmutex_t **lockp)
{
uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
kmutex_t *hash_lock = BUF_HASH_LOCK(idx);
arc_buf_hdr_t *fbuf;
uint32_t i;
ASSERT(!HDR_IN_HASH_TABLE(buf));
*lockp = hash_lock;
mutex_enter(hash_lock);
for (fbuf = buf_hash_table.ht_table[idx], i = 0; fbuf != NULL;
fbuf = fbuf->b_hash_next, i++) {
if (BUF_EQUAL(buf->b_spa, &buf->b_dva, buf->b_birth, fbuf))
return (fbuf);
}
buf->b_hash_next = buf_hash_table.ht_table[idx];
buf_hash_table.ht_table[idx] = buf;
buf->b_flags |= ARC_IN_HASH_TABLE;
/* collect some hash table performance data */
if (i > 0) {
ARCSTAT_BUMP(arcstat_hash_collisions);
if (i == 1)
ARCSTAT_BUMP(arcstat_hash_chains);
ARCSTAT_MAX(arcstat_hash_chain_max, i);
}
ARCSTAT_BUMP(arcstat_hash_elements);
ARCSTAT_MAXSTAT(arcstat_hash_elements);
return (NULL);
}
static void
buf_hash_remove(arc_buf_hdr_t *buf)
{
arc_buf_hdr_t *fbuf, **bufp;
uint64_t idx = BUF_HASH_INDEX(buf->b_spa, &buf->b_dva, buf->b_birth);
ASSERT(MUTEX_HELD(BUF_HASH_LOCK(idx)));
ASSERT(HDR_IN_HASH_TABLE(buf));
bufp = &buf_hash_table.ht_table[idx];
while ((fbuf = *bufp) != buf) {
ASSERT(fbuf != NULL);
bufp = &fbuf->b_hash_next;
}
*bufp = buf->b_hash_next;
buf->b_hash_next = NULL;
buf->b_flags &= ~ARC_IN_HASH_TABLE;
/* collect some hash table performance data */
ARCSTAT_BUMPDOWN(arcstat_hash_elements);
if (buf_hash_table.ht_table[idx] &&
buf_hash_table.ht_table[idx]->b_hash_next == NULL)
ARCSTAT_BUMPDOWN(arcstat_hash_chains);
}
/*
* Global data structures and functions for the buf kmem cache.
*/
static kmem_cache_t *hdr_cache;
static kmem_cache_t *buf_cache;
static void
buf_fini(void)
{
int i;
kmem_free(buf_hash_table.ht_table,
(buf_hash_table.ht_mask + 1) * sizeof (void *));
for (i = 0; i < BUF_LOCKS; i++)
mutex_destroy(&buf_hash_table.ht_locks[i].ht_lock);
kmem_cache_destroy(hdr_cache);
kmem_cache_destroy(buf_cache);
}
/*
* Constructor callback - called when the cache is empty
* and a new buf is requested.
*/
/* ARGSUSED */
static int
hdr_cons(void *vbuf, void *unused, int kmflag)
{
arc_buf_hdr_t *buf = vbuf;
bzero(buf, sizeof (arc_buf_hdr_t));
refcount_create(&buf->b_refcnt);
cv_init(&buf->b_cv, NULL, CV_DEFAULT, NULL);
mutex_init(&buf->b_freeze_lock, NULL, MUTEX_DEFAULT, NULL);
arc_space_consume(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
return (0);
}
/* ARGSUSED */
static int
buf_cons(void *vbuf, void *unused, int kmflag)
{
arc_buf_t *buf = vbuf;
bzero(buf, sizeof (arc_buf_t));
rw_init(&buf->b_lock, NULL, RW_DEFAULT, NULL);
arc_space_consume(sizeof (arc_buf_t), ARC_SPACE_HDRS);
return (0);
}
/*
* Destructor callback - called when a cached buf is
* no longer required.
*/
/* ARGSUSED */
static void
hdr_dest(void *vbuf, void *unused)
{
arc_buf_hdr_t *buf = vbuf;
refcount_destroy(&buf->b_refcnt);
cv_destroy(&buf->b_cv);
mutex_destroy(&buf->b_freeze_lock);
arc_space_return(sizeof (arc_buf_hdr_t), ARC_SPACE_HDRS);
}
/* ARGSUSED */
static void
buf_dest(void *vbuf, void *unused)
{
arc_buf_t *buf = vbuf;
rw_destroy(&buf->b_lock);
arc_space_return(sizeof (arc_buf_t), ARC_SPACE_HDRS);
}
/*
* Reclaim callback -- invoked when memory is low.
*/
/* ARGSUSED */
static void
hdr_recl(void *unused)
{
dprintf("hdr_recl called\n");
/*
* umem calls the reclaim func when we destroy the buf cache,
* which is after we do arc_fini().
*/
if (!arc_dead)
cv_signal(&arc_reclaim_thr_cv);
}
static void
buf_init(void)
{
uint64_t *ct;
uint64_t hsize = 1ULL << 12;
int i, j;
/*
* The hash table is big enough to fill all of physical memory
* with an average 64K block size. The table will take up
* totalmem*sizeof(void*)/64K (eg. 128KB/GB with 8-byte pointers).
*/
while (hsize * 65536 < (uint64_t)physmem * PAGESIZE)
hsize <<= 1;
retry:
buf_hash_table.ht_mask = hsize - 1;
buf_hash_table.ht_table =
kmem_zalloc(hsize * sizeof (void*), KM_NOSLEEP);
if (buf_hash_table.ht_table == NULL) {
ASSERT(hsize > (1ULL << 8));
hsize >>= 1;
goto retry;
}
hdr_cache = kmem_cache_create("arc_buf_hdr_t", sizeof (arc_buf_hdr_t),
0, hdr_cons, hdr_dest, hdr_recl, NULL, NULL, 0);
buf_cache = kmem_cache_create("arc_buf_t", sizeof (arc_buf_t),
0, buf_cons, buf_dest, NULL, NULL, NULL, 0);
for (i = 0; i < 256; i++)
for (ct = zfs_crc64_table + i, *ct = i, j = 8; j > 0; j--)
*ct = (*ct >> 1) ^ (-(*ct & 1) & ZFS_CRC64_POLY);
for (i = 0; i < BUF_LOCKS; i++) {
mutex_init(&buf_hash_table.ht_locks[i].ht_lock,
NULL, MUTEX_DEFAULT, NULL);
}
}
#define ARC_MINTIME (hz>>4) /* 62 ms */
static void
arc_cksum_verify(arc_buf_t *buf)
{
zio_cksum_t zc;
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
return;
mutex_enter(&buf->b_hdr->b_freeze_lock);
if (buf->b_hdr->b_freeze_cksum == NULL ||
(buf->b_hdr->b_flags & ARC_IO_ERROR)) {
mutex_exit(&buf->b_hdr->b_freeze_lock);
return;
}
fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
if (!ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc))
panic("buffer modified while frozen!");
mutex_exit(&buf->b_hdr->b_freeze_lock);
}
static int
arc_cksum_equal(arc_buf_t *buf)
{
zio_cksum_t zc;
int equal;
mutex_enter(&buf->b_hdr->b_freeze_lock);
fletcher_2_native(buf->b_data, buf->b_hdr->b_size, &zc);
equal = ZIO_CHECKSUM_EQUAL(*buf->b_hdr->b_freeze_cksum, zc);
mutex_exit(&buf->b_hdr->b_freeze_lock);
return (equal);
}
static void
arc_cksum_compute(arc_buf_t *buf, boolean_t force)
{
if (!force && !(zfs_flags & ZFS_DEBUG_MODIFY))
return;
mutex_enter(&buf->b_hdr->b_freeze_lock);
if (buf->b_hdr->b_freeze_cksum != NULL) {
mutex_exit(&buf->b_hdr->b_freeze_lock);
return;
}
buf->b_hdr->b_freeze_cksum = kmem_alloc(sizeof (zio_cksum_t), KM_SLEEP);
fletcher_2_native(buf->b_data, buf->b_hdr->b_size,
buf->b_hdr->b_freeze_cksum);
mutex_exit(&buf->b_hdr->b_freeze_lock);
}
void
arc_buf_thaw(arc_buf_t *buf)
{
if (zfs_flags & ZFS_DEBUG_MODIFY) {
if (buf->b_hdr->b_state != arc_anon)
panic("modifying non-anon buffer!");
if (buf->b_hdr->b_flags & ARC_IO_IN_PROGRESS)
panic("modifying buffer while i/o in progress!");
arc_cksum_verify(buf);
}
mutex_enter(&buf->b_hdr->b_freeze_lock);
if (buf->b_hdr->b_freeze_cksum != NULL) {
kmem_free(buf->b_hdr->b_freeze_cksum, sizeof (zio_cksum_t));
buf->b_hdr->b_freeze_cksum = NULL;
}
mutex_exit(&buf->b_hdr->b_freeze_lock);
}
void
arc_buf_freeze(arc_buf_t *buf)
{
if (!(zfs_flags & ZFS_DEBUG_MODIFY))
return;
ASSERT(buf->b_hdr->b_freeze_cksum != NULL ||
buf->b_hdr->b_state == arc_anon);
arc_cksum_compute(buf, B_FALSE);
}
static void
get_buf_info(arc_buf_hdr_t *ab, arc_state_t *state, list_t **list, kmutex_t **lock)
{
uint64_t buf_hashid = buf_hash(ab->b_spa, &ab->b_dva, ab->b_birth);
if (ab->b_type == ARC_BUFC_METADATA)
buf_hashid &= (ARC_BUFC_NUMMETADATALISTS - 1);
else {
buf_hashid &= (ARC_BUFC_NUMDATALISTS - 1);
buf_hashid += ARC_BUFC_NUMMETADATALISTS;
}
*list = &state->arcs_lists[buf_hashid];
*lock = ARCS_LOCK(state, buf_hashid);
}
static void
add_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
{
ASSERT(MUTEX_HELD(hash_lock));
if ((refcount_add(&ab->b_refcnt, tag) == 1) &&
(ab->b_state != arc_anon)) {
uint64_t delta = ab->b_size * ab->b_datacnt;
uint64_t *size = &ab->b_state->arcs_lsize[ab->b_type];
list_t *list;
kmutex_t *lock;
get_buf_info(ab, ab->b_state, &list, &lock);
ASSERT(!MUTEX_HELD(lock));
mutex_enter(lock);
ASSERT(list_link_active(&ab->b_arc_node));
list_remove(list, ab);
if (GHOST_STATE(ab->b_state)) {
ASSERT3U(ab->b_datacnt, ==, 0);
ASSERT3P(ab->b_buf, ==, NULL);
delta = ab->b_size;
}
ASSERT(delta > 0);
ASSERT3U(*size, >=, delta);
atomic_add_64(size, -delta);
mutex_exit(lock);
/* remove the prefetch flag if we get a reference */
if (ab->b_flags & ARC_PREFETCH)
ab->b_flags &= ~ARC_PREFETCH;
}
}
static int
remove_reference(arc_buf_hdr_t *ab, kmutex_t *hash_lock, void *tag)
{
int cnt;
arc_state_t *state = ab->b_state;
ASSERT(state == arc_anon || MUTEX_HELD(hash_lock));
ASSERT(!GHOST_STATE(state));
if (((cnt = refcount_remove(&ab->b_refcnt, tag)) == 0) &&
(state != arc_anon)) {
uint64_t *size = &state->arcs_lsize[ab->b_type];
list_t *list;
kmutex_t *lock;
get_buf_info(ab, state, &list, &lock);
ASSERT(!MUTEX_HELD(lock));
mutex_enter(lock);
ASSERT(!list_link_active(&ab->b_arc_node));
list_insert_head(list, ab);
ASSERT(ab->b_datacnt > 0);
atomic_add_64(size, ab->b_size * ab->b_datacnt);
mutex_exit(lock);
}
return (cnt);
}
/*
* Move the supplied buffer to the indicated state. The mutex
* for the buffer must be held by the caller.
*/
static void
arc_change_state(arc_state_t *new_state, arc_buf_hdr_t *ab, kmutex_t *hash_lock)
{
arc_state_t *old_state = ab->b_state;
int64_t refcnt = refcount_count(&ab->b_refcnt);
uint64_t from_delta, to_delta;
list_t *list;
kmutex_t *lock;
ASSERT(MUTEX_HELD(hash_lock));
ASSERT(new_state != old_state);
ASSERT(refcnt == 0 || ab->b_datacnt > 0);
ASSERT(ab->b_datacnt == 0 || !GHOST_STATE(new_state));
from_delta = to_delta = ab->b_datacnt * ab->b_size;
/*
* If this buffer is evictable, transfer it from the
* old state list to the new state list.
*/
if (refcnt == 0) {
if (old_state != arc_anon) {
int use_mutex;
uint64_t *size = &old_state->arcs_lsize[ab->b_type];
get_buf_info(ab, old_state, &list, &lock);
use_mutex = !MUTEX_HELD(lock);
if (use_mutex)
mutex_enter(lock);
ASSERT(list_link_active(&ab->b_arc_node));
list_remove(list, ab);
/*
* If prefetching out of the ghost cache,
* we will have a non-null datacnt.
*/
if (GHOST_STATE(old_state) && ab->b_datacnt == 0) {
/* ghost elements have a ghost size */
ASSERT(ab->b_buf == NULL);
from_delta = ab->b_size;
}
ASSERT3U(*size, >=, from_delta);
atomic_add_64(size, -from_delta);
if (use_mutex)
mutex_exit(lock);
}
if (new_state != arc_anon) {
int use_mutex;
uint64_t *size = &new_state->arcs_lsize[ab->b_type];
get_buf_info(ab, new_state, &list, &lock);
use_mutex = !MUTEX_HELD(lock);
if (use_mutex)
mutex_enter(lock);
list_insert_head(list, ab);
/* ghost elements have a ghost size */
if (GHOST_STATE(new_state)) {
ASSERT(ab->b_datacnt == 0);
ASSERT(ab->b_buf == NULL);
to_delta = ab->b_size;
}
atomic_add_64(size, to_delta);
if (use_mutex)
mutex_exit(lock);
}
}
ASSERT(!BUF_EMPTY(ab));
if (new_state == arc_anon) {
buf_hash_remove(ab);
}
/* adjust state sizes */
if (to_delta)
atomic_add_64(&new_state->arcs_size, to_delta);
if (from_delta) {
ASSERT3U(old_state->arcs_size, >=, from_delta);
atomic_add_64(&old_state->arcs_size, -from_delta);
}
ab->b_state = new_state;
/* adjust l2arc hdr stats */
if (new_state == arc_l2c_only)
l2arc_hdr_stat_add();
else if (old_state == arc_l2c_only)
l2arc_hdr_stat_remove();
}
void
arc_space_consume(uint64_t space, arc_space_type_t type)
{
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
switch (type) {
case ARC_SPACE_DATA:
ARCSTAT_INCR(arcstat_data_size, space);
break;
case ARC_SPACE_OTHER:
ARCSTAT_INCR(arcstat_other_size, space);
break;
case ARC_SPACE_HDRS:
ARCSTAT_INCR(arcstat_hdr_size, space);
break;
case ARC_SPACE_L2HDRS:
ARCSTAT_INCR(arcstat_l2_hdr_size, space);
break;
}
atomic_add_64(&arc_meta_used, space);
atomic_add_64(&arc_size, space);
}
void
arc_space_return(uint64_t space, arc_space_type_t type)
{
ASSERT(type >= 0 && type < ARC_SPACE_NUMTYPES);
switch (type) {
case ARC_SPACE_DATA:
ARCSTAT_INCR(arcstat_data_size, -space);
break;
case ARC_SPACE_OTHER:
ARCSTAT_INCR(arcstat_other_size, -space);
break;
case ARC_SPACE_HDRS:
ARCSTAT_INCR(arcstat_hdr_size, -space);
break;
case ARC_SPACE_L2HDRS:
ARCSTAT_INCR(arcstat_l2_hdr_size, -space);
break;
}
ASSERT(arc_meta_used >= space);
if (arc_meta_max < arc_meta_used)
arc_meta_max = arc_meta_used;
atomic_add_64(&arc_meta_used, -space);
ASSERT(arc_size >= space);
atomic_add_64(&arc_size, -space);
}
void *
arc_data_buf_alloc(uint64_t size)
{
if (arc_evict_needed(ARC_BUFC_DATA))
cv_signal(&arc_reclaim_thr_cv);
atomic_add_64(&arc_size, size);
return (zio_data_buf_alloc(size));
}
void
arc_data_buf_free(void *buf, uint64_t size)
{
zio_data_buf_free(buf, size);
ASSERT(arc_size >= size);
atomic_add_64(&arc_size, -size);
}
arc_buf_t *
arc_buf_alloc(spa_t *spa, int size, void *tag, arc_buf_contents_t type)
{
arc_buf_hdr_t *hdr;
arc_buf_t *buf;
ASSERT3U(size, >, 0);
hdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
ASSERT(BUF_EMPTY(hdr));
hdr->b_size = size;
hdr->b_type = type;
hdr->b_spa = spa;
hdr->b_state = arc_anon;
hdr->b_arc_access = 0;
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
buf->b_hdr = hdr;
buf->b_data = NULL;
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_next = NULL;
hdr->b_buf = buf;
arc_get_data_buf(buf);
hdr->b_datacnt = 1;
hdr->b_flags = 0;
ASSERT(refcount_is_zero(&hdr->b_refcnt));
(void) refcount_add(&hdr->b_refcnt, tag);
return (buf);
}
static arc_buf_t *
arc_buf_clone(arc_buf_t *from)
{
arc_buf_t *buf;
arc_buf_hdr_t *hdr = from->b_hdr;
uint64_t size = hdr->b_size;
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
buf->b_hdr = hdr;
buf->b_data = NULL;
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_next = hdr->b_buf;
hdr->b_buf = buf;
arc_get_data_buf(buf);
bcopy(from->b_data, buf->b_data, size);
hdr->b_datacnt += 1;
return (buf);
}
void
arc_buf_add_ref(arc_buf_t *buf, void* tag)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_lock;
/*
* Check to see if this buffer is evicted. Callers
* must verify b_data != NULL to know if the add_ref
* was successful.
*/
rw_enter(&buf->b_lock, RW_READER);
if (buf->b_data == NULL) {
rw_exit(&buf->b_lock);
return;
}
hdr = buf->b_hdr;
ASSERT(hdr != NULL);
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
rw_exit(&buf->b_lock);
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
add_reference(hdr, hash_lock, tag);
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
arc_access(hdr, hash_lock);
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_hits);
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
data, metadata, hits);
}
/*
* Free the arc data buffer. If it is an l2arc write in progress,
* the buffer is placed on l2arc_free_on_write to be freed later.
*/
static void
arc_buf_data_free(arc_buf_hdr_t *hdr, void (*free_func)(void *, size_t),
void *data, size_t size)
{
if (HDR_L2_WRITING(hdr)) {
l2arc_data_free_t *df;
df = kmem_alloc(sizeof (l2arc_data_free_t), KM_SLEEP);
df->l2df_data = data;
df->l2df_size = size;
df->l2df_func = free_func;
mutex_enter(&l2arc_free_on_write_mtx);
list_insert_head(l2arc_free_on_write, df);
mutex_exit(&l2arc_free_on_write_mtx);
ARCSTAT_BUMP(arcstat_l2_free_on_write);
} else {
free_func(data, size);
}
}
static void
arc_buf_destroy(arc_buf_t *buf, boolean_t recycle, boolean_t all)
{
arc_buf_t **bufp;
/* free up data associated with the buf */
if (buf->b_data) {
arc_state_t *state = buf->b_hdr->b_state;
uint64_t size = buf->b_hdr->b_size;
arc_buf_contents_t type = buf->b_hdr->b_type;
arc_cksum_verify(buf);
if (!recycle) {
if (type == ARC_BUFC_METADATA) {
arc_buf_data_free(buf->b_hdr, zio_buf_free,
buf->b_data, size);
arc_space_return(size, ARC_SPACE_DATA);
} else {
ASSERT(type == ARC_BUFC_DATA);
arc_buf_data_free(buf->b_hdr,
zio_data_buf_free, buf->b_data, size);
ARCSTAT_INCR(arcstat_data_size, -size);
atomic_add_64(&arc_size, -size);
}
}
if (list_link_active(&buf->b_hdr->b_arc_node)) {
uint64_t *cnt = &state->arcs_lsize[type];
ASSERT(refcount_is_zero(&buf->b_hdr->b_refcnt));
ASSERT(state != arc_anon);
ASSERT3U(*cnt, >=, size);
atomic_add_64(cnt, -size);
}
ASSERT3U(state->arcs_size, >=, size);
atomic_add_64(&state->arcs_size, -size);
buf->b_data = NULL;
ASSERT(buf->b_hdr->b_datacnt > 0);
buf->b_hdr->b_datacnt -= 1;
}
/* only remove the buf if requested */
if (!all)
return;
/* remove the buf from the hdr list */
for (bufp = &buf->b_hdr->b_buf; *bufp != buf; bufp = &(*bufp)->b_next)
continue;
*bufp = buf->b_next;
ASSERT(buf->b_efunc == NULL);
/* clean up the buf */
buf->b_hdr = NULL;
kmem_cache_free(buf_cache, buf);
}
static void
arc_hdr_destroy(arc_buf_hdr_t *hdr)
{
ASSERT(refcount_is_zero(&hdr->b_refcnt));
ASSERT3P(hdr->b_state, ==, arc_anon);
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
ASSERT(!(hdr->b_flags & ARC_STORED));
if (hdr->b_l2hdr != NULL) {
if (!MUTEX_HELD(&l2arc_buflist_mtx)) {
/*
* To prevent arc_free() and l2arc_evict() from
* attempting to free the same buffer at the same time,
* a FREE_IN_PROGRESS flag is given to arc_free() to
* give it priority. l2arc_evict() can't destroy this
* header while we are waiting on l2arc_buflist_mtx.
*
* The hdr may be removed from l2ad_buflist before we
* grab l2arc_buflist_mtx, so b_l2hdr is rechecked.
*/
mutex_enter(&l2arc_buflist_mtx);
if (hdr->b_l2hdr != NULL) {
list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist,
hdr);
}
mutex_exit(&l2arc_buflist_mtx);
} else {
list_remove(hdr->b_l2hdr->b_dev->l2ad_buflist, hdr);
}
ARCSTAT_INCR(arcstat_l2_size, -hdr->b_size);
kmem_free(hdr->b_l2hdr, sizeof (l2arc_buf_hdr_t));
if (hdr->b_state == arc_l2c_only)
l2arc_hdr_stat_remove();
hdr->b_l2hdr = NULL;
}
if (!BUF_EMPTY(hdr)) {
ASSERT(!HDR_IN_HASH_TABLE(hdr));
bzero(&hdr->b_dva, sizeof (dva_t));
hdr->b_birth = 0;
hdr->b_cksum0 = 0;
}
while (hdr->b_buf) {
arc_buf_t *buf = hdr->b_buf;
if (buf->b_efunc) {
mutex_enter(&arc_eviction_mtx);
rw_enter(&buf->b_lock, RW_WRITER);
ASSERT(buf->b_hdr != NULL);
arc_buf_destroy(hdr->b_buf, FALSE, FALSE);
hdr->b_buf = buf->b_next;
buf->b_hdr = &arc_eviction_hdr;
buf->b_next = arc_eviction_list;
arc_eviction_list = buf;
rw_exit(&buf->b_lock);
mutex_exit(&arc_eviction_mtx);
} else {
arc_buf_destroy(hdr->b_buf, FALSE, TRUE);
}
}
if (hdr->b_freeze_cksum != NULL) {
kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
hdr->b_freeze_cksum = NULL;
}
ASSERT(!list_link_active(&hdr->b_arc_node));
ASSERT3P(hdr->b_hash_next, ==, NULL);
ASSERT3P(hdr->b_acb, ==, NULL);
kmem_cache_free(hdr_cache, hdr);
}
void
arc_buf_free(arc_buf_t *buf, void *tag)
{
arc_buf_hdr_t *hdr = buf->b_hdr;
int hashed = hdr->b_state != arc_anon;
ASSERT(buf->b_efunc == NULL);
ASSERT(buf->b_data != NULL);
if (hashed) {
kmutex_t *hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
(void) remove_reference(hdr, hash_lock, tag);
if (hdr->b_datacnt > 1)
arc_buf_destroy(buf, FALSE, TRUE);
else
hdr->b_flags |= ARC_BUF_AVAILABLE;
mutex_exit(hash_lock);
} else if (HDR_IO_IN_PROGRESS(hdr)) {
int destroy_hdr;
/*
* We are in the middle of an async write. Don't destroy
* this buffer unless the write completes before we finish
* decrementing the reference count.
*/
mutex_enter(&arc_eviction_mtx);
(void) remove_reference(hdr, NULL, tag);
ASSERT(refcount_is_zero(&hdr->b_refcnt));
destroy_hdr = !HDR_IO_IN_PROGRESS(hdr);
mutex_exit(&arc_eviction_mtx);
if (destroy_hdr)
arc_hdr_destroy(hdr);
} else {
if (remove_reference(hdr, NULL, tag) > 0) {
ASSERT(HDR_IO_ERROR(hdr));
arc_buf_destroy(buf, FALSE, TRUE);
} else {
arc_hdr_destroy(hdr);
}
}
}
int
arc_buf_remove_ref(arc_buf_t *buf, void* tag)
{
arc_buf_hdr_t *hdr = buf->b_hdr;
kmutex_t *hash_lock = HDR_LOCK(hdr);
int no_callback = (buf->b_efunc == NULL);
if (hdr->b_state == arc_anon) {
arc_buf_free(buf, tag);
return (no_callback);
}
mutex_enter(hash_lock);
ASSERT(hdr->b_state != arc_anon);
ASSERT(buf->b_data != NULL);
(void) remove_reference(hdr, hash_lock, tag);
if (hdr->b_datacnt > 1) {
if (no_callback)
arc_buf_destroy(buf, FALSE, TRUE);
} else if (no_callback) {
ASSERT(hdr->b_buf == buf && buf->b_next == NULL);
hdr->b_flags |= ARC_BUF_AVAILABLE;
}
ASSERT(no_callback || hdr->b_datacnt > 1 ||
refcount_is_zero(&hdr->b_refcnt));
mutex_exit(hash_lock);
return (no_callback);
}
int
arc_buf_size(arc_buf_t *buf)
{
return (buf->b_hdr->b_size);
}
/*
* Evict buffers from list until we've removed the specified number of
* bytes. Move the removed buffers to the appropriate evict state.
* If the recycle flag is set, then attempt to "recycle" a buffer:
* - look for a buffer to evict that is `bytes' long.
* - return the data block from this buffer rather than freeing it.
* This flag is used by callers that are trying to make space for a
* new buffer in a full arc cache.
*
* This function makes a "best effort". It skips over any buffers
* it can't get a hash_lock on, and so may not catch all candidates.
* It may also return without evicting as much space as requested.
*/
static void *
arc_evict(arc_state_t *state, spa_t *spa, int64_t bytes, boolean_t recycle,
arc_buf_contents_t type)
{
arc_state_t *evicted_state;
uint64_t bytes_evicted = 0, skipped = 0, missed = 0;
int64_t bytes_remaining;
arc_buf_hdr_t *ab, *ab_prev = NULL;
list_t *evicted_list, *list, *evicted_list_start, *list_start;
kmutex_t *lock, *evicted_lock;
kmutex_t *hash_lock;
boolean_t have_lock;
void *stolen = NULL;
static int evict_metadata_offset, evict_data_offset;
int i, idx, offset, list_count, count;
ASSERT(state == arc_mru || state == arc_mfu);
evicted_state = (state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
if (type == ARC_BUFC_METADATA) {
offset = 0;
list_count = ARC_BUFC_NUMMETADATALISTS;
list_start = &state->arcs_lists[0];
evicted_list_start = &evicted_state->arcs_lists[0];
idx = evict_metadata_offset;
} else {
offset = ARC_BUFC_NUMMETADATALISTS;
list_start = &state->arcs_lists[offset];
evicted_list_start = &evicted_state->arcs_lists[offset];
list_count = ARC_BUFC_NUMDATALISTS;
idx = evict_data_offset;
}
bytes_remaining = evicted_state->arcs_lsize[type];
count = 0;
evict_start:
list = &list_start[idx];
evicted_list = &evicted_list_start[idx];
lock = ARCS_LOCK(state, (offset + idx));
evicted_lock = ARCS_LOCK(evicted_state, (offset + idx));
mutex_enter(lock);
mutex_enter(evicted_lock);
for (ab = list_tail(list); ab; ab = ab_prev) {
ab_prev = list_prev(list, ab);
bytes_remaining -= (ab->b_size * ab->b_datacnt);
/* prefetch buffers have a minimum lifespan */
if (HDR_IO_IN_PROGRESS(ab) ||
(spa && ab->b_spa != spa) ||
(ab->b_flags & (ARC_PREFETCH|ARC_INDIRECT) &&
LBOLT - ab->b_arc_access < arc_min_prefetch_lifespan)) {
skipped++;
continue;
}
/* "lookahead" for better eviction candidate */
if (recycle && ab->b_size != bytes &&
ab_prev && ab_prev->b_size == bytes)
continue;
hash_lock = HDR_LOCK(ab);
have_lock = MUTEX_HELD(hash_lock);
if (have_lock || mutex_tryenter(hash_lock)) {
ASSERT3U(refcount_count(&ab->b_refcnt), ==, 0);
ASSERT(ab->b_datacnt > 0);
while (ab->b_buf) {
arc_buf_t *buf = ab->b_buf;
if (!rw_tryenter(&buf->b_lock, RW_WRITER)) {
missed += 1;
break;
}
if (buf->b_data) {
bytes_evicted += ab->b_size;
if (recycle && ab->b_type == type &&
ab->b_size == bytes &&
!HDR_L2_WRITING(ab)) {
stolen = buf->b_data;
recycle = FALSE;
}
}
if (buf->b_efunc) {
mutex_enter(&arc_eviction_mtx);
arc_buf_destroy(buf,
buf->b_data == stolen, FALSE);
ab->b_buf = buf->b_next;
buf->b_hdr = &arc_eviction_hdr;
buf->b_next = arc_eviction_list;
arc_eviction_list = buf;
mutex_exit(&arc_eviction_mtx);
rw_exit(&buf->b_lock);
} else {
rw_exit(&buf->b_lock);
arc_buf_destroy(buf,
buf->b_data == stolen, TRUE);
}
}
if (ab->b_l2hdr) {
ARCSTAT_INCR(arcstat_evict_l2_cached,
ab->b_size);
} else {
if (l2arc_write_eligible(ab->b_spa, ab)) {
ARCSTAT_INCR(arcstat_evict_l2_eligible,
ab->b_size);
} else {
ARCSTAT_INCR(
arcstat_evict_l2_ineligible,
ab->b_size);
}
}
if (ab->b_datacnt == 0) {
arc_change_state(evicted_state, ab, hash_lock);
ASSERT(HDR_IN_HASH_TABLE(ab));
ab->b_flags |= ARC_IN_HASH_TABLE;
ab->b_flags &= ~ARC_BUF_AVAILABLE;
DTRACE_PROBE1(arc__evict, arc_buf_hdr_t *, ab);
}
if (!have_lock)
mutex_exit(hash_lock);
if (bytes >= 0 && bytes_evicted >= bytes)
break;
if (bytes_remaining > 0) {
mutex_exit(evicted_lock);
mutex_exit(lock);
idx = ((idx + 1) & (list_count - 1));
count++;
goto evict_start;
}
} else {
missed += 1;
}
}
mutex_exit(evicted_lock);
mutex_exit(lock);
idx = ((idx + 1) & (list_count - 1));
count++;
if (bytes_evicted < bytes) {
if (count < list_count)
goto evict_start;
else
dprintf("only evicted %lld bytes from %x",
(longlong_t)bytes_evicted, state);
}
if (type == ARC_BUFC_METADATA)
evict_metadata_offset = idx;
else
evict_data_offset = idx;
if (skipped)
ARCSTAT_INCR(arcstat_evict_skip, skipped);
if (missed)
ARCSTAT_INCR(arcstat_mutex_miss, missed);
/*
* We have just evicted some date into the ghost state, make
* sure we also adjust the ghost state size if necessary.
*/
if (arc_no_grow &&
arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size > arc_c) {
int64_t mru_over = arc_anon->arcs_size + arc_mru->arcs_size +
arc_mru_ghost->arcs_size - arc_c;
if (mru_over > 0 && arc_mru_ghost->arcs_lsize[type] > 0) {
int64_t todelete =
MIN(arc_mru_ghost->arcs_lsize[type], mru_over);
arc_evict_ghost(arc_mru_ghost, NULL, todelete);
} else if (arc_mfu_ghost->arcs_lsize[type] > 0) {
int64_t todelete = MIN(arc_mfu_ghost->arcs_lsize[type],
arc_mru_ghost->arcs_size +
arc_mfu_ghost->arcs_size - arc_c);
arc_evict_ghost(arc_mfu_ghost, NULL, todelete);
}
}
if (stolen)
ARCSTAT_BUMP(arcstat_stolen);
return (stolen);
}
/*
* Remove buffers from list until we've removed the specified number of
* bytes. Destroy the buffers that are removed.
*/
static void
arc_evict_ghost(arc_state_t *state, spa_t *spa, int64_t bytes)
{
arc_buf_hdr_t *ab, *ab_prev;
list_t *list, *list_start;
kmutex_t *hash_lock, *lock;
uint64_t bytes_deleted = 0;
uint64_t bufs_skipped = 0;
static int evict_offset;
int list_count, idx = evict_offset;
int offset, count = 0;
ASSERT(GHOST_STATE(state));
/*
* data lists come after metadata lists
*/
list_start = &state->arcs_lists[ARC_BUFC_NUMMETADATALISTS];
list_count = ARC_BUFC_NUMDATALISTS;
offset = ARC_BUFC_NUMMETADATALISTS;
evict_start:
list = &list_start[idx];
lock = ARCS_LOCK(state, idx + offset);
mutex_enter(lock);
for (ab = list_tail(list); ab; ab = ab_prev) {
ab_prev = list_prev(list, ab);
if (spa && ab->b_spa != spa)
continue;
hash_lock = HDR_LOCK(ab);
if (mutex_tryenter(hash_lock)) {
ASSERT(!HDR_IO_IN_PROGRESS(ab));
ASSERT(ab->b_buf == NULL);
ARCSTAT_BUMP(arcstat_deleted);
bytes_deleted += ab->b_size;
if (ab->b_l2hdr != NULL) {
/*
* This buffer is cached on the 2nd Level ARC;
* don't destroy the header.
*/
arc_change_state(arc_l2c_only, ab, hash_lock);
mutex_exit(hash_lock);
} else {
arc_change_state(arc_anon, ab, hash_lock);
mutex_exit(hash_lock);
arc_hdr_destroy(ab);
}
DTRACE_PROBE1(arc__delete, arc_buf_hdr_t *, ab);
if (bytes >= 0 && bytes_deleted >= bytes)
break;
} else {
if (bytes < 0) {
/*
* we're draining the ARC, retry
*/
mutex_exit(lock);
mutex_enter(hash_lock);
mutex_exit(hash_lock);
goto evict_start;
}
bufs_skipped += 1;
}
}
mutex_exit(lock);
idx = ((idx + 1) & (ARC_BUFC_NUMDATALISTS - 1));
count++;
if (count < list_count)
goto evict_start;
evict_offset = idx;
if ((uintptr_t)list > (uintptr_t)&state->arcs_lists[ARC_BUFC_NUMMETADATALISTS] &&
(bytes < 0 || bytes_deleted < bytes)) {
list_start = &state->arcs_lists[0];
list_count = ARC_BUFC_NUMMETADATALISTS;
offset = count = 0;
goto evict_start;
}
if (bufs_skipped) {
ARCSTAT_INCR(arcstat_mutex_miss, bufs_skipped);
ASSERT(bytes >= 0);
}
if (bytes_deleted < bytes)
dprintf("only deleted %lld bytes from %p",
(longlong_t)bytes_deleted, state);
}
static void
arc_adjust(void)
{
int64_t adjustment, delta;
/*
* Adjust MRU size
*/
adjustment = MIN(arc_size - arc_c,
arc_anon->arcs_size + arc_mru->arcs_size + arc_meta_used - arc_p);
if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_DATA] > 0) {
delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_DATA], adjustment);
(void) arc_evict(arc_mru, NULL, delta, FALSE, ARC_BUFC_DATA);
adjustment -= delta;
}
if (adjustment > 0 && arc_mru->arcs_lsize[ARC_BUFC_METADATA] > 0) {
delta = MIN(arc_mru->arcs_lsize[ARC_BUFC_METADATA], adjustment);
(void) arc_evict(arc_mru, NULL, delta, FALSE,
ARC_BUFC_METADATA);
}
/*
* Adjust MFU size
*/
adjustment = arc_size - arc_c;
if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_DATA] > 0) {
delta = MIN(adjustment, arc_mfu->arcs_lsize[ARC_BUFC_DATA]);
(void) arc_evict(arc_mfu, NULL, delta, FALSE, ARC_BUFC_DATA);
adjustment -= delta;
}
if (adjustment > 0 && arc_mfu->arcs_lsize[ARC_BUFC_METADATA] > 0) {
int64_t delta = MIN(adjustment,
arc_mfu->arcs_lsize[ARC_BUFC_METADATA]);
(void) arc_evict(arc_mfu, NULL, delta, FALSE,
ARC_BUFC_METADATA);
}
/*
* Adjust ghost lists
*/
adjustment = arc_mru->arcs_size + arc_mru_ghost->arcs_size - arc_c;
if (adjustment > 0 && arc_mru_ghost->arcs_size > 0) {
delta = MIN(arc_mru_ghost->arcs_size, adjustment);
arc_evict_ghost(arc_mru_ghost, NULL, delta);
}
adjustment =
arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size - arc_c;
if (adjustment > 0 && arc_mfu_ghost->arcs_size > 0) {
delta = MIN(arc_mfu_ghost->arcs_size, adjustment);
arc_evict_ghost(arc_mfu_ghost, NULL, delta);
}
}
static void
arc_do_user_evicts(void)
{
static arc_buf_t *tmp_arc_eviction_list;
/*
* Move list over to avoid LOR
*/
restart:
mutex_enter(&arc_eviction_mtx);
tmp_arc_eviction_list = arc_eviction_list;
arc_eviction_list = NULL;
mutex_exit(&arc_eviction_mtx);
while (tmp_arc_eviction_list != NULL) {
arc_buf_t *buf = tmp_arc_eviction_list;
tmp_arc_eviction_list = buf->b_next;
rw_enter(&buf->b_lock, RW_WRITER);
buf->b_hdr = NULL;
rw_exit(&buf->b_lock);
if (buf->b_efunc != NULL)
VERIFY(buf->b_efunc(buf) == 0);
buf->b_efunc = NULL;
buf->b_private = NULL;
kmem_cache_free(buf_cache, buf);
}
if (arc_eviction_list != NULL)
goto restart;
}
/*
* Flush all *evictable* data from the cache for the given spa.
* NOTE: this will not touch "active" (i.e. referenced) data.
*/
void
arc_flush(spa_t *spa)
{
while (arc_mru->arcs_lsize[ARC_BUFC_DATA]) {
(void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_DATA);
if (spa)
break;
}
while (arc_mru->arcs_lsize[ARC_BUFC_METADATA]) {
(void) arc_evict(arc_mru, spa, -1, FALSE, ARC_BUFC_METADATA);
if (spa)
break;
}
while (arc_mfu->arcs_lsize[ARC_BUFC_DATA]) {
(void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_DATA);
if (spa)
break;
}
while (arc_mfu->arcs_lsize[ARC_BUFC_METADATA]) {
(void) arc_evict(arc_mfu, spa, -1, FALSE, ARC_BUFC_METADATA);
if (spa)
break;
}
arc_evict_ghost(arc_mru_ghost, spa, -1);
arc_evict_ghost(arc_mfu_ghost, spa, -1);
mutex_enter(&arc_reclaim_thr_lock);
arc_do_user_evicts();
mutex_exit(&arc_reclaim_thr_lock);
ASSERT(spa || arc_eviction_list == NULL);
}
void
arc_shrink(void)
{
if (arc_c > arc_c_min) {
uint64_t to_free;
#ifdef _KERNEL
to_free = arc_c >> arc_shrink_shift;
#else
to_free = arc_c >> arc_shrink_shift;
#endif
if (arc_c > arc_c_min + to_free)
atomic_add_64(&arc_c, -to_free);
else
arc_c = arc_c_min;
atomic_add_64(&arc_p, -(arc_p >> arc_shrink_shift));
if (arc_c > arc_size)
arc_c = MAX(arc_size, arc_c_min);
if (arc_p > arc_c)
arc_p = (arc_c >> 1);
ASSERT(arc_c >= arc_c_min);
ASSERT((int64_t)arc_p >= 0);
}
if (arc_size > arc_c)
arc_adjust();
}
static int needfree = 0;
static int
arc_reclaim_needed(void)
{
#if 0
uint64_t extra;
#endif
#ifdef _KERNEL
if (needfree)
return (1);
if (arc_size > arc_c_max)
return (1);
if (arc_size <= arc_c_min)
return (0);
/*
* If pages are needed or we're within 2048 pages
* of needing to page need to reclaim
*/
if (vm_pages_needed || (vm_paging_target() > -2048))
return (1);
#if 0
/*
* take 'desfree' extra pages, so we reclaim sooner, rather than later
*/
extra = desfree;
/*
* check that we're out of range of the pageout scanner. It starts to
* schedule paging if freemem is less than lotsfree and needfree.
* lotsfree is the high-water mark for pageout, and needfree is the
* number of needed free pages. We add extra pages here to make sure
* the scanner doesn't start up while we're freeing memory.
*/
if (freemem < lotsfree + needfree + extra)
return (1);
/*
* check to make sure that swapfs has enough space so that anon
* reservations can still succeed. anon_resvmem() checks that the
* availrmem is greater than swapfs_minfree, and the number of reserved
* swap pages. We also add a bit of extra here just to prevent
* circumstances from getting really dire.
*/
if (availrmem < swapfs_minfree + swapfs_reserve + extra)
return (1);
#if defined(__i386)
/*
* If we're on an i386 platform, it's possible that we'll exhaust the
* kernel heap space before we ever run out of available physical
* memory. Most checks of the size of the heap_area compare against
* tune.t_minarmem, which is the minimum available real memory that we
* can have in the system. However, this is generally fixed at 25 pages
* which is so low that it's useless. In this comparison, we seek to
* calculate the total heap-size, and reclaim if more than 3/4ths of the
* heap is allocated. (Or, in the calculation, if less than 1/4th is
* free)
*/
if (btop(vmem_size(heap_arena, VMEM_FREE)) <
(btop(vmem_size(heap_arena, VMEM_FREE | VMEM_ALLOC)) >> 2))
return (1);
#endif
#else
if (kmem_used() > (kmem_size() * 3) / 4)
return (1);
#endif
#else
if (spa_get_random(100) == 0)
return (1);
#endif
return (0);
}
extern kmem_cache_t *zio_buf_cache[];
extern kmem_cache_t *zio_data_buf_cache[];
static void
arc_kmem_reap_now(arc_reclaim_strategy_t strat)
{
size_t i;
kmem_cache_t *prev_cache = NULL;
kmem_cache_t *prev_data_cache = NULL;
#ifdef _KERNEL
if (arc_meta_used >= arc_meta_limit) {
/*
* We are exceeding our meta-data cache limit.
* Purge some DNLC entries to release holds on meta-data.
*/
dnlc_reduce_cache((void *)(uintptr_t)arc_reduce_dnlc_percent);
}
#if defined(__i386)
/*
* Reclaim unused memory from all kmem caches.
*/
kmem_reap();
#endif
#endif
/*
* An aggressive reclamation will shrink the cache size as well as
* reap free buffers from the arc kmem caches.
*/
if (strat == ARC_RECLAIM_AGGR)
arc_shrink();
for (i = 0; i < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; i++) {
if (zio_buf_cache[i] != prev_cache) {
prev_cache = zio_buf_cache[i];
kmem_cache_reap_now(zio_buf_cache[i]);
}
if (zio_data_buf_cache[i] != prev_data_cache) {
prev_data_cache = zio_data_buf_cache[i];
kmem_cache_reap_now(zio_data_buf_cache[i]);
}
}
kmem_cache_reap_now(buf_cache);
kmem_cache_reap_now(hdr_cache);
}
static void
arc_reclaim_thread(void *dummy __unused)
{
clock_t growtime = 0;
arc_reclaim_strategy_t last_reclaim = ARC_RECLAIM_CONS;
callb_cpr_t cpr;
CALLB_CPR_INIT(&cpr, &arc_reclaim_thr_lock, callb_generic_cpr, FTAG);
mutex_enter(&arc_reclaim_thr_lock);
while (arc_thread_exit == 0) {
if (arc_reclaim_needed()) {
if (arc_no_grow) {
if (last_reclaim == ARC_RECLAIM_CONS) {
last_reclaim = ARC_RECLAIM_AGGR;
} else {
last_reclaim = ARC_RECLAIM_CONS;
}
} else {
arc_no_grow = TRUE;
last_reclaim = ARC_RECLAIM_AGGR;
membar_producer();
}
/* reset the growth delay for every reclaim */
growtime = LBOLT + (arc_grow_retry * hz);
if (needfree && last_reclaim == ARC_RECLAIM_CONS) {
/*
* If needfree is TRUE our vm_lowmem hook
* was called and in that case we must free some
* memory, so switch to aggressive mode.
*/
arc_no_grow = TRUE;
last_reclaim = ARC_RECLAIM_AGGR;
}
arc_kmem_reap_now(last_reclaim);
arc_warm = B_TRUE;
} else if (arc_no_grow && LBOLT >= growtime) {
arc_no_grow = FALSE;
}
if (needfree ||
(2 * arc_c < arc_size +
arc_mru_ghost->arcs_size + arc_mfu_ghost->arcs_size))
arc_adjust();
if (arc_eviction_list != NULL)
arc_do_user_evicts();
if (arc_reclaim_needed()) {
needfree = 0;
#ifdef _KERNEL
wakeup(&needfree);
#endif
}
/* block until needed, or one second, whichever is shorter */
CALLB_CPR_SAFE_BEGIN(&cpr);
(void) cv_timedwait(&arc_reclaim_thr_cv,
&arc_reclaim_thr_lock, hz);
CALLB_CPR_SAFE_END(&cpr, &arc_reclaim_thr_lock);
}
arc_thread_exit = 0;
cv_broadcast(&arc_reclaim_thr_cv);
CALLB_CPR_EXIT(&cpr); /* drops arc_reclaim_thr_lock */
thread_exit();
}
/*
* Adapt arc info given the number of bytes we are trying to add and
* the state that we are comming from. This function is only called
* when we are adding new content to the cache.
*/
static void
arc_adapt(int bytes, arc_state_t *state)
{
int mult;
uint64_t arc_p_min = (arc_c >> arc_p_min_shift);
if (state == arc_l2c_only)
return;
ASSERT(bytes > 0);
/*
* Adapt the target size of the MRU list:
* - if we just hit in the MRU ghost list, then increase
* the target size of the MRU list.
* - if we just hit in the MFU ghost list, then increase
* the target size of the MFU list by decreasing the
* target size of the MRU list.
*/
if (state == arc_mru_ghost) {
mult = ((arc_mru_ghost->arcs_size >= arc_mfu_ghost->arcs_size) ?
1 : (arc_mfu_ghost->arcs_size/arc_mru_ghost->arcs_size));
arc_p = MIN(arc_c - arc_p_min, arc_p + bytes * mult);
} else if (state == arc_mfu_ghost) {
uint64_t delta;
mult = ((arc_mfu_ghost->arcs_size >= arc_mru_ghost->arcs_size) ?
1 : (arc_mru_ghost->arcs_size/arc_mfu_ghost->arcs_size));
delta = MIN(bytes * mult, arc_p);
arc_p = MAX(arc_p_min, arc_p - delta);
}
ASSERT((int64_t)arc_p >= 0);
if (arc_reclaim_needed()) {
cv_signal(&arc_reclaim_thr_cv);
return;
}
if (arc_no_grow)
return;
if (arc_c >= arc_c_max)
return;
/*
* If we're within (2 * maxblocksize) bytes of the target
* cache size, increment the target cache size
*/
if (arc_size > arc_c - (2ULL << SPA_MAXBLOCKSHIFT)) {
atomic_add_64(&arc_c, (int64_t)bytes);
if (arc_c > arc_c_max)
arc_c = arc_c_max;
else if (state == arc_anon)
atomic_add_64(&arc_p, (int64_t)bytes);
if (arc_p > arc_c)
arc_p = arc_c;
}
ASSERT((int64_t)arc_p >= 0);
}
/*
* Check if the cache has reached its limits and eviction is required
* prior to insert.
*/
static int
arc_evict_needed(arc_buf_contents_t type)
{
if (type == ARC_BUFC_METADATA && arc_meta_used >= arc_meta_limit)
return (1);
#if 0
#ifdef _KERNEL
/*
* If zio data pages are being allocated out of a separate heap segment,
* then enforce that the size of available vmem for this area remains
* above about 1/32nd free.
*/
if (type == ARC_BUFC_DATA && zio_arena != NULL &&
vmem_size(zio_arena, VMEM_FREE) <
(vmem_size(zio_arena, VMEM_ALLOC) >> 5))
return (1);
#endif
#endif
if (arc_reclaim_needed())
return (1);
return (arc_size > arc_c);
}
/*
* The buffer, supplied as the first argument, needs a data block.
* So, if we are at cache max, determine which cache should be victimized.
* We have the following cases:
*
* 1. Insert for MRU, p > sizeof(arc_anon + arc_mru) ->
* In this situation if we're out of space, but the resident size of the MFU is
* under the limit, victimize the MFU cache to satisfy this insertion request.
*
* 2. Insert for MRU, p <= sizeof(arc_anon + arc_mru) ->
* Here, we've used up all of the available space for the MRU, so we need to
* evict from our own cache instead. Evict from the set of resident MRU
* entries.
*
* 3. Insert for MFU (c - p) > sizeof(arc_mfu) ->
* c minus p represents the MFU space in the cache, since p is the size of the
* cache that is dedicated to the MRU. In this situation there's still space on
* the MFU side, so the MRU side needs to be victimized.
*
* 4. Insert for MFU (c - p) < sizeof(arc_mfu) ->
* MFU's resident set is consuming more space than it has been allotted. In
* this situation, we must victimize our own cache, the MFU, for this insertion.
*/
static void
arc_get_data_buf(arc_buf_t *buf)
{
arc_state_t *state = buf->b_hdr->b_state;
uint64_t size = buf->b_hdr->b_size;
arc_buf_contents_t type = buf->b_hdr->b_type;
arc_adapt(size, state);
/*
* We have not yet reached cache maximum size,
* just allocate a new buffer.
*/
if (!arc_evict_needed(type)) {
if (type == ARC_BUFC_METADATA) {
buf->b_data = zio_buf_alloc(size);
arc_space_consume(size, ARC_SPACE_DATA);
} else {
ASSERT(type == ARC_BUFC_DATA);
buf->b_data = zio_data_buf_alloc(size);
ARCSTAT_INCR(arcstat_data_size, size);
atomic_add_64(&arc_size, size);
}
goto out;
}
/*
* If we are prefetching from the mfu ghost list, this buffer
* will end up on the mru list; so steal space from there.
*/
if (state == arc_mfu_ghost)
state = buf->b_hdr->b_flags & ARC_PREFETCH ? arc_mru : arc_mfu;
else if (state == arc_mru_ghost)
state = arc_mru;
if (state == arc_mru || state == arc_anon) {
uint64_t mru_used = arc_anon->arcs_size + arc_mru->arcs_size;
state = (arc_mfu->arcs_lsize[type] >= size &&
arc_p > mru_used) ? arc_mfu : arc_mru;
} else {
/* MFU cases */
uint64_t mfu_space = arc_c - arc_p;
state = (arc_mru->arcs_lsize[type] >= size &&
mfu_space > arc_mfu->arcs_size) ? arc_mru : arc_mfu;
}
if ((buf->b_data = arc_evict(state, NULL, size, TRUE, type)) == NULL) {
if (type == ARC_BUFC_METADATA) {
buf->b_data = zio_buf_alloc(size);
arc_space_consume(size, ARC_SPACE_DATA);
} else {
ASSERT(type == ARC_BUFC_DATA);
buf->b_data = zio_data_buf_alloc(size);
ARCSTAT_INCR(arcstat_data_size, size);
atomic_add_64(&arc_size, size);
}
ARCSTAT_BUMP(arcstat_recycle_miss);
}
ASSERT(buf->b_data != NULL);
out:
/*
* Update the state size. Note that ghost states have a
* "ghost size" and so don't need to be updated.
*/
if (!GHOST_STATE(buf->b_hdr->b_state)) {
arc_buf_hdr_t *hdr = buf->b_hdr;
atomic_add_64(&hdr->b_state->arcs_size, size);
if (list_link_active(&hdr->b_arc_node)) {
ASSERT(refcount_is_zero(&hdr->b_refcnt));
atomic_add_64(&hdr->b_state->arcs_lsize[type], size);
}
/*
* If we are growing the cache, and we are adding anonymous
* data, and we have outgrown arc_p, update arc_p
*/
if (arc_size < arc_c && hdr->b_state == arc_anon &&
arc_anon->arcs_size + arc_mru->arcs_size > arc_p)
arc_p = MIN(arc_c, arc_p + size);
}
ARCSTAT_BUMP(arcstat_allocated);
}
/*
* This routine is called whenever a buffer is accessed.
* NOTE: the hash lock is dropped in this function.
*/
static void
arc_access(arc_buf_hdr_t *buf, kmutex_t *hash_lock)
{
ASSERT(MUTEX_HELD(hash_lock));
if (buf->b_state == arc_anon) {
/*
* This buffer is not in the cache, and does not
* appear in our "ghost" list. Add the new buffer
* to the MRU state.
*/
ASSERT(buf->b_arc_access == 0);
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
arc_change_state(arc_mru, buf, hash_lock);
} else if (buf->b_state == arc_mru) {
/*
* If this buffer is here because of a prefetch, then either:
* - clear the flag if this is a "referencing" read
* (any subsequent access will bump this into the MFU state).
* or
* - move the buffer to the head of the list if this is
* another prefetch (to make it less likely to be evicted).
*/
if ((buf->b_flags & ARC_PREFETCH) != 0) {
if (refcount_count(&buf->b_refcnt) == 0) {
ASSERT(list_link_active(&buf->b_arc_node));
} else {
buf->b_flags &= ~ARC_PREFETCH;
ARCSTAT_BUMP(arcstat_mru_hits);
}
buf->b_arc_access = LBOLT;
return;
}
/*
* This buffer has been "accessed" only once so far,
* but it is still in the cache. Move it to the MFU
* state.
*/
if (LBOLT > buf->b_arc_access + ARC_MINTIME) {
/*
* More than 125ms have passed since we
* instantiated this buffer. Move it to the
* most frequently used state.
*/
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
arc_change_state(arc_mfu, buf, hash_lock);
}
ARCSTAT_BUMP(arcstat_mru_hits);
} else if (buf->b_state == arc_mru_ghost) {
arc_state_t *new_state;
/*
* This buffer has been "accessed" recently, but
* was evicted from the cache. Move it to the
* MFU state.
*/
if (buf->b_flags & ARC_PREFETCH) {
new_state = arc_mru;
if (refcount_count(&buf->b_refcnt) > 0)
buf->b_flags &= ~ARC_PREFETCH;
DTRACE_PROBE1(new_state__mru, arc_buf_hdr_t *, buf);
} else {
new_state = arc_mfu;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
}
buf->b_arc_access = LBOLT;
arc_change_state(new_state, buf, hash_lock);
ARCSTAT_BUMP(arcstat_mru_ghost_hits);
} else if (buf->b_state == arc_mfu) {
/*
* This buffer has been accessed more than once and is
* still in the cache. Keep it in the MFU state.
*
* NOTE: an add_reference() that occurred when we did
* the arc_read() will have kicked this off the list.
* If it was a prefetch, we will explicitly move it to
* the head of the list now.
*/
if ((buf->b_flags & ARC_PREFETCH) != 0) {
ASSERT(refcount_count(&buf->b_refcnt) == 0);
ASSERT(list_link_active(&buf->b_arc_node));
}
ARCSTAT_BUMP(arcstat_mfu_hits);
buf->b_arc_access = LBOLT;
} else if (buf->b_state == arc_mfu_ghost) {
arc_state_t *new_state = arc_mfu;
/*
* This buffer has been accessed more than once but has
* been evicted from the cache. Move it back to the
* MFU state.
*/
if (buf->b_flags & ARC_PREFETCH) {
/*
* This is a prefetch access...
* move this block back to the MRU state.
*/
ASSERT3U(refcount_count(&buf->b_refcnt), ==, 0);
new_state = arc_mru;
}
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
arc_change_state(new_state, buf, hash_lock);
ARCSTAT_BUMP(arcstat_mfu_ghost_hits);
} else if (buf->b_state == arc_l2c_only) {
/*
* This buffer is on the 2nd Level ARC.
*/
buf->b_arc_access = LBOLT;
DTRACE_PROBE1(new_state__mfu, arc_buf_hdr_t *, buf);
arc_change_state(arc_mfu, buf, hash_lock);
} else {
ASSERT(!"invalid arc state");
}
}
/* a generic arc_done_func_t which you can use */
/* ARGSUSED */
void
arc_bcopy_func(zio_t *zio, arc_buf_t *buf, void *arg)
{
bcopy(buf->b_data, arg, buf->b_hdr->b_size);
VERIFY(arc_buf_remove_ref(buf, arg) == 1);
}
/* a generic arc_done_func_t */
void
arc_getbuf_func(zio_t *zio, arc_buf_t *buf, void *arg)
{
arc_buf_t **bufp = arg;
if (zio && zio->io_error) {
VERIFY(arc_buf_remove_ref(buf, arg) == 1);
*bufp = NULL;
} else {
*bufp = buf;
}
}
static void
arc_read_done(zio_t *zio)
{
arc_buf_hdr_t *hdr, *found;
arc_buf_t *buf;
arc_buf_t *abuf; /* buffer we're assigning to callback */
kmutex_t *hash_lock;
arc_callback_t *callback_list, *acb;
int freeable = FALSE;
buf = zio->io_private;
hdr = buf->b_hdr;
/*
* The hdr was inserted into hash-table and removed from lists
* prior to starting I/O. We should find this header, since
* it's in the hash table, and it should be legit since it's
* not possible to evict it during the I/O. The only possible
* reason for it not to be found is if we were freed during the
* read.
*/
found = buf_hash_find(zio->io_spa, &hdr->b_dva, hdr->b_birth,
&hash_lock);
ASSERT((found == NULL && HDR_FREED_IN_READ(hdr) && hash_lock == NULL) ||
(found == hdr && DVA_EQUAL(&hdr->b_dva, BP_IDENTITY(zio->io_bp))) ||
(found == hdr && HDR_L2_READING(hdr)));
hdr->b_flags &= ~ARC_L2_EVICTED;
if (l2arc_noprefetch && (hdr->b_flags & ARC_PREFETCH))
hdr->b_flags &= ~ARC_L2CACHE;
/* byteswap if necessary */
callback_list = hdr->b_acb;
ASSERT(callback_list != NULL);
- if (BP_SHOULD_BYTESWAP(zio->io_bp)) {
+ if (BP_SHOULD_BYTESWAP(zio->io_bp) && zio->io_error == 0) {
arc_byteswap_func_t *func = BP_GET_LEVEL(zio->io_bp) > 0 ?
byteswap_uint64_array :
dmu_ot[BP_GET_TYPE(zio->io_bp)].ot_byteswap;
func(buf->b_data, hdr->b_size);
}
arc_cksum_compute(buf, B_FALSE);
/* create copies of the data buffer for the callers */
abuf = buf;
for (acb = callback_list; acb; acb = acb->acb_next) {
if (acb->acb_done) {
if (abuf == NULL)
abuf = arc_buf_clone(buf);
acb->acb_buf = abuf;
abuf = NULL;
}
}
hdr->b_acb = NULL;
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
ASSERT(!HDR_BUF_AVAILABLE(hdr));
if (abuf == buf)
hdr->b_flags |= ARC_BUF_AVAILABLE;
ASSERT(refcount_is_zero(&hdr->b_refcnt) || callback_list != NULL);
if (zio->io_error != 0) {
hdr->b_flags |= ARC_IO_ERROR;
if (hdr->b_state != arc_anon)
arc_change_state(arc_anon, hdr, hash_lock);
if (HDR_IN_HASH_TABLE(hdr))
buf_hash_remove(hdr);
freeable = refcount_is_zero(&hdr->b_refcnt);
}
/*
* Broadcast before we drop the hash_lock to avoid the possibility
* that the hdr (and hence the cv) might be freed before we get to
* the cv_broadcast().
*/
cv_broadcast(&hdr->b_cv);
if (hash_lock) {
/*
* Only call arc_access on anonymous buffers. This is because
* if we've issued an I/O for an evicted buffer, we've already
* called arc_access (to prevent any simultaneous readers from
* getting confused).
*/
if (zio->io_error == 0 && hdr->b_state == arc_anon)
arc_access(hdr, hash_lock);
mutex_exit(hash_lock);
} else {
/*
* This block was freed while we waited for the read to
* complete. It has been removed from the hash table and
* moved to the anonymous state (so that it won't show up
* in the cache).
*/
ASSERT3P(hdr->b_state, ==, arc_anon);
freeable = refcount_is_zero(&hdr->b_refcnt);
}
/* execute each callback and free its structure */
while ((acb = callback_list) != NULL) {
if (acb->acb_done)
acb->acb_done(zio, acb->acb_buf, acb->acb_private);
if (acb->acb_zio_dummy != NULL) {
acb->acb_zio_dummy->io_error = zio->io_error;
zio_nowait(acb->acb_zio_dummy);
}
callback_list = acb->acb_next;
kmem_free(acb, sizeof (arc_callback_t));
}
if (freeable)
arc_hdr_destroy(hdr);
}
/*
* "Read" the block block at the specified DVA (in bp) via the
* cache. If the block is found in the cache, invoke the provided
* callback immediately and return. Note that the `zio' parameter
* in the callback will be NULL in this case, since no IO was
* required. If the block is not in the cache pass the read request
* on to the spa with a substitute callback function, so that the
* requested block will be added to the cache.
*
* If a read request arrives for a block that has a read in-progress,
* either wait for the in-progress read to complete (and return the
* results); or, if this is a read with a "done" func, add a record
* to the read to invoke the "done" func when the read completes,
* and return; or just return.
*
* arc_read_done() will invoke all the requested "done" functions
* for readers of this block.
*
* Normal callers should use arc_read and pass the arc buffer and offset
* for the bp. But if you know you don't need locking, you can use
* arc_read_bp.
*/
int
arc_read(zio_t *pio, spa_t *spa, blkptr_t *bp, arc_buf_t *pbuf,
arc_done_func_t *done, void *private, int priority, int zio_flags,
uint32_t *arc_flags, const zbookmark_t *zb)
{
int err;
ASSERT(!refcount_is_zero(&pbuf->b_hdr->b_refcnt));
ASSERT3U((char *)bp - (char *)pbuf->b_data, <, pbuf->b_hdr->b_size);
rw_enter(&pbuf->b_lock, RW_READER);
err = arc_read_nolock(pio, spa, bp, done, private, priority,
zio_flags, arc_flags, zb);
rw_exit(&pbuf->b_lock);
return (err);
}
int
arc_read_nolock(zio_t *pio, spa_t *spa, blkptr_t *bp,
arc_done_func_t *done, void *private, int priority, int zio_flags,
uint32_t *arc_flags, const zbookmark_t *zb)
{
arc_buf_hdr_t *hdr;
arc_buf_t *buf;
kmutex_t *hash_lock;
zio_t *rzio;
top:
hdr = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
if (hdr && hdr->b_datacnt > 0) {
*arc_flags |= ARC_CACHED;
if (HDR_IO_IN_PROGRESS(hdr)) {
if (*arc_flags & ARC_WAIT) {
cv_wait(&hdr->b_cv, hash_lock);
mutex_exit(hash_lock);
goto top;
}
ASSERT(*arc_flags & ARC_NOWAIT);
if (done) {
arc_callback_t *acb = NULL;
acb = kmem_zalloc(sizeof (arc_callback_t),
KM_SLEEP);
acb->acb_done = done;
acb->acb_private = private;
if (pio != NULL)
acb->acb_zio_dummy = zio_null(pio,
spa, NULL, NULL, zio_flags);
ASSERT(acb->acb_done != NULL);
acb->acb_next = hdr->b_acb;
hdr->b_acb = acb;
add_reference(hdr, hash_lock, private);
mutex_exit(hash_lock);
return (0);
}
mutex_exit(hash_lock);
return (0);
}
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
if (done) {
add_reference(hdr, hash_lock, private);
/*
* If this block is already in use, create a new
* copy of the data so that we will be guaranteed
* that arc_release() will always succeed.
*/
buf = hdr->b_buf;
ASSERT(buf);
ASSERT(buf->b_data);
if (HDR_BUF_AVAILABLE(hdr)) {
ASSERT(buf->b_efunc == NULL);
hdr->b_flags &= ~ARC_BUF_AVAILABLE;
} else {
buf = arc_buf_clone(buf);
}
} else if (*arc_flags & ARC_PREFETCH &&
refcount_count(&hdr->b_refcnt) == 0) {
hdr->b_flags |= ARC_PREFETCH;
}
DTRACE_PROBE1(arc__hit, arc_buf_hdr_t *, hdr);
arc_access(hdr, hash_lock);
if (*arc_flags & ARC_L2CACHE)
hdr->b_flags |= ARC_L2CACHE;
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_hits);
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
data, metadata, hits);
if (done)
done(NULL, buf, private);
} else {
uint64_t size = BP_GET_LSIZE(bp);
arc_callback_t *acb;
vdev_t *vd = NULL;
uint64_t addr;
boolean_t devw = B_FALSE;
if (hdr == NULL) {
/* this block is not in the cache */
arc_buf_hdr_t *exists;
arc_buf_contents_t type = BP_GET_BUFC_TYPE(bp);
buf = arc_buf_alloc(spa, size, private, type);
hdr = buf->b_hdr;
hdr->b_dva = *BP_IDENTITY(bp);
hdr->b_birth = bp->blk_birth;
hdr->b_cksum0 = bp->blk_cksum.zc_word[0];
exists = buf_hash_insert(hdr, &hash_lock);
if (exists) {
/* somebody beat us to the hash insert */
mutex_exit(hash_lock);
bzero(&hdr->b_dva, sizeof (dva_t));
hdr->b_birth = 0;
hdr->b_cksum0 = 0;
(void) arc_buf_remove_ref(buf, private);
goto top; /* restart the IO request */
}
/* if this is a prefetch, we don't have a reference */
if (*arc_flags & ARC_PREFETCH) {
(void) remove_reference(hdr, hash_lock,
private);
hdr->b_flags |= ARC_PREFETCH;
}
if (*arc_flags & ARC_L2CACHE)
hdr->b_flags |= ARC_L2CACHE;
if (BP_GET_LEVEL(bp) > 0)
hdr->b_flags |= ARC_INDIRECT;
} else {
/* this block is in the ghost cache */
ASSERT(GHOST_STATE(hdr->b_state));
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 0);
ASSERT(hdr->b_buf == NULL);
/* if this is a prefetch, we don't have a reference */
if (*arc_flags & ARC_PREFETCH)
hdr->b_flags |= ARC_PREFETCH;
else
add_reference(hdr, hash_lock, private);
if (*arc_flags & ARC_L2CACHE)
hdr->b_flags |= ARC_L2CACHE;
buf = kmem_cache_alloc(buf_cache, KM_PUSHPAGE);
buf->b_hdr = hdr;
buf->b_data = NULL;
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_next = NULL;
hdr->b_buf = buf;
arc_get_data_buf(buf);
ASSERT(hdr->b_datacnt == 0);
hdr->b_datacnt = 1;
}
acb = kmem_zalloc(sizeof (arc_callback_t), KM_SLEEP);
acb->acb_done = done;
acb->acb_private = private;
ASSERT(hdr->b_acb == NULL);
hdr->b_acb = acb;
hdr->b_flags |= ARC_IO_IN_PROGRESS;
/*
* If the buffer has been evicted, migrate it to a present state
* before issuing the I/O. Once we drop the hash-table lock,
* the header will be marked as I/O in progress and have an
* attached buffer. At this point, anybody who finds this
* buffer ought to notice that it's legit but has a pending I/O.
*/
if (GHOST_STATE(hdr->b_state))
arc_access(hdr, hash_lock);
if (HDR_L2CACHE(hdr) && hdr->b_l2hdr != NULL &&
(vd = hdr->b_l2hdr->b_dev->l2ad_vdev) != NULL) {
devw = hdr->b_l2hdr->b_dev->l2ad_writing;
addr = hdr->b_l2hdr->b_daddr;
/*
* Lock out device removal.
*/
if (vdev_is_dead(vd) ||
!spa_config_tryenter(spa, SCL_L2ARC, vd, RW_READER))
vd = NULL;
}
mutex_exit(hash_lock);
ASSERT3U(hdr->b_size, ==, size);
DTRACE_PROBE3(arc__miss, blkptr_t *, bp, uint64_t, size,
zbookmark_t *, zb);
ARCSTAT_BUMP(arcstat_misses);
ARCSTAT_CONDSTAT(!(hdr->b_flags & ARC_PREFETCH),
demand, prefetch, hdr->b_type != ARC_BUFC_METADATA,
data, metadata, misses);
if (vd != NULL && l2arc_ndev != 0 && !(l2arc_norw && devw)) {
/*
* Read from the L2ARC if the following are true:
* 1. The L2ARC vdev was previously cached.
* 2. This buffer still has L2ARC metadata.
* 3. This buffer isn't currently writing to the L2ARC.
* 4. The L2ARC entry wasn't evicted, which may
* also have invalidated the vdev.
* 5. This isn't prefetch and l2arc_noprefetch is set.
*/
if (hdr->b_l2hdr != NULL &&
!HDR_L2_WRITING(hdr) && !HDR_L2_EVICTED(hdr) &&
!(l2arc_noprefetch && HDR_PREFETCH(hdr))) {
l2arc_read_callback_t *cb;
DTRACE_PROBE1(l2arc__hit, arc_buf_hdr_t *, hdr);
ARCSTAT_BUMP(arcstat_l2_hits);
cb = kmem_zalloc(sizeof (l2arc_read_callback_t),
KM_SLEEP);
cb->l2rcb_buf = buf;
cb->l2rcb_spa = spa;
cb->l2rcb_bp = *bp;
cb->l2rcb_zb = *zb;
cb->l2rcb_flags = zio_flags;
/*
* l2arc read. The SCL_L2ARC lock will be
* released by l2arc_read_done().
*/
rzio = zio_read_phys(pio, vd, addr, size,
buf->b_data, ZIO_CHECKSUM_OFF,
l2arc_read_done, cb, priority, zio_flags |
ZIO_FLAG_DONT_CACHE | ZIO_FLAG_CANFAIL |
ZIO_FLAG_DONT_PROPAGATE |
ZIO_FLAG_DONT_RETRY, B_FALSE);
DTRACE_PROBE2(l2arc__read, vdev_t *, vd,
zio_t *, rzio);
ARCSTAT_INCR(arcstat_l2_read_bytes, size);
if (*arc_flags & ARC_NOWAIT) {
zio_nowait(rzio);
return (0);
}
ASSERT(*arc_flags & ARC_WAIT);
if (zio_wait(rzio) == 0)
return (0);
/* l2arc read error; goto zio_read() */
} else {
DTRACE_PROBE1(l2arc__miss,
arc_buf_hdr_t *, hdr);
ARCSTAT_BUMP(arcstat_l2_misses);
if (HDR_L2_WRITING(hdr))
ARCSTAT_BUMP(arcstat_l2_rw_clash);
spa_config_exit(spa, SCL_L2ARC, vd);
}
} else {
if (vd != NULL)
spa_config_exit(spa, SCL_L2ARC, vd);
if (l2arc_ndev != 0) {
DTRACE_PROBE1(l2arc__miss,
arc_buf_hdr_t *, hdr);
ARCSTAT_BUMP(arcstat_l2_misses);
}
}
rzio = zio_read(pio, spa, bp, buf->b_data, size,
arc_read_done, buf, priority, zio_flags, zb);
if (*arc_flags & ARC_WAIT)
return (zio_wait(rzio));
ASSERT(*arc_flags & ARC_NOWAIT);
zio_nowait(rzio);
}
return (0);
}
/*
* arc_read() variant to support pool traversal. If the block is already
* in the ARC, make a copy of it; otherwise, the caller will do the I/O.
* The idea is that we don't want pool traversal filling up memory, but
* if the ARC already has the data anyway, we shouldn't pay for the I/O.
*/
int
arc_tryread(spa_t *spa, blkptr_t *bp, void *data)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_mtx;
int rc = 0;
hdr = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_mtx);
if (hdr && hdr->b_datacnt > 0 && !HDR_IO_IN_PROGRESS(hdr)) {
arc_buf_t *buf = hdr->b_buf;
ASSERT(buf);
while (buf->b_data == NULL) {
buf = buf->b_next;
ASSERT(buf);
}
bcopy(buf->b_data, data, hdr->b_size);
} else {
rc = ENOENT;
}
if (hash_mtx)
mutex_exit(hash_mtx);
return (rc);
}
void
arc_set_callback(arc_buf_t *buf, arc_evict_func_t *func, void *private)
{
ASSERT(buf->b_hdr != NULL);
ASSERT(buf->b_hdr->b_state != arc_anon);
ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt) || func == NULL);
buf->b_efunc = func;
buf->b_private = private;
}
/*
* This is used by the DMU to let the ARC know that a buffer is
* being evicted, so the ARC should clean up. If this arc buf
* is not yet in the evicted state, it will be put there.
*/
int
arc_buf_evict(arc_buf_t *buf)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_lock;
arc_buf_t **bufp;
list_t *list, *evicted_list;
kmutex_t *lock, *evicted_lock;
rw_enter(&buf->b_lock, RW_WRITER);
hdr = buf->b_hdr;
if (hdr == NULL) {
/*
* We are in arc_do_user_evicts().
*/
ASSERT(buf->b_data == NULL);
rw_exit(&buf->b_lock);
return (0);
} else if (buf->b_data == NULL) {
arc_buf_t copy = *buf; /* structure assignment */
/*
* We are on the eviction list; process this buffer now
* but let arc_do_user_evicts() do the reaping.
*/
buf->b_efunc = NULL;
rw_exit(&buf->b_lock);
VERIFY(copy.b_efunc(©) == 0);
return (1);
}
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
ASSERT(buf->b_hdr == hdr);
ASSERT3U(refcount_count(&hdr->b_refcnt), <, hdr->b_datacnt);
ASSERT(hdr->b_state == arc_mru || hdr->b_state == arc_mfu);
/*
* Pull this buffer off of the hdr
*/
bufp = &hdr->b_buf;
while (*bufp != buf)
bufp = &(*bufp)->b_next;
*bufp = buf->b_next;
ASSERT(buf->b_data != NULL);
arc_buf_destroy(buf, FALSE, FALSE);
if (hdr->b_datacnt == 0) {
arc_state_t *old_state = hdr->b_state;
arc_state_t *evicted_state;
ASSERT(refcount_is_zero(&hdr->b_refcnt));
evicted_state =
(old_state == arc_mru) ? arc_mru_ghost : arc_mfu_ghost;
get_buf_info(hdr, old_state, &list, &lock);
get_buf_info(hdr, evicted_state, &evicted_list, &evicted_lock);
mutex_enter(lock);
mutex_enter(evicted_lock);
arc_change_state(evicted_state, hdr, hash_lock);
ASSERT(HDR_IN_HASH_TABLE(hdr));
hdr->b_flags |= ARC_IN_HASH_TABLE;
hdr->b_flags &= ~ARC_BUF_AVAILABLE;
mutex_exit(evicted_lock);
mutex_exit(lock);
}
mutex_exit(hash_lock);
rw_exit(&buf->b_lock);
VERIFY(buf->b_efunc(buf) == 0);
buf->b_efunc = NULL;
buf->b_private = NULL;
buf->b_hdr = NULL;
kmem_cache_free(buf_cache, buf);
return (1);
}
/*
* Release this buffer from the cache. This must be done
* after a read and prior to modifying the buffer contents.
* If the buffer has more than one reference, we must make
* a new hdr for the buffer.
*/
void
arc_release(arc_buf_t *buf, void *tag)
{
arc_buf_hdr_t *hdr;
kmutex_t *hash_lock;
l2arc_buf_hdr_t *l2hdr;
uint64_t buf_size;
boolean_t released = B_FALSE;
rw_enter(&buf->b_lock, RW_WRITER);
hdr = buf->b_hdr;
/* this buffer is not on any list */
ASSERT(refcount_count(&hdr->b_refcnt) > 0);
ASSERT(!(hdr->b_flags & ARC_STORED));
if (hdr->b_state == arc_anon) {
/* this buffer is already released */
ASSERT3U(refcount_count(&hdr->b_refcnt), ==, 1);
ASSERT(BUF_EMPTY(hdr));
ASSERT(buf->b_efunc == NULL);
arc_buf_thaw(buf);
rw_exit(&buf->b_lock);
released = B_TRUE;
} else {
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
}
l2hdr = hdr->b_l2hdr;
if (l2hdr) {
mutex_enter(&l2arc_buflist_mtx);
hdr->b_l2hdr = NULL;
buf_size = hdr->b_size;
}
if (released)
goto out;
/*
* Do we have more than one buf?
*/
if (hdr->b_datacnt > 1) {
arc_buf_hdr_t *nhdr;
arc_buf_t **bufp;
uint64_t blksz = hdr->b_size;
spa_t *spa = hdr->b_spa;
arc_buf_contents_t type = hdr->b_type;
uint32_t flags = hdr->b_flags;
ASSERT(hdr->b_buf != buf || buf->b_next != NULL);
/*
* Pull the data off of this buf and attach it to
* a new anonymous buf.
*/
(void) remove_reference(hdr, hash_lock, tag);
bufp = &hdr->b_buf;
while (*bufp != buf)
bufp = &(*bufp)->b_next;
*bufp = (*bufp)->b_next;
buf->b_next = NULL;
ASSERT3U(hdr->b_state->arcs_size, >=, hdr->b_size);
atomic_add_64(&hdr->b_state->arcs_size, -hdr->b_size);
if (refcount_is_zero(&hdr->b_refcnt)) {
uint64_t *size = &hdr->b_state->arcs_lsize[hdr->b_type];
ASSERT3U(*size, >=, hdr->b_size);
atomic_add_64(size, -hdr->b_size);
}
hdr->b_datacnt -= 1;
arc_cksum_verify(buf);
mutex_exit(hash_lock);
nhdr = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
nhdr->b_size = blksz;
nhdr->b_spa = spa;
nhdr->b_type = type;
nhdr->b_buf = buf;
nhdr->b_state = arc_anon;
nhdr->b_arc_access = 0;
nhdr->b_flags = flags & ARC_L2_WRITING;
nhdr->b_l2hdr = NULL;
nhdr->b_datacnt = 1;
nhdr->b_freeze_cksum = NULL;
(void) refcount_add(&nhdr->b_refcnt, tag);
buf->b_hdr = nhdr;
rw_exit(&buf->b_lock);
atomic_add_64(&arc_anon->arcs_size, blksz);
} else {
rw_exit(&buf->b_lock);
ASSERT(refcount_count(&hdr->b_refcnt) == 1);
ASSERT(!list_link_active(&hdr->b_arc_node));
ASSERT(!HDR_IO_IN_PROGRESS(hdr));
arc_change_state(arc_anon, hdr, hash_lock);
hdr->b_arc_access = 0;
mutex_exit(hash_lock);
bzero(&hdr->b_dva, sizeof (dva_t));
hdr->b_birth = 0;
hdr->b_cksum0 = 0;
arc_buf_thaw(buf);
}
buf->b_efunc = NULL;
buf->b_private = NULL;
out:
if (l2hdr) {
list_remove(l2hdr->b_dev->l2ad_buflist, hdr);
kmem_free(l2hdr, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -buf_size);
mutex_exit(&l2arc_buflist_mtx);
}
}
int
arc_released(arc_buf_t *buf)
{
int released;
rw_enter(&buf->b_lock, RW_READER);
released = (buf->b_data != NULL && buf->b_hdr->b_state == arc_anon);
rw_exit(&buf->b_lock);
return (released);
}
int
arc_has_callback(arc_buf_t *buf)
{
int callback;
rw_enter(&buf->b_lock, RW_READER);
callback = (buf->b_efunc != NULL);
rw_exit(&buf->b_lock);
return (callback);
}
#ifdef ZFS_DEBUG
int
arc_referenced(arc_buf_t *buf)
{
int referenced;
rw_enter(&buf->b_lock, RW_READER);
referenced = (refcount_count(&buf->b_hdr->b_refcnt));
rw_exit(&buf->b_lock);
return (referenced);
}
#endif
static void
arc_write_ready(zio_t *zio)
{
arc_write_callback_t *callback = zio->io_private;
arc_buf_t *buf = callback->awcb_buf;
arc_buf_hdr_t *hdr = buf->b_hdr;
ASSERT(!refcount_is_zero(&buf->b_hdr->b_refcnt));
callback->awcb_ready(zio, buf, callback->awcb_private);
/*
* If the IO is already in progress, then this is a re-write
* attempt, so we need to thaw and re-compute the cksum.
* It is the responsibility of the callback to handle the
* accounting for any re-write attempt.
*/
if (HDR_IO_IN_PROGRESS(hdr)) {
mutex_enter(&hdr->b_freeze_lock);
if (hdr->b_freeze_cksum != NULL) {
kmem_free(hdr->b_freeze_cksum, sizeof (zio_cksum_t));
hdr->b_freeze_cksum = NULL;
}
mutex_exit(&hdr->b_freeze_lock);
}
arc_cksum_compute(buf, B_FALSE);
hdr->b_flags |= ARC_IO_IN_PROGRESS;
}
static void
arc_write_done(zio_t *zio)
{
arc_write_callback_t *callback = zio->io_private;
arc_buf_t *buf = callback->awcb_buf;
arc_buf_hdr_t *hdr = buf->b_hdr;
hdr->b_acb = NULL;
hdr->b_dva = *BP_IDENTITY(zio->io_bp);
hdr->b_birth = zio->io_bp->blk_birth;
hdr->b_cksum0 = zio->io_bp->blk_cksum.zc_word[0];
/*
* If the block to be written was all-zero, we may have
* compressed it away. In this case no write was performed
* so there will be no dva/birth-date/checksum. The buffer
* must therefor remain anonymous (and uncached).
*/
if (!BUF_EMPTY(hdr)) {
arc_buf_hdr_t *exists;
kmutex_t *hash_lock;
arc_cksum_verify(buf);
exists = buf_hash_insert(hdr, &hash_lock);
if (exists) {
/*
* This can only happen if we overwrite for
* sync-to-convergence, because we remove
* buffers from the hash table when we arc_free().
*/
ASSERT(zio->io_flags & ZIO_FLAG_IO_REWRITE);
ASSERT(DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig),
BP_IDENTITY(zio->io_bp)));
ASSERT3U(zio->io_bp_orig.blk_birth, ==,
zio->io_bp->blk_birth);
ASSERT(refcount_is_zero(&exists->b_refcnt));
arc_change_state(arc_anon, exists, hash_lock);
mutex_exit(hash_lock);
arc_hdr_destroy(exists);
exists = buf_hash_insert(hdr, &hash_lock);
ASSERT3P(exists, ==, NULL);
}
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
/* if it's not anon, we are doing a scrub */
if (hdr->b_state == arc_anon)
arc_access(hdr, hash_lock);
mutex_exit(hash_lock);
} else if (callback->awcb_done == NULL) {
int destroy_hdr;
/*
* This is an anonymous buffer with no user callback,
* destroy it if there are no active references.
*/
mutex_enter(&arc_eviction_mtx);
destroy_hdr = refcount_is_zero(&hdr->b_refcnt);
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
mutex_exit(&arc_eviction_mtx);
if (destroy_hdr)
arc_hdr_destroy(hdr);
} else {
hdr->b_flags &= ~ARC_IO_IN_PROGRESS;
}
hdr->b_flags &= ~ARC_STORED;
if (callback->awcb_done) {
ASSERT(!refcount_is_zero(&hdr->b_refcnt));
callback->awcb_done(zio, buf, callback->awcb_private);
}
kmem_free(callback, sizeof (arc_write_callback_t));
}
static void
write_policy(spa_t *spa, const writeprops_t *wp, zio_prop_t *zp)
{
boolean_t ismd = (wp->wp_level > 0 || dmu_ot[wp->wp_type].ot_metadata);
/* Determine checksum setting */
if (ismd) {
/*
* Metadata always gets checksummed. If the data
* checksum is multi-bit correctable, and it's not a
* ZBT-style checksum, then it's suitable for metadata
* as well. Otherwise, the metadata checksum defaults
* to fletcher4.
*/
if (zio_checksum_table[wp->wp_oschecksum].ci_correctable &&
!zio_checksum_table[wp->wp_oschecksum].ci_zbt)
zp->zp_checksum = wp->wp_oschecksum;
else
zp->zp_checksum = ZIO_CHECKSUM_FLETCHER_4;
} else {
zp->zp_checksum = zio_checksum_select(wp->wp_dnchecksum,
wp->wp_oschecksum);
}
/* Determine compression setting */
if (ismd) {
/*
* XXX -- we should design a compression algorithm
* that specializes in arrays of bps.
*/
zp->zp_compress = zfs_mdcomp_disable ? ZIO_COMPRESS_EMPTY :
ZIO_COMPRESS_LZJB;
} else {
zp->zp_compress = zio_compress_select(wp->wp_dncompress,
wp->wp_oscompress);
}
zp->zp_type = wp->wp_type;
zp->zp_level = wp->wp_level;
zp->zp_ndvas = MIN(wp->wp_copies + ismd, spa_max_replication(spa));
}
zio_t *
arc_write(zio_t *pio, spa_t *spa, const writeprops_t *wp,
boolean_t l2arc, uint64_t txg, blkptr_t *bp, arc_buf_t *buf,
arc_done_func_t *ready, arc_done_func_t *done, void *private, int priority,
int zio_flags, const zbookmark_t *zb)
{
arc_buf_hdr_t *hdr = buf->b_hdr;
arc_write_callback_t *callback;
zio_t *zio;
zio_prop_t zp;
ASSERT(ready != NULL);
ASSERT(!HDR_IO_ERROR(hdr));
ASSERT((hdr->b_flags & ARC_IO_IN_PROGRESS) == 0);
ASSERT(hdr->b_acb == 0);
if (l2arc)
hdr->b_flags |= ARC_L2CACHE;
callback = kmem_zalloc(sizeof (arc_write_callback_t), KM_SLEEP);
callback->awcb_ready = ready;
callback->awcb_done = done;
callback->awcb_private = private;
callback->awcb_buf = buf;
write_policy(spa, wp, &zp);
zio = zio_write(pio, spa, txg, bp, buf->b_data, hdr->b_size, &zp,
arc_write_ready, arc_write_done, callback, priority, zio_flags, zb);
return (zio);
}
int
arc_free(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
zio_done_func_t *done, void *private, uint32_t arc_flags)
{
arc_buf_hdr_t *ab;
kmutex_t *hash_lock;
zio_t *zio;
/*
* If this buffer is in the cache, release it, so it
* can be re-used.
*/
ab = buf_hash_find(spa, BP_IDENTITY(bp), bp->blk_birth, &hash_lock);
if (ab != NULL) {
/*
* The checksum of blocks to free is not always
* preserved (eg. on the deadlist). However, if it is
* nonzero, it should match what we have in the cache.
*/
ASSERT(bp->blk_cksum.zc_word[0] == 0 ||
bp->blk_cksum.zc_word[0] == ab->b_cksum0 ||
bp->blk_fill == BLK_FILL_ALREADY_FREED);
if (ab->b_state != arc_anon)
arc_change_state(arc_anon, ab, hash_lock);
if (HDR_IO_IN_PROGRESS(ab)) {
/*
* This should only happen when we prefetch.
*/
ASSERT(ab->b_flags & ARC_PREFETCH);
ASSERT3U(ab->b_datacnt, ==, 1);
ab->b_flags |= ARC_FREED_IN_READ;
if (HDR_IN_HASH_TABLE(ab))
buf_hash_remove(ab);
ab->b_arc_access = 0;
bzero(&ab->b_dva, sizeof (dva_t));
ab->b_birth = 0;
ab->b_cksum0 = 0;
ab->b_buf->b_efunc = NULL;
ab->b_buf->b_private = NULL;
mutex_exit(hash_lock);
} else if (refcount_is_zero(&ab->b_refcnt)) {
ab->b_flags |= ARC_FREE_IN_PROGRESS;
mutex_exit(hash_lock);
arc_hdr_destroy(ab);
ARCSTAT_BUMP(arcstat_deleted);
} else {
/*
* We still have an active reference on this
* buffer. This can happen, e.g., from
* dbuf_unoverride().
*/
ASSERT(!HDR_IN_HASH_TABLE(ab));
ab->b_arc_access = 0;
bzero(&ab->b_dva, sizeof (dva_t));
ab->b_birth = 0;
ab->b_cksum0 = 0;
ab->b_buf->b_efunc = NULL;
ab->b_buf->b_private = NULL;
mutex_exit(hash_lock);
}
}
zio = zio_free(pio, spa, txg, bp, done, private, ZIO_FLAG_MUSTSUCCEED);
if (arc_flags & ARC_WAIT)
return (zio_wait(zio));
ASSERT(arc_flags & ARC_NOWAIT);
zio_nowait(zio);
return (0);
}
static int
arc_memory_throttle(uint64_t reserve, uint64_t txg)
{
#ifdef _KERNEL
uint64_t inflight_data = arc_anon->arcs_size;
uint64_t available_memory = ptoa((uintmax_t)cnt.v_free_count);
static uint64_t page_load = 0;
static uint64_t last_txg = 0;
#if 0
#if defined(__i386)
available_memory =
MIN(available_memory, vmem_size(heap_arena, VMEM_FREE));
#endif
#endif
if (available_memory >= zfs_write_limit_max)
return (0);
if (txg > last_txg) {
last_txg = txg;
page_load = 0;
}
/*
* If we are in pageout, we know that memory is already tight,
* the arc is already going to be evicting, so we just want to
* continue to let page writes occur as quickly as possible.
*/
if (curproc == pageproc) {
if (page_load > available_memory / 4)
return (ERESTART);
/* Note: reserve is inflated, so we deflate */
page_load += reserve / 8;
return (0);
} else if (page_load > 0 && arc_reclaim_needed()) {
/* memory is low, delay before restarting */
ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
return (EAGAIN);
}
page_load = 0;
if (arc_size > arc_c_min) {
uint64_t evictable_memory =
arc_mru->arcs_lsize[ARC_BUFC_DATA] +
arc_mru->arcs_lsize[ARC_BUFC_METADATA] +
arc_mfu->arcs_lsize[ARC_BUFC_DATA] +
arc_mfu->arcs_lsize[ARC_BUFC_METADATA];
available_memory += MIN(evictable_memory, arc_size - arc_c_min);
}
if (inflight_data > available_memory / 4) {
ARCSTAT_INCR(arcstat_memory_throttle_count, 1);
return (ERESTART);
}
#endif
return (0);
}
void
arc_tempreserve_clear(uint64_t reserve)
{
atomic_add_64(&arc_tempreserve, -reserve);
ASSERT((int64_t)arc_tempreserve >= 0);
}
int
arc_tempreserve_space(uint64_t reserve, uint64_t txg)
{
int error;
#ifdef ZFS_DEBUG
/*
* Once in a while, fail for no reason. Everything should cope.
*/
if (spa_get_random(10000) == 0) {
dprintf("forcing random failure\n");
return (ERESTART);
}
#endif
if (reserve > arc_c/4 && !arc_no_grow)
arc_c = MIN(arc_c_max, reserve * 4);
if (reserve > arc_c)
return (ENOMEM);
/*
* Writes will, almost always, require additional memory allocations
* in order to compress/encrypt/etc the data. We therefor need to
* make sure that there is sufficient available memory for this.
*/
if (error = arc_memory_throttle(reserve, txg))
return (error);
/*
* Throttle writes when the amount of dirty data in the cache
* gets too large. We try to keep the cache less than half full
* of dirty blocks so that our sync times don't grow too large.
* Note: if two requests come in concurrently, we might let them
* both succeed, when one of them should fail. Not a huge deal.
*/
if (reserve + arc_tempreserve + arc_anon->arcs_size > arc_c / 2 &&
arc_anon->arcs_size > arc_c / 4) {
dprintf("failing, arc_tempreserve=%lluK anon_meta=%lluK "
"anon_data=%lluK tempreserve=%lluK arc_c=%lluK\n",
arc_tempreserve>>10,
arc_anon->arcs_lsize[ARC_BUFC_METADATA]>>10,
arc_anon->arcs_lsize[ARC_BUFC_DATA]>>10,
reserve>>10, arc_c>>10);
return (ERESTART);
}
atomic_add_64(&arc_tempreserve, reserve);
return (0);
}
static kmutex_t arc_lowmem_lock;
#ifdef _KERNEL
static eventhandler_tag arc_event_lowmem = NULL;
static void
arc_lowmem(void *arg __unused, int howto __unused)
{
/* Serialize access via arc_lowmem_lock. */
mutex_enter(&arc_lowmem_lock);
needfree = 1;
cv_signal(&arc_reclaim_thr_cv);
while (needfree)
tsleep(&needfree, 0, "zfs:lowmem", hz / 5);
mutex_exit(&arc_lowmem_lock);
}
#endif
void
arc_init(void)
{
int prefetch_tunable_set = 0;
int i;
mutex_init(&arc_reclaim_thr_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&arc_reclaim_thr_cv, NULL, CV_DEFAULT, NULL);
mutex_init(&arc_lowmem_lock, NULL, MUTEX_DEFAULT, NULL);
/* Convert seconds to clock ticks */
arc_min_prefetch_lifespan = 1 * hz;
/* Start out with 1/8 of all memory */
arc_c = kmem_size() / 8;
#if 0
#ifdef _KERNEL
/*
* On architectures where the physical memory can be larger
* than the addressable space (intel in 32-bit mode), we may
* need to limit the cache to 1/8 of VM size.
*/
arc_c = MIN(arc_c, vmem_size(heap_arena, VMEM_ALLOC | VMEM_FREE) / 8);
#endif
#endif
/* set min cache to 1/32 of all memory, or 16MB, whichever is more */
arc_c_min = MAX(arc_c / 4, 64<<18);
/* set max to 1/2 of all memory, or all but 1GB, whichever is more */
if (arc_c * 8 >= 1<<30)
arc_c_max = (arc_c * 8) - (1<<30);
else
arc_c_max = arc_c_min;
arc_c_max = MAX(arc_c * 5, arc_c_max);
#ifdef _KERNEL
/*
* Allow the tunables to override our calculations if they are
* reasonable (ie. over 16MB)
*/
if (zfs_arc_max >= 64<<18 && zfs_arc_max < kmem_size())
arc_c_max = zfs_arc_max;
if (zfs_arc_min >= 64<<18 && zfs_arc_min <= arc_c_max)
arc_c_min = zfs_arc_min;
#endif
arc_c = arc_c_max;
arc_p = (arc_c >> 1);
/* limit meta-data to 1/4 of the arc capacity */
arc_meta_limit = arc_c_max / 4;
/* Allow the tunable to override if it is reasonable */
if (zfs_arc_meta_limit > 0 && zfs_arc_meta_limit <= arc_c_max)
arc_meta_limit = zfs_arc_meta_limit;
if (arc_c_min < arc_meta_limit / 2 && zfs_arc_min == 0)
arc_c_min = arc_meta_limit / 2;
if (zfs_arc_grow_retry > 0)
arc_grow_retry = zfs_arc_grow_retry;
if (zfs_arc_shrink_shift > 0)
arc_shrink_shift = zfs_arc_shrink_shift;
if (zfs_arc_p_min_shift > 0)
arc_p_min_shift = zfs_arc_p_min_shift;
/* if kmem_flags are set, lets try to use less memory */
if (kmem_debugging())
arc_c = arc_c / 2;
if (arc_c < arc_c_min)
arc_c = arc_c_min;
zfs_arc_min = arc_c_min;
zfs_arc_max = arc_c_max;
arc_anon = &ARC_anon;
arc_mru = &ARC_mru;
arc_mru_ghost = &ARC_mru_ghost;
arc_mfu = &ARC_mfu;
arc_mfu_ghost = &ARC_mfu_ghost;
arc_l2c_only = &ARC_l2c_only;
arc_size = 0;
for (i = 0; i < ARC_BUFC_NUMLISTS; i++) {
mutex_init(&arc_anon->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mru->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mru_ghost->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mfu->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_mfu_ghost->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
mutex_init(&arc_l2c_only->arcs_locks[i].arcs_lock,
NULL, MUTEX_DEFAULT, NULL);
list_create(&arc_mru->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mru_ghost->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mfu->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mfu_ghost->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_mfu_ghost->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
list_create(&arc_l2c_only->arcs_lists[i],
sizeof (arc_buf_hdr_t), offsetof(arc_buf_hdr_t, b_arc_node));
}
buf_init();
arc_thread_exit = 0;
arc_eviction_list = NULL;
mutex_init(&arc_eviction_mtx, NULL, MUTEX_DEFAULT, NULL);
bzero(&arc_eviction_hdr, sizeof (arc_buf_hdr_t));
arc_ksp = kstat_create("zfs", 0, "arcstats", "misc", KSTAT_TYPE_NAMED,
sizeof (arc_stats) / sizeof (kstat_named_t), KSTAT_FLAG_VIRTUAL);
if (arc_ksp != NULL) {
arc_ksp->ks_data = &arc_stats;
kstat_install(arc_ksp);
}
(void) thread_create(NULL, 0, arc_reclaim_thread, NULL, 0, &p0,
TS_RUN, minclsyspri);
#ifdef _KERNEL
arc_event_lowmem = EVENTHANDLER_REGISTER(vm_lowmem, arc_lowmem, NULL,
EVENTHANDLER_PRI_FIRST);
#endif
arc_dead = FALSE;
arc_warm = B_FALSE;
if (zfs_write_limit_max == 0)
zfs_write_limit_max = ptob(physmem) >> zfs_write_limit_shift;
else
zfs_write_limit_shift = 0;
mutex_init(&zfs_write_limit_lock, NULL, MUTEX_DEFAULT, NULL);
#ifdef _KERNEL
if (TUNABLE_INT_FETCH("vfs.zfs.prefetch_disable", &zfs_prefetch_disable))
prefetch_tunable_set = 1;
#ifdef __i386__
if (prefetch_tunable_set == 0) {
printf("ZFS NOTICE: Prefetch is disabled by default on i386 "
"-- to enable,\n");
printf(" add \"vfs.zfs.prefetch_disable=0\" "
"to /boot/loader.conf.\n");
zfs_prefetch_disable=1;
}
#else
if ((((uint64_t)physmem * PAGESIZE) < (1ULL << 32)) &&
prefetch_tunable_set == 0) {
printf("ZFS NOTICE: Prefetch is disabled by default if less "
"than 4GB of RAM is present;\n"
" to enable, add \"vfs.zfs.prefetch_disable=0\" "
"to /boot/loader.conf.\n");
zfs_prefetch_disable=1;
}
#endif
/* Warn about ZFS memory and address space requirements. */
if (((uint64_t)physmem * PAGESIZE) < (256 + 128 + 64) * (1 << 20)) {
printf("ZFS WARNING: Recommended minimum RAM size is 512MB; "
"expect unstable behavior.\n");
}
if (kmem_size() < 512 * (1 << 20)) {
printf("ZFS WARNING: Recommended minimum kmem_size is 512MB; "
"expect unstable behavior.\n");
printf(" Consider tuning vm.kmem_size and "
"vm.kmem_size_max\n");
printf(" in /boot/loader.conf.\n");
}
#endif
}
void
arc_fini(void)
{
int i;
mutex_enter(&arc_reclaim_thr_lock);
arc_thread_exit = 1;
cv_signal(&arc_reclaim_thr_cv);
while (arc_thread_exit != 0)
cv_wait(&arc_reclaim_thr_cv, &arc_reclaim_thr_lock);
mutex_exit(&arc_reclaim_thr_lock);
arc_flush(NULL);
arc_dead = TRUE;
if (arc_ksp != NULL) {
kstat_delete(arc_ksp);
arc_ksp = NULL;
}
mutex_destroy(&arc_eviction_mtx);
mutex_destroy(&arc_reclaim_thr_lock);
cv_destroy(&arc_reclaim_thr_cv);
for (i = 0; i < ARC_BUFC_NUMLISTS; i++) {
list_destroy(&arc_mru->arcs_lists[i]);
list_destroy(&arc_mru_ghost->arcs_lists[i]);
list_destroy(&arc_mfu->arcs_lists[i]);
list_destroy(&arc_mfu_ghost->arcs_lists[i]);
list_destroy(&arc_l2c_only->arcs_lists[i]);
mutex_destroy(&arc_anon->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mru->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mru_ghost->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mfu->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_mfu_ghost->arcs_locks[i].arcs_lock);
mutex_destroy(&arc_l2c_only->arcs_locks[i].arcs_lock);
}
mutex_destroy(&zfs_write_limit_lock);
buf_fini();
mutex_destroy(&arc_lowmem_lock);
#ifdef _KERNEL
if (arc_event_lowmem != NULL)
EVENTHANDLER_DEREGISTER(vm_lowmem, arc_event_lowmem);
#endif
}
/*
* Level 2 ARC
*
* The level 2 ARC (L2ARC) is a cache layer in-between main memory and disk.
* It uses dedicated storage devices to hold cached data, which are populated
* using large infrequent writes. The main role of this cache is to boost
* the performance of random read workloads. The intended L2ARC devices
* include short-stroked disks, solid state disks, and other media with
* substantially faster read latency than disk.
*
* +-----------------------+
* | ARC |
* +-----------------------+
* | ^ ^
* | | |
* l2arc_feed_thread() arc_read()
* | | |
* | l2arc read |
* V | |
* +---------------+ |
* | L2ARC | |
* +---------------+ |
* | ^ |
* l2arc_write() | |
* | | |
* V | |
* +-------+ +-------+
* | vdev | | vdev |
* | cache | | cache |
* +-------+ +-------+
* +=========+ .-----.
* : L2ARC : |-_____-|
* : devices : | Disks |
* +=========+ `-_____-'
*
* Read requests are satisfied from the following sources, in order:
*
* 1) ARC
* 2) vdev cache of L2ARC devices
* 3) L2ARC devices
* 4) vdev cache of disks
* 5) disks
*
* Some L2ARC device types exhibit extremely slow write performance.
* To accommodate for this there are some significant differences between
* the L2ARC and traditional cache design:
*
* 1. There is no eviction path from the ARC to the L2ARC. Evictions from
* the ARC behave as usual, freeing buffers and placing headers on ghost
* lists. The ARC does not send buffers to the L2ARC during eviction as
* this would add inflated write latencies for all ARC memory pressure.
*
* 2. The L2ARC attempts to cache data from the ARC before it is evicted.
* It does this by periodically scanning buffers from the eviction-end of
* the MFU and MRU ARC lists, copying them to the L2ARC devices if they are
* not already there. It scans until a headroom of buffers is satisfied,
* which itself is a buffer for ARC eviction. The thread that does this is
* l2arc_feed_thread(), illustrated below; example sizes are included to
* provide a better sense of ratio than this diagram:
*
* head --> tail
* +---------------------+----------+
* ARC_mfu |:::::#:::::::::::::::|o#o###o###|-->. # already on L2ARC
* +---------------------+----------+ | o L2ARC eligible
* ARC_mru |:#:::::::::::::::::::|#o#ooo####|-->| : ARC buffer
* +---------------------+----------+ |
* 15.9 Gbytes ^ 32 Mbytes |
* headroom |
* l2arc_feed_thread()
* |
* l2arc write hand <--[oooo]--'
* | 8 Mbyte
* | write max
* V
* +==============================+
* L2ARC dev |####|#|###|###| |####| ... |
* +==============================+
* 32 Gbytes
*
* 3. If an ARC buffer is copied to the L2ARC but then hit instead of
* evicted, then the L2ARC has cached a buffer much sooner than it probably
* needed to, potentially wasting L2ARC device bandwidth and storage. It is
* safe to say that this is an uncommon case, since buffers at the end of
* the ARC lists have moved there due to inactivity.
*
* 4. If the ARC evicts faster than the L2ARC can maintain a headroom,
* then the L2ARC simply misses copying some buffers. This serves as a
* pressure valve to prevent heavy read workloads from both stalling the ARC
* with waits and clogging the L2ARC with writes. This also helps prevent
* the potential for the L2ARC to churn if it attempts to cache content too
* quickly, such as during backups of the entire pool.
*
* 5. After system boot and before the ARC has filled main memory, there are
* no evictions from the ARC and so the tails of the ARC_mfu and ARC_mru
* lists can remain mostly static. Instead of searching from tail of these
* lists as pictured, the l2arc_feed_thread() will search from the list heads
* for eligible buffers, greatly increasing its chance of finding them.
*
* The L2ARC device write speed is also boosted during this time so that
* the L2ARC warms up faster. Since there have been no ARC evictions yet,
* there are no L2ARC reads, and no fear of degrading read performance
* through increased writes.
*
* 6. Writes to the L2ARC devices are grouped and sent in-sequence, so that
* the vdev queue can aggregate them into larger and fewer writes. Each
* device is written to in a rotor fashion, sweeping writes through
* available space then repeating.
*
* 7. The L2ARC does not store dirty content. It never needs to flush
* write buffers back to disk based storage.
*
* 8. If an ARC buffer is written (and dirtied) which also exists in the
* L2ARC, the now stale L2ARC buffer is immediately dropped.
*
* The performance of the L2ARC can be tweaked by a number of tunables, which
* may be necessary for different workloads:
*
* l2arc_write_max max write bytes per interval
* l2arc_write_boost extra write bytes during device warmup
* l2arc_noprefetch skip caching prefetched buffers
* l2arc_headroom number of max device writes to precache
* l2arc_feed_secs seconds between L2ARC writing
*
* Tunables may be removed or added as future performance improvements are
* integrated, and also may become zpool properties.
*
* There are three key functions that control how the L2ARC warms up:
*
* l2arc_write_eligible() check if a buffer is eligible to cache
* l2arc_write_size() calculate how much to write
* l2arc_write_interval() calculate sleep delay between writes
*
* These three functions determine what to write, how much, and how quickly
* to send writes.
*/
static boolean_t
l2arc_write_eligible(spa_t *spa, arc_buf_hdr_t *ab)
{
/*
* A buffer is *not* eligible for the L2ARC if it:
* 1. belongs to a different spa.
* 2. is already cached on the L2ARC.
* 3. has an I/O in progress (it may be an incomplete read).
* 4. is flagged not eligible (zfs property).
*/
if (ab->b_spa != spa) {
ARCSTAT_BUMP(arcstat_l2_write_spa_mismatch);
return (B_FALSE);
}
if (ab->b_l2hdr != NULL) {
ARCSTAT_BUMP(arcstat_l2_write_in_l2);
return (B_FALSE);
}
if (HDR_IO_IN_PROGRESS(ab)) {
ARCSTAT_BUMP(arcstat_l2_write_hdr_io_in_progress);
return (B_FALSE);
}
if (!HDR_L2CACHE(ab)) {
ARCSTAT_BUMP(arcstat_l2_write_not_cacheable);
return (B_FALSE);
}
return (B_TRUE);
}
static uint64_t
l2arc_write_size(l2arc_dev_t *dev)
{
uint64_t size;
size = dev->l2ad_write;
if (arc_warm == B_FALSE)
size += dev->l2ad_boost;
return (size);
}
static clock_t
l2arc_write_interval(clock_t began, uint64_t wanted, uint64_t wrote)
{
clock_t interval, next;
/*
* If the ARC lists are busy, increase our write rate; if the
* lists are stale, idle back. This is achieved by checking
* how much we previously wrote - if it was more than half of
* what we wanted, schedule the next write much sooner.
*/
if (l2arc_feed_again && wrote > (wanted / 2))
interval = (hz * l2arc_feed_min_ms) / 1000;
else
interval = hz * l2arc_feed_secs;
next = MAX(LBOLT, MIN(LBOLT + interval, began + interval));
return (next);
}
static void
l2arc_hdr_stat_add(void)
{
ARCSTAT_INCR(arcstat_l2_hdr_size, HDR_SIZE + L2HDR_SIZE);
ARCSTAT_INCR(arcstat_hdr_size, -HDR_SIZE);
}
static void
l2arc_hdr_stat_remove(void)
{
ARCSTAT_INCR(arcstat_l2_hdr_size, -(HDR_SIZE + L2HDR_SIZE));
ARCSTAT_INCR(arcstat_hdr_size, HDR_SIZE);
}
/*
* Cycle through L2ARC devices. This is how L2ARC load balances.
* If a device is returned, this also returns holding the spa config lock.
*/
static l2arc_dev_t *
l2arc_dev_get_next(void)
{
l2arc_dev_t *first, *next = NULL;
/*
* Lock out the removal of spas (spa_namespace_lock), then removal
* of cache devices (l2arc_dev_mtx). Once a device has been selected,
* both locks will be dropped and a spa config lock held instead.
*/
mutex_enter(&spa_namespace_lock);
mutex_enter(&l2arc_dev_mtx);
/* if there are no vdevs, there is nothing to do */
if (l2arc_ndev == 0)
goto out;
first = NULL;
next = l2arc_dev_last;
do {
/* loop around the list looking for a non-faulted vdev */
if (next == NULL) {
next = list_head(l2arc_dev_list);
} else {
next = list_next(l2arc_dev_list, next);
if (next == NULL)
next = list_head(l2arc_dev_list);
}
/* if we have come back to the start, bail out */
if (first == NULL)
first = next;
else if (next == first)
break;
} while (vdev_is_dead(next->l2ad_vdev));
/* if we were unable to find any usable vdevs, return NULL */
if (vdev_is_dead(next->l2ad_vdev))
next = NULL;
l2arc_dev_last = next;
out:
mutex_exit(&l2arc_dev_mtx);
/*
* Grab the config lock to prevent the 'next' device from being
* removed while we are writing to it.
*/
if (next != NULL)
spa_config_enter(next->l2ad_spa, SCL_L2ARC, next, RW_READER);
mutex_exit(&spa_namespace_lock);
return (next);
}
/*
* Free buffers that were tagged for destruction.
*/
static void
l2arc_do_free_on_write()
{
list_t *buflist;
l2arc_data_free_t *df, *df_prev;
mutex_enter(&l2arc_free_on_write_mtx);
buflist = l2arc_free_on_write;
for (df = list_tail(buflist); df; df = df_prev) {
df_prev = list_prev(buflist, df);
ASSERT(df->l2df_data != NULL);
ASSERT(df->l2df_func != NULL);
df->l2df_func(df->l2df_data, df->l2df_size);
list_remove(buflist, df);
kmem_free(df, sizeof (l2arc_data_free_t));
}
mutex_exit(&l2arc_free_on_write_mtx);
}
/*
* A write to a cache device has completed. Update all headers to allow
* reads from these buffers to begin.
*/
static void
l2arc_write_done(zio_t *zio)
{
l2arc_write_callback_t *cb;
l2arc_dev_t *dev;
list_t *buflist;
arc_buf_hdr_t *head, *ab, *ab_prev;
l2arc_buf_hdr_t *abl2;
kmutex_t *hash_lock;
cb = zio->io_private;
ASSERT(cb != NULL);
dev = cb->l2wcb_dev;
ASSERT(dev != NULL);
head = cb->l2wcb_head;
ASSERT(head != NULL);
buflist = dev->l2ad_buflist;
ASSERT(buflist != NULL);
DTRACE_PROBE2(l2arc__iodone, zio_t *, zio,
l2arc_write_callback_t *, cb);
if (zio->io_error != 0)
ARCSTAT_BUMP(arcstat_l2_writes_error);
mutex_enter(&l2arc_buflist_mtx);
/*
* All writes completed, or an error was hit.
*/
for (ab = list_prev(buflist, head); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
* This buffer misses out. It may be in a stage
* of eviction. Its ARC_L2_WRITING flag will be
* left set, denying reads to this buffer.
*/
ARCSTAT_BUMP(arcstat_l2_writes_hdr_miss);
continue;
}
if (zio->io_error != 0) {
/*
* Error - drop L2ARC entry.
*/
list_remove(buflist, ab);
abl2 = ab->b_l2hdr;
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
/*
* Allow ARC to begin reads to this L2ARC entry.
*/
ab->b_flags &= ~ARC_L2_WRITING;
mutex_exit(hash_lock);
}
atomic_inc_64(&l2arc_writes_done);
list_remove(buflist, head);
kmem_cache_free(hdr_cache, head);
mutex_exit(&l2arc_buflist_mtx);
l2arc_do_free_on_write();
kmem_free(cb, sizeof (l2arc_write_callback_t));
}
/*
* A read to a cache device completed. Validate buffer contents before
* handing over to the regular ARC routines.
*/
static void
l2arc_read_done(zio_t *zio)
{
l2arc_read_callback_t *cb;
arc_buf_hdr_t *hdr;
arc_buf_t *buf;
kmutex_t *hash_lock;
int equal;
ASSERT(zio->io_vd != NULL);
ASSERT(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE);
spa_config_exit(zio->io_spa, SCL_L2ARC, zio->io_vd);
cb = zio->io_private;
ASSERT(cb != NULL);
buf = cb->l2rcb_buf;
ASSERT(buf != NULL);
hdr = buf->b_hdr;
ASSERT(hdr != NULL);
hash_lock = HDR_LOCK(hdr);
mutex_enter(hash_lock);
/*
* Check this survived the L2ARC journey.
*/
equal = arc_cksum_equal(buf);
if (equal && zio->io_error == 0 && !HDR_L2_EVICTED(hdr)) {
mutex_exit(hash_lock);
zio->io_private = buf;
zio->io_bp_copy = cb->l2rcb_bp; /* XXX fix in L2ARC 2.0 */
zio->io_bp = &zio->io_bp_copy; /* XXX fix in L2ARC 2.0 */
arc_read_done(zio);
} else {
mutex_exit(hash_lock);
/*
* Buffer didn't survive caching. Increment stats and
* reissue to the original storage device.
*/
if (zio->io_error != 0) {
ARCSTAT_BUMP(arcstat_l2_io_error);
} else {
zio->io_error = EIO;
}
if (!equal)
ARCSTAT_BUMP(arcstat_l2_cksum_bad);
/*
* If there's no waiter, issue an async i/o to the primary
* storage now. If there *is* a waiter, the caller must
* issue the i/o in a context where it's OK to block.
*/
if (zio->io_waiter == NULL)
zio_nowait(zio_read(zio->io_parent,
cb->l2rcb_spa, &cb->l2rcb_bp,
buf->b_data, zio->io_size, arc_read_done, buf,
zio->io_priority, cb->l2rcb_flags, &cb->l2rcb_zb));
}
kmem_free(cb, sizeof (l2arc_read_callback_t));
}
/*
* This is the list priority from which the L2ARC will search for pages to
* cache. This is used within loops (0..3) to cycle through lists in the
* desired order. This order can have a significant effect on cache
* performance.
*
* Currently the metadata lists are hit first, MFU then MRU, followed by
* the data lists. This function returns a locked list, and also returns
* the lock pointer.
*/
static list_t *
l2arc_list_locked(int list_num, kmutex_t **lock)
{
list_t *list;
int idx;
ASSERT(list_num >= 0 && list_num < 2 * ARC_BUFC_NUMLISTS);
if (list_num < ARC_BUFC_NUMMETADATALISTS) {
idx = list_num;
list = &arc_mfu->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mfu, idx);
} else if (list_num < ARC_BUFC_NUMMETADATALISTS * 2) {
idx = list_num - ARC_BUFC_NUMMETADATALISTS;
list = &arc_mru->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mru, idx);
} else if (list_num < (ARC_BUFC_NUMMETADATALISTS * 2 +
ARC_BUFC_NUMDATALISTS)) {
idx = list_num - ARC_BUFC_NUMMETADATALISTS;
list = &arc_mfu->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mfu, idx);
} else {
idx = list_num - ARC_BUFC_NUMLISTS;
list = &arc_mru->arcs_lists[idx];
*lock = ARCS_LOCK(arc_mru, idx);
}
ASSERT(!(MUTEX_HELD(*lock)));
mutex_enter(*lock);
return (list);
}
/*
* Evict buffers from the device write hand to the distance specified in
* bytes. This distance may span populated buffers, it may span nothing.
* This is clearing a region on the L2ARC device ready for writing.
* If the 'all' boolean is set, every buffer is evicted.
*/
static void
l2arc_evict(l2arc_dev_t *dev, uint64_t distance, boolean_t all)
{
list_t *buflist;
l2arc_buf_hdr_t *abl2;
arc_buf_hdr_t *ab, *ab_prev;
kmutex_t *hash_lock;
uint64_t taddr;
buflist = dev->l2ad_buflist;
if (buflist == NULL)
return;
if (!all && dev->l2ad_first) {
/*
* This is the first sweep through the device. There is
* nothing to evict.
*/
return;
}
if (dev->l2ad_hand >= (dev->l2ad_end - (2 * distance))) {
/*
* When nearing the end of the device, evict to the end
* before the device write hand jumps to the start.
*/
taddr = dev->l2ad_end;
} else {
taddr = dev->l2ad_hand + distance;
}
DTRACE_PROBE4(l2arc__evict, l2arc_dev_t *, dev, list_t *, buflist,
uint64_t, taddr, boolean_t, all);
top:
mutex_enter(&l2arc_buflist_mtx);
for (ab = list_tail(buflist); ab; ab = ab_prev) {
ab_prev = list_prev(buflist, ab);
hash_lock = HDR_LOCK(ab);
if (!mutex_tryenter(hash_lock)) {
/*
* Missed the hash lock. Retry.
*/
ARCSTAT_BUMP(arcstat_l2_evict_lock_retry);
mutex_exit(&l2arc_buflist_mtx);
mutex_enter(hash_lock);
mutex_exit(hash_lock);
goto top;
}
if (HDR_L2_WRITE_HEAD(ab)) {
/*
* We hit a write head node. Leave it for
* l2arc_write_done().
*/
list_remove(buflist, ab);
mutex_exit(hash_lock);
continue;
}
if (!all && ab->b_l2hdr != NULL &&
(ab->b_l2hdr->b_daddr > taddr ||
ab->b_l2hdr->b_daddr < dev->l2ad_hand)) {
/*
* We've evicted to the target address,
* or the end of the device.
*/
mutex_exit(hash_lock);
break;
}
if (HDR_FREE_IN_PROGRESS(ab)) {
/*
* Already on the path to destruction.
*/
mutex_exit(hash_lock);
continue;
}
if (ab->b_state == arc_l2c_only) {
ASSERT(!HDR_L2_READING(ab));
/*
* This doesn't exist in the ARC. Destroy.
* arc_hdr_destroy() will call list_remove()
* and decrement arcstat_l2_size.
*/
arc_change_state(arc_anon, ab, hash_lock);
arc_hdr_destroy(ab);
} else {
/*
* Invalidate issued or about to be issued
* reads, since we may be about to write
* over this location.
*/
if (HDR_L2_READING(ab)) {
ARCSTAT_BUMP(arcstat_l2_evict_reading);
ab->b_flags |= ARC_L2_EVICTED;
}
/*
* Tell ARC this no longer exists in L2ARC.
*/
if (ab->b_l2hdr != NULL) {
abl2 = ab->b_l2hdr;
ab->b_l2hdr = NULL;
kmem_free(abl2, sizeof (l2arc_buf_hdr_t));
ARCSTAT_INCR(arcstat_l2_size, -ab->b_size);
}
list_remove(buflist, ab);
/*
* This may have been leftover after a
* failed write.
*/
ab->b_flags &= ~ARC_L2_WRITING;
}
mutex_exit(hash_lock);
}
mutex_exit(&l2arc_buflist_mtx);
spa_l2cache_space_update(dev->l2ad_vdev, 0, -(taddr - dev->l2ad_evict));
dev->l2ad_evict = taddr;
}
/*
* Find and write ARC buffers to the L2ARC device.
*
* An ARC_L2_WRITING flag is set so that the L2ARC buffers are not valid
* for reading until they have completed writing.
*/
static uint64_t
l2arc_write_buffers(spa_t *spa, l2arc_dev_t *dev, uint64_t target_sz)
{
arc_buf_hdr_t *ab, *ab_prev, *head;
l2arc_buf_hdr_t *hdrl2;
list_t *list;
uint64_t passed_sz, write_sz, buf_sz, headroom;
void *buf_data;
kmutex_t *hash_lock, *list_lock;
boolean_t have_lock, full;
l2arc_write_callback_t *cb;
zio_t *pio, *wzio;
int try;
ASSERT(dev->l2ad_vdev != NULL);
pio = NULL;
write_sz = 0;
full = B_FALSE;
head = kmem_cache_alloc(hdr_cache, KM_PUSHPAGE);
head->b_flags |= ARC_L2_WRITE_HEAD;
ARCSTAT_BUMP(arcstat_l2_write_buffer_iter);
/*
* Copy buffers for L2ARC writing.
*/
mutex_enter(&l2arc_buflist_mtx);
for (try = 0; try < 2 * ARC_BUFC_NUMLISTS; try++) {
list = l2arc_list_locked(try, &list_lock);
passed_sz = 0;
ARCSTAT_BUMP(arcstat_l2_write_buffer_list_iter);
/*
* L2ARC fast warmup.
*
* Until the ARC is warm and starts to evict, read from the
* head of the ARC lists rather than the tail.
*/
headroom = target_sz * l2arc_headroom;
if (arc_warm == B_FALSE)
ab = list_head(list);
else
ab = list_tail(list);
if (ab == NULL)
ARCSTAT_BUMP(arcstat_l2_write_buffer_list_null_iter);
for (; ab; ab = ab_prev) {
if (arc_warm == B_FALSE)
ab_prev = list_next(list, ab);
else
ab_prev = list_prev(list, ab);
ARCSTAT_INCR(arcstat_l2_write_buffer_bytes_scanned, ab->b_size);
hash_lock = HDR_LOCK(ab);
have_lock = MUTEX_HELD(hash_lock);
if (!have_lock && !mutex_tryenter(hash_lock)) {
ARCSTAT_BUMP(arcstat_l2_write_trylock_fail);
/*
* Skip this buffer rather than waiting.
*/
continue;
}
passed_sz += ab->b_size;
if (passed_sz > headroom) {
/*
* Searched too far.
*/
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_l2_write_passed_headroom);
break;
}
if (!l2arc_write_eligible(spa, ab)) {
mutex_exit(hash_lock);
continue;
}
if ((write_sz + ab->b_size) > target_sz) {
full = B_TRUE;
mutex_exit(hash_lock);
ARCSTAT_BUMP(arcstat_l2_write_full);
break;
}
if (pio == NULL) {
/*
* Insert a dummy header on the buflist so
* l2arc_write_done() can find where the
* write buffers begin without searching.
*/
list_insert_head(dev->l2ad_buflist, head);
cb = kmem_alloc(
sizeof (l2arc_write_callback_t), KM_SLEEP);
cb->l2wcb_dev = dev;
cb->l2wcb_head = head;
pio = zio_root(spa, l2arc_write_done, cb,
ZIO_FLAG_CANFAIL);
ARCSTAT_BUMP(arcstat_l2_write_pios);
}
/*
* Create and add a new L2ARC header.
*/
hdrl2 = kmem_zalloc(sizeof (l2arc_buf_hdr_t), KM_SLEEP);
hdrl2->b_dev = dev;
hdrl2->b_daddr = dev->l2ad_hand;
ab->b_flags |= ARC_L2_WRITING;
ab->b_l2hdr = hdrl2;
list_insert_head(dev->l2ad_buflist, ab);
buf_data = ab->b_buf->b_data;
buf_sz = ab->b_size;
/*
* Compute and store the buffer cksum before
* writing. On debug the cksum is verified first.
*/
arc_cksum_verify(ab->b_buf);
arc_cksum_compute(ab->b_buf, B_TRUE);
mutex_exit(hash_lock);
wzio = zio_write_phys(pio, dev->l2ad_vdev,
dev->l2ad_hand, buf_sz, buf_data, ZIO_CHECKSUM_OFF,
NULL, NULL, ZIO_PRIORITY_ASYNC_WRITE,
ZIO_FLAG_CANFAIL, B_FALSE);
DTRACE_PROBE2(l2arc__write, vdev_t *, dev->l2ad_vdev,
zio_t *, wzio);
(void) zio_nowait(wzio);
/*
* Keep the clock hand suitably device-aligned.
*/
buf_sz = vdev_psize_to_asize(dev->l2ad_vdev, buf_sz);
write_sz += buf_sz;
dev->l2ad_hand += buf_sz;
}
mutex_exit(list_lock);
if (full == B_TRUE)
break;
}
mutex_exit(&l2arc_buflist_mtx);
if (pio == NULL) {
ASSERT3U(write_sz, ==, 0);
kmem_cache_free(hdr_cache, head);
return (0);
}
ASSERT3U(write_sz, <=, target_sz);
ARCSTAT_BUMP(arcstat_l2_writes_sent);
ARCSTAT_INCR(arcstat_l2_write_bytes, write_sz);
ARCSTAT_INCR(arcstat_l2_size, write_sz);
spa_l2cache_space_update(dev->l2ad_vdev, 0, write_sz);
/*
* Bump device hand to the device start if it is approaching the end.
* l2arc_evict() will already have evicted ahead for this case.
*/
if (dev->l2ad_hand >= (dev->l2ad_end - target_sz)) {
spa_l2cache_space_update(dev->l2ad_vdev, 0,
dev->l2ad_end - dev->l2ad_hand);
dev->l2ad_hand = dev->l2ad_start;
dev->l2ad_evict = dev->l2ad_start;
dev->l2ad_first = B_FALSE;
}
dev->l2ad_writing = B_TRUE;
(void) zio_wait(pio);
dev->l2ad_writing = B_FALSE;
return (write_sz);
}
/*
* This thread feeds the L2ARC at regular intervals. This is the beating
* heart of the L2ARC.
*/
static void
l2arc_feed_thread(void *dummy __unused)
{
callb_cpr_t cpr;
l2arc_dev_t *dev;
spa_t *spa;
uint64_t size, wrote;
clock_t begin, next = LBOLT;
CALLB_CPR_INIT(&cpr, &l2arc_feed_thr_lock, callb_generic_cpr, FTAG);
mutex_enter(&l2arc_feed_thr_lock);
while (l2arc_thread_exit == 0) {
CALLB_CPR_SAFE_BEGIN(&cpr);
(void) cv_timedwait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock,
next - LBOLT);
CALLB_CPR_SAFE_END(&cpr, &l2arc_feed_thr_lock);
next = LBOLT + hz;
/*
* Quick check for L2ARC devices.
*/
mutex_enter(&l2arc_dev_mtx);
if (l2arc_ndev == 0) {
mutex_exit(&l2arc_dev_mtx);
continue;
}
mutex_exit(&l2arc_dev_mtx);
begin = LBOLT;
/*
* This selects the next l2arc device to write to, and in
* doing so the next spa to feed from: dev->l2ad_spa. This
* will return NULL if there are now no l2arc devices or if
* they are all faulted.
*
* If a device is returned, its spa's config lock is also
* held to prevent device removal. l2arc_dev_get_next()
* will grab and release l2arc_dev_mtx.
*/
if ((dev = l2arc_dev_get_next()) == NULL)
continue;
spa = dev->l2ad_spa;
ASSERT(spa != NULL);
/*
* Avoid contributing to memory pressure.
*/
if (arc_reclaim_needed()) {
ARCSTAT_BUMP(arcstat_l2_abort_lowmem);
spa_config_exit(spa, SCL_L2ARC, dev);
continue;
}
ARCSTAT_BUMP(arcstat_l2_feeds);
size = l2arc_write_size(dev);
/*
* Evict L2ARC buffers that will be overwritten.
*/
l2arc_evict(dev, size, B_FALSE);
/*
* Write ARC buffers.
*/
wrote = l2arc_write_buffers(spa, dev, size);
/*
* Calculate interval between writes.
*/
next = l2arc_write_interval(begin, size, wrote);
spa_config_exit(spa, SCL_L2ARC, dev);
}
l2arc_thread_exit = 0;
cv_broadcast(&l2arc_feed_thr_cv);
CALLB_CPR_EXIT(&cpr); /* drops l2arc_feed_thr_lock */
thread_exit();
}
boolean_t
l2arc_vdev_present(vdev_t *vd)
{
l2arc_dev_t *dev;
mutex_enter(&l2arc_dev_mtx);
for (dev = list_head(l2arc_dev_list); dev != NULL;
dev = list_next(l2arc_dev_list, dev)) {
if (dev->l2ad_vdev == vd)
break;
}
mutex_exit(&l2arc_dev_mtx);
return (dev != NULL);
}
/*
* Add a vdev for use by the L2ARC. By this point the spa has already
* validated the vdev and opened it.
*/
void
l2arc_add_vdev(spa_t *spa, vdev_t *vd, uint64_t start, uint64_t end)
{
l2arc_dev_t *adddev;
ASSERT(!l2arc_vdev_present(vd));
/*
* Create a new l2arc device entry.
*/
adddev = kmem_zalloc(sizeof (l2arc_dev_t), KM_SLEEP);
adddev->l2ad_spa = spa;
adddev->l2ad_vdev = vd;
adddev->l2ad_write = l2arc_write_max;
adddev->l2ad_boost = l2arc_write_boost;
adddev->l2ad_start = start;
adddev->l2ad_end = end;
adddev->l2ad_hand = adddev->l2ad_start;
adddev->l2ad_evict = adddev->l2ad_start;
adddev->l2ad_first = B_TRUE;
adddev->l2ad_writing = B_FALSE;
ASSERT3U(adddev->l2ad_write, >, 0);
/*
* This is a list of all ARC buffers that are still valid on the
* device.
*/
adddev->l2ad_buflist = kmem_zalloc(sizeof (list_t), KM_SLEEP);
list_create(adddev->l2ad_buflist, sizeof (arc_buf_hdr_t),
offsetof(arc_buf_hdr_t, b_l2node));
spa_l2cache_space_update(vd, adddev->l2ad_end - adddev->l2ad_hand, 0);
/*
* Add device to global list
*/
mutex_enter(&l2arc_dev_mtx);
list_insert_head(l2arc_dev_list, adddev);
atomic_inc_64(&l2arc_ndev);
mutex_exit(&l2arc_dev_mtx);
}
/*
* Remove a vdev from the L2ARC.
*/
void
l2arc_remove_vdev(vdev_t *vd)
{
l2arc_dev_t *dev, *nextdev, *remdev = NULL;
/*
* Find the device by vdev
*/
mutex_enter(&l2arc_dev_mtx);
for (dev = list_head(l2arc_dev_list); dev; dev = nextdev) {
nextdev = list_next(l2arc_dev_list, dev);
if (vd == dev->l2ad_vdev) {
remdev = dev;
break;
}
}
ASSERT(remdev != NULL);
/*
* Remove device from global list
*/
list_remove(l2arc_dev_list, remdev);
l2arc_dev_last = NULL; /* may have been invalidated */
atomic_dec_64(&l2arc_ndev);
mutex_exit(&l2arc_dev_mtx);
/*
* Clear all buflists and ARC references. L2ARC device flush.
*/
l2arc_evict(remdev, 0, B_TRUE);
list_destroy(remdev->l2ad_buflist);
kmem_free(remdev->l2ad_buflist, sizeof (list_t));
kmem_free(remdev, sizeof (l2arc_dev_t));
}
void
l2arc_init(void)
{
l2arc_thread_exit = 0;
l2arc_ndev = 0;
l2arc_writes_sent = 0;
l2arc_writes_done = 0;
mutex_init(&l2arc_feed_thr_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&l2arc_feed_thr_cv, NULL, CV_DEFAULT, NULL);
mutex_init(&l2arc_dev_mtx, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&l2arc_buflist_mtx, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&l2arc_free_on_write_mtx, NULL, MUTEX_DEFAULT, NULL);
l2arc_dev_list = &L2ARC_dev_list;
l2arc_free_on_write = &L2ARC_free_on_write;
list_create(l2arc_dev_list, sizeof (l2arc_dev_t),
offsetof(l2arc_dev_t, l2ad_node));
list_create(l2arc_free_on_write, sizeof (l2arc_data_free_t),
offsetof(l2arc_data_free_t, l2df_list_node));
}
void
l2arc_fini(void)
{
/*
* This is called from dmu_fini(), which is called from spa_fini();
* Because of this, we can assume that all l2arc devices have
* already been removed when the pools themselves were removed.
*/
l2arc_do_free_on_write();
mutex_destroy(&l2arc_feed_thr_lock);
cv_destroy(&l2arc_feed_thr_cv);
mutex_destroy(&l2arc_dev_mtx);
mutex_destroy(&l2arc_buflist_mtx);
mutex_destroy(&l2arc_free_on_write_mtx);
list_destroy(l2arc_dev_list);
list_destroy(l2arc_free_on_write);
}
void
l2arc_start(void)
{
if (!(spa_mode & FWRITE))
return;
(void) thread_create(NULL, 0, l2arc_feed_thread, NULL, 0, &p0,
TS_RUN, minclsyspri);
}
void
l2arc_stop(void)
{
if (!(spa_mode & FWRITE))
return;
mutex_enter(&l2arc_feed_thr_lock);
cv_signal(&l2arc_feed_thr_cv); /* kick thread out of startup */
l2arc_thread_exit = 1;
while (l2arc_thread_exit != 0)
cv_wait(&l2arc_feed_thr_cv, &l2arc_feed_thr_lock);
mutex_exit(&l2arc_feed_thr_lock);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c (revision 209274)
@@ -1,1066 +1,1066 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/dmu.h>
#include <sys/dmu_impl.h>
#include <sys/dbuf.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/dsl_dataset.h> /* for dsl_dataset_block_freeable() */
#include <sys/dsl_dir.h> /* for dsl_dir_tempreserve_*() */
#include <sys/dsl_pool.h>
#include <sys/zap_impl.h> /* for fzap_default_block_shift */
#include <sys/spa.h>
#include <sys/zfs_context.h>
typedef void (*dmu_tx_hold_func_t)(dmu_tx_t *tx, struct dnode *dn,
uint64_t arg1, uint64_t arg2);
dmu_tx_t *
dmu_tx_create_dd(dsl_dir_t *dd)
{
dmu_tx_t *tx = kmem_zalloc(sizeof (dmu_tx_t), KM_SLEEP);
tx->tx_dir = dd;
if (dd)
tx->tx_pool = dd->dd_pool;
list_create(&tx->tx_holds, sizeof (dmu_tx_hold_t),
offsetof(dmu_tx_hold_t, txh_node));
#ifdef ZFS_DEBUG
refcount_create(&tx->tx_space_written);
refcount_create(&tx->tx_space_freed);
#endif
return (tx);
}
dmu_tx_t *
dmu_tx_create(objset_t *os)
{
dmu_tx_t *tx = dmu_tx_create_dd(os->os->os_dsl_dataset->ds_dir);
tx->tx_objset = os;
tx->tx_lastsnap_txg = dsl_dataset_prev_snap_txg(os->os->os_dsl_dataset);
return (tx);
}
dmu_tx_t *
dmu_tx_create_assigned(struct dsl_pool *dp, uint64_t txg)
{
dmu_tx_t *tx = dmu_tx_create_dd(NULL);
ASSERT3U(txg, <=, dp->dp_tx.tx_open_txg);
tx->tx_pool = dp;
tx->tx_txg = txg;
tx->tx_anyobj = TRUE;
return (tx);
}
int
dmu_tx_is_syncing(dmu_tx_t *tx)
{
return (tx->tx_anyobj);
}
int
dmu_tx_private_ok(dmu_tx_t *tx)
{
return (tx->tx_anyobj);
}
static dmu_tx_hold_t *
dmu_tx_hold_object_impl(dmu_tx_t *tx, objset_t *os, uint64_t object,
enum dmu_tx_hold_type type, uint64_t arg1, uint64_t arg2)
{
dmu_tx_hold_t *txh;
dnode_t *dn = NULL;
int err;
if (object != DMU_NEW_OBJECT) {
err = dnode_hold(os->os, object, tx, &dn);
if (err) {
tx->tx_err = err;
return (NULL);
}
if (err == 0 && tx->tx_txg != 0) {
mutex_enter(&dn->dn_mtx);
/*
* dn->dn_assigned_txg == tx->tx_txg doesn't pose a
* problem, but there's no way for it to happen (for
* now, at least).
*/
ASSERT(dn->dn_assigned_txg == 0);
dn->dn_assigned_txg = tx->tx_txg;
(void) refcount_add(&dn->dn_tx_holds, tx);
mutex_exit(&dn->dn_mtx);
}
}
txh = kmem_zalloc(sizeof (dmu_tx_hold_t), KM_SLEEP);
txh->txh_tx = tx;
txh->txh_dnode = dn;
#ifdef ZFS_DEBUG
txh->txh_type = type;
txh->txh_arg1 = arg1;
txh->txh_arg2 = arg2;
#endif
list_insert_tail(&tx->tx_holds, txh);
return (txh);
}
void
dmu_tx_add_new_object(dmu_tx_t *tx, objset_t *os, uint64_t object)
{
/*
* If we're syncing, they can manipulate any object anyhow, and
* the hold on the dnode_t can cause problems.
*/
if (!dmu_tx_is_syncing(tx)) {
(void) dmu_tx_hold_object_impl(tx, os,
object, THT_NEWOBJECT, 0, 0);
}
}
static int
dmu_tx_check_ioerr(zio_t *zio, dnode_t *dn, int level, uint64_t blkid)
{
int err;
dmu_buf_impl_t *db;
rw_enter(&dn->dn_struct_rwlock, RW_READER);
db = dbuf_hold_level(dn, level, blkid, FTAG);
rw_exit(&dn->dn_struct_rwlock);
if (db == NULL)
return (EIO);
err = dbuf_read(db, zio, DB_RF_CANFAIL | DB_RF_NOPREFETCH);
dbuf_rele(db, FTAG);
return (err);
}
/* ARGSUSED */
static void
dmu_tx_count_write(dmu_tx_hold_t *txh, uint64_t off, uint64_t len)
{
dnode_t *dn = txh->txh_dnode;
uint64_t start, end, i;
int min_bs, max_bs, min_ibs, max_ibs, epbs, bits;
int err = 0;
if (len == 0)
return;
min_bs = SPA_MINBLOCKSHIFT;
max_bs = SPA_MAXBLOCKSHIFT;
min_ibs = DN_MIN_INDBLKSHIFT;
max_ibs = DN_MAX_INDBLKSHIFT;
/*
* For i/o error checking, read the first and last level-0
* blocks (if they are not aligned), and all the level-1 blocks.
*/
if (dn) {
if (dn->dn_maxblkid == 0) {
err = dmu_tx_check_ioerr(NULL, dn, 0, 0);
if (err)
goto out;
} else {
zio_t *zio = zio_root(dn->dn_objset->os_spa,
NULL, NULL, ZIO_FLAG_CANFAIL);
/* first level-0 block */
start = off >> dn->dn_datablkshift;
if (P2PHASE(off, dn->dn_datablksz) ||
len < dn->dn_datablksz) {
err = dmu_tx_check_ioerr(zio, dn, 0, start);
if (err)
goto out;
}
/* last level-0 block */
end = (off+len-1) >> dn->dn_datablkshift;
if (end != start &&
P2PHASE(off+len, dn->dn_datablksz)) {
err = dmu_tx_check_ioerr(zio, dn, 0, end);
if (err)
goto out;
}
/* level-1 blocks */
if (dn->dn_nlevels > 1) {
start >>= dn->dn_indblkshift - SPA_BLKPTRSHIFT;
end >>= dn->dn_indblkshift - SPA_BLKPTRSHIFT;
for (i = start+1; i < end; i++) {
err = dmu_tx_check_ioerr(zio, dn, 1, i);
if (err)
goto out;
}
}
err = zio_wait(zio);
if (err)
goto out;
}
}
/*
* If there's more than one block, the blocksize can't change,
* so we can make a more precise estimate. Alternatively,
* if the dnode's ibs is larger than max_ibs, always use that.
* This ensures that if we reduce DN_MAX_INDBLKSHIFT,
* the code will still work correctly on existing pools.
*/
if (dn && (dn->dn_maxblkid != 0 || dn->dn_indblkshift > max_ibs)) {
min_ibs = max_ibs = dn->dn_indblkshift;
if (dn->dn_datablkshift != 0)
min_bs = max_bs = dn->dn_datablkshift;
}
/*
* 'end' is the last thing we will access, not one past.
* This way we won't overflow when accessing the last byte.
*/
start = P2ALIGN(off, 1ULL << max_bs);
end = P2ROUNDUP(off + len, 1ULL << max_bs) - 1;
txh->txh_space_towrite += end - start + 1;
start >>= min_bs;
end >>= min_bs;
epbs = min_ibs - SPA_BLKPTRSHIFT;
/*
* The object contains at most 2^(64 - min_bs) blocks,
* and each indirect level maps 2^epbs.
*/
for (bits = 64 - min_bs; bits >= 0; bits -= epbs) {
start >>= epbs;
end >>= epbs;
/*
* If we increase the number of levels of indirection,
* we'll need new blkid=0 indirect blocks. If start == 0,
* we're already accounting for that blocks; and if end == 0,
* we can't increase the number of levels beyond that.
*/
if (start != 0 && end != 0)
txh->txh_space_towrite += 1ULL << max_ibs;
txh->txh_space_towrite += (end - start + 1) << max_ibs;
}
ASSERT(txh->txh_space_towrite < 2 * DMU_MAX_ACCESS);
out:
if (err)
txh->txh_tx->tx_err = err;
}
static void
dmu_tx_count_dnode(dmu_tx_hold_t *txh)
{
dnode_t *dn = txh->txh_dnode;
dnode_t *mdn = txh->txh_tx->tx_objset->os->os_meta_dnode;
uint64_t space = mdn->dn_datablksz +
((mdn->dn_nlevels-1) << mdn->dn_indblkshift);
if (dn && dn->dn_dbuf->db_blkptr &&
dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
dn->dn_dbuf->db_blkptr->blk_birth)) {
txh->txh_space_tooverwrite += space;
} else {
txh->txh_space_towrite += space;
if (dn && dn->dn_dbuf->db_blkptr)
txh->txh_space_tounref += space;
}
}
void
dmu_tx_hold_write(dmu_tx_t *tx, uint64_t object, uint64_t off, int len)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
ASSERT(len < DMU_MAX_ACCESS);
ASSERT(len == 0 || UINT64_MAX - off >= len - 1);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_WRITE, off, len);
if (txh == NULL)
return;
dmu_tx_count_write(txh, off, len);
dmu_tx_count_dnode(txh);
}
static void
dmu_tx_count_free(dmu_tx_hold_t *txh, uint64_t off, uint64_t len)
{
uint64_t blkid, nblks, lastblk;
uint64_t space = 0, unref = 0, skipped = 0;
dnode_t *dn = txh->txh_dnode;
dsl_dataset_t *ds = dn->dn_objset->os_dsl_dataset;
spa_t *spa = txh->txh_tx->tx_pool->dp_spa;
int epbs;
if (dn->dn_nlevels == 0)
return;
/*
* The struct_rwlock protects us against dn_nlevels
* changing, in case (against all odds) we manage to dirty &
* sync out the changes after we check for being dirty.
* Also, dbuf_hold_level() wants us to have the struct_rwlock.
*/
rw_enter(&dn->dn_struct_rwlock, RW_READER);
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
if (dn->dn_maxblkid == 0) {
if (off == 0 && len >= dn->dn_datablksz) {
blkid = 0;
nblks = 1;
} else {
rw_exit(&dn->dn_struct_rwlock);
return;
}
} else {
blkid = off >> dn->dn_datablkshift;
nblks = (len + dn->dn_datablksz - 1) >> dn->dn_datablkshift;
if (blkid >= dn->dn_maxblkid) {
rw_exit(&dn->dn_struct_rwlock);
return;
}
if (blkid + nblks > dn->dn_maxblkid)
nblks = dn->dn_maxblkid - blkid;
}
if (dn->dn_nlevels == 1) {
int i;
for (i = 0; i < nblks; i++) {
blkptr_t *bp = dn->dn_phys->dn_blkptr;
ASSERT3U(blkid + i, <, dn->dn_nblkptr);
bp += blkid + i;
if (dsl_dataset_block_freeable(ds, bp->blk_birth)) {
dprintf_bp(bp, "can free old%s", "");
space += bp_get_dasize(spa, bp);
}
unref += BP_GET_ASIZE(bp);
}
nblks = 0;
}
/*
* Add in memory requirements of higher-level indirects.
* This assumes a worst-possible scenario for dn_nlevels.
*/
{
uint64_t blkcnt = 1 + ((nblks >> epbs) >> epbs);
int level = (dn->dn_nlevels > 1) ? 2 : 1;
while (level++ < DN_MAX_LEVELS) {
txh->txh_memory_tohold += blkcnt << dn->dn_indblkshift;
blkcnt = 1 + (blkcnt >> epbs);
}
ASSERT(blkcnt <= dn->dn_nblkptr);
}
lastblk = blkid + nblks - 1;
while (nblks) {
dmu_buf_impl_t *dbuf;
uint64_t ibyte, new_blkid;
int epb = 1 << epbs;
int err, i, blkoff, tochk;
blkptr_t *bp;
ibyte = blkid << dn->dn_datablkshift;
err = dnode_next_offset(dn,
DNODE_FIND_HAVELOCK, &ibyte, 2, 1, 0);
new_blkid = ibyte >> dn->dn_datablkshift;
if (err == ESRCH) {
skipped += (lastblk >> epbs) - (blkid >> epbs) + 1;
break;
}
if (err) {
txh->txh_tx->tx_err = err;
break;
}
if (new_blkid > lastblk) {
skipped += (lastblk >> epbs) - (blkid >> epbs) + 1;
break;
}
if (new_blkid > blkid) {
ASSERT((new_blkid >> epbs) > (blkid >> epbs));
skipped += (new_blkid >> epbs) - (blkid >> epbs) - 1;
nblks -= new_blkid - blkid;
blkid = new_blkid;
}
blkoff = P2PHASE(blkid, epb);
tochk = MIN(epb - blkoff, nblks);
dbuf = dbuf_hold_level(dn, 1, blkid >> epbs, FTAG);
txh->txh_memory_tohold += dbuf->db.db_size;
if (txh->txh_memory_tohold > DMU_MAX_ACCESS) {
txh->txh_tx->tx_err = E2BIG;
dbuf_rele(dbuf, FTAG);
break;
}
err = dbuf_read(dbuf, NULL, DB_RF_HAVESTRUCT | DB_RF_CANFAIL);
if (err != 0) {
txh->txh_tx->tx_err = err;
dbuf_rele(dbuf, FTAG);
break;
}
bp = dbuf->db.db_data;
bp += blkoff;
for (i = 0; i < tochk; i++) {
if (dsl_dataset_block_freeable(ds, bp[i].blk_birth)) {
dprintf_bp(&bp[i], "can free old%s", "");
space += bp_get_dasize(spa, &bp[i]);
}
unref += BP_GET_ASIZE(bp);
}
dbuf_rele(dbuf, FTAG);
blkid += tochk;
nblks -= tochk;
}
rw_exit(&dn->dn_struct_rwlock);
/* account for new level 1 indirect blocks that might show up */
if (skipped > 0) {
txh->txh_fudge += skipped << dn->dn_indblkshift;
skipped = MIN(skipped, DMU_MAX_DELETEBLKCNT >> epbs);
txh->txh_memory_tohold += skipped << dn->dn_indblkshift;
}
txh->txh_space_tofree += space;
txh->txh_space_tounref += unref;
}
void
dmu_tx_hold_free(dmu_tx_t *tx, uint64_t object, uint64_t off, uint64_t len)
{
dmu_tx_hold_t *txh;
dnode_t *dn;
uint64_t start, end, i;
int err, shift;
zio_t *zio;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_FREE, off, len);
if (txh == NULL)
return;
dn = txh->txh_dnode;
/* first block */
if (off != 0)
dmu_tx_count_write(txh, off, 1);
/* last block */
if (len != DMU_OBJECT_END)
dmu_tx_count_write(txh, off+len, 1);
if (off >= (dn->dn_maxblkid+1) * dn->dn_datablksz)
return;
if (len == DMU_OBJECT_END)
len = (dn->dn_maxblkid+1) * dn->dn_datablksz - off;
/*
* For i/o error checking, read the first and last level-0
* blocks, and all the level-1 blocks. The above count_write's
* have already taken care of the level-0 blocks.
*/
if (dn->dn_nlevels > 1) {
shift = dn->dn_datablkshift + dn->dn_indblkshift -
SPA_BLKPTRSHIFT;
start = off >> shift;
end = dn->dn_datablkshift ? ((off+len) >> shift) : 0;
zio = zio_root(tx->tx_pool->dp_spa,
NULL, NULL, ZIO_FLAG_CANFAIL);
for (i = start; i <= end; i++) {
uint64_t ibyte = i << shift;
err = dnode_next_offset(dn, 0, &ibyte, 2, 1, 0);
i = ibyte >> shift;
if (err == ESRCH)
break;
if (err) {
tx->tx_err = err;
return;
}
err = dmu_tx_check_ioerr(zio, dn, 1, i);
if (err) {
tx->tx_err = err;
return;
}
}
err = zio_wait(zio);
if (err) {
tx->tx_err = err;
return;
}
}
dmu_tx_count_dnode(txh);
dmu_tx_count_free(txh, off, len);
}
void
dmu_tx_hold_zap(dmu_tx_t *tx, uint64_t object, int add, char *name)
{
dmu_tx_hold_t *txh;
dnode_t *dn;
uint64_t nblocks;
int epbs, err;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_ZAP, add, (uintptr_t)name);
if (txh == NULL)
return;
dn = txh->txh_dnode;
dmu_tx_count_dnode(txh);
if (dn == NULL) {
/*
* We will be able to fit a new object's entries into one leaf
* block. So there will be at most 2 blocks total,
* including the header block.
*/
dmu_tx_count_write(txh, 0, 2 << fzap_default_block_shift);
return;
}
ASSERT3P(dmu_ot[dn->dn_type].ot_byteswap, ==, zap_byteswap);
if (dn->dn_maxblkid == 0 && !add) {
/*
* If there is only one block (i.e. this is a micro-zap)
* and we are not adding anything, the accounting is simple.
*/
err = dmu_tx_check_ioerr(NULL, dn, 0, 0);
if (err) {
tx->tx_err = err;
return;
}
/*
* Use max block size here, since we don't know how much
* the size will change between now and the dbuf dirty call.
*/
if (dsl_dataset_block_freeable(dn->dn_objset->os_dsl_dataset,
dn->dn_phys->dn_blkptr[0].blk_birth)) {
txh->txh_space_tooverwrite += SPA_MAXBLOCKSIZE;
} else {
txh->txh_space_towrite += SPA_MAXBLOCKSIZE;
- txh->txh_space_tounref +=
- BP_GET_ASIZE(dn->dn_phys->dn_blkptr);
}
+ if (dn->dn_phys->dn_blkptr[0].blk_birth)
+ txh->txh_space_tounref += SPA_MAXBLOCKSIZE;
return;
}
if (dn->dn_maxblkid > 0 && name) {
/*
* access the name in this fat-zap so that we'll check
* for i/o errors to the leaf blocks, etc.
*/
err = zap_lookup(&dn->dn_objset->os, dn->dn_object, name,
8, 0, NULL);
if (err == EIO) {
tx->tx_err = err;
return;
}
}
/*
* 3 blocks overwritten: target leaf, ptrtbl block, header block
* 3 new blocks written if adding: new split leaf, 2 grown ptrtbl blocks
*/
dmu_tx_count_write(txh, dn->dn_maxblkid * dn->dn_datablksz,
(3 + (add ? 3 : 0)) << dn->dn_datablkshift);
/*
* If the modified blocks are scattered to the four winds,
* we'll have to modify an indirect twig for each.
*/
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
for (nblocks = dn->dn_maxblkid >> epbs; nblocks != 0; nblocks >>= epbs)
txh->txh_space_towrite += 3 << dn->dn_indblkshift;
}
void
dmu_tx_hold_bonus(dmu_tx_t *tx, uint64_t object)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
object, THT_BONUS, 0, 0);
if (txh)
dmu_tx_count_dnode(txh);
}
void
dmu_tx_hold_space(dmu_tx_t *tx, uint64_t space)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
txh = dmu_tx_hold_object_impl(tx, tx->tx_objset,
DMU_NEW_OBJECT, THT_SPACE, space, 0);
txh->txh_space_towrite += space;
}
int
dmu_tx_holds(dmu_tx_t *tx, uint64_t object)
{
dmu_tx_hold_t *txh;
int holds = 0;
/*
* By asserting that the tx is assigned, we're counting the
* number of dn_tx_holds, which is the same as the number of
* dn_holds. Otherwise, we'd be counting dn_holds, but
* dn_tx_holds could be 0.
*/
ASSERT(tx->tx_txg != 0);
/* if (tx->tx_anyobj == TRUE) */
/* return (0); */
for (txh = list_head(&tx->tx_holds); txh;
txh = list_next(&tx->tx_holds, txh)) {
if (txh->txh_dnode && txh->txh_dnode->dn_object == object)
holds++;
}
return (holds);
}
#ifdef ZFS_DEBUG
void
dmu_tx_dirty_buf(dmu_tx_t *tx, dmu_buf_impl_t *db)
{
dmu_tx_hold_t *txh;
int match_object = FALSE, match_offset = FALSE;
dnode_t *dn = db->db_dnode;
ASSERT(tx->tx_txg != 0);
ASSERT(tx->tx_objset == NULL || dn->dn_objset == tx->tx_objset->os);
ASSERT3U(dn->dn_object, ==, db->db.db_object);
if (tx->tx_anyobj)
return;
/* XXX No checking on the meta dnode for now */
if (db->db.db_object == DMU_META_DNODE_OBJECT)
return;
for (txh = list_head(&tx->tx_holds); txh;
txh = list_next(&tx->tx_holds, txh)) {
ASSERT(dn == NULL || dn->dn_assigned_txg == tx->tx_txg);
if (txh->txh_dnode == dn && txh->txh_type != THT_NEWOBJECT)
match_object = TRUE;
if (txh->txh_dnode == NULL || txh->txh_dnode == dn) {
int datablkshift = dn->dn_datablkshift ?
dn->dn_datablkshift : SPA_MAXBLOCKSHIFT;
int epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
int shift = datablkshift + epbs * db->db_level;
uint64_t beginblk = shift >= 64 ? 0 :
(txh->txh_arg1 >> shift);
uint64_t endblk = shift >= 64 ? 0 :
((txh->txh_arg1 + txh->txh_arg2 - 1) >> shift);
uint64_t blkid = db->db_blkid;
/* XXX txh_arg2 better not be zero... */
dprintf("found txh type %x beginblk=%llx endblk=%llx\n",
txh->txh_type, beginblk, endblk);
switch (txh->txh_type) {
case THT_WRITE:
if (blkid >= beginblk && blkid <= endblk)
match_offset = TRUE;
/*
* We will let this hold work for the bonus
* buffer so that we don't need to hold it
* when creating a new object.
*/
if (blkid == DB_BONUS_BLKID)
match_offset = TRUE;
/*
* They might have to increase nlevels,
* thus dirtying the new TLIBs. Or the
* might have to change the block size,
* thus dirying the new lvl=0 blk=0.
*/
if (blkid == 0)
match_offset = TRUE;
break;
case THT_FREE:
/*
* We will dirty all the level 1 blocks in
* the free range and perhaps the first and
* last level 0 block.
*/
if (blkid >= beginblk && (blkid <= endblk ||
txh->txh_arg2 == DMU_OBJECT_END))
match_offset = TRUE;
break;
case THT_BONUS:
if (blkid == DB_BONUS_BLKID)
match_offset = TRUE;
break;
case THT_ZAP:
match_offset = TRUE;
break;
case THT_NEWOBJECT:
match_object = TRUE;
break;
default:
ASSERT(!"bad txh_type");
}
}
if (match_object && match_offset)
return;
}
panic("dirtying dbuf obj=%llx lvl=%u blkid=%llx but not tx_held\n",
(u_longlong_t)db->db.db_object, db->db_level,
(u_longlong_t)db->db_blkid);
}
#endif
static int
dmu_tx_try_assign(dmu_tx_t *tx, uint64_t txg_how)
{
dmu_tx_hold_t *txh;
spa_t *spa = tx->tx_pool->dp_spa;
uint64_t memory, asize, fsize, usize;
uint64_t towrite, tofree, tooverwrite, tounref, tohold, fudge;
ASSERT3U(tx->tx_txg, ==, 0);
if (tx->tx_err)
return (tx->tx_err);
if (spa_suspended(spa)) {
/*
* If the user has indicated a blocking failure mode
* then return ERESTART which will block in dmu_tx_wait().
* Otherwise, return EIO so that an error can get
* propagated back to the VOP calls.
*
* Note that we always honor the txg_how flag regardless
* of the failuremode setting.
*/
if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_CONTINUE &&
txg_how != TXG_WAIT)
return (EIO);
return (ERESTART);
}
tx->tx_txg = txg_hold_open(tx->tx_pool, &tx->tx_txgh);
tx->tx_needassign_txh = NULL;
/*
* NB: No error returns are allowed after txg_hold_open, but
* before processing the dnode holds, due to the
* dmu_tx_unassign() logic.
*/
towrite = tofree = tooverwrite = tounref = tohold = fudge = 0;
for (txh = list_head(&tx->tx_holds); txh;
txh = list_next(&tx->tx_holds, txh)) {
dnode_t *dn = txh->txh_dnode;
if (dn != NULL) {
mutex_enter(&dn->dn_mtx);
if (dn->dn_assigned_txg == tx->tx_txg - 1) {
mutex_exit(&dn->dn_mtx);
tx->tx_needassign_txh = txh;
return (ERESTART);
}
if (dn->dn_assigned_txg == 0)
dn->dn_assigned_txg = tx->tx_txg;
ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);
(void) refcount_add(&dn->dn_tx_holds, tx);
mutex_exit(&dn->dn_mtx);
}
towrite += txh->txh_space_towrite;
tofree += txh->txh_space_tofree;
tooverwrite += txh->txh_space_tooverwrite;
tounref += txh->txh_space_tounref;
tohold += txh->txh_memory_tohold;
fudge += txh->txh_fudge;
}
/*
* NB: This check must be after we've held the dnodes, so that
* the dmu_tx_unassign() logic will work properly
*/
if (txg_how >= TXG_INITIAL && txg_how != tx->tx_txg)
return (ERESTART);
/*
* If a snapshot has been taken since we made our estimates,
* assume that we won't be able to free or overwrite anything.
*/
if (tx->tx_objset &&
dsl_dataset_prev_snap_txg(tx->tx_objset->os->os_dsl_dataset) >
tx->tx_lastsnap_txg) {
towrite += tooverwrite;
tooverwrite = tofree = 0;
}
/* needed allocation: worst-case estimate of write space */
asize = spa_get_asize(tx->tx_pool->dp_spa, towrite + tooverwrite);
/* freed space estimate: worst-case overwrite + free estimate */
fsize = spa_get_asize(tx->tx_pool->dp_spa, tooverwrite) + tofree;
/* convert unrefd space to worst-case estimate */
usize = spa_get_asize(tx->tx_pool->dp_spa, tounref);
/* calculate memory footprint estimate */
memory = towrite + tooverwrite + tohold;
#ifdef ZFS_DEBUG
/*
* Add in 'tohold' to account for our dirty holds on this memory
* XXX - the "fudge" factor is to account for skipped blocks that
* we missed because dnode_next_offset() misses in-core-only blocks.
*/
tx->tx_space_towrite = asize +
spa_get_asize(tx->tx_pool->dp_spa, tohold + fudge);
tx->tx_space_tofree = tofree;
tx->tx_space_tooverwrite = tooverwrite;
tx->tx_space_tounref = tounref;
#endif
if (tx->tx_dir && asize != 0) {
int err = dsl_dir_tempreserve_space(tx->tx_dir, memory,
asize, fsize, usize, &tx->tx_tempreserve_cookie, tx);
if (err)
return (err);
}
return (0);
}
static void
dmu_tx_unassign(dmu_tx_t *tx)
{
dmu_tx_hold_t *txh;
if (tx->tx_txg == 0)
return;
txg_rele_to_quiesce(&tx->tx_txgh);
for (txh = list_head(&tx->tx_holds); txh != tx->tx_needassign_txh;
txh = list_next(&tx->tx_holds, txh)) {
dnode_t *dn = txh->txh_dnode;
if (dn == NULL)
continue;
mutex_enter(&dn->dn_mtx);
ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);
if (refcount_remove(&dn->dn_tx_holds, tx) == 0) {
dn->dn_assigned_txg = 0;
cv_broadcast(&dn->dn_notxholds);
}
mutex_exit(&dn->dn_mtx);
}
txg_rele_to_sync(&tx->tx_txgh);
tx->tx_lasttried_txg = tx->tx_txg;
tx->tx_txg = 0;
}
/*
* Assign tx to a transaction group. txg_how can be one of:
*
* (1) TXG_WAIT. If the current open txg is full, waits until there's
* a new one. This should be used when you're not holding locks.
* If will only fail if we're truly out of space (or over quota).
*
* (2) TXG_NOWAIT. If we can't assign into the current open txg without
* blocking, returns immediately with ERESTART. This should be used
* whenever you're holding locks. On an ERESTART error, the caller
* should drop locks, do a dmu_tx_wait(tx), and try again.
*
* (3) A specific txg. Use this if you need to ensure that multiple
* transactions all sync in the same txg. Like TXG_NOWAIT, it
* returns ERESTART if it can't assign you into the requested txg.
*/
int
dmu_tx_assign(dmu_tx_t *tx, uint64_t txg_how)
{
int err;
ASSERT(tx->tx_txg == 0);
ASSERT(txg_how != 0);
ASSERT(!dsl_pool_sync_context(tx->tx_pool));
while ((err = dmu_tx_try_assign(tx, txg_how)) != 0) {
dmu_tx_unassign(tx);
if (err != ERESTART || txg_how != TXG_WAIT)
return (err);
dmu_tx_wait(tx);
}
txg_rele_to_quiesce(&tx->tx_txgh);
return (0);
}
void
dmu_tx_wait(dmu_tx_t *tx)
{
spa_t *spa = tx->tx_pool->dp_spa;
ASSERT(tx->tx_txg == 0);
/*
* It's possible that the pool has become active after this thread
* has tried to obtain a tx. If that's the case then his
* tx_lasttried_txg would not have been assigned.
*/
if (spa_suspended(spa) || tx->tx_lasttried_txg == 0) {
txg_wait_synced(tx->tx_pool, spa_last_synced_txg(spa) + 1);
} else if (tx->tx_needassign_txh) {
dnode_t *dn = tx->tx_needassign_txh->txh_dnode;
mutex_enter(&dn->dn_mtx);
while (dn->dn_assigned_txg == tx->tx_lasttried_txg - 1)
cv_wait(&dn->dn_notxholds, &dn->dn_mtx);
mutex_exit(&dn->dn_mtx);
tx->tx_needassign_txh = NULL;
} else {
txg_wait_open(tx->tx_pool, tx->tx_lasttried_txg + 1);
}
}
void
dmu_tx_willuse_space(dmu_tx_t *tx, int64_t delta)
{
#ifdef ZFS_DEBUG
if (tx->tx_dir == NULL || delta == 0)
return;
if (delta > 0) {
ASSERT3U(refcount_count(&tx->tx_space_written) + delta, <=,
tx->tx_space_towrite);
(void) refcount_add_many(&tx->tx_space_written, delta, NULL);
} else {
(void) refcount_add_many(&tx->tx_space_freed, -delta, NULL);
}
#endif
}
void
dmu_tx_commit(dmu_tx_t *tx)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg != 0);
while (txh = list_head(&tx->tx_holds)) {
dnode_t *dn = txh->txh_dnode;
list_remove(&tx->tx_holds, txh);
kmem_free(txh, sizeof (dmu_tx_hold_t));
if (dn == NULL)
continue;
mutex_enter(&dn->dn_mtx);
ASSERT3U(dn->dn_assigned_txg, ==, tx->tx_txg);
if (refcount_remove(&dn->dn_tx_holds, tx) == 0) {
dn->dn_assigned_txg = 0;
cv_broadcast(&dn->dn_notxholds);
}
mutex_exit(&dn->dn_mtx);
dnode_rele(dn, tx);
}
if (tx->tx_tempreserve_cookie)
dsl_dir_tempreserve_clear(tx->tx_tempreserve_cookie, tx);
if (tx->tx_anyobj == FALSE)
txg_rele_to_sync(&tx->tx_txgh);
list_destroy(&tx->tx_holds);
#ifdef ZFS_DEBUG
dprintf("towrite=%llu written=%llu tofree=%llu freed=%llu\n",
tx->tx_space_towrite, refcount_count(&tx->tx_space_written),
tx->tx_space_tofree, refcount_count(&tx->tx_space_freed));
refcount_destroy_many(&tx->tx_space_written,
refcount_count(&tx->tx_space_written));
refcount_destroy_many(&tx->tx_space_freed,
refcount_count(&tx->tx_space_freed));
#endif
kmem_free(tx, sizeof (dmu_tx_t));
}
void
dmu_tx_abort(dmu_tx_t *tx)
{
dmu_tx_hold_t *txh;
ASSERT(tx->tx_txg == 0);
while (txh = list_head(&tx->tx_holds)) {
dnode_t *dn = txh->txh_dnode;
list_remove(&tx->tx_holds, txh);
kmem_free(txh, sizeof (dmu_tx_hold_t));
if (dn != NULL)
dnode_rele(dn, tx);
}
list_destroy(&tx->tx_holds);
#ifdef ZFS_DEBUG
refcount_destroy_many(&tx->tx_space_written,
refcount_count(&tx->tx_space_written));
refcount_destroy_many(&tx->tx_space_freed,
refcount_count(&tx->tx_space_freed));
#endif
kmem_free(tx, sizeof (dmu_tx_t));
}
uint64_t
dmu_tx_get_txg(dmu_tx_t *tx)
{
ASSERT(tx->tx_txg != 0);
return (tx->tx_txg);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c (revision 209274)
@@ -1,1446 +1,1437 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/dbuf.h>
#include <sys/dnode.h>
#include <sys/dmu.h>
#include <sys/dmu_impl.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_dataset.h>
#include <sys/spa.h>
#include <sys/zio.h>
#include <sys/dmu_zfetch.h>
static int free_range_compar(const void *node1, const void *node2);
static kmem_cache_t *dnode_cache;
static dnode_phys_t dnode_phys_zero;
int zfs_default_bs = SPA_MINBLOCKSHIFT;
int zfs_default_ibs = DN_MAX_INDBLKSHIFT;
/* ARGSUSED */
static int
dnode_cons(void *arg, void *unused, int kmflag)
{
int i;
dnode_t *dn = arg;
bzero(dn, sizeof (dnode_t));
rw_init(&dn->dn_struct_rwlock, NULL, RW_DEFAULT, NULL);
mutex_init(&dn->dn_mtx, NULL, MUTEX_DEFAULT, NULL);
mutex_init(&dn->dn_dbufs_mtx, NULL, MUTEX_DEFAULT, NULL);
cv_init(&dn->dn_notxholds, NULL, CV_DEFAULT, NULL);
refcount_create(&dn->dn_holds);
refcount_create(&dn->dn_tx_holds);
for (i = 0; i < TXG_SIZE; i++) {
avl_create(&dn->dn_ranges[i], free_range_compar,
sizeof (free_range_t),
offsetof(struct free_range, fr_node));
list_create(&dn->dn_dirty_records[i],
sizeof (dbuf_dirty_record_t),
offsetof(dbuf_dirty_record_t, dr_dirty_node));
}
list_create(&dn->dn_dbufs, sizeof (dmu_buf_impl_t),
offsetof(dmu_buf_impl_t, db_link));
return (0);
}
/* ARGSUSED */
static void
dnode_dest(void *arg, void *unused)
{
int i;
dnode_t *dn = arg;
rw_destroy(&dn->dn_struct_rwlock);
mutex_destroy(&dn->dn_mtx);
mutex_destroy(&dn->dn_dbufs_mtx);
cv_destroy(&dn->dn_notxholds);
refcount_destroy(&dn->dn_holds);
refcount_destroy(&dn->dn_tx_holds);
for (i = 0; i < TXG_SIZE; i++) {
avl_destroy(&dn->dn_ranges[i]);
list_destroy(&dn->dn_dirty_records[i]);
}
list_destroy(&dn->dn_dbufs);
}
void
dnode_init(void)
{
dnode_cache = kmem_cache_create("dnode_t",
sizeof (dnode_t),
0, dnode_cons, dnode_dest, NULL, NULL, NULL, 0);
}
void
dnode_fini(void)
{
kmem_cache_destroy(dnode_cache);
}
#ifdef ZFS_DEBUG
void
dnode_verify(dnode_t *dn)
{
int drop_struct_lock = FALSE;
ASSERT(dn->dn_phys);
ASSERT(dn->dn_objset);
ASSERT(dn->dn_phys->dn_type < DMU_OT_NUMTYPES);
if (!(zfs_flags & ZFS_DEBUG_DNODE_VERIFY))
return;
if (!RW_WRITE_HELD(&dn->dn_struct_rwlock)) {
rw_enter(&dn->dn_struct_rwlock, RW_READER);
drop_struct_lock = TRUE;
}
if (dn->dn_phys->dn_type != DMU_OT_NONE || dn->dn_allocated_txg != 0) {
int i;
ASSERT3U(dn->dn_indblkshift, >=, 0);
ASSERT3U(dn->dn_indblkshift, <=, SPA_MAXBLOCKSHIFT);
if (dn->dn_datablkshift) {
ASSERT3U(dn->dn_datablkshift, >=, SPA_MINBLOCKSHIFT);
ASSERT3U(dn->dn_datablkshift, <=, SPA_MAXBLOCKSHIFT);
ASSERT3U(1<<dn->dn_datablkshift, ==, dn->dn_datablksz);
}
ASSERT3U(dn->dn_nlevels, <=, 30);
ASSERT3U(dn->dn_type, <=, DMU_OT_NUMTYPES);
ASSERT3U(dn->dn_nblkptr, >=, 1);
ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
ASSERT3U(dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
ASSERT3U(dn->dn_datablksz, ==,
dn->dn_datablkszsec << SPA_MINBLOCKSHIFT);
ASSERT3U(ISP2(dn->dn_datablksz), ==, dn->dn_datablkshift != 0);
ASSERT3U((dn->dn_nblkptr - 1) * sizeof (blkptr_t) +
dn->dn_bonuslen, <=, DN_MAX_BONUSLEN);
for (i = 0; i < TXG_SIZE; i++) {
ASSERT3U(dn->dn_next_nlevels[i], <=, dn->dn_nlevels);
}
}
if (dn->dn_phys->dn_type != DMU_OT_NONE)
ASSERT3U(dn->dn_phys->dn_nlevels, <=, dn->dn_nlevels);
ASSERT(dn->dn_object == DMU_META_DNODE_OBJECT || dn->dn_dbuf != NULL);
if (dn->dn_dbuf != NULL) {
ASSERT3P(dn->dn_phys, ==,
(dnode_phys_t *)dn->dn_dbuf->db.db_data +
(dn->dn_object % (dn->dn_dbuf->db.db_size >> DNODE_SHIFT)));
}
if (drop_struct_lock)
rw_exit(&dn->dn_struct_rwlock);
}
#endif
void
dnode_byteswap(dnode_phys_t *dnp)
{
uint64_t *buf64 = (void*)&dnp->dn_blkptr;
int i;
if (dnp->dn_type == DMU_OT_NONE) {
bzero(dnp, sizeof (dnode_phys_t));
return;
}
dnp->dn_datablkszsec = BSWAP_16(dnp->dn_datablkszsec);
dnp->dn_bonuslen = BSWAP_16(dnp->dn_bonuslen);
dnp->dn_maxblkid = BSWAP_64(dnp->dn_maxblkid);
dnp->dn_used = BSWAP_64(dnp->dn_used);
/*
* dn_nblkptr is only one byte, so it's OK to read it in either
* byte order. We can't read dn_bouslen.
*/
ASSERT(dnp->dn_indblkshift <= SPA_MAXBLOCKSHIFT);
ASSERT(dnp->dn_nblkptr <= DN_MAX_NBLKPTR);
for (i = 0; i < dnp->dn_nblkptr * sizeof (blkptr_t)/8; i++)
buf64[i] = BSWAP_64(buf64[i]);
/*
* OK to check dn_bonuslen for zero, because it won't matter if
* we have the wrong byte order. This is necessary because the
* dnode dnode is smaller than a regular dnode.
*/
if (dnp->dn_bonuslen != 0) {
/*
* Note that the bonus length calculated here may be
* longer than the actual bonus buffer. This is because
* we always put the bonus buffer after the last block
* pointer (instead of packing it against the end of the
* dnode buffer).
*/
int off = (dnp->dn_nblkptr-1) * sizeof (blkptr_t);
size_t len = DN_MAX_BONUSLEN - off;
ASSERT3U(dnp->dn_bonustype, <, DMU_OT_NUMTYPES);
dmu_ot[dnp->dn_bonustype].ot_byteswap(dnp->dn_bonus + off, len);
}
}
void
dnode_buf_byteswap(void *vbuf, size_t size)
{
dnode_phys_t *buf = vbuf;
int i;
ASSERT3U(sizeof (dnode_phys_t), ==, (1<<DNODE_SHIFT));
ASSERT((size & (sizeof (dnode_phys_t)-1)) == 0);
size >>= DNODE_SHIFT;
for (i = 0; i < size; i++) {
dnode_byteswap(buf);
buf++;
}
}
static int
free_range_compar(const void *node1, const void *node2)
{
const free_range_t *rp1 = node1;
const free_range_t *rp2 = node2;
if (rp1->fr_blkid < rp2->fr_blkid)
return (-1);
else if (rp1->fr_blkid > rp2->fr_blkid)
return (1);
else return (0);
}
void
dnode_setbonuslen(dnode_t *dn, int newsize, dmu_tx_t *tx)
{
ASSERT3U(refcount_count(&dn->dn_holds), >=, 1);
dnode_setdirty(dn, tx);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
ASSERT3U(newsize, <=, DN_MAX_BONUSLEN -
(dn->dn_nblkptr-1) * sizeof (blkptr_t));
dn->dn_bonuslen = newsize;
if (newsize == 0)
dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = DN_ZERO_BONUSLEN;
else
dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
rw_exit(&dn->dn_struct_rwlock);
}
static void
dnode_setdblksz(dnode_t *dn, int size)
{
ASSERT3U(P2PHASE(size, SPA_MINBLOCKSIZE), ==, 0);
ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
ASSERT3U(size, >=, SPA_MINBLOCKSIZE);
ASSERT3U(size >> SPA_MINBLOCKSHIFT, <,
1<<(sizeof (dn->dn_phys->dn_datablkszsec) * 8));
dn->dn_datablksz = size;
dn->dn_datablkszsec = size >> SPA_MINBLOCKSHIFT;
dn->dn_datablkshift = ISP2(size) ? highbit(size - 1) : 0;
}
static dnode_t *
dnode_create(objset_impl_t *os, dnode_phys_t *dnp, dmu_buf_impl_t *db,
uint64_t object)
{
dnode_t *dn = kmem_cache_alloc(dnode_cache, KM_SLEEP);
dn->dn_objset = os;
dn->dn_object = object;
dn->dn_dbuf = db;
dn->dn_phys = dnp;
if (dnp->dn_datablkszsec)
dnode_setdblksz(dn, dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT);
dn->dn_indblkshift = dnp->dn_indblkshift;
dn->dn_nlevels = dnp->dn_nlevels;
dn->dn_type = dnp->dn_type;
dn->dn_nblkptr = dnp->dn_nblkptr;
dn->dn_checksum = dnp->dn_checksum;
dn->dn_compress = dnp->dn_compress;
dn->dn_bonustype = dnp->dn_bonustype;
dn->dn_bonuslen = dnp->dn_bonuslen;
dn->dn_maxblkid = dnp->dn_maxblkid;
dmu_zfetch_init(&dn->dn_zfetch, dn);
ASSERT(dn->dn_phys->dn_type < DMU_OT_NUMTYPES);
mutex_enter(&os->os_lock);
list_insert_head(&os->os_dnodes, dn);
mutex_exit(&os->os_lock);
arc_space_consume(sizeof (dnode_t), ARC_SPACE_OTHER);
return (dn);
}
static void
dnode_destroy(dnode_t *dn)
{
objset_impl_t *os = dn->dn_objset;
#ifdef ZFS_DEBUG
int i;
for (i = 0; i < TXG_SIZE; i++) {
ASSERT(!list_link_active(&dn->dn_dirty_link[i]));
ASSERT(NULL == list_head(&dn->dn_dirty_records[i]));
ASSERT(0 == avl_numnodes(&dn->dn_ranges[i]));
}
ASSERT(NULL == list_head(&dn->dn_dbufs));
#endif
mutex_enter(&os->os_lock);
list_remove(&os->os_dnodes, dn);
mutex_exit(&os->os_lock);
if (dn->dn_dirtyctx_firstset) {
kmem_free(dn->dn_dirtyctx_firstset, 1);
dn->dn_dirtyctx_firstset = NULL;
}
dmu_zfetch_rele(&dn->dn_zfetch);
if (dn->dn_bonus) {
mutex_enter(&dn->dn_bonus->db_mtx);
dbuf_evict(dn->dn_bonus);
dn->dn_bonus = NULL;
}
kmem_cache_free(dnode_cache, dn);
arc_space_return(sizeof (dnode_t), ARC_SPACE_OTHER);
}
void
dnode_allocate(dnode_t *dn, dmu_object_type_t ot, int blocksize, int ibs,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
{
int i;
if (blocksize == 0)
blocksize = 1 << zfs_default_bs;
else if (blocksize > SPA_MAXBLOCKSIZE)
blocksize = SPA_MAXBLOCKSIZE;
else
blocksize = P2ROUNDUP(blocksize, SPA_MINBLOCKSIZE);
if (ibs == 0)
ibs = zfs_default_ibs;
ibs = MIN(MAX(ibs, DN_MIN_INDBLKSHIFT), DN_MAX_INDBLKSHIFT);
dprintf("os=%p obj=%llu txg=%llu blocksize=%d ibs=%d\n", dn->dn_objset,
dn->dn_object, tx->tx_txg, blocksize, ibs);
ASSERT(dn->dn_type == DMU_OT_NONE);
ASSERT(bcmp(dn->dn_phys, &dnode_phys_zero, sizeof (dnode_phys_t)) == 0);
ASSERT(dn->dn_phys->dn_type == DMU_OT_NONE);
ASSERT(ot != DMU_OT_NONE);
ASSERT3U(ot, <, DMU_OT_NUMTYPES);
ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) ||
(bonustype != DMU_OT_NONE && bonuslen != 0));
ASSERT3U(bonustype, <, DMU_OT_NUMTYPES);
ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);
ASSERT(dn->dn_type == DMU_OT_NONE);
ASSERT3U(dn->dn_maxblkid, ==, 0);
ASSERT3U(dn->dn_allocated_txg, ==, 0);
ASSERT3U(dn->dn_assigned_txg, ==, 0);
ASSERT(refcount_is_zero(&dn->dn_tx_holds));
ASSERT3U(refcount_count(&dn->dn_holds), <=, 1);
ASSERT3P(list_head(&dn->dn_dbufs), ==, NULL);
for (i = 0; i < TXG_SIZE; i++) {
ASSERT3U(dn->dn_next_nlevels[i], ==, 0);
ASSERT3U(dn->dn_next_indblkshift[i], ==, 0);
ASSERT3U(dn->dn_next_bonuslen[i], ==, 0);
ASSERT3U(dn->dn_next_blksz[i], ==, 0);
ASSERT(!list_link_active(&dn->dn_dirty_link[i]));
ASSERT3P(list_head(&dn->dn_dirty_records[i]), ==, NULL);
ASSERT3U(avl_numnodes(&dn->dn_ranges[i]), ==, 0);
}
dn->dn_type = ot;
dnode_setdblksz(dn, blocksize);
dn->dn_indblkshift = ibs;
dn->dn_nlevels = 1;
dn->dn_nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
dn->dn_bonustype = bonustype;
dn->dn_bonuslen = bonuslen;
dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
dn->dn_compress = ZIO_COMPRESS_INHERIT;
dn->dn_dirtyctx = 0;
dn->dn_free_txg = 0;
if (dn->dn_dirtyctx_firstset) {
kmem_free(dn->dn_dirtyctx_firstset, 1);
dn->dn_dirtyctx_firstset = NULL;
}
dn->dn_allocated_txg = tx->tx_txg;
dnode_setdirty(dn, tx);
dn->dn_next_indblkshift[tx->tx_txg & TXG_MASK] = ibs;
dn->dn_next_bonuslen[tx->tx_txg & TXG_MASK] = dn->dn_bonuslen;
dn->dn_next_blksz[tx->tx_txg & TXG_MASK] = dn->dn_datablksz;
}
void
dnode_reallocate(dnode_t *dn, dmu_object_type_t ot, int blocksize,
dmu_object_type_t bonustype, int bonuslen, dmu_tx_t *tx)
{
int nblkptr;
ASSERT3U(blocksize, >=, SPA_MINBLOCKSIZE);
ASSERT3U(blocksize, <=, SPA_MAXBLOCKSIZE);
ASSERT3U(blocksize % SPA_MINBLOCKSIZE, ==, 0);
ASSERT(dn->dn_object != DMU_META_DNODE_OBJECT || dmu_tx_private_ok(tx));
ASSERT(tx->tx_txg != 0);
ASSERT((bonustype == DMU_OT_NONE && bonuslen == 0) ||
(bonustype != DMU_OT_NONE && bonuslen != 0));
ASSERT3U(bonustype, <, DMU_OT_NUMTYPES);
ASSERT3U(bonuslen, <=, DN_MAX_BONUSLEN);
/* clean up any unreferenced dbufs */
dnode_evict_dbufs(dn);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
dnode_setdirty(dn, tx);
if (dn->dn_datablksz != blocksize) {
/* change blocksize */
ASSERT(dn->dn_maxblkid == 0 &&
(BP_IS_HOLE(&dn->dn_phys->dn_blkptr[0]) ||
dnode_block_freed(dn, 0)));
dnode_setdblksz(dn, blocksize);
dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = blocksize;
}
if (dn->dn_bonuslen != bonuslen)
dn->dn_next_bonuslen[tx->tx_txg&TXG_MASK] = bonuslen;
nblkptr = 1 + ((DN_MAX_BONUSLEN - bonuslen) >> SPA_BLKPTRSHIFT);
if (dn->dn_nblkptr != nblkptr)
dn->dn_next_nblkptr[tx->tx_txg&TXG_MASK] = nblkptr;
rw_exit(&dn->dn_struct_rwlock);
/* change type */
dn->dn_type = ot;
/* change bonus size and type */
mutex_enter(&dn->dn_mtx);
dn->dn_bonustype = bonustype;
dn->dn_bonuslen = bonuslen;
dn->dn_nblkptr = nblkptr;
dn->dn_checksum = ZIO_CHECKSUM_INHERIT;
dn->dn_compress = ZIO_COMPRESS_INHERIT;
ASSERT3U(dn->dn_nblkptr, <=, DN_MAX_NBLKPTR);
/* fix up the bonus db_size */
if (dn->dn_bonus) {
dn->dn_bonus->db.db_size =
DN_MAX_BONUSLEN - (dn->dn_nblkptr-1) * sizeof (blkptr_t);
ASSERT(dn->dn_bonuslen <= dn->dn_bonus->db.db_size);
}
dn->dn_allocated_txg = tx->tx_txg;
mutex_exit(&dn->dn_mtx);
}
void
dnode_special_close(dnode_t *dn)
{
/*
* Wait for final references to the dnode to clear. This can
* only happen if the arc is asyncronously evicting state that
* has a hold on this dnode while we are trying to evict this
* dnode.
*/
while (refcount_count(&dn->dn_holds) > 0)
delay(1);
dnode_destroy(dn);
}
dnode_t *
dnode_special_open(objset_impl_t *os, dnode_phys_t *dnp, uint64_t object)
{
dnode_t *dn = dnode_create(os, dnp, NULL, object);
DNODE_VERIFY(dn);
return (dn);
}
static void
dnode_buf_pageout(dmu_buf_t *db, void *arg)
{
dnode_t **children_dnodes = arg;
int i;
int epb = db->db_size >> DNODE_SHIFT;
for (i = 0; i < epb; i++) {
dnode_t *dn = children_dnodes[i];
int n;
if (dn == NULL)
continue;
#ifdef ZFS_DEBUG
/*
* If there are holds on this dnode, then there should
* be holds on the dnode's containing dbuf as well; thus
* it wouldn't be eligable for eviction and this function
* would not have been called.
*/
ASSERT(refcount_is_zero(&dn->dn_holds));
ASSERT(list_head(&dn->dn_dbufs) == NULL);
ASSERT(refcount_is_zero(&dn->dn_tx_holds));
for (n = 0; n < TXG_SIZE; n++)
ASSERT(!list_link_active(&dn->dn_dirty_link[n]));
#endif
children_dnodes[i] = NULL;
dnode_destroy(dn);
}
kmem_free(children_dnodes, epb * sizeof (dnode_t *));
}
/*
* errors:
* EINVAL - invalid object number.
* EIO - i/o error.
* succeeds even for free dnodes.
*/
int
dnode_hold_impl(objset_impl_t *os, uint64_t object, int flag,
void *tag, dnode_t **dnp)
{
int epb, idx, err;
int drop_struct_lock = FALSE;
int type;
uint64_t blk;
dnode_t *mdn, *dn;
dmu_buf_impl_t *db;
dnode_t **children_dnodes;
/*
* If you are holding the spa config lock as writer, you shouldn't
* be asking the DMU to do *anything*.
*/
ASSERT(spa_config_held(os->os_spa, SCL_ALL, RW_WRITER) == 0);
if (object == 0 || object >= DN_MAX_OBJECT)
return (EINVAL);
mdn = os->os_meta_dnode;
DNODE_VERIFY(mdn);
if (!RW_WRITE_HELD(&mdn->dn_struct_rwlock)) {
rw_enter(&mdn->dn_struct_rwlock, RW_READER);
drop_struct_lock = TRUE;
}
blk = dbuf_whichblock(mdn, object * sizeof (dnode_phys_t));
db = dbuf_hold(mdn, blk, FTAG);
if (drop_struct_lock)
rw_exit(&mdn->dn_struct_rwlock);
if (db == NULL)
return (EIO);
err = dbuf_read(db, NULL, DB_RF_CANFAIL);
if (err) {
dbuf_rele(db, FTAG);
return (err);
}
ASSERT3U(db->db.db_size, >=, 1<<DNODE_SHIFT);
epb = db->db.db_size >> DNODE_SHIFT;
idx = object & (epb-1);
children_dnodes = dmu_buf_get_user(&db->db);
if (children_dnodes == NULL) {
dnode_t **winner;
children_dnodes = kmem_zalloc(epb * sizeof (dnode_t *),
KM_SLEEP);
if (winner = dmu_buf_set_user(&db->db, children_dnodes, NULL,
dnode_buf_pageout)) {
kmem_free(children_dnodes, epb * sizeof (dnode_t *));
children_dnodes = winner;
}
}
if ((dn = children_dnodes[idx]) == NULL) {
dnode_phys_t *dnp = (dnode_phys_t *)db->db.db_data+idx;
dnode_t *winner;
dn = dnode_create(os, dnp, db, object);
winner = atomic_cas_ptr(&children_dnodes[idx], NULL, dn);
if (winner != NULL) {
dnode_destroy(dn);
dn = winner;
}
}
mutex_enter(&dn->dn_mtx);
type = dn->dn_type;
if (dn->dn_free_txg ||
((flag & DNODE_MUST_BE_ALLOCATED) && type == DMU_OT_NONE) ||
((flag & DNODE_MUST_BE_FREE) && type != DMU_OT_NONE)) {
mutex_exit(&dn->dn_mtx);
dbuf_rele(db, FTAG);
return (type == DMU_OT_NONE ? ENOENT : EEXIST);
}
mutex_exit(&dn->dn_mtx);
if (refcount_add(&dn->dn_holds, tag) == 1)
dbuf_add_ref(db, dn);
DNODE_VERIFY(dn);
ASSERT3P(dn->dn_dbuf, ==, db);
ASSERT3U(dn->dn_object, ==, object);
dbuf_rele(db, FTAG);
*dnp = dn;
return (0);
}
/*
* Return held dnode if the object is allocated, NULL if not.
*/
int
dnode_hold(objset_impl_t *os, uint64_t object, void *tag, dnode_t **dnp)
{
return (dnode_hold_impl(os, object, DNODE_MUST_BE_ALLOCATED, tag, dnp));
}
/*
* Can only add a reference if there is already at least one
* reference on the dnode. Returns FALSE if unable to add a
* new reference.
*/
boolean_t
dnode_add_ref(dnode_t *dn, void *tag)
{
mutex_enter(&dn->dn_mtx);
if (refcount_is_zero(&dn->dn_holds)) {
mutex_exit(&dn->dn_mtx);
return (FALSE);
}
VERIFY(1 < refcount_add(&dn->dn_holds, tag));
mutex_exit(&dn->dn_mtx);
return (TRUE);
}
void
dnode_rele(dnode_t *dn, void *tag)
{
uint64_t refs;
mutex_enter(&dn->dn_mtx);
refs = refcount_remove(&dn->dn_holds, tag);
mutex_exit(&dn->dn_mtx);
/* NOTE: the DNODE_DNODE does not have a dn_dbuf */
if (refs == 0 && dn->dn_dbuf)
dbuf_rele(dn->dn_dbuf, dn);
}
void
dnode_setdirty(dnode_t *dn, dmu_tx_t *tx)
{
objset_impl_t *os = dn->dn_objset;
uint64_t txg = tx->tx_txg;
if (dn->dn_object == DMU_META_DNODE_OBJECT)
return;
DNODE_VERIFY(dn);
#ifdef ZFS_DEBUG
mutex_enter(&dn->dn_mtx);
ASSERT(dn->dn_phys->dn_type || dn->dn_allocated_txg);
/* ASSERT(dn->dn_free_txg == 0 || dn->dn_free_txg >= txg); */
mutex_exit(&dn->dn_mtx);
#endif
mutex_enter(&os->os_lock);
/*
* If we are already marked dirty, we're done.
*/
if (list_link_active(&dn->dn_dirty_link[txg & TXG_MASK])) {
mutex_exit(&os->os_lock);
return;
}
ASSERT(!refcount_is_zero(&dn->dn_holds) || list_head(&dn->dn_dbufs));
ASSERT(dn->dn_datablksz != 0);
ASSERT3U(dn->dn_next_bonuslen[txg&TXG_MASK], ==, 0);
ASSERT3U(dn->dn_next_blksz[txg&TXG_MASK], ==, 0);
dprintf_ds(os->os_dsl_dataset, "obj=%llu txg=%llu\n",
dn->dn_object, txg);
if (dn->dn_free_txg > 0 && dn->dn_free_txg <= txg) {
list_insert_tail(&os->os_free_dnodes[txg&TXG_MASK], dn);
} else {
list_insert_tail(&os->os_dirty_dnodes[txg&TXG_MASK], dn);
}
mutex_exit(&os->os_lock);
/*
* The dnode maintains a hold on its containing dbuf as
* long as there are holds on it. Each instantiated child
* dbuf maintaines a hold on the dnode. When the last child
* drops its hold, the dnode will drop its hold on the
* containing dbuf. We add a "dirty hold" here so that the
* dnode will hang around after we finish processing its
* children.
*/
VERIFY(dnode_add_ref(dn, (void *)(uintptr_t)tx->tx_txg));
(void) dbuf_dirty(dn->dn_dbuf, tx);
dsl_dataset_dirty(os->os_dsl_dataset, tx);
}
void
dnode_free(dnode_t *dn, dmu_tx_t *tx)
{
int txgoff = tx->tx_txg & TXG_MASK;
dprintf("dn=%p txg=%llu\n", dn, tx->tx_txg);
/* we should be the only holder... hopefully */
/* ASSERT3U(refcount_count(&dn->dn_holds), ==, 1); */
mutex_enter(&dn->dn_mtx);
if (dn->dn_type == DMU_OT_NONE || dn->dn_free_txg) {
mutex_exit(&dn->dn_mtx);
return;
}
dn->dn_free_txg = tx->tx_txg;
mutex_exit(&dn->dn_mtx);
/*
* If the dnode is already dirty, it needs to be moved from
* the dirty list to the free list.
*/
mutex_enter(&dn->dn_objset->os_lock);
if (list_link_active(&dn->dn_dirty_link[txgoff])) {
list_remove(&dn->dn_objset->os_dirty_dnodes[txgoff], dn);
list_insert_tail(&dn->dn_objset->os_free_dnodes[txgoff], dn);
mutex_exit(&dn->dn_objset->os_lock);
} else {
mutex_exit(&dn->dn_objset->os_lock);
dnode_setdirty(dn, tx);
}
}
/*
* Try to change the block size for the indicated dnode. This can only
* succeed if there are no blocks allocated or dirty beyond first block
*/
int
dnode_set_blksz(dnode_t *dn, uint64_t size, int ibs, dmu_tx_t *tx)
{
dmu_buf_impl_t *db, *db_next;
int err;
if (size == 0)
size = SPA_MINBLOCKSIZE;
if (size > SPA_MAXBLOCKSIZE)
size = SPA_MAXBLOCKSIZE;
else
size = P2ROUNDUP(size, SPA_MINBLOCKSIZE);
if (ibs == dn->dn_indblkshift)
ibs = 0;
if (size >> SPA_MINBLOCKSHIFT == dn->dn_datablkszsec && ibs == 0)
return (0);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
/* Check for any allocated blocks beyond the first */
if (dn->dn_phys->dn_maxblkid != 0)
goto fail;
mutex_enter(&dn->dn_dbufs_mtx);
for (db = list_head(&dn->dn_dbufs); db; db = db_next) {
db_next = list_next(&dn->dn_dbufs, db);
if (db->db_blkid != 0 && db->db_blkid != DB_BONUS_BLKID) {
mutex_exit(&dn->dn_dbufs_mtx);
goto fail;
}
}
mutex_exit(&dn->dn_dbufs_mtx);
if (ibs && dn->dn_nlevels != 1)
goto fail;
/* resize the old block */
err = dbuf_hold_impl(dn, 0, 0, TRUE, FTAG, &db);
if (err == 0)
dbuf_new_size(db, size, tx);
else if (err != ENOENT)
goto fail;
dnode_setdblksz(dn, size);
dnode_setdirty(dn, tx);
dn->dn_next_blksz[tx->tx_txg&TXG_MASK] = size;
if (ibs) {
dn->dn_indblkshift = ibs;
dn->dn_next_indblkshift[tx->tx_txg&TXG_MASK] = ibs;
}
/* rele after we have fixed the blocksize in the dnode */
if (db)
dbuf_rele(db, FTAG);
rw_exit(&dn->dn_struct_rwlock);
return (0);
fail:
rw_exit(&dn->dn_struct_rwlock);
return (ENOTSUP);
}
/* read-holding callers must not rely on the lock being continuously held */
void
dnode_new_blkid(dnode_t *dn, uint64_t blkid, dmu_tx_t *tx, boolean_t have_read)
{
uint64_t txgoff = tx->tx_txg & TXG_MASK;
int epbs, new_nlevels;
uint64_t sz;
ASSERT(blkid != DB_BONUS_BLKID);
ASSERT(have_read ?
RW_READ_HELD(&dn->dn_struct_rwlock) :
RW_WRITE_HELD(&dn->dn_struct_rwlock));
/*
* if we have a read-lock, check to see if we need to do any work
* before upgrading to a write-lock.
*/
if (have_read) {
if (blkid <= dn->dn_maxblkid)
return;
if (!rw_tryupgrade(&dn->dn_struct_rwlock)) {
rw_exit(&dn->dn_struct_rwlock);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
}
}
if (blkid <= dn->dn_maxblkid)
goto out;
dn->dn_maxblkid = blkid;
/*
* Compute the number of levels necessary to support the new maxblkid.
*/
new_nlevels = 1;
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
for (sz = dn->dn_nblkptr;
sz <= blkid && sz >= dn->dn_nblkptr; sz <<= epbs)
new_nlevels++;
if (new_nlevels > dn->dn_nlevels) {
int old_nlevels = dn->dn_nlevels;
dmu_buf_impl_t *db;
list_t *list;
dbuf_dirty_record_t *new, *dr, *dr_next;
dn->dn_nlevels = new_nlevels;
ASSERT3U(new_nlevels, >, dn->dn_next_nlevels[txgoff]);
dn->dn_next_nlevels[txgoff] = new_nlevels;
/* dirty the left indirects */
db = dbuf_hold_level(dn, old_nlevels, 0, FTAG);
new = dbuf_dirty(db, tx);
dbuf_rele(db, FTAG);
/* transfer the dirty records to the new indirect */
mutex_enter(&dn->dn_mtx);
mutex_enter(&new->dt.di.dr_mtx);
list = &dn->dn_dirty_records[txgoff];
for (dr = list_head(list); dr; dr = dr_next) {
dr_next = list_next(&dn->dn_dirty_records[txgoff], dr);
if (dr->dr_dbuf->db_level != new_nlevels-1 &&
dr->dr_dbuf->db_blkid != DB_BONUS_BLKID) {
ASSERT(dr->dr_dbuf->db_level == old_nlevels-1);
list_remove(&dn->dn_dirty_records[txgoff], dr);
list_insert_tail(&new->dt.di.dr_children, dr);
dr->dr_parent = new;
}
}
mutex_exit(&new->dt.di.dr_mtx);
mutex_exit(&dn->dn_mtx);
}
out:
if (have_read)
rw_downgrade(&dn->dn_struct_rwlock);
}
void
dnode_clear_range(dnode_t *dn, uint64_t blkid, uint64_t nblks, dmu_tx_t *tx)
{
avl_tree_t *tree = &dn->dn_ranges[tx->tx_txg&TXG_MASK];
avl_index_t where;
free_range_t *rp;
free_range_t rp_tofind;
uint64_t endblk = blkid + nblks;
ASSERT(MUTEX_HELD(&dn->dn_mtx));
ASSERT(nblks <= UINT64_MAX - blkid); /* no overflow */
dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n",
blkid, nblks, tx->tx_txg);
rp_tofind.fr_blkid = blkid;
rp = avl_find(tree, &rp_tofind, &where);
if (rp == NULL)
rp = avl_nearest(tree, where, AVL_BEFORE);
if (rp == NULL)
rp = avl_nearest(tree, where, AVL_AFTER);
while (rp && (rp->fr_blkid <= blkid + nblks)) {
uint64_t fr_endblk = rp->fr_blkid + rp->fr_nblks;
free_range_t *nrp = AVL_NEXT(tree, rp);
if (blkid <= rp->fr_blkid && endblk >= fr_endblk) {
/* clear this entire range */
avl_remove(tree, rp);
kmem_free(rp, sizeof (free_range_t));
} else if (blkid <= rp->fr_blkid &&
endblk > rp->fr_blkid && endblk < fr_endblk) {
/* clear the beginning of this range */
rp->fr_blkid = endblk;
rp->fr_nblks = fr_endblk - endblk;
} else if (blkid > rp->fr_blkid && blkid < fr_endblk &&
endblk >= fr_endblk) {
/* clear the end of this range */
rp->fr_nblks = blkid - rp->fr_blkid;
} else if (blkid > rp->fr_blkid && endblk < fr_endblk) {
/* clear a chunk out of this range */
free_range_t *new_rp =
kmem_alloc(sizeof (free_range_t), KM_SLEEP);
new_rp->fr_blkid = endblk;
new_rp->fr_nblks = fr_endblk - endblk;
avl_insert_here(tree, new_rp, rp, AVL_AFTER);
rp->fr_nblks = blkid - rp->fr_blkid;
}
/* there may be no overlap */
rp = nrp;
}
}
void
dnode_free_range(dnode_t *dn, uint64_t off, uint64_t len, dmu_tx_t *tx)
{
dmu_buf_impl_t *db;
uint64_t blkoff, blkid, nblks;
int blksz, blkshift, head, tail;
int trunc = FALSE;
int epbs;
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
blksz = dn->dn_datablksz;
blkshift = dn->dn_datablkshift;
epbs = dn->dn_indblkshift - SPA_BLKPTRSHIFT;
if (len == -1ULL) {
len = UINT64_MAX - off;
trunc = TRUE;
}
/*
* First, block align the region to free:
*/
if (ISP2(blksz)) {
head = P2NPHASE(off, blksz);
blkoff = P2PHASE(off, blksz);
if ((off >> blkshift) > dn->dn_maxblkid)
goto out;
} else {
ASSERT(dn->dn_maxblkid == 0);
if (off == 0 && len >= blksz) {
/* Freeing the whole block; fast-track this request */
blkid = 0;
nblks = 1;
goto done;
} else if (off >= blksz) {
/* Freeing past end-of-data */
goto out;
} else {
/* Freeing part of the block. */
head = blksz - off;
ASSERT3U(head, >, 0);
}
blkoff = off;
}
/* zero out any partial block data at the start of the range */
if (head) {
ASSERT3U(blkoff + head, ==, blksz);
if (len < head)
head = len;
if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, off), TRUE,
FTAG, &db) == 0) {
caddr_t data;
/* don't dirty if it isn't on disk and isn't dirty */
if (db->db_last_dirty ||
(db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
rw_exit(&dn->dn_struct_rwlock);
dbuf_will_dirty(db, tx);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
data = db->db.db_data;
bzero(data + blkoff, head);
}
dbuf_rele(db, FTAG);
}
off += head;
len -= head;
}
/* If the range was less than one block, we're done */
if (len == 0)
goto out;
/* If the remaining range is past end of file, we're done */
if ((off >> blkshift) > dn->dn_maxblkid)
goto out;
ASSERT(ISP2(blksz));
if (trunc)
tail = 0;
else
tail = P2PHASE(len, blksz);
ASSERT3U(P2PHASE(off, blksz), ==, 0);
/* zero out any partial block data at the end of the range */
if (tail) {
if (len < tail)
tail = len;
if (dbuf_hold_impl(dn, 0, dbuf_whichblock(dn, off+len),
TRUE, FTAG, &db) == 0) {
/* don't dirty if not on disk and not dirty */
if (db->db_last_dirty ||
(db->db_blkptr && !BP_IS_HOLE(db->db_blkptr))) {
rw_exit(&dn->dn_struct_rwlock);
dbuf_will_dirty(db, tx);
rw_enter(&dn->dn_struct_rwlock, RW_WRITER);
bzero(db->db.db_data, tail);
}
dbuf_rele(db, FTAG);
}
len -= tail;
}
/* If the range did not include a full block, we are done */
if (len == 0)
goto out;
ASSERT(IS_P2ALIGNED(off, blksz));
ASSERT(trunc || IS_P2ALIGNED(len, blksz));
blkid = off >> blkshift;
nblks = len >> blkshift;
if (trunc)
nblks += 1;
/*
* Read in and mark all the level-1 indirects dirty,
* so that they will stay in memory until syncing phase.
* Always dirty the first and last indirect to make sure
* we dirty all the partial indirects.
*/
if (dn->dn_nlevels > 1) {
uint64_t i, first, last;
int shift = epbs + dn->dn_datablkshift;
first = blkid >> epbs;
if (db = dbuf_hold_level(dn, 1, first, FTAG)) {
dbuf_will_dirty(db, tx);
dbuf_rele(db, FTAG);
}
if (trunc)
last = dn->dn_maxblkid >> epbs;
else
last = (blkid + nblks - 1) >> epbs;
if (last > first && (db = dbuf_hold_level(dn, 1, last, FTAG))) {
dbuf_will_dirty(db, tx);
dbuf_rele(db, FTAG);
}
for (i = first + 1; i < last; i++) {
uint64_t ibyte = i << shift;
int err;
err = dnode_next_offset(dn,
DNODE_FIND_HAVELOCK, &ibyte, 1, 1, 0);
i = ibyte >> shift;
if (err == ESRCH || i >= last)
break;
ASSERT(err == 0);
db = dbuf_hold_level(dn, 1, i, FTAG);
if (db) {
dbuf_will_dirty(db, tx);
dbuf_rele(db, FTAG);
}
}
}
done:
/*
* Add this range to the dnode range list.
* We will finish up this free operation in the syncing phase.
*/
mutex_enter(&dn->dn_mtx);
dnode_clear_range(dn, blkid, nblks, tx);
{
free_range_t *rp, *found;
avl_index_t where;
avl_tree_t *tree = &dn->dn_ranges[tx->tx_txg&TXG_MASK];
/* Add new range to dn_ranges */
rp = kmem_alloc(sizeof (free_range_t), KM_SLEEP);
rp->fr_blkid = blkid;
rp->fr_nblks = nblks;
found = avl_find(tree, rp, &where);
ASSERT(found == NULL);
avl_insert(tree, rp, where);
dprintf_dnode(dn, "blkid=%llu nblks=%llu txg=%llu\n",
blkid, nblks, tx->tx_txg);
}
mutex_exit(&dn->dn_mtx);
dbuf_free_range(dn, blkid, blkid + nblks - 1, tx);
dnode_setdirty(dn, tx);
out:
if (trunc && dn->dn_maxblkid >= (off >> blkshift))
dn->dn_maxblkid = (off >> blkshift ? (off >> blkshift) - 1 : 0);
rw_exit(&dn->dn_struct_rwlock);
}
/* return TRUE if this blkid was freed in a recent txg, or FALSE if it wasn't */
uint64_t
dnode_block_freed(dnode_t *dn, uint64_t blkid)
{
free_range_t range_tofind;
void *dp = spa_get_dsl(dn->dn_objset->os_spa);
int i;
if (blkid == DB_BONUS_BLKID)
return (FALSE);
/*
* If we're in the process of opening the pool, dp will not be
* set yet, but there shouldn't be anything dirty.
*/
if (dp == NULL)
return (FALSE);
if (dn->dn_free_txg)
return (TRUE);
range_tofind.fr_blkid = blkid;
mutex_enter(&dn->dn_mtx);
for (i = 0; i < TXG_SIZE; i++) {
free_range_t *range_found;
avl_index_t idx;
range_found = avl_find(&dn->dn_ranges[i], &range_tofind, &idx);
if (range_found) {
ASSERT(range_found->fr_nblks > 0);
break;
}
range_found = avl_nearest(&dn->dn_ranges[i], idx, AVL_BEFORE);
if (range_found &&
range_found->fr_blkid + range_found->fr_nblks > blkid)
break;
}
mutex_exit(&dn->dn_mtx);
return (i < TXG_SIZE);
}
/* call from syncing context when we actually write/free space for this dnode */
void
dnode_diduse_space(dnode_t *dn, int64_t delta)
{
uint64_t space;
dprintf_dnode(dn, "dn=%p dnp=%p used=%llu delta=%lld\n",
dn, dn->dn_phys,
(u_longlong_t)dn->dn_phys->dn_used,
(longlong_t)delta);
mutex_enter(&dn->dn_mtx);
space = DN_USED_BYTES(dn->dn_phys);
if (delta > 0) {
ASSERT3U(space + delta, >=, space); /* no overflow */
} else {
ASSERT3U(space, >=, -delta); /* no underflow */
}
space += delta;
if (spa_version(dn->dn_objset->os_spa) < SPA_VERSION_DNODE_BYTES) {
ASSERT((dn->dn_phys->dn_flags & DNODE_FLAG_USED_BYTES) == 0);
ASSERT3U(P2PHASE(space, 1<<DEV_BSHIFT), ==, 0);
dn->dn_phys->dn_used = space >> DEV_BSHIFT;
} else {
dn->dn_phys->dn_used = space;
dn->dn_phys->dn_flags |= DNODE_FLAG_USED_BYTES;
}
mutex_exit(&dn->dn_mtx);
}
/*
* Call when we think we're going to write/free space in open context.
* Be conservative (ie. OK to write less than this or free more than
* this, but don't write more or free less).
*/
void
dnode_willuse_space(dnode_t *dn, int64_t space, dmu_tx_t *tx)
{
objset_impl_t *os = dn->dn_objset;
dsl_dataset_t *ds = os->os_dsl_dataset;
if (space > 0)
space = spa_get_asize(os->os_spa, space);
if (ds)
dsl_dir_willuse_space(ds->ds_dir, space, tx);
dmu_tx_willuse_space(tx, space);
}
/*
* This function scans a block at the indicated "level" looking for
* a hole or data (depending on 'flags'). If level > 0, then we are
* scanning an indirect block looking at its pointers. If level == 0,
* then we are looking at a block of dnodes. If we don't find what we
* are looking for in the block, we return ESRCH. Otherwise, return
* with *offset pointing to the beginning (if searching forwards) or
* end (if searching backwards) of the range covered by the block
* pointer we matched on (or dnode).
*
* The basic search algorithm used below by dnode_next_offset() is to
* use this function to search up the block tree (widen the search) until
* we find something (i.e., we don't return ESRCH) and then search back
* down the tree (narrow the search) until we reach our original search
* level.
*/
static int
dnode_next_offset_level(dnode_t *dn, int flags, uint64_t *offset,
int lvl, uint64_t blkfill, uint64_t txg)
{
dmu_buf_impl_t *db = NULL;
void *data = NULL;
uint64_t epbs = dn->dn_phys->dn_indblkshift - SPA_BLKPTRSHIFT;
uint64_t epb = 1ULL << epbs;
uint64_t minfill, maxfill;
boolean_t hole;
int i, inc, error, span;
dprintf("probing object %llu offset %llx level %d of %u\n",
dn->dn_object, *offset, lvl, dn->dn_phys->dn_nlevels);
hole = flags & DNODE_FIND_HOLE;
inc = (flags & DNODE_FIND_BACKWARDS) ? -1 : 1;
ASSERT(txg == 0 || !hole);
if (lvl == dn->dn_phys->dn_nlevels) {
error = 0;
epb = dn->dn_phys->dn_nblkptr;
data = dn->dn_phys->dn_blkptr;
} else {
uint64_t blkid = dbuf_whichblock(dn, *offset) >> (epbs * lvl);
error = dbuf_hold_impl(dn, lvl, blkid, TRUE, FTAG, &db);
if (error) {
if (error != ENOENT)
return (error);
if (hole)
return (0);
/*
* This can only happen when we are searching up
* the block tree for data. We don't really need to
* adjust the offset, as we will just end up looking
* at the pointer to this block in its parent, and its
* going to be unallocated, so we will skip over it.
*/
return (ESRCH);
}
error = dbuf_read(db, NULL, DB_RF_CANFAIL | DB_RF_HAVESTRUCT);
if (error) {
dbuf_rele(db, FTAG);
return (error);
}
data = db->db.db_data;
}
if (db && txg &&
(db->db_blkptr == NULL || db->db_blkptr->blk_birth <= txg)) {
/*
* This can only happen when we are searching up the tree
* and these conditions mean that we need to keep climbing.
*/
error = ESRCH;
} else if (lvl == 0) {
dnode_phys_t *dnp = data;
span = DNODE_SHIFT;
ASSERT(dn->dn_type == DMU_OT_DNODE);
for (i = (*offset >> span) & (blkfill - 1);
i >= 0 && i < blkfill; i += inc) {
- boolean_t newcontents = B_TRUE;
- if (txg) {
- int j;
- newcontents = B_FALSE;
- for (j = 0; j < dnp[i].dn_nblkptr; j++) {
- if (dnp[i].dn_blkptr[j].blk_birth > txg)
- newcontents = B_TRUE;
- }
- }
- if (!dnp[i].dn_type == hole && newcontents)
+ if ((dnp[i].dn_type == DMU_OT_NONE) == hole)
break;
*offset += (1ULL << span) * inc;
}
if (i < 0 || i == blkfill)
error = ESRCH;
} else {
blkptr_t *bp = data;
uint64_t start = *offset;
span = (lvl - 1) * epbs + dn->dn_datablkshift;
minfill = 0;
maxfill = blkfill << ((lvl - 1) * epbs);
if (hole)
maxfill--;
else
minfill++;
*offset = *offset >> span;
for (i = BF64_GET(*offset, 0, epbs);
i >= 0 && i < epb; i += inc) {
if (bp[i].blk_fill >= minfill &&
bp[i].blk_fill <= maxfill &&
(hole || bp[i].blk_birth > txg))
break;
if (inc > 0 || *offset > 0)
*offset += inc;
}
*offset = *offset << span;
if (inc < 0) {
/* traversing backwards; position offset at the end */
ASSERT3U(*offset, <=, start);
*offset = MIN(*offset + (1ULL << span) - 1, start);
} else if (*offset < start) {
*offset = start;
}
if (i < 0 || i >= epb)
error = ESRCH;
}
if (db)
dbuf_rele(db, FTAG);
return (error);
}
/*
* Find the next hole, data, or sparse region at or after *offset.
* The value 'blkfill' tells us how many items we expect to find
* in an L0 data block; this value is 1 for normal objects,
* DNODES_PER_BLOCK for the meta dnode, and some fraction of
* DNODES_PER_BLOCK when searching for sparse regions thereof.
*
* Examples:
*
* dnode_next_offset(dn, flags, offset, 1, 1, 0);
* Finds the next/previous hole/data in a file.
* Used in dmu_offset_next().
*
* dnode_next_offset(mdn, flags, offset, 0, DNODES_PER_BLOCK, txg);
* Finds the next free/allocated dnode an objset's meta-dnode.
* Only finds objects that have new contents since txg (ie.
* bonus buffer changes and content removal are ignored).
* Used in dmu_object_next().
*
* dnode_next_offset(mdn, DNODE_FIND_HOLE, offset, 2, DNODES_PER_BLOCK >> 2, 0);
* Finds the next L2 meta-dnode bp that's at most 1/4 full.
* Used in dmu_object_alloc().
*/
int
dnode_next_offset(dnode_t *dn, int flags, uint64_t *offset,
int minlvl, uint64_t blkfill, uint64_t txg)
{
uint64_t initial_offset = *offset;
int lvl, maxlvl;
int error = 0;
if (!(flags & DNODE_FIND_HAVELOCK))
rw_enter(&dn->dn_struct_rwlock, RW_READER);
if (dn->dn_phys->dn_nlevels == 0) {
error = ESRCH;
goto out;
}
if (dn->dn_datablkshift == 0) {
if (*offset < dn->dn_datablksz) {
if (flags & DNODE_FIND_HOLE)
*offset = dn->dn_datablksz;
} else {
error = ESRCH;
}
goto out;
}
maxlvl = dn->dn_phys->dn_nlevels;
for (lvl = minlvl; lvl <= maxlvl; lvl++) {
error = dnode_next_offset_level(dn,
flags, offset, lvl, blkfill, txg);
if (error != ESRCH)
break;
}
while (error == 0 && --lvl >= minlvl) {
error = dnode_next_offset_level(dn,
flags, offset, lvl, blkfill, txg);
}
if (error == 0 && (flags & DNODE_FIND_BACKWARDS ?
initial_offset < *offset : initial_offset > *offset))
error = ESRCH;
out:
if (!(flags & DNODE_FIND_HAVELOCK))
rw_exit(&dn->dn_struct_rwlock);
return (error);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c (revision 209274)
@@ -1,1331 +1,1330 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/dmu.h>
#include <sys/dmu_objset.h>
#include <sys/dmu_tx.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_prop.h>
#include <sys/dsl_synctask.h>
#include <sys/dsl_deleg.h>
#include <sys/spa.h>
#include <sys/zap.h>
#include <sys/zio.h>
#include <sys/arc.h>
#include <sys/sunddi.h>
#include "zfs_namecheck.h"
static uint64_t dsl_dir_space_towrite(dsl_dir_t *dd);
static void dsl_dir_set_reservation_sync(void *arg1, void *arg2,
cred_t *cr, dmu_tx_t *tx);
/* ARGSUSED */
static void
dsl_dir_evict(dmu_buf_t *db, void *arg)
{
dsl_dir_t *dd = arg;
dsl_pool_t *dp = dd->dd_pool;
int t;
for (t = 0; t < TXG_SIZE; t++) {
ASSERT(!txg_list_member(&dp->dp_dirty_dirs, dd, t));
ASSERT(dd->dd_tempreserved[t] == 0);
ASSERT(dd->dd_space_towrite[t] == 0);
}
if (dd->dd_parent)
dsl_dir_close(dd->dd_parent, dd);
spa_close(dd->dd_pool->dp_spa, dd);
/*
* The props callback list should be empty since they hold the
* dir open.
*/
list_destroy(&dd->dd_prop_cbs);
mutex_destroy(&dd->dd_lock);
kmem_free(dd, sizeof (dsl_dir_t));
}
int
dsl_dir_open_obj(dsl_pool_t *dp, uint64_t ddobj,
const char *tail, void *tag, dsl_dir_t **ddp)
{
dmu_buf_t *dbuf;
dsl_dir_t *dd;
int err;
ASSERT(RW_LOCK_HELD(&dp->dp_config_rwlock) ||
dsl_pool_sync_context(dp));
err = dmu_bonus_hold(dp->dp_meta_objset, ddobj, tag, &dbuf);
if (err)
return (err);
dd = dmu_buf_get_user(dbuf);
#ifdef ZFS_DEBUG
{
dmu_object_info_t doi;
dmu_object_info_from_db(dbuf, &doi);
ASSERT3U(doi.doi_type, ==, DMU_OT_DSL_DIR);
ASSERT3U(doi.doi_bonus_size, >=, sizeof (dsl_dir_phys_t));
}
#endif
if (dd == NULL) {
dsl_dir_t *winner;
- int err;
dd = kmem_zalloc(sizeof (dsl_dir_t), KM_SLEEP);
dd->dd_object = ddobj;
dd->dd_dbuf = dbuf;
dd->dd_pool = dp;
dd->dd_phys = dbuf->db_data;
mutex_init(&dd->dd_lock, NULL, MUTEX_DEFAULT, NULL);
list_create(&dd->dd_prop_cbs, sizeof (dsl_prop_cb_record_t),
offsetof(dsl_prop_cb_record_t, cbr_node));
if (dd->dd_phys->dd_parent_obj) {
err = dsl_dir_open_obj(dp, dd->dd_phys->dd_parent_obj,
NULL, dd, &dd->dd_parent);
if (err)
goto errout;
if (tail) {
#ifdef ZFS_DEBUG
uint64_t foundobj;
err = zap_lookup(dp->dp_meta_objset,
dd->dd_parent->dd_phys->dd_child_dir_zapobj,
tail, sizeof (foundobj), 1, &foundobj);
ASSERT(err || foundobj == ddobj);
#endif
(void) strcpy(dd->dd_myname, tail);
} else {
err = zap_value_search(dp->dp_meta_objset,
dd->dd_parent->dd_phys->dd_child_dir_zapobj,
ddobj, 0, dd->dd_myname);
}
if (err)
goto errout;
} else {
(void) strcpy(dd->dd_myname, spa_name(dp->dp_spa));
}
winner = dmu_buf_set_user_ie(dbuf, dd, &dd->dd_phys,
dsl_dir_evict);
if (winner) {
if (dd->dd_parent)
dsl_dir_close(dd->dd_parent, dd);
mutex_destroy(&dd->dd_lock);
kmem_free(dd, sizeof (dsl_dir_t));
dd = winner;
} else {
spa_open_ref(dp->dp_spa, dd);
}
}
/*
* The dsl_dir_t has both open-to-close and instantiate-to-evict
* holds on the spa. We need the open-to-close holds because
* otherwise the spa_refcnt wouldn't change when we open a
* dir which the spa also has open, so we could incorrectly
* think it was OK to unload/export/destroy the pool. We need
* the instantiate-to-evict hold because the dsl_dir_t has a
* pointer to the dd_pool, which has a pointer to the spa_t.
*/
spa_open_ref(dp->dp_spa, tag);
ASSERT3P(dd->dd_pool, ==, dp);
ASSERT3U(dd->dd_object, ==, ddobj);
ASSERT3P(dd->dd_dbuf, ==, dbuf);
*ddp = dd;
return (0);
errout:
if (dd->dd_parent)
dsl_dir_close(dd->dd_parent, dd);
mutex_destroy(&dd->dd_lock);
kmem_free(dd, sizeof (dsl_dir_t));
dmu_buf_rele(dbuf, tag);
return (err);
}
void
dsl_dir_close(dsl_dir_t *dd, void *tag)
{
dprintf_dd(dd, "%s\n", "");
spa_close(dd->dd_pool->dp_spa, tag);
dmu_buf_rele(dd->dd_dbuf, tag);
}
/* buf must be long enough (MAXNAMELEN + strlen(MOS_DIR_NAME) + 1 should do) */
void
dsl_dir_name(dsl_dir_t *dd, char *buf)
{
if (dd->dd_parent) {
dsl_dir_name(dd->dd_parent, buf);
(void) strcat(buf, "/");
} else {
buf[0] = '\0';
}
if (!MUTEX_HELD(&dd->dd_lock)) {
/*
* recursive mutex so that we can use
* dprintf_dd() with dd_lock held
*/
mutex_enter(&dd->dd_lock);
(void) strcat(buf, dd->dd_myname);
mutex_exit(&dd->dd_lock);
} else {
(void) strcat(buf, dd->dd_myname);
}
}
/* Calculate name legnth, avoiding all the strcat calls of dsl_dir_name */
int
dsl_dir_namelen(dsl_dir_t *dd)
{
int result = 0;
if (dd->dd_parent) {
/* parent's name + 1 for the "/" */
result = dsl_dir_namelen(dd->dd_parent) + 1;
}
if (!MUTEX_HELD(&dd->dd_lock)) {
/* see dsl_dir_name */
mutex_enter(&dd->dd_lock);
result += strlen(dd->dd_myname);
mutex_exit(&dd->dd_lock);
} else {
result += strlen(dd->dd_myname);
}
return (result);
}
int
dsl_dir_is_private(dsl_dir_t *dd)
{
int rv = FALSE;
if (dd->dd_parent && dsl_dir_is_private(dd->dd_parent))
rv = TRUE;
if (dataset_name_hidden(dd->dd_myname))
rv = TRUE;
return (rv);
}
static int
getcomponent(const char *path, char *component, const char **nextp)
{
char *p;
if (path == NULL)
return (ENOENT);
/* This would be a good place to reserve some namespace... */
p = strpbrk(path, "/@");
if (p && (p[1] == '/' || p[1] == '@')) {
/* two separators in a row */
return (EINVAL);
}
if (p == NULL || p == path) {
/*
* if the first thing is an @ or /, it had better be an
* @ and it had better not have any more ats or slashes,
* and it had better have something after the @.
*/
if (p != NULL &&
(p[0] != '@' || strpbrk(path+1, "/@") || p[1] == '\0'))
return (EINVAL);
if (strlen(path) >= MAXNAMELEN)
return (ENAMETOOLONG);
(void) strcpy(component, path);
p = NULL;
} else if (p[0] == '/') {
if (p-path >= MAXNAMELEN)
return (ENAMETOOLONG);
(void) strncpy(component, path, p - path);
component[p-path] = '\0';
p++;
} else if (p[0] == '@') {
/*
* if the next separator is an @, there better not be
* any more slashes.
*/
if (strchr(path, '/'))
return (EINVAL);
if (p-path >= MAXNAMELEN)
return (ENAMETOOLONG);
(void) strncpy(component, path, p - path);
component[p-path] = '\0';
} else {
ASSERT(!"invalid p");
}
*nextp = p;
return (0);
}
/*
* same as dsl_open_dir, ignore the first component of name and use the
* spa instead
*/
int
dsl_dir_open_spa(spa_t *spa, const char *name, void *tag,
dsl_dir_t **ddp, const char **tailp)
{
char buf[MAXNAMELEN];
const char *next, *nextnext = NULL;
int err;
dsl_dir_t *dd;
dsl_pool_t *dp;
uint64_t ddobj;
int openedspa = FALSE;
dprintf("%s\n", name);
err = getcomponent(name, buf, &next);
if (err)
return (err);
if (spa == NULL) {
err = spa_open(buf, &spa, FTAG);
if (err) {
dprintf("spa_open(%s) failed\n", buf);
return (err);
}
openedspa = TRUE;
/* XXX this assertion belongs in spa_open */
ASSERT(!dsl_pool_sync_context(spa_get_dsl(spa)));
}
dp = spa_get_dsl(spa);
rw_enter(&dp->dp_config_rwlock, RW_READER);
err = dsl_dir_open_obj(dp, dp->dp_root_dir_obj, NULL, tag, &dd);
if (err) {
rw_exit(&dp->dp_config_rwlock);
if (openedspa)
spa_close(spa, FTAG);
return (err);
}
while (next != NULL) {
dsl_dir_t *child_ds;
err = getcomponent(next, buf, &nextnext);
if (err)
break;
ASSERT(next[0] != '\0');
if (next[0] == '@')
break;
dprintf("looking up %s in obj%lld\n",
buf, dd->dd_phys->dd_child_dir_zapobj);
err = zap_lookup(dp->dp_meta_objset,
dd->dd_phys->dd_child_dir_zapobj,
buf, sizeof (ddobj), 1, &ddobj);
if (err) {
if (err == ENOENT)
err = 0;
break;
}
err = dsl_dir_open_obj(dp, ddobj, buf, tag, &child_ds);
if (err)
break;
dsl_dir_close(dd, tag);
dd = child_ds;
next = nextnext;
}
rw_exit(&dp->dp_config_rwlock);
if (err) {
dsl_dir_close(dd, tag);
if (openedspa)
spa_close(spa, FTAG);
return (err);
}
/*
* It's an error if there's more than one component left, or
* tailp==NULL and there's any component left.
*/
if (next != NULL &&
(tailp == NULL || (nextnext && nextnext[0] != '\0'))) {
/* bad path name */
dsl_dir_close(dd, tag);
dprintf("next=%p (%s) tail=%p\n", next, next?next:"", tailp);
err = ENOENT;
}
if (tailp)
*tailp = next;
if (openedspa)
spa_close(spa, FTAG);
*ddp = dd;
return (err);
}
/*
* Return the dsl_dir_t, and possibly the last component which couldn't
* be found in *tail. Return NULL if the path is bogus, or if
* tail==NULL and we couldn't parse the whole name. (*tail)[0] == '@'
* means that the last component is a snapshot.
*/
int
dsl_dir_open(const char *name, void *tag, dsl_dir_t **ddp, const char **tailp)
{
return (dsl_dir_open_spa(NULL, name, tag, ddp, tailp));
}
uint64_t
dsl_dir_create_sync(dsl_pool_t *dp, dsl_dir_t *pds, const char *name,
dmu_tx_t *tx)
{
objset_t *mos = dp->dp_meta_objset;
uint64_t ddobj;
dsl_dir_phys_t *dsphys;
dmu_buf_t *dbuf;
ddobj = dmu_object_alloc(mos, DMU_OT_DSL_DIR, 0,
DMU_OT_DSL_DIR, sizeof (dsl_dir_phys_t), tx);
if (pds) {
VERIFY(0 == zap_add(mos, pds->dd_phys->dd_child_dir_zapobj,
name, sizeof (uint64_t), 1, &ddobj, tx));
} else {
/* it's the root dir */
VERIFY(0 == zap_add(mos, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_ROOT_DATASET, sizeof (uint64_t), 1, &ddobj, tx));
}
VERIFY(0 == dmu_bonus_hold(mos, ddobj, FTAG, &dbuf));
dmu_buf_will_dirty(dbuf, tx);
dsphys = dbuf->db_data;
dsphys->dd_creation_time = gethrestime_sec();
if (pds)
dsphys->dd_parent_obj = pds->dd_object;
dsphys->dd_props_zapobj = zap_create(mos,
DMU_OT_DSL_PROPS, DMU_OT_NONE, 0, tx);
dsphys->dd_child_dir_zapobj = zap_create(mos,
DMU_OT_DSL_DIR_CHILD_MAP, DMU_OT_NONE, 0, tx);
if (spa_version(dp->dp_spa) >= SPA_VERSION_USED_BREAKDOWN)
dsphys->dd_flags |= DD_FLAG_USED_BREAKDOWN;
dmu_buf_rele(dbuf, FTAG);
return (ddobj);
}
/* ARGSUSED */
int
dsl_dir_destroy_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
dsl_pool_t *dp = dd->dd_pool;
objset_t *mos = dp->dp_meta_objset;
int err;
uint64_t count;
/*
* There should be exactly two holds, both from
* dsl_dataset_destroy: one on the dd directory, and one on its
* head ds. Otherwise, someone is trying to lookup something
* inside this dir while we want to destroy it. The
* config_rwlock ensures that nobody else opens it after we
* check.
*/
if (dmu_buf_refcount(dd->dd_dbuf) > 2)
return (EBUSY);
err = zap_count(mos, dd->dd_phys->dd_child_dir_zapobj, &count);
if (err)
return (err);
if (count != 0)
return (EEXIST);
return (0);
}
void
dsl_dir_destroy_sync(void *arg1, void *tag, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
objset_t *mos = dd->dd_pool->dp_meta_objset;
uint64_t val, obj;
dd_used_t t;
ASSERT(RW_WRITE_HELD(&dd->dd_pool->dp_config_rwlock));
ASSERT(dd->dd_phys->dd_head_dataset_obj == 0);
/* Remove our reservation. */
val = 0;
dsl_dir_set_reservation_sync(dd, &val, cr, tx);
ASSERT3U(dd->dd_phys->dd_used_bytes, ==, 0);
ASSERT3U(dd->dd_phys->dd_reserved, ==, 0);
for (t = 0; t < DD_USED_NUM; t++)
ASSERT3U(dd->dd_phys->dd_used_breakdown[t], ==, 0);
VERIFY(0 == zap_destroy(mos, dd->dd_phys->dd_child_dir_zapobj, tx));
VERIFY(0 == zap_destroy(mos, dd->dd_phys->dd_props_zapobj, tx));
VERIFY(0 == dsl_deleg_destroy(mos, dd->dd_phys->dd_deleg_zapobj, tx));
VERIFY(0 == zap_remove(mos,
dd->dd_parent->dd_phys->dd_child_dir_zapobj, dd->dd_myname, tx));
obj = dd->dd_object;
dsl_dir_close(dd, tag);
VERIFY(0 == dmu_object_free(mos, obj, tx));
}
boolean_t
dsl_dir_is_clone(dsl_dir_t *dd)
{
return (dd->dd_phys->dd_origin_obj &&
(dd->dd_pool->dp_origin_snap == NULL ||
dd->dd_phys->dd_origin_obj !=
dd->dd_pool->dp_origin_snap->ds_object));
}
void
dsl_dir_stats(dsl_dir_t *dd, nvlist_t *nv)
{
mutex_enter(&dd->dd_lock);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USED,
dd->dd_phys->dd_used_bytes);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_QUOTA, dd->dd_phys->dd_quota);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_RESERVATION,
dd->dd_phys->dd_reserved);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_COMPRESSRATIO,
dd->dd_phys->dd_compressed_bytes == 0 ? 100 :
(dd->dd_phys->dd_uncompressed_bytes * 100 /
dd->dd_phys->dd_compressed_bytes));
if (dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN) {
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDSNAP,
dd->dd_phys->dd_used_breakdown[DD_USED_SNAP]);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDDS,
dd->dd_phys->dd_used_breakdown[DD_USED_HEAD]);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDREFRESERV,
dd->dd_phys->dd_used_breakdown[DD_USED_REFRSRV]);
dsl_prop_nvlist_add_uint64(nv, ZFS_PROP_USEDCHILD,
dd->dd_phys->dd_used_breakdown[DD_USED_CHILD] +
dd->dd_phys->dd_used_breakdown[DD_USED_CHILD_RSRV]);
}
mutex_exit(&dd->dd_lock);
rw_enter(&dd->dd_pool->dp_config_rwlock, RW_READER);
if (dsl_dir_is_clone(dd)) {
dsl_dataset_t *ds;
char buf[MAXNAMELEN];
VERIFY(0 == dsl_dataset_hold_obj(dd->dd_pool,
dd->dd_phys->dd_origin_obj, FTAG, &ds));
dsl_dataset_name(ds, buf);
dsl_dataset_rele(ds, FTAG);
dsl_prop_nvlist_add_string(nv, ZFS_PROP_ORIGIN, buf);
}
rw_exit(&dd->dd_pool->dp_config_rwlock);
}
void
dsl_dir_dirty(dsl_dir_t *dd, dmu_tx_t *tx)
{
dsl_pool_t *dp = dd->dd_pool;
ASSERT(dd->dd_phys);
if (txg_list_add(&dp->dp_dirty_dirs, dd, tx->tx_txg) == 0) {
/* up the hold count until we can be written out */
dmu_buf_add_ref(dd->dd_dbuf, dd);
}
}
static int64_t
parent_delta(dsl_dir_t *dd, uint64_t used, int64_t delta)
{
uint64_t old_accounted = MAX(used, dd->dd_phys->dd_reserved);
uint64_t new_accounted = MAX(used + delta, dd->dd_phys->dd_reserved);
return (new_accounted - old_accounted);
}
void
dsl_dir_sync(dsl_dir_t *dd, dmu_tx_t *tx)
{
ASSERT(dmu_tx_is_syncing(tx));
dmu_buf_will_dirty(dd->dd_dbuf, tx);
mutex_enter(&dd->dd_lock);
ASSERT3U(dd->dd_tempreserved[tx->tx_txg&TXG_MASK], ==, 0);
dprintf_dd(dd, "txg=%llu towrite=%lluK\n", tx->tx_txg,
dd->dd_space_towrite[tx->tx_txg&TXG_MASK] / 1024);
dd->dd_space_towrite[tx->tx_txg&TXG_MASK] = 0;
mutex_exit(&dd->dd_lock);
/* release the hold from dsl_dir_dirty */
dmu_buf_rele(dd->dd_dbuf, dd);
}
static uint64_t
dsl_dir_space_towrite(dsl_dir_t *dd)
{
uint64_t space = 0;
int i;
ASSERT(MUTEX_HELD(&dd->dd_lock));
for (i = 0; i < TXG_SIZE; i++) {
space += dd->dd_space_towrite[i&TXG_MASK];
ASSERT3U(dd->dd_space_towrite[i&TXG_MASK], >=, 0);
}
return (space);
}
/*
* How much space would dd have available if ancestor had delta applied
* to it? If ondiskonly is set, we're only interested in what's
* on-disk, not estimated pending changes.
*/
uint64_t
dsl_dir_space_available(dsl_dir_t *dd,
dsl_dir_t *ancestor, int64_t delta, int ondiskonly)
{
uint64_t parentspace, myspace, quota, used;
/*
* If there are no restrictions otherwise, assume we have
* unlimited space available.
*/
quota = UINT64_MAX;
parentspace = UINT64_MAX;
if (dd->dd_parent != NULL) {
parentspace = dsl_dir_space_available(dd->dd_parent,
ancestor, delta, ondiskonly);
}
mutex_enter(&dd->dd_lock);
if (dd->dd_phys->dd_quota != 0)
quota = dd->dd_phys->dd_quota;
used = dd->dd_phys->dd_used_bytes;
if (!ondiskonly)
used += dsl_dir_space_towrite(dd);
if (dd->dd_parent == NULL) {
uint64_t poolsize = dsl_pool_adjustedsize(dd->dd_pool, FALSE);
quota = MIN(quota, poolsize);
}
if (dd->dd_phys->dd_reserved > used && parentspace != UINT64_MAX) {
/*
* We have some space reserved, in addition to what our
* parent gave us.
*/
parentspace += dd->dd_phys->dd_reserved - used;
}
if (dd == ancestor) {
ASSERT(delta <= 0);
ASSERT(used >= -delta);
used += delta;
if (parentspace != UINT64_MAX)
parentspace -= delta;
}
if (used > quota) {
/* over quota */
myspace = 0;
/*
* While it's OK to be a little over quota, if
* we think we are using more space than there
* is in the pool (which is already 1.6% more than
* dsl_pool_adjustedsize()), something is very
* wrong.
*/
ASSERT3U(used, <=, spa_get_space(dd->dd_pool->dp_spa));
} else {
/*
* the lesser of the space provided by our parent and
* the space left in our quota
*/
myspace = MIN(parentspace, quota - used);
}
mutex_exit(&dd->dd_lock);
return (myspace);
}
struct tempreserve {
list_node_t tr_node;
dsl_pool_t *tr_dp;
dsl_dir_t *tr_ds;
uint64_t tr_size;
};
static int
dsl_dir_tempreserve_impl(dsl_dir_t *dd, uint64_t asize, boolean_t netfree,
boolean_t ignorequota, boolean_t checkrefquota, list_t *tr_list,
dmu_tx_t *tx, boolean_t first)
{
uint64_t txg = tx->tx_txg;
uint64_t est_inflight, used_on_disk, quota, parent_rsrv;
struct tempreserve *tr;
int enospc = EDQUOT;
int txgidx = txg & TXG_MASK;
int i;
uint64_t ref_rsrv = 0;
ASSERT3U(txg, !=, 0);
ASSERT3S(asize, >, 0);
mutex_enter(&dd->dd_lock);
/*
* Check against the dsl_dir's quota. We don't add in the delta
* when checking for over-quota because they get one free hit.
*/
est_inflight = dsl_dir_space_towrite(dd);
for (i = 0; i < TXG_SIZE; i++)
est_inflight += dd->dd_tempreserved[i];
used_on_disk = dd->dd_phys->dd_used_bytes;
/*
* On the first iteration, fetch the dataset's used-on-disk and
* refreservation values. Also, if checkrefquota is set, test if
* allocating this space would exceed the dataset's refquota.
*/
if (first && tx->tx_objset) {
int error;
dsl_dataset_t *ds = tx->tx_objset->os->os_dsl_dataset;
error = dsl_dataset_check_quota(ds, checkrefquota,
asize, est_inflight, &used_on_disk, &ref_rsrv);
if (error) {
mutex_exit(&dd->dd_lock);
return (error);
}
}
/*
* If this transaction will result in a net free of space,
* we want to let it through.
*/
if (ignorequota || netfree || dd->dd_phys->dd_quota == 0)
quota = UINT64_MAX;
else
quota = dd->dd_phys->dd_quota;
/*
* Adjust the quota against the actual pool size at the root.
* To ensure that it's possible to remove files from a full
* pool without inducing transient overcommits, we throttle
* netfree transactions against a quota that is slightly larger,
* but still within the pool's allocation slop. In cases where
* we're very close to full, this will allow a steady trickle of
* removes to get through.
*/
if (dd->dd_parent == NULL) {
uint64_t poolsize = dsl_pool_adjustedsize(dd->dd_pool, netfree);
if (poolsize < quota) {
quota = poolsize;
enospc = ENOSPC;
}
}
/*
* If they are requesting more space, and our current estimate
* is over quota, they get to try again unless the actual
* on-disk is over quota and there are no pending changes (which
* may free up space for us).
*/
if (used_on_disk + est_inflight > quota) {
if (est_inflight > 0 || used_on_disk < quota)
enospc = ERESTART;
dprintf_dd(dd, "failing: used=%lluK inflight = %lluK "
"quota=%lluK tr=%lluK err=%d\n",
used_on_disk>>10, est_inflight>>10,
quota>>10, asize>>10, enospc);
mutex_exit(&dd->dd_lock);
return (enospc);
}
/* We need to up our estimated delta before dropping dd_lock */
dd->dd_tempreserved[txgidx] += asize;
parent_rsrv = parent_delta(dd, used_on_disk + est_inflight,
asize - ref_rsrv);
mutex_exit(&dd->dd_lock);
tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
tr->tr_ds = dd;
tr->tr_size = asize;
list_insert_tail(tr_list, tr);
/* see if it's OK with our parent */
if (dd->dd_parent && parent_rsrv) {
boolean_t ismos = (dd->dd_phys->dd_head_dataset_obj == 0);
return (dsl_dir_tempreserve_impl(dd->dd_parent,
parent_rsrv, netfree, ismos, TRUE, tr_list, tx, FALSE));
} else {
return (0);
}
}
/*
* Reserve space in this dsl_dir, to be used in this tx's txg.
* After the space has been dirtied (and dsl_dir_willuse_space()
* has been called), the reservation should be canceled, using
* dsl_dir_tempreserve_clear().
*/
int
dsl_dir_tempreserve_space(dsl_dir_t *dd, uint64_t lsize, uint64_t asize,
uint64_t fsize, uint64_t usize, void **tr_cookiep, dmu_tx_t *tx)
{
int err;
list_t *tr_list;
if (asize == 0) {
*tr_cookiep = NULL;
return (0);
}
tr_list = kmem_alloc(sizeof (list_t), KM_SLEEP);
list_create(tr_list, sizeof (struct tempreserve),
offsetof(struct tempreserve, tr_node));
ASSERT3S(asize, >, 0);
ASSERT3S(fsize, >=, 0);
err = arc_tempreserve_space(lsize, tx->tx_txg);
if (err == 0) {
struct tempreserve *tr;
tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
tr->tr_size = lsize;
list_insert_tail(tr_list, tr);
err = dsl_pool_tempreserve_space(dd->dd_pool, asize, tx);
} else {
if (err == EAGAIN) {
txg_delay(dd->dd_pool, tx->tx_txg, 1);
err = ERESTART;
}
dsl_pool_memory_pressure(dd->dd_pool);
}
if (err == 0) {
struct tempreserve *tr;
tr = kmem_zalloc(sizeof (struct tempreserve), KM_SLEEP);
tr->tr_dp = dd->dd_pool;
tr->tr_size = asize;
list_insert_tail(tr_list, tr);
err = dsl_dir_tempreserve_impl(dd, asize, fsize >= asize,
FALSE, asize > usize, tr_list, tx, TRUE);
}
if (err)
dsl_dir_tempreserve_clear(tr_list, tx);
else
*tr_cookiep = tr_list;
return (err);
}
/*
* Clear a temporary reservation that we previously made with
* dsl_dir_tempreserve_space().
*/
void
dsl_dir_tempreserve_clear(void *tr_cookie, dmu_tx_t *tx)
{
int txgidx = tx->tx_txg & TXG_MASK;
list_t *tr_list = tr_cookie;
struct tempreserve *tr;
ASSERT3U(tx->tx_txg, !=, 0);
if (tr_cookie == NULL)
return;
while (tr = list_head(tr_list)) {
if (tr->tr_dp) {
dsl_pool_tempreserve_clear(tr->tr_dp, tr->tr_size, tx);
} else if (tr->tr_ds) {
mutex_enter(&tr->tr_ds->dd_lock);
ASSERT3U(tr->tr_ds->dd_tempreserved[txgidx], >=,
tr->tr_size);
tr->tr_ds->dd_tempreserved[txgidx] -= tr->tr_size;
mutex_exit(&tr->tr_ds->dd_lock);
} else {
arc_tempreserve_clear(tr->tr_size);
}
list_remove(tr_list, tr);
kmem_free(tr, sizeof (struct tempreserve));
}
kmem_free(tr_list, sizeof (list_t));
}
static void
dsl_dir_willuse_space_impl(dsl_dir_t *dd, int64_t space, dmu_tx_t *tx)
{
int64_t parent_space;
uint64_t est_used;
mutex_enter(&dd->dd_lock);
if (space > 0)
dd->dd_space_towrite[tx->tx_txg & TXG_MASK] += space;
est_used = dsl_dir_space_towrite(dd) + dd->dd_phys->dd_used_bytes;
parent_space = parent_delta(dd, est_used, space);
mutex_exit(&dd->dd_lock);
/* Make sure that we clean up dd_space_to* */
dsl_dir_dirty(dd, tx);
/* XXX this is potentially expensive and unnecessary... */
if (parent_space && dd->dd_parent)
dsl_dir_willuse_space_impl(dd->dd_parent, parent_space, tx);
}
/*
* Call in open context when we think we're going to write/free space,
* eg. when dirtying data. Be conservative (ie. OK to write less than
* this or free more than this, but don't write more or free less).
*/
void
dsl_dir_willuse_space(dsl_dir_t *dd, int64_t space, dmu_tx_t *tx)
{
dsl_pool_willuse_space(dd->dd_pool, space, tx);
dsl_dir_willuse_space_impl(dd, space, tx);
}
/* call from syncing context when we actually write/free space for this dd */
void
dsl_dir_diduse_space(dsl_dir_t *dd, dd_used_t type,
int64_t used, int64_t compressed, int64_t uncompressed, dmu_tx_t *tx)
{
int64_t accounted_delta;
boolean_t needlock = !MUTEX_HELD(&dd->dd_lock);
ASSERT(dmu_tx_is_syncing(tx));
ASSERT(type < DD_USED_NUM);
dsl_dir_dirty(dd, tx);
if (needlock)
mutex_enter(&dd->dd_lock);
accounted_delta = parent_delta(dd, dd->dd_phys->dd_used_bytes, used);
ASSERT(used >= 0 || dd->dd_phys->dd_used_bytes >= -used);
ASSERT(compressed >= 0 ||
dd->dd_phys->dd_compressed_bytes >= -compressed);
ASSERT(uncompressed >= 0 ||
dd->dd_phys->dd_uncompressed_bytes >= -uncompressed);
dd->dd_phys->dd_used_bytes += used;
dd->dd_phys->dd_uncompressed_bytes += uncompressed;
dd->dd_phys->dd_compressed_bytes += compressed;
if (dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN) {
ASSERT(used > 0 ||
dd->dd_phys->dd_used_breakdown[type] >= -used);
dd->dd_phys->dd_used_breakdown[type] += used;
#ifdef DEBUG
dd_used_t t;
uint64_t u = 0;
for (t = 0; t < DD_USED_NUM; t++)
u += dd->dd_phys->dd_used_breakdown[t];
ASSERT3U(u, ==, dd->dd_phys->dd_used_bytes);
#endif
}
if (needlock)
mutex_exit(&dd->dd_lock);
if (dd->dd_parent != NULL) {
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD,
accounted_delta, compressed, uncompressed, tx);
dsl_dir_transfer_space(dd->dd_parent,
used - accounted_delta,
DD_USED_CHILD_RSRV, DD_USED_CHILD, tx);
}
}
void
dsl_dir_transfer_space(dsl_dir_t *dd, int64_t delta,
dd_used_t oldtype, dd_used_t newtype, dmu_tx_t *tx)
{
boolean_t needlock = !MUTEX_HELD(&dd->dd_lock);
ASSERT(dmu_tx_is_syncing(tx));
ASSERT(oldtype < DD_USED_NUM);
ASSERT(newtype < DD_USED_NUM);
if (delta == 0 || !(dd->dd_phys->dd_flags & DD_FLAG_USED_BREAKDOWN))
return;
dsl_dir_dirty(dd, tx);
if (needlock)
mutex_enter(&dd->dd_lock);
ASSERT(delta > 0 ?
dd->dd_phys->dd_used_breakdown[oldtype] >= delta :
dd->dd_phys->dd_used_breakdown[newtype] >= -delta);
ASSERT(dd->dd_phys->dd_used_bytes >= ABS(delta));
dd->dd_phys->dd_used_breakdown[oldtype] -= delta;
dd->dd_phys->dd_used_breakdown[newtype] += delta;
if (needlock)
mutex_exit(&dd->dd_lock);
}
static int
dsl_dir_set_quota_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *quotap = arg2;
uint64_t new_quota = *quotap;
int err = 0;
uint64_t towrite;
if (new_quota == 0)
return (0);
mutex_enter(&dd->dd_lock);
/*
* If we are doing the preliminary check in open context, and
* there are pending changes, then don't fail it, since the
* pending changes could under-estimate the amount of space to be
* freed up.
*/
towrite = dsl_dir_space_towrite(dd);
if ((dmu_tx_is_syncing(tx) || towrite == 0) &&
(new_quota < dd->dd_phys->dd_reserved ||
new_quota < dd->dd_phys->dd_used_bytes + towrite)) {
err = ENOSPC;
}
mutex_exit(&dd->dd_lock);
return (err);
}
/* ARGSUSED */
static void
dsl_dir_set_quota_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *quotap = arg2;
uint64_t new_quota = *quotap;
dmu_buf_will_dirty(dd->dd_dbuf, tx);
mutex_enter(&dd->dd_lock);
dd->dd_phys->dd_quota = new_quota;
mutex_exit(&dd->dd_lock);
spa_history_internal_log(LOG_DS_QUOTA, dd->dd_pool->dp_spa,
tx, cr, "%lld dataset = %llu ",
(longlong_t)new_quota, dd->dd_phys->dd_head_dataset_obj);
}
int
dsl_dir_set_quota(const char *ddname, uint64_t quota)
{
dsl_dir_t *dd;
int err;
err = dsl_dir_open(ddname, FTAG, &dd, NULL);
if (err)
return (err);
if (quota != dd->dd_phys->dd_quota) {
/*
* If someone removes a file, then tries to set the quota, we
* want to make sure the file freeing takes effect.
*/
txg_wait_open(dd->dd_pool, 0);
err = dsl_sync_task_do(dd->dd_pool, dsl_dir_set_quota_check,
dsl_dir_set_quota_sync, dd, "a, 0);
}
dsl_dir_close(dd, FTAG);
return (err);
}
int
dsl_dir_set_reservation_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *reservationp = arg2;
uint64_t new_reservation = *reservationp;
uint64_t used, avail;
int64_t delta;
if (new_reservation > INT64_MAX)
return (EOVERFLOW);
/*
* If we are doing the preliminary check in open context, the
* space estimates may be inaccurate.
*/
if (!dmu_tx_is_syncing(tx))
return (0);
mutex_enter(&dd->dd_lock);
used = dd->dd_phys->dd_used_bytes;
delta = MAX(used, new_reservation) -
MAX(used, dd->dd_phys->dd_reserved);
mutex_exit(&dd->dd_lock);
if (dd->dd_parent) {
avail = dsl_dir_space_available(dd->dd_parent,
NULL, 0, FALSE);
} else {
avail = dsl_pool_adjustedsize(dd->dd_pool, B_FALSE) - used;
}
if (delta > 0 && delta > avail)
return (ENOSPC);
if (delta > 0 && dd->dd_phys->dd_quota > 0 &&
new_reservation > dd->dd_phys->dd_quota)
return (ENOSPC);
return (0);
}
/* ARGSUSED */
static void
dsl_dir_set_reservation_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
uint64_t *reservationp = arg2;
uint64_t new_reservation = *reservationp;
uint64_t used;
int64_t delta;
dmu_buf_will_dirty(dd->dd_dbuf, tx);
mutex_enter(&dd->dd_lock);
used = dd->dd_phys->dd_used_bytes;
delta = MAX(used, new_reservation) -
MAX(used, dd->dd_phys->dd_reserved);
dd->dd_phys->dd_reserved = new_reservation;
if (dd->dd_parent != NULL) {
/* Roll up this additional usage into our ancestors */
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD_RSRV,
delta, 0, 0, tx);
}
mutex_exit(&dd->dd_lock);
spa_history_internal_log(LOG_DS_RESERVATION, dd->dd_pool->dp_spa,
tx, cr, "%lld dataset = %llu",
(longlong_t)new_reservation, dd->dd_phys->dd_head_dataset_obj);
}
int
dsl_dir_set_reservation(const char *ddname, uint64_t reservation)
{
dsl_dir_t *dd;
int err;
err = dsl_dir_open(ddname, FTAG, &dd, NULL);
if (err)
return (err);
err = dsl_sync_task_do(dd->dd_pool, dsl_dir_set_reservation_check,
dsl_dir_set_reservation_sync, dd, &reservation, 0);
dsl_dir_close(dd, FTAG);
return (err);
}
static dsl_dir_t *
closest_common_ancestor(dsl_dir_t *ds1, dsl_dir_t *ds2)
{
for (; ds1; ds1 = ds1->dd_parent) {
dsl_dir_t *dd;
for (dd = ds2; dd; dd = dd->dd_parent) {
if (ds1 == dd)
return (dd);
}
}
return (NULL);
}
/*
* If delta is applied to dd, how much of that delta would be applied to
* ancestor? Syncing context only.
*/
static int64_t
would_change(dsl_dir_t *dd, int64_t delta, dsl_dir_t *ancestor)
{
if (dd == ancestor)
return (delta);
mutex_enter(&dd->dd_lock);
delta = parent_delta(dd, dd->dd_phys->dd_used_bytes, delta);
mutex_exit(&dd->dd_lock);
return (would_change(dd->dd_parent, delta, ancestor));
}
struct renamearg {
dsl_dir_t *newparent;
const char *mynewname;
};
/*ARGSUSED*/
static int
dsl_dir_rename_check(void *arg1, void *arg2, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
struct renamearg *ra = arg2;
dsl_pool_t *dp = dd->dd_pool;
objset_t *mos = dp->dp_meta_objset;
int err;
uint64_t val;
/* There should be 2 references: the open and the dirty */
if (dmu_buf_refcount(dd->dd_dbuf) > 2)
return (EBUSY);
/* check for existing name */
err = zap_lookup(mos, ra->newparent->dd_phys->dd_child_dir_zapobj,
ra->mynewname, 8, 1, &val);
if (err == 0)
return (EEXIST);
if (err != ENOENT)
return (err);
if (ra->newparent != dd->dd_parent) {
/* is there enough space? */
uint64_t myspace =
MAX(dd->dd_phys->dd_used_bytes, dd->dd_phys->dd_reserved);
/* no rename into our descendant */
if (closest_common_ancestor(dd, ra->newparent) == dd)
return (EINVAL);
if (err = dsl_dir_transfer_possible(dd->dd_parent,
ra->newparent, myspace))
return (err);
}
return (0);
}
static void
dsl_dir_rename_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_dir_t *dd = arg1;
struct renamearg *ra = arg2;
dsl_pool_t *dp = dd->dd_pool;
objset_t *mos = dp->dp_meta_objset;
int err;
ASSERT(dmu_buf_refcount(dd->dd_dbuf) <= 2);
if (ra->newparent != dd->dd_parent) {
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD,
-dd->dd_phys->dd_used_bytes,
-dd->dd_phys->dd_compressed_bytes,
-dd->dd_phys->dd_uncompressed_bytes, tx);
dsl_dir_diduse_space(ra->newparent, DD_USED_CHILD,
dd->dd_phys->dd_used_bytes,
dd->dd_phys->dd_compressed_bytes,
dd->dd_phys->dd_uncompressed_bytes, tx);
if (dd->dd_phys->dd_reserved > dd->dd_phys->dd_used_bytes) {
uint64_t unused_rsrv = dd->dd_phys->dd_reserved -
dd->dd_phys->dd_used_bytes;
dsl_dir_diduse_space(dd->dd_parent, DD_USED_CHILD_RSRV,
-unused_rsrv, 0, 0, tx);
dsl_dir_diduse_space(ra->newparent, DD_USED_CHILD_RSRV,
unused_rsrv, 0, 0, tx);
}
}
dmu_buf_will_dirty(dd->dd_dbuf, tx);
/* remove from old parent zapobj */
err = zap_remove(mos, dd->dd_parent->dd_phys->dd_child_dir_zapobj,
dd->dd_myname, tx);
ASSERT3U(err, ==, 0);
(void) strcpy(dd->dd_myname, ra->mynewname);
dsl_dir_close(dd->dd_parent, dd);
dd->dd_phys->dd_parent_obj = ra->newparent->dd_object;
VERIFY(0 == dsl_dir_open_obj(dd->dd_pool,
ra->newparent->dd_object, NULL, dd, &dd->dd_parent));
/* add to new parent zapobj */
err = zap_add(mos, ra->newparent->dd_phys->dd_child_dir_zapobj,
dd->dd_myname, 8, 1, &dd->dd_object, tx);
ASSERT3U(err, ==, 0);
spa_history_internal_log(LOG_DS_RENAME, dd->dd_pool->dp_spa,
tx, cr, "dataset = %llu", dd->dd_phys->dd_head_dataset_obj);
}
int
dsl_dir_rename(dsl_dir_t *dd, const char *newname)
{
struct renamearg ra;
int err;
/* new parent should exist */
err = dsl_dir_open(newname, FTAG, &ra.newparent, &ra.mynewname);
if (err)
return (err);
/* can't rename to different pool */
if (dd->dd_pool != ra.newparent->dd_pool) {
err = ENXIO;
goto out;
}
/* new name should not already exist */
if (ra.mynewname == NULL) {
err = EEXIST;
goto out;
}
err = dsl_sync_task_do(dd->dd_pool,
dsl_dir_rename_check, dsl_dir_rename_sync, dd, &ra, 3);
out:
dsl_dir_close(ra.newparent, FTAG);
return (err);
}
int
dsl_dir_transfer_possible(dsl_dir_t *sdd, dsl_dir_t *tdd, uint64_t space)
{
dsl_dir_t *ancestor;
int64_t adelta;
uint64_t avail;
ancestor = closest_common_ancestor(sdd, tdd);
adelta = would_change(sdd, -space, ancestor);
avail = dsl_dir_space_available(tdd, ancestor, adelta, FALSE);
if (avail < space)
return (ENOSPC);
return (0);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scrub.c (revision 209274)
@@ -1,1025 +1,1027 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/dsl_pool.h>
#include <sys/dsl_dataset.h>
#include <sys/dsl_prop.h>
#include <sys/dsl_dir.h>
#include <sys/dsl_synctask.h>
#include <sys/dnode.h>
#include <sys/dmu_tx.h>
#include <sys/dmu_objset.h>
#include <sys/arc.h>
#include <sys/zap.h>
#include <sys/zio.h>
#include <sys/zfs_context.h>
#include <sys/fs/zfs.h>
#include <sys/zfs_znode.h>
#include <sys/spa_impl.h>
#include <sys/vdev_impl.h>
#include <sys/zil_impl.h>
typedef int (scrub_cb_t)(dsl_pool_t *, const blkptr_t *, const zbookmark_t *);
static scrub_cb_t dsl_pool_scrub_clean_cb;
static dsl_syncfunc_t dsl_pool_scrub_cancel_sync;
int zfs_scrub_min_time = 1; /* scrub for at least 1 sec each txg */
int zfs_resilver_min_time = 3; /* resilver for at least 3 sec each txg */
boolean_t zfs_no_scrub_io = B_FALSE; /* set to disable scrub i/o */
extern int zfs_txg_timeout;
static scrub_cb_t *scrub_funcs[SCRUB_FUNC_NUMFUNCS] = {
NULL,
dsl_pool_scrub_clean_cb
};
#define SET_BOOKMARK(zb, objset, object, level, blkid) \
{ \
(zb)->zb_objset = objset; \
(zb)->zb_object = object; \
(zb)->zb_level = level; \
(zb)->zb_blkid = blkid; \
}
/* ARGSUSED */
static void
dsl_pool_scrub_setup_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_pool_t *dp = arg1;
enum scrub_func *funcp = arg2;
dmu_object_type_t ot = 0;
boolean_t complete = B_FALSE;
dsl_pool_scrub_cancel_sync(dp, &complete, cr, tx);
ASSERT(dp->dp_scrub_func == SCRUB_FUNC_NONE);
ASSERT(*funcp > SCRUB_FUNC_NONE);
ASSERT(*funcp < SCRUB_FUNC_NUMFUNCS);
dp->dp_scrub_min_txg = 0;
dp->dp_scrub_max_txg = tx->tx_txg;
if (*funcp == SCRUB_FUNC_CLEAN) {
vdev_t *rvd = dp->dp_spa->spa_root_vdev;
/* rewrite all disk labels */
vdev_config_dirty(rvd);
if (vdev_resilver_needed(rvd,
&dp->dp_scrub_min_txg, &dp->dp_scrub_max_txg)) {
spa_event_notify(dp->dp_spa, NULL,
ESC_ZFS_RESILVER_START);
dp->dp_scrub_max_txg = MIN(dp->dp_scrub_max_txg,
tx->tx_txg);
}
/* zero out the scrub stats in all vdev_stat_t's */
vdev_scrub_stat_update(rvd,
dp->dp_scrub_min_txg ? POOL_SCRUB_RESILVER :
POOL_SCRUB_EVERYTHING, B_FALSE);
dp->dp_spa->spa_scrub_started = B_TRUE;
}
/* back to the generic stuff */
if (dp->dp_blkstats == NULL) {
dp->dp_blkstats =
kmem_alloc(sizeof (zfs_all_blkstats_t), KM_SLEEP);
}
bzero(dp->dp_blkstats, sizeof (zfs_all_blkstats_t));
if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB)
ot = DMU_OT_ZAP_OTHER;
dp->dp_scrub_func = *funcp;
dp->dp_scrub_queue_obj = zap_create(dp->dp_meta_objset,
ot ? ot : DMU_OT_SCRUB_QUEUE, DMU_OT_NONE, 0, tx);
bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));
dp->dp_scrub_restart = B_FALSE;
dp->dp_spa->spa_scrub_errors = 0;
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_FUNC, sizeof (uint32_t), 1,
&dp->dp_scrub_func, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_QUEUE, sizeof (uint64_t), 1,
&dp->dp_scrub_queue_obj, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MIN_TXG, sizeof (uint64_t), 1,
&dp->dp_scrub_min_txg, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MAX_TXG, sizeof (uint64_t), 1,
&dp->dp_scrub_max_txg, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4,
&dp->dp_scrub_bookmark, tx));
VERIFY(0 == zap_add(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_ERRORS, sizeof (uint64_t), 1,
&dp->dp_spa->spa_scrub_errors, tx));
spa_history_internal_log(LOG_POOL_SCRUB, dp->dp_spa, tx, cr,
"func=%u mintxg=%llu maxtxg=%llu",
*funcp, dp->dp_scrub_min_txg, dp->dp_scrub_max_txg);
}
int
dsl_pool_scrub_setup(dsl_pool_t *dp, enum scrub_func func)
{
return (dsl_sync_task_do(dp, NULL,
dsl_pool_scrub_setup_sync, dp, &func, 0));
}
/* ARGSUSED */
static void
dsl_pool_scrub_cancel_sync(void *arg1, void *arg2, cred_t *cr, dmu_tx_t *tx)
{
dsl_pool_t *dp = arg1;
boolean_t *completep = arg2;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
mutex_enter(&dp->dp_scrub_cancel_lock);
if (dp->dp_scrub_restart) {
dp->dp_scrub_restart = B_FALSE;
*completep = B_FALSE;
}
/* XXX this is scrub-clean specific */
mutex_enter(&dp->dp_spa->spa_scrub_lock);
while (dp->dp_spa->spa_scrub_inflight > 0) {
cv_wait(&dp->dp_spa->spa_scrub_io_cv,
&dp->dp_spa->spa_scrub_lock);
}
mutex_exit(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_started = B_FALSE;
dp->dp_spa->spa_scrub_active = B_FALSE;
dp->dp_scrub_func = SCRUB_FUNC_NONE;
VERIFY(0 == dmu_object_free(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, tx));
dp->dp_scrub_queue_obj = 0;
bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_QUEUE, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MIN_TXG, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_MAX_TXG, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_BOOKMARK, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_FUNC, tx));
VERIFY(0 == zap_remove(dp->dp_meta_objset, DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_ERRORS, tx));
spa_history_internal_log(LOG_POOL_SCRUB_DONE, dp->dp_spa, tx, cr,
"complete=%u", *completep);
/* below is scrub-clean specific */
vdev_scrub_stat_update(dp->dp_spa->spa_root_vdev, POOL_SCRUB_NONE,
*completep);
/*
* If the scrub/resilver completed, update all DTLs to reflect this.
* Whether it succeeded or not, vacate all temporary scrub DTLs.
*/
vdev_dtl_reassess(dp->dp_spa->spa_root_vdev, tx->tx_txg,
*completep ? dp->dp_scrub_max_txg : 0, B_TRUE);
if (dp->dp_scrub_min_txg && *completep)
spa_event_notify(dp->dp_spa, NULL, ESC_ZFS_RESILVER_FINISH);
spa_errlog_rotate(dp->dp_spa);
/*
* We may have finished replacing a device.
* Let the async thread assess this and handle the detach.
*/
spa_async_request(dp->dp_spa, SPA_ASYNC_RESILVER_DONE);
dp->dp_scrub_min_txg = dp->dp_scrub_max_txg = 0;
mutex_exit(&dp->dp_scrub_cancel_lock);
}
int
dsl_pool_scrub_cancel(dsl_pool_t *dp)
{
boolean_t complete = B_FALSE;
return (dsl_sync_task_do(dp, NULL,
dsl_pool_scrub_cancel_sync, dp, &complete, 3));
}
int
dsl_free(zio_t *pio, dsl_pool_t *dp, uint64_t txg, const blkptr_t *bpp,
zio_done_func_t *done, void *private, uint32_t arc_flags)
{
/*
* This function will be used by bp-rewrite wad to intercept frees.
*/
return (arc_free(pio, dp->dp_spa, txg, (blkptr_t *)bpp,
done, private, arc_flags));
}
static boolean_t
bookmark_is_zero(const zbookmark_t *zb)
{
return (zb->zb_objset == 0 && zb->zb_object == 0 &&
zb->zb_level == 0 && zb->zb_blkid == 0);
}
/* dnp is the dnode for zb1->zb_object */
static boolean_t
bookmark_is_before(dnode_phys_t *dnp, const zbookmark_t *zb1,
const zbookmark_t *zb2)
{
uint64_t zb1nextL0, zb2thisobj;
ASSERT(zb1->zb_objset == zb2->zb_objset);
ASSERT(zb1->zb_object != -1ULL);
ASSERT(zb2->zb_level == 0);
/*
* A bookmark in the deadlist is considered to be after
* everything else.
*/
if (zb2->zb_object == -1ULL)
return (B_TRUE);
/* The objset_phys_t isn't before anything. */
if (dnp == NULL)
return (B_FALSE);
zb1nextL0 = (zb1->zb_blkid + 1) <<
((zb1->zb_level) * (dnp->dn_indblkshift - SPA_BLKPTRSHIFT));
zb2thisobj = zb2->zb_object ? zb2->zb_object :
zb2->zb_blkid << (DNODE_BLOCK_SHIFT - DNODE_SHIFT);
if (zb1->zb_object == 0) {
uint64_t nextobj = zb1nextL0 *
(dnp->dn_datablkszsec << SPA_MINBLOCKSHIFT) >> DNODE_SHIFT;
return (nextobj <= zb2thisobj);
}
if (zb1->zb_object < zb2thisobj)
return (B_TRUE);
if (zb1->zb_object > zb2thisobj)
return (B_FALSE);
if (zb2->zb_object == 0)
return (B_FALSE);
return (zb1nextL0 <= zb2->zb_blkid);
}
static boolean_t
scrub_pause(dsl_pool_t *dp, const zbookmark_t *zb)
{
int elapsed_ticks;
int mintime;
if (dp->dp_scrub_pausing)
return (B_TRUE); /* we're already pausing */
if (!bookmark_is_zero(&dp->dp_scrub_bookmark))
return (B_FALSE); /* we're resuming */
/* We only know how to resume from level-0 blocks. */
if (zb->zb_level != 0)
return (B_FALSE);
mintime = dp->dp_scrub_isresilver ? zfs_resilver_min_time :
zfs_scrub_min_time;
elapsed_ticks = lbolt64 - dp->dp_scrub_start_time;
if (elapsed_ticks > hz * zfs_txg_timeout ||
(elapsed_ticks > hz * mintime && txg_sync_waiting(dp))) {
dprintf("pausing at %llx/%llx/%llx/%llx\n",
(longlong_t)zb->zb_objset, (longlong_t)zb->zb_object,
(longlong_t)zb->zb_level, (longlong_t)zb->zb_blkid);
dp->dp_scrub_pausing = B_TRUE;
dp->dp_scrub_bookmark = *zb;
return (B_TRUE);
}
return (B_FALSE);
}
typedef struct zil_traverse_arg {
dsl_pool_t *zta_dp;
zil_header_t *zta_zh;
} zil_traverse_arg_t;
/* ARGSUSED */
static void
traverse_zil_block(zilog_t *zilog, blkptr_t *bp, void *arg, uint64_t claim_txg)
{
zil_traverse_arg_t *zta = arg;
dsl_pool_t *dp = zta->zta_dp;
zil_header_t *zh = zta->zta_zh;
zbookmark_t zb;
if (bp->blk_birth <= dp->dp_scrub_min_txg)
return;
/*
* One block ("stumpy") can be allocated a long time ago; we
* want to visit that one because it has been allocated
* (on-disk) even if it hasn't been claimed (even though for
* plain scrub there's nothing to do to it).
*/
if (claim_txg == 0 && bp->blk_birth >= spa_first_txg(dp->dp_spa))
return;
zb.zb_objset = zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET];
zb.zb_object = 0;
zb.zb_level = -1;
zb.zb_blkid = bp->blk_cksum.zc_word[ZIL_ZC_SEQ];
VERIFY(0 == scrub_funcs[dp->dp_scrub_func](dp, bp, &zb));
}
/* ARGSUSED */
static void
traverse_zil_record(zilog_t *zilog, lr_t *lrc, void *arg, uint64_t claim_txg)
{
if (lrc->lrc_txtype == TX_WRITE) {
zil_traverse_arg_t *zta = arg;
dsl_pool_t *dp = zta->zta_dp;
zil_header_t *zh = zta->zta_zh;
lr_write_t *lr = (lr_write_t *)lrc;
blkptr_t *bp = &lr->lr_blkptr;
zbookmark_t zb;
if (bp->blk_birth <= dp->dp_scrub_min_txg)
return;
/*
* birth can be < claim_txg if this record's txg is
* already txg sync'ed (but this log block contains
* other records that are not synced)
*/
if (claim_txg == 0 || bp->blk_birth < claim_txg)
return;
zb.zb_objset = zh->zh_log.blk_cksum.zc_word[ZIL_ZC_OBJSET];
zb.zb_object = lr->lr_foid;
zb.zb_level = BP_GET_LEVEL(bp);
zb.zb_blkid = lr->lr_offset / BP_GET_LSIZE(bp);
VERIFY(0 == scrub_funcs[dp->dp_scrub_func](dp, bp, &zb));
}
}
static void
traverse_zil(dsl_pool_t *dp, zil_header_t *zh)
{
uint64_t claim_txg = zh->zh_claim_txg;
zil_traverse_arg_t zta = { dp, zh };
zilog_t *zilog;
/*
* We only want to visit blocks that have been claimed but not yet
* replayed (or, in read-only mode, blocks that *would* be claimed).
*/
if (claim_txg == 0 && (spa_mode & FWRITE))
return;
zilog = zil_alloc(dp->dp_meta_objset, zh);
(void) zil_parse(zilog, traverse_zil_block, traverse_zil_record, &zta,
claim_txg);
zil_free(zilog);
}
static void
scrub_visitbp(dsl_pool_t *dp, dnode_phys_t *dnp,
arc_buf_t *pbuf, blkptr_t *bp, const zbookmark_t *zb)
{
int err;
arc_buf_t *buf = NULL;
if (bp->blk_birth == 0)
return;
if (bp->blk_birth <= dp->dp_scrub_min_txg)
return;
if (scrub_pause(dp, zb))
return;
if (!bookmark_is_zero(&dp->dp_scrub_bookmark)) {
/*
* If we already visited this bp & everything below (in
* a prior txg), don't bother doing it again.
*/
if (bookmark_is_before(dnp, zb, &dp->dp_scrub_bookmark))
return;
/*
* If we found the block we're trying to resume from, or
* we went past it to a different object, zero it out to
* indicate that it's OK to start checking for pausing
* again.
*/
if (bcmp(zb, &dp->dp_scrub_bookmark, sizeof (*zb)) == 0 ||
zb->zb_object > dp->dp_scrub_bookmark.zb_object) {
dprintf("resuming at %llx/%llx/%llx/%llx\n",
(longlong_t)zb->zb_objset,
(longlong_t)zb->zb_object,
(longlong_t)zb->zb_level,
(longlong_t)zb->zb_blkid);
bzero(&dp->dp_scrub_bookmark, sizeof (*zb));
}
}
if (BP_GET_LEVEL(bp) > 0) {
uint32_t flags = ARC_WAIT;
int i;
blkptr_t *cbp;
int epb = BP_GET_LSIZE(bp) >> SPA_BLKPTRSHIFT;
err = arc_read(NULL, dp->dp_spa, bp, pbuf,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err) {
mutex_enter(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_errors++;
mutex_exit(&dp->dp_spa->spa_scrub_lock);
return;
}
cbp = buf->b_data;
for (i = 0; i < epb; i++, cbp++) {
zbookmark_t czb;
SET_BOOKMARK(&czb, zb->zb_objset, zb->zb_object,
zb->zb_level - 1,
zb->zb_blkid * epb + i);
scrub_visitbp(dp, dnp, buf, cbp, &czb);
}
} else if (BP_GET_TYPE(bp) == DMU_OT_DNODE) {
uint32_t flags = ARC_WAIT;
dnode_phys_t *child_dnp;
int i, j;
int epb = BP_GET_LSIZE(bp) >> DNODE_SHIFT;
err = arc_read(NULL, dp->dp_spa, bp, pbuf,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err) {
mutex_enter(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_errors++;
mutex_exit(&dp->dp_spa->spa_scrub_lock);
return;
}
child_dnp = buf->b_data;
for (i = 0; i < epb; i++, child_dnp++) {
for (j = 0; j < child_dnp->dn_nblkptr; j++) {
zbookmark_t czb;
SET_BOOKMARK(&czb, zb->zb_objset,
zb->zb_blkid * epb + i,
child_dnp->dn_nlevels - 1, j);
scrub_visitbp(dp, child_dnp, buf,
&child_dnp->dn_blkptr[j], &czb);
}
}
} else if (BP_GET_TYPE(bp) == DMU_OT_OBJSET) {
uint32_t flags = ARC_WAIT;
objset_phys_t *osp;
int j;
err = arc_read_nolock(NULL, dp->dp_spa, bp,
arc_getbuf_func, &buf,
ZIO_PRIORITY_ASYNC_READ, ZIO_FLAG_CANFAIL, &flags, zb);
if (err) {
mutex_enter(&dp->dp_spa->spa_scrub_lock);
dp->dp_spa->spa_scrub_errors++;
mutex_exit(&dp->dp_spa->spa_scrub_lock);
return;
}
osp = buf->b_data;
traverse_zil(dp, &osp->os_zil_header);
for (j = 0; j < osp->os_meta_dnode.dn_nblkptr; j++) {
zbookmark_t czb;
SET_BOOKMARK(&czb, zb->zb_objset, 0,
osp->os_meta_dnode.dn_nlevels - 1, j);
scrub_visitbp(dp, &osp->os_meta_dnode, buf,
&osp->os_meta_dnode.dn_blkptr[j], &czb);
}
}
(void) scrub_funcs[dp->dp_scrub_func](dp, bp, zb);
if (buf)
(void) arc_buf_remove_ref(buf, &buf);
}
static void
scrub_visit_rootbp(dsl_pool_t *dp, dsl_dataset_t *ds, blkptr_t *bp)
{
zbookmark_t zb;
SET_BOOKMARK(&zb, ds ? ds->ds_object : 0, 0, -1, 0);
scrub_visitbp(dp, NULL, NULL, bp, &zb);
}
void
dsl_pool_ds_destroyed(dsl_dataset_t *ds, dmu_tx_t *tx)
{
dsl_pool_t *dp = ds->ds_dir->dd_pool;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
if (dp->dp_scrub_bookmark.zb_objset == ds->ds_object) {
SET_BOOKMARK(&dp->dp_scrub_bookmark, -1, 0, 0, 0);
} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, tx) != 0) {
return;
}
if (ds->ds_phys->ds_next_snap_obj != 0) {
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_phys->ds_next_snap_obj, tx) == 0);
}
ASSERT3U(ds->ds_phys->ds_num_children, <=, 1);
}
void
dsl_pool_ds_snapshotted(dsl_dataset_t *ds, dmu_tx_t *tx)
{
dsl_pool_t *dp = ds->ds_dir->dd_pool;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
ASSERT(ds->ds_phys->ds_prev_snap_obj != 0);
if (dp->dp_scrub_bookmark.zb_objset == ds->ds_object) {
dp->dp_scrub_bookmark.zb_objset =
ds->ds_phys->ds_prev_snap_obj;
} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, tx) == 0) {
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_phys->ds_prev_snap_obj, tx) == 0);
}
}
void
dsl_pool_ds_clone_swapped(dsl_dataset_t *ds1, dsl_dataset_t *ds2, dmu_tx_t *tx)
{
dsl_pool_t *dp = ds1->ds_dir->dd_pool;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
if (dp->dp_scrub_bookmark.zb_objset == ds1->ds_object) {
dp->dp_scrub_bookmark.zb_objset = ds2->ds_object;
} else if (dp->dp_scrub_bookmark.zb_objset == ds2->ds_object) {
dp->dp_scrub_bookmark.zb_objset = ds1->ds_object;
}
if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds1->ds_object, tx) == 0) {
int err = zap_add_int(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, ds2->ds_object, tx);
VERIFY(err == 0 || err == EEXIST);
if (err == EEXIST) {
/* Both were there to begin with */
VERIFY(0 == zap_add_int(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, ds1->ds_object, tx));
}
} else if (zap_remove_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds2->ds_object, tx) == 0) {
VERIFY(0 == zap_add_int(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, ds1->ds_object, tx));
}
}
struct enqueue_clones_arg {
dmu_tx_t *tx;
uint64_t originobj;
};
/* ARGSUSED */
static int
enqueue_clones_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
{
struct enqueue_clones_arg *eca = arg;
dsl_dataset_t *ds;
int err;
dsl_pool_t *dp;
err = dsl_dataset_hold_obj(spa->spa_dsl_pool, dsobj, FTAG, &ds);
if (err)
return (err);
dp = ds->ds_dir->dd_pool;
if (ds->ds_dir->dd_phys->dd_origin_obj == eca->originobj) {
while (ds->ds_phys->ds_prev_snap_obj != eca->originobj) {
dsl_dataset_t *prev;
err = dsl_dataset_hold_obj(dp,
ds->ds_phys->ds_prev_snap_obj, FTAG, &prev);
dsl_dataset_rele(ds, FTAG);
if (err)
return (err);
ds = prev;
}
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, eca->tx) == 0);
}
dsl_dataset_rele(ds, FTAG);
return (0);
}
static void
scrub_visitds(dsl_pool_t *dp, uint64_t dsobj, dmu_tx_t *tx)
{
dsl_dataset_t *ds;
uint64_t min_txg_save;
VERIFY3U(0, ==, dsl_dataset_hold_obj(dp, dsobj, FTAG, &ds));
/*
* Iterate over the bps in this ds.
*/
min_txg_save = dp->dp_scrub_min_txg;
dp->dp_scrub_min_txg =
MAX(dp->dp_scrub_min_txg, ds->ds_phys->ds_prev_snap_txg);
scrub_visit_rootbp(dp, ds, &ds->ds_phys->ds_bp);
dp->dp_scrub_min_txg = min_txg_save;
if (dp->dp_scrub_pausing)
goto out;
/*
* Add descendent datasets to work queue.
*/
if (ds->ds_phys->ds_next_snap_obj != 0) {
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_phys->ds_next_snap_obj, tx) == 0);
}
if (ds->ds_phys->ds_num_children > 1) {
if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
struct enqueue_clones_arg eca;
eca.tx = tx;
eca.originobj = ds->ds_object;
(void) dmu_objset_find_spa(ds->ds_dir->dd_pool->dp_spa,
NULL, enqueue_clones_cb, &eca, DS_FIND_CHILDREN);
} else {
VERIFY(zap_join(dp->dp_meta_objset,
ds->ds_phys->ds_next_clones_obj,
dp->dp_scrub_queue_obj, tx) == 0);
}
}
out:
dsl_dataset_rele(ds, FTAG);
}
/* ARGSUSED */
static int
enqueue_cb(spa_t *spa, uint64_t dsobj, const char *dsname, void *arg)
{
dmu_tx_t *tx = arg;
dsl_dataset_t *ds;
int err;
dsl_pool_t *dp;
err = dsl_dataset_hold_obj(spa->spa_dsl_pool, dsobj, FTAG, &ds);
if (err)
return (err);
dp = ds->ds_dir->dd_pool;
while (ds->ds_phys->ds_prev_snap_obj != 0) {
dsl_dataset_t *prev;
err = dsl_dataset_hold_obj(dp, ds->ds_phys->ds_prev_snap_obj,
FTAG, &prev);
if (err) {
dsl_dataset_rele(ds, FTAG);
return (err);
}
/*
* If this is a clone, we don't need to worry about it for now.
*/
if (prev->ds_phys->ds_next_snap_obj != ds->ds_object) {
dsl_dataset_rele(ds, FTAG);
dsl_dataset_rele(prev, FTAG);
return (0);
}
dsl_dataset_rele(ds, FTAG);
ds = prev;
}
VERIFY(zap_add_int(dp->dp_meta_objset, dp->dp_scrub_queue_obj,
ds->ds_object, tx) == 0);
dsl_dataset_rele(ds, FTAG);
return (0);
}
void
dsl_pool_scrub_sync(dsl_pool_t *dp, dmu_tx_t *tx)
{
zap_cursor_t zc;
zap_attribute_t za;
boolean_t complete = B_TRUE;
if (dp->dp_scrub_func == SCRUB_FUNC_NONE)
return;
/* If the spa is not fully loaded, don't bother. */
if (dp->dp_spa->spa_load_state != SPA_LOAD_NONE)
return;
if (dp->dp_scrub_restart) {
enum scrub_func func = dp->dp_scrub_func;
dp->dp_scrub_restart = B_FALSE;
dsl_pool_scrub_setup_sync(dp, &func, kcred, tx);
}
if (dp->dp_spa->spa_root_vdev->vdev_stat.vs_scrub_type == 0) {
/*
* We must have resumed after rebooting; reset the vdev
* stats to know that we're doing a scrub (although it
* will think we're just starting now).
*/
vdev_scrub_stat_update(dp->dp_spa->spa_root_vdev,
dp->dp_scrub_min_txg ? POOL_SCRUB_RESILVER :
POOL_SCRUB_EVERYTHING, B_FALSE);
}
dp->dp_scrub_pausing = B_FALSE;
dp->dp_scrub_start_time = lbolt64;
dp->dp_scrub_isresilver = (dp->dp_scrub_min_txg != 0);
dp->dp_spa->spa_scrub_active = B_TRUE;
if (dp->dp_scrub_bookmark.zb_objset == 0) {
/* First do the MOS & ORIGIN */
scrub_visit_rootbp(dp, NULL, &dp->dp_meta_rootbp);
if (dp->dp_scrub_pausing)
goto out;
if (spa_version(dp->dp_spa) < SPA_VERSION_DSL_SCRUB) {
VERIFY(0 == dmu_objset_find_spa(dp->dp_spa,
NULL, enqueue_cb, tx, DS_FIND_CHILDREN));
} else {
scrub_visitds(dp, dp->dp_origin_snap->ds_object, tx);
}
ASSERT(!dp->dp_scrub_pausing);
} else if (dp->dp_scrub_bookmark.zb_objset != -1ULL) {
/*
* If we were paused, continue from here. Note if the
* ds we were paused on was deleted, the zb_objset will
* be -1, so we will skip this and find a new objset
* below.
*/
scrub_visitds(dp, dp->dp_scrub_bookmark.zb_objset, tx);
if (dp->dp_scrub_pausing)
goto out;
}
/*
* In case we were paused right at the end of the ds, zero the
* bookmark so we don't think that we're still trying to resume.
*/
bzero(&dp->dp_scrub_bookmark, sizeof (zbookmark_t));
/* keep pulling things out of the zap-object-as-queue */
while (zap_cursor_init(&zc, dp->dp_meta_objset, dp->dp_scrub_queue_obj),
zap_cursor_retrieve(&zc, &za) == 0) {
VERIFY(0 == zap_remove(dp->dp_meta_objset,
dp->dp_scrub_queue_obj, za.za_name, tx));
scrub_visitds(dp, za.za_first_integer, tx);
if (dp->dp_scrub_pausing)
break;
zap_cursor_fini(&zc);
}
zap_cursor_fini(&zc);
if (dp->dp_scrub_pausing)
goto out;
/* done. */
dsl_pool_scrub_cancel_sync(dp, &complete, kcred, tx);
return;
out:
VERIFY(0 == zap_update(dp->dp_meta_objset,
DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_BOOKMARK, sizeof (uint64_t), 4,
&dp->dp_scrub_bookmark, tx));
VERIFY(0 == zap_update(dp->dp_meta_objset,
DMU_POOL_DIRECTORY_OBJECT,
DMU_POOL_SCRUB_ERRORS, sizeof (uint64_t), 1,
&dp->dp_spa->spa_scrub_errors, tx));
/* XXX this is scrub-clean specific */
mutex_enter(&dp->dp_spa->spa_scrub_lock);
while (dp->dp_spa->spa_scrub_inflight > 0) {
cv_wait(&dp->dp_spa->spa_scrub_io_cv,
&dp->dp_spa->spa_scrub_lock);
}
mutex_exit(&dp->dp_spa->spa_scrub_lock);
}
void
dsl_pool_scrub_restart(dsl_pool_t *dp)
{
mutex_enter(&dp->dp_scrub_cancel_lock);
dp->dp_scrub_restart = B_TRUE;
mutex_exit(&dp->dp_scrub_cancel_lock);
}
/*
* scrub consumers
*/
static void
count_block(zfs_all_blkstats_t *zab, const blkptr_t *bp)
{
int i;
/*
* If we resume after a reboot, zab will be NULL; don't record
* incomplete stats in that case.
*/
if (zab == NULL)
return;
for (i = 0; i < 4; i++) {
int l = (i < 2) ? BP_GET_LEVEL(bp) : DN_MAX_LEVELS;
int t = (i & 1) ? BP_GET_TYPE(bp) : DMU_OT_TOTAL;
zfs_blkstat_t *zb = &zab->zab_type[l][t];
int equal;
zb->zb_count++;
zb->zb_asize += BP_GET_ASIZE(bp);
zb->zb_lsize += BP_GET_LSIZE(bp);
zb->zb_psize += BP_GET_PSIZE(bp);
zb->zb_gangs += BP_COUNT_GANG(bp);
switch (BP_GET_NDVAS(bp)) {
case 2:
if (DVA_GET_VDEV(&bp->blk_dva[0]) ==
DVA_GET_VDEV(&bp->blk_dva[1]))
zb->zb_ditto_2_of_2_samevdev++;
break;
case 3:
equal = (DVA_GET_VDEV(&bp->blk_dva[0]) ==
DVA_GET_VDEV(&bp->blk_dva[1])) +
(DVA_GET_VDEV(&bp->blk_dva[0]) ==
DVA_GET_VDEV(&bp->blk_dva[2])) +
(DVA_GET_VDEV(&bp->blk_dva[1]) ==
DVA_GET_VDEV(&bp->blk_dva[2]));
if (equal == 1)
zb->zb_ditto_2_of_3_samevdev++;
else if (equal == 3)
zb->zb_ditto_3_of_3_samevdev++;
break;
}
}
}
static void
dsl_pool_scrub_clean_done(zio_t *zio)
{
spa_t *spa = zio->io_spa;
zio_data_buf_free(zio->io_data, zio->io_size);
mutex_enter(&spa->spa_scrub_lock);
spa->spa_scrub_inflight--;
cv_broadcast(&spa->spa_scrub_io_cv);
if (zio->io_error && (zio->io_error != ECKSUM ||
!(zio->io_flags & ZIO_FLAG_SPECULATIVE)))
spa->spa_scrub_errors++;
mutex_exit(&spa->spa_scrub_lock);
}
static int
dsl_pool_scrub_clean_cb(dsl_pool_t *dp,
const blkptr_t *bp, const zbookmark_t *zb)
{
size_t size = BP_GET_LSIZE(bp);
int d;
spa_t *spa = dp->dp_spa;
boolean_t needs_io;
int zio_flags = ZIO_FLAG_SCRUB_THREAD | ZIO_FLAG_CANFAIL;
int zio_priority;
count_block(dp->dp_blkstats, bp);
if (dp->dp_scrub_isresilver == 0) {
/* It's a scrub */
zio_flags |= ZIO_FLAG_SCRUB;
zio_priority = ZIO_PRIORITY_SCRUB;
needs_io = B_TRUE;
} else {
/* It's a resilver */
zio_flags |= ZIO_FLAG_RESILVER;
zio_priority = ZIO_PRIORITY_RESILVER;
needs_io = B_FALSE;
}
/* If it's an intent log block, failure is expected. */
if (zb->zb_level == -1 && BP_GET_TYPE(bp) != DMU_OT_OBJSET)
zio_flags |= ZIO_FLAG_SPECULATIVE;
for (d = 0; d < BP_GET_NDVAS(bp); d++) {
vdev_t *vd = vdev_lookup_top(spa,
DVA_GET_VDEV(&bp->blk_dva[d]));
/*
* Keep track of how much data we've examined so that
* zpool(1M) status can make useful progress reports.
*/
mutex_enter(&vd->vdev_stat_lock);
vd->vdev_stat.vs_scrub_examined +=
DVA_GET_ASIZE(&bp->blk_dva[d]);
mutex_exit(&vd->vdev_stat_lock);
/* if it's a resilver, this may not be in the target range */
if (!needs_io) {
if (DVA_GET_GANG(&bp->blk_dva[d])) {
/*
* Gang members may be spread across multiple
* vdevs, so the best we can do is look at the
* pool-wide DTL.
* XXX -- it would be better to change our
* allocation policy to ensure that this can't
* happen.
*/
vd = spa->spa_root_vdev;
}
needs_io = vdev_dtl_contains(&vd->vdev_dtl_map,
bp->blk_birth, 1);
}
}
if (needs_io && !zfs_no_scrub_io) {
void *data = zio_data_buf_alloc(size);
mutex_enter(&spa->spa_scrub_lock);
while (spa->spa_scrub_inflight >= spa->spa_scrub_maxinflight)
cv_wait(&spa->spa_scrub_io_cv, &spa->spa_scrub_lock);
spa->spa_scrub_inflight++;
mutex_exit(&spa->spa_scrub_lock);
zio_nowait(zio_read(NULL, spa, bp, data, size,
dsl_pool_scrub_clean_done, NULL, zio_priority,
zio_flags, zb));
}
/* do not relocate this block */
return (0);
}
int
dsl_pool_scrub_clean(dsl_pool_t *dp)
{
+ spa_t *spa = dp->dp_spa;
+
/*
* Purge all vdev caches. We do this here rather than in sync
* context because this requires a writer lock on the spa_config
* lock, which we can't do from sync context. The
* spa_scrub_reopen flag indicates that vdev_open() should not
* attempt to start another scrub.
*/
- spa_config_enter(dp->dp_spa, SCL_ALL, FTAG, RW_WRITER);
- dp->dp_spa->spa_scrub_reopen = B_TRUE;
- vdev_reopen(dp->dp_spa->spa_root_vdev);
- dp->dp_spa->spa_scrub_reopen = B_FALSE;
- spa_config_exit(dp->dp_spa, SCL_ALL, FTAG);
+ spa_vdev_state_enter(spa);
+ spa->spa_scrub_reopen = B_TRUE;
+ vdev_reopen(spa->spa_root_vdev);
+ spa->spa_scrub_reopen = B_FALSE;
+ (void) spa_vdev_state_exit(spa, NULL, 0);
return (dsl_pool_scrub_setup(dp, SCRUB_FUNC_CLEAN));
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c (revision 209274)
@@ -1,1209 +1,1209 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
- * Copyright 2008 Sun Microsystems, Inc. All rights reserved.
+ * Copyright 2009 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/spa.h>
#include <sys/vdev_impl.h>
#include <sys/zio.h>
#include <sys/zio_checksum.h>
#include <sys/fs/zfs.h>
#include <sys/fm/fs/zfs.h>
/*
* Virtual device vector for RAID-Z.
*
* This vdev supports both single and double parity. For single parity, we
* use a simple XOR of all the data columns. For double parity, we use both
* the simple XOR as well as a technique described in "The mathematics of
* RAID-6" by H. Peter Anvin. This technique defines a Galois field, GF(2^8),
* over the integers expressable in a single byte. Briefly, the operations on
* the field are defined as follows:
*
* o addition (+) is represented by a bitwise XOR
* o subtraction (-) is therefore identical to addition: A + B = A - B
* o multiplication of A by 2 is defined by the following bitwise expression:
* (A * 2)_7 = A_6
* (A * 2)_6 = A_5
* (A * 2)_5 = A_4
* (A * 2)_4 = A_3 + A_7
* (A * 2)_3 = A_2 + A_7
* (A * 2)_2 = A_1 + A_7
* (A * 2)_1 = A_0
* (A * 2)_0 = A_7
*
* In C, multiplying by 2 is therefore ((a << 1) ^ ((a & 0x80) ? 0x1d : 0)).
*
* Observe that any number in the field (except for 0) can be expressed as a
* power of 2 -- a generator for the field. We store a table of the powers of
* 2 and logs base 2 for quick look ups, and exploit the fact that A * B can
* be rewritten as 2^(log_2(A) + log_2(B)) (where '+' is normal addition rather
* than field addition). The inverse of a field element A (A^-1) is A^254.
*
* The two parity columns, P and Q, over several data columns, D_0, ... D_n-1,
* can be expressed by field operations:
*
* P = D_0 + D_1 + ... + D_n-2 + D_n-1
* Q = 2^n-1 * D_0 + 2^n-2 * D_1 + ... + 2^1 * D_n-2 + 2^0 * D_n-1
* = ((...((D_0) * 2 + D_1) * 2 + ...) * 2 + D_n-2) * 2 + D_n-1
*
* See the reconstruction code below for how P and Q can used individually or
* in concert to recover missing data columns.
*/
typedef struct raidz_col {
uint64_t rc_devidx; /* child device index for I/O */
uint64_t rc_offset; /* device offset */
uint64_t rc_size; /* I/O size */
void *rc_data; /* I/O data */
int rc_error; /* I/O error for this device */
uint8_t rc_tried; /* Did we attempt this I/O column? */
uint8_t rc_skipped; /* Did we skip this I/O column? */
} raidz_col_t;
typedef struct raidz_map {
uint64_t rm_cols; /* Column count */
uint64_t rm_bigcols; /* Number of oversized columns */
uint64_t rm_asize; /* Actual total I/O size */
uint64_t rm_missingdata; /* Count of missing data devices */
uint64_t rm_missingparity; /* Count of missing parity devices */
uint64_t rm_firstdatacol; /* First data column/parity count */
raidz_col_t rm_col[1]; /* Flexible array of I/O columns */
} raidz_map_t;
#define VDEV_RAIDZ_P 0
#define VDEV_RAIDZ_Q 1
#define VDEV_RAIDZ_MAXPARITY 2
#define VDEV_RAIDZ_MUL_2(a) (((a) << 1) ^ (((a) & 0x80) ? 0x1d : 0))
/*
* These two tables represent powers and logs of 2 in the Galois field defined
* above. These values were computed by repeatedly multiplying by 2 as above.
*/
static const uint8_t vdev_raidz_pow2[256] = {
0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,
0x1d, 0x3a, 0x74, 0xe8, 0xcd, 0x87, 0x13, 0x26,
0x4c, 0x98, 0x2d, 0x5a, 0xb4, 0x75, 0xea, 0xc9,
0x8f, 0x03, 0x06, 0x0c, 0x18, 0x30, 0x60, 0xc0,
0x9d, 0x27, 0x4e, 0x9c, 0x25, 0x4a, 0x94, 0x35,
0x6a, 0xd4, 0xb5, 0x77, 0xee, 0xc1, 0x9f, 0x23,
0x46, 0x8c, 0x05, 0x0a, 0x14, 0x28, 0x50, 0xa0,
0x5d, 0xba, 0x69, 0xd2, 0xb9, 0x6f, 0xde, 0xa1,
0x5f, 0xbe, 0x61, 0xc2, 0x99, 0x2f, 0x5e, 0xbc,
0x65, 0xca, 0x89, 0x0f, 0x1e, 0x3c, 0x78, 0xf0,
0xfd, 0xe7, 0xd3, 0xbb, 0x6b, 0xd6, 0xb1, 0x7f,
0xfe, 0xe1, 0xdf, 0xa3, 0x5b, 0xb6, 0x71, 0xe2,
0xd9, 0xaf, 0x43, 0x86, 0x11, 0x22, 0x44, 0x88,
0x0d, 0x1a, 0x34, 0x68, 0xd0, 0xbd, 0x67, 0xce,
0x81, 0x1f, 0x3e, 0x7c, 0xf8, 0xed, 0xc7, 0x93,
0x3b, 0x76, 0xec, 0xc5, 0x97, 0x33, 0x66, 0xcc,
0x85, 0x17, 0x2e, 0x5c, 0xb8, 0x6d, 0xda, 0xa9,
0x4f, 0x9e, 0x21, 0x42, 0x84, 0x15, 0x2a, 0x54,
0xa8, 0x4d, 0x9a, 0x29, 0x52, 0xa4, 0x55, 0xaa,
0x49, 0x92, 0x39, 0x72, 0xe4, 0xd5, 0xb7, 0x73,
0xe6, 0xd1, 0xbf, 0x63, 0xc6, 0x91, 0x3f, 0x7e,
0xfc, 0xe5, 0xd7, 0xb3, 0x7b, 0xf6, 0xf1, 0xff,
0xe3, 0xdb, 0xab, 0x4b, 0x96, 0x31, 0x62, 0xc4,
0x95, 0x37, 0x6e, 0xdc, 0xa5, 0x57, 0xae, 0x41,
0x82, 0x19, 0x32, 0x64, 0xc8, 0x8d, 0x07, 0x0e,
0x1c, 0x38, 0x70, 0xe0, 0xdd, 0xa7, 0x53, 0xa6,
0x51, 0xa2, 0x59, 0xb2, 0x79, 0xf2, 0xf9, 0xef,
0xc3, 0x9b, 0x2b, 0x56, 0xac, 0x45, 0x8a, 0x09,
0x12, 0x24, 0x48, 0x90, 0x3d, 0x7a, 0xf4, 0xf5,
0xf7, 0xf3, 0xfb, 0xeb, 0xcb, 0x8b, 0x0b, 0x16,
0x2c, 0x58, 0xb0, 0x7d, 0xfa, 0xe9, 0xcf, 0x83,
0x1b, 0x36, 0x6c, 0xd8, 0xad, 0x47, 0x8e, 0x01
};
static const uint8_t vdev_raidz_log2[256] = {
0x00, 0x00, 0x01, 0x19, 0x02, 0x32, 0x1a, 0xc6,
0x03, 0xdf, 0x33, 0xee, 0x1b, 0x68, 0xc7, 0x4b,
0x04, 0x64, 0xe0, 0x0e, 0x34, 0x8d, 0xef, 0x81,
0x1c, 0xc1, 0x69, 0xf8, 0xc8, 0x08, 0x4c, 0x71,
0x05, 0x8a, 0x65, 0x2f, 0xe1, 0x24, 0x0f, 0x21,
0x35, 0x93, 0x8e, 0xda, 0xf0, 0x12, 0x82, 0x45,
0x1d, 0xb5, 0xc2, 0x7d, 0x6a, 0x27, 0xf9, 0xb9,
0xc9, 0x9a, 0x09, 0x78, 0x4d, 0xe4, 0x72, 0xa6,
0x06, 0xbf, 0x8b, 0x62, 0x66, 0xdd, 0x30, 0xfd,
0xe2, 0x98, 0x25, 0xb3, 0x10, 0x91, 0x22, 0x88,
0x36, 0xd0, 0x94, 0xce, 0x8f, 0x96, 0xdb, 0xbd,
0xf1, 0xd2, 0x13, 0x5c, 0x83, 0x38, 0x46, 0x40,
0x1e, 0x42, 0xb6, 0xa3, 0xc3, 0x48, 0x7e, 0x6e,
0x6b, 0x3a, 0x28, 0x54, 0xfa, 0x85, 0xba, 0x3d,
0xca, 0x5e, 0x9b, 0x9f, 0x0a, 0x15, 0x79, 0x2b,
0x4e, 0xd4, 0xe5, 0xac, 0x73, 0xf3, 0xa7, 0x57,
0x07, 0x70, 0xc0, 0xf7, 0x8c, 0x80, 0x63, 0x0d,
0x67, 0x4a, 0xde, 0xed, 0x31, 0xc5, 0xfe, 0x18,
0xe3, 0xa5, 0x99, 0x77, 0x26, 0xb8, 0xb4, 0x7c,
0x11, 0x44, 0x92, 0xd9, 0x23, 0x20, 0x89, 0x2e,
0x37, 0x3f, 0xd1, 0x5b, 0x95, 0xbc, 0xcf, 0xcd,
0x90, 0x87, 0x97, 0xb2, 0xdc, 0xfc, 0xbe, 0x61,
0xf2, 0x56, 0xd3, 0xab, 0x14, 0x2a, 0x5d, 0x9e,
0x84, 0x3c, 0x39, 0x53, 0x47, 0x6d, 0x41, 0xa2,
0x1f, 0x2d, 0x43, 0xd8, 0xb7, 0x7b, 0xa4, 0x76,
0xc4, 0x17, 0x49, 0xec, 0x7f, 0x0c, 0x6f, 0xf6,
0x6c, 0xa1, 0x3b, 0x52, 0x29, 0x9d, 0x55, 0xaa,
0xfb, 0x60, 0x86, 0xb1, 0xbb, 0xcc, 0x3e, 0x5a,
0xcb, 0x59, 0x5f, 0xb0, 0x9c, 0xa9, 0xa0, 0x51,
0x0b, 0xf5, 0x16, 0xeb, 0x7a, 0x75, 0x2c, 0xd7,
0x4f, 0xae, 0xd5, 0xe9, 0xe6, 0xe7, 0xad, 0xe8,
0x74, 0xd6, 0xf4, 0xea, 0xa8, 0x50, 0x58, 0xaf,
};
/*
* Multiply a given number by 2 raised to the given power.
*/
static uint8_t
vdev_raidz_exp2(uint_t a, int exp)
{
if (a == 0)
return (0);
ASSERT(exp >= 0);
ASSERT(vdev_raidz_log2[a] > 0 || a == 1);
exp += vdev_raidz_log2[a];
if (exp > 255)
exp -= 255;
return (vdev_raidz_pow2[exp]);
}
static void
vdev_raidz_map_free(zio_t *zio)
{
raidz_map_t *rm = zio->io_vsd;
int c;
for (c = 0; c < rm->rm_firstdatacol; c++)
zio_buf_free(rm->rm_col[c].rc_data, rm->rm_col[c].rc_size);
kmem_free(rm, offsetof(raidz_map_t, rm_col[rm->rm_cols]));
}
static raidz_map_t *
vdev_raidz_map_alloc(zio_t *zio, uint64_t unit_shift, uint64_t dcols,
uint64_t nparity)
{
raidz_map_t *rm;
uint64_t b = zio->io_offset >> unit_shift;
uint64_t s = zio->io_size >> unit_shift;
uint64_t f = b % dcols;
uint64_t o = (b / dcols) << unit_shift;
uint64_t q, r, c, bc, col, acols, coff, devidx;
q = s / (dcols - nparity);
r = s - q * (dcols - nparity);
bc = (r == 0 ? 0 : r + nparity);
acols = (q == 0 ? bc : dcols);
rm = kmem_alloc(offsetof(raidz_map_t, rm_col[acols]), KM_SLEEP);
rm->rm_cols = acols;
rm->rm_bigcols = bc;
rm->rm_asize = 0;
rm->rm_missingdata = 0;
rm->rm_missingparity = 0;
rm->rm_firstdatacol = nparity;
for (c = 0; c < acols; c++) {
col = f + c;
coff = o;
if (col >= dcols) {
col -= dcols;
coff += 1ULL << unit_shift;
}
rm->rm_col[c].rc_devidx = col;
rm->rm_col[c].rc_offset = coff;
rm->rm_col[c].rc_size = (q + (c < bc)) << unit_shift;
rm->rm_col[c].rc_data = NULL;
rm->rm_col[c].rc_error = 0;
rm->rm_col[c].rc_tried = 0;
rm->rm_col[c].rc_skipped = 0;
rm->rm_asize += rm->rm_col[c].rc_size;
}
rm->rm_asize = roundup(rm->rm_asize, (nparity + 1) << unit_shift);
for (c = 0; c < rm->rm_firstdatacol; c++)
rm->rm_col[c].rc_data = zio_buf_alloc(rm->rm_col[c].rc_size);
rm->rm_col[c].rc_data = zio->io_data;
for (c = c + 1; c < acols; c++)
rm->rm_col[c].rc_data = (char *)rm->rm_col[c - 1].rc_data +
rm->rm_col[c - 1].rc_size;
/*
* If all data stored spans all columns, there's a danger that parity
* will always be on the same device and, since parity isn't read
* during normal operation, that that device's I/O bandwidth won't be
* used effectively. We therefore switch the parity every 1MB.
*
* ... at least that was, ostensibly, the theory. As a practical
* matter unless we juggle the parity between all devices evenly, we
* won't see any benefit. Further, occasional writes that aren't a
* multiple of the LCM of the number of children and the minimum
* stripe width are sufficient to avoid pessimal behavior.
* Unfortunately, this decision created an implicit on-disk format
* requirement that we need to support for all eternity, but only
* for single-parity RAID-Z.
*/
ASSERT(rm->rm_cols >= 2);
ASSERT(rm->rm_col[0].rc_size == rm->rm_col[1].rc_size);
if (rm->rm_firstdatacol == 1 && (zio->io_offset & (1ULL << 20))) {
devidx = rm->rm_col[0].rc_devidx;
o = rm->rm_col[0].rc_offset;
rm->rm_col[0].rc_devidx = rm->rm_col[1].rc_devidx;
rm->rm_col[0].rc_offset = rm->rm_col[1].rc_offset;
rm->rm_col[1].rc_devidx = devidx;
rm->rm_col[1].rc_offset = o;
}
zio->io_vsd = rm;
zio->io_vsd_free = vdev_raidz_map_free;
return (rm);
}
static void
vdev_raidz_generate_parity_p(raidz_map_t *rm)
{
uint64_t *p, *src, pcount, ccount, i;
int c;
pcount = rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]);
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
p = rm->rm_col[VDEV_RAIDZ_P].rc_data;
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
if (c == rm->rm_firstdatacol) {
ASSERT(ccount == pcount);
for (i = 0; i < ccount; i++, p++, src++) {
*p = *src;
}
} else {
ASSERT(ccount <= pcount);
for (i = 0; i < ccount; i++, p++, src++) {
*p ^= *src;
}
}
}
}
static void
vdev_raidz_generate_parity_pq(raidz_map_t *rm)
{
uint64_t *q, *p, *src, pcount, ccount, mask, i;
int c;
pcount = rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]);
ASSERT(rm->rm_col[VDEV_RAIDZ_P].rc_size ==
rm->rm_col[VDEV_RAIDZ_Q].rc_size);
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
p = rm->rm_col[VDEV_RAIDZ_P].rc_data;
q = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
if (c == rm->rm_firstdatacol) {
ASSERT(ccount == pcount || ccount == 0);
for (i = 0; i < ccount; i++, p++, q++, src++) {
*q = *src;
*p = *src;
}
for (; i < pcount; i++, p++, q++, src++) {
*q = 0;
*p = 0;
}
} else {
ASSERT(ccount <= pcount);
/*
* Rather than multiplying each byte individually (as
* described above), we are able to handle 8 at once
* by generating a mask based on the high bit in each
* byte and using that to conditionally XOR in 0x1d.
*/
for (i = 0; i < ccount; i++, p++, q++, src++) {
mask = *q & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*q = ((*q << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
*q ^= *src;
*p ^= *src;
}
/*
* Treat short columns as though they are full of 0s.
*/
for (; i < pcount; i++, q++) {
mask = *q & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*q = ((*q << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
}
}
}
}
static void
vdev_raidz_reconstruct_p(raidz_map_t *rm, int x)
{
uint64_t *dst, *src, xcount, ccount, count, i;
int c;
xcount = rm->rm_col[x].rc_size / sizeof (src[0]);
ASSERT(xcount <= rm->rm_col[VDEV_RAIDZ_P].rc_size / sizeof (src[0]));
ASSERT(xcount > 0);
src = rm->rm_col[VDEV_RAIDZ_P].rc_data;
dst = rm->rm_col[x].rc_data;
for (i = 0; i < xcount; i++, dst++, src++) {
*dst = *src;
}
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
dst = rm->rm_col[x].rc_data;
if (c == x)
continue;
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
count = MIN(ccount, xcount);
for (i = 0; i < count; i++, dst++, src++) {
*dst ^= *src;
}
}
}
static void
vdev_raidz_reconstruct_q(raidz_map_t *rm, int x)
{
uint64_t *dst, *src, xcount, ccount, count, mask, i;
uint8_t *b;
int c, j, exp;
xcount = rm->rm_col[x].rc_size / sizeof (src[0]);
ASSERT(xcount <= rm->rm_col[VDEV_RAIDZ_Q].rc_size / sizeof (src[0]));
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
src = rm->rm_col[c].rc_data;
dst = rm->rm_col[x].rc_data;
if (c == x)
ccount = 0;
else
ccount = rm->rm_col[c].rc_size / sizeof (src[0]);
count = MIN(ccount, xcount);
if (c == rm->rm_firstdatacol) {
for (i = 0; i < count; i++, dst++, src++) {
*dst = *src;
}
for (; i < xcount; i++, dst++) {
*dst = 0;
}
} else {
/*
* For an explanation of this, see the comment in
* vdev_raidz_generate_parity_pq() above.
*/
for (i = 0; i < count; i++, dst++, src++) {
mask = *dst & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*dst = ((*dst << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
*dst ^= *src;
}
for (; i < xcount; i++, dst++) {
mask = *dst & 0x8080808080808080ULL;
mask = (mask << 1) - (mask >> 7);
*dst = ((*dst << 1) & 0xfefefefefefefefeULL) ^
(mask & 0x1d1d1d1d1d1d1d1dULL);
}
}
}
src = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
dst = rm->rm_col[x].rc_data;
exp = 255 - (rm->rm_cols - 1 - x);
for (i = 0; i < xcount; i++, dst++, src++) {
*dst ^= *src;
for (j = 0, b = (uint8_t *)dst; j < 8; j++, b++) {
*b = vdev_raidz_exp2(*b, exp);
}
}
}
static void
vdev_raidz_reconstruct_pq(raidz_map_t *rm, int x, int y)
{
uint8_t *p, *q, *pxy, *qxy, *xd, *yd, tmp, a, b, aexp, bexp;
void *pdata, *qdata;
uint64_t xsize, ysize, i;
ASSERT(x < y);
ASSERT(x >= rm->rm_firstdatacol);
ASSERT(y < rm->rm_cols);
ASSERT(rm->rm_col[x].rc_size >= rm->rm_col[y].rc_size);
/*
* Move the parity data aside -- we're going to compute parity as
* though columns x and y were full of zeros -- Pxy and Qxy. We want to
* reuse the parity generation mechanism without trashing the actual
* parity so we make those columns appear to be full of zeros by
* setting their lengths to zero.
*/
pdata = rm->rm_col[VDEV_RAIDZ_P].rc_data;
qdata = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
xsize = rm->rm_col[x].rc_size;
ysize = rm->rm_col[y].rc_size;
rm->rm_col[VDEV_RAIDZ_P].rc_data =
zio_buf_alloc(rm->rm_col[VDEV_RAIDZ_P].rc_size);
rm->rm_col[VDEV_RAIDZ_Q].rc_data =
zio_buf_alloc(rm->rm_col[VDEV_RAIDZ_Q].rc_size);
rm->rm_col[x].rc_size = 0;
rm->rm_col[y].rc_size = 0;
vdev_raidz_generate_parity_pq(rm);
rm->rm_col[x].rc_size = xsize;
rm->rm_col[y].rc_size = ysize;
p = pdata;
q = qdata;
pxy = rm->rm_col[VDEV_RAIDZ_P].rc_data;
qxy = rm->rm_col[VDEV_RAIDZ_Q].rc_data;
xd = rm->rm_col[x].rc_data;
yd = rm->rm_col[y].rc_data;
/*
* We now have:
* Pxy = P + D_x + D_y
* Qxy = Q + 2^(ndevs - 1 - x) * D_x + 2^(ndevs - 1 - y) * D_y
*
* We can then solve for D_x:
* D_x = A * (P + Pxy) + B * (Q + Qxy)
* where
* A = 2^(x - y) * (2^(x - y) + 1)^-1
* B = 2^(ndevs - 1 - x) * (2^(x - y) + 1)^-1
*
* With D_x in hand, we can easily solve for D_y:
* D_y = P + Pxy + D_x
*/
a = vdev_raidz_pow2[255 + x - y];
b = vdev_raidz_pow2[255 - (rm->rm_cols - 1 - x)];
tmp = 255 - vdev_raidz_log2[a ^ 1];
aexp = vdev_raidz_log2[vdev_raidz_exp2(a, tmp)];
bexp = vdev_raidz_log2[vdev_raidz_exp2(b, tmp)];
for (i = 0; i < xsize; i++, p++, q++, pxy++, qxy++, xd++, yd++) {
*xd = vdev_raidz_exp2(*p ^ *pxy, aexp) ^
vdev_raidz_exp2(*q ^ *qxy, bexp);
if (i < ysize)
*yd = *p ^ *pxy ^ *xd;
}
zio_buf_free(rm->rm_col[VDEV_RAIDZ_P].rc_data,
rm->rm_col[VDEV_RAIDZ_P].rc_size);
zio_buf_free(rm->rm_col[VDEV_RAIDZ_Q].rc_data,
rm->rm_col[VDEV_RAIDZ_Q].rc_size);
/*
* Restore the saved parity data.
*/
rm->rm_col[VDEV_RAIDZ_P].rc_data = pdata;
rm->rm_col[VDEV_RAIDZ_Q].rc_data = qdata;
}
static int
vdev_raidz_open(vdev_t *vd, uint64_t *asize, uint64_t *ashift)
{
vdev_t *cvd;
uint64_t nparity = vd->vdev_nparity;
int c, error;
int lasterror = 0;
int numerrors = 0;
ASSERT(nparity > 0);
if (nparity > VDEV_RAIDZ_MAXPARITY ||
vd->vdev_children < nparity + 1) {
vd->vdev_stat.vs_aux = VDEV_AUX_BAD_LABEL;
return (EINVAL);
}
for (c = 0; c < vd->vdev_children; c++) {
cvd = vd->vdev_child[c];
if ((error = vdev_open(cvd)) != 0) {
lasterror = error;
numerrors++;
continue;
}
*asize = MIN(*asize - 1, cvd->vdev_asize - 1) + 1;
*ashift = MAX(*ashift, cvd->vdev_ashift);
}
*asize *= vd->vdev_children;
if (numerrors > nparity) {
vd->vdev_stat.vs_aux = VDEV_AUX_NO_REPLICAS;
return (lasterror);
}
return (0);
}
static void
vdev_raidz_close(vdev_t *vd)
{
int c;
for (c = 0; c < vd->vdev_children; c++)
vdev_close(vd->vdev_child[c]);
}
static uint64_t
vdev_raidz_asize(vdev_t *vd, uint64_t psize)
{
uint64_t asize;
uint64_t ashift = vd->vdev_top->vdev_ashift;
uint64_t cols = vd->vdev_children;
uint64_t nparity = vd->vdev_nparity;
asize = ((psize - 1) >> ashift) + 1;
asize += nparity * ((asize + cols - nparity - 1) / (cols - nparity));
asize = roundup(asize, nparity + 1) << ashift;
return (asize);
}
static void
vdev_raidz_child_done(zio_t *zio)
{
raidz_col_t *rc = zio->io_private;
rc->rc_error = zio->io_error;
rc->rc_tried = 1;
rc->rc_skipped = 0;
}
static int
vdev_raidz_io_start(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
vdev_t *tvd = vd->vdev_top;
vdev_t *cvd;
blkptr_t *bp = zio->io_bp;
raidz_map_t *rm;
raidz_col_t *rc;
int c;
rm = vdev_raidz_map_alloc(zio, tvd->vdev_ashift, vd->vdev_children,
vd->vdev_nparity);
ASSERT3U(rm->rm_asize, ==, vdev_psize_to_asize(vd, zio->io_size));
if (zio->io_type == ZIO_TYPE_WRITE) {
/*
* Generate RAID parity in the first virtual columns.
*/
if (rm->rm_firstdatacol == 1)
vdev_raidz_generate_parity_p(rm);
else
vdev_raidz_generate_parity_pq(rm);
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
cvd = vd->vdev_child[rc->rc_devidx];
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, rc->rc_data, rc->rc_size,
zio->io_type, zio->io_priority, 0,
vdev_raidz_child_done, rc));
}
return (ZIO_PIPELINE_CONTINUE);
}
ASSERT(zio->io_type == ZIO_TYPE_READ);
/*
* Iterate over the columns in reverse order so that we hit the parity
* last -- any errors along the way will force us to read the parity
* data.
*/
for (c = rm->rm_cols - 1; c >= 0; c--) {
rc = &rm->rm_col[c];
cvd = vd->vdev_child[rc->rc_devidx];
if (!vdev_readable(cvd)) {
if (c >= rm->rm_firstdatacol)
rm->rm_missingdata++;
else
rm->rm_missingparity++;
rc->rc_error = ENXIO;
rc->rc_tried = 1; /* don't even try */
rc->rc_skipped = 1;
continue;
}
if (vdev_dtl_contains(&cvd->vdev_dtl_map, bp->blk_birth, 1)) {
if (c >= rm->rm_firstdatacol)
rm->rm_missingdata++;
else
rm->rm_missingparity++;
rc->rc_error = ESTALE;
rc->rc_skipped = 1;
continue;
}
if (c >= rm->rm_firstdatacol || rm->rm_missingdata > 0 ||
- (zio->io_flags & ZIO_FLAG_SCRUB)) {
+ (zio->io_flags & (ZIO_FLAG_SCRUB | ZIO_FLAG_RESILVER))) {
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, rc->rc_data, rc->rc_size,
zio->io_type, zio->io_priority, 0,
vdev_raidz_child_done, rc));
}
}
return (ZIO_PIPELINE_CONTINUE);
}
/*
* Report a checksum error for a child of a RAID-Z device.
*/
static void
raidz_checksum_error(zio_t *zio, raidz_col_t *rc)
{
vdev_t *vd = zio->io_vd->vdev_child[rc->rc_devidx];
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
mutex_enter(&vd->vdev_stat_lock);
vd->vdev_stat.vs_checksum_errors++;
mutex_exit(&vd->vdev_stat_lock);
}
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE))
zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
zio->io_spa, vd, zio, rc->rc_offset, rc->rc_size);
}
/*
* Generate the parity from the data columns. If we tried and were able to
* read the parity without error, verify that the generated parity matches the
* data we read. If it doesn't, we fire off a checksum error. Return the
* number such failures.
*/
static int
raidz_parity_verify(zio_t *zio, raidz_map_t *rm)
{
void *orig[VDEV_RAIDZ_MAXPARITY];
int c, ret = 0;
raidz_col_t *rc;
for (c = 0; c < rm->rm_firstdatacol; c++) {
rc = &rm->rm_col[c];
if (!rc->rc_tried || rc->rc_error != 0)
continue;
orig[c] = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig[c], rc->rc_size);
}
if (rm->rm_firstdatacol == 1)
vdev_raidz_generate_parity_p(rm);
else
vdev_raidz_generate_parity_pq(rm);
for (c = 0; c < rm->rm_firstdatacol; c++) {
rc = &rm->rm_col[c];
if (!rc->rc_tried || rc->rc_error != 0)
continue;
if (bcmp(orig[c], rc->rc_data, rc->rc_size) != 0) {
raidz_checksum_error(zio, rc);
rc->rc_error = ECKSUM;
ret++;
}
zio_buf_free(orig[c], rc->rc_size);
}
return (ret);
}
static uint64_t raidz_corrected_p;
static uint64_t raidz_corrected_q;
static uint64_t raidz_corrected_pq;
static int
vdev_raidz_worst_error(raidz_map_t *rm)
{
int error = 0;
for (int c = 0; c < rm->rm_cols; c++)
error = zio_worst_error(error, rm->rm_col[c].rc_error);
return (error);
}
static void
vdev_raidz_io_done(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
vdev_t *cvd;
raidz_map_t *rm = zio->io_vsd;
raidz_col_t *rc, *rc1;
int unexpected_errors = 0;
int parity_errors = 0;
int parity_untried = 0;
int data_errors = 0;
int total_errors = 0;
int n, c, c1;
ASSERT(zio->io_bp != NULL); /* XXX need to add code to enforce this */
ASSERT(rm->rm_missingparity <= rm->rm_firstdatacol);
ASSERT(rm->rm_missingdata <= rm->rm_cols - rm->rm_firstdatacol);
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error) {
ASSERT(rc->rc_error != ECKSUM); /* child has no bp */
if (c < rm->rm_firstdatacol)
parity_errors++;
else
data_errors++;
if (!rc->rc_skipped)
unexpected_errors++;
total_errors++;
} else if (c < rm->rm_firstdatacol && !rc->rc_tried) {
parity_untried++;
}
}
if (zio->io_type == ZIO_TYPE_WRITE) {
/*
* XXX -- for now, treat partial writes as a success.
* (If we couldn't write enough columns to reconstruct
* the data, the I/O failed. Otherwise, good enough.)
*
* Now that we support write reallocation, it would be better
* to treat partial failure as real failure unless there are
* no non-degraded top-level vdevs left, and not update DTLs
* if we intend to reallocate.
*/
/* XXPOLICY */
if (total_errors > rm->rm_firstdatacol)
zio->io_error = vdev_raidz_worst_error(rm);
return;
}
ASSERT(zio->io_type == ZIO_TYPE_READ);
/*
* There are three potential phases for a read:
* 1. produce valid data from the columns read
* 2. read all disks and try again
* 3. perform combinatorial reconstruction
*
* Each phase is progressively both more expensive and less likely to
* occur. If we encounter more errors than we can repair or all phases
* fail, we have no choice but to return an error.
*/
/*
* If the number of errors we saw was correctable -- less than or equal
* to the number of parity disks read -- attempt to produce data that
* has a valid checksum. Naturally, this case applies in the absence of
* any errors.
*/
if (total_errors <= rm->rm_firstdatacol - parity_untried) {
switch (data_errors) {
case 0:
if (zio_checksum_error(zio) == 0) {
/*
* If we read parity information (unnecessarily
* as it happens since no reconstruction was
* needed) regenerate and verify the parity.
* We also regenerate parity when resilvering
* so we can write it out to the failed device
* later.
*/
if (parity_errors + parity_untried <
rm->rm_firstdatacol ||
(zio->io_flags & ZIO_FLAG_RESILVER)) {
n = raidz_parity_verify(zio, rm);
unexpected_errors += n;
ASSERT(parity_errors + n <=
rm->rm_firstdatacol);
}
goto done;
}
break;
case 1:
/*
* We either attempt to read all the parity columns or
* none of them. If we didn't try to read parity, we
* wouldn't be here in the correctable case. There must
* also have been fewer parity errors than parity
* columns or, again, we wouldn't be in this code path.
*/
ASSERT(parity_untried == 0);
ASSERT(parity_errors < rm->rm_firstdatacol);
/*
* Find the column that reported the error.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error != 0)
break;
}
ASSERT(c != rm->rm_cols);
ASSERT(!rc->rc_skipped || rc->rc_error == ENXIO ||
rc->rc_error == ESTALE);
if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0) {
vdev_raidz_reconstruct_p(rm, c);
} else {
ASSERT(rm->rm_firstdatacol > 1);
vdev_raidz_reconstruct_q(rm, c);
}
if (zio_checksum_error(zio) == 0) {
if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0)
atomic_inc_64(&raidz_corrected_p);
else
atomic_inc_64(&raidz_corrected_q);
/*
* If there's more than one parity disk that
* was successfully read, confirm that the
* other parity disk produced the correct data.
* This routine is suboptimal in that it
* regenerates both the parity we wish to test
* as well as the parity we just used to
* perform the reconstruction, but this should
* be a relatively uncommon case, and can be
* optimized if it becomes a problem.
* We also regenerate parity when resilvering
* so we can write it out to the failed device
* later.
*/
if (parity_errors < rm->rm_firstdatacol - 1 ||
(zio->io_flags & ZIO_FLAG_RESILVER)) {
n = raidz_parity_verify(zio, rm);
unexpected_errors += n;
ASSERT(parity_errors + n <=
rm->rm_firstdatacol);
}
goto done;
}
break;
case 2:
/*
* Two data column errors require double parity.
*/
ASSERT(rm->rm_firstdatacol == 2);
/*
* Find the two columns that reported errors.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error != 0)
break;
}
ASSERT(c != rm->rm_cols);
ASSERT(!rc->rc_skipped || rc->rc_error == ENXIO ||
rc->rc_error == ESTALE);
for (c1 = c++; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
if (rc->rc_error != 0)
break;
}
ASSERT(c != rm->rm_cols);
ASSERT(!rc->rc_skipped || rc->rc_error == ENXIO ||
rc->rc_error == ESTALE);
vdev_raidz_reconstruct_pq(rm, c1, c);
if (zio_checksum_error(zio) == 0) {
atomic_inc_64(&raidz_corrected_pq);
goto done;
}
break;
default:
ASSERT(rm->rm_firstdatacol <= 2);
ASSERT(0);
}
}
/*
* This isn't a typical situation -- either we got a read error or
* a child silently returned bad data. Read every block so we can
* try again with as much data and parity as we can track down. If
* we've already been through once before, all children will be marked
* as tried so we'll proceed to combinatorial reconstruction.
*/
unexpected_errors = 1;
rm->rm_missingdata = 0;
rm->rm_missingparity = 0;
for (c = 0; c < rm->rm_cols; c++) {
if (rm->rm_col[c].rc_tried)
continue;
zio_vdev_io_redone(zio);
do {
rc = &rm->rm_col[c];
if (rc->rc_tried)
continue;
zio_nowait(zio_vdev_child_io(zio, NULL,
vd->vdev_child[rc->rc_devidx],
rc->rc_offset, rc->rc_data, rc->rc_size,
zio->io_type, zio->io_priority, 0,
vdev_raidz_child_done, rc));
} while (++c < rm->rm_cols);
return;
}
/*
* At this point we've attempted to reconstruct the data given the
* errors we detected, and we've attempted to read all columns. There
* must, therefore, be one or more additional problems -- silent errors
* resulting in invalid data rather than explicit I/O errors resulting
* in absent data. Before we attempt combinatorial reconstruction make
* sure we have a chance of coming up with the right answer.
*/
if (total_errors >= rm->rm_firstdatacol) {
zio->io_error = vdev_raidz_worst_error(rm);
/*
* If there were exactly as many device errors as parity
* columns, yet we couldn't reconstruct the data, then at
* least one device must have returned bad data silently.
*/
if (total_errors == rm->rm_firstdatacol)
zio->io_error = zio_worst_error(zio->io_error, ECKSUM);
goto done;
}
if (rm->rm_col[VDEV_RAIDZ_P].rc_error == 0) {
/*
* Attempt to reconstruct the data from parity P.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
void *orig;
rc = &rm->rm_col[c];
orig = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig, rc->rc_size);
vdev_raidz_reconstruct_p(rm, c);
if (zio_checksum_error(zio) == 0) {
zio_buf_free(orig, rc->rc_size);
atomic_inc_64(&raidz_corrected_p);
/*
* If this child didn't know that it returned
* bad data, inform it.
*/
if (rc->rc_tried && rc->rc_error == 0)
raidz_checksum_error(zio, rc);
rc->rc_error = ECKSUM;
goto done;
}
bcopy(orig, rc->rc_data, rc->rc_size);
zio_buf_free(orig, rc->rc_size);
}
}
if (rm->rm_firstdatacol > 1 && rm->rm_col[VDEV_RAIDZ_Q].rc_error == 0) {
/*
* Attempt to reconstruct the data from parity Q.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols; c++) {
void *orig;
rc = &rm->rm_col[c];
orig = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig, rc->rc_size);
vdev_raidz_reconstruct_q(rm, c);
if (zio_checksum_error(zio) == 0) {
zio_buf_free(orig, rc->rc_size);
atomic_inc_64(&raidz_corrected_q);
/*
* If this child didn't know that it returned
* bad data, inform it.
*/
if (rc->rc_tried && rc->rc_error == 0)
raidz_checksum_error(zio, rc);
rc->rc_error = ECKSUM;
goto done;
}
bcopy(orig, rc->rc_data, rc->rc_size);
zio_buf_free(orig, rc->rc_size);
}
}
if (rm->rm_firstdatacol > 1 &&
rm->rm_col[VDEV_RAIDZ_P].rc_error == 0 &&
rm->rm_col[VDEV_RAIDZ_Q].rc_error == 0) {
/*
* Attempt to reconstruct the data from both P and Q.
*/
for (c = rm->rm_firstdatacol; c < rm->rm_cols - 1; c++) {
void *orig, *orig1;
rc = &rm->rm_col[c];
orig = zio_buf_alloc(rc->rc_size);
bcopy(rc->rc_data, orig, rc->rc_size);
for (c1 = c + 1; c1 < rm->rm_cols; c1++) {
rc1 = &rm->rm_col[c1];
orig1 = zio_buf_alloc(rc1->rc_size);
bcopy(rc1->rc_data, orig1, rc1->rc_size);
vdev_raidz_reconstruct_pq(rm, c, c1);
if (zio_checksum_error(zio) == 0) {
zio_buf_free(orig, rc->rc_size);
zio_buf_free(orig1, rc1->rc_size);
atomic_inc_64(&raidz_corrected_pq);
/*
* If these children didn't know they
* returned bad data, inform them.
*/
if (rc->rc_tried && rc->rc_error == 0)
raidz_checksum_error(zio, rc);
if (rc1->rc_tried && rc1->rc_error == 0)
raidz_checksum_error(zio, rc1);
rc->rc_error = ECKSUM;
rc1->rc_error = ECKSUM;
goto done;
}
bcopy(orig1, rc1->rc_data, rc1->rc_size);
zio_buf_free(orig1, rc1->rc_size);
}
bcopy(orig, rc->rc_data, rc->rc_size);
zio_buf_free(orig, rc->rc_size);
}
}
/*
* All combinations failed to checksum. Generate checksum ereports for
* all children.
*/
zio->io_error = ECKSUM;
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
zio->io_spa, vd->vdev_child[rc->rc_devidx], zio,
rc->rc_offset, rc->rc_size);
}
}
done:
zio_checksum_verified(zio);
if (zio->io_error == 0 && (spa_mode & FWRITE) &&
(unexpected_errors || (zio->io_flags & ZIO_FLAG_RESILVER))) {
/*
* Use the good data we have in hand to repair damaged children.
*/
for (c = 0; c < rm->rm_cols; c++) {
rc = &rm->rm_col[c];
cvd = vd->vdev_child[rc->rc_devidx];
if (rc->rc_error == 0)
continue;
zio_nowait(zio_vdev_child_io(zio, NULL, cvd,
rc->rc_offset, rc->rc_data, rc->rc_size,
ZIO_TYPE_WRITE, zio->io_priority,
ZIO_FLAG_IO_REPAIR, NULL, NULL));
}
}
}
static void
vdev_raidz_state_change(vdev_t *vd, int faulted, int degraded)
{
if (faulted > vd->vdev_nparity)
vdev_set_state(vd, B_FALSE, VDEV_STATE_CANT_OPEN,
VDEV_AUX_NO_REPLICAS);
else if (degraded + faulted != 0)
vdev_set_state(vd, B_FALSE, VDEV_STATE_DEGRADED, VDEV_AUX_NONE);
else
vdev_set_state(vd, B_FALSE, VDEV_STATE_HEALTHY, VDEV_AUX_NONE);
}
vdev_ops_t vdev_raidz_ops = {
vdev_raidz_open,
vdev_raidz_close,
vdev_raidz_asize,
vdev_raidz_io_start,
vdev_raidz_io_done,
vdev_raidz_state_change,
VDEV_TYPE_RAIDZ, /* name of this vdev type */
B_FALSE /* not a leaf vdev */
};
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_acl.c (revision 209274)
@@ -1,2712 +1,2719 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/types.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/systm.h>
#include <sys/sysmacros.h>
#include <sys/resource.h>
#include <sys/vfs.h>
#include <sys/vnode.h>
#include <sys/file.h>
#include <sys/stat.h>
#include <sys/kmem.h>
#include <sys/cmn_err.h>
#include <sys/errno.h>
#include <sys/unistd.h>
#include <sys/sdt.h>
#include <sys/fs/zfs.h>
#include <sys/policy.h>
#include <sys/zfs_znode.h>
#include <sys/zfs_fuid.h>
#include <sys/zfs_acl.h>
#include <sys/zfs_dir.h>
#include <sys/zfs_vfsops.h>
#include <sys/dmu.h>
#include <sys/dnode.h>
#include <sys/zap.h>
#include <acl/acl_common.h>
#define ALLOW ACE_ACCESS_ALLOWED_ACE_TYPE
#define DENY ACE_ACCESS_DENIED_ACE_TYPE
#define MAX_ACE_TYPE ACE_SYSTEM_ALARM_CALLBACK_OBJECT_ACE_TYPE
#define MIN_ACE_TYPE ALLOW
#define OWNING_GROUP (ACE_GROUP|ACE_IDENTIFIER_GROUP)
#define EVERYONE_ALLOW_MASK (ACE_READ_ACL|ACE_READ_ATTRIBUTES | \
ACE_READ_NAMED_ATTRS|ACE_SYNCHRONIZE)
#define EVERYONE_DENY_MASK (ACE_WRITE_ACL|ACE_WRITE_OWNER | \
ACE_WRITE_ATTRIBUTES|ACE_WRITE_NAMED_ATTRS)
#define OWNER_ALLOW_MASK (ACE_WRITE_ACL | ACE_WRITE_OWNER | \
ACE_WRITE_ATTRIBUTES|ACE_WRITE_NAMED_ATTRS)
#define WRITE_MASK_DATA (ACE_WRITE_DATA|ACE_APPEND_DATA|ACE_WRITE_NAMED_ATTRS)
#define ZFS_CHECKED_MASKS (ACE_READ_ACL|ACE_READ_ATTRIBUTES|ACE_READ_DATA| \
ACE_READ_NAMED_ATTRS|ACE_WRITE_DATA|ACE_WRITE_ATTRIBUTES| \
ACE_WRITE_NAMED_ATTRS|ACE_APPEND_DATA|ACE_EXECUTE|ACE_WRITE_OWNER| \
ACE_WRITE_ACL|ACE_DELETE|ACE_DELETE_CHILD|ACE_SYNCHRONIZE)
#define WRITE_MASK (WRITE_MASK_DATA|ACE_WRITE_ATTRIBUTES|ACE_WRITE_ACL|\
ACE_WRITE_OWNER|ACE_DELETE|ACE_DELETE_CHILD)
#define OGE_CLEAR (ACE_READ_DATA|ACE_LIST_DIRECTORY|ACE_WRITE_DATA| \
ACE_ADD_FILE|ACE_APPEND_DATA|ACE_ADD_SUBDIRECTORY|ACE_EXECUTE)
#define OKAY_MASK_BITS (ACE_READ_DATA|ACE_LIST_DIRECTORY|ACE_WRITE_DATA| \
ACE_ADD_FILE|ACE_APPEND_DATA|ACE_ADD_SUBDIRECTORY|ACE_EXECUTE)
#define ALL_INHERIT (ACE_FILE_INHERIT_ACE|ACE_DIRECTORY_INHERIT_ACE | \
ACE_NO_PROPAGATE_INHERIT_ACE|ACE_INHERIT_ONLY_ACE|ACE_INHERITED_ACE)
#define RESTRICTED_CLEAR (ACE_WRITE_ACL|ACE_WRITE_OWNER)
#define V4_ACL_WIDE_FLAGS (ZFS_ACL_AUTO_INHERIT|ZFS_ACL_DEFAULTED|\
ZFS_ACL_PROTECTED)
#define ZFS_ACL_WIDE_FLAGS (V4_ACL_WIDE_FLAGS|ZFS_ACL_TRIVIAL|ZFS_INHERIT_ACE|\
ZFS_ACL_OBJ_ACE)
static uint16_t
zfs_ace_v0_get_type(void *acep)
{
return (((zfs_oldace_t *)acep)->z_type);
}
static uint16_t
zfs_ace_v0_get_flags(void *acep)
{
return (((zfs_oldace_t *)acep)->z_flags);
}
static uint32_t
zfs_ace_v0_get_mask(void *acep)
{
return (((zfs_oldace_t *)acep)->z_access_mask);
}
static uint64_t
zfs_ace_v0_get_who(void *acep)
{
return (((zfs_oldace_t *)acep)->z_fuid);
}
static void
zfs_ace_v0_set_type(void *acep, uint16_t type)
{
((zfs_oldace_t *)acep)->z_type = type;
}
static void
zfs_ace_v0_set_flags(void *acep, uint16_t flags)
{
((zfs_oldace_t *)acep)->z_flags = flags;
}
static void
zfs_ace_v0_set_mask(void *acep, uint32_t mask)
{
((zfs_oldace_t *)acep)->z_access_mask = mask;
}
static void
zfs_ace_v0_set_who(void *acep, uint64_t who)
{
((zfs_oldace_t *)acep)->z_fuid = who;
}
/*ARGSUSED*/
static size_t
zfs_ace_v0_size(void *acep)
{
return (sizeof (zfs_oldace_t));
}
static size_t
zfs_ace_v0_abstract_size(void)
{
return (sizeof (zfs_oldace_t));
}
static int
zfs_ace_v0_mask_off(void)
{
return (offsetof(zfs_oldace_t, z_access_mask));
}
/*ARGSUSED*/
static int
zfs_ace_v0_data(void *acep, void **datap)
{
*datap = NULL;
return (0);
}
static acl_ops_t zfs_acl_v0_ops = {
zfs_ace_v0_get_mask,
zfs_ace_v0_set_mask,
zfs_ace_v0_get_flags,
zfs_ace_v0_set_flags,
zfs_ace_v0_get_type,
zfs_ace_v0_set_type,
zfs_ace_v0_get_who,
zfs_ace_v0_set_who,
zfs_ace_v0_size,
zfs_ace_v0_abstract_size,
zfs_ace_v0_mask_off,
zfs_ace_v0_data
};
static uint16_t
zfs_ace_fuid_get_type(void *acep)
{
return (((zfs_ace_hdr_t *)acep)->z_type);
}
static uint16_t
zfs_ace_fuid_get_flags(void *acep)
{
return (((zfs_ace_hdr_t *)acep)->z_flags);
}
static uint32_t
zfs_ace_fuid_get_mask(void *acep)
{
return (((zfs_ace_hdr_t *)acep)->z_access_mask);
}
static uint64_t
zfs_ace_fuid_get_who(void *args)
{
uint16_t entry_type;
zfs_ace_t *acep = args;
entry_type = acep->z_hdr.z_flags & ACE_TYPE_FLAGS;
if (entry_type == ACE_OWNER || entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE)
return (-1);
return (((zfs_ace_t *)acep)->z_fuid);
}
static void
zfs_ace_fuid_set_type(void *acep, uint16_t type)
{
((zfs_ace_hdr_t *)acep)->z_type = type;
}
static void
zfs_ace_fuid_set_flags(void *acep, uint16_t flags)
{
((zfs_ace_hdr_t *)acep)->z_flags = flags;
}
static void
zfs_ace_fuid_set_mask(void *acep, uint32_t mask)
{
((zfs_ace_hdr_t *)acep)->z_access_mask = mask;
}
static void
zfs_ace_fuid_set_who(void *arg, uint64_t who)
{
zfs_ace_t *acep = arg;
uint16_t entry_type = acep->z_hdr.z_flags & ACE_TYPE_FLAGS;
if (entry_type == ACE_OWNER || entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE)
return;
acep->z_fuid = who;
}
static size_t
zfs_ace_fuid_size(void *acep)
{
zfs_ace_hdr_t *zacep = acep;
uint16_t entry_type;
switch (zacep->z_type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
return (sizeof (zfs_object_ace_t));
case ALLOW:
case DENY:
entry_type =
(((zfs_ace_hdr_t *)acep)->z_flags & ACE_TYPE_FLAGS);
if (entry_type == ACE_OWNER ||
entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE)
return (sizeof (zfs_ace_hdr_t));
/*FALLTHROUGH*/
default:
return (sizeof (zfs_ace_t));
}
}
static size_t
zfs_ace_fuid_abstract_size(void)
{
return (sizeof (zfs_ace_hdr_t));
}
static int
zfs_ace_fuid_mask_off(void)
{
return (offsetof(zfs_ace_hdr_t, z_access_mask));
}
static int
zfs_ace_fuid_data(void *acep, void **datap)
{
zfs_ace_t *zacep = acep;
zfs_object_ace_t *zobjp;
switch (zacep->z_hdr.z_type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
zobjp = acep;
*datap = (caddr_t)zobjp + sizeof (zfs_ace_t);
return (sizeof (zfs_object_ace_t) - sizeof (zfs_ace_t));
default:
*datap = NULL;
return (0);
}
}
static acl_ops_t zfs_acl_fuid_ops = {
zfs_ace_fuid_get_mask,
zfs_ace_fuid_set_mask,
zfs_ace_fuid_get_flags,
zfs_ace_fuid_set_flags,
zfs_ace_fuid_get_type,
zfs_ace_fuid_set_type,
zfs_ace_fuid_get_who,
zfs_ace_fuid_set_who,
zfs_ace_fuid_size,
zfs_ace_fuid_abstract_size,
zfs_ace_fuid_mask_off,
zfs_ace_fuid_data
};
static int
zfs_acl_version(int version)
{
if (version < ZPL_VERSION_FUID)
return (ZFS_ACL_VERSION_INITIAL);
else
return (ZFS_ACL_VERSION_FUID);
}
static int
zfs_acl_version_zp(znode_t *zp)
{
return (zfs_acl_version(zp->z_zfsvfs->z_version));
}
static zfs_acl_t *
zfs_acl_alloc(int vers)
{
zfs_acl_t *aclp;
aclp = kmem_zalloc(sizeof (zfs_acl_t), KM_SLEEP);
list_create(&aclp->z_acl, sizeof (zfs_acl_node_t),
offsetof(zfs_acl_node_t, z_next));
aclp->z_version = vers;
if (vers == ZFS_ACL_VERSION_FUID)
aclp->z_ops = zfs_acl_fuid_ops;
else
aclp->z_ops = zfs_acl_v0_ops;
return (aclp);
}
static zfs_acl_node_t *
zfs_acl_node_alloc(size_t bytes)
{
zfs_acl_node_t *aclnode;
aclnode = kmem_zalloc(sizeof (zfs_acl_node_t), KM_SLEEP);
if (bytes) {
aclnode->z_acldata = kmem_alloc(bytes, KM_SLEEP);
aclnode->z_allocdata = aclnode->z_acldata;
aclnode->z_allocsize = bytes;
aclnode->z_size = bytes;
}
return (aclnode);
}
static void
zfs_acl_node_free(zfs_acl_node_t *aclnode)
{
if (aclnode->z_allocsize)
kmem_free(aclnode->z_allocdata, aclnode->z_allocsize);
kmem_free(aclnode, sizeof (zfs_acl_node_t));
}
static void
zfs_acl_release_nodes(zfs_acl_t *aclp)
{
zfs_acl_node_t *aclnode;
while (aclnode = list_head(&aclp->z_acl)) {
list_remove(&aclp->z_acl, aclnode);
zfs_acl_node_free(aclnode);
}
aclp->z_acl_count = 0;
aclp->z_acl_bytes = 0;
}
void
zfs_acl_free(zfs_acl_t *aclp)
{
zfs_acl_release_nodes(aclp);
list_destroy(&aclp->z_acl);
kmem_free(aclp, sizeof (zfs_acl_t));
}
static boolean_t
zfs_acl_valid_ace_type(uint_t type, uint_t flags)
{
uint16_t entry_type;
switch (type) {
case ALLOW:
case DENY:
case ACE_SYSTEM_AUDIT_ACE_TYPE:
case ACE_SYSTEM_ALARM_ACE_TYPE:
entry_type = flags & ACE_TYPE_FLAGS;
return (entry_type == ACE_OWNER ||
entry_type == OWNING_GROUP ||
entry_type == ACE_EVERYONE || entry_type == 0 ||
entry_type == ACE_IDENTIFIER_GROUP);
default:
if (type >= MIN_ACE_TYPE && type <= MAX_ACE_TYPE)
return (B_TRUE);
}
return (B_FALSE);
}
static boolean_t
zfs_ace_valid(vtype_t obj_type, zfs_acl_t *aclp, uint16_t type, uint16_t iflags)
{
/*
* first check type of entry
*/
if (!zfs_acl_valid_ace_type(type, iflags))
return (B_FALSE);
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
if (aclp->z_version < ZFS_ACL_VERSION_FUID)
return (B_FALSE);
aclp->z_hints |= ZFS_ACL_OBJ_ACE;
}
/*
* next check inheritance level flags
*/
if (obj_type == VDIR &&
(iflags & (ACE_FILE_INHERIT_ACE|ACE_DIRECTORY_INHERIT_ACE)))
aclp->z_hints |= ZFS_INHERIT_ACE;
if (iflags & (ACE_INHERIT_ONLY_ACE|ACE_NO_PROPAGATE_INHERIT_ACE)) {
if ((iflags & (ACE_FILE_INHERIT_ACE|
ACE_DIRECTORY_INHERIT_ACE)) == 0) {
return (B_FALSE);
}
}
return (B_TRUE);
}
static void *
zfs_acl_next_ace(zfs_acl_t *aclp, void *start, uint64_t *who,
uint32_t *access_mask, uint16_t *iflags, uint16_t *type)
{
zfs_acl_node_t *aclnode;
if (start == NULL) {
aclnode = list_head(&aclp->z_acl);
if (aclnode == NULL)
return (NULL);
aclp->z_next_ace = aclnode->z_acldata;
aclp->z_curr_node = aclnode;
aclnode->z_ace_idx = 0;
}
aclnode = aclp->z_curr_node;
if (aclnode == NULL)
return (NULL);
if (aclnode->z_ace_idx >= aclnode->z_ace_count) {
aclnode = list_next(&aclp->z_acl, aclnode);
if (aclnode == NULL)
return (NULL);
else {
aclp->z_curr_node = aclnode;
aclnode->z_ace_idx = 0;
aclp->z_next_ace = aclnode->z_acldata;
}
}
if (aclnode->z_ace_idx < aclnode->z_ace_count) {
void *acep = aclp->z_next_ace;
size_t ace_size;
/*
* Make sure we don't overstep our bounds
*/
ace_size = aclp->z_ops.ace_size(acep);
if (((caddr_t)acep + ace_size) >
((caddr_t)aclnode->z_acldata + aclnode->z_size)) {
return (NULL);
}
*iflags = aclp->z_ops.ace_flags_get(acep);
*type = aclp->z_ops.ace_type_get(acep);
*access_mask = aclp->z_ops.ace_mask_get(acep);
*who = aclp->z_ops.ace_who_get(acep);
aclp->z_next_ace = (caddr_t)aclp->z_next_ace + ace_size;
aclnode->z_ace_idx++;
return ((void *)acep);
}
return (NULL);
}
/*ARGSUSED*/
static uint64_t
zfs_ace_walk(void *datap, uint64_t cookie, int aclcnt,
uint16_t *flags, uint16_t *type, uint32_t *mask)
{
zfs_acl_t *aclp = datap;
zfs_ace_hdr_t *acep = (zfs_ace_hdr_t *)(uintptr_t)cookie;
uint64_t who;
acep = zfs_acl_next_ace(aclp, acep, &who, mask,
flags, type);
return ((uint64_t)(uintptr_t)acep);
}
static zfs_acl_node_t *
zfs_acl_curr_node(zfs_acl_t *aclp)
{
ASSERT(aclp->z_curr_node);
return (aclp->z_curr_node);
}
/*
* Copy ACE to internal ZFS format.
* While processing the ACL each ACE will be validated for correctness.
* ACE FUIDs will be created later.
*/
int
zfs_copy_ace_2_fuid(vtype_t obj_type, zfs_acl_t *aclp, void *datap,
zfs_ace_t *z_acl, int aclcnt, size_t *size)
{
int i;
uint16_t entry_type;
zfs_ace_t *aceptr = z_acl;
ace_t *acep = datap;
zfs_object_ace_t *zobjacep;
ace_object_t *aceobjp;
for (i = 0; i != aclcnt; i++) {
aceptr->z_hdr.z_access_mask = acep->a_access_mask;
aceptr->z_hdr.z_flags = acep->a_flags;
aceptr->z_hdr.z_type = acep->a_type;
entry_type = aceptr->z_hdr.z_flags & ACE_TYPE_FLAGS;
if (entry_type != ACE_OWNER && entry_type != OWNING_GROUP &&
entry_type != ACE_EVERYONE) {
if (!aclp->z_has_fuids)
aclp->z_has_fuids = IS_EPHEMERAL(acep->a_who);
aceptr->z_fuid = (uint64_t)acep->a_who;
}
/*
* Make sure ACE is valid
*/
if (zfs_ace_valid(obj_type, aclp, aceptr->z_hdr.z_type,
aceptr->z_hdr.z_flags) != B_TRUE)
return (EINVAL);
switch (acep->a_type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
zobjacep = (zfs_object_ace_t *)aceptr;
aceobjp = (ace_object_t *)acep;
bcopy(aceobjp->a_obj_type, zobjacep->z_object_type,
sizeof (aceobjp->a_obj_type));
bcopy(aceobjp->a_inherit_obj_type,
zobjacep->z_inherit_type,
sizeof (aceobjp->a_inherit_obj_type));
acep = (ace_t *)((caddr_t)acep + sizeof (ace_object_t));
break;
default:
acep = (ace_t *)((caddr_t)acep + sizeof (ace_t));
}
aceptr = (zfs_ace_t *)((caddr_t)aceptr +
aclp->z_ops.ace_size(aceptr));
}
*size = (caddr_t)aceptr - (caddr_t)z_acl;
return (0);
}
/*
* Copy ZFS ACEs to fixed size ace_t layout
*/
static void
zfs_copy_fuid_2_ace(zfsvfs_t *zfsvfs, zfs_acl_t *aclp, cred_t *cr,
void *datap, int filter)
{
uint64_t who;
uint32_t access_mask;
uint16_t iflags, type;
zfs_ace_hdr_t *zacep = NULL;
ace_t *acep = datap;
ace_object_t *objacep;
zfs_object_ace_t *zobjacep;
size_t ace_size;
uint16_t entry_type;
while (zacep = zfs_acl_next_ace(aclp, zacep,
&who, &access_mask, &iflags, &type)) {
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
if (filter) {
continue;
}
zobjacep = (zfs_object_ace_t *)zacep;
objacep = (ace_object_t *)acep;
bcopy(zobjacep->z_object_type,
objacep->a_obj_type,
sizeof (zobjacep->z_object_type));
bcopy(zobjacep->z_inherit_type,
objacep->a_inherit_obj_type,
sizeof (zobjacep->z_inherit_type));
ace_size = sizeof (ace_object_t);
break;
default:
ace_size = sizeof (ace_t);
break;
}
entry_type = (iflags & ACE_TYPE_FLAGS);
if ((entry_type != ACE_OWNER &&
entry_type != OWNING_GROUP &&
entry_type != ACE_EVERYONE)) {
acep->a_who = zfs_fuid_map_id(zfsvfs, who,
cr, (entry_type & ACE_IDENTIFIER_GROUP) ?
ZFS_ACE_GROUP : ZFS_ACE_USER);
} else {
acep->a_who = (uid_t)(int64_t)who;
}
acep->a_access_mask = access_mask;
acep->a_flags = iflags;
acep->a_type = type;
acep = (ace_t *)((caddr_t)acep + ace_size);
}
}
static int
zfs_copy_ace_2_oldace(vtype_t obj_type, zfs_acl_t *aclp, ace_t *acep,
zfs_oldace_t *z_acl, int aclcnt, size_t *size)
{
int i;
zfs_oldace_t *aceptr = z_acl;
for (i = 0; i != aclcnt; i++, aceptr++) {
aceptr->z_access_mask = acep[i].a_access_mask;
aceptr->z_type = acep[i].a_type;
aceptr->z_flags = acep[i].a_flags;
aceptr->z_fuid = acep[i].a_who;
/*
* Make sure ACE is valid
*/
if (zfs_ace_valid(obj_type, aclp, aceptr->z_type,
aceptr->z_flags) != B_TRUE)
return (EINVAL);
}
*size = (caddr_t)aceptr - (caddr_t)z_acl;
return (0);
}
/*
* convert old ACL format to new
*/
void
zfs_acl_xform(znode_t *zp, zfs_acl_t *aclp)
{
zfs_oldace_t *oldaclp;
int i;
uint16_t type, iflags;
uint32_t access_mask;
uint64_t who;
void *cookie = NULL;
zfs_acl_node_t *newaclnode;
ASSERT(aclp->z_version == ZFS_ACL_VERSION_INITIAL);
/*
* First create the ACE in a contiguous piece of memory
* for zfs_copy_ace_2_fuid().
*
* We only convert an ACL once, so this won't happen
* everytime.
*/
oldaclp = kmem_alloc(sizeof (zfs_oldace_t) * aclp->z_acl_count,
KM_SLEEP);
i = 0;
while (cookie = zfs_acl_next_ace(aclp, cookie, &who,
&access_mask, &iflags, &type)) {
oldaclp[i].z_flags = iflags;
oldaclp[i].z_type = type;
oldaclp[i].z_fuid = who;
oldaclp[i++].z_access_mask = access_mask;
}
newaclnode = zfs_acl_node_alloc(aclp->z_acl_count *
sizeof (zfs_object_ace_t));
aclp->z_ops = zfs_acl_fuid_ops;
VERIFY(zfs_copy_ace_2_fuid(ZTOV(zp)->v_type, aclp, oldaclp,
newaclnode->z_acldata, aclp->z_acl_count,
&newaclnode->z_size) == 0);
newaclnode->z_ace_count = aclp->z_acl_count;
aclp->z_version = ZFS_ACL_VERSION;
kmem_free(oldaclp, aclp->z_acl_count * sizeof (zfs_oldace_t));
/*
* Release all previous ACL nodes
*/
zfs_acl_release_nodes(aclp);
list_insert_head(&aclp->z_acl, newaclnode);
aclp->z_acl_bytes = newaclnode->z_size;
aclp->z_acl_count = newaclnode->z_ace_count;
}
/*
* Convert unix access mask to v4 access mask
*/
static uint32_t
zfs_unix_to_v4(uint32_t access_mask)
{
uint32_t new_mask = 0;
if (access_mask & S_IXOTH)
new_mask |= ACE_EXECUTE;
if (access_mask & S_IWOTH)
new_mask |= ACE_WRITE_DATA;
if (access_mask & S_IROTH)
new_mask |= ACE_READ_DATA;
return (new_mask);
}
static void
zfs_set_ace(zfs_acl_t *aclp, void *acep, uint32_t access_mask,
uint16_t access_type, uint64_t fuid, uint16_t entry_type)
{
uint16_t type = entry_type & ACE_TYPE_FLAGS;
aclp->z_ops.ace_mask_set(acep, access_mask);
aclp->z_ops.ace_type_set(acep, access_type);
aclp->z_ops.ace_flags_set(acep, entry_type);
if ((type != ACE_OWNER && type != OWNING_GROUP &&
type != ACE_EVERYONE))
aclp->z_ops.ace_who_set(acep, fuid);
}
/*
* Determine mode of file based on ACL.
* Also, create FUIDs for any User/Group ACEs
*/
static uint64_t
zfs_mode_fuid_compute(znode_t *zp, zfs_acl_t *aclp, cred_t *cr,
zfs_fuid_info_t **fuidp, dmu_tx_t *tx)
{
int entry_type;
mode_t mode;
mode_t seen = 0;
zfs_ace_hdr_t *acep = NULL;
uint64_t who;
uint16_t iflags, type;
uint32_t access_mask;
mode = (zp->z_phys->zp_mode & (S_IFMT | S_ISUID | S_ISGID | S_ISVTX));
while (acep = zfs_acl_next_ace(aclp, acep, &who,
&access_mask, &iflags, &type)) {
if (!zfs_acl_valid_ace_type(type, iflags))
continue;
entry_type = (iflags & ACE_TYPE_FLAGS);
/*
* Skip over owner@, group@ or everyone@ inherit only ACEs
*/
if ((iflags & ACE_INHERIT_ONLY_ACE) &&
(entry_type == ACE_OWNER || entry_type == ACE_EVERYONE ||
entry_type == OWNING_GROUP))
continue;
if (entry_type == ACE_OWNER) {
if ((access_mask & ACE_READ_DATA) &&
(!(seen & S_IRUSR))) {
seen |= S_IRUSR;
if (type == ALLOW) {
mode |= S_IRUSR;
}
}
if ((access_mask & ACE_WRITE_DATA) &&
(!(seen & S_IWUSR))) {
seen |= S_IWUSR;
if (type == ALLOW) {
mode |= S_IWUSR;
}
}
if ((access_mask & ACE_EXECUTE) &&
(!(seen & S_IXUSR))) {
seen |= S_IXUSR;
if (type == ALLOW) {
mode |= S_IXUSR;
}
}
} else if (entry_type == OWNING_GROUP) {
if ((access_mask & ACE_READ_DATA) &&
(!(seen & S_IRGRP))) {
seen |= S_IRGRP;
if (type == ALLOW) {
mode |= S_IRGRP;
}
}
if ((access_mask & ACE_WRITE_DATA) &&
(!(seen & S_IWGRP))) {
seen |= S_IWGRP;
if (type == ALLOW) {
mode |= S_IWGRP;
}
}
if ((access_mask & ACE_EXECUTE) &&
(!(seen & S_IXGRP))) {
seen |= S_IXGRP;
if (type == ALLOW) {
mode |= S_IXGRP;
}
}
} else if (entry_type == ACE_EVERYONE) {
if ((access_mask & ACE_READ_DATA)) {
if (!(seen & S_IRUSR)) {
seen |= S_IRUSR;
if (type == ALLOW) {
mode |= S_IRUSR;
}
}
if (!(seen & S_IRGRP)) {
seen |= S_IRGRP;
if (type == ALLOW) {
mode |= S_IRGRP;
}
}
if (!(seen & S_IROTH)) {
seen |= S_IROTH;
if (type == ALLOW) {
mode |= S_IROTH;
}
}
}
if ((access_mask & ACE_WRITE_DATA)) {
if (!(seen & S_IWUSR)) {
seen |= S_IWUSR;
if (type == ALLOW) {
mode |= S_IWUSR;
}
}
if (!(seen & S_IWGRP)) {
seen |= S_IWGRP;
if (type == ALLOW) {
mode |= S_IWGRP;
}
}
if (!(seen & S_IWOTH)) {
seen |= S_IWOTH;
if (type == ALLOW) {
mode |= S_IWOTH;
}
}
}
if ((access_mask & ACE_EXECUTE)) {
if (!(seen & S_IXUSR)) {
seen |= S_IXUSR;
if (type == ALLOW) {
mode |= S_IXUSR;
}
}
if (!(seen & S_IXGRP)) {
seen |= S_IXGRP;
if (type == ALLOW) {
mode |= S_IXGRP;
}
}
if (!(seen & S_IXOTH)) {
seen |= S_IXOTH;
if (type == ALLOW) {
mode |= S_IXOTH;
}
}
}
}
/*
* Now handle FUID create for user/group ACEs
*/
if (entry_type == 0 || entry_type == ACE_IDENTIFIER_GROUP) {
aclp->z_ops.ace_who_set(acep,
zfs_fuid_create(zp->z_zfsvfs, who, cr,
(entry_type == 0) ? ZFS_ACE_USER : ZFS_ACE_GROUP,
tx, fuidp));
}
}
return (mode);
}
static zfs_acl_t *
zfs_acl_node_read_internal(znode_t *zp, boolean_t will_modify)
{
zfs_acl_t *aclp;
zfs_acl_node_t *aclnode;
aclp = zfs_acl_alloc(zp->z_phys->zp_acl.z_acl_version);
/*
* Version 0 to 1 znode_acl_phys has the size/count fields swapped.
* Version 0 didn't have a size field, only a count.
*/
if (zp->z_phys->zp_acl.z_acl_version == ZFS_ACL_VERSION_INITIAL) {
aclp->z_acl_count = zp->z_phys->zp_acl.z_acl_size;
aclp->z_acl_bytes = ZFS_ACL_SIZE(aclp->z_acl_count);
} else {
aclp->z_acl_count = zp->z_phys->zp_acl.z_acl_count;
aclp->z_acl_bytes = zp->z_phys->zp_acl.z_acl_size;
}
aclnode = zfs_acl_node_alloc(will_modify ? aclp->z_acl_bytes : 0);
aclnode->z_ace_count = aclp->z_acl_count;
if (will_modify) {
bcopy(zp->z_phys->zp_acl.z_ace_data, aclnode->z_acldata,
aclp->z_acl_bytes);
} else {
aclnode->z_size = aclp->z_acl_bytes;
aclnode->z_acldata = &zp->z_phys->zp_acl.z_ace_data[0];
}
list_insert_head(&aclp->z_acl, aclnode);
return (aclp);
}
/*
* Read an external acl object.
*/
static int
zfs_acl_node_read(znode_t *zp, zfs_acl_t **aclpp, boolean_t will_modify)
{
uint64_t extacl = zp->z_phys->zp_acl.z_acl_extern_obj;
zfs_acl_t *aclp;
size_t aclsize;
size_t acl_count;
zfs_acl_node_t *aclnode;
int error;
ASSERT(MUTEX_HELD(&zp->z_acl_lock));
if (zp->z_phys->zp_acl.z_acl_extern_obj == 0) {
*aclpp = zfs_acl_node_read_internal(zp, will_modify);
return (0);
}
aclp = zfs_acl_alloc(zp->z_phys->zp_acl.z_acl_version);
if (zp->z_phys->zp_acl.z_acl_version == ZFS_ACL_VERSION_INITIAL) {
zfs_acl_phys_v0_t *zacl0 =
(zfs_acl_phys_v0_t *)&zp->z_phys->zp_acl;
aclsize = ZFS_ACL_SIZE(zacl0->z_acl_count);
acl_count = zacl0->z_acl_count;
} else {
aclsize = zp->z_phys->zp_acl.z_acl_size;
acl_count = zp->z_phys->zp_acl.z_acl_count;
if (aclsize == 0)
aclsize = acl_count * sizeof (zfs_ace_t);
}
aclnode = zfs_acl_node_alloc(aclsize);
list_insert_head(&aclp->z_acl, aclnode);
error = dmu_read(zp->z_zfsvfs->z_os, extacl, 0,
aclsize, aclnode->z_acldata);
aclnode->z_ace_count = acl_count;
aclp->z_acl_count = acl_count;
aclp->z_acl_bytes = aclsize;
if (error != 0) {
zfs_acl_free(aclp);
/* convert checksum errors into IO errors */
if (error == ECKSUM)
error = EIO;
return (error);
}
*aclpp = aclp;
return (0);
}
/*
* common code for setting ACLs.
*
* This function is called from zfs_mode_update, zfs_perm_init, and zfs_setacl.
* zfs_setacl passes a non-NULL inherit pointer (ihp) to indicate that it's
* already checked the acl and knows whether to inherit.
*/
int
zfs_aclset_common(znode_t *zp, zfs_acl_t *aclp, cred_t *cr,
zfs_fuid_info_t **fuidp, dmu_tx_t *tx)
{
int error;
znode_phys_t *zphys = zp->z_phys;
zfs_acl_phys_t *zacl = &zphys->zp_acl;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
uint64_t aoid = zphys->zp_acl.z_acl_extern_obj;
uint64_t off = 0;
dmu_object_type_t otype;
zfs_acl_node_t *aclnode;
ASSERT(MUTEX_HELD(&zp->z_lock));
ASSERT(MUTEX_HELD(&zp->z_acl_lock));
dmu_buf_will_dirty(zp->z_dbuf, tx);
zphys->zp_mode = zfs_mode_fuid_compute(zp, aclp, cr, fuidp, tx);
/*
* Decide which opbject type to use. If we are forced to
* use old ACL format than transform ACL into zfs_oldace_t
* layout.
*/
if (!zfsvfs->z_use_fuids) {
otype = DMU_OT_OLDACL;
} else {
if ((aclp->z_version == ZFS_ACL_VERSION_INITIAL) &&
(zfsvfs->z_version >= ZPL_VERSION_FUID))
zfs_acl_xform(zp, aclp);
ASSERT(aclp->z_version >= ZFS_ACL_VERSION_FUID);
otype = DMU_OT_ACL;
}
if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
/*
* If ACL was previously external and we are now
* converting to new ACL format then release old
* ACL object and create a new one.
*/
if (aoid && aclp->z_version != zacl->z_acl_version) {
error = dmu_object_free(zfsvfs->z_os,
zp->z_phys->zp_acl.z_acl_extern_obj, tx);
if (error)
return (error);
aoid = 0;
}
if (aoid == 0) {
aoid = dmu_object_alloc(zfsvfs->z_os,
otype, aclp->z_acl_bytes,
otype == DMU_OT_ACL ? DMU_OT_SYSACL : DMU_OT_NONE,
otype == DMU_OT_ACL ? DN_MAX_BONUSLEN : 0, tx);
} else {
(void) dmu_object_set_blocksize(zfsvfs->z_os, aoid,
aclp->z_acl_bytes, 0, tx);
}
zphys->zp_acl.z_acl_extern_obj = aoid;
for (aclnode = list_head(&aclp->z_acl); aclnode;
aclnode = list_next(&aclp->z_acl, aclnode)) {
if (aclnode->z_ace_count == 0)
continue;
dmu_write(zfsvfs->z_os, aoid, off,
aclnode->z_size, aclnode->z_acldata, tx);
off += aclnode->z_size;
}
} else {
void *start = zacl->z_ace_data;
/*
* Migrating back embedded?
*/
if (zphys->zp_acl.z_acl_extern_obj) {
error = dmu_object_free(zfsvfs->z_os,
zp->z_phys->zp_acl.z_acl_extern_obj, tx);
if (error)
return (error);
zphys->zp_acl.z_acl_extern_obj = 0;
}
for (aclnode = list_head(&aclp->z_acl); aclnode;
aclnode = list_next(&aclp->z_acl, aclnode)) {
if (aclnode->z_ace_count == 0)
continue;
bcopy(aclnode->z_acldata, start, aclnode->z_size);
start = (caddr_t)start + aclnode->z_size;
}
}
/*
* If Old version then swap count/bytes to match old
* layout of znode_acl_phys_t.
*/
if (aclp->z_version == ZFS_ACL_VERSION_INITIAL) {
zphys->zp_acl.z_acl_size = aclp->z_acl_count;
zphys->zp_acl.z_acl_count = aclp->z_acl_bytes;
} else {
zphys->zp_acl.z_acl_size = aclp->z_acl_bytes;
zphys->zp_acl.z_acl_count = aclp->z_acl_count;
}
zphys->zp_acl.z_acl_version = aclp->z_version;
/*
* Replace ACL wide bits, but first clear them.
*/
zp->z_phys->zp_flags &= ~ZFS_ACL_WIDE_FLAGS;
zp->z_phys->zp_flags |= aclp->z_hints;
if (ace_trivial_common(aclp, 0, zfs_ace_walk) == 0)
zp->z_phys->zp_flags |= ZFS_ACL_TRIVIAL;
zfs_time_stamper_locked(zp, STATE_CHANGED, tx);
return (0);
}
/*
* Update access mask for prepended ACE
*
* This applies the "groupmask" value for aclmode property.
*/
static void
zfs_acl_prepend_fixup(zfs_acl_t *aclp, void *acep, void *origacep,
mode_t mode, uint64_t owner)
{
int rmask, wmask, xmask;
int user_ace;
uint16_t aceflags;
uint32_t origmask, acepmask;
uint64_t fuid;
aceflags = aclp->z_ops.ace_flags_get(acep);
fuid = aclp->z_ops.ace_who_get(acep);
origmask = aclp->z_ops.ace_mask_get(origacep);
acepmask = aclp->z_ops.ace_mask_get(acep);
user_ace = (!(aceflags &
(ACE_OWNER|ACE_GROUP|ACE_IDENTIFIER_GROUP)));
if (user_ace && (fuid == owner)) {
rmask = S_IRUSR;
wmask = S_IWUSR;
xmask = S_IXUSR;
} else {
rmask = S_IRGRP;
wmask = S_IWGRP;
xmask = S_IXGRP;
}
if (origmask & ACE_READ_DATA) {
if (mode & rmask) {
acepmask &= ~ACE_READ_DATA;
} else {
acepmask |= ACE_READ_DATA;
}
}
if (origmask & ACE_WRITE_DATA) {
if (mode & wmask) {
acepmask &= ~ACE_WRITE_DATA;
} else {
acepmask |= ACE_WRITE_DATA;
}
}
if (origmask & ACE_APPEND_DATA) {
if (mode & wmask) {
acepmask &= ~ACE_APPEND_DATA;
} else {
acepmask |= ACE_APPEND_DATA;
}
}
if (origmask & ACE_EXECUTE) {
if (mode & xmask) {
acepmask &= ~ACE_EXECUTE;
} else {
acepmask |= ACE_EXECUTE;
}
}
aclp->z_ops.ace_mask_set(acep, acepmask);
}
/*
* Apply mode to canonical six ACEs.
*/
static void
zfs_acl_fixup_canonical_six(zfs_acl_t *aclp, mode_t mode)
{
zfs_acl_node_t *aclnode = list_tail(&aclp->z_acl);
void *acep;
int maskoff = aclp->z_ops.ace_mask_off();
size_t abstract_size = aclp->z_ops.ace_abstract_size();
ASSERT(aclnode != NULL);
acep = (void *)((caddr_t)aclnode->z_acldata +
aclnode->z_size - (abstract_size * 6));
/*
* Fixup final ACEs to match the mode
*/
adjust_ace_pair_common(acep, maskoff, abstract_size,
(mode & 0700) >> 6); /* owner@ */
acep = (caddr_t)acep + (abstract_size * 2);
adjust_ace_pair_common(acep, maskoff, abstract_size,
(mode & 0070) >> 3); /* group@ */
acep = (caddr_t)acep + (abstract_size * 2);
adjust_ace_pair_common(acep, maskoff,
abstract_size, mode); /* everyone@ */
}
static int
zfs_acl_ace_match(zfs_acl_t *aclp, void *acep, int allow_deny,
int entry_type, int accessmask)
{
uint32_t mask = aclp->z_ops.ace_mask_get(acep);
uint16_t type = aclp->z_ops.ace_type_get(acep);
uint16_t flags = aclp->z_ops.ace_flags_get(acep);
return (mask == accessmask && type == allow_deny &&
((flags & ACE_TYPE_FLAGS) == entry_type));
}
/*
* Can prepended ACE be reused?
*/
static int
zfs_reuse_deny(zfs_acl_t *aclp, void *acep, void *prevacep)
{
int okay_masks;
uint16_t prevtype;
uint16_t prevflags;
uint16_t flags;
uint32_t mask, prevmask;
if (prevacep == NULL)
return (B_FALSE);
prevtype = aclp->z_ops.ace_type_get(prevacep);
prevflags = aclp->z_ops.ace_flags_get(prevacep);
flags = aclp->z_ops.ace_flags_get(acep);
mask = aclp->z_ops.ace_mask_get(acep);
prevmask = aclp->z_ops.ace_mask_get(prevacep);
if (prevtype != DENY)
return (B_FALSE);
if (prevflags != (flags & ACE_IDENTIFIER_GROUP))
return (B_FALSE);
okay_masks = (mask & OKAY_MASK_BITS);
if (prevmask & ~okay_masks)
return (B_FALSE);
return (B_TRUE);
}
/*
* Insert new ACL node into chain of zfs_acl_node_t's
*
* This will result in two possible results.
* 1. If the ACL is currently just a single zfs_acl_node and
* we are prepending the entry then current acl node will have
* a new node inserted above it.
*
* 2. If we are inserting in the middle of current acl node then
* the current node will be split in two and new node will be inserted
* in between the two split nodes.
*/
static zfs_acl_node_t *
zfs_acl_ace_insert(zfs_acl_t *aclp, void *acep)
{
zfs_acl_node_t *newnode;
zfs_acl_node_t *trailernode = NULL;
zfs_acl_node_t *currnode = zfs_acl_curr_node(aclp);
int curr_idx = aclp->z_curr_node->z_ace_idx;
int trailer_count;
size_t oldsize;
newnode = zfs_acl_node_alloc(aclp->z_ops.ace_size(acep));
newnode->z_ace_count = 1;
oldsize = currnode->z_size;
if (curr_idx != 1) {
trailernode = zfs_acl_node_alloc(0);
trailernode->z_acldata = acep;
trailer_count = currnode->z_ace_count - curr_idx + 1;
currnode->z_ace_count = curr_idx - 1;
currnode->z_size = (caddr_t)acep - (caddr_t)currnode->z_acldata;
trailernode->z_size = oldsize - currnode->z_size;
trailernode->z_ace_count = trailer_count;
}
aclp->z_acl_count += 1;
aclp->z_acl_bytes += aclp->z_ops.ace_size(acep);
if (curr_idx == 1)
list_insert_before(&aclp->z_acl, currnode, newnode);
else
list_insert_after(&aclp->z_acl, currnode, newnode);
if (trailernode) {
list_insert_after(&aclp->z_acl, newnode, trailernode);
aclp->z_curr_node = trailernode;
trailernode->z_ace_idx = 1;
}
return (newnode);
}
/*
* Prepend deny ACE
*/
static void *
zfs_acl_prepend_deny(znode_t *zp, zfs_acl_t *aclp, void *acep,
mode_t mode)
{
zfs_acl_node_t *aclnode;
void *newacep;
uint64_t fuid;
uint16_t flags;
aclnode = zfs_acl_ace_insert(aclp, acep);
newacep = aclnode->z_acldata;
fuid = aclp->z_ops.ace_who_get(acep);
flags = aclp->z_ops.ace_flags_get(acep);
zfs_set_ace(aclp, newacep, 0, DENY, fuid, (flags & ACE_TYPE_FLAGS));
zfs_acl_prepend_fixup(aclp, newacep, acep, mode, zp->z_phys->zp_uid);
return (newacep);
}
/*
* Split an inherited ACE into inherit_only ACE
* and original ACE with inheritance flags stripped off.
*/
static void
zfs_acl_split_ace(zfs_acl_t *aclp, zfs_ace_hdr_t *acep)
{
zfs_acl_node_t *aclnode;
zfs_acl_node_t *currnode;
void *newacep;
uint16_t type, flags;
uint32_t mask;
uint64_t fuid;
type = aclp->z_ops.ace_type_get(acep);
flags = aclp->z_ops.ace_flags_get(acep);
mask = aclp->z_ops.ace_mask_get(acep);
fuid = aclp->z_ops.ace_who_get(acep);
aclnode = zfs_acl_ace_insert(aclp, acep);
newacep = aclnode->z_acldata;
aclp->z_ops.ace_type_set(newacep, type);
aclp->z_ops.ace_flags_set(newacep, flags | ACE_INHERIT_ONLY_ACE);
aclp->z_ops.ace_mask_set(newacep, mask);
aclp->z_ops.ace_type_set(newacep, type);
aclp->z_ops.ace_who_set(newacep, fuid);
aclp->z_next_ace = acep;
flags &= ~ALL_INHERIT;
aclp->z_ops.ace_flags_set(acep, flags);
currnode = zfs_acl_curr_node(aclp);
ASSERT(currnode->z_ace_idx >= 1);
currnode->z_ace_idx -= 1;
}
/*
* Are ACES started at index i, the canonical six ACES?
*/
static int
zfs_have_canonical_six(zfs_acl_t *aclp)
{
void *acep;
zfs_acl_node_t *aclnode = list_tail(&aclp->z_acl);
int i = 0;
size_t abstract_size = aclp->z_ops.ace_abstract_size();
ASSERT(aclnode != NULL);
if (aclnode->z_ace_count < 6)
return (0);
acep = (void *)((caddr_t)aclnode->z_acldata +
aclnode->z_size - (aclp->z_ops.ace_abstract_size() * 6));
if ((zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
DENY, ACE_OWNER, 0) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
ALLOW, ACE_OWNER, OWNER_ALLOW_MASK) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++), DENY,
OWNING_GROUP, 0) && zfs_acl_ace_match(aclp, (caddr_t)acep +
(abstract_size * i++),
ALLOW, OWNING_GROUP, 0) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
DENY, ACE_EVERYONE, EVERYONE_DENY_MASK) &&
zfs_acl_ace_match(aclp, (caddr_t)acep + (abstract_size * i++),
ALLOW, ACE_EVERYONE, EVERYONE_ALLOW_MASK))) {
return (1);
} else {
return (0);
}
}
/*
* Apply step 1g, to group entries
*
* Need to deal with corner case where group may have
* greater permissions than owner. If so then limit
* group permissions, based on what extra permissions
* group has.
*/
static void
zfs_fixup_group_entries(zfs_acl_t *aclp, void *acep, void *prevacep,
mode_t mode)
{
uint32_t prevmask = aclp->z_ops.ace_mask_get(prevacep);
uint32_t mask = aclp->z_ops.ace_mask_get(acep);
uint16_t prevflags = aclp->z_ops.ace_flags_get(prevacep);
mode_t extramode = (mode >> 3) & 07;
mode_t ownermode = (mode >> 6);
if (prevflags & ACE_IDENTIFIER_GROUP) {
extramode &= ~ownermode;
if (extramode) {
if (extramode & S_IROTH) {
prevmask &= ~ACE_READ_DATA;
mask &= ~ACE_READ_DATA;
}
if (extramode & S_IWOTH) {
prevmask &= ~(ACE_WRITE_DATA|ACE_APPEND_DATA);
mask &= ~(ACE_WRITE_DATA|ACE_APPEND_DATA);
}
if (extramode & S_IXOTH) {
prevmask &= ~ACE_EXECUTE;
mask &= ~ACE_EXECUTE;
}
}
}
aclp->z_ops.ace_mask_set(acep, mask);
aclp->z_ops.ace_mask_set(prevacep, prevmask);
}
/*
* Apply the chmod algorithm as described
* in PSARC/2002/240
*/
static void
zfs_acl_chmod(znode_t *zp, uint64_t mode, zfs_acl_t *aclp)
{
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
void *acep = NULL, *prevacep = NULL;
uint64_t who;
int i;
int entry_type;
int reuse_deny;
int need_canonical_six = 1;
uint16_t iflags, type;
uint32_t access_mask;
ASSERT(MUTEX_HELD(&zp->z_acl_lock));
ASSERT(MUTEX_HELD(&zp->z_lock));
aclp->z_hints = (zp->z_phys->zp_flags & V4_ACL_WIDE_FLAGS);
/*
* If discard then just discard all ACL nodes which
* represent the ACEs.
*
* New owner@/group@/everone@ ACEs will be added
* later.
*/
if (zfsvfs->z_acl_mode == ZFS_ACL_DISCARD)
zfs_acl_release_nodes(aclp);
while (acep = zfs_acl_next_ace(aclp, acep, &who, &access_mask,
&iflags, &type)) {
entry_type = (iflags & ACE_TYPE_FLAGS);
iflags = (iflags & ALL_INHERIT);
if ((type != ALLOW && type != DENY) ||
(iflags & ACE_INHERIT_ONLY_ACE)) {
if (iflags)
aclp->z_hints |= ZFS_INHERIT_ACE;
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
aclp->z_hints |= ZFS_ACL_OBJ_ACE;
break;
}
goto nextace;
}
/*
* Need to split ace into two?
*/
if ((iflags & (ACE_FILE_INHERIT_ACE|
ACE_DIRECTORY_INHERIT_ACE)) &&
(!(iflags & ACE_INHERIT_ONLY_ACE))) {
zfs_acl_split_ace(aclp, acep);
aclp->z_hints |= ZFS_INHERIT_ACE;
goto nextace;
}
if (entry_type == ACE_OWNER || entry_type == ACE_EVERYONE ||
(entry_type == OWNING_GROUP)) {
access_mask &= ~OGE_CLEAR;
aclp->z_ops.ace_mask_set(acep, access_mask);
goto nextace;
} else {
reuse_deny = B_TRUE;
if (type == ALLOW) {
/*
* Check preceding ACE if any, to see
* if we need to prepend a DENY ACE.
* This is only applicable when the acl_mode
* property == groupmask.
*/
if (zfsvfs->z_acl_mode == ZFS_ACL_GROUPMASK) {
reuse_deny = zfs_reuse_deny(aclp, acep,
prevacep);
if (!reuse_deny) {
prevacep =
zfs_acl_prepend_deny(zp,
aclp, acep, mode);
} else {
zfs_acl_prepend_fixup(
aclp, prevacep,
acep, mode,
zp->z_phys->zp_uid);
}
zfs_fixup_group_entries(aclp, acep,
prevacep, mode);
}
}
}
nextace:
prevacep = acep;
}
/*
* Check out last six aces, if we have six.
*/
if (aclp->z_acl_count >= 6) {
if (zfs_have_canonical_six(aclp)) {
need_canonical_six = 0;
}
}
if (need_canonical_six) {
size_t abstract_size = aclp->z_ops.ace_abstract_size();
void *zacep;
zfs_acl_node_t *aclnode =
zfs_acl_node_alloc(abstract_size * 6);
aclnode->z_size = abstract_size * 6;
aclnode->z_ace_count = 6;
aclp->z_acl_bytes += aclnode->z_size;
list_insert_tail(&aclp->z_acl, aclnode);
zacep = aclnode->z_acldata;
i = 0;
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
0, DENY, -1, ACE_OWNER);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
OWNER_ALLOW_MASK, ALLOW, -1, ACE_OWNER);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++), 0,
DENY, -1, OWNING_GROUP);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++), 0,
ALLOW, -1, OWNING_GROUP);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
EVERYONE_DENY_MASK, DENY, -1, ACE_EVERYONE);
zfs_set_ace(aclp, (caddr_t)zacep + (abstract_size * i++),
EVERYONE_ALLOW_MASK, ALLOW, -1, ACE_EVERYONE);
aclp->z_acl_count += 6;
}
zfs_acl_fixup_canonical_six(aclp, mode);
}
int
zfs_acl_chmod_setattr(znode_t *zp, zfs_acl_t **aclp, uint64_t mode)
{
int error;
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
*aclp = NULL;
error = zfs_acl_node_read(zp, aclp, B_TRUE);
if (error == 0)
zfs_acl_chmod(zp, mode, *aclp);
mutex_exit(&zp->z_acl_lock);
mutex_exit(&zp->z_lock);
return (error);
}
/*
* strip off write_owner and write_acl
*/
static void
zfs_restricted_update(zfsvfs_t *zfsvfs, zfs_acl_t *aclp, void *acep)
{
uint32_t mask = aclp->z_ops.ace_mask_get(acep);
if ((zfsvfs->z_acl_inherit == ZFS_ACL_RESTRICTED) &&
(aclp->z_ops.ace_type_get(acep) == ALLOW)) {
mask &= ~RESTRICTED_CLEAR;
aclp->z_ops.ace_mask_set(acep, mask);
}
}
/*
* Should ACE be inherited?
*/
static int
zfs_ace_can_use(znode_t *zp, uint16_t acep_flags)
{
int vtype = ZTOV(zp)->v_type;
int iflags = (acep_flags & 0xf);
if ((vtype == VDIR) && (iflags & ACE_DIRECTORY_INHERIT_ACE))
return (1);
else if (iflags & ACE_FILE_INHERIT_ACE)
return (!((vtype == VDIR) &&
(iflags & ACE_NO_PROPAGATE_INHERIT_ACE)));
return (0);
}
/*
* inherit inheritable ACEs from parent
*/
static zfs_acl_t *
zfs_acl_inherit(znode_t *zp, zfs_acl_t *paclp, uint64_t mode,
boolean_t *need_chmod)
{
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
void *pacep;
void *acep, *acep2;
zfs_acl_node_t *aclnode, *aclnode2;
zfs_acl_t *aclp = NULL;
uint64_t who;
uint32_t access_mask;
uint16_t iflags, newflags, type;
size_t ace_size;
void *data1, *data2;
size_t data1sz, data2sz;
boolean_t vdir = ZTOV(zp)->v_type == VDIR;
boolean_t vreg = ZTOV(zp)->v_type == VREG;
boolean_t passthrough, passthrough_x, noallow;
passthrough_x =
zfsvfs->z_acl_inherit == ZFS_ACL_PASSTHROUGH_X;
passthrough = passthrough_x ||
zfsvfs->z_acl_inherit == ZFS_ACL_PASSTHROUGH;
noallow =
zfsvfs->z_acl_inherit == ZFS_ACL_NOALLOW;
*need_chmod = B_TRUE;
pacep = NULL;
aclp = zfs_acl_alloc(paclp->z_version);
if (zfsvfs->z_acl_inherit == ZFS_ACL_DISCARD)
return (aclp);
while (pacep = zfs_acl_next_ace(paclp, pacep, &who,
&access_mask, &iflags, &type)) {
/*
* don't inherit bogus ACEs
*/
if (!zfs_acl_valid_ace_type(type, iflags))
continue;
if (noallow && type == ALLOW)
continue;
ace_size = aclp->z_ops.ace_size(pacep);
if (!zfs_ace_can_use(zp, iflags))
continue;
/*
* If owner@, group@, or everyone@ inheritable
* then zfs_acl_chmod() isn't needed.
*/
if (passthrough &&
((iflags & (ACE_OWNER|ACE_EVERYONE)) ||
((iflags & OWNING_GROUP) ==
OWNING_GROUP)) && (vreg || (vdir && (iflags &
ACE_DIRECTORY_INHERIT_ACE)))) {
*need_chmod = B_FALSE;
if (!vdir && passthrough_x &&
((mode & (S_IXUSR | S_IXGRP | S_IXOTH)) == 0)) {
access_mask &= ~ACE_EXECUTE;
}
}
aclnode = zfs_acl_node_alloc(ace_size);
list_insert_tail(&aclp->z_acl, aclnode);
acep = aclnode->z_acldata;
zfs_set_ace(aclp, acep, access_mask, type,
who, iflags|ACE_INHERITED_ACE);
/*
* Copy special opaque data if any
*/
if ((data1sz = paclp->z_ops.ace_data(pacep, &data1)) != 0) {
VERIFY((data2sz = aclp->z_ops.ace_data(acep,
&data2)) == data1sz);
bcopy(data1, data2, data2sz);
}
aclp->z_acl_count++;
aclnode->z_ace_count++;
aclp->z_acl_bytes += aclnode->z_size;
newflags = aclp->z_ops.ace_flags_get(acep);
if (vdir)
aclp->z_hints |= ZFS_INHERIT_ACE;
if ((iflags & ACE_NO_PROPAGATE_INHERIT_ACE) || !vdir) {
newflags &= ~ALL_INHERIT;
aclp->z_ops.ace_flags_set(acep,
newflags|ACE_INHERITED_ACE);
zfs_restricted_update(zfsvfs, aclp, acep);
continue;
}
ASSERT(vdir);
newflags = aclp->z_ops.ace_flags_get(acep);
if ((iflags & (ACE_FILE_INHERIT_ACE |
ACE_DIRECTORY_INHERIT_ACE)) !=
ACE_FILE_INHERIT_ACE) {
aclnode2 = zfs_acl_node_alloc(ace_size);
list_insert_tail(&aclp->z_acl, aclnode2);
acep2 = aclnode2->z_acldata;
zfs_set_ace(aclp, acep2,
access_mask, type, who,
iflags|ACE_INHERITED_ACE);
newflags |= ACE_INHERIT_ONLY_ACE;
aclp->z_ops.ace_flags_set(acep, newflags);
newflags &= ~ALL_INHERIT;
aclp->z_ops.ace_flags_set(acep2,
newflags|ACE_INHERITED_ACE);
/*
* Copy special opaque data if any
*/
if ((data1sz = aclp->z_ops.ace_data(acep,
&data1)) != 0) {
VERIFY((data2sz =
aclp->z_ops.ace_data(acep2,
&data2)) == data1sz);
bcopy(data1, data2, data1sz);
}
aclp->z_acl_count++;
aclnode2->z_ace_count++;
aclp->z_acl_bytes += aclnode->z_size;
zfs_restricted_update(zfsvfs, aclp, acep2);
} else {
newflags |= ACE_INHERIT_ONLY_ACE;
aclp->z_ops.ace_flags_set(acep,
newflags|ACE_INHERITED_ACE);
}
}
return (aclp);
}
/*
* Create file system object initial permissions
* including inheritable ACEs.
*/
void
zfs_perm_init(znode_t *zp, znode_t *parent, int flag,
vattr_t *vap, dmu_tx_t *tx, cred_t *cr,
zfs_acl_t *setaclp, zfs_fuid_info_t **fuidp)
{
uint64_t mode, fuid, fgid;
int error;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zfs_acl_t *aclp = NULL;
zfs_acl_t *paclp;
xvattr_t *xvap = (xvattr_t *)vap;
gid_t gid;
boolean_t need_chmod = B_TRUE;
if (setaclp)
aclp = setaclp;
mode = MAKEIMODE(vap->va_type, vap->va_mode);
/*
* Determine uid and gid.
*/
if ((flag & (IS_ROOT_NODE | IS_REPLAY)) ||
((flag & IS_XATTR) && (vap->va_type == VDIR))) {
fuid = zfs_fuid_create(zfsvfs, vap->va_uid, cr,
ZFS_OWNER, tx, fuidp);
fgid = zfs_fuid_create(zfsvfs, vap->va_gid, cr,
ZFS_GROUP, tx, fuidp);
gid = vap->va_gid;
} else {
fuid = zfs_fuid_create_cred(zfsvfs, ZFS_OWNER, tx, cr, fuidp);
fgid = 0;
if (vap->va_mask & AT_GID) {
fgid = zfs_fuid_create(zfsvfs, vap->va_gid, cr,
ZFS_GROUP, tx, fuidp);
gid = vap->va_gid;
if (fgid != parent->z_phys->zp_gid &&
!groupmember(vap->va_gid, cr) &&
secpolicy_vnode_create_gid(cr) != 0)
fgid = 0;
}
if (fgid == 0) {
if (parent->z_phys->zp_mode & S_ISGID) {
fgid = parent->z_phys->zp_gid;
gid = zfs_fuid_map_id(zfsvfs, fgid,
cr, ZFS_GROUP);
} else {
fgid = zfs_fuid_create_cred(zfsvfs,
ZFS_GROUP, tx, cr, fuidp);
#ifdef __FreeBSD__
gid = fgid = parent->z_phys->zp_gid;
#else
gid = crgetgid(cr);
#endif
}
}
}
/*
* If we're creating a directory, and the parent directory has the
* set-GID bit set, set in on the new directory.
* Otherwise, if the user is neither privileged nor a member of the
* file's new group, clear the file's set-GID bit.
*/
if ((parent->z_phys->zp_mode & S_ISGID) && (vap->va_type == VDIR)) {
mode |= S_ISGID;
} else {
if ((mode & S_ISGID) &&
secpolicy_vnode_setids_setgids(ZTOV(zp), cr, gid) != 0)
mode &= ~S_ISGID;
}
zp->z_phys->zp_uid = fuid;
zp->z_phys->zp_gid = fgid;
zp->z_phys->zp_mode = mode;
if (aclp == NULL) {
mutex_enter(&parent->z_lock);
if ((ZTOV(parent)->v_type == VDIR &&
(parent->z_phys->zp_flags & ZFS_INHERIT_ACE)) &&
!(zp->z_phys->zp_flags & ZFS_XATTR)) {
mutex_enter(&parent->z_acl_lock);
VERIFY(0 == zfs_acl_node_read(parent, &paclp, B_FALSE));
mutex_exit(&parent->z_acl_lock);
aclp = zfs_acl_inherit(zp, paclp, mode, &need_chmod);
zfs_acl_free(paclp);
} else {
aclp = zfs_acl_alloc(zfs_acl_version_zp(zp));
}
mutex_exit(&parent->z_lock);
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
if (need_chmod)
zfs_acl_chmod(zp, mode, aclp);
} else {
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
}
/* Force auto_inherit on all new directory objects */
if (vap->va_type == VDIR)
aclp->z_hints |= ZFS_ACL_AUTO_INHERIT;
error = zfs_aclset_common(zp, aclp, cr, fuidp, tx);
/* Set optional attributes if any */
if (vap->va_mask & AT_XVATTR)
zfs_xvattr_set(zp, xvap);
mutex_exit(&zp->z_lock);
mutex_exit(&zp->z_acl_lock);
ASSERT3U(error, ==, 0);
if (aclp != setaclp)
zfs_acl_free(aclp);
}
/*
* Retrieve a files ACL
*/
int
zfs_getacl(znode_t *zp, vsecattr_t *vsecp, boolean_t skipaclchk, cred_t *cr)
{
zfs_acl_t *aclp;
ulong_t mask;
int error;
int count = 0;
int largeace = 0;
mask = vsecp->vsa_mask & (VSA_ACE | VSA_ACECNT |
VSA_ACE_ACLFLAGS | VSA_ACE_ALLTYPES);
if (error = zfs_zaccess(zp, ACE_READ_ACL, 0, skipaclchk, cr))
return (error);
if (mask == 0)
return (ENOSYS);
mutex_enter(&zp->z_acl_lock);
error = zfs_acl_node_read(zp, &aclp, B_FALSE);
if (error != 0) {
mutex_exit(&zp->z_acl_lock);
return (error);
}
/*
* Scan ACL to determine number of ACEs
*/
if ((zp->z_phys->zp_flags & ZFS_ACL_OBJ_ACE) &&
!(mask & VSA_ACE_ALLTYPES)) {
void *zacep = NULL;
uint64_t who;
uint32_t access_mask;
uint16_t type, iflags;
while (zacep = zfs_acl_next_ace(aclp, zacep,
&who, &access_mask, &iflags, &type)) {
switch (type) {
case ACE_ACCESS_ALLOWED_OBJECT_ACE_TYPE:
case ACE_ACCESS_DENIED_OBJECT_ACE_TYPE:
case ACE_SYSTEM_AUDIT_OBJECT_ACE_TYPE:
case ACE_SYSTEM_ALARM_OBJECT_ACE_TYPE:
largeace++;
continue;
default:
count++;
}
}
vsecp->vsa_aclcnt = count;
} else
count = aclp->z_acl_count;
if (mask & VSA_ACECNT) {
vsecp->vsa_aclcnt = count;
}
if (mask & VSA_ACE) {
size_t aclsz;
- zfs_acl_node_t *aclnode = list_head(&aclp->z_acl);
-
aclsz = count * sizeof (ace_t) +
sizeof (ace_object_t) * largeace;
vsecp->vsa_aclentp = kmem_alloc(aclsz, KM_SLEEP);
vsecp->vsa_aclentsz = aclsz;
if (aclp->z_version == ZFS_ACL_VERSION_FUID)
zfs_copy_fuid_2_ace(zp->z_zfsvfs, aclp, cr,
vsecp->vsa_aclentp, !(mask & VSA_ACE_ALLTYPES));
else {
- bcopy(aclnode->z_acldata, vsecp->vsa_aclentp,
- count * sizeof (ace_t));
+ zfs_acl_node_t *aclnode;
+ void *start = vsecp->vsa_aclentp;
+
+ for (aclnode = list_head(&aclp->z_acl); aclnode;
+ aclnode = list_next(&aclp->z_acl, aclnode)) {
+ bcopy(aclnode->z_acldata, start,
+ aclnode->z_size);
+ start = (caddr_t)start + aclnode->z_size;
+ }
+ ASSERT((caddr_t)start - (caddr_t)vsecp->vsa_aclentp ==
+ aclp->z_acl_bytes);
}
}
if (mask & VSA_ACE_ACLFLAGS) {
vsecp->vsa_aclflags = 0;
if (zp->z_phys->zp_flags & ZFS_ACL_DEFAULTED)
vsecp->vsa_aclflags |= ACL_DEFAULTED;
if (zp->z_phys->zp_flags & ZFS_ACL_PROTECTED)
vsecp->vsa_aclflags |= ACL_PROTECTED;
if (zp->z_phys->zp_flags & ZFS_ACL_AUTO_INHERIT)
vsecp->vsa_aclflags |= ACL_AUTO_INHERIT;
}
mutex_exit(&zp->z_acl_lock);
zfs_acl_free(aclp);
return (0);
}
int
zfs_vsec_2_aclp(zfsvfs_t *zfsvfs, vtype_t obj_type,
vsecattr_t *vsecp, zfs_acl_t **zaclp)
{
zfs_acl_t *aclp;
zfs_acl_node_t *aclnode;
int aclcnt = vsecp->vsa_aclcnt;
int error;
if (vsecp->vsa_aclcnt > MAX_ACL_ENTRIES || vsecp->vsa_aclcnt <= 0)
return (EINVAL);
aclp = zfs_acl_alloc(zfs_acl_version(zfsvfs->z_version));
aclp->z_hints = 0;
aclnode = zfs_acl_node_alloc(aclcnt * sizeof (zfs_object_ace_t));
if (aclp->z_version == ZFS_ACL_VERSION_INITIAL) {
if ((error = zfs_copy_ace_2_oldace(obj_type, aclp,
(ace_t *)vsecp->vsa_aclentp, aclnode->z_acldata,
aclcnt, &aclnode->z_size)) != 0) {
zfs_acl_free(aclp);
zfs_acl_node_free(aclnode);
return (error);
}
} else {
if ((error = zfs_copy_ace_2_fuid(obj_type, aclp,
vsecp->vsa_aclentp, aclnode->z_acldata, aclcnt,
&aclnode->z_size)) != 0) {
zfs_acl_free(aclp);
zfs_acl_node_free(aclnode);
return (error);
}
}
aclp->z_acl_bytes = aclnode->z_size;
aclnode->z_ace_count = aclcnt;
aclp->z_acl_count = aclcnt;
list_insert_head(&aclp->z_acl, aclnode);
/*
* If flags are being set then add them to z_hints
*/
if (vsecp->vsa_mask & VSA_ACE_ACLFLAGS) {
if (vsecp->vsa_aclflags & ACL_PROTECTED)
aclp->z_hints |= ZFS_ACL_PROTECTED;
if (vsecp->vsa_aclflags & ACL_DEFAULTED)
aclp->z_hints |= ZFS_ACL_DEFAULTED;
if (vsecp->vsa_aclflags & ACL_AUTO_INHERIT)
aclp->z_hints |= ZFS_ACL_AUTO_INHERIT;
}
*zaclp = aclp;
return (0);
}
/*
* Set a files ACL
*/
int
zfs_setacl(znode_t *zp, vsecattr_t *vsecp, boolean_t skipaclchk, cred_t *cr)
{
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zilog_t *zilog = zfsvfs->z_log;
ulong_t mask = vsecp->vsa_mask & (VSA_ACE | VSA_ACECNT);
dmu_tx_t *tx;
int error;
zfs_acl_t *aclp;
zfs_fuid_info_t *fuidp = NULL;
if (mask == 0)
return (ENOSYS);
if (zp->z_phys->zp_flags & ZFS_IMMUTABLE)
return (EPERM);
if (error = zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr))
return (error);
error = zfs_vsec_2_aclp(zfsvfs, ZTOV(zp)->v_type, vsecp, &aclp);
if (error)
return (error);
/*
* If ACL wide flags aren't being set then preserve any
* existing flags.
*/
if (!(vsecp->vsa_mask & VSA_ACE_ACLFLAGS)) {
aclp->z_hints |= (zp->z_phys->zp_flags & V4_ACL_WIDE_FLAGS);
}
top:
if (error = zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr)) {
zfs_acl_free(aclp);
return (error);
}
mutex_enter(&zp->z_lock);
mutex_enter(&zp->z_acl_lock);
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
if (zp->z_phys->zp_acl.z_acl_extern_obj) {
/* Are we upgrading ACL? */
if (zfsvfs->z_version <= ZPL_VERSION_FUID &&
zp->z_phys->zp_acl.z_acl_version ==
ZFS_ACL_VERSION_INITIAL) {
dmu_tx_hold_free(tx,
zp->z_phys->zp_acl.z_acl_extern_obj,
0, DMU_OBJECT_END);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, aclp->z_acl_bytes);
} else {
dmu_tx_hold_write(tx,
zp->z_phys->zp_acl.z_acl_extern_obj,
0, aclp->z_acl_bytes);
}
} else if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, aclp->z_acl_bytes);
}
if (aclp->z_has_fuids) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
mutex_exit(&zp->z_acl_lock);
mutex_exit(&zp->z_lock);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
zfs_acl_free(aclp);
return (error);
}
error = zfs_aclset_common(zp, aclp, cr, &fuidp, tx);
ASSERT(error == 0);
zfs_log_acl(zilog, tx, zp, vsecp, fuidp);
if (fuidp)
zfs_fuid_info_free(fuidp);
zfs_acl_free(aclp);
dmu_tx_commit(tx);
done:
mutex_exit(&zp->z_acl_lock);
mutex_exit(&zp->z_lock);
return (error);
}
/*
* working_mode returns the permissions that were not granted
*/
static int
zfs_zaccess_common(znode_t *zp, uint32_t v4_mode, uint32_t *working_mode,
boolean_t *check_privs, boolean_t skipaclchk, cred_t *cr)
{
zfs_acl_t *aclp;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
uid_t uid = crgetuid(cr);
uint64_t who;
uint16_t type, iflags;
uint16_t entry_type;
uint32_t access_mask;
uint32_t deny_mask = 0;
zfs_ace_hdr_t *acep = NULL;
boolean_t checkit;
uid_t fowner;
uid_t gowner;
/*
* Short circuit empty requests
*/
if (v4_mode == 0)
return (0);
*check_privs = B_TRUE;
if (zfsvfs->z_assign >= TXG_INITIAL) { /* ZIL replay */
*working_mode = 0;
return (0);
}
*working_mode = v4_mode;
if ((v4_mode & WRITE_MASK) &&
(zp->z_zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) &&
(!IS_DEVVP(ZTOV(zp)))) {
*check_privs = B_FALSE;
return (EROFS);
}
/*
* Only check for READONLY on non-directories.
*/
if ((v4_mode & WRITE_MASK_DATA) &&
(((ZTOV(zp)->v_type != VDIR) &&
(zp->z_phys->zp_flags & (ZFS_READONLY | ZFS_IMMUTABLE))) ||
(ZTOV(zp)->v_type == VDIR &&
(zp->z_phys->zp_flags & ZFS_IMMUTABLE)))) {
*check_privs = B_FALSE;
return (EPERM);
}
#ifdef sun
if ((v4_mode & (ACE_DELETE | ACE_DELETE_CHILD)) &&
(zp->z_phys->zp_flags & ZFS_NOUNLINK)) {
*check_privs = B_FALSE;
return (EPERM);
}
#else
/*
* In FreeBSD we allow to modify directory's content is ZFS_NOUNLINK
* (sunlnk) is set. We just don't allow directory removal, which is
* handled in zfs_zaccess_delete().
*/
if ((v4_mode & ACE_DELETE) &&
(zp->z_phys->zp_flags & ZFS_NOUNLINK)) {
*check_privs = B_FALSE;
return (EPERM);
}
#endif
if (((v4_mode & (ACE_READ_DATA|ACE_EXECUTE)) &&
(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED))) {
*check_privs = B_FALSE;
return (EACCES);
}
/*
* The caller requested that the ACL check be skipped. This
* would only happen if the caller checked VOP_ACCESS() with a
* 32 bit ACE mask and already had the appropriate permissions.
*/
if (skipaclchk) {
*working_mode = 0;
return (0);
}
zfs_fuid_map_ids(zp, cr, &fowner, &gowner);
mutex_enter(&zp->z_acl_lock);
error = zfs_acl_node_read(zp, &aclp, B_FALSE);
if (error != 0) {
mutex_exit(&zp->z_acl_lock);
return (error);
}
while (acep = zfs_acl_next_ace(aclp, acep, &who, &access_mask,
&iflags, &type)) {
if (!zfs_acl_valid_ace_type(type, iflags))
continue;
if (ZTOV(zp)->v_type == VDIR && (iflags & ACE_INHERIT_ONLY_ACE))
continue;
entry_type = (iflags & ACE_TYPE_FLAGS);
checkit = B_FALSE;
switch (entry_type) {
case ACE_OWNER:
if (uid == fowner)
checkit = B_TRUE;
break;
case OWNING_GROUP:
who = gowner;
/*FALLTHROUGH*/
case ACE_IDENTIFIER_GROUP:
checkit = zfs_groupmember(zfsvfs, who, cr);
break;
case ACE_EVERYONE:
checkit = B_TRUE;
break;
/* USER Entry */
default:
if (entry_type == 0) {
uid_t newid;
newid = zfs_fuid_map_id(zfsvfs, who, cr,
ZFS_ACE_USER);
if (newid != IDMAP_WK_CREATOR_OWNER_UID &&
uid == newid)
checkit = B_TRUE;
break;
} else {
zfs_acl_free(aclp);
mutex_exit(&zp->z_acl_lock);
return (EIO);
}
}
if (checkit) {
uint32_t mask_matched = (access_mask & *working_mode);
if (mask_matched) {
if (type == DENY)
deny_mask |= mask_matched;
*working_mode &= ~mask_matched;
}
}
/* Are we done? */
if (*working_mode == 0)
break;
}
mutex_exit(&zp->z_acl_lock);
zfs_acl_free(aclp);
/* Put the found 'denies' back on the working mode */
if (deny_mask) {
*working_mode |= deny_mask;
return (EACCES);
} else if (*working_mode) {
return (-1);
}
return (0);
}
static int
zfs_zaccess_append(znode_t *zp, uint32_t *working_mode, boolean_t *check_privs,
cred_t *cr)
{
if (*working_mode != ACE_WRITE_DATA)
return (EACCES);
return (zfs_zaccess_common(zp, ACE_APPEND_DATA, working_mode,
check_privs, B_FALSE, cr));
}
/*
* Determine whether Access should be granted/denied, invoking least
* priv subsytem when a deny is determined.
*/
int
zfs_zaccess(znode_t *zp, int mode, int flags, boolean_t skipaclchk, cred_t *cr)
{
uint32_t working_mode;
int error;
int is_attr;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
boolean_t check_privs;
znode_t *xzp;
znode_t *check_zp = zp;
is_attr = ((zp->z_phys->zp_flags & ZFS_XATTR) &&
(ZTOV(zp)->v_type == VDIR));
#ifdef __FreeBSD__
/*
* In FreeBSD, we don't care about permissions of individual ADS.
* Note that not checking them is not just an optimization - without
* this shortcut, EA operations may bogusly fail with EACCES.
*/
if (zp->z_phys->zp_flags & ZFS_XATTR)
return (0);
#else
/*
* If attribute then validate against base file
*/
if (is_attr) {
if ((error = zfs_zget(zp->z_zfsvfs,
zp->z_phys->zp_parent, &xzp)) != 0) {
return (error);
}
check_zp = xzp;
/*
* fixup mode to map to xattr perms
*/
if (mode & (ACE_WRITE_DATA|ACE_APPEND_DATA)) {
mode &= ~(ACE_WRITE_DATA|ACE_APPEND_DATA);
mode |= ACE_WRITE_NAMED_ATTRS;
}
if (mode & (ACE_READ_DATA|ACE_EXECUTE)) {
mode &= ~(ACE_READ_DATA|ACE_EXECUTE);
mode |= ACE_READ_NAMED_ATTRS;
}
}
#endif
if ((error = zfs_zaccess_common(check_zp, mode, &working_mode,
&check_privs, skipaclchk, cr)) == 0) {
if (is_attr)
VN_RELE(ZTOV(xzp));
return (0);
}
if (error && !check_privs) {
if (is_attr)
VN_RELE(ZTOV(xzp));
return (error);
}
if (error && (flags & V_APPEND)) {
error = zfs_zaccess_append(zp, &working_mode, &check_privs, cr);
}
if (error && check_privs) {
uid_t owner;
mode_t checkmode = 0;
owner = zfs_fuid_map_id(zfsvfs, check_zp->z_phys->zp_uid, cr,
ZFS_OWNER);
/*
* First check for implicit owner permission on
* read_acl/read_attributes
*/
error = 0;
ASSERT(working_mode != 0);
if ((working_mode & (ACE_READ_ACL|ACE_READ_ATTRIBUTES) &&
owner == crgetuid(cr)))
working_mode &= ~(ACE_READ_ACL|ACE_READ_ATTRIBUTES);
if (working_mode & (ACE_READ_DATA|ACE_READ_NAMED_ATTRS|
ACE_READ_ACL|ACE_READ_ATTRIBUTES|ACE_SYNCHRONIZE))
checkmode |= VREAD;
if (working_mode & (ACE_WRITE_DATA|ACE_WRITE_NAMED_ATTRS|
ACE_APPEND_DATA|ACE_WRITE_ATTRIBUTES|ACE_SYNCHRONIZE))
checkmode |= VWRITE;
if (working_mode & ACE_EXECUTE)
checkmode |= VEXEC;
if (checkmode)
error = secpolicy_vnode_access(cr, ZTOV(check_zp),
owner, checkmode);
if (error == 0 && (working_mode & ACE_WRITE_OWNER))
error = secpolicy_vnode_chown(ZTOV(check_zp), cr, B_TRUE);
if (error == 0 && (working_mode & ACE_WRITE_ACL))
error = secpolicy_vnode_setdac(ZTOV(check_zp), cr, owner);
if (error == 0 && (working_mode &
(ACE_DELETE|ACE_DELETE_CHILD)))
error = secpolicy_vnode_remove(ZTOV(check_zp), cr);
if (error == 0 && (working_mode & ACE_SYNCHRONIZE)) {
error = secpolicy_vnode_chown(ZTOV(check_zp), cr, B_FALSE);
}
if (error == 0) {
/*
* See if any bits other than those already checked
* for are still present. If so then return EACCES
*/
if (working_mode & ~(ZFS_CHECKED_MASKS)) {
error = EACCES;
}
}
}
if (is_attr)
VN_RELE(ZTOV(xzp));
return (error);
}
/*
* Translate traditional unix VREAD/VWRITE/VEXEC mode into
* native ACL format and call zfs_zaccess()
*/
int
zfs_zaccess_rwx(znode_t *zp, mode_t mode, int flags, cred_t *cr)
{
return (zfs_zaccess(zp, zfs_unix_to_v4(mode >> 6), flags, B_FALSE, cr));
}
/*
* Access function for secpolicy_vnode_setattr
*/
int
zfs_zaccess_unix(znode_t *zp, mode_t mode, cred_t *cr)
{
int v4_mode = zfs_unix_to_v4(mode >> 6);
return (zfs_zaccess(zp, v4_mode, 0, B_FALSE, cr));
}
static int
zfs_delete_final_check(znode_t *zp, znode_t *dzp,
mode_t missing_perms, cred_t *cr)
{
int error;
uid_t downer;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
downer = zfs_fuid_map_id(zfsvfs, dzp->z_phys->zp_uid, cr, ZFS_OWNER);
error = secpolicy_vnode_access(cr, ZTOV(dzp), downer, missing_perms);
if (error == 0)
error = zfs_sticky_remove_access(dzp, zp, cr);
return (error);
}
/*
* Determine whether Access should be granted/deny, without
* consulting least priv subsystem.
*
*
* The following chart is the recommended NFSv4 enforcement for
* ability to delete an object.
*
* -------------------------------------------------------
* | Parent Dir | Target Object Permissions |
* | permissions | |
* -------------------------------------------------------
* | | ACL Allows | ACL Denies| Delete |
* | | Delete | Delete | unspecified|
* -------------------------------------------------------
* | ACL Allows | Permit | Permit | Permit |
* | DELETE_CHILD | |
* -------------------------------------------------------
* | ACL Denies | Permit | Deny | Deny |
* | DELETE_CHILD | | | |
* -------------------------------------------------------
* | ACL specifies | | | |
* | only allow | Permit | Permit | Permit |
* | write and | | | |
* | execute | | | |
* -------------------------------------------------------
* | ACL denies | | | |
* | write and | Permit | Deny | Deny |
* | execute | | | |
* -------------------------------------------------------
* ^
* |
* No search privilege, can't even look up file?
*
*/
int
zfs_zaccess_delete(znode_t *dzp, znode_t *zp, cred_t *cr)
{
uint32_t dzp_working_mode = 0;
uint32_t zp_working_mode = 0;
int dzp_error, zp_error;
mode_t missing_perms;
boolean_t dzpcheck_privs = B_TRUE;
boolean_t zpcheck_privs = B_TRUE;
/*
* We want specific DELETE permissions to
* take precedence over WRITE/EXECUTE. We don't
* want an ACL such as this to mess us up.
* user:joe:write_data:deny,user:joe:delete:allow
*
* However, deny permissions may ultimately be overridden
* by secpolicy_vnode_access().
*
* We will ask for all of the necessary permissions and then
* look at the working modes from the directory and target object
* to determine what was found.
*/
if (zp->z_phys->zp_flags & (ZFS_IMMUTABLE | ZFS_NOUNLINK))
return (EPERM);
/*
* First row
* If the directory permissions allow the delete, we are done.
*/
if ((dzp_error = zfs_zaccess_common(dzp, ACE_DELETE_CHILD,
&dzp_working_mode, &dzpcheck_privs, B_FALSE, cr)) == 0)
return (0);
/*
* If target object has delete permission then we are done
*/
if ((zp_error = zfs_zaccess_common(zp, ACE_DELETE, &zp_working_mode,
&zpcheck_privs, B_FALSE, cr)) == 0)
return (0);
ASSERT(dzp_error && zp_error);
if (!dzpcheck_privs)
return (dzp_error);
if (!zpcheck_privs)
return (zp_error);
/*
* Second row
*
* If directory returns EACCES then delete_child was denied
* due to deny delete_child. In this case send the request through
* secpolicy_vnode_remove(). We don't use zfs_delete_final_check()
* since that *could* allow the delete based on write/execute permission
* and we want delete permissions to override write/execute.
*/
if (dzp_error == EACCES)
return (secpolicy_vnode_remove(ZTOV(dzp), cr)); /* XXXPJD: s/dzp/zp/ ? */
/*
* Third Row
* only need to see if we have write/execute on directory.
*/
if ((dzp_error = zfs_zaccess_common(dzp, ACE_EXECUTE|ACE_WRITE_DATA,
&dzp_working_mode, &dzpcheck_privs, B_FALSE, cr)) == 0)
return (zfs_sticky_remove_access(dzp, zp, cr));
if (!dzpcheck_privs)
return (dzp_error);
/*
* Fourth row
*/
missing_perms = (dzp_working_mode & ACE_WRITE_DATA) ? VWRITE : 0;
missing_perms |= (dzp_working_mode & ACE_EXECUTE) ? VEXEC : 0;
ASSERT(missing_perms);
return (zfs_delete_final_check(zp, dzp, missing_perms, cr));
}
int
zfs_zaccess_rename(znode_t *sdzp, znode_t *szp, znode_t *tdzp,
znode_t *tzp, cred_t *cr)
{
int add_perm;
int error;
if (szp->z_phys->zp_flags & ZFS_AV_QUARANTINED)
return (EACCES);
add_perm = (ZTOV(szp)->v_type == VDIR) ?
ACE_ADD_SUBDIRECTORY : ACE_ADD_FILE;
/*
* Rename permissions are combination of delete permission +
* add file/subdir permission.
*
* BSD operating systems also require write permission
* on the directory being moved from one parent directory
* to another.
*/
if (ZTOV(szp)->v_type == VDIR && ZTOV(sdzp) != ZTOV(tdzp)) {
if (error = zfs_zaccess(szp, ACE_WRITE_DATA, 0, B_FALSE, cr))
return (error);
}
/*
* first make sure we do the delete portion.
*
* If that succeeds then check for add_file/add_subdir permissions
*/
if (error = zfs_zaccess_delete(sdzp, szp, cr))
return (error);
/*
* If we have a tzp, see if we can delete it?
*/
if (tzp) {
if (error = zfs_zaccess_delete(tdzp, tzp, cr))
return (error);
}
/*
* Now check for add permissions
*/
error = zfs_zaccess(tdzp, add_perm, 0, B_FALSE, cr);
return (error);
}
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c (revision 209274)
@@ -1,5053 +1,5051 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
/* Portions Copyright 2007 Jeremy Teo */
#include <sys/types.h>
#include <sys/param.h>
#include <sys/time.h>
#include <sys/systm.h>
#include <sys/sysmacros.h>
#include <sys/resource.h>
#include <sys/vfs.h>
#include <sys/vnode.h>
#include <sys/file.h>
#include <sys/stat.h>
#include <sys/kmem.h>
#include <sys/taskq.h>
#include <sys/uio.h>
#include <sys/atomic.h>
#include <sys/namei.h>
#include <sys/mman.h>
#include <sys/cmn_err.h>
#include <sys/errno.h>
#include <sys/unistd.h>
#include <sys/zfs_dir.h>
#include <sys/zfs_ioctl.h>
#include <sys/fs/zfs.h>
#include <sys/dmu.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/dbuf.h>
#include <sys/zap.h>
#include <sys/dirent.h>
#include <sys/policy.h>
#include <sys/sunddi.h>
#include <sys/filio.h>
#include <sys/zfs_ctldir.h>
#include <sys/zfs_fuid.h>
#include <sys/dnlc.h>
#include <sys/zfs_rlock.h>
#include <sys/extdirent.h>
#include <sys/kidmap.h>
#include <sys/bio.h>
#include <sys/buf.h>
#include <sys/sf_buf.h>
#include <sys/sched.h>
#include <sys/acl.h>
/*
* Programming rules.
*
* Each vnode op performs some logical unit of work. To do this, the ZPL must
* properly lock its in-core state, create a DMU transaction, do the work,
* record this work in the intent log (ZIL), commit the DMU transaction,
* and wait for the intent log to commit if it is a synchronous operation.
* Moreover, the vnode ops must work in both normal and log replay context.
* The ordering of events is important to avoid deadlocks and references
* to freed memory. The example below illustrates the following Big Rules:
*
* (1) A check must be made in each zfs thread for a mounted file system.
* This is done avoiding races using ZFS_ENTER(zfsvfs).
* A ZFS_EXIT(zfsvfs) is needed before all returns. Any znodes
* must be checked with ZFS_VERIFY_ZP(zp). Both of these macros
* can return EIO from the calling function.
*
* (2) VN_RELE() should always be the last thing except for zil_commit()
* (if necessary) and ZFS_EXIT(). This is for 3 reasons:
* First, if it's the last reference, the vnode/znode
* can be freed, so the zp may point to freed memory. Second, the last
* reference will call zfs_zinactive(), which may induce a lot of work --
* pushing cached pages (which acquires range locks) and syncing out
* cached atime changes. Third, zfs_zinactive() may require a new tx,
* which could deadlock the system if you were already holding one.
* If you must call VN_RELE() within a tx then use VN_RELE_ASYNC().
*
* (3) All range locks must be grabbed before calling dmu_tx_assign(),
* as they can span dmu_tx_assign() calls.
*
* (4) Always pass zfsvfs->z_assign as the second argument to dmu_tx_assign().
* In normal operation, this will be TXG_NOWAIT. During ZIL replay,
* it will be a specific txg. Either way, dmu_tx_assign() never blocks.
* This is critical because we don't want to block while holding locks.
* Note, in particular, that if a lock is sometimes acquired before
* the tx assigns, and sometimes after (e.g. z_lock), then failing to
* use a non-blocking assign can deadlock the system. The scenario:
*
* Thread A has grabbed a lock before calling dmu_tx_assign().
* Thread B is in an already-assigned tx, and blocks for this lock.
* Thread A calls dmu_tx_assign(TXG_WAIT) and blocks in txg_wait_open()
* forever, because the previous txg can't quiesce until B's tx commits.
*
* If dmu_tx_assign() returns ERESTART and zfsvfs->z_assign is TXG_NOWAIT,
* then drop all locks, call dmu_tx_wait(), and try again.
*
* (5) If the operation succeeded, generate the intent log entry for it
* before dropping locks. This ensures that the ordering of events
* in the intent log matches the order in which they actually occurred.
*
* (6) At the end of each vnode op, the DMU tx must always commit,
* regardless of whether there were any errors.
*
* (7) After dropping all locks, invoke zil_commit(zilog, seq, foid)
* to ensure that synchronous semantics are provided when necessary.
*
* In general, this is how things should be ordered in each vnode op:
*
* ZFS_ENTER(zfsvfs); // exit if unmounted
* top:
* zfs_dirent_lock(&dl, ...) // lock directory entry (may VN_HOLD())
* rw_enter(...); // grab any other locks you need
* tx = dmu_tx_create(...); // get DMU tx
* dmu_tx_hold_*(); // hold each object you might modify
* error = dmu_tx_assign(tx, zfsvfs->z_assign); // try to assign
* if (error) {
* rw_exit(...); // drop locks
* zfs_dirent_unlock(dl); // unlock directory entry
* VN_RELE(...); // release held vnodes
* if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
* dmu_tx_wait(tx);
* dmu_tx_abort(tx);
* goto top;
* }
* dmu_tx_abort(tx); // abort DMU tx
* ZFS_EXIT(zfsvfs); // finished in zfs
* return (error); // really out of space
* }
* error = do_real_work(); // do whatever this VOP does
* if (error == 0)
* zfs_log_*(...); // on success, make ZIL entry
* dmu_tx_commit(tx); // commit DMU tx -- error or not
* rw_exit(...); // drop locks
* zfs_dirent_unlock(dl); // unlock directory entry
* VN_RELE(...); // release held vnodes
* zil_commit(zilog, seq, foid); // synchronous when necessary
* ZFS_EXIT(zfsvfs); // finished in zfs
* return (error); // done, report error
*/
/* ARGSUSED */
static int
zfs_open(vnode_t **vpp, int flag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(*vpp);
if ((flag & FWRITE) && (zp->z_phys->zp_flags & ZFS_APPENDONLY) &&
((flag & FAPPEND) == 0)) {
return (EPERM);
}
if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
ZTOV(zp)->v_type == VREG &&
!(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) &&
zp->z_phys->zp_size > 0)
if (fs_vscan(*vpp, cr, 0) != 0)
return (EACCES);
/* Keep a count of the synchronous opens in the znode */
if (flag & (FSYNC | FDSYNC))
atomic_inc_32(&zp->z_sync_cnt);
return (0);
}
/* ARGSUSED */
static int
zfs_close(vnode_t *vp, int flag, int count, offset_t offset, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
/* Decrement the synchronous opens in the znode */
if ((flag & (FSYNC | FDSYNC)) && (count == 1))
atomic_dec_32(&zp->z_sync_cnt);
/*
* Clean up any locks held by this process on the vp.
*/
cleanlocks(vp, ddi_get_pid(), 0);
cleanshares(vp, ddi_get_pid());
if (!zfs_has_ctldir(zp) && zp->z_zfsvfs->z_vscan &&
ZTOV(zp)->v_type == VREG &&
!(zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) &&
zp->z_phys->zp_size > 0)
VERIFY(fs_vscan(vp, cr, 1) == 0);
return (0);
}
/*
* Lseek support for finding holes (cmd == _FIO_SEEK_HOLE) and
* data (cmd == _FIO_SEEK_DATA). "off" is an in/out parameter.
*/
static int
zfs_holey(vnode_t *vp, u_long cmd, offset_t *off)
{
znode_t *zp = VTOZ(vp);
uint64_t noff = (uint64_t)*off; /* new offset */
uint64_t file_sz;
int error;
boolean_t hole;
file_sz = zp->z_phys->zp_size;
if (noff >= file_sz) {
return (ENXIO);
}
if (cmd == _FIO_SEEK_HOLE)
hole = B_TRUE;
else
hole = B_FALSE;
error = dmu_offset_next(zp->z_zfsvfs->z_os, zp->z_id, hole, &noff);
/* end of file? */
if ((error == ESRCH) || (noff > file_sz)) {
/*
* Handle the virtual hole at the end of file.
*/
if (hole) {
*off = file_sz;
return (0);
}
return (ENXIO);
}
if (noff < *off)
return (error);
*off = noff;
return (error);
}
/* ARGSUSED */
static int
zfs_ioctl(vnode_t *vp, u_long com, intptr_t data, int flag, cred_t *cred,
int *rvalp, caller_context_t *ct)
{
offset_t off;
int error;
zfsvfs_t *zfsvfs;
znode_t *zp;
switch (com) {
case _FIOFFS:
return (0);
/*
* The following two ioctls are used by bfu. Faking out,
* necessary to avoid bfu errors.
*/
case _FIOGDIO:
case _FIOSDIO:
return (0);
case _FIO_SEEK_DATA:
case _FIO_SEEK_HOLE:
if (ddi_copyin((void *)data, &off, sizeof (off), flag))
return (EFAULT);
zp = VTOZ(vp);
zfsvfs = zp->z_zfsvfs;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/* offset parameter is in/out */
error = zfs_holey(vp, com, &off);
ZFS_EXIT(zfsvfs);
if (error)
return (error);
if (ddi_copyout(&off, (void *)data, sizeof (off), flag))
return (EFAULT);
return (0);
}
return (ENOTTY);
}
/*
* When a file is memory mapped, we must keep the IO data synchronized
* between the DMU cache and the memory mapped pages. What this means:
*
* On Write: If we find a memory mapped page, we write to *both*
* the page and the dmu buffer.
*
* NOTE: We will always "break up" the IO into PAGESIZE uiomoves when
* the file is memory mapped.
*/
static int
mappedwrite(vnode_t *vp, int nbytes, uio_t *uio, dmu_tx_t *tx)
{
znode_t *zp = VTOZ(vp);
objset_t *os = zp->z_zfsvfs->z_os;
vm_object_t obj;
vm_page_t m;
struct sf_buf *sf;
int64_t start, off;
int len = nbytes;
int error = 0;
uint64_t dirbytes;
ASSERT(vp->v_mount != NULL);
obj = vp->v_object;
ASSERT(obj != NULL);
start = uio->uio_loffset;
off = start & PAGEOFFSET;
dirbytes = 0;
VM_OBJECT_LOCK(obj);
for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
uint64_t bytes = MIN(PAGESIZE - off, len);
uint64_t fsize;
again:
if ((m = vm_page_lookup(obj, OFF_TO_IDX(start))) != NULL &&
vm_page_is_valid(m, (vm_offset_t)off, bytes)) {
uint64_t woff;
caddr_t va;
if (vm_page_sleep_if_busy(m, FALSE, "zfsmwb"))
goto again;
fsize = obj->un_pager.vnp.vnp_size;
vm_page_busy(m);
vm_page_lock_queues();
vm_page_undirty(m);
vm_page_unlock_queues();
VM_OBJECT_UNLOCK(obj);
if (dirbytes > 0) {
error = dmu_write_uio(os, zp->z_id, uio,
dirbytes, tx);
dirbytes = 0;
}
if (error == 0) {
sched_pin();
sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
va = (caddr_t)sf_buf_kva(sf);
woff = uio->uio_loffset - off;
error = uiomove(va + off, bytes, UIO_WRITE, uio);
/*
* The uiomove() above could have been partially
* successful, that's why we call dmu_write()
* below unconditionally. The page was marked
* non-dirty above and we would lose the changes
* without doing so. If the uiomove() failed
* entirely, well, we just write what we got
* before one more time.
*/
dmu_write(os, zp->z_id, woff,
MIN(PAGESIZE, fsize - woff), va, tx);
sf_buf_free(sf);
sched_unpin();
}
VM_OBJECT_LOCK(obj);
vm_page_wakeup(m);
} else {
if (__predict_false(obj->cache != NULL)) {
vm_page_cache_free(obj, OFF_TO_IDX(start),
OFF_TO_IDX(start) + 1);
}
dirbytes += bytes;
}
len -= bytes;
off = 0;
if (error)
break;
}
VM_OBJECT_UNLOCK(obj);
if (error == 0 && dirbytes > 0)
error = dmu_write_uio(os, zp->z_id, uio, dirbytes, tx);
return (error);
}
/*
* When a file is memory mapped, we must keep the IO data synchronized
* between the DMU cache and the memory mapped pages. What this means:
*
* On Read: We "read" preferentially from memory mapped pages,
* else we default from the dmu buffer.
*
* NOTE: We will always "break up" the IO into PAGESIZE uiomoves when
* the file is memory mapped.
*/
static int
mappedread(vnode_t *vp, int nbytes, uio_t *uio)
{
znode_t *zp = VTOZ(vp);
objset_t *os = zp->z_zfsvfs->z_os;
vm_object_t obj;
vm_page_t m;
struct sf_buf *sf;
int64_t start, off;
caddr_t va;
int len = nbytes;
int error = 0;
uint64_t dirbytes;
ASSERT(vp->v_mount != NULL);
obj = vp->v_object;
ASSERT(obj != NULL);
start = uio->uio_loffset;
off = start & PAGEOFFSET;
dirbytes = 0;
VM_OBJECT_LOCK(obj);
for (start &= PAGEMASK; len > 0; start += PAGESIZE) {
uint64_t bytes = MIN(PAGESIZE - off, len);
again:
if ((m = vm_page_lookup(obj, OFF_TO_IDX(start))) != NULL &&
vm_page_is_valid(m, (vm_offset_t)off, bytes)) {
if (vm_page_sleep_if_busy(m, FALSE, "zfsmrb"))
goto again;
vm_page_busy(m);
VM_OBJECT_UNLOCK(obj);
if (dirbytes > 0) {
error = dmu_read_uio(os, zp->z_id, uio,
dirbytes);
dirbytes = 0;
}
if (error == 0) {
sched_pin();
sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
va = (caddr_t)sf_buf_kva(sf);
error = uiomove(va + off, bytes, UIO_READ, uio);
sf_buf_free(sf);
sched_unpin();
}
VM_OBJECT_LOCK(obj);
vm_page_wakeup(m);
} else if (m != NULL && uio->uio_segflg == UIO_NOCOPY) {
/*
* The code below is here to make sendfile(2) work
* correctly with ZFS. As pointed out by ups@
* sendfile(2) should be changed to use VOP_GETPAGES(),
* but it pessimize performance of sendfile/UFS, that's
* why I handle this special case in ZFS code.
*/
if (vm_page_sleep_if_busy(m, FALSE, "zfsmrb"))
goto again;
vm_page_busy(m);
VM_OBJECT_UNLOCK(obj);
if (dirbytes > 0) {
error = dmu_read_uio(os, zp->z_id, uio,
dirbytes);
dirbytes = 0;
}
if (error == 0) {
sched_pin();
sf = sf_buf_alloc(m, SFB_CPUPRIVATE);
va = (caddr_t)sf_buf_kva(sf);
error = dmu_read(os, zp->z_id, start + off,
bytes, (void *)(va + off));
sf_buf_free(sf);
sched_unpin();
}
VM_OBJECT_LOCK(obj);
vm_page_wakeup(m);
if (error == 0)
uio->uio_resid -= bytes;
} else {
dirbytes += bytes;
}
len -= bytes;
off = 0;
if (error)
break;
}
VM_OBJECT_UNLOCK(obj);
if (error == 0 && dirbytes > 0)
error = dmu_read_uio(os, zp->z_id, uio, dirbytes);
return (error);
}
offset_t zfs_read_chunk_size = 1024 * 1024; /* Tunable */
/*
* Read bytes from specified file into supplied buffer.
*
* IN: vp - vnode of file to be read from.
* uio - structure supplying read location, range info,
* and return buffer.
* ioflag - SYNC flags; used to provide FRSYNC semantics.
* cr - credentials of caller.
* ct - caller context
*
* OUT: uio - updated offset and range, buffer filled.
*
* RETURN: 0 if success
* error code if failure
*
* Side Effects:
* vp - atime updated if byte count > 0
*/
/* ARGSUSED */
static int
zfs_read(vnode_t *vp, uio_t *uio, int ioflag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
objset_t *os;
ssize_t n, nbytes;
int error;
rl_t *rl;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
os = zfsvfs->z_os;
if (zp->z_phys->zp_flags & ZFS_AV_QUARANTINED) {
ZFS_EXIT(zfsvfs);
return (EACCES);
}
/*
* Validate file offset
*/
if (uio->uio_loffset < (offset_t)0) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* Fasttrack empty reads
*/
if (uio->uio_resid == 0) {
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Check for mandatory locks
*/
if (MANDMODE((mode_t)zp->z_phys->zp_mode)) {
if (error = chklock(vp, FREAD,
uio->uio_loffset, uio->uio_resid, uio->uio_fmode, ct)) {
ZFS_EXIT(zfsvfs);
return (error);
}
}
/*
* If we're in FRSYNC mode, sync out this znode before reading it.
*/
if (ioflag & FRSYNC)
zil_commit(zfsvfs->z_log, zp->z_last_itx, zp->z_id);
/*
* Lock the range against changes.
*/
rl = zfs_range_lock(zp, uio->uio_loffset, uio->uio_resid, RL_READER);
/*
* If we are reading past end-of-file we can skip
* to the end; but we might still need to set atime.
*/
if (uio->uio_loffset >= zp->z_phys->zp_size) {
error = 0;
goto out;
}
ASSERT(uio->uio_loffset < zp->z_phys->zp_size);
n = MIN(uio->uio_resid, zp->z_phys->zp_size - uio->uio_loffset);
while (n > 0) {
nbytes = MIN(n, zfs_read_chunk_size -
P2PHASE(uio->uio_loffset, zfs_read_chunk_size));
if (vn_has_cached_data(vp))
error = mappedread(vp, nbytes, uio);
else
error = dmu_read_uio(os, zp->z_id, uio, nbytes);
if (error) {
/* convert checksum errors into IO errors */
if (error == ECKSUM)
error = EIO;
break;
}
n -= nbytes;
}
out:
zfs_range_unlock(rl);
ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Fault in the pages of the first n bytes specified by the uio structure.
* 1 byte in each page is touched and the uio struct is unmodified.
* Any error will exit this routine as this is only a best
* attempt to get the pages resident. This is a copy of ufs_trans_touch().
*/
static void
zfs_prefault_write(ssize_t n, struct uio *uio)
{
struct iovec *iov;
ulong_t cnt, incr;
caddr_t p;
if (uio->uio_segflg != UIO_USERSPACE)
return;
iov = uio->uio_iov;
while (n) {
cnt = MIN(iov->iov_len, n);
if (cnt == 0) {
/* empty iov entry */
iov++;
continue;
}
n -= cnt;
/*
* touch each page in this segment.
*/
p = iov->iov_base;
while (cnt) {
if (fubyte(p) == -1)
return;
incr = MIN(cnt, PAGESIZE);
p += incr;
cnt -= incr;
}
/*
* touch the last byte in case it straddles a page.
*/
p--;
if (fubyte(p) == -1)
return;
iov++;
}
}
/*
* Write the bytes to a file.
*
* IN: vp - vnode of file to be written to.
* uio - structure supplying write location, range info,
* and data buffer.
* ioflag - IO_APPEND flag set if in append mode.
* cr - credentials of caller.
* ct - caller context (NFS/CIFS fem monitor only)
*
* OUT: uio - updated offset and range.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - ctime|mtime updated if byte count > 0
*/
/* ARGSUSED */
static int
zfs_write(vnode_t *vp, uio_t *uio, int ioflag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
rlim64_t limit = MAXOFFSET_T;
ssize_t start_resid = uio->uio_resid;
ssize_t tx_bytes;
uint64_t end_size;
dmu_tx_t *tx;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zilog_t *zilog;
offset_t woff;
ssize_t n, nbytes;
rl_t *rl;
int max_blksz = zfsvfs->z_max_blksz;
uint64_t pflags;
int error;
/*
* Fasttrack empty write
*/
n = start_resid;
if (n == 0)
return (0);
if (limit == RLIM64_INFINITY || limit > MAXOFFSET_T)
limit = MAXOFFSET_T;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/*
* If immutable or not appending then return EPERM
*/
pflags = zp->z_phys->zp_flags;
if ((pflags & (ZFS_IMMUTABLE | ZFS_READONLY)) ||
((pflags & ZFS_APPENDONLY) && !(ioflag & FAPPEND) &&
(uio->uio_loffset < zp->z_phys->zp_size))) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
zilog = zfsvfs->z_log;
/*
* Pre-fault the pages to ensure slow (eg NFS) pages
* don't hold up txg.
*/
zfs_prefault_write(n, uio);
/*
* If in append mode, set the io offset pointer to eof.
*/
if (ioflag & IO_APPEND) {
/*
* Range lock for a file append:
* The value for the start of range will be determined by
* zfs_range_lock() (to guarantee append semantics).
* If this write will cause the block size to increase,
* zfs_range_lock() will lock the entire file, so we must
* later reduce the range after we grow the block size.
*/
rl = zfs_range_lock(zp, 0, n, RL_APPEND);
if (rl->r_len == UINT64_MAX) {
/* overlocked, zp_size can't change */
woff = uio->uio_loffset = zp->z_phys->zp_size;
} else {
woff = uio->uio_loffset = rl->r_off;
}
} else {
woff = uio->uio_loffset;
/*
* Validate file offset
*/
if (woff < 0) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* If we need to grow the block size then zfs_range_lock()
* will lock a wider range than we request here.
* Later after growing the block size we reduce the range.
*/
rl = zfs_range_lock(zp, woff, n, RL_WRITER);
}
if (woff >= limit) {
zfs_range_unlock(rl);
ZFS_EXIT(zfsvfs);
return (EFBIG);
}
if ((woff + n) > limit || woff > (limit - n))
n = limit - woff;
/*
* Check for mandatory locks
*/
if (MANDMODE((mode_t)zp->z_phys->zp_mode) &&
(error = chklock(vp, FWRITE, woff, n, uio->uio_fmode, ct)) != 0) {
zfs_range_unlock(rl);
ZFS_EXIT(zfsvfs);
return (error);
}
end_size = MAX(zp->z_phys->zp_size, woff + n);
/*
* Write the file in reasonable size chunks. Each chunk is written
* in a separate transaction; this keeps the intent log records small
* and allows us to do more fine-grained space accounting.
*/
while (n > 0) {
/*
* Start a transaction.
*/
woff = uio->uio_loffset;
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
dmu_tx_hold_write(tx, zp->z_id, woff, MIN(n, max_blksz));
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
if (error == ERESTART &&
zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
continue;
}
dmu_tx_abort(tx);
break;
}
/*
* If zfs_range_lock() over-locked we grow the blocksize
* and then reduce the lock range. This will only happen
* on the first iteration since zfs_range_reduce() will
* shrink down r_len to the appropriate size.
*/
if (rl->r_len == UINT64_MAX) {
uint64_t new_blksz;
if (zp->z_blksz > max_blksz) {
ASSERT(!ISP2(zp->z_blksz));
new_blksz = MIN(end_size, SPA_MAXBLOCKSIZE);
} else {
new_blksz = MIN(end_size, max_blksz);
}
zfs_grow_blocksize(zp, new_blksz, tx);
zfs_range_reduce(rl, woff, n);
}
/*
* XXX - should we really limit each write to z_max_blksz?
* Perhaps we should use SPA_MAXBLOCKSIZE chunks?
*/
nbytes = MIN(n, max_blksz - P2PHASE(woff, max_blksz));
if (woff + nbytes > zp->z_phys->zp_size)
vnode_pager_setsize(vp, woff + nbytes);
rw_enter(&zp->z_map_lock, RW_READER);
tx_bytes = uio->uio_resid;
if (vn_has_cached_data(vp)) {
rw_exit(&zp->z_map_lock);
error = mappedwrite(vp, nbytes, uio, tx);
} else {
error = dmu_write_uio(zfsvfs->z_os, zp->z_id,
uio, nbytes, tx);
rw_exit(&zp->z_map_lock);
}
tx_bytes -= uio->uio_resid;
/*
* If we made no progress, we're done. If we made even
* partial progress, update the znode and ZIL accordingly.
*/
if (tx_bytes == 0) {
dmu_tx_commit(tx);
ASSERT(error != 0);
break;
}
/*
* Clear Set-UID/Set-GID bits on successful write if not
* privileged and at least one of the excute bits is set.
*
* It would be nice to to this after all writes have
* been done, but that would still expose the ISUID/ISGID
* to another app after the partial write is committed.
*
* Note: we don't call zfs_fuid_map_id() here because
* user 0 is not an ephemeral uid.
*/
mutex_enter(&zp->z_acl_lock);
if ((zp->z_phys->zp_mode & (S_IXUSR | (S_IXUSR >> 3) |
(S_IXUSR >> 6))) != 0 &&
(zp->z_phys->zp_mode & (S_ISUID | S_ISGID)) != 0 &&
secpolicy_vnode_setid_retain(vp, cr,
(zp->z_phys->zp_mode & S_ISUID) != 0 &&
zp->z_phys->zp_uid == 0) != 0) {
zp->z_phys->zp_mode &= ~(S_ISUID | S_ISGID);
}
mutex_exit(&zp->z_acl_lock);
/*
* Update time stamp. NOTE: This marks the bonus buffer as
* dirty, so we don't have to do it again for zp_size.
*/
zfs_time_stamper(zp, CONTENT_MODIFIED, tx);
/*
* Update the file size (zp_size) if it has changed;
* account for possible concurrent updates.
*/
while ((end_size = zp->z_phys->zp_size) < uio->uio_loffset)
(void) atomic_cas_64(&zp->z_phys->zp_size, end_size,
uio->uio_loffset);
zfs_log_write(zilog, tx, TX_WRITE, zp, woff, tx_bytes, ioflag);
dmu_tx_commit(tx);
if (error != 0)
break;
ASSERT(tx_bytes == nbytes);
n -= nbytes;
}
zfs_range_unlock(rl);
/*
* If we're in replay mode, or we made no progress, return error.
* Otherwise, it's at least a partial write, so it's successful.
*/
if (zfsvfs->z_assign >= TXG_INITIAL || uio->uio_resid == start_resid) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (ioflag & (FSYNC | FDSYNC))
zil_commit(zilog, zp->z_last_itx, zp->z_id);
ZFS_EXIT(zfsvfs);
return (0);
}
void
zfs_get_done(dmu_buf_t *db, void *vzgd)
{
zgd_t *zgd = (zgd_t *)vzgd;
rl_t *rl = zgd->zgd_rl;
vnode_t *vp = ZTOV(rl->r_zp);
objset_t *os = rl->r_zp->z_zfsvfs->z_os;
int vfslocked;
vfslocked = VFS_LOCK_GIANT(vp->v_vfsp);
dmu_buf_rele(db, vzgd);
zfs_range_unlock(rl);
/*
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(vp, dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
zil_add_block(zgd->zgd_zilog, zgd->zgd_bp);
kmem_free(zgd, sizeof (zgd_t));
VFS_UNLOCK_GIANT(vfslocked);
}
/*
* Get data to generate a TX_WRITE intent log record.
*/
int
zfs_get_data(void *arg, lr_write_t *lr, char *buf, zio_t *zio)
{
zfsvfs_t *zfsvfs = arg;
objset_t *os = zfsvfs->z_os;
znode_t *zp;
uint64_t off = lr->lr_offset;
dmu_buf_t *db;
rl_t *rl;
zgd_t *zgd;
int dlen = lr->lr_length; /* length of user data */
int error = 0;
ASSERT(zio);
ASSERT(dlen != 0);
/*
* Nothing to do if the file has been removed
*/
if (zfs_zget(zfsvfs, lr->lr_foid, &zp) != 0)
return (ENOENT);
if (zp->z_unlinked) {
/*
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(ZTOV(zp),
dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
return (ENOENT);
}
/*
* Write records come in two flavors: immediate and indirect.
* For small writes it's cheaper to store the data with the
* log record (immediate); for large writes it's cheaper to
* sync the data and get a pointer to it (indirect) so that
* we don't have to write the data twice.
*/
if (buf != NULL) { /* immediate write */
rl = zfs_range_lock(zp, off, dlen, RL_READER);
/* test for truncation needs to be done while range locked */
if (off >= zp->z_phys->zp_size) {
error = ENOENT;
goto out;
}
VERIFY(0 == dmu_read(os, lr->lr_foid, off, dlen, buf));
} else { /* indirect write */
uint64_t boff; /* block starting offset */
/*
* Have to lock the whole block to ensure when it's
* written out and it's checksum is being calculated
* that no one can change the data. We need to re-check
* blocksize after we get the lock in case it's changed!
*/
for (;;) {
if (ISP2(zp->z_blksz)) {
boff = P2ALIGN_TYPED(off, zp->z_blksz,
uint64_t);
} else {
boff = 0;
}
dlen = zp->z_blksz;
rl = zfs_range_lock(zp, boff, dlen, RL_READER);
if (zp->z_blksz == dlen)
break;
zfs_range_unlock(rl);
}
/* test for truncation needs to be done while range locked */
if (off >= zp->z_phys->zp_size) {
error = ENOENT;
goto out;
}
zgd = (zgd_t *)kmem_alloc(sizeof (zgd_t), KM_SLEEP);
zgd->zgd_rl = rl;
zgd->zgd_zilog = zfsvfs->z_log;
zgd->zgd_bp = &lr->lr_blkptr;
VERIFY(0 == dmu_buf_hold(os, lr->lr_foid, boff, zgd, &db));
ASSERT(boff == db->db_offset);
lr->lr_blkoff = off - boff;
error = dmu_sync(zio, db, &lr->lr_blkptr,
lr->lr_common.lrc_txg, zfs_get_done, zgd);
ASSERT((error && error != EINPROGRESS) ||
lr->lr_length <= zp->z_blksz);
if (error == 0)
zil_add_block(zfsvfs->z_log, &lr->lr_blkptr);
/*
* If we get EINPROGRESS, then we need to wait for a
* write IO initiated by dmu_sync() to complete before
* we can release this dbuf. We will finish everything
* up in the zfs_get_done() callback.
*/
if (error == EINPROGRESS)
return (0);
dmu_buf_rele(db, zgd);
kmem_free(zgd, sizeof (zgd_t));
}
out:
zfs_range_unlock(rl);
/*
* Release the vnode asynchronously as we currently have the
* txg stopped from syncing.
*/
VN_RELE_ASYNC(ZTOV(zp), dsl_pool_vnrele_taskq(dmu_objset_pool(os)));
return (error);
}
/*ARGSUSED*/
static int
zfs_access(vnode_t *vp, int mode, int flag, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
if (flag & V_ACE_MASK)
error = zfs_zaccess(zp, mode, flag, B_FALSE, cr);
else
error = zfs_zaccess_rwx(zp, mode, flag, cr);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Lookup an entry in a directory, or an extended attribute directory.
* If it exists, return a held vnode reference for it.
*
* IN: dvp - vnode of directory to search.
* nm - name of entry to lookup.
* pnp - full pathname to lookup [UNUSED].
* flags - LOOKUP_XATTR set if looking for an attribute.
* rdir - root directory vnode [UNUSED].
* cr - credentials of caller.
* ct - caller context
* direntflags - directory lookup flags
* realpnp - returned pathname.
*
* OUT: vpp - vnode of located entry, NULL if not found.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* NA
*/
/* ARGSUSED */
static int
zfs_lookup(vnode_t *dvp, char *nm, vnode_t **vpp, struct componentname *cnp,
int nameiop, cred_t *cr, kthread_t *td, int flags)
{
znode_t *zdp = VTOZ(dvp);
zfsvfs_t *zfsvfs = zdp->z_zfsvfs;
int error;
int *direntflags = NULL;
void *realpnp = NULL;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zdp);
*vpp = NULL;
if (flags & LOOKUP_XATTR) {
#ifdef TODO
/*
* If the xattr property is off, refuse the lookup request.
*/
if (!(zfsvfs->z_vfs->vfs_flag & VFS_XATTR)) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
#endif
/*
* We don't allow recursive attributes..
* Maybe someday we will.
*/
if (zdp->z_phys->zp_flags & ZFS_XATTR) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
if (error = zfs_get_xattrdir(VTOZ(dvp), vpp, cr, flags)) {
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Do we have permission to get into attribute directory?
*/
if (error = zfs_zaccess(VTOZ(*vpp), ACE_EXECUTE, 0,
B_FALSE, cr)) {
VN_RELE(*vpp);
*vpp = NULL;
}
ZFS_EXIT(zfsvfs);
return (error);
}
if (dvp->v_type != VDIR) {
ZFS_EXIT(zfsvfs);
return (ENOTDIR);
}
/*
* Check accessibility of directory.
*/
if (error = zfs_zaccess(zdp, ACE_EXECUTE, 0, B_FALSE, cr)) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (zfsvfs->z_utf8 && u8_validate(nm, strlen(nm),
NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
error = zfs_dirlook(zdp, nm, vpp, flags, direntflags, realpnp);
if (error == 0) {
/*
* Convert device special files
*/
if (IS_DEVVP(*vpp)) {
vnode_t *svp;
svp = specvp(*vpp, (*vpp)->v_rdev, (*vpp)->v_type, cr);
VN_RELE(*vpp);
if (svp == NULL)
error = ENOSYS;
else
*vpp = svp;
}
}
/* Translate errors and add SAVENAME when needed. */
if (cnp->cn_flags & ISLASTCN) {
switch (nameiop) {
case CREATE:
case RENAME:
if (error == ENOENT) {
error = EJUSTRETURN;
cnp->cn_flags |= SAVENAME;
break;
}
/* FALLTHROUGH */
case DELETE:
if (error == 0)
cnp->cn_flags |= SAVENAME;
break;
}
}
if (error == 0 && (nm[0] != '.' || nm[1] != '\0')) {
int ltype = 0;
if (cnp->cn_flags & ISDOTDOT) {
ltype = VOP_ISLOCKED(dvp);
VOP_UNLOCK(dvp, 0);
}
ZFS_EXIT(zfsvfs);
error = vn_lock(*vpp, cnp->cn_lkflags);
if (cnp->cn_flags & ISDOTDOT)
vn_lock(dvp, ltype | LK_RETRY);
if (error != 0) {
VN_RELE(*vpp);
*vpp = NULL;
return (error);
}
} else {
ZFS_EXIT(zfsvfs);
}
#ifdef FREEBSD_NAMECACHE
/*
* Insert name into cache (as non-existent) if appropriate.
*/
if (error == ENOENT && (cnp->cn_flags & MAKEENTRY) && nameiop != CREATE)
cache_enter(dvp, *vpp, cnp);
/*
* Insert name into cache if appropriate.
*/
if (error == 0 && (cnp->cn_flags & MAKEENTRY)) {
if (!(cnp->cn_flags & ISLASTCN) ||
(nameiop != DELETE && nameiop != RENAME)) {
cache_enter(dvp, *vpp, cnp);
}
}
#endif
return (error);
}
/*
* Attempt to create a new entry in a directory. If the entry
* already exists, truncate the file if permissible, else return
* an error. Return the vp of the created or trunc'd file.
*
* IN: dvp - vnode of directory to put new file entry in.
* name - name of new file entry.
* vap - attributes of new file.
* excl - flag indicating exclusive or non-exclusive mode.
* mode - mode to open file with.
* cr - credentials of caller.
* flag - large file flag [UNUSED].
* ct - caller context
* vsecp - ACL to be set
*
* OUT: vpp - vnode of created or trunc'd entry.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated if new entry created
* vp - ctime|mtime always, atime if new
*/
/* ARGSUSED */
static int
zfs_create(vnode_t *dvp, char *name, vattr_t *vap, int excl, int mode,
vnode_t **vpp, cred_t *cr, kthread_t *td)
{
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
objset_t *os;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
int error;
zfs_acl_t *aclp = NULL;
zfs_fuid_info_t *fuidp = NULL;
void *vsecp = NULL;
int flag = 0;
/*
* If we have an ephemeral id, ACL, or XVATTR then
* make sure file system is at proper version
*/
if (zfsvfs->z_use_fuids == B_FALSE &&
(vsecp || (vap->va_mask & AT_XVATTR) ||
IS_EPHEMERAL(crgetuid(cr)) || IS_EPHEMERAL(crgetgid(cr))))
return (EINVAL);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
os = zfsvfs->z_os;
zilog = zfsvfs->z_log;
if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (vap->va_mask & AT_XVATTR) {
if ((error = secpolicy_xvattr(dvp, (xvattr_t *)vap,
crgetuid(cr), cr, vap->va_type)) != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
}
top:
*vpp = NULL;
if ((vap->va_mode & S_ISVTX) && secpolicy_vnode_stky_modify(cr))
vap->va_mode &= ~S_ISVTX;
if (*name == '\0') {
/*
* Null component name refers to the directory itself.
*/
VN_HOLD(dvp);
zp = dzp;
dl = NULL;
error = 0;
} else {
/* possible VN_HOLD(zp) */
int zflg = 0;
if (flag & FIGNORECASE)
zflg |= ZCILOOK;
error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
NULL, NULL);
if (error) {
if (strcmp(name, "..") == 0)
error = EISDIR;
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (error);
}
}
if (vsecp && aclp == NULL) {
error = zfs_vsec_2_aclp(zfsvfs, vap->va_type, vsecp, &aclp);
if (error) {
ZFS_EXIT(zfsvfs);
if (dl)
zfs_dirent_unlock(dl);
return (error);
}
}
if (zp == NULL) {
uint64_t txtype;
/*
* Create a new file object and update the directory
* to reference it.
*/
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
goto out;
}
/*
* We only support the creation of regular files in
* extended attribute directories.
*/
if ((dzp->z_phys->zp_flags & ZFS_XATTR) &&
(vap->va_type != VREG)) {
error = EINVAL;
goto out;
}
tx = dmu_tx_create(os);
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
if ((aclp && aclp->z_has_fuids) || IS_EPHEMERAL(crgetuid(cr)) ||
IS_EPHEMERAL(crgetgid(cr))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ,
FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
dmu_tx_hold_bonus(tx, dzp->z_id);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
if ((dzp->z_phys->zp_flags & ZFS_INHERIT_ACE) || aclp) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, SPA_MAXBLOCKSIZE);
}
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART &&
zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (error);
}
zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, aclp, &fuidp);
(void) zfs_link_create(dl, zp, tx, ZNEW);
txtype = zfs_log_create_txtype(Z_FILE, vsecp, vap);
if (flag & FIGNORECASE)
txtype |= TX_CI;
zfs_log_create(zilog, tx, txtype, dzp, zp, name,
vsecp, fuidp, vap);
if (fuidp)
zfs_fuid_info_free(fuidp);
dmu_tx_commit(tx);
} else {
int aflags = (flag & FAPPEND) ? V_APPEND : 0;
/*
* A directory entry already exists for this name.
*/
/*
* Can't truncate an existing file if in exclusive mode.
*/
if (excl == EXCL) {
error = EEXIST;
goto out;
}
/*
* Can't open a directory for writing.
*/
if ((ZTOV(zp)->v_type == VDIR) && (mode & S_IWRITE)) {
error = EISDIR;
goto out;
}
/*
* Verify requested access to file.
*/
if (mode && (error = zfs_zaccess_rwx(zp, mode, aflags, cr))) {
goto out;
}
mutex_enter(&dzp->z_lock);
dzp->z_seq++;
mutex_exit(&dzp->z_lock);
/*
* Truncate regular files if requested.
*/
if ((ZTOV(zp)->v_type == VREG) &&
(vap->va_mask & AT_SIZE) && (vap->va_size == 0)) {
/* we can't hold any locks when calling zfs_freesp() */
zfs_dirent_unlock(dl);
dl = NULL;
error = zfs_freesp(zp, 0, 0, mode, TRUE);
if (error == 0) {
vnevent_create(ZTOV(zp), ct);
}
}
}
out:
if (dl)
zfs_dirent_unlock(dl);
if (error) {
if (zp)
VN_RELE(ZTOV(zp));
} else {
*vpp = ZTOV(zp);
/*
* If vnode is for a device return a specfs vnode instead.
*/
if (IS_DEVVP(*vpp)) {
struct vnode *svp;
svp = specvp(*vpp, (*vpp)->v_rdev, (*vpp)->v_type, cr);
VN_RELE(*vpp);
if (svp == NULL) {
error = ENOSYS;
}
*vpp = svp;
}
}
if (aclp)
zfs_acl_free(aclp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Remove an entry from a directory.
*
* IN: dvp - vnode of directory to remove entry from.
* name - name of entry to remove.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime
* vp - ctime (if nlink > 0)
*/
/*ARGSUSED*/
static int
zfs_remove(vnode_t *dvp, char *name, cred_t *cr, caller_context_t *ct,
int flags)
{
znode_t *zp, *dzp = VTOZ(dvp);
znode_t *xzp = NULL;
vnode_t *vp;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
uint64_t acl_obj, xattr_obj;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
boolean_t may_delete_now, delete_now = FALSE;
boolean_t unlinked, toobig = FALSE;
uint64_t txtype;
pathname_t *realnmp = NULL;
pathname_t realnm;
int error;
int zflg = ZEXISTS;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (flags & FIGNORECASE) {
zflg |= ZCILOOK;
pn_alloc(&realnm);
realnmp = &realnm;
}
top:
/*
* Attempt to lock directory; fail if entry doesn't exist.
*/
if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
NULL, realnmp)) {
if (realnmp)
pn_free(realnmp);
ZFS_EXIT(zfsvfs);
return (error);
}
vp = ZTOV(zp);
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
/*
* Need to use rmdir for removing directories.
*/
if (vp->v_type == VDIR) {
error = EPERM;
goto out;
}
vnevent_remove(vp, dvp, name, ct);
if (realnmp)
dnlc_remove(dvp, realnmp->pn_buf);
else
dnlc_remove(dvp, name);
may_delete_now = FALSE;
/*
* We may delete the znode now, or we may put it in the unlinked set;
* it depends on whether we're the last link, and on whether there are
* other holds on the vnode. So we dmu_tx_hold() the right things to
* allow for either case.
*/
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
dmu_tx_hold_bonus(tx, zp->z_id);
if (may_delete_now) {
toobig =
zp->z_phys->zp_size > zp->z_blksz * DMU_MAX_DELETEBLKCNT;
/* if the file is too big, only hold_free a token amount */
dmu_tx_hold_free(tx, zp->z_id, 0,
(toobig ? DMU_MAX_ACCESS : DMU_OBJECT_END));
}
/* are there any extended attributes? */
if ((xattr_obj = zp->z_phys->zp_xattr) != 0) {
/* XXX - do we need this if we are deleting? */
dmu_tx_hold_bonus(tx, xattr_obj);
}
/* are there any additional acls */
if ((acl_obj = zp->z_phys->zp_acl.z_acl_extern_obj) != 0 &&
may_delete_now)
dmu_tx_hold_free(tx, acl_obj, 0, DMU_OBJECT_END);
/* charge as an update -- would be nice not to charge at all */
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
VN_RELE(vp);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
if (realnmp)
pn_free(realnmp);
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Remove the directory entry.
*/
error = zfs_link_destroy(dl, zp, tx, zflg, &unlinked);
if (error) {
dmu_tx_commit(tx);
goto out;
}
if (0 && unlinked) {
VI_LOCK(vp);
delete_now = may_delete_now && !toobig &&
vp->v_count == 1 && !vn_has_cached_data(vp) &&
zp->z_phys->zp_xattr == xattr_obj &&
zp->z_phys->zp_acl.z_acl_extern_obj == acl_obj;
VI_UNLOCK(vp);
}
if (delete_now) {
if (zp->z_phys->zp_xattr) {
error = zfs_zget(zfsvfs, zp->z_phys->zp_xattr, &xzp);
ASSERT3U(error, ==, 0);
ASSERT3U(xzp->z_phys->zp_links, ==, 2);
dmu_buf_will_dirty(xzp->z_dbuf, tx);
mutex_enter(&xzp->z_lock);
xzp->z_unlinked = 1;
xzp->z_phys->zp_links = 0;
mutex_exit(&xzp->z_lock);
zfs_unlinked_add(xzp, tx);
zp->z_phys->zp_xattr = 0; /* probably unnecessary */
}
mutex_enter(&zp->z_lock);
VI_LOCK(vp);
vp->v_count--;
ASSERT3U(vp->v_count, ==, 0);
VI_UNLOCK(vp);
mutex_exit(&zp->z_lock);
zfs_znode_delete(zp, tx);
} else if (unlinked) {
zfs_unlinked_add(zp, tx);
}
txtype = TX_REMOVE;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_remove(zilog, tx, txtype, dzp, name);
dmu_tx_commit(tx);
out:
if (realnmp)
pn_free(realnmp);
zfs_dirent_unlock(dl);
if (!delete_now) {
VN_RELE(vp);
} else if (xzp) {
/* this rele is delayed to prevent nesting transactions */
VN_RELE(ZTOV(xzp));
}
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Create a new directory and insert it into dvp using the name
* provided. Return a pointer to the inserted directory.
*
* IN: dvp - vnode of directory to add subdir to.
* dirname - name of new directory.
* vap - attributes of new directory.
* cr - credentials of caller.
* ct - caller context
* vsecp - ACL to be set
*
* OUT: vpp - vnode of created directory.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated
* vp - ctime|mtime|atime updated
*/
/*ARGSUSED*/
static int
zfs_mkdir(vnode_t *dvp, char *dirname, vattr_t *vap, vnode_t **vpp, cred_t *cr,
caller_context_t *ct, int flags, vsecattr_t *vsecp)
{
znode_t *zp, *dzp = VTOZ(dvp);
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
uint64_t txtype;
dmu_tx_t *tx;
int error;
zfs_acl_t *aclp = NULL;
zfs_fuid_info_t *fuidp = NULL;
int zf = ZNEW;
ASSERT(vap->va_type == VDIR);
/*
* If we have an ephemeral id, ACL, or XVATTR then
* make sure file system is at proper version
*/
if (zfsvfs->z_use_fuids == B_FALSE &&
(vsecp || (vap->va_mask & AT_XVATTR) || IS_EPHEMERAL(crgetuid(cr))||
IS_EPHEMERAL(crgetgid(cr))))
return (EINVAL);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (dzp->z_phys->zp_flags & ZFS_XATTR) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
if (zfsvfs->z_utf8 && u8_validate(dirname,
strlen(dirname), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zf |= ZCILOOK;
if (vap->va_mask & AT_XVATTR)
if ((error = secpolicy_xvattr(dvp, (xvattr_t *)vap,
crgetuid(cr), cr, vap->va_type)) != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* First make sure the new directory doesn't exist.
*/
top:
*vpp = NULL;
if (error = zfs_dirent_lock(&dl, dzp, dirname, &zp, zf,
NULL, NULL)) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (error = zfs_zaccess(dzp, ACE_ADD_SUBDIRECTORY, 0, B_FALSE, cr)) {
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
if (vsecp && aclp == NULL) {
error = zfs_vsec_2_aclp(zfsvfs, vap->va_type, vsecp, &aclp);
if (error) {
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
}
/*
* Add a new entry to the directory.
*/
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, dirname);
dmu_tx_hold_zap(tx, DMU_NEW_OBJECT, FALSE, NULL);
if ((aclp && aclp->z_has_fuids) || IS_EPHEMERAL(crgetuid(cr)) ||
IS_EPHEMERAL(crgetgid(cr))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
if ((dzp->z_phys->zp_flags & ZFS_INHERIT_ACE) || aclp)
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, SPA_MAXBLOCKSIZE);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (error);
}
/*
* Create new node.
*/
zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, aclp, &fuidp);
if (aclp)
zfs_acl_free(aclp);
/*
* Now put new name in parent dir.
*/
(void) zfs_link_create(dl, zp, tx, ZNEW);
*vpp = ZTOV(zp);
txtype = zfs_log_create_txtype(Z_DIR, vsecp, vap);
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_create(zilog, tx, txtype, dzp, zp, dirname, vsecp, fuidp, vap);
if (fuidp)
zfs_fuid_info_free(fuidp);
dmu_tx_commit(tx);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Remove a directory subdir entry. If the current working
* directory is the same as the subdir to be removed, the
* remove will fail.
*
* IN: dvp - vnode of directory to remove from.
* name - name of directory to be removed.
* cwd - vnode of current working directory.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated
*/
/*ARGSUSED*/
static int
zfs_rmdir(vnode_t *dvp, char *name, vnode_t *cwd, cred_t *cr,
caller_context_t *ct, int flags)
{
znode_t *dzp = VTOZ(dvp);
znode_t *zp;
vnode_t *vp;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
int error;
int zflg = ZEXISTS;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (flags & FIGNORECASE)
zflg |= ZCILOOK;
top:
zp = NULL;
/*
* Attempt to lock directory; fail if entry doesn't exist.
*/
if (error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg,
NULL, NULL)) {
ZFS_EXIT(zfsvfs);
return (error);
}
vp = ZTOV(zp);
if (error = zfs_zaccess_delete(dzp, zp, cr)) {
goto out;
}
if (vp->v_type != VDIR) {
error = ENOTDIR;
goto out;
}
if (vp == cwd) {
error = EINVAL;
goto out;
}
vnevent_rmdir(vp, dvp, name, ct);
/*
* Grab a lock on the directory to make sure that noone is
* trying to add (or lookup) entries while we are removing it.
*/
rw_enter(&zp->z_name_lock, RW_WRITER);
/*
* Grab a lock on the parent pointer to make sure we play well
* with the treewalk and directory rename code.
*/
rw_enter(&zp->z_parent_lock, RW_WRITER);
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_zap(tx, dzp->z_id, FALSE, name);
dmu_tx_hold_bonus(tx, zp->z_id);
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
rw_exit(&zp->z_parent_lock);
rw_exit(&zp->z_name_lock);
zfs_dirent_unlock(dl);
VN_RELE(vp);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
#ifdef FREEBSD_NAMECACHE
cache_purge(dvp);
#endif
error = zfs_link_destroy(dl, zp, tx, zflg, NULL);
if (error == 0) {
uint64_t txtype = TX_RMDIR;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_remove(zilog, tx, txtype, dzp, name);
}
dmu_tx_commit(tx);
rw_exit(&zp->z_parent_lock);
rw_exit(&zp->z_name_lock);
#ifdef FREEBSD_NAMECACHE
cache_purge(vp);
#endif
out:
zfs_dirent_unlock(dl);
VN_RELE(vp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Read as many directory entries as will fit into the provided
* buffer from the given directory cursor position (specified in
* the uio structure.
*
* IN: vp - vnode of directory to read.
* uio - structure supplying read location, range info,
* and return buffer.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* OUT: uio - updated offset and range, buffer filled.
* eofp - set to true if end-of-file detected.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - atime updated
*
* Note that the low 4 bits of the cookie returned by zap is always zero.
* This allows us to use the low range for "special" directory entries:
* We use 0 for '.', and 1 for '..'. If this is the root of the filesystem,
* we use the offset 2 for the '.zfs' directory.
*/
/* ARGSUSED */
static int
zfs_readdir(vnode_t *vp, uio_t *uio, cred_t *cr, int *eofp, int *ncookies, u_long **cookies)
{
znode_t *zp = VTOZ(vp);
iovec_t *iovp;
edirent_t *eodp;
dirent64_t *odp;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
objset_t *os;
caddr_t outbuf;
size_t bufsize;
zap_cursor_t zc;
zap_attribute_t zap;
uint_t bytes_wanted;
uint64_t offset; /* must be unsigned; checks for < 1 */
int local_eof;
int outcount;
int error;
uint8_t prefetch;
boolean_t check_sysattrs;
uint8_t type;
int ncooks;
u_long *cooks = NULL;
int flags = 0;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
/*
* If we are not given an eof variable,
* use a local one.
*/
if (eofp == NULL)
eofp = &local_eof;
/*
* Check for valid iov_len.
*/
if (uio->uio_iov->iov_len <= 0) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* Quit if directory has been removed (posix)
*/
if ((*eofp = zp->z_unlinked) != 0) {
ZFS_EXIT(zfsvfs);
return (0);
}
error = 0;
os = zfsvfs->z_os;
offset = uio->uio_loffset;
prefetch = zp->z_zn_prefetch;
/*
* Initialize the iterator cursor.
*/
if (offset <= 3) {
/*
* Start iteration from the beginning of the directory.
*/
zap_cursor_init(&zc, os, zp->z_id);
} else {
/*
* The offset is a serialized cursor.
*/
zap_cursor_init_serialized(&zc, os, zp->z_id, offset);
}
/*
* Get space to change directory entries into fs independent format.
*/
iovp = uio->uio_iov;
bytes_wanted = iovp->iov_len;
if (uio->uio_segflg != UIO_SYSSPACE || uio->uio_iovcnt != 1) {
bufsize = bytes_wanted;
outbuf = kmem_alloc(bufsize, KM_SLEEP);
odp = (struct dirent64 *)outbuf;
} else {
bufsize = bytes_wanted;
odp = (struct dirent64 *)iovp->iov_base;
}
eodp = (struct edirent *)odp;
if (ncookies != NULL) {
/*
* Minimum entry size is dirent size and 1 byte for a file name.
*/
ncooks = uio->uio_resid / (sizeof(struct dirent) - sizeof(((struct dirent *)NULL)->d_name) + 1);
cooks = malloc(ncooks * sizeof(u_long), M_TEMP, M_WAITOK);
*cookies = cooks;
*ncookies = ncooks;
}
/*
* If this VFS supports the system attribute view interface; and
* we're looking at an extended attribute directory; and we care
* about normalization conflicts on this vfs; then we must check
* for normalization conflicts with the sysattr name space.
*/
#ifdef TODO
check_sysattrs = vfs_has_feature(vp->v_vfsp, VFSFT_SYSATTR_VIEWS) &&
(vp->v_flag & V_XATTRDIR) && zfsvfs->z_norm &&
(flags & V_RDDIR_ENTFLAGS);
#else
check_sysattrs = 0;
#endif
/*
* Transform to file-system independent format
*/
outcount = 0;
while (outcount < bytes_wanted) {
ino64_t objnum;
ushort_t reclen;
off64_t *next;
/*
* Special case `.', `..', and `.zfs'.
*/
if (offset == 0) {
(void) strcpy(zap.za_name, ".");
zap.za_normalization_conflict = 0;
objnum = zp->z_id;
type = DT_DIR;
} else if (offset == 1) {
(void) strcpy(zap.za_name, "..");
zap.za_normalization_conflict = 0;
objnum = zp->z_phys->zp_parent;
type = DT_DIR;
} else if (offset == 2 && zfs_show_ctldir(zp)) {
(void) strcpy(zap.za_name, ZFS_CTLDIR_NAME);
zap.za_normalization_conflict = 0;
objnum = ZFSCTL_INO_ROOT;
type = DT_DIR;
} else {
/*
* Grab next entry.
*/
if (error = zap_cursor_retrieve(&zc, &zap)) {
if ((*eofp = (error == ENOENT)) != 0)
break;
else
goto update;
}
if (zap.za_integer_length != 8 ||
zap.za_num_integers != 1) {
cmn_err(CE_WARN, "zap_readdir: bad directory "
"entry, obj = %lld, offset = %lld\n",
(u_longlong_t)zp->z_id,
(u_longlong_t)offset);
error = ENXIO;
goto update;
}
objnum = ZFS_DIRENT_OBJ(zap.za_first_integer);
/*
* MacOS X can extract the object type here such as:
* uint8_t type = ZFS_DIRENT_TYPE(zap.za_first_integer);
*/
type = ZFS_DIRENT_TYPE(zap.za_first_integer);
if (check_sysattrs && !zap.za_normalization_conflict) {
#ifdef TODO
zap.za_normalization_conflict =
xattr_sysattr_casechk(zap.za_name);
#else
panic("%s:%u: TODO", __func__, __LINE__);
#endif
}
}
if (flags & V_RDDIR_ENTFLAGS)
reclen = EDIRENT_RECLEN(strlen(zap.za_name));
else
reclen = DIRENT64_RECLEN(strlen(zap.za_name));
/*
* Will this entry fit in the buffer?
*/
if (outcount + reclen > bufsize) {
/*
* Did we manage to fit anything in the buffer?
*/
if (!outcount) {
error = EINVAL;
goto update;
}
break;
}
if (flags & V_RDDIR_ENTFLAGS) {
/*
* Add extended flag entry:
*/
eodp->ed_ino = objnum;
eodp->ed_reclen = reclen;
/* NOTE: ed_off is the offset for the *next* entry */
next = &(eodp->ed_off);
eodp->ed_eflags = zap.za_normalization_conflict ?
ED_CASE_CONFLICT : 0;
(void) strncpy(eodp->ed_name, zap.za_name,
EDIRENT_NAMELEN(reclen));
eodp = (edirent_t *)((intptr_t)eodp + reclen);
} else {
/*
* Add normal entry:
*/
odp->d_ino = objnum;
odp->d_reclen = reclen;
odp->d_namlen = strlen(zap.za_name);
(void) strlcpy(odp->d_name, zap.za_name, odp->d_namlen + 1);
odp->d_type = type;
odp = (dirent64_t *)((intptr_t)odp + reclen);
}
outcount += reclen;
ASSERT(outcount <= bufsize);
/* Prefetch znode */
if (prefetch)
dmu_prefetch(os, objnum, 0, 0);
/*
* Move to the next entry, fill in the previous offset.
*/
if (offset > 2 || (offset == 2 && !zfs_show_ctldir(zp))) {
zap_cursor_advance(&zc);
offset = zap_cursor_serialize(&zc);
} else {
offset += 1;
}
if (cooks != NULL) {
*cooks++ = offset;
ncooks--;
KASSERT(ncooks >= 0, ("ncookies=%d", ncooks));
}
}
zp->z_zn_prefetch = B_FALSE; /* a lookup will re-enable pre-fetching */
/* Subtract unused cookies */
if (ncookies != NULL)
*ncookies -= ncooks;
if (uio->uio_segflg == UIO_SYSSPACE && uio->uio_iovcnt == 1) {
iovp->iov_base += outcount;
iovp->iov_len -= outcount;
uio->uio_resid -= outcount;
} else if (error = uiomove(outbuf, (long)outcount, UIO_READ, uio)) {
/*
* Reset the pointer.
*/
offset = uio->uio_loffset;
}
update:
zap_cursor_fini(&zc);
if (uio->uio_segflg != UIO_SYSSPACE || uio->uio_iovcnt != 1)
kmem_free(outbuf, bufsize);
if (error == ENOENT)
error = 0;
ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
uio->uio_loffset = offset;
ZFS_EXIT(zfsvfs);
if (error != 0 && cookies != NULL) {
free(*cookies, M_TEMP);
*cookies = NULL;
*ncookies = 0;
}
return (error);
}
ulong_t zfs_fsync_sync_cnt = 4;
static int
zfs_fsync(vnode_t *vp, int syncflag, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
(void) tsd_set(zfs_fsyncer_key, (void *)zfs_fsync_sync_cnt);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
zil_commit(zfsvfs->z_log, zp->z_last_itx, zp->z_id);
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Get the requested file attributes and place them in the provided
* vattr structure.
*
* IN: vp - vnode of file.
* vap - va_mask identifies requested attributes.
* If AT_XVATTR set, then optional attrs are requested
* flags - ATTR_NOACLCHECK (CIFS server context)
* cr - credentials of caller.
* ct - caller context
*
* OUT: vap - attribute values.
*
* RETURN: 0 (always succeeds)
*/
/* ARGSUSED */
static int
zfs_getattr(vnode_t *vp, vattr_t *vap, int flags, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
znode_phys_t *pzp;
int error = 0;
uint32_t blksize;
u_longlong_t nblocks;
uint64_t links;
xvattr_t *xvap = (xvattr_t *)vap; /* vap may be an xvattr_t * */
xoptattr_t *xoap = NULL;
boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
pzp = zp->z_phys;
- mutex_enter(&zp->z_lock);
-
/*
* If ACL is trivial don't bother looking for ACE_READ_ATTRIBUTES.
* Also, if we are the owner don't bother, since owner should
* always be allowed to read basic attributes of file.
*/
if (!(pzp->zp_flags & ZFS_ACL_TRIVIAL) &&
(pzp->zp_uid != crgetuid(cr))) {
if (error = zfs_zaccess(zp, ACE_READ_ATTRIBUTES, 0,
skipaclchk, cr)) {
- mutex_exit(&zp->z_lock);
ZFS_EXIT(zfsvfs);
return (error);
}
}
/*
* Return all attributes. It's cheaper to provide the answer
* than to determine whether we were asked the question.
*/
+ mutex_enter(&zp->z_lock);
vap->va_type = IFTOVT(pzp->zp_mode);
vap->va_mode = pzp->zp_mode & ~S_IFMT;
zfs_fuid_map_ids(zp, cr, &vap->va_uid, &vap->va_gid);
// vap->va_fsid = zp->z_zfsvfs->z_vfs->vfs_dev;
vap->va_nodeid = zp->z_id;
if ((vp->v_flag & VROOT) && zfs_show_ctldir(zp))
links = pzp->zp_links + 1;
else
links = pzp->zp_links;
vap->va_nlink = MIN(links, UINT32_MAX); /* nlink_t limit! */
vap->va_size = pzp->zp_size;
vap->va_fsid = vp->v_mount->mnt_stat.f_fsid.val[0];
vap->va_rdev = zfs_cmpldev(pzp->zp_rdev);
vap->va_seq = zp->z_seq;
vap->va_flags = 0; /* FreeBSD: Reset chflags(2) flags. */
/*
* Add in any requested optional attributes and the create time.
* Also set the corresponding bits in the returned attribute bitmap.
*/
if ((xoap = xva_getxoptattr(xvap)) != NULL && zfsvfs->z_use_fuids) {
if (XVA_ISSET_REQ(xvap, XAT_ARCHIVE)) {
xoap->xoa_archive =
((pzp->zp_flags & ZFS_ARCHIVE) != 0);
XVA_SET_RTN(xvap, XAT_ARCHIVE);
}
if (XVA_ISSET_REQ(xvap, XAT_READONLY)) {
xoap->xoa_readonly =
((pzp->zp_flags & ZFS_READONLY) != 0);
XVA_SET_RTN(xvap, XAT_READONLY);
}
if (XVA_ISSET_REQ(xvap, XAT_SYSTEM)) {
xoap->xoa_system =
((pzp->zp_flags & ZFS_SYSTEM) != 0);
XVA_SET_RTN(xvap, XAT_SYSTEM);
}
if (XVA_ISSET_REQ(xvap, XAT_HIDDEN)) {
xoap->xoa_hidden =
((pzp->zp_flags & ZFS_HIDDEN) != 0);
XVA_SET_RTN(xvap, XAT_HIDDEN);
}
if (XVA_ISSET_REQ(xvap, XAT_NOUNLINK)) {
xoap->xoa_nounlink =
((pzp->zp_flags & ZFS_NOUNLINK) != 0);
XVA_SET_RTN(xvap, XAT_NOUNLINK);
}
if (XVA_ISSET_REQ(xvap, XAT_IMMUTABLE)) {
xoap->xoa_immutable =
((pzp->zp_flags & ZFS_IMMUTABLE) != 0);
XVA_SET_RTN(xvap, XAT_IMMUTABLE);
}
if (XVA_ISSET_REQ(xvap, XAT_APPENDONLY)) {
xoap->xoa_appendonly =
((pzp->zp_flags & ZFS_APPENDONLY) != 0);
XVA_SET_RTN(xvap, XAT_APPENDONLY);
}
if (XVA_ISSET_REQ(xvap, XAT_NODUMP)) {
xoap->xoa_nodump =
((pzp->zp_flags & ZFS_NODUMP) != 0);
XVA_SET_RTN(xvap, XAT_NODUMP);
}
if (XVA_ISSET_REQ(xvap, XAT_OPAQUE)) {
xoap->xoa_opaque =
((pzp->zp_flags & ZFS_OPAQUE) != 0);
XVA_SET_RTN(xvap, XAT_OPAQUE);
}
if (XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED)) {
xoap->xoa_av_quarantined =
((pzp->zp_flags & ZFS_AV_QUARANTINED) != 0);
XVA_SET_RTN(xvap, XAT_AV_QUARANTINED);
}
if (XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED)) {
xoap->xoa_av_modified =
((pzp->zp_flags & ZFS_AV_MODIFIED) != 0);
XVA_SET_RTN(xvap, XAT_AV_MODIFIED);
}
if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP) &&
vp->v_type == VREG &&
(pzp->zp_flags & ZFS_BONUS_SCANSTAMP)) {
size_t len;
dmu_object_info_t doi;
/*
* Only VREG files have anti-virus scanstamps, so we
* won't conflict with symlinks in the bonus buffer.
*/
dmu_object_info_from_db(zp->z_dbuf, &doi);
len = sizeof (xoap->xoa_av_scanstamp) +
sizeof (znode_phys_t);
if (len <= doi.doi_bonus_size) {
/*
* pzp points to the start of the
* znode_phys_t. pzp + 1 points to the
* first byte after the znode_phys_t.
*/
(void) memcpy(xoap->xoa_av_scanstamp,
pzp + 1,
sizeof (xoap->xoa_av_scanstamp));
XVA_SET_RTN(xvap, XAT_AV_SCANSTAMP);
}
}
if (XVA_ISSET_REQ(xvap, XAT_CREATETIME)) {
ZFS_TIME_DECODE(&xoap->xoa_createtime, pzp->zp_crtime);
XVA_SET_RTN(xvap, XAT_CREATETIME);
}
}
ZFS_TIME_DECODE(&vap->va_atime, pzp->zp_atime);
ZFS_TIME_DECODE(&vap->va_mtime, pzp->zp_mtime);
ZFS_TIME_DECODE(&vap->va_ctime, pzp->zp_ctime);
ZFS_TIME_DECODE(&vap->va_birthtime, pzp->zp_crtime);
mutex_exit(&zp->z_lock);
dmu_object_size_from_db(zp->z_dbuf, &blksize, &nblocks);
vap->va_blksize = blksize;
vap->va_bytes = nblocks << 9; /* nblocks * 512 */
if (zp->z_blksz == 0) {
/*
* Block size hasn't been set; suggest maximal I/O transfers.
*/
vap->va_blksize = zfsvfs->z_max_blksz;
}
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* Set the file attributes to the values contained in the
* vattr structure.
*
* IN: vp - vnode of file to be modified.
* vap - new attribute values.
* If AT_XVATTR set, then optional attrs are being set
* flags - ATTR_UTIME set if non-default time values provided.
* - ATTR_NOACLCHECK (CIFS context only).
* cr - credentials of caller.
* ct - caller context
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - ctime updated, mtime updated if size changed.
*/
/* ARGSUSED */
static int
zfs_setattr(vnode_t *vp, vattr_t *vap, int flags, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
znode_phys_t *pzp;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
zilog_t *zilog;
dmu_tx_t *tx;
vattr_t oldva;
uint_t mask = vap->va_mask;
uint_t saved_mask;
uint64_t saved_mode;
int trim_mask = 0;
uint64_t new_mode;
znode_t *attrzp;
int need_policy = FALSE;
int err;
zfs_fuid_info_t *fuidp = NULL;
xvattr_t *xvap = (xvattr_t *)vap; /* vap may be an xvattr_t * */
xoptattr_t *xoap;
zfs_acl_t *aclp = NULL;
boolean_t skipaclchk = (flags & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
if (mask == 0)
return (0);
if (mask & AT_NOSET)
return (EINVAL);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
pzp = zp->z_phys;
zilog = zfsvfs->z_log;
/*
* Make sure that if we have ephemeral uid/gid or xvattr specified
* that file system is at proper version level
*/
if (zfsvfs->z_use_fuids == B_FALSE &&
(((mask & AT_UID) && IS_EPHEMERAL(vap->va_uid)) ||
((mask & AT_GID) && IS_EPHEMERAL(vap->va_gid)) ||
(mask & AT_XVATTR))) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
if (mask & AT_SIZE && vp->v_type == VDIR) {
ZFS_EXIT(zfsvfs);
return (EISDIR);
}
if (mask & AT_SIZE && vp->v_type != VREG && vp->v_type != VFIFO) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* If this is an xvattr_t, then get a pointer to the structure of
* optional attributes. If this is NULL, then we have a vattr_t.
*/
xoap = xva_getxoptattr(xvap);
/*
* Immutable files can only alter immutable bit and atime
*/
if ((pzp->zp_flags & ZFS_IMMUTABLE) &&
((mask & (AT_SIZE|AT_UID|AT_GID|AT_MTIME|AT_MODE)) ||
((mask & AT_XVATTR) && XVA_ISSET_REQ(xvap, XAT_CREATETIME)))) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
if ((mask & AT_SIZE) && (pzp->zp_flags & ZFS_READONLY)) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
/*
* Verify timestamps doesn't overflow 32 bits.
* ZFS can handle large timestamps, but 32bit syscalls can't
* handle times greater than 2039. This check should be removed
* once large timestamps are fully supported.
*/
if (mask & (AT_ATIME | AT_MTIME)) {
if (((mask & AT_ATIME) && TIMESPEC_OVERFLOW(&vap->va_atime)) ||
((mask & AT_MTIME) && TIMESPEC_OVERFLOW(&vap->va_mtime))) {
ZFS_EXIT(zfsvfs);
return (EOVERFLOW);
}
}
top:
attrzp = NULL;
if (zfsvfs->z_vfs->vfs_flag & VFS_RDONLY) {
ZFS_EXIT(zfsvfs);
return (EROFS);
}
/*
* First validate permissions
*/
if (mask & AT_SIZE) {
err = zfs_zaccess(zp, ACE_WRITE_DATA, 0, skipaclchk, cr);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
/*
* XXX - Note, we are not providing any open
* mode flags here (like FNDELAY), so we may
* block if there are locks present... this
* should be addressed in openat().
*/
/* XXX - would it be OK to generate a log record here? */
err = zfs_freesp(zp, vap->va_size, 0, 0, FALSE);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
}
if (mask & (AT_ATIME|AT_MTIME) ||
((mask & AT_XVATTR) && (XVA_ISSET_REQ(xvap, XAT_HIDDEN) ||
XVA_ISSET_REQ(xvap, XAT_READONLY) ||
XVA_ISSET_REQ(xvap, XAT_ARCHIVE) ||
XVA_ISSET_REQ(xvap, XAT_CREATETIME) ||
XVA_ISSET_REQ(xvap, XAT_SYSTEM))))
need_policy = zfs_zaccess(zp, ACE_WRITE_ATTRIBUTES, 0,
skipaclchk, cr);
if (mask & (AT_UID|AT_GID)) {
int idmask = (mask & (AT_UID|AT_GID));
int take_owner;
int take_group;
/*
* NOTE: even if a new mode is being set,
* we may clear S_ISUID/S_ISGID bits.
*/
if (!(mask & AT_MODE))
vap->va_mode = pzp->zp_mode;
/*
* Take ownership or chgrp to group we are a member of
*/
take_owner = (mask & AT_UID) && (vap->va_uid == crgetuid(cr));
take_group = (mask & AT_GID) &&
zfs_groupmember(zfsvfs, vap->va_gid, cr);
/*
* If both AT_UID and AT_GID are set then take_owner and
* take_group must both be set in order to allow taking
* ownership.
*
* Otherwise, send the check through secpolicy_vnode_setattr()
*
*/
if (((idmask == (AT_UID|AT_GID)) && take_owner && take_group) ||
((idmask == AT_UID) && take_owner) ||
((idmask == AT_GID) && take_group)) {
if (zfs_zaccess(zp, ACE_WRITE_OWNER, 0,
skipaclchk, cr) == 0) {
/*
* Remove setuid/setgid for non-privileged users
*/
secpolicy_setid_clear(vap, vp, cr);
trim_mask = (mask & (AT_UID|AT_GID));
} else {
need_policy = TRUE;
}
} else {
need_policy = TRUE;
}
}
mutex_enter(&zp->z_lock);
oldva.va_mode = pzp->zp_mode;
zfs_fuid_map_ids(zp, cr, &oldva.va_uid, &oldva.va_gid);
if (mask & AT_XVATTR) {
if ((need_policy == FALSE) &&
(XVA_ISSET_REQ(xvap, XAT_APPENDONLY) &&
xoap->xoa_appendonly !=
((pzp->zp_flags & ZFS_APPENDONLY) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_NOUNLINK) &&
xoap->xoa_nounlink !=
((pzp->zp_flags & ZFS_NOUNLINK) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_IMMUTABLE) &&
xoap->xoa_immutable !=
((pzp->zp_flags & ZFS_IMMUTABLE) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_NODUMP) &&
xoap->xoa_nodump !=
((pzp->zp_flags & ZFS_NODUMP) != 0)) ||
(XVA_ISSET_REQ(xvap, XAT_AV_MODIFIED) &&
xoap->xoa_av_modified !=
((pzp->zp_flags & ZFS_AV_MODIFIED) != 0)) ||
((XVA_ISSET_REQ(xvap, XAT_AV_QUARANTINED) &&
((vp->v_type != VREG && xoap->xoa_av_quarantined) ||
xoap->xoa_av_quarantined !=
((pzp->zp_flags & ZFS_AV_QUARANTINED) != 0)))) ||
(XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP)) ||
(XVA_ISSET_REQ(xvap, XAT_OPAQUE))) {
need_policy = TRUE;
}
}
mutex_exit(&zp->z_lock);
if (mask & AT_MODE) {
if (zfs_zaccess(zp, ACE_WRITE_ACL, 0, skipaclchk, cr) == 0) {
err = secpolicy_setid_setsticky_clear(vp, vap,
&oldva, cr);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
trim_mask |= AT_MODE;
} else {
need_policy = TRUE;
}
}
if (need_policy) {
/*
* If trim_mask is set then take ownership
* has been granted or write_acl is present and user
* has the ability to modify mode. In that case remove
* UID|GID and or MODE from mask so that
* secpolicy_vnode_setattr() doesn't revoke it.
*/
if (trim_mask) {
saved_mask = vap->va_mask;
vap->va_mask &= ~trim_mask;
if (trim_mask & AT_MODE) {
/*
* Save the mode, as secpolicy_vnode_setattr()
* will overwrite it with ova.va_mode.
*/
saved_mode = vap->va_mode;
}
}
err = secpolicy_vnode_setattr(cr, vp, vap, &oldva, flags,
(int (*)(void *, int, cred_t *))zfs_zaccess_unix, zp);
if (err) {
ZFS_EXIT(zfsvfs);
return (err);
}
if (trim_mask) {
vap->va_mask |= saved_mask;
if (trim_mask & AT_MODE) {
/*
* Recover the mode after
* secpolicy_vnode_setattr().
*/
vap->va_mode = saved_mode;
}
}
}
/*
* secpolicy_vnode_setattr, or take ownership may have
* changed va_mask
*/
mask = vap->va_mask;
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
if (((mask & AT_UID) && IS_EPHEMERAL(vap->va_uid)) ||
((mask & AT_GID) && IS_EPHEMERAL(vap->va_gid))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
if (mask & AT_MODE) {
uint64_t pmode = pzp->zp_mode;
new_mode = (pmode & S_IFMT) | (vap->va_mode & ~S_IFMT);
if (err = zfs_acl_chmod_setattr(zp, &aclp, new_mode)) {
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (err);
}
if (pzp->zp_acl.z_acl_extern_obj) {
/* Are we upgrading ACL from old V0 format to new V1 */
if (zfsvfs->z_version <= ZPL_VERSION_FUID &&
pzp->zp_acl.z_acl_version ==
ZFS_ACL_VERSION_INITIAL) {
dmu_tx_hold_free(tx,
pzp->zp_acl.z_acl_extern_obj, 0,
DMU_OBJECT_END);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, aclp->z_acl_bytes);
} else {
dmu_tx_hold_write(tx,
pzp->zp_acl.z_acl_extern_obj, 0,
aclp->z_acl_bytes);
}
} else if (aclp->z_acl_bytes > ZFS_ACE_SPACE) {
dmu_tx_hold_write(tx, DMU_NEW_OBJECT,
0, aclp->z_acl_bytes);
}
}
if ((mask & (AT_UID | AT_GID)) && pzp->zp_xattr != 0) {
err = zfs_zget(zp->z_zfsvfs, pzp->zp_xattr, &attrzp);
if (err) {
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
if (aclp)
zfs_acl_free(aclp);
return (err);
}
dmu_tx_hold_bonus(tx, attrzp->z_id);
}
err = dmu_tx_assign(tx, zfsvfs->z_assign);
if (err) {
if (attrzp)
VN_RELE(ZTOV(attrzp));
if (aclp) {
zfs_acl_free(aclp);
aclp = NULL;
}
if (err == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (err);
}
dmu_buf_will_dirty(zp->z_dbuf, tx);
/*
* Set each attribute requested.
* We group settings according to the locks they need to acquire.
*
* Note: you cannot set ctime directly, although it will be
* updated as a side-effect of calling this function.
*/
mutex_enter(&zp->z_lock);
if (mask & AT_MODE) {
mutex_enter(&zp->z_acl_lock);
zp->z_phys->zp_mode = new_mode;
err = zfs_aclset_common(zp, aclp, cr, &fuidp, tx);
ASSERT3U(err, ==, 0);
mutex_exit(&zp->z_acl_lock);
}
if (attrzp)
mutex_enter(&attrzp->z_lock);
if (mask & AT_UID) {
pzp->zp_uid = zfs_fuid_create(zfsvfs,
vap->va_uid, cr, ZFS_OWNER, tx, &fuidp);
if (attrzp) {
attrzp->z_phys->zp_uid = zfs_fuid_create(zfsvfs,
vap->va_uid, cr, ZFS_OWNER, tx, &fuidp);
}
}
if (mask & AT_GID) {
pzp->zp_gid = zfs_fuid_create(zfsvfs, vap->va_gid,
cr, ZFS_GROUP, tx, &fuidp);
if (attrzp)
attrzp->z_phys->zp_gid = zfs_fuid_create(zfsvfs,
vap->va_gid, cr, ZFS_GROUP, tx, &fuidp);
}
if (aclp)
zfs_acl_free(aclp);
if (attrzp)
mutex_exit(&attrzp->z_lock);
if (mask & AT_ATIME)
ZFS_TIME_ENCODE(&vap->va_atime, pzp->zp_atime);
if (mask & AT_MTIME)
ZFS_TIME_ENCODE(&vap->va_mtime, pzp->zp_mtime);
/* XXX - shouldn't this be done *before* the ATIME/MTIME checks? */
if (mask & AT_SIZE)
zfs_time_stamper_locked(zp, CONTENT_MODIFIED, tx);
else if (mask != 0)
zfs_time_stamper_locked(zp, STATE_CHANGED, tx);
/*
* Do this after setting timestamps to prevent timestamp
* update from toggling bit
*/
if (xoap && (mask & AT_XVATTR)) {
if (XVA_ISSET_REQ(xvap, XAT_AV_SCANSTAMP)) {
size_t len;
dmu_object_info_t doi;
ASSERT(vp->v_type == VREG);
/* Grow the bonus buffer if necessary. */
dmu_object_info_from_db(zp->z_dbuf, &doi);
len = sizeof (xoap->xoa_av_scanstamp) +
sizeof (znode_phys_t);
if (len > doi.doi_bonus_size)
VERIFY(dmu_set_bonus(zp->z_dbuf, len, tx) == 0);
}
zfs_xvattr_set(zp, xvap);
}
if (mask != 0)
zfs_log_setattr(zilog, tx, TX_SETATTR, zp, vap, mask, fuidp);
if (fuidp)
zfs_fuid_info_free(fuidp);
mutex_exit(&zp->z_lock);
if (attrzp)
VN_RELE(ZTOV(attrzp));
dmu_tx_commit(tx);
ZFS_EXIT(zfsvfs);
return (err);
}
typedef struct zfs_zlock {
krwlock_t *zl_rwlock; /* lock we acquired */
znode_t *zl_znode; /* znode we held */
struct zfs_zlock *zl_next; /* next in list */
} zfs_zlock_t;
/*
* Drop locks and release vnodes that were held by zfs_rename_lock().
*/
static void
zfs_rename_unlock(zfs_zlock_t **zlpp)
{
zfs_zlock_t *zl;
while ((zl = *zlpp) != NULL) {
if (zl->zl_znode != NULL)
VN_RELE(ZTOV(zl->zl_znode));
rw_exit(zl->zl_rwlock);
*zlpp = zl->zl_next;
kmem_free(zl, sizeof (*zl));
}
}
/*
* Search back through the directory tree, using the ".." entries.
* Lock each directory in the chain to prevent concurrent renames.
* Fail any attempt to move a directory into one of its own descendants.
* XXX - z_parent_lock can overlap with map or grow locks
*/
static int
zfs_rename_lock(znode_t *szp, znode_t *tdzp, znode_t *sdzp, zfs_zlock_t **zlpp)
{
zfs_zlock_t *zl;
znode_t *zp = tdzp;
uint64_t rootid = zp->z_zfsvfs->z_root;
uint64_t *oidp = &zp->z_id;
krwlock_t *rwlp = &szp->z_parent_lock;
krw_t rw = RW_WRITER;
/*
* First pass write-locks szp and compares to zp->z_id.
* Later passes read-lock zp and compare to zp->z_parent.
*/
do {
if (!rw_tryenter(rwlp, rw)) {
/*
* Another thread is renaming in this path.
* Note that if we are a WRITER, we don't have any
* parent_locks held yet.
*/
if (rw == RW_READER && zp->z_id > szp->z_id) {
/*
* Drop our locks and restart
*/
zfs_rename_unlock(&zl);
*zlpp = NULL;
zp = tdzp;
oidp = &zp->z_id;
rwlp = &szp->z_parent_lock;
rw = RW_WRITER;
continue;
} else {
/*
* Wait for other thread to drop its locks
*/
rw_enter(rwlp, rw);
}
}
zl = kmem_alloc(sizeof (*zl), KM_SLEEP);
zl->zl_rwlock = rwlp;
zl->zl_znode = NULL;
zl->zl_next = *zlpp;
*zlpp = zl;
if (*oidp == szp->z_id) /* We're a descendant of szp */
return (EINVAL);
if (*oidp == rootid) /* We've hit the top */
return (0);
if (rw == RW_READER) { /* i.e. not the first pass */
int error = zfs_zget(zp->z_zfsvfs, *oidp, &zp);
if (error)
return (error);
zl->zl_znode = zp;
}
oidp = &zp->z_phys->zp_parent;
rwlp = &zp->z_parent_lock;
rw = RW_READER;
} while (zp->z_id != sdzp->z_id);
return (0);
}
/*
* Move an entry from the provided source directory to the target
* directory. Change the entry name as indicated.
*
* IN: sdvp - Source directory containing the "old entry".
* snm - Old entry name.
* tdvp - Target directory to contain the "new entry".
* tnm - New entry name.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* sdvp,tdvp - ctime|mtime updated
*/
/*ARGSUSED*/
static int
zfs_rename(vnode_t *sdvp, char *snm, vnode_t *tdvp, char *tnm, cred_t *cr,
caller_context_t *ct, int flags)
{
znode_t *tdzp, *szp, *tzp;
znode_t *sdzp = VTOZ(sdvp);
zfsvfs_t *zfsvfs = sdzp->z_zfsvfs;
zilog_t *zilog;
vnode_t *realvp;
zfs_dirlock_t *sdl, *tdl;
dmu_tx_t *tx;
zfs_zlock_t *zl;
int cmp, serr, terr;
int error = 0;
int zflg = 0;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(sdzp);
zilog = zfsvfs->z_log;
/*
* Make sure we have the real vp for the target directory.
*/
if (VOP_REALVP(tdvp, &realvp, ct) == 0)
tdvp = realvp;
if (tdvp->v_vfsp != sdvp->v_vfsp) {
ZFS_EXIT(zfsvfs);
return (EXDEV);
}
tdzp = VTOZ(tdvp);
ZFS_VERIFY_ZP(tdzp);
if (zfsvfs->z_utf8 && u8_validate(tnm,
strlen(tnm), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zflg |= ZCILOOK;
top:
szp = NULL;
tzp = NULL;
zl = NULL;
/*
* This is to prevent the creation of links into attribute space
* by renaming a linked file into/outof an attribute directory.
* See the comment in zfs_link() for why this is considered bad.
*/
if ((tdzp->z_phys->zp_flags & ZFS_XATTR) !=
(sdzp->z_phys->zp_flags & ZFS_XATTR)) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* Lock source and target directory entries. To prevent deadlock,
* a lock ordering must be defined. We lock the directory with
* the smallest object id first, or if it's a tie, the one with
* the lexically first name.
*/
if (sdzp->z_id < tdzp->z_id) {
cmp = -1;
} else if (sdzp->z_id > tdzp->z_id) {
cmp = 1;
} else {
/*
* First compare the two name arguments without
* considering any case folding.
*/
int nofold = (zfsvfs->z_norm & ~U8_TEXTPREP_TOUPPER);
cmp = u8_strcmp(snm, tnm, 0, nofold, U8_UNICODE_LATEST, &error);
ASSERT(error == 0 || !zfsvfs->z_utf8);
if (cmp == 0) {
/*
* POSIX: "If the old argument and the new argument
* both refer to links to the same existing file,
* the rename() function shall return successfully
* and perform no other action."
*/
ZFS_EXIT(zfsvfs);
return (0);
}
/*
* If the file system is case-folding, then we may
* have some more checking to do. A case-folding file
* system is either supporting mixed case sensitivity
* access or is completely case-insensitive. Note
* that the file system is always case preserving.
*
* In mixed sensitivity mode case sensitive behavior
* is the default. FIGNORECASE must be used to
* explicitly request case insensitive behavior.
*
* If the source and target names provided differ only
* by case (e.g., a request to rename 'tim' to 'Tim'),
* we will treat this as a special case in the
* case-insensitive mode: as long as the source name
* is an exact match, we will allow this to proceed as
* a name-change request.
*/
if ((zfsvfs->z_case == ZFS_CASE_INSENSITIVE ||
(zfsvfs->z_case == ZFS_CASE_MIXED &&
flags & FIGNORECASE)) &&
u8_strcmp(snm, tnm, 0, zfsvfs->z_norm, U8_UNICODE_LATEST,
&error) == 0) {
/*
* case preserving rename request, require exact
* name matches
*/
zflg |= ZCIEXACT;
zflg &= ~ZCILOOK;
}
}
/*
* If the source and destination directories are the same, we should
* grab the z_name_lock of that directory only once.
*/
if (sdzp == tdzp) {
zflg |= ZHAVELOCK;
rw_enter(&sdzp->z_name_lock, RW_READER);
}
if (cmp < 0) {
serr = zfs_dirent_lock(&sdl, sdzp, snm, &szp,
ZEXISTS | zflg, NULL, NULL);
terr = zfs_dirent_lock(&tdl,
tdzp, tnm, &tzp, ZRENAMING | zflg, NULL, NULL);
} else {
terr = zfs_dirent_lock(&tdl,
tdzp, tnm, &tzp, zflg, NULL, NULL);
serr = zfs_dirent_lock(&sdl,
sdzp, snm, &szp, ZEXISTS | ZRENAMING | zflg,
NULL, NULL);
}
if (serr) {
/*
* Source entry invalid or not there.
*/
if (!terr) {
zfs_dirent_unlock(tdl);
if (tzp)
VN_RELE(ZTOV(tzp));
}
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
if (strcmp(snm, ".") == 0 || strcmp(snm, "..") == 0)
serr = EINVAL;
ZFS_EXIT(zfsvfs);
return (serr);
}
if (terr) {
zfs_dirent_unlock(sdl);
VN_RELE(ZTOV(szp));
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
if (strcmp(tnm, "..") == 0)
terr = EINVAL;
ZFS_EXIT(zfsvfs);
return (terr);
}
/*
* Must have write access at the source to remove the old entry
* and write access at the target to create the new entry.
* Note that if target and source are the same, this can be
* done in a single check.
*/
if (error = zfs_zaccess_rename(sdzp, szp, tdzp, tzp, cr))
goto out;
if (ZTOV(szp)->v_type == VDIR) {
/*
* Check to make sure rename is valid.
* Can't do a move like this: /usr/a/b to /usr/a/b/c/d
*/
if (error = zfs_rename_lock(szp, tdzp, sdzp, &zl))
goto out;
}
/*
* Does target exist?
*/
if (tzp) {
/*
* Source and target must be the same type.
*/
if (ZTOV(szp)->v_type == VDIR) {
if (ZTOV(tzp)->v_type != VDIR) {
error = ENOTDIR;
goto out;
}
} else {
if (ZTOV(tzp)->v_type == VDIR) {
error = EISDIR;
goto out;
}
}
/*
* POSIX dictates that when the source and target
* entries refer to the same file object, rename
* must do nothing and exit without error.
*/
if (szp->z_id == tzp->z_id) {
error = 0;
goto out;
}
}
vnevent_rename_src(ZTOV(szp), sdvp, snm, ct);
if (tzp)
vnevent_rename_dest(ZTOV(tzp), tdvp, tnm, ct);
/*
* notify the target directory if it is not the same
* as source directory.
*/
if (tdvp != sdvp) {
vnevent_rename_dest_dir(tdvp, ct);
}
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, szp->z_id); /* nlink changes */
dmu_tx_hold_bonus(tx, sdzp->z_id); /* nlink changes */
dmu_tx_hold_zap(tx, sdzp->z_id, FALSE, snm);
dmu_tx_hold_zap(tx, tdzp->z_id, TRUE, tnm);
if (sdzp != tdzp)
dmu_tx_hold_bonus(tx, tdzp->z_id); /* nlink changes */
if (tzp)
dmu_tx_hold_bonus(tx, tzp->z_id); /* parent changes */
dmu_tx_hold_zap(tx, zfsvfs->z_unlinkedobj, FALSE, NULL);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
if (zl != NULL)
zfs_rename_unlock(&zl);
zfs_dirent_unlock(sdl);
zfs_dirent_unlock(tdl);
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
VN_RELE(ZTOV(szp));
if (tzp)
VN_RELE(ZTOV(tzp));
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
if (tzp) /* Attempt to remove the existing target */
error = zfs_link_destroy(tdl, tzp, tx, zflg, NULL);
if (error == 0) {
error = zfs_link_create(tdl, szp, tx, ZRENAMING);
if (error == 0) {
szp->z_phys->zp_flags |= ZFS_AV_MODIFIED;
error = zfs_link_destroy(sdl, szp, tx, ZRENAMING, NULL);
ASSERT(error == 0);
zfs_log_rename(zilog, tx,
TX_RENAME | (flags & FIGNORECASE ? TX_CI : 0),
sdzp, sdl->dl_name, tdzp, tdl->dl_name, szp);
/* Update path information for the target vnode */
vn_renamepath(tdvp, ZTOV(szp), tnm, strlen(tnm));
}
#ifdef FREEBSD_NAMECACHE
if (error == 0) {
cache_purge(sdvp);
cache_purge(tdvp);
}
#endif
}
dmu_tx_commit(tx);
out:
if (zl != NULL)
zfs_rename_unlock(&zl);
zfs_dirent_unlock(sdl);
zfs_dirent_unlock(tdl);
if (sdzp == tdzp)
rw_exit(&sdzp->z_name_lock);
VN_RELE(ZTOV(szp));
if (tzp)
VN_RELE(ZTOV(tzp));
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Insert the indicated symbolic reference entry into the directory.
*
* IN: dvp - Directory to contain new symbolic link.
* link - Name for new symlink entry.
* vap - Attributes of new entry.
* target - Target path of new symlink.
* cr - credentials of caller.
* ct - caller context
* flags - case flags
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* dvp - ctime|mtime updated
*/
/*ARGSUSED*/
static int
zfs_symlink(vnode_t *dvp, vnode_t **vpp, char *name, vattr_t *vap, char *link,
cred_t *cr, kthread_t *td)
{
znode_t *zp, *dzp = VTOZ(dvp);
zfs_dirlock_t *dl;
dmu_tx_t *tx;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
int len = strlen(link);
int error;
int zflg = ZNEW;
zfs_fuid_info_t *fuidp = NULL;
int flags = 0;
ASSERT(vap->va_type == VLNK);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (zfsvfs->z_utf8 && u8_validate(name, strlen(name),
NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zflg |= ZCILOOK;
top:
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
ZFS_EXIT(zfsvfs);
return (error);
}
if (len > MAXPATHLEN) {
ZFS_EXIT(zfsvfs);
return (ENAMETOOLONG);
}
/*
* Attempt to lock directory; fail if entry already exists.
*/
error = zfs_dirent_lock(&dl, dzp, name, &zp, zflg, NULL, NULL);
if (error) {
ZFS_EXIT(zfsvfs);
return (error);
}
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, MAX(1, len));
dmu_tx_hold_bonus(tx, dzp->z_id);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
if (dzp->z_phys->zp_flags & ZFS_INHERIT_ACE)
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0, SPA_MAXBLOCKSIZE);
if (IS_EPHEMERAL(crgetuid(cr)) || IS_EPHEMERAL(crgetgid(cr))) {
if (zfsvfs->z_fuid_obj == 0) {
dmu_tx_hold_bonus(tx, DMU_NEW_OBJECT);
dmu_tx_hold_write(tx, DMU_NEW_OBJECT, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
dmu_tx_hold_zap(tx, MASTER_NODE_OBJ, FALSE, NULL);
} else {
dmu_tx_hold_bonus(tx, zfsvfs->z_fuid_obj);
dmu_tx_hold_write(tx, zfsvfs->z_fuid_obj, 0,
FUID_SIZE_ESTIMATE(zfsvfs));
}
}
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
dmu_buf_will_dirty(dzp->z_dbuf, tx);
/*
* Create a new object for the symlink.
* Put the link content into bonus buffer if it will fit;
* otherwise, store it just like any other file data.
*/
if (sizeof (znode_phys_t) + len <= dmu_bonus_max()) {
zfs_mknode(dzp, vap, tx, cr, 0, &zp, len, NULL, &fuidp);
if (len != 0)
bcopy(link, zp->z_phys + 1, len);
} else {
dmu_buf_t *dbp;
zfs_mknode(dzp, vap, tx, cr, 0, &zp, 0, NULL, &fuidp);
/*
* Nothing can access the znode yet so no locking needed
* for growing the znode's blocksize.
*/
zfs_grow_blocksize(zp, len, tx);
VERIFY(0 == dmu_buf_hold(zfsvfs->z_os,
zp->z_id, 0, FTAG, &dbp));
dmu_buf_will_dirty(dbp, tx);
ASSERT3U(len, <=, dbp->db_size);
bcopy(link, dbp->db_data, len);
dmu_buf_rele(dbp, FTAG);
}
zp->z_phys->zp_size = len;
/*
* Insert the new object into the directory.
*/
(void) zfs_link_create(dl, zp, tx, ZNEW);
out:
if (error == 0) {
uint64_t txtype = TX_SYMLINK;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_symlink(zilog, tx, txtype, dzp, zp, name, link);
*vpp = ZTOV(zp);
}
if (fuidp)
zfs_fuid_info_free(fuidp);
dmu_tx_commit(tx);
zfs_dirent_unlock(dl);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Return, in the buffer contained in the provided uio structure,
* the symbolic path referred to by vp.
*
* IN: vp - vnode of symbolic link.
* uoip - structure to contain the link path.
* cr - credentials of caller.
* ct - caller context
*
* OUT: uio - structure to contain the link path.
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* vp - atime updated
*/
/* ARGSUSED */
static int
zfs_readlink(vnode_t *vp, uio_t *uio, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
size_t bufsz;
int error;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
bufsz = (size_t)zp->z_phys->zp_size;
if (bufsz + sizeof (znode_phys_t) <= zp->z_dbuf->db_size) {
error = uiomove(zp->z_phys + 1,
MIN((size_t)bufsz, uio->uio_resid), UIO_READ, uio);
} else {
dmu_buf_t *dbp;
error = dmu_buf_hold(zfsvfs->z_os, zp->z_id, 0, FTAG, &dbp);
if (error) {
ZFS_EXIT(zfsvfs);
return (error);
}
error = uiomove(dbp->db_data,
MIN((size_t)bufsz, uio->uio_resid), UIO_READ, uio);
dmu_buf_rele(dbp, FTAG);
}
ZFS_ACCESSTIME_STAMP(zfsvfs, zp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Insert a new entry into directory tdvp referencing svp.
*
* IN: tdvp - Directory to contain new entry.
* svp - vnode of new entry.
* name - name of new entry.
* cr - credentials of caller.
* ct - caller context
*
* RETURN: 0 if success
* error code if failure
*
* Timestamps:
* tdvp - ctime|mtime updated
* svp - ctime updated
*/
/* ARGSUSED */
static int
zfs_link(vnode_t *tdvp, vnode_t *svp, char *name, cred_t *cr,
caller_context_t *ct, int flags)
{
znode_t *dzp = VTOZ(tdvp);
znode_t *tzp, *szp;
zfsvfs_t *zfsvfs = dzp->z_zfsvfs;
zilog_t *zilog;
zfs_dirlock_t *dl;
dmu_tx_t *tx;
vnode_t *realvp;
int error;
int zf = ZNEW;
uid_t owner;
ASSERT(tdvp->v_type == VDIR);
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(dzp);
zilog = zfsvfs->z_log;
if (VOP_REALVP(svp, &realvp, ct) == 0)
svp = realvp;
if (svp->v_vfsp != tdvp->v_vfsp) {
ZFS_EXIT(zfsvfs);
return (EXDEV);
}
szp = VTOZ(svp);
ZFS_VERIFY_ZP(szp);
if (zfsvfs->z_utf8 && u8_validate(name,
strlen(name), NULL, U8_VALIDATE_ENTIRE, &error) < 0) {
ZFS_EXIT(zfsvfs);
return (EILSEQ);
}
if (flags & FIGNORECASE)
zf |= ZCILOOK;
top:
/*
* We do not support links between attributes and non-attributes
* because of the potential security risk of creating links
* into "normal" file space in order to circumvent restrictions
* imposed in attribute space.
*/
if ((szp->z_phys->zp_flags & ZFS_XATTR) !=
(dzp->z_phys->zp_flags & ZFS_XATTR)) {
ZFS_EXIT(zfsvfs);
return (EINVAL);
}
/*
* POSIX dictates that we return EPERM here.
* Better choices include ENOTSUP or EISDIR.
*/
if (svp->v_type == VDIR) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
owner = zfs_fuid_map_id(zfsvfs, szp->z_phys->zp_uid, cr, ZFS_OWNER);
if (owner != crgetuid(cr) &&
secpolicy_basic_link(svp, cr) != 0) {
ZFS_EXIT(zfsvfs);
return (EPERM);
}
if (error = zfs_zaccess(dzp, ACE_ADD_FILE, 0, B_FALSE, cr)) {
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Attempt to lock directory; fail if entry already exists.
*/
error = zfs_dirent_lock(&dl, dzp, name, &tzp, zf, NULL, NULL);
if (error) {
ZFS_EXIT(zfsvfs);
return (error);
}
tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, szp->z_id);
dmu_tx_hold_zap(tx, dzp->z_id, TRUE, name);
error = dmu_tx_assign(tx, zfsvfs->z_assign);
if (error) {
zfs_dirent_unlock(dl);
if (error == ERESTART && zfsvfs->z_assign == TXG_NOWAIT) {
dmu_tx_wait(tx);
dmu_tx_abort(tx);
goto top;
}
dmu_tx_abort(tx);
ZFS_EXIT(zfsvfs);
return (error);
}
error = zfs_link_create(dl, szp, tx, 0);
if (error == 0) {
uint64_t txtype = TX_LINK;
if (flags & FIGNORECASE)
txtype |= TX_CI;
zfs_log_link(zilog, tx, txtype, dzp, szp, name);
}
dmu_tx_commit(tx);
zfs_dirent_unlock(dl);
if (error == 0) {
vnevent_link(svp, ct);
}
ZFS_EXIT(zfsvfs);
return (error);
}
/*ARGSUSED*/
void
zfs_inactive(vnode_t *vp, cred_t *cr, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
if (zp->z_dbuf == NULL) {
/*
* The fs has been unmounted, or we did a
* suspend/resume and this file no longer exists.
*/
VI_LOCK(vp);
vp->v_count = 0; /* count arrives as 1 */
VI_UNLOCK(vp);
vrecycle(vp, curthread);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
return;
}
if (zp->z_atime_dirty && zp->z_unlinked == 0) {
dmu_tx_t *tx = dmu_tx_create(zfsvfs->z_os);
dmu_tx_hold_bonus(tx, zp->z_id);
error = dmu_tx_assign(tx, TXG_WAIT);
if (error) {
dmu_tx_abort(tx);
} else {
dmu_buf_will_dirty(zp->z_dbuf, tx);
mutex_enter(&zp->z_lock);
zp->z_atime_dirty = 0;
mutex_exit(&zp->z_lock);
dmu_tx_commit(tx);
}
}
zfs_zinactive(zp);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
}
CTASSERT(sizeof(struct zfid_short) <= sizeof(struct fid));
CTASSERT(sizeof(struct zfid_long) <= sizeof(struct fid));
/*ARGSUSED*/
static int
zfs_fid(vnode_t *vp, fid_t *fidp, caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
uint32_t gen;
uint64_t object = zp->z_id;
zfid_short_t *zfid;
int size, i;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
gen = (uint32_t)zp->z_gen;
size = (zfsvfs->z_parent != zfsvfs) ? LONG_FID_LEN : SHORT_FID_LEN;
fidp->fid_len = size;
zfid = (zfid_short_t *)fidp;
zfid->zf_len = size;
for (i = 0; i < sizeof (zfid->zf_object); i++)
zfid->zf_object[i] = (uint8_t)(object >> (8 * i));
/* Must have a non-zero generation number to distinguish from .zfs */
if (gen == 0)
gen = 1;
for (i = 0; i < sizeof (zfid->zf_gen); i++)
zfid->zf_gen[i] = (uint8_t)(gen >> (8 * i));
if (size == LONG_FID_LEN) {
uint64_t objsetid = dmu_objset_id(zfsvfs->z_os);
zfid_long_t *zlfid;
zlfid = (zfid_long_t *)fidp;
for (i = 0; i < sizeof (zlfid->zf_setid); i++)
zlfid->zf_setid[i] = (uint8_t)(objsetid >> (8 * i));
/* XXX - this should be the generation number for the objset */
for (i = 0; i < sizeof (zlfid->zf_setgen); i++)
zlfid->zf_setgen[i] = 0;
}
ZFS_EXIT(zfsvfs);
return (0);
}
static int
zfs_pathconf(vnode_t *vp, int cmd, ulong_t *valp, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp, *xzp;
zfsvfs_t *zfsvfs;
zfs_dirlock_t *dl;
int error;
switch (cmd) {
case _PC_LINK_MAX:
*valp = INT_MAX;
return (0);
case _PC_FILESIZEBITS:
*valp = 64;
return (0);
#if 0
case _PC_XATTR_EXISTS:
zp = VTOZ(vp);
zfsvfs = zp->z_zfsvfs;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
*valp = 0;
error = zfs_dirent_lock(&dl, zp, "", &xzp,
ZXATTR | ZEXISTS | ZSHARED, NULL, NULL);
if (error == 0) {
zfs_dirent_unlock(dl);
if (!zfs_dirempty(xzp))
*valp = 1;
VN_RELE(ZTOV(xzp));
} else if (error == ENOENT) {
/*
* If there aren't extended attributes, it's the
* same as having zero of them.
*/
error = 0;
}
ZFS_EXIT(zfsvfs);
return (error);
#endif
case _PC_ACL_EXTENDED:
*valp = 0;
return (0);
case _PC_ACL_NFS4:
*valp = 1;
return (0);
case _PC_ACL_PATH_MAX:
*valp = ACL_MAX_ENTRIES;
return (0);
case _PC_MIN_HOLE_SIZE:
*valp = (int)SPA_MINBLOCKSIZE;
return (0);
default:
return (EOPNOTSUPP);
}
}
/*ARGSUSED*/
static int
zfs_getsecattr(vnode_t *vp, vsecattr_t *vsecp, int flag, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
error = zfs_getacl(zp, vsecp, skipaclchk, cr);
ZFS_EXIT(zfsvfs);
return (error);
}
/*ARGSUSED*/
static int
zfs_setsecattr(vnode_t *vp, vsecattr_t *vsecp, int flag, cred_t *cr,
caller_context_t *ct)
{
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
int error;
boolean_t skipaclchk = (flag & ATTR_NOACLCHECK) ? B_TRUE : B_FALSE;
ZFS_ENTER(zfsvfs);
ZFS_VERIFY_ZP(zp);
error = zfs_setacl(zp, vsecp, skipaclchk, cr);
ZFS_EXIT(zfsvfs);
return (error);
}
static int
zfs_freebsd_open(ap)
struct vop_open_args /* {
struct vnode *a_vp;
int a_mode;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
znode_t *zp = VTOZ(vp);
int error;
error = zfs_open(&vp, ap->a_mode, ap->a_cred, NULL);
if (error == 0)
vnode_create_vobject(vp, zp->z_phys->zp_size, ap->a_td);
return (error);
}
static int
zfs_freebsd_close(ap)
struct vop_close_args /* {
struct vnode *a_vp;
int a_fflag;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
return (zfs_close(ap->a_vp, ap->a_fflag, 0, 0, ap->a_cred, NULL));
}
static int
zfs_freebsd_ioctl(ap)
struct vop_ioctl_args /* {
struct vnode *a_vp;
u_long a_command;
caddr_t a_data;
int a_fflag;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
return (zfs_ioctl(ap->a_vp, ap->a_command, (intptr_t)ap->a_data,
ap->a_fflag, ap->a_cred, NULL, NULL));
}
static int
zfs_freebsd_read(ap)
struct vop_read_args /* {
struct vnode *a_vp;
struct uio *a_uio;
int a_ioflag;
struct ucred *a_cred;
} */ *ap;
{
return (zfs_read(ap->a_vp, ap->a_uio, ap->a_ioflag, ap->a_cred, NULL));
}
static int
zfs_freebsd_write(ap)
struct vop_write_args /* {
struct vnode *a_vp;
struct uio *a_uio;
int a_ioflag;
struct ucred *a_cred;
} */ *ap;
{
return (zfs_write(ap->a_vp, ap->a_uio, ap->a_ioflag, ap->a_cred, NULL));
}
static int
zfs_freebsd_access(ap)
struct vop_access_args /* {
struct vnode *a_vp;
accmode_t a_accmode;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
accmode_t accmode;
int error = 0;
/*
* ZFS itself only knowns about VREAD, VWRITE, VEXEC and VAPPEND,
*/
accmode = ap->a_accmode & (VREAD|VWRITE|VEXEC|VAPPEND);
if (accmode != 0)
error = zfs_access(ap->a_vp, accmode, 0, ap->a_cred, NULL);
/*
* VADMIN has to be handled by vaccess().
*/
if (error == 0) {
accmode = ap->a_accmode & ~(VREAD|VWRITE|VEXEC|VAPPEND);
if (accmode != 0) {
vnode_t *vp = ap->a_vp;
znode_t *zp = VTOZ(vp);
znode_phys_t *zphys = zp->z_phys;
error = vaccess(vp->v_type, zphys->zp_mode,
zphys->zp_uid, zphys->zp_gid, accmode, ap->a_cred,
NULL);
}
}
return (error);
}
static int
zfs_freebsd_lookup(ap)
struct vop_lookup_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
char nm[NAME_MAX + 1];
ASSERT(cnp->cn_namelen < sizeof(nm));
strlcpy(nm, cnp->cn_nameptr, MIN(cnp->cn_namelen + 1, sizeof(nm)));
return (zfs_lookup(ap->a_dvp, nm, ap->a_vpp, cnp, cnp->cn_nameiop,
cnp->cn_cred, cnp->cn_thread, 0));
}
static int
zfs_freebsd_create(ap)
struct vop_create_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
struct vattr *a_vap;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
vattr_t *vap = ap->a_vap;
int mode;
ASSERT(cnp->cn_flags & SAVENAME);
vattr_init_mask(vap);
mode = vap->va_mode & ALLPERMS;
return (zfs_create(ap->a_dvp, cnp->cn_nameptr, vap, !EXCL, mode,
ap->a_vpp, cnp->cn_cred, cnp->cn_thread));
}
static int
zfs_freebsd_remove(ap)
struct vop_remove_args /* {
struct vnode *a_dvp;
struct vnode *a_vp;
struct componentname *a_cnp;
} */ *ap;
{
ASSERT(ap->a_cnp->cn_flags & SAVENAME);
return (zfs_remove(ap->a_dvp, ap->a_cnp->cn_nameptr,
ap->a_cnp->cn_cred, NULL, 0));
}
static int
zfs_freebsd_mkdir(ap)
struct vop_mkdir_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
struct vattr *a_vap;
} */ *ap;
{
vattr_t *vap = ap->a_vap;
ASSERT(ap->a_cnp->cn_flags & SAVENAME);
vattr_init_mask(vap);
return (zfs_mkdir(ap->a_dvp, ap->a_cnp->cn_nameptr, vap, ap->a_vpp,
ap->a_cnp->cn_cred, NULL, 0, NULL));
}
static int
zfs_freebsd_rmdir(ap)
struct vop_rmdir_args /* {
struct vnode *a_dvp;
struct vnode *a_vp;
struct componentname *a_cnp;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
ASSERT(cnp->cn_flags & SAVENAME);
return (zfs_rmdir(ap->a_dvp, cnp->cn_nameptr, NULL, cnp->cn_cred, NULL, 0));
}
static int
zfs_freebsd_readdir(ap)
struct vop_readdir_args /* {
struct vnode *a_vp;
struct uio *a_uio;
struct ucred *a_cred;
int *a_eofflag;
int *a_ncookies;
u_long **a_cookies;
} */ *ap;
{
return (zfs_readdir(ap->a_vp, ap->a_uio, ap->a_cred, ap->a_eofflag,
ap->a_ncookies, ap->a_cookies));
}
static int
zfs_freebsd_fsync(ap)
struct vop_fsync_args /* {
struct vnode *a_vp;
int a_waitfor;
struct thread *a_td;
} */ *ap;
{
vop_stdfsync(ap);
return (zfs_fsync(ap->a_vp, 0, ap->a_td->td_ucred, NULL));
}
static int
zfs_freebsd_getattr(ap)
struct vop_getattr_args /* {
struct vnode *a_vp;
struct vattr *a_vap;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
vattr_t *vap = ap->a_vap;
xvattr_t xvap;
u_long fflags = 0;
int error;
xva_init(&xvap);
xvap.xva_vattr = *vap;
xvap.xva_vattr.va_mask |= AT_XVATTR;
/* Convert chflags into ZFS-type flags. */
/* XXX: what about SF_SETTABLE?. */
XVA_SET_REQ(&xvap, XAT_IMMUTABLE);
XVA_SET_REQ(&xvap, XAT_APPENDONLY);
XVA_SET_REQ(&xvap, XAT_NOUNLINK);
XVA_SET_REQ(&xvap, XAT_NODUMP);
error = zfs_getattr(ap->a_vp, (vattr_t *)&xvap, 0, ap->a_cred, NULL);
if (error != 0)
return (error);
/* Convert ZFS xattr into chflags. */
#define FLAG_CHECK(fflag, xflag, xfield) do { \
if (XVA_ISSET_RTN(&xvap, (xflag)) && (xfield) != 0) \
fflags |= (fflag); \
} while (0)
FLAG_CHECK(SF_IMMUTABLE, XAT_IMMUTABLE,
xvap.xva_xoptattrs.xoa_immutable);
FLAG_CHECK(SF_APPEND, XAT_APPENDONLY,
xvap.xva_xoptattrs.xoa_appendonly);
FLAG_CHECK(SF_NOUNLINK, XAT_NOUNLINK,
xvap.xva_xoptattrs.xoa_nounlink);
FLAG_CHECK(UF_NODUMP, XAT_NODUMP,
xvap.xva_xoptattrs.xoa_nodump);
#undef FLAG_CHECK
*vap = xvap.xva_vattr;
vap->va_flags = fflags;
return (0);
}
static int
zfs_freebsd_setattr(ap)
struct vop_setattr_args /* {
struct vnode *a_vp;
struct vattr *a_vap;
struct ucred *a_cred;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
vattr_t *vap = ap->a_vap;
cred_t *cred = ap->a_cred;
xvattr_t xvap;
u_long fflags;
uint64_t zflags;
vattr_init_mask(vap);
vap->va_mask &= ~AT_NOSET;
xva_init(&xvap);
xvap.xva_vattr = *vap;
zflags = VTOZ(vp)->z_phys->zp_flags;
if (vap->va_flags != VNOVAL) {
zfsvfs_t *zfsvfs = VTOZ(vp)->z_zfsvfs;
int error;
if (zfsvfs->z_use_fuids == B_FALSE)
return (EOPNOTSUPP);
fflags = vap->va_flags;
if ((fflags & ~(SF_IMMUTABLE|SF_APPEND|SF_NOUNLINK|UF_NODUMP)) != 0)
return (EOPNOTSUPP);
/*
* Unprivileged processes are not permitted to unset system
* flags, or modify flags if any system flags are set.
* Privileged non-jail processes may not modify system flags
* if securelevel > 0 and any existing system flags are set.
* Privileged jail processes behave like privileged non-jail
* processes if the security.jail.chflags_allowed sysctl is
* is non-zero; otherwise, they behave like unprivileged
* processes.
*/
if (secpolicy_fs_owner(vp->v_mount, cred) == 0 ||
priv_check_cred(cred, PRIV_VFS_SYSFLAGS, 0) == 0) {
if (zflags &
(ZFS_IMMUTABLE | ZFS_APPENDONLY | ZFS_NOUNLINK)) {
error = securelevel_gt(cred, 0);
if (error != 0)
return (error);
}
} else {
/*
* Callers may only modify the file flags on objects they
* have VADMIN rights for.
*/
if ((error = VOP_ACCESS(vp, VADMIN, cred, curthread)) != 0)
return (error);
if (zflags &
(ZFS_IMMUTABLE | ZFS_APPENDONLY | ZFS_NOUNLINK)) {
return (EPERM);
}
if (fflags &
(SF_IMMUTABLE | SF_APPEND | SF_NOUNLINK)) {
return (EPERM);
}
}
#define FLAG_CHANGE(fflag, zflag, xflag, xfield) do { \
if (((fflags & (fflag)) && !(zflags & (zflag))) || \
((zflags & (zflag)) && !(fflags & (fflag)))) { \
XVA_SET_REQ(&xvap, (xflag)); \
(xfield) = ((fflags & (fflag)) != 0); \
} \
} while (0)
/* Convert chflags into ZFS-type flags. */
/* XXX: what about SF_SETTABLE?. */
FLAG_CHANGE(SF_IMMUTABLE, ZFS_IMMUTABLE, XAT_IMMUTABLE,
xvap.xva_xoptattrs.xoa_immutable);
FLAG_CHANGE(SF_APPEND, ZFS_APPENDONLY, XAT_APPENDONLY,
xvap.xva_xoptattrs.xoa_appendonly);
FLAG_CHANGE(SF_NOUNLINK, ZFS_NOUNLINK, XAT_NOUNLINK,
xvap.xva_xoptattrs.xoa_nounlink);
FLAG_CHANGE(UF_NODUMP, ZFS_NODUMP, XAT_NODUMP,
xvap.xva_xoptattrs.xoa_nodump);
#undef FLAG_CHANGE
}
return (zfs_setattr(vp, (vattr_t *)&xvap, 0, cred, NULL));
}
static int
zfs_freebsd_rename(ap)
struct vop_rename_args /* {
struct vnode *a_fdvp;
struct vnode *a_fvp;
struct componentname *a_fcnp;
struct vnode *a_tdvp;
struct vnode *a_tvp;
struct componentname *a_tcnp;
} */ *ap;
{
vnode_t *fdvp = ap->a_fdvp;
vnode_t *fvp = ap->a_fvp;
vnode_t *tdvp = ap->a_tdvp;
vnode_t *tvp = ap->a_tvp;
int error;
ASSERT(ap->a_fcnp->cn_flags & (SAVENAME|SAVESTART));
ASSERT(ap->a_tcnp->cn_flags & (SAVENAME|SAVESTART));
error = zfs_rename(fdvp, ap->a_fcnp->cn_nameptr, tdvp,
ap->a_tcnp->cn_nameptr, ap->a_fcnp->cn_cred, NULL, 0);
if (tdvp == tvp)
VN_RELE(tdvp);
else
VN_URELE(tdvp);
if (tvp)
VN_URELE(tvp);
VN_RELE(fdvp);
VN_RELE(fvp);
return (error);
}
static int
zfs_freebsd_symlink(ap)
struct vop_symlink_args /* {
struct vnode *a_dvp;
struct vnode **a_vpp;
struct componentname *a_cnp;
struct vattr *a_vap;
char *a_target;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
vattr_t *vap = ap->a_vap;
ASSERT(cnp->cn_flags & SAVENAME);
vap->va_type = VLNK; /* FreeBSD: Syscall only sets va_mode. */
vattr_init_mask(vap);
return (zfs_symlink(ap->a_dvp, ap->a_vpp, cnp->cn_nameptr, vap,
ap->a_target, cnp->cn_cred, cnp->cn_thread));
}
static int
zfs_freebsd_readlink(ap)
struct vop_readlink_args /* {
struct vnode *a_vp;
struct uio *a_uio;
struct ucred *a_cred;
} */ *ap;
{
return (zfs_readlink(ap->a_vp, ap->a_uio, ap->a_cred, NULL));
}
static int
zfs_freebsd_link(ap)
struct vop_link_args /* {
struct vnode *a_tdvp;
struct vnode *a_vp;
struct componentname *a_cnp;
} */ *ap;
{
struct componentname *cnp = ap->a_cnp;
ASSERT(cnp->cn_flags & SAVENAME);
return (zfs_link(ap->a_tdvp, ap->a_vp, cnp->cn_nameptr, cnp->cn_cred, NULL, 0));
}
static int
zfs_freebsd_inactive(ap)
struct vop_inactive_args /* {
struct vnode *a_vp;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
zfs_inactive(vp, ap->a_td->td_ucred, NULL);
return (0);
}
static void
zfs_reclaim_complete(void *arg, int pending)
{
znode_t *zp = arg;
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
if (zp->z_dbuf != NULL) {
ZFS_OBJ_HOLD_ENTER(zfsvfs, zp->z_id);
zfs_znode_dmu_fini(zp);
ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
}
zfs_znode_free(zp);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
/*
* If the file system is being unmounted, there is a process waiting
* for us, wake it up.
*/
if (zfsvfs->z_unmounted)
wakeup_one(zfsvfs);
}
static int
zfs_freebsd_reclaim(ap)
struct vop_reclaim_args /* {
struct vnode *a_vp;
struct thread *a_td;
} */ *ap;
{
vnode_t *vp = ap->a_vp;
znode_t *zp = VTOZ(vp);
zfsvfs_t *zfsvfs = zp->z_zfsvfs;
rw_enter(&zfsvfs->z_teardown_inactive_lock, RW_READER);
ASSERT(zp != NULL);
/*
* Destroy the vm object and flush associated pages.
*/
vnode_destroy_vobject(vp);
mutex_enter(&zp->z_lock);
ASSERT(zp->z_phys != NULL);
zp->z_vnode = NULL;
mutex_exit(&zp->z_lock);
if (zp->z_unlinked)
; /* Do nothing. */
else if (zp->z_dbuf == NULL)
zfs_znode_free(zp);
else /* if (!zp->z_unlinked && zp->z_dbuf != NULL) */ {
int locked;
locked = MUTEX_HELD(ZFS_OBJ_MUTEX(zfsvfs, zp->z_id)) ? 2 :
ZFS_OBJ_HOLD_TRYENTER(zfsvfs, zp->z_id);
if (locked == 0) {
/*
* Lock can't be obtained due to deadlock possibility,
* so defer znode destruction.
*/
TASK_INIT(&zp->z_task, 0, zfs_reclaim_complete, zp);
taskqueue_enqueue(taskqueue_thread, &zp->z_task);
} else {
zfs_znode_dmu_fini(zp);
if (locked == 1)
ZFS_OBJ_HOLD_EXIT(zfsvfs, zp->z_id);
zfs_znode_free(zp);
}
}
VI_LOCK(vp);
vp->v_data = NULL;
ASSERT(vp->v_holdcnt >= 1);
VI_UNLOCK(vp);
rw_exit(&zfsvfs->z_teardown_inactive_lock);
return (0);
}
static int
zfs_freebsd_fid(ap)
struct vop_fid_args /* {
struct vnode *a_vp;
struct fid *a_fid;
} */ *ap;
{
return (zfs_fid(ap->a_vp, (void *)ap->a_fid, NULL));
}
static int
zfs_freebsd_pathconf(ap)
struct vop_pathconf_args /* {
struct vnode *a_vp;
int a_name;
register_t *a_retval;
} */ *ap;
{
ulong_t val;
int error;
error = zfs_pathconf(ap->a_vp, ap->a_name, &val, curthread->td_ucred, NULL);
if (error == 0)
*ap->a_retval = val;
else if (error == EOPNOTSUPP)
error = vop_stdpathconf(ap);
return (error);
}
static int
zfs_freebsd_fifo_pathconf(ap)
struct vop_pathconf_args /* {
struct vnode *a_vp;
int a_name;
register_t *a_retval;
} */ *ap;
{
switch (ap->a_name) {
case _PC_ACL_EXTENDED:
case _PC_ACL_NFS4:
case _PC_ACL_PATH_MAX:
case _PC_MAC_PRESENT:
return (zfs_freebsd_pathconf(ap));
default:
return (fifo_specops.vop_pathconf(ap));
}
}
/*
* FreeBSD's extended attributes namespace defines file name prefix for ZFS'
* extended attribute name:
*
* NAMESPACE PREFIX
* system freebsd:system:
* user (none, can be used to access ZFS fsattr(5) attributes
* created on Solaris)
*/
static int
zfs_create_attrname(int attrnamespace, const char *name, char *attrname,
size_t size)
{
const char *namespace, *prefix, *suffix;
/* We don't allow '/' character in attribute name. */
if (strchr(name, '/') != NULL)
return (EINVAL);
/* We don't allow attribute names that start with "freebsd:" string. */
if (strncmp(name, "freebsd:", 8) == 0)
return (EINVAL);
bzero(attrname, size);
switch (attrnamespace) {
case EXTATTR_NAMESPACE_USER:
#if 0
prefix = "freebsd:";
namespace = EXTATTR_NAMESPACE_USER_STRING;
suffix = ":";
#else
/*
* This is the default namespace by which we can access all
* attributes created on Solaris.
*/
prefix = namespace = suffix = "";
#endif
break;
case EXTATTR_NAMESPACE_SYSTEM:
prefix = "freebsd:";
namespace = EXTATTR_NAMESPACE_SYSTEM_STRING;
suffix = ":";
break;
case EXTATTR_NAMESPACE_EMPTY:
default:
return (EINVAL);
}
if (snprintf(attrname, size, "%s%s%s%s", prefix, namespace, suffix,
name) >= size) {
return (ENAMETOOLONG);
}
return (0);
}
/*
* Vnode operating to retrieve a named extended attribute.
*/
static int
zfs_getextattr(struct vop_getextattr_args *ap)
/*
vop_getextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
IN const char *a_name;
INOUT struct uio *a_uio;
OUT size_t *a_size;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrname[255];
struct vattr va;
vnode_t *xvp = NULL, *vp;
int error, flags;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VREAD);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
sizeof(attrname));
if (error != 0)
return (error);
ZFS_ENTER(zfsvfs);
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
flags = FREAD;
NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW | MPSAFE, UIO_SYSSPACE, attrname,
xvp, td);
error = vn_open_cred(&nd, &flags, 0, 0, ap->a_cred, NULL);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
if (error == ENOENT)
error = ENOATTR;
return (error);
}
if (ap->a_size != NULL) {
error = VOP_GETATTR(vp, &va, ap->a_cred);
if (error == 0)
*ap->a_size = (size_t)va.va_size;
} else if (ap->a_uio != NULL)
error = VOP_READ(vp, ap->a_uio, IO_UNIT | IO_SYNC, ap->a_cred);
VOP_UNLOCK(vp, 0);
vn_close(vp, flags, ap->a_cred, td);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Vnode operation to remove a named attribute.
*/
int
zfs_deleteextattr(struct vop_deleteextattr_args *ap)
/*
vop_deleteextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
IN const char *a_name;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrname[255];
struct vattr va;
vnode_t *xvp = NULL, *vp;
int error, flags;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VWRITE);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
sizeof(attrname));
if (error != 0)
return (error);
ZFS_ENTER(zfsvfs);
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
NDINIT_ATVP(&nd, DELETE, NOFOLLOW | LOCKPARENT | LOCKLEAF | MPSAFE,
UIO_SYSSPACE, attrname, xvp, td);
error = namei(&nd);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
if (error == ENOENT)
error = ENOATTR;
return (error);
}
error = VOP_REMOVE(nd.ni_dvp, vp, &nd.ni_cnd);
vput(nd.ni_dvp);
if (vp == nd.ni_dvp)
vrele(vp);
else
vput(vp);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Vnode operation to set a named attribute.
*/
static int
zfs_setextattr(struct vop_setextattr_args *ap)
/*
vop_setextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
IN const char *a_name;
INOUT struct uio *a_uio;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrname[255];
struct vattr va;
vnode_t *xvp = NULL, *vp;
int error, flags;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VWRITE);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, ap->a_name, attrname,
sizeof(attrname));
if (error != 0)
return (error);
ZFS_ENTER(zfsvfs);
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR | CREATE_XATTR_DIR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
flags = FFLAGS(O_WRONLY | O_CREAT);
NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW | MPSAFE, UIO_SYSSPACE, attrname,
xvp, td);
error = vn_open_cred(&nd, &flags, 0600, 0, ap->a_cred, NULL);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
VATTR_NULL(&va);
va.va_size = 0;
error = VOP_SETATTR(vp, &va, ap->a_cred);
if (error == 0)
VOP_WRITE(vp, ap->a_uio, IO_UNIT | IO_SYNC, ap->a_cred);
VOP_UNLOCK(vp, 0);
vn_close(vp, flags, ap->a_cred, td);
ZFS_EXIT(zfsvfs);
return (error);
}
/*
* Vnode operation to retrieve extended attributes on a vnode.
*/
static int
zfs_listextattr(struct vop_listextattr_args *ap)
/*
vop_listextattr {
IN struct vnode *a_vp;
IN int a_attrnamespace;
INOUT struct uio *a_uio;
OUT size_t *a_size;
IN struct ucred *a_cred;
IN struct thread *a_td;
};
*/
{
zfsvfs_t *zfsvfs = VTOZ(ap->a_vp)->z_zfsvfs;
struct thread *td = ap->a_td;
struct nameidata nd;
char attrprefix[16];
u_char dirbuf[sizeof(struct dirent)];
struct dirent *dp;
struct iovec aiov;
struct uio auio, *uio = ap->a_uio;
size_t *sizep = ap->a_size;
size_t plen;
vnode_t *xvp = NULL, *vp;
int done, error, eof, pos;
error = extattr_check_cred(ap->a_vp, ap->a_attrnamespace,
ap->a_cred, ap->a_td, VREAD);
if (error != 0)
return (error);
error = zfs_create_attrname(ap->a_attrnamespace, "", attrprefix,
sizeof(attrprefix));
if (error != 0)
return (error);
plen = strlen(attrprefix);
ZFS_ENTER(zfsvfs);
if (sizep != NULL)
*sizep = 0;
error = zfs_lookup(ap->a_vp, NULL, &xvp, NULL, 0, ap->a_cred, td,
LOOKUP_XATTR);
if (error != 0) {
ZFS_EXIT(zfsvfs);
/*
* ENOATTR means that the EA directory does not yet exist,
* i.e. there are no extended attributes there.
*/
if (error == ENOATTR)
error = 0;
return (error);
}
NDINIT_ATVP(&nd, LOOKUP, NOFOLLOW | LOCKLEAF | LOCKSHARED | MPSAFE,
UIO_SYSSPACE, ".", xvp, td);
error = namei(&nd);
vp = nd.ni_vp;
NDFREE(&nd, NDF_ONLY_PNBUF);
if (error != 0) {
ZFS_EXIT(zfsvfs);
return (error);
}
auio.uio_iov = &aiov;
auio.uio_iovcnt = 1;
auio.uio_segflg = UIO_SYSSPACE;
auio.uio_td = td;
auio.uio_rw = UIO_READ;
auio.uio_offset = 0;
do {
u_char nlen;
aiov.iov_base = (void *)dirbuf;
aiov.iov_len = sizeof(dirbuf);
auio.uio_resid = sizeof(dirbuf);
error = VOP_READDIR(vp, &auio, ap->a_cred, &eof, NULL, NULL);
done = sizeof(dirbuf) - auio.uio_resid;
if (error != 0)
break;
for (pos = 0; pos < done;) {
dp = (struct dirent *)(dirbuf + pos);
pos += dp->d_reclen;
/*
* XXX: Temporarily we also accept DT_UNKNOWN, as this
* is what we get when attribute was created on Solaris.
*/
if (dp->d_type != DT_REG && dp->d_type != DT_UNKNOWN)
continue;
if (plen == 0 && strncmp(dp->d_name, "freebsd:", 8) == 0)
continue;
else if (strncmp(dp->d_name, attrprefix, plen) != 0)
continue;
nlen = dp->d_namlen - plen;
if (sizep != NULL)
*sizep += 1 + nlen;
else if (uio != NULL) {
/*
* Format of extattr name entry is one byte for
* length and the rest for name.
*/
error = uiomove(&nlen, 1, uio->uio_rw, uio);
if (error == 0) {
error = uiomove(dp->d_name + plen, nlen,
uio->uio_rw, uio);
}
if (error != 0)
break;
}
}
} while (!eof && error == 0);
vput(vp);
ZFS_EXIT(zfsvfs);
return (error);
}
int
zfs_freebsd_getacl(ap)
struct vop_getacl_args /* {
struct vnode *vp;
acl_type_t type;
struct acl *aclp;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
int error;
vsecattr_t vsecattr;
if (ap->a_type != ACL_TYPE_NFS4)
return (EINVAL);
vsecattr.vsa_mask = VSA_ACE | VSA_ACECNT;
if (error = zfs_getsecattr(ap->a_vp, &vsecattr, 0, ap->a_cred, NULL))
return (error);
error = acl_from_aces(ap->a_aclp, vsecattr.vsa_aclentp, vsecattr.vsa_aclcnt);
if (vsecattr.vsa_aclentp != NULL)
kmem_free(vsecattr.vsa_aclentp, vsecattr.vsa_aclentsz);
return (error);
}
int
zfs_freebsd_setacl(ap)
struct vop_setacl_args /* {
struct vnode *vp;
acl_type_t type;
struct acl *aclp;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
int error;
vsecattr_t vsecattr;
int aclbsize; /* size of acl list in bytes */
aclent_t *aaclp;
if (ap->a_type != ACL_TYPE_NFS4)
return (EINVAL);
if (ap->a_aclp->acl_cnt < 1 || ap->a_aclp->acl_cnt > MAX_ACL_ENTRIES)
return (EINVAL);
/*
* With NFSv4 ACLs, chmod(2) may need to add additional entries,
* splitting every entry into two and appending "canonical six"
* entries at the end. Don't allow for setting an ACL that would
* cause chmod(2) to run out of ACL entries.
*/
if (ap->a_aclp->acl_cnt * 2 + 6 > ACL_MAX_ENTRIES)
return (ENOSPC);
error = acl_nfs4_check(ap->a_aclp, ap->a_vp->v_type == VDIR);
if (error != 0)
return (error);
vsecattr.vsa_mask = VSA_ACE;
aclbsize = ap->a_aclp->acl_cnt * sizeof(ace_t);
vsecattr.vsa_aclentp = kmem_alloc(aclbsize, KM_SLEEP);
aaclp = vsecattr.vsa_aclentp;
vsecattr.vsa_aclentsz = aclbsize;
aces_from_acl(vsecattr.vsa_aclentp, &vsecattr.vsa_aclcnt, ap->a_aclp);
error = zfs_setsecattr(ap->a_vp, &vsecattr, 0, ap->a_cred, NULL);
kmem_free(aaclp, aclbsize);
return (error);
}
int
zfs_freebsd_aclcheck(ap)
struct vop_aclcheck_args /* {
struct vnode *vp;
acl_type_t type;
struct acl *aclp;
struct ucred *cred;
struct thread *td;
} */ *ap;
{
return (EOPNOTSUPP);
}
struct vop_vector zfs_vnodeops;
struct vop_vector zfs_fifoops;
struct vop_vector zfs_vnodeops = {
.vop_default = &default_vnodeops,
.vop_inactive = zfs_freebsd_inactive,
.vop_reclaim = zfs_freebsd_reclaim,
.vop_access = zfs_freebsd_access,
#ifdef FREEBSD_NAMECACHE
.vop_lookup = vfs_cache_lookup,
.vop_cachedlookup = zfs_freebsd_lookup,
#else
.vop_lookup = zfs_freebsd_lookup,
#endif
.vop_getattr = zfs_freebsd_getattr,
.vop_setattr = zfs_freebsd_setattr,
.vop_create = zfs_freebsd_create,
.vop_mknod = zfs_freebsd_create,
.vop_mkdir = zfs_freebsd_mkdir,
.vop_readdir = zfs_freebsd_readdir,
.vop_fsync = zfs_freebsd_fsync,
.vop_open = zfs_freebsd_open,
.vop_close = zfs_freebsd_close,
.vop_rmdir = zfs_freebsd_rmdir,
.vop_ioctl = zfs_freebsd_ioctl,
.vop_link = zfs_freebsd_link,
.vop_symlink = zfs_freebsd_symlink,
.vop_readlink = zfs_freebsd_readlink,
.vop_read = zfs_freebsd_read,
.vop_write = zfs_freebsd_write,
.vop_remove = zfs_freebsd_remove,
.vop_rename = zfs_freebsd_rename,
.vop_pathconf = zfs_freebsd_pathconf,
.vop_bmap = VOP_EOPNOTSUPP,
.vop_fid = zfs_freebsd_fid,
.vop_getextattr = zfs_getextattr,
.vop_deleteextattr = zfs_deleteextattr,
.vop_setextattr = zfs_setextattr,
.vop_listextattr = zfs_listextattr,
.vop_getacl = zfs_freebsd_getacl,
.vop_setacl = zfs_freebsd_setacl,
.vop_aclcheck = zfs_freebsd_aclcheck,
};
struct vop_vector zfs_fifoops = {
.vop_default = &fifo_specops,
.vop_fsync = zfs_freebsd_fsync,
.vop_access = zfs_freebsd_access,
.vop_getattr = zfs_freebsd_getattr,
.vop_inactive = zfs_freebsd_inactive,
.vop_read = VOP_PANIC,
.vop_reclaim = zfs_freebsd_reclaim,
.vop_setattr = zfs_freebsd_setattr,
.vop_write = VOP_PANIC,
.vop_pathconf = zfs_freebsd_fifo_pathconf,
.vop_fid = zfs_freebsd_fid,
.vop_getacl = zfs_freebsd_getacl,
.vop_setacl = zfs_freebsd_setacl,
.vop_aclcheck = zfs_freebsd_aclcheck,
};
Index: stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c (revision 209274)
@@ -1,2276 +1,2276 @@
/*
* CDDL HEADER START
*
* The contents of this file are subject to the terms of the
* Common Development and Distribution License (the "License").
* You may not use this file except in compliance with the License.
*
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
* or http://www.opensolaris.org/os/licensing.
* See the License for the specific language governing permissions
* and limitations under the License.
*
* When distributing Covered Code, include this CDDL HEADER in each
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
* If applicable, add the following below this CDDL HEADER, with the
* fields enclosed by brackets "[]" replaced with your own identifying
* information: Portions Copyright [yyyy] [name of copyright owner]
*
* CDDL HEADER END
*/
/*
* Copyright 2008 Sun Microsystems, Inc. All rights reserved.
* Use is subject to license terms.
*/
#include <sys/zfs_context.h>
#include <sys/fm/fs/zfs.h>
#include <sys/spa.h>
#include <sys/txg.h>
#include <sys/spa_impl.h>
#include <sys/vdev_impl.h>
#include <sys/zio_impl.h>
#include <sys/zio_compress.h>
#include <sys/zio_checksum.h>
SYSCTL_DECL(_vfs_zfs);
SYSCTL_NODE(_vfs_zfs, OID_AUTO, zio, CTLFLAG_RW, 0, "ZFS ZIO");
static int zio_use_uma = 0;
TUNABLE_INT("vfs.zfs.zio.use_uma", &zio_use_uma);
SYSCTL_INT(_vfs_zfs_zio, OID_AUTO, use_uma, CTLFLAG_RDTUN, &zio_use_uma, 0,
"Use uma(9) for ZIO allocations");
/*
* ==========================================================================
* I/O priority table
* ==========================================================================
*/
uint8_t zio_priority_table[ZIO_PRIORITY_TABLE_SIZE] = {
0, /* ZIO_PRIORITY_NOW */
0, /* ZIO_PRIORITY_SYNC_READ */
0, /* ZIO_PRIORITY_SYNC_WRITE */
6, /* ZIO_PRIORITY_ASYNC_READ */
4, /* ZIO_PRIORITY_ASYNC_WRITE */
4, /* ZIO_PRIORITY_FREE */
0, /* ZIO_PRIORITY_CACHE_FILL */
0, /* ZIO_PRIORITY_LOG_WRITE */
10, /* ZIO_PRIORITY_RESILVER */
20, /* ZIO_PRIORITY_SCRUB */
};
/*
* ==========================================================================
* I/O type descriptions
* ==========================================================================
*/
char *zio_type_name[ZIO_TYPES] = {
"null", "read", "write", "free", "claim", "ioctl" };
#define SYNC_PASS_DEFERRED_FREE 1 /* defer frees after this pass */
#define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
#define SYNC_PASS_REWRITE 1 /* rewrite new bps after this pass */
/*
* ==========================================================================
* I/O kmem caches
* ==========================================================================
*/
kmem_cache_t *zio_cache;
kmem_cache_t *zio_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
kmem_cache_t *zio_data_buf_cache[SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT];
#ifdef _KERNEL
extern vmem_t *zio_alloc_arena;
#endif
/*
* An allocating zio is one that either currently has the DVA allocate
* stage set or will have it later in its lifetime.
*/
#define IO_IS_ALLOCATING(zio) \
((zio)->io_orig_pipeline & (1U << ZIO_STAGE_DVA_ALLOCATE))
void
zio_init(void)
{
size_t c;
zio_cache = kmem_cache_create("zio_cache", sizeof (zio_t), 0,
NULL, NULL, NULL, NULL, NULL, 0);
/*
* For small buffers, we want a cache for each multiple of
* SPA_MINBLOCKSIZE. For medium-size buffers, we want a cache
* for each quarter-power of 2. For large buffers, we want
* a cache for each multiple of PAGESIZE.
*/
for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
size_t size = (c + 1) << SPA_MINBLOCKSHIFT;
size_t p2 = size;
size_t align = 0;
while (p2 & (p2 - 1))
p2 &= p2 - 1;
if (size <= 4 * SPA_MINBLOCKSIZE) {
align = SPA_MINBLOCKSIZE;
} else if (P2PHASE(size, PAGESIZE) == 0) {
align = PAGESIZE;
} else if (P2PHASE(size, p2 >> 2) == 0) {
align = p2 >> 2;
}
if (align != 0) {
char name[36];
(void) sprintf(name, "zio_buf_%lu", (ulong_t)size);
zio_buf_cache[c] = kmem_cache_create(name, size,
align, NULL, NULL, NULL, NULL, NULL, KMC_NODEBUG);
(void) sprintf(name, "zio_data_buf_%lu", (ulong_t)size);
zio_data_buf_cache[c] = kmem_cache_create(name, size,
align, NULL, NULL, NULL, NULL, NULL, KMC_NODEBUG);
}
}
while (--c != 0) {
ASSERT(zio_buf_cache[c] != NULL);
if (zio_buf_cache[c - 1] == NULL)
zio_buf_cache[c - 1] = zio_buf_cache[c];
ASSERT(zio_data_buf_cache[c] != NULL);
if (zio_data_buf_cache[c - 1] == NULL)
zio_data_buf_cache[c - 1] = zio_data_buf_cache[c];
}
zio_inject_init();
}
void
zio_fini(void)
{
size_t c;
kmem_cache_t *last_cache = NULL;
kmem_cache_t *last_data_cache = NULL;
for (c = 0; c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT; c++) {
if (zio_buf_cache[c] != last_cache) {
last_cache = zio_buf_cache[c];
kmem_cache_destroy(zio_buf_cache[c]);
}
zio_buf_cache[c] = NULL;
if (zio_data_buf_cache[c] != last_data_cache) {
last_data_cache = zio_data_buf_cache[c];
kmem_cache_destroy(zio_data_buf_cache[c]);
}
zio_data_buf_cache[c] = NULL;
}
kmem_cache_destroy(zio_cache);
zio_inject_fini();
}
/*
* ==========================================================================
* Allocate and free I/O buffers
* ==========================================================================
*/
/*
* Use zio_buf_alloc to allocate ZFS metadata. This data will appear in a
* crashdump if the kernel panics, so use it judiciously. Obviously, it's
* useful to inspect ZFS metadata, but if possible, we should avoid keeping
* excess / transient data in-core during a crashdump.
*/
void *
zio_buf_alloc(size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
return (kmem_cache_alloc(zio_buf_cache[c], KM_PUSHPAGE));
else
return (kmem_alloc(size, KM_SLEEP));
}
/*
* Use zio_data_buf_alloc to allocate data. The data will not appear in a
* crashdump if the kernel panics. This exists so that we will limit the amount
* of ZFS data that shows up in a kernel crashdump. (Thus reducing the amount
* of kernel heap dumped to disk when the kernel panics)
*/
void *
zio_data_buf_alloc(size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
return (kmem_cache_alloc(zio_data_buf_cache[c], KM_PUSHPAGE));
else
return (kmem_alloc(size, KM_SLEEP));
}
void
zio_buf_free(void *buf, size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
kmem_cache_free(zio_buf_cache[c], buf);
else
kmem_free(buf, size);
}
void
zio_data_buf_free(void *buf, size_t size)
{
size_t c = (size - 1) >> SPA_MINBLOCKSHIFT;
ASSERT(c < SPA_MAXBLOCKSIZE >> SPA_MINBLOCKSHIFT);
if (zio_use_uma)
kmem_cache_free(zio_data_buf_cache[c], buf);
else
kmem_free(buf, size);
}
/*
* ==========================================================================
* Push and pop I/O transform buffers
* ==========================================================================
*/
static void
zio_push_transform(zio_t *zio, void *data, uint64_t size, uint64_t bufsize,
zio_transform_func_t *transform)
{
zio_transform_t *zt = kmem_alloc(sizeof (zio_transform_t), KM_SLEEP);
zt->zt_orig_data = zio->io_data;
zt->zt_orig_size = zio->io_size;
zt->zt_bufsize = bufsize;
zt->zt_transform = transform;
zt->zt_next = zio->io_transform_stack;
zio->io_transform_stack = zt;
zio->io_data = data;
zio->io_size = size;
}
static void
zio_pop_transforms(zio_t *zio)
{
zio_transform_t *zt;
while ((zt = zio->io_transform_stack) != NULL) {
if (zt->zt_transform != NULL)
zt->zt_transform(zio,
zt->zt_orig_data, zt->zt_orig_size);
zio_buf_free(zio->io_data, zt->zt_bufsize);
zio->io_data = zt->zt_orig_data;
zio->io_size = zt->zt_orig_size;
zio->io_transform_stack = zt->zt_next;
kmem_free(zt, sizeof (zio_transform_t));
}
}
/*
* ==========================================================================
* I/O transform callbacks for subblocks and decompression
* ==========================================================================
*/
static void
zio_subblock(zio_t *zio, void *data, uint64_t size)
{
ASSERT(zio->io_size > size);
if (zio->io_type == ZIO_TYPE_READ)
bcopy(zio->io_data, data, size);
}
static void
zio_decompress(zio_t *zio, void *data, uint64_t size)
{
if (zio->io_error == 0 &&
zio_decompress_data(BP_GET_COMPRESS(zio->io_bp),
zio->io_data, zio->io_size, data, size) != 0)
zio->io_error = EIO;
}
/*
* ==========================================================================
* I/O parent/child relationships and pipeline interlocks
* ==========================================================================
*/
static void
zio_add_child(zio_t *pio, zio_t *zio)
{
mutex_enter(&pio->io_lock);
if (zio->io_stage < ZIO_STAGE_READY)
pio->io_children[zio->io_child_type][ZIO_WAIT_READY]++;
if (zio->io_stage < ZIO_STAGE_DONE)
pio->io_children[zio->io_child_type][ZIO_WAIT_DONE]++;
zio->io_sibling_prev = NULL;
zio->io_sibling_next = pio->io_child;
if (pio->io_child != NULL)
pio->io_child->io_sibling_prev = zio;
pio->io_child = zio;
zio->io_parent = pio;
mutex_exit(&pio->io_lock);
}
static void
zio_remove_child(zio_t *pio, zio_t *zio)
{
zio_t *next, *prev;
ASSERT(zio->io_parent == pio);
mutex_enter(&pio->io_lock);
next = zio->io_sibling_next;
prev = zio->io_sibling_prev;
if (next != NULL)
next->io_sibling_prev = prev;
if (prev != NULL)
prev->io_sibling_next = next;
if (pio->io_child == zio)
pio->io_child = next;
mutex_exit(&pio->io_lock);
}
static boolean_t
zio_wait_for_children(zio_t *zio, enum zio_child child, enum zio_wait_type wait)
{
uint64_t *countp = &zio->io_children[child][wait];
boolean_t waiting = B_FALSE;
mutex_enter(&zio->io_lock);
ASSERT(zio->io_stall == NULL);
if (*countp != 0) {
zio->io_stage--;
zio->io_stall = countp;
waiting = B_TRUE;
}
mutex_exit(&zio->io_lock);
return (waiting);
}
static void
zio_notify_parent(zio_t *pio, zio_t *zio, enum zio_wait_type wait)
{
uint64_t *countp = &pio->io_children[zio->io_child_type][wait];
int *errorp = &pio->io_child_error[zio->io_child_type];
mutex_enter(&pio->io_lock);
if (zio->io_error && !(zio->io_flags & ZIO_FLAG_DONT_PROPAGATE))
*errorp = zio_worst_error(*errorp, zio->io_error);
pio->io_reexecute |= zio->io_reexecute;
ASSERT3U(*countp, >, 0);
if (--*countp == 0 && pio->io_stall == countp) {
pio->io_stall = NULL;
mutex_exit(&pio->io_lock);
zio_execute(pio);
} else {
mutex_exit(&pio->io_lock);
}
}
static void
zio_inherit_child_errors(zio_t *zio, enum zio_child c)
{
if (zio->io_child_error[c] != 0 && zio->io_error == 0)
zio->io_error = zio->io_child_error[c];
}
/*
* ==========================================================================
* Create the various types of I/O (read, write, free, etc)
* ==========================================================================
*/
static zio_t *
zio_create(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
void *data, uint64_t size, zio_done_func_t *done, void *private,
zio_type_t type, int priority, int flags, vdev_t *vd, uint64_t offset,
const zbookmark_t *zb, uint8_t stage, uint32_t pipeline)
{
zio_t *zio;
ASSERT3U(size, <=, SPA_MAXBLOCKSIZE);
ASSERT(P2PHASE(size, SPA_MINBLOCKSIZE) == 0);
ASSERT(P2PHASE(offset, SPA_MINBLOCKSIZE) == 0);
ASSERT(!vd || spa_config_held(spa, SCL_STATE_ALL, RW_READER));
ASSERT(!bp || !(flags & ZIO_FLAG_CONFIG_WRITER));
ASSERT(vd || stage == ZIO_STAGE_OPEN);
zio = kmem_cache_alloc(zio_cache, KM_SLEEP);
bzero(zio, sizeof (zio_t));
mutex_init(&zio->io_lock, NULL, MUTEX_DEFAULT, NULL);
cv_init(&zio->io_cv, NULL, CV_DEFAULT, NULL);
if (vd != NULL)
zio->io_child_type = ZIO_CHILD_VDEV;
else if (flags & ZIO_FLAG_GANG_CHILD)
zio->io_child_type = ZIO_CHILD_GANG;
else
zio->io_child_type = ZIO_CHILD_LOGICAL;
if (bp != NULL) {
zio->io_bp = bp;
zio->io_bp_copy = *bp;
zio->io_bp_orig = *bp;
if (type != ZIO_TYPE_WRITE)
zio->io_bp = &zio->io_bp_copy; /* so caller can free */
if (zio->io_child_type == ZIO_CHILD_LOGICAL) {
if (BP_IS_GANG(bp))
pipeline |= ZIO_GANG_STAGES;
zio->io_logical = zio;
}
}
zio->io_spa = spa;
zio->io_txg = txg;
zio->io_data = data;
zio->io_size = size;
zio->io_done = done;
zio->io_private = private;
zio->io_type = type;
zio->io_priority = priority;
zio->io_vd = vd;
zio->io_offset = offset;
zio->io_orig_flags = zio->io_flags = flags;
zio->io_orig_stage = zio->io_stage = stage;
zio->io_orig_pipeline = zio->io_pipeline = pipeline;
if (zb != NULL)
zio->io_bookmark = *zb;
if (pio != NULL) {
/*
* Logical I/Os can have logical, gang, or vdev children.
* Gang I/Os can have gang or vdev children.
* Vdev I/Os can only have vdev children.
* The following ASSERT captures all of these constraints.
*/
ASSERT(zio->io_child_type <= pio->io_child_type);
if (zio->io_logical == NULL)
zio->io_logical = pio->io_logical;
zio_add_child(pio, zio);
}
return (zio);
}
static void
zio_destroy(zio_t *zio)
{
spa_t *spa = zio->io_spa;
uint8_t async_root = zio->io_async_root;
mutex_destroy(&zio->io_lock);
cv_destroy(&zio->io_cv);
kmem_cache_free(zio_cache, zio);
if (async_root) {
mutex_enter(&spa->spa_async_root_lock);
if (--spa->spa_async_root_count == 0)
cv_broadcast(&spa->spa_async_root_cv);
mutex_exit(&spa->spa_async_root_lock);
}
}
zio_t *
zio_null(zio_t *pio, spa_t *spa, zio_done_func_t *done, void *private,
int flags)
{
zio_t *zio;
zio = zio_create(pio, spa, 0, NULL, NULL, 0, done, private,
ZIO_TYPE_NULL, ZIO_PRIORITY_NOW, flags, NULL, 0, NULL,
ZIO_STAGE_OPEN, ZIO_INTERLOCK_PIPELINE);
return (zio);
}
zio_t *
zio_root(spa_t *spa, zio_done_func_t *done, void *private, int flags)
{
return (zio_null(NULL, spa, done, private, flags));
}
zio_t *
zio_read(zio_t *pio, spa_t *spa, const blkptr_t *bp,
void *data, uint64_t size, zio_done_func_t *done, void *private,
int priority, int flags, const zbookmark_t *zb)
{
zio_t *zio;
zio = zio_create(pio, spa, bp->blk_birth, (blkptr_t *)bp,
data, size, done, private,
ZIO_TYPE_READ, priority, flags, NULL, 0, zb,
ZIO_STAGE_OPEN, ZIO_READ_PIPELINE);
return (zio);
}
zio_t *
zio_write(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
void *data, uint64_t size, zio_prop_t *zp,
zio_done_func_t *ready, zio_done_func_t *done, void *private,
int priority, int flags, const zbookmark_t *zb)
{
zio_t *zio;
ASSERT(zp->zp_checksum >= ZIO_CHECKSUM_OFF &&
zp->zp_checksum < ZIO_CHECKSUM_FUNCTIONS &&
zp->zp_compress >= ZIO_COMPRESS_OFF &&
zp->zp_compress < ZIO_COMPRESS_FUNCTIONS &&
zp->zp_type < DMU_OT_NUMTYPES &&
zp->zp_level < 32 &&
zp->zp_ndvas > 0 &&
zp->zp_ndvas <= spa_max_replication(spa));
ASSERT(ready != NULL);
zio = zio_create(pio, spa, txg, bp, data, size, done, private,
ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
ZIO_STAGE_OPEN, ZIO_WRITE_PIPELINE);
zio->io_ready = ready;
zio->io_prop = *zp;
return (zio);
}
zio_t *
zio_rewrite(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp, void *data,
uint64_t size, zio_done_func_t *done, void *private, int priority,
int flags, zbookmark_t *zb)
{
zio_t *zio;
zio = zio_create(pio, spa, txg, bp, data, size, done, private,
ZIO_TYPE_WRITE, priority, flags, NULL, 0, zb,
ZIO_STAGE_OPEN, ZIO_REWRITE_PIPELINE);
return (zio);
}
zio_t *
zio_free(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
zio_done_func_t *done, void *private, int flags)
{
zio_t *zio;
ASSERT(!BP_IS_HOLE(bp));
if (bp->blk_fill == BLK_FILL_ALREADY_FREED)
return (zio_null(pio, spa, NULL, NULL, flags));
if (txg == spa->spa_syncing_txg &&
spa_sync_pass(spa) > SYNC_PASS_DEFERRED_FREE) {
bplist_enqueue_deferred(&spa->spa_sync_bplist, bp);
return (zio_null(pio, spa, NULL, NULL, flags));
}
zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
done, private, ZIO_TYPE_FREE, ZIO_PRIORITY_FREE, flags,
NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_FREE_PIPELINE);
return (zio);
}
zio_t *
zio_claim(zio_t *pio, spa_t *spa, uint64_t txg, blkptr_t *bp,
zio_done_func_t *done, void *private, int flags)
{
zio_t *zio;
/*
* A claim is an allocation of a specific block. Claims are needed
* to support immediate writes in the intent log. The issue is that
* immediate writes contain committed data, but in a txg that was
* *not* committed. Upon opening the pool after an unclean shutdown,
* the intent log claims all blocks that contain immediate write data
* so that the SPA knows they're in use.
*
* All claims *must* be resolved in the first txg -- before the SPA
* starts allocating blocks -- so that nothing is allocated twice.
*/
ASSERT3U(spa->spa_uberblock.ub_rootbp.blk_birth, <, spa_first_txg(spa));
ASSERT3U(spa_first_txg(spa), <=, txg);
zio = zio_create(pio, spa, txg, bp, NULL, BP_GET_PSIZE(bp),
done, private, ZIO_TYPE_CLAIM, ZIO_PRIORITY_NOW, flags,
NULL, 0, NULL, ZIO_STAGE_OPEN, ZIO_CLAIM_PIPELINE);
return (zio);
}
zio_t *
zio_ioctl(zio_t *pio, spa_t *spa, vdev_t *vd, int cmd,
zio_done_func_t *done, void *private, int priority, int flags)
{
zio_t *zio;
int c;
if (vd->vdev_children == 0) {
zio = zio_create(pio, spa, 0, NULL, NULL, 0, done, private,
ZIO_TYPE_IOCTL, priority, flags, vd, 0, NULL,
ZIO_STAGE_OPEN, ZIO_IOCTL_PIPELINE);
zio->io_cmd = cmd;
} else {
zio = zio_null(pio, spa, NULL, NULL, flags);
for (c = 0; c < vd->vdev_children; c++)
zio_nowait(zio_ioctl(zio, spa, vd->vdev_child[c], cmd,
done, private, priority, flags));
}
return (zio);
}
zio_t *
zio_read_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
void *data, int checksum, zio_done_func_t *done, void *private,
int priority, int flags, boolean_t labels)
{
zio_t *zio;
ASSERT(vd->vdev_children == 0);
ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
ASSERT3U(offset + size, <=, vd->vdev_psize);
zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, done, private,
ZIO_TYPE_READ, priority, flags, vd, offset, NULL,
ZIO_STAGE_OPEN, ZIO_READ_PHYS_PIPELINE);
zio->io_prop.zp_checksum = checksum;
return (zio);
}
zio_t *
zio_write_phys(zio_t *pio, vdev_t *vd, uint64_t offset, uint64_t size,
void *data, int checksum, zio_done_func_t *done, void *private,
int priority, int flags, boolean_t labels)
{
zio_t *zio;
ASSERT(vd->vdev_children == 0);
ASSERT(!labels || offset + size <= VDEV_LABEL_START_SIZE ||
offset >= vd->vdev_psize - VDEV_LABEL_END_SIZE);
ASSERT3U(offset + size, <=, vd->vdev_psize);
zio = zio_create(pio, vd->vdev_spa, 0, NULL, data, size, done, private,
ZIO_TYPE_WRITE, priority, flags, vd, offset, NULL,
ZIO_STAGE_OPEN, ZIO_WRITE_PHYS_PIPELINE);
zio->io_prop.zp_checksum = checksum;
if (zio_checksum_table[checksum].ci_zbt) {
/*
* zbt checksums are necessarily destructive -- they modify
* the end of the write buffer to hold the verifier/checksum.
* Therefore, we must make a local copy in case the data is
* being written to multiple places in parallel.
*/
void *wbuf = zio_buf_alloc(size);
bcopy(data, wbuf, size);
zio_push_transform(zio, wbuf, size, size, NULL);
}
return (zio);
}
/*
* Create a child I/O to do some work for us.
*/
zio_t *
zio_vdev_child_io(zio_t *pio, blkptr_t *bp, vdev_t *vd, uint64_t offset,
void *data, uint64_t size, int type, int priority, int flags,
zio_done_func_t *done, void *private)
{
uint32_t pipeline = ZIO_VDEV_CHILD_PIPELINE;
zio_t *zio;
ASSERT(vd->vdev_parent ==
(pio->io_vd ? pio->io_vd : pio->io_spa->spa_root_vdev));
if (type == ZIO_TYPE_READ && bp != NULL) {
/*
* If we have the bp, then the child should perform the
* checksum and the parent need not. This pushes error
* detection as close to the leaves as possible and
* eliminates redundant checksums in the interior nodes.
*/
pipeline |= 1U << ZIO_STAGE_CHECKSUM_VERIFY;
pio->io_pipeline &= ~(1U << ZIO_STAGE_CHECKSUM_VERIFY);
}
if (vd->vdev_children == 0)
offset += VDEV_LABEL_START_SIZE;
zio = zio_create(pio, pio->io_spa, pio->io_txg, bp, data, size,
done, private, type, priority,
(pio->io_flags & ZIO_FLAG_VDEV_INHERIT) |
ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | flags,
vd, offset, &pio->io_bookmark,
ZIO_STAGE_VDEV_IO_START - 1, pipeline);
return (zio);
}
zio_t *
zio_vdev_delegated_io(vdev_t *vd, uint64_t offset, void *data, uint64_t size,
int type, int priority, int flags, zio_done_func_t *done, void *private)
{
zio_t *zio;
ASSERT(vd->vdev_ops->vdev_op_leaf);
zio = zio_create(NULL, vd->vdev_spa, 0, NULL,
data, size, done, private, type, priority,
flags | ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_RETRY,
vd, offset, NULL,
ZIO_STAGE_VDEV_IO_START - 1, ZIO_VDEV_CHILD_PIPELINE);
return (zio);
}
void
zio_flush(zio_t *zio, vdev_t *vd)
{
zio_nowait(zio_ioctl(zio, zio->io_spa, vd, DKIOCFLUSHWRITECACHE,
NULL, NULL, ZIO_PRIORITY_NOW,
ZIO_FLAG_CANFAIL | ZIO_FLAG_DONT_PROPAGATE | ZIO_FLAG_DONT_RETRY));
}
/*
* ==========================================================================
* Prepare to read and write logical blocks
* ==========================================================================
*/
static int
zio_read_bp_init(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
if (BP_GET_COMPRESS(bp) != ZIO_COMPRESS_OFF && zio->io_logical == zio) {
uint64_t csize = BP_GET_PSIZE(bp);
void *cbuf = zio_buf_alloc(csize);
zio_push_transform(zio, cbuf, csize, csize, zio_decompress);
}
if (!dmu_ot[BP_GET_TYPE(bp)].ot_metadata && BP_GET_LEVEL(bp) == 0)
zio->io_flags |= ZIO_FLAG_DONT_CACHE;
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_write_bp_init(zio_t *zio)
{
zio_prop_t *zp = &zio->io_prop;
int compress = zp->zp_compress;
blkptr_t *bp = zio->io_bp;
void *cbuf;
uint64_t lsize = zio->io_size;
uint64_t csize = lsize;
uint64_t cbufsize = 0;
int pass = 1;
/*
* If our children haven't all reached the ready stage,
* wait for them and then repeat this pipeline stage.
*/
if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY) ||
zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_READY))
return (ZIO_PIPELINE_STOP);
if (!IO_IS_ALLOCATING(zio))
return (ZIO_PIPELINE_CONTINUE);
ASSERT(compress != ZIO_COMPRESS_INHERIT);
if (bp->blk_birth == zio->io_txg) {
/*
* We're rewriting an existing block, which means we're
* working on behalf of spa_sync(). For spa_sync() to
* converge, it must eventually be the case that we don't
* have to allocate new blocks. But compression changes
* the blocksize, which forces a reallocate, and makes
* convergence take longer. Therefore, after the first
* few passes, stop compressing to ensure convergence.
*/
pass = spa_sync_pass(zio->io_spa);
ASSERT(pass > 1);
if (pass > SYNC_PASS_DONT_COMPRESS)
compress = ZIO_COMPRESS_OFF;
/*
* Only MOS (objset 0) data should need to be rewritten.
*/
ASSERT(zio->io_logical->io_bookmark.zb_objset == 0);
/* Make sure someone doesn't change their mind on overwrites */
ASSERT(MIN(zp->zp_ndvas + BP_IS_GANG(bp),
spa_max_replication(zio->io_spa)) == BP_GET_NDVAS(bp));
}
if (compress != ZIO_COMPRESS_OFF) {
if (!zio_compress_data(compress, zio->io_data, zio->io_size,
&cbuf, &csize, &cbufsize)) {
compress = ZIO_COMPRESS_OFF;
} else if (csize != 0) {
zio_push_transform(zio, cbuf, csize, cbufsize, NULL);
}
}
/*
* The final pass of spa_sync() must be all rewrites, but the first
* few passes offer a trade-off: allocating blocks defers convergence,
* but newly allocated blocks are sequential, so they can be written
* to disk faster. Therefore, we allow the first few passes of
* spa_sync() to allocate new blocks, but force rewrites after that.
* There should only be a handful of blocks after pass 1 in any case.
*/
if (bp->blk_birth == zio->io_txg && BP_GET_PSIZE(bp) == csize &&
pass > SYNC_PASS_REWRITE) {
ASSERT(csize != 0);
uint32_t gang_stages = zio->io_pipeline & ZIO_GANG_STAGES;
zio->io_pipeline = ZIO_REWRITE_PIPELINE | gang_stages;
zio->io_flags |= ZIO_FLAG_IO_REWRITE;
} else {
BP_ZERO(bp);
zio->io_pipeline = ZIO_WRITE_PIPELINE;
}
if (csize == 0) {
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
} else {
ASSERT(zp->zp_checksum != ZIO_CHECKSUM_GANG_HEADER);
BP_SET_LSIZE(bp, lsize);
BP_SET_PSIZE(bp, csize);
BP_SET_COMPRESS(bp, compress);
BP_SET_CHECKSUM(bp, zp->zp_checksum);
BP_SET_TYPE(bp, zp->zp_type);
BP_SET_LEVEL(bp, zp->zp_level);
BP_SET_BYTEORDER(bp, ZFS_HOST_BYTEORDER);
}
return (ZIO_PIPELINE_CONTINUE);
}
/*
* ==========================================================================
* Execute the I/O pipeline
* ==========================================================================
*/
static void
zio_taskq_dispatch(zio_t *zio, enum zio_taskq_type q)
{
zio_type_t t = zio->io_type;
/*
- * If we're a config writer, the normal issue and interrupt threads
- * may all be blocked waiting for the config lock. In this case,
- * select the otherwise-unused taskq for ZIO_TYPE_NULL.
+ * If we're a config writer or a probe, the normal issue and
+ * interrupt threads may all be blocked waiting for the config lock.
+ * In this case, select the otherwise-unused taskq for ZIO_TYPE_NULL.
*/
- if (zio->io_flags & ZIO_FLAG_CONFIG_WRITER)
+ if (zio->io_flags & (ZIO_FLAG_CONFIG_WRITER | ZIO_FLAG_PROBE))
t = ZIO_TYPE_NULL;
/*
* A similar issue exists for the L2ARC write thread until L2ARC 2.0.
*/
if (t == ZIO_TYPE_WRITE && zio->io_vd && zio->io_vd->vdev_aux)
t = ZIO_TYPE_NULL;
(void) taskq_dispatch_safe(zio->io_spa->spa_zio_taskq[t][q],
(task_func_t *)zio_execute, zio, &zio->io_task);
}
static boolean_t
zio_taskq_member(zio_t *zio, enum zio_taskq_type q)
{
kthread_t *executor = zio->io_executor;
spa_t *spa = zio->io_spa;
for (zio_type_t t = 0; t < ZIO_TYPES; t++)
if (taskq_member(spa->spa_zio_taskq[t][q], executor))
return (B_TRUE);
return (B_FALSE);
}
static int
zio_issue_async(zio_t *zio)
{
zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
return (ZIO_PIPELINE_STOP);
}
void
zio_interrupt(zio_t *zio)
{
zio_taskq_dispatch(zio, ZIO_TASKQ_INTERRUPT);
}
/*
* Execute the I/O pipeline until one of the following occurs:
* (1) the I/O completes; (2) the pipeline stalls waiting for
* dependent child I/Os; (3) the I/O issues, so we're waiting
* for an I/O completion interrupt; (4) the I/O is delegated by
* vdev-level caching or aggregation; (5) the I/O is deferred
* due to vdev-level queueing; (6) the I/O is handed off to
* another thread. In all cases, the pipeline stops whenever
* there's no CPU work; it never burns a thread in cv_wait().
*
* There's no locking on io_stage because there's no legitimate way
* for multiple threads to be attempting to process the same I/O.
*/
static zio_pipe_stage_t *zio_pipeline[ZIO_STAGES];
void
zio_execute(zio_t *zio)
{
zio->io_executor = curthread;
while (zio->io_stage < ZIO_STAGE_DONE) {
uint32_t pipeline = zio->io_pipeline;
zio_stage_t stage = zio->io_stage;
int rv;
ASSERT(!MUTEX_HELD(&zio->io_lock));
while (((1U << ++stage) & pipeline) == 0)
continue;
ASSERT(stage <= ZIO_STAGE_DONE);
ASSERT(zio->io_stall == NULL);
/*
* If we are in interrupt context and this pipeline stage
* will grab a config lock that is held across I/O,
* issue async to avoid deadlock.
*/
if (((1U << stage) & ZIO_CONFIG_LOCK_BLOCKING_STAGES) &&
zio->io_vd == NULL &&
zio_taskq_member(zio, ZIO_TASKQ_INTERRUPT)) {
zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
return;
}
zio->io_stage = stage;
rv = zio_pipeline[stage](zio);
if (rv == ZIO_PIPELINE_STOP)
return;
ASSERT(rv == ZIO_PIPELINE_CONTINUE);
}
}
/*
* ==========================================================================
* Initiate I/O, either sync or async
* ==========================================================================
*/
int
zio_wait(zio_t *zio)
{
int error;
ASSERT(zio->io_stage == ZIO_STAGE_OPEN);
ASSERT(zio->io_executor == NULL);
zio->io_waiter = curthread;
zio_execute(zio);
mutex_enter(&zio->io_lock);
while (zio->io_executor != NULL)
cv_wait(&zio->io_cv, &zio->io_lock);
mutex_exit(&zio->io_lock);
error = zio->io_error;
zio_destroy(zio);
return (error);
}
void
zio_nowait(zio_t *zio)
{
ASSERT(zio->io_executor == NULL);
if (zio->io_parent == NULL && zio->io_child_type == ZIO_CHILD_LOGICAL) {
/*
* This is a logical async I/O with no parent to wait for it.
* Attach it to the pool's global async root zio so that
* spa_unload() has a way of waiting for async I/O to finish.
*/
spa_t *spa = zio->io_spa;
zio->io_async_root = B_TRUE;
mutex_enter(&spa->spa_async_root_lock);
spa->spa_async_root_count++;
mutex_exit(&spa->spa_async_root_lock);
}
zio_execute(zio);
}
/*
* ==========================================================================
* Reexecute or suspend/resume failed I/O
* ==========================================================================
*/
static void
zio_reexecute(zio_t *pio)
{
zio_t *zio, *zio_next;
pio->io_flags = pio->io_orig_flags;
pio->io_stage = pio->io_orig_stage;
pio->io_pipeline = pio->io_orig_pipeline;
pio->io_reexecute = 0;
pio->io_error = 0;
for (int c = 0; c < ZIO_CHILD_TYPES; c++)
pio->io_child_error[c] = 0;
if (IO_IS_ALLOCATING(pio)) {
/*
* Remember the failed bp so that the io_ready() callback
* can update its accounting upon reexecution. The block
* was already freed in zio_done(); we indicate this with
* a fill count of -1 so that zio_free() knows to skip it.
*/
blkptr_t *bp = pio->io_bp;
ASSERT(bp->blk_birth == 0 || bp->blk_birth == pio->io_txg);
bp->blk_fill = BLK_FILL_ALREADY_FREED;
pio->io_bp_orig = *bp;
BP_ZERO(bp);
}
/*
* As we reexecute pio's children, new children could be created.
* New children go to the head of the io_child list, however,
* so we will (correctly) not reexecute them. The key is that
* the remainder of the io_child list, from 'zio_next' onward,
* cannot be affected by any side effects of reexecuting 'zio'.
*/
for (zio = pio->io_child; zio != NULL; zio = zio_next) {
zio_next = zio->io_sibling_next;
mutex_enter(&pio->io_lock);
pio->io_children[zio->io_child_type][ZIO_WAIT_READY]++;
pio->io_children[zio->io_child_type][ZIO_WAIT_DONE]++;
mutex_exit(&pio->io_lock);
zio_reexecute(zio);
}
/*
* Now that all children have been reexecuted, execute the parent.
*/
zio_execute(pio);
}
void
zio_suspend(spa_t *spa, zio_t *zio)
{
if (spa_get_failmode(spa) == ZIO_FAILURE_MODE_PANIC)
fm_panic("Pool '%s' has encountered an uncorrectable I/O "
"failure and the failure mode property for this pool "
"is set to panic.", spa_name(spa));
zfs_ereport_post(FM_EREPORT_ZFS_IO_FAILURE, spa, NULL, NULL, 0, 0);
mutex_enter(&spa->spa_suspend_lock);
if (spa->spa_suspend_zio_root == NULL)
spa->spa_suspend_zio_root = zio_root(spa, NULL, NULL, 0);
spa->spa_suspended = B_TRUE;
if (zio != NULL) {
ASSERT(zio != spa->spa_suspend_zio_root);
ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
ASSERT(zio->io_parent == NULL);
ASSERT(zio->io_stage == ZIO_STAGE_DONE);
zio_add_child(spa->spa_suspend_zio_root, zio);
}
mutex_exit(&spa->spa_suspend_lock);
}
void
zio_resume(spa_t *spa)
{
zio_t *pio, *zio;
/*
* Reexecute all previously suspended i/o.
*/
mutex_enter(&spa->spa_suspend_lock);
spa->spa_suspended = B_FALSE;
cv_broadcast(&spa->spa_suspend_cv);
pio = spa->spa_suspend_zio_root;
spa->spa_suspend_zio_root = NULL;
mutex_exit(&spa->spa_suspend_lock);
if (pio == NULL)
return;
while ((zio = pio->io_child) != NULL) {
zio_remove_child(pio, zio);
zio->io_parent = NULL;
zio_reexecute(zio);
}
ASSERT(pio->io_children[ZIO_CHILD_LOGICAL][ZIO_WAIT_DONE] == 0);
(void) zio_wait(pio);
}
void
zio_resume_wait(spa_t *spa)
{
mutex_enter(&spa->spa_suspend_lock);
while (spa_suspended(spa))
cv_wait(&spa->spa_suspend_cv, &spa->spa_suspend_lock);
mutex_exit(&spa->spa_suspend_lock);
}
/*
* ==========================================================================
* Gang blocks.
*
* A gang block is a collection of small blocks that looks to the DMU
* like one large block. When zio_dva_allocate() cannot find a block
* of the requested size, due to either severe fragmentation or the pool
* being nearly full, it calls zio_write_gang_block() to construct the
* block from smaller fragments.
*
* A gang block consists of a gang header (zio_gbh_phys_t) and up to
* three (SPA_GBH_NBLKPTRS) gang members. The gang header is just like
* an indirect block: it's an array of block pointers. It consumes
* only one sector and hence is allocatable regardless of fragmentation.
* The gang header's bps point to its gang members, which hold the data.
*
* Gang blocks are self-checksumming, using the bp's <vdev, offset, txg>
* as the verifier to ensure uniqueness of the SHA256 checksum.
* Critically, the gang block bp's blk_cksum is the checksum of the data,
* not the gang header. This ensures that data block signatures (needed for
* deduplication) are independent of how the block is physically stored.
*
* Gang blocks can be nested: a gang member may itself be a gang block.
* Thus every gang block is a tree in which root and all interior nodes are
* gang headers, and the leaves are normal blocks that contain user data.
* The root of the gang tree is called the gang leader.
*
* To perform any operation (read, rewrite, free, claim) on a gang block,
* zio_gang_assemble() first assembles the gang tree (minus data leaves)
* in the io_gang_tree field of the original logical i/o by recursively
* reading the gang leader and all gang headers below it. This yields
* an in-core tree containing the contents of every gang header and the
* bps for every constituent of the gang block.
*
* With the gang tree now assembled, zio_gang_issue() just walks the gang tree
* and invokes a callback on each bp. To free a gang block, zio_gang_issue()
* calls zio_free_gang() -- a trivial wrapper around zio_free() -- for each bp.
* zio_claim_gang() provides a similarly trivial wrapper for zio_claim().
* zio_read_gang() is a wrapper around zio_read() that omits reading gang
* headers, since we already have those in io_gang_tree. zio_rewrite_gang()
* performs a zio_rewrite() of the data or, for gang headers, a zio_rewrite()
* of the gang header plus zio_checksum_compute() of the data to update the
* gang header's blk_cksum as described above.
*
* The two-phase assemble/issue model solves the problem of partial failure --
* what if you'd freed part of a gang block but then couldn't read the
* gang header for another part? Assembling the entire gang tree first
* ensures that all the necessary gang header I/O has succeeded before
* starting the actual work of free, claim, or write. Once the gang tree
* is assembled, free and claim are in-memory operations that cannot fail.
*
* In the event that a gang write fails, zio_dva_unallocate() walks the
* gang tree to immediately free (i.e. insert back into the space map)
* everything we've allocated. This ensures that we don't get ENOSPC
* errors during repeated suspend/resume cycles due to a flaky device.
*
* Gang rewrites only happen during sync-to-convergence. If we can't assemble
* the gang tree, we won't modify the block, so we can safely defer the free
* (knowing that the block is still intact). If we *can* assemble the gang
* tree, then even if some of the rewrites fail, zio_dva_unallocate() will free
* each constituent bp and we can allocate a new block on the next sync pass.
*
* In all cases, the gang tree allows complete recovery from partial failure.
* ==========================================================================
*/
static zio_t *
zio_read_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
if (gn != NULL)
return (pio);
return (zio_read(pio, pio->io_spa, bp, data, BP_GET_PSIZE(bp),
NULL, NULL, pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
&pio->io_bookmark));
}
zio_t *
zio_rewrite_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
zio_t *zio;
if (gn != NULL) {
zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
gn->gn_gbh, SPA_GANGBLOCKSIZE, NULL, NULL, pio->io_priority,
ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
/*
* As we rewrite each gang header, the pipeline will compute
* a new gang block header checksum for it; but no one will
* compute a new data checksum, so we do that here. The one
* exception is the gang leader: the pipeline already computed
* its data checksum because that stage precedes gang assembly.
* (Presently, nothing actually uses interior data checksums;
* this is just good hygiene.)
*/
if (gn != pio->io_logical->io_gang_tree) {
zio_checksum_compute(zio, BP_GET_CHECKSUM(bp),
data, BP_GET_PSIZE(bp));
}
} else {
zio = zio_rewrite(pio, pio->io_spa, pio->io_txg, bp,
data, BP_GET_PSIZE(bp), NULL, NULL, pio->io_priority,
ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
}
return (zio);
}
/* ARGSUSED */
zio_t *
zio_free_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
return (zio_free(pio, pio->io_spa, pio->io_txg, bp,
NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
}
/* ARGSUSED */
zio_t *
zio_claim_gang(zio_t *pio, blkptr_t *bp, zio_gang_node_t *gn, void *data)
{
return (zio_claim(pio, pio->io_spa, pio->io_txg, bp,
NULL, NULL, ZIO_GANG_CHILD_FLAGS(pio)));
}
static zio_gang_issue_func_t *zio_gang_issue_func[ZIO_TYPES] = {
NULL,
zio_read_gang,
zio_rewrite_gang,
zio_free_gang,
zio_claim_gang,
NULL
};
static void zio_gang_tree_assemble_done(zio_t *zio);
static zio_gang_node_t *
zio_gang_node_alloc(zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn;
ASSERT(*gnpp == NULL);
gn = kmem_zalloc(sizeof (*gn), KM_SLEEP);
gn->gn_gbh = zio_buf_alloc(SPA_GANGBLOCKSIZE);
*gnpp = gn;
return (gn);
}
static void
zio_gang_node_free(zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn = *gnpp;
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
ASSERT(gn->gn_child[g] == NULL);
zio_buf_free(gn->gn_gbh, SPA_GANGBLOCKSIZE);
kmem_free(gn, sizeof (*gn));
*gnpp = NULL;
}
static void
zio_gang_tree_free(zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn = *gnpp;
if (gn == NULL)
return;
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++)
zio_gang_tree_free(&gn->gn_child[g]);
zio_gang_node_free(gnpp);
}
static void
zio_gang_tree_assemble(zio_t *lio, blkptr_t *bp, zio_gang_node_t **gnpp)
{
zio_gang_node_t *gn = zio_gang_node_alloc(gnpp);
ASSERT(lio->io_logical == lio);
ASSERT(BP_IS_GANG(bp));
zio_nowait(zio_read(lio, lio->io_spa, bp, gn->gn_gbh,
SPA_GANGBLOCKSIZE, zio_gang_tree_assemble_done, gn,
lio->io_priority, ZIO_GANG_CHILD_FLAGS(lio), &lio->io_bookmark));
}
static void
zio_gang_tree_assemble_done(zio_t *zio)
{
zio_t *lio = zio->io_logical;
zio_gang_node_t *gn = zio->io_private;
blkptr_t *bp = zio->io_bp;
ASSERT(zio->io_parent == lio);
ASSERT(zio->io_child == NULL);
if (zio->io_error)
return;
if (BP_SHOULD_BYTESWAP(bp))
byteswap_uint64_array(zio->io_data, zio->io_size);
ASSERT(zio->io_data == gn->gn_gbh);
ASSERT(zio->io_size == SPA_GANGBLOCKSIZE);
ASSERT(gn->gn_gbh->zg_tail.zbt_magic == ZBT_MAGIC);
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
if (!BP_IS_GANG(gbp))
continue;
zio_gang_tree_assemble(lio, gbp, &gn->gn_child[g]);
}
}
static void
zio_gang_tree_issue(zio_t *pio, zio_gang_node_t *gn, blkptr_t *bp, void *data)
{
zio_t *lio = pio->io_logical;
zio_t *zio;
ASSERT(BP_IS_GANG(bp) == !!gn);
ASSERT(BP_GET_CHECKSUM(bp) == BP_GET_CHECKSUM(lio->io_bp));
ASSERT(BP_GET_LSIZE(bp) == BP_GET_PSIZE(bp) || gn == lio->io_gang_tree);
/*
* If you're a gang header, your data is in gn->gn_gbh.
* If you're a gang member, your data is in 'data' and gn == NULL.
*/
zio = zio_gang_issue_func[lio->io_type](pio, bp, gn, data);
if (gn != NULL) {
ASSERT(gn->gn_gbh->zg_tail.zbt_magic == ZBT_MAGIC);
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
blkptr_t *gbp = &gn->gn_gbh->zg_blkptr[g];
if (BP_IS_HOLE(gbp))
continue;
zio_gang_tree_issue(zio, gn->gn_child[g], gbp, data);
data = (char *)data + BP_GET_PSIZE(gbp);
}
}
if (gn == lio->io_gang_tree)
ASSERT3P((char *)lio->io_data + lio->io_size, ==, data);
if (zio != pio)
zio_nowait(zio);
}
static int
zio_gang_assemble(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
ASSERT(BP_IS_GANG(bp) && zio == zio->io_logical);
zio_gang_tree_assemble(zio, bp, &zio->io_gang_tree);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_gang_issue(zio_t *zio)
{
zio_t *lio = zio->io_logical;
blkptr_t *bp = zio->io_bp;
if (zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
ASSERT(BP_IS_GANG(bp) && zio == lio);
if (zio->io_child_error[ZIO_CHILD_GANG] == 0)
zio_gang_tree_issue(lio, lio->io_gang_tree, bp, lio->io_data);
else
zio_gang_tree_free(&lio->io_gang_tree);
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
return (ZIO_PIPELINE_CONTINUE);
}
static void
zio_write_gang_member_ready(zio_t *zio)
{
zio_t *pio = zio->io_parent;
zio_t *lio = zio->io_logical;
dva_t *cdva = zio->io_bp->blk_dva;
dva_t *pdva = pio->io_bp->blk_dva;
uint64_t asize;
if (BP_IS_HOLE(zio->io_bp))
return;
ASSERT(BP_IS_HOLE(&zio->io_bp_orig));
ASSERT(zio->io_child_type == ZIO_CHILD_GANG);
ASSERT3U(zio->io_prop.zp_ndvas, ==, lio->io_prop.zp_ndvas);
ASSERT3U(zio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(zio->io_bp));
ASSERT3U(pio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(pio->io_bp));
ASSERT3U(BP_GET_NDVAS(zio->io_bp), <=, BP_GET_NDVAS(pio->io_bp));
mutex_enter(&pio->io_lock);
for (int d = 0; d < BP_GET_NDVAS(zio->io_bp); d++) {
ASSERT(DVA_GET_GANG(&pdva[d]));
asize = DVA_GET_ASIZE(&pdva[d]);
asize += DVA_GET_ASIZE(&cdva[d]);
DVA_SET_ASIZE(&pdva[d], asize);
}
mutex_exit(&pio->io_lock);
}
static int
zio_write_gang_block(zio_t *pio)
{
spa_t *spa = pio->io_spa;
blkptr_t *bp = pio->io_bp;
zio_t *lio = pio->io_logical;
zio_t *zio;
zio_gang_node_t *gn, **gnpp;
zio_gbh_phys_t *gbh;
uint64_t txg = pio->io_txg;
uint64_t resid = pio->io_size;
uint64_t lsize;
int ndvas = lio->io_prop.zp_ndvas;
int gbh_ndvas = MIN(ndvas + 1, spa_max_replication(spa));
zio_prop_t zp;
int error;
error = metaslab_alloc(spa, spa->spa_normal_class, SPA_GANGBLOCKSIZE,
bp, gbh_ndvas, txg, pio == lio ? NULL : lio->io_bp,
METASLAB_HINTBP_FAVOR | METASLAB_GANG_HEADER);
if (error) {
pio->io_error = error;
return (ZIO_PIPELINE_CONTINUE);
}
if (pio == lio) {
gnpp = &lio->io_gang_tree;
} else {
gnpp = pio->io_private;
ASSERT(pio->io_ready == zio_write_gang_member_ready);
}
gn = zio_gang_node_alloc(gnpp);
gbh = gn->gn_gbh;
bzero(gbh, SPA_GANGBLOCKSIZE);
/*
* Create the gang header.
*/
zio = zio_rewrite(pio, spa, txg, bp, gbh, SPA_GANGBLOCKSIZE, NULL, NULL,
pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio), &pio->io_bookmark);
/*
* Create and nowait the gang children.
*/
for (int g = 0; resid != 0; resid -= lsize, g++) {
lsize = P2ROUNDUP(resid / (SPA_GBH_NBLKPTRS - g),
SPA_MINBLOCKSIZE);
ASSERT(lsize >= SPA_MINBLOCKSIZE && lsize <= resid);
zp.zp_checksum = lio->io_prop.zp_checksum;
zp.zp_compress = ZIO_COMPRESS_OFF;
zp.zp_type = DMU_OT_NONE;
zp.zp_level = 0;
zp.zp_ndvas = lio->io_prop.zp_ndvas;
zio_nowait(zio_write(zio, spa, txg, &gbh->zg_blkptr[g],
(char *)pio->io_data + (pio->io_size - resid), lsize, &zp,
zio_write_gang_member_ready, NULL, &gn->gn_child[g],
pio->io_priority, ZIO_GANG_CHILD_FLAGS(pio),
&pio->io_bookmark));
}
/*
* Set pio's pipeline to just wait for zio to finish.
*/
pio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
zio_nowait(zio);
return (ZIO_PIPELINE_CONTINUE);
}
/*
* ==========================================================================
* Allocate and free blocks
* ==========================================================================
*/
static int
zio_dva_allocate(zio_t *zio)
{
spa_t *spa = zio->io_spa;
metaslab_class_t *mc = spa->spa_normal_class;
blkptr_t *bp = zio->io_bp;
int error;
ASSERT(BP_IS_HOLE(bp));
ASSERT3U(BP_GET_NDVAS(bp), ==, 0);
ASSERT3U(zio->io_prop.zp_ndvas, >, 0);
ASSERT3U(zio->io_prop.zp_ndvas, <=, spa_max_replication(spa));
ASSERT3U(zio->io_size, ==, BP_GET_PSIZE(bp));
error = metaslab_alloc(spa, mc, zio->io_size, bp,
zio->io_prop.zp_ndvas, zio->io_txg, NULL, 0);
if (error) {
if (error == ENOSPC && zio->io_size > SPA_MINBLOCKSIZE)
return (zio_write_gang_block(zio));
zio->io_error = error;
}
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_dva_free(zio_t *zio)
{
metaslab_free(zio->io_spa, zio->io_bp, zio->io_txg, B_FALSE);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_dva_claim(zio_t *zio)
{
int error;
error = metaslab_claim(zio->io_spa, zio->io_bp, zio->io_txg);
if (error)
zio->io_error = error;
return (ZIO_PIPELINE_CONTINUE);
}
/*
* Undo an allocation. This is used by zio_done() when an I/O fails
* and we want to give back the block we just allocated.
* This handles both normal blocks and gang blocks.
*/
static void
zio_dva_unallocate(zio_t *zio, zio_gang_node_t *gn, blkptr_t *bp)
{
spa_t *spa = zio->io_spa;
boolean_t now = !(zio->io_flags & ZIO_FLAG_IO_REWRITE);
ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp));
if (zio->io_bp == bp && !now) {
/*
* This is a rewrite for sync-to-convergence.
* We can't do a metaslab_free(NOW) because bp wasn't allocated
* during this sync pass, which means that metaslab_sync()
* already committed the allocation.
*/
ASSERT(DVA_EQUAL(BP_IDENTITY(bp),
BP_IDENTITY(&zio->io_bp_orig)));
ASSERT(spa_sync_pass(spa) > 1);
if (BP_IS_GANG(bp) && gn == NULL) {
/*
* This is a gang leader whose gang header(s) we
* couldn't read now, so defer the free until later.
* The block should still be intact because without
* the headers, we'd never even start the rewrite.
*/
bplist_enqueue_deferred(&spa->spa_sync_bplist, bp);
return;
}
}
if (!BP_IS_HOLE(bp))
metaslab_free(spa, bp, bp->blk_birth, now);
if (gn != NULL) {
for (int g = 0; g < SPA_GBH_NBLKPTRS; g++) {
zio_dva_unallocate(zio, gn->gn_child[g],
&gn->gn_gbh->zg_blkptr[g]);
}
}
}
/*
* Try to allocate an intent log block. Return 0 on success, errno on failure.
*/
int
zio_alloc_blk(spa_t *spa, uint64_t size, blkptr_t *new_bp, blkptr_t *old_bp,
uint64_t txg)
{
int error;
error = metaslab_alloc(spa, spa->spa_log_class, size,
new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID);
if (error)
error = metaslab_alloc(spa, spa->spa_normal_class, size,
new_bp, 1, txg, old_bp, METASLAB_HINTBP_AVOID);
if (error == 0) {
BP_SET_LSIZE(new_bp, size);
BP_SET_PSIZE(new_bp, size);
BP_SET_COMPRESS(new_bp, ZIO_COMPRESS_OFF);
BP_SET_CHECKSUM(new_bp, ZIO_CHECKSUM_ZILOG);
BP_SET_TYPE(new_bp, DMU_OT_INTENT_LOG);
BP_SET_LEVEL(new_bp, 0);
BP_SET_BYTEORDER(new_bp, ZFS_HOST_BYTEORDER);
}
return (error);
}
/*
* Free an intent log block. We know it can't be a gang block, so there's
* nothing to do except metaslab_free() it.
*/
void
zio_free_blk(spa_t *spa, blkptr_t *bp, uint64_t txg)
{
ASSERT(!BP_IS_GANG(bp));
metaslab_free(spa, bp, txg, B_FALSE);
}
/*
* ==========================================================================
* Read and write to physical devices
* ==========================================================================
*/
static void
zio_vdev_io_probe_done(zio_t *zio)
{
zio_t *dio;
vdev_t *vd = zio->io_private;
mutex_enter(&vd->vdev_probe_lock);
ASSERT(vd->vdev_probe_zio == zio);
vd->vdev_probe_zio = NULL;
mutex_exit(&vd->vdev_probe_lock);
while ((dio = zio->io_delegate_list) != NULL) {
zio->io_delegate_list = dio->io_delegate_next;
dio->io_delegate_next = NULL;
if (!vdev_accessible(vd, dio))
dio->io_error = ENXIO;
zio_execute(dio);
}
}
/*
* Probe the device to determine whether I/O failure is specific to this
* zio (e.g. a bad sector) or affects the entire vdev (e.g. unplugged).
*/
static int
zio_vdev_io_probe(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
zio_t *pio = NULL;
boolean_t created_pio = B_FALSE;
/*
* Don't probe the probe.
*/
if (zio->io_flags & ZIO_FLAG_PROBE)
return (ZIO_PIPELINE_CONTINUE);
/*
* To prevent 'probe storms' when a device fails, we create
* just one probe i/o at a time. All zios that want to probe
* this vdev will join the probe zio's io_delegate_list.
*/
mutex_enter(&vd->vdev_probe_lock);
if ((pio = vd->vdev_probe_zio) == NULL) {
vd->vdev_probe_zio = pio = zio_root(zio->io_spa,
zio_vdev_io_probe_done, vd, ZIO_FLAG_CANFAIL);
created_pio = B_TRUE;
vd->vdev_probe_wanted = B_TRUE;
spa_async_request(zio->io_spa, SPA_ASYNC_PROBE);
}
zio->io_delegate_next = pio->io_delegate_list;
pio->io_delegate_list = zio;
mutex_exit(&vd->vdev_probe_lock);
if (created_pio) {
zio_nowait(vdev_probe(vd, pio));
zio_nowait(pio);
}
return (ZIO_PIPELINE_STOP);
}
static int
zio_vdev_io_start(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
uint64_t align;
spa_t *spa = zio->io_spa;
ASSERT(zio->io_error == 0);
ASSERT(zio->io_child_error[ZIO_CHILD_VDEV] == 0);
if (vd == NULL) {
if (!(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
spa_config_enter(spa, SCL_ZIO, zio, RW_READER);
/*
* The mirror_ops handle multiple DVAs in a single BP.
*/
return (vdev_mirror_ops.vdev_op_io_start(zio));
}
align = 1ULL << vd->vdev_top->vdev_ashift;
if (P2PHASE(zio->io_size, align) != 0) {
uint64_t asize = P2ROUNDUP(zio->io_size, align);
char *abuf = zio_buf_alloc(asize);
ASSERT(vd == vd->vdev_top);
if (zio->io_type == ZIO_TYPE_WRITE) {
bcopy(zio->io_data, abuf, zio->io_size);
bzero(abuf + zio->io_size, asize - zio->io_size);
}
zio_push_transform(zio, abuf, asize, asize, zio_subblock);
}
ASSERT(P2PHASE(zio->io_offset, align) == 0);
ASSERT(P2PHASE(zio->io_size, align) == 0);
ASSERT(zio->io_type != ZIO_TYPE_WRITE || (spa_mode & FWRITE));
if (vd->vdev_ops->vdev_op_leaf &&
(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE)) {
if (zio->io_type == ZIO_TYPE_READ && vdev_cache_read(zio) == 0)
return (ZIO_PIPELINE_STOP);
if ((zio = vdev_queue_io(zio)) == NULL)
return (ZIO_PIPELINE_STOP);
if (!vdev_accessible(vd, zio)) {
zio->io_error = ENXIO;
zio_interrupt(zio);
return (ZIO_PIPELINE_STOP);
}
}
return (vd->vdev_ops->vdev_op_io_start(zio));
}
static int
zio_vdev_io_done(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
vdev_ops_t *ops = vd ? vd->vdev_ops : &vdev_mirror_ops;
boolean_t unexpected_error = B_FALSE;
if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
ASSERT(zio->io_type == ZIO_TYPE_READ || zio->io_type == ZIO_TYPE_WRITE);
if (vd != NULL && vd->vdev_ops->vdev_op_leaf) {
vdev_queue_io_done(zio);
if (zio->io_type == ZIO_TYPE_WRITE)
vdev_cache_write(zio);
if (zio_injection_enabled && zio->io_error == 0)
zio->io_error = zio_handle_device_injection(vd, EIO);
if (zio_injection_enabled && zio->io_error == 0)
zio->io_error = zio_handle_label_injection(zio, EIO);
if (zio->io_error) {
if (!vdev_accessible(vd, zio)) {
zio->io_error = ENXIO;
} else {
unexpected_error = B_TRUE;
}
}
}
ops->vdev_op_io_done(zio);
if (unexpected_error)
return (zio_vdev_io_probe(zio));
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_vdev_io_assess(zio_t *zio)
{
vdev_t *vd = zio->io_vd;
if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
if (vd == NULL && !(zio->io_flags & ZIO_FLAG_CONFIG_WRITER))
spa_config_exit(zio->io_spa, SCL_ZIO, zio);
if (zio->io_vsd != NULL) {
zio->io_vsd_free(zio);
zio->io_vsd = NULL;
}
if (zio_injection_enabled && zio->io_error == 0)
zio->io_error = zio_handle_fault_injection(zio, EIO);
/*
* If the I/O failed, determine whether we should attempt to retry it.
*/
if (zio->io_error && vd == NULL &&
!(zio->io_flags & (ZIO_FLAG_DONT_RETRY | ZIO_FLAG_IO_RETRY))) {
ASSERT(!(zio->io_flags & ZIO_FLAG_DONT_QUEUE)); /* not a leaf */
ASSERT(!(zio->io_flags & ZIO_FLAG_IO_BYPASS)); /* not a leaf */
zio->io_error = 0;
zio->io_flags |= ZIO_FLAG_IO_RETRY |
ZIO_FLAG_DONT_CACHE | ZIO_FLAG_DONT_AGGREGATE;
zio->io_stage = ZIO_STAGE_VDEV_IO_START - 1;
zio_taskq_dispatch(zio, ZIO_TASKQ_ISSUE);
return (ZIO_PIPELINE_STOP);
}
/*
* If we got an error on a leaf device, convert it to ENXIO
* if the device is not accessible at all.
*/
if (zio->io_error && vd != NULL && vd->vdev_ops->vdev_op_leaf &&
!vdev_accessible(vd, zio))
zio->io_error = ENXIO;
/*
* If we can't write to an interior vdev (mirror or RAID-Z),
* set vdev_cant_write so that we stop trying to allocate from it.
*/
if (zio->io_error == ENXIO && zio->io_type == ZIO_TYPE_WRITE &&
vd != NULL && !vd->vdev_ops->vdev_op_leaf)
vd->vdev_cant_write = B_TRUE;
if (zio->io_error)
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
return (ZIO_PIPELINE_CONTINUE);
}
void
zio_vdev_io_reissue(zio_t *zio)
{
ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
ASSERT(zio->io_error == 0);
zio->io_stage--;
}
void
zio_vdev_io_redone(zio_t *zio)
{
ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_DONE);
zio->io_stage--;
}
void
zio_vdev_io_bypass(zio_t *zio)
{
ASSERT(zio->io_stage == ZIO_STAGE_VDEV_IO_START);
ASSERT(zio->io_error == 0);
zio->io_flags |= ZIO_FLAG_IO_BYPASS;
zio->io_stage = ZIO_STAGE_VDEV_IO_ASSESS - 1;
}
/*
* ==========================================================================
* Generate and verify checksums
* ==========================================================================
*/
static int
zio_checksum_generate(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
enum zio_checksum checksum;
if (bp == NULL) {
/*
* This is zio_write_phys().
* We're either generating a label checksum, or none at all.
*/
checksum = zio->io_prop.zp_checksum;
if (checksum == ZIO_CHECKSUM_OFF)
return (ZIO_PIPELINE_CONTINUE);
ASSERT(checksum == ZIO_CHECKSUM_LABEL);
} else {
if (BP_IS_GANG(bp) && zio->io_child_type == ZIO_CHILD_GANG) {
ASSERT(!IO_IS_ALLOCATING(zio));
checksum = ZIO_CHECKSUM_GANG_HEADER;
} else {
checksum = BP_GET_CHECKSUM(bp);
}
}
zio_checksum_compute(zio, checksum, zio->io_data, zio->io_size);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_checksum_verify(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
int error;
if (bp == NULL) {
/*
* This is zio_read_phys().
* We're either verifying a label checksum, or nothing at all.
*/
if (zio->io_prop.zp_checksum == ZIO_CHECKSUM_OFF)
return (ZIO_PIPELINE_CONTINUE);
ASSERT(zio->io_prop.zp_checksum == ZIO_CHECKSUM_LABEL);
}
if ((error = zio_checksum_error(zio)) != 0) {
zio->io_error = error;
if (!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) {
zfs_ereport_post(FM_EREPORT_ZFS_CHECKSUM,
zio->io_spa, zio->io_vd, zio, 0, 0);
}
}
return (ZIO_PIPELINE_CONTINUE);
}
/*
* Called by RAID-Z to ensure we don't compute the checksum twice.
*/
void
zio_checksum_verified(zio_t *zio)
{
zio->io_pipeline &= ~(1U << ZIO_STAGE_CHECKSUM_VERIFY);
}
/*
* ==========================================================================
* Error rank. Error are ranked in the order 0, ENXIO, ECKSUM, EIO, other.
* An error of 0 indictes success. ENXIO indicates whole-device failure,
* which may be transient (e.g. unplugged) or permament. ECKSUM and EIO
* indicate errors that are specific to one I/O, and most likely permanent.
* Any other error is presumed to be worse because we weren't expecting it.
* ==========================================================================
*/
int
zio_worst_error(int e1, int e2)
{
static int zio_error_rank[] = { 0, ENXIO, ECKSUM, EIO };
int r1, r2;
for (r1 = 0; r1 < sizeof (zio_error_rank) / sizeof (int); r1++)
if (e1 == zio_error_rank[r1])
break;
for (r2 = 0; r2 < sizeof (zio_error_rank) / sizeof (int); r2++)
if (e2 == zio_error_rank[r2])
break;
return (r1 > r2 ? e1 : e2);
}
/*
* ==========================================================================
* I/O completion
* ==========================================================================
*/
static int
zio_ready(zio_t *zio)
{
blkptr_t *bp = zio->io_bp;
zio_t *pio = zio->io_parent;
if (zio->io_ready) {
if (BP_IS_GANG(bp) &&
zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_READY))
return (ZIO_PIPELINE_STOP);
ASSERT(IO_IS_ALLOCATING(zio));
ASSERT(bp->blk_birth == zio->io_txg || BP_IS_HOLE(bp));
ASSERT(zio->io_children[ZIO_CHILD_GANG][ZIO_WAIT_READY] == 0);
zio->io_ready(zio);
}
if (bp != NULL && bp != &zio->io_bp_copy)
zio->io_bp_copy = *bp;
if (zio->io_error)
zio->io_pipeline = ZIO_INTERLOCK_PIPELINE;
if (pio != NULL)
zio_notify_parent(pio, zio, ZIO_WAIT_READY);
return (ZIO_PIPELINE_CONTINUE);
}
static int
zio_done(zio_t *zio)
{
spa_t *spa = zio->io_spa;
zio_t *pio = zio->io_parent;
zio_t *lio = zio->io_logical;
blkptr_t *bp = zio->io_bp;
vdev_t *vd = zio->io_vd;
uint64_t psize = zio->io_size;
/*
* If our of children haven't all completed,
* wait for them and then repeat this pipeline stage.
*/
if (zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) ||
zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) ||
zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE))
return (ZIO_PIPELINE_STOP);
for (int c = 0; c < ZIO_CHILD_TYPES; c++)
for (int w = 0; w < ZIO_WAIT_TYPES; w++)
ASSERT(zio->io_children[c][w] == 0);
if (bp != NULL) {
ASSERT(bp->blk_pad[0] == 0);
ASSERT(bp->blk_pad[1] == 0);
ASSERT(bp->blk_pad[2] == 0);
ASSERT(bcmp(bp, &zio->io_bp_copy, sizeof (blkptr_t)) == 0 ||
(pio != NULL && bp == pio->io_bp));
if (zio->io_type == ZIO_TYPE_WRITE && !BP_IS_HOLE(bp) &&
!(zio->io_flags & ZIO_FLAG_IO_REPAIR)) {
ASSERT(!BP_SHOULD_BYTESWAP(bp));
ASSERT3U(zio->io_prop.zp_ndvas, <=, BP_GET_NDVAS(bp));
ASSERT(BP_COUNT_GANG(bp) == 0 ||
(BP_COUNT_GANG(bp) == BP_GET_NDVAS(bp)));
}
}
/*
* If there were child vdev or gang errors, they apply to us now.
*/
zio_inherit_child_errors(zio, ZIO_CHILD_VDEV);
zio_inherit_child_errors(zio, ZIO_CHILD_GANG);
zio_pop_transforms(zio); /* note: may set zio->io_error */
vdev_stat_update(zio, psize);
if (zio->io_error) {
/*
* If this I/O is attached to a particular vdev,
* generate an error message describing the I/O failure
* at the block level. We ignore these errors if the
* device is currently unavailable.
*/
if (zio->io_error != ECKSUM && vd != NULL && !vdev_is_dead(vd))
zfs_ereport_post(FM_EREPORT_ZFS_IO, spa, vd, zio, 0, 0);
if ((zio->io_error == EIO ||
!(zio->io_flags & ZIO_FLAG_SPECULATIVE)) && zio == lio) {
/*
* For logical I/O requests, tell the SPA to log the
* error and generate a logical data ereport.
*/
spa_log_error(spa, zio);
zfs_ereport_post(FM_EREPORT_ZFS_DATA, spa, NULL, zio,
0, 0);
}
}
if (zio->io_error && zio == lio) {
/*
* Determine whether zio should be reexecuted. This will
* propagate all the way to the root via zio_notify_parent().
*/
ASSERT(vd == NULL && bp != NULL);
if (IO_IS_ALLOCATING(zio))
if (zio->io_error != ENOSPC)
zio->io_reexecute |= ZIO_REEXECUTE_NOW;
else
zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
if ((zio->io_type == ZIO_TYPE_READ ||
zio->io_type == ZIO_TYPE_FREE) &&
zio->io_error == ENXIO &&
spa_get_failmode(spa) != ZIO_FAILURE_MODE_CONTINUE)
zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
if (!(zio->io_flags & ZIO_FLAG_CANFAIL) && !zio->io_reexecute)
zio->io_reexecute |= ZIO_REEXECUTE_SUSPEND;
}
/*
* If there were logical child errors, they apply to us now.
* We defer this until now to avoid conflating logical child
* errors with errors that happened to the zio itself when
* updating vdev stats and reporting FMA events above.
*/
zio_inherit_child_errors(zio, ZIO_CHILD_LOGICAL);
if (zio->io_reexecute) {
/*
* This is a logical I/O that wants to reexecute.
*
* Reexecute is top-down. When an i/o fails, if it's not
* the root, it simply notifies its parent and sticks around.
* The parent, seeing that it still has children in zio_done(),
* does the same. This percolates all the way up to the root.
* The root i/o will reexecute or suspend the entire tree.
*
* This approach ensures that zio_reexecute() honors
* all the original i/o dependency relationships, e.g.
* parents not executing until children are ready.
*/
ASSERT(zio->io_child_type == ZIO_CHILD_LOGICAL);
if (IO_IS_ALLOCATING(zio))
zio_dva_unallocate(zio, zio->io_gang_tree, bp);
zio_gang_tree_free(&zio->io_gang_tree);
if (pio != NULL) {
/*
* We're not a root i/o, so there's nothing to do
* but notify our parent. Don't propagate errors
* upward since we haven't permanently failed yet.
*/
zio->io_flags |= ZIO_FLAG_DONT_PROPAGATE;
zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
} else if (zio->io_reexecute & ZIO_REEXECUTE_SUSPEND) {
/*
* We'd fail again if we reexecuted now, so suspend
* until conditions improve (e.g. device comes online).
*/
zio_suspend(spa, zio);
} else {
/*
* Reexecution is potentially a huge amount of work.
* Hand it off to the otherwise-unused claim taskq.
*/
(void) taskq_dispatch_safe(
spa->spa_zio_taskq[ZIO_TYPE_CLAIM][ZIO_TASKQ_ISSUE],
(task_func_t *)zio_reexecute, zio, &zio->io_task);
}
return (ZIO_PIPELINE_STOP);
}
ASSERT(zio->io_child == NULL);
ASSERT(zio->io_reexecute == 0);
ASSERT(zio->io_error == 0 || (zio->io_flags & ZIO_FLAG_CANFAIL));
if (zio->io_done)
zio->io_done(zio);
zio_gang_tree_free(&zio->io_gang_tree);
ASSERT(zio->io_delegate_list == NULL);
ASSERT(zio->io_delegate_next == NULL);
if (pio != NULL) {
zio_remove_child(pio, zio);
zio_notify_parent(pio, zio, ZIO_WAIT_DONE);
}
if (zio->io_waiter != NULL) {
mutex_enter(&zio->io_lock);
zio->io_executor = NULL;
cv_broadcast(&zio->io_cv);
mutex_exit(&zio->io_lock);
} else {
zio_destroy(zio);
}
return (ZIO_PIPELINE_STOP);
}
/*
* ==========================================================================
* I/O pipeline definition
* ==========================================================================
*/
static zio_pipe_stage_t *zio_pipeline[ZIO_STAGES] = {
NULL,
zio_issue_async,
zio_read_bp_init,
zio_write_bp_init,
zio_checksum_generate,
zio_gang_assemble,
zio_gang_issue,
zio_dva_allocate,
zio_dva_free,
zio_dva_claim,
zio_ready,
zio_vdev_io_start,
zio_vdev_io_done,
zio_vdev_io_assess,
zio_checksum_verify,
zio_done
};
Index: stable/8/sys/cddl/contrib/opensolaris
===================================================================
--- stable/8/sys/cddl/contrib/opensolaris (revision 209273)
+++ stable/8/sys/cddl/contrib/opensolaris (revision 209274)
Property changes on: stable/8/sys/cddl/contrib/opensolaris
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/cddl/contrib/opensolaris:r209093-209101
Index: stable/8/sys/contrib/dev/acpica
===================================================================
--- stable/8/sys/contrib/dev/acpica (revision 209273)
+++ stable/8/sys/contrib/dev/acpica (revision 209274)
Property changes on: stable/8/sys/contrib/dev/acpica
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/contrib/dev/acpica:r209093-209101
Index: stable/8/sys/contrib/pf
===================================================================
--- stable/8/sys/contrib/pf (revision 209273)
+++ stable/8/sys/contrib/pf (revision 209274)
Property changes on: stable/8/sys/contrib/pf
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/contrib/pf:r209093-209101
Index: stable/8/sys/dev/xen/xenpci
===================================================================
--- stable/8/sys/dev/xen/xenpci (revision 209273)
+++ stable/8/sys/dev/xen/xenpci (revision 209274)
Property changes on: stable/8/sys/dev/xen/xenpci
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/dev/xen/xenpci:r209093-209101
Index: stable/8/sys/geom/sched
===================================================================
--- stable/8/sys/geom/sched (revision 209273)
+++ stable/8/sys/geom/sched (revision 209274)
Property changes on: stable/8/sys/geom/sched
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys/geom/sched:r209093-209101
Index: stable/8/sys
===================================================================
--- stable/8/sys (revision 209273)
+++ stable/8/sys (revision 209274)
Property changes on: stable/8/sys
___________________________________________________________________
Modified: svn:mergeinfo
## -0,0 +0,1 ##
Merged /head/sys:r209093-209101
File Metadata
Details
Attached
Mime Type
text/x-c
Expires
Sun, Mar 29, 10:26 PM (2 d)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
30495913
Default Alt Text
(571 KB)
Attached To
Mode
rS FreeBSD src repository - subversion
Attached
Detach File
Event Timeline
Log In to Comment